Ensemble learning for detection of malicious content embedded in PDF documents

Citation:
Nath, HV, Mehtre BM.  2015.  Ensemble learning for detection of malicious content embedded in PDF documents, 19-21 Feb. 2015. 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES). :1-5.

Date Presented:

19-21 Feb. 2015

Abstract:

Portable Document Format (PDF) is used as a defacto standard for sharing documents. Even though pdf is a document description language, it has lot of features similar to programming language. With the add on support of JavaScript (Malicious script) and the facility to embed any file into a PDF document, creates a big potential for disastrous cyber attacks. From 2008 onwards, the malicious users are concentrating more on embedding malicious codes into pdf documents. Compared to PE, pdf files pose higher risk since the embedded content can be encrypted and/or encoded. Recently multistage delivery of malware is used for APTs and targeted attacks. Here pdf documents are used for accomplishing one or more stages, like mini-duke, where pdf file was used for first stage. It went undetected for almost two years. These files could be considered as a carrier of k-ary codes. In this paper, we bring out the importance of analyzing the data encoded in the stream tag along with other structural information. We are giving a proof of concept by embedding JavaScript into PDF document. This is not detected by any of the existing pdf parsers. Finally, we propose ensemble learning for detecting such pdf files.

Notes:

n/a