Wednesday, December 21, 2011

PDF Analysis

So this video is all about PDF analysis. The tools I use in the video are:
  • Contagio: great place to go for sample malware
  • PDF Stream Dumper: A bunch of tools thrown together in one amazing program, did I mention it is free?
  • REMnux: A great RE tool by Lenny Zeltser. Has tools for PDF, JS , shellcode and much more. I was using INetSim in this demo to simulate network services so the malware had something to talk to.
  • CaptureBAT: Allows the collection of modified/created/deleted files and registry keys after clicking malware. Can also capture network traffic.
  • Process Explorer: SysInternals... enough said
Now I really do not go into explaining what a PDF is composed of in the video, I wanted to keep it to a reasonable time. So.. consider this the 'fine print'... and is it a doooozy!

A PDF consists of objects which can be multiple things: numbers, strings, code, streams (compressed data), etc.  Below is a screen shot to hopefully explain this a bit better:


Ok... so the left pane has all the Objects in this PDF, which is 14. The pane on the right shows what is inside that object, these are called indirect objects but I have seen it called header data too. You see how there are two numbers? The first one is the index number (or the object you can find the data under) and the second is the version number. Version numbers can indicate previous or newer versions of the same object, and can be used by nefarious users to hide their code. The 'R' means Reference, so.... we can tell from this screenshot
  • Object 1 references 3 additional objects
    • Pages (go to obj 2 for more information) 
    • OpenAction (go to obj 11 for more information)
    • AcroForm (go to obj 13 for more information)
Going to these objects may actually reference additional objects, it can become a cat and mouse game and given a lot of objects, it can be tedious to sort thru.

Now what do these things mean? Well a quick run-down of items of interest:
  • Stream Objects: compressed/encoded data... you gotta decompress/decode  to see what's inside
  • /Page: How many pages are in the document (if its 0... watch out)
  • /JS -or- /Javascript: self-explanatory, watch out because this can be obfuscated
  • /AA /OpenAction -or- /Acroform: indicates an automatic action when the PDF is opened
  • /RichMedia: indicates the presence of Flash (another way to exploit the system)
So lets follow /OpenAction, which is in Object 11:


 See what I mean? Another reference... this time to Javascript which is in object 12, which is the obfuscated code in the video.

Oh and headers (indirect objects) themselves can be obfuscated. PDF Stream Dumper is nice and converts them for you, but if you right click on an Object and select 'Show Raw Header' to see what I mean. Here is what object 1's indirect objects look like:

This is using hex to obfuscate the header data. #50 is equal to the ASCII symbol 'P', #61 is 'a' and so on. There are a ton of hex to ascii converters. A good site for tons of string manipulation options is http://www.string-functions.com.

Honestly playing around and research on the internet is the best way to figure this stuff out. Didier Stevens has some awesome tools, which are included in the REMnux image. The guys over at Sourcefire also did a post of PDF analysis using Didier Steven's tools.Oh and did I mention Mr. Stevens wrote a book about PDF analysis?! Best thing: it's free

I would be remiss if I didn't reiterate watching the videos with PDF Stream Dumper too, no one knows the tool better than the guy to created it :) Watch, learn, play... enjoy

So without ado... this video:



Oh and in my haste to finish the video I forgot to show the network data captures by CaputreBAT. Here is a screenshot of Wireshark with the file opened:


Ok, so the first thing we see is the DNS query for googlemail.proxydns.com. This was the TCP item we saw when we looked at the process with Process Explorer. My REMnux box, running dutifully as a DNS server, says the website is at 192.168.10.1 (my REMnux box again). The malware then connects to the web server and posts to it a file index.php. REMnux sends a dummy file, which the malware does not know what to do with... however we now know the domain the malware beacons out to and can block by name and IP. Or, as analysts, we go out there and see what is on that site :)

This is the Virustotal output for the PDF and the subsequent spoolsv file. Both bad.

What I am trying to say is that I barely scraped the surface of PDF analysis. It is always better in the long run to understand the structure of a file you are analyzing rather than depending on a tool to do it for you. This was when something goes wrong, you have a better understanding of what is happening and potentially why. In a court of law, it does not look good as an expert witness if you say "Well your honor, you click this button and this pops out... I don't know how it arrives as that answer"

Never stop learning my friends :)
 






6 comments:

Hunter_Forensics said...

Excellent post and video. Your explanations and descriptions were very helpful. Keep them coming!

Glenn said...

Glad to see you took a look at CaptureBAT...

Jose Miguel Esparza said...

Hi! Good article! You can also take a look at peepdf, it's a tool to analyse PDF files, written in Python, and it's also included in REMnux ;)

-Sketchymoose said...

REMnux is a treasure trove of tools, I hope to get better acquainted with them. I am hoping to do a video of PDF analysis via REMnux, so much thanks for the tip :)

Anonymous said...

In your movie you are mentioning using Sysinternals' strings but actually you are using a right-click on the file and choose strings. Isn't that the Malware Analyst Pack installation of strings?

-Sketchymoose said...

Good catch and you are correct. I use strings on the Malcode Analysis Pack (http://labs.idefense.com/files/labs/releases/previews/map/) in this video. Although its similar, the strings in the Analysis Pack shows MD5 hash and size.

Gotta give credit where credit is due-- cheers for that!