So this video is all about PDF analysis. The tools I use in the video are:
- Contagio: great place to go for sample malware
- PDF Stream Dumper: A bunch of tools thrown together in one amazing program, did I mention it is free?
- REMnux: A great RE tool by Lenny Zeltser. Has tools for PDF, JS , shellcode and much more. I was using INetSim in this demo to simulate network services so the malware had something to talk to.
- CaptureBAT: Allows the collection of modified/created/deleted files and registry keys after clicking malware. Can also capture network traffic.
- Process Explorer: SysInternals... enough said
Now I really do not go into explaining what a PDF is composed of in the video, I wanted to keep it to a reasonable time. So.. consider this the 'fine print'... and is it a doooozy!
A PDF consists of objects which can be multiple things: numbers, strings, code, streams (compressed data), etc. Below is a screen shot to hopefully explain this a bit better:
Ok... so the left pane has all the Objects in this PDF, which is 14. The pane on the right shows what is inside that object, these are called indirect objects but I have seen it called header data too. You see how there are two numbers? The first one is the index number (or the object you can find the data under) and the second is the version number. Version numbers can indicate previous or newer versions of the same object, and can be used by nefarious users to hide their code. The 'R' means Reference, so.... we can tell from this screenshot
- Object 1 references 3 additional objects
- Pages (go to obj 2 for more information)
- OpenAction (go to obj 11 for more information)
- AcroForm (go to obj 13 for more information)
Going to these objects may actually reference additional objects, it can become a cat and mouse game and given a lot of objects, it can be tedious to sort thru.
Now what do these things mean? Well a quick run-down of items of interest:
- Stream Objects: compressed/encoded data... you gotta decompress/decode to see what's inside
- /Page: How many pages are in the document (if its 0... watch out)
- /JS -or- /Javascript: self-explanatory, watch out because this can be obfuscated
- /AA /OpenAction -or- /Acroform: indicates an automatic action when the PDF is opened
- /RichMedia: indicates the presence of Flash (another way to exploit the system)
So lets follow /OpenAction, which is in Object 11:
See what I mean? Another reference... this time to Javascript which is in object 12, which is the obfuscated code in the video.
Oh and headers (indirect objects) themselves can be obfuscated. PDF Stream Dumper is nice and converts them for you, but if you right click on an Object and select 'Show Raw Header' to see what I mean. Here is what object 1's indirect objects look like:
This is using hex to obfuscate the header data. #50 is equal to the ASCII symbol 'P', #61 is 'a' and so on. There are a ton of hex to ascii converters. A good site for tons of string manipulation options is
http://www.string-functions.com.
Honestly playing around and research on the internet is the best way to figure this stuff out.
Didier Stevens has some awesome tools, which are included in the
REMnux image. The guys over at Sourcefire also did a post of
PDF analysis using Didier Steven's tools.Oh and did I mention Mr. Stevens
wrote a book about PDF analysis?! Best thing: it's free
I would be remiss if I didn't reiterate watching the videos with PDF Stream Dumper too, no one knows the tool better than the guy to created it :) Watch, learn, play... enjoy
So without ado... this video:
Oh and in my haste to finish the video I forgot to show the network data captures by CaputreBAT. Here is a screenshot of Wireshark with the file opened:
Ok, so the first thing we see is the DNS query for googlemail.proxydns.com. This was the TCP item we saw when we looked at the process with Process Explorer. My REMnux box, running dutifully as a DNS server, says the website is at 192.168.10.1 (my REMnux box again). The malware then connects to the web server and posts to it a file index.php. REMnux sends a dummy file, which the malware does not know what to do with... however we now know the domain the malware beacons out to and can block by name and IP. Or, as analysts, we go out there and see what is on that site :)
This is the Virustotal output for the
PDF and the subsequent
spoolsv file. Both bad.
What I am trying to say is that I
barely scraped the surface of PDF analysis. It is always better in the long run to understand the structure of a file you are analyzing rather than depending on a tool to do it for you. This was when something goes wrong, you have a better understanding of what is happening and potentially why. In a court of law, it does not look good as an expert witness if you say "Well your honor, you click this button and this pops out... I don't know how it arrives as that answer"
Never stop learning my friends :)