Home | Archive | Ask
Picture of me

a notepad of things


Text.

Giving VMs Access to Physical Disks with VMWare ESXi Server

So, I really wanted an ESXi server for various reasons, but didn’t want to build a whole new server. Having a quad core box w/ 8GB of ram sitting around as my “storage” server, I decided to repurpose it as a VMWare ESXi server and build a storage server as a VM.

This had one major issue for me: that server has a mdadm RAID5 array with 5x 2TB disks on it. So, I was extremely cautious about wanting to convert my storage server to a VM as I definitely didn’t want to lose the data on my RAID array.

Sadly, ESXi doesn’t have support for mdadm (or any reasonable way to install it that I could see), despite it being based on Linux. So, I needed to come up with a way for my new storage VM to have physical access to the disks in the array so that I could reassemble the array inside the VM using mdadm.

I found this blog post detailing how to accomplish this, but I ran into one minor issue with the instructions: an `fdisk -l` wasn’t showing all of my drives. I knew they were there, as I saw them all listed as I was going through the ESXi installation process.

My minor tweak was simply to get the location of the disks from where the blog author said they were stored, in /dev/disks/. From there, you can see the full names that VMWare has given each physical disk and proceed with using the `vmkfstools` command as instructed.

I found out my VM (by default?) was using LSI Logic, by looking at the SCSI controller settings, so all I did on the ESXi server was:

~ $ cd /vmfs/volumes/datastore1
/vmfs/volumes/datastore1 $ mkdir disks; cd disks
/vmfs/volumes/datastore1/disks/ $ vmkfstools -z /dev/disks/<name of disk> RAW-MDADM1A.vmdk -a lsilogic
And just repeated that last command for each of the disks I wanted the VM to have access to. Then, all I had to do was go into the vSphere client and start adding the existing VMDKs as hard disks to my VMs.

Then I just booted up my fresh Ubuntu Server 11.10 VM install, made sure the OS saw the disks by doing a quick:
$ sudo fdisk -l
Installed mdadm:
$ sudo apt-get install mdadm
And reassembled my array and mounted it with 2 simple commands:
$ sudo mdadm --assemble --scan
mdadm: /dev/md/0 has been started with 5 drives.
$ sudo mount /dev/md/0 /mnt/
And voila! My VM is now using the physical disks and successfully reconstructed the mdadm array, data totally unharmed. GREAT SUCCESS.

(via: cyborgworkshop.org)

Text.

Auerbach: More fun with objdump + Perl + MySQL

Finished my script to cache a list of imports from each sample of malware. I’m up to over 12GB in samples now (admittedly some duplicates), but this script ran surprisingly quickly. It finished almost 4000 unique samples in about 10 minutes.

Haven’t really even begun to do any serious data diving quite yet, but just to throw out some stats …

10 Most commonly imported functions:

mysql> select function_library,function_name,count(*) c from SampleFunctionMap left join LibraryFunctions using (function_id) where function_name is not null group by function_name order by c desc limit 10;
+------------------+---------------------+------+
| function_library | function_name       | c    |
+------------------+---------------------+------+
| kernel32.dll     | GetProcAddress      | 2175 |
| kernel32.dll     | LoadLibraryA        | 2101 |
| kernel32.dll     | WriteFile           | 2096 |
| kernel32.dll     | ExitProcess         | 2080 |
| kernel32.dll     | GetModuleHandleA    | 2073 |
| kernel32.dll     | GetLastError        | 2037 |
| kernel32.dll     | GetModuleFileNameA  | 2019 |
| kernel32.dll     | GetCurrentProcess   | 1999 |
| kernel32.dll     | MultiByteToWideChar | 1986 |
| kernel32.dll     | CreateFileA         | 1982 |
+------------------+---------------------+------+
10 rows in set (1.31 sec)
10 most commonly used DLLs:
mysql> select function_library,count(*) c from SampleFunctionMap left join LibraryFunctions using (function_id) where function_name is null group by function_library order by c desc limit 10;
+------------------+------+
| function_library | c    |
+------------------+------+
| kernel32.dll     | 3278 |
| user32.dll       | 3082 |
| gdi32.dll        | 2355 |
| advapi32.dll     | 2326 |
| shell32.dll      | 2308 |
| comctl32.dll     | 1947 |
| ole32.dll        | 1758 |
| oleaut32.dll     | 1328 |
| version.dll      | 1181 |
| wininet.dll      |  655 |
+------------------+------+
10 rows in set (1.00 sec)
I’m still really green to the malware analysis game, but I’m guessing this isn’t particularly surprising data to anyone that’s been analyzing malware for a while.

So what do I intend to do with this data? Well, I have a few ideas that I’d like to attempt:

- Review and provide descriptions for each API function (automatically if at all possible) to give a more general idea of what each sample might do given the functions it imports.

- Try to isolate/correlate functions which could indicate a suspicious executable, or even narrow it down to a “family” of malware. Ex: “FunctionA+FunctionB” indicates that this sample might be a keylogger.

- Eliminate functions which are more routinely used by any legitimate executable to reduce signal-to-noise ratio

- Isolate samples that will likely generate network traffic based on the functions they import to analyze later and hoepfully write Snort signatuers for.

I’m thinknig I can make good use of VirusTotal’s api for some of these ideas.

Simple regex hackery on objdump output to get and cache list of imports:
sub objdumpImports {
  my ($folder,$md5) = @_; 

  ## Get a sample and run objdump on it and loop through/parse output
  open(CMD,"find $folder -type f | tail -1 | xargs -I {} objdump -x {} |");
  my @output = <CMD>;
  
  my $new_dll = 0;
  my $dll;
  foreach my $line (@output) {

    if ($line !~ /[A-Za-z0-9]+/) {
      $new_dll = 0;
      next;
    }   

    ## Is this a DLL .. ?
    if ($new_dll == 0 && $line =~ /DLL Name: ([^\s\n]+)/) {
      $dll = $1; 
      $new_dll = 1;
      print "Found DLL: $dll\n";

      ## Do we already have this DLL listed? If so, get ID
      my $function_id = &checkDLL($dll);

      ## If not, add it and get back the ID
      if ($function_id == 0) {
        print "Adding $dll to database .. \n";
        $function_id = &addDLL($dll);
        print "$dll id is $function_id\n";
      }   

      ## Add Mapping from ID to this sample md5
      &setFunctionSampleMap($function_id,$md5);
      next;
    }   

    ## If we're already "inside" a DLL, get function name ..
    if ($new_dll == 1 && $line =~ /\s+[0-9A-Fa-f]+\s+[0-9A-Fa-f]+\s+([^\s]+)/) {
      my $function_name = $1; 
      print "+Found $function_name in $dll\n";

      ## Do we already have this function listed? If so, get ID
      my $function_id = &checkFunction($dll,$function_name);
      ## Do we already have this function listed? If so, get ID
      my $function_id = &checkFunction($dll,$function_name);

      ## If not, add it and get back the ID
      if ($function_id == 0) {
        print "Adding $dll:$function_name to database .. \n";
        $function_id = &addFunction($dll,$function_name);
        print "$dll:$function_name id is $function_id\n";
      }

      ## Add Mapping from ID to this sample md5
      &setFunctionSampleMap($function_id,$md5);

      next;
    }

  }
}

Text.

Project Auerbach Initial Data Mining

I’ve been downloading quite a bit of malware and trying to come up with ways to mine for correlative or generally interesting data.

One approach I’m taking for automated static analysis mining is to make use of objdump and strings. The script I’m running is a bit more sophisticated than the following one-liners (and is for storing results in a set of database tables), but thought I would just post a few of these that I used while I was testing and looking for “interesting” things across all my downloaded malware samples.

Of course, pretty much all of these tactics are really only useful for non-packed PEs.

Gathering a list of the top imported DLLs

# for i in `find /mnt/Malware/samples -type f`; do objdump -x $i | grep "DLL Name" | awk '{print $3}'; done | sort | uniq -c | sort -rn
Gathering a list file and domain names from `strings` (ignoring DLL imports that might show up)
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -E ':.*[A-Za-z0-9_-:.]{4,}\.[a-z]{2,3}'; done | grep -v dll
Gathering a list of URLs from `strings` that have GET arguments (possible download, phone-home or C&C URLs):
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -E ':.*http://.*\?'; done
Finding SQL queries from `strings` (surprisingly turned up a lot of hits from my sample set)
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -Ei ':.*(insert into.*|select.*from|update.*set)'; done
Finding references to (potentially malicious) Javascripts/iFrames from `strings`
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -Ei ':.*(<(script|iframe)|\.js[^a-z])'; done
Finding samples that are likely to generate (forged) HTTP request
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -Ei ':.*((get|post) /|user-agent|content-type:|host:|http/1|accept:)'; done
Samples that might generate IRC activity (didn’t find any true positives in my sample set yet)
# for i in `find /mnt/Malware/samples -type f`; do strings --print-file-name $i | grep -Ei ':.*(nick|pass|join|connect|msg|dcc) '; done
I also have a few lines of perl to further parse output from objdump and store the results in a database. One thing I was trying to do was create a mapping of imported functions and libraries from each executable, to see if there’s a solid list of Windows API functions we could use to pick out suspicious executables over something more legitimate.

Another purpose for this was to separate out any PEs which are potentially packed (and unpack them if possible so we can perform the same level of static analysis on them as unpacked PEs). Still a WIP, so I’m leaving it off this blog for now.

I hardly think any of these would have a 100% true positive rate on returns, but were just a result of some brainstorming on how to find interesting samples to look at. Part of my thinking for doing something along these lines was to come up with a system to prioritize automated dynamic analysis for executables by looking for common suspicious activity so that executables more likely to yield interesting results get analyzed ahead of others.

Anyway, that’s enough for tonight. Hopefully over the next couple weeks, I’ll have some more code and stats to share from my findings.

Also (random note) I just wanted to thank the folks at VirusTotal for being so awesome about increasing my API limits. I haven’t quite finished writing scripts to interact with their API, but I’m excited to be swapping samples for information with them. I foresee this being a huge help in statistical static analysis correlation for samples on my end.

Text.

Malware Harvesting with Auerbach (Phase 1)

Had a moment at work today where I thought up something I could do, if nothing else, as a hobby for my own entertainment. I know of a few good publicly available sources for malware samples (I don’t run them, so I’m choosing not to post them here for good reason), so I decided to write up a few scripts in Perl (because it’s what I know well) with a MySQL back-end.

At this point, the intent is to gather samples as a hobby and education for myself. I’m unsure if this will evolve into something more publicly available or not, but I’ve always wanted to get my own (semi-)automated malware sandbox up and running. This suite of code I’m developing for this, I’m simply calling “Auerbach” for now.

I think this xkcd comic pretty well sums it up:


It’s very simple at the moment: scrape source(s) for malware samples and store the URLs in a “Queue” table in SQL. A separte script then goes and “processes” that queue and downloads them into local directories on a RAID 5 array by Year/Month/Day/MD5sum. I still need to spend some time building out an API and set of Perl modules to make my own life easier, but this is simply “Phase 1”.

I realized I will soon have a 250GB/mo banwidth cap (thanks to moving back onto Comcast), so I wanted to make sure I wasn’t sapping all my bandwidth on this. I calculated it out to where I’d only use 10% of my bandwidth and rate limited my sample downloads to 9KB/sec, so that the scripts could run continuously without putting too big of a dent in my bandwidth allownace.

At this rate, if the average sample size is about 1MB, I should be able to get over 25,000 samples per month and over 300,000 samples per year. I don’t know exactly what all I want to do with it, but I have some ideas and definitely intend to try to whittle that list down to a more manageable size of more “interesting” samples.

“Phase 2” will involve scripting up some basic static analysis to pick out samples that might be “interesting” to look at.

“Phase 3” I think may involve doing some data mining and correlation on the automated static analysis, as well as on some IP/Domain/Location data.

“Phase 4” will most likely involve automated basic dynamic analysis by use of Truman. Probably going to take me a while to get to this point, since I need to spend some time building out my home lab to where I can actually make good use of Truman.

“Phase 5” will probably involve some more data mining, also involving automated Truman analysis.

For my co-workers who happen upon this blog: I know, we already do pretty much all of this (and better). Just something I’ve wanted to do for a while for fun.

Will check in once in a while with more stats and updates as I have them, but that’s all for today. Not bad for just a couple hours’ work getting the DBs set up and some initial code tested and laid down.

Link.
Link.
Text.

Grepping for the last log each device has reported in a logfile

Not the most concise title for a blog post, but an interesting problem I ran into at work today. Solution seemed simple in my head, but trying to throw together a one-liner to do it took me a while to get there. Some fun with awk, grep and xargs. I’m sure there’s more than one way to do this - and there may even be a better day - but here’s what I came up with:

awk '{print $4}' [logfile] | sort | uniq | xargs -I {} sh -c "tac [logfile] | grep -m1 {}" | awk '{print $4" "$3}'
Breaking it down:


awk '{print $4}' [logfile] | sort | uniq 
A technique frequently used for printing out the IP address of the device that generated the log (the 4th space-delimited field), pipe to sort to sort the results and then piping to uniq to only print out the unique IPs found.


| xargs -I {} sh -c "tac [logfile] | grep -m1 {}"
Using each IP that came out of the first bit, send those results to xargs. Within xargs, run the command “tac [logfile]” which will cat the file in reverse and the piping to “grep -m1 {}”, where {} will be the IP (indicated by xargs -I {}) we’ll grep for .. and -m1 to stop grepping after the first match is found. Using “tac” and “grep -m1” is simply to save time, so we’re not repeatedly grepping the whole file from beginning to end, when we really only want the last line in the file.


| awk '{print $4" "$3}'
This last bit is just because I only wanted the IP address ($4) and the timestamp ($3) of the last log for each device, to try to find which devices might have their time set incorrectly (or could indicate the last time a device sent a log).

Text.

My ~/.vimrc

Just posting this online mostly so that I can get to it whenever I need it. Frustrating whenever I’m on a new machine that doesn’t have it and I have to go dig it up from work.

set tabstop=2
set shiftwidth=2
set wm=2
set number
set history=700
set ruler
set magic
set smarttab
set ai
set si
set wrap
set linebreak
set nolist
set paste
let mapleader = ","
let g:mapleader = ","
map ss :setlocal spell!<cr>
autocmd FileType perl set makeprg=perl\ -c\ %\ $*
autocmd FileType perl set errorformat=%f:%l:%m
autocmd FileType perl set autowrite
set pastetoggle=<F11>
hi LineNr ctermfg=darkgrey
map <S-up> <C-W>+
map <S-down> <C-W>-
map <S-left> <C-W><
map <S-right> <C-W>>
map <C-up> <C-W>k
map <C-down> <C-W>j
map <C-left> <C-W>h
map <C-right> <C-W>l

Text.

Information Security for the Greater Good

I had this long opinion piece typed out and instead just decided to delete it and paraphrase.

I will disclaimer this by saying that this is my own opinion and does not in any way reflect my opinion of my employer, nor does it reflect any opinion of my employer. My thoughts here are my own and merely a reflection of the industry as a whole.

Where is it fair to draw the line between a corporation wanting to protect intellectual property and an employee wanting to provide for the greater good?

Isn’t it at some point just as good for the corporation to have the publicity that it essentially paid for the research that was done and provided to the public to help protect themselves?

Does it at any point actually become detrimental to the corporation to let an employee release a certain amount or volume of such information to the public?

[NAME REDACTED] fights for the user.

Link.