Most normal user traffic communicates via a hostname and not an IP address. Solooking at traffic communicating directly by IP with no associated DNS request is a good thing do to. Some attackers use DNS names for their communications. There is alsomalware such as Skybot and the Styx exploit kit that use algorithmically chosen host name rather than IP addresses for their command and control channels. This malware uses what has been called DGA or Domain Generation Algorithms to create random lookinghost names for its TLS command and control channel or to digitally sign its SSL certificates. These do not look like normal host names. A human being can easily pick them out of our logs and traffic, but it turns out to be a somewhat challenging thing to do in an automated process. Natural Language Processing or measuring the randomness dont seem to work very well. Here is a video that illustrates the problem and one possible approach to solving it.
One way you might try to solve this is with a tool called ent. ent a great Linux tool for detecting entropy in files.">Entropy = 7.999982 bits per byte."> --">[~]$ python -c print A*1000000 | ent
Entropy = 0.000021 bits per byte. -- 0 = not random
So 8 is highly random and 0 is not random at all.">[~]$ echo google | ent
Entropy = 2.235926 bits per byte.
[~]$ echo clearing-house | ent
Entropy = 3.773557 bits per byte. - Valid hosts are in the 2 to 4 range
Google scores 2.23 and clearing-house scores 3.7. So it appears as thoughlegitimate host names willbe in the 2 to 4 range.">[~]$ echo e6nbbzucq2zrhzqzf | ent
Entropy = 3.503258 bits per byte.
[~]$ echo sdfe3454hhdf | ent
Entropy = 3.085055 bits per byte. - Malicious host from Skybot and Styx malware are in the same range as valid hosts
Thats no good. Known malicious host names are also in the 2 to 4 range. They score just about the same as normal host names. We need a different approach to this problem.
Normal readable English has some pairs of characters that appear more frequently than others. TH, QU and ER appear very frequently but other pairs like WZ appear very rarely. Specifically, there is approximately a 40% chance that a T will be followed by an H. There is approximately a 97% change that a Q will be followed by the letter U. There is approximately a 19% chance that E is followed by R. With regard to unlikely pairs, there is approximately a 0.004% chance that W will be followed by a Z. So here is the idea, lets analyze a bunch of text and figure out what normal looks like. Then measure the host names against the tables. Im making this script and a Windows executable version of this tool available to you to try it out. Let me know how it works. Here is a look at how to use the tool.
Step 1) You need a frequency table. I shared two of them in my github if you want to use them you can download them and skip to step 2.
1a) Create the table: Im creating a table called custom.freq.">C:\freqfreq.exe --create custom.freq
1b) You can optionally turn ON case sensitivity if you want the frequency table to count uppercase letters and lowercase letters separately. Without this option the tool will convert everything to lowercase before counting character pairs.">C:\freqfreq.exe -t custom.freq
1c) Next fill the frequency table with normal text. You might load it with known legitimate host names like the Alexa top 1 million most commonly accessed websites. (http://s3.amazonaws.com/alexa-static/top-1m.csv.zip) I will just load it up with famous works of literature.">C:\freqfor %i in (txtdocs\*.*) do freq.exe --normalfile %i custom.freq
C:\freqfreq.exe --normalfile txtdocs\center_earth custom.freq
C:\freqfreq.exe --normalfile txtdocs\defoe-robinson-103.txt custom.freq
C:\freqfreq.exe --normalfile txtdocs\dracula.txt custom.freq
C:\freqfreq.exe --normalfile txtdocs\freck10.txt custom.freq
C:\freq">
Step 2) Measure badness!
Once the frequency table is filled with data you can start to measure strings to see how probable they are according to our frequency tables.">C:\freqfreq.exe --measure google custom.freq
6.59612840648
C:\freqfreq.exe --measure clearing-house custom.freq
12.1836883765
So normal host names have a probability above 5 (at least these two and most others do). We will consider anything above 5 to be good for our tests.">C:\freqfreq.exe --measure asdfl213u1 custom.freq
3.15113061843
C:\freqfreq.exe --measure po24sf92cxlk">Our malicious hosts are less than 5. 5 seems to be a pretty good benchmark. In my testing it seems to work pretty well for picking out these abnormal host names. But it isnt perfect. Nothing is. One problem is that very small host names and acronyms that are not in the source files you use to build your frequency tables will be below 5. For example, fbi and cia both come up below 5 when I just use classic literature to build my frequency tables. But I am not limited to classic literature. That leads us to step 3.
Step 3) Tune for your organization.
The real power of frequency tables is when you tune it to match normal traffic for your network. --normal and --odd. --normal can be given a normal string and it will update the frequency table with that string. Both --normal and --odd can be used with the --weight option tocontrol how much influence the given string has on the probabilities in the frequency table. Its effectiveness is demonstrated by the accompanying youtube video. Note that marking random host names as --odd is not a good strategy. It simply injects noise into the frequency table. Like everything else in security identifying all the bad in the world is a losing proposition. Instead focus on learning normal and identifying anomalies. So passing --normal cia --weight 10000 adds 10000 counts of the pair ci and the pair ia to the frequency table and increases the probability of cia">C:\freqfreq.exe --normal cia --weight 10000 custom.freq
The source code and a Windows Executable version of this program can be downloaded from here:https://github.com/MarkBaggett/MarkBaggett/tree/master/freq
Tomorrow I in my diary I will show you some other cool things you can do with this approach and how you can incorporate this into your own tools.
Follow me on twitter @MarkBaggett
Want to learn to use this code in your own script or build tools of your own? Join me for PythonSEC573 in Las Vegas this September 14th! Click here for more information.
What do you think? Leave a comment.
(c) SANS Internet Storm Center. https://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.