Intrusion Detection

We tried various machine learning techniques to build an Intrusion (Anomaly) Detector. Intrusions into systems or networks can be caught by observing the data and somehow measuring the deviation of the data from the normal. This involves getting a concept of what normal data looks like. Many different techniques have been applied to classify normal data. These include Nueral Networks, Hidden Markov Models, Bayesian Networks, etc. We tried a few techniques and evaluated the performance of the detector on synthesized data sets.

Frequentist

This detector uses statistical measures to detect anomolies in a system. Our implementation is based on the NIDES detector at SRI. There are two components of the NIDES detector and we concentrate on the statistical component that is described in this document. This is the closest working implementation of a Frequentist that we could find. We modified the detection mechanism to use a chi-squared significance test rather than a maintained distribution function of the statistic. This was done to simplify the evaluation procedure.

Datamining

There has been some recent work by Wenkee Lee and Sal Stolfo on using data mining as a intrusion detection mechanism. In their work, they use RIPPER, a rule based data mining tool by Cohen. We used another data mining tool called CN2, which is similar to RIPPER in that it is also a rule based tool. We used it to detect anomalies in the data sets and found that although the accuracy of the results is excellent, it falls short in detecting a lot of anomalies. We also concluded that it would be difficult for such a tool to detect novel attacks.

Genetic Programming

We then used a new approach to anomaly detection, genetic programming. The closest idea that we could find in the literature was using Genetic Algorithms, which are different from Genetic Programs. We devised a mechanism to represent the problem of anomaly detection in the GP domain. GP is different from GA in the sense that it allows one to use a more sophiscated tree structure to represent a problem than a simple string of bits. The evolution mechanisms are still mutations and crossovers. We used a Genetic Programming Kernel for this detector. Our results show that this approach is promising and can be used for further analysis on different data sets to judge its effects.

Data Files

The data files used during these tests have the following format.

Training Data
Test Data
Keys

Training data has a stream of characters. In these experiments we used a stream with an alphabet size of 6. Test data looks exactly like train data with anomalies injected in it. The key files tell the detector where the anomalies are located in the test file. This is used for evaluation after a detector has finished its process. The actual data files are avilable on mainmon:/usr2/kmct/.

Paper

A write-up of this project is available here. A short presentation was also made.

Source code

Frequentist: Frequentist.tgz
Datamining: Convert_data.tgz
Genetic Programming: gpd.tgz

Running the code

Frequentist

Compiling the source code gives a detector that can be run with command line arguments with the required data-files. It can then be run with the command line options

ad.01 -n alphabet_size -r rate_threshold -d decay_rate -f train_file -t test_file -o out_file

The out_file produced then needs to be run through a few lines of awk and is then used to compare the results with the key file using the checker find_it. Look at the file "run" in the source code for an example.

Data Mining

The source code is just for converting the data to the required format. You need CN2 for running the entire test. CN2 is also modified a little bit (the changed file is cn_print_thing.c) which are in the file "diff". Look at the file "run" to see an example of how to run this detector.

Genetic Programming

You need the Genetic Programming Kernel to run this detector. The code provided here will help to convert the data files into the correct format, and then run the GP using the file "gpd.ini". An example is present in the file "run".

Shahzad Ali <cheeko@cs.cmu.edu>