Detect the syn flooding attacks and port scan attacks

Assignment Help Basic Statistics
Reference no: EM131204535

1. Purpose

This homework implements algorithms for anomaly detection.

2. Description

Introduction

Comparing with signature based IDS, statistical based IDS uses the statistical metrics and algorithms to differentiate the anomaly traffic from benign traffic, and to differentiate different types of attacks. The advantage of statistical based IDS is that it can detect the unknown attacks. As we know, DOS attacks happen frequently in the Internet. SYN flooding attack is an important form of DOS attacks. Port scan is another main category of malicious intrusions. However, some kinds of port scan can be easily hidden in the SYN flooding attacks, and this noise always confuses the administrator and causes the wrong responses. How to differentiate the SYN flooding from port scan is a hot research topic in the intrusion detection research. In this project, you will be asked to develop some key parts of a tiny statistical based IDS system to differentiate the SYN flooding attacks and port scan attacks from benign traffic, and further differentiate the SYN flooding from port scan. We will supply you the main framework, and the statistical model for automatically inferring the threshold for detection. You will be asked to find the best statistical metrics which can easily characterize and classify the traffic into three categories: normal, SYN flooding, port scan, and calculate the metrics from the raw network traffic trace. Also, based on the statistical models, you should be able to detect the SYN flooding attacks and port scan attacks.

This tiny IDS project includes the following steps:

1) Obtain the data set from [1], or https://www.ll.mit.edu/ideval/data/1998data.html, which is the network traffic trace of the DARPA98 IDS evaluation data set.

2) Calculate the metrics you selected

3) Use the statistical metrics calculated from the training data set to training the statistical model.

4) Use the model from (3) and the statistical metrics calculated from the test data set to detect the attacks in the test data set.

Do not copy the whole data set to your home directory, since it's too large, just read the data directly from that directory.

The DARPA98 IDS evaluation data set contains one training data set and one testing data set. For each detection trial, you select one to five statistical metrics and calculate for both training and testing data set.

The HMM model can accept up to 5 metrics to detect the attacks and the Gauss model can only use one. You need to figure out the best metrics combination to get the best detection result. And you can try as many different metric combinations as you like. Use the statistical metrics calculated for the training data set, which has annotated attacks, to train the statistical model. For the training phase, we give the ground truth of the attacks to the statistical model, which can automatically infer the threshold for differentiating three categories (normal, SYN flooding, and port scan) based on the different statistical characteristics of your metrics for the three categories. Next, you can use the trained statistical model and the statistical metrics calculated from the testing data set to detect the attacks in the testing data set.

In this assignment, you will use two classical statistical learning models: the k-means clustering algorithm and the Double Gaussian model. You can use the two models together or just use one model or design your own model.

For example, you can use the k-means clustering model to differentiate the abnormal (including SYN flooding and port scan) from normal, and use the Double Gaussian model to differentiate the SYN flooding from port scan; or you can just use k-means clustering model to classify to 3 categories (normal, SYN flooding and port scan).

For each detection, you can select 1 to 5 statistical metrics. For example, you can calculate the volume of SYN packet minus the volume of SYN ACK packet as a metrics, in every 5 minutes time interval.

Specification

You first need to write some C/C++ programs to calculate selected metrics of the training and testing data set, and then use the Matlab program to train the statistical model and do detection.

Basically you need to do the following:

A. You need to write the parsing C programs for both training and testing data. For training data sets, the output of your program should be a list of metrics plus the annotated flags. Each line presents the metrics calculated from the 5-minute network traffic packet data. The same output should be for the program of testing data sets, except which does not have the annotated flag in each line, instead, you need to put the time stamp in each line. You also need to add some metric specific code to it, for the metrics you select.

B. Based on the outputs of training and testing, you need to use the Matlab program (e.g., TinyIDS.m) to train the statistical model for detection. The Matlab programs allow you to compare your result with the ground truth to calculate your accuracy.
Getting familiar with DARPA98 data set

As mentioned before, DARPA98 data set includes two parts: the training data set and the testing data set. The training data set contains 7 week (35 days) Tcpdump data. The training data set is in plain text format and includes 35 data files. Each of the data files contain one day's network traffic data, which is in Tcpdump format.

In Tcpdump format, each line represents a packet transferred on the target network. e.g. 897048008.080700 172.16.114.169.1024 > 195.73.151.50.25: S 1055330111:1055330111(0) win 512 <mss 1460>

The first column 897048008.080700 is the time stamp of this packet, which is in an absolute time format. 897048008 is the number of seconds since 1970, and .0870700 is the microseconds.

The second column 172.16.114.169.1024 is the Source IP address 172.16.114.169 plus the source port 1024. The third column > presents the direction of the traffic, which means the traffic is from source IP.port column 2 to the destination IP.port column 4.
The fourth column 195.73.151.50.25 is the Destination IP address 195.73.151.50 plus the destination port 25. The fifth column S is TCP flag of the packet. S presents SYN.

The following columns are other TCP header fields. Note, ack flag may be in these fields. For more information about the format please man tcpdump

In the training data set, one column is added at the beginning of each line to annotate which categories this line belongs to. 1 presents normal, 2 presents SYN flooding, 3 presents port scan.

The testing data set contains 2 weeks (10 days) data, so you need to use your tiny IDS to detect the anomalies in it. Based on the ground truth* of testing data set, the TinyIDS.m program will output the error ratio of your detection program, so you can change or adjust your statistical metrics to get better detection result.

*Term: ground truth --- The real attacks in the data set. If your detection results equal to ground truth, you will get 100 percent accuracy.

Metric calculation

Because the data sets are quite large, we recommend you to use C/C++ language to write the calculation program for high efficiency. You should write 2 programs (cal_training and cal_testing), one for the training data set and the other for the testing data set. In this program, read the text Tcpdump trace data from stdin, and write the calculated metrics to stdout.

For the metrics calculation, we require you to calculate the metrics in every 5 minute traffic. For example, you may calculate the volume of SYN packets minus the volume of SYN ACK packets in every 5 minutes time interval. Use the time stamp to get the time.

For the metrics_training.txt file, we recommend the following format of each line:
Metric1 Metric2 ... Annotation flag

Put the metrics you select in column 1 - n, if you select n metrics. The HMM system currently can simultaneously consider at most 5 metrics. So make sure n is no larger than five. Put the annotation flag in the last column. Please use one whitespace character to separate the different columns.

Note: In each 5-minute data, if any packet record is annotated as SYN-flooding, this record for the 5 minutes data should be annotated as SYN-flooding, otherwise if one packet record is annotated as port scan, the record for the 5 minutes should be annotated as port scan. The other records which do not have any SYN-flooding or port scan packet, should be annotated as normal.

For the metrics_testing.txt file, we require the following format:
Metric1 Metric2 ... Timestamp

The timestamp can be used for comparing the ground truth with your detection results. Thus it is very important for grading. Please make sure you use the same form as the timestamp in Darpa98 data set.

Self-evaluation
After you finish the attack detection in the testing data set, based on the ground truth, your program will calculate the error ratio E.
Since most of your grade for this project will be based on your error ratio E result, you should try different metrics and combinations of the two statistical models to get as good results as you can. Try to minimize the error ratio E.

Note 1: Here we have 3 categories: Normal, SYN-flooding, and Portscan. If you count any time interval into the wrong category, you error should be increased by 1. The total errors divided the total time intervals you get the error ratio E.

Note 2: A more complicated metric may not be better than a simpler one. Just consider how the attacks will affect the characteristics of network traffic (packets) to select the metric. Also, it is up to you how to use the two classifiers. Maybe only one classifier is Okay, maybe you need to use both. Ponder on the characteristics of these network attacks, and try as much as you can.

Note 3: In this project, our detection is a preliminary one. From the result, we only can know for each time interval whether it belongs to Normal, or SYN-flooding, or Portscan. But some attacks may last for many time intervals. Thus we do not really know how many attacks we detect and how many we mis-detect. Therefore, here we use the error ratio E to evaluate you results. Another reason for such metric is that we have 3 categories to classify each interval to. It is not a simple normal/abnormal question.

Evaluation Report

You also need to write an evaluation report which includes the followings:

- Your comparative analysis about the statistical metrics chosen: why you decided to choose these metrics based on the characteristics of portscan and/or SYN-flooding?

- The best metrics and the detection results of the metrics.

- Some theoretical analysis, if possible, of why these metrics chosen by you are the best metrics.

- A list of some other metrics you tried but you do not think those are good ones. Please include the results of those metrics as well, and why you think theoretically they are not as good as your best ones.

- A description of any important design decisions you made.

Evaluation

The total points are 100.

A. The metric and the accuracy of your result of testing data set (we will test it separately)

B. Your evaluation report

You can enter the accuracy contest with your metrics and your implementation. Whoever has the highest detection accuracy will get 30 extra bonus points. If there are ties, then all the parties will get the 30 extra points plus gifts.

Submission
- Zip your entire project. Submit the zip and README file The ZIP should include the following:

1) The source code of the metric calculation C/C++ programs

2) The metrics calculation result: metrics_training.txt and metrics_testing.txt

3) Your evaluation report (make sure to include your detection results in your report)

Reference

[1] Richard P. Lippmann, Robert K. Cunningham, David J. Fried, Issac Graf, Kris R. Kendall, Seth E. Webster, Marc A. Zissman, "Results of the DARPA 1998 Offline Intrusion Detection Evaluation," slides presented at RAID 1999 Conference, September 7-9, 1999, West Lafayette, Indiana.

[2] Lawrence R. Rabinner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," in Proceedings of the IEEE, Feb 1989.

Appendix
In order to enlighten you a little bit about the statistical metrics you can use, in this section we will supply some of them as examples. But feel free to use anything else which is not on the list. Note, the metrics in the list are just examples, which may or may not be good choices.
_ The traffic volume
_ The packet volume
_ The SYN packet volume
_ The unresponded flows (The SYN packet volume - The SYN ACK packet volume)
_ The mean or standard deviation of the packet volume per host/port in the target network
_ The number of peers that one host connect with in a time interval
_ The unresponeded flows/total packets in a time interval

Reference no: EM131204535

Questions Cloud

How are the rescue services funded in other countries : In the United Kingdom the lifeboat service is funded by charitable donations. How can this work? How are the rescue services funded in other countries?
How do you define classes in css : How do you define classes in css? Please give an example and explain what classes are?
How does the halt command trap : How does the HALT command Trap x25 work when coding in LC-3? Where is the placement, at the front or back?
What is the value of total purchases at the nash equilibrium : What level of expenditure on the public good maximizes the total level of utility?
Detect the syn flooding attacks and port scan attacks : Calculate the metrics you selected - Use the statistical metrics calculated from the training data set to training the statistical model - Use the statistical metrics calculated from the training data set to training the statistical model.
Structured query language : Goal: Use a transaction to add new rows to two tables at the same time. You must add a row into one table first, before the row can be added into the second table.
Why we would choose to place so much emphasis on qsen : Then, choose one of the QSEN competencies (Patient Centered Care, Teamwork and Collaboration, Informatics, Evidence Based Practice, Safety, or Quality Improvement) and one attitude that you think is most important for a student nurse to incorporat..
Professional language and terminology of systems : Your informed commentary and analysis -- simply repeating what your sources say does not constitute an adequate paper. Your ability to apply the professional language and terminology of systems analysis and design correctly and in context; you are e..
Difference between a security plan and a security policy : What is the difference between a security plan and a security policy? How these do relates to each other?

Reviews

Write a Review

 

Basic Statistics Questions & Answers

  Effectiveness of treating respiratory infections

The effectiveness of treating respiratory infections with herbal remedies was studied. "Days of fever" was used to measure effects. Among 356 children treated with herbal remedies, teh mean number of days with was 0.36,

  In an illustration of normal probabilty distribution a

in an illustration of normal probabilty distribution a shaded are represents

  Cable and the cable length

The power transmission cable has a weight per unit length of 15 lb/ft . If the lowest point of the cable must be at least 90 ft above the ground, determine the maximum tension developed in the cable and the cable's length between A and B

  Personality assessment instrument or inventory

Select one psychological personality assessment instrument or inventory to serve as the basis of this assignment. Explore the psychological literature to find three to five articles that test the use of this inventory or assessment on obsessive-com..

  Listed below are measured amounts of lead in the air the

listed below are measured amounts of lead in the air. the epa has established an air quality standard for lead of 1.5

  What is the expected number of defective calculators

The J.O. supplies company buys calculators the probability of a defective calculator is 10%. If 100 calculators are selected at random, what is the expected number of defective?

  When we accept the null hypothesis we are certain that the

The sum of the squares error measures the variability in the measurements within the groups. Equal replication means that the same number of objects being observed are randomly selected from each population.

  Data file description

Describe the context of the data set. Cite a previous description if the same data set is used from a previous assignment. To increase formal tone of the DAA, avoid first-person perspective "I." For example, do not write, "I ran a scatter plot sho..

  Sample needed to find the population mean

How large a sample is needed to find the population mean within $250 at 99 percent confidence?

  Listed below are the playing times in seconds of songs

listed below are the playing times in seconds of songs that were popular at the time of this writing. find the amean

  Evaluating confidence interval

In random, independent samples of 250 adults and 375 teenagers who watched a certain television show, 147 adults and 228 teens indicated that they liked the show.

  Government healthcare accounting and commercial accounting

What are some of the key differences between government healthcare accounting and commercial accounting? Please cite some examples.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd