Reference no: EM133870563
Assignment - Machine Learning Modelling
Scenario
WA Cyber Command - WACY-COM has acquired aggregate data about 200,000 identified cyber-attacks and scans. The data are sourced from a Honey-pot project which places fake servers across the globe and records attacker activity and techniques. As Honeypots are simulated networks and devices, they allow researchers to safely monitor malicious traffic without endangering real computers or networks.
When analysing cyber-attacks, the level of sophistication of attackers can range in from low-level scammers, right up to Advanced Persistent Threats (APTs) which are often associated with state-sponsored cyber-attacks. The attacker tools and techniques generally vary depending on the sophistication of the attacker.
A research project has been undertaken by WACY-COM to determine what patterns exist in state-sponsored APT attacks.
Typically, a complex attack can involve multiple attacking computers (with different source-IP addresses) and different payloads and targets. By coordinating attacks from multiple devices, the attacks can become more difficult to detect and stop.
Note: The scenario and data are loosely based on real-world cyber threats and attacks. However, this data set has been curated entirely to help you understand the types of data, correlations and issues that you may experience when handling real-world cyber security data.
Data description
The aggregated data available to WACY-COM are described by the following features (with data types given in square brackets):
[Categorical] Port - The port or service that was being attacked on the honey-pot network. Well known ports include 80/443 (Web traffic), 25 (Email reception), 993 (Email collection)
[Categorical] Protocol - The Internet Protocol in use to conduct the attack [Numeric] Hits - How many ‘hits' the attacker made against the network [Numeric] Average Request Size (Bytes) - Average ‘payload' sent by the attacker [Numeric] Attack Window (Seconds) - Duration of the attack
[Numeric] Average Attacker Payload Entropy (Bits) - An attempt to qualify whether payload data were encrypted (higher Shannon entropy may indicate random data, data obfuscation or encryption)
[Categorical] Target Honeypot Server OS - The Operating System of the simulated server
[Numeric] Attack Source IP Address Count - How many unique IP addresses were used in the attack
[Numeric] Average ping to attacking IP (milliseconds) - Used to detect ‘distance' to the attacker. The average ping time ‘back‘ to the attacker's IP addresses were calculated. [Numeric] Average ping variability (st.dev) - High variability pings can indicate a saturated or unreliable link.
[Numeric] Individual URLs requested - How many different URLs were probed or attacked (Only relevant for Web Server ports)
[Categorical] Source OS (Detected) - The detected operating system of the attacking IP address. Acquired by scanning and fingerprinting the IP address of the attacking server [Categorical] Source Port Range - What range of source ports were used by the attacker. Typically, ‘low' ports are reserved for system services. Higher ports are used by end- user applications.
[Categorical] Source IP Type (Detected) - Whether the IP of the attacker can be linked to known proxies/VPNs or TOR (technologies that can be used to hide the real source of the attack), or Likely ISP traffic (which may indicate the attacker is leveraging compromised end-user computers)
[Numeric] IP Range Trust Score - A trust score generated by an existing WACY-COM system. This system integrates with open-source intelligence (OS-Int) databases to identify potentially compromised on malicious IP addresses
[Binary] APT - Was the attack conducted by a known Advanced Persistent Threat actor (APT).
The raw data for the above variables are contained in the WACY-COM.csv file.
Objectives
You have been brought on as part of a data analysis team to determine if APT activity can be inferred from other attack parameters.
Task
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other. Get in touch with us for online assignment help service!
Part 1 - General data preparation and cleaning.
Import the WACY-COM.csv (same version as Assignment 1) into R Studio.
Write the appropriate code in R Studio to prepare and clean the WACY-COM master dataset as follows:
Clean the whole dataset based on the feedback received for Assignment 1.
For the feature Source.OS.Detected, merge its categories Windows 10 and Windows Server 2008 together to form a new category, say Windows_All. Similarly for Target.Honeypot.Server.OS, merge its categories Windows (Desktops) and Windows (Servers) to form the new category named
Windows_DeskServ. Further, combine Linux and MacOS (All) to form the category MacOS_Linux. Hint: use the forcats:: fct_collapse(.) function.
Log-transform Average.ping.variability using the log(.) function, and remove the original Average.ping.variability column from the dataset (unless you have overwritten it with the log-transformed data). Similarly, transform the following features using the square root, i.e. sqrt(.), function instead.
Hits;
Attack.Source.IP.Address.Count;
Average.ping.to.attacking.IP.milliseconds;
Individual.URLs.requested.
Select only the complete cases using the na.omit(.) function, and name the dataset WACY-COM_cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files. You may be asked to provide these for verification purpose.
Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you are asked to use 30% of the data only to train your ML models to save time.
Part 2 - Compare the performances of different ML algorithms
Determine your THREE randomly selected supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 ML approaches are given by myModels.
For each of your three ML modelling approaches, you will need to:
Run the ML algorithm in R on the training set with APT as the outcome variable.
Perform hyperparameter tuning to optimise the model:
Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches. Report on the search range(s) for hyperparameter tuning, which ??-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (i.e. CV results, tables and plots), where appropriate. If you are using repeated CVs, a minimum of 2 repeats are required.
If one of your selected tree models is Bagging, you must tune the nbagg, cp and minsplit hyperparameters simultaneously, with 3 values for each.
If one your selected tree model is Random Forest, you must tune the num.trees, mtry, and min.node.size hyperparameters simultaneously, with 3 values for each.
Be sure to set the randomisation seed using your student ID prior to training each model.
Evaluate the performance of each ML models on the test set. Provide the confusion matrices (see marking criteria for an example) - report and describe them along with the following measures in the context of the project:
Sensitivity (the detection rate for APT)
Specificity (the detection rate non-APT)
Overall Accuracy
Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, and to a lesser extent, interpretability maybe taken into account if the decision is close. You may outline your model coefficients (which you can place in the appendix) for your penalised logistic regression model if it helps your argument.
What to submit
Gather your findings into a report (maximum of 4 pages) and citing sources, if applicable. You may include an appendix (maximum of 2 pages) if appropriate. The minimum required font size is 11.
Outline how and why the data was manipulated, how the ML models were tuned and finally how they performed against each other. You may use graphs and tables where appropriate to help your reader understand your findings.
Make a final recommendation on which ML modelling approach is the best for this task.
Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.
Your submission must include the following:
Your report (minimum font size 11, 4 pages or less, excluding cover/contents
/appendix/reference page).
A copy of your R script (not the R workspace), which is to be submitted separately from the report.
Make sure you keep a copy of the training set and a test set (in .csv format) in case you are asked to provide later on.
The report must be submitted through TURNITIN and checked for originality. The R script is to be submitted separately via another submission link on Canvas.
Note that no marks will be given if the results you have provided cannot be confirmed by your code. Furthermore, all pages exceeding the 4-page limit will not be read or examined.
Marking Criteria
Criterion Contribution to
assignment mark
Accurate implementation data cleaning and of each supervised machine learning algorithm in R.
Strictly about code
The code work from start to finish.
The code is appropriately documented.
External sources referenced in APA 7 referencing style (if applicable).
All the steps are performed correctly.
It is your own work.
Note: At least 80% of the code (excluding those provided to you above) must aligned with unit content. Otherwise, a mark of zero will be awarded for this component.