Reference no: EM133840965
Assignment - Data Exploration and Preparation
Scenario
WA Cyber Command - WACY-COM has acquired aggregate data about 200,000 identified cyber-attacks and scans. The data are sourced from a Honey-pot project which places fake servers across the globe and records attacker activity and techniques. As Honeypots are simulated networks and devices, they allow researchers to safely monitor malicious traffic without endangering real computers or networks.
When analysing cyber-attacks, the level of sophistication of attackers can range in from low-level scammers, right up to Advanced Persistent Threats (APTs) which are often associated with state-sponsored cyber-attacks. The attacker tools and techniques generally vary depending on the sophistication of the attacker.
A research project has been undertaken by WACY-COM to determine what patterns exist in state-sponsored APT attacks.
Typically, a complex attack can involve multiple attacking computers (with different source-IP addresses) and different payloads and targets. By coordinating attacks from multiple devices, the attacks can become more difficult to detect and stop.
Note: The scenario and data are loosely based on real-world cyber threats and attacks. However, this data set has been curated entirely to help you understand the types of data, correlations and issues that you may experience when handling real-world cyber security data.
Data description
The aggregated data available to WACY-COM are described by the following features (with data types given in square brackets):
[Categorical] Port - The port or service that was being attacked on the honey-pot network. Well known ports include 80/443 (Web traffic), 25 (Email reception), 993 (Email collection)
[Categorical] Protocol - The Internet Protocol in use to conduct the attack [Numeric] Hits - How many ‘hits' the attacker made against the network [Numeric] Average Request Size (Bytes) - Average ‘payload' sent by the attacker [Numeric] Attack Window (Seconds) - Duration of the attack
[Numeric] Average Attacker Payload Entropy (Bits) - An attempt to qualify whether payload data were encrypted (higher Shannon entropy may indicate random data, data obfuscation or encryption)
[Categorical] Target Honeypot Server OS - The Operating System of the simulated server
[Numeric] Attack Source IP Address Count - How many unique IP addresses were used in the attack
[Numeric] Average ping to attacking IP (milliseconds) - Used to detect ‘distance' to the attacker. The average ping time ‘back‘ to the attacker's IP addresses were calculated. [Numeric] Average ping variability (st.dev) - High variability pings can indicate a saturated or unreliable link.
[Numeric] Individual URLs requested - How many different URLs were probed or attacked (Only relevant for Web Server ports)
[Categorical] Source OS (Detected) - The detected operating system of the attacking IP address. Acquired by scanning and fingerprinting the IP address of the attacking server [Categorical] Source Port Range - What range of source ports were used by the attacker. Typically, ‘low' ports are reserved for system services. Higher ports are used by end- user applications.
[Categorical] Source IP Type (Detected) - Whether the IP of the attacker can be linked to known proxies/VPNs or TOR (technologies that can be used to hide the real source of the attack), or Likely ISP traffic (which may indicate the attacker is leveraging compromised end-user computers)
[Numeric] IP Range Trust Score - A trust score generated by an existing WACY-COM system. This system integrates with open-source intelligence (OS-Int) databases to identify potentially compromised on malicious IP addresses
[Binary] APT - Was the attack conducted by a known Advanced Persistent Threat actor (APT).
The raw data for the above variables are contained in the WACY-COM.csv file.
Objectives
You have been brought on as part of a data analysis team to determine if APT activity can be inferred from other attack parameters.
Your task is to perform data exploration and basic analysis, identify issues in the dataset, and recommend appropriate actions to address them.
Task
First, copy the code below to a R script. Enter your student ID into the command set.seed(.) and run the whole code. The code will create a sub-sample that is unique to you. Can you do my assignment for me? We sure can!
You are required to perform basic data analysis on the relevant features in mydata using R and report your findings.
Exploratory Data Analysis and Data Cleaning
For each categorical or binary variable, determine the frequency N and percentage (%) of instances in each category and summarise the results in a table as follows. You do not need to recreate the table in R; your code only needs to generate the statistics required to populate it. You may export or copy the values to Microsoft Excel and format the table there. State all percentages to 1 decimal places.
Summarise each of your continuous/numeric variables in a table as follows. State all decimal values to 1 decimal place.
Examine the value in the tables in parts (i) and (ii). Are there any invalid categories/values for the categorical variables? If so, how will you deal with them and why? Is there any evidence of outliers for any of the continuous/numeric variables? If so, how many and what percentage are there and how will you deal with them? Justify your decision in the treatment of outliers (if any).
Note: You may use plots/graphs to further support your observations/decisions.
A single report, not exceeding three (3) pages (excluding the cover page, contents page, and references, if applicable), containing:
summary tables of all the variables in the dataset;
a list of data issues (if any) and how you will deal with them in the data cleaning process.
Solutions should be in the order that the questions were posed in the assignment.
If you reference any sources in your analysis or discussion beyond the notes provided in the unit, you must cite them, including the use of ChatGPT or any other generative AI platform.
The dataset containing your sub-sample of 400 observations, i.e., mydata. A copy of your R code.