Design and build a binary classifier over the dataset

Assignment Help Computer Engineering
Reference no: EM132285305

Big Data Analytics using Hadoop and Spark

Tasks:

(1) Understanding Dataset:

The raw network packets of the UNSW-NB151 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features with the class label.

a) The features are described here(attached).

b) The number of records per traffic type are described here(attached).

c) In this coursework, we use the total number of 2,540,044 records that was stored in the CSV file (download). The total size is 560MB, which is big enough to employ big data methodologies for analysis. As a big data specialist, firstly, we would like to read and understand its features, then apply modeling techniques. If you want to see a few records of this dataset, you can import it into Hadoop HDFS, then make a Hive query for printing the first 5-10 records for your understanding.

(2) Big Data Query & Analysis by Apache Hive
This task is using Apache Hive for converting big raw data into useful information for end users. To do so, firstly understand the dataset carefully. Then, make at least four Hive queries to be able to get information from this big dataset. Apply appropriate visualization tools to present your findings numerically and graphically. Interpret shortly your findings. Finally, take screenshot of your scripts/codes into the report.

Tip: the mark for this section depends on the level of Hive queries' complexities, for instance using simple select query is not supposed for full mark.

(3) Advanced Analytics using PySpark
In this section, you will conduct advanced analytics using PySpark.

Analyze and Interpret Big Data
a) We need to learn and understand the data through 3-4 descriptive analysis methods. You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.

b) Apply 3-4 advanced statistical analysis methods (e.g., correlation, hypothesis testing, density estimation and so on) to interpret data precisely. You need to write down a report of your methods, their configurations and interpret your findings.

Design and Build a Classifier
a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations.

b) How do you evaluate the performance of the model?

c) How do you verify the accuracy and the effectiveness of your model?

d) Apply a multi-class classifier to classify data into ten class: one normal and nine attack (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statement on its parameters, accuracy and effectiveness.

(4) Individual Assessment
Discuss (1) what did you learn from this coursework, (2) what other alternative technologies are available for tasks 2 and 3 and how they are differ (use academic references), and (3) what was surprisingly new thinking evoked and/or neglected at your end?
Tip: add individual assessment of each member in a same report.

(5) Documentation
Document all your work. Your final report must follow 5 sections detailed in the "format of final submission" section (refer to next page). Your work must demonstrate appropriate understanding of academic writing and integrity.

Attachment:- Big Data Analytics.rar

Reference no: EM132285305

Questions Cloud

Description of the instruments used in that cultures music : Explanation of the key attributes of the culture's music, including melody, harmony, timbre, texture, rhythm and form.
Which type of communication within the organization : Laying people off are examples of which type of communication within the organization?
Discuss purpose of the t test for multiple comparisons : Which statistical test is considered a parametric test? The purpose of the t test for multiple comparisons is to
Identity development models were developed by psychologists : Using one of the identity development models discussed in class, you will write an essay mapping out your journey of personal identity development.
Design and build a binary classifier over the dataset : CN7022 - Big Data Analytics - University of East London - Design and build a binary classifier over the dataset. Explain your algorithm and its configuration.
Suppose that demand exceeds available capacity : Suppose that demand exceeds available capacity and both stations operate at their maximum rate throughout the work day.
Identify key aspects of the selected healthcare setting : Identify key aspects of the selected healthcare setting, how their mission and values compare with the services they provide.
What are the key elements of its customer value proposition : How would you characterize Uber’s business model and strategy? What are the key elements of its customer value proposition?
What is failure mode and effects analysis : In the context of risk management, how can it be used to improve processes in healthcare organizations?

Reviews

len2285305

4/16/2019 9:40:56 PM

Topic Total mark Remarks (breakdown of marks for each sub-task) Big Data Analytics using Hive 30 (20) Provide big data query and analysis by Apache Hive. (10) Visualize the outcomes of queries into the graphical representations to get big insights. Big Data Analytics using Spark 50 (30) Design and build advanced analytics over the big data for converting raw data to knowledge. (10) Visualize the outcomes into the graphical representations. (10) Evaluate the accuracy of the models. Individual assessment 10 (3) (5) (2) (1) Express new understanding and knowledge of the topic, (2) Find alternative solutions for high level query languages and analytics approaches, (3) Express findings from big data analytics with relevant theories. Documentation 10 (10) Write down a scientific report. Total: 100

len2285305

4/16/2019 9:40:43 PM

do the project on hive and apache spark • Cover sheet to be attached to the front of the assignment when submitted • Question paper to be attached to assignment when submitted • All pages to be numbered sequentially • All work has to be presented in a ready to submit state upon arrival at the ACE Helpdesk. Assignment cover sheets or stationery will NOT be provided by Helpdesk staff This coursework must be attempted in groups of 2-3 students. This coursework is divided into two sections: (1) Big Data analytics on a real case study and (2) group presentation. All the members of group must attend in the presentation date. If you do not turn up in the presentation date, you will fail the module. Overall mark for coursework comes from two main activities as follows: 1- Big Data Analytics (around 3,000 words, with a tolerance of ± 10%) (70%) 2- Presentation (30%)

Write a Review

Computer Engineering Questions & Answers

  Why is it that virtual memory paging does not suffer

Why is it that virtual memory paging does not suffer from external fragmentation and virtual memory segmentation does not suffer from internal fragmentation?

  Examine specific windows server disk storage technologies

Examine at least two specific Windows Server 2012 disk storage technologies for their capability to overcome these deficits. Provide support for your response.

  Define an array data type called quiz-array

Define an array data type called Quiz_Array that will contain 12 components indexed by the integers 21 through 32. The component type is Boolean.

  Write a research paper on enterprise risk management

Write a 6 page (double space, Font - Georgia with font size 12) research paper on your findings on enterprise risk management and firm performance.

  List the main feature of the sha cryptographic hash function

List the main features of the SHA-512 cryptographic hash function. What kind of compression function is used in SHA-512?

  How much information the store can accumulate about you

Many large grocery stores issue cards to their regular customers. How much information the store can accumulate about you?

  List and sketch six of the common stock shapes

List and sketch six of the common stock shapes. What is the difference between roughing and finishing machining operations?

  Distinguish between a ring and a commutative ring

Define a group and distinguish between a group and a commutative group. Define a ring and distinguish between a ring and a commutative ring.

  Web security consulting recommendations

Web Security Consulting Recommendations You are a security consultant for a new client in the healthcare industry. You have been asked to design a web solution

  How would you design combined hardware and software support

How would you design combined hardware and software support to provide the illusion of a nearly infinite virtual memory on a limited amount of physical memory?

  Write a procedure that accepts pointers to two arrays

Write a procedure that accepts pointers to two arrays of double words and the size of both arrays (The arrays are equal in length.).

  Prepare an opening statement that specifies organization

The graphically depicted solution is not included in the required page length. Explain what the client can expect from your services.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd