Data mining project

Assignment Help Other Subject
Reference no: EM13999715

The project must be carried out using any programming language or one of the suggested

platforms and libraries: references to them are listed here and are also available on Blackboard.

·         KNIME, open source Data Mining platform (https://www.knime.org).

·         Weka, open source ML library in Java (https://www.cs.waikato.ac.nz/ml/weka).

·         R, free programming language for statistical computing (https://www.r-project.org).

The following data files are required for this coursework and are provided in Blackboard:

·         wine.csv (data file for tasks 1 and 2)

·         training100Ku.csv (data file for tasks 3)

·         test1K.csv (data file for tasks 3)

·

Wine dataset for Task #1 and Task #2

The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data record contains the cultivar ID (1, 2 or 3) and 13 numerical attributes.

Task #1 – Data Exploration and Clustering

You are required to perform a clustering analysis for the multidimensional data set indicated above. This task has to be carried out two times: with and without normalisation.

Task1.1: Clustering without normalisation

Apply Principal Component Analysis (PCA) to generate two-dimensional coordinates and a 2D plot (plot1) of the records. The data points in plot1 should be represented with a colour associated to their class label. Apply a clustering algorithm to the data set to generate three partitions. Generate a 2D plot (plot2) based on the same PCA projection, similarly to the previous one, where the colour is associated to the cluster ID (use different colours w.r.t. plot1), and compare it with plot1. For the records associated to each cluster generate a 2D plot (plot3a, plot3b, plot3c) with colour associated to the class label (same colours of plot1): visually verify the distribution of class labels in each cluster.

Select, describe and apply at least one cluster validity measure: report the results in the report. Task1.2: Clustering with normalisation

Apply a normalisation pre-processing to the data set and repeat the steps of the part 1. Compare the new plots and the cluster validity measure with the previous ones.

The submission for Task #1 must contain two components:

·         a report section dedicated to your solution for Task #1,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #2 – Comparison of Classification Models

You are required to learn and test classification models for the wine data set. For this task you need to carry out a performance comparison of TWO different classification algorithms. You should use a 10-fold cross-validation method to estimate the generalisation error.

In the report you should briefly describe the two algorithms and the method used to compare the two algorithms.

The submission for Task #2 must contain two components:

·         a report section dedicated to your solution for Task #2,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #3 – The Search for God Particle: a Binary Classification Challenge

The CERN’s Large Hadron Collider (LHC) typically produces approximately 1011 collisions per hour and about 300 (0.0000003%) of these collisions result in a Higgs boson, the so called God particle. Detecting when interesting particles are produced is an important challenge, which is typically studied by the use of simulations. The data set for this task is related to simulations of collision events, which can be used to train a classification model to distinguish between collisions producing particles of interest (signal) and those producing other particles (background).

 Two data files are provided: the training set (training100Ku.csv) and the test set (test1K.csv). The training set file has 100,000 records, each containing, in this order, 21 numerical low-level attributes, 7 high-level attributes and the class label (signal/background). The low-level attributes are kinematic properties measured by the particle detectors in the accelerator during the experiment. The high-level attributes are computed after the experiment by means of some complex model as function of the low-level attributes (feature transformation).

The test set has 1,000 records, each containing a unique record identifier and 21 numerical low-level attributes (the same measurements in the same order as in the training set). The 7 high-level attributes and the class label are not present.

Your task is to predict the class label for the records of the test set. The resulting predictions must be submitted as a single file (CSV format) with only two columns: the record ID and the predicted class label (signal/background).

You must also include a section in the report to describe the method used to generate the submitted predictions and an estimation of these performance indices: accuracy, F-measure, precision and recall.

In summary, the submission for Task #3 must contain three components:

·         a report section dedicated to your solution for Task #3,

·         any KNIME workflow(*) and/or source code used (a zip/jar archive) and

·         the file “Task3-predictions.csv”.

 

 

 

(*) Important: do not include data when you export a KNIME workflow as a zip archive.

Reference no: EM13999715

Questions Cloud

What is the expected value of the potential offers : A simple model of search. Consider an agent who lives two periods. He is unemployed at the beginning of the first period and has a wage offer of w. If he accepts the wage offer w, he will work forever at that wage. What is the expected value of the p..
Type of sexual abuse from which jeremy may be suffering : Fully define, describe, and explain the type of sexual abuse from which Jeremy may be suffering. What type of treatment program would you suggest for Jeremy to participate in that would address the needs of a victim of child sexual abuse?
Derive an expression for the marginal cost of production : A manufacturer estimates that its variable cost for manufacturing a given product is given by the following expression: C(q) = 25q2 + 2000q [$] where C is the total cost and q is the quantity produced. Derive an expression for the marginal cost of pr..
Calculate the consumption-the consumers gross surplus : The inverse demand function of a group of consumers for a given type of widgets is given by the following expression: π = −10q + 2000[$] where q is the demand and π is the unit price for this product. Determine the maximum consumption of these consum..
Data mining project : The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data recor..
The supply function for the widget market : Economists estimate that the supply function for the widget market is given by the following expression: q = 0.2 · π − 40 a. Calculate the demand and price at the market equilibrium if the demand is as defined in Problem 2.2. b. For this equilibrium,..
Use only r programming language : If twenty-seven students are to be assigned to groups of three for each problem set, and no student can be assigned to the same group as a student whom he or she has previously worked with, how many problem sets can Dr Lee assign? Extend the function..
A partially completed bank reconciliation for dave company : A partially completed bank reconciliation for Dave Company at March 31, as well as additional data necessary to answer the questions, which follow.
How do the utilitarians use gossen 3rd law : How do the Utilitarians use Gossen's 3rd Law to resolve the 'Paradox of Value' aka 'the Water-Diamond paradox as posited by Adam Smith,no less than 400 words

Reviews

Write a Review

 

Other Subject Questions & Answers

  E business and e commerce and information systems

Analysis and discussion and the case or the company that is covered in your project (400 words approximately). In addition to your opinion and critical reflections.

  Promoting healthy behavioral practices

What activities and experiences you and your child have engaged in might be promoting healthy behavioral practices and an interest in physical activity?

  Does your organization use portable electronic devices

Does your organization use portable electronic devices and What safeguards are in place to ensure the security of data and patient information - Describe both the advantages and disadvantages of your solution.

  What is the optimal mindset of the investigator

What is the optimal mindset of the investigator and how are the concepts associated with the optimal mindset of an investigator's manifest.

  Types of power-type of power do you find most effective

Watch the "Types of Power" tutorial at the end of your lecture, Which type of power do you find most effective. Which type of power do you think you have.

  Describing casualty gap

One of best examples of the way in which social class affects life chances is idea of "The Casualty Gap". The burdens of war, including, importantly, the dying

  Do you believe this is the correct tool

Visit the website: www.fdic.gov Read about the CAMELS rating system. In a 2 page analysis report critically review the CAMELS rating system for assessing bank performance. Also explain why it is fast becoming a global standard adopted by central b..

  Eassy on chosing a job-career or task

Please help writing a persivesive essay including introducing, 3 bodies, and conclusion. topic is : chosing a job, career, or task that people think brainless, or more thinking, challenge, or unchallege to write persuasive essay that have introducing..

  Briefly summarize a previous hls-related problem

Briefly summarize a previous HLS-related problem or issue and associated technological solution. Describe the program and source of success that you picked.

  Summarize white-collar crimes and criminal tools

Summarize white-collar crimes and criminal tools. Use technology and information resources to research issues in information technology in criminal justice.

  The narrator-redactors

The narrator-redactors of Mark, Matthew, and Luke claimed that a ________ prompted the Sanhedrin to arrest Jesus.

  What is family values movement which sociological theory

what is the family values movement? which sociological theory does it support? why? identify three historical

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd