Data mining project, The data set (wine.csv) is obtained

Data mining project

Assignment Help Other Subject

Reference no: EM13999715

The project must be carried out using any programming language or one of the suggested

platforms and libraries: references to them are listed here and are also available on Blackboard.

· KNIME, open source Data Mining platform (https://www.knime.org).

· Weka, open source ML library in Java (https://www.cs.waikato.ac.nz/ml/weka).

· R, free programming language for statistical computing (https://www.r-project.org).

The following data files are required for this coursework and are provided in Blackboard:

· wine.csv (data file for tasks 1 and 2)

· training100Ku.csv (data file for tasks 3)

· test1K.csv (data file for tasks 3)

Wine dataset for Task #1 and Task #2

The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data record contains the cultivar ID (1, 2 or 3) and 13 numerical attributes.

Task #1 – Data Exploration and Clustering

You are required to perform a clustering analysis for the multidimensional data set indicated above. This task has to be carried out two times: with and without normalisation.

Task1.1: Clustering without normalisation

Apply Principal Component Analysis (PCA) to generate two-dimensional coordinates and a 2D plot (plot1) of the records. The data points in plot1 should be represented with a colour associated to their class label. Apply a clustering algorithm to the data set to generate three partitions. Generate a 2D plot (plot2) based on the same PCA projection, similarly to the previous one, where the colour is associated to the cluster ID (use different colours w.r.t. plot1), and compare it with plot1. For the records associated to each cluster generate a 2D plot (plot3a, plot3b, plot3c) with colour associated to the class label (same colours of plot1): visually verify the distribution of class labels in each cluster.

Select, describe and apply at least one cluster validity measure: report the results in the report. Task1.2: Clustering with normalisation

Apply a normalisation pre-processing to the data set and repeat the steps of the part 1. Compare the new plots and the cluster validity measure with the previous ones.

The submission for Task #1 must contain two components:

· a report section dedicated to your solution for Task #1,

· any KNIME workflow(*) and source code used (a zip/jar archive).

Task #2 – Comparison of Classification Models

You are required to learn and test classification models for the wine data set. For this task you need to carry out a performance comparison of TWO different classification algorithms. You should use a 10-fold cross-validation method to estimate the generalisation error.

In the report you should briefly describe the two algorithms and the method used to compare the two algorithms.

The submission for Task #2 must contain two components:

· a report section dedicated to your solution for Task #2,

· any KNIME workflow(*) and source code used (a zip/jar archive).

Task #3 – The Search for God Particle: a Binary Classification Challenge

The CERN’s Large Hadron Collider (LHC) typically produces approximately 1011 collisions per hour and about 300 (0.0000003%) of these collisions result in a Higgs boson, the so called God particle. Detecting when interesting particles are produced is an important challenge, which is typically studied by the use of simulations. The data set for this task is related to simulations of collision events, which can be used to train a classification model to distinguish between collisions producing particles of interest (signal) and those producing other particles (background).

Two data files are provided: the training set (training100Ku.csv) and the test set (test1K.csv). The training set file has 100,000 records, each containing, in this order, 21 numerical low-level attributes, 7 high-level attributes and the class label (signal/background). The low-level attributes are kinematic properties measured by the particle detectors in the accelerator during the experiment. The high-level attributes are computed after the experiment by means of some complex model as function of the low-level attributes (feature transformation).

The test set has 1,000 records, each containing a unique record identifier and 21 numerical low-level attributes (the same measurements in the same order as in the training set). The 7 high-level attributes and the class label are not present.

Your task is to predict the class label for the records of the test set. The resulting predictions must be submitted as a single file (CSV format) with only two columns: the record ID and the predicted class label (signal/background).

You must also include a section in the report to describe the method used to generate the submitted predictions and an estimation of these performance indices: accuracy, F-measure, precision and recall.

In summary, the submission for Task #3 must contain three components:

· a report section dedicated to your solution for Task #3,

· any KNIME workflow(*) and/or source code used (a zip/jar archive) and

· the file “Task3-predictions.csv”.

(*) Important: do not include data when you export a KNIME workflow as a zip archive.

Reference no: EM13999715

Questions Cloud

What is the expected value of the potential offers : A simple model of search. Consider an agent who lives two periods. He is unemployed at the beginning of the first period and has a wage offer of w. If he accepts the wage offer w, he will work forever at that wage. What is the expected value of the p..

Type of sexual abuse from which jeremy may be suffering : Fully define, describe, and explain the type of sexual abuse from which Jeremy may be suffering. What type of treatment program would you suggest for Jeremy to participate in that would address the needs of a victim of child sexual abuse?

Derive an expression for the marginal cost of production : A manufacturer estimates that its variable cost for manufacturing a given product is given by the following expression: C(q) = 25q2 + 2000q [$] where C is the total cost and q is the quantity produced. Derive an expression for the marginal cost of pr..

Calculate the consumption-the consumers gross surplus : The inverse demand function of a group of consumers for a given type of widgets is given by the following expression: π = −10q + 2000[$] where q is the demand and π is the unit price for this product. Determine the maximum consumption of these consum..

Data mining project : The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data recor..

The supply function for the widget market : Economists estimate that the supply function for the widget market is given by the following expression: q = 0.2 · π − 40 a. Calculate the demand and price at the market equilibrium if the demand is as defined in Problem 2.2. b. For this equilibrium,..

Use only r programming language : If twenty-seven students are to be assigned to groups of three for each problem set, and no student can be assigned to the same group as a student whom he or she has previously worked with, how many problem sets can Dr Lee assign? Extend the function..

A partially completed bank reconciliation for dave company : A partially completed bank reconciliation for Dave Company at March 31, as well as additional data necessary to answer the questions, which follow.

How do the utilitarians use gossen 3rd law : How do the Utilitarians use Gossen's 3rd Law to resolve the 'Paradox of Value' aka 'the Water-Diamond paradox as posited by Adam Smith,no less than 400 words

User Account

All Pages