Data mining project

Assignment Help Other Subject
Reference no: EM13999715

The project must be carried out using any programming language or one of the suggested

platforms and libraries: references to them are listed here and are also available on Blackboard.

·         KNIME, open source Data Mining platform (http://www.knime.org).

·         Weka, open source ML library in Java (http://www.cs.waikato.ac.nz/ml/weka).

·         R, free programming language for statistical computing (http://www.r-project.org).

The following data files are required for this coursework and are provided in Blackboard:

·         wine.csv (data file for tasks 1 and 2)

·         training100Ku.csv (data file for tasks 3)

·         test1K.csv (data file for tasks 3)

·

Wine dataset for Task #1 and Task #2

The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data record contains the cultivar ID (1, 2 or 3) and 13 numerical attributes.

Task #1 – Data Exploration and Clustering

You are required to perform a clustering analysis for the multidimensional data set indicated above. This task has to be carried out two times: with and without normalisation.

Task1.1: Clustering without normalisation

Apply Principal Component Analysis (PCA) to generate two-dimensional coordinates and a 2D plot (plot1) of the records. The data points in plot1 should be represented with a colour associated to their class label. Apply a clustering algorithm to the data set to generate three partitions. Generate a 2D plot (plot2) based on the same PCA projection, similarly to the previous one, where the colour is associated to the cluster ID (use different colours w.r.t. plot1), and compare it with plot1. For the records associated to each cluster generate a 2D plot (plot3a, plot3b, plot3c) with colour associated to the class label (same colours of plot1): visually verify the distribution of class labels in each cluster.

Select, describe and apply at least one cluster validity measure: report the results in the report. Task1.2: Clustering with normalisation

Apply a normalisation pre-processing to the data set and repeat the steps of the part 1. Compare the new plots and the cluster validity measure with the previous ones.

The submission for Task #1 must contain two components:

·         a report section dedicated to your solution for Task #1,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #2 – Comparison of Classification Models

You are required to learn and test classification models for the wine data set. For this task you need to carry out a performance comparison of TWO different classification algorithms. You should use a 10-fold cross-validation method to estimate the generalisation error.

In the report you should briefly describe the two algorithms and the method used to compare the two algorithms.

The submission for Task #2 must contain two components:

·         a report section dedicated to your solution for Task #2,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #3 – The Search for God Particle: a Binary Classification Challenge

The CERN’s Large Hadron Collider (LHC) typically produces approximately 1011 collisions per hour and about 300 (0.0000003%) of these collisions result in a Higgs boson, the so called God particle. Detecting when interesting particles are produced is an important challenge, which is typically studied by the use of simulations. The data set for this task is related to simulations of collision events, which can be used to train a classification model to distinguish between collisions producing particles of interest (signal) and those producing other particles (background).

 Two data files are provided: the training set (training100Ku.csv) and the test set (test1K.csv). The training set file has 100,000 records, each containing, in this order, 21 numerical low-level attributes, 7 high-level attributes and the class label (signal/background). The low-level attributes are kinematic properties measured by the particle detectors in the accelerator during the experiment. The high-level attributes are computed after the experiment by means of some complex model as function of the low-level attributes (feature transformation).

The test set has 1,000 records, each containing a unique record identifier and 21 numerical low-level attributes (the same measurements in the same order as in the training set). The 7 high-level attributes and the class label are not present.

Your task is to predict the class label for the records of the test set. The resulting predictions must be submitted as a single file (CSV format) with only two columns: the record ID and the predicted class label (signal/background).

You must also include a section in the report to describe the method used to generate the submitted predictions and an estimation of these performance indices: accuracy, F-measure, precision and recall.

In summary, the submission for Task #3 must contain three components:

·         a report section dedicated to your solution for Task #3,

·         any KNIME workflow(*) and/or source code used (a zip/jar archive) and

·         the file “Task3-predictions.csv”.

 

 

 

(*) Important: do not include data when you export a KNIME workflow as a zip archive.

Reference no: EM13999715

What is media convergence

What is media convergence? Provide an example of media convergence and explain the media elements that are being combined. Explain how the combination of media elements in you

Discuss ngaire woods four aspects of globalization

First, identify and discuss Ngaire Woods' four aspects (or processes as she calls it) of globalization and their relevance for the study of IPE. Follow that with a discussio

Abnormal psychology-social psychology

Explain the criticism of artificiality in the discipline of psychology and apply this criticism to at least three sub-disciplines within psychology (e.g., abnormal psychology,

Real estate broker and income tax preparer

What is the best form of ownership for this business? ( I would think partnership) Why is that the best form ownership for this particular business? What is the best form of o

Specializes in cracking the codes of protection software

Your former high school buddy invites you to join an entrepreneurial start-up that specializes in cracking the codes of protection software, which protect CDs, VCDs, and DVD

Effective delivery techniques and self-evaluation

Increase the audience's knowledge and understanding of a particular concept, object/person, or event - use a variety of supporting materials in order to add depth to your spe

Define peak performance

Illustrate how the use of stable and unstable attributions can explain why some athletes persist and increase intensity, and others lose motivation and decrease intensity, e

What types of technology are available to him

What types of technology are available to him, and how does the technology impact his ability to prevent visitors from smuggling in contraband? What types of technology are

Reviews

Write a Review

 
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd