Perform a k-means cluster analysis on the data

Assignment Help Applied Statistics
Reference no: EM13985384

Portfolio - Classification and partitioning

This coursework accounts for 40% of the total mark for the portfolio.

In addition to the combined marks for each of the portfolio tasks, you will also be graded on the structure, presentation and clarity of the portfolio as a whole. So your work should be professionally presented, with good use of English.

In the real world, you will be expected to communicate the results from any analysis you perform to non-specialists, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.

Task 1

This task uses the well-known Iris data set.

The data were first collected by American botanist Edgar Anderson, but became a popular method of exploring various multivariate statistical methods when it was used by Ronald Fisher to explore discriminant analysis in 1936. This version is from the UCI's Machine Learning Repository . https://archive.ics.uci.edu/ml/datasets/Iris

The data consists of four different measurements taken from 50 irises each of three different species. The original data set does not include any identification label for the observations, but I have added one - you may find it useful when assessing your results (don't forget that this should not be included in any analysis).

For some of the tasks, you will need to separate the data into training and testing data sets. As the data is ordered, you will need to use some method of randomisation or randomised sampling, which you should do using the appropriate software.

You should employ the sampling functions of the data mining software you use. For consistency, and to assess the relative strengths of the software and algorithms used, you may use the sets from one package in another. But I want to see evidence that you are using as much of the relevant functionality in your software as possible.

In each case, consider whether the strength of your models can be improved by restricting the variables used.

Compare the R and RapidMiner results, giving an account of their similarities and differences, and assesing their relative strengths and weaknesses.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the species.
Use your results to decide whether you need to standardise the data in any way for the models you will build.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning. Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R. Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable species.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual species.

f) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data, and in particular, if the results from one method helped you refine another.

Are there any observations which cause problems for the different methods?

Task 2

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. (A cultivar is a grouping of plants which which have similar, usually sought-after properties.) The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The data is originally attributed to M. Forina, and may have been much larger. This version was donated to the UCI Machine Learning
Aeberhard.
See https://archive.ics.uci.edu/ml/datasets/Wine

repository by Stephan

(A slightly reduced version is available within your R installation, but this is the most complete version I could find.)

Note that this is a larger and more complex data set than was used in section A, and is therefore more like the data typically encountered.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the three different cultivars.

Note that as you have 13 numeric variables in this data set variables, you may find that you can reduce the size of your models based on your EDA observations.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning. Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R.

Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable cultivars.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual cultivars.

f) In the above sections you built your models based on classifying wines according to the cultivar from which they were made.

One could quite reasonably explore some other way of classifying wines - alcohol content, for example.

Using the results of your exploratory data analysis, find a suitable method of classifying wines by their alcohol content and re-run your data mining modules to reflect this.

How do your results compare to the first set of models?

g) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data.

Are there any observations which cause problems for the different methods?

Attachment:- Data.rar

Reference no: EM13985384

Questions Cloud

Quantitative analysis of an alternative hedging strategy : What mistakes did your boss make in devising the futures contract liquidation strategy? Be specific as to how the liquidation strategy should have been altered - What mistakes did your boss make in setting the hedge in December
What is craft knowledge : 1.What is "craft" knowledge and how does it differ from thinking of teaching and leading as applied science? (350 words with references) 2.What is the Theory of Gemeinschaft and Gesellschaft? Are gesellschaft values appropriate for schools?(350 word..
What are the key elements of an operations system : What are the key Elements of an Operations System?...
What is the difference between inflation : What is the difference between inflation and deflation?
Perform a k-means cluster analysis on the data : Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the species - Perform a k-means cluster analysis on the data.
The term gross domestic product gdp : What do you mean by the term Gross Domestic Product (GDP)?
A right to se letter from the eeoc : Timothy Reverson, an African American, has been employed with Cameron Lake Boat Rentals for 15 years.  He is one of 17 employees.  After an argument with his boss over a pay raise, Timothy is discharged.  Timothy brings a lawsuit under Title VII afte..
Compute the moment generating function : Suppose X1, X2, and X3 are independent and N (0, 1)-distributed. Compute the moment generating function of Y = X1X2 + X1X3 + X2X3.
Determine the wavelength of the light : Monochromatic light falls o two very narrow slits 0.053 mm apart. Successive fringes on a screen 5.90 m away are 6.1 cm apart near the center of the pattern. Determine the wavelength of the light.

Reviews

Write a Review

Applied Statistics Questions & Answers

  X is a binomial random variable

If x is a binomial random variable where n = 100 and p = 0.1, find the probability that x is greater than or equal to 8 using the normal approximation to the binomial.

  An aptitude test are normally distributed

A study shows that scores on an aptitude test are normally distributed with a mean of 70 and a standard deviation of 12

  Test of two means

Test of two means.

  Appropriate formatting and verbiage for a college-level

Create a scenario based on your work, school, home or other place and describe how you could use the knowledge and skills you've developed in the class to help answer a question or solve a problem.In a minimum of 500 words create a scenario us..

  The number of power outages at a nuclear power plant

1. The number of power outages at a nuclear power plant has a Poisson distribution with a mean of 7 outages per year. the probability that there will be between 1 and 2 inclusive power outages in a year is?

  Confidence interval for the population mean salary

Using our sample data, we can construct a 95% confidence interval for the population's mean salary for each gender. Interpret the results. Based on our sample data, can we conclude that males and females are distributed across grades in a similar p..

  Find the probability that the yearly return obtained

Find the probability that the yearly return obtained by investing in common stocks will be higher than the maximum yearly return that might be obtained by investing in tax-free municipal bonds.

  What is the mean of sampling distribution of means

The average time scheduled for a doctor's visit is 25 minutes with a standard deviation of 22 minutes.  A researcher uses a sampling distribution made up of samples of size 271.  According to the Central Limit Theore, what is the mean of sampling dis..

  A1, a2, a3 form a partition of the universal set s

Suppoe that A1, A2, A3 form a partition of the universal set S. Let B be an arbitrary set. Assume that we know |B∩A1|=10, |B∩A2|=20,

  If a regression line linear regression line is said to fit

If a regression line (linear regression line) is said to "fit" the data presented in a scatter plot, what relation is minimized between the observed data, Y, and the predicted output for the regression line?

  How can one come up with the probability of success

Refer to the new product development example in the overview of this module. Suppose that there is a 60% probability that the product will be a success on the market (that means, the probability of failure is 40%). If the product is a success, you wi..

  Develop the tabular form and graphical bar chart

What is the most frequent group in your WI sample data? What does that indicates in term of your data distribution - What is the probability that the head of household is women and her HLE is Primary?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd