Write code to perform k-means clustering on the values

Assignment Help Applied Statistics

Reference no: EM131434767

Part 1: K-means Clustering

K-means clustering is a clustering method. The algorithm can be described as follows:

0) Determine how many (k) clusters you will search for.

1) Randomly assign points in your data to each of the clusters.

2) Once all values have been assigned to a cluster, calculate the means or the centroid of the values in each cluster.

3) Reassign values to clusters by associating values in the data set to the nearest (euclidean distance) mean.

4) Repeat steps 2 and 3 until convergence. Convergence occurs when no values are reassigned to a new cluster.

Task 1 - Write code to perform k-means clustering on the values in the matrix 'dat'.

The true group labels are provided in the vector 'true_groups'. Of course, you can't use that until the very end where you will perform some verification.

Requirements:

1) So everyone will get consistent results, I have performed the initial assignment of points to clusters.

2) With each iteration, plot the data, colored by their current groupings, and the updated means.

3) Convergence is reached when group assignments no longer change. Your k-means clustering algorithm should reach convergence fairly quickly.

4) Print out a 'confusion' matrix showing how well the k-means clustering algorithm classified the data.

You'll probably want to write a function that will calculate distances from the three means. I also highly recommend avoiding the use of loops within each iteration. Instead try to use apply functions. My final code takes 15 lines. You don't need to write as efficiently as I do. You can write a hundred lines, but as long as it gets the job done correctly, it'll be accepted.

Part 2: EM Algorithm

The Expectation-Maximization algorithm is an iterative algorithm for finding maximum likelihood estimates of parameters when some of the data is missing. In our case, we are trying to estimate the model parameters of a mixture of multi-variate Gaussian distributions, but we are missing the assignments of the data points.

The general form of the EM algorithm consists of alternating between an expectation step and a maximization step. In the expectation step, a function is calculated. The function is the expectation of the log-likelihood of the joint distribution of the data X along with the missing values of Z given the values of X under the current estimates of $\theta$. In the maximization step, the values of $\theta$ are found that will maximize this expected log-likelihood.

The great thing is that the solution to the maximization step can often be found analytically (versus having to search for it via a computational method.) For example, the estimate of the mean that maximizes the likelihood of the data is just the sample mean.

EM Algorithm for Gaussian Mixtures -

This (brilliant) algorithm can be applied to perform clustering of Gaussian mixtures (among many other applications) in a manner similar to the k-means algorithm. A key difference between the k-means algorithm and the EM algorithm is that the EM algorithm is probabilistic. While the k-means algorithm assigned a value to the group with the nearest mean, the EM algorithm calculates the probability that a point belongs to a certain group.

In the context of a Gaussian mixture, we have the following components:

1) $X$ is our observed data

2) $Z$ is the missing data: the class or cluster to which the observations $X$ belong.

3) $X$ come from a normal distributions defined by the unknown parameters $\Theta$ (the mean $\mu$ and variance $\Sigma$).

4) $Z$ is generated by a categorical distribution based on the unknown class mixing parameters $\alpha$. ($\sum \alpha_i = 1$)

The EM algorithm for Gaussian Mixtures will behave as follows:

1) Begin with some random or arbitrary starting values of $\Theta$ and $\alpha$.

2) E-Step. In the E-step, we will use Bayes' theorem to calculate the posterior probability that an observation $i$ belongs to component $k$.

3) M-step. Based on the estimates of the 'weights' found in the E-step, we will now perform Maximum Likelihood estimation for the model parameters.

This turns out to be fairly straightforward, as the MLE estimates for a normal distribution are fairly easy to obtain analytically.

For each component, we will find the mean, variance, and mixing proportion based on the data points that are "assigned" to the component. The data points are not actually "assigned" to the components like they are in k-means, but rather the components are given a "weight" or "responsibility" for each observation.

4. Iterate between steps 2 and 3 until convergence is reached.

Coding the EM algorithm for Gaussian Mixtures -

Coding the algorithm is a matter of turning the above steps into code.

The package 'mvtnorm' handles multivariate normal distributions. The function 'dmvnorm()' can be used to find the probability of the data $N(x_i | \mu_k, \Sigma_k)$. It can even be applied in vector form, so you can avoid loops when trying to find the probabilities.

You are dealing with a 1000 x 2 matrix of data points.

A few key things to remember / help you troubleshoot your code:

1) Your matrix of 'weights' will be 1000 x 3. (one row for each observation, one column for each cluster)

2) $N_k$ is a vector of three elements. It is effectively the column sums of the weight matrix $w$.

3) $\alpha$ is a vector of three elements. The elements will add to 1.

4) $\mu$ is a 3 x 2 matrix. One row for each cluster, one column for each x variable.

5) Each covariance matrix sigma is a 2x2 matrix. There are three clusters, so there are three covariance matrices.

Part 3: Open-ended analysis

For this third part, I will ask you to perform a bit of open ended analysis with real data.

Our data is comes from kaggle. It is a list of all the menu items offered at McDonald's restaurants and their nutrition information. The task is for you to perform some cluster analysis on the McDonald's menu items.

While the task is open-ended, there are a few things that you will be required to do.

Requirements:

Data Preparation:

1) Remove the following variables, as they are almost 100% colinear with another variable:

- Calories from Fat

- total fat (% daily value)

- Saturated Fat (% Daily Value)

- Cholesterol (% Daily Value)

- Sodium (% Daily Value)

- Carbohydrates (% Daily Value)

- Dietary Fiber (% Daily Value)

2) Perform your analysis on scaled the data. (just run 'scale()'). This prevents certain variables with large variances from dominating the analysis

Initial Visualization:

3) Produce between 1 and 3 plots of the menu items and nutrition information. Try to identify variables that would be useful in clustering the data. (You are allowed to use whatever packages you want for this. ggplot2, scatterplot3d, etc.) [Don't spend too much time (over 20 min) on this task. You are not trying to win datafest visualization. Keep it simple.]

PCA as "pre-processing":

4) Perform Principal components analysis on the scaled McDonald's data. Use the built in function 'prcomp()'.

5) Plot your data using the first two principal components

6) Look at resulting rotation matrix, identify which variables seem to be weighted the heaviest for each of the first two principal components.

K-means on the PCA matrix

7) Use R's 'kmeans()' function to perform k-means clustering using just the first two components in the PCA matrix. (Just the first two columns of $x in the prcomp output)

8) You will need to decide how many clusters to look for. You can plot $tot.withinss against the number of components to see where adding additional clusters do not improve the total within sum of squares.

I do a bit of this here: <https://youtu.be/YtQ9t_MozQM?t=17m1s>

9) Plot the results of the k-means analysis on the plot of the first two principal components

Explaining it all

10) Take a look at the clusters formed by your k-means analysis. Pair them up with the names of the menu items. Can you give each cluster a name? For example, you might call one category "sugary drinks and desserts" or something similar.

11) Using the results of your principal components analysis, try to pick the two or three most important variables in the original data. Produce a plot using those variables and the k-means groups.

Attachment:- Assignment Files.rar

Reference no: EM131434767

Questions Cloud

Many people view conflict as universally negative : Many people view conflict as universally negative. Do you agree with this statement? Should organizations seek to reduce OR control conflict? Give an example.

Create an equivalent cpp class to represent inheritance : CMPSC 122- Class Programming: Given an object-based entity with specific functionality, students should be able to create an equivalent C++ class to representit.

Select a source of anatomy and physiology information : Use the name of your source as the title of your forum post. You cannot use a source that one of your classmates' has already used as a topic for their initial forum post.For your discussion this week, you need to respond to at least two of your c..

Future and consider possible trends in society-echnology : Look into the future and consider possible trends in society, technology, economics, environmentalism, and politics that could influence Proctor and Gamble. Be sure to look beyond this firm's current market, product, and geographic boundaries.

Write code to perform k-means clustering on the values : Write code to perform k-means clustering on the values in the matrix 'dat'. The true group labels are provided in the vector 'true_groups'. Of course, you can't use that until the very end where you will perform some verification

Inheritance to purchase four us treasury bonds : Having spent several years in the bank's investments department, he's well aware of the concept of duration and decides to apply it to his bond portfolio. In particular, Elliot intends to use $1 million of his inheritance to purchase four U.S. Tre..

Drive-through window for customer pickups : A pizza place is considering opening a drive-through window for customer pickups. It is estimated that at peak times (6-9pm and 1-3am), customers will arrive at a rate of 11 per hour. The clerk who will staff the window can serve customers at the rat..

Design a menu class and associated iterator classes : You are to design and implement a Menu class, and associated iterator classes, that maintains a collection of MenuItem objects. A MenuItem object contains the following information about each menu item of a particular restaura..

Field office or risk losing best customer : From the Steps in the Purchasing Procedure reading, imagine you are in a hurry to get the latest, expensive equipment for your field office or risk losing your best customer. Indicate which two of the eight steps you might expedite - but not overl..

Reviews

len1434767

3/21/2017 2:59:10 AM

Topic: R programming in machine learning. You'll probably want to write a function that will calculate distances from the three means. I also highly recommend avoiding the use of loops within each iteration. Instead try to use apply functions. My final code takes 15 lines. You don't need to write as efficiently as I do. You can write a hundred lines, but as long as it gets the job done correctly, it'll be accepted. IMO, implementing the covariances is the hardest part of the code. Before trying to update the covariances, you can leave the covariance matrices as the identity matrix, or plug in the actual known covariance matrices for `sig1` `sig2` and `sig3`. This way you can test out the rest of the code to see if the values of the means are updating as you would expect.

Write a Review

Required(*) Message

User Account

All Pages