Data clustering using k-means

Assignment Help Data Structure & Algorithms
Reference no: EM13889468

Project

Project Title: Data Clustering using K-means

In this project, students are required to cluster Amazon product reviews that belong to four product categories: books, electronic appliances, dvds, and kitchen appliances. Moreover, each category is further divided into positive-valued sentiment reviews and negative-valued sentiment reviews. In total, you will find reviews that belong to 4 × 2 = 8 categories in the data file attached "data.txt".

The format of the data file is as follows. Each line of the data file corresponds to one review. The first element in the line represents the label of the instance (e.g. kitchen-positive indicates that the review is a positive sentiment review about some kitchen appliance). The next elements (separated by spaces) in the line represent the unigram and bigram features extracted from the review. Note that the two words in a bigram feature are connected by two underscores. Reviews are represented using binary-valued features (i.e. each feature appears exactly once in a given line).

Questions

(1) Write a program to load the data instances to memory from the provided file data.txt.

(2) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. Make sure that you normalize each feature vector to unit L2 length before computing Euclidean distances.

(3) Instead of selecting the mean in a cluster,

i. select the instance that is closest to the mean as the cluster center when performing k-means clustering and

ii. select k-medoid method to perform clustering

(4) Evaluate the clusters obtained in step 2, 3 and 4 using cross validation evaluation method.

(5) Briefly discuss which clustering method is best for this data and why?

Submission Instructions

• Submit

(a) the source code for all your programs,

(b) a README file (plain text) describing how to compile/run your code to produce the various results

(c) a PDF file providing the answers of all above questions

Compress all of the above files into a single zip/rar file and name it with your registration number.

Reference no: EM13889468

Questions Cloud

What will be the total expected foreign exchange gain : What will be the total expected foreign exchange gain or loss for both the interest payment and the value of the bond (in percentage) for Company A each year in the next eight years?
The standard deviation of a list of numbers is a measure : The standard deviation of a list of numbers is a measure of how much the numbers deviate from the averag
A global manufacturer of electrical switching equipment : 1.A global manufacturer of electrical switching equipment (ESE) is considering outsourcing the manufacturing of an electrical breaker used in the manufacturing of switch boards.
How does mild hypoxia affect airline crew : What is Mild Hypoxia? And how does mild hypoxia affect airline crew? Present a detailed and research based answer to these questions.
Data clustering using k-means : Write a program to load the data instances to memory from the provided file data.txt.
A firm in ohio is thinking of buying a plant : 1.A firm in Ohio is thinking of buying a plant from a regional business group located in a Southeast Asian country.
Who are the potential stakeholders involved in the situation : Who are the potential stakeholders involved in this situation? What alternatives does Tony have in this situation? What might the company do to prevent this situation from occurring?
Personal reflection essay on role of professional nurse : Write a 500 word, personal reflection on how your perspective on the role of the professional nurse has changed since the beginning of this course. Include details of how this course has influenced your understanding of role clarity.
Overlap between financial and management accounting : Are you surprised by the topics that management accountants are focusing on? Why or why not? What interests you more, financial accounting or management accounting?

Reviews

Write a Review

Data Structure & Algorithms Questions & Answers

  Determine how the representation of internal data

Imagine you are asked to write a program to print out a yearly calendar. In this program, the user enters the year desired, and the output is a calendar for that year. Determine how the representation of internal data will affect the way in which ..

  Importance of database documentation

Assume your database is performing poorly, and you just started this new job within the past month. You ask to see the documentation for system and are told it does not exist.

  Question about indexed strategy

Think about a file system on a disk that has both logical and physical block sizes of 512 bytes. Suppose that the data about each file is already in memory.

  Define difference between algorithm and heuristic

Please explain in detail the Difference Between the following terms Algorithm and Heuristic

  1 describe the differences between our specifications of

1. describe the differences between our specifications of the sorted list adt and the binary search tree adt. 2. write

  Using a backtracking algorithm

If the backtracking algorithm finds a dead end, it retraces its path until it reaches a position from which there is an untried path. The backtracking algorithm always tries all directions from any position, and always in the same order.

  Programming language problems

Many programming languages do not permit you to ask two or more questions in a single comparison by using a logical And Operator

  Highlighting features that boost performances

highlighting features that boost performances

  Write down a 3-4 page apa formatted paper describing

write 3-4 page apa formatted paper discussing whether in the next few years rdbms will be replaced completely partially

  Babylonian algorithm

Babylonian Algorithm. The Babylonian algorithm to compute the square root of a positive number n is as given:

  List of common data structures

Make a list of some of the common data structures provided by C#. You should have a minimum of 4 different data types.

  Write computer program to implement this algorithm

Write computer program to implement this algorithm and demonstrate the results and what is the machine run time in second for sorting array A?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd