Determine the number of clusters in the data

Assignment Help Computer Engineering
Reference no: EM132068265

PART 1: CLASSIFICATION

1. Run the following classifiers, with the default parameters, on this data: ZeroR, OneR, J48, IBK and construct a table of the training and cross-validation errors. You can get the training error by selecting "Use training set" as the test option. What do you conclude from these results?

Run No

Classifier

Parameters

Parameters

Training

Error

Cross-valid

Error

Over-

Fitting

1

.

ZeroR

.

None

.

30.0%

.

30.0%

.

None

2. Using the J48 classifier, can you find a combination of the C and M parameter values that minimizes the amount of overfitting? Include the results of your best five runs, including the parameter values, in your table of results.

3. Reset J48 parameters to their default values. What is the effect of lowering the number of examples in the training set? Include your runs in your table of re- sults.

4. Using the IBk classifier, can you find the value of k that minimizes the amount of overfitting? Include your runs in your table of results.

5. Try a number of other classifiers. Aside from ZeroR, which classifiers are best and worst in terms of predictive accuracy? Include 5 runs in your table of results.

6. Compare the accuracy of ZeroR, OneR and J48. What do you conclude?

7. What golden nuggets did you find, if any?

8. [OPTIONAL] Use an attribute selection algorithm to get a reduced attribute set. How does the accuracy on the reduced set compare with the accuracy on the full set?

PART 2: NUMERIC PREDICTION

Numeric Prediction of the balance attribute in the bank data of part 1. The main goal is to achieve the lowest mean absolute error with the lowest amount of overfitting.

1. Run the following classifers, with default parameters, on this data: ZeroR, MP5, IBk and construct a table of the training and cross-validation errors. You may want to turn on "Output Predictions" to get a better sense of the magnitude of the error on each example. What do you conclude from these results?

2. Explore different parameter settings for M5P and IBk. Which values give the best performance in terms of predictive accuracy and overfitting. Include the results of the best five runs in your table of results.

3. Investigate three other classifiers for numeric prediction and their associated pa- rameters. Include your best five runs in your table of results. Which classifier gives the best performance in terms of predictive accuracy and overfitting?

4. What golden nuggets did you find, if any?
Report Length Up to one page.

PART 3: CLUSTERING

Clustering of the bank data of part 1. For this part use only the attributes age, marital, education, and balance.

The aim is determine the number of clusters in the data and assess whether any of the clusters are meaningful.

1. Run the Kmeans clustering algorithm on this data for the following values of K: 1,2,3,4,5,10,20. Analyse the resulting clusters. What do you conclude?

2. Choose a value of K and run the algorithm with different seeds. What is the effect of changing the seed?

3. Run the EM algorithm on this data with the default parameters and describe the output.

4. The EM algorithm can be quite sensitive to whether the data is normalized or not. Use the weka normalize filter
(Preprocess --> Filter --> unsupervised --> normalize)
to normalize the numeric attributes. What difference does this make to the clus- tering runs?

5. The algorithm can be quite sensitive to the values of minLogLikelihoodImprove- mentCV minStdDev and minLogLikelihoodImprovementIterating, Explore the effect of changing these values. What do you conclude?

6. How many clusters do you think are in the data? Give an English language description of one of them.

7. Compare the use of Kmeans and EM for these clustering tasks. Which do you think is best? Why?

8. What golden nuggets did you find, if any?

PART 4: ASSOCIATION FINDING

These files contain the same details of shopping transactions represented in two dif- ferent ways. You can use a text viewer to look at the files.

1. What is the difference in representations?

2. Load the file supermarket1.arff into weka and run the Apriori algorithm on this data. You might need to restrict the number of attributes and/or the number of examples. What significant associations can you find?

3. Explore different possibilities of the metric type and associated parameters. What do you find?

4. Load the file supermarket22.arff into weka and run the Apriori algorithm on this data. What do you find?

5. Explore different possibilities of the metric type and associated parameters. What do you find?

6. Try the other associators. What are the differences to Apriori?

7. What golden nuggets did you find, if any?

8. [OPTIONAL] Can you find any meaningful associations in the bank data?

Attachment:- Assign.rar

Reference no: EM132068265

Questions Cloud

Need and rationale for a compensation philosophy : Compensation is one of the fundamental responsibilities of an HR department, with the ability to offer a compensation package that attracts
Training plan assessment : An organization's ability to provide its employees with additional knowledge or skills that employees can effectively utilize remains central to raising
Lsi assignment guidelines developing a willingness : LSI Assignment Guidelines Developing a willingness and ability to engage in self-reflection is a critical leadership skill that is not easily learned
Comprise the competitor current persona : In this assignment, you are asked to explore the Nike brand and compare it to its greatest competitor.
Determine the number of clusters in the data : COSC2110 - Data Mining - RMIT University - What is the effect of lowering the number of examples in the training set? Include your runs in your table
Calculate the expected amount of total assets : Calculate the expected amount of total assets (in $ billion) for the end of next year. Assume that the percentage of total assets to sales is expected .
Derive the projects free cash flows to the firm : Derive the project's free cash flows to the firm for each year of the proposed investment (including the initial investment outlay).
What is the conversion price of as stock : A Corporation just sold $30 million of convertible bonds with a conversion ratio of 40. Each $1,000 bond is convertible into 25 shares of A's stock.
Write a report to the board of directors : FINM036 : Write a report to the board of directors of one of the selected companies below as part of the interview process for your first appointment .

Reviews

len2068265

7/28/2018 3:41:16 AM

This assignment counts for 23% of the total marks in this course. Due Date 9:00am Monday 27 Submit through Canvas You can work on this assignment individually or in a group of 2. In this assignment you are asked to apply a number of algorithms to a number of data sets and write a report on your findings. You will be assessed on methodology, analysis of results and conclusions.

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd