Find natural groupings in the data

Assignment Help Database Management System
Reference no: EM131177725

Data Analysis using R and Weka

Overview

The coursework is organized into three parts, each one focusing on a different and important aspect of either Data Pre-processing, Data Analysis or Data Mining. All parts involve the use of the same dataset. The first part focuses on describing and visualizing the data and preparing the data for subsequent treatment ('pre-processing'). The second part focuses on clustering and the third part focuses on classification. The main goal is to give you first-hand experience on working with a relatively large and real data set, from the earliest states of data description to the later stages of knowledge extraction and prediction.

Data Set

The data set is a slightly modified version of a reference data set available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml). The data concerns the modeling of wine quality based on physicochemical tests (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). Each record consists of 11 attribute (input) columns, and one class (output) column corresponding to the quality of wine, rated on a ten-point scale. The attributes are various physical / chemical properties of the wine. The entire data set consists of 1600 instances. Figure 1 provides a small snapshot of the data: the header row and 15 instances of the data set. Some of the variables contain missing values, which are indicated by empty entries.

428_Figure.png

Part 1 - Description, Visualisation and Pre-processing [R Only]

a) Explore the data

i. Use as many functions/techniques in R as necessary to adequately describe and visualize the data. Provide a table for all the attributes of the dataset including the measures of centrality (mean, median etc.), dispersion and how many missing values each attribute has. Use the table to make comments about the data.

ii. Produce histograms for each attribute. Provide details how you created the histograms and comment on the distribution of data. Use also the descriptive statistics you produced above to help you characterise the shape of the distribution.

b) Explore the relationships between the attributes, and between the class and the attributes

i. Calculate the correlations between er and pgr, b1 and b2, and p1 and p2 (three correlations). What do these tell you about the relationships between these variables?

ii. Produce scatterplots between the class variable and er, pgr and h1 variables (note: you may have to recode the class variable as numeric to produce scatterplots). What do these tell you about the relationships between these three variables and the class?

c) General Conclusions

Take into considerations all the descriptive statistics, the visualisations, the correlations you produced together with the missing values and comment on the importance of the attributes. Which of the attributes seem to hold significant information and which you can regard as insignificant? Provide an explanation for your choice.

d) Dealing with missing values in R

i. Write an script in R to find missing values and replace them using three strategies. Replace missing values with 0, mean and median.

ii. Compare and contrast these approaches

f) Attribute transformation

Explore the use of three transformation techniques (mean centering, normalisation and standardisation) to scale the attributes, and compare their various effects.

g) Attribute / instance selection

i. Starting again from the raw data, consider attribute and instance deletion strategies to deal with missing values. Choose a number of missing values per instance or per attribute and delete instances/attributes accordingly. Explain your choice.

ii. Consider using correlations between attributes to reduce the number of attributes. Try to reduce the dataset to contain only uncorrelated attributes.

iii. Use principal component analysis in R to create a data set with ten attributes.

As a result, you will end up with several different sets of data to be used in Part 2 & 3. Give each set of data a clear and distinct name, so that you can easily refer to again in the later stages.

Part 2 - Clustering [R Only]

Using R (only), explore the use of clustering to find natural groupings in the data, without using the class variable - i.e. use only the 20 numeric (input) attributes to perform the clustering. Once the data is clustered, you may use the class variable to evaluate or interpret the results (how do the new clusters compare to the original classes?).

a) Use hierarchical, k-means, PAM as clustering algorithms to create classifications of seven clusters and write the results. Which algorithm produces better results when compared to the class attribute?

b) As each of these algorithms has adjustable parameters, you may explore the 'optimisation' or 'tuning' of these parameters, either manually or (preferably) automatically. Which parameters produce the best results for each clustering algorithm? Provide the reasoning of the techniques you used to find the optimal parameters.

c) Choose one clustering algorithm of the above and perform this clustering on alternative data sets that you have produced as a result of Part 2.

i. The reduced data set featuring only the first 10 Principal Components.

ii. The dataset after deletion of instances and attributes.

iii. The three datasets after you replaced missing values with the three techniques.

iv. Which of these datasets had a positive impact on the quality of the clustering? Provide explanations using the results for each clustering of the alternative data set.

Part 3 - Classification [Weka and R]

You must use Weka to perform the classification, but you may choose to use R to present results. Use Weka to explore the use of various classification techniques to create models that predict the given class from the input attributes. Split the data (randomly) into a training set (2/3 of the data) and a test set (containing 1/3 of the data);

a) Try using the following classification algorithms: ZeroR, OneR, NaïveBayes, IBk (k-NN) and J48 (C4.5) algorithms. Which algorithm produces the best results?

b) Choose one classification algorithm of the above and explore various parameter settings for each of the different splits of data. Which parameters improve the predictive ability of the algorithm?

c) Choose one classification algorithm of the above and use the data sets you created in part 2 [5]:

i. The reduced data set featuring only the first 10 Principal Components.

ii. The dataset after deletion of instances and attributes.

iii. The three datasets after you replaced missing values with the three techniques.

iv. Which of the datasets had a good impact on the predictive ability of the algorithm? Provide explanations using the results for each clustering of the alternative data set.

Reference no: EM131177725

Questions Cloud

What is the implied cost of backordering a customer : What is the implied cost of backordering a customer? -  What would be the recommended order quantity for a 195HR14 tire with an annual demand of 1,000?
Question regarding the negative communication : On the TV show "The Apprentice," Donald Trump seemed to relish announcing, "You're fired" to losing contestants. But most employers recoil from having to tell employees that they will be "downsized."
Inventory models to address lou carlsons questions : Use inventory models to address Lou Carlson's questions. Support your recommendations with cost justification.
Identify the platform that ial chose as an online portal : Identify the platform that IAL chose as an online portal and content management system, and describe the main reason(s) why IAL chose such a specific platform.
Find natural groupings in the data : Using R (only), explore the use of clustering to find natural groupings in the data, without using the class variable - i.e. use only the 20 numeric (input) attributes to perform the clustering
What was the implied stock out cost : Based on Martin Quinn's estimate of other stock out costs, how many servings should the chef prepare?- What was the implied stock out cost?
Affluence and technological development : What is a recent event that involved businesses having an impact for better or worse on the health of the biosphere? How does population growth affect affluence and technological development? Are these concepts really independent?
Finds the length of the longest path between two vertices : Devise an algorithm, based on the concept of interior vertices, that finds the length of the longest path between two vertices in a directed graph, or determines that there are arbitrarily long paths between these vertices.
Differences between wired and wireless communications : Describe your perspective on communication technology such as wireless communication, the Internet, and smart phone technology. Provide at least three examples of communication technology you use in your daily life. Examine the underlying scientifi..

Reviews

Write a Review

Database Management System Questions & Answers

  Explain techniques for distributed data placement

Create a diagram using Visio, Microsoft Paint, or other graphical creation utility of your choosing to illustrate the diagrams supporting your conclusion. Note: The graphically depicted solution is not included in the required page length.

  How many green books it has

Webster Library wants to know how many green books it has. Out of the 500 shelves of books, 3 shelves were selected. These shelves hold 150 books each. 54 of the books sampled were green.

  Calculate the unadjusted function points

Calculate the unadjusted function points for the problem description of Problem 2.

  Techniques to store and retrieve data using an sql database

Develop an understanding of strengths and limitations of various data storage, retrieval methods and models. The assessment requires students to conduct a survey and report on data storage and retrieval techniques and data models.

  Types of relationships in a design model of a database

There are many kinds of relationships in a design model of a database. Why is it important to classify each of these types in an ERD Model?

  What are the typical phases of query processing

What are the typical phases of query processing? With a neat sketch discuss these phases in high-level query processing. Discuss the reasons for converting SQL queries into relational algebra queries before query optimization is done

  Justification for utalizing database management system

What is the justification for utalizing database management system approach

  Ways of implementing one-to-one relationships

Describe the difference ways of implementing one-to-one relationships. Assume you are maintaining information on offices (office numbers, building, and phone numbers)

  Create the class diagram and write the pseudocode

What is the screen output of the following code segment? Explain the reasoning behind your answer.

  What kind of a design would you try in this case

Suppose that your database system has very ine?cient implementations of index structures. What kind of a design would you try in this case?

  Binary 1-n relationship-optional-to-optional relationship

Give examples of binary 1:N relationships, for (a) an optional-to-optional relationship, (b) an optional-to-mandatory relationship,

  Write query to display first name and last name as full name

Write query to display first name and last name as full name

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd