Perform suitable exploratory analyses to examine the data

Assignment Help Database Management System
Reference no: EM13838969

Portfolio 1 - Classification and partitioning

In addition to the combined marks for each of the portfolio tasks, you will also be graded on the structure, presentation and clarity of the portfolio as a whole. So your work should be professionally presented, with good use of English.

In the real world, you will be expected to communicate the results from any analysis you perform to non-specialists, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.

Deadline

The final submission for this task is week 24. However, as you will be set further tasks over the coming weeks, I recommend that you try to complete this by the end of week 22.

Description

This coursework requires you to use a variety of data mining techniques to explore the structure of your data, and use them to build a throrough description of the relationships and differences within the observations.

You may use any software you feel is appropriate for initial exploratory analysis of the data. However, you must use appropriate data mining software as your main tools. This should include both RapidMiner and R, although you may work with other packages or programming languages if you wish, and you may add extensions to RapidMiner, such as Weka and R.

(In particular, before using R or RapidMiner for discriminant analysis, you may wish to begin with performing an analysis with SPSS and/or Minitab. These will provide you with more detailed output which you can use to compare against the results you obtain from the data mining tools.)

N.B. It is possible to pass this assignment by just using the methods described in the lecture notes and course books (see below). However, additional marks are available for demonstrating research and exploration which leads to improving your models or overcoming initial limitations of the methods described.

Classification This task uses the well-known Iris data set. task 1

The data were first collected by American botanist Edgar Anderson, but became a popular method of exploring various multivariate statistical methods when it was used by Ronald Fisher to explore discriminant analysis in 1936. This version is from the UCI's Machine Learning Repository . https://archive.ics.uci.edu/ml/datasets/Iris

The data consists of four different measurements taken from 50 irises each of three different species. The original data set does not include any identification label for the observations, but I have added one - you may find it useful when assessing your results (don't forget that this should not be included in any analysis).

For some of the tasks, you will need to separate the data into training and testing data sets. As the data is ordered, you will need to use some method of randomisation or randomised sampling, which you should do using the appropriate software.

You should employ the sampling functions of the data mining software you use. For consistency, and to assess the relative strengths of the software and algorithms used, you may use the sets from one package in another. But I want to see evidence that you are using as much of the relevant functionality in your software as possible.

In each case, consider whether the strength of your models can be improved by restricting the variables used.

Compare the R and RapidMiner results, giving an account of their similarities and differences, and assesing their relative strengths and weaknesses.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the species.

Use your results to decide whether you need to standardise the data in any way for the models you will build.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning.

Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R.

Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable species.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual species.

f) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data, and in particular, if the results from one method helped you refine another.

Are there any observations which cause problems for the different methods?

Classification These data are the results of a chemical analysis of wines grown in the same region task 2 in Italy but derived from three different cultivars. (A cultivar is a grouping of plants which which have similar, usually sought-after properties.) The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The data is originally attributed to M. Forina, and may have been much larger. This version was donated to the UCI Machine Learning repository by Stephan Aeberhard.

See https://archive.ics.uci.edu/ml/datasets/Wine

(A slightly reduced version is available within your R installation, but this is the most complete version I could find.)

Note that this is a larger and more complex data set than was used in section A, and is therefore more like the data typically encountered.

a) Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the three different cultivars.

Note that as you have 13 numeric variables in this data set variables, you may find that you can reduce the size of your models based on your EDA observations.

b) Use the k-NN algorithm to produce an assignment model for the data, using R and RapidMiner. In both cases, check the accuracy of the predictions, and use appropriate methods to try to improve it if necessary.

c) Perform a k-means cluster analysis on the data. Explain your choice of value for k and assess the strength of your results in terms of accuracy of partitioning.

Can you learn anything from changing the value of k?

Use hierarchical cluster anlaysis to justify (or otherwise) your value for k.

d) Build a decision tree for the data using RapidMiner and R.

Use appropriate methods to refine the tree to try to achieve maximum leaf purity based on the outcome variable cultivars.

e) Use RapidMiner and R to produce a discriminant analysis of the data, with the goal of finding a set of discriminant equations which will best assign observations to their actual cultivars.

f) In the above sections you built your models based on classifying wines according to the cultivar from which they were made.

One could quite reasonably explore some other way of classifying wines - alcohol content, for example.

Using the results of your exploratory data analysis, find a suitable method of classifying wines by their alcohol content and re-run your data mining modules to reflect this.

How do your results compare to the first set of models?

g) Give an overall summary of your results above to give a description of how the combination of classification techniques builds a picture of the data set.

Identify which methods, algorithms, software etc. do the best job of explaining the data.

Are there any observations which cause problems for the different methods?

Basic source material:

Lantz, B (2013) Machine Learning with R. Packt Publishing Ltd.

North, M (2012) Data Mining for the Masses. Global Text Project

Zumel, N Mount, J (2014) Practical Data Science with R. Manning Publications Co.

Reference no: EM13838969

Questions Cloud

What will be the quantity demanded of rice : The price elasticity for rice is estimated to be -0.4 and the income elasticity is 0.8. At a price of $0.40 per pound and a per capita income of $20,000, the demand for rice is 50 million tons per year. If per capita income increases to $20,500, what..
Produce the optimal level of output : The cost measure sellers use to determine whether or not to produce the optimal (i.e. profit maximizing) level of output is:
Increase in dollar of fixed costs increases marginal cost : An increase in the price of a product (P), along with an increase in the price of an input factor (PI), is certain to lead to an increase in quantity supplied (QS). An increase in the dollar amount of fixed costs increases marginal cost.
What is the factorys present value : A factory forecasts to produce the following cash flows:Year 1 - $6,516, Year 2 - $7,000, Year 3 - $11,400, Year 4 onward in perpetuity - $12,000. If the cost of capital is 6%, what is the factorys present value?
Perform suitable exploratory analyses to examine the data : Perform suitable exploratory analyses to examine the data, in particular how the values of the variables change with the species. Use your results to decide whether you need to standardise the data in any way for the models you will build.
Writing strategies - learning from pop : Critical Reading and Writing Strategies for given topics- America 1960-1970 Notes on Urban images and Theory, Kenneth Frampton and "Learning from pop" Casebella 359-360 (December 19710), Demise Scott Brown
Assume all employees are paid the same wage : Assume all employees are paid the same wage. A decreasing marginal product of labor, while still a positive number, tells us that marginal cost must be increasing. A watch manufacturer finds that at 1,000 units of output, its marginal costs are below..
Research paper - cognitive modeling and human interaction : The paper can be a review that covers current solutions on Cognitive modeling and human interaction, or a research paper that proposes your own opinions/solutions.
Prepare a partial income statement under each method : Prepare a partial income statement under each method of inventory: (a) average cost, (b) FIFO, (c) LIFO, and (d) specific identification. For specific identification, assume that the first sale was selected from the beginning inventory and the second..

Reviews

Write a Review

Database Management System Questions & Answers

  Identify department store transactions that can be stored

Identify the potential sales and department store transactions that can be stored within the database. Justify how Big Data tools could be used for forecasting sales and inventory of the department store.

  Determining the matrix form of game

Assume a game with two players, A and B, who raise one or both hands concurrently. A wins if total number of hands raised is odd, and B wins in other way.Write down the matrix form of the game. Is there a pure strategy solution? Explain your answer..

  Describing the database systems

Summarize your paper by describing the database systems in your workplace, identifying which database systems and architecture they fall under.

  Find that the data structure holds redundant data

Imagine that you have been hired to fulfill their need of enhancing the data repository for their current reservation processing system. Upon reviewing the system, you find that the data structure holds redundant data and that this structure lacks..

  Configuring the authentication mode in ssms

Launch SSMS - Right-click on the SQL Server instance you want to configure and select Properties - you will configure the authentication mode for the SQL Server instance. You will ensure that the authentication mode is set to mixed mode so that the..

  Which is not a factor to consider in software evaluation

Which is not a factor to consider in software evaluation?

  What functional dependencies can you identify

Suppose you have a relation schema about teaching classes that has the following attributes: Class, Instructor, Time and Room - based on your understanding of this scenario, what functional dependencies can you identify that hold on this relation?

  Create a database.

Create a database.

  Draw e-r diagram with cardinalities of store revenue cycle

Marge's Crafts sells work by local artists at her small store in Burnsville. Customers pay for their purchases at the time sale. Commercial customers, purchasing over 10 items, may pay in installments.

  Sql concepts and database design

SQL Concepts and Database Design

  Yrace history of the development of databases

In 500 words or less, trace the history of the development of databases beginning with pre-computer days to the present.

  Database management systems

Referential integrity "rules" trigger some cleanup of your data when insertion, update, or deletion events would typically cause data anomalies. But the word "trigger" has its own place in database design and implementation. Research the concept o..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd