Evaluate menu to detemine which is the best tool

Assignment Help Other Subject

Reference no: EM132655837

Classification Using Rattle

A hypothetical Melbourne suburb was surveyed with a view to its redevelopment potential. In particular there was interest in finding adjacent properties for more intensive redevelopment. Only 2887 (2.5%) of more than 45,000 properties were redeveloped between 2004 and 2009, making this a relatively rare event. Our goal is to predict which 2004 properties were redeveloped between 2004 and 2009 based on various 2004 variables and recent changes in the immediate neighbourhood. The data should be partitioned 70:15:15 for training, validation and testing. Only a few of the available variables are considered in this assignment. Please include all outputs for each question.

Names of Variables	Measureme nt Scale	*Example Property ()**	Description
DwellingsConstructed_2 00m	Interval	4	Number of dwellings constructed within 200m between 2000 and 2004
NetDwellingIncrease_20 0m	Interval	3	Increase in number of dwellings within 200m between 2000 and 2004
redevPotIndex_2004	Interval	.025	2004 assessment of redevelopment potential based on property dimensions
strata	Binary	0	Strata housing (1=yes, 0=no)
BuildingProjects_200m	Interval	2	Number of building projects within 200m between 2000 and 2004
Demolitions_200m	Interval	1	Number of demolitionswithin 200m between 2000 and2004
Road Frontage(m)	Interval	20	Length of road frontage
Redeveloped 2004-2009	Binary	?	The response/target variable coded equal to one for properties redeveloped between 2004 and 2009, 0 otherwise.

a) The redevelop.csv data contains data for a random sample of the properties that were not redeveloped and all the properties that were redeveloped, resulting in a data set containing a total of only 7409 properties.
i) Why was only a random sample of the properties that were not redeveloped between 2004 and 2009 chosen?
ii) What else could have been done to achieve a similar effect?

b) Open R and include the rattle package. What instructions did you use to do this?

c) Read there develop.csv data in to Rattle and assign appropriate roles to your variables. Note that the partition is 70% for training, 15% for validation and 15% for testing.

What is thetargetvariable?

d) Produce suitable plots to visualise the differences in the distributions of the input variables for properties that were and were not redeveloped. Try to show at least six different types of plot.

e) Fit a classification tree for redeveloped properties assuming a loss matrix with losses half as big for a false negative (Redeveloped="No" when it should be Redevelop="Yes") as a false positive (Redeveloped="Yes" when it should be Redevelop="No"). Assume no losses when a correct decision is made. Answer the following questions after drawing your tree for the training data. Be sure to maximise your tree window before drawing your tree (again).

i. Complete the above loss matrix.

ii. What are the rules for the terminal node with the smallest errorrate?

iii. How many splits if we want to minimise the cross-validation error? Explain your answer

iv. Consider node 2 of your drawn tree. How many training observations for node 2 and what are the rules for node2?

v. At node 2 in the training data what is the average loss per property if we make a Redevelopment="Yes" decision? What is the average loss per property if we make a Redevelopment = "No" decision? Which is the better decision for this node?

vi. Repeat (v) for some other node where the better decision is unexpected. Explain why the better decision is unexpected.

f) Run a random forest with your data with 500 trees, randomly selecting three input variables from which to choose your split variable at each node. Please include all outputs for each question.

i. What is the OOB estimate of the error rate and what does OOB mean?
ii. What is the error rate for the Redevelopment = "Yes" predictions with thetestdata and what is the error rate for Redevelopment = "No" predictions with the testdata?
iii. Which are the top 3 predictor variables according to the Gini measure of variable importance and how is this measure defined?

g) Now try Boosting. Please include all outputs for each question.

i. Interpret the term Gain and explain why this measure provides a reliable measure of Variable Importance.

ii. What does the Error Plot suggest as the optimum number of trees?

h) Now try a neural network with two and then three hidden nodes. Use the Evaluate menu error matrix to answer the following questions. Please include all outputs for each question.

i. Is it necessary to transform any of the input variables? What transformations have you chosen and why?
ii. What is the error rate for properties that actually were redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iii. What is the error rate for properties that were not actually redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iv. Which is better a 2 hidden node or a 3 hidden node solution?Why?

i) Use the Evaluate menu to detemine which is the best tool for modelling your data; a single tree, a random forest, boosting, a neural network. Why have you chosen this one method over the other three methods?

j) For this best tool show the ROC, sensitivity, risk and lift charts for the test data ONLY.

k) Explain the axes for each of the above four charts.

l) Which is the best method for choosing the most important predictor of Redevelopment = "Yes"; plots, a single tree, a random forest, boosting, a neural network? Why have you chosen this one method over the other four methods?

m) Do any of the above models appear to be worth commercialising? For what purpose?

Attachment:- Exercise.rar

Reference no: EM132655837

Questions Cloud

How a small business could combine a concentrated : Give an example of how a small business could combine a concentrated marketing strategy

Discuss issues surrounding representativeness : Discuss the issues surrounding representativeness and ways to increase overall representativeness in state government.

What do you do when someone gets sick : Think for a while about cultural practices and how they affect health or illness in your own family. They may be difficult to identify as such at first.

What is the overhead rate per machine hour : Flawless Cosmetic Company manufactures and distributes, If Flawless changes its allocation basis to machine hours, what is the overhead rate per machine hour?

Evaluate menu to detemine which is the best tool : Evaluate menu to detemine which is the best tool for modelling your data; a single tree, a random forest, boosting, a neural network

What is the main function of legislative branch : What is the main function of the legislative branch? What role does the executive branch play in the formation of laws?

Identify the benefits and costs associated : Identify the benefits and costs associated with each option available to the government. Illustrate how each policy response will impact the marco-economy.

What the average cost of product is closest to : The company makes 410 units of product O37W a year, According to the activity-based costing system, the average cost of product O37W is closest to

Calculate south africa nominal gdp in 2018 and 2019 : Suppose that South Africa produces only two goods, sanitisers and masks. The base year is 2018 and the table below gives the quantities

User Account

All Pages