Reference no: EM132655837
Classification Using Rattle
A hypothetical Melbourne suburb was surveyed with a view to its redevelopment potential. In particular there was interest in finding adjacent properties for more intensive redevelopment. Only 2887 (2.5%) of more than 45,000 properties were redeveloped between 2004 and 2009, making this a relatively rare event. Our goal is to predict which 2004 properties were redeveloped between 2004 and 2009 based on various 2004 variables and recent changes in the immediate neighbourhood. The data should be partitioned 70:15:15 for training, validation and testing. Only a few of the available variables are considered in this assignment. Please include all outputs for each question.
|
Names of Variables
|
Measureme nt
Scale
|
Example Property (*)
|
Description
|
|
DwellingsConstructed_2 00m
|
Interval
|
4
|
Number of dwellings constructed within 200m between 2000 and 2004
|
|
NetDwellingIncrease_20 0m
|
Interval
|
3
|
Increase in number of dwellings within 200m between 2000 and 2004
|
|
redevPotIndex_2004
|
Interval
|
.025
|
2004 assessment of redevelopment potential based on
property dimensions
|
|
strata
|
Binary
|
0
|
Strata housing (1=yes, 0=no)
|
|
BuildingProjects_200m
|
Interval
|
2
|
Number of building projects within 200m between 2000 and 2004
|
|
Demolitions_200m
|
Interval
|
1
|
Number of demolitionswithin
200m between 2000 and2004
|
|
Road Frontage(m)
|
Interval
|
20
|
Length of road frontage
|
|
Redeveloped 2004-2009
|
Binary
|
?
|
The response/target variable coded equal to one for properties redeveloped between 2004 and
2009, 0 otherwise.
|
a) The redevelop.csv data contains data for a random sample of the properties that were not redeveloped and all the properties that were redeveloped, resulting in a data set containing a total of only 7409 properties.
i) Why was only a random sample of the properties that were not redeveloped between 2004 and 2009 chosen?
ii) What else could have been done to achieve a similar effect?
b) Open R and include the rattle package. What instructions did you use to do this?
c) Read there develop.csv data in to Rattle and assign appropriate roles to your variables. Note that the partition is 70% for training, 15% for validation and 15% for testing.
What is thetargetvariable?
d) Produce suitable plots to visualise the differences in the distributions of the input variables for properties that were and were not redeveloped. Try to show at least six different types of plot.
e) Fit a classification tree for redeveloped properties assuming a loss matrix with losses half as big for a false negative (Redeveloped="No" when it should be Redevelop="Yes") as a false positive (Redeveloped="Yes" when it should be Redevelop="No"). Assume no losses when a correct decision is made. Answer the following questions after drawing your tree for the training data. Be sure to maximise your tree window before drawing your tree (again).
i. Complete the above loss matrix.
ii. What are the rules for the terminal node with the smallest errorrate?
iii. How many splits if we want to minimise the cross-validation error? Explain your answer
iv. Consider node 2 of your drawn tree. How many training observations for node 2 and what are the rules for node2?
v. At node 2 in the training data what is the average loss per property if we make a Redevelopment="Yes" decision? What is the average loss per property if we make a Redevelopment = "No" decision? Which is the better decision for this node?
vi. Repeat (v) for some other node where the better decision is unexpected. Explain why the better decision is unexpected.
f) Run a random forest with your data with 500 trees, randomly selecting three input variables from which to choose your split variable at each node. Please include all outputs for each question.
i. What is the OOB estimate of the error rate and what does OOB mean?
ii. What is the error rate for the Redevelopment = "Yes" predictions with thetestdata and what is the error rate for Redevelopment = "No" predictions with the testdata?
iii. Which are the top 3 predictor variables according to the Gini measure of variable importance and how is this measure defined?
g) Now try Boosting. Please include all outputs for each question.
i. Interpret the term Gain and explain why this measure provides a reliable measure of Variable Importance.
ii. What does the Error Plot suggest as the optimum number of trees?
h) Now try a neural network with two and then three hidden nodes. Use the Evaluate menu error matrix to answer the following questions. Please include all outputs for each question.
i. Is it necessary to transform any of the input variables? What transformations have you chosen and why?
ii. What is the error rate for properties that actually were redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iii. What is the error rate for properties that were not actually redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iv. Which is better a 2 hidden node or a 3 hidden node solution?Why?
i) Use the Evaluate menu to detemine which is the best tool for modelling your data; a single tree, a random forest, boosting, a neural network. Why have you chosen this one method over the other three methods?
j) For this best tool show the ROC, sensitivity, risk and lift charts for the test data ONLY.
k) Explain the axes for each of the above four charts.
l) Which is the best method for choosing the most important predictor of Redevelopment = "Yes"; plots, a single tree, a random forest, boosting, a neural network? Why have you chosen this one method over the other four methods?
m) Do any of the above models appear to be worth commercialising? For what purpose?
Attachment:- Exercise.rar