Reference no: EM132385795
STAT 601 R Programming Assignment -
Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question.
Please do the following problems from the text book R Handbook and stated.
1. The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes (medv variable, in 1000s USD) based on other predictors in the dataset. Use this dataset to do the following
a) Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide a plot of the predicted vs. observed values. Plot the final tree.
b) Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.
c) Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide a plot of the predicted vs. observed values.
d) Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide a plot of the predicted vs. observed values.
e) Provide a table containing each method and associated MSE. Which method is more accurate?
2. Consider the glacoma data (data = "GlaucomaM", package = "TH.data").
a) Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, a logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.
glac_glm <- glm(Class ~., data = GlaucomaM, family = "binomial")
#warning messages -- variable selection needed
The solution is to select variables that seem to be important for predicting the response and using those in the modeling process using GLM. One way to do this is by looking at the relationship between the response variable and predictor variables using graphical or numerical summaries - this tends to be a tedious process. Secondly, we can use a formal variable selection approach.
The step() function will do this in R. Using the step function, choose any direction for variable selection and fit logistic regression model. Discuss the model and error rate.
#use of step() function in R
?step
glm.step <- step(glac_glm)
Do not print out the summaries of every single model built using variable selection. That will end up being dozens of pages long and not worth reading through. Your discussion needs to include the direction you chose. You may only report on the final model, the summary of that model, and the error rate associated with that model.
b) Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate.
c) Find a function (package in R) that can conduct the "adaboost" ensemble modeling. Use it to predict glaucoma and report error rate. Be sure to mention the package you used.
d) Report the error rates based on single tree, bagging and random forest. (A table would be great for this).
e) Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?
f) From the above analysis, which variables seem to be important in predicting Glaucoma?
Attachment:- R Programming Assignment Files.rar