Build logistic regression model with K-fold cross validation

Assignment Help Applied Statistics

Reference no: EM132385795

STAT 601 R Programming Assignment -

Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question.

Please do the following problems from the text book R Handbook and stated.

1. The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes (medv variable, in 1000s USD) based on other predictors in the dataset. Use this dataset to do the following

a) Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide a plot of the predicted vs. observed values. Plot the final tree.

b) Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.

c) Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide a plot of the predicted vs. observed values.

d) Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide a plot of the predicted vs. observed values.

e) Provide a table containing each method and associated MSE. Which method is more accurate?

2. Consider the glacoma data (data = "GlaucomaM", package = "TH.data").

a) Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, a logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.

glac_glm <- glm(Class ~., data = GlaucomaM, family = "binomial")

#warning messages -- variable selection needed

The solution is to select variables that seem to be important for predicting the response and using those in the modeling process using GLM. One way to do this is by looking at the relationship between the response variable and predictor variables using graphical or numerical summaries - this tends to be a tedious process. Secondly, we can use a formal variable selection approach.

The step() function will do this in R. Using the step function, choose any direction for variable selection and fit logistic regression model. Discuss the model and error rate.

#use of step() function in R

?step

glm.step <- step(glac_glm)

Do not print out the summaries of every single model built using variable selection. That will end up being dozens of pages long and not worth reading through. Your discussion needs to include the direction you chose. You may only report on the final model, the summary of that model, and the error rate associated with that model.

b) Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate.

c) Find a function (package in R) that can conduct the "adaboost" ensemble modeling. Use it to predict glaucoma and report error rate. Be sure to mention the package you used.

d) Report the error rates based on single tree, bagging and random forest. (A table would be great for this).

e) Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?

f) From the above analysis, which variables seem to be important in predicting Glaucoma?

Attachment:- R Programming Assignment Files.rar

Reference no: EM132385795

Questions Cloud

What is the best type of business that is best suited : What are the advantages and disadvantages of a sole proprietorship, LLC, and corporation? What is the best type of business that is best suited.

What are the pros and cons of the option : What are the pros and cons of the option. What is the effect on the income, balance sheet and cash flow statements. hat are the pros and cons of the option.

Prepare the necessary journal entries for bonita computers : Computers to Robertson Company with terms 3/15, n/60. Bonita uses the gross method to record cash discounts. Bonita estimates allowances.

Case Study Assignment - Conduct a regression analysis : Module 4 Case Study Assignment - Conduct a regression analysis to assess whether HIV+ men have lower PSA levels after adjusting for differences in age

Build logistic regression model with K-fold cross validation : STAT 601 R Programming Assignment - Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate

Describe the regulations pertaining to consumer credit : Question - Describe and analyze the Regulations pertaining to consumer credit, and the laws prohibiting discrimination in access to credit

Assess the current financial performance of an organisation : You are required to assess the current financial performance of an organisation within the context of its markets and its economic performance

Create a WBS for this project and enter the tasks in Project : Web Site Development Assignment - Project Scope Management: Create a WBS for this project and enter the tasks in Project 2016

Write an advanced level research evidence : Need to write an advanced level research evidence based Article on the topic "Use of Modern Technologies by School Children

User Account

All Pages