Build logistic regression model with K-fold cross validation

Assignment Help Applied Statistics
Reference no: EM132385795

STAT 601 R Programming Assignment -

Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question.

Please do the following problems from the text book R Handbook and stated.

1. The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes (medv variable, in 1000s USD) based on other predictors in the dataset. Use this dataset to do the following

a) Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide a plot of the predicted vs. observed values. Plot the final tree.

b) Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.

c) Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide a plot of the predicted vs. observed values.

d) Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide a plot of the predicted vs. observed values.

e) Provide a table containing each method and associated MSE. Which method is more accurate?

2. Consider the glacoma data (data = "GlaucomaM", package = "TH.data").

a) Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, a logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.

glac_glm <- glm(Class ~., data = GlaucomaM, family = "binomial")

#warning messages -- variable selection needed

The solution is to select variables that seem to be important for predicting the response and using those in the modeling process using GLM. One way to do this is by looking at the relationship between the response variable and predictor variables using graphical or numerical summaries - this tends to be a tedious process. Secondly, we can use a formal variable selection approach.

The step() function will do this in R. Using the step function, choose any direction for variable selection and fit logistic regression model. Discuss the model and error rate.

#use of step() function in R

?step

glm.step <- step(glac_glm)

Do not print out the summaries of every single model built using variable selection. That will end up being dozens of pages long and not worth reading through. Your discussion needs to include the direction you chose. You may only report on the final model, the summary of that model, and the error rate associated with that model.

b) Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate.

c) Find a function (package in R) that can conduct the "adaboost" ensemble modeling. Use it to predict glaucoma and report error rate. Be sure to mention the package you used.

d) Report the error rates based on single tree, bagging and random forest. (A table would be great for this).

e) Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?

f) From the above analysis, which variables seem to be important in predicting Glaucoma?

Attachment:- R Programming Assignment Files.rar

Reference no: EM132385795

Questions Cloud

What is the best type of business that is best suited : What are the advantages and disadvantages of a sole proprietorship, LLC, and corporation? What is the best type of business that is best suited.
What are the pros and cons of the option : What are the pros and cons of the option. What is the effect on the income, balance sheet and cash flow statements. hat are the pros and cons of the option.
Prepare the necessary journal entries for bonita computers : Computers to Robertson Company with terms 3/15, n/60. Bonita uses the gross method to record cash discounts. Bonita estimates allowances.
Case Study Assignment - Conduct a regression analysis : Module 4 Case Study Assignment - Conduct a regression analysis to assess whether HIV+ men have lower PSA levels after adjusting for differences in age
Build logistic regression model with K-fold cross validation : STAT 601 R Programming Assignment - Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate
Describe the regulations pertaining to consumer credit : Question - Describe and analyze the Regulations pertaining to consumer credit, and the laws prohibiting discrimination in access to credit
Assess the current financial performance of an organisation : You are required to assess the current financial performance of an organisation within the context of its markets and its economic performance
Create a WBS for this project and enter the tasks in Project : Web Site Development Assignment - Project Scope Management: Create a WBS for this project and enter the tasks in Project 2016
Write an advanced level research evidence : Need to write an advanced level research evidence based Article on the topic "Use of Modern Technologies by School Children

Reviews

Write a Review

Applied Statistics Questions & Answers

  Set up null and alternative hypotheses for hypothesis test

The bottling company wants to set up a hypothesis test so that the filler will be readjusted if the null hypothesis is rejected. Set up the null and alternative hypotheses for this hypothesis test.

  Formal hypothesis test

Your write-up of this experiment should include your work to determine the n, your methodology, your record of the experiment, your confidence interval, your interpretation of the confidence interval, your formal hypothesis test and your comments ..

  Calculate a confidence interval on the female coefficient

Interpret the coefficient on H.S. Diploma in column 1 and again in column 3. Calculate a confidence interval on the female coefficient in part (c)

  Investigate how Systolic blood pressure is related to age

Using a statistical package and techniques learned to date, investigate how Systolic blood pressure is related to age and Birth control pill use (BrthPl)

  Generate an object in R

POL 51 Hare Winter 2017 Third Assignment. Generate an object in R (call it whatever you want to name it) with 100 random draws from a normal distribution. Using this object, calculate and provide: A histogram of the object

  Which variable is the independent variable

You are consulting for a large real estate firm.  You have been asked to construct a model that can predict listing prices based on square footages for homes in the city you've been researching.  You have data on square footages and listing prices fo..

  What does an agglomeration schedule tell us in general

What does an agglomeration schedule tell us in general? Provide a brief hypothetical example (using the Metropolitan Areas case), outlining the circumstances in which we might be interested in interpreting the agglomeration schedule

  Are your independent variables truly independent

Statistical reasons and logic for why you selected the independent variables you selected. For each independent variable you must do the following: Are your independent variables truly independent? Is the proposed "Independent Variable" dependent o..

  A group of brigham young university

A group of Brigham Young University

  Thoughts on the value of statistics in general

Thoughts on the value of statistics in general

  Describe minimum four methods of collecting qualitative data

SDFD211 Statistical And Computational Mathematics Assignment. Describe a minimum of four methods of collecting qualitative data

  Two variables that have perfect positive linear correlation

A.Two variables that have perfect positive linear correlation are the price per gallon of gasoline and the total cost of gasoline. Two variables that have perfect negative linear correlation are the distance from a door and the height of a wheelchair..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd