Build logistic regression model with K-fold cross validation

Assignment Help Applied Statistics
Reference no: EM132385795

STAT 601 R Programming Assignment -

Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question.

Please do the following problems from the text book R Handbook and stated.

1. The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes (medv variable, in 1000s USD) based on other predictors in the dataset. Use this dataset to do the following

a) Construct a regression tree using rpart(). The following need to be included in your discussion. How many nodes did your tree have? Did you prune the tree? Did it decrease the number of nodes? What is the prediction error (calculate MSE)? Provide a plot of the predicted vs. observed values. Plot the final tree.

b) Perform bagging with 50 trees. Report the prediction error (MSE). Provide the predicted vs observed plot.

c) Use randomForest() function in R to perform bagging. Report the prediction error (MSE). Was it the same as (b)? If they are different what do you think caused it? Provide a plot of the predicted vs. observed values.

d) Use randomForest() function in R to perform random forest. Report the prediction error (MSE). Provide a plot of the predicted vs. observed values.

e) Provide a table containing each method and associated MSE. Which method is more accurate?

2. Consider the glacoma data (data = "GlaucomaM", package = "TH.data").

a) Build a logistic regression model. Note that most of the predictor variables are highly correlated. Hence, a logistic regression model using the whole set of variables will not work here as it is sensitive to correlation.

glac_glm <- glm(Class ~., data = GlaucomaM, family = "binomial")

#warning messages -- variable selection needed

The solution is to select variables that seem to be important for predicting the response and using those in the modeling process using GLM. One way to do this is by looking at the relationship between the response variable and predictor variables using graphical or numerical summaries - this tends to be a tedious process. Secondly, we can use a formal variable selection approach.

The step() function will do this in R. Using the step function, choose any direction for variable selection and fit logistic regression model. Discuss the model and error rate.

#use of step() function in R

?step

glm.step <- step(glac_glm)

Do not print out the summaries of every single model built using variable selection. That will end up being dozens of pages long and not worth reading through. Your discussion needs to include the direction you chose. You may only report on the final model, the summary of that model, and the error rate associated with that model.

b) Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate.

c) Find a function (package in R) that can conduct the "adaboost" ensemble modeling. Use it to predict glaucoma and report error rate. Be sure to mention the package you used.

d) Report the error rates based on single tree, bagging and random forest. (A table would be great for this).

e) Write a conclusion comparing the above results (use a table to report models and corresponding error rates). Which one is the best model?

f) From the above analysis, which variables seem to be important in predicting Glaucoma?

Attachment:- R Programming Assignment Files.rar

Reference no: EM132385795

Questions Cloud

What is the best type of business that is best suited : What are the advantages and disadvantages of a sole proprietorship, LLC, and corporation? What is the best type of business that is best suited.
What are the pros and cons of the option : What are the pros and cons of the option. What is the effect on the income, balance sheet and cash flow statements. hat are the pros and cons of the option.
Prepare the necessary journal entries for bonita computers : Computers to Robertson Company with terms 3/15, n/60. Bonita uses the gross method to record cash discounts. Bonita estimates allowances.
Case Study Assignment - Conduct a regression analysis : Module 4 Case Study Assignment - Conduct a regression analysis to assess whether HIV+ men have lower PSA levels after adjusting for differences in age
Build logistic regression model with K-fold cross validation : STAT 601 R Programming Assignment - Build a logistic regression model with K-fold cross validation (k = 10). Report the error rate
Describe the regulations pertaining to consumer credit : Question - Describe and analyze the Regulations pertaining to consumer credit, and the laws prohibiting discrimination in access to credit
Assess the current financial performance of an organisation : You are required to assess the current financial performance of an organisation within the context of its markets and its economic performance
Create a WBS for this project and enter the tasks in Project : Web Site Development Assignment - Project Scope Management: Create a WBS for this project and enter the tasks in Project 2016
Write an advanced level research evidence : Need to write an advanced level research evidence based Article on the topic "Use of Modern Technologies by School Children

Reviews

Write a Review

Applied Statistics Questions & Answers

  Plan to conduct a marketing experimentin

You plan to conduct a marketing experimentin which students are to taste one of two different brands of soft drink. Theirtask is to correctly identify the brand tasted. You select a random sample of200 students and assume that the students have no ab..

  Half of residents would rather have motorway tolls

Is there evidence that more than half of residents would rather have motorway tolls reduced by 25 cents than have their annual car registration lowered

  The center for epidemiologic studies depression scale

The Center for Epidemiologic Studies Depression Scale (CES-D scale) is often utilized to measure depressive symptomology (Radloff, 1977). It is a self-assessment that is completed by the individual. The CES-D contains 20-items rated on a 4-poin..

  U the table below for questions 1-4cust idorder

use the table below for questions 1-4.cust idorder idshipping costdelivery methoddelivery time hrscustomer

  What is the probability that a student selected at random wi

1) Class records at Rockwood College indicate that a student selected at random has probability 0.71 of passing French 101. For the student who passes French 101, the probability is 0.9 that he or she will pass French 102. What is the probability tha..

  Develop innovative analytics visualization solutions

ITECH7407- Real Time Analytic Individual Assignment. The topic will be on environmental issues. Your main task is to apply any of the analytical tools to develop innovative analytics visualization solutions and predictive models with regards to en..

  Find a point estimate of percent confidence interval

Find a point estimate of and a 95 percent confidence interval for the total number of unexcused absences by hourly workers in the last year.

  Describe the random sampling

Describe the normal percentages of distribution in a normal curve (areas under the normal curve for various Z scores). What does that mean in regards to the normal distribution of one standard deviation from the mean, 2 standard devaiations, 3 sta..

  Find the probability that the sales for next month

Find the probability that the sales for next month was 15,000 or larger.

  Examine the proportion of female high school students

a) Conduct a crosstabs analysis to examine the proportion of female high school students who take advanced math courses is different for different levels of the parent variable.

  Construct a pareto diagram that shows frequency

Use the data file DEFECTS.MTW. Construct a Pareto diagram that shows frequency of defects by type (type is in the column labeled "Describe.). Construct a Pareto diagram that shows frequency of defects by plant

  What is the mean of your original sample

Construct a bootstrap distribution of the credit card debt data from your sample using 3000 resamples. What is the mean of your original sample

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd