Reference no: EM132379033
STAT 601 Homework -
Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question. Submit your .rmd file with the knitted PDF.
Please do the following problems from the text book R Handbook and stated.
1. Collett (2003) argues that two outliers need to be removed from the plasma data. Try to identify those two unusual observations by means of a scatterplot.
2. (Multiple Regression) Continuing from the lecture on the hubble data from gamair library;
a) Fit a quadratic regression model, i.e., a model of the form
Model 2: velocity = β1 × distance + β2 × distance2 + ε
b) Plot the fitted curve from Model 2 on the scatterplot of the data.
c) Add the simple linear regression fit (fitted in class) on this plot - use different color and line type to differentiate the two and add a legend to your plot.
d) Which model do you consider most sensible considering the nature of the data - looking at the plot?
e) Which model is better? - provide a statistic to support you claim.
Note: The quadratic model here is still regarded as a linear regression" model since the term-linear" relates to the parameters of the model and not to the powers of the explanatory variables.
3. The leuk data from package MASS shows the survival times from diagnosis of patients suffering from leukemia and the values of two explanatory variables, the white blood cell count (wbc) and the presence or absence of a morphological characteristic of the white blood cells (ag).
a) Define a binary outcome variable according to whether or not patients lived for at least 24 weeks after diagnosis. Call it surv24.
b) Fit a logistic regression model to the data with surv24 as response. It is advisable to transform the very large white blood counts to avoid regression coefficients very close to 0 (and odds ratio close to 1). You may use log transformation.
c) Construct some graphics useful in the interpretation of the final model you fit.
d) Fit a model with an interaction term between the two predictors. Which model fits the data better? Justify your answer.
4. Load the Default dataset from ISLR library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a four-dimensional dataset with 10000 observations. The question of interest is to predict individuals who will default . We want to examine how each predictor variable is related to the response (default). Do the following on this dataset
a) Perform descriptive analysis on the dataset to have an insight. Use summaries and appropriate exploratory graphics to answer the question of interest.
b) Use R to build a logistic regression model.
c) Discuss your result. Which predictor variables were important? Are there interactions?
d) How good is your model? Assess the performance of the logistic regression classifier. What is the error rate?
5. Run all the codes (additional exploration of data is allowed) and write your own version of explanation and interpretation.
Attachment:- Assignment Files.rar