Reference no: EM132389022
Assignment
The objective of this assignment is to use ridge regression and the lasso in order to train a number of regression models for prediction. You will use a data set from the University of Wisconsin where each record represents follow-up data for one breast cancer case after surgery.
The data set contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
Information about the outcome of the patient is also included, such as time to recurrence or time to last seen, for those who have not experienced recurrence yet. Here, time to recurrence will be considered as the response variable of interest.
The information of the variables in the data are shown below, corresponding to the numbering of the columns:
1) ID number
2) Outcome (R = recur, N = nonrecur)
3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N)
4-33) Ten real-valued features are computed for each cell nucleus:
1. a) radius (mean of distances from center to points on the perimeter)
2. b) texture (standard deviation of gray-scale values)
3. c) perimeter
4. d) area
5. e) smoothness (local variation in radius lengths)
6. f) compactness (perimeter^2 / area - 1.0)
7. g) concavity (severity of concave portions of the contour)
8. h) concave points (number of concave portions of the contour)
9. i) symmetry
10. j) fractal dimension ("coastline approximation" - 1)
The mean, standard error (SE), and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, column 4 contains Mean Radius, column14 is Radius SE and column 24 is Worst Radius.
34) Tumor size - diameter of the excised tumor in centimeters
35) Lymph node status - number of positive axillary lymph nodes observed at time of surgery
The dataset has been prepared in a .csv format in the file bc_data.csv.
Tasks
1. Read the data into R, making sure that you code the missing values properly. The character “?” is used for denoting missing values in the .csv file. Notice that there is no header in the data file.
2. In your analysis you will use as predictors and focus only on the mean values of the above described (a) – (j) FNA features (which as found in columns 4-13), and the variables found in columns 34 and 35. You first need to convert the number of axillary nodes (column 35) into a categorical variable, with three levels: 0, 1-3, 4 or more.
Make a subset of the original dataset with only those with recurrence. Using this dataset, generate appropriate descriptive statistics and plots for the predictors.
3. Using as predictors the 12 features described in 2 (11 continuous and 1 categorical), train a ridge regression model for prediction of time to recurrence. Use the default grid of values for the lambda parameter in the glmnet R function. Make a plot showing the coefficients of these predictors for different levels of regularization. Comment on the results.
4. Using a 5-fold cross-validation estimate and report the optimal value for lambda (i.e. that minimizes the MSE). Make a plot showing the MSE against the values of log(lambda). Report the coefficients of the predictors for the optimal lambda value.
5. Calculate the MSE on the whole set of the recurrent group for the model using the optimal lambda value.
6. Repeat tasks 3-5 above but this time using the lasso method. This time report also what the selected features are for the optimal lambda value.
7. Make some comments on how the two methods compare based on the analysis you did and the results you generated above. Suggest a rigorous method or approach of actually comparing the performance of the two prediction methods on these data. You do not have to apply this comparison method.
8. Make comments on the overall appropriateness of this “design” (i.e. choice of response variable and dataset) for achieving the objective.
You will need to submit separately two files, one with your report (in either Word or pdf format) and one separate R file with your code.
Make sure you annotate your figures and format your tables properly. Make sure you organize and document your code properly.