Plot the receiver operating characteristic

Assignment Help Applied Statistics
Reference no: EM131443621

Abalone data for Q1

In this problem, we are going to analyze a datasets with 4177 subjects data with 8 variables, and will try to predict whether or not the ring of abalone is greater 9 or not. The complete dataset description can be found at https://archive.ics.uci.edu/ml/datasets/Abalone Below are the list of all variables in the dataset are :

- Sex:nominal variable - takes levels of M, F, and I (infant).
- Length:continuous variable (mm) - Longest shell measurement
- Diameter:continuous variable (mm) - perpendicular to length
- Height:continuous variable (mm) - with meat in shell
- Whole weight:continuous variable (grams) - whole abalone
- Shucked weight:continuous variable (grams) - weight of meat
- Viscera weight:continuous variable (grams) - gut weight (after bleeding)
- Shell weight:continuous variable (grams) - after being dried
- Rings:integer

We are interested in predicting the rings variable is greater than 9 or not. So you need to create the binary response based on it,

faba <- read.table("abalone.data",sep=",")

faba$y <- ifelse(faba$V9>8,1,0)

head(faba)

##

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

y

##

1

M

0.455

0.365

0.095

0.5140

0.2245

0.1010

0.150

15

1

##

2

M

0.350

0.265

0.090

0.2255

0.0995

0.0485

0.070

7

0

##

3

F

0.530

0.420

0.135

0.6770

0.2565

0.1415

0.210

9

1

##

4

M

0.440

0.365

0.125

0.5160

0.2155

0.1140

0.155

10

1

##

5

I

0.330

0.255

0.080

0.2050

0.0895

0.0395

0.055

7

0

##

6

I

0.425

0.300

0.095

0.3515

0.1410

0.0775

0.120

8

0

Ships data for Q2

We are interested in the number of accidents per month for a sample of ships (a classic example given by McCullagh & Nelder, 1989). The data can be found in the file "ships.csv" and it contains 40 subjects data with 14 variables. The response variable is called ACC. The explicative variables are:

- TYPE: there are 5 ships, labelled as 1-2-3-4-5. Type is a categorical variable, and 5 dummyTA, TB, TC, TD, TE.
- CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading to the dummy variablesT6064, T6569, T7074, T7579.
- MONTHS: a measure for the amount of service months that the ship has already carried out.

ships <- read.table("ships.csv",header=T,sep=",") str(ships)

head(ships)

##

 

TYPE

TA

TB

TC

TD

TE

T6064

T6569

T7074

T7579

O6074

O7579

MONTHS

ACC

##

1

1

1

0

0

0

0

1

0

0

0

1

0

127

0

##

2

1

1

0

0

0

0

1

0

0

0

0

1

63

0

##

3

1

1

0

0

0

0

0

1

0

0

1

0

1095

3

##

4

1

1

0

0

0

0

0

1

0

0

0

1

1095

4

##

5

1

1

0

0

0

0

0

0

1

0

1

0

1512

6

##

6

1

1

0

0

0

0

0

0

1

0

0

1

3353

18

Q1. Binary classiftcation of Abalone data.

1(a) We are going to use the first 3133 samples to train the model, and the rest will be used as the test set. Show your R code to get the training data and testing data. Find the mean and standard error of the continous variables (V2-V8). Standardize all the continous predictors (V2-V8) in the training set using formula (X - X¯ )/sd(X). Use the mean and sd in the training set to standardize the corresponding predictor in the testing data set.

xtrain <-faba[1:3133,1:8]
ytrain <- as.factor( faba[1:3133,10] ) xtest <-faba[- c(1:3133),1:8]
ytest <- as.factor( faba[-c(1:3133),10] )
# continue to write your code

(1b) Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from the cross-validation. Predicting with the training and testing data set, print the confusion matrix and report mean error rate (fraction of incorrect labels), respectively.

# Training the model on the standardized training set
# alpha=0 for ridge penalty; alpha=1 for the LASSO penalty
library(glmnet)

# .....

1(c) Plot the receiver operating characteristic (ROC) curve on the test data. Use package ROCR to get the ROC curve and use ggplot2 to plot the ROC curves. Report the area under the ROC curve (AUC).

1(d) Plot the receiver operating characteristic (ROC) curve on the test data using ridge penalty. Also, report the area under the ROC curve (AUC).

Q2. Analysis of ships data.

(2a) Make a histogram of the variable ACC. Comment on its form.

ships=read.table("ships.csv",header=T,sep=",")

# ...

Comments:
. . .
(2b) Estimate the Poisson regression model including all explicative variables and a constant term.Show your R code and summary output, comment on the coefficient for the variables MONTHS, is it significant?
Be careful on fitting the Poisson model. Note that if you include all the Type (TA-TE) and years (T6569- T7579) dummy variables, an error message would be generated, and no estimation would be performed. To avoid it, TA was chosen to be the reference category for type, and T6064 was chosen to be the reference category for construction year.

ships=read.table("ships.csv",header=T,sep=",")
options(scipen=5)

#...

Comments on the coefficient for the variable MONTH:
. . .
(2c) Perform a Wald test for the joint significance of all the type dummy variables. Specify the H0
and Ha, and your conclusion.
#....

(2d) Given a ship of category TA, constructed in the year period 65-69, with MONTHS=1000. Predict the number of accidents per month. Also, estimate (1) The probability that no accidents will occur for this ship. (2) the probability that at least two accidents will occur.

#..
# prob of (1)
#..
# prob (2)

Q3. Analysis of 3-way contingency table

 

 

Heart disease

 

Gender

Cholesterol

Yes     No

Total

Male

 

High

16    256

272

Low

28    2897

2925

Female

 

High

13    319

332

Low

23   2565

2588

 

Total

80    6037

6117

You investigate the relationship between serum cholesterol (C), gender (G) and heart disease (H), and acquire the following data.

(3a) State the loglinear model that only expresses the main effects of the three characteristics on the expected counts. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

(3b) State the loglinear model that expresses all the main effects, and also an interaction between Cholesterol and Gender, and an interaction between Cholesterol and Heart disease. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

For model in (a) and (b), which one is better? Make your conclusion based on AIC and likelihood ratio test.

Verified Expert

This Assignment is completely based on R programming, and i have used R studio software for this.I have many functions in R for drawing graphs and installing packages which are required.Basically exploring the structure of the data set ans producing summary statistics like Mean,Standard error and count of all the observations for important variables which are used for this analysis.Next step is finding the aggregate values on some important variables which are related to assignment task and also creating plots and graphs by using important functions like Histogram, ggplot plot for plotting aggregate values.

Reference no: EM131443621

Questions Cloud

Discuss about the theoretical conceptualization : Diagnostic criteria described (this would include information, such as symptoms , considerations for symptoms needed to qualify, such as impairment or consistency in occurrence of symptoms , duration , and rule out criteria.Supporting information ..
Benefits of the program : The purpose of the program, the target population or audience, the benefits of the program, the cost or budget justification, the basis upon which the program or project will be evaluated.
What is a negative stakeholder : What is a negative stakeholder? Should a negative stakeholder be part of the project? Why or why not?
What is mpc : If government purchases increased by $20 billion, other things being equal, what would be the resulting change in aggregate demand, and how much of that change would be a change in consumption, if the MPC were.
Plot the receiver operating characteristic : STA303/1002 Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from t..
Are there any other alternatives worth : Other than executory arbitration, Are there any other alternatives worth considering when it comes to pre established agreements for employers?
Who are the main theorists associated with the theory : the information provided in the tables should not just be a listing. I would like to see explanations and applications of the concepts that you are discussing for each theory. With that in mind, your sections should be similar to a substantive po..
What decision style would be appropriate : What decision-making style do you think would be most appropriate in the following circumstances? Take into consideration the degree to which the feelings of others should be taken into account. Justify your choice in each case.
Judicial court system did this legal opinion occur : At what level of the judicial court system did this legal opinion occur? What was the opinion of the lower court that was finally overturned in Simkins?

Reviews

inf1443621

4/8/2017 5:53:09 AM

Unlike other services, ExpertsMind does not require tons of your personal information and long order forms. I feel quite secure with these guys. Keep it up! Coming to my work, its fabulous, i am quite happy with the work.

len1443621

3/29/2017 2:06:53 AM

Be independent. Your solutions must be written up independently (i.e., your solutions should not be the same as another students solutions). • Due date: Late assignments will be subject to a deduction of 5% of the total marks for the assignment for each day late. Any late assignment after the day I post the solution will get zero mark. • Presentation of solutions is very important.

Write a Review

Applied Statistics Questions & Answers

  Estimating the proportion of trees

A paper company is interested in estimating the proportion of trees in a 500-acre forest with diameters exceeding 2 feet. The company selects 25 plots (100 feet by 100 feet) from the forest and utilizes the information from the 25 plots to help..

  Perform a hypothesis testing

The company wants to perform a hypothesis testing. Which of the following are the right hypotheses and what is the α-risk associated with the x' chart?

  Tim recognize on the exchange of his services for a roof

Tim is a plumber who joined a barter club. This year Tim exchanges plumbing services for a new roof. The roof is properly valued at $2,500, but Tim would have only billed $2,200 for the plumbing services. What amount of income should Tim reco..

  Is there anything unusual about geographical distribution

A closer examination of the top 100 showed 55 in the Americas, 37 in Europe, and 8 elsewhere. Is there anything unusual about the geographical distribution of the world's top 100 universities?

  Draw a pie chart for your data

Find a categorical variable for which there are at least three categories and for which you can collect at least 20 observations.

  What is the probability of a student not doing homework

What is the probability of a student not doing homework or passing and what is the probability that the home team will win this game given that it is ahead at the half?

  Find the probability that the time until the first sale

An average of 8.5 cars are sold per 10-hour day on Saturdays and Sundays in January and February.  A) On the first Saturday in February, the dealership opens at 9am.  Find the probability that the time until the first sale is more than 2 hours...

  Significant difference between their mean levels of wages

Significant difference between their mean levels of wages?

  Find confidence interval for the population mean annual numb

1) Twenty-eight small communities in Connecticut (population near 10,000 each) gave an average of x = 138.5 reported cases of larceny per year. Assume that σ is known to be 42.7 cases per year. (a) Find a 90% confidence interval for the population me..

  Develop a second decision tree for sonny

Develop a second decision tree for Sonny and his team to reflect this new option of hiring the research firm prior to the possibility of developing the app

  A plane that has a maximum capacity of 300 people

On any given flight, an airline's goal is to fill the plane as much as possible without overbooking. If, on average, 10% of customers cancel their tickets, all independently of each other, what is the probability that a particular flight will be o..

  A bird lands on a power line of length l

A bird lands on a power line of length L

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd