What percentage of the total data is in the training set

Assignment Help Other Subject

Reference no: EM133931714

Milagro: Predicting Store Profitability at a Fast-Casual Restaurant Chain

1. Dataset Preparation and Rationale

Kathleen's team has already split the Milagro store data into training and testing sets.

Questions
Why did Kathleen's team split the data into a training set (374 stores) and test set (85 stores)?
What percentage of the total data is in the training set? Get expert online assignment help in the USA.
Explain what the training set will be used for, what the test set will be used for, and why it is important not to use the test set during model building.

2. Kathleen's Original Model

Kathleen originally built a multiple linear regression model using the training dataset to predict annual store profitability (annual.profit) as a function of four variables: agg.inc, sqft, col.grad, and com60.

Questions
Fit a linear regression model using the training data with the four variables:
agg.inc, sqft, col.grad, and com60.
Write the complete linear regression equation for predicting annual store prof- itability from these four predictors. Your equation should be in the form:
annual.profit = β0 + β1 × agg.inc + β2 × sqft + β3 × col.grad + β4 × com60

Using the estimated regression model, what annual profitability is predicted for a Milagro store located in an area with:
Aggregate income (agg.inc) of $100,000,000
Store size (sqft) of 800 square feet
College graduate percentage (col.grad) of 0.30 (30%)
Long commute percentage (com60) of 0.10 (10%)
Evaluate the quality of the original model:
What is the R2 value on the training data?
What is the R2 value on the test data?
Test the statistical significance of the predictors:
Which independent variables are statistically significant at the 5% level (α = 0.05)?
Which variable has the smallest p-value (most statistically significant)?
Which variable has the largest p-value (least statistically significant, but still below 0.05)?

3. Exploratory Correlation Analysis
Kathleen wants to understand the relationships between variables in the expanded dataset before building more complex models.

Questions
Compute the correlation matrix for all numerical predictor variables (exclude store.number, annual.profit, and state).

The dataset now has 10 predictor variables: the 4 original variables plus 6 new variables. Identify the three pairs of variables with the strongest correlations (highest absolute values). Report the correlation coefficient for each pair.

Statistical significance of new variables: Build a regression model using ALL 10 variables (the 4 original plus 6 new variables). Test the statistical significance of each variable at the 5% level (α = 0.05).

Which of the 6 new variables (lci, nearcomp, nearmil, freestand, gini, housemed) are statistically significant?
Which of the new variables are NOT significant? What does this suggest about their usefulness in predicting store profitability?

4. Model Comparison
Now build and compare four different models.

Questions
Fit and evaluate four models using the training data:

Model A: Kathleen's Original Model

Variables: agg.inc, sqft, col.grad, com60

Model B: Full Model

Variables: All variables except store.number, annual.profit, and state

Model C: Parsimonious Model

Build this model by removing variables that meet either of these criteria:
Variables that are NOT statistically significant at the 5% level (from Question 7).
Variables involved in pairs with absolute correlation > 0.70 (from Question 6). For highly correlated pairs, keep the variable with stronger correlation to the outcome variable (annual.profit).

Model D: Alternative Model
Start with the original 4 variables (agg.inc, sqft, col.grad, com60).
Add ONE variable from the 4 significant new variables identified in Question 7:
lci, nearcomp, nearmil, freestand.

Test each of the 4 possible additions (one at a time) and choose the one that:
Improves test R2 compared to Model A, and
Maintains total profitability prediction ≥ $40 million.

For Model D report which variable you added.
For each model, report:
Training R2
Test R2
Total predicted profitability for the 48 construction sites (in millions)

Model recommendation and the dilemma: Review the performance of your four models. You should notice a critical dilemma: Models with the highest test R2 (best predictive accuracy) predict profitability BELOW $40 million, while models that meet the $40 million target have lower test R2.

Which model would you recommend to Harriman Capital? In your answer, dis- cuss whether you should prioritize statistical performance (higher test R2) even if it means revising the $40M profitability estimate downward, or prioritize meet- ing the business requirement ($40M target) even with lower predictive accuracy. What are the business risks of each choice?

Reference no: EM133931714

Questions Cloud

What stereotypes are operating : What stereotypes are operating? What expectations do the various characters have based on the way Indian people are stereotyped?

Analyze unauthorized hacking constitutes an ethical breach : Analyze if the unauthorized hacking constitutes an ethical breach. Make sure to substantiate your opinions with proper references.

Although the nuclear family may remain cultural ideal : Although the nuclear family may remain a cultural ideal for many Americas, other domestic arrangements now outnumber the 'traditional' American household

Discuss fifth-generation core i7 processor specifications : Discuss three fifth-generation Core i7 processor specifications and provide your assessment of their effectiveness in improving system performance.

What percentage of the total data is in the training set : Predicting Store Profitability at a Fast-Casual Restaurant Chain - Explain what the training set will be used for, what the test set will be used for

How are sex-gender and sexual orientation : How are sex, gender, and sexual orientation related to one another? What are the differences among these three concepts?

Describe examples of security vulnerabilities : Describe 2 examples of security vulnerabilities that an end user needs to know in order to take precautions beforehand.

Stigmatization of different types of illness : How can the stigmatization of an illness can change over time? What can medical anthropologists do to lessen the stigmatization of different types of illness?

What does the term ethnocentrism : What does the term ethnocentrism refer to? Why is ethnocentrism usually something we should avoid?

User Account

All Pages