Reference no: EM133931714
Milagro: Predicting Store Profitability at a Fast-Casual Restaurant Chain
1. Dataset Preparation and Rationale
Kathleen's team has already split the Milagro store data into training and testing sets.
Questions
Why did Kathleen's team split the data into a training set (374 stores) and test set (85 stores)?
What percentage of the total data is in the training set? Get expert online assignment help in the USA.
Explain what the training set will be used for, what the test set will be used for, and why it is important not to use the test set during model building.
2. Kathleen's Original Model
Kathleen originally built a multiple linear regression model using the training dataset to predict annual store profitability (annual.profit) as a function of four variables: agg.inc, sqft, col.grad, and com60.
Questions
Fit a linear regression model using the training data with the four variables:
agg.inc, sqft, col.grad, and com60.
Write the complete linear regression equation for predicting annual store prof- itability from these four predictors. Your equation should be in the form:
annual.profit = β0 + β1 × agg.inc + β2 × sqft + β3 × col.grad + β4 × com60
Using the estimated regression model, what annual profitability is predicted for a Milagro store located in an area with:
Aggregate income (agg.inc) of $100,000,000
Store size (sqft) of 800 square feet
College graduate percentage (col.grad) of 0.30 (30%)
Long commute percentage (com60) of 0.10 (10%)
Evaluate the quality of the original model:
What is the R2 value on the training data?
What is the R2 value on the test data?
Test the statistical significance of the predictors:
Which independent variables are statistically significant at the 5% level (α = 0.05)?
Which variable has the smallest p-value (most statistically significant)?
Which variable has the largest p-value (least statistically significant, but still below 0.05)?
3. Exploratory Correlation Analysis
Kathleen wants to understand the relationships between variables in the expanded dataset before building more complex models.
Questions
Compute the correlation matrix for all numerical predictor variables (exclude store.number, annual.profit, and state).
The dataset now has 10 predictor variables: the 4 original variables plus 6 new variables. Identify the three pairs of variables with the strongest correlations (highest absolute values). Report the correlation coefficient for each pair.
Statistical significance of new variables: Build a regression model using ALL 10 variables (the 4 original plus 6 new variables). Test the statistical significance of each variable at the 5% level (α = 0.05).
Which of the 6 new variables (lci, nearcomp, nearmil, freestand, gini, housemed) are statistically significant?
Which of the new variables are NOT significant? What does this suggest about their usefulness in predicting store profitability?
4. Model Comparison
Now build and compare four different models.
Questions
Fit and evaluate four models using the training data:
Model A: Kathleen's Original Model
Variables: agg.inc, sqft, col.grad, com60
Model B: Full Model
Variables: All variables except store.number, annual.profit, and state
Model C: Parsimonious Model
Build this model by removing variables that meet either of these criteria:
Variables that are NOT statistically significant at the 5% level (from Question 7).
Variables involved in pairs with absolute correlation > 0.70 (from Question 6). For highly correlated pairs, keep the variable with stronger correlation to the outcome variable (annual.profit).
Model D: Alternative Model
Start with the original 4 variables (agg.inc, sqft, col.grad, com60).
Add ONE variable from the 4 significant new variables identified in Question 7:
lci, nearcomp, nearmil, freestand.
Test each of the 4 possible additions (one at a time) and choose the one that:
Improves test R2 compared to Model A, and
Maintains total profitability prediction ≥ $40 million.
For Model D report which variable you added.
For each model, report:
Training R2
Test R2
Total predicted profitability for the 48 construction sites (in millions)
Model recommendation and the dilemma: Review the performance of your four models. You should notice a critical dilemma: Models with the highest test R2 (best predictive accuracy) predict profitability BELOW $40 million, while models that meet the $40 million target have lower test R2.
Which model would you recommend to Harriman Capital? In your answer, dis- cuss whether you should prioritize statistical performance (higher test R2) even if it means revising the $40M profitability estimate downward, or prioritize meet- ing the business requirement ($40M target) even with lower predictive accuracy. What are the business risks of each choice?