What percentage of the total data is in the training set

Assignment Help Other Subject
Reference no: EM133931714

Milagro: Predicting Store Profitability at a Fast-Casual Restaurant Chain

1. Dataset Preparation and Rationale

Kathleen's team has already split the Milagro store data into training and testing sets.

Questions
Why did Kathleen's team split the data into a training set (374 stores) and test set (85 stores)?
What percentage of the total data is in the training set? Get expert online assignment help in the USA.
Explain what the training set will be used for, what the test set will be used for, and why it is important not to use the test set during model building.

2. Kathleen's Original Model

Kathleen originally built a multiple linear regression model using the training dataset to predict annual store profitability (annual.profit) as a function of four variables: agg.inc, sqft, col.grad, and com60.

Questions
Fit a linear regression model using the training data with the four variables:
agg.inc, sqft, col.grad, and com60.
Write the complete linear regression equation for predicting annual store prof- itability from these four predictors. Your equation should be in the form:
annual.profit = β0 + β1 × agg.inc + β2 × sqft + β3 × col.grad + β4 × com60

Using the estimated regression model, what annual profitability is predicted for a Milagro store located in an area with:
Aggregate income (agg.inc) of $100,000,000
Store size (sqft) of 800 square feet
College graduate percentage (col.grad) of 0.30 (30%)
Long commute percentage (com60) of 0.10 (10%)
Evaluate the quality of the original model:
What is the R2 value on the training data?
What is the R2 value on the test data?
Test the statistical significance of the predictors:
Which independent variables are statistically significant at the 5% level (α = 0.05)?
Which variable has the smallest p-value (most statistically significant)?
Which variable has the largest p-value (least statistically significant, but still below 0.05)?

3. Exploratory Correlation Analysis
Kathleen wants to understand the relationships between variables in the expanded dataset before building more complex models.

Questions
Compute the correlation matrix for all numerical predictor variables (exclude store.number, annual.profit, and state).

The dataset now has 10 predictor variables: the 4 original variables plus 6 new variables. Identify the three pairs of variables with the strongest correlations (highest absolute values). Report the correlation coefficient for each pair.

Statistical significance of new variables: Build a regression model using ALL 10 variables (the 4 original plus 6 new variables). Test the statistical significance of each variable at the 5% level (α = 0.05).

Which of the 6 new variables (lci, nearcomp, nearmil, freestand, gini, housemed) are statistically significant?
Which of the new variables are NOT significant? What does this suggest about their usefulness in predicting store profitability?

4. Model Comparison
Now build and compare four different models.

Questions
Fit and evaluate four models using the training data:

Model A: Kathleen's Original Model

Variables: agg.inc, sqft, col.grad, com60

Model B: Full Model

Variables: All variables except store.number, annual.profit, and state

Model C: Parsimonious Model

Build this model by removing variables that meet either of these criteria:
Variables that are NOT statistically significant at the 5% level (from Question 7).
Variables involved in pairs with absolute correlation > 0.70 (from Question 6). For highly correlated pairs, keep the variable with stronger correlation to the outcome variable (annual.profit).

Model D: Alternative Model
Start with the original 4 variables (agg.inc, sqft, col.grad, com60).
Add ONE variable from the 4 significant new variables identified in Question 7:
lci, nearcomp, nearmil, freestand.

Test each of the 4 possible additions (one at a time) and choose the one that:
Improves test R2 compared to Model A, and
Maintains total profitability prediction ≥ $40 million.

For Model D report which variable you added.
For each model, report:
Training R2
Test R2
Total predicted profitability for the 48 construction sites (in millions)

Model recommendation and the dilemma: Review the performance of your four models. You should notice a critical dilemma: Models with the highest test R2 (best predictive accuracy) predict profitability BELOW $40 million, while models that meet the $40 million target have lower test R2.

Which model would you recommend to Harriman Capital? In your answer, dis- cuss whether you should prioritize statistical performance (higher test R2) even if it means revising the $40M profitability estimate downward, or prioritize meet- ing the business requirement ($40M target) even with lower predictive accuracy. What are the business risks of each choice?

Reference no: EM133931714

Questions Cloud

What stereotypes are operating : What stereotypes are operating? What expectations do the various characters have based on the way Indian people are stereotyped?
Analyze unauthorized hacking constitutes an ethical breach : Analyze if the unauthorized hacking constitutes an ethical breach. Make sure to substantiate your opinions with proper references.
Although the nuclear family may remain cultural ideal : Although the nuclear family may remain a cultural ideal for many Americas, other domestic arrangements now outnumber the 'traditional' American household
Discuss fifth-generation core i7 processor specifications : Discuss three fifth-generation Core i7 processor specifications and provide your assessment of their effectiveness in improving system performance.
What percentage of the total data is in the training set : Predicting Store Profitability at a Fast-Casual Restaurant Chain - Explain what the training set will be used for, what the test set will be used for
How are sex-gender and sexual orientation : How are sex, gender, and sexual orientation related to one another? What are the differences among these three concepts?
Describe examples of security vulnerabilities : Describe 2 examples of security vulnerabilities that an end user needs to know in order to take precautions beforehand.
Stigmatization of different types of illness : How can the stigmatization of an illness can change over time? What can medical anthropologists do to lessen the stigmatization of different types of illness?
What does the term ethnocentrism : What does the term ethnocentrism refer to? Why is ethnocentrism usually something we should avoid?

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd