Solution-Compute the out-of-sample performance of your

Compute the out-of-sample performance of your predictors

Assignment Help Advanced Statistics

Reference no: EM132040883

Problems -

1. Use the data in 401ksubs.csv to answer this question. The data consist of 9915 observations on 12 variables defined in the file readme 401ksubs.txt.

Part A - The goal of this exercise is simply to use machine learning/nonparametric modeling to build a model for prediction. Start by removing 3915 observations which will be used for an out-of-sample comparison. Using the remaining 6000 observations as the training sample use the following methods to obtain prediction rules:

(i) Estimate E[net_tfa].

(ii) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] using linear regression.

(iii) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] by taking transformations of the input variables to allow approximation of a potentially complicated nonlinear function by lasso with penalty parameter chosen by cross-validation. Document the terms you construct and briefly comment on your rationale for the considered transformations. Briefly comment on the terms selected by lasso.

(iv) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] by ridge using the same transformations of input variables as in (iii) with penalty parameter chosen by cross-validation.

(v) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] by elastic net using the same transformations of input variables as in (iv) with penalty parameters chosen by cross-validation. Briefly comment on the terms selected by lasso.

(vi) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] using a CART with cost-complexity chosen by cross-validation. Comment on the final tree structure.

(vii) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] using a random forest. Note how many bootstrap replication you use and any other tuning you do. Which variables seem most important in the forest fit?

(viii) Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}] using boosted regression trees with number of boosting iterations chosen by cross-validation. Comment on the tree depth you use and how you made this choice. Which variables seem most important in the boosted tree fit?

Use the 3915 observations you held out to compare the prediction rules obtained in parts (i)-(viii). Specifically, let b g^{^}_j(x) for j ∈ {(i), (ii), (iii), (iv), (v), (vi), (vii), (viii)} be the estimator of the conditional expectation obtained in the part of the question corresponding to j. Calculate the mean square forecast error as 1/3915 ∑_i_∈hold-out(g^{^}_j(x_i) - y_i)². Which procedure is best? Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error.]

Part B - The goal of this exercise is to use machine learning/nonparametric modeling to aid in estimating a policy effect. Specifically, the goal is to estimate the effect of 401(k) eligibility on net financial assets after controlling for X = {age, inc, fsize, educ, db, marr, twoearn, pira, hown}.

Suppose we believe a reasonable model is

net_tfa = αe401 + g(X) + ε

where E[ε|X] = 0, and we wish to estimate and do inference for e401. Use your favorite non-parametric estimator and "Frisch-Waugh-Lovell partialling out" to estimate α. (I.e. estimate E[net_tfa|X] and E[e401|X] using a nonparametric procedure. Form estimates of residuals U = net_tfa - E[net_tfa|X] and V = e401 - E[e401|X]. Regress your estimate of U on your estimate of V to obtain your estimate of α.) Comment on the Report the estimated coefficient on α and the associated estimated standard error. Assuming that eligibility for a 401(k) can be taken as exogenous, what can you conclude about the causal effect of 401(k) eligibility on accumulated assets?

2. This exercise is intended to have you compare the various estimators/predictors that we discussed in class. It is deliberately kept broad and somewhat vague - feel free to do more work than what's asked for.

a. Assume that we're interested in predicting the growth rate of a country, based on the country characteristics. Download the Barro-Lee data (accessible via hdm package). Why does this correspond to the "big p" case? Why should we worry about overfitting in this case?

b. Consider several predictors that would allow us to potentially get rid of the overfitting problem. These include, but are not limited to,

- OLS with fewer, carefully chosen regressors ("small OLS")

- Lasso (with the penalty level λ chosen via plug-in method),

- post-Lasso (with the penalty level λ chosen via plug-in method),

- Lasso (with the penalty level λ chosen via cross-validation),

- Ridge Estimator (with the penalty level λ chosen via cross-validation),

- Elastic Net (with the penalty level λ chosen via cross-validation),

- Random Forests

- Pruned Trees.

Which one do you think would perform better? (i.e. Would you expect this model to the dense or sparse? How would you pick the regressors in small OLS? Do we really know how Random Forests work?) Speculate.

c. Split the data into training and test samples, estimate coefficients using the training sample, and run predictions for the test sample. Compute the out-of-sample performance of your predictors by computing the MSE for prediction on test sample. Calculate the 95% confidence intervals for MSE. How do the predictors compare? Discuss.

d. Now, let's get causal. Assume we're interested in estimating the effect of initial level of per-capita GDP on the growth rate (known as the infamous "convergence hypothesis").

The specification is

y_i = α₀d_i + _j=1∑^pβ_jx_ij + ∈_i,

where y_i is the growth rate of GDP over a specified decade in country i, d_i is the log of GDP at the beginning of the decade, and the x_ij are country characteristics at the beginning of the decade. The convergence hypothesis holds that α₀ < 0. Test the convergence hypothesis via the Frisch-Waugh-Lovell partialling out using Lasso, post-Lasso, and Random Forest. Give intuition.

Note - Total 2 pages. Calculations are to be done in R Studio.

Attachment:- Assignment Files.rar

Verified Expert

This task provides a clear R codes working principle on lasso regression. MSE is the sum of the variance of an estimate plus the square of its bias. ince MSE is an estimate of how much the data vary naturally around the unknown population regression hyperplane, we have little control over MSE other than making sure that we make our measurements as carefully as possible.

Reference no: EM132040883

Questions Cloud

Evaluate new product line for markum enterprises : You are a consultant who has been hired to evaluate a new product line for Markum Enterprises.

Explain the character and values you possess : How has your previous education and/or work experience prepared you for doctoral studies at Liberty University?

Compute the npv and irr for the project : If the cost of capital is 12%, compute the NPV and IRR for the project.

What would be your estimate of intrinsic value : On August 1, 1998, Georgina Comer, CFO of Globalcom Inc. was meeting with the team of investment bankers who were helping Globalcom issue bonds worth.

Compute the out-of-sample performance of your predictors : Compute the out-of-sample performance of your predictors by computing the MSE for prediction on test sample. Calculate the 95% confidence intervals for MSE

Normal population with mean u and a standard : A random sample size of 41 is produced from a normal population with mean u and a standardd deviation sigma = 1. Let X(estimated) = (i=1E40 Xi)/40

The definition and history of applied research : Explain how the author(s) used the results to improve or change the environment studied.

Find the risk-neutral value of the call at time : Find the risk neutral probability of up and down, p and (1-p). Find the risk-neutral value of the call at time = 0.

Model growth phenomena in biological populations : Question: The logistic equation is commonly used to model growth phenomena in biological populations:

User Account

All Pages

Compute the out-of-sample performance of your predictors

Reference no: EM132040883

Reference no: EM132040883

Questions Cloud

Reviews

inf2040883

Write a Review

Advanced Statistics Questions & Answers

Relationship between speed, flow and geometry

Logistic regression model

Logistic regression

Probability and statistics

Solve the linear model

Plan the analysis

Quantitative analysis

Modelise as a markov chain

Correlation and regression

Construct a frequency distribution for payment method

Perform simple linear regression

Quality control analysis

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT