What is the test set auc of the model

Assignment Help Other Subject
Reference no: EM133932389

Case: Lending Club Classification

Details

In this assignment, you will need to make use of the following data

Loans Data

Solve the Loan Repayment problem on pages 405-408 of the textbook.

b

For A and B parts' last question, please report accuracy, recall, F1 score, AUC respectively. The threshold for calculation is 0.5

Submission instructions:
Please remember to label your responses with question numbers.

Please try to write your responses in a markdown cell rather than using code comments or code strings (you can easily change a cell's type from "code" to "markdown," or if you're using Google Colab or VScode, you can click on "+text/+markdown").

It would be greatly appreciated if you could check your PDF file to ensure that your responses are not truncated during the file conversion.
Please share your findings in a written format. Raw code outputs of regressions and correlation tables without any written response are not preferred.

Loan Repayment

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders would like to minimize the risk of a borrower being unable to repay a loan.

In this exercise, we will use publicly available data from LendingClub, a website that connects borrowers and investors over the internet. The dataset is in the file Loans.csv (available in the Online Companion). There are 9,578 observations, each representing a 3-year loan that was funded through the LendingClub.com platform between May 2007 and February 2010. There are 14 variables in the dataset, described in Table 22.6. We will be trying to predict NotFullyPaid, using all of the other variables as independent variables.

a) Let us start by building a logistic regression model.

i) First, randomly split the dataset Loans.csv into a training set and a testing set. Put 70% of the data in the training set. What is the accuracy on the test set of a simple baseline model that predicts that all loans will be paid back in full (NotFullyPaid = 0) ? Our goal will be to build a model that adds value over this simple baseline method.

ii) Now, build a logistic regression model that predicts the dependent variable NotFullyPaid using all of the other variables as independent variables. Use the training set as the data for the model. Describe your resulting model. Which of the independent variables are significant in your model?

iii) Consider two loan applications, which are identical other than the fact that the borrower in Application A has a FICO credit score of 700 while the borrower in Application B has a FICO credit score of 710. Let Logit(A) be the value of the linear logit function of loan A not being paid back in full, according to our logistic regression model, and define Logit (B) similarly for loan B. What is the value of Logit(A) - Logit(B)?

iv) Now predict the probability of the test set loans not being paid back in full. Store these predicted probabilities in a variable named PredictedRisk and add it to your test set (we will use this variable in later parts of the problem). What is the accuracy of the logistic regression model on the test set using a threshold of 0.5? How does this compare to the baseline model?

v) What is the test set AUC of the model? Given the accuracy and the AUC of the model on the test set, do you think this model could be useful to an investor to make profitable investments? Get top-notch online assignment help.

b) Lending Club assigns the interest rate to a loan based on their estimate of that loan's risk. This variable, IntRate, is an independent variable in our dataset. In this part, we will investigate just using the loan's interest rate as a "smart baseline" to order the loans according to risk.

i) Using the training set, build a logistic regression model that predicts the dependent variable NotFullyPaid using IntRate as the only independent variable. Is IntRate significant in this model? Was it significant in the first logistic regression model you built? How would you explain this difference?

ii) Use the model you just built (with only one independent variable) to make predictions for the observations in the test set. What is the highest predicted probability of a loan not being paid back in full on the test set? How many loans would we predict would not be paid back in full if we used a threshold of 0.5 to make predictions?

iii) Compute the test set AUC of the model. How does this compare to the model using all of the independent variables? In your opinion, which model is stronger? Why?

c) Let us now see how our logistic regression model can be used to identify loans that are expected to be profitable.

i) If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.

To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years. Using continuous compounding of interest, this investment pays back c×ertc×ert dollars by the end of the t years. How much does a $10 investment with an annual interest rate of 6% pay back after3 years, using continuous compounding of interest? (HINT:Remember to convert the percentage to a proportion beforedoing the math.)

ii) While the investment has this value after collecting interest, the investor had to pay c dollars for the investment. What is the profit to the investor if the investment is paid back in full? What is the profit to the investor if the investment is not paid back in full?

iii) Compute the profit of a $1 investment in each loan, and save your result to a variable named Profit. Keep in mind that the profit computation should depend on the value of the variable NotFullyPaid. What is the maximum profit of a $1 investment in any loan in the testing set?

iv) A simple investment strategy of equally investing in all the loans would yield a profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem. Instead, let us analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%) to maximize return, but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full. We will model an investor who invests $1 in each of the most promising 100 loans.

First, create a new dataset called HighInterest consisting of the test set loans with an interest rate of at least 15%. What is the average profit of a $1 investment in one of these high-interest loans? What proportion of the high-interest loans were not paid back in full?

v) Next, sort the loans in the HighInterest dataset by the variable PredictedRisk that we computed earlier in the problem. Create a new dataset called SelectedLoans that consists of the 100 loans with the smallest values of PredictedRisk.

What is the profit of an investor who invested $1 in each of these 100 loans? How many of the 100 selected loans were not paid back in full? How does this compare to the simple strategy of investing in all loans, which yielded a profit of $20.94 for a $100 investment?

d) One of the most important assumptions of predictive modeling often does not hold in financial situations, causing predictive models to fail. What do you think this is? As an analyst, what could you do to improve the situation?

Reference no: EM133932389

Questions Cloud

These are cambial cells located in between vascular bundles : These are cambial cells located in between vascular bundles. This tissue is also known as the wood. It is also known as the functional xylem.
Discuss the brain regions involved in object : Discuss the brain regions involved in object and face recognition. What conditions arise if these brain regions are damaged or dysfunctional?
What is spironolactone and hypokalemia : What is the site of action of thiazides in the nephron? What is hypokalemia? What is spironolactone?
Carnivorous plants : Carnivorous plants What are Carnivorous plants? Where do Carnivorous plants live? How do carnivorous plants attract their prey?
What is the test set auc of the model : What is the accuracy of the logistic regression model on the test set using a threshold of 0.5? How does this compare to the baseline model
Seed and fruit dispersal is achieved with the aid of wind : Seed and fruit dispersal is achieved with the aid of wind , water, animals, mechanical means and humans.
What is a wild tongue : What is a wild tongue? On page 55, she looks at the different types of languages Chicano's speak. Do you identify to one or can relate to one?
Why does revictimization frequently occur : Why does revictimization frequently occur? Does this differ based on gender or age? Include a minimum of three studies looking at current research.
What is natural causality : What is natural causality? What is a simple description of evolution? What is the driving force of evolution?

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd