Forecasting method called a binary classifier

Assignment Help Computer Engineering

Reference no: EM133923429

Unit: Self-Check Assignment 3:Diabetes Forecasting

This assignment builds on all ofour previous work and introduces you to predictive analytics through a forecasting method called a binary classifier. We will then work on how to visualize and understand a binary classifier. Get online assignment help-AI & plagiarism-free-now!

In this assignment, you will:

- Receive an introduction to binary classifiers, logistic regression, and the results, including true- positive, false-positive, true-negative, and false-negative results
- Run a binary classification algorithm on our diabetes data
- Visualize the results in Tableau

For this assignment, follow these steps:
1) Download the diabetes dataset if you need it
2) Learn about binary classifiers
3) Perform binary classification using a logistic regression in Python (this has been written for you; all you need to do is press ‘run' in Colab)
4) Download the results
5) Visualize the results in Tableau

Question 1: Understanding the Problem

In the diabetes dataset, what is/are the possible input variable(s)? (Input variables are the things we will use to make our prediction.) Select all that apply.
A. Glucose
B. Insulin
C. BMI
D. Age
E. Blood pressure
F. Outcome

Question 2: Understanding the Problem

In the diabetes dataset, what is/are the possible output variable(s)? (An output variable is the thing we want to predict.)Select all that apply.
A. Glucose
B. Insulin
C. BMI
D. Age
E. Blood Pressure
F. Outcome

There are many algorithms which can be used in data science for classification. Exactly how to determine which algorithm should be used, and how to evaluate its results, is beyond the scope of this course. But we will give you a very basic overview of how predictive analytics models work here. In the learning resources for this unit, we have provided a video from StatQuest about logistic regression. His example in predicting obesity in mice is very close to what we are doing here.

Question 3: What We Are Trying to Do Here with Logistic Regression

Which statement most closely resembles what we are trying to do here with our logistic regression binary classifier?
A. We want to predict whether or nota person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we realize that the relationship might not be linear. If you double the BMI, you might not double the chances of having diabetes.
B. We want to predict whether or nota person will have diabetes (our binary outcome). We want to use some combination of glucose, insulin, BMI, and other data, and we expect that the relationship will be linear for all variables. In other words, if you double glucose, you will double the diabetes. If you double insulin, you will double the diabetes. And if you double glucose and insulin, you will have fourtimes the diabetes.
C. We want to predict the BMI of a person based on their diabetes status. We want to use the logistic regression S-curve to determine what the 25th, 50th, 75th, and 99th percentiles of BMI for diabetic and non-diabetic people in this sample are.
D. We want to predict the S-curve-shaped interrelationships between BMI, age, glucose, pregnancies, and other data. We want to be able to see, as age goes up, what happens to BMI, glucose, and pregnancies with a valid regression with a solid P-value.
E. We want to predict the log odds of having diabetes because mathematically, this will solve the problem that a straight-line linear relationship will often exceed 100%, especially when some numbers are outliers (like age of 80+ years or BMI at age 50+).

With binary classifiers, we typically build the model on our training data and then test the model (to see how good the predictions actually were) on the testing data. We then collect the results of our testing in a confusion matrix. You will find a learning resource about confusion matrices from StatQuest.

Question 4: Our Diabetes Model Confusion Matrix

Let's say we want to predict whether a person has diabetes, and we are using the following confusion matrix:
Person actually has diabetes Person actually does not have diabetes
Person is predicted to have diabetes A B
Person is predicted to not have diabetes C D

Match the cell with its label
(True positive, or TP)
(False positive, or FP)
(False negative, or FN)
(True negative, or TN)

Question 5: Practicing Our TP/TN/FP/FN Terminology

Let's say we have a person with a glucose of 136, insulin of 130, and BMI of 28.3, and they are 42 years old. Our logistic regression model predicts that this person will not have diabetes. However, their medical records indicate that they do indeed have diabetes. Which phrase should be used to describe this situation?

A True positive
B False positive
C False negative
D True negative

Perform Binary Classification Using Logistic Regression in Python

Now we are going to run a binary classification predictive analytics algorithm in Python and review the results. You won't have to write any code, but you will be running code which has been written for you.

1. Go to your browser and set up a new instance of Google Colab atWelcome to Colaboratory.
2. Upload two files:
a. Upload the "Diabetes_Classifier.ipynb" as a notebook:

b. Upload the "diabetes.csv" as a file uploaded to session storage:

Alt text: Google Colab
3. Run the first cell, the classifier model. You can ask ChatGPT to explain this to you more fully, but basically what we are doing here with this code is:
a. Importing a bunch of other code written by other people to help us build the model
b. Reading in the diabetes.csv dataset
c. Splitting the data into a training dataset (which we will use to build our logistic regression prediction model) and a testing dataset (which we will use to tell how good our model really was)
d. Running the model on our training data
e. Evaluating the model on our testing data
4. When the code in this cell has finished running, it gives a little confusion matrix. (Note this confusion matrix has its labels switched from the way StatQuest did them. If you are keeping close track of these things, you will notice that the matrix printed from this code has the actual values on the left and the predicted values on the top. If you are not keeping close track of these things, you don't need to keep close track of this switch either.)

Alt text: StatQuest

5. Run the next cell to generate the output file we will use to visualize the results in Tableau. Your output should look something like this, and you should have a "diabetes_predicted.csv" file available for download. It may take a minute or two to run and another minute or two to refresh, and you can click the "refresh" icon if you want to see the output file the very minute it is available:

Alt text: Classifier

6. Let's just look at the "diabetes_predicted.csv" file before we download it:

Alt text: csv file
a. Here, let's look at the first row, Patient_ID 767. This person has a glucose of 126, BMI of 30.1, and an age of 47. This person also had an actual outcome of Diabetes (fourth column)but was predicted to have Not Diabetes (fifth column). The Model Results column classified this as a False Negative for this person (sixth column).

Question 6: Interpreting the Output File

Look further through the diabetes_predicted.csv file. For Patient_ID 526, what was their outcome?

A True positive
BFalse positive
C False negative
D True negative

7. Download the diabetes_predicted.csv file to your computer. We are now ready to visualize it using Tableau.

Visualize the Results in Tableau

We can see that these sorts of output files can be difficult to interpret. Let's use Tableau to help visualize them.

1. Fire up Tableau and import your diabetes_predicted.csv data file to Tableau. Be sure the file you import has both Actual Outcome Text and Predicted Outcome Text fields in it.
2. Check: You should have 231 total rows in this data source.
3. First, let's make a basic bar graph: How many model results were true positives? False positives? Other values?
a. Drag the Model Results to the Columns bar and the diabetes_predicted.csv (Count) to the Rows. It should look a little bit like the skeleton below-but you should have bar charts here.

Alt text: csv file
Question 7: Interpreting the Output File

How did the model do? Of the 231 people in this dataset, what was the most frequent model result?

A True positive: 49% of the results were true positive
B False positive: 18 people had a false-positive result
C False negative: 32% of the results were a false negative
D True negative: 132 people had a true-negative result

4. Let's take another look at these results, which are more akin to the confusion matrix we saw earlier.
a. Go to another worksheet
b. Put the Actual Outcome Text in the Rows area, and the Predicted Outcome Text in the Columns area:

Alt text: outcome
c. Then drag the diabetes_predicted.csv (Count) to the area with the "Abc" in it:

Alt text: csv file

d. You will now have the numbers of the actual and predicted outcomes summed up for you:

Alt text: predicted outcomes
e. Let's get the Marks a bit fancier: Take the diabetes_predicted.csv (Count), also, to the Size, and once again drag diabetes_predicted.csv (Count) to the Label. Take the Model Results to the Label and expand your graphics so you can see the whole thing. You will get something that should look like this:

Alt text: predicted csv
Question 8: Interpreting the Visual Confusion Matrix

Look at your visual matrix. Which statements would you agree with? Select all that apply.

A If a person actually has diabetes, their results would be found on the top row.
B If a person actually does not have diabetes, their results would be found on the bottom row.
C If the model predicts diabetes, the majority of the people in this category will turn out to have diabetes
DIf the model predicts not diabetes, the majority of the people in this category will not turn out to have diabetes
E If a person has diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given
F If a person does not have diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given

5. Sometimes we want to see how a model's predictions vary as certain variables change. Does this model predict differently for people of different ages?
a. Go to a new worksheet and make a histogram of the age. Set the bin size to 10. It should look like this:

Alt text: bar graph
b. Add the Predicted Outcome text in front of the Age (bin). You will now see histograms, but they are split by predictions:

Alt text: bar graph
Question 9: Interpreting the Split Histograms

Look at these two histograms. Which statements would you agree with? Select all that apply.

A Among those who are predicted not to have diabetes, the age distribution has a lot of younger people in it.
B In the age group 40-49, the model is predicting approximately the same number of people with and without diabetes.
C In the age group 40-49, the model is predicting approximately the same percentage of people with and without diabetes.
C In the group which is predicted to have diabetes, the ages are relatively evenly distributed between people in their 20s, 30s, 40s, and 50s, with a sharp drop-off at age 60 and older.

6. Sometimes the total head count does not give the whole picture, and a percentage is a better way to go. Let's try to get our histograms to show us percentages of total.
a. Duplicate your paired Age histograms to a new sheet.
b. Under the Rows, CNT(Age), pull down the right arrow and Add Table Calculation.

Alt text: histogram
c. For your Table Calculation, choose Percent of Total, and have it compute using Table(down):

Alt text: table
d. Then put the Model Results on the Color so you can see what percentage of each age group has what sorts of model results:

Alt text: graph
e. The final touch: Often, culturally, we see green as "good/correct" and red as "bad/error." Let's go through and set the colors so the "true" outcomes are in the green family and the "false" outcomes are in the red family.

Alt text: graph
f. Now we can look at - for example - a person in their 20s who is predicted not to have diabetes. Do they need to worry?
i. The prediction is not diabetes, so we want the graph on the right (blue and red).
ii. Find the bar which represents people in their 20s who are not predicted to have diabetes

Alt text: graph
iii. Let's look at this bar a little more closely. We can drag the diabetes_predicted.csv (Count) onto the labels to have it show us the total number of people here. We can see that it does pretty well (lots of true model outcomes) for people in their 20s who are predicted not to have diabetes.

Alt text: graph
Question 10: Interpreting the Stacked Percentage Bar Charts

Look at these charts. Which statements are accurate? Select all that apply.

A For people in their 40s (age 40-49), a model prediction of "no diabetes" is very good news because the model is nearly always correct, and they probably don't have diabetes.
B For very elderly people (age 80-89), there is only one person in the dataset of this age. Because the model predicts "diabetes" for this person, it will always predict "diabetes" for all people in this age group, regardless of their BMI, glucose, or other variables.
C Say you have 10 people in their 20s who receive a model prediction of "diabetes." Approximately 7 of those people will actually have diabetes, but 3 will be incorrectly predicted to have diabetes.
D Say you have 10 people in their 20s who receive a model prediction of "diabetes." Approximately 4 of those people will actually have diabetes, and these are the false positives.
E There are relatively few people in either category (predicted diabetes, predicted no diabetes) who are age 60-69, so we should be cautious about interpreting these percentages for a broader population.

Reference no: EM133923429

Questions Cloud

Identify the hazards and potential threats of the situation : Your job is to identify the hazards and potential threats of the situation, not to debate sides, but to keep everyone safe.

Who should be given probation : As such, what should be done in relation to fees and fines for individuals who cannot pay based on the inability to work or unemployment?

Discuss three basic network components and how they are used : Discuss common forms of attack on Microsoft systems using the Internet, and/or your job as reference for full credit - Discuss three basic network components

Discuss situations for uses of different operating system : Discuss situations for uses of different operating system installs in a virtual environment while at the same time securing the operating system

Forecasting method called a binary classifier : Forecasting method called a binary classifier. We will then work on how to visualize and understand a binary classifier

Describe how planning, staying on task : Briefly describe how planning, staying on task, meeting deadlines, and utilizing feedback can each have a positive impact on reducing risk and promoting

Create logging policies : Create logging policies. Please review NIST SP800-92, Guide to Computer Security Log Management, specifically Section

What types of intervention efforts could be successful : I would like you to consider gangs as a religion. What do you think of this theory? How does it compare to what we have learned about gangs in this class?

Define the decision variables and specify the objective : Define the decision variables and specify the objective function and constraints - How would the minimum cost obtained in Task 2 be affected if the purchase

User Account

All Pages