Compute the entropy for the overall data

Assignment Help Other Subject
Reference no: EM132371413

Data Science Assignment -

Question 1 - Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative (ratio)

1. House numbers assigned for a given street

2. Your calorie intake per day

3. Shape of a geometric objects commonly found in geometry classes.

4. Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives you a number that you can use to claim your coat when you leave.)

5. Longitude and Latitude of a location on earth.

Question 2 - Below is an example of customer transaction database. Each row is the purchased items (binary variable, 1 for purchased, 0 or empty for not purchased) by one customer. Assume there are millions of items.

Items

Item 1

Item 2

Item 3

Item 4

Item 5

Item 6

Item 7

Item 8

Item 9

Item 10

Customer 1

1

 

1

 

 

 

 

 

 

 

Customer 2

1

1

1

 

 

 

 

 

 

 

Customer 3

 

 

 

 

 

1

 

 

 

 

...

...

 

 

 

 

 

 

 

 

 

(1) Consider each row (customer purchased history) as a sample, and different columns as attributes. Are the binary attribute symmetric or asymmetric? Why?

(2) Compute similarity between Customer 1 and Customer 2, as well as similarity between customer 1 and customer 3. Use Simple Matching Coefficient and Jaccard Coefficient respectively.

SMC(1, 2) = ?

SMC(1,3) = ?

JC(1,2)=?

JC(1,3)=?

Which measure among the two better reflects customer similarity? Why?

(3) Assume the values in the table are not binary (purchased or not), but counts (non-negative integer) of items being purchased by a customer. Our goal is to recommend items to customers. In other words, we want to recommend an item that have not been purchased by a customer yet. Specifically, we will predict "the count of an item being purchased by a customer A" based on the average count of the same item being purchased by "the five most similar customers to A".

How would you measure similarity between customers in this case? Which measure would you prefer among SMC, Jaccard, Euclidean distance, Correlation, and Cosine? Why do you select this measure but not the others?

Question 3 - Assume sample p = (1,0,1,0,1,0), sample q = (3,-3,3,-3,3,-3). Answer the questions below:

(1) What are similarities between p and q based on "Cosine" and "Pearson's Correlation"?

Cosine(p,q)=?

Correlation(p,q)=?

(2) Assume we add all attribute in q by a constant 3, now q'=(6,0,6,0,6,0).

Cosine(p,q')=?

Correlation(p,q')=?

What did you find out by comparing results in (1) and (2)?

(3) Assume we multiple all attribute in q by 3, now q''=(9,-9,9,-9,9,-9).

Cosine(p,q'')=?

Correlation(p,q'')=?

What did you find out by comparing results in (1) and (3)?

Question 4 - Consider the dataset shown in Table 1.1 for a binary classification problem.

Customer ID

Housing Type

Gender

Marital Status

Class

1

Apartment

Male

Married

C0

2

House

Male

Single

C1

3

House

Female

Married

C1

4

Apartment

Female

Single

C0

5

Apartment

Male

Married

C0

6

Hostel

Male

Single

C1

7

House

Female

Married

C1

8

Apartment

Female

Single

C0

9

Apartment

Male

Married

C0

10

House

Male

Single

C1

11

Hostel

Female

Married

C1

12

House

Female

Single

C0

13

House

Male

Married

C0

14

Hostel

Male

Single

C1

15

Hostel

Female

Married

C1

16

Apartment

Female

Single

C0

a. Compute the entropy for the overall data.

b. Compute the entropy obtained for each of the four attributes (consider a multi-way split using each unique value of an attribute).

c. Compute the Information Gain (IG) obtained by splitting the overall data using each of the four attributes. Which attribute provides the highest IG, and which attribute provides the lowest IG.

d. For splitting at the root node, would you choose the attribute that provides the maximum IG? Briefly explain your choice.

f. Consider the following 3 decision trees:

496_figure.png

Compute the difference between the entropy of overall data with the weighted entropy of the leaves for each of the three trees. Based on these differences, which tree would you choose for performing classification? Is the attribute chosen at the root of this tree same as the attribute chosen for splitting in (e)? Briefly comment on the nature of your results, and the properties of the impurity measure used while constructing decision trees.

Question 5 - Consider a data set with instances belonging to one of two classes - positive(+) and negative(-). A classifier was built using a training set consisting of equal number of positive and negative instances. Among the training instances, the classifier has an accuracy m on the positive class and an accuracy of n on the negative class.

The trained classifier is now tested on two data sets. Both have similar data characteristics as the training set. The first data set has 1000 positive and 1000 negative instances. The second data set has 100 positive and 1000 negative instances.

A. Draw the expected confusion matrix summarizing the expected classifier performance on the two data sets.

B. What is the accuracy of the classifier on the training set? Compute the precision, TPR and FPR for the two test data sets using the confusion matrix from part A. Also report the accuracy of the classifier on both data sets.

C. i). If the skew in the test data - the ratio of the number of positive instances to the number of negative instances, is 1:s, what is the accuracy of the algorithm on this data set? Express your answer in terms of s, m, n.

ii). What value does the overall accuracy approach to if s is very large (>>1)? And when s is very small (<<1)?

D. In the scenario where the class imbalance is pretty high (say, s>500 for part C), how are precision and recall better metrics in comparison to overall accuracy? What information does precision capture that recall doesn't?

Question 6 - Below is the learning curve of a classifier ("classification accuracy on test set" versus "the number of training samples"). Please mark out which ranges of x-axis corresponding to "underfitting" and "overfitting" respectively. If you answer that the curve does not show "underfitting" or "overfitting", please explain why.

1959_figure1.png

Question 7 - In each of the classification scenarios listed below, you are given a set of classifiers and a description of the classification scenario. For each scenario, state the choice of the classifier that is best suited for the dataset along with a brief explanation supporting your answer.

(a) Many of the attributes are irrelevant (contain no information about the class). Classifiers available: KNN, ANN, Decision Trees.

(b) Dataset contains attributes that are not discriminative by themselves, but are discriminative in combination. Classifiers available: KNN, Naïve Bayes, Decision Trees (taking single attribute at a time in a tree node).

Question 8 - Please prove a conclusion related to PCA we discussed in the class: the unit vector to which the projected sample points have the largest variance is the first eigenvector of sample covariance matrix. Please provide brief and sufficient mathematical inductions.

Question 9 - Logistic regression classifier directly models P(x_i|y_i), while naïve Bayes classifier model P(x_i,y_i) together. This is why logistic regression classifier is called "discriminative" and naïve bayes classifier is called "generative".

(1) Please write down the mathematics of model learning and prediction in logistic regression and naïve Bayes respectively. Assume a binary logistic regression classifier. You can also assume x_i follow conditional Gaussian distribution in Naïve Bayes. Use the maximum likelihood method for parameter learning in both models.

(2) Does the learning process of Logistic Regression involve a close-form solution (a direct math formula to compute learned parameters)? Why? If not, what are the strategies to learn parameter fast? Write down detailed math in your answer.

Question 10 - Prove that as the number of iterations increase to infinite, the expected loss of Adaboost is decreasing towards 0.

Reference no: EM132371413

Questions Cloud

What produces two fertility hormones : What produces two fertility hormones, Follicle stimulating hormone, and Luteinizing hormone?
Determine whether the statement is true or false : Determine whether the statement is true or false, and why. "Hormones bond to a receptor protein to enter a cell and stimulate a response."
What are the peptide hormones in the body : What are the peptide hormones in the body?
Planning factors for response to global health emergency : What are the foundational planning factors for response to a global health emergency?
Compute the entropy for the overall data : CS 451/551 Data Science Assignment - Compute the entropy obtained for each of the four attributes. Compute the entropy for the overall data
Team-building and communications proposal : Team-Building and Communications Proposal You have just been hired as a consultant to the chief executive officer (CEO) of a health care organization.
Research on nervous system diseases : You are doing research on nervous system diseases; particularly, one that dissolves lipids. What part of the neuron would you expect to be affected
Depending on personal experience in obtaining health care : Depending on one's personal experience in obtaining health care, or one's view of the role of physicians in society, of law as a tool for social change,
Learned about punnett squares : In your lab reading and in class, you learned about Punnett squares. Punnett squares illustrate the probability of different genotypes and phenotypes appearing

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd