How many parameters must be estimated to train

Assignment Help Data Structure & Algorithms
Reference no: EM131105458

10-701 Midterm Exam, Spring 2005

1. Big Picture

Following the example given, add 10 edges to Figure 1 relating the pair of algorithms. Each edge should be labeled with one characteristic the methods share, and one difference. These labels should be short and address basic concepts, such as types of learning problems, loss functions, and hypothesis spaces.

650_Figure.png

2. Short Questions

(a) Briefly describe the difference between a maximum likelihood hypothesis and a maximum a posteriori hypothesis.

(b) Consider a naive Bayes classifier with 3 boolean input variables, X1, X2 and X3, and one boolean output, Y.

  • How many parameters must be estimated to train such a naive Bayes classifier? (you need not list them unless you wish to, just give the total)
  • How many parameters would have to be estimated to learn the above classifier if we do not make the naive Bayes conditional independence assumption?

True or False? If true, explain why in at most two sentences. If false, explain why or give a brief counterexample in at most two sentences.

  • (True or False?) The error of a hypothesis measured over its training set provides a pessimistically biased estimate of the true error of the hypothesis.
  • (True or False?) If you are given m data points, and use half for training and half for testing, the difference between training error and test error decreases as m increases.
  • (True or False?) Overfitting is more likely when the set of training data is small.
  • (True or False?) Overfitting is more likely when the hypothesis space is small.

3. Learning Algorithms

Consider learning a target function of the form f: R2 → {A, B, C} that is, a function with 3 discrete values defined over the 2-dimensional plane. Consider the following learning algorithms:

  • Decision trees
  • Logistic regression
  • Support Vector Machine
  • 1-nearest neighbor

Note each of these algorithms can be used to learn our target function f, though doing so might require a common extension (e.g., in the case of decision trees, we need to utilize the usual method for handling real-valued input attributes).

For each of these algorithms,

A. Describe any assumptions you are making about the variant of the algorithm you would use.

B. Draw in the decision surface that would be learned given this training data (and describing any ambiguities in your decision surface).

C. Circle any examples that would be misclassified in a leave-one-out evaluation of this algorithm with this data. That is, if you were to repeatedly train on n-1 of these examples, and use the learned classifier to label the left out example, will it be misclassified?

4. Decision Trees

NASA wants to be able to discriminate between Martians (M) and Humans (H) based on the following characteristics: Green ∈ {N, Y }, Legs ∈ {2, 3}, Height ∈ {S, T }, Smelly ∈ {N, Y }.

Our available training data is as follows:

 

Species

Green

Legs

Height

Smelly

1)

M

N

3

S

Y

2)

M

Y

2

T

N

3)

M

Y

3

T

N

4)

M

N

2

S

Y

5)

M

Y

3

T

N

6)

H

N

2

T

Y

7)

H

N

2

S

N

8)

H

N

2

T

N

9)

H

Y

2

S

N

10)

H

N

2

T

Y

a) Greedily learn a decision tree using the ID3 algorithm and draw the tree.

b) i) Write the learned concept for Martian as a set of conjunctive rules (e.g., if (green=Y and legs=2 and height=T and smelly=N), then Martian; else if ... then Martian; ...; else Human).

ii) The solution of part b)i) above uses up to 4 attributes in each conjunction. Find a set of conjunctive rules using only 2 attributes per conjunction that still results in zero error in the training set. Can this simpler hypothesis be represented by a decision tree of depth 2? Justify

5. Loss functions and support vector machines

relationship between ridge regression and the maximum a posteriori (MAP) approximation for Bayesian learning in a particular probabilistic model. In this question, you will explore this relationship further, finally obtaining a relationship between SVM regression and MAP estimation.

(a) Ridge regression usually optimizes the squared (L2) norm:

W^L2 = arg minw j=1ΣN(tj - Σiwihi(xj))2 + λΣiwi2.         (1)

 The L2 norm minimizes the squared residual (tji wihi(xj))2, thus significantly weighing outlier points. (An outlier is a data point that falls far away from the prediction Σi wihi(xj).) An alternative that is less susceptible to outliers is to minimize the "sum of absolute values" (L1) norm:

W^L1 = arg minw j=1ΣN|tj - Σiwihi(xj)| + λΣiwi2.      (2)

(i) Plot a sketch of the L1 loss function, do not include the regularization term in your plot. (The x-axis should be the residual tj - Σi wihi(xj) and the y-axis is the loss function.)

(ii) Give an example of a case where outliers can hurt a learning algorithm.

(iii) Why do you think L1 is less susceptible to outliers than L2?

(vi) Are outliers always bad and we should always ignore them? Why? (Give one short reason for ignoring outliers, and one short reason against.)

(v) As with ridge regression in Equation 1, the regularized L1 regression in Equation 2 can also be viewed a MAP estimator. Explain why by describing the prior P(w) and the likelihood function P(t| x, w) for this Bayesian learning problem.

P(x) = (1/2b)e-|x-µ|/b.

(b) As mentioned in class, SVM regression is a margin-based regression algorithm that takes two parameters, ∈ > 0 and C ≥ 0, as input. In SVM regression, there is no penalty for points that are within ∈ of the hyperplane. Points that are further than ∈ are penalized using the hinge loss. Formally, the SVM regression QP is:

1444_Figure1.png

(i) Plot a sketch of the loss function function used by SVM regression. Again, the x-axis should be the residual tj - Σiwihi(xj) and the y-axis the loss function. However, do not include the ½w.w term in this plot of the loss function.

(ii) Compared to L2 and L1, how do you think SVM regression will behave in the presence of outliers?

(iii) SVM regression can also be view as a MAP estimator. What is the prior and the likelihood function for this case?

6. Learning Theory

This question asks you to consider the relationship between the VC dimension of a hypothesis space H and the number of queries the learner must make (in the worst case) to assure that it exactly learns an arbitrary target concept in H.

More precisely, we have a learner with a hypothesis space H containing hypotheses of the form h: X → {0, 1}. The target function c: X → {0, 1} is one of the hypotheses in H. Training examples are generated by the learner posing a query instance xi ∈ X, and the teacher then providing its label c(xi). The learner continues posing query instances until it has determined exactly which one of its hypothesis in H is the target concept c.

Show that in the worst case (i.e., if an adversary gets to choose c ∈ H based on the learner's queries thus far, and wishes to maximize the number of queries), then the number of queries needed by the learner will be at least VC(H), the VC dimension of H. Put more formally, let MinQueries(c, H) be the minimum number of queries needed to guarantee learning target concept c exactly, when considering hypothesis space H. We are interested in the worst case number of queries, W orstQ(H), where

W orstQ(H) = maxcH[MinQueries(c, H)]

You are being asked to prove that

WorstQ(H) ≥ V C(H)

You will break this down into two steps:

(a) Consider the largest subset of instances S ⊂ X that can be shattered by H. Show that regardless of its learning algorithm, in the worst case the learner will be forced to pose each instance x ∈ S as a separate query.

(b) Use the above to argue that W orstQ(H) ≥ V C(H).

(c) Is there a better case? In other words, if the learner knows that a friend (not an adversary) will be choosing c ∈ H, and that the friend wishes to minimize the number of learning queries, is it possible for the friend to choose a c that allows the learner to avoid querying all of the points in S? More formally, if we define

BestQ(H) = mincH[MinQueries(c, H)]

then is the following statement true or false?

BestQ(H) ≥ V C(H)

Justify your answer.

7. Extra credit question: Bias-Variance trade-off

Suppose you want to learn a function y = f(x) + ε, where ε is an independent zero-mean noise variable. In class, you learned that the expected squared error of a learned hypothesis fˆ(x) at any specific point x0 can be decomposed into three parts:

Err(x0) = E[(Y - fˆ(x0))2 | x = x0],

= σ2 + [f(x0) - E(fˆ(x0))]2 + E[( ˆf(x0) - E(fˆ(x0)))2],

= Irreducible Error + Bias2 + Variance.

where the expected value is taken over different training data sets.

For this question, we'll assume the target function to be learned is f(x) = x and that ε is a zero-mean Gaussian random variable with variance σ2. We'll also assume the available training data consists of a fixed set of n equally spaced data points in x ∈ [a, b], and therefore the only randomness in different draws of training data arises from the randomness in the values of ε. Below you will consider approximating f(x) using different types of ˆf(x) functions. In each case, assume fˆ(x) is trained using least-squares regression based on this data.

1495_Figure2.png

(a) Suppose the learner considers only constant functions as possible hypotheses, i.e., fˆ(x) = w0. (see figure 2).

(i) Write the result of regression fˆ(x) as a function of the input data {y1, . . . , yn}.

(ii) What is the expected value of our estimator, E[fˆ(x)], where the expectation is taken over possible data sets?

(iii) Provide an upper bound over x0 ∈ [a, b] for the value of the bias term:

maxx0[a,b][f(x0) - E(fˆ(x0))]2.

(iv) What is the variance term:

E[(fˆ(x0) - E(fˆ(x0)))2]?

Hint: The variance is the same for all x0.

(b) Now, suppose that fˆ(x) is represented by two step functions: fˆ1(x) takes value w1 for x ∈ [a, a+b/2 ], and fˆ2(x) takes value w2 for x ∈ (a+b/2 , b].

(i) Provide an upper bound over x0 ∈ [a, b] for the value of the bias term.

(ii) What is the variance term?

(c) Finally, suppose that fˆ(x) is represented by m ≤ n equally spaced step functions.

(i) Provide an upper bound over x0 ∈ [a, b] for the value of the bias term.

(ii) What is the variance term?

(iii) Based on this analysis, what is the optimal number of step functions for learning f(x) in this problem setting; that is, what number of step functions produces the smallest expected squared error in the learned hypothesis? (Your answer should be an equation in terms of n, a, b, σ2.)

Reference no: EM131105458

Questions Cloud

Discuss the fairness of covenants-not-to-compete agreements : Discuss the fairness of covenants-not-to-compete agreements from both the employer and employee viewpoints when the employee is let go.
Interest rate and n periods of compounding : How should the interest rate and n periods of compounding be modified if compounding for less than one year?
What is the project npv : The fixed assets will be sold for $2,000 at the end of year three. The project generates OCF of $13,000. What is the project's NPV? Please give me a clear detailed way on how to find the NPV.
What are derivatives how can derivatives be used : What are derivatives? How can derivatives be used to reduce risk? Can derivatives be used to increase risk? Explain
How many parameters must be estimated to train : Consider a naive Bayes classifier with 3 boolean input variables, X1, X2 and X3, and one boolean output, Y. How many parameters must be estimated to train such a naive Bayes classifier? (you need not list them unless you wish to, just give the tota..
What concerns should smith have at the given point : What was the effect of the production volume variance on plant operating income? If White's bonus is based on operating income, what concerns should Smith have at this point? Were White's directives justified?
Formulate this problem as a markov decision process : A soap company specializes in a luxury type of bath soap. The sales of this soap fluctuate between two levels-low and high-depending upon two factors: (1) whether they advertise and (2) the advertising and marketing of new products by competitors.
What is the effect of mistake of fact in contract law : Can contractual agreements be illegal as formed or performed? What is the effect of mistake of fact in contract law?
Why are financial markets essential for a healthy economy : Why are financial markets essential for a healthy economy and economic growth?

Reviews

Write a Review

Data Structure & Algorithms Questions & Answers

  Do you observe any changes in cluster memberships

Draw the graphic for the healthy set, representing the values, healthy and unhealthy and what is the degree of membership to the fuzzy set healthy of person B who has a BMI of 26.2? And to the fuzzy set unhealthy?

  Find use two-dimensional other or multidimensional array

Suppose that you are creating an app to keep track of the places that you have visited while traveling. Determine whether or not you would use a two-dimensional or other multidimensional array (choosing to use an array) for the task at hand. Provi..

  Write an algorithm that includes the three control structure

This project provides you with the opportunity to examine algorithms, identify the inaccuracies in the algorithms, and finally, to modify the algorithms with the correct details. Your next step will be to write an algorithm that includes the three..

  How to sort an array using insertion sort

How to sort an array using insertion sort and track teh number of swaps during the sorting - Can someone provide the answer with reference to data structure?

  Explain the benefits and limitations of developing software

You will be asked to demonstrate how your algorithm works and explain the benefits and limitations of developing software solutions to common problems such as that in the scenario.

  Algorithm bubble sort

This project is a great opportunity to express your unique problem solving skills as well as your use of structured logic.

  What is the machine run time in second for sorting array

Write computer program to implement this algorithm and demonstrate the results and what is the machine run time in second for sorting array A

  Question 1a explain the meaning of each of the following

question 1a explain the meaning of each of the following pointer declarations-i float a -0.137float pa ampaii double

  Generate a new element one at a time

Then you can check for each new element if it is already there using the lookup function and if there are enough elements in the tree using the count function before doing insert. Finally, use the print function to show the result.

  Devise a linear-time algorithm

Devise a linear-time algorithm to count the parallel edges in a graph. Write the algorithm in pseudocode.

  Learning for numeric prediction

Write down the output (class) values and number of instances that appear in each of the leaf nodes A, B and C of the tree - Learning for Numeric Prediction

  Analyze case asymptotic complexity of making interference

Analyze the worst-case asymptotic complexity of making an interference graph, for a program of size N (with at most N variables and at most N control-flow nodes).

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd