Write down the squared error function

Assignment Help Basic Statistics
Reference no: EM131097062

10701 Machine Learning - Spring 2012 - Final Examinations

Q1. Decision trees and KNNs

In the following questions we use Euclidian distance for the KNN.

1.1 Assume we have a decision tree to classify binary vectors of length 100 (that is, each input is of size 100). Can we specify a 1-NN that would result in exactly the same classification as our decision tree? If so, explain why. If not, either explain why or provide a counter example.

1.2

124_Figure.png

Assume we have the decision tree in Figure 1 which classifies two dimensional vectors {X1, X2} ∈ R\{A, B}. In other words, the values A and B are never used in the inputs. Can this decision tree be implemented as a 1-NN? If so, explicitly say what are the values you use for the 1-NN (you should use the minimal number possible). If not, either explain why or provide a counter example.

Q2. Neural Networks

Consider neural networks where each neuron has a linear activation function, i.e., each neurons output is given by g(x) = c+b 1/n i=1nWixi, where b and c are two fixed real numbers and n is the number of incoming links to that neuron.

1. Suppose you have a single neuron with a linear activation function g() as above and input x = x0, . . . , xn and weights W = W0, . . . , Wn. Write down the squared error function for this input if the true output is a scalar y, then write down the weight update rule for the neuron based on gradient descent on this error function.

2. Now consider a network of linear neurons with one hidden layer of m units, n input units, and one output unit. For a given set of weights wk,j in the input-hidden layer and Wj in the hidden-output layer, write down the equation for the output unit as a function of wk,j , Wj, and input x. Show that there is a single-layer linear network with no hidden units that computes the same function.

3. Now assume that the true output is a vector y of length o. Consider a network of linear neurons with one hidden layer of m units, n input units, and o output units. If o > m, can a single-layer linear network with no hidden units be trained to compute the same function? Briefly explain why or why not.

4. The model in 3) combines dimensionality reduction with regression. One could also reduce the dimensionality of the inputs (e.g. with PCA) and then use a linear network to predict the outputs. Briefly explain why this might not be as effective as 3) on some data sets.

Q3. Gaussian mixtures models

3.1 The E-step in estimating a GMM infers the probabilistic membership of each data point in each component Z: P(Zj|Xi), i = 1, ..., n, j = 1, ..., k, where i indexes data and j indexes components. Suppose a GMM has two components with known variance and an equal prior distribution

N(µ1, 1) × 0.5 + N(µ2, 1) × 0.5                                                                      

The observed data are x1 = 2, and the current estimates of µ1 and µ2 are 2 and 1 respectively. Compute the component memberships of this observed data point for the next E-step (hint: normal densities for standardized variable y(µ=0,σ=1) at 0, 0.5, 1, 1.5, 2 are 0.4, 0.35, 0.24, 0.13, 0.05).

3.2 The Gaussian mixture model (GMM) and the k-means algorithm are closely related-the latter is a special case of GMM. The likelihood of a GMM with Z denoting the latent components can be expressed typically as

P(X) = ∑zP(X|Z)P(Z)

where P(X|Z) is the (multivariate) Gaussian likelihood conditioned on the mixture component and P(Z) is the prior on the components. Such a likelihood formulation can also be used to describe a k-means clustering model. Which of the following statements are true-choose all correct options if there are multiple ones?

a) P(Z) is uniform in k-means but this is not necessarily true in GMM.

b) The values in the covariance matrix in P(X|Z) tend towards zero in k-means but this is not so in GMM.

c) The values in the covariance matrix in P(X|Z) tend towards infinity in k-means but this is not so in GMM.

d) The covariance matrix in P(X|Z) in k-means is diagonal but this is not necessarily the case in GMM.

Q4. Semi-Supervised learning

4.1 Assume we are trying to classify stocks to predict whether the stock will increase (class 1) or decrease (class 0) based on the stock closing value in the last 5 days (so our input is a vector with 5 values). We would like to use logistic regression for this task. We have both labeled and unlabeled data for the learning task. For each of the 4 semi-supervised learning methods we discussed in class, say whether it can be used for this classification problem or not. If it can, briefly explain if and how you should change the logistic regression target function to perform the algorithm. If it cannot, explain why.

1. Re-weighting

2. Co-Training

3. EM based

4. Minimizing overfitting

4.2 Assume we are using co-training to classify the rectangles and circles in Figure 2 a). The ? represents unlabeled data. For our co-training we use linear classifiers where the first classifier uses only the x coordinate values and the second only the y axis values. Choose from the answers below the number of iterations that will be performed until our co-training converges for this problem and briefly explain:

1. 1

2. 2

3. more than 2

4. Impossible to tell.

493_Figure1.png

4.3 Now assume that we are using boosting (with linear classifiers, so each classifier is a line in the two dimensional plane) to classify the points in Figure 2 b). We terminate the boosting algorithm once we reach a t such that ∈t = 0 or after 100 iterations. How many iterations do we need to perform until the algorithm converges? Briefly explain.

1. 1

2. 2

3. more than 2

4. Impossible to tell.

Q5. Bayesian networks

5.1 1. Show that a 95_image.pngb, c | d (a is conditionally independent of {b, c} given d) implies a 95_image.png b | d (a is conditionally independent of b given d).

2. Define the skeleton of a BN G over variables X as the undirected graph over X that contains an edge {X, Y} for every edge (X, Y ) in G. Show that if two BNs G and G' over the same set of variables X having the same skeleton and the same set of v-structures, encode the same set of independence assumptions. V-structure is a structure of 3 nodes X, Y, Z such as X → Y ← Z.

Hints: It suffices to show that for any independence assumption A 95_image.pngB | C (A, B, C are mutually exclusive sets of variables), it is encoded by G if and only if it is also encoded G'. Show by d-separation.

5.2 For each of the following pairs of BNs in Figure 3, determine if the two BNs are equivalent, e.g they have the same set of independence assumptions. When they are equivalent, state one independence/conditional independence assumption shared by them. When they are not equivalent, state one independence/conditional independence assumption satisfied by one but not the other.

585_Figure2.png

5.3 We refer to the Figure 4 in this question.

2219_Figure3.png

1. What is Markov blanket of {A, S}?

2. (T/F) For each of the following independence assumptions, please state whether it is true or false:

(a) D 95_image.pngS.

(b) D 95_image.pngS|H.

(c) C 95_image.pngJ|H.

(d) C 95_image.pngJ|A.

Q6. Hidden Markov Models

6.1 Assume we have temporal data from two classes (for example, 10 days closing prices for stocks that increased/decreased on the following day). How can we use HMMs to classify this data?

6.2 Derive the probablity of:

P(o1, . . . , ot, qt-1 = sv, qt = sj)

You may use any of the model parameters (starting, emission and transition probabilities) and the following constructs (defined and derived in class) as part of your derivation:

pt(i) = p(qt = si)

 αt(i) = p(o1, . . . , ot, qt = si)

 δi(i) = maxq1,...,qt-1p(q1, . . . , qt-1, qt = si, O1, . . . , Ot)

 Note that you may not need to use all of these constructs to fully define the function listed above.

6.3 The following questions refer to figure 5. In that figure we present a HMM and specify both the transition and emission probabilities. Let ptA be the probability of being in state A at time t. Similarly define ptB and ptC.

1. What is p3C?

2. Define pt2 as the probability of observing 2 at time point t. Express pt2 as a function of ptB and ptC (you can also use any of the model parameters defined in the figure if you need).

2392_Figure4.png

Q7. Dimension Reduction

7.1 Which of the following unit vectors expressed in coordinates (X1, X2) correspond to theoretically correct directions of the 1st (p) and 2nd (q) principal components (via linear PCA) respectively for the data shown in Figure 6? Choose all correct options if there are multiple ones.

a) i) p(1, 0) q(0, 1) ii) p(1/√(2), 1/√(2)) q(1/√(2), -1/√(2))

b) i) p(1, 0) q(0, 1) ii) p(1/√(2), -1/√(2)) q(1/√(2), 1/√(2))

c) i) p(0, 1) q(1, 0) ii) p(1/√(2), 1/√(2)) q(-1/√(2), 1/√(2))

d) i) p(0, 1) q(1, 0) ii) p(1/√(2), 1/√(2)) q(-1/√(2), -1/√(2))

e) All of the above are correct

7.2 In linear PCA, the covariance matrix of the data C = XT X is decomposed into weighted sums of its eigenvalues (λ) and eigenvectors p:

C = ∑iλipipTi

87_Figure5.png

Prove mathematically that the first eigenvalue λ1 is identical to the variance obtained by projecting data into the first principal component p1 (hint: PCA maximizes variance by projecting data onto its principal components).

7.3 The key assumption of a naive Bayes (NB) classifier is that features are independent, which is not always desirable. Suppose that linear principal components analysis (PCA) is first used to transform the features, and NB is then used to classify data in this low-dimensional space. Is the following statement true? Justify your answers.

The independent assumption of NB would now be valid with PCA transformed features because all principal components are orthogonal and hence uncorrelated.

Q8. Markov Decision Processes

Consider the following Markov Decision Process (MDP), describing a simple robot grid world. The values of the immediate rewards R are written next to the transitions. Transitions with no value have an immediate reward of 0. Note that the result of the action "go south" from state S5 results in one of two outcomes. With probability p the robot succeeds in transitioning to state S6 and receives immediate reward 100. However, with probability (1-p) it gets stuck in sand, and remains in state S5 with zero immediate reward. Assume the discount factor γ = 0.8. Assume the probability p = 0.9.

2456_Figure6.png

1. Mark the state-action transition arrows that correspond to one optimal policy. If there is a tie, always choose the state with the smallest index.

2. Is it possible to change the value for γ so that the optimal policy is changed? If yes, give a new value for γ and describe the change in policy that it causes. Otherwise briefly explain why this is impossible.

3. Is it possible to change the immediate reward function so that Vchanges but the optimal policy π remains unchanged? If yes, give such a change and describe the resulting changes to V. Otherwise briefly explain why this is impossible.

4. How sticky does the sand have to get before the robot will prefer to completely avoid it? Answer this question by giving a probability for p below which the optimal policy chooses actions that completely avoid the sand, even choosing the action "go west" over "go south" when the robot is in state S5.

Q9. Boosting

9.1 AdaBoost can be understood as an optimizer that minimizes an exponential loss function E = i=1Nexp(-yif(xi)) where y = +1 or -1 is the class label, x is the data and f(x) is the weighted sum of weak learners. Show that the loss function E is strictly greater than and hence an upper bound on a 0 - 1 loss function E0-1 = i=1N 1 · (yif(xi) < 0) (hint: E0-1 is a step function that assigns value 1 if the classifier predicts correctly and 0 otherwise).

9.2 The AdaBoost algorithm has two caveats. Answer the following questions regarding these.

a) Show mathematically why a weak learner with < 50% predictive accuracy presents a problem to AdaBoost.

b) AdaBoost is susceptible to outliers. Suggest a simple heuristic that relieves this.

9.3 Figure 7 illustrates the decision boundary (the middle intersecting line) after the first iteration in an AdaBoost classifier with decision stumps as the weak learner. The square points are from class -1 and the circles are from class 1. Draw (roughly) in a solid line the decision boundary at the second iteration. Draw in dashed line the ensemble decision boundary based on decisions at iteration 1 and 2. State your reasoning.

540_Figure7.png

Reference no: EM131097062

Questions Cloud

How fiscal policy affects interest rate and aggregate demand : Explain how monetary policy affects interest rates and aggregate demand. Analyze how fiscal policy affects interest rates and aggregate demand. Evaluate why policymakers face a short-run trade-off between inflation and unemployment.
Previous court decisions to interpret statutes and rules : Question 1. Which of the following statements is not true?
What is wrong with childhood traumas inevitably assumptions : What is wrong with childhood traumas inevitably assumptions. Use your chapter to obtain information and support your statements.
Is the stock a good or bad buy : A share of stock with a beta of .83 now sells for $61. Investors expect the stock to pay a year-end dividend of $3. The T-bill rate is 6%, and the market risk premium is 9%. a. Suppose investors believe the stock will sell for $63 at year-end. Is t..
Write down the squared error function : 10701 Machine Learning - Spring 2012 - Final Examinations. Suppose you have a single neuron with a linear activation function g() as above and input x = x0, . . . , xn and weights W = W0, . . . , Wn. Write down the squared error function for this inp..
Expectation of the price of the stock : If the stock is perceived to be fairly priced today, what must be investors' expectation of the price of the stock at the end of the year?
Evaluate the influence of cultural and environmental factors : Evaluate the influence of cultural and environmental factors on the effectiveness of psychological testing and assessment. Assess the ethical issues involved in test administration and interpretation of testing and assessment results Prompt.
Expected rate of return on the market portfolio : Suppose you believe that the beta of the firm is .7. How much is the firm worth if the risk-free rate is 2% and the expected rate of return on the market portfolio is 13%? (Do not round intermediate calculations. Round your answer to 2 decimal pla..
Identify and explain the goals of psychology : Identify and explain the definition of psychology presented during class - identify and explain the goals of psychology.

Reviews

Write a Review

Basic Statistics Questions & Answers

  Find probability that fewer than five employees steal

A recent report in Business Week indicated that 20 percent of all employees steal from their company each year. If a company employs 50 people, what is the probability that: Fewer than 5 employees steal?

  Define the linear programming problem (lpp)

Define the Linear programming problem (LPP)

  Calculate the probability that the price of large popcorn

assuming that the true mean of the population is 5 the standard deviation of the population is 0.75.1- calculate the

  Type your question here the five summery for production

type your question here the five summery for production order size in thousends pounds for steel tubing is min16.5 q132

  Find the median, average grade and the standard deviation

Draw the statistics as a histogram. Do they resemble the Normal Distribution? Find the median, the average grade and the standard deviation of the two terms.

  A girl having two siblings is chosen at random what is the

consider the following question.a girl having two siblings is chosen at random. what is the probability that she has no

  Confidence interval for proportion of positive drug test

In 1992, the FAA conducted 90,000 pre-employment drug tests on job applicants who were to be engaged in safety and security-related jobs, construct a 95 percent confidence interval for the population proportion of positive drug tests.

  Find probability that more than three students will withdraw

Assume that 20 students registered for the course. Compute the probability that more than three will withdraw (to 4 decimals).

  Test the significance of the correlation coefficient r at a

test the significance of the correlation coefficient r at a 0.05 for the data below.x values 55 61 70 78 88y values 89

  Estimate of the true melting point of lead

The sample mean and sample standard deviation of these measurements were (in degrees centigrade) 330.2 and 15.4, respectively. Construct a 95 percent confidence interval estimate of the true melting point of lead

  What is the probability that a well shuffled deck will

what is the probability that a well shuffled deck will produce a straight flush five cards of one suit in order akqj10

  What is the maximum possible number of 6-letter words

Suppose that in a lottery you choose 5 numbers out of 47. Five numbers are then drawn and you win a prize if you have at least 3 of the 5 numbers. What is the probability that you win something?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd