Compute the log marginal likelihood log

Assignment Help Applied Statistics
Reference no: EM132805402

Problem 1
Suppose x is a scalar random variable drawn from a univariate Gaussian p(x|n) = N(x|0, η). The variance η itself is drawn from an exponential distribution: p(η|γ) = Exp(n|γ2/2), where γ > 0. Note that the exponential distribution is defined as Exp(x|λ) = λ exp(-λx). Derive the expression of the marginal distribution of x, i.e., p(x|γ) = ∫ p(x|η)p(η|γ)dη after integrating out n. What does the marginal distribution p(x|γ) mean?

Plot both ∫p(x|η) and p(x|γ) and include in the writeup PDF itself. What difference do you see between the shapes of these two distributions? Note: You don't need to submit the code used to generate the plots. Just the plots (appropriately labeled) are fine.

Hint: You will notice that ∫p(x|η)p(n|γ)dη is a hard to compute integral. However, the solution does have a closed form expression. One way to get the result is to compute the moment generating function (MGF)1 of ∫p(x|η)p(η|γ)dη (note that this is a p.d.f.) and compare the obtained MGF expression with the MGFs of various p.d.f.s given in the table on the following Wikipedia and identify which p.d.f.'s MGF it matches with. That will give you the form of distribution p(x|γ). Specifically, name this distribution and identify its parameters.

Problem 2
(It Gets Better..) Recall that, for a Bayesian linear regression model with likelihood P(Y|x,w) = Ν(wTX,β-1) and prior p(w) = Nor(0, λ-1I), the predictive posterior is P(Y*|x*) = N(µTNx*-1 + X*TNX.) = Ν(µTNx*, σ2N(x*)), where we have defined σ2N(x*) = β-1 + xT ENx.* and μN and ∑N are the mean and covariance matrix of the Gaussian posterior on w, s.t., µN = ∑(β∑Nn=1 ynxn) and ∑N = (( β∑Nn xnxnT + λI)-1. Here, we have used the subscript N to denote that the model is learned using N training examples. As the training set size N increases, what happens to the variance of the predictive posterior? Does it increase or decrease or remain the same? You must also prove your answer formally. You might find the following identity useful: You may make use the following matrix identity:

(M + vvT )-1 = M-1 - (M-1v)(vTM-1)/(1+vTM-1v)

Where M denotes a square matrix and v denotes a column vector.

Problem 3
(Distribution of Empirical Mean of Gaussian Observations) Consider N scalar-valued observations x1, , xΝ drawn i.i.d. from Ν(μ, σ2). Consider their empirical mean x‾ = 1/N ∑Nn=1 xn. Representing the empirical mean as a linear transformation of a random variable, derive the probability distribution of x‾. Briefly explain why the result makes intuitive sense.

Problem 4
(Benefits of Probabilistic Joint Modeling-1) Consider a dataset of test-scores of students from M schools in a district: x = {x(m)}Mm=1 = {x1m, , xN(m)m,}Mm=1 where Nm denotes the number of students in school m.

Assume the scores of students in school m are drawn independently as xn(m) ~ Ν(μm, σ2) where the Gaussian's mean m is unknown and the variance σ2 is same for all schools and known (for simplicity). Assume the means μ1,......, μM of the M Gaussians to also be Gaussian distributed itm μm ~ Ν(μ0, σ02) where μ0 and σ02 are hyperparameters.

1. Assume the hyperparameters μ0 and σ02 to be known. Derive the posterior distribution of pm and write down the mean and variance of this posterior distribution. Note: While you can derive it the usual way, the derivation will be much more compact if you use the result of Problem 2 and think of each school's data as a single observation (the empirical mean of observations) having the distribution derived in Problem 3.

2. Assume the hyperparameter [to to be unknown (but still keep σ02, as fixed for simplicity). Derive the marginal likelihood p(x|μ0 σ2, σ02) and use MLE-II to estimate po (note again that σ2 and σ02 are known here). Note: Looking at the form/expression of the marginal likelihood, if the MLE-II result looks obvious to you, you may skip the derivation and directly write the result.

3. Consider using this MLE-II estimate of /to from part (2) in the posteriors of each μm you derived in part (1). Do you see any benefit in using the MLE-II estimate of po as opposed to using a known value of μ0?

Problem 5
(Benefits of Probabilistic Joint Modeling-2) Suppose we have student data from M schools where Nm, denotes the number of students in school m. The data for each school m = 1,..... , M is in the following form: For student n in school m, there is a response variable (e.g., score in some exam) yn(m) ∈ R and a feature vector xn(m) ∈ RD.

Assume a linear regression model for these scores, i.e., p(yn(m)|xn(m), Wm) = N(yn(m)|wTmxn(m), β-1), where wm ∈ RD denotes the regression weight vector for school m, and β is known. Note that this can also be denoted as p(y(m)|X(m), wm) = N(y(m)|X(m)wm, β-1IN), where y(m) is Nm x 1 and X(m) is Nm x D. Assume a prior p(wm) = (wmlwo, λ-1ID), λ to be known and w0 to be unknown.

Derive the expression for the log of the MLE-II objective for estimating w0. You do not need to optimize this objective w.r.t. wo; just writing down the final expression of objective function is fine. Also state what is the benefit of this approach as opposed to fixing w0 to some value, if our goal is to learn the school-specific weight vectors w1, wm? (Feel free to make direct use of properties of Gaussian distributions).

Problem 6 - Programming Assignment

(Bayesian Linear Regression) Consider a toy data set consisting of 10 training examples {xn, yn}10n=1 with each input xn, as well as the output yr, being scalars. The data is given below.

x = [-2.23, -1.30, -0.42, 0.30, 0.33, 0.52, 0.87, 1.80, 2.74, 3.62];
y = [1.01, 0.69, -0.66, -1.34, -1.75, -0.98, 0.25, 1.57, 1.65, 1.51]

We would like to learn a Bayesian linear regression model using this data, assuming a Gaussian likelihood model for the outputs with fixed noise precision β = 4. However, instead of working with the original scalar-valued inputs, we will map each input x using a degree-k polynomial as cbk(x) = [1, x, x2, ,xk] T. Note that, when using the mapping Φk, each original input becomes k + 1 dimensional. Denote the entire set of mapped inputs as Φk(x), a 10 x (k + 1) matrix. Consider k = 1, 2, 3 and 4, and learn a Bayesian linear regression model for each case. Assume the following prior on the regression weights: p(ω) = Ν(ω|0, I) with ω ∈Rk+1.

1. For each k, compute the posterior of w and show a plot with 10 random functions drawn from the inferred posterior (show the functions for the input range x ∈ [-4, 4]). Also show the original training examples on the same plot to illustrate how well the functions fit the training data.

2. For each k, compute and plot the mean of the posterior predictive p(y*k(x*),Φk(x), y, β) on the interval x* ∈ [-4, 4]. On the same plot, also show the predictive posterior mean plus-and-minus two times the predictive posterior standard deviation.

3. Compute the log marginal likelihood log p(y|Φk(x), β) of the training data for each of the 4 mappings k = 1, 2, 3, 4. Which of these 4 "models" seems to explain the data the best?

4. Using the MAP estimate wMAp, Compute the log likelihood log p(ylωMAp, Φk(X), β) for each k. Which of these 4 models seems to have the highest log likelihood? Is your answer the same as that based on the log marginal likelihood (part 3)? Which of these two criteria (highest log likelihood or highest log marginal likelihood) do you think is more reasonable to select the best model and why?

5. For your best model, suppose you could include an additional training input x' (along with its output y') to "improve" your learned model using this additional example. Where in the region x ∈ [-4, 4] would you like the chosen x' to be? Explain your answer briefly, Your implementation should be in Python notebook (and should not use an existing implementation of Bayesian linear regression from any library).

Reference no: EM132805402

Questions Cloud

Implementing proper security controls and technologies : Privacy and security go hand in hand; and hence, privacy cannot be protected without implementing proper security controls and technologies.
Explain the enhancement of reentry services for offenders : The Second Chance Act of 2007 was enacted to provide support through funding and training for state and local criminal justice agencies.
What is the appropriate journal entry for job : For job 836, direct labor hours were 700 for the month of December. What is the appropriate journal entry for job 836 for the month of December
What is the cost of goods sold for babe company : What is the cost of goods sold for Babe Company? The printing department used 8,000 labor hours at P5.60 per hour and the binding department used 4,600
Compute the log marginal likelihood log : Compute the log marginal likelihood log p(y|Fk(x), ß) of the training data for each of the 4 mappings k = 1, 2, 3, 4. Which of these 4 "models" seems
Develop your promotional message : The professional experience assignments are designed to help prepare you for that environment. To earn credit, make sure you complete all elements and follow.
Prepare the journal entry to record the reimbursement : Prepare the journal entry to record the reimbursement of the petty cash fund, and to increase the size of the fund by $100.00 EXPLANATION IS REQUIRED
Identify the credit risk of applicants for two cards : Would you expect the default rate on the Optima Card to be lower, equal to or higher than the Australian Express Card? Explain.
Discuss the correctional facility staff hierarchy : Discuss the correctional facility staff hierarchy of a correctional facility and the four main goals of correctional staff members. Distinguish between.

Reviews

len2805402

2/22/2021 11:02:49 PM

This is an assignment related to statistics and Bayesian machine learning, primarily on MAP estimate, posterior, etc.

Write a Review

Applied Statistics Questions & Answers

  Hypothesis testing

What assumptions about the number of pedestrians passing the location in an hour are necessary for your hypothesis test to be valid?

  Calculate the maximum reduction in the standard deviation

Calculate the maximum reduction in the standard deviation

  Calculate the expected value, variance, and standard deviati

Calculate the expected value, variance, and standard deviation of the total income

  Determine the impact of social media use on student learning

Research paper examines determine the impact of social media use on student learning.

  Unemployment survey

Find a statistics study on Unemployment and explain the five-step process of the study.

  Statistical studies

Locate the original poll, summarize the poling procedure (background on how information was gathered), the sample surveyed.

  Evaluate the expected value of the total number of sales

Evaluate the expected value of the total number of sales

  Statistic project

Identify sample, population, sampling frame (if applicable), and response rate (if applicable). Describe sampling technique (if applicable) or experimental design

  Simple data analysis and comparison

Write a report on simple data analysis and comparison.

  Analyze the processed data in statistical survey

Analyze the processed data in Statistical survey.

  What is the probability

Find the probability of given case.

  Frequency distribution

Accepting Manipulation or Manipulating

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd