Reference no: EM132275486
Modelling for Data Analysis Assignment -
This assignment contains 6 questions.
1. Probabilities in Cards
Have a regular deck of cards with no jokers (13 cards per suit, 4 suits) giving 52 cards. Suppose we draw a 5 card hand, so 5 cards without replacement. For each answer write out the full calculation in R to show working.
1.1 A special flush
What is the probability of getting a royal flush but where the cards ordered by rank have alternate color? Note in a proper royal flush, it is all the one suit, but we have changed that to alternate colour. So, for example: red 10, black J, red Q, black K, red A. Note the order in which they are drawn from the pack is not considered.
1.2 No repeats
What is the probability that in the sequence of cards, as they are drawn, no rank occurs twice in a row? So ignoring the suit, the following are allowed: A, 10, 4, J, 10 or A, 10, A, 4, A, but the following are not allowed: A, A, 10, 4, A (A repeated in positions 1 and 2), A, 4, 10, 10, J (10 repeated in positions 3 and 4).
2. PDF and Expectations
Let X have the PDF given by a function with a different negative and positive part.
f(x) = 12/7 (1 + x)2 for -1 < x ≤ 0
= 12/7 (1 - x)3 for 0 < x < 1
= 0 otherwise
You can use Wolfram Alpha to do the definite integrals.
2.1 Plot
Draw the plot in R.
2.2 Mean
Find E(X). Why is it not zero?
2.3 Variance
Find variance, V ar(X).
2.4 Skewness
Find skewness, using the formula in the lecture notes. Interpret the value.
3. Distributions
One study has evaluated a number of leukaemia records in a rural area. The population of the area was 35,000. In a year there were 16 leukaemia cases identified, of which 4 where not local residents but tourists or new immigrants (of which there are not many). In a general population, the annual rate of leukaemia is typically about one in 10,000.
3.1 Model
Describe the model you recommend to use for the counts, and estimate the parameters using suitable point estimates.
3.2 Checking
Also, consider the hypothesis, "the annual rate of leukaemia in the area is 1/10,000?" Assume this is the rate for the residents only. Plot the distribution over counts under this hypothesis. Where does your data lie, and do you think it is consistent with the hypothesis?
4. Entropy
In this question, we will use a modified version of the Titanic dataset from the Kaggle competition, Titanic: Machine Learning from Disaster? The dataset includes information about passenger characteristics as well as whether they survived from the disaster.
Import the Titanic data using the following R code:
df <- read.csv("Titanic.csv",header=TRUE, sep=",")
Now Survived is Boolean so convert to a truth value with:
df[['Survived']] <- df[['Survived']]==1
4.1 Conditional probabilities
Compute tables for the frequency estimates of P(Survived), P(Survived|Pclass = val) and P(Survived|Gender = val), for different vals. Do the computation in R. But it's OK to present the final table as a separate Word table (since it might be hard to layout in R). What does this tell you about survival?
4.2 Entropies
Calculate the entropy (log2()) of Survived, H(Survived) and the conditional entropy of Survived given Pclass, H(Survived|Pclass), and of Survived given Gender, H(Survived|Gender). Do not use an entropy function but write the code yourself. Use R functions table() and prop.table() to gather stats and form probabilities from the data frame. What do these three entropies tell you about Survived?
4.3 Coding
Consider the joint space (Survived, Pclass) which has six outcomes, (True, 1), (True, 2), (True, 3), (False, 1), (False, 2), (False, 3). Develop an efficient binary prefix code to transmit these outcomes. Would it be adequate to just provide the codelengths, or is a code needed too? Justify your answer.
5. Maximum likelihood estimation of parameters
One of the central problems of sensory neuroscience is to separate the recordings of background physiological processes that are irrelevant (noise), from neural responses that are of experimental interest (signal). This is by no means an easy task, as the signals that neurons produce when they fire are extremely weak and more random. It is therefore of particular interest to examine the randomness of neuro signals as this allows researchers to study the brain at a cellular level.
Let's assume that we have conducted one experiment and recorded the spike signals from one particular neuron for a duration of 15 seconds. After some data processing, we can obtain spike signals with data given by a time in seconds and a spike size, similar to the following data and figure.
5.1 Model
Let us assume that the rate of signals remains constant over time, and the size of each signal is independent of time too. If the rate of the signals remains constant over time, which distribution would most suit to model the probability distribution for the number of spike signals over 15 seconds? Why? Briefly answer this question in a sentence or 2. Also, while we don't know enough to suggest a distribution for spike sizes, but what properties should it have?
5.2 Maximum likelihood fitting
Using the model above, what is the log-likelihood function for number of spike signals for the period of experiment time, and what is the maximum likelihood estimate for its parameters?
You're told that a candidate distribution for spike sizes is the Weibull with shape given by 0.7 and unknown scale, between 0.5 and 2. This is supported in R using the [dpqr]weibull() functions. One can do maximum likelihood fitting using the Weibull density on the unknown parameter. Use the optimize() function for that, so something like
'optimize(fn, c(minvalue,maxvalue), maximum = TRUE, tol = .Machine$double.epsˆ0.25)'
6. Central Limit Theorem
Assume that we draw random integers from a Poisson distribution with rate one of λ1 = 1, λ2 = 5, or λ3 = 20.
6.1 Sampling distribution
According to Central Limit Theorem what is the sample mean and sample standard deviation, for the three rates λ1, λ2, λ3, when we have sample size of 10, 100, 1000 and 10000? Give the theory the compute the values in R.
6.2 Simulation
Experimentally justify the result in the CLT that says the sample mean has a mean given by the population mean and a variance given by the population variance divided by sample size. See the CLT Theorem in Lecture 4. Use simulation given sample a size of 10, 100 and 1000. For each given sample size use 50000 simulations to compute samples and their means. From these means compute the mean and variance of the sample means, and discuss how results reflect the CLT. Plot the results (3 sample sizes and 3 rates with mean and SD) to demonstrate any effects you want to discuss.
6.3 Plotting normality
When rate λ1 = 1 and λ2 = 5 and sample size is 10 or 100, obtain the z scores of the sampling means (from 50000 simulations). Plot the distributions in a histogram with the theoretical Gaussian curve overlaid. Note for sample size 100, the plots overlay very nicely. But what happens with sample size 10? Explain the differences between the four plots.
For each simulation: the z score of the mean can be calculated as:
(X¯ - µ)/σ/√n,
where X¯ is the mean of the sample, µ is the population mean and σ is the population SD.
Attachment:- Assignment File.rar