Reference no: EM132271449
Project - Statistics
In this project we will be using the statistics based commands available in MATLAB. Also a MATLAB file, called "StatsData.mat" is on the website. You will need to download it.
This project will involve writing programs that perform statistical analysis of data, establishing which sets are related and which are not. Also we will be doing an experiment that will demonstrate the concept of a Confidence Interval.
1. Regression and Correlation:
a) Using MATLAB, compute the Linear Regression (LR) parameters and the correlation coefficient (CC) between each of the first four rows and the fifth row from "StatsData.mtx". In other words, compute the LR parameters and CC between row1, and row 5, then between row2 and row 5, row 3 and row 5 and then finally between row 4 and row 5.
b) Select the two rows with the higher CC from the independent rows (row 1, row 2, row 3 and row 4). Then perform a multivariate regression for the data in the file "StatsData.mtx". The fifth row should be the dependent variable and the two selected rows are to be treated as the independent variables. Be sure to compute the Coefficient of Determination (CD) and then compute its square root, which is has the same magnitude as CC.
The selection of the two rows, does not need to be done in software, but rather can be "hard coded" into the program.
c) Based on the CC computed from the 5 cases of regression analysis performed, what can be said about the data?
What data or terms appear to be related to the dependent variable and which are not related?
Is this consistent with results of the case where two rows were used for regression?
2. Histograms, PDF's and Confidence Intervals:
a) Assuming that the mean and variance of the entire row is basically the same as the parameters of the hidden process, compute the 95% Confidence Interval (CI) for each row, in "StatsData.mat" assuming a sample size of 64, (Note sqrt(64) = 8).
b) Produce a histogram of the data in each row of the matrix. Setting the number of bins to the square root of the number of samples. Use the histograms to plot an estimate of the probability density of each row. Also plot the matching Gaussian distribution for the row, based on the mean and variance of the entire row.
c) Then compute the average for each 64 point subsection of each row. There will be 1000 of these 64 point subsections in each row. These will be referred to here after as the Short Interval Averages (SIA's). Compute the mean and variance of the 1000 SIA's and compare this to the predicted mean and variance for 64 point average. The mean of the SIA's should be the same as the mean for the row, while the variance should be the variance of the row divided by length of the short intervals (64).
d) Count the number of times the SIA's fall inside the 95% CI bound for the each row. Convert this to an estimate of probability and compare it to 95%. How well do they match, noting that some of the distributions are not Gaussian?
e) Finally compute the Squared-Sum-Difference (SSD) between the histogram estimate and the Gaussian PDF. The formula for which is given here.
SSD = n=1∑N(HEn - PDFn)2
where HEn is the Histogram Estimate at bin n, and PDFn is the PDF at bin n.
Based on the histogram plots, and the SSD, what type of distribution is each row, and can the SSD be used as a measure of how Gaussian a set of data is?
Attachment:- Assignment Files.rar