Reference no: EM132393122
Assignment -
Learning Outcomes:
1. Identify challenges in data analytics: be able to critically evaluate and select appropriate solutions.
2. Demonstrate an understanding of the core methods and algorithms used in data analytics.
3. Analyse and manipulate data sets to extract statistics and features and provide analytic insights.
4. Critically evaluate, select and employ appropriate tools, technologies and data models to provide answers to analytics questions.
All answer require a rigour explanation to support findings.
PART 1 - Demonstrate the need for smooth functions
The air quality data: The data set air-quality is one of the data frames available in R within the standard package datasets. It has the daily air quality measurements in New York, from May to September 1973.
R data file: air-quality in package datasets of dimensions 154 X 6 variables
Ozone: in ppb
Solar.R: in lang
Wind: in mph
Temp: in F
Month: Month (1-12)
Day: Day of month (1-31)
(a) Here we will use Ozone as the response variable and Solar.R, Wind and Temp as explanatory variables. (We will not consider Month and Day.) The data can be plotted using:
data(airquality)
plot(airquality[-c(5,6)])
Comment on the plot.
(b) Fit a standard regression model (i.e., with a normal distribution and constant variance) use the function Im().
(c) Extract fit summary (using summary()) and discuss about the coefficients and their standard errors. Use the function termplot() and comment on the term plot.
(d) Check and comment on the residuals using plot().
(e) Fit the same model using the gamlss() function, but note that the data set airquality has some missing observations (i.e. NA values). The gamls() function does not work with NA's, so before fitting the model the missing values need to be removed.
(f) Summarize the fitted gamls model using summary(). Plot the fitted terms using the corresponding function for gamlss called term.plot().
(g) Check the residuals using the plot() and wp() functions.
(h) Comment on the worm plot. Note the warning message that some points are missed out of the worm plot. Increase the limits in the vertical axis by using the argument ylim.all = 2 in wp().
(i) Since the fitted normal distribution seems not to be correct, try to fit different distributions (e.g. gamma (GA), Inverse Gaussian (IG) and Box Cox Cole and Green (BCCGo)) to the data. Compare them with the normal distribution using GAIC with penalty k = 2 (i.e. AIC).
(j) Has the model improved according to the AJC? Use term.plot() output to see the fitted smooth functions for the predictor of μ for your chosen distribution. Use plot() and wp() output to check the residuals.
PART 2 - Modelling the shape and scale parameters
The abdom data provide information on the abdominal data. Fit different response distributions and choose the 'best' model according to the GAIC criterion.
(a) Load the abdom data and print the variable names.
(b) Fit the normal distribution model, using pb() to fit P-spline smoothen for the predictors for μ and σ with automatic selection of smoothing parameters.
(c) Try fitting alternative distributions:
a. two-parameter distributions: GA, IG, GU, RG, LO,
b. three-parameter distributions: PE, TF, BCCG,
c. four-parameter distributions: BCT, BCPE.
(d) Apply pb() to all parameters of each distribution. Make sure to use different model names.
(e) Compare the fitted models stung GAIC with each of the penalties k=2, k=3 and k=log(length(abdom$y)).
(f) Check the residuals for your chosen model, say m, by plot(m) and wp(m).
(g) For a chosen model, say m, look at the total effective degrees of freedom edfAll(m), plot the fitted parameters, fittedPlot(m, x=abdom,$x), and plot the data by plot and fitted μ against x, lines.
(h) For a chosen model, examine the centile curves using centiles.
Attachment:- Assignment Files.rar