Question:
(a) (i) Define the term multicollinearity.
(ii) Explain why it is important to guard against multicollinearity.
(b) (i) Sometimes we encounter missing values in databases with a large number of fields. A common method of handling missing values is simply to omit from the analysis the records or fields with missing values. Explain why this may be dangerous.
(ii) Data analysts have turned to methods that would replace the missing value with a value substituted according to various criteria. Briefly give a choice of three possible replacement values for missing data.
(c) Variables tend to have ranges that vary greatly from each other. Data miners should normalise the numerical variables to standardise the scale of effect each variable has on the results. Name two techniques for normalisation and differentiate between each one of them.
(d) The usual measure used to evaluate estimation and prediction models is the mean square error (MSE). Write down the expression for the MSE.
(e) (i) Explain briefly the term measures of variability. (ii) Give four examples of typical measures of variability.