Reference no: EM133849901
Principal Component Analysis
TASK
For your video presentation, you must demonstrate your PCA analysis on the continuous features of the WACY-COM dataset and interpret the results. Submit the recording via the Panopto link on Canvas. Please ensure you follow the instructions carefully.
Perform PCA and Visualise Data
First, copy the code below to a R script. Enter your student ID into the command set.seed(.) and run the whole code. The code will create a sub-sample of 400 that is unique to you.
Extract only the continuous features and the APT feature from the WACY-COM dataset and store them as a data frame/tibble. Refer to Assignment 1 for the feature description if needed. Get top assignment help at pocket friendly prices!
Clean the extracted data based on the feedback received from Assignment 1.
Remove the incomplete cases to make it usable in "R" for PCA.
Perform PCA using prcomp(.) in R, but only on the numeric features (i.e. ignore APT in this step).
Explain why you believe the data should or should not be scaled, i.e. standardised, when performing PCA.
Display and describe the individual and cumulative proportions of variance (3 decimal places) explained by each of the principal components.
Outline how many principal components are adequate to explain at least 50% of the variability in your data.
Display and interpret the coefficients (or loadings) to 3 decimal places for PC1, PC2 and PC3. Describe which features (based on the loadings) are the key drivers for each of these three principal components.
Create and display the biplot for PC1 vs. PC2 to visualise the PCA results in the first two dimensions. Colour-code the points based on the APT feature. Explain the biplot by commenting on the PCA plot and the loadings plot individually, and then both plots combined (see Slides 28-29 of Module 3 notes). Finally, comment on and justify which (if any) features can help distinguish APT activity.
Based on the results from parts (v) and (vi), describe whether PC1 or PC2 (choose one) best assists in classifying APT. Hint: Project all points in the PCA plot onto the PC1 axis (i.e. consider the PC1 scores only) and assess whether there is a clear separation between known and unknown APT actors. Then, project onto the PC2 axis (i.e. consider the PC2 scores only) and evaluate whether the separation is better than in PC1. You can access the PCA scores for PC1 and PC2 via mypca$x, assuming mypca contains your PCA results from prcomp(.).
the key features in this dimension that can drive this process (Hint: based on your decision above, examine the loadings from part (v) of your chosen PC and choose those whose absolute loading (i.e. disregard the sign) is greater than 0.3).
Video Presentation Checklist
In your video presentation, you must
Run your code corresponding to parts (i) to (vii) above
Display the relevant output
Interpret the output
Your video presentation must include a camera shot of yourself in the video capture, unless there is an exceptional reason and is supported by a Learning Assessment Plan (LAP). 20% is automatically deducted from your final mark if this is not included in your video presentation. If you choose to record with another application, you must make sure that this feature is included.
Your video presentation must be between 4-5 minutes long.