Reference no: EM132370935
Data Science Practice Assignment -
Assignment Task - This assignment consists of two deliverables, being:
- One code implementation - The code file in Jupyter Notebook format and the relevant date set files.
- A report - The report must be uploaded as a separate file.
Part I - PySpark source code
Important Note: For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the workshops. The data files are packaged properly with your code file.
In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks:
1. Exploratory data analysis
2. Recommendation engine
3. Classification
4. Clustering
You need to choose a dataset from Kaggle to complete these tasks. Remember to include the data set file in you source code submission.
Note: In your notebook, please use Heading 1 Markdown cell to separate each sub task.
Task I.1: Exploratory data analysis
This subtask requires you to explore your dataset by
- telling its number of rows and columns,
- doing the data cleaning (missing values or duplicated records) if necessary
- selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each to summarise it
Task I.2: Recommendation engine
This subtask requires you to implement a recommender system on Collaborative filtering with Alternative Least Squares Algorithm. You need to include
- Model training and predictions
- Model evaluation using MSE
Task I.3: Classification
This subtask requires you to implement a classification system on Logistic regression with LogisticRegressionWithLBFGS class. You need to include
- Logistic Regression model training
- Model evaluation
Task I.4: Clustering
This subtask requires you to implement a clustering system on K-means. You need to include
- Model training
- Model evaluation
Part II -Report
You are required to write a report to explain your design and implementation of the machine learning parts in your code, including the following topics:
- Introduction/summary/explanation to the ML algorithm/concepts
- The learning settings, such as how to prepare training/testing set, what are the key parameters and how you set them up
- Comments/evaluation for the models learnt
Your report should use the following template:
Table of Contents
1.0 Introduction
Explain the data set you've chosen, including its source URL. Demonstrate your exploratory data analysis in this section.
2.0 Machine learning implementation
2.1 Collaborative filtering
2.3 Logistic regression
2.4 K-Means
3.0 Conclusion
References
Assignment Advice - This assignment will take several weeks to complete and will require a good understanding of machine learning and PySpark for successful completion. It is imperative that students take heed of the following points in relation to doing this assignment:
1. Ensure that you clearly understand the requirements for the assignment - what must be done and what are the deliverables.
2. If you do not understand any of the assignment requirements - Please ASK your tutor.
3. Each time you work on any aspect of the assignment reread the assignment requirements to ensure that what is required is clearly understood.
4. We have practiced nearly all coding tasks in DataCamp before. If you have any difficulty, redoing the practices in DataCamp is recommended.
5. Prior to submitting your code, you should ensure not only that it executes as required, but also looks professional. It is expected that you adhere to python standards for naming and indenting. All methods should be adequately documented such that another programmer examining your code will readily know what the code is doing.