Analyse given dataset using IPython notebook

Assignment Help Python Programming
Reference no: EM132294041

ASSESSMENT TASK ONE - DATA EXPLORATION: WINE RATING DATA

Task Description

We provide one IPython notebook SIT742Task1.ipynb, together with two data files at the data subfolder:

wine.json The json file contains the wine ratings and reviews from WineEnthusiast.
stopwords.txt This text file contains the most common English stop words.

You are required to develop a data exploration report using IPython notebook to complete the following two sub-tasks.

Numeric and Categorical Value Analysis

For a data scientist, after obtaining the dataset, the first most crucial task is to obtain a good understanding of the data he or she is dealing with. This includes: examining the data attributes (or equivalently, data fields), seeing what they look like, what is the data type for each field, and from this information, determining suitable numerical/visual descriptions.

The first task is to read the json file as a Pandas DataFrame and delete the rows which contain invalid values in the attributes of "points" and "price".

Then, you need to answer the following two questions in your IPython notebook based on this dataset:

(1) what are the 10 varieties of wine which receives the highest number of reviews?

(2) which varieties of wine having the average price less than 20, with the average points at least 90? Assuming there is no duplicate review in the data, i.e., each row represent a unique wine.

In addition, you need to group all reviews by different countries and generate a statistic table, and save as a csv file named "statisticByState.csv". The table must have four columns:
Country - listing the unique country name.
Variety - listing the varieties receiving the most reviews in that country.
AvgPoint - listing the average point (rounded to 2 decimal places) of wine in that country
AvgPrice - listing the average price (rounded to 2 decimal places) of wine in that country Based on this table, which country/countries would you recommend Hotel TULIP to
source wine from? Please state your reasons.

Text analysis

In this task, you are required to write Python code to extract keywords from the "description" column of the json data, used to redesign the wine menu for Hotel TULIP.

You need to generate two txt files:
HighFreq.txt This file contains the frequent unigrams that appear in more than 5000
reviews (one row in the dataframe is one review).

Shirazkey.txt This file contains the key unigrams with tf-idf score higher than 0.4. To reduce the runtime, first you need to extract the description from the variety of "Shiraz", and then calculate tf-idf score for the unigrams in these descriptions only.

In both txt files, all unigrams are sorted alphabetically and are saved line by line without duplicate. Before you calculate the unigram frequent or tf-idf, you need to remove the stop words in all description using the provided "stopwords.txt" or using the built-in function in Python.

*ASSESSMENT TASK TWO - DATA ANALYTICS: BANK MARKETING

Task Description

We provide one IPython notebook SIT742Task2.ipynb, together with a csv file bank.csv at the data sub- folder. You are required to analyse this dataset using IPython notebook with Spark packages including spark.sql and pyspark.ml that you have learnt from SIT742.

Table 2.1: Attribute information of the dataset

Attribute Meaning
age age of the customer
job  type of job 
marital  marital status 
education  education level
default  has credit in default?
balance  the balance of the customer 
housing has housing loan?
loan  has personal loan?
contact contact communication type 
day  last contact day of the week 
month  last contact month of year
duration  last contact duration, in seconds 
campaign number of contacts performed
pdays  number of days that passed by after a previous campaign 
previous  number of contacts performed before this campaign 
poutcome outcome of the previous marketing campaign
deposit has the client subscribed a term deposit?

IPython Notebook

To systematically investigate this dataset, your IPython notebook should follow the basic 6 procedures as:

(1) Import the csv file, "bank.csv", as a Spark dataframe and name it as df, then check and explore its individual attribute.

(2) Select important attributes from df as a new dataframe df2 for further investigate. You are required to select 13 important attributes from df: `age', `job', `marital',
`education', `default', `balance', `housing', `loan', `campaign', `pdays',
`previous', `poutcome' and 'deposit'.

(3) Remove all invalid rows in the dataframe df2 using spark.sql. Supposing that a row is invalid if at least one of its attributes contains `unknown'. For the attribute
`poutcome', the valid values are `failure' and `success'.

(4) Convert all categorical attributes to numerical attributes in df2 using One hot encoding, then apply Min-Max normalisation on each attribute.

(5) Perform unsupervised learning on df2 including k-means and PCA. For k-means, you can use the whole df2 as both training and testing data, and evaluate the clustering result using Accuracy. For PCA, you can generate a scatter plot using the first two components to investigate the data distribution.

(6) Perform supervised learning on df2 including Logistic Regression, Decision Tree and Naive Bayes. For the three classification methods, you can use 70% of df2 as the training data and the remaining 30% as the testing data, and evaluate their prediction performance using Accuracy.

Case Study Report
Based on your IPython notebook results, you are required to write a case study report with 500 - 1000 words, which should include the following information:

(1) The data attribute distribution

(2) The methods/algorithms you used for data wrangling and processing

(3) The performance of both unsupervised and supervised learning on the data

(4) The important features which affect the objective (‘yes' in ‘deposit') [Hint: you can refer the coefficients generated from the Logistic Regression]

(5) Discuss the possible reasons for obtaining these analysis results and how to improve them

(6) Describe the group activities, such as the task distribution for group members and what you have learnt during this project.

*Note: Only need ASSESSMENT TASK TWO

Attachment:- Modern Data Science.rar

Reference no: EM132294041

Questions Cloud

Discuss the mobile computing and its business implications : Select a topic from the following list on which you would like to conduct an in-depth investigation: Information systems infrastructure: evolution and trends .
Computation of mark up and target selling price : Computation of mark up and Target selling price in cost-minus pricing
What risk an invalid ssl certificate is to the companies : What is one formative influence on your technological interests today? Please limit your answer to two paragraphs.
Write response on Technology for ECE directors : Read attached three essay files and write 40-50 words your own thought on essay - Forum - Technology for ECE directors
Analyse given dataset using IPython notebook : SIT742 - Modern Data Science - Deakin University - write Python code to extract keywords from the "description" column of the json data, used to redesign
Contrast the work of renoir and laurtrec : Contrast the work of Renoir and Laurtrec. How do the subjects' styles of the artists reflect nineteenth century French society and the innovations of nineteenth
Write an essay that charts the commonalities : You learned about the 16th-century Renaissance and the later 18th-century Enlightenment-periods of human history that are often described.
What are the mission and values of the organization : What are the mission, vision, and values of this organization? What have you learned about the importance and function of an organization's mission, vision.
Discuss how the site management team has dealt : BBE203 – Managing Multiple Projects Learning Trigger. Discuss how the site management team has dealt with these issues. ?As part of your strategic site.

Reviews

len2294041

4/26/2019 3:37:05 AM

SIT742Task2.ipynb Your IPython notebook solution source file for the data exploration of the bank marketing data. You can fill your group information at the relevant place in the first markdown cell. Please follow the PEP 8 guidelines (Section 3.1) for source code style. Report.pdf A 500 - 1000 words report describing and discussing your analysis results. No Special Consideration will be granted for this project. Students who have difficulty meeting the deadline because of illness, etc. must apply for an assignment extension no later than the noon on the day prior to the deadline.

len2294041

4/26/2019 3:36:46 AM

Please familiarise yourself with the General Requirements (see Section 0.2) on Assignments Submission. By the due date, you are required to submit the following files to the corresponding Assignment (Dropbox) in CloudDeakin: SIT742Task1.ipynb Your IPython notebook solution source file for the data exploration of the wine rating data. You can fill your name and deakin ID information at the relevant place in the first markdown cell. Please follow the PEP 8 guidelines (Section 3.1) for source code style.

len2294041

4/26/2019 3:36:11 AM

hi there is project assignment of python need to be done all the requirements and details have been given in assignment pdf. please go through it carefully .and also save them accordingly. assessment 2 needs to be done that is "BANK MARKETING" I NEED THIS ASSIGNMENT BY 5TH

Write a Review

Python Programming Questions & Answers

  Design a function that accepts an integer

Design a function that accepts an integer argument and returns the sum of all the integers from 1 up to the number passed as an argument. For example, if 50 is passed as an argument, the function will return the sum of 1, 2, 3, 4, . . . 50. Use recur..

  Write a program triangle that takes three integers

Write a program triangle that takes three integers as command-line arguments and writes True if each one of them is less than or equal to the sum.

  Write a program that finds the longest word in dictionary

The letters A through F are used for writing hexadecimal numbers and can also spell a few English words.

  Draw nassishneiderman diagram

Draw NassiShneiderman (NS) diagram/s that present the steps of the algorithm required to perform the task specified

  Write a program using the following python functions

You are required to write the following Python functions. Make sure you understand where each function fits into the system described above.

  Create a short obstacle-avoidance collection style game

Computer Science 313 - Software Development for Games Assignment: Sprites and Libraries. Create a short obstacle-avoidance, collection style game

  Calculate the cost of that order and add

A good friend of yours is managing a fund raiser for a football team. He needs a program to calculate the total amount of candy sold at a football game.

  Assume an n × n matrix a is given

Assume an n × n matrix A is given, containing only 1's and 0's, such that, in each row, all 1's come before all 0's. Give an O(n log n) algorithm to count all 1's in A.

  Computing finite difference for first derivative of function

Computing finite difference approximations for the first derivative of a specified function, and to plot these approximations along with the analytical first

  Program that asks the user to enter a students name

The program should output the students name, a letter grade for each assignment score, and cumulative average for all the assignments

  Write a Python script that allows the user to enter a number

COMS 104 Introduction to Programming Assignment, Iowa State University, USA. Write a Python script that allows the user to enter a number greater than 100

  Write a program to determine the coefficients of expanded

Write a program to determine the coefficients of expanded form of (aX+bY+c)n. Every line in bold is questions to ask whoever runs the program to enter values or commands (y or n) at the console.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd