Write python code to integrate several datasets

Assignment Help Other Subject
Reference no: EM132796859

 FIT5196 Data wrangling - Monash University

For this assessment, you are required to write Python code to integrate several datasets into one single schema and find and fix possible problems in the data. Input and output of this assessment are shown below:

Table 1. The input and output of the task

 

Inputs

Output

Jupyter notebook and pdf

shopingcenters.html, real_state.json, hospitals.xlsx, supermarkets.pdf, real_state.xml, Vic_suburb_boundary.zip,

GTFS_Melbourne_Train_Infor mation.zip

<student_no>_A3_solution.csv

<student_no>_ass3.ipynb

<student_no>_ass3.pdf

Task 1: Data Integration
In this task, you are required to integrate the input datasets (i.e., 7 datasets including hospitals, supermarkets, shopping centers, real estate files (one XML and one Json), Vic_suburb_boundary, and GTFS_Melbourne_Train_Information files) into one dataset with the following schema.

Table 2. Description of the final schema

 

COLUMN

DESCRIPTION

Property_id

A unique id for the property

lat

The property latitude

lng

The property longitude

 

addr_street

The property address

suburb (15%)

The property suburb. Default value: "not available"

price

The property price

property_type

The type of the property

year

Year of sold

bedrooms

Number of bedrooms

bathrooms

Number of bathrooms

parking_space

The number of parking space of the property

Shopping_center_id (5%)

The closest shopping centre to the property. Default value: "not available"

Distance_to_sc (5%)

The Haversine Distance (3 decimal places with +-0.001 tolerance) from the closest shopping centre to the property. Default value: 0

Train_station_id (10%)

The closest train station to the property. Default value: 0

Distance_to_train_st ation (5%)

The Haversine Distance (3 decimal places with +-0.001 tolerance) from the closest train station to the property. Default value: 0

travel_min_to_CBD (15%)

The average travel time (minutes) from the closest train station to the "Flinders street" station on weekdays (i.e. Monday-Friday) departing between 7 to 9 am. For example, if there are 3 trip departing from the closest train station to the Flinders street station on weekdays between 7-9am and each take 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3. If there are any direct transfers between the closest station and Flinders street station, only the average of direct transfers should be calculated. Default value: 0

Hospital_id (5%)

The closest hospital to the property. Default value: "not available"

 

Distance_to_hospital (5%)

The Haversine Distance (3 decimal places with +-0.001 tolerance) from the closest hospital to the property. Default value: 0

Supermarket_id (5%)

The closest supermarket to the property. Default value: "not available"

Distance_to_superm aket (5%)

The Haversine Distance (3 decimal places with +-0.001 tolerance) from the closest supermarket to the property. Default value: 0

Over_ave_price (15%)

The Boolean value to the property. Write a True value if the price is higher than the average house price of its suburb. Otherwise, the value should be False. The average house price of each property can be calculated by averaging the total house price of its suburb. Default value: 0

Task 2: Documentation

The main focus of the documentation would be on the quality of your explanation on task 1 but similar to the previous assignments, your notebook file should be in a decent format with proper sections and subsections.

Note 1: the output csv file must have the exact same columns as specified on the schema. Please note that the output files which are not in a correct format, as specified in the integrated schema, won't be marked.

Note 2: if you decide not to calculate any of the required columns, then you must have that column in your final dataframe with the ‘default value' as the value of all the rows. Please note that the output files which are not in a correct format, as specified in the integrated schema, won't be marked.

Note 3: No external data is allowed to calculate the values of the integrated schema. For example, to calculate the suburb, you can only use the shape files.

Note 4: the radius of the earth is still 6378 km!

Note 5: In table 2, numbers in front of some of the columns in the format of (a%) are the allocated mark associated with that column. For example, column "suburb" carries 15% of the total output mark of task 1. Also, please note that we are aware that the summation of percentages is 90%. The other 10% goes to the issue(s) that may appear during data integration tasks and you should find and resolve them.

Attachment:- Data wrangling.rar

Reference no: EM132796859

Questions Cloud

How much interest revenue will machine gun kelly company : How much interest revenue will Machine Gun Kelly Company report on its year end December 31, 2022 income statement relative to this note?
Prepare all necessary consolidation adjustment entries : Prepare all necessary consolidation adjustment entries needed to prepare the consolidated financial statements as at 30 June 2019
Find the change in the? manufacturer total revenue : A? manufacturer's marginal-revenue, If r is in? dollars, find the change in the? manufacturer's total revenue if production is increased from 400 to 800 units
How the flipped classroom idea can be used in conjunction : Discuss how the flipped classroom idea can be used in conjunction with CCSS (Math or English Language Arts). Describe ways you could incorporate technology.
Write python code to integrate several datasets : Write Python code to integrate several datasets into one single schema and find and fix possible problems in the data. Input and output of this assessment
Determine the pre-tax cost for a bond : Determine the pre-tax cost for a bond that sells at $ 925 par value and pays a coupon of $ 85 for 20 years. The flotation costs are $ 5 per bond.
What amount of goodwill should be recognized : Britain Corporation acquires all of English, Inc. for $800,000 cash. What amount of goodwill should be recognized on the date of the acquisition
How you will use findings in future professional practice : Choose a reading and writing concept with a strategy that aligns to what students in your field experience classroom are currently learning.
Calculate the cost of preferred capital : The cost of issuance (flotation cost) is $ 8 per share. Calculate the cost of preferred capital. You must show the calculations to receive a score for answer

Reviews

len2796859

2/13/2021 12:42:57 AM

I need this assignment made before 20th of this month. I need the assignment without any scope of getting caught in plagiarism.

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd