BUS5DWR Data Wrangling And R Assignment

Assignment Help Applied Statistics
Reference no: EM132393553

BUS5DWR Data Wrangling And R Assignment - La Trobe University, Australia

The purpose of this assignment is to develop and assess your skills in R programming including summarising, wrangling and plotting data. Using the tidyverse package is recommended but not compulsory. Please read through the entire assignment and understand the submission format and marking rubrics before starting.

Part 1 -

The spreadsheet titled 'sports_and_recreation.xlsx' details the location, types of sport played as well as the condition, age and other details about sports and recreational facilities in Victoria. You will see that it is far from being ready for analysis and needs to be 'wrangled'. Additionally a few errors have been deliberately introduced so these will need to be corrected by initial analysis.

1. Explain why the data in its current form is not considered to be in 'tidy' format.

2. Write R code to read in the data, manipulate it and output the result to a single csv file having the following header row. Each row should provide the details for a single sport within a sports and recreation facility (ignore cases where the sport is unspecified).

facility_ID,facility_name,street_no,street_name,street_type,suburb_town,postcode,LGA,latitude,long itude,sports_played,number_field_courts,field_surface_type,facility_age,facility_condition,facility_up grade_age.

Your code will have the following sections (not necessarily in the order given and the process may be iterative as you find more things to do). Please include comments in the code to separate each segment and explain your steps.

a) Read the data into a dataframe (or tibble).

b) Observe the layout of the data and describe any issues you encounter in terms of missing or duplicated data. Make the necessary modification of rows for consistency.

c) Write a function that takes in a facility ID and outputs a dataframe with one or more rows, where each row is specific to the details of a single sport within a facility.

d) Apply the function to each facility name and then combine the rows to form a single dataframe.

e) Split the (latitude, longitude) attribute into two separate columns.

f) Split the address attribute into its components as listed above (note that not all addresses are complete so there may be more than one case to consider).

i. Note that the street number may have two words such as "Lot 1A" but a street number is assumed to have at least one digit character.

ii. Note that the street type may have two words such as "Street North".

iii. Assume the suburb/town attribute is the part of the address in capitals before the postcode.

g) Include the above header row for the dataframe.

h) Do a summary of the dataframe to look for unusual values, then correct them until satisfied with the result.

i) Sort the dataframe by facility_ID, then sports_played.

j) Write the result to a csv file.

3. How many facilities offer softball with at least 5 fields?

Part 2 -

The online hospitality company Airbnb has made publicly available a number of datasets. This part of the assignment makes use of the listings.csv dataset.

It consists of a number of parameters related to properties available for lodging in the Melbourne metropolitan area.

Write R code to answer the following.

1. Find the average price of a listing by room type.

2. How many listings contain the following combinations of words (upper or lower case or mixed, either order) in the name column?

a. Spacious, cosy

b. Vibrant, bright

3. How many listings had last review date in the past 14 days up to and including January 6 2018?

4. How many host names (not host ids) are mentioned between 10 and 20 times (inclusive) in the listings?

5. List the five top neighbourhoods sorted in decreasing order by average number of reviews. In a third column show the average value of availability_365 for that neighbourhood.

6. Write a function that has listing id as input and calculates the (direct) distance in kilometres between the location of the listing and the Melbourne CBD with latitude/longitude coordinates (-37.8136o, 144.9631o). Look up and use the haversine formula to perform this using an earth radius of 6370km (you may use an R package). Hence determine the number of listings that are between 5 and 10km from the Melbourne CBD.

7. Suppose somebody wants to choose a listing based on the following criteria. Write one or more functions that inputs a listing id and calculates a score that is the sum of points as below:

a. Points for the location: 50 × (10 minus haversine formula distance in kilometres to the nearest table tennis facility in km) but not less than zero (this uses the sports_played and location attributes in the cleaned dataset from Part 1).

b. Points for price: (300 minus price) but not less than zero.

c. Points for the room type: 100 for Entire home/apt, 50 for Private room, 0 for Shared room.

d. Points for availability: (availability_365) divided by 5.

e. Points by popularity based on the number of reviews: 100 if at least in the top quartile, 0 if less than the median value, 50 otherwise.

Which id has the highest score according to the above system?

Part 3 -

Write a short report describing the two (processed) datasets from parts 1 and 2 through tables and plots with R including the following:

  • A summary of the variables using the summarytools package. Include at least two observations you feel are worth noting.
  • A histogram showing the distribution of the distance of listings from the Melbourne CBD (using the result from Part 2). Use a bin width of 1km.
  • A single map combining a subset of the data from each of the two datasets (e.g. it could be in a single LGA). You will need to use an R package that can map geospatial data.
  • Choose one sport and do a visualisation of your choice including one or more of the following variables from the dataset of part 2.

NumberFieldCourts FieldSurfaceType FacilityAge FacilityCondition FacilityUpgradeAge

Point out any interesting patterns (e.g. trends) you see from your plots.

Attachment:- Data Wrangling And R Assignment Files.rar

Reference no: EM132393553

Questions Cloud

Find the present value of 180-day non-interest-bearing note : Find the present value of 180-day non-interest-bearing note for $7,200 issued February 20, 2020, if money is worth 5.64%, on June 1, 2020.
What is the advantage of financial leverage : What is the advantage of financial leverage, the degree to which a firm or individual uses borrowed money to make money?
What are the equations for profit margin : What are the equations for: profit margin, current ratio, quick ratio, return on assets, debt to assets receivables turnover, Avg. collection period, inventory
Discuss the current political values : Discuss the current political values that are shaping the budget process?
BUS5DWR Data Wrangling And R Assignment : BUS5DWR Data Wrangling And R Assignment Help and Solution, La Trobe University, Australia.
What is the value of the firm stock : If the required return is 11 percent and the company just paid a dividend of $0.97, what is the value of the firm's stock?
Acquisition of new wrapping machine : Herky Foods is considering acquisition of new wrapping machine. By purchasing the machine, Herky will save money on packaging in each of the next 5 years
Withheld for federal income taxes : Anabella Cunningham is 24 years old and single, lives in an apartment, and has no dependents. Last year she earned $45,000 as a sales representative
Ptoject net present value : The annual operating cash flow is $38,519. The cost of capital is 19 percent. What is the project's net present value if the tax rate is 8 percent?

Reviews

Write a Review

Applied Statistics Questions & Answers

  Hypothesis testing

What assumptions about the number of pedestrians passing the location in an hour are necessary for your hypothesis test to be valid?

  Calculate the maximum reduction in the standard deviation

Calculate the maximum reduction in the standard deviation

  Calculate the expected value, variance, and standard deviati

Calculate the expected value, variance, and standard deviation of the total income

  Determine the impact of social media use on student learning

Research paper examines determine the impact of social media use on student learning.

  Unemployment survey

Find a statistics study on Unemployment and explain the five-step process of the study.

  Statistical studies

Locate the original poll, summarize the poling procedure (background on how information was gathered), the sample surveyed.

  Evaluate the expected value of the total number of sales

Evaluate the expected value of the total number of sales

  Statistic project

Identify sample, population, sampling frame (if applicable), and response rate (if applicable). Describe sampling technique (if applicable) or experimental design

  Simple data analysis and comparison

Write a report on simple data analysis and comparison.

  Analyze the processed data in statistical survey

Analyze the processed data in Statistical survey.

  What is the probability

Find the probability of given case.

  Frequency distribution

Accepting Manipulation or Manipulating

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd