Write a short report summarising the variables

Assignment Help Other Subject
Reference no: EM132324905

DATA WRANGLING AND R Assignment -

The purpose of this assignment is to develop and assess your skills in R programming including wrangling, summarising and plotting data. Using the tidyverse package is recommended but not compulsory. Please read through the entire assignment and understand the submission format and marking rubrics before starting.

Part 1 - The spreadsheet titled 'censusdata.xlsx' contains information about the number of bedrooms in occupied private dwellings for local government areas in Melbourne for the years 2011 and 2016. You will see that it is far from being ready for analysis and needs to be 'wrangled'. Additionally a few errors have been deliberately introduced into the first two columns so these will need to be corrected by initial analysis.

1. Explain why the data in its current form is not considered to be in 'tidy' format.

2. Write R code to read in the data (readxl package), manipulate it and output it to a single csv file having the following header row.

region,year,br_count_0,br_count_1,br_count_2,br_count_3,br_count_4_or_more,br_count_unstate d,av_per_dwelling,av_per_household

Your code will have the following sections (not necessarily in the order given and the process may be iterative as you find more things to do). Please include comments in the code to separate each segment and explain your steps.

Read in the data sets into two data frames df2011 and df2016.

Compare the layout of each of the two data frames, then remove appropriate rows of one data frame to match the format of the other.

Write a function that takes in a table of the original form and outputs a table in the desired form with columns specified above.

  • Remove unwanted rows or columns.
  • Split values into multiple columns to make them atomic.
  • Appropriately transform the data into the desired form.
  • Rename columns.

Apply the function to each table to create two tables in the desired format.

Do a summary of each table to look for unusual values.

Correct those values until the two tables have the same dimensions and format.

Merge the two tables into a single table so that we see data in the form

Banyule,2011,78,1287,8457,21865,11366,645,3.1,2.6

Banyule,2016,...

Bayside,2011,...

Bayside,2016,...

...

Victoria,2011,...

Victoria,2016,...

Australia,2011,...

Australia,2016,...

(listed alphabetically by region, then by year, with Victoria and Australia at the end) (2 marks)

Write the result to a csv file (it should have 65 rows including the header).

3. Which region(s) (ignoring Victoria and Australia) had the largest increase in the number of occupied dwellings with 3 or more bedrooms between 2011 and 2016? (Ignore the unstated counts.)

Part 2 - The online hospitality company Airbnb has made publicly available a number of datasets. This part of the assignment makes use of the listings.csv dataset.

It consists of a number of parameters related to properties available for lodging in the Melbourne metropolitan area and can be visualized.

Write R code to answer the following.

1. Give the five neighbourhoods with the most listings (list them along with the counts in descending order).

2. How many listings contain the following words (upper or lower case or mixed) in the name column?

a. Beautiful

b. Quiet

c. Amazing

d. <another adjective of your choice with at least 200 instances>

3. How many listings are there with last review in 2016? Give month by month counts for the year 2016.

4. Create a new column of the table which calculates the number of ids that correspond to the given host_id . Your answer will match the calculated_host_listings_count column (only use this column to check your answer).

5. Write a function that inputs a listing id and outputs a score that is the sum of points according to the following criteria:

a. Points for the neighbourhood: (average number of bedrooms per dwelling in 2016) × 50 (this comes from the data set in Part 1)

b. Points for the room type: 200 for Entire home/apt, 100 for Private room, 0 for Shared room

c. Points for minimum nights: 50 for 1 night, 25 for 2 nights, 0 for 3 or more nights

d. Points for availability: (availability_365) divided by 5

e. Points for review frequency: 50 × (reviews per month), but no more than 100

f. Points for price: (300 minus price)

Which id (ids if more than one) has the highest score according to the above system?

Part 3 - Write a short report summarising the variables in the two (processed) datasets from parts 1 and 2 through tables [2 marks] and plots with R including the following:

  • A histogram showing the distribution of a variable of interest.
  • A plot of one or more variables with time on the x axis (e.g. month, year or date).
  • A word cloud of the words in the name column of the listings table. You may follow the instructions and use the packages referred.
  • A map showing the price of listings by colour (e.g. a dot plot or heat map - you will need to use an R package that can map geospatial data).

Point out any interesting patterns (e.g. trends) you see from your plots and summaries.

Attachment:- Assignment Files.rar

Verified Expert

This paper is about data processing , visualization and summaries in r, the first part of the paper uses the census data set , the requirement was to write a function which loads cleans and process and clean data set in to a tidy data set. the attached source code has such a function.The second part entails sub-setting the listing data-set according to specified variables such as date and year, merging the two datasets then writing an r function which calculates scores for each given listing id and counting of words on the listing names.The third part includes visualization of the two data sets using Melbourne map(from leaflet) , histogram ,time plot and word frequency for the words on the listing name.

Reference no: EM132324905

Questions Cloud

Discussing all aspects of access control systems : What are the factors that influence the selection of access control software and/or hardware? Discuss all aspects of access control systems. 200-to-300 words.
Perspective of money non-neutrality : Explain the Great Moderation from the perspective of Stock and Watson's research and from the perspective of money non-neutrality.
What is his marginal utility for apples : What is his marginal utility for apples, and what is his marginal utility for kumquats? What bundle of apples and kumquats should he purchase to maximize
How your approach to security matches the value of your data : Security and BCP/DR. A 3-4 page APA original document in MS-Word describing your Information Security and DR approach. This will include your assessment.
Write a short report summarising the variables : La Trobe University, Australia - BUS5DWR DATA WRANGLING AND R Assignment, Write a short report summarising the variables
Define the scope and boundaries for the ra plan : Define the scope and boundaries for the RA plan. Identify the key roles and responsibilities of individuals and departments within the organization.
Do you agree that the given models would be useful : Based on the examples that are provided, do you agree that these models would be useful? Please explain why or why not. Please make sure that you make.
How would the entries differ if dividend were a liquidating : How would the entries differ if the dividend were a liquidating dividend? Teal Corporation has 11.20 million shares of common stock issued and outstanding.
What is the opportunity cost of producing a third : Assume that the following table describes the production possibilities frontier (PPF) confronting an economy. Using that information:

Reviews

inf2324905

7/31/2019 4:21:23 AM

They have very good experts with best report writing and I am completely satisfied with the assignment solution provided. Thank you Experts Mind keep your momentum on !!!! I received wonderful work. The writer knows what exactlyh we want in assignment. He must be a brilliant scholar.

len2324905

6/19/2019 3:38:06 AM

Your submission to this assignment will consist of two files. A single .R file with all the code used for this assignment (all parts), including comments that contain the answers for parts 1 and 2. A document (pdf or docx) for part 3 (including code here is optional, only code in the R file will be assessed). Do not include the answers for parts 1 or 2 here. Note: please keep the data files in the same directory as your scripts so that you do not specify directories in your code. This will make your R code easier to assess.

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd