Reference no: EM132324905 
                                                                               
                                       
DATA WRANGLING AND R Assignment -
The purpose of this assignment is to develop and assess your skills  in R programming including wrangling, summarising and plotting data.  Using the tidyverse package is recommended but not compulsory. Please  read through the entire assignment and understand the submission format  and marking rubrics before starting.
Part 1 - The spreadsheet titled 'censusdata.xlsx' contains  information about the number of bedrooms in occupied private dwellings  for local government areas in Melbourne for the years 2011 and 2016. You  will see that it is far from being ready for analysis and needs to be  'wrangled'. Additionally a few errors have been deliberately introduced  into the first two columns so these will need to be corrected by initial  analysis.
1. Explain why the data in its current form is not considered to be in 'tidy' format.
2. Write R code to read in the data (readxl package), manipulate it  and output it to a single csv file having the following header row.
region,year,br_count_0,br_count_1,br_count_2,br_count_3,br_count_4_or_more,br_count_unstate d,av_per_dwelling,av_per_household
Your code will have the following sections (not necessarily in the  order given and the process may be iterative as you find more things to  do). Please include comments in the code to separate each segment and  explain your steps.
Read in the data sets into two data frames df2011 and df2016.
Compare the layout of each of the two data frames, then remove  appropriate rows of one data frame to match the format of the other.
Write a function that takes in a table of the original form and  outputs a table in the desired form with columns specified above.
-  Remove unwanted rows or columns.
-  Split values into multiple columns to make them atomic.
-  Appropriately transform the data into the desired form.
-  Rename columns.
Apply the function to each table to create two tables in the desired format.
Do a summary of each table to look for unusual values.
Correct those values until the two tables have the same dimensions and format.
Merge the two tables into a single table so that we see data in the form
Banyule,2011,78,1287,8457,21865,11366,645,3.1,2.6
Banyule,2016,...
Bayside,2011,...
Bayside,2016,...
...
Victoria,2011,...
Victoria,2016,...
Australia,2011,...
Australia,2016,...
(listed alphabetically by region, then by year, with Victoria and Australia at the end) (2 marks)
Write the result to a csv file (it should have 65 rows including the header).
3. Which region(s) (ignoring Victoria and Australia) had the largest  increase in the number of occupied dwellings with 3 or more bedrooms  between 2011 and 2016? (Ignore the unstated counts.)
Part 2 - The online hospitality company Airbnb has made  publicly available a number of datasets. This part of the assignment  makes use of the listings.csv dataset.
It consists of a number of parameters related to properties available  for lodging in the Melbourne metropolitan area and can be visualized.
Write R code to answer the following.
1. Give the five neighbourhoods with the most listings (list them along with the counts in descending order).
2. How many listings contain the following words (upper or lower case or mixed) in the name column?
a. Beautiful
b. Quiet
c. Amazing
d. <another adjective of your choice with at least 200 instances>
3. How many listings are there with last review in 2016? Give month by month counts for the year 2016.
4. Create a new column of the table which calculates the number of  ids that correspond to the given host_id . Your answer will match the  calculated_host_listings_count column (only use this column to check  your answer).
5. Write a function that inputs a listing id and outputs a score that is the sum of points according to the following criteria:
a. Points for the neighbourhood: (average number of bedrooms per dwelling in 2016) × 50 (this comes from the data set in Part 1)
b. Points for the room type: 200 for Entire home/apt, 100 for Private room, 0 for Shared room
c. Points for minimum nights: 50 for 1 night, 25 for 2 nights, 0 for 3 or more nights
d. Points for availability: (availability_365) divided by 5
e. Points for review frequency: 50 × (reviews per month), but no more than 100
f. Points for price: (300 minus price)
Which id (ids if more than one) has the highest score according to the above system?
Part 3 - Write a short report summarising the variables in the  two (processed) datasets from parts 1 and 2 through tables [2 marks]  and plots with R including the following:
-  A histogram showing the distribution of a variable of interest.
-  A plot of one or more variables with time on the x axis (e.g. month, year or date).
-  A word cloud of the words in the name column of the listings table.  You may follow the instructions and use the packages referred. 
-  A map showing the price of listings by colour (e.g. a dot plot or  heat map - you will need to use an R package that can map geospatial  data).
Point out any interesting patterns (e.g. trends) you see from your plots and summaries.
Attachment:- Assignment Files.rar