Reference no: EM132633170
FIT5145 Introduction to Data Science - Monash University
The aim of this assignment is to investigate and visualise data using various data science tools. It will test your ability to:
• Using R,
o read data files and extract related data from those files;
o wrangle and process data into the required formats;
o use various graphical and non-graphical tools to perform exploratory data analysis and visualisation; and
• communicate your findings in your report.
Tasks:
• There are two tasks (A & B) in this assignment. Each task has separate data set files.
• You need to use R to complete the tasks.
• You need to use R Markdown to communicate
o your answers,
o the code you used to complete the tasks, and
o your explanation of the steps you took and any issues that arose
It is crucial that the R Markdown report you submit clearly identifies which questions you are answering, and explains how you are processing the data and why you are processing the data in that way. It is not adequate for you to just answer the questions for each task or just supply the code you used.
The data supplied for each task will also have to be wrangled in order to answer the questions. The supplied data is not guaranteed to be "clean" and without faults. This may require you to
• examine the data,
• filter the data,
• deal with missing or inconsistent values or formats,
• deal with any outliers or exceptional values,
• merge or divide the values or data sets,
• sort the data, and/or
• any other pre-processing steps that are required in order to be able to analyse the data. Your report must explain why and how you are performing this data wrangling, including identifying any issues you find with the data.
Task A: Investigating the size of the Indigenous Australian Population
In this task, you are required to visualise the relationship between the distribution and age of Indigenous Australians and gain insights into relations and trends over time. The data files used in this task were originally downloaded from the Australian Bureau of Statistics (ABS). We have extracted the data from the original files and put it into a simpler format. Please download the data from Moodle:
• IndigAusPopData_by_region (Data1): This file contains yearly data regarding the estimated resident population of Indigenous Australians, grouping by indigenous regions, between 2016 to 2031.
• IndigAusPopData_by_state (Data2): This file contains yearly data regarding the estimated resident population of Indigenous Australians, grouping by state or territory, between 2006 and 2031.
A1. Investigating the Distribution of Indigenous Australians
Indigenous Australians are part of Australian society everywhere, but some parts of the country have larger populations than others. For Data1, Australia is segmented into regions (titled "Indigenous regions") and the expected Indigenous population for each region is indicated. This data also divides each region's population into different age groups.
1. Use R to read, wrangle and analyse the data in Data1. Make sure you describe any complications you encounter and the steps you take when answering the following questions.
a. What regions have the maximum and minimum total Indigenous populations in 2016 and 2031?
b. What region/s have the maximum and minimum growth or decay rates of their total
Indigenous population between 2016 and 2031?
Calculate these rates as the percentage difference between the 2016 and 2031, e.g., if 2031 population = 5500 & 2016 population = 5000,
then rate = (5500 - 5000) / 5000 = 500/5000 = 0.1, so 10% growth
c. Plot and describe the growth or decay of the total Indigenous populations for the capitals of the 8 state/territories across all time periods.
For these calculations, you will need to work out the growth/decay rates for each time period, where the total population of the capital in time period N is compared to that in time period N+1.
e.g., if 2017 population = 5050 and 2016 population = 5000,
then rate = (5050 - 5000) / 5000 = 50/5000 = 0.01, so 1% growth for 2016-2017
A2. Investigating the Ages of Indigenous Australians
On average, the lifespan of Indigenous Australians is lower than that of the overall Australian population, due to a variety of socio-economic factors. Data1 and Data2 give separate populations for different ages or age groups, but because this is about living populations, not when they die, we can't use it to calculate average lifespans. Instead, let's look at how many children are in the populations. Make sure you describe any complications you encounter and the steps you take when answering the following questions.
1. Using Data1, which region has the highest percentage of children in its total 2016 population?
For this, calculate this as a percentage of the total population for a region. The ABS commonly considers children to be under 15 years of age.
2. Data2 includes estimated populations measured for the years 2006-2016 and projected estimates predicted for the years 2016-2031. Data1 just uses projected estimates. Using Data2 only, calculate and discuss which state or territory has the highest percentage of children in its total 2006, 2016 and 2031 populations.
3. Use R to build a Motion Chart comparing the total Indigenous Australian population of each region to the percentage of Indigenous Australian children in each state/territory. Use the region populations calculated from Data1 and the child percentage values calculated from Data2. The motion chart should show the population on the x-axis, the percentage on the y-axis, the bubble size should depend on the population.
Hint: an example of how to construct an R motion chart can be found on Moodle. You will have to install the ‘googleVis' package and may have to allow Flash to work on your browser (see https://community.rstudio.com/t/gvismotionchart-from-googlevis-is-not-working-any- suggestion/6109/9 for advice on allowing Flash for Chrome). If you cannot get the example script to work, contact your tutor.
4. Using the Motion Chart, answer the following questions, supporting your answers with relevant R code and/or Motion Charts
a. Which region's population overtakes that of another region in the same state/territory? In which year/s does this happen?
b. Is there generally a relationship between the Indigenous Australian population size and percentage of children in the population? If so, what kind of relationship? Explain your answer.
c. Colour is commonly used in data visualisation to help understand data. Which aspect of this data would you use colour for in your plot and why?
d. Are there any other interesting things you notice in the data or any changes you would recommend for the Motion Chart?
B: Exploratory Analysis on Australian Immunisation rates
In this task, you are required to do some exploratory analysis on data relating to the Australian childhood immunisation rates. This data was originally prepared and released through the Australian Government's Australian Institute of Health and Welfare. We have extracted the data from the original files and put it into a simpler format. Please download the data from Moodle:
• AusImmunisationData (Data3): This file contains yearly data regarding the number of 1, 2 and 5 year-old Australian children fully or partially immunised in various Primary Health Network (PHN) areas.
COLUMN
|
DESCRIPTION
|
State
|
State or territory for the PHN area
|
PHN code
|
Identification number for PHN area relating to the data
|
PHN area name
|
Description of PHN area
|
Reporting Year
|
Financial period examined
|
Age group
|
Age group of children
|
Number of registered children
|
Number of children registered in the age group
|
Number fully immunised
|
Number of children in the age group who were fully immunised, according to government objectives
|
Number not fully immunised
|
Number of children in the age group who were not fully immunised, according to government objectives
|
Number of registered IndigAus children
|
Number of Indigenous Australian children in the age group
|
Number IndigAus fully immunised
|
Number of Indigenous Australian children in the age group who were fully immunised, according to government
objectives
|
Number IndigAus not fully immunised
|
Number of Indigenous Australian children in the age group who were not fully immunised, according to government objectives
|
Interpret with caution
|
This area's eligible population is between 26 and 100 registered children.
|
Use R to read, wrangle and analyse the data from Data3. Make sure you describe any complications you encounter and the steps you take when answering the following questions.
B1. Values and Variables
1. How many PHN areas does the data cover?
2. What are the possible values for 'PHN code'?
3. For each row, calculate the percentage of Australian children that are fully immunised (this is the immunisation rate). What are the average, maximum and minimum immunisation rates? Calculate the same for the group that are Indigenous Australian children. Do all of those values seem statistically reasonable to you?
B2. Variation in rates over Time, Age and Location
Generate boxplots (or other plots) of the immunisation rates versus year and age to answer the following questions:
1. Have the immunisation rates improved over time? Are the median immunisation rates increasing, decreasing or staying the same?
2. How do the immunisation rates vary with the age of the child?
Generate boxplots (or other plots) of the immunisation rates versus locations and answer the following questions:
3. What is the median rate per state/territory?
4. Which states or territories seem most consistent in their immunisation rates?
Attachment:- Introduction to Data Science.rar