Reference no: EM132375203
Data Applications Assignment - Problem Set
Instructions - Problem Set is organized as follows:
Question 1 simulates a data visualization task. It builds off skills you have applied in previous labs and problem sets.
Question 2 asks you to execute and interpret a fairly basic regression model.
You are required to use an R Notebook for problem sets. Therefore, you should have a file with the extension .Rmd once you Knit your notebook. Please name your notebook file ps3_lastname1_lastname2_lastname3. This is the file you should submit via eLC. I will Knit your notebook and grade accordingly, adding comments to your notebook so everything is self-contained.
It is time to start using your R Notebook as if you are preparing a report for an external audience. Throughout this problem set, consider what code and results should be included in your output. For instance, a reader probably isn't interested in the importing and wrangling needed to produce plots. If you think code or results are not necessary for a reader to see, suppress accordingly. Use your best judgment; I will not grade your choices strictly.
Students may work in groups of at most three. If in a group, please provide one submission. I cannot enforce this, but I highly recommend groups to actually work together in each other's presence rather than working on separate questions remotely.
Any external data or documentation required to complete a problem set will be available on eLC. Please read each question carefully and provide a thorough response.
Question 1 - United Nations life expectancy data
Life expectancy at birth can vary along time or between countries because of many causes: the evolution of medicine, the degree of development of countries, or the effect of armed conflicts. Life expectancy varies between gender, as well. Women generally live longer than men. Why? Several potential factors, including biological reasons and the theory that women tend to be more health conscious.
Let's create some plots to explore the inequalities about life expectancy at birth around the world. We will use a dataset from the United Nations Statistics Division that is available on eLC.
Part 1: Import
Take a look at the UNdata.csv file before importing. Since our focus is life expectancy between sexes by country and year, this file clearly has some unnecessary columns and rows.
Import the UNdata.csv dataset. Name the new object life_expectancy. Within the import command, change the variable names to the following:
- country
- sex
- year
- source
- unit
- lifeExp
- footnote
Also, tell R to skip the first row. and set the column types to be stored as character except for the one quantitative variable that should be stored as integer. Set the maximum number of rows imported so that the footnotes in the bottom rows of the .csv file are not imported. This maximum number should be equal to the number of rows that contain actual data in the .csv file.
Part 2: Wrangle
Our first plot will compare male and female life expectancy with the most recent data available in life_expectancy. The dataset still requires some wrangling to facilitate making such a plot.
Generate a new object that contains a subset of life_expectancy according the following instructions:
Keep only the following variables:
Include only the most recent time period.
Drop the year variable since it is no longer necessary.
Change the sex variable into two columns where lifeExp serves as their value.
Part 3: Scatterplot Step-by-Step
First, create a basic scatterplot to represent life expectancy of males (x-axis) against females (y-axis).
Next, adjust this plot to make it easier to interpret. Set limits for the x and y axis from 35 to 85. Add a dashed reference line that intersects the y-axis at 0 and has a slope of 1.
Briefly explain to your readers how they should interpret a point lying either above or below the reference line. In other words, what does the reference line help a reader interpret?
Next, adjust the scatterplot according to the following directions:
- Alter the points so that their outline color is "white", their fill color is "chartreuse3", shape equals 21, alpha equals 0.55, and size equals 4.
- Add an appropriate label for its title, a subtitle to specify which years the data include, a caption to report the source of the data, and appropriate labels for the x and y axes.
We want to draw attention to some countries where the gap in life expectancy between men and women is particularly high.
For this to be done, you need to generate two new objects-top_male and top_female-that contain the 3 countries with the highest difference in life expectancy for males and females, respectively.
Lastly, modify your previous scatterplot code according to the following instructions:
- Add a label aesthetic assigned to country
- Add two text geoms. One should use top_male as its data and the other should use top_female. Set the size of the text equal to 3.
- Add a theme that you think is best.
Now that you have a fantastic plot, provide a brief interpretation containing what a reader should learn from it.
Part 4: Scatterplot 2
Since our data contain historical information, let's see how life expectancy between males and females has changed over time. Our second plot will represent the difference between men and women across countries between two periods: 2000-2005 and 1985-1990.
First, we need to generate a new object that subsets life_expectancy. Ultimately, we need a dataset where country is the unit of analysis, contains life expectancy for each sex in each time period as separate variables, and contains two more variables that are the difference in life expectancy between the two time periods for each sex.
The following instructions are provided to help you generate this new subset:
- Include only observations where year equals the two aformentioned time periods
- Unite the sex and year columns into one column separated by an underscore
- Change the hyphen separating years to an underscore so R does not think it is a minus operation (code provided below)
- Transform the sex_year column into four columns where lifeExp serves as their value
- Create two new variables that calculate the change in life expectancy over time for each sex (difference = new value - old value)
Now we are ready to plot. This time we want to plot the difference in male life expectancy over time on the x-axis and the difference among females on the y-axis.
Fortunately, much of the code you used for the last plot will work for this new one. You can copy and paste the code from the last plot and make the required modifications to it to save some time.
As for modifying the code you hust reused, you will obviously need to change the variables used for this new scatterplot. Also, the axis scales need to be adjusted based on the values of the new variables (hint: use the summary function to obtain min and max values). Choose the scale limits you think work best so long as they are the same for both x and y. You will also need to change the data source for the text labels that highlight interesting countries.
Finally, add two new reference lines. A dashed horizontal line at 0 and a dashed verital line at 0.
Some code is provided below to assist you.
As before, provide an interpretation of the graph for readers. What does a point's position relative to each reference line mean? What is the main takeaway for each group of interesting countries?
Question 2 -
For this question, we will use the Georgia school district data from lab 3 and problem set 2.
load("ga_schdist_clean.RData")
After seeing the results of your last analysis concerning districts with the highest and lowest expenditures, suppose your boss is now interested to know what variables are associated with district revenues.
Based on the data avaible to us, we may suspect the number of students enrolled in special programs (e.g. Limited English Proficiency) and the share of total enrollment comprised of black and Hispanic student explain revenues.
Therefore, we want to create a new dataset that contains percentages of enrollment for special programs and race as well as revenues expressed in per pupil terms. The below code does this. Note the use of mutate_at as a shortcut to mutate multiple variables with the same function.
Part 1 - Run Regressions
With this new dataset, run two regressions according to the following model:
Yi = β0 + β1PctLEPi + β2PctSPEDi + β3PctFRPLi + β4PctBlacki + β5PctHispi + ?
where y is total revenues for each district i in the first regression and total local revenues in the second.
Part 2 - Regression Tables
Provide your reader a table of results for each regression. Be sure to provide a line of text that tells your reader which table belongs to which regression model.
Part 3 - Interpret Coefficients
Provide an interpretation for the following coefficients:
The percent of students enrolled in special education in regression 1.
The percent of black and Hispanic students in regression 2.
Attachment:- Data Applications Assignment File.rar