Reference no: EM132378111
Assignment - Inferential Statistics
Task 1: Climate change and temperature anomalies
If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at [NASA's Goddard Institute for Space Studies].
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
You have two objectives in this section:
1. Select the year and the twelve month variables from the 'weather' dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use 'select()' function.
2. Convert the dataframe from wide to 'long' format. Hint: use 'gather()' or 'pivot_longer()' function. Name the new dataframe as 'tidyweather', name the variable containing the name of the month as 'month', and the temperature deviation values as 'delta'.
Inspect your dataframe. It should have three variables now, one each for
1. year,
1. month, and
1. delta, or temperature deviation.
Plotting Information - Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called 'date' in order to ensure that the 'delta' values are plot chronologically.
In the following chunk of code, I used the 'eval=FALSE' argument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframe 'tidyweather'. When you actually want to run this code and knit your document, you must delete 'eval=FALSE', not just here but in all chunks were 'eval=FALSE' appears
'''{r scatter_plot, eval=FALSE, warning=FALSE}
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)
'''
Is the effect of increasing temperature more pronounced in some months? Use 'facet_wrap()' to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled "Jan", "Feb", "Mar" (full or abbreviated month names are fine), not '1', '2', '3'.
'''{r facet_wrap, echo=FALSE, warning=FALSE}
It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called 'comparison' that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We remove data before 1800 and before using 'filter'. Then, we use the 'mutate' function to create a new variable 'interval' which contains information on which period each observation belongs to. We can assign the different periods using 'case_when()'.
'''{r intervals, eval=FALSE}
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))
'''
Inspect the 'comparison' dataframe by clicking on it in the 'Environment' pane.
Now that we have the 'interval' variable, we can create a density plot to study the distribution of monthly deviations ('delta'), grouped by the different time periods we are interested in. Set 'fill' to 'interval' to group and colour the data by different time periods.
'''{r density_plot, eval=FALSE,, warning=FALSE}
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
)
'''
So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using 'group_by()' and 'summarise()', followed by a scatter plot to display the result.
'''{r averaging, warning=FALSE, eval=FALSE}
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use 'na.rm=TRUE' to eliminate NA (not available) values
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
#plotting the data:
ggplot(average_annual_anomaly, aes(x=year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
)
'''
Hypothesis Test -
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to determine (test) whether the difference in average temperature deviation (delta) since 2011 is (statistically) significantly different from 1.5 degrees.
First, state what you are doing. What is your null hypothesis? What is your alternative hypothesis?
Confidence Interval for delta -
Let us construct a confidence interval for the average annual delta since 2011. Recall that the dataframe 'comparison' has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
'''{r, calculate_CI_by_hand}
formula_ci <- comparison %>%
# choose the interval 2011-present
# what dplyr verb will you use?
# calculate summary statistics for temperature deviation (delta)
# calculate mean, SD, count, SE, lower/upper 95% CI
# what dplyr verb will you use?
#print out formula_CI
formula_ci
'''
### t-stat for observed delta
In hypothesis testing, we want to calculate a **t-stat**, namely how far away is what we obsevred (the actual mean delta since 2011), from what we assumed, but expressed not in degrees Celsius, but rather in standard errors. Given the 'formula_ci' numbers calculated earlier, how far away is the observed (actual) mean delta from 1.5 degrees?
What is the data showing us? Please Type your answer after (and outside!) this blockquote. You have to explain what you have done, the result of your test, and the interpretation of that result. One paragraph max, please!
Task 2: IMDB ratings: Differences between directors
I would like you to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. I have already calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap.
<center>

</center>
You should use both the 't.test' command and the 'infer' package to simulate from a null distribution, where you assume zero difference between the two.
Before anything, write down the null and alternative hypotheses, as well as the resulting test statistic and the associated t-stat or p-value. At the end of the day, what do you conclude?
You can load the data and examine its structure
'''{r load-movies-data, message=FALSE, warning=FALSE}
movies <- read_csv(here::here("Data", "movies.csv"))
glimpse(movies)
'''
Your R code and analysis should go here. If you want to insert a blank chunk of R code you can just hit 'Ctrl/Cmd+Alt+I'
'''{r}
'''
Task 3: Calculate and plot risk/return profile of stocks
We will use the 'tidyquant' package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
The 'tidyquant' package allows us to download historical prices for many financial assets, most of them coming through Yahoo Finance. We must first identify which stocks we want to download data for, and for this we must know their **ticker** symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald's as MCD, etc.
In September 2017, Samir Khan from the [investexcel.net website](https://investexcel.net/all-yahoo-finance-stock-tickers/) got a list of all Yahoo finance tickers, and we will use a modified version of that list of tickers.
'''{r get_tickers, warning=FALSE, message=FALSE}
tickers <- read_csv(here::here("Data","yahoo_finance_tickers.csv"))
'''
The 'tickers' dataframe contains 207,533 tickers of various instruments, the 'name' of the instrument, the 'exchange' it is traded at, which market sector it belongs to ('category_name'), and the 'type' of the instrument, namely stocks, market indices, or ETFs.
Based on this dataset, I want you to create two bar plots:
A bar plot that shows the top 25 countries with respect to tickers. The bars should be arranged with the first one being the largest, etc.
Similary, a bar plot with the top 25 market sectors ('category_name'), again arranged in descending order.
'''{r bar_plots_counry_category}
# YOUR CODE GOES HERE
'''
Next, choose around a dozen stocks, preferably from your country, or a sector ('category_name') that interests you. If I had chosen AAPL and MSFT, I would create a variable 'my_tickers <- c("AAPL", "MSFT")' and would then use tidyquant to download the last 3 years worth of data.
'''{r get_price_data, message=FALSE, warning=FALSE}
my_tickers <- c("AAPL", "MSFT") # enter chosen tickers here-- *NOT* AAPL, or MSFT, sorry
myStocks <- my_tickers %>%
tq_get(get = "stock.prices",
from = "2016-07-01",
to = "2019-09-01") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame
'''
Financial performance and CAPM analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices we donwloaded, our first step is to calculate daily and monthly returns.
'''{r calculate_returns, message=FALSE, warning=FALSE}
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
'''
Create a table where you summarise monthly returns for each of the stocks; min, max, median, mean, SD.
'''{r summarise_monthly_returns}
# YOUR CODE GOES HERE
'''
Plot a faceted density plot, using 'geom_density()', for each of the stocks.
'''{r density_monthly_returns}
# YOUR CODE GOES HERE
'''
What can you infer from this plot? Which stock has the highest/lowest volatility?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. You can use different colours for different tickers, but more importantly please print the label of each ticker next to the stock, using 'geom_text(aes(label = ticker))'.
'''{r risk_return_plot}
# YOUR CODE GOES HERE
'''
What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
Challenge 1: Ridge plots
Using your newfound visualisation skills (and referencing [the 'ggridges' vignette](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html)), make a ridge plot showing either
- the distribution of temperature anomalies from the NASA dataset over different periods, or
- the distribution of IMDB ratings by genre, as shown below.
<center>

</center>
Save the plot you will create as a PNG file in your 'images' folder with 'ggsave()'
Attachment:- Assignment File.rar