Reference no: EM132374126
Statistics for Data Science Assignment - Capital BikeShare
Bike sharing systems are a new generation of bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, a user is able to easily rent a bike from a particular position and return the bike at another position. Currently, there are over 500 bike-sharing programs around the world, with some of the best and largest found in Hangzhou (China), Paris (France), London (England), New York City (US) and Montreal (Canada). Great interest in these systems exists due to their role in addressing traffic congestion, environmental impact and population health issues in big cities.
The data for this assignment comes from one such program, called Capital Bikeshare, operating in Washington in the US. It has over 3000 bicycles that can be rented from over 350 stations across Washington, D.C., Arlington and Alexandria, VA and Montgomery County, MD. Their website encourages users to check out bikes for a trip to work, to run errands, go shopping, or visit friends and family. Users can join Capital Bikeshare for one to three days (casual membership), or for a month or a year (registered membership). Access to the Capital Bikeshare fleet of bikes is available 24 hours a day, 365 days a year. The first 30 minutes of each trip are free.
You will use data derived from Capital Bikeshare trip records to build a statistical model for the purposes of predicting the number of rentals per day.
References and Data Sources:
1. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science.
2. Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
Data files for this assignment:
The main data file for this assignment is called daily.sas7bdat and contains daily counts of bike rentals for 2011 and 2012, derived from Capital Bikeshare trip history data, with additional weather and seasonal information. The data was downloaded from the UCI Machine Learning Repository. Variables in that file are as follows:
Variable
|
Description
|
instant
|
Record index
|
dteday
|
Date
|
season
|
Winter, spring, summer or fall (northern hemisphere)
|
yr
|
0 = 2011, 1 = 2012
|
month
|
Month (January to December)
|
weekday
|
Day of the week (Monday to Sunday)
|
workingday
|
Working day = 1, weekend or public holiday = 0
|
temp
|
Normalised temperature in degrees Celsius; observed temperature divided by 41 (max)
|
atemp
|
Normalised 'feels like' temperature in degrees Celsius; values divided by 50 (max)
|
hum
|
Normalised humidity; observed values divided by 100 (max)
|
windspeed
|
Normalised wind speed; values divided by 67 (max)
|
casual
|
Count of casual users
|
registered
|
Count of registered users
|
count
|
Total count of bike rentals (casual plus registered)
|
The second file for this assignment is called random_sample.xlsx and it can be downloaded from the Data Files folder on the course website. The file contains a stratified sample of bike rentals taken from the Capital Bikeshare trip history data for the second quarter of 2012. Variables in that file are as follows:
Variable
|
Description
|
Duration
|
Trip duration, in seconds
|
Start_date
|
Date and time stamp for the beginning of the trip
|
Start_station
|
Address for the location from which the bike was rented
|
End_date
|
Date and time stamp for the end of the trip
|
End_station
|
Address for the location to which the bike was returned
|
Bike_number
|
Bike identification number
|
User_type
|
Type of user (casual or registered)
|
Assignment tasks:
Question 1 -
(a) Use SAS to study the distribution of the total daily number of rentals. Obtain measures of location, dispersion, skewness and kurtosis. Obtain a boxplot, histogram and a quantile-quantile plot. Also carry out Normal goodness-of-fit tests. What are the key features of this distribution?
(b) Now use SAS to obtain boxplots of the total daily number of rentals according to season and by type of day (working day vs weekend or public holiday). What do the boxplots suggest about the pattern, if any, of bike rentals?
(c) In 2012, the east coast of the United States was struck by Hurricane Sandy. Is this severe weather event evident in your results? Provide a relevant graph to support your answer.
Question 2 -
(a) Obtain a Pearson correlation matrix relating variables count, atemp, temp, hum and windspeed. Also obtain a scatterplot matrix of the same variables. Discuss the relationships.
(b) Fit a simple regression model relating count to atemp, with count as the dependent variable, and determine the residuals from this regression. Discuss the fitted relationship and the goodness of fit. Examine residual plots and influence diagnostics and comment on the residual behaviour.
(c) Obtain a correlation matrix relating the residuals from part (b) to variables temp, hum and windspeed. Comment on these correlations. What do they tell you about the importance of these variables for predicting the daily count of bike rentals?
(d) Using the correlations in part (c) identify a set of potential explanatory variables. Regress count on your selection of variables. Discuss the fitted relationship and the goodness of fit. Also examine and discuss residual patterns.
(e) Extend your multiple regression model from part (c) to include categorical predictors. You can use stepwise selection to help you find the most parsimonious (simplest) model with the highest R-square. Report and interpret in detail only your final model, but do indicate how it was obtained and why it was considered the 'best'.
In building your model consider as many potential explanatory variables as possible (you may need to define additional dummy variables). Be sure to check, and if necessary correct, for collinearity.
Question 3 -
(a) Upload the data file random_sample.xlsx into a folder of your choice in your home directory on the SAS server and then use the import procedure to convert the data file into a SAS table. The code snippet shown below assumes that the Excel data file was uploaded directly to the home directory in SAS Studio, and proc print is used to check that the data was converted correctly into SAS format:
(b) Is there a statistically significant difference in duration of bike trips by casual versus registered users? If so, which trips are typically longer? Check the necessary conditions and perform an appropriate hypothesis test. Should it be a two-sample or a paired t-test? You may need to use a transformation (e.g. log) in order to justify performing a t-test on this data. Justify your choices and discuss your results.
Question 4 -
Write a summary of your findings from Questions 1 to 3. Keep the technical details of the analyses that led you to these conclusions to the absolute minimum. Rather, focus on practical significance and present your findings in non-specialist terms. A few paragraphs (up to a page) will be sufficient.
Note - Please include screenshots of SAS graphs where needed, followed with texts to explain them, according to the questions, thank you very much! There is no need to answer/explain graphs if the questions do not state so.
Attachment:- Statistics for Data Science Assignment File.rar