Obtain a Pearson correlation matrix relating variables count

Assignment Help Other Subject

Reference no: EM132374126

Statistics for Data Science Assignment - Capital BikeShare

Bike sharing systems are a new generation of bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, a user is able to easily rent a bike from a particular position and return the bike at another position. Currently, there are over 500 bike-sharing programs around the world, with some of the best and largest found in Hangzhou (China), Paris (France), London (England), New York City (US) and Montreal (Canada). Great interest in these systems exists due to their role in addressing traffic congestion, environmental impact and population health issues in big cities.

The data for this assignment comes from one such program, called Capital Bikeshare, operating in Washington in the US. It has over 3000 bicycles that can be rented from over 350 stations across Washington, D.C., Arlington and Alexandria, VA and Montgomery County, MD. Their website encourages users to check out bikes for a trip to work, to run errands, go shopping, or visit friends and family. Users can join Capital Bikeshare for one to three days (casual membership), or for a month or a year (registered membership). Access to the Capital Bikeshare fleet of bikes is available 24 hours a day, 365 days a year. The first 30 minutes of each trip are free.

You will use data derived from Capital Bikeshare trip records to build a statistical model for the purposes of predicting the number of rentals per day.

References and Data Sources:

1. Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science.

2. Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.

Data files for this assignment:

The main data file for this assignment is called daily.sas7bdat and contains daily counts of bike rentals for 2011 and 2012, derived from Capital Bikeshare trip history data, with additional weather and seasonal information. The data was downloaded from the UCI Machine Learning Repository. Variables in that file are as follows:

Variable	Description
instant	Record index
dteday	Date
season	Winter, spring, summer or fall (northern hemisphere)
yr	0 = 2011, 1 = 2012
month	Month (January to December)
weekday	Day of the week (Monday to Sunday)
workingday	Working day = 1, weekend or public holiday = 0
temp	Normalised temperature in degrees Celsius; observed temperature divided by 41 (max)
atemp	Normalised 'feels like' temperature in degrees Celsius; values divided by 50 (max)
hum	Normalised humidity; observed values divided by 100 (max)
windspeed	Normalised wind speed; values divided by 67 (max)
casual	Count of casual users
registered	Count of registered users
count	Total count of bike rentals (casual plus registered)

The second file for this assignment is called random_sample.xlsx and it can be downloaded from the Data Files folder on the course website. The file contains a stratified sample of bike rentals taken from the Capital Bikeshare trip history data for the second quarter of 2012. Variables in that file are as follows:

Variable	Description
Duration	Trip duration, in seconds
Start_date	Date and time stamp for the beginning of the trip
Start_station	Address for the location from which the bike was rented
End_date	Date and time stamp for the end of the trip
End_station	Address for the location to which the bike was returned
Bike_number	Bike identification number
User_type	Type of user (casual or registered)

Assignment tasks:

Question 1 -

(a) Use SAS to study the distribution of the total daily number of rentals. Obtain measures of location, dispersion, skewness and kurtosis. Obtain a boxplot, histogram and a quantile-quantile plot. Also carry out Normal goodness-of-fit tests. What are the key features of this distribution?

(b) Now use SAS to obtain boxplots of the total daily number of rentals according to season and by type of day (working day vs weekend or public holiday). What do the boxplots suggest about the pattern, if any, of bike rentals?

(c) In 2012, the east coast of the United States was struck by Hurricane Sandy. Is this severe weather event evident in your results? Provide a relevant graph to support your answer.

Question 2 -

(a) Obtain a Pearson correlation matrix relating variables count, atemp, temp, hum and windspeed. Also obtain a scatterplot matrix of the same variables. Discuss the relationships.

(b) Fit a simple regression model relating count to atemp, with count as the dependent variable, and determine the residuals from this regression. Discuss the fitted relationship and the goodness of fit. Examine residual plots and influence diagnostics and comment on the residual behaviour.

(c) Obtain a correlation matrix relating the residuals from part (b) to variables temp, hum and windspeed. Comment on these correlations. What do they tell you about the importance of these variables for predicting the daily count of bike rentals?

(d) Using the correlations in part (c) identify a set of potential explanatory variables. Regress count on your selection of variables. Discuss the fitted relationship and the goodness of fit. Also examine and discuss residual patterns.

(e) Extend your multiple regression model from part (c) to include categorical predictors. You can use stepwise selection to help you find the most parsimonious (simplest) model with the highest R-square. Report and interpret in detail only your final model, but do indicate how it was obtained and why it was considered the 'best'.

In building your model consider as many potential explanatory variables as possible (you may need to define additional dummy variables). Be sure to check, and if necessary correct, for collinearity.

Question 3 -

(a) Upload the data file random_sample.xlsx into a folder of your choice in your home directory on the SAS server and then use the import procedure to convert the data file into a SAS table. The code snippet shown below assumes that the Excel data file was uploaded directly to the home directory in SAS Studio, and proc print is used to check that the data was converted correctly into SAS format:

(b) Is there a statistically significant difference in duration of bike trips by casual versus registered users? If so, which trips are typically longer? Check the necessary conditions and perform an appropriate hypothesis test. Should it be a two-sample or a paired t-test? You may need to use a transformation (e.g. log) in order to justify performing a t-test on this data. Justify your choices and discuss your results.

Question 4 -

Write a summary of your findings from Questions 1 to 3. Keep the technical details of the analyses that led you to these conclusions to the absolute minimum. Rather, focus on practical significance and present your findings in non-specialist terms. A few paragraphs (up to a page) will be sufficient.

Note - Please include screenshots of SAS graphs where needed, followed with texts to explain them, according to the questions, thank you very much! There is no need to answer/explain graphs if the questions do not state so.

Attachment:- Statistics for Data Science Assignment File.rar

Reference no: EM132374126

Questions Cloud

What is multiculturalism : What is multiculturalism with respect to technology and information access.

Looking for information on what to do post graduation : What is the best way to recruit and screen group members for a therapy group for high school students that are soon to be graduates and their parents

Understanding of the group process : How often should we meet and for how long? Just trying to get a better understanding of the group process.

Test score difference occurring by chance : ''What does this mean about the probability of this test score difference occurring by chance''?'' Is it less than 0.05''?

Obtain a Pearson correlation matrix relating variables count : MATH 4044 - Statistics for Data Science Assignment - Capital BikeShare. Obtain a Pearson correlation matrix relating variables count

Evaluate client satisfaction with services : How would you go about planning a process to evaluate client satisfaction with services?

Article on lack of education : Looking for an article on " lack of education" where lack good critical thinking skills are being demonstrated by the author or speaker.

Three good critical skill from the article : Please provide at least three good critical skill from the article.

What is meant by the utility of a test : What is meant by the utility of a test? What are factors that affect a test's utility?

Reviews

len2374126

9/21/2019 4:16:46 AM

Please include screenshots of SAS graphs where needed, followed with texts to explain them, according to the questions, thank you very much! There is no need to answer/explain graphs if the questions do not state so. Instructions: This assignment is worth 25% of your final grade. It is due no later than 11pm on Sunday 22 September, at the end of Week 8. You will need to submit your assignment via Learnonline. There is no need to include a cover sheet as it is generated automatically by Learnonline system.

9/21/2019 4:16:40 AM

The submitted assignment needs to be a single file, in either a Microsoft Word (doc or docx) or pdf file format. The assignment is out of 120 marks. To achieve maximum marks for each question, you should aim to: Complete the requested statistical analysis in SAS using appropriate tasks or procedures. (40%) Provide and interpret only the output most relevant to the question. Do not include every piece of output produced by SAS! (40%) Discuss the results in the context of the question. (20%) Assignments submitted late, without an extension being granted, will attract a penalty of 10 marks per each day or any part thereof beyond the due date and time.

Write a Review

Required(*) Message

User Account

All Pages