Reference no: EM132378054
CS 5834 : Intro to Urban Computing
NYC Taxi Data Analysis and Modeling
In this homework, you will process the taxi data collected from New York city, use regression models to predict the trip fare amount, and use different classification models to predict whether the tip fare was less than 20% or more than that.
Problem 1. Download and process data.
1. The NYC taxi data can be found
In this data, the yellow and green taxi trip records include fields capturing pick- up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. You are free to choose a trip sheet (in csv format) of the yellow taxi in any month of 2017 for your homework.
2. Randomly sample 10,000 trip records to solve the Problems 2 and 3.
3. Create a dataset with the following attributes:
a. VendorID
b. Day/night. Please convert the ‘tpep_pickup_datetime' to day (for 1) or night (for 0).
c. Passenger_count
d. Trip_distance
e. PULocationID
f. DOLocationID
g. Payment_type
h. Payment_type_cat1,Payment_type_cat2,....
Note: Please convert the ‘payment_type' to dummy variables.
i. Fare_amount
j. Tip_amount
k. Tip_rate_20.
First, calculate ‘tip_rate' with ‘tip_rate'=Tip_amount'/'Fare_amount'. Second, if ‘tip_rate' < 0.2, set ‘Tip_rate_20' = 0, otherwise, set it to 1.
4. Save the dataset as a CSV file. The first line of the CSV file should be the attribute names described in the last question.
5. Plot the distribution of the Fare_amounts and Tip_amounts
Problem 2. Trip fare amount prediction
1. Build a linear regression model to predict the trip fare amount. You are free to use packages like sklearn or write your own codes.
a. Here is a link to the linear regression module of sklearn package.
b. Use attributes b, c, d, e, f, h as input features and attribute i as the output.
c. Your model should be evaluated with the 5-fold cross-validation and you have to report the averaged mean-squared-error (MSE) and standard deviation. You can use this link to calculate MSE.
2. Similarly, build a KNN regression model to predict the trip fare amount. The model should be evaluated with the 5-fold cross-validation. In each fold, 80% of the data should be used for training and 20% for testing.
You must choose the optimal value of K in between 1 and 10 based on half of the testing data, then calculate the MSE on another half of testing data with the best K. At last, report the averaged MSE and standard deviation.
3. Compare the results of the two models.
Problem 3. Tip rate classification.
Sample 1000 trip records from your data, and solve the following problems.
1. Use KNN model to predict the Tip_rate_20.
a. Set K in KNN to 5.
b. Use attribute b, c, d, h as input features.
c. Use attribute k as class labels.
d. Use Euclidean distance.
e. Run 5-fold cross validation to evaluate your model.
f. Report precision, recall and F-score of the classification.
g. Please follow this link to KNN in sklearn:
2. Use Decision Tree to predict the Tip_rate_20.
a. Build decision tree with attribute b, c, d, g.
b. Use attribute k as class labels.
c. Use 5-fold cross-validation to evaluate your model.
d. Report precision, recall and F-score of the classification.
e. Here is the link to the Decision Tree in sklearn package
Problem 4. Subway Services
Suppose you are the CTO for WMATA and are looking to improve your services. If you are not familiar with WMATA, they run the metro system in the greater Washington DC area.
Every traveler buys a metro card and then uses it on automated fare collection systems while both entering and exiting stations. Many hotels and online travel websites also sell the metro card (apart from sales at stations). A major problem you need to solve is to differentiate between tourist trips and normal commuters in your system.
1. Given your knowledge of ML, can you pose this is as one of the tasks we have seen before in class? Make sure you clearly describe how you will create your dataset and justify why your setup makes sense.
2. Will your answer change in anyway if WMATA collected the fare directly at the entry point only (so no card swipe at exit)?
3. Finally, assuming you have built this ML model to differentiate these commuters, how can you use your knowledge for improving the user experience?
There is no one ‘right' answer for the questions above; we are looking to see if you can design well and reason about your choices/responses. Please try to keep your answer brief (3-4 lines) for each question.
Attachment:- NYC Taxi Data Analysis and Modeling.rar