CS 5834 Intro to Urban Computing- Assignment Problem

Assignment Help Computer Engineering

Reference no: EM132378054

CS 5834 : Intro to Urban Computing

NYC Taxi Data Analysis and Modeling

In this homework, you will process the taxi data collected from New York city, use regression models to predict the trip fare amount, and use different classification models to predict whether the tip fare was less than 20% or more than that.

Problem 1. Download and process data.

1. The NYC taxi data can be found

In this data, the yellow and green taxi trip records include fields capturing pick- up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. You are free to choose a trip sheet (in csv format) of the yellow taxi in any month of 2017 for your homework.

2. Randomly sample 10,000 trip records to solve the Problems 2 and 3.

3. Create a dataset with the following attributes:

a. VendorID

b. Day/night. Please convert the ‘tpep_pickup_datetime' to day (for 1) or night (for 0).

c. Passenger_count

d. Trip_distance

e. PULocationID

f. DOLocationID

g. Payment_type

h. Payment_type_cat1,Payment_type_cat2,....

Note: Please convert the ‘payment_type' to dummy variables.

i. Fare_amount

j. Tip_amount

k. Tip_rate_20.

First, calculate ‘tip_rate' with ‘tip_rate'=Tip_amount'/'Fare_amount'. Second, if ‘tip_rate' < 0.2, set ‘Tip_rate_20' = 0, otherwise, set it to 1.

4. Save the dataset as a CSV file. The first line of the CSV file should be the attribute names described in the last question.

5. Plot the distribution of the Fare_amounts and Tip_amounts

Problem 2. Trip fare amount prediction

1. Build a linear regression model to predict the trip fare amount. You are free to use packages like sklearn or write your own codes.

a. Here is a link to the linear regression module of sklearn package.

b. Use attributes b, c, d, e, f, h as input features and attribute i as the output.

c. Your model should be evaluated with the 5-fold cross-validation and you have to report the averaged mean-squared-error (MSE) and standard deviation. You can use this link to calculate MSE.

2. Similarly, build a KNN regression model to predict the trip fare amount. The model should be evaluated with the 5-fold cross-validation. In each fold, 80% of the data should be used for training and 20% for testing.

You must choose the optimal value of K in between 1 and 10 based on half of the testing data, then calculate the MSE on another half of testing data with the best K. At last, report the averaged MSE and standard deviation.

3. Compare the results of the two models.

Problem 3. Tip rate classification.

Sample 1000 trip records from your data, and solve the following problems.

1. Use KNN model to predict the Tip_rate_20.

a. Set K in KNN to 5.

b. Use attribute b, c, d, h as input features.

c. Use attribute k as class labels.

d. Use Euclidean distance.

e. Run 5-fold cross validation to evaluate your model.

f. Report precision, recall and F-score of the classification.

g. Please follow this link to KNN in sklearn:

2. Use Decision Tree to predict the Tip_rate_20.

a. Build decision tree with attribute b, c, d, g.

b. Use attribute k as class labels.

c. Use 5-fold cross-validation to evaluate your model.

d. Report precision, recall and F-score of the classification.

e. Here is the link to the Decision Tree in sklearn package

Problem 4. Subway Services

Suppose you are the CTO for WMATA and are looking to improve your services. If you are not familiar with WMATA, they run the metro system in the greater Washington DC area.

Every traveler buys a metro card and then uses it on automated fare collection systems while both entering and exiting stations. Many hotels and online travel websites also sell the metro card (apart from sales at stations). A major problem you need to solve is to differentiate between tourist trips and normal commuters in your system.

1. Given your knowledge of ML, can you pose this is as one of the tasks we have seen before in class? Make sure you clearly describe how you will create your dataset and justify why your setup makes sense.

2. Will your answer change in anyway if WMATA collected the fare directly at the entry point only (so no card swipe at exit)?

3. Finally, assuming you have built this ML model to differentiate these commuters, how can you use your knowledge for improving the user experience?

There is no one ‘right' answer for the questions above; we are looking to see if you can design well and reason about your choices/responses. Please try to keep your answer brief (3-4 lines) for each question.

Attachment:- NYC Taxi Data Analysis and Modeling.rar

Reference no: EM132378054

Questions Cloud

What is critical thinking and do you believed : What is critical thinking and do you believed the payoff of critical thinking is the effort?

How is this proclamation problematic for marketers : Is this possible for marketers? Is it even desirable? How is this proclamation problematic for marketers?

What you found the most interesting : What you found the most interesting in Dr. Kotler's marketing talk?

Which theory or theories provide sound counter-arguments : Explain in detail which of the general theories in the chapter characterize your viewpoints on Free Will and/or Determinism: libertarianism, indeterminism.

CS 5834 Intro to Urban Computing- Assignment Problem : Virginia Polytechnic Institute and State University-US-CS 5834 Intro to Urban Computing Assignment Help and Solutions, Compare the results of the two models.

MAN6905 Databases and Business Intelligence : MAN6905 Databases and Business Intelligence Assignment help and solution, Edith Cowan University, Assessment help - what sales and marketing system is required

How is your organization impacted by demand conditions : Demand Conditions: How is your organization impacted by demand conditions? In other words, how is your company developed compared to other competitors?

Differences between a cash flow hedge and a fair value hedge : Define and differentiate the differences between a cash flow hedge and a fair value hedge, including when (in or under which particular or specific).

Interpret how to plan and execute search engine : Interpret how to plan and execute search engine related marketing strategy. Compare how different social media channels contribute to meeting marketing.

User Account

All Pages