Perform a general analysis of this dataset

Assignment Help Database Management System
Reference no: EM131437574

Data Mining Assignment  

Task 1 -

Preface: The analysis of results from urban mobility simulations provide very valuable data for the identification and addressing of problems in an urban road network. Public transport vehicles such as busses and taxis are often equipped with GPS location devices and the location data is submitted to a central server for analysis.

The metropolitan city of Rome, Italy collected location data from 320 taxi drivers that work in the center of Rome. Data was collected during the period from 01/Feb/2014 until 02/March/2014. An extract of the dataset is found in taxi.csv. The dataset contains 4 attributes:

1. ID of a taxi driver. This is a unique numeric ID.

2. Date and time in the format Y:m:d H:m:s.msec+tz, where msec is micro-seconds, and tz is a time-zone adjustment. (You may have to change the format of the date into one that R can understand).

3. Latitude

4. Longitude

For a further description of this dataset: https://crawdad.org/roma/taxi/20140717/

Purpose of this task: Perform a general analysis of this dataset. Learn to work with large datasets. Obtain general information of the behaviour of some taxi drivers. Analyse and interpret results. This task also serves as a preparation for a project that will be based on this dataset.

Questions: By using the data in taxi.csv perform the following tasks:

(a) Plot the location points (2D plot), clearly indicate the points that are outliers or noise points. The plot should be informative! Remove outliers and noise points before answering the subsequent sub-questions. Explain reasons to why you defined the removed points as noise points.

(b) Compute the minimum, maximum, and mean location values.

(c) Obtain the most active, least active, and average activity of the taxi drivers (most time driven, least time driven, and mean time driven)

(d) Look at the file Student_Taxi_Mapping.txt. The file contains two columns. The first column is a 4- digit student code, the 2nd column is the ID of a taxi driver. Use the first and last three digits of your student number, locate that number in the first column of the file Student_Taxi_Mapping.txt then use the ID of the taxi driver listed in column 2. Thus, for example, if your student number is 52345678 then you would look up 5678 in file Student_Taxi_Mapping.txt to find that the corresponding taxi ID is 50. Use the taxi ID that matches your 4-digit student code to answer the following questions:

i. Plot the location points of taxi=ID

ii. Compare the mean, min, and max location value of taxi=ID with the global mean, min, and max.

iii. Compare total time driven by taxi=ID with the global mean, min, and max values.

iv. Compute the distance traveled by taxi=ID. To compute the distance between two points on the surface of the earth use the following method:

dlon = lon2 ­ lon1 

dlat = lat2 ­ lat1 

a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2 

c = 2 * atan2( sqrt(a), sqrt(1­a)) 

distance = R * c (where R is the radius of the Earth) 

Assume that R=6,371,000 meters.

Task 2 -

Preface: Banks are often posed with a problem to whether or nor a client is credit worthy. Banks commonly employ data mining techniques to classify a customer into risk categories such as category A (highest rating) or category C (lowest rating).

A bank collects data from past credit assessments. The file creditworthiness.csv contains 2500 of such assessments. Each assessment lists 46 attributes of a customer. The last attribute (the 47-th attribute) is the result of the assessment. Open the file and study its contents. You will notice that the columns are coded by numeric values. The meaning of these values is defined in the file definitions.txt. For example, a value 3 in the 47-th column means that the customer credit worthiness is rated "C". Any value of attributes not listed in definitions.txt is "as is".

This poses a "prediction" problem. A machine is to learn from the outcomes of past assessments and, once the machine has been trained, to assess any customer who has not yet been assessed. For example, the value 0 in column 47 indicates that this customer has not yet been assessed.

Purpose of this task:

You are to start with an analysis if the general properties of this dataset by using visualization and clustering techniques (i.e. Such as those introduced during the lectures), and you are to obtain an insight into the degree of difficulty of this prediction task. Then you are to design and deploy an appropriate supervised prediction model (i.e. MLP as will be used in the lab of week 5) to obtain a prediction of customer ratings.

Question 1: Analyse the general properties of the dataset and obtain an insight into the difficulty of the prediction task. Create a statistical analysis of the attributes, then list 5 of the most interesting (or most valuable) attributes. Explain the reasons that make these attributes interesting. Note A set of R-script files are provided with this assignment (included in the zip-file). These are similar to the scripts used in lab1. The scripts provided will allow you to produce some first results. However, virtually none of the parameters used in these scripts are suitable for obtaining a good insight into the general properties of the given dataset. Hence your task is to modify the scripts such that informative results are obtained from which conclusions about the learning problem can be made. Note that finding a good set of parameters is often very time consuming in data mining.

An additional challange is to make a correct interpretation of the results.

This is what you need to do: Find a good set of parameters (i.e. Through a trial and error approach), obtain informative results then offer an interpretation of the results. Write down your approach to conducting the experiments, explain your results, and offer a comprehensive interpretation of the results. Do not forget that you are also to provide an insight into the degree of difficulty of this learning problem (i.e. From the results that you obtained, can it be expected that a prediction model will be able to obtain 100% prediction accuracy?). Always explain your answers.

Question 2: Deploy a prediction model to predict the credit worthiness of customers which have not yet been assessed. The prediction capabilities of the MLP in lab4 was very poor. Your task is to:

a) Describe a valid strategy that maximises the accuracy of predicting the credit rating. Explain why your strategy can be expected to maximize the prediction capabilities.

b) Use your strategy to train MLP(s) then report your results. Give an interpretation of your results.

What is the best classification accuracy (expressed in % of correctly classified data) that you can obtain for data that were not used during training (i.e. The test set)?

https://teaching.cs.uow.edu.au/~markus/data/taxi.csv.zip

Attachment:- Assignment Files.rar

Verified Expert

This is about writing a code in R for spatial map creation using Latitudes and Longitudes. Also, this has complete set of steps on building a classification model by training and testing on samples

Reference no: EM131437574

Questions Cloud

How would you assess level of care for a client : How would you assess level of care for a client? For example, what is the difference between an individual who requires inpatient vs. outpatient treatment?
What safety concerns must a therapist be aware : What safety concerns must a therapist be aware of when assessing level of care?
Expression of social motives : Why would you want to be a leader-power, achievement, or affiliation? Does your work or personal life allow the expression of your social motives?
How did he justify his continued abuse of the painkillers : Following the loss of his wife, Roy was treated with tranquilizers. Initially, Xanax helped Roy to sleep and eased the pain due to his loss. The tranquilizers did what they were supposed to do--slow down the central nervous system and create a cal..
Perform a general analysis of this dataset : INFO411/911 Data Mining Assignment. Purpose of this task: Perform a general analysis of this dataset. Learn to work with large datasets. Obtain general information of the behaviour of some taxi drivers. Analyse and interpret results. This task also..
Increase production in a business setting : What do you recommend Adam do to increase production in a business setting that does not seem to value high production?
Discuss about the psychological addiction : Using the integrative/multicausal perspective, address the following using the Case Study: Roy from Module 1 DQ 1. Support your answers with evidence.Discuss the differences between psychological addiction and physical addiction as related to the..
Elements of an effective project : What are the elements of an effective project (or portfolio) communications plan? What are some ways of implementing those elements? How is project communications affected by the organizational structure and overall project management maturity?
Perspective on the ingredients of effective leadership : Identify the most effective leader you have encountered in your own experience and list the respective leader's three traits and three behaviors which you believe most accounted for the leader's effectiveness. Discuss the impact of these traits an..

Reviews

inf1437574

3/29/2017 6:05:06 AM

I simply got my solution and everything was brilliant. I especially compliment them for staying in contact all through. Awesome client mind and sensible rates. R coding is not an easy thing for my kind of students, but after getting the solution i feeling so happy. thanks dudes.

len1437574

3/23/2017 3:42:42 AM

Subject- Assignment R. I have to task one of them is about taxi in Rome and another on about bank and all files you need it will be find. I want to read the both task carefully please. The R software package, the file assignment1.zip I add in your website and the file taxi.csv.zip from this link you will find in assignment description.

Write a Review

Database Management System Questions & Answers

  Knowledge and data warehousing

Design a dimensional model for analysing Purchases for Adventure Works Cycles and implement it as cubes using SQL Server Analysis Services. The AdventureWorks OLTP sample database is the data source for you BI analysis.

  Design a database schema

Design a Database schema

  Entity-relationship diagram

Create an entity-relationship diagram and design accompanying table layout using sound relational modeling practices and concepts.

  Implement a database of courses and students for a school

Implement a database of courses and students for a school.

  Prepare the e-r diagram for the movie database

Energy in the home, personal energy use and home energy efficiency and Efficient use of ‘waste' heat and renewable heat sources

  Design relation schemas for the entire database

Design relation schemas for the entire database.

  Prepare the relational schema for database

Prepare the relational schema for database

  Data modeling and normalization

Data Modeling and Normalization

  Use cases perform a requirements analysis for the case study

Use Cases Perform a requirements analysis for the Case Study

  Knowledge and data warehousing

Knowledge and Data Warehousing

  Stack and queue data structure

Identify and explain the differences between a stack and a queue data structure

  Practice on topic of normalization

Practice on topic of Normalization

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd