Compute the minimum maximum and mean location values

Assignment Help Applied Statistics
Reference no: EM132271872

Data Mining Assignment -

Task 1 -

Preface: The analysis of results from urban mobility simulations can provide very valuable information for the identification and addressing of problems in an urban road network. Public transport vehicles such as busses and taxis are often equipped with GPS location devices and the location data is submitted to a central server for analysis.

The metropolitan city of Rome, Italy collected location data from 320 taxi drivers that work in the center of Rome. Data was collected during the period from 01/Feb/2014 until 02/March/2014. An extract of the dataset is found in taxi.csv. The dataset contains 4 attributes:

1. ID of a taxi driver. This is a unique numeric ID.

2. Date and time in the format Y:m:d H:m:s.msec+tz, where msec is micro-seconds, and tz is a time-zone adjustment. (You may have to change the format of the date into one that R can understand).

3. Latitude

4. Longitude

Purpose of this task: Perform a general analysis of this dataset. Learn to work with large datasets. Obtain general information of the behaviour of some taxi drivers. Analyse and interpret results. This task also serves as a preparation for projects that will be based on this dataset.

Questions: By using the data in taxi.csv perform the following tasks:

(a) Plot the location points (2D plot using all of the latitude,longitude value pairs in the dataset). Clearly indicate points that are invalid, outliers or noise points. The plot should be informative! Clearly explain the rationale that you used when identifying invalid points, noise points, and outliers.

Remove invalid points, outliers and noise points before answering the subsequent questions.

(b) Compute the minimum, maximum, and mean location values.

(c) Obtain the most active, least active, and average activity of the taxi drivers (most time driven, least time driven, and mean time driven) . Explain the rationale of your approach and explain your results.

(d) Look at the file Student_Taxi_Mapping.txt. The file contains two columns. The first column is a 4-digit code, the 2nd column is the ID of a taxi driver. Use the first and last three digits of your student number to optain a 4-digit code. Locate that code in the first column of the file Student_Taxi_Mapping.txt then use the corresponding ID of the taxi driver listed in column 2. Thus, for example, if your student number is 52345856 then you would look up 5856 in file Student_Taxi_Mapping.txt to find that the corresponding taxi ID is 50. Use the taxi ID that is listed next to your 4-digit student code to answer the following questions:

i. Plot the location points for taxi=ID.

ii. Compare the mean, min, and max location value of taxi=ID with the global mean, min, and max.

iii. Compare total time driven by taxi=ID with the global mean, min, and max values.

iv. Compute the distance traveled by taxi=ID. To compute the distance between two points on the surface of the earth use the following method:

dlon = longitude2 - longitude1

dlat = latitude2 - latitude1

a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2

c = 2 * atan2( sqrt(a), sqrt(1-a) )

distance = R * c (where R is the radius of the Earth)

Assume that R=6,371,000 meters.

With each of your answers: Explain what knowledge can be derived from your answer.

Task 2 -

Preface: Banks are often posed with a problem to whether or nor a client is credit worthy. Banks commonly employ data mining techniques to classify a customer into risk categories such as category A (highest rating) or category C (lowest rating).

A bank collects data from past credit assessments. The file "creditworthiness.csv" contains 2500 of such assessments. Each assessment lists 46 attributes of a customer. The last attribute (the 47-th attribute) is the result of the assessment. Open the file and study its contents. You will notice that the columns are coded by numeric values. The meaning of these values is defined in the file "definitions.txt". For example, a value 3 in the 47-th column means that the customer credit worthiness is rated "C". Any value of attributes not listed in definitions.txt is "as is".

This poses a "prediction" problem. A machine is to learn from the outcomes of past assessments and, once the machine has been trained, to assess any customer who has not yet been assessed. For example, the value 0 in column 47 indicates that this customer has not yet been assessed.

Purpose of this task: You are to start with an analysis of the general properties of this dataset by using suitable visualization and clustering techniques (i.e. Such as those introduced during the lectures), and you are to obtain an insight into the degree of difficulty of this prediction task. Then you are to design and deploy an appropriate supervised prediction model (i.e. MLP as will be used in the lab of week 5) to obtain a prediction of customer ratings.

Question 1: Analyse the general properties of the dataset and obtain an insight into the difficulty of the prediction task. Create a statistical analysis of the attributes and their values, then list 5 of the most interesting (most valuable) attributes. Explain the reasons that make these attributes interesting. Note: A set of R-script files are provided with this assignment (included in the zip-file). These are similar to the scripts used in labs. The scripts provided will allow you to produce some first results. However, virtually none of the parameters used in these scripts are suitable for obtaining a good insight into the general properties of the given dataset. Hence your task is to modify the scripts such that informative results can be obtained from which conclusions about the learning problem can be made. Note that finding a good set of parameters is often very time consuming in data mining.

An additional challange is to make a correct interpretation of the results.

This is what you need to do: Find a good set of parameters (i.e. through a trial and error approach), obtain informative results then offer an interpretation of the results. Write down your approach to conducting the experiments, explain your results, and offer a comprehensive interpretation of the results. Do not forget that you are also to provide an insight into the degree of difficulty of this learning problem (i.e. from the results that you obtained, can it be expected that a prediction model will be able to achieve a 100% prediction accuracy?). Always explain your answers.

Question 2: Deploy a prediction model to predict the credit worthiness of customers which have not yet been assessed. The prediction capabilities of the MLP in lab4 was very poor. Your task is to:

a) Describe a valid strategy that maximises the accuracy of predicting the credit rating. Explain why your strategy can be expected to maximize the prediction capabilities.

b) Use your strategy to train MLP(s) then report your results. Give an interpretation of your results. What is the best classification accuracy (expressed in % of correctly classified data) that you can obtain for data that were not used during training (i.e. the test set)?

c) You will find that 100% accuracy cannot be obtained on the test data. Explain reasons to why a 100% accuracy could not be obtained on this test dataset. What would be needed to get the prediction accuracy closer to 100%?

Attachment:- Assignment Files.rar

Reference no: EM132271872

Questions Cloud

What was the plight of the farmers as per mary lease : What was the plight of the farmers as per Mary Lease, who or what was/were the villain.
Ontario fault determination rules start : Referring to the Ontario Fault Determination Rules start by determining fault in this scenario.
What are the legal systems downfalls : Research international laws combating human trafficking (sex/labor/smuggling/organ trade, etc). What are the legal systems downfalls?
Maintain a position of being neutral third party : Can you help me propose at least two reasons why a mediator should maintain a position of being neutral third party within the mediation process
Compute the minimum maximum and mean location values : INFO411/911 Data Mining Assignment, University of Wollongong, Australia. Compute the minimum, maximum, and mean location values
What role does going to college play in all of this : What would you change in the world if you could? Why? Is it your responsibility to make this change? If so, explain why. If not, then explain who has.
Write a program that simulates playing phase : Create a struct Player that will keep track of the score for each player and their current phase - Create an enum of colors for the dice
Measure the effectiveness of marketing videos : What are some of the key metrics that are used to measure the effectiveness of marketing videos?
What might you see the process entailing maybe from now : Roger Landry (2014) started his book comparing life to autumn. He said "The spectacular colors of the fall foliage are compensation for long winters.

Reviews

len2271872

4/1/2019 3:25:11 AM

I will send you the other files of data. R programming. I need the codes + the file of the word of answers. I will attached all the documents. Submission of the answers must be done online via Moodle by using the submission link that is associated with assignment 1 for this subject. One PDF document is to be submitted. The PDF must contain typed text of your answer (do not submit a scan of a handwritten document). The document can include computer generated graphics and illustrations (hand drawn graphics and illustrations will be ignored). The size limit for this PDF document is 20MB. All questions are to be answered. An clear and complete explanation and analysis needs to be provided with each answer.

len2271872

4/1/2019 3:25:04 AM

Submissions made after the due time will be assessed as late submissions. A late submission is counted in full day increments (i.e. 1 minute late counts as a 1 day late submission). There is a 25% penalty in marks for each day after the due date. The submission site closes on the fourth day after the due date. No submission will be accepted after the submission site has closed. This is an individual assignment. Plagiarism of any part of the assignment will result in 0 marks for the assignment and for all students involved. You may need to do some research on background information for this assignment. For example, you may need to develop a deeper understanding of writing code in R, or study the general characteristics of GPS, obtain general geographic information about Rome, and study other topics that are related to the tasks in this assignment.

len2271872

4/1/2019 3:24:58 AM

What you need: The R software package (Rstudio is optional) and the file assignment1.zip. Successful completion of lab 4 and lab 5. You may use the R-script from the labs as a basis for attempting this question. Note that in this assignment the term "prediction capabilities" refer to a model's ability to predict the credit rating of samples that were not used to train the model (i.e. samples in a test set). Submission: The answers to both tasks of this assignment should be provided with a single PDF document which is to be submitted. Submit one single PDF document that contains your answers to both tasks of this assignment. Submit before the due date and follow the submission procedure as described in the header of this assignment.

Write a Review

Applied Statistics Questions & Answers

  Estimate the standard deviation of the daily demand

The standard deviation of the daily demand for a product is an important factor for inventory control for the product. Suppose that a pharmacy wants to estimate the standard deviation of the daily demand for a certain antibiotic. It is known that the..

  Determine the satisfaction of their customers

Many organizations are doing surveys to determine the satisfaction of their customers. Attitudes toward various aspects of campus life were the subject of one such study conducted at Purdue University.

  What is a double-blind study

7NDEFP09 STATISTICS AND EPIDEMIOLOGY ASSIGNMENTS. Outline (using a numbered list) the ethical issues that should be considered when setting up a prevalence study within dentistry. How would you address them? What is a double-blind study? When and w..

  Complete the analysis of variance

Complete the analysis of variance. At a .05 level of significance, is there a significant difference between the treatments?

  Find the probability of observing a sample mean

Use the properties of this sampling distribution to find the probability of observing a sample mean greater than or equal to 42.95 when we assume that m equals 42.

  Find the expected value (mean) and variance of portfolio

A foreign bond has a return, labelled r∗, with mean 2 and variance 2. The covariancebetween the two returns, r and r∗, is 1. Suppose investors hold 75% of their portfolios in the domestic bond and 25% in the foreign bond. Find the expected value (mea..

  What was the mad of the 4-month forecast

ata for a particular subdivision near downtown Houston indicate that the average price per square foot for a home is $100 with a standard deviation of $5 .

  Use a statistical software package to fit model to the data

Propose a first-order model for y as a function of all seven independent variables.- Use a statistical software package to fit the model to the data in the table.

  What are the ten most popular occupations in camp data frame

R Data Wrangling Homework- Download the "Campaign.zip" file from attachment. What are the ten most popular occupations and their counts in the camp data frame

  Would the null hypothesis be rejected

Using the results you got from Question 3 and a level of significance at .05, what are the two-tailed critical values associated with each? Would the null hypothesis be rejected

  Is there a major difference between two pieces of equipment

Construct a 95 percent confidence interval for the difference between the proportions of service contracts sold on treadmills versus exercise bikes. Is there a major difference between the two pieces of equipment? Why or why not

  A group of organic pollutants found in a variety

Polychlorinated biphenyl (PCB) is among a group of organic pollutants found in a variety of products, such as coolants, insulating materials, and lubricants in electrical equipment. Disposal of items containing less than 50 parts per million (ppm) PC..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd