Solution-Create a bar plot displaying the number of records

Create a bar plot displaying the number of records

Assignment Help Computer Engineering

Reference no: EM131438761

You will be leveraging the "R in Action" book by Robert Kabacoff extensively in this assignment.

Preparing the Data

1. The first thing you need to do is convert the following variables to factors: record_type, day, state, homeowner, car_value, and married_couple. Refer to the variable descriptions to understand the meaning of the variables and know how to apply the appropriate labels to the factors, where needed. If the variable description specifies labels - for instance, for married_couple 1=no and 2=yes - then include them in the factor. Again, ?factorwill be helpful here.

IMPORTANT: In order to make the factor changes stick (i.e., have the factor names you assign be permanent), you have to assign the results of the factor() function back to the column in the allstate data frame, like this: allstate$record_type <- factor().

Exploring structure and summary statistics

2. Now view the structure of your data frame with the str() function to insure that record_type, day, state, homeowner, car_value, and married_couple are factors with the correct levels. Notice that's the same view contained in the Environment tab in the upper-right quadrant of RStudio.

3. Produce a summary of the Allstate data frame. Is the car_age skewed? Is cost skewed? If they are, in what direction are they skewed (e.g., left or right). Explain how you know the answers to these questions.

4. In a two dimensional table, display the count, mean, std dev, skew, kurtosis, and standard error of these variables: group_size, car_age, age_oldest, age_youngest, duration_previous, and cost. HINT: See Listing 7.2 of Kabacoff pg. 139. This is a great example of how to use sapply. It also provides you the formulas for skew and kurtosis. If you don't remember the formula for standard error, "google" it.

If you don't understand what the e means in the numbers in your output, google "scientific notation".

5. If you setup your mystats function like Kabacoff did on pg. 139, you probably noticed that the duration_previous column contains NA values. You can pass a True value to the na.omit parameter of the mystats() function by including na.omit=T as the third parameter in the sapply function. Try it and notice how it affects the count of duration_previous.

6. As you'll discover, there are a plethora of packages and ways to generate simple descriptive statistics in R. One that Kabacoff does not cover is the ddply() function found in the plyr package. Install the plyr package with this command: install.packages("plyr"). Then load it in with the library() function and display the cost mean and standard error for each of the married_couple by homeowner groups using ddply(). Remember! Help is your friend: ?ddply. If you find help unhelpful (which fairly normal), then google r ddply.
Creating count tables

7. Create a table to display the number of records in each state. Which state is most represented in this data set?

8. Create a table to display how many shopping points and purchase points are in the data. What's the approximate ratio of purchase points to shopping points? Hopefully, you have noticed that the table() function is useful for creating count tables.

9. Now, create a two-way table showing the counts of days of week and states, BUT only include purchases (i.e., exclude shopping points).
2

10. Create a three-way of table of counts using group_size, homeowner and risk_factor using the xtabs() function.

NOTE: The xtabs() function employs R's formula notation which takes on the pattern of y x1 + x2

+ ... + xn, where y is the dependent variable, and xn are the independent variables. It is common

when using xtabs() to leave out the left-hand side of the equation if you just want to generate counts for each of the cross-tabulated groups. For example, with this data set, you might specify ~ risk_factor + day to get a two-way table. See Listing 7.11 on pg. 149 in the Kabacoff book for an example. If you want want to sum the data in each cross-tabulated group, then you can specify what variable you want to sum on the left side of the equation. For example, if you wanted a sum of the costs in the previously specified two-way table, your formula would look like this cost ~ risk_factor + day.

11. You probably noticed that the third dimension in that table is displayed kind of clunky. You can fix this by wrapping your table in the ftable() function. Again, refer to Listing 7.11 for an example. Go ahead and clean up your table with the ftable() function.

Creating Other Aggregated Tables

12. Create a table showing the average car age for each of the car_value levels. NOTE: Prior questions are dealing with counts, this is dealing with means. You'll need to use another function, try aggregate(). If you look in the examples for aggregate(), you'll notice that you can use the R formula notation to aggregate the data.

Creating plots and graphs

We'll start with bar plots. You've all seen them, but have you really thought about them. For instance, what kind of variable (i.e., categorical or continuous) do bar plots display? If you are thinking, "Hmmm, the different bars on the x-axis have to be driven by a categorical variable...", then you are absolutely correct. Well, that takes care of the x-axis, what about the y-axis? You might be tempted to think "Easy! Continuous!", but you would not be completely correct. The y-axis usually represents some aggregated value (e.g., a sum or a mean). So with that in mind, let's get going!

13. Create a bar plot displaying the number of records that are shopping points and the number of records that are purchase points. Give it an appropriate "main" title and axis labels. NOTE: You won't be able to feed the raw data frame into the barplot() function. You'll need to create a table first to create the aggregated values you want to plot. See pg. 118-120 of Kabacoff for examples.

14. Now add some color to the bar ploti. Make shopping points blue and purchase points green. HINT: You can use "blue" and "green" in your color vector.

15. Create a bar plot (with color) that displays the average for the oldest person on the policy for only the purchase points for each of the risk_factor levels. Again you'll need to create a table first - try aggregate() or ddply(). You need to give barplot() a vector, not a matrix or data frame. CHALLENGE: If you want to play with different colors that are automatically generated, try using the RColorBrewer package.

Enough of bar plots. Time for histograms! Yay! A histogram is a special kind of bar plot intended to display the distribution of a variable. Why is it special? Well, there is no categorical variable on the x-axis. The x-axis is a bunch of "buckets" that break up an otherwise continuous variable along the x-axis. These buckets hold small ranges of the continuous variable you are plotting. So what kind of variable is on the y-axis? If

you are thinking, "A continuous variable," then you are incorrect. It's not a continuous variable. It's the count of the number of values of the continuous variable that falls into each bucket. So a histogram really only involves a single variable.

16. Create a histogram of the cost variable using only the purchase points. Add a title and a label for the x-axis. Refer to pp. 125-126 of Kabacoff for help. Is cost normally distributed? (Oh, yeah! Now we are really wiping away the Statistics cob webs, huh? Normally distributed? What the heck is that?

"google" it if you need to. It will be important when we get to the regression world.) 17. Now increase the number of bins in the histogram you created to 25.

The distribution of cost appears pretty much normally distributed (except for that rascally long tail on the left), but sometimes its not easy to see the distribution with a histogram. That is when the density plot is useful. (No, McFly! You are not my density!)

18. Create a density plot for cost. Refer to Listing 6.7 in Kabacoff for help.

19. Now, if you are obsessive like I am, you are probably being driven nuts by that elongated left tail on our distribution plots. Find the values that are causing it and decide if you can remove them. If you remove them, create a new density plot. If you decide not to remove them, explain why. Either way, revisit the question of whether the cost variable is normally distributed and explain your thinking.

TIP: I suggest storing the cost column into a separate variable for this problem - like this: myCost <- [your data frame name]$cost. Then operate on myCost and use it to generate your new density plot.

Now let's move on to box plots. Kabacoff pg. 129 has a good description of box plots, if you need refreshing. In short, box plots are another perspective on the distribution of a variable, only focused on the median and quartiles. (Oh man! What's the difference between a mean and a median?! Does a quartile have any meaning in relation to the mean?!)

20. Create a simple box plot of the age_youngest variable. Add an appropriate y-axis label and main title.

21. Now create a box plot to compare the distribution of the youngest age between whether a married couple is on the policy or not. HINT: You'll need to use the R formula notation. Add the proper axis labels.

22. Create a box plot to compare the age of a car on the policy with the value of the car. Add appropriate axis labels. Based on the box plot, which level of car_value do you think represents the cars of least value?

Now we'll move on to Chapter 11 of Kabacoff and cover scatter plots (pp. 256). Scatter plots are useful for comparing the relationship (think correlation!) between two continuous variables. (That's right! Say goodbye to the categorical in this realm!)

23. Show the relationship between the oldest age and the cost of policies purchased in New Mexico with a scatter plot. Refer to Listing

11.1 for help. HINT: I recommend extracting out the records for New Mexico and purchases into a separate variable.

24. Now fit a smoothed line on the scatterplot using the lines() and lowess() functions to emphasize any relationship between the age of the oldest person on the policy and the cost of the policy. Again, see Listing 11.1 for help. NOTE: You might need to make the line a different color than the points on the plot.

26. Now use a scatterplot to compare the duration of the customers' previous insurance issuer with the cost for both New Mexico and Idaho. Use only the "purchase" data. Include labels and boxplots on the x and y axes.

HINT: Again, subset out the data you need first.
HINT #2: Use the scatterplot() function from the car library as shown on pg. 258 in the Kabacoff book. You'll need to click the "Zoom" button in the "Plots" window to get a good view of it.

HINT #2: You will also need to reset the state factor after you subset the NM and ID data out. Something like this: nm_id_purchased$state <- factor(nm_id_purchased$state).
Which state has higher policy costs?

27. Using the subset of data you created for New Mexico and Idaho, create a scatter plot matrix (like the one shown on pg. 260) of the following variables: car_age, risk_factor, age_oldest, duration_previous, and cost. Add an appropriate title to the plot.
What are those plots down the diagonal of the plot matrix? Which variable is the most normally distributed? Which is the least?

Which two variables are the most positively correlated? negatively correlated?
Challenge Questions (no not extra credit questions)

Now that you have you really "whet your awa whistle" (as my 6 year-old might say when she's older and still can't pronounce her r's), I'm going to give you two open-end questions where you need to employ your R-descriptive-statistics-plotting skills to answer. Good luck!

28. (Descriptive Table) Which state has highest average policy cost for policies purchased on Fridays? HINT: This page will be helpful in finding the maximum of the average costs.

29. (Graph or Plot) Your boss has requested to see the separate distributions of the oldest people on purchased policies from CO, ID, NM, and UT. He wants to see the oldest age distribution of each of the four states on a single plot.

Attachment:- r-help.rar

Verified Expert

This Assignment is completely based on R programming, and i have used R studio software for this.I have many functions in R for drawing graphs and installing packages which are required.Basically exploring the structure of the data set ans producing summary statistics like Mean,Standard Error,Skewness Max, Min and count of all the observations for important variables which are used for this analysis.Next step is finding the aggregate values on some important variables which are related to assignment task and also creating plots and graphs by using important functions like Bar plot,Histogram,Box plot, and also Plot, ggplot,scatter plot for plotting aggregate values.Used random normalized functions for the data set frames like RNORM and also provided insights which states are performing better with purchase points and so on

Reference no: EM131438761

Questions Cloud

When using the critical ratio priority sequence rule : When using the critical ratio ( CR ) priority sequence rule:

Consider an end item composed of single component : Consider an end item composed of a single component. Demand for the end item is 20 in week 1, four in week 2, two in week 3, and zero until week 8 when there is a demand of 50. Currently there are 25 units on hand and no scheduled receipts. Compute t..

Find expected value and variance - game of battleship : The time required to play a game of Battleship is uniformly distributed between 15 and 60 minutes.

Discussion of regulatory strategies for the given pollutants : Discussion of relevant local and federal regulations for these air pollutants. Discussion of regulatory strategies for these pollutants. At least two reliable references for your final project topic.

Create a bar plot displaying the number of records : Create a table to display how many shopping points and purchase points are in the data. What's the approximate ratio of purchase points to shopping points? Hopefully, you have noticed that the table() function is useful for creating count tables.

Prepare the journal entry to record the issuance : 50895095: ACC/423 Final Exam. Pearl Corporation issued 380 shares of $10 par value common stock and 107 shares of $50 par value preferred stock for a lump sum of $15,507. The common stock has a market price of $20 per share, and the preferred stoc..

Describe the three types of learning : Do you think it's possible to change another person's behavior by using classical conditioning , operant conditioning or learning by observation? If you wanted to change someone's behavior or even your own behavior, how would you use Classical, Op..

What are the sources for the air pollutant in the us : What are the sources for this air pollutant in the U.S.? What are the human health effects of this air pollutant? What are the welfare effects of this air pollutant?

Probability of the malfunctioning of the machine : A popular soft drink is sold in 2-liter (2,000-milliliter) bottles. Because of variation in the filling process, bottles have a mean of 2,000 milliliters and a standard deviation of 20, normally distributed.

Reviews

inf1438761

4/4/2017 5:54:13 AM

This author could finish a great paper in a short measure of time and was extremely kind and comprehension all through the procedure.

len1438761

3/24/2017 4:01:08 AM

The questions for this assignment are contained in the DescriptiveAnalytics_Assignment.rmd file. You will do your work in that file and then generate a DOCX file from it. Look at the comments at the top of the LearningR_Assignment.rmd for further instructions on how to do this. Submit your completed assignment as a DOCX document to Learn.

3/24/2017 4:00:57 AM

I recommend completing the at least the Data Visualization module of the swirl Data Analysis course. The Data Analysis course has three modules. You do not need to complete the first two on central tendancy (i.e., mean, median, and mode) or dispersion (i.e., variance, standard deviation) unless you need to refresh those concepts. Also, you do not need to watch the videos during the module(s) when it prompts you. You can install the Data Analysis course with the following command in R (assuming you already have swirl installed):

Write a Review

Required(*) Message

User Account

All Pages

Create a bar plot displaying the number of records

Reference no: EM131438761

Reference no: EM131438761

Questions Cloud

Reviews

inf1438761

len1438761

len1438761

Write a Review

Computer Engineering Questions & Answers

Mathematics in computing

Ict governance

Implementation of memory management

Realize business and organizational data storage

What is the protocol overhead

Implementation of memory management

Define open and closed loop control systems

Prepare a proposal to deploy windows server

Security policy document project

Write a procedure that produces independent stack objects

Define a suitable functional unit

Calculate yield to maturity and bond prices

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT