Compute the correlation and the regression of these

Assignment Help Applied Statistics
Reference no: EM131277660

Part -1:

Dirty Data Assignment

Hardcore working with real data in Stata. Collapsing, Merging, Regressing.

1. Our goal for this problem is to learn how to clean up a small but difficult to use set that has information on population by county so that we can merge it to our original dataset of 990 forms.

a. Download the following dataset:https://www.dropbox.com/s/bg71qxiwxh0c8pn/UrbanityCodes.dta?dl=0
The password is PSAA643.

b. As always, get to know this new dataset. (Show your code and outputs) (sum describe list)

c. What information does this dataset give you?

d. How is the variable "Population_2010" coded? Why might this method of coding not be useful? Why might a different type of coding be more useful?

e. We want to create a new variable for 2010 population that is a number rather than a string. To do this, we will use the command "destring". Type "help destring" in your stata window. Read what it says about destring.

f. Create a new variable called "pop_2010_number" using the destring command. Hint: You will want to have a ", generate" in your command or you will not create a new variable.

g. Check to see if your new variable looks the same as your old variable by opening up the data browser and looking at the two variables side by side. Another way to check would be to do a cross-tab, but Stata IC often has problems doing cross-tabs with a large number of values, so the eye-ball method should be fine. (It is sufficient to say that you looked at it after you have done this check.)

h. Summarize your new variable. Do the means, max, and min look reasonable (as populations by county) to you? If not, what looks unusual?

i. We will be merging this dataset to your original dataset with a variable for county fip code. This is a way of keeping track which county is which across datasets. In UrbanityCodes.dta there are two variables that we could possibly use for this purpose. FIPS and fips. What is the difference between these two codes? (Hint, use "describe".)

j. How is the fips variable coded in NCCS_CORE_2013_orig.dta ?

k. In order to merge the two datasets on county fip, we will need to have county fip variables in each set that are coded the exact same way, either as strings or as long. Create a new variable in the UrbanityCode.dta set called fipsformerge. gen fipsformerge = FIPS . Why did I choose to go with the string option instead of the long option?

l. Save your dataset under a new name, UrbanityCode_formerge.dta

2. The goal for this problem is to collapse your original dataset to get a number for total expenses from all nonprofits by county. You will merge your collapsed data to your population data above because you are interested to see if the total amount of nonprofit expenditures in a county is correlated with the county's population.

a. Load NCCS_CORE_2013_orig.dta

b. Type: "help collapse". Read the section on collapsing data.

c. Use the following code to collapse expenditures (adding them up) to the county cell level: collapse (sum) exps, by(fips) .

d. You are interested in the total expenditures for all nonprofits in a county. How would your code change if instead you were interested in the *average* expenditures for all nonprofits in in a county?

e. Name at least two other measures besides adding and averaging that you could get with the collapse command and explain why they might be of interest.

f. gen fipsformerge = fips . Why do you need to make this variable? Note that there is 1 missing variable when you do this. Noticing that is an important attention-to-detail thing that you should get used to doing. In this case, I can tell you that that missing variable comes from the 487 observations in the original dataset that do not have any county information given.

g. Since missing is unlikely to all be the same missing county, merging missings together would add measurement error. drop if fipsformerge==""

h. Save your dataset under the name NCCS_CORE_2013_collapse.dta

3. The goal for this problem is to merge your two datasets together so that you can say something about how charitable expenditures are correlated with the population in a county.

a. Type "help merge" into stata. Read about merging.

b. You will be merging your collapsed version of the original dataset to the population set using the variable "fipsformerge". This will be

a 1:1 merge because each county only shows up once in each dataset. Merge your two new datasets together with fipsformerge as the merging variable.

c. If you did this correctly, you will notice that 57 observations didn't match. Sometimes not matching means you messed up. Sometimes not matching means that some numbers are in one dataset but not in the other or vice versa. It is important to determine which is the case. Look through the dataset carefully. Sometimes you'll notice that an entire state is missing from one dataset or another (for example one set might have Puerto Rico and another might not). That does not seem to be the situation in this case. After you look through the dataset, write down any comments you have about unmatched data in your solutions.

d. Note that it could be that 990 data are missing in some counties because there are no non-profits in those counties. If that is the case, then you would want to replace their expenditures with "0". We don't know enough at this time to figure out if that is something we should be doing. Right now, let's treat them as missing-it isn't a large portion of your dataset so hopefully unlikely to bias your results too much. In your solutions write down, "I need to know more about the 990 dataset to know what to do with missing 990 information."

4. Now let's do some data analysis.

a. Regress with exps as the Y variable and pop_2010_number as the X variable.

b. In words, how does an increase of 1 in the county population affect the total county non-profit 990 expenditures?

c. Is this correlation significant?

d. Can you say for certain that an increase in population causes an increase in non-profit expenditures? What else might be going on?

5. Bringing it all together.

a. Do you feel smarter?

b. How are you going to remember these skills when you need them in the future? (Note that you do not need to actually remember the exact code, just that these are things that you can do in stata and that you can look them up.)

Part -2:

Exercises

1. Use gss2006_chapter8 . dta. Imagine that you heard somebody say that there was no reason to provide more educational opportunities for women because so many of them just stay at home anyway. You have a variable measuring education, educ, and a variable measuring hours worked in the last week, hrsl. Do a correlation and regression of hours worked in the last week on years of educations. Then do this separately for women and for men. Interpret the slope for the overall sample and then for women and for men separately. Is there an element of truth to what you heard?

2. Use gss2006_chapter8.dta. What is the relationship between the hours a person works and the hours his or her spouse works? Do this for women and for men separately. Compute the correlation, the regression of these. Next test if the correlation is statistically significant and interpret the results, and the scattergrams.

3. Use gss2006_chapter8.dta. Repeat figure 8.2 using your own subsample of 250 observations. Then repeat the figure using a jitter (3) option. Compare the two figures. Set your seed at 111.

4. Use gss2006_chapter8.dta. Compute the correlations between happy, and health by using correlate and then again by using pwcorr. Why are the results slightly different? Then estimate the correlations by using pworr, and get the significance level and the number of observations for each case. Finally, repeat the pwcorr command so that all the Ns are the same (that is, t casewise/listwise deletion).

5. Use gss2002_c.hapter8.dta. There are two variables called happy7 and satfan7. Run the codebook command on these variables. Notice how the higher score goes with being unhappy or being dissatisfied. You always want the higher score to mean more of a variable, so generate new variables (happynew and satfamnew) that reverse these codes so that a score of 1 on happynew means very swum, and a score of 7 means very happy. Similarly, a score of 1 on satfamnew means very dissatisfied and a score of 7 means very satisfied. Now do a region happiness on family satisfaction with the new variables. How correlated are these variables? Write the regression equation. Interpret the constant and the slope.

Attachment:- Chi-square.rar

Reference no: EM131277660

Questions Cloud

Why decision making between a leader and team : Briefly explain the project portfolio process. Identify and describe the type of the projects that should be included and the mix of each. why decision making between a Leader and team is so important
Describe the competitive environment in which firm operates : Describe the competitive environment in which the firm operates, the distribution of market power, and the strategic behavior of the firm and its competitors.
Why do so few franchise companies tie royalties to sales : Why do so few franchise companies tie royalties to sales, not profits? Are sales easier to monitor through a retail information system than profits? Why or why not?
Identify a conflict your group faced. how did the group hand : Identify a conflict your group faced. How did the group handle it.Do you feel that your group reached a level of team cohesion? Why or why not?
Compute the correlation and the regression of these : Compute the correlation, the regression of these. Next test if the correlation is statistically significant and interpret the results, and the scattergrams.
Ensuring safe and healthful working conditions : Do you believe OSHA is effective at ensuring safe and healthful working conditions for working men and women by setting and enforcing standards and by providing training, outreach, education and assistance? Explain your response by providing supporti..
Analyze and describe corporate culture : Analyze and describe corporate culture and its importance within an organization. What are some potential ramifications for an organization having a negative culture? Please explain your answer in your own words.
About organizational learning : Write a paper with at least 300 words about Organizational Learning. Make a summary and give a reference or two to how you personally can use and implement this information in your life.
Mathematical or overrun to date : Under what conditions would the “Cumulative CPI X SPI” Estimate at Completion (EAC) formula provide a value numerically less than the “Mathematical” or “Overrun to Date” EAC formula?

Reviews

Write a Review

Applied Statistics Questions & Answers

  1 if the mean number of hours of television watched by

1. if the mean number of hours of television watched by teenagers per week is 12 with a standard deviation of 2 hours

  Correlation between the number of days of class missed

An analysis was done to see if there was a correlation between the number of days of class missed and a student's GPA.

  Describe how the increase in the confidence level

One can calculate the 95% confidence interval for the mean with the population standard deviation known. This will give us an upper and a lower confidence limit. What happens if we decide to calculate the 99% confidence interval? Describe how the inc..

  What is a factor how can the use of factors benefit a design

What is a factor? How can the use of factors benefit a design?

  Cases of relays and capacitors

How many cases of relays and capacitors should Harkin Electronics produce during that period? If your answer is in fractional units of cases that is acceptable - do not round to whole number of cases.

  Perform a factor analysis on the variables

Perform a factor analysis on the three variables shown in question 2 above. What do conclude about the factors?  Perform another factor analysis that includes the new variable, change in external events.  What do you conclude?

  Expected value is within one standard deviation

The expected value is within one standard deviation of themean(so you must calculate both the standard deviation and expected value for your game).

  A marketing analyst is studying the relationship

A marketing analyst is studying the relationship between the money spent on TV advertising (x) and the increase in sales (y). One study reported the following data (in $) for a particular company.

  International before running the sneak preview

What would the optimal action be for International before running the sneak preview?

  The sampling distribution of the sample mean

The sampling distribution of the sample mean Select one: a. is the probability distribution showing all possible values of the sample mean b. is used as a point estimator of the population mean m c. shows the distribution of all possible values of m ..

  The length of time x to complete a particular college entran

The length of time x to complete a particular college entrance

  What is the factorial notation

1. Complete each of the ANOVA summary tables. In Addition, answer the following questions for each of the ANOVA summary tables:a. What is the factorial notation?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd