Reference no: EM132299244
EPIDEMIOLOGY STATA PROGRAMMING AND DATA MANAGEMENT Assignment -
Write a .do file that performs the tasks described below. Your .do file should follow conventions for .do file structure described in class. Make sure your script will run on our machines, even if we are using a different version of Stata. Do not submit your log files as part of the assignment.
Evaluation - For Question 1, use the dataset hw2_pra_hist.dta and hw2_hosp.dta to perform the required tasks. Your .do file will be run on a different dataset with more visits.
For the other questions, simply define your program. You do not need to run the programs you write in your .do file. The graders will run your programs using a dataset that will not be released to you.
Question 1 -
Context: You are conducting a study that examines the regional variation in the distribution of panel-reactive antibody (PRA). So far, you recruited 73 patients (px_id = 1, ...., 73) from 10 hospitals (hosp_id=1, ...,10) in 3 regions (region=A, B, C, ... ), and measured PRA four times: visit 0 (baseline), visit 1, visit 2, and visit 3. You hear that the organization that funds your research plans to extend the funding for several more visits (visit 4, visit 5, ..., visit N). Since you do not know how many more visits there will be, you decide to write a .do file that can work regardless of how many visits the dataset has.
Codebook
Variable
|
Description
|
Values/Range
|
hw2_pra_hist.dta
|
hosp_id
|
Hospital ID
|
Integers: 1 - 10
|
px_id
|
Patient ID
|
Integers: 1 - 73
|
pra_vX
|
PRA value at visit X
|
Integers: 0 - 100 Visit 0 indicators baseline.
|
hw2_hosp.dta
|
hosp_id
|
Hospital ID
|
Integers: 1-10
|
Region
|
Region
|
Alphabets
|
Note: the study might add more patients, hospitals, and regions in the future, so hosp_id, px_id, and region might include more values.
i) Load hw2_pra_hist.dta. Print a table as shown in attached file, which displays the number of patients with a valid PRA value greater than 80 (i.e., between 81 and 100) for all outcome variables (pra_v0, pra_v1, ..., pra_vN). N and XX should be replaced with the correct values from the dataset.
ii) Create a new variable peak_pra, which contains the highest value among valid PRA measurements in each participant. Print the median (IQR) of peak_pra as shown in attached file. XX.X should be replaced with the correct values from the dataset and formatted with one digit after the decimal point (e.g., 12.0). (Hint: the rowmax function in egen might be helpful.)
iii) Another dataset provided to you, hw2_hosp.dta, has information on which region each hospital is located in. Merge the current dataset in memory with hw2_hosp.dta. Use the command list to list the ID of the patient with the highest peak_pra value for each region as shown in attached file.
X should be replaced with the correct values from the dataset. If there are ties (i.e., multiple patients with the highest value), print all tied patients. If region C has ties (while A and B does not), the table will look like in attached file. If any regions don't have any patients in hw2_pra_hist, don't list these regions.
Question 2 -
Define a program called univar. This program runs a series of univariable (simple) linear or logistic regressions between each of the independent variables and the dependent variable.
For example, if the user runs univar var1 var2 var3 var4, outcome(var5)
this program will quietly run four univariable linear regressions on var5,
regress var5 var1
regress var5 var2
regress var5 var3
regress var5 var4
and return the following output, assuming that var2 and var4 were significantly (p<0.05) associated with var5. P-values should be formatted with three digits after the decimal point.
Significantly associated with var5:
var2 (p=x.xxx)
var4 (p=x.xxx)
Similarly, if the user runs this program with the logistic option univar var1 var2 var3 var4, outcome(var5) logistic
this program will quietly run four univariable logistic regressions on var5,
logistic var5 var1
logistic var5 var2
logistic var5 var3
logistic var5 var4
and return the following output, assuming that var2 and var4 were significantly (p<0.05) associated with var5. P-values should be formatted with three digits after the decimal point.
Significantly associated with var5:
var2 (p=x.xxx)
var4 (p=x.xxx)
This program should not alter the dataset in the memory: i.e., if you need to alter the dataset, restore to the original status after completing your procedures.
Hint: The program model in lecture 4 has some similarities with this question. The p-value after regress and logistic can be obtained using the following code:
Command
|
Code for p-value (change var1 as appropriate)
|
regress
|
ttail(e(df_r), abs(_b[var1]/_se[var1]))*2
|
logistic
|
(1-normal(abs(_b[var1]/_se[var1])))*2
|
Question 3 -
Print the following text: "Question 3: I estimate that it took me xxxx hours to complete this assignment."
For example, if it took you six hours of active work time (not counting when you ate/slept/did other things), your .do file will contain the line. Give an honest answer; this is just for our data collection purposes. However, this question is worth some points, so don't skip it!
Question 4 -
A prime number is a natural number greater than 1 that cannot be formed by multiplying two smaller natural numbers. Write prime, a program that takes any real number as an option n and determines whether the number is a prime number or not. The program will also display an error message when the user enters any number that is not a natural number greater than 1.
For example: If the user types prime, n(100), your program will display "100 is NOT a prime number."
If the user types prime, n(109), your program will display "109 is a prime number."
If the user types prime, n(1), your program will display "Invalid input: enter a natural number greater than 1."
If the user types prime, n(3.14), your program will display "Invalid input: enter a natural number greater than 1."
Note - Need Q2 and Q4 only - to be solved using STATA.
Attachment:- Assignment Files.rar