Reference no: EM132385487
Project Instructions -
This project will be based on the Data Science Evolution data set (please see tutorial 1 instructions for a description of the data file). However, the data file has been augmented with 2 new variables. The first new variable is 'Satisfaction', which is a rating of agreement with the statement 'I leave work with a sense of achievement each day' rated on a 1 to 5 scale where 1 = 'strongly disagree', 2= 'disagree', 3='neither disagree or agree', 4 = 'agree', 5 = 'strongly agree', and 6 = don't know. The second new variable is 'HiPo', which is whether a respondent has been formally identified as a high potential employee (0 = No, 1 = Yes, 6 = Don't know). Make sure to download the assignment version of this data file DataSciEvolution_A1.csv.
For this assignment, you need to subset your data file. If your student id number ends in 0, 1, 2 or 3, you should analyze the data subset for which industry = 8 (Health), if your student id number ends in 4, 5, or 6, you should analyze the data subset for where industry = 11 (Manufacturing), and if your student id number ends 7, 8, or 9, you should analyze data for where industry = 15 (Retail). Because different students analyze different subsets, your answers will be different to those of other students. Please see this week's instructional video to see how you can subset your data appropriately. Importantly, in this assignment, I am looking to see your substantive interpretation of the statistical results (i.e. your interpretation and conclusions matter as much as the statistical analysis!).
Questions -
i) What 'level of measurement' are the 'data science' variables in this data set (i.e. Extraction, Modeling, Visualization, Statistics, Programming, and Experimentation. How might this impact the analyses you perform?
ii) Undertake data screening and cleaning. Ensure you recode any missing values appropriately, and make sure you examine the patterns of missing data in your analysis, including addressing both 'don't know' and 'missing data' responses. Note, we have a lecture scheduled on missing data analysis on Tuesday 15th, a video will be uploaded ahead of this class.
iii) Create three 'composite scores' by creating an average of the items for each scale. Composite 1 should include the average of the variables Extraction and Modeling, Composite 2 should include the average of the variables Experimentation and Statistics, Composite 3 should include the average of the variables Programming and Visualization.
Summarize and interpret each composite distribution by presenting a box plot (also known as a 5-point summary) for the variable (i.e. a graph including the minimum, maximum, median, and lower and upper quartiles), and create histograms showing the distribution of each of your variables.
iv) Check that each of your three composites is is reliable using Cronbach's alpha and interpret your results. Note, we will discuss the concept of reliability on Tuesday 8th, and a video will be uploaded following this class.
v) This is a question about associations between variables. Please examine the correlation between each of your three composite variables and the reported level of job satisfaction, labelled Satisfaction. Choose the most appropriate correlation coefficient, and interpret it.
vi) This is a question about differences between subgroups of respondents. Examine whether there is any difference in scores on your composites for people who are considered High Potentials and people who are not considered High Potentials. Create an appropriate graph that illustrates your results.
vii) Your colleagues are considering follow-up qualitative research interviews that they say will give a richer perspective on how data science skills have changed for segments. What ethical considerations should they factor into their thinking about a proposed research design?
Attachment:- Assignment Files.rar