Reference no: EM132634293
STAT6001 Data Wrangling and Visualisation - University of Newcastle
Section A - Space Race (all launches since 1957)
For Section A of this assignment you will use Excel and/or PowerBI to prepare the dataset "A1A space race.csv" and to create visualisations that help answer questions about the data. The dataset was sourced from kaggle which was scraped from and contains data on all space missions since 1957.
Question 1
a) Create a variable for Country based on the launch location. Document any decisions you make regarding the country of any launches conducted at sea or on islands.
Show a table of the total number of launches by country. Which two countries have the highest number of launches?
b) Create a line graph showing the number of launches per year since 1957. According to the graph, what year was the peak?
c) Filter the data to launches in the USA only. Is there any seasonal trend in the timing of launches throughout the year?
d) Create a graph that shows the status of rockets and a graph that shows the status of missions. What proportion of rockets are active and what proportion of missions have been successful?
e) Create a table that shows the number of and total cost of rocket launches by country. Which are the two countries that have spent the most on rocket launches? What issues are there with this comparison?
Question 2
a) Dichotomise mission status into "Successful" and "Not successful". Create a stacked bar chart with heights set to 100% that shows the mission success rates of Russia and the USA.
b) Compare the annual number of launches over time for Russia and the USA. What periods of high activity and/or trend(s) do you see in terms of mission launches for the two nations?
Hint: the time period of the ‘space race' is generally considered to be 1955-1975 and the Cold War between US and Soviet union spanned from 1947 to 1991.
c) Collapse "Russia" and "Kazakhstan" into a single category called "USSR/Russia". How would this affect the results of previous parts of Questions 1 and 2?
Section B - Earthquakes 1965-2016
Import the dataset ‘A1B earthquakes.csv' into SAS to answer the following questions. The dataset contains the date, time, location, size, and source of significant earthquakes (magnitude 5.5 or higher) recorded by seismograph networks between 1965 and 2016. The data were recorded by the National Earthquake Information Center (NEIC) and made available online by the United States Geological Survey (USGS).
Description of variables
• Latitude - number of degrees north or south of the equator (negative values for southern hemisphere, positive values for northern hemisphere), -90 to +90
• Longitude - number of degrees east or west of the prime meridian (negative values indicate west, positive values indicate east), -180 to +180
• Type - type of seismic event
• Depth - in kilometres, vertical distance below mean sea level
• Depth seismic stations - number of seismic stations that supplied data for the depth measurement
• Magnitude - best available estimate of the size of the seismic event at its source, measured on a (base 10) logarithmic scale
• Magnitude type - algorithm type used to calculated magnitude
• Magnitude seismic stations - number of seismic stations that supplied data for the magnitude measurement
• Azimuthal gap - in degrees (0-360), gap between seismic stations. Larger values indicate higher uncertainty in depth and location measurements
• Horizontal distance - in kilometres, indicates uncertainty in the horizontal location measurement
• Status - indicates whether the event has been reviewed for validity by a human or automatically processed by the system.
Questions
For each question part your answer should only include necessary SAS output (tables, graphs). You should include brief sections of SAS code.
Question 1
a) Explore the variables in the dataset and complete the table below.
For each variable in the table, list the type (e.g., continuous, discrete, ordinal, categorical) and the number of rows missing an entry for that variable. If the variable is categorical or ordinal list the number of levels; if the variable is continuous or discrete list the minimum and maximum values.
|
Variable
|
Variable type
|
N levels
(if categorical)
|
Min, Max
(if numeric)
|
N missing
|
|
Latitude
|
|
|
|
|
|
Longitude
|
|
|
|
|
|
Type
|
|
|
|
|
|
Depth
|
|
|
|
|
|
Depth seismic
stations
|
|
|
|
|
|
Magnitude
|
|
|
|
|
|
Magnitude_type
|
|
|
|
|
|
Magnitude seismic
stations
|
|
|
|
|
|
Azimuthal_gap
|
|
|
|
|
|
Horizontal_distance
|
|
|
|
|
|
Status
|
|
|
|
|
b) Are there any range errors for the numeric variables? Explain why/why not.
c) Use an appropriate graph and summary statistics to describe the distribution of magnitude.
d) Create a formatted numeric variable that categorises magnitude according to the following classes:
Show your SAS code and a frequency table of magnitude class.
e) Examine the distribution of depth using a histogram.
The depth of earthquakes can be categorised into three zones. Shallow earthquakes are between 0 and 70km deep; intermediate earthquakes, 70-300 km deep; and deep earthquakes, 300-700 km deep.
Create a formatted numeric variable that categorises depth for earthquakes only (not other seismic events that are recorded in the dataset). Show your SAS code and a frequency table of depth zone.
What proportion of earthquakes occur in the Deep zone?
f) Examine the relationship between depth zone and magnitude class for Earthquakes using a contingency table.
Does magnitude differ by depth zone? Use appropriate summary table(s) and graph(s) to support your conclusion.
Question 2 - Own question
Propose your own question that can be answered by this dataset and investigate the answer using tables and/or charts. Write a summary of your findings (approximately 100-200 words).
For example, you might like to investigate one of these topics:
• Create maps in PowerBI showing the location of earthquakes. Show the depth zones and then the magnitude classes.
• Has the annual number of earthquakes changed over time? What about the average magnitude?
• What are the characteristics of the events that were not earthquakes?
• Investigate the bump in the tail of the distribution of depth.
• Compare one of the error variables (e.g., azimuthal gap, horizontal distance) by whether or not the measurements were verified by a human.
1. Describes a question or topic of interest as it relates to variables in the dataset.
2. Statements are supported by relevant tables or charts as evidence from the data.
3. Refers to specific quantities (counts, percentages, statistics) as part of written answer.
4. Communicates clearly regarding filtered/grouped data or categories when summarising data or making comparisons.
5. Writes with clarity and organisation using report-style language.
Attachment:- Data Wrangling and Visualisation.rar