Reference no: EM132376748
Questions -
Question 1: Jaime is a researcher interested in security events. He suspected that people in the 20-29 age group were more likely to say they had experienced security events than people in the 30-39 age group. He obtained separate random samples of people from each age group. Here are the results:
Have you been hacked?
|
20-29
|
30-39
|
Yes
|
12
|
12
|
No
|
68
|
108
|
Total
|
80
|
120
|
Jaime wants to use these results to construct a 99% confidence interval to estimate the difference between the proportion of people in each age group who would say they have been hacked (x = p20s-p30s). Assume that all of the conditions for inference have been met.
Which of the following is a correct 99% confidence interval based on Jaime's samples?
a. x<=0.142
b. x<=0.145
c. x<=0.171
d. x<=0.175
e. x<=0.125
f. -0.0747101<=x<=0.17471
g. -0.0419321<=x<=0.141932
Question 2: The number of emails at a certain email address on any randomly selected day is approximately normally distributed, with a mean of 18 and a standard deviation of 6. What is the probability that x will equal 20?
a. 1.64
b. 0.641
c. 0.0641
d. 0.164
Question 3: If you had a population of 1000 users online, and you were to sample 55 of those users in a random manner to investigate how many of those were choosing to access inappropriate links (you are expecting 5 of those users choose the inappropriate links with a confidence level of 95% which you are using to justify lobbying management for stricter internet controls within the organisation) calculate your confidence interval and justify if would be reasonable to implement stricter controls or not.
a. ±5.60% this is at a level where you can be 95% confident that the results are correct
b. ±5.60% this is at a level where you cannot be 95% confident that the results are correct
c. ±4.25% this is at a level where you can be 95% confident that the results are correct
d. ±4.25% this is at a level where you cannot be 95% confident that the results are correct
Question 4: Should we need to be concerned regarding the characteristics of the probability distribution in a situation where we have 49 data points with sample mean 6.25 and sample variance 12?
a. Yes
b. No
Question 5: Determine the null and alternative hypotheses for the following case (assuming the standard definitions of p and μ:
When inspecting log files of network traffic to establish a baseline, the following question is asked: "Is the majority of normal TELNET traffic from ISP 192.168.0.3?"
a. H0: p=0.5 vs Ha: p>0.5
b. H0: μ=0.5 vs Ha: μ>0.5
c. H0: p=0.5 vs Ha: p<0.5
d. H0: μ=0.5 vs Ha: μ<0.5
Question 6: Determine the null and alternative hypotheses (under normal definition of p and μ): A security team wants to see the whether the mean lifetime of a VOIP communication is less than 5 minutes so that they can go in to investigate any calls outside of that duration. The security team have indicated that they think that the average lifetime is at least 5 minutes.
a. H0: p=5 vs Ha: p<5
b. H0: μ=5 vs Ha: μ>5
c. H0: p=5 vs Ha: p≠5
d. H0: μ=5 vs Ha: μ<5
Question 7: Determine the null and alternative hypotheses for the following case: testing to see if rural or urban dwellers are more likely to experience an intrusion event for a device connected to the internet.
a. H0 : μrural>μurban, Ha : μrural = μurban
b. H0 : μrural=μurban, Ha : μrural > μurban
c. H0 : μrural<μurban, Ha : μrural = μurban
d. H0 : μrural=μurban, Ha : μrural < μurban
Question 8: Determine the null and alternative hypotheses (under normal definition of p and μ): The size of an average file downloaded from a server is supposed to be approximately 450kb which is downloaded in 8.5 minutes. An analyst wants to check whether the transfer rate is being affected by mitigation measures they have installed.
a. H0: p=0.88 vs Ha: p<0.88
b. H0: μ=0.88 vs Ha: μ<0.88
c. H0: p=0.88 vs Ha: p≠0.88
d. H0: μ=8.5 vs Ha: μ≠8.5
e. H0: μ=450 vs Ha: μ≠450
Question 9: Consider the following data set:
Dependent numbers(y):
|
Independent numbers(x):
|
17
|
53
|
49
|
112
|
68
|
435
|
86
|
509
|
99
|
642
|
113
|
955
|
What is the expectation value (y) for an independent variable (x) of 80?
a. 32
b. 34
c. 35
d. 36
e. 43
Question 10: What is the coefficient of determination used for?
a. To determine which equation, you should apply
b. To determine how well the Standard Deviation reflects the data
c. To determine how well the equation fits the data
d. All of the above
e. None of the above
Question 11: Would a coefficient of determination of 63% be considered as indicating a good fit?
a. yes
b. no
Question 12: Researchers conducted a study "comparing the number of intrusions to number of connected devices" in a population to determine if men or women were more likely to experience an event. The researchers obtained device and connectivity data for a random sample of adults.
The researchers calculated the average number of devices of the rural residents and urban residents in the sample population. They want to test if rural have a higher average number of intrusion events than urban residents. Assume that all conditions for inference have been met.
Decide which is the most appropriate test to test for the question "is the number of intrusions related to the number of connected devices?
a. An ANOVA test
b. A z-test
c. A paired T test
d. A Chi-squared test
Question 13: Calculate the Correlation Coefficient given the following data:
Given the data:
x
|
y
|
26
|
52
|
98
|
54
|
46
|
11
|
16
|
23
|
39
|
22
|
a. 0.21
b. 0.44
c. 0.67
d. 0.54
Question 14: For a sample population, illustrated in the table below, we are trying to understand if there is a relationship between the source of network traffic and the size of the packets. The x value shows the source and the y value shows the mean size of packet. The last two columns show the deviation of each source from the mean. The summary rows at the bottom indicate the sums and the mean scores used in the regression analysis.
Source
|
Xi
|
yi
|
(xi-x)
|
(yi-y)
|
(xi-x)2
|
(yi-y)2
|
(xi-x) (yi-y)
|
A
|
95
|
85
|
17
|
8
|
289
|
64
|
136
|
B
|
85
|
95
|
7
|
18
|
49
|
324
|
126
|
C
|
80
|
70
|
2
|
-7
|
4
|
49
|
-14
|
D
|
70
|
65
|
-8
|
-12
|
64
|
144
|
96
|
E
|
60
|
70
|
-18
|
-12
|
324
|
49
|
126
|
Sum
|
390
|
385
|
|
|
730
|
630
|
470
|
Mean
|
78
|
77
|
|
|
|
|
|
Given that the regression equation is an equation of a line, of the form: y = b0 + b1x, solve for b0 and b1.
a. y = 27.19 + 0.010x
b. y = 0.010 + 27.19x
c. y = 0.644 + 26.768x
d. y = 26.768 + 0.644x
Question 15: Given the Network Capture Data what would you do first? Assumptions: you only have basic MS office (or similar) tools for the analysis, and standard computing resources such as Pentium processor, 8GB RAM, ≤ 1TB memory.
a. Triage data, format and munge, filter on interesting events and segments of no activity. Plot basic visualisations such as scatterplots and histograms to explore the data. Inspect the data using filters etcetera for anomalies, nulls, oddities in formatting. Save any "event" data sets as smaller discrete sets covering the irregularities. Isolate segments of "baseline data" and "event" data for further investigation. Calculate basic statistics and confirm relationships using those statistics.
b. Use bash tools like grep, awk, sed, cut etc to get familiar with it eg to show the protocols and then start playing around with RStudio to do the same thing. Then do the same thing with splunk. Under visualizations this will give me a count and percentage of each event as well as a nice little bar chart.
c. Categorize the data and create a framework. This is often referred to as coding or indexing the data. Identify themes or patterns that may consist of ideas, concepts, behaviors, interactions, phrases and so forth. Set up a coding plan to provide a framework that will structure, label and define the data. Identify patterns and make connections. Interpret the data and explain findings.
d. Apply algorithms to the data to identify relationships among the variables, such as correlation or causation. Use inferential statistics includes techniques to measure relationships between particular variables. For example, regression analysis may be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y). In mathematical terms, Y (sales) is a function of X (advertising). It may be described as Y = aX + b + error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X.
Question 16: After you analysed the Network Capture Data , what irregular behaviour did you notice?
a. Events around time point 2000, 7000. There is a time period at the start of the data set that appears to be free of any events of interest. The system then appears to operate normally for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs.
b. Events around time point 2000, 9000. There is a time period at the start of the data set that appears to be free of any events of interest. The system then appears to operate normally for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs.
c. Events around time point 0, 7000. There is a time period at the start of the data set that appears to be free of any events of interest (note here though that there was in fact an event near the beginning of the dataset where there was an increase in traffic. Is this the initial intrusion?). The system then appears to operate normally for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs.
d. Events around time point 0, 2000, 9000. There is a time period at the start of the data set that appears to be free of any events of interest (note here though that there was in fact an event near the beginning of the dataset where there was an increase in traffic. Is this the initial intrusion?). The system then appears to operate normally for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs.
Question 17: After you analysed the Network Capture Data , did you find any abnormal behaviour in the ports being used?
a. No
b. There are many different port numbers being used in an inconsistent manner indicating that there may be a scan occurring. Most computers, in normal day to day operations, only use a small range of ports, the data is showing that there is a large range of ports being used.
c. There are a few different port numbers being used in an inconsistent manner. This may indicate that there is a scan occurring, but this is unlikely. Most computers, in normal day to day operations, use a large range of ports, and as the data is showing that there is a large range of ports being used in a consistent manner, something may be going on, but probably not.
d. There are many different port numbers being used in a consistent manner indicating that there may be a scan occurring. However, most computers, in normal day to day operations, use a large range of ports, and as the data is showing that there is a large range of ports being used in a consistent manner, something may be going on, but probably not.
Question 18: After you analysed the Network Capture Data set, what source do you think can be used for baseline data?
a. The time period immediatel after the start of the data set appears to be free of any events of interest. The system then appears to operate "normally?" for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs. We should therefore take the data as close as possible to the beginning of the data set, because it should be enough data AND it should be far away from where events seem to start happening.
b. There is a time period just after the start of the data set that appears to be free of any events of interest. The system then appears to operate "normally?" for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs. The data in the middle of that period could be used for baselining. We don't take the data at the very start of the capture, could be a initial intrusion that hasn't begun any detectable behaviour.
c. The time period immediatel after the start of the data set appears to be free of any events of interest. The system then appears to operate "normally?" for a period of time around 7000 seconds after epoch, before a large spike in traffic occurs. We should therefore take the data as close as possible to the beginning of the data set, because it should be enough data AND it should be far away from where events seem to start happening. Events seem to stop after 9000 seconds. We should therefore also take a good chunk out of there (>>9000), because together with the data from the beginning, this creates the most reliable baseline.
d. None of this data can be trusted, as events seem to happen everywhere. We should therefore get the baseline data from a different system that is equal to the "live" system, and of which we know 100% sure that it is free of malicious activity, e.g. because it is air gapped.
e. This is a large data set, and although some events seem to occur, there are not many of them. Just averaging all data and use that as a baseline should therefore work quite well: the few events are averaged out. The baseline can then be well used to find the outliers.
Question 19: What initial statistical analysis could you conduct on the Network Capture data to confirm your theories?
a. Standard Deviation. would identify any traffic that varied from normal, normal traffic is predictable, usually around the same size and length, normal traffic also tends to travel from a few ports to a few destination ports, so that if traffic begins to scan across many ports then that would be behaviour that may warrant investigation.
Simple Regression. Traffic tends towards consistent predictable behaviour, and step wise increases or decreases, unexplained inflection points, outliers may be seen using regression.
Multiple Regression. Looks at how changes in the combination of two or more predictor variables predict the level of change in the outcome variable.
Paired t-test. Check to see the relationship between two variables from the same population, how much they vary and in what manner.
Independent t-test. Check for the difference in the came variable from different populations (urban to rural, port to pot)
ANOVA test. Test more than two variables. Tests between group means after any other variance in the outcome variable is accounted for.
b. Standard Deviation. would identify any traffic that varied from normal, normal traffic is predictable, usually around the same size and length, normal traffic also tends to travel from a few ports to a few destination ports, so that if traffic begins to scan across many ports then that would be behaviour that may warrant investigation.
Simple Regression. Traffic tends towards consistent predictable behaviour, and step wise increases or decreases, unexplained inflection points, outliers may be seen using regression.
Multiple Regression. Looks at how changes in the combination of two or more predictor variables predict the level of change in the outcome variable.
Paired t-test. Check to see the relationship between two variables from the same population, how much they vary and in what manner.
Independent t-test. Check for the difference in the came variable from different populations (urban to rural, port to pot)
ANOVA test. Test more than two variables. Tests between group means after any other variance in the outcome variable is accounted for.
Cluster analysis. Maybe leave Baysian analysis to later, but at least apply some initiatl k-means algorithms.
c. Standard Deviation. would identify any traffic that varied from normal, normal traffic is predictable, usually around the same size and length, normal traffic also tends to travel from a few ports to a few destination ports, so that if traffic begins to scan across many ports then that would be behaviour that may warrant investigation.
Simple Regression. Traffic tends towards consistent predictable behaviour, and step wise increases or decreases, unexplained inflection points, outliers may be seen using regression.
Paired t-test. Check to see the relationship between two variables from the same population, how much they vary and in what manner.
Independent t-test. Check for the difference in the came variable from different populations (urban to rural, port to pot)
Cluster analysis. Maybe leave Baysian analysis to later, but at least apply some initiatl k-means algorithms.
d. Standard Deviation. would identify any traffic that varied from normal, normal traffic is predictable, usually around the same size and length, normal traffic also tends to travel from a few ports to a few destination ports, so that if traffic begins to scan across many ports then that would be behaviour that may warrant investigation.
Simple Regression. Traffic tends towards consistent predictable behaviour, and step wise increases or decreases, unexplained inflection points, outliers may be seen using regression.
Multiple Regression. Looks at how changes in the combination of two or more predictor variables predict the level of change in the outcome variable.
Paired t-test. Check to see the relationship between two variables from the same population, how much they vary and in what manner.
Independent t-test. Check for the difference in the came variable from different populations (urban to rural, port to pot)
Classification analysis. Inclusing Bayesian analysis and supervised learning methods.
e. All of the above and more.
Question 20: What difficulties should you anticipate with analysing the Network Capture data set ?
a. None. There are such a great tools available nowadays, that relatively small data sets as this one can always be handled easily.
b. The data file is very large and that volume will contribute to problems with data cleansing and manipulation.
The data needs to be separated into sets, cleaned, munged (or formatted so that it can be analysed), sotred and accessed easily.
As the extract, load, transform process occurs, the files grow to very large sizes making it difficult to manage them and conduct computations.
c. The data file is very large and that volume will contribute to problems with data cleansing and manipulation.
The time period may cause issues, to manage, data can be rolled into larger time intervals, for example 10 minute intervals or similar to minimise the volumes of data being manipulated. The risk that you may "average out" anomalies is fairly small as the chance that they occur on such short time scales is limited.
The data needs to be separated into sets, cleaned, munged (or formatted so that it can be analysed), sorted and accessed easily.
d. The data file is very large and that volume will contribute to problems with data cleansing and manipulation.
The time period may cause issues, to manage, data can be rolled into larger time intervals, for example 10 minute intervals or similar to minimise the volumes of data being manipulated. The risk here is that you may "average out" anomalies.
The data needs to be separated into sets, cleaned, munged (or formatted so that it can be analysed), sorted and accessed easily.
As the extract, load, transform process occurs, the files grow to very large sizes making it difficult to manage them and conduct computations.
e. All of the above (except answer a.) and more.
Question 21: How would you manage any difficulties arising when analysing the Network Capture data set if you only had access to average office type systems (for example Pentium processor, 8GB RAM, 1TB memory)?
a. After an initial inspection of the data, including creating a scatterplot and histogram to inspect data behaviour, small sample sets should be taken to analyse both baseline behaviour and investigate any anomalous behaviours. Statistically, this sample set will be analysed to determine the properties of the entire population and this should be considered in the sample selection, as both a set representing "normal" or "baseline" behaviour and a set for anomalies will need to be obtained.
b. After an initial inspection of the data (but not including creating a scatterplot and histogram to inspect data behaviour, because that is of later care), small sample sets should be taken to analyse both baseline behaviour and investigate any anomalous behaviours. Statistically, this sample set will be analysed to determine the properties of the entire population and this should be considered in the sample selection, as both a set representing "normal" or "baseline" behaviour and a set for anomalies will need to be obtained.
c. After an initial inspection of the data (but not including creating a scatterplot and histogram to inspect data behaviour, because that is of later care), many chunks of large sample sets should be taken to analyse both baseline behaviour and investigate any anomalous behaviours. Statistically, this sample set will be analysed to determine the properties of the entire population and this should be considered in the sample selection, as both a set representing "normal" or "baseline" behaviour and a set for anomalies will need to be obtained.
d. After an initial inspection of the data (but not including creating a scatterplot and histogram to inspect data behaviour, because that is of later care), many chunks of large sample sets should be taken to analyse both baseline behaviour and investigate any anomalous behaviours. Statistically, this sample set will be analysed to determine the properties of parts of the population and this should be considered in the sample selection, as one should have separate sample sets with respectively "normal", "baseline", and "anomalies" will need to be obtained.
Attachment:- Data File.rar