Reference no: EM132931129 , Length: word count:2000
5011CEM Big Data Programming Project - Coventry University
Learning Outcome 1: COMPUTATION THINKING:develop and understand algorithms to solve problems; measure andoptimise algorithm complexity; appreciate the limits of what may bedone algorithmically in reasonable time or at all.
Learning Outcome 2: PROGRAMMING:create working solutions to a variety of computational and real world problems using multiple programming languages chosen asappropriate for the task.
Learning Outcome 3: DATA SCIENCE:work with (potentially large) datasets; using appropriate storagetechnology; applying statistical analysis to draw meaningfulconclusions; and using modern machine learning tools to discoverhidden patterns.
Learning Outcome 4: SOFTWARE DEVELOPMENT: develop a product from the initialstage of requirement / analysis all the way through development toits final stages of testing / evaluation.
Learning Outcome 5: PROFESSIONAL PRACTICE:understand professional practices of the modern IT industry whichinclude those technical (e.g. version control / automated testing) butalso social, ethical & legal responsibilities.
Learning Outcome 6: TRANSFERABLE SKILLS:apply a wide variety of degree level transferable skills including time management, team working, written and verbal presentation to bothexperts and non-experts, and critical reflection on own and otherswork.
Learning Outcome 7: ADVANCED WORK:apply the above to advanced topics selected according to theinterests of individual students.
Assessment Overview
Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project.Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.
Project Scenario
You have been approached by a client who analysis atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.
They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:
1. Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.
2. The data for a single day of model output is approximately 250MB. However, they have over 100 years' worth of data to analyse making a total of over 9TB.
3. Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
4. It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.
You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:
1. Test and compare the processing speed of sequential and parallel processing
2. Extrapolate your findings to indicate the number of processors required to achieve the target processing time.
3. Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.
4. Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.
Assignment Brief 2
Learning Outcome 1: DATA SCIENCE:work with (potentially large) datasets; using appropriatestorage technology; applying statistical analysis to drawmeaningful conclusions; and using modern machinelearning tools to discover hidden patterns.
B6: PROFESSIONAL PRACTICE:understand professional practices of the modern ITindustry which include those technical (e.g. versioncontrol / automated testing) but also social, ethical &legal responsibilities.
B7: TRANSFERABLE SKILLS:apply a wide variety of degree level transferable skillsincluding time management, team working, written andverbal presentation to both experts and non-experts, andcritical reflection on own and others work.
VIVA TASK
The VIVA will take the form of a submission of a recorded presentation of your work.
The recording should be an informal, meeting-likepresentation and should be considered as an opportunity to showcase your work. The aim is for you to present your work clearly and effectively to your client.
You are allowed 5 minutes to deliver your main content.You will then answer the questions below where you are allowed up to1 minute per answer. Poor timing will affect your grade.
VIVA Questions
Following the presentation of your work, please verbally answer the following questions.Keep your answers brief and concise and take account of the timing indicated for each.
1. You have tested your code using ozone (o3). We have many chemical species to analyse, how would you need to adapt your code to work with carbon monoxide (CO) for example.
2. If we wanted to analyse multiple chemical species at the same time, how would that affect our HPC requirements, e.g. number of processors?
3. One of our measuring instruments uses different text entries for errors, e.g. "Instrument Error", "Communication Error" as an error code, not NaN. How might you adapt your code to check and report errors?
Attachment:- Assessment Overview.rar