Reference no: EM133871210
BIG DATA MANAGEMENT ON THE CLOUD
Objectives
Gain in depth experience playing around with big data tools (Hive, SparkRDDs, and Spark SQL).
Solve challenging big data processing tasks by finding highly efficient solutions.
Experience processing three different types of real data
Standard multi-attribute data (Bank data)
Time series data (Twitter feed data)
Bag of words data.
Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for Hive, Spark (especially spark look under RDD. There are a lot of really useful API calls).
- If you are not sure what a spark API call does, try to write a small example and try it in the spark shell
Assignment structure:
A script which puts all of the data files into HDFS automatically is provided for you. Whenever you start the docker container again you will need to run the following script to upload the data to HDFS again, since HDFS state is not maintained across docker runs:
$ bash put_data_in_hdfs.sh
The script will output the names of all of the data files it copies into HDFS. If you do not run this script, solutions to the Spark questions will not work since they load data from HDFS.
To put the files onto HDFS do the following:
First start the docker container using run.sh like you have done for your labs.
Change to the directory that contains the file put_data_in_hdfs.sh file and then run the following command:
bash put_data_in_hdfs.sh
The above will put all the assignment files into HDFS. You can now look at the HDFS contents in Hue like the following. Open Firefox browser and type in the following URL: localhost:8888
Type in username: root and password: root
Next select the files icon on the left to see the files you have uploaded to HDFS.
For each Hive question a skeleton .hql file is provided for you to write your solution in. You can run these just like you did in labs:
$ hive -f Task_XX.hql
For each Spark question, a skeleton project is provided for you. Write your solution in the .scala file in the src directory. Build and run your Spark code using the provided scripts:
$ bash build_and_run.sh
Follow the instructions below to run a small test program that outputs to HDFS so you can see the output.
Change to the Task_test directory and type the following command:
bash build_and_run.sh
Next look at the output of the program in Hue
Tips:
Look at the data files before you begin each task. Try to understand what you are dealing with!
For each subtask we provide small example input and the corresponding output in the assignment specifications below. These small versions of the files are also supplied with the assignment (they have "-small" in the name). It's a good idea to get your solution working on the small inputs first before moving on to the full files.
In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.
It can take some time to build and run Spark applications from .scala files. So for the Spark questions it's best to experiment using spark-shell first to figure out a working solution, and then put your code into the .scala files afterwards. As an example you can try to copy the following highlighted lines from the Task_test source file into the spark shell.
Task 1: Analysing Bank Data
We will be doing some analytics on real data from a Portuguese banking institution 1. The data is stored in a semicolon (";") delimited format.
The data is supplied with the assignment at the following locations:
Small version Full version
Task_1/Data/bank-small.csv Task_1/Data/bank.csv
The data has the following attributes
Attribute index Attribute name Description
0 age numeric
1 job type of job (categorical: admin., unknown, unemployed, management, housemaid, entrepreneur, student,
blue-collar, self-employed, retired, technician, services)
2 marital marital status (categorical: married, divorced, single; note: divorced means divorced or widowed)
3 education (categorical: unknown, secondary, primary, tertiary)
4 default has credit in default? (binary: yes, no)
5 balance average yearly balance, in euros (numeric)
6 housing has housing loan? (binary: yes, no)
7 loan has personal loan? (binary: yes, no)
8 contact contact communication type (categorical: unknown, telephone, cellular)
9 day last contact day of the month (numeric)
10 month last contact month of year (categorical: jan, feb, mar, ..., nov, dec)
11 duration last contact duration, in seconds (numeric)
12 campaign number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 pdays number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
14 previous number of contacts performed before this campaign and for this client (numeric)
15 poutcome outcome of the previous marketing campaign (categorical: unknown,other,failure,success)
16 termdeposit has the client subscribed a term deposit? (binary: yes,no)
Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):
job marital education balance loan
management Married tertiary 2143 Yes
technician Divorced secondary 29 Yes
entrepreneur Single secondary 2 No
blue-collar Married unknown 1506 No
services Divorced secondary 829 Yes
technician Married tertiary 929 Yes
Management Divorced tertiary 22 No
technician Married primary 10 No
Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask.
[Hive] Report the number of clients for each marital status who have a balance above 500 and has a loan. Write the results to Task_1a-out. For the above small example data set you would report the following (output order is not important for this question): (For all questions make sure you only modify code inside the TODO blocks. However, you can change the input filename from the small to the large file).
[Hive] Report the average yearly balance for all people in each job category in descending order of average yearly balance. Write the results to Task_1b-out. For the small example data set you would report the following:
[Spark RDD] Group balance into the following three categories:
Low: -infinity to 500
Medium: 501 to 1500 =>
High: 1501 to +infinity
Report the number of people in each of the above categories. Write the results to "Task_1c-out" in text file format. For the small example data set you should get the following results (output order is not important in this question):
d) [Spark RDD] Output the following details for each person whose job category has an average balance above 500: education, balance, job, marital, loan. Make sure the output is in decreasing order of individual balance. Write the results to Task_1d-out in text file format (output to a single file). For the small example data set you would report the following:
Task 2: Analysing Twitter Time Series Data
In this task we will be doing some analytics on real Twitter data2. The data is stored in a tab ("\t") delimited format.
The data is supplied with the assignment at the following locations:
Small version Full version
Task_2/Data/twitter-small.tsv Task_2/Data/twitter.tsv
The data has the following attributes
Attribute index Attribute name Description
0 tokenType In our data set all rows have Token type of hashtag. So this attribute is useless for this assignment.
1 month The year and month specified like the following: YYYYMM. So 4 digits for year followed by 2 digits for month. So like the following 200905, meaning the year 2009 and month of May
2 count An integer representing the number tweets of this hash tag for the given year and month
3 hashtagName The #tag name, e.g. babylove, mydate, etc.
Here is a small example of the Twitter data that we will use to illustrate the subtasks below:
Token type Month count Hash Tag Name
hashtag 200910 2 babylove
hashtag 200911 2 babylove
hashtag 200912 90 babylove
hashtag 200812 100 mycoolwife
hashtag 200901 201 mycoolwife
hashtag 200910 1 mycoolwife
hashtag 200912 500 mycoolwife
hashtag 200905 23 abc
hashtag 200907 1000 abc
[Spark RDD] Find the single row that has the highest count and for that row report the month, count and hashtag name. Print the result to the terminal output using println. So, for the above small example data set the result would be:
[Do twice, once using Hive and once using Spark RDD] Find the hash tag name that was tweeted the most in the entire data set across all months. Report the total number of tweets for that hash tag name. You can print the result to the terminal using println. So, for the above small example data set the output would be:
abc 1023
[Spark RDD] Given two months x and y, where y > x, find the hashtag name that has increased the number of tweets the most from month x to month y. Ignore the tweets in the months between x and y, so just compare the number of tweets at month x and at month y. Report the hashtag name, the number of tweets in months x and y. Ignore any hashtag names that had no tweets in either month x or y. You can assume that the combination of hashtag and month is unique. Therefore, the same hashtag and month combination cannot occur more than once. Print the result to the terminal output using println. For the above small example data set:
For this subtask you can specify the months x and y as arguments to the script. This is required to test on the full-sized data. For example:
$ bash build_and_run.sh 200901 200902
Task 3: Indexing Bag of Words data
In this task you are asked to create a partitioned index of words to documents that contain the words. Using this index you can search for all the documents that contain a particular word efficiently.
The data is supplied with the assignment at the following locations3:
Small version Full version
Task_3/Data/docword-small.txt Task_3/Data/docword.txt
Task_3/Data/vocab-small.txt Task_3/Data/vocab.txt
The first file is called docword.txt, which contains the contents of all the documents stored in the following format:
Attribute index Attribute name Description
0 docId The ID of the document that contains the word
1 vocabId Instead of storing the word itself, we store an ID from the vocabulary file.
2 count An integer representing the number of times this word occurred in this document.
The second file called vocab.txt contains each word in the vocabulary, which is indexed by vocabIndex from the docword.txt file.
Here is a small example content of the docword.txt file.
docId vocabId count
3 3 600
2 3 702
1 2 120
2 5 200
2 2 500
3 1 100
3 5 2000
3 4 122
1 3 1200
1 1 1000
Here is an example of the vocab.txt file
vocabId word
1 plane
2 car
3 motorbike
4 truck
5 boat
Complete the following subtasks using Spark:
[spark SQL] Calculate the total count of each word across all documents. List the words in ascending alphabetical order. Write the results to "Task_3a-out" in CSV format (multiple output parts are allowed). So, for the above small example input the output would be the following:
Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
[spark SQL] Find the most frequently occurring word for each document and then output the following information: docId, word, count. Sort in decreasing order according to count. Write the results to Task_3b-out in CSV format (multiple output parts are allowed).
So, for the above small example input, the output would be the following:
Note: spark SQL will give the output in multiple files. You should ensure that the data is sorted globally across all the files (parts).
Bonus Marks:
1. Using spark perform the following task using the data set of task 2.
[Spark RDD or Spark SQL] Find the hash tag name that has increased the number of tweets the most from among any two consecutive months of any hash tag name.
Consecutive month means for example, 200801 to 200802, or 200902 to 200903, etc. Report the hash tag name, the 1st month count, and the 2nd month count using println.
For the small example data set of task 2 the output would be:
Hash tag name: mycoolwife count of month 200812: 100
count of month 200901: 201