Reference no: EM133180768
Big Data Management
Questions
You answer the same questions as in assignment 1 but this time you are required to use the SQL API of Apache Spark. Prepare Cassandra structures and Spark code that saves the precomputed data into these structures:
1. Show number of downloads for package ggplot2.
2. Highest number of downloads by a country. Show its name.
3. Top 10 smallest sized packages.
4. What are the top 10 least popular packages?
5. Highest number of downloads by an Operating System.
6. What is the most popular package in Ireland?
7. What is the highest number of downloads by a single machine?
8. What OS is the least popular among the R programmers?
9. How many users use MAC OS?
10. List total number of incomplete records - lines which have missing values.
Task 1
For your first task, you are required to use Apache Spark RDD's transformations and actions to answer above questions about the dataset.
Task 2
In this task you are required to use Apache Spark's SQL API to to answer above questions about the dataset. Store the results for each question in Apache Cassandra.
Task 3
In the last task, you are required to use Apache Spark's Streaming API to compute the real-time views for the questions. For storing these views you need to use the Apache Cassandra. To emulate a live-stream of the download logs, you are required to write a separate Python script that reads 1000 lines every 5 seconds from each log file and stores them as separate files (log1, log2, log3, etc.) in the streaming directory on which your application is listening.
Submission
• Submit your solution on Moodle by the specified deadline.
• Acceptable file format: Python notebook - name it student name.ipynb. The notebook should be exported as iPython Notebook with *.ipynb extension. If the code in your notebook does not run, it will result in 20% penalty.
• Take two screenshots of your solution to each question (code + its output into the Cassandra table where applicable) and insert it in a word document, generate a pdf of this document.
Attachment:- Big Data Management.rar