Reference no: EM133871455
Software Practice for Big Data Analytics
Assessment - Perform Data Analytics on Real-World Problems Using Amazon Web Services
Purpose of the assessment - Select the tools in the chosen software stack to design and program the big data analytics platform;
Relate the concept and use of visualization to big data analytics; Get online assignment help in the USA!
Develop and appraise big data platforms for predictive analytics in complex real-world domains.
Description
In this group assignment, you will delve into various aspects of big data analysis and manipulation using the Hadoop ecosystem, with a focus on Pig Latin and Hive Query Language (HiveQL). The main objective is to gain hands-on experience in processing large-scale datasets while observing data trends and changes over time. You will work with two distinct datasets: stock data from major tech companies and Amazon product sales data.
In the first part of the assignment, you will analyze the stock data of 14 leading technology companies to evaluate the total stock trading volume per year. This involves uploading stock files, creating directories within the Hadoop cluster, and using Pig Latin scripts to calculate the total number of shares traded annually for each company. This exercise will enhance your understanding of stock data analytics and long-term trend analysis.
The second part of the assignment focuses on sales data, where you will use HiveQL to examine Amazon product sales. Tasks will include uploading data to HDFS, joining multiple datasets, grouping records, and performing various calculations such as identifying top-rated products, filtering items by discount or price, and computing average product prices.
By completing this assignment, you will gain practical experience in utilizing the Hadoop ecosystem for efficient large-scale data processing and analysis. This foundational knowledge will support your continued learning and exploration in the domain of big data analytics.
Your Tasks
To complete Assignment 2, which comprises two main parts, your team will follow the steps outlined in the two questions below to perform data processing and analysis tasks using the Hadoop ecosystem, Pig Latin and HiveQL. The primary focus will be working with data sets related to stock data and sale data, allowing for hands-on experience in managing and processing large-scale information efficiently.
Part I: Download the big_tech_companies.csv and big_tech_stock_prices.csv files from the Assignment 2 folder on Moodle. These comma-separated values (CSV) files contain daily stock price and trading volume data for 14 leading tech companies from 2010 to 2023. The dataset includes firms such as Apple (AAPL), Amazon (AMZN), Alphabet (GOOGL), Meta Platforms (META), Adobe (ADBE), Cisco Systems (CSCO), IBM, Intel Corporation (INTC), Netflix (NFLX), Tesla (TSLA), and NVIDIA (NVDA). The "high" column indicates the highest stock price recorded on each trading day, while the "volume" column shows the total number of shares traded on that day.
[40 Marks] For Part I, using Pig Latin commands and Tableau to perform the following tasks:
Upload the files to HDFS.
Create a directory on the cluster named Stock.
Transfer the files big_tech_companies and big_tech_stock_prices into the Stock directory .
Write a Pig script to compute the total number of shares traded per year for each company.
Write a Pig Latin script to calculate the average of daily high prices for each company in each year.
Perform an analysis of the trends in total trading volume and average trading price for the years 2010 through 2023.
Using Tableau Software, visualize the results in a suitable manner. Choose the format that youfind most appropriate.
Write a 350-word summary reflecting your understanding of the trends in trade prices and trading volumes of tech companies over time.
Part II: Download the saledata.zip file from the Assignment 2 folder on Moodle. This compressed file, when extracted, contains seven CSV (Comma-Separated Values) files. Each file includes nine columns, with each row detailing product information as described in Table 1.
For Part II, using HiveQL commands to perform the following operations:
Upload all seven CSV files to the Hadoop Distributed File System (HDFS).
Create a directory on the cluster and name it Sales.
Set up a database named sales_db and create corresponding tables to load the seven CSV files.
Retrieve and display product names across all categories, including Appliances, Electronics, Exercise & Fitness, Grocery & Gourmet Foods, Home & Kitchen, Pet Supplies,
and Sports Fitness & Outdoors.
Display the top 5 highest-rated products within each category.
List all products in each category that offer a discount greater than 40%.
Filter and show products with an actual price of $500 or less.
Identify and display the highest-priced product in each category based on the actual price.
Calculate and display the average actual price of products for each category.
Write a 350-word summary highlighting your key insights and findings from the analysis.