Perform data analytics on real-world problems

Assignment Help Computer Engineering
Reference no: EM133871455

Software Practice for Big Data Analytics

Assessment - Perform Data Analytics on Real-World Problems Using Amazon Web Services

Purpose of the assessment - Select the tools in the chosen software stack to design and program the big data analytics platform;
Relate the concept and use of visualization to big data analytics; Get online assignment help in the USA!
Develop and appraise big data platforms for predictive analytics in complex real-world domains.

Description

In this group assignment, you will delve into various aspects of big data analysis and manipulation using the Hadoop ecosystem, with a focus on Pig Latin and Hive Query Language (HiveQL). The main objective is to gain hands-on experience in processing large-scale datasets while observing data trends and changes over time. You will work with two distinct datasets: stock data from major tech companies and Amazon product sales data.

In the first part of the assignment, you will analyze the stock data of 14 leading technology companies to evaluate the total stock trading volume per year. This involves uploading stock files, creating directories within the Hadoop cluster, and using Pig Latin scripts to calculate the total number of shares traded annually for each company. This exercise will enhance your understanding of stock data analytics and long-term trend analysis.

The second part of the assignment focuses on sales data, where you will use HiveQL to examine Amazon product sales. Tasks will include uploading data to HDFS, joining multiple datasets, grouping records, and performing various calculations such as identifying top-rated products, filtering items by discount or price, and computing average product prices.

By completing this assignment, you will gain practical experience in utilizing the Hadoop ecosystem for efficient large-scale data processing and analysis. This foundational knowledge will support your continued learning and exploration in the domain of big data analytics.

Your Tasks
To complete Assignment 2, which comprises two main parts, your team will follow the steps outlined in the two questions below to perform data processing and analysis tasks using the Hadoop ecosystem, Pig Latin and HiveQL. The primary focus will be working with data sets related to stock data and sale data, allowing for hands-on experience in managing and processing large-scale information efficiently.

Part I: Download the big_tech_companies.csv and big_tech_stock_prices.csv files from the Assignment 2 folder on Moodle. These comma-separated values (CSV) files contain daily stock price and trading volume data for 14 leading tech companies from 2010 to 2023. The dataset includes firms such as Apple (AAPL), Amazon (AMZN), Alphabet (GOOGL), Meta Platforms (META), Adobe (ADBE), Cisco Systems (CSCO), IBM, Intel Corporation (INTC), Netflix (NFLX), Tesla (TSLA), and NVIDIA (NVDA). The "high" column indicates the highest stock price recorded on each trading day, while the "volume" column shows the total number of shares traded on that day.
[40 Marks] For Part I, using Pig Latin commands and Tableau to perform the following tasks:
Upload the files to HDFS.
Create a directory on the cluster named Stock.
Transfer the files big_tech_companies and big_tech_stock_prices into the Stock directory .
Write a Pig script to compute the total number of shares traded per year for each company.
Write a Pig Latin script to calculate the average of daily high prices for each company in each year.
Perform an analysis of the trends in total trading volume and average trading price for the years 2010 through 2023.
Using Tableau Software, visualize the results in a suitable manner. Choose the format that youfind most appropriate.
Write a 350-word summary reflecting your understanding of the trends in trade prices and trading volumes of tech companies over time.

Part II: Download the saledata.zip file from the Assignment 2 folder on Moodle. This compressed file, when extracted, contains seven CSV (Comma-Separated Values) files. Each file includes nine columns, with each row detailing product information as described in Table 1.

For Part II, using HiveQL commands to perform the following operations:
Upload all seven CSV files to the Hadoop Distributed File System (HDFS).
Create a directory on the cluster and name it Sales.
Set up a database named sales_db and create corresponding tables to load the seven CSV files.
Retrieve and display product names across all categories, including Appliances, Electronics, Exercise & Fitness, Grocery & Gourmet Foods, Home & Kitchen, Pet Supplies,
and Sports Fitness & Outdoors.
Display the top 5 highest-rated products within each category.
List all products in each category that offer a discount greater than 40%.
Filter and show products with an actual price of $500 or less.
Identify and display the highest-priced product in each category based on the actual price.
Calculate and display the average actual price of products for each category.
Write a 350-word summary highlighting your key insights and findings from the analysis.

Reference no: EM133871455

Questions Cloud

What does zine have to say about creation of tribal councils : MHA616- What might happen if the Supreme Court case Johnson v. McIntosh was overturned? What does the zine have to say about the creation of Tribal Councils?
What were the historical background of the two societies : What were the historical background of these two societies that paved the road for these unconstitutional revolutions?
Your organization do you think would benefit from AI : What do you think are the biggest drawback to using AI for selection? What functions in your organization do you think would benefit from AI? Explain.
Why leadership style is so important to leadership research : Discuss what Bass and Riggio (2006) and what Lowe and Gardner (2001) define as the reasons why this leadership style is so important to leadership research.
Perform data analytics on real-world problems : MDA621 Software Practice for Big Data Analytics, Master of Data Analytics (MDA) - Melbourne Institute of Technology - Develop and appraise big data platforms
What is the issue that you are trying to solve or improve : What is the issue that you are trying to solve or improve? Why should the audience care about solving it? How does the plan justify the resource expenditure?
Succession planning is implemented in organization : When succession planning is implemented in an organization, the efforts are almost always directed towards the top leadership spot
Carol is delivering bad-news message to her manager : Carol is delivering a bad-news message to her manager, who prefers that all messages be delivered directly.
Synthesis of ideas and theories from diverse perspectives : Discuss the five most crucial principles with detailed analysis and synthesis of ideas and theories from diverse perspectives

Reviews

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd