Setting up a pipeline to ingest data from twitter

Assignment Help Other Subject
Reference no: EM133367353

Data Collection and Curation

Introduction

One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction.

Instructions

Step 1: Setup Kafka producer to ingest tweets

Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos.

Step 2: Setup Kafka Consumer

Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well.

Step 3: Setup Flume Agent

Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS.

Step 4: Clean and Process Data

The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below).

Step 5: Load Data into Spark SQL

Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter.

Step 6: Train a Spark ML algorithm

Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.

Reference no: EM133367353

Questions Cloud

Geographical locations around the globe : The inbound side of the supply chain and retailers on the outbound, and their geographical locations around the globe.
What might the deer, the boar, and the fox represent : What might the deer, the boar, and the fox represent? Use the Medieval Bestiary to help you interpret the symbolism of the animals as the medieval audience
Which seems the most manipulative and how do you know : Which of the two summation arguments takes into account most of the factual evidence.? which seems the most manipulative? how do you know?
the American Dream and American society : How would you define "the American Dream"? Do you think it's attainable? What do you think Marx or Engels would say about these ideas?
Setting up a pipeline to ingest data from twitter : BDAT 1008 Data Collection and Curation, Georgian College - train a machine learning algorithm using Spark ML to predict whether the tweets that you have
What is universal grammar : What is Universal Grammar? Consider these two statements: I learned a new word today. I learned a new sentence today. Do you think the two statements are equal
Special challenges of stage of development : Identify risk and protective factors and special challenges of this stage of development.
Provide a minimum of three community resources : Provide a minimum of three community resources, such as local agencies, outreach departments from the department of education, other educational resources
The institution of the family has remained resilient : The institution of the family has remained resilient even though its structures and functions continue to remain in a state of flux.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd