Create a contigency table and a bar plot

Assignment Help Other Subject
Reference no: EM132389326

1. Preprocessing

1a: Write a function preprocess (text, stop_words) which performs these steps:

  • word tokenization (with NLTK)
  • remove punctuation
  • remove "stop words" from the text; the stop words are given as a set; the words should be matched case insensitively.

To get all points, use one or more list comprehensions to achieve the filtering.

When using Topic Models, it is common to chop long texts into chunks with a fixed number of tokens. For example, we might want to chop a novel into chunks of exactly 1000 tokens (regardless of where sentences or chapters end).

1b: Write a function chunker(tokens, n) that takes a text as string, and returns a list with chunks of n tokens each. Hint: the extra optional arguments to range() might come in handy.

1c: Integrate the previous two functions into a function chunk_text (input_filename, output_filename, stop_words, chunk_length) that:

  • reads a file
  • calls the preprocess function on the result
  • calls the chunker function on that result
  • writes those chunks to a file, with one chunk per line.

2. Movie scripts

The year is 1994. You are the agent of Tom Cruise. Tom is offered the part of "Ethan- in the movie Mission Impossible. However, Tom only wants the part if his role is so big that he has more than 2.5x as many lines as the second-most prominent character. Unfortunately, you don't have the time to read the script (it is in the attached file mi.txt), so you will have to write a program that counts the number of lines each character has. Fortunately, movie scripts are formatted in a very particular way, with a certain number of spaces for each type of line. Recall the exercise in week 3 about Romeo & Juliet which solved a related problem.

2a: Your mission, should you choose to accept it, is to plot the top 20 characters with the most lines of dialogue in the script in a bar plot. The plot should have names on the y-axis and the number of lines for each character on the x-axis. Note that "ETHAN (CONT'D)" is not a name, the part in parentheses should be stripped off; for simplicity, count it as a separate line of dialogue even though it indicates a continuation of a previous line.

2b: Now show a similar bar plot for the sequel, Mission Impossible 2: mi2.txt. Make sure your code is re-usable so that you don't have to repeat a lot of code to do this.

2c: Adapt your function so that it produces a Series with the lines of dialogue; the index should have the name of the character that's speaking. Show the first 5 lines in the script.

3. Tweets

We will look at tweets related to a crisis to analyze how different parties communicate about crises. In particular, we will look at the 2013 NY train crash. The data is in the directory 2013_NY_train_crash.

3a: Load the file called '2013_ NY_train_crash-tweets_labeled.csv' into a dataframe called tweets. You may want to rename the columns, because the column names include leading spaces which is error prone.

This file does not contain timestamps. For that load the other file '2013_NY_train_crash-tweetids_entire_period.csv. Pass the option parse_dates=[ 'Timestamp' ] to properly load the timestamps as times instead of strings. This file contains duplicate rows which cause problems. Find a Pandas method to drop the duplicate rows.

Now take the column with the "Timestamp" and add it to the tweets DataFrame as a new column. Note that the timestamps are in the UTC timezone, not local NY time (EST timezone).

You should now have a DataFrame with tweets, timestamps and three other columns with manually annotated labels about each tweet. Show the first 5 rows in the dataframe.

3b: we are interested in knowing how different parties report about victims. For example: who is quicker to report on a disaster, the media or outsiders? Does the former react to the latter, or vice versa?

Select all tweets about "Affected individuals". For the resulting tweets we are interested in contrasting those that are close in time to the disaster (before 16:00 UTC), with tweets which are sent later. Add a column 'later' indicating whether the tweet was after '2013-12-01  16:00:00' . Note that you can compare the Timestamp column to a string with this time to achieve this: tweets ( 'Timestamp' > ' 2013-12-01 16:00:00'

Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not.

3c: think of a simple hypothesis to explore on this dataset and show the results.

Attachment:- Assignment Files.rar

Reference no: EM132389326

Questions Cloud

Discuss the three different types of hr structural forms : Discuss the three different types of HR structural forms (centralized, decentralized, and transition) and select the one this organization should adopt.
Include in the financial segment : What suggestions would you make to Teresa regarding the kinds of information to include in the financial segment? Be as specific as possible.
Engage in analysis and reflection : ?In this project, you will create a professional presence on LinkedIn (a professional social media network that is widely used by professionals and employers
Review of resourcing and talent management policies : To undertake a review of resourcing and talent management policies and practices in an organisation of your choice and make recommendations for improvement.
Create a contigency table and a bar plot : Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not
Why physical distractions are usually easier : Which of the following is the reason why physical distractions are usually easier to prevent in a listening or speaking situation?
Developing accounting software packages : The primary business activity of Con Pewter Ltd is developing accounting software packages. Con Pewter charges $2,000 as installation fees and separate twoyear
Examine electronic health systems in health care : Examine the emergence of technology and electronic health systems in health care since the passage of the Health Insurance Portability and Accountability Act.
Improve follower satisfaction with pay and benefits : How can leaders employ the focusing illusion to improve follower satisfaction with pay and benefits? What aspects of the job can employees focus on besides pay

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd