What is the average number of total words per document

Assignment Help Other Subject

Reference no: EM132030478

Assignment -

For this assignment, the first two problems require Hadoop mapreduce jobs, although you need only solve one of them. Each of these problems should have it's own folder. The folder for a problem must contain a .txt file which gives the command line invocation for the job. For Java jobs submit the project directory as well as a jar. The streaming job will require it's own folder in which you will have files for the mapper and reducer. Problems which are carried out in Spark require only the file which will be submitted through spark-submit. Spark jobs will be implemented in Python. For Spark jobs, key-value output may include parentheses. For problems which do not require Mapreduce or Spark follow the instructions given below including all work in the main submission zip.

Solve one of problems 1 and 2.

1. The following is a mapreduce exercise. You may use either the Java or Streaming API's. From the UCI Machine Learning Repository download the compressed files docwords.nytimes.txt.gz and vocab.nytimes.txt.gz. These are part of the bag of words data set. Create a file named words_nytimes.txt which is the same as docwords.nytimes.txt but with the first three lines removed. Using the distributed cache translate the records of the nytimes data set into the form (docid, actual term , term count, max frequency for document). Parentheses should not be part of the output and you may use different delimiters. The actual term is the mapping of a term id as given in the file vocab.nytimes.txt. The input file here is words_nytimes.txt and the file which will be put in the distributed cache is vocab.nytimes.txt. The VM may have difficulty with the entire dataset. If you are having issues run on only a part of the file.

2. In this exercise you will implement matrix multiplication as a streaming job using Python. You will do so by executing a secondary sort in such a way that no buffering is required in the reducer. Your reducer may use only O(1) additional memory. For example you may use a small number of variables, storing foats or ints only.

3. In this problem you will build an inverted index for the nytimes datain the following sense. The output will be a term id together with a sorted list of the documents in which the term is found. To be precise the output will be lines with tab separated fields where the first field is the term and the subsequent fields are of the form docid:count where the count is the number of times that the term appears in the document. Furthermore, the docid:count data needs to be sorted, highest to lowest, by count. So the document for which the count is greatest will appear first and that in which the count is least will appear last. You will implement this in Spark. Your submission will be a file whose lines contain the required data together with a file giving the code/commands executed. Compress the submission data.

4. For this problem you will need to read about the tf-idf transform in the book Mining of Massive Datasets. For this problem the file words_nytimes.txt will be the input. The output will be the same as the input except that the third field which gives the count of the term in the document will be replaced by the tf-idf score for the term in the document.

You may solve this using any method you like, however the tf-idf score must be as defined in the above mentioned text. You need only submit the output. You must compress the output and include it with you zipped submission.

5. The following must be solved using Spark. You will submit your answers together with a file containing the commands you executed. It is recommended that you employ data frames for this problem. You may need to make use of AWS if your computer is unable to process the entire data set. When asking about particular words give the id only. Referring to the New York Times dataset mentioned above answer the following questions.

(a) How many documents have at least 100 distinct words?

(b) Which document contains the most total words from the vocabulary?

(d) Which document, with at least 100 words, has the greatest lexical richness with respect to the vocabulary? By lexical richness we mean the number of distinct words divided by the total number of words.

(e) Which document, with at least 100 words, has the least lexical richness?

(f) Which word from the vocabulary appears the most across all of the documents, in terms of total count?

(g) How many documents have fewer than 50 words from the vocabulary?

(h) What is the average number of total words per document?

(i) What is the average number of distinct words per document?

6. Download the file movies.txt.gz and familiarize yourself with it's structure. This is a large file and the download may take some time depending on your internet connection. After this you will create a new file called reviews.csv which will have on each line the following:

review id, product id, score, helpfulness score

where the fields are separated by a comma. There should be one line per review in the file. You may solve this exercise in whatever manner you chose. Now carry out the following parts, you will include code and/or commands that were executed to answer these questions. You will also submit the compressed output. As in the previous you may use any method you like.

(a) Verify that you have the correct number of reviews in the file you created.

(b) Verify the number of distinct products.

(d) Verify the number of users with 50 or more reviews.

(e) Create a file called mean_rating.csv which has one line per unique reviewer such that each line has the user id and mean score of all their ratings separated by a comma. This file should also be compressed and submitted.

Textbook - Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

Attachment:- Assignment File.rar

Reference no: EM132030478

Questions Cloud

Could gerrymandering actually be a good thing : Is it possible to create a "neutral," "fair," or "equal" system for drawing legislative districts? What would be "fair"? What would be or "equal"?

Create a brochure for prevention and treatment : Create a 5-panel brochure for prevention and treatment of that issue targeted at the middle childhood or adolescent population.

How does vitamin d maintains bone health at the cellular : 1. How does vitamin D play a major role in maintaining your bone health and what foods can you consume to get this vitamin?

What information determines the scope of a policy : 1. What information determines the scope of a policy? 2. Why it is important to engage stakeholders while developing a sustainability policy?

What is the average number of total words per document : The following must be solved using Spark. What is the average number of total words per document? How many documents have at least 100 distinct words

Outcomes of the sustainability policy : List three important factors to be considered when documenting the outcomes of the sustainability policy and write their importance.

Documenting the outcomes of the sustainability policy : List three important factors to be considered when documenting the outcomes of the sustainability policy and write their importance.

Briefly describe each new category you created : Briefly describe each new category you created and present your reasoning for each category. Include the characteristics of the original eight elements.

Increasing regulation of the internet and e-mail : Given the increasing regulation of the Internet and e-mail by the government of China, recommend to the CEO of Google whether the company

Reviews

len2030478

6/26/2018 12:42:26 AM

Please read the attached instruction document. Deadline is firm and can't be delayed. For this assignment submit one zip archive as in the previous assignments. The first two problems require Hadoop mapreduce jobs, although you need only solve one of them. Each of these problems should have it’s own folder. The folder for a problem must contain a .txt file which gives the command line invocation for the job. For Java jobs submit the project directory as well as a jar. The streaming job will require it’s own folder in which you will have files for the mapper and reducer. Problems which are carried out in Spark require only the file which will be submitted through spark-submit. Spark jobs will be implemented in Python. For Spark jobs, key-value output may include parentheses. For problems which do not require Mapreduce or Spark follow the instructions given below including all work in the main submission zip.

Write a Review

Required(*) Message

User Account

All Pages