Prepare the data for a gensim word2vec model

Assignment Help Other Subject

Reference no: EM133967976

Information Technology and Marketing in the New Economy

Assignment 1

Part 1. Text representation

There are 100 reviews for restaurants and films in a collection under the IA1_1.csv file. For this assignment, you are asked to preprocess these reviews such that each of the reviews will be represented as a TF-IDF vector. In particular, please follow the steps listed below:

1. Tokenize each review in the collection.

2. Use the tokenized reviews after step 1, lemmatize all the words.

3. Based on the output in step 2, remove all the stop-words and the punctuations.

4. Based on the output in step 3, convert each of the reviews to TD-IDF vectors. The minimal document frequency for each term is 3. Also, include 2-grams.

5. Based on the output in step 1, POS-tag each word and do a TD-IDF vectorization, the minimal document frequency for each term is 4 (please don't do normalization and stop-word removal). Get dependable, budget-friendly assignment help-starting today!

Tip: you may consider using a "for loop" for step 1 to step 3, so you could process the whole collection at once.

Please submit these files:

1. A Jupyter Notebook file .ipynb which includes your python code with your comments # or markdowns, and the results of each successful running through. Use a markdown at the end of the .ipynb file to report the #dimension of the vectors of step 4 and step 5.

2. A CSV file with your final TF-IDF vectors (step 4). Each review should correspond to one row and each column should correspond to one item in the vectors. (Note: you don't need to submit the intermediate output data in step 1, step 2 and step 3).

3. A CSV file with your POS-tag TF-IDF vectors (step 5). Each review should correspond to one row and each column should correspond to one item in the vectors. (Note: you don't need to submit the intermediate output data in step 1).

Part 2. Word2Vec

The data in IA1_2.csv has the information about 11914 cars. There are two fields: Maker_Model and description. The description column contains a set of tags (separated by commas), where the Maker_Model is also included.

1. Prepare the data for a gensim Word2Vec model.

2. Run the model (with size = 50) and display the vector for ‘Toyota Camry'.

3. Compute the similarity between 'Porsche 718 Cayman' and 'Nissan Van'.

4. Find the five cars most similar to 'Mercedes-Benz SLK-Class'.

5. Generate a t-SNE graph for a list of 50 unique cars.

Reference no: EM133967976

Questions Cloud

Analysis of modernity as mass society and as class society : State four defining characteristics of social change. Contrast analysis of modernity as mass society and as class society.

Considering the components of three-legged stool : Burn the Free Fuel When considering the components of the three-legged stool

How will they impact your academic plan and your career : What are the three options in the General Studies degree program that are you considering? How will they impact your academic plan and your career?

Explain the concepts of fertility-morality and migration : Explain the concepts of fertility, morality, and migration and how they affect population size.

Prepare the data for a gensim word2vec model : Prepare the data for a gensim Word2Vec model - Compute the similarity between 'Porsche 718 Cayman' and Nissan Van

How intersectional identity can affect womens experiences : THEN, discuss how intersectional identity can affect women's experiences and relationship to society.

Blue ridge tunnel built : What engineering approach was first used in the construction of the Blue Ridge Tunnel built through Afton Mountain in the 1850s?

At which age will patient likely catch up developmentally : A nurse practitioner assesses a newborn patient at the first clinic visit following hospital. At which age will the patient likely catch up developmentally?

Which is an appropriate response by the nurse practitioner : The parent is concerned about temper tantrums, which have been worsening over the past few months. Which is an appropriate response by the nurse practitioner?

User Account

All Pages