Reference no: EM133967976
Information Technology and Marketing in the New Economy
Assignment 1
Part 1. Text representation
There are 100 reviews for restaurants and films in a collection under the IA1_1.csv file. For this assignment, you are asked to preprocess these reviews such that each of the reviews will be represented as a TF-IDF vector. In particular, please follow the steps listed below:
1. Tokenize each review in the collection.
2. Use the tokenized reviews after step 1, lemmatize all the words.
3. Based on the output in step 2, remove all the stop-words and the punctuations.
4. Based on the output in step 3, convert each of the reviews to TD-IDF vectors. The minimal document frequency for each term is 3. Also, include 2-grams.
5. Based on the output in step 1, POS-tag each word and do a TD-IDF vectorization, the minimal document frequency for each term is 4 (please don't do normalization and stop-word removal). Get dependable, budget-friendly assignment help-starting today!
Tip: you may consider using a "for loop" for step 1 to step 3, so you could process the whole collection at once.
Please submit these files:
1. A Jupyter Notebook file .ipynb which includes your python code with your comments # or markdowns, and the results of each successful running through. Use a markdown at the end of the .ipynb file to report the #dimension of the vectors of step 4 and step 5.
2. A CSV file with your final TF-IDF vectors (step 4). Each review should correspond to one row and each column should correspond to one item in the vectors. (Note: you don't need to submit the intermediate output data in step 1, step 2 and step 3).
3. A CSV file with your POS-tag TF-IDF vectors (step 5). Each review should correspond to one row and each column should correspond to one item in the vectors. (Note: you don't need to submit the intermediate output data in step 1).
Part 2. Word2Vec
The data in IA1_2.csv has the information about 11914 cars. There are two fields: Maker_Model and description. The description column contains a set of tags (separated by commas), where the Maker_Model is also included.
1. Prepare the data for a gensim Word2Vec model.
2. Run the model (with size = 50) and display the vector for ‘Toyota Camry'.
3. Compute the similarity between 'Porsche 718 Cayman' and 'Nissan Van'.
4. Find the five cars most similar to 'Mercedes-Benz SLK-Class'.
5. Generate a t-SNE graph for a list of 50 unique cars.