Reference no: EM132917304
Final Assignment
Part 1:
Step 1: Read the Tripadvisor hotel reviews dataset
Step 2: Create a diagram to take a look at the variable "Score" to see if majority of the customer ratings are positive or negative.
Step 3: Create wordclouds to see the most frequently used words in the reviews and save it.
Step 4: Do Sentiment analysis with VADER
• Applying the model on our dataset
• Assign reviews with compound > 0 as positive sentiment, compound < 0 negative sentiment and remove score = 0
• export csv files
• Now that we have classified reviews into positive and negative, let's build wordclouds for each!
• Take a look at the distribution of reviews with sentiment across the dataset and save the diagram
Step 5: Building the classification model
Build the sentiment analysis model! This model will take reviews in as input.
It will then come up with a prediction on whether the review is positive or negative.
This is a classification task, so you will train a simple logistic regression model to do it.
Step 6: Split the Dataframe
The new data frame should only have two columns - "Review", and "sentiment" (the target variable).
Training the sentiment analysis model
80% of the data will be used for training, and 20% will be used for testing.
Step 7: Create a bag of words
Use a count vectorizer from the Scikit-learn library.
Convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.
Step 8: Logistic Regression
Split target and independent variables Fit model on data
Make predictions:
Step 9: Test the accuracy of your model Find accuracy, precision, recall
Create the classification report
Part 2: Topic Modelling
LDA
Step 1: Import the positive.csv dataset you have created in Part 1 Step 2: Applying LDA on the "Review" column
Step 3: Define number of topics as 5
Step 4: Create topics along with the probability distribution for each word in our vocabulary for each topic.
Step 5: Print the 10 words with highest probabilities for all the five topics
Step 6: Add a column to the original data frame that will store the topic for the reviews.
Step 7: Save the new dataset as: reviews_topic(lda).csv
Non-Negative Matrix Factorization (NMF)
Step 1: Import the positive.csv dataset you have created in Part 1
Step 2: Apply Non-Negative Matrix Factorization (NMF) on the dataset Step 3: Define number of topics as 5
Step 4: Create topics along with the probability distribution for each word in our vocabulary for each topic.
Step 5: Print the 10 words with highest probabilities for all the five topics
Step 6: Add a column to the original data frame that will store the topic for the reviews.
Step 7: Save the new dataset as: reviews_topic(nmf).csv
Attachment:- Reviews Assignment.rar