Different characteristics of the frequently misclassified , Mechanical Engineering

Assignment Help:

Sentiment analysis is a subfield of NLP concerned with the determination of opinion and subjectivity in a text, which has application in the analysis of online product reviews, recommendations, blogs, and other types of opinionated documents.

In this assignment you will be developing classifiers for sentiment analysis of movie reviews using Support Vector Machines (SVMs), in the manner of the paper by Pang, Lee, and Vaithyanathan [1], which was the first research on this topic. The goal is to develop a classifier that performs sentiment analysis, assigning a movie review a label of "positive" or "negative" that predicts whether the author of the review liked the movie or disliked it.

You may use Java or Python programming and scripting languages of your choice for this assignment, but for the machine learning you must use SVMlight (section D). 

B. Data

The data (available on the course web page) consists of 1,000 positive and 1,000 negative reviews. These have been divided into training, validation, and test sets of 800, 100, and 100 reviews, respectively. In order to encourage you not to optimize against the testing set while developing your classifiers, the testing data will not be immediately available.

The reviews were obtained from Pang's website [2], and then part-of-speech tagged using a bidirectional Maximum Entropy Markov Model [3, 4].

Each document is formatted as one sentence per line. Each token is of the format word/POStag, where a "word" also includes punctuation. Each word is in lowercase. There is sometimes more than one slash in a token, such as in writer/director/NN.

C. Baseline system

For a baseline system, think of 20 words that you think would be indicative of a positive movie review, and 20 words that you think would be indicative of a negative review.

To develop the baseline classifier, take this approach: given a movie review, count how many times it contains either a positive word or a negative word (token occurrences). Assign the label POSITIVE if the review contains more positive words than negative words. Assign the label NEGATIVE if it contains more negative words than positive words. If there are an equal number of positive and negative words, it is a TIE.

D. Machine learning

The machine learning software to be used is SVMlight [5], which learns Support Vector Machines for binary classification. It is available for Unix systems, Windows, and Mac OS X.

You will need to read the documentation on the SVMlight website in order to figure out how to use the software. To test whether you know how to use it, it might be helpful to first create a small, "toy" dataset by hand, and then train and test the SVM on it.

When training the classifier, select the option for classification:

-z {c,r,p} - select between classification (c), regression (r), and

preference ranking (p)

A training file is of the format:

.=. : : ... : #

.=. +1 | -1 | 0 |

.=. | "qid"

.=.

.=.

Since we are doing binary classification, the value of should be +1 or -1.

Every feature (which may be expressed as an integer or a string) is associated with a value, which is a floating-point number. If you want a feature to be binary-valued, you may use values of 0.0 and 1.0.

With binary features, it is not necessary to include an explicit representation feature of features that do not occur. For example, suppose a document contains 100 different words out of a vocabulary of 50,000 possible words. If you are using binary features, it suffices to include a feature with a value of 1.0 for each of the words that do occur. You do not have to include a feature with a value of 0.0 for each of the 49,900 words that do not appear in the document.

You do not need to perform smoothing.

E. Feature sets

Use these feature sets for training and testing your classifier:

1. unigrams

2. bigrams

3. unigrams + POS

4. adjectives

5. top unigrams

6. optimized

Detailed explanation:

1. unigrams: use the word unigrams that occurred >= 4 times in the training data. Let this quantity be N.

2. bigrams: use the N most-frequent bigrams.

3. unigrams + POS: use all combinations of word/tag for each of the unigrams in (1). Since a word may occur with multiple tags, the quantity of this type of feature will be greater than N.

4. adjectives: use the adjectives that occurred >= 4 times. Let this quantity be M.

5. top unigrams: use the M most-frequent unigrams.

6. optimized: choose any combination of features you would like, to try to produce the best classifier possible. For example, you might choose different cutoff values for frequencies of different types of features. You could also create entirely new types of features. You could also try different settings for training the SVM. The optimized classifier should be produced through a process of repeatedly training the classifier and computing its performance on the validation set.

F. Evaluation

Train the SVMs on the training data and perform preliminary tests on the validation data. To evaluate your classifiers, compute the accuracy rate on the testing data, which is percentage of movie reviews correctly classified. For the baseline classifier, also compute the number of ties.

Evaluate your classifiers on the testing data when it is released. Do not further optimize your system based on performance on the testing data.

G. Turn in

Produce a document that states:

- Short descriptions of attached files

- A list of the positive and negative words chosen for your baseline system

- Performance of the baseline system on the test set

- A table listing the number of distinct features for each feature set. Since the split of the data into training and testing is not exactly the same as Pang et al.¡¦s, the quantity of different features will be similar, but not identical.

- A table of performance of the classifiers on the validation set and test set

- A written comparison of your results with Pang et al.'s (minimum 5 lines)

- produce a table listing the 50 most-frequently misclassified reviews (across all 6 classifiers) in the validation set, and the number of classifiers by which they were misclassified. For example, the review cv808_12635.txt might have been misclassified by 4 classifiers. Describe 5 different characteristics of the frequently misclassified reviews, showing excerpts from 2 reviews for each characteristic. For each of these characteristics, describe a possible feature that could be added to improve performance.


Related Discussions:- Different characteristics of the frequently misclassified

Force on rigid body - non rigid body, Force on Rigid body - Non Rigid body:...

Force on Rigid body - Non Rigid body: As Rigid body cannot change its shape on application of any force, It will begin moving in the direction of applied force without any

Problems, Prlblems 1 & 3 can you provide accurate and clear solved out solu...

Prlblems 1 & 3 can you provide accurate and clear solved out solutions not done by computer either.

Calculate heat transfer through cylinder wall, Calculate Heat transfer thro...

Calculate Heat transfer through cylinder wall: Calculate change in entropy and heat transfer through cylinder walls, if 0.4m 3 o f a gas at the pressure of 10bar and 20

Shut-off valves on major zone, Q. Shut-off valves on major zone? Are shu...

Q. Shut-off valves on major zone? Are shut-off valves on major zone branches and on the bottom and top of riser? Are balancing valves on return branches? Are control valves on

Calculate the vertical anchoring forces, A converging elbow (see Figure 4) ...

A converging elbow (see Figure 4) turns water through an angle of 135 o in a vertical plane. The flow cross-section diameter is 400 mm at the elbow inlet, section (1), and 200 mm

Arc welding power sources-arc characteristics, ARC CHARACTERISTICS In ...

ARC CHARACTERISTICS In order to understand the different output characteristics that are required by the welding power source it is necessary to understand the load which is f

Helix angle, what is helix angle and how to determine it?

what is helix angle and how to determine it?

Screw jack, give equivalence of screw jack and friction on an inclined plan...

give equivalence of screw jack and friction on an inclined plane

Two wheeler engines, Two Wheeler Engines: Engine is a unit of motorcycle t...

Two Wheeler Engines: Engine is a unit of motorcycle that runs the motorcycle by converting fuel energy to mechanical power. This power comes from the burning of fuel inside the cy

Calculate net force on the piston, A vertical petrol engine 100mm diameter ...

A vertical petrol engine 100mm diameter and 120mm stroke has a connecting rod 250mm long. The mass of the  piston is 1.1kg. The speed is 2000 r.p.m. On the expansion stroke with a

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd