K-nearest neighbor for text classification, Computer Engineering

Assignment Help:

Assignment 2: K-nearest neighbor for text classification.

The goal of text classification is to identify the topic for a piece of text (news article, web-blog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

1.       Download the dataset from my website, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

2.       Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

a.        Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

b.       Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

c.         Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

3.        Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

4.       You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

 

 


Related Discussions:- K-nearest neighbor for text classification

Explain the uses of thumbwheels, Explain the uses of thumbwheels Two t...

Explain the uses of thumbwheels Two thumbwheels are usually required to control the screen cursor in its horizontal and vertical position respectively. As the name implies,

Diffrentiate between non-relocatable & relocatable programs, How non-reloca...

How non-relocatable programs are different from relocatable programs? Non relocatable program: It is one that cannot be made to execute in any type of area of storage other t

Move a layout cell, If you need to line up the cells next to each other you...

If you need to line up the cells next to each other you can resize and move layout cells as you need. You can change size of a layout cell by using one of its resize handles. Yo

Function of in network access layer in tcp/ip protocol stack, What is the f...

What is the function of in network access layer in TCP/IP protocol stack? Function of Network Access Layer: The network access layer is the lowest layer within Internet

Master scheduling , The alpha beta company generates two products; A and B,...

The alpha beta company generates two products; A and B, that are made from components C and D. Given the following product structures, master scheduling requirements and inventory

Signalling connection control part and message transfer part, The Signallin...

The Signalling connection control part (SCCP) and message transfer part (MTP) together are referred to as (A) Signal Switching Points (SSPs) (B)  Signal Transfer Points

What is a thread, What is a thread? A thread otherwise called a lightwe...

What is a thread? A thread otherwise called a lightweight process (LWP) is a basic unit of CPU utilization, it comprises of a thread id, a program counter, a register set and a

What is compact disk rom, Q. What is Compact Disk ROM? Both audio CD an...

Q. What is Compact Disk ROM? Both audio CD and CD-ROM (compact disk read-only memory) share similar technology. Main difference is that CD-ROM players are more rugged and have

gui component handle its own events, A component can handle its own events...

A component can handle its own events by executing the needed event-listener interface and adding itself as its own event listener.

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd