Reference no: EM132188009
Assignment -
In this exercise you will write a script called "docdistances" that will calculate distances between pairs of text documents. These distances will be based on a vanilla version of term frequency-Inverse document frequency (if-id°. Your script will calculate the distances between 6 documents: 3 documents are synopsis of fairy tales (Red riding hood, the Princess and the pea and Cinderella); the other 3 documents are the abstract of papers related to protein function prediction (identified as CAFA3, CAFA2 and CAFA3). You will find these documents on the attached file (the files name are: "RedRiclingHood.txt", "PrincessPettit'', "Onderella.txt", "CAFA1.txt", "CAFA2.txt", "CAFA3.txt").
Your script will:
1. For each document, calculate its td-idf vector.
The td-idf vector of a document is a vector whose length is equal to the total number of different terms (words) which are present in the corpus (in this case, the corpus is the entire set of 6 documents). Each term is assigned a specific element of the vector, which is in the same position for the tf-idf vector of every document. For a given document d, the vector element corresponding to term t is calculated as the product of 2 values:
a) Term frequency: the number of times that term t appears in document d.
b) Inverse document frequency: the log base 10 of the inverse fraction of the documents that contain the term, i.e.
log10(number of documents in the corpus/number of documents where term t appears)
2. Calculate the cosine distance between every pair of tf-idf vectors representing each document. (This is equal to 1 minus the cosine of the angle between the 2 vectors.)
3. Collect these distances into a 6x6 matrix where the value in the all element contains the distance between document i and document j. Then make a figure that displays the matrix. Your Figure should look similar to the Figure below (here I have used imagesc and set the colormap to gray).
It is interesting to note that the 2 types of documents form 2 clear groups: the synopsis of fairy tales are more similar to each other than they are to scientific papers. Also, the "Princess and the pea" is more similar to "Cinderella" than to "Red Riding Hood", and this makes sense as the "Princess and the pea" and "Cinderella" have many elements in common.
Attachment:- Assignment Files.rar