Calculate the cosine distance between every pair of vectors

Assignment Help Other Subject
Reference no: EM132188009

Assignment -

In this exercise you will write a script called "docdistances" that will calculate distances between pairs of text documents. These distances will be based on a vanilla version of term frequency-Inverse document frequency (if-id°. Your script will calculate the distances between 6 documents: 3 documents are synopsis of fairy tales (Red riding hood, the Princess and the pea and Cinderella); the other 3 documents are the abstract of papers related to protein function prediction (identified as CAFA3, CAFA2 and CAFA3). You will find these documents on the attached file (the files name are: "RedRiclingHood.txt", "PrincessPettit'', "Onderella.txt", "CAFA1.txt", "CAFA2.txt", "CAFA3.txt").

Your script will:

1. For each document, calculate its td-idf vector.

The td-idf vector of a document is a vector whose length is equal to the total number of different terms (words) which are present in the corpus (in this case, the corpus is the entire set of 6 documents). Each term is assigned a specific element of the vector, which is in the same position for the tf-idf vector of every document. For a given document d, the vector element corresponding to term t is calculated as the product of 2 values:

a) Term frequency: the number of times that term t appears in document d.

b) Inverse document frequency: the log base 10 of the inverse fraction of the documents that contain the term, i.e.

log10(number of documents in the corpus/number of documents where term t appears)

2. Calculate the cosine distance between every pair of tf-idf vectors representing each document. (This is equal to 1 minus the cosine of the angle between the 2 vectors.)

3. Collect these distances into a 6x6 matrix where the value in the all element contains the distance between document i and document j. Then make a figure that displays the matrix. Your Figure should look similar to the Figure below (here I have used imagesc and set the colormap to gray).

2497_figure.png

It is interesting to note that the 2 types of documents form 2 clear groups: the synopsis of fairy tales are more similar to each other than they are to scientific papers. Also, the "Princess and the pea" is more similar to "Cinderella" than to "Red Riding Hood", and this makes sense as the "Princess and the pea" and "Cinderella" have many elements in common.

Attachment:- Assignment Files.rar

Reference no: EM132188009

Questions Cloud

Web-based training : What are some technologies included into a Web-based Training and what are some pros and cons of it.
Which web session vulnerability is directly associated : Which web session vulnerability is directly associated with sessions that remain valid for periods longer than they are needed?
Discuss about the critical elements of the final product : In Task 10-1 you will submit your final Intervention Plan. It should be a complete, polished artifact containing all of the critical elements of the final.
How many different strings over the alphabet : How many different strings over the alphabet {a,b,c} have length 6 and exactly one b? Be sure to show your work.
Calculate the cosine distance between every pair of vectors : For each document, calculate its td-idf vector. Calculate the cosine distance between every pair of tf-idf vectors representing each document
How could an increased awareness of information processing : How could an increased awareness of information processing, your learning style, and your attention and self-regulation strategies potentially.
How you would establish trust with the employees : In your initial post, briefly analyze and define who the client is in this case study. Assess your professional role as the I/O psychologist.
Explain the psychoanalysis and psychodynamic theory : To foster retention of foundational theories in psychology, this assignment requires the incorporation of information from this course and previous courses.
Morse code encryption-decryption program : Develop and test a Python program that allows a user to open a text file containing a simple message using only the (uppercase) letters A.

Reviews

Write a Review

Other Subject Questions & Answers

  Explain how product you have selected fits in the ais cycles

Could small business benefit from the "extra" features found in the non MYOB products and explain how the product you have selected "fits" in the AIS cycles.

  How important is the misconception in keeping people

In Slaying the "Zombies" of Climate Science. [Video, 18:00 mins], Dr. Marshall Shepherd addresses common misconceptions that, from his scientific viewpoint.

  How is terrell and hicok reflected in the methods section

How is Terrell and Hicok reflected in the methods section? Are there any oversights or issues that you wish had been addressed in the paper that were not?

  What would be an example for the transfer of power

What would be an example for The transfer of power to the president is bad and has usurped the power of the Congress through repeated use of executive

  Identify psychological concept

Read the article and identify psychological concept (s) that relateto the issue/event. Examples of psychological concepts: obedience, operant conditioning, biological drives, nature-nurture, attachment, stress and coping, etc. Locate a recent (200..

  What is computer forensics

What is computer forensics? How do we know that the original evidence was not changed during the forensic process?

  What does the field of cognitive psychology study

What does the field of cognitive psychology study? What is the process of attention? What is the purpose of attention in cognition?

  Define what does healthy people 2020 have to say

What does Healthy People 2020 have to say about infectious diseases. Are there certain ones that they are focusing on in certain locations

  Represent a square with x and y

You will read a line of data with either 1, 3 or 4 doubles numbers. Numbers will be separate with a space, last number terminates with a newline When you read in a line with only number it will have a -1, once you read this line terminate your pro..

  What are the four mechanisms of appropriability

What is patent protection ? Briefly discuss the patent protection and legal protection? What are the four mechanisms of appropriability?

  Discuss response actions required in the event of a spill

Provide recommendations for preventing spills or releases. Discuss response actions required in the event of a spill or release. Discuss how you applied each of the steps in the GEBMO process and what risks you identified.

  Does thailand have healthcare issues

Does Thailand have healthcare issues that could endanger other countries and do other countries have healthcare issues that could endanger Thailand?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd