Calculate the cosine distance between every pair of vectors

Assignment Help Other Subject
Reference no: EM132188009

Assignment -

In this exercise you will write a script called "docdistances" that will calculate distances between pairs of text documents. These distances will be based on a vanilla version of term frequency-Inverse document frequency (if-id°. Your script will calculate the distances between 6 documents: 3 documents are synopsis of fairy tales (Red riding hood, the Princess and the pea and Cinderella); the other 3 documents are the abstract of papers related to protein function prediction (identified as CAFA3, CAFA2 and CAFA3). You will find these documents on the attached file (the files name are: "RedRiclingHood.txt", "PrincessPettit'', "Onderella.txt", "CAFA1.txt", "CAFA2.txt", "CAFA3.txt").

Your script will:

1. For each document, calculate its td-idf vector.

The td-idf vector of a document is a vector whose length is equal to the total number of different terms (words) which are present in the corpus (in this case, the corpus is the entire set of 6 documents). Each term is assigned a specific element of the vector, which is in the same position for the tf-idf vector of every document. For a given document d, the vector element corresponding to term t is calculated as the product of 2 values:

a) Term frequency: the number of times that term t appears in document d.

b) Inverse document frequency: the log base 10 of the inverse fraction of the documents that contain the term, i.e.

log10(number of documents in the corpus/number of documents where term t appears)

2. Calculate the cosine distance between every pair of tf-idf vectors representing each document. (This is equal to 1 minus the cosine of the angle between the 2 vectors.)

3. Collect these distances into a 6x6 matrix where the value in the all element contains the distance between document i and document j. Then make a figure that displays the matrix. Your Figure should look similar to the Figure below (here I have used imagesc and set the colormap to gray).

2497_figure.png

It is interesting to note that the 2 types of documents form 2 clear groups: the synopsis of fairy tales are more similar to each other than they are to scientific papers. Also, the "Princess and the pea" is more similar to "Cinderella" than to "Red Riding Hood", and this makes sense as the "Princess and the pea" and "Cinderella" have many elements in common.

Attachment:- Assignment Files.rar

Reference no: EM132188009

Questions Cloud

Web-based training : What are some technologies included into a Web-based Training and what are some pros and cons of it.
Which web session vulnerability is directly associated : Which web session vulnerability is directly associated with sessions that remain valid for periods longer than they are needed?
Discuss about the critical elements of the final product : In Task 10-1 you will submit your final Intervention Plan. It should be a complete, polished artifact containing all of the critical elements of the final.
How many different strings over the alphabet : How many different strings over the alphabet {a,b,c} have length 6 and exactly one b? Be sure to show your work.
Calculate the cosine distance between every pair of vectors : For each document, calculate its td-idf vector. Calculate the cosine distance between every pair of tf-idf vectors representing each document
How could an increased awareness of information processing : How could an increased awareness of information processing, your learning style, and your attention and self-regulation strategies potentially.
How you would establish trust with the employees : In your initial post, briefly analyze and define who the client is in this case study. Assess your professional role as the I/O psychologist.
Explain the psychoanalysis and psychodynamic theory : To foster retention of foundational theories in psychology, this assignment requires the incorporation of information from this course and previous courses.
Morse code encryption-decryption program : Develop and test a Python program that allows a user to open a text file containing a simple message using only the (uppercase) letters A.

Reviews

Write a Review

Other Subject Questions & Answers

  Create a report for a practice committee

Create a report for a practice committee at a health care institution. The objective is to convince your peers of the value of using a Culture Care Theory to guide practice and evaluate care.

  Quality characteristics of your organization

Determine which statistical technique you will employ to measure the quality characteristics of your organization. Provide examples to support the rationale.

  Social life affect your everyday life

How does your social life affect your everyday life? How do you see your social class changing in the next 5-10 years? How will that affect your everyday life?

  Explain key life events that influenced sara relationships

Explain key life events that have influenced Sara's relationships. Be sure to substantiate what makes them key in your perspective.

  Explain risk maps and risk analysis matrices

Risk maps and risk analysis matrices, help workers and managers better understand safety and health issues in their workplaces

  Human nervous system

Write down some of the problems encountered in trying to test nervous-system-based theory of temperament.

  Discuss how a specific middle-range nursing theory

Discuss how a specfic middle-range nursing theory has been or could be applied by nurse leader or nurse manager to effectively deal with adminstrative issues

  Inequality in the united states helpful

In your view, is the extent of social inequality in the United States helpful or harmful to society as a whole? Explain.

  Why would a police department want to reduce fear

Explain which of the fear-reduction strategies do you believe is most effective? Discuss why would a police department want to reduce fear of crime rather than crime itself

  Identify the constitutional right or combination of rights

Identify the constitutional right or combination of rights involved, specifying their source from among the Amendments to the US Constitution.

  What impresses you about the speaker

What makes your examples unique? What impresses you about the speaker or what turns you away? What can you tell us about the speaker's style? Intonation? Language and expressions

  Assessment teaching plan current practice in mathematics

assessment teaching plan current practice in mathematics education what it looks like sounds like and feels likethe

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd