Design our own n-gram model

Assignment Help Data Structure & Algorithms
Reference no: EM132412761

Exercise

An n-gram is a sequence of elements or tokens that appear together in a document or a longer sequence of tokens. In this structure, n is the sequence size. For instance, in the sentence:

"a n-grama is a sequence of tokens"
"Un n-grama es una secuencia de tokens"

The 2-grams (bigrams) that would conform the sentence are (Un, n-grama), (n-grama, es), (es, una), (una, secuencia), (secuencia, de), (de tokens). The 3-gramas (trigrams) that make the sentence are: (Un, n-grama, es), (n-grama, es, una), (es, una, secuencia), (una, secuencia, de), (secuencia, de, tokens).

The n-grams are very popular structures mainly for their use in natural language processing, a branch of computer science that aims to achieve adequate processing of human natural language by machines. Its popularity resides in the ability to detect common patterns in related documents. For instance, in sports texts we can commonly find bigrams like (the, player) or (the, team), whereas in other kinds of texts like fantastic novels we will surely find bigrams like (the, damsel), (a, castle), etc. more frequently.

In this exercise we will design our own n-gram model which will help us score texts based on their similarity to different reference texts. To do so:

In the edu.uoc.mecm.eda.ngram.NgramFrequencyScorer class you will have to implement the train() method. This method takes an input path to a system folder and reads all files ending with .txt. These files will become the training set for our text scoring model. For each file, the method extracts all the tokens in the text. You will have to complete the method to calculate the relative frequency of all n-grams that appear in the training set (globally in all texts). When the class is initialized, the type of n-grams to use is specified (attribute numWords in the class). Your code must be generic and has to work with n-grams of any size.

Once we have calculated the relative frequency of the n-grams of the training set, we are ready to score other texts. All texts that are similar to the training set will have a better score than other less similar texts. N-gram relative frequency can be seen as the probability of appearance of an n-gram in the training texts (p(x) where x is an n-gram and p(x) is its relative frequency). Therefore, if we assume independence between n-grams of the same text, we can evaluate a text with our model using the expression:

getScore(X) =3∈5 log(p(x))

where X is the set of all n-grams that compose the text. Complete the getScore() method, which takes a text file as input parameter and returns the text's score based on our n- gram model.

If you take a look at the implemented tests in the edu.uoc.mecm.eda.tests.NgramFrequencyScorerTest class, you will see that the score of the first text is higher than the second text, whereas the score of this second text is higher than the third. Explain why this happens. Maybe you will have to analyse the training texts and the evaluation ones.

Reference no: EM132412761

Questions Cloud

Explain how checkpoints serve to regulate the cell cycle : Explain how checkpoints serve to regulate the cell cycle and help a cell avoid mutations and cancer (when working properly, of course!).
Symbiotic relationships between biotic and abiotic parts : Have human activities affected these areas in any way? How? what is the symbiotic relationships between biotic and abiotic parts in these specific biomes
Define carrying capacity : Define carrying capacity and then apply it to the following two ecosystems: (1) Tropical Rainforest and (2) Desert. Choose one specific geographic location
Culture of modern life : What are least three of the discoveries that are to be most important and what describes their significance to society, health, and the culture of modern life?
Design our own n-gram model : Design our own n-gram model which will help us score texts based on their similarity to different reference texts - Explain why this happens
Explain one way that the meat packing industry in Chicago : Based on the excerpt of Upton Sinclair's The Jungle in the Virtual Reader, List and explain one way that the meat packing industry in Chicago defiled the meat
Process in the mitochondria of trypanosomes : RNA editing is a common process in the mitochondria of trypanosomes and plants as well as in chloroplasts, and in rare cases it occurs in higher eukaryotes
Discuss one clinical correlation for system : Find clinical correlations that relate to both the digestive and respiratory systems.
Describe leading strand and lagging strand dna replication : Describe leading strand and lagging strand DNA replication and use the following terms in your description: 3', 5', helicase, primase (RNA polymerase),

Reviews

len2412761

12/6/2019 11:56:38 PM

must use Java and be run in intellij Idea Exercises 3 and 4 are worth 30% each. In these exercises the correctness of the source code (passing all available unit tests – without changing them in any way), the most appropriate data structure choice, the justification for your choice and the code’s legibility will be evaluated. section 3 searching link to website book

Write a Review

Data Structure & Algorithms Questions & Answers

  Implement an open hash table

In this programming assignment you will implement an open hash table and compare the performance of four hash functions using various prime table sizes.

  Use a search tree to find the solution

Explain how will use a search tree to find the solution.

  How to access virtualised applications through unicore

How to access virtualised applications through UNICORE

  Recursive tree algorithms

Write a recursive function to determine if a binary tree is a binary search tree.

  Determine the mean salary as well as the number of salaries

Determine the mean salary as well as the number of salaries.

  Currency conversion development

Currency Conversion Development

  Cloud computing assignment

WSDL service that receives a request for a stock market quote and returns the quote

  Design a gui and implement tic tac toe game in java

Design a GUI and implement Tic Tac Toe game in java

  Recursive implementation of euclids algorithm

Write a recursive implementation of Euclid's algorithm for finding the greatest common divisor (GCD) of two integers

  Data structures for a single algorithm

Data structures for a single algorithm

  Write the selection sort algorithm

Write the selection sort algorithm

  Design of sample and hold amplifiers for 100 msps by using n

The report is divided into four main parts. The introduction about sample, hold amplifier and design, bootstrap switch design followed by simulation results.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd