Explore the effectiveness of this technique for task

Assignment Help Computer Engineering
Reference no: EM131141884

Quotations from the opposing candidates in the 2004

Presidential Debates:

I proposed a constitutional amendment. The reason I did so was because I was worried that activist judges are actually defining the definition of marriage, and the surest way to protect marriage between a man and woman is to amend the Constitution.

I believe marriage is between a man and a woman. But I also believe that because we are the United States of America, we're a country with a great, unbelievable Constitution, with rights that we afford people, that you can't discriminate in the workplace.

In this assignment, your task will be to write a computer program that can (usually) predict whether a quote such as these are more likely to have been said by George W. Bush or John F. Kerry. There are many reasonable approaches to this problem. In this assignment, we will consider one of these: We will begin by building a statistical model of the kind of language used by each of the candidates. Then, given a new piece of text, we will determine who said it by measuring how well it fits each of the statistical models, and selecting the one that gives the best fit.

Markov models. The probabilistic language models we will be using are called Markov models. Markov models are at the heart of modern speech recognition systems, and are used for a broad range of natural language processing tasks, as well as many other problems relevant to artificial intelligence. They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.

A Markov model defines a probability distribution over sequences of symbols. Said differently, a Markov model defines a probabilistic mechanism for randomly generating sequences over some alphabet of symbols. In a zero-th order Markov model, the symbols in the sequence are generated with a fixed probability that does not in any way depend on other symbols in the sequence. For instance, the alphabet might consist of just the letters a and b, and the zero-th order Markov model might specify that each symbol in the sequence is a with probability 2/3 and b with probability 1/3. Thus, such a zero-th order Markov model might generate a sequence like the following:

a b a b a a a b b b a a a a b b b a a b a a a a a b a a a a b a a a b b

In a first order Markov model, the probability of the current symbol that is being generated can depend on preceding symbol. For instance, consider a sequence in which every occurrence of b is always followed by a, while an a is followed with equal probability by either a or b. Such a model might generate a sequence like the following.

b a b a b a b a a b a b a b a b a a b a a b a a a a a a b a b a a a a b

In this example, the sequence is likely to have about the same proportion of a's and b's as in the preceding example. However, the model is first order rather than zero-th order because whether or not the current symbol is a or b significantly affects the probability of what symbol will be generated next.

This idea can be generalized to a k-th order Markov model in which the probability of the current symbol can depend on the preceding k symbols.

Markov models are frequently used to build a probabilistic model or statistical estimate of a natural language. In modeling ordinary English text, a zero-th order model is clearly inadequate for most purposes, capturing only the frequencies of each letter. On the other hand, a first order Markov model can capture the notion, for instance, that a q is nearly always followed by u. Higher order Markov models can capture the tendency of the letters proba to be followed by b or t. (In other language applications, entire words are regarded as the "letters" of the "alphabet" so that the Markov model provides a model of the sequences of words rather than letters that are likely to be observed. We will not consider such a framework in this assignment, but you are welcome to explore it on your own.)

Language modeling. The first part of the assignment is to implement a class called MarkovModel.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(p·c) / N(p), where N(p·c) denotes the number of times we observe the concatenated sequencep·c, and N(p) denotes the number of times that we observe p.

Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(p·c) + 1) / (N(p) + S), where S is the size of our alphabet.

This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.

For example, if the input text is "aabcabaacaac", and we are using a second-order Markov model (k = 2) and the three-letter alphabet is {a, b, c}, then we can estimate the probability that "aa" is followed by'c'; by (2 + 1) / (3 + 3) = 1/2 since N("aa") = 3, N("aac") = 2 and S = 3. Similarly, the probabilities that "aa" is followed by 'a' and 'b' is 1/6 and 1/3, respectively. Note that the probabilities sum to 1.
One thing to notice here: To handle the beginning and end of the string, we treat the string as circular. Thus, N("caa") = 2 instead of 1.
So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N(·) by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use. You should also compute and record the alphabet size S. Organize your program by creating a data type MarkovModel with the following API:

public class MarkovModel { public MarkovModel(int k, String corpus) // build an order-k Markov model from corpus public double laplace(String s) // return laplace-smoothed probability estimate of s public String toString() // return a string representation of this model }

Now, the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace-smoothed probability estimates.

MarkovModel model = new MarkovModel(k, corpus); StdOut.println(model); StdOut.printf("%.4f\n", model.laplace("aac")); StdOut.printf("%.4f\n", model.laplace("aaa")); StdOut.printf("%.4f\n", model.laplace("aab"));

For example, whenk is 2 andcorpus is"aabcabaacaac", this code fragment should print the following:

alphabet size S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333

Determining the likelihood of a new sequence. Next, we will want to use the Markov model that you constructed from data to compute a measure of how well that model fits a new text sequence. Later, we will use this measure to determine which of two models (one for Bush and one for Kerry) better fits a test sequence. To compute a measure of fit, we will compute the probability of the model generating the new sequence, a quantity usually called the likelihood of the sequence under the model. For each symbol c in the sequence, we can compute the probability of observing c under the model, given its k-letter context p. In particular, this is just the Laplace-smoothed estimate given above. To compute the likelihood, or probability of the entire sequence, we can multiply these probabilities together, for all symbols in the sequence. Although correct mathematically, the resulting product might be indistinguishable from zero using floating point arithmetic. Therefore, rather than multiplying probabilities, you will need to work with log probabilities. That is, for each symbol c in the sequence, you will need to compute the log of its probability, and you will then add (rather than multiply) these log probabilities to arrive at the final log likelihood of the sequence. Keep in mind that all of these log probabilities will be negative (since probabilities are never more than one).

Your second job is to compute, for a given string, and for each position i in the string, an estimate of the log probability of the character at position i, given its k-character context. The log likelihood of this entire sequence can then be computed as the sum of these log probabilities.

For instance, continuing the example above, having constructed the second order (k = 2) Markov model, suppose we are given a new string "aabca". We can compute the log probability of each of the five symbols in this string as follows (where the first column gives the context of each of the symbols in the second column):

context c log probability ------------------------------------------------ "aa" 'b' log((1 + 1) / (3 + 3)) = -1.0986 "ab" 'c' log((1 + 1) / (2 + 3)) = -0.9163 "bc" 'a' log((1 + 1) / (1 + 3)) = -0.6931 "ca" 'a' log((2 + 1) / (3 + 3)) = -0.6931 "aa" 'a' log((0 + 1) / (3 + 3)) = -1.7918 TOTAL log likelihood = -5.1930 AVERAGE log likelihood = -1.0386

One thing to notice here: To handle the beginning and end of the string, we again treat the string as circular. Thus, in "aabca", the character following the context "ca" is 'a'.

Choosing the best model. Now that we have a method of measuring the fit of a model to a particular string, we can choose the best model to be the one that maximizes the likelihood of the string. In other words, we can simply measure the likelihood of the string under each model (Bush or Kerry) and choose the one that is greatest.

Write a program TopModel.java as follows: its first argument is the order parameter k; its next two arguments are the names of the files containing excerpts from the two candidates (or other text that you might choose to experiment with). The remaining arguments are the names of the files containing text sequences you wish to classify. Your program should build models for each of the two candidates, read in each text sequence, and classify it using the output format described below.

Input format. Each file is a sequence of Unicode characters. We provide data files from the three presidential debates. You should train using the first two debates and test on the third. Read the files using the In.java library that we provide.

In in = new In("filename.txt"); String corpus = in.readAll();

For amusement, you might also see how your Bush/Kerry models predict when attempting to classify quotes from John Edwards and Dick Cheney taken from the vice presidential debate. Or you might use the data from the Obama/McCain and Biden/Palin debates.

Output format. For each text sequence, your program should output the filename of the text block and its average likelihood under each model (that is, the likelihood as computed above, divided by the length of the sequence-this is simply to prevent unusually long sequences from seeming more significant than they are). Also, print the difference of these two numbers; the sign of the difference will indicate which model is more likely, and the absolute value of the difference can be interpreted as a measure of confidence in this prediction.

To get an idea of why the algorithm is making its predictions, you should also find and print the ten positions in the test sequence where the difference in the log probabilities of the two models are greatest. That is, if the log probability of character i (given its context) under model j is lj[i], then you should find the 10 indices i for which |l1[i] - l2[i]| is greatest. For each of these, print i's context, characteri, l1[i], l2[i] and l1[i] - l2[i]. For instance, you might get output like this:

%java TopModel 2 bush1+2.txt kerry1+2.txt bush3-00.txt bush3-01.txt bush3-00.txt -2.1131 -2.1895 +0.0764 "spr" -2.715 -4.963 +2.247 "eek" -5.617 -3.457 -2.160 "eek" -5.617 -3.457 -2.160 "ari" -6.011 -3.901 -2.110 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "ban" -2.338 -4.174 +1.836 "sig" -2.744 -4.514 +1.769 "fea" -2.645 -4.401 +1.756 bush3-01.txt -2.1467 -2.1607 +0.0140 "Of " -2.351 -4.331 +1.979 " Of" -2.543 -4.382 +1.839 " My" -1.674 -3.308 +1.634 "My " -1.376 -2.983 +1.607 "sal" -4.956 -3.418 -1.538 "sal" -4.956 -3.418 -1.538 "nen" -2.275 -3.703 +1.428 " Go" -2.970 -4.357 +1.386 "k I" -5.380 -4.003 -1.377 "dis" -1.946 -3.292 +1.346

The first line indicates that for the file bush3-00.txt, the average likelihood under the model built using bush1+2.txt was -2.1131, and -2.1895 under the kerry1+2.txt model. The difference of these numbers is +0.0764, which, being positive, indicates an overall prediction that this quote was uttered by Bush. The next line indicates that the most significant difference in log probabilities between the two models came in predicting 'r' under context "sp" where the Bush log probability was -2.715 and the Kerry log probability was -4.963, the difference being +2.247.

When printing characters from the text, you should convert all white space to ordinary spaces.

Analysis. Analyze your approach to this problem giving estimates of its time and space requirements. Also, critically explore the effectiveness of this technique for this task.

Submission. Submit TopModel.java, MarkovModel.java, and any other files needed by your program (excluding those in stdlib.jar and adt.jar). Finally, submit a readme.txt file and answer the questions

Reference no: EM131141884

Questions Cloud

Discuss why avoiding dietary carbs might be ill-advised : Analyze the roles of carbohydrates and fiber and their importance to our health. Discuss why avoiding dietary carbs might be ill-advised. Evaluate current recommendations for carbohydrate and fiber intake.
What are the limitations of operant conditioning : The use of reward and punishment is an everyday human behavior. Which do you think is more effective for the child: reinforcement or punishment? What are the limitations of operant conditioning (reward and punishment)
Describe thought process about making an advance directive : Describe your thought process about making an advance directive. Explain your decision and note any barriers or resistance you may encounter when making such a directive
Determine the bending moments produced at the joints : Determine the bending moments produced at the joints of the rigidly jointed frame shown in Figure S7.7by the indicated loads. The relative EI / L values are shown ringed alongside the members.
Explore the effectiveness of this technique for task : Analyze your approach to problem giving estimates of its time and space requirements. Also, critically explore the effectiveness of this technique for this task.
Determine certain human behavioral characteristics : For this paper, you will have to evaluate your beliefs with respect to ONE of the following claims: Genes alone can determine certain human behavioral characteristics
Why do we need protein and amino acids : Design three personalized recommendations on how Mrs. Miller can cut down on saturated and trans-fats and increase consumption of healthier fats. Include three sources of protein-rich foods, which will not contradict the fat recommendations you s..
Discuss a bank financial management related study : You are required to write a report to comprehensively and critically discuss a bank financial management related study. In choosing a topic, you could consider the Risks of financial institutions.
What kind of pain killing effect can related to acupuncture : Acupuncture may have some pain-killing effects which can be related to some kind of pain experienced in the body. What kind of pain killing effect can be related to Acupuncture

Reviews

Write a Review

 

Computer Engineering Questions & Answers

  Write down an applet that reads in the size of the side

Write down an applet that reads in the size of the side of a square and displays a hollow square of that size out of asterisks, by using the drawString method inside your applet's paint method. Use an input dialog to read the size from the user. Y..

  How many bits are required for each physical address

A process of a system has a virtual memory space of 2048 words or addresses with a page size of 64 words. There are 32 blocks in the main memory with 16 words per block. How many bits are required for each physical address?

  Imagine that the array is unsorted

Write down a method, removeAll, that takes three parameters: an array of integers, the length of the array, and an integer, say, removeItem. The method should find and delete all occurrences of removeItem in the array.

  Explain a mitigating strategy for the risks

Briefly identify and describe a mitigating strategy for the risks.

  Write down a for loop which will display all numbers that

write a for loop that will display all numbers that are multiples of 10 from 0 to 100 and then display the average of

  Provide the requested after values

Suppose the before values when the given instructions is executed. Provide the requested after values:

  Mutual authentication and key establishment protocol

Assume a server-assisted mutual authentication and key establishment protocol. Suppose that Alice and the Server share the pairwise symmetric key KAS; however Bob and Server share a pairwise symmetric key KBS.

  Find an ordered pair that is a solution

Use the indicated value for the given equation to find an ordered pair that is a solution 3x+y=27. Let x=7 the solution of the equation is 3x+y+27 is. Type an ordered pair.

  Suggestions for ensuring an adequate change control

What are two suggestions for ensuring the adequate change control on projects that involve outside contracts.

  Management tools for microsoft windows server

Based on the Kercher (2009) article, how can management tools for Microsoft Windows Server 2008 automate user account administration

  Give a new technology is perceived to be superior

Relative benefit is the degree to that a new technology is perceived to be superior to the current technology.

  How would go about preventing unwanted eyes

the Director of Human Resources has a file documenting the pay rates of all employees. How will you prevent users on the network from actually being able to see the file on that particular workstation.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd