Characterize the plays written by shakespeare

Assignment Help Other Subject
Reference no: EM131291126

Imperative Programming -Stylometrics

Our goal here is to use a rudimentary characterization of authors' uses of words to identify the authors of unknown works. We will use dictionaries and simple statistics (really just ratios) to categorize an author's work by the frequency with which they use their 50 most popular words.

For example, say you wanted to characterize the plays written by Shakespeare and the stories written by Melville. You might choose a large sample of each. For example,

1 For Shakespeare you might choose 3 plays: Macbeth, Othello and All's Well that Ends Well.
2 For Milton you might choose Moby Dick, Bartleby and Omoo.

You can find these texts on the Internet. For example, take a look at https://www.gutenberg.org. For example you can find Melville's Moby Dick at https://www.gutenberg.org/ebooks/2701. Because we want to work with plain text, we should use the Plain Text UTF-8 files, e.g. https://www.gutenberg.org/cache/epub/2701/pg2701.txt. So, then you might characterize these files by some simple statistics. For example, you might characterize the Shakespeare texts by the words that appear a certain number of times (as a percentage of the total number of unique words) in the Shakespeare plays but under some percentage in the Melville texts. You will have to experiment to determine these percentages.

Then use these characterizations to decide among, say 10 different files, which contain works of Shakespeare and which contain works of Melville. These 10 works can be found on the Internet and saved as files, say file1.txt, …, file10.txt. See if you can use the characterizations (or vocabulary signatures) in this way to identify authors.

Feel free to modify the parameters of this project so long as you at least try this simple characterization.

You may try additional tasks. For example you might work with a larger set of authors. You might try categorizing scientific articles as to their field or sub-fields.

To characterize authors (at least Shakespeare and Melville) use 3 works. For Melville, use:
- https://www.gutenberg.org/cache/epub/2701/pg2701.txt
- https://www.gutenberg.org/cache/epub/11231/pg11231.txt
- https://www.gutenberg.org/cache/epub/4045/pg4045.txt
For Shakespeare, use:
- https://www.gutenberg.org/cache/epub/2264/pg2264.txt
- https://www.gutenberg.org/cache/epub/2267/pg2267.txt
- https://www.gutenberg.org/cache/epub/1125/pg1125.txt

To characterize and author we build a dictionary, one for each author.

1 We read in a large body of work by that author (e.g. 3 works). From this work, we build a dictionary of the work's 50 most frequently used words and their counts (as in wordfreq.py from our handout).

2 We go through the dictionary replacing each count, by a ratio:

? We compute this ratio by dividing the count by the total number of words (we should count them as we process them in (1)). The total number of words will include a count of duplicates; it's the total number of words in the entire body of text that we are characterizing.
? So corresponding to each of the 50 most popular words in the author's work is the ratio of the use of that word to the total number of words in the text.

Then we define a function identifyAuthor(), such that identifyAuthor(filename), where filename is a string name of a file containing the text we want to identify (e.g. an unknown work by one of the authors), returns either the name of the author who we think wrote the work, or "unknown" if we think none of our authors wrote the work. The function identifyAuthor() should do the following:

1 Read in the work from the named file.
2 Build a dictionary, mapping the work's 50 most frequent words to the ratios, calculated in the same way we did for the authors' works.
3 We want to compute a difference, between this dictionary and those for each of the authors:

? For each word in the 50-word dictionary for this unknown work, look up the ratio in both this dictionary and that for the author; if the word is not in the author's dictionary, make it 0 (zero).

? Computer the absolute value of the difference between the two ratios.

? The difference between the dictionary for the unknown work and the dictionary characterization of the work is the sum of the differences for the 50 words.

4 We say that the author of the work is that whose dictionary is the least different from that for the unknown work.

5 We define some arbitrary cutoff x (difference) as indicating none of the authors wrote the unknown work: if the differences between the dictionary for the unknown work and the dictionaries for each of the authors is greater than x, we say the author is unknown.

So that I may test your identification method, make sure you name it identifyAuthor such that identifyAuthor("file") attempts to identify the author who wrote the work in the file named "file", and returns either the string containing the author's name or "unknown".

Experiment as much as possible. Write about your experiments and their results. Show results and discuss them.

You should submit two files to the vault for homework5:

1 memo.txt -- This will contain a (plain-text) narrative explaining the design of your solution, how you experimented in coming up with ratios and cut-offs for identifying authors, and the results of test runs.

2 sylometrics.py -- your Python program that implements your solution, defining identifyAuthor("file") and any helper functions you need. Don't forget your docstrings!

Important:

Your program should read, moby.txt, bartleby.txt, and omoo.txt to build a characterization of Melville.

Your program should read, macbeth.txt, othello.txt, and allswell.txt to build a characterization of Shakespeare.

These six file will be in my test directory, so all you need to submit is your memo.txt and stylometrics.txt

You must insure the names are exactly correct; that's part of your assignment.

You may experiment with other files, and you should run tests, but be sure to comment all of that experimenting out.

I will simply execute,

identifyAuthor( "some file name") a couple of times.

Reference no: EM131291126

Questions Cloud

Calculate the p and h : Specifically, the stock price is $100, the annually compounded risk free rate is 5%, and the strike price is $100. Use a one-period binomial model with u =4/3 and d = 3/4. Calculate the p and h. Explain
Find the current value of the option : Then determine whether an American version of the option, also limited to a maximum payout of $40, would have any additional value over the European version. Compare your answers to the value of the option if there were no limitation on the payoff
Integral part of systems development : Redesigning processes and procedures is usually an integral part of systems development or systems implementation. Discuss why this is the case, and the dangers of systems development or systems implementation without a reengineering effort.
Discuss the iowa model as the evidence based practice model : Discuss the Iowa Model as the Evidence Based Practice model. Explain how you will use the selected model to direct your research utilization project of reducing incidence and rates of pressure ulcers.
Characterize the plays written by shakespeare : Use a rudimentary characterization of authors' uses of words to identify the authors of unknown works. We will use dictionaries and simple statistics (really just ratios) to categorize an author's work by the frequency with which they use their 50..
Three other examples of cardinality : What are at least three other examples of cardinality?
Combined annual holding and ordering cost : a. What is the combined annual holding and ordering cost of an order size of 200 units for SKU 005? b. What is the economic order quantity for SKU 005?
Internet exercise-teams and teamwork : Find an article that provides some new ideas about workplace teams. Below is one Web site that will assist you in locating an article on teamwork; feel free to use an alternate Web site if you wish. Fastcompany (Links to an external site.).  After ..
Construct a table containing the up and down factors : Construct a table containing the up and down factors for a one-year option with a stock volatility of 55 percent and a risk-free rate

Reviews

Write a Review

Other Subject Questions & Answers

  Identify some of myths surrounding topic of sexual abuse

Myths and misinformation surround the topic of sexual violence. For years, these myths have hung around the discourse, further muddying an already difficult topic about which to communicate. Although all myths can be harmful, there are some that m..

  How us courts have applied international law

Write a 3-6 page, double spaced essay describing how US courts have applied international law.

  The farmland protection policy act

The Farmland Protection Policy Act of 1981 (FPPA) (amended in 1987) provides specific criteria for projects that attempt to convert viable agricultural land to nonagricultural purposes. The impetus is the protection of a potential food source for ..

  A firm that has total fixed costs of 40000 sells its output

a firm that has total fixed costs of 40000 sells its output for 250 per unit and has an average variable cost of 150.

  Is this instrument negotiable

The following instrument was written on a napkin: "I, the undersigned, do acknowledge that I owe Vladimir Lenin ten thousand rubles, with interest, payable at Moscow out of the proceeds of the sale next month of my dacha in St. Petersburg.

  Describe the major concepts of the theory

Describe the major concepts of the theory. How are they defined? (theoretically and/or operationally) Is the author consistent in the use of the concepts and other terms in the theory?

  Determine the main sociological theories

Differentiate between various theoretical explanations for delinquent behavior.Explicate the types of prevention programs that are likely to work with high-risk youngsters.

  Sociological approaches to global poverty

Taken together, the sociological approaches to global poverty show us that poverty is:

  Relationship between EU and national law in competition law

What can I mention in a small synthesis about the relationship between EU and national law in competition law?

  French preposition

Put in the correct french preposition in the sentence.  Marie prend ses vacances __________ Kansas.

  Identify the problem that this company is facing based on

you are a consultant working for the group chian power conculting co. offering its servicesto the shipping industry. a

  Explain defined benefit pension plan

What are the key differences between a company sponsored defined contribution pension plan (DC plan) and a defined benefit pension plan (DB plan) from both the company and employee point of view.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd