Compute a distribution over k topics

Assignment Help Python Programming
Reference no: EM132398862

Assignment

Below is the algorithm details.

You are not required to implement the PLSA algorithm with a background model (we will run tests assuming the background model has not been implemented). You are provided with two data sets in the data folder: test.txt and dblp.txt. You can assume that each line is a separate document.

The test.txt data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”).  Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”. Run your code to test if your EM algorithm returns reasonable results.

The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a document. Make sure to test your code on the simpler data set test.txt before running it on DBLP.txt.

You are provided with a code skeleton plsa.py. Do not change anything in the def __init__() function. But feel free to change anything in the main() function to test your code.

Writing the PLSA Algorithm:

The original PLSA model does not contain a background model. This MP also is based on the original PLSA model, you do not have to worry about the background model. However, lectures are all about PLSA with a background model, so you should not attempt to map the formulas on lecture slides directly to the code. Instead, you would need to adjust the formulas on slides by assuming that there is zero probability that the background model would be chosen. 

That is, you should set λB to zero. If you do this, you will see that the E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λB=0). If you are still confused, please take a look at Section 3 of Chase Geigle's note on EM [2] and the note below.

The main data structures involved in the implementation of this EM algorithm are three matrices:

T (topics by words): this is the set of parameters characterizing topic content that we denoted by θi's. Each element is the probability of a particular word in a particular topic.

D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by pij's. Each element is the probability of a particular topic is covered in a particular document.

Z (hidden variables):  For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters).

Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:

T and D are initialized.

E-step computes all Z's based on T and D.

M-step uses all Z's to update T and D.

We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful (can you imagine some applications of Zs?).

Attachment:- PLSA.rar

Reference no: EM132398862

Questions Cloud

Company tremendous competitive advantage : The decision to adopt or bypass a technology can give a company a tremendous competitive advantage or remove a company from the market completely.
Write a reflection paper on developed countries : write a reflection paper on developed countries in their development efforts and China and India's experiences with their development efforts
Standard performance measures are way of collecting data : Standard performance measures are a way of collecting data across similar functions, processes, costs, and providers.
Local variables of that function : On the x86-64, the arguments to a function are stored at higher addresses than the local variables of that function.
Compute a distribution over k topics : We need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document.
Discuss configuration control in the lan environment : Why would change control on network devices such as routers and switches be as vital as other more 'dynamic' devices such as servers in the LAN?
Write imaginary case study for your hypothetical patient : Write an imaginary case study for your hypothetical patient. Explain how the patient moved through the healthcare delivery system.
Identify different types of strategic change programmes : Critically analyse if the Resource-Based View (RBV) model of strategy can achieve sustainable competitiveness for an organisation
Comparison between two letters is based on alphabetical : Suppose we had a binary search tree where each node's value is a letter of the alphabet, and the comparison between two letters is based on alphabetical order.

Reviews

Write a Review

Python Programming Questions & Answers

  Write a python program to implement the diff command

Without using the system() function to call any bash commands, write a python program that will implement a simple version of the diff command.

  Write a program for checking a circle

Write a program for checking a circle program must either print "is a circle: YES" or "is a circle: NO", appropriately.

  Prepare a python program

Prepare a Python program which evaluates how many stuck numbers there are in a range of integers. The range will be input as two command-line arguments.

  Python atm program to enter account number

Write a simple Python ATM program. Ask user to enter their account number, and print their initail balance. (Just make one up). Ask them if they wish to make deposit or withdrawal.

  Python function to calculate two roots

Write a Python function main() to calculate two roots. You must input a,b and c from keyboard, and then print two roots. Suppose the discriminant D= b2-4ac is positive.

  Design program that asks user to enter amount in python

IN Python Design a program that asks the user to enter the amount that he or she has budget in a month. A loop should then prompt the user to enter his or her expenses for the month.

  Write python program which imports three dictionaries

Write a Python program called hours.py which imports three dictionaries, and uses the data in them to calculate how many hours each person has spent in the lab.

  Write python program to create factors of numbers

Write down a python program which takes two numbers and creates the factors of both numbers and displays the greatest common factor.

  Email spam filter

Analyze the emails and predict whether the mail is a spam or not a spam - Create a training file and copy the text of several mails and spams in to it And create a test set identical to the training set but with different examples.

  Improve the readability and structural design of the code

Improve the readability and structural design of the code by improving the function names, variables, and loops, as well as whitespace. Move functions close to related functions or blocks of code related to your organised code.

  Create a simple and responsive gui

Please use primarily PHP or Python to solve the exercise and create a simple and responsive GUI, using HTML, CSS and JavaScript.Do not use a database.

  The program is to print the time

The program is to print the time in seconds that the iterative version takes, the time in seconds that the recursive version takes, and the difference between the times.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd