Implement a basic initial ner classifier

Assignment Help Database Management System
Reference no: EM131224193

1. Objective: Named Entity Recognition

In this project, you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition.

2. Getting Start

Named Entity Recogonition. The goal of Named Entity Recognition (NER) is to locate segments of text from input document and classify them in one of the pre-defined categories (e.g., PERSON, LOCATION, and ORGNIZATION). In this project, you only need to perform NER for a single category of TITLE. We define a TITLE as an appellation associated with a person by virtue of occupation, office, birth, or as an honorific. For example, in the following sentence, both Prime Minister and MP are TITLEs.

Prime Minister Malcolm Turnbull MP visited UNSW yesterday.

Formulating NER as a Classiftcation Problem. We can cast the TITLE NER problem as a binary classification problem as follows: for each token w in the input document, we construct a feature vector ?x, and classify it into two classes: TITLE or O (meaning other ), using the logistic regression classifier.

The goal of the project is to achieve the best classfication accuracy by configuring your classifier and performing feature engineering.
In this project, you must use the LogisticRegression classifier in scikit-learn ver 0.17.1 in the implementation.

3. Your Tasks

Training and Testing Dataset. We will publish a training dataset on the project page to help you build the classifier. The training dataset contains a set of sentences, where each sentence is already tokenized into tokens. We also provide the part-of-speech tags (e.g., , and the correct named entity class (i.e., "TITLE" or "O") for each token1. The data structure in python for the above sentence is:

[[('Prime', 'NNP', 'TITLE'), ('Minister', 'NNP', 'TITLE'),
('Malcolm', 'NNP', 'O'), ('Turnbull', 'NNP', 'O'), ('MP', 'NNP', 'TITLE'),
('visited', 'VBD', 'O'), ('UNSW', 'NNP', 'O'),
('yesterday', 'NN', 'O'), ('.', '.', 'O')]]

Therefore, the training dataset is formed as a list of sentences. Each sentence is formed as a list of 3-tuple, which contains to the token itself, its POS tag, and the named entity class.

In python, you could just use the following code to get the training dataset (where training data is the path of the provided training dataset file).

with open(training_data, 'rb') as f: training_set = pickle.load(f)

The final test dataset (which will not be provided to you) will be similarly formed as the training dataset, except each sentence will be formed as a list of 2-tuple (only the token itself and its PoS tag). For example, if the testing dataset only contains one sentence which is "Prime Minister Malcolm Turnbull MP visited UNSW yesterday.", then it will be format as:

[[('Prime', 'NNP'), ('Minister', 'NNP'), ('Malcolm', 'NNP'),
('Turnbull', 'NNP'), ('MP', 'NNP'), ('visit', 'NN'), ('UNSW', 'NNP'), ('yesterday', 'NN'), ('.', '.')]]

Feature Engineering. In order to build the classifier, you need to firstly extract features for each token. In this project, using token itself as a feature usually achieves a high accuracy on the training dataset, however it could result in low testing accuracy on the test dataset due to overfitting. For example, it is not uncommon that the test dataset contains titles that do not exist in the training dataset. Therefore, we encourage you to find/engineer meanningful/strong features and build a more robust classifier.

You will need to describe all the features that you have used in this project, and justify your choice in your report.

Build logistic regression classifter. In this project, you need to use the logistic regression classifier (i.e., sklearn.linear model.LogisticRegression) from the scikit- learn package. For more details, you could refer to the scikit-learn page2 and the relevant course materials.

You also need to dump the trained classifier using the following code (where classifier is the trained classifier and and classifier path is the path of the output file):

with open(classifier_path, 'wb') as f: pickle.dump(classifier, f)

You are also required to submit the dumped classifier (which must be named as classifier.dat).

Suggestion Steps. You may want to implement a basic, initial NER classifier. Then you can further improve its performance in multiple ways. For example, you can find best setting of hyper-parameters of the your model/features; you can design and test different sets of features. It is recommended that you use cross validation and construct your own testing datasets.

You need to describe how you have improved the performance of your classifier in your report.

Trainer. You need to submit a python file named trainer.py. It receieves two command line arguments which specify the path to the training dataset and the path to a file where your trained classifier will be dumped to. Your program will be excuted as below:
python trainer.py <path_to_training_data> <path_to_classifier>

Tester. You need to submit a python file named tester.py. It recieves three com- mand line arguments which specify the path to the testing dataset, the path to the dumped classifier, and the path to a file where your output results will be dumped to. Your program will be executed as below:

python tester.py <path_to_testing_data> <path_to_classifier> <path_to_results>

You should, for each token in the testing dataset, output the named entity class of it (i.e., TITLE or O). Your output, in python internally, is a list sentences, where each sentence is a list of (TOKEN, CLASS) tuples. For example, a possible output for the example in Section 3.1 is:
[[('Prime', 'TITLE'), ('Minister', 'TITLE'), ('Malcolm', 'O'),
('Turnbull', 'O'), ('MP, 'O'), ('visit', 'O'), ('UNSW', 'O'),
('yesterday', 'O'), ('.', 'O')]]

Then you should dump it to a file using the following code (where result is your result list and path to results is the path of the dumped file):
with open(path_to_results, 'wb') as f: pickle.dump(result, f)

Report. You need to submit a report (named report.pdf) which answers the fol- lowing two questions:

- What feature do you use in your classifier? Why are they important and what information do you expect them to capture?
- How do you experiment and improve your classifier?

4. Execution
Your program will be tested automatically on a CSE Linux machine as follows:

- the following command will be used to build the classifier: python trainer.py <path_to_training_data> <path_to_classifier> where

- path to training data indicates the path to the training dataset
- path to classifier indicates the path to the dumped classifier
- the following command will be used to test the classifier:
python tester.py <path_to_testing_data> <path_to_classifier> <path_to_results>

where

- path to testing data indicates the path to the testing dataset
- path to classifier indicates the path to the dumped classifier
- path to result indicates the path to the dumped result

Your program will be executed using python 3 with only the following packages available:
- nltk 3.2.1
- numpy 1.11.1
- scipy 0.18.0
- scikit-learn 0.17.1
- pandas 0.18.1

5. Evaluation

We will use F1-measure to evaluate the performance of your classifier. We will test your classifier using two test datasets,
- the first one is sampled from the same domain as the given training dataset, and
- the second one is sampled from a domain different from that of the given training dataset.

Each test will contribute 40 points to your final score. Your report contributes the rest 20 points.

In order to minimize the effect of randomness, we will execute trainer.py three times, and use the best performance achieved.

For each test dataset, your best performed classifier will be compared with a reference classifier C. While the detailed scheme will be published later on the project website, basically, the marking scheme will reward you with more points if your classifier works no worse than the reference classifier.

Neither the reference classifier C nor the two test datasets will be given to you. But you can create your own test dataset, submit it and get the performance of the reference classifier C on it. For more details, please refer to Section 6.

6. Customized Datasets

You are encouraged to construct your own test datasets to test your classifier and help improve its performance. You can submit your datasets to get the F1-measure of the pre- trained classifier C on your datasets. Furthermore, all the submitted datasets is accessible by every student in the class. You should also benefit from such experience because this will give you many initial ideas about the meaningful features to use in your own classifier. You can upload your test cases to through the data submission website (url will be sent via email). You will need to login before uploading your datasets or downloading other datasets. The id is your student number (e.g., z1234567) and your password will be sent to you by email.

Once you have logged into the system, you can:
- submit your dataset
- view the performance of all the datasets
- download a chosen dataset

Your submitted test datasets should be in the same format as the training dataset.3 You can submit at most 10 datasets to the system. We will release more details about how to use the system and some tools/tips to help you visualize and tag TITLE occurrences later on the project website.

Verified Expert

In the solution of this task python scripts trainer.py and tester.py The scripts are running successfully in the unit testing. Further improvements can be done in feature engineering and at the classifier level as well, have given the provision for commenting two lines for checking cross validation.

Reference no: EM131224193

Questions Cloud

How office scheduling and hours should be constructed : A new group of primary care physicians have decided to locate in a suburb of Washington, D.C. After conducting some primary market research on the area, the group's research firm has characterized the community as comprising Boomers, Gen-Xers, How th..
Ways of dealing with a conflict associated with projects : 1. Identify three ways of dealing with a conflict associated with projects. 2. What are some advantages and disadvantages of housing a project in a functional form? 3. What are the systems architect's duties?
Prepare and submit homicide section of criminal law outline : Under the Felony Murder Rule, can Piro be prosecuted for the deaths of Annie, Fireman Jim, and Fireman Bill?- prepare and submit Homicide section of Criminal Law outline.
What is the ending cash balance for january : What are the total cash disbursements from the cash disbursements journal? What is the ending cash balance for January? How many general journal entries were made in January? What method of inventory tracking is the company using
Implement a basic initial ner classifier : COMP9318 (16S2) PROJECT - you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition and
About to launch social networking site : Imagine that you are about to launch a social networking site. Determine the target market to which your social media site will cater. Recommend a marketing strategy to entice your chosen target market to join your site. Determine the approach you wi..
Independent and establishing own free-standing service : Imagine that you are a credentialed, licensed professional, and that you are considering going independent and establishing your own free-standing service (perhaps an urgent care center, a physical therapy service, a private laboratory, or other s..
Explain how you might mitigate these threats : Using the hypotheses you have developed, compare the characteristics of the different experimental research designs discussed in the Skidmore (2008) article and choose the one that is most appropriate to adequately test your hypotheses. Identify p..
Difference between upper limit and lower limit of dimension : Tolerance:It is the difference between upper limit and lower limit of the dimensions of a part. So, it can be stated as maximum possible variation in the dimension of the part.The tolerance may be applied unilaterally or bilaterally

Reviews

Write a Review

Database Management System Questions & Answers

  Your task is to develop a database to support this activity

Your task is to develop a database to support this activity.Here is what you need to be able to provide Custom Auto Body in order to land your first consulting contract:

  Write a second select statement that uses

Write a SELECT statement that returns three columns: EmailAddress, OrderID, and the order total for each customer.

  Find the decryption function and decipher

Find '(2007), '(2008), and '(b), where b is the integer obtained from the last four digits of your student number.

  Write the sql create statements that implement this model

Write the SQL CREATE statements that will implement this model in MySQL. Remember to include the appropriate entity and referential constraints in each table structure. Submit as a text file or Word document to the group D2L folder

  Patients post cardiac bypass surgery

Monitor Renal Outcomes in patients post cardiac bypass surgery?

  Describe examples in which data warehouses could be used

Describe three examples in which data warehouses and data mining could be used to support data processing and trend analysis in large organizational environment.

  Provide an entity relationship model

Provide an Entity Relationship Model (ERM) that will describe the data structure that will store all data elements.Note:The graphically depicted solution is not included in the required page length.

  What advice would you give the managers of this company

Advice given to managersThe storage system that best fits the needs of the companyReasons why this is the best storage system.

  Characteristics of leasing in client sever architecture

Characteristics of leasing in client sever architecture

  Database and data warehousing design

This assignment consists of two (2) sections: a design document and a revised project plan. You must submit both sections as separate files for the completion of this assignment. Label each file name according to the section of the assignment it i..

  Create a dtd file that defines the format of the xml file

Expand the XML file a little bit by making sure that it is valid as well as well-formed.

  Every professor must teach some course

Now suppose that certain courses can be taught by a team of professors jointly, but it is possible that no one professor in a team can teach the course. Model this situation, introducing additional entity sets and relationship sets if necessary.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd