Implement a basic initial ner classifier

Assignment Help Database Management System
Reference no: EM131224193

1. Objective: Named Entity Recognition

In this project, you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition.

2. Getting Start

Named Entity Recogonition. The goal of Named Entity Recognition (NER) is to locate segments of text from input document and classify them in one of the pre-defined categories (e.g., PERSON, LOCATION, and ORGNIZATION). In this project, you only need to perform NER for a single category of TITLE. We define a TITLE as an appellation associated with a person by virtue of occupation, office, birth, or as an honorific. For example, in the following sentence, both Prime Minister and MP are TITLEs.

Prime Minister Malcolm Turnbull MP visited UNSW yesterday.

Formulating NER as a Classiftcation Problem. We can cast the TITLE NER problem as a binary classification problem as follows: for each token w in the input document, we construct a feature vector ?x, and classify it into two classes: TITLE or O (meaning other ), using the logistic regression classifier.

The goal of the project is to achieve the best classfication accuracy by configuring your classifier and performing feature engineering.
In this project, you must use the LogisticRegression classifier in scikit-learn ver 0.17.1 in the implementation.

3. Your Tasks

Training and Testing Dataset. We will publish a training dataset on the project page to help you build the classifier. The training dataset contains a set of sentences, where each sentence is already tokenized into tokens. We also provide the part-of-speech tags (e.g., , and the correct named entity class (i.e., "TITLE" or "O") for each token1. The data structure in python for the above sentence is:

[[('Prime', 'NNP', 'TITLE'), ('Minister', 'NNP', 'TITLE'),
('Malcolm', 'NNP', 'O'), ('Turnbull', 'NNP', 'O'), ('MP', 'NNP', 'TITLE'),
('visited', 'VBD', 'O'), ('UNSW', 'NNP', 'O'),
('yesterday', 'NN', 'O'), ('.', '.', 'O')]]

Therefore, the training dataset is formed as a list of sentences. Each sentence is formed as a list of 3-tuple, which contains to the token itself, its POS tag, and the named entity class.

In python, you could just use the following code to get the training dataset (where training data is the path of the provided training dataset file).

with open(training_data, 'rb') as f: training_set = pickle.load(f)

The final test dataset (which will not be provided to you) will be similarly formed as the training dataset, except each sentence will be formed as a list of 2-tuple (only the token itself and its PoS tag). For example, if the testing dataset only contains one sentence which is "Prime Minister Malcolm Turnbull MP visited UNSW yesterday.", then it will be format as:

[[('Prime', 'NNP'), ('Minister', 'NNP'), ('Malcolm', 'NNP'),
('Turnbull', 'NNP'), ('MP', 'NNP'), ('visit', 'NN'), ('UNSW', 'NNP'), ('yesterday', 'NN'), ('.', '.')]]

Feature Engineering. In order to build the classifier, you need to firstly extract features for each token. In this project, using token itself as a feature usually achieves a high accuracy on the training dataset, however it could result in low testing accuracy on the test dataset due to overfitting. For example, it is not uncommon that the test dataset contains titles that do not exist in the training dataset. Therefore, we encourage you to find/engineer meanningful/strong features and build a more robust classifier.

You will need to describe all the features that you have used in this project, and justify your choice in your report.

Build logistic regression classifter. In this project, you need to use the logistic regression classifier (i.e., sklearn.linear model.LogisticRegression) from the scikit- learn package. For more details, you could refer to the scikit-learn page2 and the relevant course materials.

You also need to dump the trained classifier using the following code (where classifier is the trained classifier and and classifier path is the path of the output file):

with open(classifier_path, 'wb') as f: pickle.dump(classifier, f)

You are also required to submit the dumped classifier (which must be named as classifier.dat).

Suggestion Steps. You may want to implement a basic, initial NER classifier. Then you can further improve its performance in multiple ways. For example, you can find best setting of hyper-parameters of the your model/features; you can design and test different sets of features. It is recommended that you use cross validation and construct your own testing datasets.

You need to describe how you have improved the performance of your classifier in your report.

Trainer. You need to submit a python file named trainer.py. It receieves two command line arguments which specify the path to the training dataset and the path to a file where your trained classifier will be dumped to. Your program will be excuted as below:
python trainer.py <path_to_training_data> <path_to_classifier>

Tester. You need to submit a python file named tester.py. It recieves three com- mand line arguments which specify the path to the testing dataset, the path to the dumped classifier, and the path to a file where your output results will be dumped to. Your program will be executed as below:

python tester.py <path_to_testing_data> <path_to_classifier> <path_to_results>

You should, for each token in the testing dataset, output the named entity class of it (i.e., TITLE or O). Your output, in python internally, is a list sentences, where each sentence is a list of (TOKEN, CLASS) tuples. For example, a possible output for the example in Section 3.1 is:
[[('Prime', 'TITLE'), ('Minister', 'TITLE'), ('Malcolm', 'O'),
('Turnbull', 'O'), ('MP, 'O'), ('visit', 'O'), ('UNSW', 'O'),
('yesterday', 'O'), ('.', 'O')]]

Then you should dump it to a file using the following code (where result is your result list and path to results is the path of the dumped file):
with open(path_to_results, 'wb') as f: pickle.dump(result, f)

Report. You need to submit a report (named report.pdf) which answers the fol- lowing two questions:

- What feature do you use in your classifier? Why are they important and what information do you expect them to capture?
- How do you experiment and improve your classifier?

4. Execution
Your program will be tested automatically on a CSE Linux machine as follows:

- the following command will be used to build the classifier: python trainer.py <path_to_training_data> <path_to_classifier> where

- path to training data indicates the path to the training dataset
- path to classifier indicates the path to the dumped classifier
- the following command will be used to test the classifier:
python tester.py <path_to_testing_data> <path_to_classifier> <path_to_results>

where

- path to testing data indicates the path to the testing dataset
- path to classifier indicates the path to the dumped classifier
- path to result indicates the path to the dumped result

Your program will be executed using python 3 with only the following packages available:
- nltk 3.2.1
- numpy 1.11.1
- scipy 0.18.0
- scikit-learn 0.17.1
- pandas 0.18.1

5. Evaluation

We will use F1-measure to evaluate the performance of your classifier. We will test your classifier using two test datasets,
- the first one is sampled from the same domain as the given training dataset, and
- the second one is sampled from a domain different from that of the given training dataset.

Each test will contribute 40 points to your final score. Your report contributes the rest 20 points.

In order to minimize the effect of randomness, we will execute trainer.py three times, and use the best performance achieved.

For each test dataset, your best performed classifier will be compared with a reference classifier C. While the detailed scheme will be published later on the project website, basically, the marking scheme will reward you with more points if your classifier works no worse than the reference classifier.

Neither the reference classifier C nor the two test datasets will be given to you. But you can create your own test dataset, submit it and get the performance of the reference classifier C on it. For more details, please refer to Section 6.

6. Customized Datasets

You are encouraged to construct your own test datasets to test your classifier and help improve its performance. You can submit your datasets to get the F1-measure of the pre- trained classifier C on your datasets. Furthermore, all the submitted datasets is accessible by every student in the class. You should also benefit from such experience because this will give you many initial ideas about the meaningful features to use in your own classifier. You can upload your test cases to through the data submission website (url will be sent via email). You will need to login before uploading your datasets or downloading other datasets. The id is your student number (e.g., z1234567) and your password will be sent to you by email.

Once you have logged into the system, you can:
- submit your dataset
- view the performance of all the datasets
- download a chosen dataset

Your submitted test datasets should be in the same format as the training dataset.3 You can submit at most 10 datasets to the system. We will release more details about how to use the system and some tools/tips to help you visualize and tag TITLE occurrences later on the project website.

Verified Expert

In the solution of this task python scripts trainer.py and tester.py The scripts are running successfully in the unit testing. Further improvements can be done in feature engineering and at the classifier level as well, have given the provision for commenting two lines for checking cross validation.

Reference no: EM131224193

Questions Cloud

How office scheduling and hours should be constructed : A new group of primary care physicians have decided to locate in a suburb of Washington, D.C. After conducting some primary market research on the area, the group's research firm has characterized the community as comprising Boomers, Gen-Xers, How th..
Ways of dealing with a conflict associated with projects : 1. Identify three ways of dealing with a conflict associated with projects. 2. What are some advantages and disadvantages of housing a project in a functional form? 3. What are the systems architect's duties?
Prepare and submit homicide section of criminal law outline : Under the Felony Murder Rule, can Piro be prosecuted for the deaths of Annie, Fireman Jim, and Fireman Bill?- prepare and submit Homicide section of Criminal Law outline.
What is the ending cash balance for january : What are the total cash disbursements from the cash disbursements journal? What is the ending cash balance for January? How many general journal entries were made in January? What method of inventory tracking is the company using
Implement a basic initial ner classifier : COMP9318 (16S2) PROJECT - you will use scikit-learn and python 3 to engineer an effective clas- sifier for an important Information Extraction task called Named Entity Recognition and
About to launch social networking site : Imagine that you are about to launch a social networking site. Determine the target market to which your social media site will cater. Recommend a marketing strategy to entice your chosen target market to join your site. Determine the approach you wi..
Independent and establishing own free-standing service : Imagine that you are a credentialed, licensed professional, and that you are considering going independent and establishing your own free-standing service (perhaps an urgent care center, a physical therapy service, a private laboratory, or other s..
Explain how you might mitigate these threats : Using the hypotheses you have developed, compare the characteristics of the different experimental research designs discussed in the Skidmore (2008) article and choose the one that is most appropriate to adequately test your hypotheses. Identify p..
Difference between upper limit and lower limit of dimension : Tolerance:It is the difference between upper limit and lower limit of the dimensions of a part. So, it can be stated as maximum possible variation in the dimension of the part.The tolerance may be applied unilaterally or bilaterally

Reviews

Write a Review

 

Database Management System Questions & Answers

  Locations of r rats

A rat can travel to its adjacent locations (left, right, top and bottom), but to reach a cell, it must be open. Given the locations of R rats, can you find out whether all the rats can reach others or not.

  Design an enhanced entity-relationship diagram

Design an Enhanced Entity-Relationship diagram for the ONLINE_AUCTIONdatabase and build the design using a data modeling tool such as ERwin orRational Rose.

  Develop a system sequence diagram for each use case

Write out the steps of the dialog between the user and the system for the use case Place new order for nursing home employees.

  Clustered versus unclustered indexes

Choice of search key for the index. What is a composite search key, and what considerations are made in choosing composite search keys? What are index-only plans, and what is the in?uence of potential index-only evaluation plans on the choice of sear..

  Companys traditional costing system

The manufacturing overhead that would be applied to a unit of product R58G under the company's traditional costing system is closest

  How long is a clock cycle

Consider the timing diagram of Figure 12.10 in the 8th edition (or 14.10 in the 9th edition). Assume that there is only a two-stage pipeline (fetch, execute). Redraw the diagram to show how many time units are now required for four instructions.

  Create a struct data structure for each file contained

Write a password protected console-based application, which enables you to create user accounts and specify the name of the file or directory to be shared with other users.

  Co-related sub-query to return one row per customer

Use a co-related sub-query to return one row per customer, representing the customer's oldest order (the one with the earliest date).

  Variation of activity-based costing

Product cost of product U86Y under the company's traditional costing system is closest to - product cost of product M91F under the activity-based costing system

  Specify your physical design by identifying the attributes

List the names, ages, and salaries of managers of a user-speci?ed sex (male or female) working in a given department. You can assume that, while there are many departments, each department contains very few project managers.

  Explain k-means clustering algorithm to group the homicides

In this assignment, analyze crime rates using data mining clustering techniques (K-means) to get an accurate prediction to enable police forces to get a clear picture of criminals and solve crimes.

  How to select the primary key from the candidate keys

A motor vehicle maintenance center wants to improve its services by using database management systems and data mining what tables are needed in such a database and how can it help improve their services

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd