Designing an efficient regex in order to extract the data

Assignment Help Computer Engineering
Reference no: EM132221714

Task 1: Parsing Raw Text Files

This assessment touches the very first step of analyzing textual data, i.e., extracting data from unstructured text files. Each student is provided with a data-set that contains several job postings (please find your own file from task1.rar, i.e., <your_student_number>.dat).

Each data-set contains information about the job advertisements, e.g., job title, job description, start date, required qualifications (see sample.pdf and sample.txt for the data dictionary). Your task is to extract the data and transform the data to the XML and JSON format.

Please note that the re and json packages in Python are the only packages that you are allowed to use in this task and the following must be performed to complete the assessment.

? Designing an efficient regex in order to extract the data from the file

? Storing and submitting the extracted data into an XML file, <your_student_number>.xml following the format of example.xml

? Storing and submitting the extracted data into a JSON file <your_student_number>.json following the format of example.json

? Submitting task1_<your_student_number>.ipynb

Task 2: Text Pre-Processing

This assessment touches on the next step of analyzing textual data, i.e., converting the extracted data into a proper format. In this assessment, you are required to write Python code to preprocess a set of resumes and convert them into numerical representations (which are suitable for input into recommender-systems/ information-retrieval algorithms).

The data-set that we provide contains 250 CVs for each student. Please find the resume_dataset.txt to know the PDF files in your own data-set.

Each line in the csv file contains the id of the resumes that a student needs to include in the data-set (for example 1111111111:
[3 34 5 ...] means that the student 1111111111 data-set includes resume_(3), resume_(34), resume_(5),...). CVs contain information about the applicants represented in the PDF format.

The information includes, for example, personal information, skills, work experience, education, etc. Your task is to extract and transform the information for each applicant.

Reference no: EM132221714

Questions Cloud

What does the following query return : Assume that each employee makes a sale to at least one customer. What does the following query return?
Discuss given problem related to environment and population : In order to understand environment and population the understanding of organizing and categorizing biodiversity, biomes, and ecosystems will contribute to the m
Create an application method holds two double variables : Create an application named Percentages whose main() method holds two double variables.
Explain the idea of limits to population growth : Describe the historical pattern of growth of the worldwide human population since our origin. Include in this historic overview the changes that have happened.
Designing an efficient regex in order to extract the data : FIT5196: Designing an efficient regex in order to extract the data from the file
How would you handle the given issues : Think about the informed consent issues you, a researcher within your field of study, may encounter. How would you handle these issues?
What is the role of the clinical nurse practitioner : Summarize in your own words the National Institutes of Health (NIH) definition of Complementary and Alternative Medicine (CAM).
Why is terrorism considered a hazard : Why is terrorism considered a hazard? In Unnatural Causes, Place Matters, Dr. David Williams argues that health campaigns focused solely on changing individual.
How do you feel about experience of reading newspaper : How do you feel about the the experience of reading a print newspaper? How do you feel about the experience of reading a newspaper online?

Reviews

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd