Reference no: EM132221714
Task 1: Parsing Raw Text Files
This assessment touches the very first step of analyzing textual data, i.e., extracting data from unstructured text files. Each student is provided with a data-set that contains several job postings (please find your own file from task1.rar, i.e., <your_student_number>.dat).
Each data-set contains information about the job advertisements, e.g., job title, job description, start date, required qualifications (see sample.pdf and sample.txt for the data dictionary). Your task is to extract the data and transform the data to the XML and JSON format.
Please note that the re and json packages in Python are the only packages that you are allowed to use in this task and the following must be performed to complete the assessment.
? Designing an efficient regex in order to extract the data from the file
? Storing and submitting the extracted data into an XML file, <your_student_number>.xml following the format of example.xml
? Storing and submitting the extracted data into a JSON file <your_student_number>.json following the format of example.json
? Submitting task1_<your_student_number>.ipynb
Task 2: Text Pre-Processing
This assessment touches on the next step of analyzing textual data, i.e., converting the extracted data into a proper format. In this assessment, you are required to write Python code to preprocess a set of resumes and convert them into numerical representations (which are suitable for input into recommender-systems/ information-retrieval algorithms).
The data-set that we provide contains 250 CVs for each student. Please find the resume_dataset.txt to know the PDF files in your own data-set.
Each line in the csv file contains the id of the resumes that a student needs to include in the data-set (for example 1111111111:
[3 34 5 ...] means that the student 1111111111 data-set includes resume_(3), resume_(34), resume_(5),...). CVs contain information about the applicants represented in the PDF format.
The information includes, for example, personal information, skills, work experience, education, etc. Your task is to extract and transform the information for each applicant.