Reference no: EM133755183
Algorithms and Data Structures
Assignment: Spellcheck Lookup
Purpose: Improve your proficiency in C programming and your dexterity with dynamic memory allocation.
Demonstrate understanding of a concrete data structure (radix tree) and implement a set of algorithms.
Practice multi-file programming and improve your proficiency in using UNIX utilities.
Practice writing reports which systematically compare the efficiency of different data structures.
Background: In Assignment 1 you implemented a dictionary using a linked list. Many applications of dictionaries cannot assume that the user will query in the correct format. Take Google search as an example. If someone is so excited to check the recent Olympics updates, they may accidentally search something similar to this:
Thankfully, there is enough context in the original query that the intended search result is recommended back to the user. In Assignment 2 you will be extending your implementation to account for misspelled keys when querying your dictionary.
Your Task
Assignment:
You will be using the same dataset as Assignment 1.
Users will be able to query the radix tree and will get either the expected key, or the closest recommended key.
You will then write a report to analyse the time and memory complexity of your Assignment 1 linked list compared to your radix tree implementation.
C Implementation:
Your programs will build the dictionary by reading data from a file. They will insert each suburb into the dictionary (either the linked list (Stage 3) or radix tree (Stage 4)).
Your programs will handle the search for keys. There are three situations that your programs must handle:
Handle exact matches: output all records that match the key (Stage 3 and 4).
Handle similar matches: if there are no exact matches, find the most similar key and output its associated records (Stage 4 only).
Handle no matches being found: if neither exact nor similar matches are found, indicate that there is no match or recommendation (Stage 3 and 4).
Your programs should also be able to populate and query the dictionary with the Assignment 1 linked list approach and the radix tree approach.
Your programs should also provide enough information to compare the efficiency of the linked list with the radix tree with a operation-based measure that ignores less important run- time factors (e.g. comparison count).
Implementation Tips
Get a particular bit from a given character
Extract a stem from a given key number of bits from a given key.
and
which extracts a certain
If you want to understand the code provided, you need to understand the following bit operators over numbers a and b:
Implementation Details
Assignment 2 will involve roughly three new stages.
Stage 3 will extend your Assignment 1 solution to count the number of comparisons it takes to find a given key, i.e. a search query.
Stage 4 will implement the lookup of data by a given key (Official Name Suburb) in a Patricia tree.
Stage 5 is a report about the complexity of a linked list compared to the Patricia tree.
Stage 3: Linked List Search Comparisons
In this stage, you will extend your Assignment 1 solution to count the number of binary and string comparisons and node accesses used when searching for a given key.
Your will produce an executable called dict3 . The command line arguments are identical
to assignment 1 but with the stage being instead.
The first argument will be the stage, for this part, the value will always be lookup with comparison count added).
The second argument will be the name of the input CSV data file. The third argument will be the name of the output file.
(for linked list
Your program should function exactly the same as assignment 1's stage 1. You will add the
functionality to count comparisons when searching for a key which will be added to the output. For this stage, and this stage only, you may assume:
Each character compared adds exactly 8 bits to the bit comparison count. The node access is incremented once per accessed linked list node.
Each string comparison, even if a mismatch occurs on the first character, is 1 string comparison.
You should create the functionality to store comparisons with stage 4 in mind. You will also be recording comparisons in the Patricia tree implementation, so your code should be easily applied to both. For this reason, you may want to extend your search function to include a pointer to a struct that holds information about the query and results.
Stage 4: Patricia Tree Spellchecker
In Stage 4, you will implement the another dictionary allowing the lookup of data by key ( Suburb
Your should produce an executable program called dict4 . This program should take three
command line arguments:
The first argument will be the stage, for this part, the value will always be 4 (for Patricia tree lookup and comparison counting).
The second argument will be the name of the input CSV data file. The third argument will be the name of the output file.
program should:
Read the data in the same format as assignment 1 and stage 1, and save each entry in a Patricia tree using the Official Name Suburb as the key.
The program should:
Accept Official Name Suburb s from stdin , and search the tree for the matching key. Since your program should act as a simple spellchecker, an exact match may not be found. Follow the process "mismatch in key" as outlined below to implement spellchecking logic.
You must output all matching records to the output file, where a matching record may be an exact match, or the closest match determined by the mismatch key algorithm. If no matches for the query are found (i.e. the trie is empty), your program should output NOTFOUND . When outputting data records, you should follow the guidelines described in the slide Dataset and Assumptions.
In addition to outputting the record(s) to the output file, the number of records found and
the comparisons and node accesses when querying should be output to .
Stage 5: Complexity Report
The final deliverable for this assignment is a small report analyzing the complexity of both dictionaries. You will run various test cases through your program to test its correctness and also to examine the number of key comparisons used when searching keys across different dictionaries. You will report on the number of comparisons used by your linked list and patricia tree implementations for different data file inputs and key prefixes. You will compare these results with what you expect based on theory (Big-O).
Your approach should be systematic, varying the size of the files you use and their characteristics (e.g. sorted/unsorted, duplicates, range of key space, length of prefix, etc.), and observing how the number of key comparisons varies given different sizes of key prefixes. Looking at statistical measures about the results of your tests may help.
Implementation Requirements
The following implementation requirements must be adhered to:
Programming Language: You must write your code in the C programming language.
Record Structure: Each data record (representing a suburb) must be stored in a custom structure ( struct ) that reflects the previously mentioned data types. This struct will have member variables for each field (Year, Suburb Code, etc.) with the appropriate data types (int, double, string).
Linked List Implementation: You must use a linked list to implement the dictionary in Stage 3.
The order in which data records appear in the input file must be preserved within the linked list.
Patricia Tree Implementation: You must use a Patricia tree to implement the dictionary in Stage 4.
The order in which data records appear in the input file must be preserved within the Patricia tree (i.e. duplicates should appear in the same order as the relative order in the original file).
Modular Design: You must write your code in a modular way.
Your C program will be marked on the basis of accuracy, readability, and good C programming structure, safety and style, including documentation (1 mark). Safety refers to checking whether opening a file returns something, whether malloc() s do their job, etc. The documentation should explain all major design decisions, and should be formatted so that it does not interfere with reading the code. As much as possible, try to make your code self-documenting by choosing descriptive variable names.
Reflection
This form provides an opportunity to both look back at your work to reflect on it, as well as evaluate how you did.