How many of these one million pairs will hash to the bucket

Assignment Help Basic Computer Science
Reference no: EM131212170

This exercise is based on the entity-resolution problem of Example 22.9. For concreteness, suppose that the only pairs records that could possibly be total edit distance 5 or less from each other consist of a true copy of a record and another corrupted version of the record. In the corrupted version, each of the three fields is changed independently. 50% of the time, a field has no change. 20% of the time, there is a change resulting in edit distance 1 for that field. There is a 20% chance of edit distance 2 and 10% chance of edit distance 10. Suppose there are one million pairs of this kind in the dataset.

a) How many of the million pairs are within total edit distance 5 of each other?

b) If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?

c) How many false negatives will there be; that is, how many of the one million pairs are within total edit distance 5, but will not hash to the same bucket for any of the three hashings?

Example 22.9

Suppose for concreteness that records are as in the running example of Section 21.7: name-address-phone triples, where each of the three fields is a character string. Suppose also that we define records to be similar if the sum of the edit distances of their three corresponding pairs of fields is no greater than 5. Let us use a hash function h that hashes the name field of a record to one of a million buckets. How h works is unimportant, except that it must be a good hash function - one that distributes names roughly uniformly among the buckets. But we do not stop here. We also hash the records to another set of a million buckets, this time using the address, and a suitable hash function on addresses. If h operates on any strings, we can even use h. Then, we hash records a third time to a million buckets, using the phone number. Finally, we examine each bucket in each of the three hash tables, a total of 3,000,000 buckets. For each bucket, we compare each pair of records in each bucket, and we report any pair that has total edit distance 5 or less. Suppose there are n records. Assuming even distribution of records in each hash table, there are n/106 records in each bucket. The number of pairs of records in each bucket is approximately n2/( 2 x 1012). Since there are 3 x 106 buckets, the total number of comparisons is about 1.5n2/106. And since there are about ra2/ 2 pairs of records, we have managed to look at only fraction 3 x 10-6 of the records, a big improvement.

Reference no: EM131212170

Questions Cloud

Develop a java application that inputs the salesperson : Develop a Java application that inputs the salesperson's gross sales for that item for last week and calculates and displays that salesperson's earnings. There is no limit to the number of items sold. After the loop is done, print out the aggregat..
Alexander falconbridge an account of the slave trade : What does Falconbridge's account of the slave trade and the middle passage tell us about the nature of the Atlantic Slave trade?
What are the purpose of change management : Question 1: What are the purpose of change management? Question 2: What is a relationship between changeq Incidentq Service Request q Release Question 3: What are the Fiverisk indicators of poor change management ?
How many bits are needed for the opcode : A digital computer has a memory unit with 16 bits per word. The instruction set consists of 122 different operations. All instructions have an operation code part (opcode) and an address part (allowing for only one address). How many bits are need..
How many of these one million pairs will hash to the bucket : If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?
Picture of systems analysis and systems development : How does the Internet, and more specifically the World Wide Web, fit into the picture of systems analysis and systems development?
The histories the second persian invasion of greece : Read given file, Herodotus, The Histories, The Second Persian Invasion of Greece. - And discuss should contain a thesis statement, evidence from the texts to support argument.
Identify the economic environment : Using your chosen company's domestic and global environments identify the economic environment of each and compare and contrast it using Rostow and Galbraith (see lesson plan and resources below).
Maintain a word-readable document : While you are working on the project, maintain a Word-readable document (.docx, .doc, .rtf, or .txt) that lists the tasks you experience problems with. Are there any tasks that cannot translate directly from one language to another? How did you ha..

Reviews

Write a Review

Basic Computer Science Questions & Answers

  Greentrees optimal capital structure

GreenTrees Corporation currently has $60 million in liabilities and common equity in combination. There are no preferred stock. The CFO constructed the following table to show the effect of changing the firm's capital structure:

  Input file or files must have at least ten student names

Your input file or files must have at least ten student names, their scores, and their finals. Again, the calculations (processing) should remain the same. Submit the Visual Logic flowchart and associated input text files.** will provide instructor V..

  Focus on paramtized queries and stored procedure

The testing phase will include creating a simple page of login as password with simple code and show that the sql injections occurs (show codes and snapshots where necessary)

  Analyze the impact of e-commerce on a business

This assignment provides you with an opportunity to analyze the impact of e-commerce on a business. First, choose ONE of the following case studies that can be located in the Business Source Complete database of the online library:

  Which algorithm would probably work better on a computer

Write a recursive algorithm in pseudo code to generate the Fibonacci sequence.

  Electronic health record

The student will write an Individual essay paper on EHRs: include history, implementation challenges and benefits. (3-4 pages double spaced due at the beginning of session 8).EHR: Electronic Health Record

  Explain white-box testing strategy in software engineering

Explain white-box testing strategy in software engineering. Why it is given this name? Explain the advantages and disadvantages of white-box testing strategy?

  Compute the fair value of a chooser option

Compute the fair value of a chooser option which expires after n=10 periods. At expiration the owner of the chooser gets to choose (at no cost) a European call option or a European put option. The call and put each have strike K=100 and they matur..

  The way to call two functions with the onclick event

add the image so when click on the image it will submit the amount. Also is this the way to call two funtions with the onclick event?

  Display the sum and the average of the 3 values

Write a java program to accept 3 integer values from keyboard, display the sum and the average of the 3 values.

  Audit evidence process and strategic planning

In what possible ways can an IT auditor collect audit evidence in order to express opinions? List three (3) different techniques for project scheduling. What are computer-assisted audit solutions?

  Implementing a database system for an organization

This project involves designing and implementing a database system for an organization. The term project is made up of a series of four deliverables, each building towards the finished product.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd