-
Notifications
You must be signed in to change notification settings - Fork 552
Making smart comparisons
Say we have magic function that takes in a pair of records and always returns a False
if a pair of records are distinct and True
if a pair of records refer to the same person or organization.
Let's say that this function was pretty slow. It always took one second to return.
How long would it take to duplicate a thousand records?
Within a data set of thousand records, there are 1,000×999⁄10 = 499,500 unique possible pair of records. If we compared all of them using our magic function it would take six days.
But, one second is a long time, let's say we sped it up so that we can make 10,000 comparisons per second. Now we can get through our thousand record long dataset in less than a minute
Feeling good about our super fast comparison function, let's take on a data set of 100,000 records. Now there are 100,000×99,999⁄10 = 4,999,950,000 unique possible pairs. If compare all of them with our super fast comparison function, it will take six days again.
If we want to work with moderately sized data, we have to find a of making fewer comparisons.
In real world data, nearly all possible pairs of records are not duplicates.
In this four record example below, only two pairs of records are duplicates (the first two and the last two), while there are four unique pairs of records that are not duplicates. This typical fraction of true duplicate pairs gets very small, very quickly as the length of a dataset grows.
first name | last name | address | phone | record_id|
-----------------------------------------------------------------------------
bob | roberts | 1600 pennsylvania ave. | 555-0123 | 1 |
Robert | Roberts | 1600 Pensylvannia Avenue | | 2 |
steve | Jones | 123 Cowabunga Lane | 555-0000 | 3 |
Stephen | Janes | 123 Cawabunga Ln | 444-555-0000 | 4 |
If we could only compare records that were true duplicates we would not run into the explosion of comparisons. Of course if already knew where the true duplicates were, we wouldn't need to compare any individual records. Unfortunately we don't but we do quite well if just compare records that were somewhat similar.
Duplicate records almost always share some thing in common. If we define groups of data that share some thing and only compare the records in that group, or block, then we can dramatically reduce the number of comparisons we will make. If define these blocks well, then we will make very few comparisons and still have confidence that will compare records that truly are duplicates.
This task is called blocking, and we approach it in two ways: predicate blocks and canopies.
A predicate block is a block of records that all share some features produced by a relatively simple functions called a predicates.
Predicates are functions that take in a record field and output a set of features of that field. These features could be something like "the first 3 characters of the field" or "every word in the field".
So if we used our "first 3 character" predicated on the address field above we would have two blocks
{'160' : (1,2) # tuple of record_ids
'123' : (3,4)
}
Others simple predicates we use include:
- whole field
- token field
- common integer
- same three char start
- same five char start
- same seven char start
- near integers
- common four gram
- common six gram
Dedupe also uses another way of producing blocks called canopies, which was developed by Andrew McCallum, Kamal Nigamy, and Lyle Ungar.
Here's the basic idea: We start with the first record in our data. We add it to a new canopy group and call it the target record. We then go through the rest of the records. If a record is close enough to target record then we add it to the canopy group. Once we have passed through the data, we find the next record that was not assigned to a canopy and this become our target record for a new canopy and we repeat the process until every record is part of a canopy.
In order to build canopies, we need a distance that we can compute without have to compare every single record, which is what we were trying to avoid in the first place. We use the the cosine of TF/IDF weighted word vectors.
If it's good to put define blocks of records that share the same 'city' field, it might be even better to block record that share BOTH the 'city' field AND 'zip code' field. Dedupe tries these cross-field blocks. These combinations blocks are called disjunctive blocks.
Dedupe comes with a long set of predicate blocks and can create canopies for any field. When these are combined dedupe have hundreds of possible blocking rules to choose from. We will want to find a small set of these rules that minimizes the number of distinct records in a block while ensuring that nearly all true duplicates are in some block.
While we approach this problem by using greedy algorithm, particularly Chvatal's Greedy Set-Cover algorithm. With a set of pairs that are labeled as distinct pairs or duplicate pairs, and we try to find the best set of predicates.
Back: Matching records | Next: Grouping duplicates