1 + 1 = 1 or Record Deduplication with Python¶

Flávio Juvenal
@flaviojuvenal

Partner at

Slides available at: bit.ly/pybay-dupe (open as Desktop)

Jupyter Notebook source at: github.com/vintasoftware/deduplication-slides/

Introduction¶

Real world data is inputted by people and often it's:

Not reviewed
Not linked with related data
Not properly normalized by the input system
Or simply it's incorrectly inputted because people make mistakes: typos, mishearing, miscalculation, misinterpretation, etc.

This causes the following problems on data:

lack of unique identifiers (making difficult detect duplicates in a dataset or to link with other datasets)
duplications (e.g. multiple records refer to a single person)
inconsistencies (e.g. a person appears with multiple addresses)
bad formatting (e.g. birth dates appear with multiple formats like DD/MM/YY and YYYY-MM-DD)

All of that affects the ability to properly extract knowledge from one or more datasets.

The solution is to perform Record Linkage. It works by joining records in a fuzzy way using data like names, addresses, phone numbers, dates, etc.

The term Record Linkage is most used when the linkage is applied to multiple datasets, like joining a Restaurant Food Inspection dataset with an Employee Wage dataset. In fact, someone did just that.

Record Linkage is also known as Data Matching, Entity Resolution, and other names.

What we'll discuss here is a specific application of Record Linkage, called Deduplication, which is applying Record Linkage on a single dataset against itself to find which records are duplicates. Here's a very simple example:

import warnings; warnings.simplefilter('ignore')
import logging; logging.disable(level=logging.INFO)

import pandas as pd

data = [
    ("Chin's","3200 Las Vegas Boulevard","New York"),
    ("Chin Bistro","3200 Las Vegas Blvd.","New York"),
    ("Bistro","3400 Las Vegas Blvd.","New York City"),
    ("Bistro","3400 Las Vegas B.","NYC"),
]

df = pd.DataFrame(data, columns=['restaurant', 'address', 'city'])

df

A good deduplication on the data above would find that:

(0, 1) are duplicates
(2, 3) are duplicates
but (0, 1) and (2, 3) are different, despite being similar

Process¶

The process for deduplicating a dataset usually is:

Preprocessing
- input: dataset
- output: cleaned dataset
Indexing
- output: pairs to compare
Comparison
- output: comparison vectors
Classification
- output: matching/nonmatching pairs
Clustering
- output: unique record clusters

Let's explore each of those steps!

0/4 - Preprocessing¶

Without unique identifiers, we need to match records by fuzzy data like:

Names
Addresses
Phone Numbers
Dates

So it's important to clean them for matching.

Cleaning names (companies or people)¶

Let's use string functions and regexes to normalize names and remove undesired variations:

import numpy as np
import re
import pprint


company_names = company_names_dirty = [
    'APPLE COMPUTER INC',
    'APPLE COMPUTER, INC.',
    'APPLE INC',
    'Apple Computer',
    'Apple Computer Co.',
    'Apple Computer Company',
    'Apple Computer Inc',
    'Apple Computer Incorporated',
    'Apple Computer, Inc.',
    'Apple Inc',
    'Apple Inc.',
    'Apple, Inc.'
]

print("Lower case:")
company_names = [c.lower() for c in company_names]
pprint.pprint(company_names)

Lower case:
['apple computer inc',
 'apple computer, inc.',
 'apple inc',
 'apple computer',
 'apple computer co.',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer, inc.',
 'apple inc',
 'apple inc.',
 'apple, inc.']

print("Remove irrelevant separators:")
irrelevant_regex = re.compile(r'[^a-z0-9\s]')
company_names = [irrelevant_regex.sub(' ', c) for c in company_names]
pprint.pprint(company_names)

Remove irrelevant separators:
['apple computer inc',
 'apple computer  inc ',
 'apple inc',
 'apple computer',
 'apple computer co ',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer  inc ',
 'apple inc',
 'apple inc ',
 'apple  inc ']

print("Remove multi-spaces:")
multispace_regex = re.compile(r'\s\s+')
company_names = [multispace_regex.sub(' ', c).strip() for c in company_names]
pprint.pprint(company_names)

Remove multi-spaces:
['apple computer inc',
 'apple computer inc',
 'apple inc',
 'apple computer',
 'apple computer co',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer inc',
 'apple inc',
 'apple inc',
 'apple inc']

print("Remove stopwords:")
business_stopwords = {  # suppose we got this from somewhere
    'computer',
    'co',
    'company',
    'corp',
    'corporation',
    'inc',
    'incorporated',
    'llc',
    #...
}

company_names = [
    ' '.join([c_part for c_part in c.split() if c_part not in business_stopwords])
    for c in company_names
]
pprint.pprint(company_names)

Remove stopwords:
['apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple']

We can also use the natural language processing library probablepeople to parse company names and extract just the parts we want (or break into parts and match by part on the comparison step later):

import probablepeople as pp

pp.parse("Apple Computer Incorporated")

[('Apple', 'CorporationName'),
 ('Computer', 'CorporationName'),
 ('Incorporated', 'CorporationLegalType')]

company_names_alternative_1 = [
    [
        parsed_value
        for parsed_value, parsed_type
        in pp.parse(c)
        if parsed_type == 'CorporationName'
    ]
    for c in company_names_dirty
]
pprint.pprint(company_names_alternative_1)

[['APPLE', 'COMPUTER'],
 ['APPLE', 'COMPUTER,'],
 ['APPLE'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer,'],
 ['Apple'],
 ['Apple'],
 ['Apple,']]

probablepeople, as the name suggests, can parse people names too:

pp.parse('Mr. Guido van Rossum')

[('Mr.', 'PrefixMarital'),
 ('Guido', 'GivenName'),
 ('van', 'Surname'),
 ('Rossum', 'Surname')]

But performance isn't great from non-English names. So you can try nameparser, which is not probabilistic:

from nameparser import HumanName
from IPython.display import display

display(pp.parse('Flávio Juvenal da Silva Jr'))
display(HumanName('Flávio Juvenal da Silva Jr'))

[('Flávio', 'CorporationName'),
 ('Juvenal', 'CorporationName'),
 ('da', 'CorporationName'),
 ('Silva', 'CorporationName'),
 ('Jr', 'CorporationName')]

<HumanName : [
	title: '' 
	first: 'Flávio' 
	middle: 'Juvenal' 
	last: 'da Silva' 
	suffix: 'Jr'
	nickname: ''
]>

If it's useful to ignore accents, try unidecode:

import unidecode

print("ASCII transliteration:")
brazilian_name = "Flávio"
print(brazilian_name, "->")
print(unidecode.unidecode(brazilian_name))

ASCII transliteration:
Flávio ->
Flavio

It's important to consider how the data values were inputted. If the value was inputted by someone while hearing from someone else, try phonetic encoding with doublemetaphone to normalize words that sound similar:

from doublemetaphone import doublemetaphone

print(doublemetaphone('Andrew'))
print(doublemetaphone('André'))

('ANTR', 'ANTRF')
('ANTR', 'ANTR')

If data was obtained via Optical Character Recognition (OCR), techniques exist to correct common misspellings that OCRs make (like C to L, S to 5, q to g, etc). See Christen, page 47 [2].

Cleaning addresses¶

Geocoding street addresses, i.e., converting them to latitude/longitude is very useful for matching, because geocoders usually clean irrelevant address variations. Also, having lat/lng enables the calculation of geometric distances between addresses.

import requests
import geocoder

full_addresses = [
    "2066 Crist Drive, 94024, Los Altos, CALIFORNIA, US",
    "2066 Crist Dr, 94024, LOS ALTOS, CALIFORNIA, US",
    "20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US",
    "20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US",
    "10260 Bandley Drive, 95014, Cupertino, CALIFORNIA, US",
    "10260 Bandley Dr., 95014, Cupertino, CALIFORNIA, US",
    "20525 MARIANI AVENUE, 95014, CUPERTINO, CALIFORNIA, US",
    "20525 Mariani Ave, 95014, CUPERTINO, CALIFORNIA, US",
    "1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US",
    "One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US",
    "One Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
    "1 Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
]

full_addresses_latlng = []
with requests.Session() as session:
    for a in full_addresses:
        a_geocoded = geocoder.google(a, session=session)
        full_addresses_latlng.append(a_geocoded.latlng)

address_latlng = list(zip(full_addresses, full_addresses_latlng))

pprint.pprint((address_latlng[2], address_latlng[3]))
print()

(('20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US',
  [37.3241563, -122.0387297]),
 ('20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US',
  [37.3241563, -122.0387297]))

Google geocoder is able to:

Ignore lower/upper case difference
Ignore the difference between Suite 300 vs 330 (building entrance may be the same)
Consider 'Lupertino' as 'Cupertino'
Expand abbreviations, like 'STE' vs 'Suite'

pprint.pprint((address_latlng[8], address_latlng[9]))

(('1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US',
  [37.3324975, -122.0289785]),
 ('One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US',
  [37.33182, -122.03118]))

However, it gets slighly different addresses on the case of '1' vs 'One'.

Note geocoding from web APIs is slow and has quota limits! Alternatively, you can build your own geocoder with:

If geocoding is inviable, we can use the library pypostal to normalize addresses:

from postal.expand import expand_address

print(full_addresses[8])
pprint.pprint(expand_address(full_addresses[8]))
print()
print(full_addresses[9])
pprint.pprint(expand_address(full_addresses[9]))

1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US
['1 infinite loop 95014 cupertino california us']

One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US
['1 infinite loop 95014 cupertino california us']

print(full_addresses[2])
pprint.pprint(expand_address(full_addresses[2]))
print()
print(full_addresses[3])
pprint.pprint(expand_address(full_addresses[3]))

20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US
['20863 stevens creek boulevard suite 300 95015 cupertino california us']

20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US
['20863 stevens creek boulevard suite 330 95014 lupertino california us',
 '20863 stevens creek boulevard sainte 330 95014 lupertino california us']

From DataMade, the creators of probablepeople, there's a usaddress parser:

import usaddress

print(full_addresses[0])
pprint.pprint(usaddress.parse(full_addresses[0]))
print()
print(full_addresses[1])
pprint.pprint(usaddress.parse(full_addresses[1]))

2066 Crist Drive, 94024, Los Altos, CALIFORNIA, US
[('2066', 'AddressNumber'),
 ('Crist', 'StreetName'),
 ('Drive,', 'StreetNamePostType'),
 ('94024,', 'SubaddressIdentifier'),
 ('Los', 'PlaceName'),
 ('Altos,', 'PlaceName'),
 ('CALIFORNIA,', 'PlaceName'),
 ('US', 'StateName')]

2066 Crist Dr, 94024, LOS ALTOS, CALIFORNIA, US
[('2066', 'AddressNumber'),
 ('Crist', 'StreetName'),
 ('Dr,', 'StreetNamePostType'),
 ('94024,', 'ZipCode'),
 ('LOS', 'PlaceName'),
 ('ALTOS,', 'PlaceName'),
 ('CALIFORNIA,', 'PlaceName'),
 ('US', 'StateName')]

Cleaning phone numbers¶

phonenumbers library can normalize phone numbers from many countries:

import phonenumbers

print("Phone number normalization:")
phone = "(541) 555-3010"
print(phone, '->')
print(
    phonenumbers.format_number(
        phonenumbers.parse(phone, 'US'),
        phonenumbers.PhoneNumberFormat.E164)
)

Phone number normalization:
(541) 555-3010 ->
+15415553010

Cleaning dates¶

dateparser library can guess date formats and parse them as datetime objects. It can even guess DD/MM or MM/DD by the language:

import dateparser

print(dateparser.parse("at 10/1/2018 10am"))
print(dateparser.parse("às 10/1/2018 10:00"))
pprint.pprint(dateparser.DateDataParser().get_date_data("às 10/1/2018 10:00"))

2018-10-01 10:00:00
2018-01-10 10:00:00
{'date_obj': datetime.datetime(2018, 1, 10, 10, 0),
 'locale': 'pt',
 'period': 'day'}

Check also dateutil parser with fuzzy=True.

Cleaning a real dataset¶

Now we know the tools, let's grab a dataset to preprocess and go on the other deduplication steps. Our dataset is based* on the Restaurant dataset, a well-known dataset used by researchers. It contains 881 restaurant records and contains 150 duplicates.

* We've introduced some changes that you can check by doing a diff restaurant.original.csv restaurant.csv

df_with_truth = pd.read_csv('restaurant.csv', skip_blank_lines=True)
df_with_truth.head(9)

The dataset comes with the true matches indicated by the 'cluster' column. We'll remove it. We'll also remove the 'phone' to makes things more difficult:

df = df_with_truth.drop(columns=['cluster', 'phone'])
df.head(9)

By running info method, we assert our dataset is quite clean, missing only 1 value of 'type'. We'll just replace it with an empty string. Note that on other datasets you might need to do more profiling to know what to fix. Check Mull's talk [1].

df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 881 entries, 0 to 880
Data columns (total 4 columns):
name    881 non-null object
addr    881 non-null object
city    881 non-null object
type    880 non-null object
dtypes: object(4)
memory usage: 27.6+ KB

df = df.fillna('')

Now we'll clean the column values! Cleaning name:

def assign_no_symbols_name(df):
    return df.assign(
        name=df['name']
             .str.replace(irrelevant_regex, ' ')
             .str.replace(multispace_regex, ' '))

df = assign_no_symbols_name(df)
df.head(9)

from collections import Counter

possible_stopwords = Counter(" ".join(df["name"]).split()).most_common(20)
pprint.pprint(possible_stopwords)

[('s', 184),
 ('cafe', 81),
 ('the', 39),
 ('grill', 32),
 ('restaurant', 26),
 ('la', 24),
 ('le', 20),
 ('house', 20),
 ('bar', 19),
 ('of', 15),
 ('bistro', 13),
 ('room', 11),
 ('kitchen', 10),
 ('deli', 9),
 ('club', 9),
 ('and', 9),
 ('ritz', 9),
 ('carlton', 9),
 ('on', 8),
 ('buckhead', 8)]

def assign_cleaned_name(df):
    restaurant_stopwords = {
        's', 'the', 'la', 'le', 'of', 'and', 'on', 'l'}
    restaurant_stopwords_regex = r'\b(?:{})\b'.format(
        '|'.join(restaurant_stopwords))
    return df.assign(
        name=df['name']
             .str.replace(restaurant_stopwords_regex, '')
             .str.replace(multispace_regex, ' ')
             .str.strip())

df = assign_cleaned_name(df)
df.head(9)

Geocoding addr:

all_addresses = df['addr'].str.cat(df['city'], sep=', ').values
unique_addresses = np.unique(all_addresses)
print(len(all_addresses), len(unique_addresses))

881 819

import os.path
import json

geocoding_filename = 'address_to_geocoding.json'

def geocode_addresses(address_to_geocoding):
    remaining_addresses = (
        set(unique_addresses) -
        set(k for k, v in address_to_geocoding.items() if v is not None))
    
    with requests.Session() as session:
        for i, address in enumerate(remaining_addresses):
            print(f"Geocoding {i + 1}/{len(remaining_addresses)}")
            geocode_result = geocoder.google(address, session=session)
            address_to_geocoding[address] = geocode_result.json

        with open(geocoding_filename, 'w') as f:
            json.dump(address_to_geocoding, f, indent=4)

if not os.path.exists(geocoding_filename):
    address_to_geocoding = {}
    geocode_addresses(address_to_geocoding)
else:
    with open(geocoding_filename) as f:
        address_to_geocoding = json.load(f)
    geocode_addresses(address_to_geocoding)
 
address_to_postal = {
    k: v['postal']
    for k, v in address_to_geocoding.items()
    if v is not None and 'postal' in v
}
address_to_latlng = {
    k: (v['lat'], v['lng'])
    for k, v in address_to_geocoding.items()
    if v is not None
}
print(f"Failed to get postal from {len(address_to_geocoding) - len(address_to_postal)}")
print(f"Failed to get latlng from {len(address_to_geocoding) - len(address_to_latlng)}")

Failed to get postal from 8
Failed to get latlng from 0

def assign_postal_lat_lng(df):
    addresses = df['addr'].str.cat(df['city'], sep=', ')
    addresses_to_postal = [address_to_postal.get(a) for a in addresses]
    addresses_to_lat = [address_to_latlng[a][0] if a in address_to_latlng else None for a in addresses]
    addresses_to_lng = [address_to_latlng[a][1] if a in address_to_latlng else None for a in addresses]

    return df.assign(postal=addresses_to_postal, lat=addresses_to_lat, lng=addresses_to_lng)

df = assign_postal_lat_lng(df)
df.head(6)

Adding addr_variations:

def assign_addr_variations(df):
    return df.assign(
        addr_variations=df['addr'].apply(
            lambda addr: frozenset(expand_address(addr))))

df = assign_addr_variations(df)
df[['name', 'addr', 'addr_variations']].head(6)

Now, with a clean dataset, we can move to the next step.

1/4 - Indexing¶

To explain the following deduplication steps, we'll use the Python libary recordlinkage, aka Python Record Linkage Toolkit. We choose it because it doesn't abstract too much the details of the process, even though it has a simple API.

We have the cleaned records, we now need the pairs we want to compare to find matches. To produce the pairs, we could do a "full" index, i.e., all records against all records:

import recordlinkage as rl

full_indexer = rl.FullIndex()
pairs = full_indexer.index(df)

print(f"Full index: {len(df)} records, {len(pairs)} pairs")

Full index: 881 records, 387640 pairs

The formula for the total number of pairs is:
len(df) * (len(df) - 1) / 2 == 387640

The number of pairs grows too fast as the number of records grows: it grows quadratically. That's why we need indexing. We need to produce only pairs that are good candidates of being duplicates to avoid wasting too much time.

Indexing is also called blocking because the trivial way to index is to produce pairs that have some column value in common. By doing this, we produce blocks of record pairs and compare only those in the same block:

Below we produce pairs that have equal values for postal code:

postal_indexer = rl.BlockIndex('postal')
postal_index_pairs = postal_indexer.index(df)

print(f"Postal index: {len(postal_index_pairs)} pairs")

Postal index: 6536 pairs

We could also produce pairs by sorting the unique values of some column and, for each value, getting the records with neighboring values. The idea is to produce pairs with close values on some column, like johnny and john or 2018-01-02 and 2018-01-05. It looks like this:

Below we produce pairs that have neighboring values for name:

name_indexer = rl.SortedNeighbourhoodIndex('name', window=7)
name_index_pairs = name_indexer.index(df)

print(f"Name index: {len(name_index_pairs)} pairs")

Name index: 3132 pairs

Note that simply sorting values wouldn't be able to get kamila and camila as neighbors, as sorting is sensitive to the leading characters of strings. There are other ways to index that could handle that, check Christen, chapter 4 [2].

To produce more pairs without introducing redundant pairs, we should union different indexes:

pairs = postal_index_pairs.union(name_index_pairs)

print(f"Postal or name index: {len(pairs)} pairs")

Postal or name index: 9465 pairs

We now have which pairs we want to run comparisons on!

2/4 - Comparison¶

Now we want to run comparisons on the indexed pairs to produce a comparison vector for each pair. A comparison vector represents the similarity between 2 records by holding similarity values between 0 to 1 for each column.

pd.DataFrame([[1.0, 0.75, 0.0, 0.25, 0.0]],
             columns=['name', 'addr', 'type', 'latlng', 'addr_variations'],
             index=pd.MultiIndex.from_arrays([[100], [200]]))

For example, the comparison vector above means the pair of records (100, 200) has:

Equal names
Similar addrs
Completely different types
Some distance on latlngs
No agreement on addr_variations

To compute the comparison vectors for all indexed pairs, we define a similarity function for each column:

vectorized_intersection = np.vectorize(
    lambda x, y: float(bool(x.intersection(y))))

def compare_addr_variations(a1, a2):
    return vectorized_intersection(a1, a2)

comp = rl.Compare()
comp.string('name', 'name', method='jarowinkler', label='name')
comp.string('addr', 'addr', method='jarowinkler', label='addr')
comp.string('city', 'city', method='jarowinkler', label='city')
comp.string('type', 'type', method='jarowinkler', label='type')
comp.geo('lat', 'lng', 'lat', 'lng', method='exp', scale=0.5, label='latlng')
comp.compare_vectorized(compare_addr_variations,
                        'addr_variations', 'addr_variations',
                        label='addr_variations');

There are many similarity functions we could use for strings. For example:

jarowinkler gives priority to the begining of the string
levenshtein cares more about the order
lcs cares less about the order

from recordlinkage.algorithms.string import (
    jarowinkler_similarity as jarowinkler,
    levenshtein_similarity as levenshtein,
    longest_common_substring_similarity as lcs)

for s1, s2 in [["arnie morton", "arnie morton s of chicago"],
               ["the palm", "palm the"]]:
    print(f'jarowinkler("{s1}", "{s2}") ==',
          jarowinkler([s1], [s2])[0])
    print(f'levenshtein("{s1}", "{s2}") ==',
          levenshtein([s1], [s2])[0])
    print(f'lcs("{s1}", "{s2}") ==',
          lcs([s1], [s2])[0])
    print()

jarowinkler("arnie morton", "arnie morton s of chicago") == 0.896
levenshtein("arnie morton", "arnie morton s of chicago") == 0.48
lcs("arnie morton", "arnie morton s of chicago") == 0.6486486486486487

jarowinkler("the palm", "palm the") == 0.4166666666666667
levenshtein("the palm", "palm the") == 0.0
lcs("the palm", "palm the") == 0.875

For more details on the different similarity functions you can use, check Christen, chapter 5 [2]. Also, check the similarity functions implemented on recordlikage library.

Now we'll compute the comparison vectors:

comparison_vectors = comp.compute(pairs, df)
comparison_vectors.head(5)

It's good to check the statistics of the vectors to see if they look right:

comparison_vectors.describe()

By looking at the summary we can see it's easier to get higher similarities on 'city', 'addr' and 'type', not so easy on 'name' and 'latlng', and very difficult on 'addr_variations'. Therefore, it seems that:

we have few unique values of 'city' in our dataset and that also causes...
...lots of 'addr' to be similar
there only a limited number of 'type's, so it repeats a lot
and getting a 1 on 'addr_variations' is rare since it only considers an exact agreement on address variation

Now, with our comparison vectors, we'll explore different ways to classify them as matches and nonmatches.

3/4 - Classification¶

Before, let's remember what's on records 0 to 5:

df.head(6)

Threshold-Based Classification¶

A simple way to classify comparison vectors as matches or nonmatches is to compute a weighted average over the vectors to get a score:

scores = np.average(
    comparison_vectors.values,
    axis=1,
    weights=[30, 10, 5, 10, 30, 15])
scored_comparison_vectors = comparison_vectors.assign(score=scores)
scored_comparison_vectors.head(5)

By looking at the data, we know record 0 truly matches 1 and 2, so we'll classify as a match any pair with score >= 0.85. That's our threshold:

matches = comparison_vectors[
    scored_comparison_vectors['score'] >= 0.85]
matches.head(5)

Let's check another match we've found with this threshold: what are the matches of record 3?

display(df.loc[[3]])
print("matched:")
display(df.loc[matches.loc[(3,)].index])

matched:

Seems OK! Since we have the true match status on cluster column, we can evaluate how well our threshold classification did:

golden_pairs = rl.BlockIndex('cluster').index(df_with_truth)
print("Golden pairs:", len(golden_pairs))

Golden pairs: 150

found_pairs_set = set(matches.index)

golden_pairs_set = set(golden_pairs)

true_positives = golden_pairs_set & found_pairs_set
false_positives = found_pairs_set - golden_pairs_set
false_negatives = golden_pairs_set - found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))

true_positives total: 99
false_positives total: 7
false_negatives total: 51

We've got a small number of false positives. Some are really tricky cases:

print(f"False positives:")
for false_positive_pair in false_positives:
    display(df.loc[list(false_positive_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

False positives:

On the other hand, we got a lot of false negatives. We've missed a lot of matches!

print(f"False negatives (sample 10 of {len(false_negatives)}):")
for false_negative_pair in list(false_negatives)[:10]:
    display(df.loc[list(false_negative_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

False negatives (sample 10 of 51):

We've set the weights and the threshold by guessing, could we do any better?

Supervised Classification¶

Instead of trying to guess weights and thresholds, we can train a classifier to learn how to classify matches and nonmatches based on some training data we provide:

df_training = pd.read_csv('restaurant-training.csv', skip_blank_lines=True)
df_training = df_training.drop(columns=['phone'])
df_training

We need to preprocess our training data too:

df_training = assign_no_symbols_name(df_training)
df_training = assign_cleaned_name(df_training)
df_training = assign_postal_lat_lng(df_training)
df_training = assign_addr_variations(df_training)
df_training.head(5)

We'll feed a Support Vector Machine classifier with our training data. SVMs are resilient to noise, can handle correlated features (like 'addr' and 'latlng') and are robust to imbalanced training sets. That last attribute is relevant for deduplication, because we'll usually find more negative pairs than positive pairs to add to a training set. (Bilenko [3])

all_training_pairs = rl.FullIndex().index(df_training)
matches_training_pairs = rl.BlockIndex('cluster').index(df_training)

training_vectors = comp.compute(all_training_pairs, df_training)

svm = rl.SVMClassifier()
svm.learn(training_vectors, matches_training_pairs);

svm_pairs = svm.predict(comparison_vectors)
svm_found_pairs_set = set(svm_pairs)

svm_true_positives = golden_pairs_set & svm_found_pairs_set
svm_false_positives = svm_found_pairs_set - golden_pairs_set
svm_false_negatives = golden_pairs_set - svm_found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))
print()
print('svm_true_positives total:', len(svm_true_positives))
print('svm_false_positives total:', len(svm_false_positives))
print('svm_false_negatives total:', len(svm_false_negatives))

true_positives total: 99
false_positives total: 7
false_negatives total: 51

svm_true_positives total: 124
svm_false_positives total: 5
svm_false_negatives total: 26

Much better results! The only false positive we got on the SVM classifier and not on the threshold method is a really difficult case where most columns are very similar:

print("(SVM false positives) - (Threshold false positives):")
for svm_false_positive in (svm_false_positives - false_positives):
    display(df.loc[list(svm_false_positive)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

(SVM false positives) - (Threshold false positives):

But the SVM was tricked by some apparently simple cases, so we can't be very confident it really learned well to classify matches:

print("(SVM false negatives) - (Threshold false negatives):")
for svm_false_negative in (svm_false_negatives - false_negatives):
    display(df.loc[list(svm_false_negative)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

(SVM false negatives) - (Threshold false negatives):

There are other classifiers from recordlinkage library we could try, but the truth is:

It's very difficult to build a good training set that takes in account all important cases of matches/nonmatches
It's possible to tune classifier parameters to get better results, but it's very difficult to decide the right parameters that will generalize well for future predictions
And we're not even sure if the indexing rules we used are really sane: we can be dropping true positives that are not being blocked together, or even introducing false negatives that are being blocked together but our classifier isn't being able classify them as nonmatching

The alternative to all that uncertainty is...

Active Learning Classification¶

Active Learning methods identify training examples that "lead to maximal accuracy improvements" (Bilenko [3]), both to train optimal classifier weights, as well as to find optimal indexing/blocking rules! A Python library called Dedupe implements that.

We'll see it in practice. First we define the fields/columns of our data:

import logging; logging.disable(level=logging.NOTSET)

import dedupe

fields = [
    {
        'field': 'name',
        'variable name': 'name',
        'type': 'ShortString',
        'has missing': True
    },
    {
        'field': 'addr',
        'variable name': 'addr',
        'type': 'ShortString',
    },
    {
        'field': 'city',
        'variable name': 'city',
        'type': 'ShortString',
        'has missing': True
    },
] # ...

Dedupe comes with built-in variables for common data types like names, addresses, latlong, datetimes, thereby reducing the preprocessing work. It also supports custom variables:

def addr_variations_comparator(x, y):
    return 1.0 - float(bool(x.intersection(y)))

fields.extend([
    {
        'field': 'postal',
        'variable name': 'postal',
        'type': 'ShortString',
        'has missing': True
    },
    {
        'field': 'latlng',
        'variable name': 'latlng',
        'type': 'LatLong',
        'has missing': True
    },
    {
        'field': 'addr_variations',
        'variable name': 'addr_variations',
        'type': 'Custom',
        'comparator': addr_variations_comparator
    },
])

Other nice feature of Dedupe is the built-in ability to model interactions) between fields. What's that for? Imagine we have an exact address match: it makes no sense to have all 'addr', 'city', 'postal', 'latlng', 'addr_variations' contributing additively to the matching odds. It's expected all of them to match if 'addr' and 'city' match exactly. So to prevent a redundant increase of odds, we should add a interaction on those fields:

fields.extend([
    {
        'type': 'Interaction',
        'interaction variables': [
            'addr',
            'city',
            'postal',
            'latlng',
            'addr_variations'
        ]
    }
])

Now we initialize the Dedupe instance with the fields. We can also use a pre-saved pickled Dedupe instance:

settings_filename = 'dedupe-settings.pickle'
if os.path.exists(settings_filename):
    with open(settings_filename, 'rb') as sf:
        deduper = dedupe.StaticDedupe(sf, num_cores=4)
else:
    deduper = dedupe.Dedupe(fields, num_cores=4)

INFO:dedupe.api:((SimplePredicate: (firstTokenPredicate, name), TfidfNGramCanopyPredicate: (0.6, addr)), (SimplePredicate: (commonTwoTokens, name), SimplePredicate: (fingerprint, name)))

We need to adapt the data a bit to the format Dedupe wants:

data_for_dedupe = df.to_dict('index')
for record in data_for_dedupe.values():
    # Change nans to None
    for k, v in record.items():
        if isinstance(v, float) and np.isnan(v):
            record[k] = None
    
    # Move lat and lng to a single field latlng
    lat = record.pop('lat')
    lng = record.pop('lng')
    if lat is not None and lng is not None:
        record['latlng'] = (lat, lng)
    else:
        record['latlng'] = None

Here we're using a Dedupe instance that we trained before. Let's check how was the training input/output:

training_input_output = 'training-input-output.txt'
if os.path.exists(training_input_output):
    with open(training_input_output) as t:
        txt = t.read()
        print('\n'.join(txt.split('\n')[:114]))
print('...')

name : cite
addr : 120 w. 51st st.
city : new york
postal : 10019
latlng : (40.7607952, -73.9812268)
addr_variations : frozenset({'120 west 51st saint', '120 west 51 saint', '120 w 51st street', '120 w 51st saint', '120 west 51st street', '120 w 51 saint', '120 w 51 street', '120 west 51 street'})

name : new york noodletown
addr : 28 1/2 bowery at bayard st.
city : new york
postal : 10013
latlng : (40.7150317, -73.9970383)
addr_variations : frozenset({'28 1 2 bowery at bayard saint', '28 1 2 bowery at bayard street'})

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
name : bernardin
addr : 155 w. 51st st.
city : new york city
postal : 10019
latlng : (40.7615691, -73.98180479999999)
addr_variations : frozenset({'155 w 51st saint', '155 west 51st saint', '155 w 51st street', '155 west 51 saint', '155 w 51 street', '155 w 51 saint', '155 west 51 street', '155 west 51st street'})

name : republic
addr : 37a union sq. w  between 16th and 17th sts.
city : new york
postal : 10003
latlng : (40.7369985, -73.9907851)
addr_variations : frozenset({'37a union square w between 16th and 17 streets', '37 a union square west between 16th and 17 streets', '37a union square w between 16th and 17th streets', '37 a union square west between 16th and 17th streets', '37 a union square west between 16 and 17th streets', '37 a union square w between 16th and 17th streets', '37 a union square w between 16 and 17 streets', '37 a union square w between 16th and 17 streets', '37 a union square west between 16 and 17 streets', '37 a union square w between 16 and 17th streets', '37a union square w between 16 and 17 streets', '37a union square w between 16 and 17th streets', '37a union square west between 16 and 17th streets', '37a union square west between 16 and 17 streets', '37a union square west between 16th and 17 streets', '37a union square west between 16th and 17th streets'})

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
name : dawat
addr : 210 e. 58th st.
city : new york
postal : None
latlng : None
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

name : dawat
addr : 210 e. 58th st.
city : new york city
postal : 10022
latlng : (40.7604227, -73.9664276)
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

0/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : art delicatessen
addr : 12224 ventura blvd.
city : studio city
postal : 91604
latlng : (34.1429661, -118.3994688)
addr_variations : frozenset({'12224 ventura boulevard'})

name : art deli
addr : 12224 ventura blvd.
city : los angeles
postal : 91604
latlng : (34.1429661, -118.3994688)
addr_variations : frozenset({'12224 ventura boulevard'})

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), SimplePredicate: (wholeFieldPredicate, name))
name : gotham bar grill
addr : 12 e 12th st
city : new york city
postal : 10003
latlng : (40.734207, -73.99369899999999)
addr_variations : frozenset({'12 e 12th street', '12 east 12 saint', '12 e 12 street', '12 e 12 saint', '12 east 12th saint', '12 east 12th street', '12 east 12 street', '12 e 12th saint'})

name : gotham
addr : 12 e 12th st
city : new york
postal : 10003
latlng : (40.734207, -73.99369899999999)
addr_variations : frozenset({'12 e 12th street', '12 east 12 saint', '12 e 12 street', '12 e 12 saint', '12 east 12th saint', '12 east 12th street', '12 east 12 street', '12 e 12th saint'})

2/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), SimplePredicate: (sortedAcronym, name))
name : dawat
addr : 210 e. 58th st.
city : new york
postal : None
latlng : None
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

name : rainbow restaurant
addr : 2118 n. decatur rd.
city : decatur
postal : 30033
latlng : (33.7908588, -84.3052307)
addr_variations : frozenset({'2118 north decatur road', '2118 n decatur road'})

3/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), TfidfNGramCanopyPredicate: (0.6, name))
...

You can check the full training at training-input-output.txt.

If you want to train it yourself, do a rm dedupe-settings.pickle dedupe-slides-training.json and run this whole Active Learning session again.

if not isinstance(deduper, dedupe.StaticDedupe):
    deduper.sample(data_for_dedupe)
    
    training_filename = 'dedupe-slides-training.json'
    if os.path.exists(training_filename):
        with open(training_filename) as tf:
            deduper.readTraining(tf)

    dedupe.consoleLabel(deduper)
    
    with open(training_filename, 'w') as tf:
        deduper.writeTraining(tf)
    
    deduper.train()
    
    with open(settings_filename, 'wb') as sf:
        deduper.writeSettings(sf)

After training, we can see which blocking predicates (indexing rules) the deduper learned from our training input. It's good to do that to check if we trained enough:

deduper.predicates

((SimplePredicate: (firstTokenPredicate, name),
  TfidfNGramCanopyPredicate: (0.6, addr)),
 (SimplePredicate: (commonTwoTokens, name),
  SimplePredicate: (fingerprint, name)))

Those predicates make sense:

Pairs where the first token of the name is the same AND the addr is similar
... union with ...
Pairs where the name shares two tokens AND the fingerprint of the name is the same (fingerprint means the sorted tokens)

The deduper selected those predicates from this extense list of possible predicates:

deduper.data_model.predicates()

{ExistsPredicate: (Exists, city),
 ExistsPredicate: (Exists, latlng),
 ExistsPredicate: (Exists, name),
 ExistsPredicate: (Exists, postal),
 LevenshteinCanopyPredicate: (1, addr),
 LevenshteinCanopyPredicate: (1, city),
 LevenshteinCanopyPredicate: (1, name),
 LevenshteinCanopyPredicate: (1, postal),
 LevenshteinCanopyPredicate: (2, addr),
 LevenshteinCanopyPredicate: (2, city),
 LevenshteinCanopyPredicate: (2, name),
 LevenshteinCanopyPredicate: (2, postal),
 LevenshteinCanopyPredicate: (3, addr),
 LevenshteinCanopyPredicate: (3, city),
 LevenshteinCanopyPredicate: (3, name),
 LevenshteinCanopyPredicate: (3, postal),
 LevenshteinCanopyPredicate: (4, addr),
 LevenshteinCanopyPredicate: (4, city),
 LevenshteinCanopyPredicate: (4, name),
 LevenshteinCanopyPredicate: (4, postal),
 SimplePredicate: (alphaNumericPredicate, addr),
 SimplePredicate: (alphaNumericPredicate, city),
 SimplePredicate: (alphaNumericPredicate, name),
 SimplePredicate: (alphaNumericPredicate, postal),
 SimplePredicate: (commonFourGram, addr),
 SimplePredicate: (commonFourGram, city),
 SimplePredicate: (commonFourGram, name),
 SimplePredicate: (commonFourGram, postal),
 SimplePredicate: (commonIntegerPredicate, addr),
 SimplePredicate: (commonIntegerPredicate, city),
 SimplePredicate: (commonIntegerPredicate, name),
 SimplePredicate: (commonIntegerPredicate, postal),
 SimplePredicate: (commonSixGram, addr),
 SimplePredicate: (commonSixGram, city),
 SimplePredicate: (commonSixGram, name),
 SimplePredicate: (commonSixGram, postal),
 SimplePredicate: (commonThreeTokens, addr),
 SimplePredicate: (commonThreeTokens, city),
 SimplePredicate: (commonThreeTokens, name),
 SimplePredicate: (commonThreeTokens, postal),
 SimplePredicate: (commonTwoTokens, addr),
 SimplePredicate: (commonTwoTokens, city),
 SimplePredicate: (commonTwoTokens, name),
 SimplePredicate: (commonTwoTokens, postal),
 SimplePredicate: (doubleMetaphone, addr),
 SimplePredicate: (doubleMetaphone, city),
 SimplePredicate: (doubleMetaphone, name),
 SimplePredicate: (doubleMetaphone, postal),
 SimplePredicate: (fingerprint, addr),
 SimplePredicate: (fingerprint, city),
 SimplePredicate: (fingerprint, name),
 SimplePredicate: (fingerprint, postal),
 SimplePredicate: (firstIntegerPredicate, addr),
 SimplePredicate: (firstIntegerPredicate, city),
 SimplePredicate: (firstIntegerPredicate, name),
 SimplePredicate: (firstIntegerPredicate, postal),
 SimplePredicate: (firstTokenPredicate, addr),
 SimplePredicate: (firstTokenPredicate, city),
 SimplePredicate: (firstTokenPredicate, name),
 SimplePredicate: (firstTokenPredicate, postal),
 SimplePredicate: (hundredIntegerPredicate, addr),
 SimplePredicate: (hundredIntegerPredicate, city),
 SimplePredicate: (hundredIntegerPredicate, name),
 SimplePredicate: (hundredIntegerPredicate, postal),
 SimplePredicate: (hundredIntegersOddPredicate, addr),
 SimplePredicate: (hundredIntegersOddPredicate, city),
 SimplePredicate: (hundredIntegersOddPredicate, name),
 SimplePredicate: (hundredIntegersOddPredicate, postal),
 SimplePredicate: (latLongGridPredicate, latlng),
 SimplePredicate: (metaphoneToken, addr),
 SimplePredicate: (metaphoneToken, city),
 SimplePredicate: (metaphoneToken, name),
 SimplePredicate: (metaphoneToken, postal),
 SimplePredicate: (nearIntegersPredicate, addr),
 SimplePredicate: (nearIntegersPredicate, city),
 SimplePredicate: (nearIntegersPredicate, name),
 SimplePredicate: (nearIntegersPredicate, postal),
 SimplePredicate: (oneGramFingerprint, addr),
 SimplePredicate: (oneGramFingerprint, city),
 SimplePredicate: (oneGramFingerprint, name),
 SimplePredicate: (oneGramFingerprint, postal),
 SimplePredicate: (sameFiveCharStartPredicate, addr),
 SimplePredicate: (sameFiveCharStartPredicate, city),
 SimplePredicate: (sameFiveCharStartPredicate, name),
 SimplePredicate: (sameFiveCharStartPredicate, postal),
 SimplePredicate: (sameSevenCharStartPredicate, addr),
 SimplePredicate: (sameSevenCharStartPredicate, city),
 SimplePredicate: (sameSevenCharStartPredicate, name),
 SimplePredicate: (sameSevenCharStartPredicate, postal),
 SimplePredicate: (sameThreeCharStartPredicate, addr),
 SimplePredicate: (sameThreeCharStartPredicate, city),
 SimplePredicate: (sameThreeCharStartPredicate, name),
 SimplePredicate: (sameThreeCharStartPredicate, postal),
 SimplePredicate: (sortedAcronym, addr),
 SimplePredicate: (sortedAcronym, city),
 SimplePredicate: (sortedAcronym, name),
 SimplePredicate: (sortedAcronym, postal),
 SimplePredicate: (suffixArray, addr),
 SimplePredicate: (suffixArray, city),
 SimplePredicate: (suffixArray, name),
 SimplePredicate: (suffixArray, postal),
 SimplePredicate: (tokenFieldPredicate, addr),
 SimplePredicate: (tokenFieldPredicate, city),
 SimplePredicate: (tokenFieldPredicate, name),
 SimplePredicate: (tokenFieldPredicate, postal),
 SimplePredicate: (twoGramFingerprint, addr),
 SimplePredicate: (twoGramFingerprint, city),
 SimplePredicate: (twoGramFingerprint, name),
 SimplePredicate: (twoGramFingerprint, postal),
 SimplePredicate: (wholeFieldPredicate, addr),
 SimplePredicate: (wholeFieldPredicate, city),
 SimplePredicate: (wholeFieldPredicate, name),
 SimplePredicate: (wholeFieldPredicate, postal),
 TfidfNGramCanopyPredicate: (0.2, addr),
 TfidfNGramCanopyPredicate: (0.2, city),
 TfidfNGramCanopyPredicate: (0.2, name),
 TfidfNGramCanopyPredicate: (0.2, postal),
 TfidfNGramCanopyPredicate: (0.4, addr),
 TfidfNGramCanopyPredicate: (0.4, city),
 TfidfNGramCanopyPredicate: (0.4, name),
 TfidfNGramCanopyPredicate: (0.4, postal),
 TfidfNGramCanopyPredicate: (0.6, addr),
 TfidfNGramCanopyPredicate: (0.6, city),
 TfidfNGramCanopyPredicate: (0.6, name),
 TfidfNGramCanopyPredicate: (0.6, postal),
 TfidfNGramCanopyPredicate: (0.8, addr),
 TfidfNGramCanopyPredicate: (0.8, city),
 TfidfNGramCanopyPredicate: (0.8, name),
 TfidfNGramCanopyPredicate: (0.8, postal)}

To proceed with the deduplication, we compute the clustering threshold and call the actual match:

import itertools

threshold = deduper.threshold(data_for_dedupe, recall_weight=2)
clustered_dupes = deduper.match(data_for_dedupe, threshold)

dedupe_found_pairs_set = set()
for cluster, __ in clustered_dupes:  # we'll explain that later
    for pair in itertools.combinations(cluster, 2):
        dedupe_found_pairs_set.add(tuple(pair))

INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.998
INFO:dedupe.api:precision: 0.859
INFO:dedupe.api:With threshold: 0.195

Now we'll evaluate how it performed:

dedupe_true_positives = golden_pairs_set & dedupe_found_pairs_set
dedupe_false_positives = dedupe_found_pairs_set - golden_pairs_set
dedupe_false_negatives = golden_pairs_set - dedupe_found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))
print()
print('svm_true_positives total:', len(svm_true_positives))
print('svm_false_positives total:', len(svm_false_positives))
print('svm_false_negatives total:', len(svm_false_negatives))
print()
print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))

true_positives total: 99
false_positives total: 7
false_negatives total: 51

svm_true_positives total: 124
svm_false_positives total: 5
svm_false_negatives total: 26

dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

Great! It found more true positives than the other methods. But it also found more false positives than SVM. It's possible to control that by lowering recall_weight:

threshold_low_recall = deduper.threshold(data_for_dedupe, recall_weight=0.5)
clustered_dupes_low_recall = deduper.match(data_for_dedupe, threshold_low_recall)

dedupe_low_recall_found_pairs_set = set()
for cluster, __ in clustered_dupes_low_recall:
    for pair in itertools.combinations(cluster, 2):
        dedupe_low_recall_found_pairs_set.add(tuple(pair))

INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.972
INFO:dedupe.api:precision: 0.885
INFO:dedupe.api:With threshold: 0.727

dedupe_low_recall_true_positives = golden_pairs_set & dedupe_low_recall_found_pairs_set
dedupe_low_recall_false_positives = dedupe_low_recall_found_pairs_set - golden_pairs_set
dedupe_low_recall_false_negatives = golden_pairs_set - dedupe_low_recall_found_pairs_set

print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))
print()
print('dedupe_low_recall_true_positives total:', len(dedupe_low_recall_true_positives))
print('dedupe_low_recall_false_positives total:', len(dedupe_low_recall_false_positives))
print('dedupe_low_recall_false_negatives total:', len(dedupe_low_recall_false_negatives))

dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

dedupe_low_recall_true_positives total: 137
dedupe_low_recall_false_positives total: 1
dedupe_low_recall_false_negatives total: 13

As we can see, there's always a tradeoff between true positives and false positives, or, more precisely, between precision and recall. You have to decide what's more important for your business.

But let's suppose we want to find more true positives and use the previous dedupe_found_pairs_set. What false positives it found?

print("Dedupe false positives")
for false_positive_pair in list(dedupe_false_positives):
    display(df.loc[list(false_positive_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

Dedupe false positives

The deduper is confused by "caffe"s that are located near each other. "caffe" could be a stopword? Or even better, we could incorporate some logic for name rarity: if the token isn't rare, like "caffe", it should contribute less with the name similarity than a rarer token like "reggio". It's possible to do that with an interaction with name frequency.

Or maybe the deduper could have learned a more restrictive name blocking? Or the classifier should have put less weight on name? It's difficult to guess because those things could have introduced more false negatives too. What about them?

print("Dedupe false negatives")
for false_negative_pair in list(dedupe_false_negatives):
    display(df.loc[list(false_negative_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])

Dedupe false negatives

Some of these false negatives could be prevented with better address normalization, but on others, the data is simply bad: different addresses for the same place, maybe a corner, maybe two entrances? We could try to train more the deduper to fix this. But we'll leave as it is and move to the last step of the deduplication process.

4/4 - Clustering¶

Using the Threshold or the SVM, we got the matching pairs. But what Dedupe returned to us were clusters of matches:

clustered_dupes[:5]

[((0, 1, 2), array([0.89782533, 0.8964192 , 0.88561887])),
 ((3, 4, 5), array([0.90827334, 0.91548833, 0.93204328])),
 ((7, 8), array([0.883838, 0.883838])),
 ((9, 10, 11, 12), array([0.91171577, 0.91171577, 0.91171577, 0.91134983])),
 ((13, 14), (0.91189873, 0.91189873))]

Dedupe went one step further on the process and merged the matching pairs into clusters! Why is that important? Because the following can happen:

We have the records A, B, and C
By deduplicating, we found the matching pairs (A, B) and (B, C). However, (A, C) was found to be a nonmatch

It doesn't make sense to consider (A, B) and (B, C) as a match, but (A, C) as a nonmatch. The solution for that ambiguity is computing the Transitive Closure with Clustering.

Using some private methods, it's possible to get the unclustered pairs from Dedupe:

from dedupe.core import scoreDuplicates

candidate_records = itertools.chain.from_iterable(deduper._blockedPairs(deduper._blockData(data_for_dedupe)))
dedupe_matches = scoreDuplicates(candidate_records,
                                 deduper.data_model,
                                 deduper.classifier,
                                 deduper.num_cores)
dedupe_unclustered_found_pairs_set = {tuple(pair) for ([*pair], __) in dedupe_matches}

Let's evaluate those unclustered pairs against the clustered pairs:

dedupe_unclustered_true_positives = golden_pairs_set & dedupe_unclustered_found_pairs_set
dedupe_unclustered_false_positives = dedupe_unclustered_found_pairs_set - golden_pairs_set
dedupe_unclustered_false_negatives = golden_pairs_set - dedupe_unclustered_found_pairs_set

print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))
print()
print('dedupe_unclustered_true_positives total:', len(dedupe_unclustered_true_positives))
print('dedupe_unclustered_false_positives total:', len(dedupe_unclustered_false_positives))
print('dedupe_unclustered_false_negatives total:', len(dedupe_unclustered_false_negatives))

dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

dedupe_unclustered_true_positives total: 142
dedupe_unclustered_false_positives total: 10
dedupe_unclustered_false_negatives total: 8

We've found the unclustered pairs are different from the clustered pairs! That means the clustering process can both create new matches and drop found matches. Therefore, even though clustering is necessary to disambiguate the deduplication result, it can either improve or worsen the quality of found pairs.

diff_set = dedupe_found_pairs_set ^ dedupe_unclustered_found_pairs_set
display(diff_set)

{(6, 7),
 (34, 37),
 (195, 198),
 (196, 200),
 (196, 839),
 (197, 198),
 (198, 200),
 (198, 839)}

Here's a case where the clustering process got a new correct match:

from graph_utils import show_cluster_graphs

good_diff_all_ids = {34, 35, 36, 37}
show_cluster_graphs(
    df,
    golden_pairs_set, dedupe_found_pairs_set, dedupe_unclustered_found_pairs_set,
    good_diff_all_ids)

And here's a case where the clustering process dropped a true match:

bad_diff_all_ids = {6, 7, 8}
show_cluster_graphs(
    df,
    golden_pairs_set, dedupe_found_pairs_set, dedupe_unclustered_found_pairs_set,
    bad_diff_all_ids)

Remember the recall_weight parameter of the deduper.threshold method? That's what it controls: how aggressive we want to be on finding or dropping matches while clustering. Check Dedupe docs for more details on how it performs clustering and how it computes a good threshold.

Finally, it's worth mentioning there's a web-based product version of Dedupe. If you don't want to write code for deduping a dataset, check it.

Next Steps¶

Once we have the clusters, how to consolidate data from many records into one? Check for material on Data Fusion:

Open issue at Python Record Linkage Toolkit
Christen, 6.12 Merging Matches [2]

What if new records arrive? Should we merge, unmerge, move records from clusters? Check for material on Incremental Record Linkage:

Dedupe approach is to either add to an existing cluster or create a new cluster. You can use a Gazetteer with previously deduped data, then merge new matches into it, and index new nonmatches.
Other approaches, check papers:

Also worth checking the Privacy implications of Record Linkage:

References¶

Referenced by this talk:
Other talks about Record Linkage with Python:
- Andrew Rowe, Big Data Deduplication and data matching using Python
- Rhydwyn Mcguire, Record linkage: Join for real life
Basics of Entity Resolution with Python and Dedupe
Automating your Data Cleanup with Python
Comparison of Data Matching software
Compilation of Record Linkage resources

Thank you!¶

flavio@vinta.com.br
@flaviojuvenal
vinta.com.br

Special thanks to Russell Keith-Magee @freakboy3742, Forest Timothy Gregg @forestgregg, and Jonathan de Bruin @J535D165.

	name	addr	city	phone	type	cluster
0	arnie morton's of chicago	435 s. la cienega blv.	los angeles	310/246-1501	american	0
1	arnie morton's of chicago	435 s. la cienega blvd.	los angeles	310-246-1501	steakhouses	0
2	arnie morton	435 s. la cienega boulevard	los angeles	310-246-1501	steakhouses	0
3	art's delicatessen	12224 ventura blvd.	studio city	818/762-1221	american	1
4	art's deli	12224 ventura blvd.	studio city	818-762-1221	delis	1
5	art's deli	12224 ventura blvd.	los angeles	818-762-1221	delis	1
6	hotel bel-air	701 stone canyon rd.	bel air	310/472-1211	californian	2
7	bel-air hotel	701 stone canyon rd.	bel air	310-472-1211	californian	2
8	bel-air	701 stone canyon road	bel air	(310) 472-1211	american	2

	name	addr	city	type
0	arnie morton's of chicago	435 s. la cienega blv.	los angeles	american
1	arnie morton's of chicago	435 s. la cienega blvd.	los angeles	steakhouses
2	arnie morton	435 s. la cienega boulevard	los angeles	steakhouses
3	art's delicatessen	12224 ventura blvd.	studio city	american
4	art's deli	12224 ventura blvd.	studio city	delis
5	art's deli	12224 ventura blvd.	los angeles	delis
6	hotel bel-air	701 stone canyon rd.	bel air	californian
7	bel-air hotel	701 stone canyon rd.	bel air	californian
8	bel-air	701 stone canyon road	bel air	american

	name	addr	city	type
0	arnie morton s of chicago	435 s. la cienega blv.	los angeles	american
1	arnie morton s of chicago	435 s. la cienega blvd.	los angeles	steakhouses
2	arnie morton	435 s. la cienega boulevard	los angeles	steakhouses
3	art s delicatessen	12224 ventura blvd.	studio city	american
4	art s deli	12224 ventura blvd.	studio city	delis
5	art s deli	12224 ventura blvd.	los angeles	delis
6	hotel bel air	701 stone canyon rd.	bel air	californian
7	bel air hotel	701 stone canyon rd.	bel air	californian
8	bel air	701 stone canyon road	bel air	american

	name	addr	city	type
0	arnie morton chicago	435 s. la cienega blv.	los angeles	american
1	arnie morton chicago	435 s. la cienega blvd.	los angeles	steakhouses
2	arnie morton	435 s. la cienega boulevard	los angeles	steakhouses
3	art delicatessen	12224 ventura blvd.	studio city	american
4	art deli	12224 ventura blvd.	studio city	delis
5	art deli	12224 ventura blvd.	los angeles	delis
6	hotel bel air	701 stone canyon rd.	bel air	californian
7	bel air hotel	701 stone canyon rd.	bel air	californian
8	bel air	701 stone canyon road	bel air	american

	name	addr	addr_variations
0	arnie morton chicago	435 s. la cienega blv.	(435 s louisiana cienega boulevard, 435 south ...
1	arnie morton chicago	435 s. la cienega blvd.	(435 san lane cienega boulevard, 435 sur louis...
2	arnie morton	435 s. la cienega boulevard	(435 san lane cienega boulevard, 435 s louisia...
3	art delicatessen	12224 ventura blvd.	(12224 ventura boulevard)
4	art deli	12224 ventura blvd.	(12224 ventura boulevard)
5	art deli	12224 ventura blvd.	(12224 ventura boulevard)

	restaurant	address	city
0	Chin's	3200 Las Vegas Boulevard	New York
1	Chin Bistro	3200 Las Vegas Blvd.	New York
2	Bistro	3400 Las Vegas Blvd.	New York City
3	Bistro	3400 Las Vegas B.	NYC

		name	addr	city	type	latlng	addr_variations
0	1	1.000000	0.985507	1.00000	0.310606	1.00000	1.0
	2	0.920000	0.910774	1.00000	0.310606	1.00000	1.0
	3	0.559722	0.593620	0.40404	1.000000	0.00001	0.0
	4	0.572222	0.593620	0.40404	0.550000	0.00001	0.0
	5	0.572222	0.593620	1.00000	0.550000	0.00001	0.0

	name	addr	city	type	latlng	addr_variations
count	9465.000000	9465.000000	9465.000000	9465.000000	9465.000000	9465.000000
mean	0.515619	0.630213	0.827979	0.558777	0.326727	0.015320
std	0.167495	0.142923	0.263938	0.223105	0.298554	0.122827
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.447712	0.526462	0.589744	0.436508	0.000093	0.000000
50%	0.517949	0.593957	1.000000	0.527381	0.295513	0.000000
75%	0.588462	0.706536	1.000000	0.690476	0.561194	0.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

		name	addr	city	type	latlng	addr_variations
0	1	1.00	0.985507	1.00000	0.310606	1.0	1.0
0	2	0.92	0.910774	1.00000	0.310606	1.0	1.0
1	2	0.92	0.923779	1.00000	1.000000	1.0	1.0
3	4	0.90	1.000000	1.00000	0.550000	1.0	1.0
3	5	0.90	1.000000	0.40404	0.550000	1.0	1.0

	name	addr	city	type	lat	lng
156	palace court	3570 las vegas blvd. s	las vegas	continental	36.116156	-115.175058
562	cafe roma	3570 las vegas blvd. s	las vegas	coffee shops/diners	36.116156	-115.175058

	name	addr	city	type	lat	lng
157	palace court	3570 las vegas blvd. s.	las vegas	french (new)	36.116156	-115.175058
562	cafe roma	3570 las vegas blvd. s	las vegas	coffee shops/diners	36.116156	-115.175058

	name	addr	city	type	lat	lng
196	ritz carlton cafe buckhead	3434 peachtree rd. ne	atlanta	american (new)	33.850807	-84.364227
198	ritz carlton dining room buckhead	3434 peachtree rd. ne	atlanta	american (new)	33.850807	-84.364227

	name	addr	city	type	lat	lng
199	restaurant ritz carlton atlanta	181 peachtree st.	atlanta	continental	33.758579	-84.387066
839	ritz carlton cafe atlanta	181 peachtree st.	atlanta	american (new)	33.758579	-84.387066

	name	addr	city	type	lat	lng
99	les celebrites	160 central park s	new york	french	40.766016	-73.978641
329	cafe botanica	160 central park s	new york	french	40.766016	-73.978641

	name	addr	city	type	lat	lng
200	ritz carlton restaurant	181 peachtree st.	atlanta	french (classic)	33.758579	-84.387066
839	ritz carlton cafe atlanta	181 peachtree st.	atlanta	american (new)	33.758579	-84.387066

	name	addr	city	type	lat	lng
170	brasserie coze	3393 peachtree rd. lenox square mall near ne...	atlanta	french	33.846181	-84.364109
171	brasserie coze	3393 peachtree rd.	atlanta	french bistro	33.846869	-84.362512

	name	addr	city	type	lat	lng
22	fenix at argyle	8358 sunset blvd.	w. hollywood	french (new)	34.095097	-118.371967
23	fenix at argyle	8358 sunset blvd. west	hollywood	american	34.095097	-118.371967

	name	addr	city	type	lat	lng
182	heera india	595 piedmont ave. rio shopping mall	atlanta	asian	33.795817	-84.370602
183	heera india	595 piedmont ave.	atlanta	indian	33.770495	-84.381425

	name	addr	city	type	lat	lng
138	sign dove	1110 3rd ave. at 65th st.	new york	american	40.76564	-73.963968
139	sign dove	1110 third ave.	new york city	american (new)	40.76564	-73.963968

	name	addr	city	type	lat	lng
195	cafe ritz carlton buckhead	3434 peachtree rd.	atlanta	ext 6108 international	33.850807	-84.364227
197	dining room ritz carlton buckhead	3434 peachtree rd.	atlanta	international	33.850807	-84.364227

	name	addr	city	type	lat	lng
164	abruzzi	2355 peachtree rd. peachtree battle shopping ...	atlanta	italian	33.820137	-84.387280
165	abruzzi	2355 peachtree rd. ne	atlanta	italian	33.824647	-84.387453

	name	addr	city	type	lat	lng
69	carmine	2450 broadway between 90th and 91st sts.	new york	italian	40.791096	-73.973991
70	carmine	2450 broadway	new york city	italian	40.791096	-73.973991

	name	addr	city	type	lat	lng
28	restaurant katsu	1972 n. hillhurst ave.	los angeles	asian	34.107405	-118.28719
29	katsu	1972 hillhurst ave.	los feliz	japanese	34.107405	-118.28719

	name	addr	city	type	lat	lng
36	locanda veneta	8638 w 3rd	st los angeles	italian	34.073417	-118.381096
37	locanda	w. third st.	st los angeles	italian	34.068958	-118.320928

	name	addr	city	type	lat	lng
168	bone	3130 piedmont road	atlanta	american	33.842103	-84.371103
169	bone restaurant	3130 piedmont rd. ne	atlanta	steakhouses	33.842103	-84.371103

	name	addr	city	type	cluster
0	locanda veneta	3rd st.	los angeles	italian	13
1	locanda veneta	8638 w. third st.	los angeles	italian	13
2	locanda veneta	8638 w 3rd st	los angeles	italian	13
3	cafe lalo	201 w. 83rd st.	new york	coffee bar	26
4	cafe lalo	201 w. 83rd st.	new york city	coffeehouses	26
5	les celebrites	160 central park s	new york	french	42
6	les celebrites	155 w. 58th st.	new york city	french (classic)	42
7	second avenue deli	156 2nd ave. at 10th st.	new york	delicatessen	58
8	second avenue deli	156 second ave.	new york city	delis	58
9	smith & wollensky	201 e. 49th st.	new york	american	62
10	smith & wollensky	797 third ave.	new york city	steakhouses	62
11	chin's	3200 las vegas blvd. s	las vegas	asian	67
12	chin's	3200 las vegas blvd. s.	las vegas	chinese	67
13	toulouse	b peachtree rd.	atlanta	french	92
14	toulouse	293-b peachtree rd.	atlanta	french (new)	92
15	rose pistola	532 columbus ave.	san francisco	italian	111
16	rose pistola	532 columbus ave.	san francisco	italian	111
17	bistro garden	176 n. canon dr.	los angeles	californian	115
18	remi	3rd st. promenade	santa monica	italian	159
19	remi	145 w. 53rd st.	new york	italian	334
20	west	63rd street steakhouse 44 w. 63rd st.	new york	american	375
21	bistro	3400 las vegas blvd. s	las vegas	continental	429
22	mikado	3400 las vegas blvd. s	las vegas	asian	446
23	l'osteria del forno	519 columbus ave.	san francisco	italian	490
24	stars	150 redwood alley	san francisco	american	514
25	stars cafe	500 van ness ave.	san francisco	american	515
26	belvedere the	9882 little santa monica blvd.	beverly hills	pacific new wave	537
27	bernard's	515 s. olive st.	los angeles	continental	539
28	bistro 45	45 s. mentor ave.	pasadena	californian	540
29	cafe '50s	838 lincoln blvd.	venice	american	545
30	cafe blanc	9777 little santa monica blvd.	beverly hills	pacific new wave	546
31	la cachette	10506 little santa monica blvd.	century city	french (new)	568
32	moongate	3400 las vegas blvd. s.	las vegas	chinese	666

	name	addr	city	type	lat	lng
40	palm	9001 santa monica blvd.	los angeles	american	34.083064	-118.387282
41	palm los angeles	9001 santa monica blvd.	w. hollywood	steakhouses	34.083064	-118.387282

	name	addr	city	type	lat	lng
337	caffe lure	169 sullivan st. between houston and bleecker...	new york	french	40.727928	-74.000985
338	caffe reggio	119 macdougal st. between 3rd and bleecker sts.	new york	coffee bar	40.730308	-74.000371

	name	addr	city	type	lat	lng
762	palm	837 second ave.	new york city	steakhouses	40.751701	-73.971180
763	palm too	840 second ave.	new york city	steakhouses	40.751467	-73.970686

	name	addr	city	type	lat	lng
335	caffe dante	81 macdougal st. between houston and bleeker ...	new york	coffee bar	40.728878	-74.001662
338	caffe reggio	119 macdougal st. between 3rd and bleecker sts.	new york	coffee bar	40.730308	-74.000371

	name	addr	city	type	lat	lng
219	fringale	570 4th st.	san francisco	french	37.778542	-122.397193
220	fringale	570 fourth st.	san francisco	french bistro	37.778542	-122.397193

	name	addr	city	type	lat	lng
53	spago	1114 horn ave.	los angeles	californian	34.091172	-118.383161
54	spago los angeles	8795 sunset blvd.	w. hollywood	californian	34.091132	-118.383290

	name	addr	city	type	lat	lng
136	shun lee west	43 w. 65th st.	new york	asian	40.772900	-73.981348
137	shun lee palace	155 e. 55th st.	new york city	chinese	40.759428	-73.969068