1 + 1 = 1 or Record Deduplication with Python

Flávio Juvenal
@flaviojuvenal

Partner at

Slides available at: bit.ly/pybay-dupe (open as Desktop)

Jupyter Notebook source at: github.com/vintasoftware/deduplication-slides/

Introduction

Real world data is inputted by people and often it's:

  • Not reviewed
  • Not linked with related data
  • Not properly normalized by the input system
  • Or simply it's incorrectly inputted because people make mistakes: typos, mishearing, miscalculation, misinterpretation, etc.

This causes the following problems on data:

  • lack of unique identifiers (making difficult detect duplicates in a dataset or to link with other datasets)
  • duplications (e.g. multiple records refer to a single person)
  • inconsistencies (e.g. a person appears with multiple addresses)
  • bad formatting (e.g. birth dates appear with multiple formats like DD/MM/YY and YYYY-MM-DD)

All of that affects the ability to properly extract knowledge from one or more datasets.

The solution is to perform Record Linkage. It works by joining records in a fuzzy way using data like names, addresses, phone numbers, dates, etc.

The term Record Linkage is most used when the linkage is applied to multiple datasets, like joining a Restaurant Food Inspection dataset with an Employee Wage dataset. In fact, someone did just that.

Record Linkage is also known as Data Matching, Entity Resolution, and other names.

What we'll discuss here is a specific application of Record Linkage, called Deduplication, which is applying Record Linkage on a single dataset against itself to find which records are duplicates. Here's a very simple example:

In [1]:
import warnings; warnings.simplefilter('ignore')
import logging; logging.disable(level=logging.INFO)
In [2]:
import pandas as pd

data = [
    ("Chin's","3200 Las Vegas Boulevard","New York"),
    ("Chin Bistro","3200 Las Vegas Blvd.","New York"),
    ("Bistro","3400 Las Vegas Blvd.","New York City"),
    ("Bistro","3400 Las Vegas B.","NYC"),
]

df = pd.DataFrame(data, columns=['restaurant', 'address', 'city'])
In [3]:
df
Out[3]:
restaurant address city
0 Chin's 3200 Las Vegas Boulevard New York
1 Chin Bistro 3200 Las Vegas Blvd. New York
2 Bistro 3400 Las Vegas Blvd. New York City
3 Bistro 3400 Las Vegas B. NYC

A good deduplication on the data above would find that:

  • (0, 1) are duplicates
  • (2, 3) are duplicates
  • but (0, 1) and (2, 3) are different, despite being similar

Process

The process for deduplicating a dataset usually is:

  1. Preprocessing
    • input: dataset
    • output: cleaned dataset
  2. Indexing
    • output: pairs to compare
  3. Comparison
    • output: comparison vectors
  4. Classification
    • output: matching/nonmatching pairs
  5. Clustering
    • output: unique record clusters

Let's explore each of those steps!

0/4 - Preprocessing

Without unique identifiers, we need to match records by fuzzy data like:

  • Names
  • Addresses
  • Phone Numbers
  • Dates

So it's important to clean them for matching.

Cleaning names (companies or people)

Let's use string functions and regexes to normalize names and remove undesired variations:

In [4]:
import numpy as np
import re
import pprint


company_names = company_names_dirty = [
    'APPLE COMPUTER INC',
    'APPLE COMPUTER, INC.',
    'APPLE INC',
    'Apple Computer',
    'Apple Computer Co.',
    'Apple Computer Company',
    'Apple Computer Inc',
    'Apple Computer Incorporated',
    'Apple Computer, Inc.',
    'Apple Inc',
    'Apple Inc.',
    'Apple, Inc.'
]
In [5]:
print("Lower case:")
company_names = [c.lower() for c in company_names]
pprint.pprint(company_names)
Lower case:
['apple computer inc',
 'apple computer, inc.',
 'apple inc',
 'apple computer',
 'apple computer co.',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer, inc.',
 'apple inc',
 'apple inc.',
 'apple, inc.']
In [6]:
print("Remove irrelevant separators:")
irrelevant_regex = re.compile(r'[^a-z0-9\s]')
company_names = [irrelevant_regex.sub(' ', c) for c in company_names]
pprint.pprint(company_names)
Remove irrelevant separators:
['apple computer inc',
 'apple computer  inc ',
 'apple inc',
 'apple computer',
 'apple computer co ',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer  inc ',
 'apple inc',
 'apple inc ',
 'apple  inc ']
In [7]:
print("Remove multi-spaces:")
multispace_regex = re.compile(r'\s\s+')
company_names = [multispace_regex.sub(' ', c).strip() for c in company_names]
pprint.pprint(company_names)
Remove multi-spaces:
['apple computer inc',
 'apple computer inc',
 'apple inc',
 'apple computer',
 'apple computer co',
 'apple computer company',
 'apple computer inc',
 'apple computer incorporated',
 'apple computer inc',
 'apple inc',
 'apple inc',
 'apple inc']
In [8]:
print("Remove stopwords:")
business_stopwords = {  # suppose we got this from somewhere
    'computer',
    'co',
    'company',
    'corp',
    'corporation',
    'inc',
    'incorporated',
    'llc',
    #...
}

company_names = [
    ' '.join([c_part for c_part in c.split() if c_part not in business_stopwords])
    for c in company_names
]
pprint.pprint(company_names)
Remove stopwords:
['apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple',
 'apple']

We can also use the natural language processing library probablepeople to parse company names and extract just the parts we want (or break into parts and match by part on the comparison step later):

In [9]:
import probablepeople as pp

pp.parse("Apple Computer Incorporated")
Out[9]:
[('Apple', 'CorporationName'),
 ('Computer', 'CorporationName'),
 ('Incorporated', 'CorporationLegalType')]
In [10]:
company_names_alternative_1 = [
    [
        parsed_value
        for parsed_value, parsed_type
        in pp.parse(c)
        if parsed_type == 'CorporationName'
    ]
    for c in company_names_dirty
]
pprint.pprint(company_names_alternative_1)
[['APPLE', 'COMPUTER'],
 ['APPLE', 'COMPUTER,'],
 ['APPLE'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer'],
 ['Apple', 'Computer,'],
 ['Apple'],
 ['Apple'],
 ['Apple,']]

probablepeople, as the name suggests, can parse people names too:

In [11]:
pp.parse('Mr. Guido van Rossum')
Out[11]:
[('Mr.', 'PrefixMarital'),
 ('Guido', 'GivenName'),
 ('van', 'Surname'),
 ('Rossum', 'Surname')]

But performance isn't great from non-English names. So you can try nameparser, which is not probabilistic:

In [12]:
from nameparser import HumanName
from IPython.display import display

display(pp.parse('Flávio Juvenal da Silva Jr'))
display(HumanName('Flávio Juvenal da Silva Jr'))
[('Flávio', 'CorporationName'),
 ('Juvenal', 'CorporationName'),
 ('da', 'CorporationName'),
 ('Silva', 'CorporationName'),
 ('Jr', 'CorporationName')]
<HumanName : [
	title: '' 
	first: 'Flávio' 
	middle: 'Juvenal' 
	last: 'da Silva' 
	suffix: 'Jr'
	nickname: ''
]>

If it's useful to ignore accents, try unidecode:

In [13]:
import unidecode

print("ASCII transliteration:")
brazilian_name = "Flávio"
print(brazilian_name, "->")
print(unidecode.unidecode(brazilian_name))
ASCII transliteration:
Flávio ->
Flavio

It's important to consider how the data values were inputted. If the value was inputted by someone while hearing from someone else, try phonetic encoding with doublemetaphone to normalize words that sound similar:

In [14]:
from doublemetaphone import doublemetaphone

print(doublemetaphone('Andrew'))
print(doublemetaphone('André'))
('ANTR', 'ANTRF')
('ANTR', 'ANTR')

If data was obtained via Optical Character Recognition (OCR), techniques exist to correct common misspellings that OCRs make (like C to L, S to 5, q to g, etc). See Christen, page 47 [2].

Cleaning addresses

Geocoding street addresses, i.e., converting them to latitude/longitude is very useful for matching, because geocoders usually clean irrelevant address variations. Also, having lat/lng enables the calculation of geometric distances between addresses.

In [15]:
import requests
import geocoder

full_addresses = [
    "2066 Crist Drive, 94024, Los Altos, CALIFORNIA, US",
    "2066 Crist Dr, 94024, LOS ALTOS, CALIFORNIA, US",
    "20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US",
    "20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US",
    "10260 Bandley Drive, 95014, Cupertino, CALIFORNIA, US",
    "10260 Bandley Dr., 95014, Cupertino, CALIFORNIA, US",
    "20525 MARIANI AVENUE, 95014, CUPERTINO, CALIFORNIA, US",
    "20525 Mariani Ave, 95014, CUPERTINO, CALIFORNIA, US",
    "1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US",
    "One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US",
    "One Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
    "1 Apple Park Way, 95014, Cupertino, CALIFORNIA, US",
]

full_addresses_latlng = []
with requests.Session() as session:
    for a in full_addresses:
        a_geocoded = geocoder.google(a, session=session)
        full_addresses_latlng.append(a_geocoded.latlng)

address_latlng = list(zip(full_addresses, full_addresses_latlng))
In [16]:
pprint.pprint((address_latlng[2], address_latlng[3]))
print()
(('20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US',
  [37.3241563, -122.0387297]),
 ('20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US',
  [37.3241563, -122.0387297]))

Google geocoder is able to:

  • Ignore lower/upper case difference
  • Ignore the difference between Suite 300 vs 330 (building entrance may be the same)
  • Consider 'Lupertino' as 'Cupertino'
  • Expand abbreviations, like 'STE' vs 'Suite'
In [17]:
pprint.pprint((address_latlng[8], address_latlng[9]))
(('1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US',
  [37.3324975, -122.0289785]),
 ('One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US',
  [37.33182, -122.03118]))

However, it gets slighly different addresses on the case of '1' vs 'One'.

Note geocoding from web APIs is slow and has quota limits! Alternatively, you can build your own geocoder with:

If geocoding is inviable, we can use the library pypostal to normalize addresses:

In [18]:
from postal.expand import expand_address

print(full_addresses[8])
pprint.pprint(expand_address(full_addresses[8]))
print()
print(full_addresses[9])
pprint.pprint(expand_address(full_addresses[9]))
1 Infinite Loop, 95014, Cupertino, CALIFORNIA, US
['1 infinite loop 95014 cupertino california us']

One Infinite Loop,, 95014, Cupertino, CALIFORNIA, US
['1 infinite loop 95014 cupertino california us']
In [19]:
print(full_addresses[2])
pprint.pprint(expand_address(full_addresses[2]))
print()
print(full_addresses[3])
pprint.pprint(expand_address(full_addresses[3]))
20863 Stevens Creek Blvd., Suite 300, 95015, CUPERTINO, CALIFORNIA, US
['20863 stevens creek boulevard suite 300 95015 cupertino california us']

20863 STEVENS CREEK BLVD STE 330, 95014, Lupertino, CALIFORNIA, US
['20863 stevens creek boulevard suite 330 95014 lupertino california us',
 '20863 stevens creek boulevard sainte 330 95014 lupertino california us']

From DataMade, the creators of probablepeople, there's a usaddress parser:

In [20]:
import usaddress

print(full_addresses[0])
pprint.pprint(usaddress.parse(full_addresses[0]))
print()
print(full_addresses[1])
pprint.pprint(usaddress.parse(full_addresses[1]))
2066 Crist Drive, 94024, Los Altos, CALIFORNIA, US
[('2066', 'AddressNumber'),
 ('Crist', 'StreetName'),
 ('Drive,', 'StreetNamePostType'),
 ('94024,', 'SubaddressIdentifier'),
 ('Los', 'PlaceName'),
 ('Altos,', 'PlaceName'),
 ('CALIFORNIA,', 'PlaceName'),
 ('US', 'StateName')]

2066 Crist Dr, 94024, LOS ALTOS, CALIFORNIA, US
[('2066', 'AddressNumber'),
 ('Crist', 'StreetName'),
 ('Dr,', 'StreetNamePostType'),
 ('94024,', 'ZipCode'),
 ('LOS', 'PlaceName'),
 ('ALTOS,', 'PlaceName'),
 ('CALIFORNIA,', 'PlaceName'),
 ('US', 'StateName')]

Cleaning phone numbers

phonenumbers library can normalize phone numbers from many countries:

In [21]:
import phonenumbers

print("Phone number normalization:")
phone = "(541) 555-3010"
print(phone, '->')
print(
    phonenumbers.format_number(
        phonenumbers.parse(phone, 'US'),
        phonenumbers.PhoneNumberFormat.E164)
)
Phone number normalization:
(541) 555-3010 ->
+15415553010

Cleaning dates

dateparser library can guess date formats and parse them as datetime objects. It can even guess DD/MM or MM/DD by the language:

In [22]:
import dateparser

print(dateparser.parse("at 10/1/2018 10am"))
print(dateparser.parse("às 10/1/2018 10:00"))
pprint.pprint(dateparser.DateDataParser().get_date_data("às 10/1/2018 10:00"))
2018-10-01 10:00:00
2018-01-10 10:00:00
{'date_obj': datetime.datetime(2018, 1, 10, 10, 0),
 'locale': 'pt',
 'period': 'day'}

Check also dateutil parser with fuzzy=True.

Cleaning a real dataset

Now we know the tools, let's grab a dataset to preprocess and go on the other deduplication steps. Our dataset is based* on the Restaurant dataset, a well-known dataset used by researchers. It contains 881 restaurant records and contains 150 duplicates.

* We've introduced some changes that you can check by doing a diff restaurant.original.csv restaurant.csv

In [23]:
df_with_truth = pd.read_csv('restaurant.csv', skip_blank_lines=True)
df_with_truth.head(9)
Out[23]:
name addr city phone type cluster
0 arnie morton's of chicago 435 s. la cienega blv. los angeles 310/246-1501 american 0
1 arnie morton's of chicago 435 s. la cienega blvd. los angeles 310-246-1501 steakhouses 0
2 arnie morton 435 s. la cienega boulevard los angeles 310-246-1501 steakhouses 0
3 art's delicatessen 12224 ventura blvd. studio city 818/762-1221 american 1
4 art's deli 12224 ventura blvd. studio city 818-762-1221 delis 1
5 art's deli 12224 ventura blvd. los angeles 818-762-1221 delis 1
6 hotel bel-air 701 stone canyon rd. bel air 310/472-1211 californian 2
7 bel-air hotel 701 stone canyon rd. bel air 310-472-1211 californian 2
8 bel-air 701 stone canyon road bel air (310) 472-1211 american 2

The dataset comes with the true matches indicated by the 'cluster' column. We'll remove it. We'll also remove the 'phone' to makes things more difficult:

In [24]:
df = df_with_truth.drop(columns=['cluster', 'phone'])
df.head(9)
Out[24]:
name addr city type
0 arnie morton's of chicago 435 s. la cienega blv. los angeles american
1 arnie morton's of chicago 435 s. la cienega blvd. los angeles steakhouses
2 arnie morton 435 s. la cienega boulevard los angeles steakhouses
3 art's delicatessen 12224 ventura blvd. studio city american
4 art's deli 12224 ventura blvd. studio city delis
5 art's deli 12224 ventura blvd. los angeles delis
6 hotel bel-air 701 stone canyon rd. bel air californian
7 bel-air hotel 701 stone canyon rd. bel air californian
8 bel-air 701 stone canyon road bel air american

By running info method, we assert our dataset is quite clean, missing only 1 value of 'type'. We'll just replace it with an empty string. Note that on other datasets you might need to do more profiling to know what to fix. Check Mull's talk [1].

In [25]:
df.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 881 entries, 0 to 880
Data columns (total 4 columns):
name    881 non-null object
addr    881 non-null object
city    881 non-null object
type    880 non-null object
dtypes: object(4)
memory usage: 27.6+ KB
In [26]:
df = df.fillna('')

Now we'll clean the column values! Cleaning name:

In [27]:
def assign_no_symbols_name(df):
    return df.assign(
        name=df['name']
             .str.replace(irrelevant_regex, ' ')
             .str.replace(multispace_regex, ' '))

df = assign_no_symbols_name(df)
df.head(9)
Out[27]:
name addr city type
0 arnie morton s of chicago 435 s. la cienega blv. los angeles american
1 arnie morton s of chicago 435 s. la cienega blvd. los angeles steakhouses
2 arnie morton 435 s. la cienega boulevard los angeles steakhouses
3 art s delicatessen 12224 ventura blvd. studio city american
4 art s deli 12224 ventura blvd. studio city delis
5 art s deli 12224 ventura blvd. los angeles delis
6 hotel bel air 701 stone canyon rd. bel air californian
7 bel air hotel 701 stone canyon rd. bel air californian
8 bel air 701 stone canyon road bel air american
In [28]:
from collections import Counter

possible_stopwords = Counter(" ".join(df["name"]).split()).most_common(20)
pprint.pprint(possible_stopwords)
[('s', 184),
 ('cafe', 81),
 ('the', 39),
 ('grill', 32),
 ('restaurant', 26),
 ('la', 24),
 ('le', 20),
 ('house', 20),
 ('bar', 19),
 ('of', 15),
 ('bistro', 13),
 ('room', 11),
 ('kitchen', 10),
 ('deli', 9),
 ('club', 9),
 ('and', 9),
 ('ritz', 9),
 ('carlton', 9),
 ('on', 8),
 ('buckhead', 8)]
In [29]:
def assign_cleaned_name(df):
    restaurant_stopwords = {
        's', 'the', 'la', 'le', 'of', 'and', 'on', 'l'}
    restaurant_stopwords_regex = r'\b(?:{})\b'.format(
        '|'.join(restaurant_stopwords))
    return df.assign(
        name=df['name']
             .str.replace(restaurant_stopwords_regex, '')
             .str.replace(multispace_regex, ' ')
             .str.strip())

df = assign_cleaned_name(df)
df.head(9)
Out[29]:
name addr city type
0 arnie morton chicago 435 s. la cienega blv. los angeles american
1 arnie morton chicago 435 s. la cienega blvd. los angeles steakhouses
2 arnie morton 435 s. la cienega boulevard los angeles steakhouses
3 art delicatessen 12224 ventura blvd. studio city american
4 art deli 12224 ventura blvd. studio city delis
5 art deli 12224 ventura blvd. los angeles delis
6 hotel bel air 701 stone canyon rd. bel air californian
7 bel air hotel 701 stone canyon rd. bel air californian
8 bel air 701 stone canyon road bel air american

Geocoding addr:

In [30]:
all_addresses = df['addr'].str.cat(df['city'], sep=', ').values
unique_addresses = np.unique(all_addresses)
print(len(all_addresses), len(unique_addresses))
881 819
In [31]:
import os.path
import json

geocoding_filename = 'address_to_geocoding.json'

def geocode_addresses(address_to_geocoding):
    remaining_addresses = (
        set(unique_addresses) -
        set(k for k, v in address_to_geocoding.items() if v is not None))
    
    with requests.Session() as session:
        for i, address in enumerate(remaining_addresses):
            print(f"Geocoding {i + 1}/{len(remaining_addresses)}")
            geocode_result = geocoder.google(address, session=session)
            address_to_geocoding[address] = geocode_result.json

        with open(geocoding_filename, 'w') as f:
            json.dump(address_to_geocoding, f, indent=4)

if not os.path.exists(geocoding_filename):
    address_to_geocoding = {}
    geocode_addresses(address_to_geocoding)
else:
    with open(geocoding_filename) as f:
        address_to_geocoding = json.load(f)
    geocode_addresses(address_to_geocoding)
 
address_to_postal = {
    k: v['postal']
    for k, v in address_to_geocoding.items()
    if v is not None and 'postal' in v
}
address_to_latlng = {
    k: (v['lat'], v['lng'])
    for k, v in address_to_geocoding.items()
    if v is not None
}
print(f"Failed to get postal from {len(address_to_geocoding) - len(address_to_postal)}")
print(f"Failed to get latlng from {len(address_to_geocoding) - len(address_to_latlng)}")
Failed to get postal from 8
Failed to get latlng from 0
In [32]:
def assign_postal_lat_lng(df):
    addresses = df['addr'].str.cat(df['city'], sep=', ')
    addresses_to_postal = [address_to_postal.get(a) for a in addresses]
    addresses_to_lat = [address_to_latlng[a][0] if a in address_to_latlng else None for a in addresses]
    addresses_to_lng = [address_to_latlng[a][1] if a in address_to_latlng else None for a in addresses]

    return df.assign(postal=addresses_to_postal, lat=addresses_to_lat, lng=addresses_to_lng)

df = assign_postal_lat_lng(df)
df.head(6)
Out[32]:
name addr city type postal lat lng
0 arnie morton chicago 435 s. la cienega blv. los angeles american 90048 34.070609 -118.376722
1 arnie morton chicago 435 s. la cienega blvd. los angeles steakhouses 90048 34.070609 -118.376722
2 arnie morton 435 s. la cienega boulevard los angeles steakhouses 90048 34.070609 -118.376722
3 art delicatessen 12224 ventura blvd. studio city american 91604 34.142966 -118.399469
4 art deli 12224 ventura blvd. studio city delis 91604 34.142966 -118.399469
5 art deli 12224 ventura blvd. los angeles delis 91604 34.142966 -118.399469

Adding addr_variations:

In [33]:
def assign_addr_variations(df):
    return df.assign(
        addr_variations=df['addr'].apply(
            lambda addr: frozenset(expand_address(addr))))

df = assign_addr_variations(df)
df[['name', 'addr', 'addr_variations']].head(6)
Out[33]:
name addr addr_variations
0 arnie morton chicago 435 s. la cienega blv. (435 s louisiana cienega boulevard, 435 south ...
1 arnie morton chicago 435 s. la cienega blvd. (435 san lane cienega boulevard, 435 sur louis...
2 arnie morton 435 s. la cienega boulevard (435 san lane cienega boulevard, 435 s louisia...
3 art delicatessen 12224 ventura blvd. (12224 ventura boulevard)
4 art deli 12224 ventura blvd. (12224 ventura boulevard)
5 art deli 12224 ventura blvd. (12224 ventura boulevard)

Now, with a clean dataset, we can move to the next step.

1/4 - Indexing

To explain the following deduplication steps, we'll use the Python libary recordlinkage, aka Python Record Linkage Toolkit. We choose it because it doesn't abstract too much the details of the process, even though it has a simple API.

We have the cleaned records, we now need the pairs we want to compare to find matches. To produce the pairs, we could do a "full" index, i.e., all records against all records:

In [34]:
import recordlinkage as rl

full_indexer = rl.FullIndex()
pairs = full_indexer.index(df)

print(f"Full index: {len(df)} records, {len(pairs)} pairs")
Full index: 881 records, 387640 pairs

The formula for the total number of pairs is:
len(df) * (len(df) - 1) / 2 == 387640

The number of pairs grows too fast as the number of records grows: it grows quadratically. That's why we need indexing. We need to produce only pairs that are good candidates of being duplicates to avoid wasting too much time.

Indexing is also called blocking because the trivial way to index is to produce pairs that have some column value in common. By doing this, we produce blocks of record pairs and compare only those in the same block:

Below we produce pairs that have equal values for postal code:

In [35]:
postal_indexer = rl.BlockIndex('postal')
postal_index_pairs = postal_indexer.index(df)

print(f"Postal index: {len(postal_index_pairs)} pairs")
Postal index: 6536 pairs

We could also produce pairs by sorting the unique values of some column and, for each value, getting the records with neighboring values. The idea is to produce pairs with close values on some column, like johnny and john or 2018-01-02 and 2018-01-05. It looks like this:

Below we produce pairs that have neighboring values for name:

In [36]:
name_indexer = rl.SortedNeighbourhoodIndex('name', window=7)
name_index_pairs = name_indexer.index(df)

print(f"Name index: {len(name_index_pairs)} pairs")
Name index: 3132 pairs

Note that simply sorting values wouldn't be able to get kamila and camila as neighbors, as sorting is sensitive to the leading characters of strings. There are other ways to index that could handle that, check Christen, chapter 4 [2].

To produce more pairs without introducing redundant pairs, we should union different indexes:

In [37]:
pairs = postal_index_pairs.union(name_index_pairs)

print(f"Postal or name index: {len(pairs)} pairs")
Postal or name index: 9465 pairs

We now have which pairs we want to run comparisons on!

2/4 - Comparison

Now we want to run comparisons on the indexed pairs to produce a comparison vector for each pair. A comparison vector represents the similarity between 2 records by holding similarity values between 0 to 1 for each column.

In [38]:
pd.DataFrame([[1.0, 0.75, 0.0, 0.25, 0.0]],
             columns=['name', 'addr', 'type', 'latlng', 'addr_variations'],
             index=pd.MultiIndex.from_arrays([[100], [200]]))
Out[38]:
name addr type latlng addr_variations
100 200 1.0 0.75 0.0 0.25 0.0

For example, the comparison vector above means the pair of records (100, 200) has:

  • Equal names
  • Similar addrs
  • Completely different types
  • Some distance on latlngs
  • No agreement on addr_variations

To compute the comparison vectors for all indexed pairs, we define a similarity function for each column:

In [39]:
vectorized_intersection = np.vectorize(
    lambda x, y: float(bool(x.intersection(y))))

def compare_addr_variations(a1, a2):
    return vectorized_intersection(a1, a2)

comp = rl.Compare()
comp.string('name', 'name', method='jarowinkler', label='name')
comp.string('addr', 'addr', method='jarowinkler', label='addr')
comp.string('city', 'city', method='jarowinkler', label='city')
comp.string('type', 'type', method='jarowinkler', label='type')
comp.geo('lat', 'lng', 'lat', 'lng', method='exp', scale=0.5, label='latlng')
comp.compare_vectorized(compare_addr_variations,
                        'addr_variations', 'addr_variations',
                        label='addr_variations');

There are many similarity functions we could use for strings. For example:

  • jarowinkler gives priority to the begining of the string
  • levenshtein cares more about the order
  • lcs cares less about the order
In [40]:
from recordlinkage.algorithms.string import (
    jarowinkler_similarity as jarowinkler,
    levenshtein_similarity as levenshtein,
    longest_common_substring_similarity as lcs)

for s1, s2 in [["arnie morton", "arnie morton s of chicago"],
               ["the palm", "palm the"]]:
    print(f'jarowinkler("{s1}", "{s2}") ==',
          jarowinkler([s1], [s2])[0])
    print(f'levenshtein("{s1}", "{s2}") ==',
          levenshtein([s1], [s2])[0])
    print(f'lcs("{s1}", "{s2}") ==',
          lcs([s1], [s2])[0])
    print()
jarowinkler("arnie morton", "arnie morton s of chicago") == 0.896
levenshtein("arnie morton", "arnie morton s of chicago") == 0.48
lcs("arnie morton", "arnie morton s of chicago") == 0.6486486486486487

jarowinkler("the palm", "palm the") == 0.4166666666666667
levenshtein("the palm", "palm the") == 0.0
lcs("the palm", "palm the") == 0.875

For more details on the different similarity functions you can use, check Christen, chapter 5 [2]. Also, check the similarity functions implemented on recordlikage library.

Now we'll compute the comparison vectors:

In [41]:
comparison_vectors = comp.compute(pairs, df)
comparison_vectors.head(5)
Out[41]:
name addr city type latlng addr_variations
0 1 1.000000 0.985507 1.00000 0.310606 1.00000 1.0
2 0.920000 0.910774 1.00000 0.310606 1.00000 1.0
3 0.559722 0.593620 0.40404 1.000000 0.00001 0.0
4 0.572222 0.593620 0.40404 0.550000 0.00001 0.0
5 0.572222 0.593620 1.00000 0.550000 0.00001 0.0

It's good to check the statistics of the vectors to see if they look right:

In [42]:
comparison_vectors.describe()
Out[42]:
name addr city type latlng addr_variations
count 9465.000000 9465.000000 9465.000000 9465.000000 9465.000000 9465.000000
mean 0.515619 0.630213 0.827979 0.558777 0.326727 0.015320
std 0.167495 0.142923 0.263938 0.223105 0.298554 0.122827
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.447712 0.526462 0.589744 0.436508 0.000093 0.000000
50% 0.517949 0.593957 1.000000 0.527381 0.295513 0.000000
75% 0.588462 0.706536 1.000000 0.690476 0.561194 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

By looking at the summary we can see it's easier to get higher similarities on 'city', 'addr' and 'type', not so easy on 'name' and 'latlng', and very difficult on 'addr_variations'. Therefore, it seems that:

  • we have few unique values of 'city' in our dataset and that also causes...
  • ...lots of 'addr' to be similar
  • there only a limited number of 'type's, so it repeats a lot
  • and getting a 1 on 'addr_variations' is rare since it only considers an exact agreement on address variation

Now, with our comparison vectors, we'll explore different ways to classify them as matches and nonmatches.

3/4 - Classification

Before, let's remember what's on records 0 to 5:

In [43]:
df.head(6)
Out[43]:
name addr city type postal lat lng addr_variations
0 arnie morton chicago 435 s. la cienega blv. los angeles american 90048 34.070609 -118.376722 (435 s louisiana cienega boulevard, 435 south ...
1 arnie morton chicago 435 s. la cienega blvd. los angeles steakhouses 90048 34.070609 -118.376722 (435 san lane cienega boulevard, 435 sur louis...
2 arnie morton 435 s. la cienega boulevard los angeles steakhouses 90048 34.070609 -118.376722 (435 san lane cienega boulevard, 435 s louisia...
3 art delicatessen 12224 ventura blvd. studio city american 91604 34.142966 -118.399469 (12224 ventura boulevard)
4 art deli 12224 ventura blvd. studio city delis 91604 34.142966 -118.399469 (12224 ventura boulevard)
5 art deli 12224 ventura blvd. los angeles delis 91604 34.142966 -118.399469 (12224 ventura boulevard)

Threshold-Based Classification

A simple way to classify comparison vectors as matches or nonmatches is to compute a weighted average over the vectors to get a score:

In [44]:
scores = np.average(
    comparison_vectors.values,
    axis=1,
    weights=[30, 10, 5, 10, 30, 15])
scored_comparison_vectors = comparison_vectors.assign(score=scores)
scored_comparison_vectors.head(5)
Out[44]:
name addr city type latlng addr_variations score
0 1 1.000000 0.985507 1.00000 0.310606 1.00000 1.0 0.929611
2 0.920000 0.910774 1.00000 0.310606 1.00000 1.0 0.898138
3 0.559722 0.593620 0.40404 1.000000 0.00001 0.0 0.347484
4 0.572222 0.593620 0.40404 0.550000 0.00001 0.0 0.306234
5 0.572222 0.593620 1.00000 0.550000 0.00001 0.0 0.336032

By looking at the data, we know record 0 truly matches 1 and 2, so we'll classify as a match any pair with score >= 0.85. That's our threshold:

In [45]:
matches = comparison_vectors[
    scored_comparison_vectors['score'] >= 0.85]
matches.head(5)
Out[45]:
name addr city type latlng addr_variations
0 1 1.00 0.985507 1.00000 0.310606 1.0 1.0
2 0.92 0.910774 1.00000 0.310606 1.0 1.0
1 2 0.92 0.923779 1.00000 1.000000 1.0 1.0
3 4 0.90 1.000000 1.00000 0.550000 1.0 1.0
5 0.90 1.000000 0.40404 0.550000 1.0 1.0

Let's check another match we've found with this threshold: what are the matches of record 3?

In [46]:
display(df.loc[[3]])
print("matched:")
display(df.loc[matches.loc[(3,)].index])
name addr city type postal lat lng addr_variations
3 art delicatessen 12224 ventura blvd. studio city american 91604 34.142966 -118.399469 (12224 ventura boulevard)
matched:
name addr city type postal lat lng addr_variations
4 art deli 12224 ventura blvd. studio city delis 91604 34.142966 -118.399469 (12224 ventura boulevard)
5 art deli 12224 ventura blvd. los angeles delis 91604 34.142966 -118.399469 (12224 ventura boulevard)

Seems OK! Since we have the true match status on cluster column, we can evaluate how well our threshold classification did:

In [47]:
golden_pairs = rl.BlockIndex('cluster').index(df_with_truth)
print("Golden pairs:", len(golden_pairs))
Golden pairs: 150
In [48]:
found_pairs_set = set(matches.index)

golden_pairs_set = set(golden_pairs)

true_positives = golden_pairs_set & found_pairs_set
false_positives = found_pairs_set - golden_pairs_set
false_negatives = golden_pairs_set - found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))
true_positives total: 99
false_positives total: 7
false_negatives total: 51

We've got a small number of false positives. Some are really tricky cases:

In [49]:
print(f"False positives:")
for false_positive_pair in false_positives:
    display(df.loc[list(false_positive_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
False positives:
name addr city type lat lng
156 palace court 3570 las vegas blvd. s las vegas continental 36.116156 -115.175058
562 cafe roma 3570 las vegas blvd. s las vegas coffee shops/diners 36.116156 -115.175058
name addr city type lat lng
157 palace court 3570 las vegas blvd. s. las vegas french (new) 36.116156 -115.175058
562 cafe roma 3570 las vegas blvd. s las vegas coffee shops/diners 36.116156 -115.175058
name addr city type lat lng
196 ritz carlton cafe buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
198 ritz carlton dining room buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
name addr city type lat lng
199 restaurant ritz carlton atlanta 181 peachtree st. atlanta continental 33.758579 -84.387066
839 ritz carlton cafe atlanta 181 peachtree st. atlanta american (new) 33.758579 -84.387066
name addr city type lat lng
99 les celebrites 160 central park s new york french 40.766016 -73.978641
329 cafe botanica 160 central park s new york french 40.766016 -73.978641
name addr city type lat lng
200 ritz carlton restaurant 181 peachtree st. atlanta french (classic) 33.758579 -84.387066
839 ritz carlton cafe atlanta 181 peachtree st. atlanta american (new) 33.758579 -84.387066
name addr city type lat lng
195 cafe ritz carlton buckhead 3434 peachtree rd. atlanta ext 6108 international 33.850807 -84.364227
197 dining room ritz carlton buckhead 3434 peachtree rd. atlanta international 33.850807 -84.364227

On the other hand, we got a lot of false negatives. We've missed a lot of matches!

In [50]:
print(f"False negatives (sample 10 of {len(false_negatives)}):")
for false_negative_pair in list(false_negatives)[:10]:
    display(df.loc[list(false_negative_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
False negatives (sample 10 of 51):
name addr city type lat lng
170 brasserie coze 3393 peachtree rd. lenox square mall near ne... atlanta french 33.846181 -84.364109
171 brasserie coze 3393 peachtree rd. atlanta french bistro 33.846869 -84.362512
name addr city type lat lng
22 fenix at argyle 8358 sunset blvd. w. hollywood french (new) 34.095097 -118.371967
23 fenix at argyle 8358 sunset blvd. west hollywood american 34.095097 -118.371967
name addr city type lat lng
182 heera india 595 piedmont ave. rio shopping mall atlanta asian 33.795817 -84.370602
183 heera india 595 piedmont ave. atlanta indian 33.770495 -84.381425
name addr city type lat lng
138 sign dove 1110 3rd ave. at 65th st. new york american 40.76564 -73.963968
139 sign dove 1110 third ave. new york city american (new) 40.76564 -73.963968
name addr city type lat lng
164 abruzzi 2355 peachtree rd. peachtree battle shopping ... atlanta italian 33.820137 -84.387280
165 abruzzi 2355 peachtree rd. ne atlanta italian 33.824647 -84.387453
name addr city type lat lng
69 carmine 2450 broadway between 90th and 91st sts. new york italian 40.791096 -73.973991
70 carmine 2450 broadway new york city italian 40.791096 -73.973991
name addr city type lat lng
197 dining room ritz carlton buckhead 3434 peachtree rd. atlanta international 33.850807 -84.364227
198 ritz carlton dining room buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
name addr city type lat lng
28 restaurant katsu 1972 n. hillhurst ave. los angeles asian 34.107405 -118.28719
29 katsu 1972 hillhurst ave. los feliz japanese 34.107405 -118.28719
name addr city type lat lng
36 locanda veneta 8638 w 3rd st los angeles italian 34.073417 -118.381096
37 locanda w. third st. st los angeles italian 34.068958 -118.320928
name addr city type lat lng
168 bone 3130 piedmont road atlanta american 33.842103 -84.371103
169 bone restaurant 3130 piedmont rd. ne atlanta steakhouses 33.842103 -84.371103

We've set the weights and the threshold by guessing, could we do any better?

Supervised Classification

Instead of trying to guess weights and thresholds, we can train a classifier to learn how to classify matches and nonmatches based on some training data we provide:

In [51]:
df_training = pd.read_csv('restaurant-training.csv', skip_blank_lines=True)
df_training = df_training.drop(columns=['phone'])
df_training
Out[51]:
name addr city type cluster
0 locanda veneta 3rd st. los angeles italian 13
1 locanda veneta 8638 w. third st. los angeles italian 13
2 locanda veneta 8638 w 3rd st los angeles italian 13
3 cafe lalo 201 w. 83rd st. new york coffee bar 26
4 cafe lalo 201 w. 83rd st. new york city coffeehouses 26
5 les celebrites 160 central park s new york french 42
6 les celebrites 155 w. 58th st. new york city french (classic) 42
7 second avenue deli 156 2nd ave. at 10th st. new york delicatessen 58
8 second avenue deli 156 second ave. new york city delis 58
9 smith & wollensky 201 e. 49th st. new york american 62
10 smith & wollensky 797 third ave. new york city steakhouses 62
11 chin's 3200 las vegas blvd. s las vegas asian 67
12 chin's 3200 las vegas blvd. s. las vegas chinese 67
13 toulouse b peachtree rd. atlanta french 92
14 toulouse 293-b peachtree rd. atlanta french (new) 92
15 rose pistola 532 columbus ave. san francisco italian 111
16 rose pistola 532 columbus ave. san francisco italian 111
17 bistro garden 176 n. canon dr. los angeles californian 115
18 remi 3rd st. promenade santa monica italian 159
19 remi 145 w. 53rd st. new york italian 334
20 west 63rd street steakhouse 44 w. 63rd st. new york american 375
21 bistro 3400 las vegas blvd. s las vegas continental 429
22 mikado 3400 las vegas blvd. s las vegas asian 446
23 l'osteria del forno 519 columbus ave. san francisco italian 490
24 stars 150 redwood alley san francisco american 514
25 stars cafe 500 van ness ave. san francisco american 515
26 belvedere the 9882 little santa monica blvd. beverly hills pacific new wave 537
27 bernard's 515 s. olive st. los angeles continental 539
28 bistro 45 45 s. mentor ave. pasadena californian 540
29 cafe '50s 838 lincoln blvd. venice american 545
30 cafe blanc 9777 little santa monica blvd. beverly hills pacific new wave 546
31 la cachette 10506 little santa monica blvd. century city french (new) 568
32 moongate 3400 las vegas blvd. s. las vegas chinese 666

We need to preprocess our training data too:

In [52]:
df_training = assign_no_symbols_name(df_training)
df_training = assign_cleaned_name(df_training)
df_training = assign_postal_lat_lng(df_training)
df_training = assign_addr_variations(df_training)
df_training.head(5)
Out[52]:
name addr city type cluster postal lat lng addr_variations
0 locanda veneta 3rd st. los angeles italian 13 None 34.068958 -118.320928 (3rd saint, 3 saint, 3rd street, 3 street)
1 locanda veneta 8638 w. third st. los angeles italian 13 90048 34.073417 -118.381096 (8638 west 3 street, 8638 west 3rd street, 863...
2 locanda veneta 8638 w 3rd st los angeles italian 13 None NaN NaN (8638 west 3 street, 8638 west 3rd street, 863...
3 cafe lalo 201 w. 83rd st. new york coffee bar 26 10024 40.785981 -73.976727 (201 west 83rd street, 201 west 83rd saint, 20...
4 cafe lalo 201 w. 83rd st. new york city coffeehouses 26 10024 40.785981 -73.976727 (201 west 83rd street, 201 west 83rd saint, 20...

We'll feed a Support Vector Machine classifier with our training data. SVMs are resilient to noise, can handle correlated features (like 'addr' and 'latlng') and are robust to imbalanced training sets. That last attribute is relevant for deduplication, because we'll usually find more negative pairs than positive pairs to add to a training set. (Bilenko [3])

In [53]:
all_training_pairs = rl.FullIndex().index(df_training)
matches_training_pairs = rl.BlockIndex('cluster').index(df_training)

training_vectors = comp.compute(all_training_pairs, df_training)

svm = rl.SVMClassifier()
svm.learn(training_vectors, matches_training_pairs);
In [54]:
svm_pairs = svm.predict(comparison_vectors)
svm_found_pairs_set = set(svm_pairs)

svm_true_positives = golden_pairs_set & svm_found_pairs_set
svm_false_positives = svm_found_pairs_set - golden_pairs_set
svm_false_negatives = golden_pairs_set - svm_found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))
print()
print('svm_true_positives total:', len(svm_true_positives))
print('svm_false_positives total:', len(svm_false_positives))
print('svm_false_negatives total:', len(svm_false_negatives))
true_positives total: 99
false_positives total: 7
false_negatives total: 51

svm_true_positives total: 124
svm_false_positives total: 5
svm_false_negatives total: 26

Much better results! The only false positive we got on the SVM classifier and not on the threshold method is a really difficult case where most columns are very similar:

In [55]:
print("(SVM false positives) - (Threshold false positives):")
for svm_false_positive in (svm_false_positives - false_positives):
    display(df.loc[list(svm_false_positive)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
(SVM false positives) - (Threshold false positives):
name addr city type lat lng
643 stars 150 redwood alley san francisco american 37.780894 -122.419481
644 stars cafe 500 van ness ave. san francisco american 37.780298 -122.420002

But the SVM was tricked by some apparently simple cases, so we can't be very confident it really learned well to classify matches:

In [56]:
print("(SVM false negatives) - (Threshold false negatives):")
for svm_false_negative in (svm_false_negatives - false_negatives):
    display(df.loc[list(svm_false_negative)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
(SVM false negatives) - (Threshold false negatives):
name addr city type lat lng
40 palm 9001 santa monica blvd. los angeles american 34.083064 -118.387282
41 palm los angeles 9001 santa monica blvd. w. hollywood steakhouses 34.083064 -118.387282
name addr city type lat lng
6 hotel bel air 701 stone canyon rd. bel air californian 34.086594 -118.446351
7 bel air hotel 701 stone canyon rd. bel air californian 34.086594 -118.446351

There are other classifiers from recordlinkage library we could try, but the truth is:

  • It's very difficult to build a good training set that takes in account all important cases of matches/nonmatches
  • It's possible to tune classifier parameters to get better results, but it's very difficult to decide the right parameters that will generalize well for future predictions
  • And we're not even sure if the indexing rules we used are really sane: we can be dropping true positives that are not being blocked together, or even introducing false negatives that are being blocked together but our classifier isn't being able classify them as nonmatching

The alternative to all that uncertainty is...

Active Learning Classification

Active Learning methods identify training examples that "lead to maximal accuracy improvements" (Bilenko [3]), both to train optimal classifier weights, as well as to find optimal indexing/blocking rules! A Python library called Dedupe implements that.

We'll see it in practice. First we define the fields/columns of our data:

In [57]:
import logging; logging.disable(level=logging.NOTSET)
In [58]:
import dedupe

fields = [
    {
        'field': 'name',
        'variable name': 'name',
        'type': 'ShortString',
        'has missing': True
    },
    {
        'field': 'addr',
        'variable name': 'addr',
        'type': 'ShortString',
    },
    {
        'field': 'city',
        'variable name': 'city',
        'type': 'ShortString',
        'has missing': True
    },
] # ...

Dedupe comes with built-in variables for common data types like names, addresses, latlong, datetimes, thereby reducing the preprocessing work. It also supports custom variables:

In [59]:
def addr_variations_comparator(x, y):
    return 1.0 - float(bool(x.intersection(y)))

fields.extend([
    {
        'field': 'postal',
        'variable name': 'postal',
        'type': 'ShortString',
        'has missing': True
    },
    {
        'field': 'latlng',
        'variable name': 'latlng',
        'type': 'LatLong',
        'has missing': True
    },
    {
        'field': 'addr_variations',
        'variable name': 'addr_variations',
        'type': 'Custom',
        'comparator': addr_variations_comparator
    },
])

Other nice feature of Dedupe is the built-in ability to model interactions) between fields. What's that for? Imagine we have an exact address match: it makes no sense to have all 'addr', 'city', 'postal', 'latlng', 'addr_variations' contributing additively to the matching odds. It's expected all of them to match if 'addr' and 'city' match exactly. So to prevent a redundant increase of odds, we should add a interaction on those fields:

In [60]:
fields.extend([
    {
        'type': 'Interaction',
        'interaction variables': [
            'addr',
            'city',
            'postal',
            'latlng',
            'addr_variations'
        ]
    }
])

Now we initialize the Dedupe instance with the fields. We can also use a pre-saved pickled Dedupe instance:

In [61]:
settings_filename = 'dedupe-settings.pickle'
if os.path.exists(settings_filename):
    with open(settings_filename, 'rb') as sf:
        deduper = dedupe.StaticDedupe(sf, num_cores=4)
else:
    deduper = dedupe.Dedupe(fields, num_cores=4)
INFO:dedupe.api:((SimplePredicate: (firstTokenPredicate, name), TfidfNGramCanopyPredicate: (0.6, addr)), (SimplePredicate: (commonTwoTokens, name), SimplePredicate: (fingerprint, name)))

We need to adapt the data a bit to the format Dedupe wants:

In [62]:
data_for_dedupe = df.to_dict('index')
for record in data_for_dedupe.values():
    # Change nans to None
    for k, v in record.items():
        if isinstance(v, float) and np.isnan(v):
            record[k] = None
    
    # Move lat and lng to a single field latlng
    lat = record.pop('lat')
    lng = record.pop('lng')
    if lat is not None and lng is not None:
        record['latlng'] = (lat, lng)
    else:
        record['latlng'] = None

Here we're using a Dedupe instance that we trained before. Let's check how was the training input/output:

In [63]:
training_input_output = 'training-input-output.txt'
if os.path.exists(training_input_output):
    with open(training_input_output) as t:
        txt = t.read()
        print('\n'.join(txt.split('\n')[:114]))
print('...')
name : cite
addr : 120 w. 51st st.
city : new york
postal : 10019
latlng : (40.7607952, -73.9812268)
addr_variations : frozenset({'120 west 51st saint', '120 west 51 saint', '120 w 51st street', '120 w 51st saint', '120 west 51st street', '120 w 51 saint', '120 w 51 street', '120 west 51 street'})

name : new york noodletown
addr : 28 1/2 bowery at bayard st.
city : new york
postal : 10013
latlng : (40.7150317, -73.9970383)
addr_variations : frozenset({'28 1 2 bowery at bayard saint', '28 1 2 bowery at bayard street'})

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
name : bernardin
addr : 155 w. 51st st.
city : new york city
postal : 10019
latlng : (40.7615691, -73.98180479999999)
addr_variations : frozenset({'155 w 51st saint', '155 west 51st saint', '155 w 51st street', '155 west 51 saint', '155 w 51 street', '155 w 51 saint', '155 west 51 street', '155 west 51st street'})

name : republic
addr : 37a union sq. w  between 16th and 17th sts.
city : new york
postal : 10003
latlng : (40.7369985, -73.9907851)
addr_variations : frozenset({'37a union square w between 16th and 17 streets', '37 a union square west between 16th and 17 streets', '37a union square w between 16th and 17th streets', '37 a union square west between 16th and 17th streets', '37 a union square west between 16 and 17th streets', '37 a union square w between 16th and 17th streets', '37 a union square w between 16 and 17 streets', '37 a union square w between 16th and 17 streets', '37 a union square west between 16 and 17 streets', '37 a union square w between 16 and 17th streets', '37a union square w between 16 and 17 streets', '37a union square w between 16 and 17th streets', '37a union square west between 16 and 17th streets', '37a union square west between 16 and 17 streets', '37a union square west between 16th and 17 streets', '37a union square west between 16th and 17th streets'})

0/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
name : dawat
addr : 210 e. 58th st.
city : new york
postal : None
latlng : None
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

name : dawat
addr : 210 e. 58th st.
city : new york city
postal : 10022
latlng : (40.7604227, -73.9664276)
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

0/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
name : art delicatessen
addr : 12224 ventura blvd.
city : studio city
postal : 91604
latlng : (34.1429661, -118.3994688)
addr_variations : frozenset({'12224 ventura boulevard'})

name : art deli
addr : 12224 ventura blvd.
city : los angeles
postal : 91604
latlng : (34.1429661, -118.3994688)
addr_variations : frozenset({'12224 ventura boulevard'})

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), SimplePredicate: (wholeFieldPredicate, name))
name : gotham bar grill
addr : 12 e 12th st
city : new york city
postal : 10003
latlng : (40.734207, -73.99369899999999)
addr_variations : frozenset({'12 e 12th street', '12 east 12 saint', '12 e 12 street', '12 e 12 saint', '12 east 12th saint', '12 east 12th street', '12 east 12 street', '12 e 12th saint'})

name : gotham
addr : 12 e 12th st
city : new york
postal : 10003
latlng : (40.734207, -73.99369899999999)
addr_variations : frozenset({'12 e 12th street', '12 east 12 saint', '12 e 12 street', '12 e 12 saint', '12 east 12th saint', '12 east 12th street', '12 east 12 street', '12 e 12th saint'})

2/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), SimplePredicate: (sortedAcronym, name))
name : dawat
addr : 210 e. 58th st.
city : new york
postal : None
latlng : None
addr_variations : frozenset({'210 e 58 street', '210 e 58th saint', '210 east 58 saint', '210 east 58th street', '210 east 58 street', '210 e 58 saint', '210 e 58th street', '210 east 58th saint'})

name : rainbow restaurant
addr : 2118 n. decatur rd.
city : decatur
postal : 30033
latlng : (33.7908588, -84.3052307)
addr_variations : frozenset({'2118 north decatur road', '2118 n decatur road'})

3/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (fingerprint, addr), TfidfNGramCanopyPredicate: (0.6, name))
...

You can check the full training at training-input-output.txt.

If you want to train it yourself, do a rm dedupe-settings.pickle dedupe-slides-training.json and run this whole Active Learning session again.

In [64]:
if not isinstance(deduper, dedupe.StaticDedupe):
    deduper.sample(data_for_dedupe)
    
    training_filename = 'dedupe-slides-training.json'
    if os.path.exists(training_filename):
        with open(training_filename) as tf:
            deduper.readTraining(tf)

    dedupe.consoleLabel(deduper)
    
    with open(training_filename, 'w') as tf:
        deduper.writeTraining(tf)
    
    deduper.train()
    
    with open(settings_filename, 'wb') as sf:
        deduper.writeSettings(sf)

After training, we can see which blocking predicates (indexing rules) the deduper learned from our training input. It's good to do that to check if we trained enough:

In [65]:
deduper.predicates
Out[65]:
((SimplePredicate: (firstTokenPredicate, name),
  TfidfNGramCanopyPredicate: (0.6, addr)),
 (SimplePredicate: (commonTwoTokens, name),
  SimplePredicate: (fingerprint, name)))

Those predicates make sense:

  • Pairs where the first token of the name is the same AND the addr is similar
  • ... union with ...
  • Pairs where the name shares two tokens AND the fingerprint of the name is the same (fingerprint means the sorted tokens)

The deduper selected those predicates from this extense list of possible predicates:

In [66]:
deduper.data_model.predicates()
Out[66]:
{ExistsPredicate: (Exists, city),
 ExistsPredicate: (Exists, latlng),
 ExistsPredicate: (Exists, name),
 ExistsPredicate: (Exists, postal),
 LevenshteinCanopyPredicate: (1, addr),
 LevenshteinCanopyPredicate: (1, city),
 LevenshteinCanopyPredicate: (1, name),
 LevenshteinCanopyPredicate: (1, postal),
 LevenshteinCanopyPredicate: (2, addr),
 LevenshteinCanopyPredicate: (2, city),
 LevenshteinCanopyPredicate: (2, name),
 LevenshteinCanopyPredicate: (2, postal),
 LevenshteinCanopyPredicate: (3, addr),
 LevenshteinCanopyPredicate: (3, city),
 LevenshteinCanopyPredicate: (3, name),
 LevenshteinCanopyPredicate: (3, postal),
 LevenshteinCanopyPredicate: (4, addr),
 LevenshteinCanopyPredicate: (4, city),
 LevenshteinCanopyPredicate: (4, name),
 LevenshteinCanopyPredicate: (4, postal),
 SimplePredicate: (alphaNumericPredicate, addr),
 SimplePredicate: (alphaNumericPredicate, city),
 SimplePredicate: (alphaNumericPredicate, name),
 SimplePredicate: (alphaNumericPredicate, postal),
 SimplePredicate: (commonFourGram, addr),
 SimplePredicate: (commonFourGram, city),
 SimplePredicate: (commonFourGram, name),
 SimplePredicate: (commonFourGram, postal),
 SimplePredicate: (commonIntegerPredicate, addr),
 SimplePredicate: (commonIntegerPredicate, city),
 SimplePredicate: (commonIntegerPredicate, name),
 SimplePredicate: (commonIntegerPredicate, postal),
 SimplePredicate: (commonSixGram, addr),
 SimplePredicate: (commonSixGram, city),
 SimplePredicate: (commonSixGram, name),
 SimplePredicate: (commonSixGram, postal),
 SimplePredicate: (commonThreeTokens, addr),
 SimplePredicate: (commonThreeTokens, city),
 SimplePredicate: (commonThreeTokens, name),
 SimplePredicate: (commonThreeTokens, postal),
 SimplePredicate: (commonTwoTokens, addr),
 SimplePredicate: (commonTwoTokens, city),
 SimplePredicate: (commonTwoTokens, name),
 SimplePredicate: (commonTwoTokens, postal),
 SimplePredicate: (doubleMetaphone, addr),
 SimplePredicate: (doubleMetaphone, city),
 SimplePredicate: (doubleMetaphone, name),
 SimplePredicate: (doubleMetaphone, postal),
 SimplePredicate: (fingerprint, addr),
 SimplePredicate: (fingerprint, city),
 SimplePredicate: (fingerprint, name),
 SimplePredicate: (fingerprint, postal),
 SimplePredicate: (firstIntegerPredicate, addr),
 SimplePredicate: (firstIntegerPredicate, city),
 SimplePredicate: (firstIntegerPredicate, name),
 SimplePredicate: (firstIntegerPredicate, postal),
 SimplePredicate: (firstTokenPredicate, addr),
 SimplePredicate: (firstTokenPredicate, city),
 SimplePredicate: (firstTokenPredicate, name),
 SimplePredicate: (firstTokenPredicate, postal),
 SimplePredicate: (hundredIntegerPredicate, addr),
 SimplePredicate: (hundredIntegerPredicate, city),
 SimplePredicate: (hundredIntegerPredicate, name),
 SimplePredicate: (hundredIntegerPredicate, postal),
 SimplePredicate: (hundredIntegersOddPredicate, addr),
 SimplePredicate: (hundredIntegersOddPredicate, city),
 SimplePredicate: (hundredIntegersOddPredicate, name),
 SimplePredicate: (hundredIntegersOddPredicate, postal),
 SimplePredicate: (latLongGridPredicate, latlng),
 SimplePredicate: (metaphoneToken, addr),
 SimplePredicate: (metaphoneToken, city),
 SimplePredicate: (metaphoneToken, name),
 SimplePredicate: (metaphoneToken, postal),
 SimplePredicate: (nearIntegersPredicate, addr),
 SimplePredicate: (nearIntegersPredicate, city),
 SimplePredicate: (nearIntegersPredicate, name),
 SimplePredicate: (nearIntegersPredicate, postal),
 SimplePredicate: (oneGramFingerprint, addr),
 SimplePredicate: (oneGramFingerprint, city),
 SimplePredicate: (oneGramFingerprint, name),
 SimplePredicate: (oneGramFingerprint, postal),
 SimplePredicate: (sameFiveCharStartPredicate, addr),
 SimplePredicate: (sameFiveCharStartPredicate, city),
 SimplePredicate: (sameFiveCharStartPredicate, name),
 SimplePredicate: (sameFiveCharStartPredicate, postal),
 SimplePredicate: (sameSevenCharStartPredicate, addr),
 SimplePredicate: (sameSevenCharStartPredicate, city),
 SimplePredicate: (sameSevenCharStartPredicate, name),
 SimplePredicate: (sameSevenCharStartPredicate, postal),
 SimplePredicate: (sameThreeCharStartPredicate, addr),
 SimplePredicate: (sameThreeCharStartPredicate, city),
 SimplePredicate: (sameThreeCharStartPredicate, name),
 SimplePredicate: (sameThreeCharStartPredicate, postal),
 SimplePredicate: (sortedAcronym, addr),
 SimplePredicate: (sortedAcronym, city),
 SimplePredicate: (sortedAcronym, name),
 SimplePredicate: (sortedAcronym, postal),
 SimplePredicate: (suffixArray, addr),
 SimplePredicate: (suffixArray, city),
 SimplePredicate: (suffixArray, name),
 SimplePredicate: (suffixArray, postal),
 SimplePredicate: (tokenFieldPredicate, addr),
 SimplePredicate: (tokenFieldPredicate, city),
 SimplePredicate: (tokenFieldPredicate, name),
 SimplePredicate: (tokenFieldPredicate, postal),
 SimplePredicate: (twoGramFingerprint, addr),
 SimplePredicate: (twoGramFingerprint, city),
 SimplePredicate: (twoGramFingerprint, name),
 SimplePredicate: (twoGramFingerprint, postal),
 SimplePredicate: (wholeFieldPredicate, addr),
 SimplePredicate: (wholeFieldPredicate, city),
 SimplePredicate: (wholeFieldPredicate, name),
 SimplePredicate: (wholeFieldPredicate, postal),
 TfidfNGramCanopyPredicate: (0.2, addr),
 TfidfNGramCanopyPredicate: (0.2, city),
 TfidfNGramCanopyPredicate: (0.2, name),
 TfidfNGramCanopyPredicate: (0.2, postal),
 TfidfNGramCanopyPredicate: (0.4, addr),
 TfidfNGramCanopyPredicate: (0.4, city),
 TfidfNGramCanopyPredicate: (0.4, name),
 TfidfNGramCanopyPredicate: (0.4, postal),
 TfidfNGramCanopyPredicate: (0.6, addr),
 TfidfNGramCanopyPredicate: (0.6, city),
 TfidfNGramCanopyPredicate: (0.6, name),
 TfidfNGramCanopyPredicate: (0.6, postal),
 TfidfNGramCanopyPredicate: (0.8, addr),
 TfidfNGramCanopyPredicate: (0.8, city),
 TfidfNGramCanopyPredicate: (0.8, name),
 TfidfNGramCanopyPredicate: (0.8, postal)}

To proceed with the deduplication, we compute the clustering threshold and call the actual match:

In [67]:
import itertools

threshold = deduper.threshold(data_for_dedupe, recall_weight=2)
clustered_dupes = deduper.match(data_for_dedupe, threshold)

dedupe_found_pairs_set = set()
for cluster, __ in clustered_dupes:  # we'll explain that later
    for pair in itertools.combinations(cluster, 2):
        dedupe_found_pairs_set.add(tuple(pair))
INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.998
INFO:dedupe.api:precision: 0.859
INFO:dedupe.api:With threshold: 0.195

Now we'll evaluate how it performed:

In [68]:
dedupe_true_positives = golden_pairs_set & dedupe_found_pairs_set
dedupe_false_positives = dedupe_found_pairs_set - golden_pairs_set
dedupe_false_negatives = golden_pairs_set - dedupe_found_pairs_set

print('true_positives total:', len(true_positives))
print('false_positives total:', len(false_positives))
print('false_negatives total:', len(false_negatives))
print()
print('svm_true_positives total:', len(svm_true_positives))
print('svm_false_positives total:', len(svm_false_positives))
print('svm_false_negatives total:', len(svm_false_negatives))
print()
print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))
true_positives total: 99
false_positives total: 7
false_negatives total: 51

svm_true_positives total: 124
svm_false_positives total: 5
svm_false_negatives total: 26

dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

Great! It found more true positives than the other methods. But it also found more false positives than SVM. It's possible to control that by lowering recall_weight:

In [69]:
threshold_low_recall = deduper.threshold(data_for_dedupe, recall_weight=0.5)
clustered_dupes_low_recall = deduper.match(data_for_dedupe, threshold_low_recall)

dedupe_low_recall_found_pairs_set = set()
for cluster, __ in clustered_dupes_low_recall:
    for pair in itertools.combinations(cluster, 2):
        dedupe_low_recall_found_pairs_set.add(tuple(pair))
INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 0.972
INFO:dedupe.api:precision: 0.885
INFO:dedupe.api:With threshold: 0.727
In [70]:
dedupe_low_recall_true_positives = golden_pairs_set & dedupe_low_recall_found_pairs_set
dedupe_low_recall_false_positives = dedupe_low_recall_found_pairs_set - golden_pairs_set
dedupe_low_recall_false_negatives = golden_pairs_set - dedupe_low_recall_found_pairs_set

print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))
print()
print('dedupe_low_recall_true_positives total:', len(dedupe_low_recall_true_positives))
print('dedupe_low_recall_false_positives total:', len(dedupe_low_recall_false_positives))
print('dedupe_low_recall_false_negatives total:', len(dedupe_low_recall_false_negatives))
dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

dedupe_low_recall_true_positives total: 137
dedupe_low_recall_false_positives total: 1
dedupe_low_recall_false_negatives total: 13

As we can see, there's always a tradeoff between true positives and false positives, or, more precisely, between precision and recall. You have to decide what's more important for your business.

But let's suppose we want to find more true positives and use the previous dedupe_found_pairs_set. What false positives it found?

In [71]:
print("Dedupe false positives")
for false_positive_pair in list(dedupe_false_positives):
    display(df.loc[list(false_positive_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
Dedupe false positives
name addr city type lat lng
337 caffe lure 169 sullivan st. between houston and bleecker... new york french 40.727928 -74.000985
338 caffe reggio 119 macdougal st. between 3rd and bleecker sts. new york coffee bar 40.730308 -74.000371
name addr city type lat lng
762 palm 837 second ave. new york city steakhouses 40.751701 -73.971180
763 palm too 840 second ave. new york city steakhouses 40.751467 -73.970686
name addr city type lat lng
196 ritz carlton cafe buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
198 ritz carlton dining room buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
name addr city type lat lng
335 caffe dante 81 macdougal st. between houston and bleeker ... new york coffee bar 40.728878 -74.001662
338 caffe reggio 119 macdougal st. between 3rd and bleecker sts. new york coffee bar 40.730308 -74.000371
name addr city type lat lng
195 cafe ritz carlton buckhead 3434 peachtree rd. atlanta ext 6108 international 33.850807 -84.364227
198 ritz carlton dining room buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
name addr city type lat lng
200 ritz carlton restaurant 181 peachtree st. atlanta french (classic) 33.758579 -84.387066
839 ritz carlton cafe atlanta 181 peachtree st. atlanta american (new) 33.758579 -84.387066
name addr city type lat lng
335 caffe dante 81 macdougal st. between houston and bleeker ... new york coffee bar 40.728878 -74.001662
337 caffe lure 169 sullivan st. between houston and bleecker... new york french 40.727928 -74.000985

The deduper is confused by "caffe"s that are located near each other. "caffe" could be a stopword? Or even better, we could incorporate some logic for name rarity: if the token isn't rare, like "caffe", it should contribute less with the name similarity than a rarer token like "reggio". It's possible to do that with an interaction with name frequency.

Or maybe the deduper could have learned a more restrictive name blocking? Or the classifier should have put less weight on name? It's difficult to guess because those things could have introduced more false negatives too. What about them?

In [72]:
print("Dedupe false negatives")
for false_negative_pair in list(dedupe_false_negatives):
    display(df.loc[list(false_negative_pair)][['name', 'addr', 'city', 'type', 'lat', 'lng']])
Dedupe false negatives
name addr city type lat lng
219 fringale 570 4th st. san francisco french 37.778542 -122.397193
220 fringale 570 fourth st. san francisco french bistro 37.778542 -122.397193
name addr city type lat lng
6 hotel bel air 701 stone canyon rd. bel air californian 34.086594 -118.446351
7 bel air hotel 701 stone canyon rd. bel air californian 34.086594 -118.446351
name addr city type lat lng
6 hotel bel air 701 stone canyon rd. bel air californian 34.086594 -118.446351
8 bel air 701 stone canyon road bel air american 34.086594 -118.446351
name addr city type lat lng
197 dining room ritz carlton buckhead 3434 peachtree rd. atlanta international 33.850807 -84.364227
198 ritz carlton dining room buckhead 3434 peachtree rd. ne atlanta american (new) 33.850807 -84.364227
name addr city type lat lng
28 restaurant katsu 1972 n. hillhurst ave. los angeles asian 34.107405 -118.28719
29 katsu 1972 hillhurst ave. los feliz japanese 34.107405 -118.28719
name addr city type lat lng
53 spago 1114 horn ave. los angeles californian 34.091172 -118.383161
54 spago los angeles 8795 sunset blvd. w. hollywood californian 34.091132 -118.383290
name addr city type lat lng
168 bone 3130 piedmont road atlanta american 33.842103 -84.371103
169 bone restaurant 3130 piedmont rd. ne atlanta steakhouses 33.842103 -84.371103
name addr city type lat lng
136 shun lee west 43 w. 65th st. new york asian 40.772900 -73.981348
137 shun lee palace 155 e. 55th st. new york city chinese 40.759428 -73.969068
name addr city type lat lng
199 restaurant ritz carlton atlanta 181 peachtree st. atlanta continental 33.758579 -84.387066
200 ritz carlton restaurant 181 peachtree st. atlanta french (classic) 33.758579 -84.387066

Some of these false negatives could be prevented with better address normalization, but on others, the data is simply bad: different addresses for the same place, maybe a corner, maybe two entrances? We could try to train more the deduper to fix this. But we'll leave as it is and move to the last step of the deduplication process.

4/4 - Clustering

Using the Threshold or the SVM, we got the matching pairs. But what Dedupe returned to us were clusters of matches:

In [73]:
clustered_dupes[:5]
Out[73]:
[((0, 1, 2), array([0.89782533, 0.8964192 , 0.88561887])),
 ((3, 4, 5), array([0.90827334, 0.91548833, 0.93204328])),
 ((7, 8), array([0.883838, 0.883838])),
 ((9, 10, 11, 12), array([0.91171577, 0.91171577, 0.91171577, 0.91134983])),
 ((13, 14), (0.91189873, 0.91189873))]

Dedupe went one step further on the process and merged the matching pairs into clusters! Why is that important? Because the following can happen:

  • We have the records A, B, and C
  • By deduplicating, we found the matching pairs (A, B) and (B, C). However, (A, C) was found to be a nonmatch

It doesn't make sense to consider (A, B) and (B, C) as a match, but (A, C) as a nonmatch. The solution for that ambiguity is computing the Transitive Closure with Clustering.

Using some private methods, it's possible to get the unclustered pairs from Dedupe:

In [74]:
from dedupe.core import scoreDuplicates

candidate_records = itertools.chain.from_iterable(deduper._blockedPairs(deduper._blockData(data_for_dedupe)))
dedupe_matches = scoreDuplicates(candidate_records,
                                 deduper.data_model,
                                 deduper.classifier,
                                 deduper.num_cores)
dedupe_unclustered_found_pairs_set = {tuple(pair) for ([*pair], __) in dedupe_matches}

Let's evaluate those unclustered pairs against the clustered pairs:

In [75]:
dedupe_unclustered_true_positives = golden_pairs_set & dedupe_unclustered_found_pairs_set
dedupe_unclustered_false_positives = dedupe_unclustered_found_pairs_set - golden_pairs_set
dedupe_unclustered_false_negatives = golden_pairs_set - dedupe_unclustered_found_pairs_set

print('dedupe_true_positives total:', len(dedupe_true_positives))
print('dedupe_false_positives total:', len(dedupe_false_positives))
print('dedupe_false_negatives total:', len(dedupe_false_negatives))
print()
print('dedupe_unclustered_true_positives total:', len(dedupe_unclustered_true_positives))
print('dedupe_unclustered_false_positives total:', len(dedupe_unclustered_false_positives))
print('dedupe_unclustered_false_negatives total:', len(dedupe_unclustered_false_negatives))
dedupe_true_positives total: 141
dedupe_false_positives total: 7
dedupe_false_negatives total: 9

dedupe_unclustered_true_positives total: 142
dedupe_unclustered_false_positives total: 10
dedupe_unclustered_false_negatives total: 8

We've found the unclustered pairs are different from the clustered pairs! That means the clustering process can both create new matches and drop found matches. Therefore, even though clustering is necessary to disambiguate the deduplication result, it can either improve or worsen the quality of found pairs.

In [76]:
diff_set = dedupe_found_pairs_set ^ dedupe_unclustered_found_pairs_set
display(diff_set)
{(6, 7),
 (34, 37),
 (195, 198),
 (196, 200),
 (196, 839),
 (197, 198),
 (198, 200),
 (198, 839)}

Here's a case where the clustering process got a new correct match:

In [79]:
from graph_utils import show_cluster_graphs

good_diff_all_ids = {34, 35, 36, 37}
show_cluster_graphs(
    df,
    golden_pairs_set, dedupe_found_pairs_set, dedupe_unclustered_found_pairs_set,
    good_diff_all_ids)
name addr city type postal lat lng addr_variations
34 locanda veneta 3rd st. los angeles italian None 34.068958 -118.320928 (3rd saint, 3 saint, 3rd street, 3 street)
35 locanda veneta 8638 w. third st. los angeles italian 90048 34.073417 -118.381096 (8638 west 3 street, 8638 west 3rd street, 863...
36 locanda veneta 8638 w 3rd st los angeles italian 90048 34.073417 -118.381096 (8638 wohnung 3, 8638 wohnung 3rd, 8638 w 3rd,...
37 locanda w. third st. st los angeles italian None 34.068958 -118.320928 (w 3 saint, w 3rd street, west 3rd street, wes...

And here's a case where the clustering process dropped a true match:

In [78]:
bad_diff_all_ids = {6, 7, 8}
show_cluster_graphs(
    df,
    golden_pairs_set, dedupe_found_pairs_set, dedupe_unclustered_found_pairs_set,
    bad_diff_all_ids)
name addr city type postal lat lng addr_variations
8 bel air 701 stone canyon road bel air american 90077 34.086594 -118.446351 (701 stone canyon road)
6 hotel bel air 701 stone canyon rd. bel air californian 90077 34.086594 -118.446351 (701 stone canyon road)
7 bel air hotel 701 stone canyon rd. bel air californian 90077 34.086594 -118.446351 (701 stone canyon road)

Remember the recall_weight parameter of the deduper.threshold method? That's what it controls: how aggressive we want to be on finding or dropping matches while clustering. Check Dedupe docs for more details on how it performs clustering and how it computes a good threshold.

Finally, it's worth mentioning there's a web-based product version of Dedupe. If you don't want to write code for deduping a dataset, check it.

Next Steps

Once we have the clusters, how to consolidate data from many records into one? Check for material on Data Fusion:

What if new records arrive? Should we merge, unmerge, move records from clusters? Check for material on Incremental Record Linkage:

Also worth checking the Privacy implications of Record Linkage:

References

Thank you!

flavio@vinta.com.br
@flaviojuvenal
vinta.com.br

Special thanks to Russell Keith-Magee @freakboy3742, Forest Timothy Gregg @forestgregg, and Jonathan de Bruin @J535D165.