Address Data Matching Methods for Sanctions Compliance

Too many the implemented models out there:

Garbage In

On a serious note, I don’t ever want to hear the saying “garbage in, garbage out” ever again — but alas, I have no intentions of hunkering down in cave until the end of time.

This proverbial phrase is thrown around so frequently in compliance circles that we become dulled to it’s implications. However, as companies work tirelessly to ensure high standards for internal records, have we lost sight of the forest through the trees?

 

Specifically, I am questioning whether governmental institutions issuing sanctions meet the rigorous data-quality standards they impose on market players, and whether the generic matching algorithms used to generate alerts are suitable in all circumstances.

After all, sanctions compliance is a two-way intersection, with company data running perpendicular to sanctioned party information published by OFAC, and the like. It is the job of a compliance analyst to investigate collisions (record matches) to ensure that the firm is not conducting business with any sanctioned parties.

Two criteria must be met for quality matches to be made:

  1. The internal company data and sanctions record data must be comparable.
  2. The method of comparison must be appropriate for the type of records being compared.

When it comes matching name records, I would argue watchlist screening vendors (the guys that build the record matching software) have done a pretty good job. However, current geographic information comparison methods are entirely lacking.

Often, instead of building a separate matching engine for geographic information and treating the comparison of this information as it’s own unique challenge, string-matching algorithms have been repurposed and lackadaisically applied to geographic data.

In theory, this should be sufficient – after all we should be able to use those same string-distance based algorithms to definitively say “123 Apple Street, New York, NY 12345 USA” is not the same as “123 Apple Street, Providence, RI 67890 USA”.

However, that presumption is contingent on both addresses having identical formatting and informational quality.

Queue spotty OFAC addressing standards.


As you can see, OFAC’s address data formatting is all over the place.
Here are some examples sourced from the SDN Address File:

  • “Emerson No. 148 Piso 7″,”Mexico, D.F. 11570″,”Mexico”
  • “211 East 43rd Street”,”New York, NY 10017″,”United States”
  • “ul Pavlovskaya, 29″,”Kiev 01135″,”Ukraine”
  • 18 Ulitsa D. Ulyanova, Apartment 110″,”Simferopol, Crimea”,”Ukraine”
  • “9F-1, No. 22, Hsin Yi Rd., Sec. 2″,”Taipei”,”Taiwan”
  • “33 Akti Maouli”,”Pireas (Piraeus) 185-35″,”Greece”

OFAC has published a general description of their schema (here), but it fails to offer any valuable insights into how they intended to be utilized for matching purposes based on it’s formatting.

Here are my annotations:

OFAC Address Schema - annotated

First, the address field (#3) has notorious standardization issues, making it the “Valley of Death” – a place where all bad algorithms go to die.

Then, field #4 contains a slew of information that should be parsed into separate fields. OFAC makes extensive use of NULL values, so I am at a loss for rationale of why they chose to condense mid-level geographic information into a single field.

Finally, the country field (#5) relies on American sovereignty designations which may not be applicable for internationally sourced records.

Getting more into the weeds and providing concrete examples of data quality issues, here are all those address examples again (annotated):

OFAC Address Example - annotated

A number of issues present themselves:

  1. Translation & Abbreviations are not consistent. “Ulitsa” and it’s “Ul” abbreviated form are used interchangeably in records.
    Furthermore, address terms translations are not applied consistently. Hispanic address records use “Piso” and “Floor” interchangeably; Russian records use “Ulitsa” and “Street” interchangeably – and so on.
  2. Data Overload is rampant. Entry (6) contains “No. 22” and “Sec. 2” granular address information which will likely not make it into business data, making straight comparison difficult.
  3. Sanctioned Jurisdiction data is present in field (2), which is usually mapped to City or State data-fields once records are parsed, while field (3) country data is used exclusively by some vendor systems to screen for sanctioned jurisdictions.
  4. Inconsistent City Naming can show up from time to time. Entry (3)’s “Mexico, D.F.” stands for “Mexico Distrito Federal” which is commonly known as “Mexico City”. Entry (7) contains two alternate spellings for the city’s name.
  5. Sovereignty Issues can come into play with records with entries (5) and (6). Russian business data may list Simferopol’s country as Russia, and similarly, Chinese business data may list Taiwan’s country as China.

A single issue alone is enough to generate false-negatives when a string-matching algorithm is repurposed for address matching. To illustrate, I applied a simple Damerau-Levenshtein edit distance function over an OFAC record and a reasonable variant. Let’s compare entry (4)’s “ul Pavlovskaya, 29” address with the “29 Ulitsa Pavlovskaya” variant, where “ulitsa” is unabbreviated and the street number has been brought to the front of the address line to comply with American address notation.

rm(list = ls())
library(stringdist)
clean = function(text){
# Text standardization function - remove all non-alphanumeric characters & make text ALL-CAPS
text = toupper(gsub("[^[:alnum:]]","",text))
text
}
address.ofac = "ul Pavlovskaya, 29"
address.data = "29 Ulitsa Pavlovskaya"
editDistance = stringdist(clean(address.ofac),clean(address.data), method = "dl")
# editDistance = 8
similarity = 1 - (editDistance/nchar(address.ofac))
# similarity = 55.56%

The edit distance between the two address strings is (8) characters, meaning that they are only 56% similar relative to the length of the original OFAC address (18 characters).

Hence, false-negative matches would reign supreme if this string-matching based methodology was applied in a production environment, and their implications could be devastating when paired with Address-based Alert Suppression features.

 


Before we get into the technical aspects of address data utilization, I would like to point out the obvious: Regulatory institutions are not technology companies.

Consequently, these data-standardization deficiencies are here to stay, and it becomes the responsibility of watchlist screening software vendors to ensure address data is properly cleaned/parsed and that the appropriate screening methodologies applied.


 

Q: So how do we ‘address’ data-quality issues with geographic information?
A: Stop relying on string-matching algorithms to compare address data.

This proposed methodology standardizes both watch-list and business address data into coordinate locations by piggy-backing off systems already designed to compensate for address variation, and then running distance-based address comparison calculations.

At it’s heart, it involves training and implementing a machine-learning model to compensate for address formatting variations.

Fortunately, vendors don’t need to reinvent the wheel and it is easy to piggy back on established systems to standardize both watch-list and business data so they can be compared on an equal footing.

I am going to use Google’s geocoding service for data standardization, but there are plenty of other options.

rm(list = ls())
# Call Dependent Packages
library(RCurl)
library(jsonlite)
library(plyr)
library(geosphere)
# Function to form API Query URL
url = function(address, return.call = "json", sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/"
u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
return(URLencode(u))
}


# Function to parse API Query Return
geoCode <- function(address,verbose=FALSE) {
# if(verbose) cat(address,"\n")
u <- url(address)
doc <- getURL(u)
x <- fromJSON(doc)
if(x$status=="OK") {
lat <- x$results$geometry$location$lat
lng <- x$results$geometry$location$lng
location_type <- x$results$geometry$location_type
formatted_address <- x$results$formatted_address
return(c(lat, lng, location_type, formatted_address))
} else {
return(c(NA,NA,NA, NA))
}
}


address.ofac = "ul Pavlovskaya, 29, Kiev 01135 Ukraine" # OFAC Address
address.data = "29 Ulitsa Pavlovskaya, Kiev 01135 Ukraine"
# Reformatted "company" address


geo.ofac = geoCode(address.ofac)
geo.data = geoCode(address.data)


# Compute haversine distance (meters) between the two addresses
geo.dist = distHaversine(p1 = as.numeric(geo.ofac[2:1]), p2 = as.numeric(geo.data[2:1]))
# geo.dist = 0


# If distance < 100m then generate an alert
if(geo.dist < 100){writeLines("ALERT - POTENTIAL ADDRESS MATCH")}

If you run this code, you get back the following data regarding each of the locations:

This data can then be parsed to isolate corresponding long/lat coordinates and calculate the Haversine distance between the two points, “as the crow flies”.
The monitoring system can treat the addresses a potential matches if the distance between the two points is below an arbitrary threshold.

Of course, there are additional factors that need to be considered when implementing such a methodology, such as distance scaling factors based on coordinate accuracy, bad-address treatment, etc….

But using a coordinate based comparison methodology in tandem current string-distance methodologies would at least put us on the right path.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s