The holly grail of business intelligence is to have “one version of the truth” by integrating all systems within an organization and it’s ecosystem into a single, unified database. One of the great obstacles to doing this is the lack of common foreign key relationships like the ones you would fine within a single well designed system. This article shows you a way to use join records based on similarity. For instance, the CRM system might contain “Abc Inc.” while the accounting system has “Abc Incorporated”. A direct join would not connect these systems. Using similarity joins we can. The Killer App is to detect names that are similar but not exact matches.
A Little Help From Vladimir
Since there is no similarity operator in SQL, we’ll have to create one. We’ll leverage the idea of Levenshtein distance which tells us how far apart two strings are. Essentially this tells how how many edits are required to change one string into another. An exact match would have a distance of 0. The words “cat” and “hat” would have a distance of 1 because you’d need to make a single edit to turn one into another. This algorithm was discovered in 1965, before the software for the lunar landar was completed by a guy named Vladimir Levenshtein.
Normalizing It
While the original algorithms can be very helpful the “number of edits” value makes it hard to do large scale aggregate analysis because it is not relative to the number of letters. Only needing a single edit on a short word like “cat” is a lot less significant than only needing a single edit on an entire English paper. Having a score of 1 on the former still leaves some ambiguity while having the same score on the english paper reveals outright plagiarism. So to compensate we will divide by the number of letters in larger word. We will than subtract the value from 1 so the result is more consistent with statistics and probability. So having a 1 means there is a 100% chance of the words matching while lower values indicate a lower probability.
Speeding It Up
Since this is a fairly complex algorithm, we’ll assume that most matches are exact and put in an optimization to just return 1 for those.
Putting It In Action
One of the first appications would be to discover names of accounts that are similar but not exact matches. This query shows us which contacts have names most similar to their companies.
SELECT CompanyName,ContactName, dbo.Similarity(CompanyName,ContactName) FROM
Customers
ORDER BY dbo.Similarity(CompanyName,ContactName) DESC
The Results
We ran this on the customers table on the Northwind database. These two fields were never meant to be used together but our SIMILARITY function revealed the ones where the customer name and the contact name were similar.

Looks like Antonio got the top score because his Taqueria is named after him. There is about a 60% chance that the two are related. You can use this information to find patterns in your database or even to join together data based on probability. Click here to get the source for the SIMILARITY function. Note that this is designed to demonstrate the idea and a production version would include more optimizations. Another addition would be to give a boost to items that are subsets of each other or add special translations

Comments on this entry are closed.