String edit distance (Levenshtein)

  • 11 September 2018
  • 4 replies
  • 327 views

Userlevel 1

Hi all,


Is there a simple manner to compute the edit distance between two strings ?

In the following example, string comparison is performed in order to check whether the name of a user in the db looks alike his email.

I computed the following proxy: the email contains the first name or the last name, but I would like for a more generic function, such as the Levenshtein distance : how many edits from one string to the other. ‘johnsmith’ is 1 step distant from ‘john_smith’ and 3 steps to ‘jsmith’, but 9 steps to ‘billbrown’.



dimension: is_email_consistent {

type: yesno

sql: ${email} like concat(’%’,lower(${firstname}),’%’) OR ${email} like concat(’%’,lower(${lastname}),’%’) ;;

}



Do you know any more clever way to perform such distance ?


Many thanks in advance


4 replies

Userlevel 1

What database are you using? Postgres natively implements a levenshtein() function

We implemented Levenshtein distance on our Redshift cluster using a Python UDF (https://docs.aws.amazon.com/redshift/latest/dg/udf-creating-a-scalar-udf.html). User-defined functions might be available on your database as well.

Userlevel 7
Badge

I found that for the use case I cared about, Levenshtein distance was actually not a very helpful metric. I was using BigQuery, so was able to use a Javascript-based UDF. It basically looks for how many matching n-grams there are between the two strings, with the ability to configure stop words:


Userlevel 1

Thanks, I used a simplified version of your code to score consistencies of emails with last name/first name, in order to raise a flag on fraudulent behaviours (among many other flags)

Reply