Finding closest matches by name in BigQuery

  • 15 June 2018
  • 4 replies

Userlevel 7

We’ve all been there before: someone gives you a dataset without ID’s, keyed only by name. You either file it away and never look at it again, or spend the next two days matching up the records by hand.

Not to condone this situation, but it happens, so here’s a query (in BigQuery using a Javascript UDF) to help speed this up. I tried simply using Levenshtein distance, but found too many mismatches, so switched to a strategy of counting matching substrings of a given length. Even with this approach, there will be mismatches, so use this to speed up a manual review, rather than simply making it a PDT.

Note that you can customize:

  • stopWords, which get filtered out entirely. I chose this specific list when trying to match company names

  • tokens, which get shortened to limit their impact on the match score

  • grams, the number of consecutive characters that are considered when calculating the score. Maybe set it down to 2 or 3 if your dataset is composed of phrases with shorter words

    LANGUAGE js AS """
    var stopWords = /\\b(The|inc|ASP|Co|Ltd|LLC|SA|SAS|BV|Corp|formerly|previously|by|as|dba|com)\\b/ig
    var stopChars = /[-_ \\.,'"'\\/()]/g
    var tokens = /\\b(Global|Media Group|Group|Limited|Solutions|Software|Computing|Networks?|Systems|Health|Brands|Project|Services|Technologies|Technology|Companies|Holdings?|Studios?|Digital|Mobile|Financial|Management|Media|Maintenance|Agency|Support|Bank|reseller|account|internal|external|International|Dental|University|Corporation|Labs|Enterprises?|GmbH|Industries|Incorporated|Entertainment|Society|Security|Interactive|Capital)\\b/ig
    var grams = 4
    var tokenValue = 2
    var substrs1 = extract(str1)
    var substrs2 = extract(str2)
    var maxL = max(substrs1.length, substrs2.length)
    return ( parseInt(100*(
    ) / (0.01+maxL) / 2))
    function max(a,b){return a>b?a:b}
    function extract(str){
    return str.toLowerCase()
    .replace(tokens, m=>m.slice(0,tokenValue).toUpperCase())
    WITH closest_by_name as (
    MAX(match(, as similarity,
    CASE MAX(match(,
    CAST(ROUND(1000+match(, AS string)
    ,CAST(1000000 + official.tie_breaker as string)
    ),5+7) END as closest_name,
    CASE MAX(match(,
    CAST(ROUND(1000+match(, AS string)
    ,CAST(1000000 + official.tie_breaker as string)
    ),5+7) END as corresponding_id
    FROM (
    SELECT DISTINCT name FROM `table1`
    ) AS dataset
    SELECT id, name, ROW_NUMBER() OVER () as tie_breaker
    FROM `table2`
    --WHERE optional filter
    ) as official
    GROUP BY 1
    SELECT, main.similarity, main.closest_name, main.id_of_closest_name,
    STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN END) as id_is_better_matched_to,
    STRING_AGG( CASE WHEN dupe.similarity < main.similarity THEN END) as id_is_also_closest_to,
    CASE WHEN STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN END) IS NULL THEN main.id_of_closest_name END as suggested_id
    FROM closest_by_name as main
    LEFT JOIN closest_by_name as dupe
    ON dupe.id_of_closest_name = main.id_of_closest_name
    AND <>
    GROUP BY 1,2,3,4

4 replies

Hey @fabio! I know this is a bit of an old thread, but I was hoping you might be able to explain what each line of the UDF does to someone who has no background at all in JS? Your function has been super handy for me, since I’ll occasionally get a request where people want me to find the respective IDs for a list of app names.

I’ve been trying to remedy two issues I’ve been having with it, and I thought maybe if I understand the function a bit better, I’ll be able to solve them. The first issue I’m having is that some apps share the exact same name across iOS and Android platforms, and I’m trying to get the query to return a match for both of them, if possible. The second issue is that, in the cases of apps that are sequels, sometimes the query will return the ID of the original game instead of the sequel. For example, I know the app “Shadow Fight 2” is in our database, but the query will return the ID for “Shadow Fight” instead. Is there any quick way to account for that?

Sorry again for bumping such an old thread, but I look forward to hearing back from you about this! Thanks again for creating this useful function!

Userlevel 7


As for returning multiple matches, I would look at the SQL instead of the Javascript. All that the JS does is compare a pair of names and return a score. The SQL takes care of queuing up a comparison between all the names, and then picking the highest scores.

So for example, instead of SELECT name1, MAX(score & name2) GROUP BY name1, you might want WITH top_scores AS (SELECT name1, ROW_NUMBER() OVER (parition by name1 order by score) as rank, name2) SELECT * FROM top_scores WHERE rank <=3

As for why the original is picked instead of the sequel, I think that might be because they are both getting very high scores which are getting rounded up to 100 by the parseInt function and then SQL is just randomly choosing one of the two 100’s. I can’t tell/remember why I put the parseInt in there, maybe just because it looked nicer with fewer decimal places… I would try just removing that function call, and it should help.

Thanks a ton for taking the time to write up a response; I really appreciate it! I’ll give those changes a shot.

Thanks again!


sorry for necroposting but I created an account here specifically to thank you @fabio for your answer here. It just saved me weeks of work: on one side because I didn’t know I could write javascript code in bigquery, and on the other because your algorithm proved to be way more effective than any other more complex formula out there, when dealing with company names. I know such posts are not useful at all and I’m sorry about that, but I thought you had not been thanked enough for such an useful response :D