Question

Finding closest matches by name in BigQuery

  • 15 June 2018
  • 3 replies
  • 782 views

Userlevel 6
Badge

We’ve all been there before: someone gives you a dataset without ID’s, keyed only by name. You either file it away and never look at it again, or spend the next two days matching up the records by hand.


Not to condone this situation, but it happens, so here’s a query (in BigQuery using a Javascript UDF) to help speed this up. I tried simply using Levenshtein distance, but found too many mismatches, so switched to a strategy of counting matching substrings of a given length. Even with this approach, there will be mismatches, so use this to speed up a manual review, rather than simply making it a PDT.


Note that you can customize:




  • stopWords, which get filtered out entirely. I chose this specific list when trying to match company names




  • tokens, which get shortened to limit their impact on the match score




  • grams, the number of consecutive characters that are considered when calculating the score. Maybe set it down to 2 or 3 if your dataset is composed of phrases with shorter words


      CREATE TEMPORARY FUNCTION match(str1 STRING, str2 STRING)
    RETURNS FLOAT64
    LANGUAGE js AS """
    var stopWords = /\\b(The|inc|ASP|Co|Ltd|LLC|SA|SAS|BV|Corp|formerly|previously|by|as|dba|com)\\b/ig
    var stopChars = /[-_ \\.,'"'\\/()]/g
    var tokens = /\\b(Global|Media Group|Group|Limited|Solutions|Software|Computing|Networks?|Systems|Health|Brands|Project|Services|Technologies|Technology|Companies|Holdings?|Studios?|Digital|Mobile|Financial|Management|Media|Maintenance|Agency|Support|Bank|reseller|account|internal|external|International|Dental|University|Corporation|Labs|Enterprises?|GmbH|Industries|Incorporated|Entertainment|Society|Security|Interactive|Capital)\\b/ig
    var grams = 4
    var tokenValue = 2
    var substrs1 = extract(str1)
    var substrs2 = extract(str2)
    var maxL = max(substrs1.length, substrs2.length)
    return ( parseInt(100*(
    substrs1.filter(s=>~substrs2.indexOf(s)).length
    +substrs2.filter(s=>~substrs1.indexOf(s)).length
    ) / (0.01+maxL) / 2))
    function max(a,b){return a>b?a:b}
    function extract(str){
    return str.toLowerCase()
    .replace(stopWords,"")
    .replace(tokens, m=>m.slice(0,tokenValue).toUpperCase())
    .replace(stopChars,"")
    .split('').map((char,c,arr)=>arr.slice(max(0,1+c-grams),1+c).join('')).filter(s=>s.length>=3)
    }
    """;
    WITH closest_by_name as (
    SELECT
    dataset.name,
    MAX(match(dataset.name,official.name)) as similarity,
    CASE MAX(match(dataset.name,official.name))
    WHEN 0 THEN NULL
    ELSE SUBSTR(MAX(
    CONCAT(
    CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
    ,CAST(1000000 + official.tie_breaker as string)
    ,official.name
    )
    ),5+7) END as closest_name,
    CASE MAX(match(dataset.name,official.name))
    WHEN 0 THEN NULL
    ELSE SUBSTR(MAX(
    CONCAT(
    CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
    ,CAST(1000000 + official.tie_breaker as string)
    ,official.id
    )
    ),5+7) END as corresponding_id
    FROM (
    SELECT DISTINCT name FROM `table1`
    ) AS dataset
    CROSS JOIN (
    SELECT id, name, ROW_NUMBER() OVER () as tie_breaker
    FROM `table2`
    --WHERE optional filter
    ) as official
    GROUP BY 1
    )
    SELECT main.name, main.similarity, main.closest_name, main.id_of_closest_name,
    STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) as id_is_better_matched_to,
    STRING_AGG( CASE WHEN dupe.similarity < main.similarity THEN dupe.name END) as id_is_also_closest_to,
    CASE WHEN STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) IS NULL THEN main.id_of_closest_name END as suggested_id
    FROM closest_by_name as main
    LEFT JOIN closest_by_name as dupe
    ON dupe.id_of_closest_name = main.id_of_closest_name
    AND dupe.name <> main.name
    GROUP BY 1,2,3,4
    ORDER BY 2 ASC



3 replies

Thanks a ton for taking the time to write up a response; I really appreciate it! I’ll give those changes a shot.


Thanks again!

Userlevel 6
Badge

Hi!


As for returning multiple matches, I would look at the SQL instead of the Javascript. All that the JS does is compare a pair of names and return a score. The SQL takes care of queuing up a comparison between all the names, and then picking the highest scores.


So for example, instead of SELECT name1, MAX(score & name2) GROUP BY name1, you might want WITH top_scores AS (SELECT name1, ROW_NUMBER() OVER (parition by name1 order by score) as rank, name2) SELECT * FROM top_scores WHERE rank <=3


As for why the original is picked instead of the sequel, I think that might be because they are both getting very high scores which are getting rounded up to 100 by the parseInt function and then SQL is just randomly choosing one of the two 100’s. I can’t tell/remember why I put the parseInt in there, maybe just because it looked nicer with fewer decimal places… I would try just removing that function call, and it should help.

Hey @fabio! I know this is a bit of an old thread, but I was hoping you might be able to explain what each line of the UDF does to someone who has no background at all in JS? Your function has been super handy for me, since I’ll occasionally get a request where people want me to find the respective IDs for a list of app names.


I’ve been trying to remedy two issues I’ve been having with it, and I thought maybe if I understand the function a bit better, I’ll be able to solve them. The first issue I’m having is that some apps share the exact same name across iOS and Android platforms, and I’m trying to get the query to return a match for both of them, if possible. The second issue is that, in the cases of apps that are sequels, sometimes the query will return the ID of the original game instead of the sequel. For example, I know the app “Shadow Fight 2” is in our database, but the query will return the ID for “Shadow Fight” instead. Is there any quick way to account for that?


Sorry again for bumping such an old thread, but I look forward to hearing back from you about this! Thanks again for creating this useful function!

Reply