Finding closest matches by name in BigQuery

We’ve all been there before: someone gives you a dataset without ID’s, keyed only by name. You either file it away and never look at it again, or spend the next two days matching up the records by hand.

Not to condone this situation, but it happens, so here’s a query (in BigQuery using a Javascript UDF) to help speed this up. I tried simply using Levenshtein distance, but found too many mismatches, so switched to a strategy of counting matching substrings of a given length. Even with this approach, there will be mismatches, so use this to speed up a manual review, rather than simply making it a PDT.

Note that you can customize:

  • stopWords, which get filtered out entirely. I chose this specific list when trying to match company names

  • tokens, which get shortened to limit their impact on the match score

  • grams, the number of consecutive characters that are considered when calculating the score. Maybe set it down to 2 or 3 if your dataset is composed of phrases with shorter words

      CREATE TEMPORARY FUNCTION match(str1 STRING, str2 STRING)
      RETURNS FLOAT64
      LANGUAGE js AS """
        var stopWords = /\\b(The|inc|ASP|Co|Ltd|LLC|SA|SAS|BV|Corp|formerly|previously|by|as|dba|com)\\b/ig
        var stopChars = /[-_ \\.,'"'\\/()]/g
        var tokens = /\\b(Global|Media Group|Group|Limited|Solutions|Software|Computing|Networks?|Systems|Health|Brands|Project|Services|Technologies|Technology|Companies|Holdings?|Studios?|Digital|Mobile|Financial|Management|Media|Maintenance|Agency|Support|Bank|reseller|account|internal|external|International|Dental|University|Corporation|Labs|Enterprises?|GmbH|Industries|Incorporated|Entertainment|Society|Security|Interactive|Capital)\\b/ig
        var grams = 4
        var tokenValue = 2
        var substrs1 = extract(str1)
        var substrs2 = extract(str2)
        var maxL = max(substrs1.length, substrs2.length)
        return ( parseInt(100*(
            substrs1.filter(s=>~substrs2.indexOf(s)).length
            +substrs2.filter(s=>~substrs1.indexOf(s)).length
          ) / (0.01+maxL) / 2))
        function max(a,b){return a>b?a:b}
        function extract(str){
          return str.toLowerCase()
            .replace(stopWords,"")
            .replace(tokens, m=>m.slice(0,tokenValue).toUpperCase())
            .replace(stopChars,"")
            .split('').map((char,c,arr)=>arr.slice(max(0,1+c-grams),1+c).join('')).filter(s=>s.length>=3)
        }
      """;
      WITH closest_by_name as (
      	SELECT 
      	  dataset.name, 
      	  MAX(match(dataset.name,official.name)) as similarity,
      	  CASE MAX(match(dataset.name,official.name))
      	  WHEN 0 THEN NULL
      	  ELSE SUBSTR(MAX(
      		CONCAT(
      		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                        ,CAST(1000000 + official.tie_breaker as string)
      		  ,official.name
      		)
      	  ),5+7) END as closest_name,
      	  CASE MAX(match(dataset.name,official.name))
      	  WHEN 0 THEN NULL
      	  ELSE SUBSTR(MAX(
      		CONCAT(
      		  CAST(ROUND(1000+match(dataset.name,official.name)) AS string)
                        ,CAST(1000000 + official.tie_breaker as string)
      		  ,official.id
      		)
      	  ),5+7) END as corresponding_id
      	FROM (
      	  SELECT DISTINCT name FROM `table1`
      	) AS dataset 
      	CROSS JOIN (
      	  SELECT id, name, ROW_NUMBER() OVER () as tie_breaker
      	  FROM `table2`
      	  --WHERE optional filter
      	) as official
      	GROUP BY 1
      )
      SELECT main.name, main.similarity, main.closest_name, main.id_of_closest_name, 
      STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) as id_is_better_matched_to,
      STRING_AGG( CASE WHEN dupe.similarity < main.similarity THEN dupe.name END) as id_is_also_closest_to,
      CASE WHEN STRING_AGG( CASE WHEN dupe.similarity > main.similarity THEN dupe.name END) IS NULL THEN main.id_of_closest_name END as suggested_id
      FROM closest_by_name as main
      LEFT JOIN closest_by_name as dupe
       ON  dupe.id_of_closest_name = main.id_of_closest_name
       AND dupe.name <> main.name
      GROUP BY 1,2,3,4
      ORDER BY 2 ASC
2 4 2,676
4 REPLIES 4

Hey @fabio1! I know this is a bit of an old thread, but I was hoping you might be able to explain what each line of the UDF does to someone who has no background at all in JS? Your function has been super handy for me, since I’ll occasionally get a request where people want me to find the respective IDs for a list of app names.

I’ve been trying to remedy two issues I’ve been having with it, and I thought maybe if I understand the function a bit better, I’ll be able to solve them. The first issue I’m having is that some apps share the exact same name across iOS and Android platforms, and I’m trying to get the query to return a match for both of them, if possible. The second issue is that, in the cases of apps that are sequels, sometimes the query will return the ID of the original game instead of the sequel. For example, I know the app “Shadow Fight 2” is in our database, but the query will return the ID for “Shadow Fight” instead. Is there any quick way to account for that?

Sorry again for bumping such an old thread, but I look forward to hearing back from you about this! Thanks again for creating this useful function!

Hi!

As for returning multiple matches, I would look at the SQL instead of the Javascript. All that the JS does is compare a pair of names and return a score. The SQL takes care of queuing up a comparison between all the names, and then picking the highest scores.

So for example, instead of SELECT name1, MAX(score & name2) GROUP BY name1, you might want WITH top_scores AS (SELECT name1, ROW_NUMBER() OVER (parition by name1 order by score) as rank, name2) SELECT * FROM top_scores WHERE rank <=3

As for why the original is picked instead of the sequel, I think that might be because they are both getting very high scores which are getting rounded up to 100 by the parseInt function and then SQL is just randomly choosing one of the two 100’s. I can’t tell/remember why I put the parseInt in there, maybe just because it looked nicer with fewer decimal places… I would try just removing that function call, and it should help.

Thanks a ton for taking the time to write up a response; I really appreciate it! I’ll give those changes a shot.

Thanks again!

Hi,

sorry for necroposting but I created an account here specifically to thank you @fabio1 for your answer here. It just saved me weeks of work: on one side because I didn’t know I could write javascript code in bigquery, and on the other because your algorithm proved to be way more effective than any other more complex formula out there, when dealing with company names. I know such posts are not useful at all and I’m sorry about that, but I thought you had not been thanked enough for such an useful response 😄 

Top Labels in this Space
Top Solution Authors