Question

BigQuery - Migrating from Legacy SQL to Standard SQL

  • 18 October 2016
  • 1 reply
  • 326 views

Userlevel 3

Anyone else migrating to Standard SQL?


With BigQuery’s release of a Standard SQL, the appeal of migrating away from Legacy SQL is pretty high. From the LookML perspective, Legacy SQL forces you to create a lot of PDTs, which are hard to maintain and ever harder to scale with increasing volume sizes. In the standard SQL there are features like inequality joins and subqueries in SELECT that should make PDTs less prevalent.


More importantly, the Legacy SQL has very painful bugs (e.g. “Failed to save view. Cannot create valid output schema for field”) and it sounds like won’t be ever resolved. The only workaround for these bugs is to create even more PDTs:



Whereas saving a Query into a database view fails, saving it into a PDT works because there is an inconsistency in how BQ treats column names.



So far, in evaluating the transition to Standard SQL, I see a couple of changes that need to happen.


1) Comma “,” replacement issues


This seems pretty straightforward. A simple regex should do the trick.


work required = X


2) UNION logic


This generally falls under (1), unless you have a lot of mismatched columns. Standard SQL forces columns to be positioned perfectly on top of each-other, whereas Legacy performed matching based on column names.



This is, by the way, either an advantage or a disadvantage. Position-matching bypassed the above inconsistency error, but forces a mundane task of making sure that columns are being pulled consistently in the right order (allowing for either scenario would have been ideal, GOOGLE).



work required = X to 100X


3) TABLE_DATE_RANGE replacement with _TABLE_SUFFIX


A Regex would not work here because we would need to apply recursive logic for bracket matching.


The date range also moves to the WHERE clause, so the solution would need to parse out the SQL into a tree in order to move the matching logic into an appropriate WHERE section.


This is still fairly straightforward to automate. One major complication is the more flexible use of TABLE_QUERY in Legacy SQL. If you’ve not used this sparingly, you might just have to do this one by hand. So far, there are two use cases I have seen for TABLE_QUERY: 1) to access PDTs, 2) to dynamically choosing a table based on a filter condition. If you don’t have either of this, you are O.K.


work required = 200X


4) Filtered MAX/MIN WITHIN Record


This one is actually problematic. It can look like this:


MAX(IF(condition, value-if-true,value-if-false)) WITHIN RECORD AS ...

I do see the array equivalents for repeated fields:


Legacy                          Standard
NTH(index, arr) WITHIN RECORD arr[SAFE_ORDINAL(index)]
COUNT(arr) WITHIN RECORD ARRAY_LENGTH(arr)

So this should allow for MAX of an array since MAX is really just the last element of the sorted array.


However, there is nothing that I see about filtered arrays - neither the ones using CASE/WHEN logic nor the ones using IF ELSE. This means that the same logic would be achieved through entirely different SQL. The simplest form would be to pre-flatten these datasets in Subqueries or CTEs and perform a regular MAX/MIN in aggregation.


work required = 200 to 800X+


Am I missing something?


1 reply

Re (4), here’s a solution. Consider the case where you have a table with one row per doctor, and a REPEATED RECORD (ARRAY OF STRUCTS) with information about email addresses. These two queries are equivalent:


Legacy SQL:


SELECT
dr_id,
MAX(IF(emails.email IS NOT NULL, true, false)) WITHIN RECORD AS has_email
FROM rsdw.doctor
ORDER BY 1

Standard SQL:


SELECT
dr_id,
IFNULL((SELECT MAX(IF(email iS NOT NULL, TRUE, FALSE)) FROM UNNEST(emails)), FALSE) AS emails2
FROM rsdw.doctor
ORDER BY 1

Notes:



  1. The slightly awkward IFNULL() is necessary to avoid getting NULLS instead of FALSE in cases where there are no email addresses

  2. You could remove the IF() statements and just have the condition but I left them in for clarity

Reply