Anyone else migrating to Standard SQL?
With BigQuery’s release of a Standard SQL, the appeal of migrating away from Legacy SQL is pretty high. From the LookML perspective, Legacy SQL forces you to create a lot of PDTs, which are hard to maintain and ever harder to scale with increasing volume sizes. In the standard SQL there are features like inequality joins and subqueries in SELECT that should make PDTs less prevalent.
More importantly, the Legacy SQL has very painful bugs (e.g. “Failed to save view. Cannot create valid output schema for field”) and it sounds like won’t be ever resolved. The only workaround for these bugs is to create even more PDTs:
Whereas saving a Query into a database view fails, saving it into a PDT works because there is an inconsistency in how BQ treats column names.
So far, in evaluating the transition to Standard SQL, I see a couple of changes that need to happen.
1) Comma “,” replacement issues
This seems pretty straightforward. A simple regex should do the trick.
work required = X
2) UNION logic
This generally falls under (1), unless you have a lot of mismatched columns. Standard SQL forces columns to be positioned perfectly on top of each-other, whereas Legacy performed matching based on column names.
This is, by the way, either an advantage or a disadvantage. Position-matching bypassed the above inconsistency error, but forces a mundane task of making sure that columns are being pulled consistently in the right order (allowing for either scenario would have been ideal, GOOGLE).
work required = X to 100X
3) TABLE_DATE_RANGE replacement with _TABLE_SUFFIX
A Regex would not work here because we would need to apply recursive logic for bracket matching.
The date range also moves to the
WHERE clause, so the solution would need to parse out the SQL into a tree in order to move the matching logic into an appropriate
This is still fairly straightforward to automate. One major complication is the more flexible use of
TABLE_QUERY in Legacy SQL. If you’ve not used this sparingly, you might just have to do this one by hand. So far, there are two use cases I have seen for TABLE_QUERY: 1) to access PDTs, 2) to dynamically choosing a table based on a filter condition. If you don’t have either of this, you are O.K.
work required = 200X
4) Filtered MAX/MIN WITHIN Record
This one is actually problematic. It can look like this:
MAX(IF(condition, value-if-true,value-if-false)) WITHIN RECORD AS ...
I do see the array equivalents for repeated fields:
NTH(index, arr) WITHIN RECORD arr[SAFE_ORDINAL(index)]
COUNT(arr) WITHIN RECORD ARRAY_LENGTH(arr)
So this should allow for MAX of an array since MAX is really just the last element of the sorted array.
However, there is nothing that I see about filtered arrays - neither the ones using
CASE/WHEN logic nor the ones using
IF ELSE. This means that the same logic would be achieved through entirely different SQL. The simplest form would be to pre-flatten these datasets in Subqueries or CTEs and perform a regular MAX/MIN in aggregation.
work required = 200 to 800X+
Am I missing something?
Re (4), here’s a solution. Consider the case where you have a table with one row per doctor, and a REPEATED RECORD (ARRAY OF STRUCTS) with information about email addresses. These two queries are equivalent: