Question

SubSelect not allowed in SELECT clause - BigQuery Legacy SQL

  • 26 January 2017
  • 8 replies
  • 625 views

This is my first post so apologies if I do not follow a standard process.

We are using Looker to access BQ tables via Legacy SQL.

We have a number of brands that have deprecated ‘legacy’ GA Views as well as live Views.

Consequently, there are multiple ‘GA’ BQ tables in play.

I have designed it such that a user can enter the brand name for which they wish to perform some analysis via a brand filter (type: string).

I then use this as a templated filters to perform a conditional analysis such that individual BQ tables are parsed for data that is only relevant to that brand (some live GA views now track multiple brands).

However; for even greater efficiency I would like to only interrogate those BQ tables that would be relevant to a certain brand.

I have implemented a CASE WHEN solution that reads user brand input and selects the appropriate BQ tables by unioning them. However, running this in SQL Runner produced an error: Subselect not allowed in SELECT clause.

Can anyone advise on a different strategy when dealing with multiple BQ tables? I know I could simply union all and filter by brand en-masse but was hoping for a more clinical solution.

Many Thanks,

James


8 replies

Userlevel 3

Sounds like a couple of things going on here.


First, TABLE_QUERY is how tables are dynamically/conditionally queried in Legacy. You would still use a UNION, but tables that don’t meet the condition are not queried. One of my older posts covers this.


Second, sounds like you are performing another SELECT within a dimension. This is only allowed in BigQuery’s Standard SQL.

Thanks Segah. Do you have a link to that older post?

Basically, my current solution unions legacy GA data via 3 BQ tables (1 for each legacy GA View).

Applies a Legacy URL redirect lookup so that URLs are normalised between old and new.

Then applies a conditional templated brand filter to filter out only relevant brand URLs (after simulated ETL process)

Then merges with results from BQ table (again filtered using brand template filter) that houses live GA data.

Instead of union’ing all legacy GA data and applying an ETL en-masse I would prefer to union only data relevant to brand. I.E. conditionally select the legacy BQ tables relevant…

Can you confirm if switching to BQ Standard SQL would make ‘lives’ a lot easier when dealing with multiple BQ tables by brand and across business units (several brands to each unit)?

Userlevel 6
Badge

@JimmyCognos, I think you life would be much easier under Standard SQL. I’m supposing you have a table named with suffixes like:



ga_brand1

ga_brand2

ga_brandx



Where the data for each brand is stored in separate tables with identical structures and the brand name follows ‘ga_’.


If you use BigQuery Standard SQL, you could write a view that simply:


explore: ga {
always_filter: {
filters: {
field: ga.brand
value: "brand1"
}
}

view: ga {
sql_table_name: `ga_*`

dimension: brand
sql: ${TABLE}._TABLE_SUFFIX

...
}

Using this structure, you will only include tables that match any filter you set on the ‘Brand’ dimension, reducing your query. _TABLE_SUFFIX is a special variable in Standard SQL used to, essentially, filter partitioned files.


You’ll probably want to add an always_filter on brand to keep from running big queries.


_TABLE_SUFFIX is really very flexible and much better than TABLE_QUERY in Legacy SQL

Thanks Lloydtab.


Your response is very helpful. I suspected as much.


Kind Regards,

James

Userlevel 3

Same here, StandardSQL is a must migration. There are annoying issues with LegacySQL in the backend that I’ve run into (for example, how database views are treated) that BigQuery simply won’t resolve because they made a decision to rewrite the engine for StandardSQL. Ideally you can avoid going down the rabbit hole of LegacySQL implementation 🙂


One word of caution, ga_* is limited to 1,000 tables. And performance (of querying for a single brand/table) decreases as the number of tables grows (but relevant only once you are in the hundreds).

Hi segahm.


Thanks for your response and sharing your experience re: LegacySQL.


Best Regards,

James

Userlevel 4

Here’s a pattern to join two different GA PROPERTIES (i.e. two different websites) in one query, utilizing the wildcard and _TABLE_SUFFIX pattern on both datasets:


view: unioned {
derived_table: {
sql: SELECT *,'Property1' AS Property FROM `dataset_number.ga_sessions_*` WHERE {% condition partition_date %} TIMESTAMP(PARSE_DATE('%Y%m%d', REGEXP_EXTRACT(_TABLE_SUFFIX,r'^\d\d\d\d\d\d\d\d'))) {% endcondition %}
UNION ALL SELECT *,'Property2' AS Property FROM `dataset_number2.ga_sessions_*` WHERE {% condition partition_date %} TIMESTAMP(PARSE_DATE('%Y%m%d', REGEXP_EXTRACT(_TABLE_SUFFIX,r'^\d\d\d\d\d\d\d\d'))) {% endcondition %};;
}

filter: partition_date {
type: date_time
Userlevel 1

Hi, I had a similar situation where I needed to pull data from two tables. So I created a derived table using union all of both tables but it was taking long time because both tables were having huge data and derived table every time executes query in back-end when we pull reports.

To overcome this i created view in my snowflake data warehouse and it was drastically faster than derived table.

Data warehouse view is better that using derived table its comparatively very fast the reason is its not like native mysql view, in data warehouse if you create view it stores data separately not virtually so its faster than normal mysql views.

Reply