BigQuery daily snapshots

  • 21 April 2016
  • 2 replies

Continuing the discussion from ETL Tool Recommendations:

Hi @lloydtabb, thanks for the info!

I saw this post as well:

and the TABLE_DATE_RANGE feature makes me wonder what the best approach is.

My first thought was just to append a current date column each day to the snapshot and keep appending to the same table, but your post here seems to suggest it is better to create a new table each day with the snapshot date in the table name?

Being a BigQuery novice still, is one clearly better than the other?

For context, I am currently cross joining a simple calendar date column against my table of interest and deriving my historical numbers using date logic. This works, but as you can guess, is not fast, as I am effectively multiplying my table row count once for each date that I care to process…

2 replies

Userlevel 6

Aaron, depends on the size of the table. If you have a transactional table, it is best to break into table/days, but only if it is somewhat big (say more then a few gigabytes).

There are limitations to the number of days you can union together, so if you are looking at really long timeframes, you might not want to use days)

BigQuery does a full table scan for each query (believe it or not).



The only way to avoid a full table scan (on the columns you are querying) is to use table decorators:

If you have a 10 TB table for one day of data and only want to query the last 30 minutes to save time and cost, this is really really helpful (but not supported in Looker today though).

I think BQ might be building other stuff to help solve for this too.

e.g. get table data added between one hour and half an hour ago:

SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000–1800000]