“Resources exceeded during query execution” in BigQuery because of cartesian product

BigQuery charges per query based on how much data it has to scan. It then allocates server capacity to your query based on this data scanning estimate.

Generally, this is great since it means you only pay for what you use. But occasionally, BigQuery underestimates how difficult a query will be and refuses to complete it. Ideally, it’d just offer to complete the query and charge you more, but it doesn’t do that yet.

So, in order to get what we need done, we have to get crafty. And I did.

The case in question was one where I wanted to compute the cartesian product of a relatively small table (by BigQuery standards) with itself. (My colleague has humorously titled this cross-self-join a “Cartesian Blast”.) Even though the table only has 50,000 rows, the cartesian product of it with itself produces 3.5 billion rows.

The problem arises because BigQuery bases its cost estimate on scanning one, 50,000-row table. Since this is a tiny job, it allocates minimal resources for a job which is, in fact, reasonably large. And that leads us to the dreaded “Resources exceeded during query execution” error.

The trick here is to think about exactly how BigQuery makes its cost estimates and play its game. More data scanned = more resources allocated = bigger queries possible.

So since we can’t just do this in one shot:

view: cartesian_blast {
  derived_table: {
    sql: SELECT
             *
         FROM
           table AS a
           CROSS JOIN table AS b;;
  }
}

we’ll instead do the first part of it and save the result:

view: cartesian_blast_part1 {
  derived_table: {
    sql: SELECT
             *
         FROM
           (SELECT * FROM table LIMIT 1000) AS a
           CROSS JOIN table AS b;;
    persist_for: 24 hours
  }
}

Now, since we’ve cross-joined the first 1,000 rows of the table with the full 50k-row table, we have a 50 million row table to work with. Since that’s much bigger than 50k, BigQuery will allocate more resources, allowing part 2 to be bigger.

view: cartesian_blast_part2 {
  derived_table: {
    sql: 
         SELECT * FROM ${cartesian_blast_part1.SQL_TABLE_NAME}
         UNION ALL
         SELECT
             *
         FROM
           (SELECT * FROM table LIMIT 10000 OFFSET 1000) AS a
           CROSS JOIN table AS b;;
    persist_for: 24 hours
  }
}

In part 2, we unioned the output of the first part with the output of the second part. This UNION ALL is the key piece that includes the 50m-row table in BigQuery’s calculations and allows part 2 to tackle the next 10,000 rows (rather than only 1,000). We use LIMIT and OFFSET together to skip the first 1,000 rows so we don’t calculate them twice.

We can keep doing this pattern over and over, with each successive query tackling bigger and bigger chunks of the job since including the previous tables in the query makes BigQuery allocate more resources.

You’ll have to tune the size of each step so as not to trip BQ’s resource limits, but for my task I was easily able to handle the 50k-row cartesian blast in three steps. My final one looked like:

view: cartesian_blast_part3 {
  derived_table: {
    sql: 
         SELECT * FROM ${cartesian_blast_part2.SQL_TABLE_NAME}
         UNION ALL
         SELECT
             *
         FROM
           (SELECT * FROM table LIMIT 10000000 OFFSET 11000) AS a
           CROSS JOIN table AS b;;
    persist_for: 24 hours
  }
}

I made the limit on this last one effectively limitless, since the 550m-row table BQ is basing its resource allocation on is plenty big to handle lots more rows.

Now, if you wanted to get REALLY tricky, you could probably use this same pattern to just include a massive table in your query in a way that BQ would have to scan it but that didn’t affect your output.

SELECT 1 FROM really_big_table won’t work because BQ is too smart for that. But something like this would:

SELECT 
  CASE 
    WHEN rbt.id = 'never_gonna_happen' 
    THEN 0
    ELSE NULL
END 
FROM really_big_table rbt
2 3 6,564
3 REPLIES 3

Nice article @Mintz ! How about this on below ? i have done try to remove order by but still cant running, what does the ideal query ?

SELECT id as data, RANK() OVER (ORDER BY date_update) AS rank FROMtest.sampleORDER BY id DESC LIMIT 1

Hey @fathinadya, in this case, I actually don’t think the issue is that too few resources are being allocated, but instead that RANK() is a window function that has to be performed on a single node. If the table you’re ranking is large enough, that node could run out of memory and kill the query. You can read more about that here: "Resources exceeded during query execution" when building derived table in BigQuery

Thanks Mintz - this was super useful. Do you know if there is documentation on exactly how BigQuery determines the CPU limit? - I can’t see anything. 

Top Labels in this Space
Top Solution Authors