Bigquery Error when using partition filter and fal...

ryan_snowplow · 07-05-2023 11:44 AM

I have encountered a weird bug relating to required partition filter columns and merge statements.

The summary is basically when there is both a filter on the partition column AND a `false` condition (which may occur as we use dbt and dynamically populate this value with true or false), the query itself runs with no issue but when it gets used as part of the merge statement, BigQuery returns the `Cannot query over table 'test_partitioned_table' without a filter over column(s) 'tstamp' that can be used for partition elimination` error, despite the filter clearly being in the query.

I assume that the optimizer is somehow removing this filter (because it sees the other clause will always be false), but this is leading to the error. For some reason this isn't removed when it isn't part of a merge statement.

A minimum working example to see the issue is below, does anyone have any experience or can confirm this is a bug?

CREATE TABLE
 test_partitioned_table ( id string,
    tstamp timestamp )
PARTITION BY
  TIMESTAMP_TRUNC(tstamp, DAY) OPTIONS ( require_partition_filter = TRUE);

INSERT INTO
  dbt_ryan.test_partitioned_table (id,
    tstamp)
VALUES
  ('hello', timestamp '2023-05-01');

INSERT INTO
 test_partitioned_table (id,
    tstamp)
VALUES
  ('world', timestamp '2023-05-02');

CREATE TABLE
 test_partitioned_table2 ( id string,
    start_tstamp timestamp )
PARTITION BY
  TIMESTAMP_TRUNC(start_tstamp, DAY) OPTIONS ( require_partition_filter = FALSE);

-- this works with no issue
SELECT
  id,
  tstamp AS start_tstamp
FROM
 test_partitioned_table
WHERE
  tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
  AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
  AND FALSE 

-- This causes and error based on not filtering on the partition key
MERGE INTO
 test_partitioned_table2 dest
USING
  (
  SELECT
    id,
    tstamp AS start_tstamp
  FROM
   test_partitioned_table
  WHERE
    tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
    AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
    AND FALSE ) src
ON
  (dest.id = src.id)
  WHEN MATCHED THEN UPDATE SET dest.id = src.id, dest.start_tstamp = src.start_tstamp
  WHEN NOT MATCHED
  THEN
INSERT
  (id,
    start_tstamp)
VALUES
  (id, start_tstamp)

ms4446

The error message you're encountering is due to an issue with BigQuery's optimizer. It's incorrectly removing the filter on the 'tstamp' column, assuming the false condition will always be false. As a result, it scans all partitions in the 'test_partitioned_table' table, even though the 'tstamp' filter should limit it to the '2023-03-01' partition only.

This is a known bug in BigQuery's optimizer. Google has been notified about this issue and they're currently working on a solution. However, until the bug is fixed, you can use the following workarounds:

Utilize the 'search_condition' clause in the merge statement to explicitly indicate that the 'tstamp' column should be used for partition elimination. Here's an example of how you can modify your merge statement:

MERGE INTO test_partitioned_table AS t
USING (
SELECT * FROM test_partitioned_table
WHERE tstamp = '2023-03-01' AND false
) AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET data = s.data
WHEN NOT MATCHED THEN INSERT (id, tstamp, data) VALUES (s.id, s.tstamp, s.data)
search_condition = tstamp = '2023-03-01';

Use the 'require_partition_filter' option when creating the 'test_partitioned_table' table. This will stop the optimizer from removing the filter on the 'tstamp' column. Here's how you can create the table:

CREATE TABLE test_partitioned_table (
id INT64,
tstamp TIMESTAMP,
data STRING,
PRIMARY KEY (id)
) PARTITION BY TIMESTAMP (tstamp)
REQUIRE FILTER;

ryan_snowplow

Thanks for your reply - is there a publicly available issue in the IssueTracker (https://issuetracker.google.com/issues) that I can follow please?

I also tried both your suggestions, however neither sql codes are valid, and in the second case this is already actually specifically the issue (that a required partition filter is already set, but it for some reason is not being used).

ms4446

Another workaround could be to create a temporary table with the SELECT statement and then use that temporary table in the MERGE statement. This could potentially bypass the issue with the partition filter being removed.

Here's an example:

CREATE TEMPORARY TABLE temp_table AS
SELECT
id,
tstamp AS start_tstamp
FROM
test_partitioned_table
WHERE
tstamp BETWEEN CAST('2023-04-01 00:00:00+00:00' AS timestamp )
AND CAST('2023-05-03 00:00:00+00:00' AS timestamp )
AND FALSE;

MERGE INTO
test_partitioned_table2 dest
USING
temp_table src
ON
(dest.id = src.id)
WHEN MATCHED THEN UPDATE SET dest.id = src.id, dest.start_tstamp = src.start_tstamp
WHEN NOT MATCHED
THEN
INSERT
(id,
start_tstamp)
VALUES
(id, start_tstamp);

dataders

is this related to the optimizer bug you speak of? in our case we are creating the target table with the require_partition_filter, that's what is throwing the error in the first place it seems. when the partition_filter is not required, the query "works", but likely all partitions are scanned'

https://github.com/dbt-labs/dbt-bigquery/issues/792

ms4446

Hi @dataders ,Thank you for providing the link to the GitHub issue. Yes, this issue raised by @ryan_snowplow is related to the behavior described above.

Here are a few workarounds you can consider:

Use a Temporary Table: As previously suggested, you can materialize the result of your SELECT statement into a temporary table and then use that temporary table in your MERGE statement. This way, the optimizer's behavior with the original SELECT statement becomes irrelevant for the MERGE operation.

CREATE TEMPORARY TABLE temp_table AS

SELECT

...

Then use temp_table in your MERGE statement.
Dynamic Condition: Instead of using a static AND FALSE condition, use a dynamic condition that the optimizer can't optimize away. For example:

AND (tstamp IS NOT NULL AND FALSE)
Explicit Partition Filter: If you know the specific partitions you're interested in, you can explicitly list them in the WHERE clause. This might make it harder for the optimizer to mistakenly remove the partition filter.

WHERE tstamp IN (TIMESTAMP "2023-04-01", TIMESTAMP "2023-04-02", ...)
Avoid MERGE with Partitioned Tables: If possible, consider other methods of updating or inserting data into your partitioned table, such as using separate INSERT and UPDATE statements instead of a MERGE. This might not be ideal, but it could be a temporary solution until the issue with MERGE is resolved.

moniw

Hi @dataders ,Is this issue resolved in any new release of dbt version?, or can you please let me know what i need to do, to resolve that

Surendrach

Below scenario not working:

1.Base table has require_partition_filter enabled, and the View query has row_num logic to get unique records on base table.

ms4446

If you're experiencing issues where a base table in BigQuery has require_partition_filter enabled and a view built on top of this table includes row number logic for deduplication, there are a few considerations and potential solutions:

Partition Filter Requirement in Views:

Views in BigQuery do not inherently pass partition filters from underlying tables. This means if your view is queried, BigQuery expects a partition filter at the query level, even if the view's underlying table (base table) has the require_partition_filter option set.
You need to ensure that any query against the view explicitly includes a partition filter to meet this requirement.

Row Number Logic and Scans:

The use of row number functions (like ROW_NUMBER() OVER (PARTITION BY ...) ) in views can complicate matters, as such functions typically require scanning significant portions of the data, potentially all partitions, to compute the results accurately.
If you’re using this function in a view without a partition filter, it might lead to full scans which BigQuery tries to prevent with the require_partition_filter setting.

Workarounds:

Modify the View: Include an explicit partition filter within the view definition itself. This might not always be practical depending on your data schema and requirements.
Query Modification: When querying the view, explicitly include a partition filter in the query. This ensures compliance with the require_partition_filter setting and avoids errors related to full table scans.
Use Materialized Views: If the view's logic is static and performance is a concern, consider using a materialized view with an appropriate partition filter. This can pre-compute and store the results, reducing runtime costs and complexity.
Adjust the Base Table Setting: Temporarily disable the require_partition_filter setting during specific operations that require extensive scanning or where it complicates development. This should be done cautiously and with awareness of the potential cost implications.

Example Adjustments: If you need to keep the require_partition_filter and use a view, your view definition could look something like this:

CREATE OR REPLACE VIEW your_view AS

SELECT *, ROW_NUMBER() OVER (PARTITION BY some_partition_col ORDER BY some_ordering_col) AS row_num

FROM your_base_table

WHERE some_partition_col = 'specific_partition_value'; -- Enforces partition filter

When querying the view, always include a partition filter:

SELECT * FROM your_view

WHERE some_partition_col BETWEEN 'start_date' AND 'end_date';

These suggestions aim to balance the strict requirements of partition filters in BigQuery with the flexibility and performance considerations of using views for complex queries.

Surendrach

-- Enforces partition filter 'specific_partition_value' filter is not ideal solution for all the cases.

ms4446

True, using a specific partition value in the view definition may not be ideal solution, especially when you need to handle multiple partitions dynamically. Here are some alternative strategies that might help:

Instead of hardcoding partition filters in the view, you can modify your queries to pass the required partition filters dynamically. This approach gives you the flexibility to specify different partitions based on your use case when querying the view.

 
-- Assuming you've set up the view without a specific partition filter: 
SELECT * FROM your_view WHERE partition_column BETWEEN @start_date AND @end_date;

If the query patterns are predictable and the data isn't updated frequently, consider using materialized views which can be set to refresh on a schedule. Materialized views can include partition filters that handle a broader range of dates or other partitioning criteria, and they store precomputed results that can make querying more efficient.

 

CREATE MATERIALIZED VIEW your_materialized_view AS 
SELECT *, ROW_NUMBER() OVER (PARTITION BY partition_col ORDER BY some_col) AS row_num 
FROM base_table 
WHERE partition_col BETWEEN 'start_range' AND 'end_range'; 

This view would need to be refreshed periodically but allows for faster access to precomputed data.

In some cases, you can write more complex SQL logic in the view to handle a variety of partitions without specifying exact partition values. This might involve using functions or more complex conditions to dynamically determine which partitions to access based on the query's context or additional filters provided at runtime.

If the partition filter requirement proves too restrictive and cannot be efficiently managed within BigQuery alone, consider retrieving a broader dataset into your application or middleware layer and applying further filtering or processing there. This is less efficient in terms of data transfer but might be necessary if the partitioning scheme is highly complex.

If these restrictions are proving too limiting, it might be worth revisiting your partitioning strategy. Perhaps a different partitioning key or a change in the partitioning granularity could align better with your query patterns.

Each of these approaches has trade-offs in terms of performance, cost, and complexity. The right choice depends on your specific requirements, such as how often the data is accessed, the typical query patterns, and how dynamic the partition filters need to be. Adjusting these strategies to match your operational needs can help you manage the require_partition_filter constraint more effectively.

Surendrach

1.Removing the duplicates in base table and filter partitioning and pass that query to View.

2.I tried adding the partitioning along with cluster and Partition filter not working and its scanning entire data.

ms4446

Here are a few strategies and considerations that might help refine your approach:

Ensure Correct Partition Filters in Views: When you create a view in BigQuery that relies on a partitioned base table, remember that the view itself does not automatically inherit the partitioning behavior of the base table. Make sure that any query against the view explicitly includes a partition filter. This is necessary even if the underlying table has a require_partition_filter setting.

Verify Partition and Clustering Definitions: If you have added partitioning along with clustering, ensure that the partitioning key is used effectively in your queries. BigQuery optimizes query performance by pruning partitions when the query filters directly reference the partitioned columns. If the partitioning column isn't properly referenced, BigQuery might end up scanning more data than necessary.

Using QUALIFY with Window Functions: If your aim is to eliminate duplicates using window functions (like ROW_NUMBER()), consider using the QUALIFY clause to filter the rows within the same query block. This can sometimes help in ensuring that the filters and window functions are applied efficiently.

 
SELECT * FROM base_table WHERE partition_column BETWEEN 'start_date' AND 'end_date' 
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) = 1;

Debugging Data Scans: Use the EXPLAIN statement to understand how your queries are being executed. This can provide insights into whether partition pruning is happening as expected or if full table scans are being performed.

Reevaluate Partitioning and Clustering Choices: Sometimes, the choice of partitioning and clustering keys might not align well with the typical query patterns. Reevaluating and possibly redesigning the partitioning and clustering keys based on the most common access patterns can lead to better performance.

Materialized Views for Aggregated Queries: If your view primarily serves aggregated data or results from complex transformations, consider using a materialized view which can store pre-computed results and is optimized for read access.

Adjustments in Data Loading or Transformation Logic: Sometimes, adjustments in the way data is loaded or pre-processed can reduce the need for complex deduplication logic in SQL queries. Ensuring data quality at the ingestion stage might reduce complexity and improve performance downstream.

Bigquery Error when using partition filter and false condition in merge subquery