Solved: Re: Dataform: incremental table: stored procedure ...

CezaryK · 02-27-2024 05:18 AM

HI,

Dataform generates code with procedure name like:

EXECUTE IMMEDIATE

"""

CREATE OR REPLACE PROCEDURE `??????.df_d1b3851d423cca3c78ade6504a94655edc9a9d9e4b127897c0a14462705e8e9a`() OPTIONS(strict_mode=false)

BEGIN

Is it a way to change a name ?

Thanks,

Cezary

ms4446

You're correct! That's an important nuance, and I apologize for not addressing that earlier. The procedure names in Dataform are generated based on a hash that includes factors like the execution start time, ensuring uniqueness and tracking changes to your data transformations. However, this means that procedures executed simultaneously can have identical names. This scenario is particularly problematic when these processes aim to merge data into the same destination table, increasing the risk of data loss or duplication.

Solutions and Best Practices

Sequential Execution: To circumvent this issue, the most reliable method is to schedule your Dataform processes to run sequentially. This ensures that no two processes targeting the same table are executed at the same time.
Dependency Management: In cases where sequential execution is impractical, meticulously structure your Dataform definitions to establish explicit dependencies between processes. This means:
- Clearly define processes that consume data from the target table as dependents of any processes that concurrently write to it.
- Rely on Dataform's built-in dependency resolution logic to enforce the correct execution order.
Advanced Strategy: Temporary Tables and Merging: For complex scenarios, consider the following approach:
- Designate a Primary Process: Identify one of the concurrent processes as the primary one.
- Use Temporary Tables: Within the primary process, direct the output to a temporary table.
- Merge Data: Perform a MERGE operation to consolidate the data from the temporary table with the target table in a controlled manner. Refer to the official BigQuery MERGE documentation for details: [invalid URL removed]

Important Considerations

Complexity: Implementing the advanced strategy with temporary tables increases the complexity of your Dataform project.
Trade-offs: Each solution has implications for execution time, efficiency, and maintainability. Weigh the trade-offs carefully based on your specific requirements and data volumes.

View solution in original post

CezaryK

ok thank you!

View solution in original post

ms4446

Unfortunately, you cannot directly change the auto-generated procedure names within Dataform. Dataform automatically generates procedure names in the format project_id.dataset_id.df_d1b3851d423cca3c78ade6504a94655edc9a9d9e4b127897c0a14462705e8e9a. These hashed names, while not directly modifiable, serve important purposes:

Uniqueness: Hashes ensure that each procedure has a unique identifier, allowing Dataform to track changes and determine if modifications require updates or new procedures.
Dependency Management: Dataform relies on these names to accurately manage the intricate relationships between procedures within your data transformation pipelines.

Workarounds for Improved Readability

While you cannot directly change the auto-generated names, here are effective strategies to enhance readability and manageability:

View Creation:

Create views in BigQuery that reference Dataform-generated tables.
Assign meaningful names to these views for easier use in queries.

 
CREATE VIEW my_descriptive_view_name AS
SELECT * FROM `project_id.dataset_id.df_d1b3851d423cca3c78ade6504a94655edc9a9d9e4b127897c0a14462705e8e9a`;

Custom Operations:
- Leverage Dataform's custom operations feature (https://cloud.google.com/dataform/docs/custom-sql).
- Within these operations, use CREATE OR REPLACE PROCEDURE to define your own procedures with descriptive names.
```
-- Custom operation in Dataform
CREATE OR REPLACE PROCEDURE your_project.your_dataset.your_meaningful_procedure_name() 
BEGIN 
  -- Your procedure logic here 
END; 
```

CezaryK

Thank you but there is another problem.

The hashed name doesn't depend only on the name but on start time of execution.

If you execute 2 process at the same time that merge data into the same target table

then they execute the same procedure twice as the name is generated the same for them which leads to skipping data from one source and produce duplicates.

ms4446

You're correct! That's an important nuance, and I apologize for not addressing that earlier. The procedure names in Dataform are generated based on a hash that includes factors like the execution start time, ensuring uniqueness and tracking changes to your data transformations. However, this means that procedures executed simultaneously can have identical names. This scenario is particularly problematic when these processes aim to merge data into the same destination table, increasing the risk of data loss or duplication.

Solutions and Best Practices

Sequential Execution: To circumvent this issue, the most reliable method is to schedule your Dataform processes to run sequentially. This ensures that no two processes targeting the same table are executed at the same time.
Dependency Management: In cases where sequential execution is impractical, meticulously structure your Dataform definitions to establish explicit dependencies between processes. This means:
- Clearly define processes that consume data from the target table as dependents of any processes that concurrently write to it.
- Rely on Dataform's built-in dependency resolution logic to enforce the correct execution order.
Advanced Strategy: Temporary Tables and Merging: For complex scenarios, consider the following approach:
- Designate a Primary Process: Identify one of the concurrent processes as the primary one.
- Use Temporary Tables: Within the primary process, direct the output to a temporary table.
- Merge Data: Perform a MERGE operation to consolidate the data from the temporary table with the target table in a controlled manner. Refer to the official BigQuery MERGE documentation for details: [invalid URL removed]

Important Considerations

Complexity: Implementing the advanced strategy with temporary tables increases the complexity of your Dataform project.
Trade-offs: Each solution has implications for execution time, efficiency, and maintainability. Weigh the trade-offs carefully based on your specific requirements and data volumes.

CezaryK

ok thank you!

yan-hic

@ms4446 I just hit on the same problem while investigating random `Procedure is not found` errors which stop workflows with no option to retry.

Because our workflows are externally triggered (from GCP Workflows, on blob creation in GCS), 2 df procedures may get the same name, overwriting each other, and if timing is right (I mean wrong) one workflow can drop the procedure after execution, before the other one tries to execute it. It could lead either to the wrong SQL definition being run as per OP, or the error that procedure doesn't exist (anymore).

I understand we can't control the temp name that df generates for the procedure but can you reach out internally and tell us which exact parameters are used to make up the hash assuming it's not just the timestamp?

ms4446

Hi @yan-hic,

The issue you're experiencing, particularly the 'Procedure not found' error in workflows triggered by external events like GCS blob creation, indeed points to a challenge in using Dataform in pro

In the interim, consider the following strategies to mitigate the impact:

Serialization: Where possible, adjust your workflow triggers to ensure Dataform processes are executed sequentially. This can help avoid race conditions.
Enhanced Error Handling: Implement comprehensive error logging and notification mechanisms within your Dataform projects to better manage and respond to 'Procedure not found' errors.
Provisional Procedures: As a temporary workaround, manually creating and managing procedures with known names might help reduce the frequency of these errors. While not ideal, this approach could offer a stopgap solution.

Engaging with Google Cloud Support

Support Channels: Reach out through Google Cloud's official support channels. Detailed reports, especially those affecting production workflows, are taken seriously and can lead to more direct action or guidance.
Issue Tracker: Check Issue Tracker for similar reports or create a new issue detailing your experience. This visibility can help prioritize a resolution.

yan-hic

@ms4446 that AI-assisted answer was unfortunately not edited by the product expert that you are, and just transcribed as-is. However, to answer on the proposed strategies:

- Serialization: it is not possible if triggered externally if one doesn't control the time and rate that files get copied to GCS. Dataform should have been designed - or corrected since take-over - to support concurrency. The solution is easy - read on.

- Error Handling: errors are raised by Dataform wrapping, not by user code. Users have no control on error handling raised by the Dataform wrapping into procedures. This is always the case for incremental table where a unique key is specified: DF generates a script that creates a procedure to MERGE. The reason we (wanted to) use DF is that we don't need to code for incremental/full refresh boilerplate.

- Temp procedures: you basically advise to drop incremental tables and use `operations` instead across a workflow...

The solution in my mind is straightforward though: DF should use a UUID for naming the temp procedure, instead of a hash based on execution timestamp. This ensures uniqueness and isolation. To me, this is more a bug or bad design assumption.
In the meantime, I would appreciate you reach out to answer my previous post question: how is the hash exactly made of ?? If we know how built, we can see how to use distinct factors, that will render distinct hash values.

Dataform: incremental table: stored procedure name