defaultSchema not applied in DDL

I had expected that the default schema be used in any non-qualified table reference.

For instance, I have bunch of DDLs that DataForm cannot generate at compiling time so we use BQ scripting.

 

EXECUTE IMMEDIATE "CREATE TABLE test AS SELECT <selList-from-UDF> FROM staging"

 

will error out with `table must be qualified with a dataset`

The workaround is to add a preceding `set @@dataset_id=`${schema()}` 

Can someone though clarify when the defaultSchema (e.g. overridden in the execution API) is used vs. where not (and why) ?

Thanks !

Solved Solved
0 1 137
1 ACCEPTED SOLUTION

The defaultSchema setting in Dataform, configurable in dataform.json or via the execution API, acts as the implicit dataset for table references within Dataform's SQLX code. This setup offers significant advantages:

  • Implicit Dataset References: Utilizing Dataform's SQLX syntax to define tables within the defaultSchema eliminates the need for full table qualification, streamlining script development.

  • Simplified SQL Generation: SQL generated by Dataform for creating views, transformations, and other operations typically adheres to the defaultSchema, simplifying project management and execution.

When employing BigQuery's EXECUTE IMMEDIATE for dynamic SQL execution, explicit table qualification becomes essential due to:

  • Scope and Isolation: EXECUTE IMMEDIATE commands run in a distinct execution context, separate from the Dataform environment. As such, the defaultSchema setting does not automatically apply within these dynamically constructed SQL commands.

  • Security and Clarity: Mandatory dataset qualification within EXECUTE IMMEDIATE statements ensures clarity and security, significantly reducing the risk of accidental operations on unintended datasets.

Limitations of defaultSchema

There are scenarios within Dataform where explicit table qualification is necessary:

  • Cross-Dataset References: Queries involving tables from outside the defaultSchema require full qualification (e.g., dataset_name.table_name) to ensure accuracy.

  • Avoiding Name Conflicts: In cases where identical table names exist across different datasets, full qualification is essential to prevent conflicts and ensure correct table references.

Visual Illustration Concept

Imagine a diagram where two blocks represent "Dataform SQLX" and "EXECUTE IMMEDIATE" processes. Arrows from the defaultSchema setting point towards the "Dataform SQLX" block, indicating direct applicability. In contrast, a barrier symbol or a detour sign between the defaultSchema and "EXECUTE IMMEDIATE" block illustrates the need for explicit dataset specification in dynamic SQL contexts.

The set @@dataset_id Workaround

For dynamically executing DDL statements within the desired dataset context using EXECUTE IMMEDIATE, the following approach is recommended:

 

 
set @@dataset_id = `${schema()}`;
EXECUTE IMMEDIATE "CREATE TABLE test AS SELECT <selList-from-UDF> FROM staging";
  • @@dataset_id: This BigQuery session variable temporarily overrides the default dataset, aligning dynamic SQL execution with the intended defaultSchema.

  • schema() Function: A Dataform utility that dynamically retrieves the current schema, facilitating adaptable and context-aware DDL execution.

Best Practices

  • Assertion Testing: Implement assertions within your Dataform project to ensure that tables are indeed created within the correct defaultSchema, enhancing data integrity and consistency.

  • Advanced Use Cases: For complex queries involving joins across different datasets, fully qualify table names (e.g., project.dataset1.table1 JOIN project.dataset2.table2) to maintain clarity and prevent ambiguity.

View solution in original post

1 REPLY 1

The defaultSchema setting in Dataform, configurable in dataform.json or via the execution API, acts as the implicit dataset for table references within Dataform's SQLX code. This setup offers significant advantages:

  • Implicit Dataset References: Utilizing Dataform's SQLX syntax to define tables within the defaultSchema eliminates the need for full table qualification, streamlining script development.

  • Simplified SQL Generation: SQL generated by Dataform for creating views, transformations, and other operations typically adheres to the defaultSchema, simplifying project management and execution.

When employing BigQuery's EXECUTE IMMEDIATE for dynamic SQL execution, explicit table qualification becomes essential due to:

  • Scope and Isolation: EXECUTE IMMEDIATE commands run in a distinct execution context, separate from the Dataform environment. As such, the defaultSchema setting does not automatically apply within these dynamically constructed SQL commands.

  • Security and Clarity: Mandatory dataset qualification within EXECUTE IMMEDIATE statements ensures clarity and security, significantly reducing the risk of accidental operations on unintended datasets.

Limitations of defaultSchema

There are scenarios within Dataform where explicit table qualification is necessary:

  • Cross-Dataset References: Queries involving tables from outside the defaultSchema require full qualification (e.g., dataset_name.table_name) to ensure accuracy.

  • Avoiding Name Conflicts: In cases where identical table names exist across different datasets, full qualification is essential to prevent conflicts and ensure correct table references.

Visual Illustration Concept

Imagine a diagram where two blocks represent "Dataform SQLX" and "EXECUTE IMMEDIATE" processes. Arrows from the defaultSchema setting point towards the "Dataform SQLX" block, indicating direct applicability. In contrast, a barrier symbol or a detour sign between the defaultSchema and "EXECUTE IMMEDIATE" block illustrates the need for explicit dataset specification in dynamic SQL contexts.

The set @@dataset_id Workaround

For dynamically executing DDL statements within the desired dataset context using EXECUTE IMMEDIATE, the following approach is recommended:

 

 
set @@dataset_id = `${schema()}`;
EXECUTE IMMEDIATE "CREATE TABLE test AS SELECT <selList-from-UDF> FROM staging";
  • @@dataset_id: This BigQuery session variable temporarily overrides the default dataset, aligning dynamic SQL execution with the intended defaultSchema.

  • schema() Function: A Dataform utility that dynamically retrieves the current schema, facilitating adaptable and context-aware DDL execution.

Best Practices

  • Assertion Testing: Implement assertions within your Dataform project to ensure that tables are indeed created within the correct defaultSchema, enhancing data integrity and consistency.

  • Advanced Use Cases: For complex queries involving joins across different datasets, fully qualify table names (e.g., project.dataset1.table1 JOIN project.dataset2.table2) to maintain clarity and prevent ambiguity.