How to add dependencies from .js script in Dataform or use wildcard/tag inside dependencies?

I have about 20 scripts to merge tables that I generate using .js inside Dataform (something like here https://www.googlecloudcommunity.com/gc/Data-Analytics/Wildcards-in-Dataform-sqlx-files/td-p/636028)...

Now I want to reference this .js file as a dependency in my other .sqlx script, but it does not work. I can reference each of the generated scripts, but I would prefer to have it dynamic like using tag or wildcard (because all generated scripts have the same prefix). Is it possible?

Solved Solved
0 20 2,478
2 ACCEPTED SOLUTIONS

To create a separate Dataform project for your generated scripts, follow these steps:

  1. Install the Dataform CLI.
  2. Initialize a new project:
dataform init
  1. Connect to your data warehouse:
dataform connect
  1. Create a new directory for your generated scripts.
  2. Create a new Dataform file in the directory and paste your generated scripts into the file.
  3. Save the file.
  4. In your other Dataform projects, add a dependency to the new project.

Using a Dataform Package:

To create a Dataform package that contains your generated scripts, follow these steps:

  1. Clone the Dataform base package repository:
git clone https://github.com/dataform/base-package.git
  1. Optionally, import the package to Dataform web:
dataform import package
  1. Update the base repository to add your package functionality.
  2. Connect to a data warehouse to test out your package.
  3. Release your package to the community (optional).

Using a Dataform Custom SQL Operation:

To use a Dataform custom SQL operation to generate your dependencies dynamically, follow these steps:

  1. Create a new Dataform file and paste the following SQL into the file:
 
CREATE OR REPLACE CUSTOM SQL OPERATION generate_dependencies() AS
SELECT table_name AS dependency
FROM `project.dataset.table_schema`
WHERE table_name LIKE 'generated_script_prefix%';
  1. Save the file.
  2. In your other Dataform projects, add a dependency to the custom SQL operation.
  3. In your Dataform code, call the custom SQL operation to generate your dependencies:
dependencies: generate_dependencies();

This will generate a list of dependencies that you can use in your Dataform code.

View solution in original post

@maffka123 , I would recommend reaching out to support. Be prepared to share your code and detailed information about the observed behavior to facilitate efficient troubleshooting and identify the root cause behind the unexpected execution sequence.

View solution in original post

20 REPLIES 20

To create a separate Dataform project for your generated scripts, follow these steps:

  1. Install the Dataform CLI.
  2. Initialize a new project:
dataform init
  1. Connect to your data warehouse:
dataform connect
  1. Create a new directory for your generated scripts.
  2. Create a new Dataform file in the directory and paste your generated scripts into the file.
  3. Save the file.
  4. In your other Dataform projects, add a dependency to the new project.

Using a Dataform Package:

To create a Dataform package that contains your generated scripts, follow these steps:

  1. Clone the Dataform base package repository:
git clone https://github.com/dataform/base-package.git
  1. Optionally, import the package to Dataform web:
dataform import package
  1. Update the base repository to add your package functionality.
  2. Connect to a data warehouse to test out your package.
  3. Release your package to the community (optional).

Using a Dataform Custom SQL Operation:

To use a Dataform custom SQL operation to generate your dependencies dynamically, follow these steps:

  1. Create a new Dataform file and paste the following SQL into the file:
 
CREATE OR REPLACE CUSTOM SQL OPERATION generate_dependencies() AS
SELECT table_name AS dependency
FROM `project.dataset.table_schema`
WHERE table_name LIKE 'generated_script_prefix%';
  1. Save the file.
  2. In your other Dataform projects, add a dependency to the custom SQL operation.
  3. In your Dataform code, call the custom SQL operation to generate your dependencies:
dependencies: generate_dependencies();

This will generate a list of dependencies that you can use in your Dataform code.

Hi Thank you very much for your answer!

The last part is exactly what I need , I just do not understand what do you mean in the point 3. In your other Dataform projects, add a dependency to the custom SQL operation. ? I actually what to have everything in one project and should I edit package.json file? How exactly?

just like that it does not work

maffka123_0-1699260909482.png

maffka123_1-1699260963586.png

 

 

 

In Dataform, you don't need to edit the package.json file to add a dependency for a custom SQL operation. Instead, you'll reference the output of the custom SQL operation within your Dataform SQLX files using the ref() function, which is how you define dependencies between datasets within a Dataform project.

Here's how you can reference a custom SQL operation in your SQLX files:

  1. Ensure your custom SQL operation is set to produce an output by including config { hasOutput: true } at the beginning of the file.
  2. Use the resulting dataset as you would use any other dataset by calling it with the ref() function.

For example, if your custom SQL operation creates a view named generated_dependencies, you would reference it in another SQLX file like this:

 
SELECT * FROM ${ref('generated_dependencies')}

This reference tells Dataform that there's a dependency on the generated_dependencies dataset, and it will ensure that the dataset is created before running the SQLX file that references it.

You just wrote above that I can include the list of dependencies as 

dependencies: generate_dependencies();

 If I will use ref() it will just make the view, as you wrote above and I want that the names from the view are included as dependencies, not the view itself.

 

Just to explain my problem again: I have a .js File which dynamically generates about 10 update operations for a list of tables, resulting operations are calles megre_<table>_stage, now I want that in other .sqlx file all the list of these generated scripts is included as `dependencies: [megre_<table1>_stage,megre_<table2>_stage...]` dynamically.

So far I have tried to include the name of my .js File as dependency, wildcard "megre_*_stage", suggested above solution, a bit modified just generates a view with the names, but do not include them as dependencies

Hi @maffka123 ,

Thanks for the clarification. 

In Dataform, dependencies for an action are specified in an array in the configuration block at the top of a .sqlx file. Unfortunately, as of my last update, Dataform does not support dynamic generation of this dependencies list within the .sqlx file itself; it expects a static array of strings, where each string is the name of another action in the project.

Since Dataform compiles the code before running it, the dynamic resolution of dependencies at runtime using something like dependencies: generate_dependencies(); isn't supported.

If you have a .js file generating multiple actions with predictable names (like merge_<table>_stage), one potential workaround would be to use JavaScript to dynamically generate the dependencies array. However, this would involve generating the .sqlx file itself via a script, not just the dependencies array.

If you're using JavaScript to define your Dataform actions, you might be able to use JavaScript to loop over your table names and programmatically generate the dependencies array. You would then inject this array into the config block of your .sqlx file.

Here's a conceptual example in Javascript&colon;

const tables = ["table1", "table2", "table3"]; // Your dynamic list of tables

publish("my_table", {
dependencies: tables.map(table => `merge_${table}_stage`),
query: `SELECT * FROM ...` // Your SQL query here
});

Please Note: This code isn't complete and would need to be part of a larger .js file that generates your Dataform actions, but it illustrates the idea of programmatically creating the dependencies array.

Thank you very much, i was afraid that i will have to do this 🙂

One more question, is it possible inside a .js file completely define an order of .sqlx file executions? Something like complex dependencies? Since we have about 5 steps in which need to be execude one after another (I know I can schedule them, but would be cool if they would be triggered only if previous was successfull and in one file!)

Something like

const tables = ["table1", "table2", "table3"]; 

execute("step1", {tables.map(table => `merge_${table}_stage`)
});

execute("step2", {"merge_other_table"});

etc.

In Dataform, you can indeed define execution order using JavaScript, essentially setting up a chain of dependencies that must be executed successfully before proceeding to the next step.

Here's a conceptual example of how you might structure this in a .js file:

 

const tableNames = ["table1", "table2", "table3"];

// Define the first step
const step1Actions = tableNames.map(table => {
  return publish(`${table}_merged`, {
    // SQL for the merge operation
    query: `SELECT * FROM ${table}`,
    // Add other configuration as needed
  });
});

// Define the second step which depends on all actions from step 1
publish("step2", {
  dependencies: step1Actions,
  query: `SELECT * FROM merge_other_table`,
  // Add other configuration as needed
});

// Continue chaining steps as needed...

To set up complex dependencies and execution order fully in one .js file, you'd utilize the dependencies attribute to manage the order. Remember, though, that this code needs to be part of a .js file in your Dataform project, and you'd have to adapt the queries and table names to match your actual schema and requirements.

This setup does not require scheduling since Dataform will automatically handle the execution order based on the dependencies you've defined. If one step fails, subsequent steps that depend on it will not run.

Cool! Thank you! Just one last question, can I reference existing .sqlx scripts instead of writing the whole query? Like:

 

 const step2Actions =  
    publish("adrkey_shedules").query(
      ctx => ctx.ref(AdrkeyAndTlog_TableFullLoad)
    );
 
Where AdrkeyAndTlog_TableFullLoad.sqlx is a script inside my DF project?

Yes, you can reference existing .sqlx scripts within your .js files by using the ref() function to refer to the output of those scripts. However, the publish() function does not have a query method that takes a function in Dataform. Instead, you would set the dependencies attribute of the publish() function to ensure that the action represented by AdrkeyAndTlog_TableFullLoad.sqlx is executed before the current action. Here's how you can do it:

 

const AdrkeyAndTlog_TableFullLoad = "AdrkeyAndTlog_TableFullLoad";

publish("step2_action", {
  dependencies: [AdrkeyAndTlog_TableFullLoad],
  // ... other configuration
});

This code doesn't execute the AdrkeyAndTlog_TableFullLoad query itself but ensures that the step2_action depends on the AdrkeyAndTlog_TableFullLoad action being completed first. If you want to directly use the result of AdrkeyAndTlog_TableFullLoad in another action's query, you would reference it like so:

SELECT * FROM ${ref('AdrkeyAndTlog_TableFullLoad')}

Remember, the actual SQL code for AdrkeyAndTlog_TableFullLoad stays within its .sqlx file, and you use ref() to reference its output in other scripts.

I get an error when I refence the first step: Uncaught TypeError: Cannot read properties of undefined (reading 'get')

 

publish("step2_action", {
  dependencies: step1Actions
});

 ideally I would like to reference it in a list:

publish("step2_action", {
  dependencies: [step1Actions, "otherTable"]
});

But then I get: Uncaught TypeError: Converting circular structure to JSON --> starting at object with constructor 'Session' | property 'actions' -> object with constructor 'Array' | index 0 -> object with constructor 'Operation' --- property 'session' closes the circle

The error you're encountering seems to be related to how Dataform is interpreting the dependencies attribute in your publish function. When you use step1Actions directly as a dependency, it appears that Dataform is not able to process it correctly, likely because step1Actions is an array of actions, not a string or array of strings, which is what the dependencies attribute expects.

To resolve this, you should refer to the names of the actions (as strings) rather than the action objects themselves. If step1Actions is an array of publish actions, you would need to extract their names and use those as dependencies. Here's an example of how you might adjust your code:

 

const step1ActionNames = tableNames.map(table => `${table}_merged`);
publish("step2_action", { dependencies: [...step1ActionNames, "otherTable"]
});

Hi, So I did construct a pipeline which runs, but unfortunately all dependencies in all steps are run at the same time and only steps themself are executed one by one

// prepare list of crm table run, as it is compiled from CRM.BSI_TablesMerge
const tables_list = Object.entries(crm_tables.tables_dict).map(([k, v]) => (`merge_${k}`));

// execute bsi_* table merges and tlog merge
publish("step1", {
  dependencies: [...tables_list, "Tlog"]
});

// execute a-t mapping
publish("step2", {
  dependencies: ["step1", "AdTlog"]
});

// execute features preparation
publish("step3", {
  dependencies: ["step2", "Attributes", "clc"],
  tags: "adrkeytlog_pipeline"
});

The issue you're encountering is due to Dataform's parallel execution of dependencies. Dataform ensures that all specified dependencies for a step are completed before executing that step, but it doesn't enforce sequential execution within each step.

To achieve the desired sequential execution, you can structure your pipeline by defining individual tasks as separate publish actions and then chaining them with dependencies. This way, you create a dependency chain that controls the execution order within each step.

Here's an example of how to restructure your code:

// Define individual tasks as separate actions
publish("merge_crm_table1", { dependencies: ["CRM.BSI_TablesMerge"] });
// ... other CRM table merge actions ...
publish("merge_tlog", { dependencies: ["Tlog"] });

// Collect dependencies for step1
const step1Dependencies = ["merge_crm_table1", /* other CRM merges */, "merge_tlog"];

// Execute step 1
publish("step1", { dependencies: step1Dependencies });

// Continue with further steps
// ...

// This approach ensures that actions within each step are executed in the order you define in your dependencies.

Hi, what I meant is not that dependencies are executet at the same time within one step, it is ok, but all dependencies from all steps run together, as if I have packed them in one step and not in separate steps.

Ensure that for each publish action in a step, you correctly specify the previous step's actions as dependencies. This way, step2 should only begin after all actions in step1 are completed, and so on. This setup should prevent all dependencies from running at the same time and ensure a sequential flow across steps.

Unfortunately  this seems does not work, I have posted above my code, I specify steps as dependencies:

// prepare list of crm table run, as it is compiled from CRM.BSI_TablesMerge
const tables_list = Object.entries(crm_tables.tables_dict).map(([k, v]) => (`merge_${k}`));

// execute bsi_* table merges and tlog merge
publish("step1", {
  dependencies: [...tables_list, "Tlog"]
});

// execute a-t mapping
publish("step2", {
  dependencies: ["step1", "AdTlog"]
});

// execute features preparation
publish("step3", {
  dependencies: ["step2", "Attributes", "clc"],
  tags: "adrkeytlog_pipeline"
});

 I see that steps are executed sequentially, but all the dependencies from all steps are running together..

Based on your description and the code you've shared, it seems like there might be a misunderstanding in how Dataform is interpreting the dependencies. In Dataform, when you declare a publish action like "step1", "step2", etc., these are treated as individual actions. The dependencies within these actions are considered for execution before the specific action ("step1", "step2", etc.) starts.

However, declaring "step1" as a dependency in "step2" does not necessarily mean that all dependencies of "step1" will complete before "step2" starts. It only ensures that the action "step1" itself is completed. If "step1" is a wrapper or a summary action without its own query, it might complete without waiting for its internal dependencies to finish.

To ensure sequential execution across steps, each action in step2 should explicitly list all actions from step1 as dependencies, and similarly for step3. This can become quite verbose, but it's necessary to explicitly define the dependency chain in Dataform.

If this approach still doesn't work, or if it's too cumbersome for your project, I would recommend reaching out to support.

listing all dependencies in each step also does not work

@maffka123 , I would recommend reaching out to support. Be prepared to share your code and detailed information about the observed behavior to facilitate efficient troubleshooting and identify the root cause behind the unexpected execution sequence.

Hello @maffka123  could you please let us know if you have found the solution for your problem ? 

Thank you