Dataform Incremental table - Am I thinking right?

Hi,

1) I have a table that is being formed through Dataform every hour (I have an Airflow DAG Calling Dataform)
2) The Dataform file is named: Table_A.sqlx and following is the query I type in the Table_A.sqlx to get the table Table_A

config {
    type: "table",
    description: "Example table called Table_A",
    
}

SELECT 
    a,
    b

FROM 
    (
    SELECT * FROM ${ref("XXX_YYY_ZZZ")} AS somename1
    UNION ALL
    SELECT * FROM ${ref("AAA_BBB_CCC")} AS somename2
    
    ) AS combined_tables

I want to do the following

3) I want to have an incremental table called: Table_B

4) So I will file called Table_B.sqlx

5) Table_A has a column called timestamp

6) From Table_A I want to write all the rows into Table_B. The Table_B should be appended with all rows of Table_A

7) Hence I say that Table_B is incremental table

😎 The condition for merging (or, putting) records of Table_B into Table_A is that the timestamp record of Table_B must be greater than the MAX(timestamp) record of Table A

What should be the content of file Table_B.sqlx?

IS this correct?

The contents of Table_B.sqlx will be as follows:

config {
    type: "incremental",
    description: "Example table called Table_B",
    
}

SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})`) }

IS this right?

@ms4446 

Solved Solved
0 4 90
1 ACCEPTED SOLUTION

The COALESCE function is designed to handle situations where you might have NULL values (which represent missing or unknown data). It takes a list of expressions and returns the first one that isn't NULL.

Below uses COALESCE cleverly to make sure all the data from Table_A gets copied into Table_B the first time you run the process:

 
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') FROM ${self()})
  1. MAX(timestamp): This part tries to find the most recent timestamp in Table_B. Since Table_B might be empty initially, this could return NULL.

  2. COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01'): COALESCE comes to the rescue here:

    • Scenario 1: Table_B has data: If MAX(timestamp) returns an actual timestamp, COALESCE simply returns that timestamp. This ensures that only new data (with more recent timestamps) gets pulled from Table_A.
    • Scenario 2: Table_B is empty: If MAX(timestamp) returns NULL, COALESCE provides the fallback timestamp '1970-01-01'. This makes sure all the records from Table_A are included in the initial load, as their timestamps will naturally be more recent than the fallback.

The beauty of COALESCE is its conciseness. It handles the "empty table" scenario elegantly in a single line of code.

View solution in original post

4 REPLIES 4

Hi @ayushmaheshwari ,

Your approach looks correct and your understanding of incremental tables in Dataform seems spot-on. Here are some additional insights:

Your Table_B.sqlx file is well-structured and should effectively create an incremental table:

 
config {
  type: "incremental",
  description: "Example table called Table_B",
}

SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})

Explanation

  1. config { type: "incremental" }: This is the heart of the incremental table setup. It tells Dataform to only append new rows based on the timestamp condition.

  2. SELECT * FROM ${ref("Table_A")}: This selects all columns from your source Table_A.

  3. WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()}): This is the crucial filtering logic. It ensures only rows from Table_A with a timestamp greater than the maximum timestamp currently in Table_B are included. The use of self() references Table_B itself.

Additional Considerations

  • Initial Population: The first time this incremental table runs, there won't be a MAX(timestamp) in Table_B. Consider replacing the subquery with something like COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') to ensure the initial load pulls all data from Table_A.
  • Data Type Compatibility: As mentioned in the detailed guide, if you're unsure about the timestamp column's data type, use CAST to explicitly convert it (e.g., CAST(timestamp AS TIMESTAMP)).
  • Performance: The provided solution is a great starting point. Explore more advanced change tracking methods if your datasets are massive and updates are extremely frequent.

Thank you @ms4446 . I do not understand the COALESCE you describe , can you explain if you do not mind? Yes, for the first time it has to simply pull all data from Table A into Table B. How do I tell Dataform that? I do not follow by telling it through COALESCE

The COALESCE function is designed to handle situations where you might have NULL values (which represent missing or unknown data). It takes a list of expressions and returns the first one that isn't NULL.

Below uses COALESCE cleverly to make sure all the data from Table_A gets copied into Table_B the first time you run the process:

 
SELECT * FROM ${ref("Table_A")}
WHERE timestamp > (SELECT COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01') FROM ${self()})
  1. MAX(timestamp): This part tries to find the most recent timestamp in Table_B. Since Table_B might be empty initially, this could return NULL.

  2. COALESCE(MAX(timestamp), TIMESTAMP '1970-01-01'): COALESCE comes to the rescue here:

    • Scenario 1: Table_B has data: If MAX(timestamp) returns an actual timestamp, COALESCE simply returns that timestamp. This ensures that only new data (with more recent timestamps) gets pulled from Table_A.
    • Scenario 2: Table_B is empty: If MAX(timestamp) returns NULL, COALESCE provides the fallback timestamp '1970-01-01'. This makes sure all the records from Table_A are included in the initial load, as their timestamps will naturally be more recent than the fallback.

The beauty of COALESCE is its conciseness. It handles the "empty table" scenario elegantly in a single line of code.

One caveat if the table gets partitioned by timestamp, it won't prune as-is.
One should declare a variable with the predicate (in `pre-operations`) and use that var value in the main query.