Dataform uniqueKey and partitionBy VS BigQuery primary key and partitioned on field

Hi there,

What is the different between primary keys in BigQuery and uniqueKey in GCP Dataform?  How come the uniqueKey is not showing in BQ? Is it only used in Dataform config {} when creating table/incremental table? 

Also, Is partitionBy (GCP Dataform) the same as the partitioned on field (BigQuery)? do i need to add partition by day or by the field (session_date) in GCP Dataform? They are not the same?

Thanks

sysph_0-1689949481866.png

sysph_1-1689949527642.png

 

 

Solved Solved
0 5 2,107
2 ACCEPTED SOLUTIONS

Neither BigQuery nor Dataform enforce primary key or unique key constraints, but both allow you to designate certain fields as unique identifiers for your own use. The main difference is how these unique identifiers are used: in BigQuery, you can use them to join tables, filter data, etc., while in Dataform, uniqueKey is used to handle incremental updates to tables.

The partitionBy configuration in Dataform and partitioning a table in BigQuery are not exactly the same, but they are related concepts and serve similar purposes.

In BigQuery, when you create a partitioned table, you specify the column on which to partition. This could be a _PARTITIONTIME pseudo column for ingestion-time partitioned tables, or any TIMESTAMP or DATE column for partitioning based on a column's values.

In Dataform, you can also specify a column to partition by using the partitionBy configuration setting. For example, if you have a session_date column that you want to partition on, you would add partitionBy: "session_date" in your Dataform config.

However, remember that in Dataform, the partitioning is done as part of the data transformation process, and the actual partitioning of the table in BigQuery will depend on how you have set up your data pipeline and how you write the data from Dataform to BigQuery.

As to your question about whether you need to partition by day or by session_date, it depends on your use case. If session_date is a DATE or TIMESTAMP column that represents the day of the session, and you often filter your queries by this date, then it would be a good idea to partition by session_date. If session_date is not a DATE or TIMESTAMP column, or if you don't often filter by this column, it may not make sense to partition by session_date.

In general, you want to partition on a column that is frequently used in the WHERE clause of your queries, and that divides your data into a large number of relatively equal-sized partitions.

View solution in original post

Both clusterBy and updatePartitionFilter are configuration settings in Dataform, but they serve different purposes.

clusterBy: This configuration setting is used to specify one or more columns to cluster a table by. Clustering is a technique used in BigQuery to improve the performance of queries that filter by the clustering columns. When a table is clustered, rows with similar values in the clustering columns are stored together, which can significantly improve query performance.

updatePartitionFilter: This configuration setting is used with incremental tables in Dataform. When you set updatePartitionFilter, Dataform uses it to limit the amount of data that needs to be scanned when updating an incremental table. It's a way to improve the performance of incremental updates by limiting the amount of data that needs to be scanned.

So while both settings are related to improving performance, they serve different purposes. clusterBy improves the performance of queries that filter by the clustering columns, while updatePartitionFilter improves the performance of incremental updates.

View solution in original post

5 REPLIES 5

Neither BigQuery nor Dataform enforce primary key or unique key constraints, but both allow you to designate certain fields as unique identifiers for your own use. The main difference is how these unique identifiers are used: in BigQuery, you can use them to join tables, filter data, etc., while in Dataform, uniqueKey is used to handle incremental updates to tables.

The partitionBy configuration in Dataform and partitioning a table in BigQuery are not exactly the same, but they are related concepts and serve similar purposes.

In BigQuery, when you create a partitioned table, you specify the column on which to partition. This could be a _PARTITIONTIME pseudo column for ingestion-time partitioned tables, or any TIMESTAMP or DATE column for partitioning based on a column's values.

In Dataform, you can also specify a column to partition by using the partitionBy configuration setting. For example, if you have a session_date column that you want to partition on, you would add partitionBy: "session_date" in your Dataform config.

However, remember that in Dataform, the partitioning is done as part of the data transformation process, and the actual partitioning of the table in BigQuery will depend on how you have set up your data pipeline and how you write the data from Dataform to BigQuery.

As to your question about whether you need to partition by day or by session_date, it depends on your use case. If session_date is a DATE or TIMESTAMP column that represents the day of the session, and you often filter your queries by this date, then it would be a good idea to partition by session_date. If session_date is not a DATE or TIMESTAMP column, or if you don't often filter by this column, it may not make sense to partition by session_date.

In general, you want to partition on a column that is frequently used in the WHERE clause of your queries, and that divides your data into a large number of relatively equal-sized partitions.

If I want to use _PARTITIONTIME pseudo column in dataform when I create a new table - How can I do it?
If I add it to the partitionBy configuration setting, it doesn't recognize this column.

Hi MS4446, 

I guess clusterBy and updatePartitionFilter in the config {} block will be the same purpose in Dataform when handling merge or insert incremental records to BQ ??  thanks

Both clusterBy and updatePartitionFilter are configuration settings in Dataform, but they serve different purposes.

clusterBy: This configuration setting is used to specify one or more columns to cluster a table by. Clustering is a technique used in BigQuery to improve the performance of queries that filter by the clustering columns. When a table is clustered, rows with similar values in the clustering columns are stored together, which can significantly improve query performance.

updatePartitionFilter: This configuration setting is used with incremental tables in Dataform. When you set updatePartitionFilter, Dataform uses it to limit the amount of data that needs to be scanned when updating an incremental table. It's a way to improve the performance of incremental updates by limiting the amount of data that needs to be scanned.

So while both settings are related to improving performance, they serve different purposes. clusterBy improves the performance of queries that filter by the clustering columns, while updatePartitionFilter improves the performance of incremental updates.

Thank you MS4446 for speedy responding .  That's really helpful.  Have a lovely weekend