snowflake: 2 correlated columns have very differen...

raivivan · 06-23-2022 05:16 AM

0

We have a table with 120M rows (over 2222 micropartitions), that has 2 important columns, record_id with values in format as prefix|<account_id>|<uuid> (unique) and column account_id, which has the value of <account_id>. Note that the prefix is same for all records. Then of course some factum columns, but that is not relevant.

Snowflake shows perfect clustering for the record_id column (automatically chosen by SF, no specified clustering is set by us) via clustering_information function:

"total_partition_count" : 2222,
 "total_constant_partition_count" : 2222,
 "average_overlaps" : 24.0,
 "average_depth" : 25.0,

However, for the column account_id, the clustering is very bad

 "total_constant_partition_count" : 0,
 "average_overlaps" : 2221.0,
 "average_depth" : 2222.0,

There is about 130 distinct account ids, which means that on average, records of one account_id should be over 17 partitions. Even if snowflake clusters by records_id, the beginning of that column (prefix|<account_id>) correlates with account_id column. So, records with the same account_id should end up in the same partitions. Therefore, I cannot figure out why there is 100% overlap of micropartitions for the account_id column. It is like if snowflake used some weird sorting for the record_id column and thus scattered rows of each account across all partitions. Is that possible?

This has negative consequences on performance, since doing a query with account_id filter results in scan of all partitions.

snowflake: 2 correlated columns have very different clustering information (one has perfect, the other has terrible)