0
We have a table with 120M rows (over 2222 micropartitions), that has 2 important columns, record_id with values in format as prefix|<account_id>|<uuid>
(unique) and column account_id, which has the value of <account_id>
. Note that the prefix is same for all records. Then of course some factum columns, but that is not relevant.
Snowflake shows perfect clustering for the record_id column (automatically chosen by SF, no specified clustering is set by us) via clustering_information function:
"total_partition_count" : 2222,
"total_constant_partition_count" : 2222,
"average_overlaps" : 24.0,
"average_depth" : 25.0,
However, for the column account_id, the clustering is very bad
"total_constant_partition_count" : 0,
"average_overlaps" : 2221.0,
"average_depth" : 2222.0,
There is about 130 distinct account ids, which means that on average, records of one account_id should be over 17 partitions. Even if snowflake clusters by records_id, the beginning of that column (prefix|<account_id>
) correlates with account_id column. So, records with the same account_id should end up in the same partitions. Therefore, I cannot figure out why there is 100% overlap of micropartitions for the account_id column. It is like if snowflake used some weird sorting for the record_id column and thus scattered rows of each account across all partitions. Is that possible?
This has negative consequences on performance, since doing a query with account_id filter results in scan of all partitions.