Messed up time zones while replicating

We replicate data from MySQL to BigQuery via DataFusion.

MySQL server has UTC+2 timezone while BQ has UTC by default.

We experience the next issue.

It looks like while DataFusion is on the stage of taking snapshots just copy timestamp column with value of UTC+2 but paste it as it was UTC (incorrect value). When snapshot stage is passed DataFusion replicates data but already with correct conversion from UTC+2 to UTC, hence we have 2 hours when count of records is "duplicated". They contain records both from UTC and UTC+2 timezones.

I want to know if my hypothesis is true? is it bug or feature? is there a good practice how to deal with this case besides set MySQL server to UTC or BigQuery to UTC+2?

Solved Solved
0 1 297
1 ACCEPTED SOLUTION

It seems like the issue you're encountering could indeed be related to the time zone differences between your MySQL server (UTC+2) and BigQuery (UTC). Data replication tools like DataFusion might handle timestamps differently during snapshot creation and data replication, leading to the observed duplication issue.

This isn't necessarily a bug but rather a consequence of how timestamps are handled during the replication process. When the snapshot is taken, timestamps might not be adjusted to UTC, leading to the initial discrepancy. However, during the replication process, DataFusion might correctly convert timestamps to UTC, resulting in the duplication of records with different timestamps.

One good practice to handle this situation is to standardize the time zones across your systems. You've mentioned adjusting the time zone settings for either MySQL or BigQuery, which is indeed a viable solution. Here are a couple of approaches:

  1. Set MySQL Server to UTC: Configuring your MySQL server to use UTC as its time zone ensures that timestamps stored in MySQL are already in UTC format. This would align with the default time zone of BigQuery and might prevent discrepancies during replication.

  2. Set BigQuery to UTC+2: Alternatively, if adjusting the MySQL server isn't feasible, you can configure BigQuery to use the UTC+2 time zone. This way, both systems align, and there's no need for on-the-fly conversions during replication.

  3. Adjust Time Zones in DataFusion: Check if DataFusion allows for time zone settings during the replication process. Some data replication tools offer options to specify time zone conversions during data transfer. Adjusting these settings might help prevent discrepancies.

  4. Handle Time Zone Conversion in Transformation: Another approach could involve performing time zone conversions explicitly within DataFusion. You might incorporate transformations in your data pipeline to convert timestamps from UTC+2 to UTC during the replication process.

View solution in original post

1 REPLY 1

It seems like the issue you're encountering could indeed be related to the time zone differences between your MySQL server (UTC+2) and BigQuery (UTC). Data replication tools like DataFusion might handle timestamps differently during snapshot creation and data replication, leading to the observed duplication issue.

This isn't necessarily a bug but rather a consequence of how timestamps are handled during the replication process. When the snapshot is taken, timestamps might not be adjusted to UTC, leading to the initial discrepancy. However, during the replication process, DataFusion might correctly convert timestamps to UTC, resulting in the duplication of records with different timestamps.

One good practice to handle this situation is to standardize the time zones across your systems. You've mentioned adjusting the time zone settings for either MySQL or BigQuery, which is indeed a viable solution. Here are a couple of approaches:

  1. Set MySQL Server to UTC: Configuring your MySQL server to use UTC as its time zone ensures that timestamps stored in MySQL are already in UTC format. This would align with the default time zone of BigQuery and might prevent discrepancies during replication.

  2. Set BigQuery to UTC+2: Alternatively, if adjusting the MySQL server isn't feasible, you can configure BigQuery to use the UTC+2 time zone. This way, both systems align, and there's no need for on-the-fly conversions during replication.

  3. Adjust Time Zones in DataFusion: Check if DataFusion allows for time zone settings during the replication process. Some data replication tools offer options to specify time zone conversions during data transfer. Adjusting these settings might help prevent discrepancies.

  4. Handle Time Zone Conversion in Transformation: Another approach could involve performing time zone conversions explicitly within DataFusion. You might incorporate transformations in your data pipeline to convert timestamps from UTC+2 to UTC during the replication process.