Solved: Deduplication if subscriber application is not ide...

ankit-pradhan · 05-18-2023 02:50 PM

If Subscriber application is legacy and not idempotent, Can we handle deduplication (in case of redelivery, acknowledgement deadline, etc ) somehow without changing subscriber application.
Can we still use pubsub in such scenario.

Went through https://cloud.google.com/pubsub/docs/exactly-once-delivery link but redelivery is also a trouble for subscriber application

ms4446

Correction 4/3/2024:

Google Cloud Pub/Sub provides a few features that can help with handling message deduplication and redelivery, even when dealing with a legacy, non-idempotent subscriber application.

Message Deduplication: Pub/Sub provides automatic deduplication of published messages on a best-effort basis. If you publish a message with the same ordering key and message ID as a previously published message, and the previous message has not yet been acknowledged, then Pub/Sub understands that this is a duplicate message and does not deliver it to subscribers. Please note: The purpose of ordering keys is not to eliminate duplicate messages, but rather to ensure that messages sharing the same key are delivered in a specific sequence. However, it's important to note that Google Pub/Sub does offer a feature known as "exactly once delivery" (you can read more about it here: https://cloud.google.com/pubsub/docs/exactly-once-delivery). This feature ensures that messages are not delivered more than once if they have not been acknowledged, thereby providing a more reliable handling of acknowledgments and message delivery.

For effectively removing duplicate messages, as you've correctly identified, using Dataflow is the recommended approach. I've also taken the step to request an update from the author on the Google Community answer to reflect this clarification.
Adjusting the Acknowledgement Deadline: One way to handle redelivery is to adjust the acknowledgement deadline. If your subscriber application needs more time to process a message, you can modify the acknowledgement deadline for a specific message. This prevents the message from being redelivered during this extended deadline.
Dead Letter Topics: If a message is redelivered more times than you specify, Pub/Sub can send that message to a dead letter topic. This can help you identify problematic messages that your subscriber application is unable to process for some reason.

However, these methods still require the subscriber application to acknowledge the message once it has been processed. If the subscriber application is not capable of acknowledging messages, or if it cannot handle redelivery of messages at all, then using Pub/Sub may be more challenging.

In such cases, you might consider adding a middleware layer between Pub/Sub and the subscriber application. This middleware could handle the acknowledgement and redelivery logic. This would allow the legacy subscriber application to operate as it currently does, while the middleware layer handles the complexities of dealing with message deduplication and redelivery.

Keep in mind that this approach would require development and maintenance of this middleware layer, which could increase complexity. It's also important to ensure that this layer is designed and implemented to handle potential failure scenarios, to avoid data loss or duplication.

Finally, it's worth noting that while Pub/Sub provides many features for handling message delivery and redelivery, it may not be the best fit for every use case. Depending on the specific requirements of your legacy application, other messaging or event-driven architectures may be more appropriate.

View solution in original post

ms4446

Correction 4/3/2024:

Google Cloud Pub/Sub provides a few features that can help with handling message deduplication and redelivery, even when dealing with a legacy, non-idempotent subscriber application.

Message Deduplication: Pub/Sub provides automatic deduplication of published messages on a best-effort basis. If you publish a message with the same ordering key and message ID as a previously published message, and the previous message has not yet been acknowledged, then Pub/Sub understands that this is a duplicate message and does not deliver it to subscribers. Please note: The purpose of ordering keys is not to eliminate duplicate messages, but rather to ensure that messages sharing the same key are delivered in a specific sequence. However, it's important to note that Google Pub/Sub does offer a feature known as "exactly once delivery" (you can read more about it here: https://cloud.google.com/pubsub/docs/exactly-once-delivery). This feature ensures that messages are not delivered more than once if they have not been acknowledged, thereby providing a more reliable handling of acknowledgments and message delivery.

For effectively removing duplicate messages, as you've correctly identified, using Dataflow is the recommended approach. I've also taken the step to request an update from the author on the Google Community answer to reflect this clarification.
Adjusting the Acknowledgement Deadline: One way to handle redelivery is to adjust the acknowledgement deadline. If your subscriber application needs more time to process a message, you can modify the acknowledgement deadline for a specific message. This prevents the message from being redelivered during this extended deadline.
Dead Letter Topics: If a message is redelivered more times than you specify, Pub/Sub can send that message to a dead letter topic. This can help you identify problematic messages that your subscriber application is unable to process for some reason.

However, these methods still require the subscriber application to acknowledge the message once it has been processed. If the subscriber application is not capable of acknowledging messages, or if it cannot handle redelivery of messages at all, then using Pub/Sub may be more challenging.

In such cases, you might consider adding a middleware layer between Pub/Sub and the subscriber application. This middleware could handle the acknowledgement and redelivery logic. This would allow the legacy subscriber application to operate as it currently does, while the middleware layer handles the complexities of dealing with message deduplication and redelivery.

Keep in mind that this approach would require development and maintenance of this middleware layer, which could increase complexity. It's also important to ensure that this layer is designed and implemented to handle potential failure scenarios, to avoid data loss or duplication.

Finally, it's worth noting that while Pub/Sub provides many features for handling message delivery and redelivery, it may not be the best fit for every use case. Depending on the specific requirements of your legacy application, other messaging or event-driven architectures may be more appropriate.

ankit-pradhan

So below comment is true even after acknowledgement deadline also ?
As per my understanding message Id is managed by pubsub itself.
i.e. It generates this id only when message is successfully published.
Should we try to maintain message Id on our own ? Is it possible ?

@ms4446 wrote:
Message Deduplication: Pub/Sub provides automatic deduplication of published messages on a best-effort basis. If you publish a message with the same ordering key and message ID as a previously published message, and the previous message has not yet been acknowledged, then Pub/Sub understands that this is a duplicate message and does not deliver it to subscribers.

ankit-pradhan

As per "https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/pubsub_subscription#m....
Looks like for dead lettering redelivery must be between 5-100. That means it won't be able to handle idempotency as before going to dead-letter it will redeliver 4-5 times at least.

max_delivery_attempts - (Optional) The maximum number of delivery attempts for any message. The value must be between 5 and 100. The number of delivery attempts is defined as 1 + (the sum of number of NACKs and number of times the acknowledgement deadline has been exceeded for the message). A NACK is any call to ModifyAckDeadline with a 0 deadline. Note that client libraries may automatically extend ack_deadlines. This field will be honored on a best effort basis. If this parameter is 0, a default value of 5 is used.

@ms4446 wrote:
Dead Letter Topics: If a message is redelivered more times than you specify, Pub/Sub can send that message to a dead letter topic. This can help you identify problematic messages that your subscriber application is unable to process for some reason

ms4446

Yes, the max_delivery_attempts for a Pub/Sub subscription's dead letter policy must be between 5 and 100. This means that a message must be delivered at least 5 times before it is sent to the dead letter topic.

In this context, "delivery attempts" include both explicit negative acknowledgements (NACKs, which are essentially requests to re-deliver the message) and instances where the acknowledgement deadline has been exceeded. It's important to note that exceeding the acknowledgement deadline is considered a signal that the subscriber failed to process the message, so the message is re-delivered.

Therefore, if your subscriber application is not idempotent and has trouble handling redelivered messages, this aspect of Pub/Sub's dead letter policy could indeed pose a problem. Before a message is sent to the dead letter topic, it could be delivered multiple times to the subscriber, potentially causing unwanted side effects if the application doesn't handle duplicate messages gracefully.

As mentioned in my previous responses, one possible workaround could be to introduce a middleware layer between Pub/Sub and the subscriber application. This middleware could handle the deduplication logic, checking whether a message has been processed before forwarding it to the legacy application. This would require careful design and implementation to ensure reliability and consistency, but it could help mitigate the issues caused by multiple delivery attempts.

ms4446

Correction 4/3/2024:

The message ID in Google Cloud Pub/Sub is indeed assigned by the service itself when a message is successfully published. This message ID is guaranteed to be unique within the topic for the lifetime of the topic.

However, the purpose of the message ID in Pub/Sub is not to enable end-user deduplication, but rather to allow the service to guarantee "at least once" delivery. When a message is published, it's assigned an ID by Pub/Sub. If the same message is published again (due to an error or other issue), the service will assign a new ID to the repeated message.

For deduplication purposes, Pub/Sub provides an optional ordering_key attribute. When publishing a batch of ordered messages, if a message with the same ordering key and data is published again, Pub/Sub deduplicates it on a best-effort basis.

Please note: The purpose of ordering keys is not to eliminate duplicate messages, but rather to ensure that messages sharing the same key are delivered in a specific sequence. However, it's important to note that Google Pub/Sub does offer a feature known as "exactly once delivery" (you can read more about it here: https://cloud.google.com/pubsub/docs/exactly-once-delivery). This feature ensures that messages are not delivered more than once if they have not been acknowledged, thereby providing a more reliable handling of acknowledgments and message delivery.

That said, maintaining your own unique identifier for each message could help with deduplication on the subscriber side, particularly for a legacy application that may not handle redelivery well. You could include this unique identifier in the message data or attributes, and then have the subscriber application check this identifier against a local cache or database to see if the message has been processed before.

This approach, however, has its own complexities and potential pitfalls. For example, you would need to ensure that the cache or database used for checking identifiers is highly available and consistent, to prevent processing duplicate messages or missing messages in case of errors or failures. Also, you would need to consider how to manage the size and lifetime of the data stored in this cache or database, to prevent it from growing indefinitely.

So, while it's possible to manage your own unique identifiers for messages in Pub/Sub, it's not a straightforward solution and requires careful consideration of the trade-offs and potential issues. The best solution would depend on the specific requirements and constraints of your application and system architecture.

kamalaboulhosn

The ordering key is not for deduplication. The ordering key allows one to receive messages for the same key in the order in which they were published. There is no deduping inherent with ordering keys. See the documentation for more information about ordering keys.

amitnigam_arch

Facing same issue, Pub/sub pushes duplicate messages at time, just a few millisecond apart. Not all applications/clients can be idempotent. Including another infra just to deduplicate the message, completely makes pub/sub useless. I would like to better understand deduplication process on pubsub side.

ms4446

Here are some resources that provide insights into the deduplication process in Google Cloud Pub/Sub:

Handle message failures | Cloud Pub/Sub | Google Cloud: This page discusses how to configure dead-letter topics and handle message failures in Pub/Sub, which can be related to the deduplication process.
Handling duplicate data in streaming pipeline using Pub/Sub and Dataflow - Google Cloud: This blog post explains how to handle duplicate data in a streaming pipeline that uses Pub/Sub and Dataflow, including insights into how duplicates might occur and strategies to manage them.
Streaming with Pub/Sub | Cloud Dataflow | Google Cloud: This page explains how Dataflow complements Pub/Sub's delivery model with message deduplication and exactly-once processing, which might be useful for understanding how deduplication works in conjunction with other Google Cloud services.

nitishukg

@ms4446 does the messageID remain same when pub sub retries the message?

ms4446

In Pub/Sub, the messageID remains the same for a message across all its delivery attempts. When Pub/Sub retries a message, it does not change the messageID of that message. This behavior is part of the design to ensure "at least once" delivery semantics.

The messageID is assigned by Pub/Sub when the message is first successfully published to a topic. If the message is not acknowledged by the subscriber (for reasons like processing failure, acknowledgement deadline expiry, etc.), and Pub/Sub retries delivering the message, the messageID will be the same for each retry.

This consistent messageID across retries can be used by the subscriber to recognize that it is receiving a retry of a previously sent message. However, as mentioned earlier, handling deduplication and idempotency is still the responsibility of the subscriber application. The application should use the messageID (or some other unique identifier within the message) to track whether it has already processed a given message and handle it accordingly to avoid duplicate processing.

amitnigam_arch

the messageID remains same, and duplicate messages are generally pushed a couple of milliseconds apart. practically if you are not using data flow, it very hard to deduplicate the message. GCP doesn't seem to be a mature cloud platform, there are tons of issues with GCP, moving to another provider.

Deduplication if subscriber application is not idempotent and we can't change it.