How to stream MongoDB cdc changes to BigQuery using Dataflow?

RC1
Bronze 4
Bronze 4

Hi folks so I have a usecase where I need a realtime pipeline to push mongodb cdc data to big query for analytics. I went through Google cloud docs and I found a solution which covers some part of my problem [ https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#mongodb-to-bigquery-cdc ] . But here the pre requisites is to have "The change stream pushing changes from MongoDB to Pub/Sub should be running". In order to do this we have to use debezium or similar kind of service. We are mostly looking for a managed service for this usecase rather than self hosting debezium. Are there any services in GCP which can help me in this ? Also are there any plans for modifying apache beam mongo connector libraries  for cdc streaming ?

cc: @fguan 

 

1 6 2,852
6 REPLIES 6

Reviewing the documentation you shared, I was wondering if you also considered the following documentation:

Streaming Pipelines and Streamline your real-time data pipeline with Datastream and MongoDB.

Also, an alternative to debezium might be DBT, which is a good option to consider by allowing to load entire raw data to BQ and then using DBT for transformation.

But Debezium is taking over as it is reliable and free. There is a Debezium connector to get data from MongoDB and examples of how to put Debezium CDC into BigQuery.

Searching for more documentation, I found the Using MongoDB Change Streams to replicate data into BigQuery guide that provides a good process.

MongoDB Atlas does provide Database Triggers that you can use which act as a serverless service that constantly monitors MongoDB change streams. You can choose to listen for specific events from specific collections and then filter that down further using $match expressions as part of the Aggregation framework. Then you can trigger a serverless function when one of those events occurs.

The serverless function itself is again managed by MongoDB and in here you can write to Pub/Sub. You don't need Debezium as MongoDB have their own change streams.

The only limitation to serverless functions on MongoDB Realm is that you are limited to writing in Node.js but if it's just to write to Pub/Sub then it's not much code at all. Plus, you can always just trigger another Cloud Function if you particularly wanted to write a large chunk of work in another language.

Hi @RC1 We have implemented the same using self hosted kafka - debezium on a k8s cluster. Let me know if you want to discuss the same.

@anuraag_zolo Can you please share the tutorial that you referred ?

@RC1 
I used strimzi kafka hosted on gcp k8s,  the workloads have been running since a year, I have not faced any issue in it yet. (touchwood)
https://strimzi.io/docs/operators/latest/overview.html

Using a sample kafkaconnector image create kafka connector for source mongodb and sink bigquery connector
https://debezium.io/documentation/reference/2.1/connectors/mongodb.html
https://www.confluent.io/hub/wepay/kafka-connect-bigquery/

There will be some challenges along the way, I am planning to complete a full tutorial of this on medium.

Until its there, happy to help


hi @anuraag_zolo 

Shared links are really helpful. In between, would you be able to complete the tutorial. If yes, Could you please share the complete tutorial of how you address this challenge in detail. This would help me and community a lot.

Thanks