Re: GCP Datastream and Postgres CDC Integration

yogocik · 06-30-2022 01:59 AM

Hi, I am Yogic, Data Engineer. I have a CDC project, capturing data from Postgres database and stream it into GCP Environment using Pub/Sub+Cloud Function. I use Debezium Server for this one because Datastream doesn't support Postgres as source yet so far. But still, I have found difficulties in maintaining the Debezium especially its scalability aspect (cannot horizontally scalable). When would this service fully release so we can use its feature on CDC process especially with Postgres? (I hope this would be released soon) Thank you.

etaim

Hi, and thanks for your interest in Datastream!
Datastream will support PostgreSQL as a source in Q3'22 (public preview). You can follow our release notes for an update.
Thanks,
Etai

armandfulfilld

Is this still expected to go into public preview Q3'22?

etaim

Yes, we're on track for launching in September.

mrubner

Amazing, keep us in the loop 🙂 looking forward to it too!

etaim

Support for PostgreSQL source is now available in public preview. We also launched Datastream for BigQuery - you can read more about it here: https://cloud.google.com/blog/products/data-analytics/introducing-seamless-database-replication-to-b...

yogocik

That's great. I have read the documentation, but I have a question about 'schema shift'. It will add automatically new column or attributes from sources. How about other case when someone delete some attributes or change those attributes/column into the new one, i.e
1. [id, number, created_at] -> [id, created_at] (deletion)
2. [id, number, created_at -> [id, number_something, created_at] (naming changes)

Also I am curious about the replication mechanism and configuration of Datastream. Is it looks like Debezium or what? (It will be good if you can provide detailed technical documentation about that so we can estimate pros-cons when migrating)

etaim

column deletions are ignored - Datastream will not receive data for the deleted columns and will simply pass them to the destination as empty.

column rename is interpreted as a column delete + column insert - i.e. the old 'number' column will not receive new values in the destination after the rename, and new events will only populate the new 'number_something' column (and existing rows will have an empty value for this column).

Datastream isn't based on, nor does it leverage Debezium in any way.

mage87

Hey etaim,

do you by any chance know if arrays(character varying [] in postgres) are supported soon? Currently, it seems to drop those columns while transferring from Postgres to BQ.

Didn't found it in the known issues though, plus arrays are missing in the "translation" overview.

Best

Mathias

etaim

Hi Mathias,

Arrays aren't currently supported, but potentially will be supported when the PostgreSQL source will be GA. We're finalizing the list of data types we'll support for GA, and Arrays is currently being researched.

Thanks,
Etai

etaim

I'm happy to share that support for PostgreSQL ARRAYs is now available:

https://cloud.google.com/datastream/docs/release-notes#October_20_2023

KK6R

@etaim,

What is the expected GA date for postgresql support in Datastream. I understand previews are typically for 3 month period. Does it mean postgresql support will be GA in DEC 2022

thanks,

KK

etaim

Typically timelines aren't shared public forums like this ;), but GA of the PostgreSQL source is planned for H1'23.

kotnikd3

Hi!

We are also using Datastream for CDC from Cloud SQL Postgresql to BigQuery and columns with "character varying []" and "citex" data types are not being replicated. Hopefully the mentioned Postgresql's data types will be supported in the future.

Regards,

Denis Kotnik

mrubner

We successfully set up a Datastream with all required permissions and privileges, it runs stable for a week now, ingesting data from a PostgreSQL into BigQuery. Nevertheless, there is one critical issue: whenever a data type in any given table is not declared directly in the table (such as: STRING, ENUM) but is actually defined in a stand-alone so-called "type" in the type section of the Postgres (f.e. to limit the string values it can resemble), and the table itself references this data type in a file of the form "schema-name".column_type, it appears that Datastream shows the "source data type" incorrectly as "schema-name".column_type instead of string - it looks like it does not translate the referenced type mentioned in the file. And in this case, the whole columns seems to be dropped / not to be found in BigQuery, including the data it is supposed to include. Are there some privileges, permissions or something that we are missing? How can we make sure that Datastream interprets/translates the types into the correct type that the referenced type-file declares?

/edit: I might have found the issue: * Datastream doesn't support replication of columns of the enumerated (ENUM) data type.

.. is there any chance, Datastream will support this in the future? Or any suggestion how to circumvent this issue the easiest way? We have dozens over dozens of ENUM columns.

etaim

ENUMs are now supported.

https://cloud.google.com/datastream/docs/release-notes#June_30_2023

jimmysteinmetz

We're needing to upgrade to v10 of Postgres in order to use Datastream. Is there a risk that support for v10 will go away within the next few years? The more versions we hop as part of the upgrade makes it more complex, but I also don't want to have to perform another upgrade in six months if/when v10 compatibility isn't supported.

etaim

There is currently no plan or reason to expect that we will need to discontinue support for Postgres 10.

mage87

we have it running for some time now and its working great(except the array issue :D).

However yesterday we noticed "duplicates" within the bq tables, which can happen as i read in the documentation. However i am wondering if he does resolve it on its own at some point of time(backfilling was no use) or if i need dedup queries for every table then. Any way to decrease the likelyhood of this happening ?

Best,

Mathias

etaim

If the source table has a primary key then duplicate records should not occur. It's worth trying to identify why the duplicates happened in the first place - does the table have a primary key? did the primary key change in any way while replication was already in progress?

If duplicates exist in BigQuery then they will no be de-duped (i.e. there's no process that "kicks in" and removes them), and backfill alone won't have any impact on duplicates without first truncating the table.

yogocik

Hi, we have tried to set Postgres integration but we got some errors in some tables. It is said like this,

Datastream can't create a table in BigQuery. BigQuery doesn't support tables with more than 16 primary keys, and the table has 19 primary keys

(This is not make sense, why my tables is being considered for having so many primary keys)
The tables have more than 10 columns which is consist of primary, foreign/composite key, and other functional/descriptive/numerical attributes/columns. I think the Datastream try to infer the columns key status by its name (I don't know about under the hood how this happen) and find more than maximum expectation number of primary keys (there is no such a thing like this in the datasource postgres schema/config), we have tried to read and play around with other settings such as column exclusion but no clues and worked solution for this.

We have set normal config for this replication,
slot pgoutput, REPLICA FULL IDENTITY, and SINGLE-PUBLISHER for all tables.

bekkerza

Are there any plans to enable CDC to run from a postgres cloudsql replica?

"Datastream can't replicate from a read replica instance, because PostgreSQL doesn't support logical decoding in read replicas", or any alternative plan to avoid using CDC on the primary?

etaim

There's currently no plan to support replication from a replica. Can you explain why you want to avoid using CDC on the replica?

bekkerza

thanks @etaim .

No we actually want to make use of the replica to be a source for DataStream.

kotnikd3

@bekkerza

I think that this is more of a conceptual problem of data replication and does not have to do anything with Datastream or Cloud SQL for Postgres technologies.

You can replicate data in 3 different ways:
a) Single-leader (or leader-follower, master-slave)
b) Multi-leader
c) Leaderless

I assume that you are using single-leader replication in Cloud SQL which implies that followers (replicas) can be used for read-only operations. I would say (but I'm not sure about it) that Datastream needs to write to a source database about acknowledging what has already been successfully replicated, since it is considered as a subscriber in a relationship with a source database.

bbm-takealot

Hi @etaim, Now we can create a replication using Datastream and postgres DB which is currently real-time sync/replication to GBQ. We wanted to do the cdc on the same database as the below link for datastream and dataflow analytics which is supporting MySQL/Oracle DBs.

https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics

Can we do this on the postgres DB or is there any chance this will be enabled in near future for analytics on Postgres DBs to capture all the events to GBQ and not only replication sync ?

Your assistance will be greatly appreciated.

etaim

We're planning to support "append-only" mode by EOY, which will capture all events in BigQuery (without applying any MERGE).

There is currently no ETA for updating the Dataflow template to support PostgreSQL, but it is something we are looking into.

arahul

We will be building a new application where we need to stream change events from cloudSQL PostgreSQL to a pubsub topic. We checked but Google Datastream currently doesn't have this support.