Re: Could I achieve schema on read approach when l...

Dovile · 10-04-2023 10:21 AM

So I am exploring of fully using BQ as my primary storage for Data Lakehouse architecture pattern.

However, one of the main features of the Data Lakehouse is that its first layer (raw, bronze - whatever it's called) is schema on-read.

Are there any approaches where I could use BQ for my RAW with a schema on-read approach?

Has anyone seen / done this?

For example, I am loading data from a RDBMS (mssql, oracle) via a BQ connector, and even if a column changes the data type or a new column is added or column is removed - all works and data is ingested just fine in BQ. Meaning at this RAW stage I don't have to worry about managing schema evolution.

Thank you, Vaiva!

ms4446

Yes, you can achieve a schema on read approach when loading data to Google BigQuery. There are a few different ways to do this:

Loading data from a RDBMS into BigQuery using schema auto-detection:

Use a BigQuery connector that supports schema auto-detection. Google Cloud Dataflow and Cloud Dataproc are two services that offer connectors for this purpose.
Create a Dataflow or Dataproc job to load the data from the RDBMS into BigQuery.
In the job configuration:
Specify the source RDBMS connection details.
Define the destination BigQuery table.
Enable the schema auto-detection option.
Execute the job.

Loading data from a RDBMS into BigQuery using a custom schema file:

Prepare a schema file in JSON format that outlines the data's schema.
Set up a Dataflow or Dataproc job to transfer the data from the RDBMS to BigQuery.
Within the job configuration:
Provide the source RDBMS connection information.
Indicate the destination BigQuery table.
Reference the path to your custom schema file.
Run the job.

Streaming data from a RDBMS to BigQuery:

Utilize a streaming tool or service, such as Google Cloud Dataflow or Cloud Dataproc, to facilitate the streaming of data from the RDBMS to BigQuery.
Configure the chosen tool or service to direct the data to a BigQuery table set up for streaming.
Initiate the streaming process.

Dovile

@ms4446 thank you for your answer.

In its true sense schema on-read storage would mean that I can load & store whatever, ideally a data file + metadata file. However, if I set-up BQ table with auto-detect schema, or schema file, the next time the data is written to BQ, the schema will be checked and some records with be exceptions. It means, we have schema validation going on. Meaning not really a schema on read.

One idea was if we load all columns as string, then schema validation would always pass, but the storage might not be optimal.

Could I achieve schema on read approach when loading data to Google BQ (BigQuery)?