Solved: Bigquery Preproduction environment for CI/CD

mischturk · 06-29-2022 06:28 AM

Hi all,

I am assigned to design a framework for CI/CD for our upcoming data pipelines. I am looking for the best ways to integrate Source control, Metadata and testing management into our data pipeline deployments.

We want a Dev environment where Developers test the data pipelines (both streaming and batch) and then automatically push the code to trigger the build process.

You get the idea...

I found these two documents and understood what is necessary at a high level.

https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-develo...

https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-deploy...

What I didn’t understand was how to create/ access test data.

It mentions a preproduction environment , and it recommends being very similar to prod. We have a massive data warehouse. Does it mean that we have to double our storage to be able to work in a test environment?

Also it mentions creating a project for each developer for developer and testing code. I would like to understand this concept a bit better since it didn't provide details about that approach

Can you please advise?

josegutierrez

1.- The Beam SDK provides methods to supply test data to your data pipeline and to verify the results of processing.

2.- Possibly no, the pre production environment needs to be similar to production, but it does not mean that you’ll have to double your storage to be able to work in a test environment . This actually means that in this environment you are going to check all the config before you send it to a production environment.

3.- This approach is to create a new Dataflow job and pipeline, using: `Testpipeline` instead of `pipeline.create`, to consider the quotas, the scale-ups/scales-down and other services that are going to be used in your production environment. So, basically it is a new pipeline to use it to check all the config and the code that you’ll need in the production pipeline.

View solution in original post

josegutierrez

1.- The Beam SDK provides methods to supply test data to your data pipeline and to verify the results of processing.

2.- Possibly no, the pre production environment needs to be similar to production, but it does not mean that you’ll have to double your storage to be able to work in a test environment . This actually means that in this environment you are going to check all the config before you send it to a production environment.

3.- This approach is to create a new Dataflow job and pipeline, using: `Testpipeline` instead of `pipeline.create`, to consider the quotas, the scale-ups/scales-down and other services that are going to be used in your production environment. So, basically it is a new pipeline to use it to check all the config and the code that you’ll need in the production pipeline.

mischturk

Thanks