Hi all,
I am assigned to design a framework for CI/CD for our upcoming data pipelines. I am looking for the best ways to integrate Source control, Metadata and testing management into our data pipeline deployments.
We want a Dev environment where Developers test the data pipelines (both streaming and batch) and then automatically push the code to trigger the build process.
You get the idea...
I found these two documents and understood what is necessary at a high level.
What I didn’t understand was how to create/ access test data.
It mentions a preproduction environment , and it recommends being very similar to prod. We have a massive data warehouse. Does it mean that we have to double our storage to be able to work in a test environment?
Also it mentions creating a project for each developer for developer and testing code. I would like to understand this concept a bit better since it didn't provide details about that approach
Can you please advise?
Solved! Go to Solution.
1.- The Beam SDK provides methods to supply test data to your data pipeline and to verify the results of processing.
2.- Possibly no, the pre production environment needs to be similar to production, but it does not mean that you’ll have to double your storage to be able to work in a test environment . This actually means that in this environment you are going to check all the config before you send it to a production environment.
3.- This approach is to create a new Dataflow job and pipeline, using: `Testpipeline` instead of `pipeline.create`, to consider the quotas, the scale-ups/scales-down and other services that are going to be used in your production environment. So, basically it is a new pipeline to use it to check all the config and the code that you’ll need in the production pipeline.
1.- The Beam SDK provides methods to supply test data to your data pipeline and to verify the results of processing.
2.- Possibly no, the pre production environment needs to be similar to production, but it does not mean that you’ll have to double your storage to be able to work in a test environment . This actually means that in this environment you are going to check all the config before you send it to a production environment.
3.- This approach is to create a new Dataflow job and pipeline, using: `Testpipeline` instead of `pipeline.create`, to consider the quotas, the scale-ups/scales-down and other services that are going to be used in your production environment. So, basically it is a new pipeline to use it to check all the config and the code that you’ll need in the production pipeline.
Thanks