Workflows cold start latency (scaling issue)

deepak2023sso · 01-25-2024 03:08 AM

Hi,

I am reaching out to seek insights and guidance on addressing a cold start latency issue encountered while triggering +1,000 GCP workflow instances simultaneously using Cloud Tasks.

Background: Our team is currently working on a project that involves the simultaneous triggering of a large number of GCP workflow instances through Cloud Tasks. We have noticed a significant cold start latency, impacting the overall performance and efficiency of our system.

Problem Description: Upon triggering 1,000 workflow instances simultaneously, the cold start time appears to be considerably higher than anticipated. This delay is affecting the responsiveness of our application and hindering its ability to scale seamlessly.

Request for Guidance: We are seeking advice and best practices from the community to optimize the cold start time for our GCP workflow instances. Any insights, recommendations, or experiences related to addressing similar challenges would be greatly appreciated.

Current Configuration:

Cloud Tasks for triggering workflows instances.
GCP workflows.
Concerns primarily related to cold start latency during simultaneous triggering.

Best regards,

Deepak

robertcarlos

Hi @deepak2023sso,

Welcome to Google Cloud Community!

You are encountering "cold starts" even if you haven't reached the maximum number of workflows (which is 10,000 per project) is due to the following:

Concurrent executions
- There's a maximum number of active workflow executions per region, per project, including those that started, not yet completed, failed, or waiting.
Workflow API requests
- API requests could also impact cold starts
Step limits
- This is enforced by workflows, this includes assignments per step, conditions per switch, max call stack depth, minimum and maximum steps.
Parallel step limits
- This includes branches per step, parallel depth, concurrent branches and iterations, and uncaught exceptions within a parallel step
Resource limits
- Workflows enforces the following usage limits:
  - Source code size (128 KB)
  - Response size (2 MB)
  - Expression length (400 characters)
  - Data size (512 KB)
  - Environment variables (4 KiB)
  - Execution duration (1 year)
  - Execution retention (90 days)

Please check this documentation on workflows quotas and limits for additional information.

If you think that you haven't reached out the maximum quotas or limits for your workflows, you may file a bug so that our engineers could take a look at this. We don't have a specific ETA for this but you can keep track of its progress once the ticket has been created.

Hope this helps.