Cloud Composer 2 - Airflow database connectivity issue

Hello

We deployed a fresh Cloud Composer 2 months ago and after some period of development and testing, it is fully functional now running ~20 DAGs.

Last days we are experiencing an Airflow database connectivity issue:

  • currently 2 airflow-sqlproxy pods are running
  • the 1st (older) pod works fine without any issues
  • the 2nd pod very often has connectivity issue and we get this message on logs

 

dial tcp xxx.xxx.xxx.xxx:3307: connect: connection timed out

 

We checked the postgres connection pool and looks OK, no problem with the number of active connections

Any idea for what is happening or any solution is welcome

1 2 334
2 REPLIES 2

It sounds like you're encountering connectivity problems with the second Airflow SQL proxy pod in your Cloud Composer setup. The error message suggests a timeout when attempting to connect to the PostgreSQL database.

Here are some steps you could take to troubleshoot this issue:

Review the logs of the second airflow-sqlproxy pod extensively to gather more detailed information about the connection issues. Look for any specific error messages or patterns that might indicate the cause of the timeouts.

Check the resource allocation for the second pod. It's possible that it's running out of resources or encountering contention issues due to insufficient memory, CPU, or other resource limitations.

Examine the PostgreSQL database performance during the times when the timeouts occur. There might be high loads or specific conditions causing the connection timeouts for this pod specifically.

Review the Airflow configuration for both pods, ensuring they are identical or at least have the same relevant configuration settings related to database connectivity.

Consider redeploying the problematic pod. Sometimes, reinitializing the pod or creating a new instance might resolve underlying issues in the deployment.

If the issue persists and you can't resolve it with the above steps, it might be beneficial to reach out to Google Cloud support.

Thank you for your reply.

Actually we have already checked everything you mentioned except may be to examine the PostgreSQL database performance, even we don't have much visibility to this managed PostgreSQL.

At the moment we disabled a couple of DAGs (including airflow_clean_up dag) so it's working fine. We may have to make some optimizations to these DAGs.

What is weird is that this is happening only to 1 of 2 airflow-sqlproxy running pods.