Cloud Run to Cloud SQL Connection stops working after 1 hour

HELP! 

 

 
We have a spring  based API that we migrated to cloud run from GKE a week or two back. After running perfectly for one hour we start to get 500's for requests against the API and in the logs we see.
java.lang.RuntimeException: [ourproject:region:instance] Failed to update metadata for Cloud SQL instance.
at com.google.cloud.sql.core.CloudSqlInstance.addExceptionContext (CloudSqlInstance.java:465)
at com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata (CloudSqlInstance.java:329)
...
So it seems like the credentials of the SA (with roles that are obviously fine seeing as it works fine for an hour) token expire and don't get refreshed and we can no longer connect to the DB from this instance. The only way to correct it is redeploy.
 
We are not using Unix Domain Sockets - just simply configured Cloud SQL as "normal" external PostGres DB using a VPC connector for connectivity
 
Anyone run into something similar?
1 11 7,371
11 REPLIES 11

Hi, you dont use the "Cloud SQL Proxy" to connect to the Cloud SQL ? 

No - we just migrated from GKE where we simply configured existing Spring configured DB connection to CloudSQL by shoving in Cloud Run and adding a VPC Connector. 

Which works fine for 1 hour... then we get these 3 errors in succession
"Failed to update metadata for Cloud SQL instance"

"Failed to create ephemeral certificate for the Cloud SQL instance."

"EOFException: SSL peer shut down incorrectly"

And the connection no longer works

Yeah that doesn't fill me with great hope 🙂

We are putting some work into doing things the "official" way using socket factory and using the --add-cloudsql-instance deployment flag. As I understand it, the flag is essentially is a shortcut  to having cloudsql proxy deployed with the container but then we will be super-dependent on this hourly metadata refresh working right?


@shenxiang wrote:

You may want to keep an eye on this issue: https://github.com/GoogleCloudPlatform/cloud-sql-jdbc-socket-factory/issues/502


Thank you for the helpful link.

Here is a fuller account of the order of events. This happens regardless of whether we use:

* Vanilla spring configuration of the pool and vpc-connector for connectivity

* --add-clousql-instance + socket factory

* just about any other method we can think of!

2021-06-24T08:11:28.220Z
Got more than one input failure. Logging failures after the first
java.lang.RuntimeException: [our-project:our-region:our-instance] Failed to update metadata for Cloud SQL instance.
com.google.cloud.sql.core.CloudSqlInstance.addExceptionContext(CloudSqlInstance.java:574)
com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:483)
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.net.ssl.SSLException: readHandshakeRecord
java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1320)
java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:440)
java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:411)
java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:567)
java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:148)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
com.google.cloud.sql.core.CloudSqlInstance.fetchMetadata(CloudSqlInstance.java:438)
... 9 common frames omitted
Suppressed: java.net.SocketException: Broken pipe (Write failed)
java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
java.base/sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:81)
java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:357)
java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:269)
java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:450)\
20 common frames omitted
Caused by: java.net.SocketException: Broken pipe (Write failed)
java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
java.base/sun.security.ssl.SSLSocketOutputRecord.flush(SSLSocketOutputRecord.java:251)
java.base/sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:89)
java.base/sun.security.ssl.Finished$T12FinishedProducer.onProduceFinished(Finished.java:404)
java.base/sun.security.ssl.Finished$T12FinishedProducer.produce(Finished.java:379)
java.base/sun.security.ssl.SSLHandshake.produce(SSLHandshake.java:436)
java.base/sun.security.ssl.ServerHelloDone$ServerHelloDoneConsumer.consume(ServerHelloDone.java:182)
java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:392)
java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:183)
ava.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:171)
java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1403)
ava.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1309)
... 21 common frames omitted


2021-06-24T08:19:18.702Z
HikariPool-1 - Connection is not available, request timed out after 30001ms.

2021-06-24T08:19:18.702Z
Something unusual has occurred to cause the driver to fail. Please report this exception.

2021-06-24T08:19:18.707Z
Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.transaction.
CannotCreateTransactionException: Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection] with root cause
java.net.SocketException: Broken pipe (Write failed)
java.base/java.net.SocketOutputStream.socketWrite0(Native Method)


2021-06-24T08:19:18.736557Z HTTP 500

Hi, 

Do you see the error all the way after 1 hour or for a brief period of time (when the certs expire between the refresh)? And also, does your API environment throttle background activities as the issue describes? 

The refresh operation should be done automatically. If it is not the case for you, it will be better to create a tech support ticket in order to further investigate. 

Hello,

How are you connecting to Cloud SQL Postgres, on it's public or private address?

I would advice private and start using Serverless VPC access.

@erzz , we had the same problem.  We had very long running Cloud Run instances connecting to Cloud SQL Postgres via Serverless VPC access.  We reasoned that these instances were holding onto SQL connections that were older than two minutes; in this case, the underlying Cloud Run infrastructure drops these idle SQL connections.  Our solution was to limit the lifetime of our SQL connections to 90 seconds.  It could be that  we could have simply limited the lifetime of idle SQL connections, but we were way to busy to investigate that further and the draconian solution was sufficient.

 

With that said, we are noticing "Failed to create ephemeral certificate for the Cloud SQL instance" error messages that seem innocuous, but noisy since they trigger incident reports.

I had the same problem recently with a spring boot application. What solved it was to update 

<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>spring-cloud-gcp-dependencies</artifactId>
<version>xxx</version>
<type>pom</type>
<scope>import</scope>
</dependency>

to the latest version. Suddenly everything started to work three times as fast and not crashing after one hour. 
Hope it helps!

Hi Olav, We have tried upgrading it to 3.4.5 which is the latest for the springboot 2.x versions. but we are still facing the issue. Any others solutions that you might have tried could help