control-service: add timeouts to shedlock's database operations #693

tpalashki · 2022-02-02T15:18:11Z

We have experienced cases when operations against the database appear to hang
indefinitely (usually after a connectivity issue with the database). As a result,
tasks that attempt to obtain a distributed lock, fail to start. This is
particularly problematic for important tasks, such as the data job watch.

This commit attempts to resolve this by specifying a query timeout to the
Jdbc template used by shedlock to operate with the database.

Link: https://dzone.com/articles/threads-stuck-in-javanetsocketinputstreamsocketrea

Testing done: started the service locally and verified that the locks are
operating normally with the timeout.

Signed-off-by: Tsvetomir Palashki [email protected]

We have experienced cases when operations against the database appear to hang indefinitely (usually after a connectivity issue with the database). As a result tasks that attempt to obtain a distributed lock, fail to start. This is particularly problematic for important tasks, such as the data job watching. This commit attempts to resolve this by specifying a query timeout to the Jdbc template used by shedlock to operate with the database. Link: https://dzone.com/articles/threads-stuck-in-javanetsocketinputstreamsocketrea Testing done: started the service locally and veried that the locks are operating normally with thevtimeout. Signed-off-by: Tsvetomir Palashki <[email protected]>

ivakoleva · 2022-02-02T15:51:25Z

Isn't shedlock affected by generic db connectivity issues? maybe we need to address the JDBC connectivity recovery prior fine-tuning the shedlock that reuses same datasource configuration?

That is found at

versatile-data-kit/projects/control-service/projects/pipelines_control_service/src/main/resources/application-prod.properties

Line 18 in 29aa26b

spring.datasource.url=${DB_JDBC_URL}

for example we install it the Pipelines Control Service using helm install ... --set database.jdbcUrl=$DATABASE_JDBC_URL
https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/templates/deployment.yaml#L99

Maybe consider providing some defaults/configdocs for:

spring.datasource.testOnBorrow=true
spring.datasource.validationQuery = SELECT 1

maybe

spring.datasource.testWhileIdle = true 
spring.datasource.timeBetweenEvictionRunsMillis = 3600000

I see 3 database configurations only,
maybe consider introducing pool configs https://www.cockroachlabs.com/docs/stable/connection-pooling.html#example

mivanov1988 · 2022-02-03T14:51:21Z

Isn't shedlock affected by generic db connectivity issues? maybe we need to address the JDBC connectivity recovery prior fine-tuning the shedlock that reuses same datasource configuration?

That is found at

versatile-data-kit/projects/control-service/projects/pipelines_control_service/src/main/resources/application-prod.properties

Line 18 in 29aa26b

spring.datasource.url=${DB_JDBC_URL}

for example we install it the Pipelines Control Service using helm install ... --set database.jdbcUrl=$DATABASE_JDBC_URL
https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/templates/deployment.yaml#L99
Maybe consider providing some defaults/configdocs for:
spring.datasource.testOnBorrow=true
spring.datasource.validationQuery = SELECT 1
maybe
spring.datasource.testWhileIdle = true 
spring.datasource.timeBetweenEvictionRunsMillis = 3600000
I see 3 database configurations only, maybe consider introducing pool configs https://www.cockroachlabs.com/docs/stable/connection-pooling.html#example

As far as I know, only the Shedlock is affected (@tpalashki correct me if I am wrong), I would recommend merging this PR and tweaking the datasource in a separate one.

tpalashki · 2022-02-03T15:32:45Z

Judging by the thread dump, those are the threads to have appeared to be hung:

DataJobMonitorSync.updateDataJobStatus:31
DataJobMonitor.watchJobs:<before starting, within shedlock>
DeploymentMonitorSync.updateJobDeploymentStatuses:43
DeploymentMonitorSync.updateJobDeploymentStatuses:43

All of them are at the reading of the query result:

at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
at java.net.SocketInputStream.socketRead([email protected]/Unknown Source)
at java.net.SocketInputStream.read([email protected]/Unknown Source)
at java.net.SocketInputStream.read([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.read([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.readHeader([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/Unknown Source)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
at org.postgresql.core.PGStream.receiveChar(PGStream.java:443)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2056)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322)

All of them are using the HikariCP configuration, which should supposedly be taking care (?) of reviving the connections according to its defaults: https://github.com/brettwooldridge/HikariCP
Unfortunately, the suggested application-level properties are not applicable to HikariCP: https://stackoverflow.com/questions/53870473/hikari-and-test-on-borrow-option

vmwclabot added the cla-not-required label Feb 2, 2022

mivanov1988 approved these changes Feb 2, 2022

View reviewed changes

Merge branch 'main' into person/tpalashki/address-jdbc-hanging

45f9997

tpalashki enabled auto-merge (squash) February 7, 2022 13:48

tpalashki merged commit 15b6ab7 into main Feb 7, 2022

tpalashki deleted the person/tpalashki/address-jdbc-hanging branch February 7, 2022 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control-service: add timeouts to shedlock's database operations #693

control-service: add timeouts to shedlock's database operations #693

tpalashki commented Feb 2, 2022 •

edited

Loading

ivakoleva commented Feb 2, 2022

mivanov1988 commented Feb 3, 2022

tpalashki commented Feb 3, 2022

control-service: add timeouts to shedlock's database operations #693

control-service: add timeouts to shedlock's database operations #693

Conversation

tpalashki commented Feb 2, 2022 • edited Loading

ivakoleva commented Feb 2, 2022

mivanov1988 commented Feb 3, 2022

tpalashki commented Feb 3, 2022

tpalashki commented Feb 2, 2022 •

edited

Loading