Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-service: add timeouts to shedlock's database operations #693

Merged
merged 2 commits into from
Feb 7, 2022

Conversation

tpalashki
Copy link
Contributor

@tpalashki tpalashki commented Feb 2, 2022

We have experienced cases when operations against the database appear to hang
indefinitely (usually after a connectivity issue with the database). As a result,
tasks that attempt to obtain a distributed lock, fail to start. This is
particularly problematic for important tasks, such as the data job watch.

This commit attempts to resolve this by specifying a query timeout to the
Jdbc template used by shedlock to operate with the database.

Link: https://dzone.com/articles/threads-stuck-in-javanetsocketinputstreamsocketrea

Testing done: started the service locally and verified that the locks are
operating normally with the timeout.

Signed-off-by: Tsvetomir Palashki [email protected]

We have experienced cases when operations against the database appear to hang
indefinitely (usually after a connectivity issue with the database). As a result
tasks that attempt to obtain a distributed lock, fail to start. This is
particularly problematic for important tasks, such as the data job watching.

This commit attempts to resolve this by specifying a query timeout to the
Jdbc template used by shedlock to operate with the database.

Link: https://dzone.com/articles/threads-stuck-in-javanetsocketinputstreamsocketrea

Testing done: started the service locally and veried that the locks are
operating normally with thevtimeout.

Signed-off-by: Tsvetomir Palashki <[email protected]>
@ivakoleva
Copy link
Contributor

Isn't shedlock affected by generic db connectivity issues? maybe we need to address the JDBC connectivity recovery prior fine-tuning the shedlock that reuses same datasource configuration?

That is found at


for example we install it the Pipelines Control Service using helm install ... --set database.jdbcUrl=$DATABASE_JDBC_URL
https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/templates/deployment.yaml#L99

Maybe consider providing some defaults/configdocs for:

spring.datasource.testOnBorrow=true
spring.datasource.validationQuery = SELECT 1

maybe

spring.datasource.testWhileIdle = true 
spring.datasource.timeBetweenEvictionRunsMillis = 3600000

I see 3 database configurations only,
maybe consider introducing pool configs https://www.cockroachlabs.com/docs/stable/connection-pooling.html#example

@mivanov1988
Copy link
Collaborator

Isn't shedlock affected by generic db connectivity issues? maybe we need to address the JDBC connectivity recovery prior fine-tuning the shedlock that reuses same datasource configuration?

That is found at

for example we install it the Pipelines Control Service using helm install ... --set database.jdbcUrl=$DATABASE_JDBC_URL
https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/templates/deployment.yaml#L99
Maybe consider providing some defaults/configdocs for:

spring.datasource.testOnBorrow=true
spring.datasource.validationQuery = SELECT 1

maybe

spring.datasource.testWhileIdle = true 
spring.datasource.timeBetweenEvictionRunsMillis = 3600000

I see 3 database configurations only, maybe consider introducing pool configs https://www.cockroachlabs.com/docs/stable/connection-pooling.html#example

As far as I know, only the Shedlock is affected (@tpalashki correct me if I am wrong), I would recommend merging this PR and tweaking the datasource in a separate one.

@tpalashki
Copy link
Contributor Author

Judging by the thread dump, those are the threads to have appeared to be hung:

  • DataJobMonitorSync.updateDataJobStatus:31
  • DataJobMonitor.watchJobs:<before starting, within shedlock>
  • DeploymentMonitorSync.updateJobDeploymentStatuses:43
  • DeploymentMonitorSync.updateJobDeploymentStatuses:43

All of them are at the reading of the query result:

at java.net.SocketInputStream.socketRead0([email protected]/Native Method)
at java.net.SocketInputStream.socketRead([email protected]/Unknown Source)
at java.net.SocketInputStream.read([email protected]/Unknown Source)
at java.net.SocketInputStream.read([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.read([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.readHeader([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketImpl.readApplicationRecord([email protected]/Unknown Source)
at sun.security.ssl.SSLSocketImpl$AppInputStream.read([email protected]/Unknown Source)
at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:161)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:128)
at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:113)
at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:73)
at org.postgresql.core.PGStream.receiveChar(PGStream.java:443)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2056)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:322)

All of them are using the HikariCP configuration, which should supposedly be taking care (?) of reviving the connections according to its defaults: https://github.com/brettwooldridge/HikariCP
Unfortunately, the suggested application-level properties are not applicable to HikariCP: https://stackoverflow.com/questions/53870473/hikari-and-test-on-borrow-option

@tpalashki tpalashki enabled auto-merge (squash) February 7, 2022 13:48
@tpalashki tpalashki merged commit 15b6ab7 into main Feb 7, 2022
@tpalashki tpalashki deleted the person/tpalashki/address-jdbc-hanging branch February 7, 2022 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants