-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rethink how we check whether node's time is synchronized #1815
Comments
A few more ideas here: ideally we can report both on how the local node is doing compared to the network and how other nodes are (not) nominating properly. |
We've run into this recently. See logs:
you can clearly see that |
NB: 000875a removed the NTP checker entirely. So we do not notice such drift-from-shared-reality at all. |
@MonsieurNicolas I think trying to report how the other nodes are not nominating correctly may lead to false positives. Specifically, the ct picked for nomination depends on node externalizing latest ledger. So if the node is lagging for whatever reason, it may report other nodes as drifting (it will have ct higher than what it sees externalized) Reporting the other way around seems good enough: a node with drifted clock will see that its |
I think you're mixing "a node not nominating in time" with "a node has a bad clock". The first kind of problem leads to things like:
The second kind of problem leads to:
This issue deals with the second kind of problem: "bad clock". Ideally whatever solution we come up with allows to warn:
I think it's acceptable to warn if a node is always behind: this points to a problem with that node's ability to follow quorums. |
all I was saying is that the ct picked by a node with a bad clock is also impacted by how the node is doing overall, because ct also depends on ledger trigger timing. So my question was around the accuracy of reporting based on that. |
(copying a discussion from elsewhere for posterity) what I’m trying to convey is that with the approach proposed in this issue, where we rely on relative time on other validators, we will have a lot of noise. Imagine a perfect identical clock on all validators. Even with that, nodes will experience various levels of drift based on their processing of network events. What I don’t want is operators getting confused over bogus reporting, so I’d rather report conservatively but with higher probability there’s an issue. As for the questions above:
|
summarizing the discussion on the issue
So for now we concluded that the actual code change isn't worth pursuing here, and nodes should maintain the correct time outside of code. |
I recently experienced a situation where I was running stellar-core inside a VM, and for reasons irrelevant the VM's time drifted significantly. Upon a fresh boot of stellar-core with no prior data, the node didn't join the network, but also didn't present any error messages so it was difficult to debug that time drift was the culprit.
@MonsieurNicolas mentioned to me it relates to this issue. |
I think it would be super useful to give some idea that there is something wrong with the clock when the node is out of sync. |
I think as part of this issue, we should also audit all places where we use system clock and determine how problematic it'd be for the network to have drifting nodes. |
Observed on pubnet today:
|
We may want to split the problem into:
Maybe a new metric "number of validators in quorum set with clock skew compared to local" here would help in both cases? I think we can only estimate it over time which is not great, but better than nothing. |
Looks like currently we check whether local time is synchronized with an NTP server (via NtpSynchronizationChecker/NtpWork/NtpClient). There are a few things:
mIsSynchronized
from the worker thread (without using mutex or std::atomic)The text was updated successfully, but these errors were encountered: