-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor hinted-handoff service #4516
Conversation
a105c9a
to
ff8291f
Compare
ff8291f
to
623b763
Compare
623b763
to
5ad30bd
Compare
|
||
# Interval between running checks for data that should be purged. Data is purged from | ||
# hinted-handoff queues for two reasons. 1) The data is older than the max age, or | ||
# 2) the target node has been dropped from the server. Data is never dropped until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target node? Do you mean shard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HH queues (which are wrapped in a NodeProcessor
) are (and always have been) on a per-node basis. By node I mean an InfluxDB process, server, etc whatever word makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I don't understand what "target node has been dropped from the server" means. That's "target server has been dropped from the server"? Do you actually mean "the target server has been dropped from the cluster"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I do, that Is a typo.
On Friday, October 23, 2015, Paul Dix [email protected] wrote:
In etc/config.sample.toml
#4516 (comment):
- enabled = true
- dir = "/var/opt/influxdb/hh"
- max-size = 1073741824
- max-age = "168h"
- retry-rate-limit = 0
Hinted handoff will start retrying writes to down nodes at a rate of once per second.
If any error occurs, it will backoff in an exponential manner, until the interval
reaches retry-max-interval. Once writes to all nodes are successfully completed the
interval will reset to retry-interval.
- retry-interval = "1s"
- retry-max-interval = "1m"
Interval between running checks for data that should be purged. Data is purged from
hinted-handoff queues for two reasons. 1) The data is older than the max age, or
2) the target node has been dropped from the server. Data is never dropped until
Then I don't understand what "target node has been dropped from the
server" means. That's "target server has been dropped from the server"? Do
you actually mean "the target server has been dropped from the cluster"?—
Reply to this email directly or view it on GitHub
https://github.com/influxdb/influxdb/pull/4516/files#r42859470.
What is a |
A Node is a distinct InfluxDB server i.e. a node in a cluster. |
d12ceec
to
574a108
Compare
574a108
to
3c9924c
Compare
3c9924c
to
8e18c18
Compare
5f091d0
to
605ecaf
Compare
Kicked tests on Circle, it was an unrelated Raft issue. |
bd60ccf
to
2e57816
Compare
edafdd3
to
f1f9fc3
Compare
A NodeProcessor wraps an on-disk queue and the goroutine that attempts to drain that queue and send the data to the associated target node.
f1f9fc3
to
f38c536
Compare
if s.closing != nil { | ||
close(s.closing) | ||
} | ||
s.wg.Wait() | ||
s.closing = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting this to nil
prevents existing goroutines from selecting it. That has created issues in other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've seen this pattern introduced recently and it made sense to me.
So a nil
state of this variable is used to check the "open" state of the service. Sounds like I would then need to use a distinct boolean instead.
Not sure if the nil channels will be a problem in this code or no. They did cause problems in other places where they were supposed to abort early from a function call, but did not because they were set to nil after being closed. Up to you. 👍 |
Use of |
Refactor hinted-handoff service
This change refactors the hinted-handoff service.
Whereas previously the system repeatedly launched a short-lived goroutine per active node (to drain all HH data for that node), it now launches a single long-running goroutine per active node which remains running until the node processor is purged due to being marked inactive (which may never happen). A node is considered active until it is DROPped.
Giving each target node a dedicated, long-running, goroutine makes it easier to deal with some HH edge cases.
The change also cleans up stats, and adds HH diagnostics.
Multiple rounds of testing done with 3 nodes, with the 2 nodes not receiving the write requests killed every 10 seconds, multiple times. After these 2 nodes were finally brought online and kept online, they received all the missed writes.