-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validator-registration: simplify & optimize duty execution #2030
base: stage
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Although as you mentioned I don't think this is a real issue, since the cache is by pubkey, we are overriding the pubkey each time and the cache doesn't grow. There are edge cases where a validator is removed and it might stay in the cache forever unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find. lgtm
@iurii-ssv it may in theory grow perpetually, however its bounded by the amount of registered active validators, so should never be a problem whether we can drop the registrations or not depends on how often we should be submitting registrations i recall that other validator clients submit every epoch, and if thats the case then we should probably stick to it however i did notice that we seem to submit all registrations every slot? if so that does seem like a bit of an overkill 😄 we should probably be submitting for every validator only once per epoch |
I don't really know how validator registrations are supposed to work in full, but cleaning up cache helps with "submitting too often / submitting too much data" (since we only submit when cache is not empty) submitting every epoch is another way to reduce the amount of requests/data sent I guess, but
|
i think that's once per epoch, you can verify it by reading other validator clients such as Lighthouse if we remove from the cache, we'll only submit once per 10 epochs, which isn't ideal because if the above is true what we can do maybe is a hot & cold cache system, where we move submitted registrations to the cold cache which is submitted only once per epoch (if maybe |
2bae57f
to
504d1ba
Compare
c1284f3
to
273e455
Compare
a0ce8b0
to
8bbbfb0
Compare
// registrations is a set of validator-registrations (their latest versions) to be sent to | ||
// Beacon node to ensure various entities in Ethereum network, such as Relays, are aware of | ||
// participating validators | ||
registrations map[phase0.BLSPubKey]*validatorRegistration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to eventually expire old validator registrations from the registrations map (e.g., if a validator is exited/slashed or removed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to do that, but I don't think it would justify the added complexity (since "validator is exited/slashed or removed" events are rare and node-restarts solve it eventually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- validator X is initially registered with operators 1,2,3,4
- fee recipient Y is submitted via operator 1,2,3,4
- validator X switches to a new cluster: 2,3,4,5
- fee recipient is updated to Z
- operator 5 now submits registration with recipient Z
- but operator 1 (still holding old state) may still submit with Y
- if validator X gets a proposal duty in the meantime → block may be built with outdated fee recipient Y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's an interesting scenario you are describing,
I'm not sure how realistic/bad it actually is (maybe @moshe-blox or @y0sher could chime in on it) but from what I understand:
- both Y and Z recipients "belong" to the "same user" (the owner of Validator X) - meaning he should be able to get those funds even though they might be sent to old address due to all these circumstances
- eventually (after a restart) operator 1 will stop sending out this old info, and so it will resolve - so user will only need to "have access" to his old address for maybe ~ a week to pull those funds out of address Y
thus if it's unlikely to ever happen it doesn't seem too bad ? wdyt
Also, that seems to be a problem for stage version as well, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- both Y and Z recipients "belong" to the "same user" (the owner of Validator X) - meaning he should be able to get those funds even though they might be sent to old address due to all these circumstances
The assumption that "Y and Z belong to the same user" is risky.
In a permissionless system like Ethereum, the protocol can't rely on that being true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, that seems to be a problem for stage version as well, right ?
If registrations aren’t removed from the cache after validator removals, then yes this issue likely exists in both stage and production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pretty much agree with all your input, just not sure where exactly it fits on our priority list - so I'll create an issue to document the problem for now (so we don't loose it) #2105 but it doesn't have to a part of this PR
beacon/goclient/validator.go
Outdated
// Select registrations to submit. | ||
gc.registrationMu.Lock() | ||
allRegistrations := maps.Values(gc.registrations) | ||
gc.registrationMu.Unlock() | ||
|
||
registrations := make([]*api.VersionedSignedValidatorRegistration, 0) | ||
for _, r := range allRegistrations { | ||
validatorPk, err := r.PubKey() | ||
if err != nil { | ||
gc.log.Error("Failed to get validator pubkey", zap.Error(err), fields.Slot(currentSlot)) | ||
continue | ||
} | ||
|
||
// Distribute the registrations evenly across the epoch based on the pubkeys. | ||
slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch() | ||
validatorHash := sha256.Sum256(validatorPk[:8]) | ||
validatorDescriptor := binary.LittleEndian.Uint64(validatorHash[:]) | ||
shouldSubmit := validatorDescriptor%gc.network.SlotsPerEpoch() == slotInEpoch | ||
|
||
if r.new || shouldSubmit { | ||
r.new = false | ||
registrations = append(registrations, r.VersionedSignedValidatorRegistration) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With ~1000 registrations max, the tradeoff between copying maps.Values()
and filtering inside the lock seems minor. Was this mainly to reduce lock time and avoid blocking SubmitValidatorRegistration()
?
setting r.new = false
after unlocking could lead to a race — if the registration gets replaced before that line runs, it might cause unnecessary resubmissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting r.new = false after unlocking could lead to a race — if the registration gets replaced before that line runs, it might cause unnecessary resubmissions
Right, not sure if in stage branch all racy behavior is avoided - but here in this PR we certainly have races like that - still I think these are non-harmful (sending registration couple times extra isn't that bad)
while on the upside it simplifies mutex usage quite a bit
With ~1000 registrations max, the tradeoff between copying maps.Values() and filtering inside the lock seems minor. Was this mainly to reduce lock time and avoid blocking SubmitValidatorRegistration()?
I think there just is no need to hold this mutex locked while filtering,
unless you want use this mutex to make operations involving r.new
atomic ... which IMO isn't worth the added complexity (it's also not super obvious to lock this mutex to achieve that - comment would help somewhat with that ... but long comments aren't ideal either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the race here is low-impact — re-submitting a registration isn’t a big deal.
That said, using maps.Values()
and then mutating .new
outside the lock introduces a data race on the struct itself. The lock protects the map, but not the underlying pointer, which may have already been replaced concurrently (e.g., via SubmitValidatorRegistration
).
It might be worth either moving the mutation under the lock or rethinking how submission state is tracked — just to ensure consistency and avoid surprises if the logic evolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait you are right, I meant for new
to be atomic.Bool
I thought I've defined it like that already but looks like I forgot - changed it now 80f8d94 so the data race should no longer be an issue
Edit: on a second thought I think there isn't a data race in what you described above because all SubmitValidatorRegistration
does is 1-time initialization that needs to be "propagated" to go-routine that's running registrationSubmitter
func (and gc.registrationMu
actually ensures that "propagation"/synchronization happens correctly for us) - and then go-routine that's running registrationSubmitter
reads/writes that data address sequentially (nobody reads/modifies it after that concurrently to it - it is the only go-routine that accesses this address from that point forward)
so I'm reverting 80f8d94 for now as unnecessary, @olegshmuelov let me know if I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation! Appreciate you thinking it through.
Just worth noting - this is still a data race under the Go memory model, and relying on intuition like “only one goroutine accesses it after init” isn’t safe in concurrent code. Even benign races are better avoided, especially in infra-level systems.
That said, fine to leave it as-is if we agree the impact is negligible.
} | ||
|
||
// Distribute the registrations evenly across the epoch based on the pubkeys. | ||
slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch()
could be moved outside the loop since it doesn't change per validator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I thought trivial optimizations like that are done automatically by compiler, Golang compiler is kinda weak though in that sense last time I've read about it (compared to C++ or Java which are more aggressive) ...
but regardless I would prefer code readability over some processing overhead (unless it's in some super-hot execution path)
cc @oleg-ssvlabs @moshe-blox @y0sher let me know if you think otherwise (just so we get on the same page about this)
7e385aa
to
7722a70
Compare
This PR improves and clarifies how validator-registrations are submitted to the Beacon Node. The design separates the process into:
Known Issues:
another spec side issue: This introduces a potential data race, where the fee recipient may change between signing root calculation and actual signature generation. If this happens, reconstructed signing roots won't match, causing signature reconstruction to fail and the duty to break. |
If we aim to further improve and harden the validator registration flow, we should consider addressing at least one of the known issues above |
This PR clarifies validator-registration flow and also aims to simplify & improve the way validator-registrations are sent to Beacon node, namely we want to: