Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator-registration: simplify & optimize duty execution #2030

Open
wants to merge 12 commits into
base: stage
Choose a base branch
from

Conversation

iurii-ssv
Copy link
Contributor

@iurii-ssv iurii-ssv commented Feb 7, 2025

This PR clarifies validator-registration flow and also aims to simplify & improve the way validator-registrations are sent to Beacon node, namely we want to:

  • submit each validator-registration once per epoch (like Lighthouse or Prysm)
  • submit newly produced registrations next slot (they waited 10 epochs, so lets not make them wait another 1)
  • to reduce BN load - don't submit everything at the same time (also keep in mind that multiple operators in cluster can be connected to the same Beacon node - that might create bottleneck scenario if not accounted for)

@iurii-ssv iurii-ssv requested a review from nkryuchkov February 7, 2025 13:11
Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 26.15385% with 48 lines in your changes missing coverage. Please review.

Project coverage is 47.8%. Comparing base (df0b510) to head (1c1639f).

Files with missing lines Patch % Lines
beacon/goclient/validator.go 25.9% 38 Missing and 2 partials ⚠️
doppelganger/mock.go 0.0% 6 Missing ⚠️
operator/duties/validatorregistration.go 50.0% 2 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Although as you mentioned I don't think this is a real issue, since the cache is by pubkey, we are overriding the pubkey each time and the cache doesn't grow. There are edge cases where a validator is removed and it might stay in the cache forever unused.

Copy link
Contributor

@oleg-ssvlabs oleg-ssvlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find. lgtm

@moshe-blox
Copy link
Contributor

@iurii-ssv it may in theory grow perpetually, however its bounded by the amount of registered active validators, so should never be a problem

whether we can drop the registrations or not depends on how often we should be submitting registrations

i recall that other validator clients submit every epoch, and if thats the case then we should probably stick to it

however i did notice that we seem to submit all registrations every slot? if so that does seem like a bit of an overkill 😄

we should probably be submitting for every validator only once per epoch

@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Mar 20, 2025

I don't really know how validator registrations are supposed to work in full, but cleaning up cache helps with "submitting too often / submitting too much data" (since we only submit when cache is not empty)

submitting every epoch is another way to reduce the amount of requests/data sent I guess, but

  • maybe cleaning up cache (+ submit every slot) is good enough ?
  • maybe we want to submit frequently so that the delay of submission is as low as possible (but maybe that doesn't affect anything really)

@moshe-blox ^

@moshe-blox
Copy link
Contributor

moshe-blox commented Mar 20, 2025

I don't really know how validator registrations are supposed to work in full

i think that's once per epoch, you can verify it by reading other validator clients such as Lighthouse

if we remove from the cache, we'll only submit once per 10 epochs, which isn't ideal because if the above is true

what we can do maybe is a hot & cold cache system, where we move submitted registrations to the cold cache which is submitted only once per epoch (if slot%32==0), where as the hot cache is submitted every slot to keep the small delay

maybe pendingRegistrations and activeRegistrations

@iurii-ssv iurii-ssv force-pushed the beacon-client-clean-up-registrations-cache branch from 2bae57f to 504d1ba Compare March 20, 2025 19:50
@iurii-ssv iurii-ssv changed the title beacon-client: clean up registrations cache validator-registration: simplify & optimize duty execution Mar 20, 2025
@iurii-ssv iurii-ssv marked this pull request as draft March 20, 2025 19:51
@iurii-ssv iurii-ssv force-pushed the beacon-client-clean-up-registrations-cache branch from c1284f3 to 273e455 Compare March 21, 2025 11:09
@iurii-ssv iurii-ssv marked this pull request as ready for review March 21, 2025 11:15
@iurii-ssv iurii-ssv requested a review from olegshmuelov March 21, 2025 11:24
@iurii-ssv iurii-ssv force-pushed the beacon-client-clean-up-registrations-cache branch from a0ce8b0 to 8bbbfb0 Compare March 21, 2025 15:09
// registrations is a set of validator-registrations (their latest versions) to be sent to
// Beacon node to ensure various entities in Ethereum network, such as Relays, are aware of
// participating validators
registrations map[phase0.BLSPubKey]*validatorRegistration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to eventually expire old validator registrations from the registrations map (e.g., if a validator is exited/slashed or removed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to do that, but I don't think it would justify the added complexity (since "validator is exited/slashed or removed" events are rare and node-restarts solve it eventually)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. validator X is initially registered with operators 1,2,3,4
  2. fee recipient Y is submitted via operator 1,2,3,4
  3. validator X switches to a new cluster: 2,3,4,5
  4. fee recipient is updated to Z
  5. operator 5 now submits registration with recipient Z
  6. but operator 1 (still holding old state) may still submit with Y
  7. if validator X gets a proposal duty in the meantime → block may be built with outdated fee recipient Y

Copy link
Contributor Author

@iurii-ssv iurii-ssv Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's an interesting scenario you are describing,

I'm not sure how realistic/bad it actually is (maybe @moshe-blox or @y0sher could chime in on it) but from what I understand:

  • both Y and Z recipients "belong" to the "same user" (the owner of Validator X) - meaning he should be able to get those funds even though they might be sent to old address due to all these circumstances
  • eventually (after a restart) operator 1 will stop sending out this old info, and so it will resolve - so user will only need to "have access" to his old address for maybe ~ a week to pull those funds out of address Y

thus if it's unlikely to ever happen it doesn't seem too bad ? wdyt

Also, that seems to be a problem for stage version as well, right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • both Y and Z recipients "belong" to the "same user" (the owner of Validator X) - meaning he should be able to get those funds even though they might be sent to old address due to all these circumstances

The assumption that "Y and Z belong to the same user" is risky.
In a permissionless system like Ethereum, the protocol can't rely on that being true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, that seems to be a problem for stage version as well, right ?

If registrations aren’t removed from the cache after validator removals, then yes this issue likely exists in both stage and production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pretty much agree with all your input, just not sure where exactly it fits on our priority list - so I'll create an issue to document the problem for now (so we don't loose it) #2105 but it doesn't have to a part of this PR

Comment on lines 87 to 110
// Select registrations to submit.
gc.registrationMu.Lock()
allRegistrations := maps.Values(gc.registrations)
gc.registrationMu.Unlock()

registrations := make([]*api.VersionedSignedValidatorRegistration, 0)
for _, r := range allRegistrations {
validatorPk, err := r.PubKey()
if err != nil {
gc.log.Error("Failed to get validator pubkey", zap.Error(err), fields.Slot(currentSlot))
continue
}

// Distribute the registrations evenly across the epoch based on the pubkeys.
slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch()
validatorHash := sha256.Sum256(validatorPk[:8])
validatorDescriptor := binary.LittleEndian.Uint64(validatorHash[:])
shouldSubmit := validatorDescriptor%gc.network.SlotsPerEpoch() == slotInEpoch

if r.new || shouldSubmit {
r.new = false
registrations = append(registrations, r.VersionedSignedValidatorRegistration)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ~1000 registrations max, the tradeoff between copying maps.Values() and filtering inside the lock seems minor. Was this mainly to reduce lock time and avoid blocking SubmitValidatorRegistration()?

setting r.new = false after unlocking could lead to a race — if the registration gets replaced before that line runs, it might cause unnecessary resubmissions.

Copy link
Contributor Author

@iurii-ssv iurii-ssv Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting r.new = false after unlocking could lead to a race — if the registration gets replaced before that line runs, it might cause unnecessary resubmissions

Right, not sure if in stage branch all racy behavior is avoided - but here in this PR we certainly have races like that - still I think these are non-harmful (sending registration couple times extra isn't that bad)

while on the upside it simplifies mutex usage quite a bit

With ~1000 registrations max, the tradeoff between copying maps.Values() and filtering inside the lock seems minor. Was this mainly to reduce lock time and avoid blocking SubmitValidatorRegistration()?

I think there just is no need to hold this mutex locked while filtering,

unless you want use this mutex to make operations involving r.new atomic ... which IMO isn't worth the added complexity (it's also not super obvious to lock this mutex to achieve that - comment would help somewhat with that ... but long comments aren't ideal either)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the race here is low-impact — re-submitting a registration isn’t a big deal.
That said, using maps.Values() and then mutating .new outside the lock introduces a data race on the struct itself. The lock protects the map, but not the underlying pointer, which may have already been replaced concurrently (e.g., via SubmitValidatorRegistration).
It might be worth either moving the mutation under the lock or rethinking how submission state is tracked — just to ensure consistency and avoid surprises if the logic evolves.

Copy link
Contributor Author

@iurii-ssv iurii-ssv Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait you are right, I meant for new to be atomic.Bool

I thought I've defined it like that already but looks like I forgot - changed it now 80f8d94 so the data race should no longer be an issue

Edit: on a second thought I think there isn't a data race in what you described above because all SubmitValidatorRegistration does is 1-time initialization that needs to be "propagated" to go-routine that's running registrationSubmitter func (and gc.registrationMu actually ensures that "propagation"/synchronization happens correctly for us) - and then go-routine that's running registrationSubmitter reads/writes that data address sequentially (nobody reads/modifies it after that concurrently to it - it is the only go-routine that accesses this address from that point forward)

so I'm reverting 80f8d94 for now as unnecessary, @olegshmuelov let me know if I'm missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! Appreciate you thinking it through.
Just worth noting - this is still a data race under the Go memory model, and relying on intuition like “only one goroutine accesses it after init” isn’t safe in concurrent code. Even benign races are better avoided, especially in infra-level systems.
That said, fine to leave it as-is if we agree the impact is negligible.

}

// Distribute the registrations evenly across the epoch based on the pubkeys.
slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: slotInEpoch := uint64(currentSlot) % gc.network.SlotsPerEpoch() could be moved outside the loop since it doesn't change per validator

Copy link
Contributor Author

@iurii-ssv iurii-ssv Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I thought trivial optimizations like that are done automatically by compiler, Golang compiler is kinda weak though in that sense last time I've read about it (compared to C++ or Java which are more aggressive) ...

but regardless I would prefer code readability over some processing overhead (unless it's in some super-hot execution path)

cc @oleg-ssvlabs @moshe-blox @y0sher let me know if you think otherwise (just so we get on the same page about this)

@iurii-ssv iurii-ssv force-pushed the beacon-client-clean-up-registrations-cache branch from 7e385aa to 7722a70 Compare March 25, 2025 16:24
@olegshmuelov
Copy link
Contributor

olegshmuelov commented Mar 25, 2025

This PR improves and clarifies how validator-registrations are submitted to the Beacon Node. The design separates the process into:

  1. Duty Scheduling (ValidatorRegistrationHandler)

    • Every slot, the handler loads participating shares for epoch + frequencyEpochs (currently +10).
    • It executes validator-registration duties only if:
    uint64(share.ValidatorIndex) % registrationSlots == uint64(slot) % registrationSlots
  2. Duty Execution (ProcessPreConsensus) → Submission

    • Once reach pre consensus quorum, a duty sends a VersionedSignedValidatorRegistration to GoClient.SubmitValidatorRegistration().
  3. Registration Submission (registrationSubmitter)

    • Runs every slot, checks all registrations.
    • Submits:
      • Fresh (.new) registrations immediately
      • All registrations once per epoch, distributed deterministically by pubkey hash
    • Batches submissions in groups of 500 to avoid BN overload.

Known Issues:
Delayed submission for new added validators or fee recipient updates.
There’s a timing edge case where a new validator registration or fee recipient update might not take effect in time for a proposal. Here’s how it can happen:

  1. A validator is added to the SSV network, or its fee recipient is updated
    This creates a new validator registration that needs to be submitted to the Beacon Node.
  2. The duty scheduler attempts to assign a BNRoleValidatorRegistration duty
    • But duties are only scheduled if:
    validatorIndex % registrationSlots == slot % registrationSlots
    • If the validator’s index doesn’t match the current slot (especially near the end of the frequencyEpochs window), the duty might be skipped for now.
  3. In the meantime, a block proposal duty is triggered for that validator
  4. The proposal is built and submitted with the old or fallback fee recipient (e.g., the owner address)
    since the Beacon Node hasn’t received the updated validator registration yet.

another spec side issue:
ssvlabs/ssv-spec#504
The ValidatorRegistration duty does not carry the fee recipient directly — instead, it dynamically reads it from a shared state, which can be modified by the FeeRecipientUpdate event handler during the pre-consensus process.

This introduces a potential data race, where the fee recipient may change between signing root calculation and actual signature generation.

If this happens, reconstructed signing roots won't match, causing signature reconstruction to fail and the duty to break.

@olegshmuelov
Copy link
Contributor

Known Issues: Delayed submission for new added validators or fee recipient updates. There’s a timing edge case where a new validator registration or fee recipient update might not take effect in time for a proposal. Here’s how it can happen:

  1. A validator is added to the SSV network, or its fee recipient is updated
    This creates a new validator registration that needs to be submitted to the Beacon Node.

  2. The duty scheduler attempts to assign a BNRoleValidatorRegistration duty

    • But duties are only scheduled if:
    validatorIndex % registrationSlots == slot % registrationSlots
    • If the validator’s index doesn’t match the current slot (especially near the end of the frequencyEpochs window), the duty might be skipped for now.
  3. In the meantime, a block proposal duty is triggered for that validator

  4. The proposal is built and submitted with the old or fallback fee recipient (e.g., the owner address)
    since the Beacon Node hasn’t received the updated validator registration yet.

If we aim to further improve and harden the validator registration flow, we should consider addressing at least one of the known issues above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants