Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cycilic increase in Committee fetch time on butteflynet #874

Closed
masih opened this issue Feb 6, 2025 · 6 comments
Closed

Investigate cycilic increase in Committee fetch time on butteflynet #874

masih opened this issue Feb 6, 2025 · 6 comments
Assignees

Comments

@masih
Copy link
Member

masih commented Feb 6, 2025

Committee fetch time cycles almost exactly every 12hrs increasing from mere milliseconds to ~30ms and back down again. Observed on Butterflynet. Understand why.

Image

The pattern matches heap object allocation:

Image

Link to Grafana dashboard.

@github-project-automation github-project-automation bot moved this to Todo in F3 Feb 6, 2025
@masih masih added this to the M2: Mainnet Passive Testing milestone Feb 6, 2025
@masih
Copy link
Member Author

masih commented Feb 6, 2025

allocs from preminer-1:

Image

@masih
Copy link
Member Author

masih commented Feb 6, 2025

Looks like this is related to the default powerTableFrequency the frequency by which power table of an instance is stored, currently set to every 1440 instances (in steady state this is twice a day). The frequency matches the length of cyclic pattern.

This means to get the power table of an instance as part of GetCommittee calls we are traversing half a day worth of finality certs to cumulatively calculate the target power table.

We may need to increase this frequency considering the sensitivity to instance misalignment if GetCommittee is slow and/or arbitrarily slow across a large network. This specific scenario was observed in previous round of passive testing as "Slow GetCommittee" calls, and was partially mitigated by introducing caching layers at consensus input; see #784.

The disadvantage of lower frequency is higher storage footprint of certstore.

@masih masih self-assigned this Feb 6, 2025
@masih masih moved this from Todo to In progress in F3 Feb 6, 2025
@masih
Copy link
Member Author

masih commented Feb 6, 2025

A way to significantly speed up certstore.GetPowerTable (in exchange for a small memory footprint) is to cache the latest known powertable in certStore. This would make calls to GetPowerTable for an instance made by consensus inputs in steady state very quick. This call makes up the vast majority of certstore workload.

@Stebalien Do you recall why we didn't do this to begin with? i.e.: Cache the latest powertable, and fallback to getting range from instance mod frequency to instance if latest is not cached.

@masih
Copy link
Member Author

masih commented Feb 6, 2025

Also unclear why consensus inputs is not using the powertable cache for getting committee. Never mind; that's just the cache of power table CIDs.

@Stebalien
Copy link
Member

Cache the latest powertable, and fallback to getting range from instance mod frequency to instance if latest is not cached.

Because I didn't want to complicate the initial implementation and/or get it wrong. But you're right, we absolutely should cache recent values.

I personally wouldn't store more intermediate values unless we really need to.

@masih
Copy link
Member Author

masih commented Feb 7, 2025

Captured fixing this in a separate issue as investigation is complete.

@masih masih closed this as completed Feb 7, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in F3 Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants