Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nimbus VC to Lodestar BN shows attestation errors #6631

Closed
mbonenberger opened this issue Apr 4, 2024 · 9 comments · Fixed by #6668
Closed

Nimbus VC to Lodestar BN shows attestation errors #6631

mbonenberger opened this issue Apr 4, 2024 · 9 comments · Fixed by #6668
Labels
meta-bug Issues that identify a bug and require a fix. scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling.

Comments

@mbonenberger
Copy link

Describe the bug

We're running a Nimbus VC v24.3.0 in a cluster of several BNs, including a Lodestar BN 1.17.0, and are regularly seeing the following message in the VC logs:

WRN 2024-04-04 20:14:55.043+00:00 Beacon node reports internal error         reason="Unable to decode error response: [Serialization error];500;getAggregatedAttestation(best);internal-issue" node=http://xx.xx.xx.xx:5052[Lodestar/v1.17.0/def26ac] node_index=2 node_roles=AGBSDT
NTC 2024-04-04 20:14:55.064+00:00 Aggregated attestation published           delay=64ms947us130ns service=attestation_service validator=831bb96c@785062 attestation="(aggregation_bits: 0b1111111111111111111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111, data: (slot: 8786472, index: 6, beacon_block_root: \"676aa960\", source: \"274576:b5e91a14\", target: \"274577:5b9344bd\"), signature: \"930cc3dc\")"
NTC 2024-04-04 20:14:55.514+00:00 Beacon node is online                      agent_version=Lodestar/v1.17.0/def26ac node=http://xx.xx.xx.xx:5052[Lodestar/v1.17.0/def26ac] node_index=2 node_roles=AGBSDT
NTC 2024-04-04 20:14:55.560+00:00 Beacon node is compatible                  node=http://xx.xx.xx.xx:5052[Lodestar/v1.17.0/def26ac] node_index=2 node_roles=AGBSDT
NTC 2024-04-04 20:14:55.582+00:00 Beacon node is in sync                     head_slot=8786472 sync_distance=0 is_optimistic=false node=http://xx.xx.xx.xx:5052[Lodestar/v1.17.0/def26ac] node_index=2 node_roles=AGBSDT

On the Lodestar BN it shows the following error:

Apr 04 22:14:55 ethereum-v5-fr-2 lodestar[115637]: Error: No attestation for slot=8786472 dataRoot=0xbcf1127f79d3e25defb09d0108262aa08991fce8aaf788aa3481cb51e49ff925
Apr 04 22:14:55 ethereum-v5-fr-2 lodestar[115637]:     at AttestationPool.getAggregate (file:///usr/local/bin/lodestar/packages/beacon-node/src/chain/opPools/attestationPool.ts:136:13)
Apr 04 22:14:55 ethereum-v5-fr-2 lodestar[115637]:     at Object.getAggregatedAttestation (file:///usr/local/bin/lodestar/packages/beacon-node/src/api/impl/validator/index.ts:1041:47)
Apr 04 22:14:55 ethereum-v5-fr-2 lodestar[115637]:     at processTicksAndRejections (node:internal/process/task_queues:95:5)
Apr 04 22:14:55 ethereum-v5-fr-2 lodestar[115637]:     at Object.handler (file:///usr/local/bin/lodestar/packages/api/src/utils/server/genericJsonServer.ts:45:23)

This suggests that connecting a Nimbus VC to a Lodestar BN creates attestation issues.

Expected behavior

Attestations should work when running Nimbus VC with Lodestar BN. No errors should be logged on both services.

Steps to reproduce

  1. Run Lodestar BN
  2. Attach Nimbus VC with active validators
  3. Observe logs on both services

Additional context

This issue seems to be related to #5553 and #6419

Operating system

Linux

Lodestar version or commit hash

v1.17.0/def26ac

@mbonenberger mbonenberger added the meta-bug Issues that identify a bug and require a fix. label Apr 4, 2024
@nflaig
Copy link
Member

nflaig commented Apr 4, 2024

Thanks for reporting. Good news is that it is not an attestation issue but rather an issue with producing an aggregated attestation, so it shouldn't really impact your effectiveness or cause missed attestations. And it looks like the aggregate was successfully published anyways, likely due to the fact that you have multiple beacon nodes connected.

regularly seeing the following message in the VC logs:

I would require debug logs over a longer period (at least a few epochs) to give you a proper answer on why this is failing, I have been running Nimbus VC before with Lodestar BN and have not had those errors, might be something introduced in a newer release, or related to the cluster setup.

This issue seems to be related to #5553 and #6419

The reported issue with Lighthouse VC was actually just noise and fixed on Lighthouse in the end. Might be similar here.

Based on the Beacon Node <> Validator Client compatibility matrix from the EF devops team there are issues with Nimbus VC <> Lodestar BN, will check with them if it's related to what you are seeing.

So for now, if you could provide debug logs from the Lodestar BN would be great to further investigate this. My best guess is that Nimbus VC does not call the produce attestation API on the Lodestar BN but still tries to request a aggregate for the slot / data_root but in that case Lodestar BN does not have the data cached to serve the request.

@philknows philknows added the scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling. label Apr 5, 2024
@mbonenberger
Copy link
Author

Thank you for your reply. I'll run Nimbus and Lodestar with debug logs and come back.

@nflaig
Copy link
Member

nflaig commented Apr 5, 2024

Thank you for your reply. I'll run Nimbus and Lodestar with debug logs and come back.

Lodestar already has debug logs enabled by default but those are only written to log files not stdout unless you have changed --logFileLevel to something else.

You can find the logs in --dataDir, see data retention for details.

@mbonenberger
Copy link
Author

Please find the logs attached. Hope this helps. Please let me know if you need anything else for debugging.
lodestar-bn.log.tar.gz

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

Please find the logs attached. Hope this helps. Please let me know if you need anything else for debugging. lodestar-bn.log.tar.gz

Thanks for providing the logs. I summarized this issue here #6634 (comment) with an explanation on why it happens. The tldr is that it's not a real issue and just noisy logs which will be improved in our next release #6648. I have also created an issue on the Nimbus side status-im/nimbus-eth2#6184 to potentially improve error handling on their end.

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

lodestar-bn.log.tar.gz

Regarding those logs, I don't see a single "No attestation for slot" in there, did you change something in our setup, or are those logs from the primary (first) node while the error was observed on a fallback node?

@mbonenberger
Copy link
Author

mbonenberger commented Apr 8, 2024

lodestar-bn.log.tar.gz

Regarding those logs, I don't see a single "No attestation for slot" in there, did you change something in our setup, or are those logs from the primary (first) node while the error was observed on a fallback node?

No, I didn't change anything in the setup and these are the logs from the affected node. I was referring to the type of error that you can find with the following command:

cat beacon-2024-04-05.log.bak|grep "Error: No attestation for slot"

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

cat beacon-2024-04-05.log.bak|grep "Error: No attestation for slot"

Thanks for double checking that, I was looking at the wrong log file... sorry for that. Turns out it is submitting attestations for the same slot, and the unaggregated attestation is also gossiped really timely. I can't really explain why for example for slot 8791364 the aggregate failed, your attestation made it onchain at least, likely aggregated by another validator.

While what I explained in #6631 (comment) still applies and would be the expected case where this can happen, what we are seeing on your node is not. I am not sure what the Nimbus VC does differently than ours..

Another problem right now is also that we don't have good enough debug logs in our attestation pool (cache) to analyze this based on the logs. What's definitely clear is that most of the time producing the aggregate failed (28 times) while succeeded only once.

Will try to reproduce this in a simpler setup, for now, I am assuming it's an issue with our attestation pool and not something Nimbus does wrong.

While this is not that critical of an issue, please also be aware that there is also a reported block production issue using Nimbus VC with Lodestar (#6634) but since you have multiple bns connected that shouldn't be an issue.

@nflaig
Copy link
Member

nflaig commented Apr 23, 2024

A fix #6668 has been included in our latest release v1.18.0, feel free to reopen if you can still observe this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix. scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants