-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pindexer: error on undefined field #4999
Comments
To aid in debugging, I've uploaded a fresh copy of a raw cometbft event db, in custom postgres format, here: https://artifacts.plinfra.net/penumbra-1/cometbft-dbdump-height-3141978.dump. Plugging that into a local setup will allow rerunning pindexer quickly to 1) reproduce the problem; and 2) try candidate fixes if necessary. |
I was able to reproduce the failure from that dump, although I can't identify where the null field is coming from. Pulled another dump from a different nodes cometbft db, and also encountered a failure at a different block:
which makes me suspect we've got a data integrity issue in the databases. In order to debug we'll need to update the reindexer to support v0.81.x migrations and rebuild from scratch. It's been a long while since we've done that on the dbs in question. |
Prepared a bump for the reindexer in penumbra-zone/reindexer#21, running that code now to get a fresh database. |
After regenerating a cometbft database based on changes in penumbra-zone/reindexer#21, I was able to rerun pindexer against that db, and it completed successfully, with no code changes. Confirmed successful up to at least block 3186459 on penumbra-1. However, once I restarted the node to keep that cometbft db up to date, pindexer quickly crashed with the same error reported above. It appears we've got a regression in psql query to find any null values (thanks, claude)
But comparing between good/bad dbs didn't show anything significant. Perhaps if we revise the query to look for values that are null or an empty string, we might find more hits. Will continue to investigate the db layer now that indexing functionality is restored. |
Ran some tests locally. Given that we know the regression appeared in v1, we can prepare local indexing setups based on both the reusable SQL query to generate statements for searching for nulls
The workflow is:
Doing that against both databases, we have a clear culprit:
all show significant quantities (~75%) of null values. Notably nulls found on known-good db (v0.81.3)
Compare that with: nulls found on known-bad db (v1.0.0)
The raw data is hard to read, but if you My operating intuition is that the |
Thanks for the write-up, and digging into it, this is helpful
I don't understand why the crate move would cause this. At this point, I don't think it did. My sense is that the source of the problem is with the ecosystem crate bump, either we missed something during review, or a semantics change in an upstream, or both. There's an easy way to test this hypothesis: I can do a small pass over the ABCI event handling in the protocol later today, and report back. |
There was a small but important error in translating the ABCI event indexing layer while doing the tendermint ecosystem bump: #5009 that I missed during review. |
Adds some feature-gated integration tests, not yet hooked up to CI, to sanity check that the ABCI events pipeline is working: pd -> cometbft -> postgres -> pindexer The integration tests talk to local postgres databases and make assertions about their contents. There's also a small change to the process-compose orchestration, instructing pindexer to wait for the fullnode services to come up, lest it error out early due to an empty database. The error I observed on an empty db was: Error: error occurred while decoding column 0: unexpected null; try decoding as an `Option` Caused by: unexpected null; try decoding as an `Option` Which is distinct from the error reported in #4999. By waiting a bit longer we can ensure a clean start even on a fresh devnet. Will follow up with subsequent PRs to make sure these tests run in CI.
## Describe your changes We missed this on reviewing #4963 which causes a series of unfortunate events in the data pipeline: #4999 The important diff: ``` -set_index(false) +set_index(true) ``` ## Issue ticket number and link ## Checklist before requesting a review - [x] I have added guiding text to explain how a reviewer should test these changes. I have not tested this change. To test it we can run `pd` and check if events are making it to the comet pg indexer. - [x] If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason: > event indexing layer --------- Co-authored-by: Conor Schaefer <[email protected]>
Resolved via #5009. |
Describe the bug
pindexer errors out on mainnet db data due to a missing field
To Reproduce
Steps to reproduce the behavior:
pindexer
(at least v1.0.0) against mainnet events dbExpected behavior
pindexer ingests block 3139617 and continues without error.
Screenshots
Error message:
Additional context
Looks like the problem is caused by slow queries on the target database. Indeed, the database I'm talking to can be quite slow, due to heavy traffic. However, I don't think that's the source of the problem: I think we've got a syntax error in a query. We can rule this out by checking against a local db and verifying that the crash still happens.
The text was updated successfully, but these errors were encountered: