Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

Closed
sduchesneau opened this issue Nov 21, 2022 · 12 comments
Assignees

Comments

@sduchesneau
Copy link
Contributor

BUG!

Current behavior:

  1. When the postgresql store is unavailable for a little while (ex: load is too high, server crashed or restarted)
  2. The resulting error is store error: no connection to the server
  3. This error halts the indexing of that subgraph for 30 minutes before retrying
  4. When it retries, even if the store is back and other subgraphs are correctly indexing, it leads to this error message:
    ERRO Subgraph failed with non-deterministic error: Failed to transact block operations: subgraph writer poisoned by previous error, retry_delay_s: 1800, attempt: 27, sgd: 865, subgraph_id: (...), component: SubgraphInstanceManager
  5. Then, it waits another 30 minutes before showing the same error about the connection being poisoned.
  6. If I restart the graph-node, this subgraph restarts indexing correctly.

How to reproduce:

I don't have a safe way to reproduce this on demand, unfortunately, but it happened to me on 3 different subgraphs at the same time (under heavy load)

Expected behavior:

  1. after these long 30 minutes of waiting, the graph-node should at least retry the connection and get back on its feet (instead of complaining about the writer being poisoned
  2. 30 minutes is a long wait time for something as trivial as "no connection to the server", it could easily be shorter for this kind of error.
@sduchesneau
Copy link
Contributor Author

@evaporei ^ as we discussed earlier

@azf20
Copy link
Contributor

azf20 commented Nov 23, 2022

thanks @sduchesneau - I think @fordN has also seen this error. Interesting that a restart fixes, while the regular retry does not. Will take a look

@azf20
Copy link
Contributor

azf20 commented Nov 23, 2022

@sduchesneau is this from a recent build from source?

@sduchesneau
Copy link
Contributor Author

@sduchesneau is this from a recent build from source?

It was from a docker build of master, graphprotocol/graph-node:013c2c9

@azf20
Copy link
Contributor

azf20 commented Nov 29, 2022

thanks @sduchesneau! This seems like it might be related to a failure to reset the subgraph on encountering this error:

2022-11-25T11:16:15.424986882Z stderr F Nov 25 11:16:15.424 ERRO Subgraph writer failed, error: subgraph `QmbaZPFGGifoWzDu4uRGHQmy4N4rg52TyX8BYbu3qTh88m` has already processed block `106046331`; there are most likely two (or more) nodes indexing this subgraph, sgd: 414711, subgraph_id: QmbaZPFGGifoWzDu4uRGHQmy4N4rg52TyX8BYbu3qTh88m, component: SubgraphInstanceManager

(we see this log in all of the shared examples where we have seen this)
Which means it keeps using a store that has encountered an error, hence the recurring complaint about being poisoned (and thanks @lutter for the help with diagnosis).

@balakhonoff
Copy link

balakhonoff commented Jan 27, 2023

Hey @sduchesneau and @azf20
Are there any solutions or workarounds?

@paymog
Copy link

paymog commented Feb 21, 2023

@sduchesneau did you see this while using RPC or Firehose providers? I started seeing this recently after adding a firehose provider and I'm not sure if it's because of the firehose provider (the provider being faulty or some bug in the graph node firehose code path).

@paymog
Copy link

paymog commented Feb 21, 2023

@balakhonoff have you encountered this issue as well? If so, were you using firehose providers?

@SozinM
Copy link
Contributor

SozinM commented Mar 16, 2023

@azf20 We did encounter this issue. Also, we can reproduce it reliably in our self-hosted environment.
Also, we could help with debugging using our indexer instances located in dev env.

@lutter
Copy link
Collaborator

lutter commented Mar 16, 2023

The issue is a mismatch between the subgraph runner and the store: because we write changes asynchronously, the store pretends to the subgraph runner that things have been written that aren't really in the database yet. That's all good when things are working. But if the async writer encounters an error when it is trying to write something, the subgraph runner's view will deviate from reality, and the store refuses to do anything more (that's what the poisoned error says) Note that "database not available" here shouldn't count as an error, but I think there are cases where accesses to the block store (not the subgraph store) don't have the proper retry logic in place.

What needs to happen at that point is that the subgraph runner tells the store explicitly "I got rid of all in-memory assumptions about what has been written" and reinitializes the store. That can either happen by calling SubgraphStore.writable here or we could add a method to the Writable to reinitialize itself.

The retry loop in the subgraph runner here doesn't take that into account and continues with a poisoned store. The easiest fix for this might be to add a method WritableStore.reinitialize() that does that and call it when restarting the subgraph here The new WritableStore.reinitialize() should be a noop for a non-poisoned store and return a new clean WritableStore for a poisoned store.

@azf20
Copy link
Contributor

azf20 commented Apr 12, 2023

Thanks @lutter! Is this fixed in #4533?

@lutter
Copy link
Collaborator

lutter commented Apr 13, 2023

Thanks @lutter! Is this fixed in #4533?

Yes, that should fix these recurring errors

@azf20 azf20 closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants