Store connection issue prevents subgraph indexing until graph-node is restarted #4190

sduchesneau · 2022-11-21T19:12:34Z

BUG!

Current behavior:

When the postgresql store is unavailable for a little while (ex: load is too high, server crashed or restarted)
The resulting error is store error: no connection to the server
This error halts the indexing of that subgraph for 30 minutes before retrying
When it retries, even if the store is back and other subgraphs are correctly indexing, it leads to this error message:
ERRO Subgraph failed with non-deterministic error: Failed to transact block operations: subgraph writer poisoned by previous error, retry_delay_s: 1800, attempt: 27, sgd: 865, subgraph_id: (...), component: SubgraphInstanceManager
Then, it waits another 30 minutes before showing the same error about the connection being poisoned.
If I restart the graph-node, this subgraph restarts indexing correctly.

How to reproduce:

I don't have a safe way to reproduce this on demand, unfortunately, but it happened to me on 3 different subgraphs at the same time (under heavy load)

Expected behavior:

after these long 30 minutes of waiting, the graph-node should at least retry the connection and get back on its feet (instead of complaining about the writer being poisoned
30 minutes is a long wait time for something as trivial as "no connection to the server", it could easily be shorter for this kind of error.

The text was updated successfully, but these errors were encountered:

sduchesneau · 2022-11-21T19:12:52Z

@evaporei ^ as we discussed earlier

azf20 · 2022-11-23T08:42:05Z

thanks @sduchesneau - I think @fordN has also seen this error. Interesting that a restart fixes, while the regular retry does not. Will take a look

azf20 · 2022-11-23T09:17:32Z

@sduchesneau is this from a recent build from source?

sduchesneau · 2022-11-28T18:22:43Z

@sduchesneau is this from a recent build from source?

It was from a docker build of master, graphprotocol/graph-node:013c2c9

azf20 · 2022-11-29T09:57:39Z

thanks @sduchesneau! This seems like it might be related to a failure to reset the subgraph on encountering this error:

2022-11-25T11:16:15.424986882Z stderr F Nov 25 11:16:15.424 ERRO Subgraph writer failed, error: subgraph `QmbaZPFGGifoWzDu4uRGHQmy4N4rg52TyX8BYbu3qTh88m` has already processed block `106046331`; there are most likely two (or more) nodes indexing this subgraph, sgd: 414711, subgraph_id: QmbaZPFGGifoWzDu4uRGHQmy4N4rg52TyX8BYbu3qTh88m, component: SubgraphInstanceManager

(we see this log in all of the shared examples where we have seen this)
Which means it keeps using a store that has encountered an error, hence the recurring complaint about being poisoned (and thanks @lutter for the help with diagnosis).

balakhonoff · 2023-01-27T12:54:52Z

Hey @sduchesneau and @azf20
Are there any solutions or workarounds?

paymog · 2023-02-21T12:19:22Z

@sduchesneau did you see this while using RPC or Firehose providers? I started seeing this recently after adding a firehose provider and I'm not sure if it's because of the firehose provider (the provider being faulty or some bug in the graph node firehose code path).

paymog · 2023-02-21T12:19:49Z

@balakhonoff have you encountered this issue as well? If so, were you using firehose providers?

SozinM · 2023-03-16T10:45:33Z

@azf20 We did encounter this issue. Also, we can reproduce it reliably in our self-hosted environment.
Also, we could help with debugging using our indexer instances located in dev env.

lutter · 2023-03-16T18:24:38Z

The issue is a mismatch between the subgraph runner and the store: because we write changes asynchronously, the store pretends to the subgraph runner that things have been written that aren't really in the database yet. That's all good when things are working. But if the async writer encounters an error when it is trying to write something, the subgraph runner's view will deviate from reality, and the store refuses to do anything more (that's what the poisoned error says) Note that "database not available" here shouldn't count as an error, but I think there are cases where accesses to the block store (not the subgraph store) don't have the proper retry logic in place.

What needs to happen at that point is that the subgraph runner tells the store explicitly "I got rid of all in-memory assumptions about what has been written" and reinitializes the store. That can either happen by calling SubgraphStore.writable here or we could add a method to the Writable to reinitialize itself.

The retry loop in the subgraph runner here doesn't take that into account and continues with a poisoned store. The easiest fix for this might be to add a method WritableStore.reinitialize() that does that and call it when restarting the subgraph here The new WritableStore.reinitialize() should be a noop for a non-poisoned store and return a new clean WritableStore for a poisoned store.

azf20 · 2023-04-12T09:55:35Z

Thanks @lutter! Is this fixed in #4533?

lutter · 2023-04-13T15:41:48Z

Thanks @lutter! Is this fixed in #4533?

Yes, that should fix these recurring errors

azf20 assigned lutter Apr 12, 2023

acdibble mentioned this issue Apr 13, 2023

Staked node not displayed on the dashboard chainflip-io/support#2

Closed

azf20 closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

sduchesneau commented Nov 21, 2022

sduchesneau commented Nov 21, 2022

azf20 commented Nov 23, 2022

azf20 commented Nov 23, 2022

sduchesneau commented Nov 28, 2022

azf20 commented Nov 29, 2022

balakhonoff commented Jan 27, 2023 •

edited

Loading

paymog commented Feb 21, 2023

paymog commented Feb 21, 2023

SozinM commented Mar 16, 2023 •

edited

Loading

lutter commented Mar 16, 2023

azf20 commented Apr 12, 2023

lutter commented Apr 13, 2023

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

Store connection issue prevents subgraph indexing until graph-node is restarted #4190

Comments

sduchesneau commented Nov 21, 2022

Current behavior:

How to reproduce:

Expected behavior:

sduchesneau commented Nov 21, 2022

azf20 commented Nov 23, 2022

azf20 commented Nov 23, 2022

sduchesneau commented Nov 28, 2022

azf20 commented Nov 29, 2022

balakhonoff commented Jan 27, 2023 • edited Loading

paymog commented Feb 21, 2023

paymog commented Feb 21, 2023

SozinM commented Mar 16, 2023 • edited Loading

lutter commented Mar 16, 2023

azf20 commented Apr 12, 2023

lutter commented Apr 13, 2023

balakhonoff commented Jan 27, 2023 •

edited

Loading

SozinM commented Mar 16, 2023 •

edited

Loading