Pull in transaction receipts only when necessary #1308

leoyvens · 2019-10-22T15:30:13Z

When processing blocks considered final, we fetch the triggers only to drop them and re-process them from the transaction receipts. This is a wasteful process, particularly of disk space because we cache hundreds of GBs worth of transaction receipts we don't really need.

This PR makes it so that we only request and store transaction receipts when they are actually necessary, which is when processing blocks that may still be reorged. blocks_with_triggers now returns richer information instead of dropping it, the return type was changed from EthereumBlockPointer to EthereumBlockWithTriggers.

One thing that makes this change tricky is that we may need to re-process a block for dynamic data sources, at which point we again need to know if the block is final or not, and need extra information for non-final blocks. This is encoded in the BlockFinality enum and the trigger re-processing is encapsulated triggers_in_block. Some code got moved from the block stream to the ethereum adapter so that it could be shared between the block stream and the instance manager.

Since this touched some of the block stream, I took the opportunity to refactor it a bit. We no longer have special logic for advancing through an empty range, instead we insert a dummy block for to with no triggers, and use the normal processing logic to advance. The special casing was in some sense an optimization, but it was optimizing the fastest possible case which is an empty range, so it was considerable complexity for no real gain. In doing this, transact_block_operations got a bit simpler, it was only used to advance one block a time when it's perfectly capable of skipping blocks. Also it no longer takes a block_ptr_from, it doesn't make sense for that to be anything other than the current block pointer, so we just assume that. Also, some of the types in the block stream state machine got simplified a bit.

Sorry if this is a lot in one PR! Please ask if some change needs clarification or more comments. I tested this with Moloch and Betoken, both synced fine. I avoided any non-trivial change to reversion code, any changes there are just formatting and minor refactors.

lutter · 2019-10-22T16:53:55Z

store/postgres/src/store.rs

-        if block_ptr_from.number != block_ptr_to.number - 1 {
-            panic!("transact_block_operations must transact a single block only");
-        }
+        let block_ptr_from = self.block_ptr(subgraph_id.clone())?;


I am all for getting rid of set_block_ptr_with_no_changes, but I don't like that this change causes a query for data the caller already has. We should continue to pass block_ptr_from into transact_block_operations but change the rules for it: if block_ptr_from + 1 < block_ptr_to, mods has to be empty, and we must always have block_ptr_from < block_ptr_to.

It's interesting observation that at some earlier point the block stream called made this same DB lookup, and we could plumb that through in order to save this call. But if we wanted to keep block_ptr cached in memory for performance, I'd approach that differently by having the store keep that cache without complicating other code. However this query should be cheap enough to do in both places, if it's currently slow that's because we're pulling the entire subgraph only to read the block pointer, which is an issue regardless of this change.

I can assert that block_ptr_from < block_ptr_to, though I don't follow why if block_ptr_from + 1 < block_ptr_to then mods must be empty, that's not the case in this PR, is that a problem?

I measured the speed of this query for the Moloch subgraph, it never took more than 1 ms, so we're good on that.

What I was getting at is that current callers have block_ptr_from already, so why not continue to pass that in rather than look it up? Yes, it's a very cheap lookup, at least on an empty, uncontested DB, but I'd rather not query at all when it's not necessary.

As for empty mods, I was going by what the current set_block_ptr_with_no_changes does; what is in mods when we are jumping more than one block? I'd like to assert/check something about them; part of the reason for this is #1093 where we never got to the bottom of why we trip over a metadata update, even though that should be impossible, and I'd like to make sure we are not just silently overwriting stuff.

The check we had transact_block_operations caught bugs in my early block explorer data work, so I second the request for keeping the from pointer and performing the sanity checks that @lutter is recommending.

The caller doesn't have it. With these simplifications to no longer enforce a step of 1, I don't even know how to keep track of the block pointer in memory. In our current abstraction the instance manager can't tell if it was the last one to change the pointer by moving it forward, or if it was the block stream with a revert, only the DB knows.

This block_ptr_from value is only used for asserts and logs, in fact it's the only reason we have the weird metadata guard thing, there's more to simplify here but that can be done as a follow up. If we agree this PR is mostly an improvement, I'd prefer doing further improvements after merging this.

No matter how many blocks we're skipping, mods contains the modifications for the block being applied. I've added the block_ptr_from.number < block_ptr_to.number assert, I'm happy to add any asserts you think are worthwhile.

I didn't realize that the caller doesn't have an easy way to get at block_ptr_from; in that case, I am fine with the change. I would just really like to figure out what caused #1093

As am I (fine with the changes, that is).

ghost · 2019-10-22T18:43:09Z

datasource/ethereum/src/block_stream.rs

-                                            "Found {} relevant block(s)",
-                                            descendant_ptrs.len()
+                                            logger,
+                                            "Found {} trigger(s)",


I think it's useful here to log the number of blocks, don't we log triggers farther down the pipeline?

You're right, seems I got confused here. I've moved this log into triggers_in_block and it now logs the number of blocks again.

ghost · 2019-10-22T19:01:12Z

datasource/ethereum/src/block_stream.rs

-                                parent_ptr,
+                Box::new(
+                    self.eth_adapter
+                        .load_block(&ctx.logger, subgraph_ptr.hash)


Curious why expect this to succeed when we already checked that this block has been uncled, do ethereum node reliably maintain uncles?

I didn't change this code, the diff is from formatting churn. This relies on the block ingestor always pulling in the blocks within the reorg threshold.

It does. I'm not sure Ethereum nodes keep everything about uncles available, like transaction receipts, but the block headers are stored on chain. This is necessary for e.g. applying and verifying uncle rewards.

datasource/ethereum/src/ethereum_adapter.rs

ghost · 2019-10-22T19:17:42Z

datasource/ethereum/src/block_stream.rs

                        // Yield one block
-                        Ok(Async::Ready(Some(next_block))) => {
+                        Some(next_block) => {


I think it's odd to push blocks without triggers to the subgraph instance, I think it's going to produce metrics observations of 0 for the triggers in block histogram and logs which say something like 0 triggers in blocks.

At a higher level, I like that that block stream would advance the subgraph pointer in this case even though at a lower level it adds more paths in the block streams code.

I see that a block with zero triggers is still a sort of special case, but it already happens in the current code for non-final blocks and so far it's a special case that requires less complexity, the cases you mention with logs and metrics are easy to handle. I'm not fundamentally opposed to keeping set_block_ptr_with_no_changes, though I'm personally not a fan and @lutter also found it annoying, I'd be interested to know what @Jannis thinks.

I like the idea of skipping irrelevant blocks more efficiently but I think set_block_ptr_with_no_changes unnecessarily triggered the logic that checks if subscriptions need to be updated? I'm all for dumping it if it doesn't have a detrimental impact on indexing performance.

Do we want to give the block stream a property of only emitting relevant blocks (with triggers) to the subgraph instance? It seems cleaner that way to me. If we don't want that or we don't want that in this PR, we should make sure to filter which blocks get logged and added to metrics on the subgraph instance side

I just remembered another reason we had this: to get more frequent updates on indexing progress for any subgraph while scanning historic blocks. That is an important feature to preserve.

I don't mind the 0 triggers log, during syncing most subgraphs will log that only once every 10k blocks. When synced, it seems relevant to log how many triggers are found on each new block, even if 0. If this is not helpful for the metrics, I can filter out.

I spent some time measuring performance vs master, I used Moloch because it has few events so it's seems like a good one for measuring overhead per processing cycle. Sometimes master was faster, sometimes the branch was faster, so any difference seems to be in the noise. Afaict the block stream is network bound once the blocks are cached, and this PR didn't change the rpc calls done in that scenario.

Ahh, makes sense re: logs. Let's just filter out the zero trigger blocks from metrics then

@olibearo Done, let me know if any other metrics need to check for this.

Jannis

I've reviewed a bit more than 2/3 of the PR. I like most changes a lot. The things I still need to think about a bit are ThinEthereumBlock and BlockFinality and how they are used. Will do that tomorrow.

Jannis · 2019-10-23T01:40:48Z

core/src/subgraph/instance_manager.rs

@@ -162,6 +167,7 @@ impl SubgraphInstanceManager {
    pub fn new<B, S, M>(
        logger_factory: &LoggerFactory,
        stores: HashMap<String, Arc<S>>,
+        eth_adapters: HashMap<String, Arc<dyn EthereumAdapter>>,


One concern I have here is that we're integrating Ethereum more tightly into subgraph indexing. That's ok for now though.

datasource/ethereum/src/block_stream.rs

Jannis · 2019-10-23T02:15:14Z

datasource/ethereum/src/block_stream.rs

                        // Yield one block
-                        Ok(Async::Ready(Some(next_block))) => {
+                        Some(next_block) => {


I like the idea of skipping irrelevant blocks more efficiently but I think set_block_ptr_with_no_changes unnecessarily triggered the logic that checks if subscriptions need to be updated? I'm all for dumping it if it doesn't have a detrimental impact on indexing performance.

Jannis · 2019-10-23T02:27:02Z

datasource/ethereum/src/ethereum_adapter.rs

+            self.block_hash_by_block_number(&logger, to)
+                .map(move |block_hash_opt| EthereumBlockPointer {
+                    hash: block_hash_opt.unwrap(),
+                    number: to,
+                })
+                .and_then(move |block_pointer| {
+                    stream::unfold(block_pointer, move |descendant_block_pointer| {
+                        if descendant_block_pointer.number < from {
+                            return None;
+                        }
+                        // Populate the parent block pointer
+                        Some(
+                            eth.block_parent_hash(&logger, descendant_block_pointer.hash)
+                                .map(move |block_hash_opt| {
+                                    let parent_block_pointer = EthereumBlockPointer {
+                                        hash: block_hash_opt.unwrap(),
+                                        number: descendant_block_pointer.number - 1,
+                                    };
+                                    (descendant_block_pointer, parent_block_pointer)
+                                }),
+                        )
+                    })
+                    .collect()


How large are the block ranges we'd pass in here? While simple and easy to follow, populating the vector one item at a time could be slow — a more performant (but more complicated) approach would be to optimistically resolve all hashes for the (to, from) range in parallel (or in parallel batches) and checking them for consistency afterwards.

Yes, this is slow and makes the performance of unconditional block triggers suck. I should improve this by doing your suggestion.

I've improved this to fetch the block hashes in parallel, and refactored the code that does regular block fetching so that they're similar. I also removed the special case for the scan range of block triggers, this helps a subgraph like Betoken to be less slow, though the situation is still not ideal because we don't cache the number -> hash association in the database. Consistency checking is not necessary since this range is for final blocks.

Jannis · 2019-10-23T02:30:57Z

graph/src/components/ethereum/adapter.rs

+            ));
+        }
+
+        match block_filter.contract_addresses.len() {


Should we put this one right next to the block_filter.trigger_every_block condition above, perhaps even in an else branch to make it more obvious that the block filter is only applied once, depending on which type of block filter it is.

Good idea, I've refactored this a bit.

graph/src/components/ethereum/adapter.rs

datasource/ethereum/src/ethereum_adapter.rs

graph/src/components/ethereum/types.rs

Jannis · 2019-10-23T19:06:26Z

datasource/ethereum/src/ethereum_adapter.rs

+                }),
+            )
+                as Box<dyn Future<Item = _, Error = _> + Send>,
+            BlockFinality::NonFinal(fat_block) => Box::new(future::ok({


Could we rename fat_block to full_block, since that's more in line with e.g. load_full_block etc.

Done, thanks

graph/src/components/ethereum/types.rs

core/src/subgraph/instance_manager.rs

datasource/ethereum/src/block_stream.rs

datasource/ethereum/src/ethereum_adapter.rs

graph/src/components/store.rs

leoyvens · 2019-10-24T14:35:18Z

Some comment threads are still on going but I've responded to everything. I did a sample of the size of an ethereum_blocks row before and after this change, and there was a 10x size reduction, from around 300k to around 30k. Yay!

Jannis

There's a typo in one of the commit messages:

ethereum: Rename methdos

work in progress.

`blocks_with_triggers` returns `EthereumBlockWithTriggers`s. Remove the check for an empty scan and the `AdvanceToDescendantBlock` state. Still WIP>

Skipping a blocks happens implicitly by not processing those blocks.

And move the log to `triggers_in_block`.

And refactor a bit to make loading by hash and by number use similar code.

They are not that special, we can the a normal range.

…call

leoyvens · 2019-10-25T15:34:52Z

@Jannis rebased and fixed the message.

Jannis

As far as I am concerned, the changes and post-review updates look good to me. I'd be happy to merge this.

datasource/ethereum/src/block_stream.rs

Jannis · 2019-10-28T12:49:18Z

datasource/ethereum/src/block_stream.rs

-                                parent_ptr,
+                Box::new(
+                    self.eth_adapter
+                        .load_block(&ctx.logger, subgraph_ptr.hash)


It does. I'm not sure Ethereum nodes keep everything about uncles available, like transaction receipts, but the block headers are stored on chain. This is necessary for e.g. applying and verifying uncle rewards.

datasource/ethereum/src/ethereum_adapter.rs

graph/src/components/ethereum/types.rs

Jannis · 2019-10-28T12:56:52Z

store/postgres/src/store.rs

-        if block_ptr_from.number != block_ptr_to.number - 1 {
-            panic!("transact_block_operations must transact a single block only");
-        }
+        let block_ptr_from = self.block_ptr(subgraph_id.clone())?;


As am I (fine with the changes, that is).

leoyvens requested a review from a team October 22, 2019 15:30

lutter reviewed Oct 22, 2019

View reviewed changes

ghost reviewed Oct 22, 2019

View reviewed changes

datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved

ghost reviewed Oct 22, 2019

View reviewed changes

datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved

ghost reviewed Oct 22, 2019

View reviewed changes

Jannis requested changes Oct 23, 2019

View reviewed changes

Jannis requested changes Oct 24, 2019

View reviewed changes

Jannis requested changes Oct 25, 2019

View reviewed changes

leoyvens added 18 commits October 25, 2019 12:11

ethereum: Have blocks_with_triggers return triggers, not blocks

4cc29fc

work in progress.

block_stream: Pass triggers to ProcessDescendantBlocks

9447936

*: Pull transaction receipts only when necessary

903256f

ethereum: Cache thin blocks

13b92d5

ethereum: Small refactor to include_calls_in_blocks

df181d2

block_stream: Remove triggers_in_block dead trait fn

ee6820f

ethereum: Refactor trigger scanning

ca6a4b0

`blocks_with_triggers` returns `EthereumBlockWithTriggers`s. Remove the check for an empty scan and the `AdvanceToDescendantBlock` state. Still WIP>

block_stream: Don't explicitly skip blocks

eed36d8

Skipping a blocks happens implicitly by not processing those blocks.

*: Remove dead fn set_block_ptr_with_no_changes

91834f8

garph: Handle unconditional block triggers again

6fa9be9

block_stream: Fix passing of BlockStreamMetrics

093fae6

ethereum: Use the ThinEthereumBlock alias consistently

f40b13b

core: Small refactor to block trigger matching

ed83f79

instance_manager: Remove dead variable

7d80d42

block_stream: Log the amount of relevant blocks, not logs

5c2666e

And move the log to `triggers_in_block`.

ethereum: Parallelize block_range_to_ptrs

5f86ae0

And refactor a bit to make loading by hash and by number use similar code.

block_stream: Use normal range for block triggers

6eb67ee

They are not that special, we can the a normal range.

ethereum: Clarify that block triggers are either unconditional or by …

282040a

…call

leoyvens added 7 commits October 25, 2019 12:11

*: rename ThinEthereumBlock -> LightEthereumBlock

94b696a

ethereum: Rename methods

60b3fa4

ethereum: Clarify purpose of BlockFinality

41a0c8b

*: Rename thin_block to light_block

95f4d3c

ethereum: Use "Ethereum node" in errors

90b4836

metrics: Don't register triggers count for blocks with no triggers

a49ffe8

store: Assert that block_ptr_from < block_ptr_to

caf89a5

leoyvens force-pushed the leo/simplify-block-stream-triggers branch from 642f16a to caf89a5 Compare October 25, 2019 15:11

Jannis approved these changes Oct 28, 2019

View reviewed changes

leoyvens merged commit 16abb45 into master Oct 28, 2019

leoyvens deleted the leo/simplify-block-stream-triggers branch October 28, 2019 13:31

This was referenced Oct 29, 2019

Refactor away guard from MetadataOperation::Update #1318

Merged

Index subgraphs from the genesis block rather than block 1 #1297

Merged

Pull in transaction receipts only when necessary #1308

Pull in transaction receipts only when necessary #1308

Conversation

leoyvens commented Oct 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jannis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoyvens commented Oct 24, 2019

Jannis left a comment

Choose a reason for hiding this comment

leoyvens commented Oct 25, 2019

Jannis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment