Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in transaction receipts only when necessary #1308

Merged
merged 25 commits into from
Oct 28, 2019

Conversation

leoyvens
Copy link
Collaborator

When processing blocks considered final, we fetch the triggers only to drop them and re-process them from the transaction receipts. This is a wasteful process, particularly of disk space because we cache hundreds of GBs worth of transaction receipts we don't really need.

This PR makes it so that we only request and store transaction receipts when they are actually necessary, which is when processing blocks that may still be reorged. blocks_with_triggers now returns richer information instead of dropping it, the return type was changed from EthereumBlockPointer to EthereumBlockWithTriggers.

One thing that makes this change tricky is that we may need to re-process a block for dynamic data sources, at which point we again need to know if the block is final or not, and need extra information for non-final blocks. This is encoded in the BlockFinality enum and the trigger re-processing is encapsulated triggers_in_block. Some code got moved from the block stream to the ethereum adapter so that it could be shared between the block stream and the instance manager.

Since this touched some of the block stream, I took the opportunity to refactor it a bit. We no longer have special logic for advancing through an empty range, instead we insert a dummy block for to with no triggers, and use the normal processing logic to advance. The special casing was in some sense an optimization, but it was optimizing the fastest possible case which is an empty range, so it was considerable complexity for no real gain. In doing this, transact_block_operations got a bit simpler, it was only used to advance one block a time when it's perfectly capable of skipping blocks. Also it no longer takes a block_ptr_from, it doesn't make sense for that to be anything other than the current block pointer, so we just assume that. Also, some of the types in the block stream state machine got simplified a bit.

Sorry if this is a lot in one PR! Please ask if some change needs clarification or more comments. I tested this with Moloch and Betoken, both synced fine. I avoided any non-trivial change to reversion code, any changes there are just formatting and minor refactors.

@leoyvens leoyvens requested a review from a team October 22, 2019 15:30
if block_ptr_from.number != block_ptr_to.number - 1 {
panic!("transact_block_operations must transact a single block only");
}
let block_ptr_from = self.block_ptr(subgraph_id.clone())?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am all for getting rid of set_block_ptr_with_no_changes, but I don't like that this change causes a query for data the caller already has. We should continue to pass block_ptr_from into transact_block_operations but change the rules for it: if block_ptr_from + 1 < block_ptr_to, mods has to be empty, and we must always have block_ptr_from < block_ptr_to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting observation that at some earlier point the block stream called made this same DB lookup, and we could plumb that through in order to save this call. But if we wanted to keep block_ptr cached in memory for performance, I'd approach that differently by having the store keep that cache without complicating other code. However this query should be cheap enough to do in both places, if it's currently slow that's because we're pulling the entire subgraph only to read the block pointer, which is an issue regardless of this change.

I can assert that block_ptr_from < block_ptr_to, though I don't follow why if block_ptr_from + 1 < block_ptr_to then mods must be empty, that's not the case in this PR, is that a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured the speed of this query for the Moloch subgraph, it never took more than 1 ms, so we're good on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was getting at is that current callers have block_ptr_from already, so why not continue to pass that in rather than look it up? Yes, it's a very cheap lookup, at least on an empty, uncontested DB, but I'd rather not query at all when it's not necessary.

As for empty mods, I was going by what the current set_block_ptr_with_no_changes does; what is in mods when we are jumping more than one block? I'd like to assert/check something about them; part of the reason for this is #1093 where we never got to the bottom of why we trip over a metadata update, even though that should be impossible, and I'd like to make sure we are not just silently overwriting stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check we had transact_block_operations caught bugs in my early block explorer data work, so I second the request for keeping the from pointer and performing the sanity checks that @lutter is recommending.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller doesn't have it. With these simplifications to no longer enforce a step of 1, I don't even know how to keep track of the block pointer in memory. In our current abstraction the instance manager can't tell if it was the last one to change the pointer by moving it forward, or if it was the block stream with a revert, only the DB knows.

This block_ptr_from value is only used for asserts and logs, in fact it's the only reason we have the weird metadata guard thing, there's more to simplify here but that can be done as a follow up. If we agree this PR is mostly an improvement, I'd prefer doing further improvements after merging this.

No matter how many blocks we're skipping, mods contains the modifications for the block being applied. I've added the block_ptr_from.number < block_ptr_to.number assert, I'm happy to add any asserts you think are worthwhile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that the caller doesn't have an easy way to get at block_ptr_from; in that case, I am fine with the change. I would just really like to figure out what caused #1093

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As am I (fine with the changes, that is).

"Found {} relevant block(s)",
descendant_ptrs.len()
logger,
"Found {} trigger(s)",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful here to log the number of blocks, don't we log triggers farther down the pipeline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, seems I got confused here. I've moved this log into triggers_in_block and it now logs the number of blocks again.

parent_ptr,
Box::new(
self.eth_adapter
.load_block(&ctx.logger, subgraph_ptr.hash)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why expect this to succeed when we already checked that this block has been uncled, do ethereum node reliably maintain uncles?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this code, the diff is from formatting churn. This relies on the block ingestor always pulling in the blocks within the reorg threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. I'm not sure Ethereum nodes keep everything about uncles available, like transaction receipts, but the block headers are stored on chain. This is necessary for e.g. applying and verifying uncle rewards.

// Yield one block
Ok(Async::Ready(Some(next_block))) => {
Some(next_block) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's odd to push blocks without triggers to the subgraph instance, I think it's going to produce metrics observations of 0 for the triggers in block histogram and logs which say something like 0 triggers in blocks.

At a higher level, I like that that block stream would advance the subgraph pointer in this case even though at a lower level it adds more paths in the block streams code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that a block with zero triggers is still a sort of special case, but it already happens in the current code for non-final blocks and so far it's a special case that requires less complexity, the cases you mention with logs and metrics are easy to handle. I'm not fundamentally opposed to keeping set_block_ptr_with_no_changes, though I'm personally not a fan and @lutter also found it annoying, I'd be interested to know what @Jannis thinks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of skipping irrelevant blocks more efficiently but I think set_block_ptr_with_no_changes unnecessarily triggered the logic that checks if subscriptions need to be updated? I'm all for dumping it if it doesn't have a detrimental impact on indexing performance.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to give the block stream a property of only emitting relevant blocks (with triggers) to the subgraph instance? It seems cleaner that way to me. If we don't want that or we don't want that in this PR, we should make sure to filter which blocks get logged and added to metrics on the subgraph instance side

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered another reason we had this: to get more frequent updates on indexing progress for any subgraph while scanning historic blocks. That is an important feature to preserve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the 0 triggers log, during syncing most subgraphs will log that only once every 10k blocks. When synced, it seems relevant to log how many triggers are found on each new block, even if 0. If this is not helpful for the metrics, I can filter out.

I spent some time measuring performance vs master, I used Moloch because it has few events so it's seems like a good one for measuring overhead per processing cycle. Sometimes master was faster, sometimes the branch was faster, so any difference seems to be in the noise. Afaict the block stream is network bound once the blocks are cached, and this PR didn't change the rpc calls done in that scenario.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, makes sense re: logs. Let's just filter out the zero trigger blocks from metrics then

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olibearo Done, let me know if any other metrics need to check for this.

Copy link
Contributor

@Jannis Jannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed a bit more than 2/3 of the PR. I like most changes a lot. The things I still need to think about a bit are ThinEthereumBlock and BlockFinality and how they are used. Will do that tomorrow.

@@ -162,6 +167,7 @@ impl SubgraphInstanceManager {
pub fn new<B, S, M>(
logger_factory: &LoggerFactory,
stores: HashMap<String, Arc<S>>,
eth_adapters: HashMap<String, Arc<dyn EthereumAdapter>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have here is that we're integrating Ethereum more tightly into subgraph indexing. That's ok for now though.

datasource/ethereum/src/block_stream.rs Show resolved Hide resolved
datasource/ethereum/src/block_stream.rs Show resolved Hide resolved
// Yield one block
Ok(Async::Ready(Some(next_block))) => {
Some(next_block) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of skipping irrelevant blocks more efficiently but I think set_block_ptr_with_no_changes unnecessarily triggered the logic that checks if subscriptions need to be updated? I'm all for dumping it if it doesn't have a detrimental impact on indexing performance.

Comment on lines 1093 to 1115
self.block_hash_by_block_number(&logger, to)
.map(move |block_hash_opt| EthereumBlockPointer {
hash: block_hash_opt.unwrap(),
number: to,
})
.and_then(move |block_pointer| {
stream::unfold(block_pointer, move |descendant_block_pointer| {
if descendant_block_pointer.number < from {
return None;
}
// Populate the parent block pointer
Some(
eth.block_parent_hash(&logger, descendant_block_pointer.hash)
.map(move |block_hash_opt| {
let parent_block_pointer = EthereumBlockPointer {
hash: block_hash_opt.unwrap(),
number: descendant_block_pointer.number - 1,
};
(descendant_block_pointer, parent_block_pointer)
}),
)
})
.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large are the block ranges we'd pass in here? While simple and easy to follow, populating the vector one item at a time could be slow — a more performant (but more complicated) approach would be to optimistically resolve all hashes for the (to, from) range in parallel (or in parallel batches) and checking them for consistency afterwards.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is slow and makes the performance of unconditional block triggers suck. I should improve this by doing your suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved this to fetch the block hashes in parallel, and refactored the code that does regular block fetching so that they're similar. I also removed the special case for the scan range of block triggers, this helps a subgraph like Betoken to be less slow, though the situation is still not ideal because we don't cache the number -> hash association in the database. Consistency checking is not necessary since this range is for final blocks.

));
}

match block_filter.contract_addresses.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this one right next to the block_filter.trigger_every_block condition above, perhaps even in an else branch to make it more obvious that the block filter is only applied once, depending on which type of block filter it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've refactored this a bit.

graph/src/components/ethereum/adapter.rs Outdated Show resolved Hide resolved
datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved
graph/src/components/ethereum/types.rs Show resolved Hide resolved
}),
)
as Box<dyn Future<Item = _, Error = _> + Send>,
BlockFinality::NonFinal(fat_block) => Box::new(future::ok({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename fat_block to full_block, since that's more in line with e.g. load_full_block etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

graph/src/components/ethereum/types.rs Show resolved Hide resolved
core/src/subgraph/instance_manager.rs Outdated Show resolved Hide resolved
datasource/ethereum/src/block_stream.rs Show resolved Hide resolved
datasource/ethereum/src/block_stream.rs Show resolved Hide resolved
datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved
datasource/ethereum/src/ethereum_adapter.rs Outdated Show resolved Hide resolved
datasource/ethereum/src/ethereum_adapter.rs Outdated Show resolved Hide resolved
datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved
graph/src/components/store.rs Outdated Show resolved Hide resolved
@leoyvens
Copy link
Collaborator Author

Some comment threads are still on going but I've responded to everything. I did a sample of the size of an ethereum_blocks row before and after this change, and there was a 10x size reduction, from around 300k to around 30k. Yay!

Copy link
Contributor

@Jannis Jannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in one of the commit messages:

ethereum: Rename methdos

`blocks_with_triggers` returns `EthereumBlockWithTriggers`s.

Remove the check for an empty scan and
the `AdvanceToDescendantBlock` state.

Still WIP>
Skipping a blocks happens implicitly by not processing those blocks.
And move the log to `triggers_in_block`.
And refactor a bit to make loading by hash and by
number use similar code.
They are not that special, we can the a normal range.
@leoyvens leoyvens force-pushed the leo/simplify-block-stream-triggers branch from 642f16a to caf89a5 Compare October 25, 2019 15:11
@leoyvens
Copy link
Collaborator Author

@Jannis rebased and fixed the message.

Copy link
Contributor

@Jannis Jannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am concerned, the changes and post-review updates look good to me. I'd be happy to merge this.

datasource/ethereum/src/block_stream.rs Show resolved Hide resolved
parent_ptr,
Box::new(
self.eth_adapter
.load_block(&ctx.logger, subgraph_ptr.hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. I'm not sure Ethereum nodes keep everything about uncles available, like transaction receipts, but the block headers are stored on chain. This is necessary for e.g. applying and verifying uncle rewards.

datasource/ethereum/src/ethereum_adapter.rs Show resolved Hide resolved
graph/src/components/ethereum/types.rs Show resolved Hide resolved
if block_ptr_from.number != block_ptr_to.number - 1 {
panic!("transact_block_operations must transact a single block only");
}
let block_ptr_from = self.block_ptr(subgraph_id.clone())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As am I (fine with the changes, that is).

@leoyvens leoyvens merged commit 16abb45 into master Oct 28, 2019
@leoyvens leoyvens deleted the leo/simplify-block-stream-triggers branch October 28, 2019 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants