Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup validator uptime computation #1270

Merged
merged 10 commits into from
Jan 8, 2021
Merged

Cleanup validator uptime computation #1270

merged 10 commits into from
Jan 8, 2021

Conversation

mcortesi
Copy link
Contributor

@mcortesi mcortesi commented Dec 15, 2020

Description

Several people had trouble following current uptime scoring logic. It looks like a pool for off by one errors. This PR attempts to do some cleanup to make it more maintainable.

  • Extracts the code to its own package where before code was distributed in 3+ files.
  • Add Tests
  • Add godocs to functions
  • Rename functions/variables to make code easier to understand
  • Introduce the type Window which simplifies reasoning about UP blocks
  • Modify the monitoringWindow and updateUptime functions so they work with the block whose signatures they intend to monitor, vs the block whose parent block signatures we want to monitor.

We can now summarize uptime computation as:

  • monitoringWindow(epoch): Range of blocks from an epoch where UP status will be monitor. Represents also the maximum number of blocks within an epoch a validator can be considered UP
  • currentLookbackWindow(blockNumber) Current range of blocks we consider to label a validator a UP. A valiadator is considered UP when it has sign at least a block in the currentLookbackWindow
  • validator.UpBlocks total number of blocks within an epoch a validator has been labeled as UP

Then the score is:

uptimeScore = validator.UpBlocks / monitoringWindow(epoch).Size()

Finally, for each block we process, the LastSignedBlock for each validator is updated, and then validator.UPBlocks is increased if: monitoringWindow.contains(blockNumber) && currentLookbackWindow.contains(lastSignedBlock)

Other changes

  • Improve epoch related functions
  • Add godocs to non related functions

Tested

  • Unit tests

Related issues

Backwards compatibility

Yes

@mcortesi mcortesi marked this pull request as draft December 15, 2020 16:37
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
logger := um.logger.New("func", "Backend.updateValidatorScores", "epoch", epoch)
logger.Trace("Updating validator scores")

// The denominator is the (last block - first block + 1) of the val score tally window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be clearer if we explicitly say this is the maximum possible uptime score

Copy link
Contributor Author

@mcortesi mcortesi Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that denominator doesn't really inform anything.

It's not the max uptime score (since score it's a ratio), but the total blocks that we are monitoring

So, what about:

// The totalMonitoredBlocks are the total number of block on which we monitor uptime for the epoch
	totalMonitoredBlocks := um.MonitoringWindowSize(epoch)

consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
Copy link
Contributor

@oneeman oneeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definite improvement, but I think there's further we can go while we're at it. Added a bunch of comments, you can decide which of them make sense.

consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
// than couting up to the second to last one.
lastBlockToMonitor := epochLastBlock - 1

return firstBlockToMonitor, lastBlockToMonitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ask the question you mentioned to me previously, what happens if lastBlockToMontor < firstBlockToMonitor? (i.e. if lookbackWindowSize > epochSize - 2). Do we have safeguards to prevent that from happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll review this on when finishing #1136

consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/utils.go Show resolved Hide resolved
consensus/istanbul/utils.go Outdated Show resolved Hide resolved
{"tally on first epoch", args{1, 10, 2}, 1 + 2, 9},
{"tally on second epoch", args{2, 10, 2}, 11 + 2, 19},
{"lookback window too big", args{1, 10, 10}, 11, 9},
// TODO: Add test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO intended to be completed before finalizing the PR? If not, I would remove it.

}{
{"tally on first epoch", args{1, 10, 2}, 1 + 2, 9},
{"tally on second epoch", args{2, 10, 2}, 11 + 2, 19},
{"lookback window too big", args{1, 10, 10}, 11, 9},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about including a case with an invalid combination of lookback window size and epoch size. Is this a case that we are supposed to support or that should lead to a fatal error somewhere? If it's the latter, we should test elsewhere does it does lead to a fatal error, and not test it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll review this on when finishing #1136

consensus/istanbul/uptime/uptime_test.go Outdated Show resolved Hide resolved
@mcortesi mcortesi marked this pull request as ready for review January 7, 2021 16:22
@mcortesi mcortesi force-pushed the mc/uptime-refactor branch from 4bc78f9 to 1923025 Compare January 7, 2021 17:07
Copy link
Contributor

@oneeman oneeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much improved. Left some minor comments here and there. Also, TestUptimeStorage is failing due to stray references to ScoreTally. That made me wonder: since the uptime store uses the db, would renaming a field cause problems when it comes to backwards compatibility?

Also, github says there are now merge conflicts (my guess is it's since 1.9.12 is merged to master now) 😬

consensus/istanbul/uptime/window.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/window.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/window.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/window.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Outdated Show resolved Hide resolved
consensus/istanbul/uptime/uptime.go Show resolved Hide resolved
@mcortesi
Copy link
Contributor Author

mcortesi commented Jan 7, 2021

Much improved. Left some minor comments here and there. Also, TestUptimeStorage is failing due to stray references to ScoreTally. That made me wonder: since the uptime store uses the db, would renaming a field cause problems when it comes to backwards compatibility?

Also, github says there are now merge conflicts (my guess is it's since 1.9.12 is merged to master now) 😬

Fixed the stray reference.

About the field renaming, it doesn't cause problems. since RLP (the encoding) it's unaware of the names, only the structure of the type..

here's a test I've just did to verify this

func TestUptimeStorageChange(t *testing.T) {
	db := NewMemoryDatabase()
	epoch := uint64(0)

	// Create a test uptime to move around the database and make sure it's really new
	if entry := ReadAccumulatedEpochUptime(db, epoch); entry != nil {
		t.Fatalf("Non existent uptime returned: %v", entry)
	}

	expectedStoredUptime := &uptime.Uptime{
		Entries: []uptime.UptimeEntry{
			{UpBlocks: 0, LastSignedBlock: 1},
			{UpBlocks: 2, LastSignedBlock: 2},
			{UpBlocks: 8, LastSignedBlock: 8},
		},
		LatestBlock: 5,
	}

	type OldUptimeEntry struct {
		// Numbers of blocks validator is considered UP within monitored window
		UpBlocks        uint64
		LastSignedBlock uint64
	}
	type OldUptime struct {
		LatestBlock uint64
		Entries     []OldUptimeEntry
	}

	// Write and verify the uptime in the database
	uptime := &OldUptime{
		Entries: []OldUptimeEntry{
			{UpBlocks: 0, LastSignedBlock: 1},
			{UpBlocks: 2, LastSignedBlock: 2},
			{UpBlocks: 8, LastSignedBlock: 8},
		},
		LatestBlock: 5,
	}

	data, err := rlp.EncodeToBytes(uptime)
	if err != nil {
		t.Fatalf("Failed to RLP encode updated uptime %v", err)
	}
	if err := db.Put(uptimeKey(epoch), data); err != nil {
		t.Fatalf("Failed to store updated uptime: %v", err)
	}

	if entry := ReadAccumulatedEpochUptime(db, epoch); entry == nil {
		t.Fatalf("Stored uptime not found")
	} else if !reflect.DeepEqual(entry, expectedStoredUptime) {
		t.Fatalf("Retrieved uptime mismatch: have %v, want %v", entry, uptime)
	}
	// Delete the uptime and verify the execution
	DeleteAccumulatedEpochUptime(db, epoch)
	if entry := ReadAccumulatedEpochUptime(db, epoch); entry != nil {
		t.Fatalf("Deleted uptime returned: %v", entry)
	}
}

Instead of one method for first and last block on window, have a method to get the window [first, last] pair.

 - Add godocs to a few utils.go methods
 - Add tests to a few utils.go methods
- Centralize all uptime related logic into istanbul/uptime pkg.
- Use interface for storage to ease testing later
We define a `Window` type and then uptime block are the sum of all
blocks where block is within monitoredWindow, and lastSignedBlock for
the validator is within currentLookback window.

Additionally, updateUptime() function now receives the block is to
monitor and the corresponding singatures, instead of receiving a block
and the parent block signatures. So math is simplified.

MonitoringWindow() now returns the blocks whose signatures we want to
monitor and not the blocks whose parentBlock signatures we want to
monitor.
@mcortesi mcortesi force-pushed the mc/uptime-refactor branch from c41c105 to 42f0da3 Compare January 7, 2021 21:16
@mcortesi
Copy link
Contributor Author

mcortesi commented Jan 7, 2021

Rebased to master to fix conflicts

@mcortesi mcortesi requested review from oneeman and prestwich January 7, 2021 21:16
Copy link
Contributor

@oneeman oneeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very nice!

@mcortesi mcortesi merged commit 591e331 into master Jan 8, 2021
@mcortesi mcortesi deleted the mc/uptime-refactor branch January 8, 2021 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants