Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix performance regression due to increased hash allocations #55214

Merged
merged 2 commits into from
Oct 7, 2020

Conversation

arulajmani
Copy link
Collaborator

We recently increased allocations in the appStats structure when
adding support for transaction level statistics in #52704. This was
because the interactions with the fnv library were expensive in terms
of allocations. This patch aims to claw back the regression by:

  • Using our own implementation of the FNV algorithm instead of the
    library, which is significantly lighter weight (microbenchmarks below).
  • Re-organizes the code to only construct the statement IDs (deemed the
    expensive operation) if required.

When comparing the difference between the commit that introduced the
regression and the changes proposed by this diff, I got the following
improvements on the KV workload:

name             old ops/s   new ops/s   delta
kv95-throughput  34.5k ± 6%  35.7k ± 4%  +3.42%  (p=0.023 n=10+10)

Microbenchmarks for the new hashing algorithm (written/run by @knz):

name                     old time/op    new time/op    delta
ConstructStatementID-32     405ns ±17%      39ns ±12%  -90.34%  (p=0.008 n=5+5)

name                     old alloc/op   new alloc/op   delta
ConstructStatementID-32      120B ± 0%       16B ± 0%  -86.67%  (p=0.008 n=5+5)

name                     old allocs/op  new allocs/op  delta
ConstructStatementID-32      6.00 ± 0%      1.00 ± 0%  -83.33%  (p=0.008 n=5+5)

Closes #54515

Release note: None

@arulajmani arulajmani requested review from knz, solongordon, nvanbenschoten and a team October 4, 2020 01:06
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)


pkg/roachpb/app_stats.go, line 43 at r1 (raw file):

	}
	b := util.FNV64ToByteArray(hash)
	return StmtID(*(*string)(unsafe.Pointer(&b)))

you don't know that the byte array is not shared at this point. This cast is truly unsafe and should not be kept there.


pkg/sql/conn_executor_exec.go, line 1461 at r1 (raw file):

	commitLat := ex.phaseTimes.getCommitLatency()

	key := txnKey(fmt.Sprintf("%x", util.FNV64ToByteArray(ex.extraTxnState.transactionStatementsHash)))

that is wayyyy too expensive. Just never use printf to compute keys or things on the hot path.


pkg/util/hash.go, line 42 at r1 (raw file):

}

func FNV64ToByteArray(s0 uint64) []byte {

as per the previous discussion I'd encourage you to try make the stmt key an uint64 directly instead of a string

barring that, make this function return a string directly, otherwise you get weird allocation semantics with the unsafe cast.

@arulajmani arulajmani force-pushed the fix-txn-stats-regression branch from 7d1401f to f7bc4c2 Compare October 5, 2020 18:23
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the statement keys/txn keys to be uint64s instead of strings. RFAL.

CC-ing @dhartunian on this as this will probably affect the adminUI too. We recently discovered a significant performance regression because of increased string allocations. As a fix for this, I've updated the proto so that statement ID is now a uint64 instead of a string.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)


pkg/roachpb/app_stats.go, line 43 at r1 (raw file):

Previously, knz (kena) wrote…

you don't know that the byte array is not shared at this point. This cast is truly unsafe and should not be kept there.

Not applicable anymore as I've made StmtID a uint64 instead of a string.


pkg/sql/conn_executor_exec.go, line 1461 at r1 (raw file):

Previously, knz (kena) wrote…

that is wayyyy too expensive. Just never use printf to compute keys or things on the hot path.

Same comment as above.


pkg/util/hash.go, line 42 at r1 (raw file):

Previously, knz (kena) wrote…

as per the previous discussion I'd encourage you to try make the stmt key an uint64 directly instead of a string

barring that, make this function return a string directly, otherwise you get weird allocation semantics with the unsafe cast.

Switched to uint64.

@arulajmani arulajmani requested a review from knz October 5, 2020 18:34
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)


pkg/sql/crdb_internal.go, line 956 at r2 (raw file):

				stmtIDsDatum := tree.NewDArray(types.String)
				for _, stmtID := range s.statementIDs {
					if err := stmtIDsDatum.Append(tree.NewDString(strconv.FormatUint(uint64(stmtID), 10))); err != nil {

Maybe add a TODO here to migrate the schema of the vtable to use an integer type.

(Not in this change though)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1, 13 of 13 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @solongordon)


pkg/server/serverpb/status.proto, line 893 at r2 (raw file):

  message CollectedStatementStatistics {
    ExtendedStatementStatisticsKey key = 1 [(gogoproto.nullable) = false];
   uint64 id = 3 [(gogoproto.customname) = "ID",

nit: is the spacing off here?


pkg/sql/app_stats.go, line 247 at r2 (raw file):

	// is an expensive operation).
	s := a.getStatsForStmtWithKey(key, invalidStmtID, false /* createIfNonexistent */)
	if s == nil && createIfNonexistent {

What fraction of the time do we expect to hit this branch? Would it make sense to move constructStatementIDFromStmtKey under the lock and the if !ok && createIfNonexistent { branch to avoid locking twice?


pkg/sql/app_stats.go, line 257 at r2 (raw file):

	key stmtKey, ID roachpb.StmtID, createIfNonexistent bool,
) *stmtStats {
	a.Lock()

Can these functions acquire a read lock if !createIfNonexistent?


pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):

	// by the internal executor, in which case the hash is uninitialized (=0), and
	// can therefore be safely ignored.
	if ex.extraTxnState.transactionStatementsHash != 0 {

Should we be comparing against invalidStmtID instead of 0 in a few spots?


pkg/util/hash.go, line 32 at r2 (raw file):

const fnvPrime = 1099511628211

// FNV64Init initializes a new hash function.

Very cool!

You might consider adding some type safety here to prevent misuse. Something like:

type FNV64 struct {
    sum uint64
}

func NewFnv64() FNV64 {
    return FNV64{sum: fnvBase}
}

func (f *FNV64) Add(c uint64) {
    f.sum *= fnvPrime
	f.sum ^= c
}

func (f *FNV64) Sum() uint64 {
    return f.sum
}

Just an idea though, feel free to ignore.

@arulajmani arulajmani requested a review from a team October 6, 2020 18:37
@arulajmani arulajmani force-pushed the fix-txn-stats-regression branch from 9276903 to 87bb3c9 Compare October 6, 2020 21:17
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)


pkg/server/serverpb/status.proto, line 893 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: is the spacing off here?

Seems so, fixed.


pkg/sql/app_stats.go, line 247 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What fraction of the time do we expect to hit this branch? Would it make sense to move constructStatementIDFromStmtKey under the lock and the if !ok && createIfNonexistent { branch to avoid locking twice?

Done!


pkg/sql/app_stats.go, line 257 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can these functions acquire a read lock if !createIfNonexistent?

I think they should, though that would be a bigger change as if I switch appStats to have a RWLock, all other places that lock app stats should be updated as well. Would you be okay if I tackled that in a follow-up PR instead of this change?


pkg/sql/crdb_internal.go, line 956 at r2 (raw file):

Previously, knz (kena) wrote…

Maybe add a TODO here to migrate the schema of the vtable to use an integer type.

(Not in this change though)

Wouldn't that cause potential overflows as these IDs are unsigned integer?


pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we be comparing against invalidStmtID instead of 0 in a few spots?

Not applicable anymore because of the FNV64 struct switch below.


pkg/util/hash.go, line 32 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Very cool!

You might consider adding some type safety here to prevent misuse. Something like:

type FNV64 struct {
    sum uint64
}

func NewFnv64() FNV64 {
    return FNV64{sum: fnvBase}
}

func (f *FNV64) Add(c uint64) {
    f.sum *= fnvPrime
	f.sum ^= c
}

func (f *FNV64) Sum() uint64 {
    return f.sum
}

Just an idea though, feel free to ignore.

Much cleaner, done.

@arulajmani arulajmani force-pushed the fix-txn-stats-regression branch 3 times, most recently from f61b130 to c942919 Compare October 6, 2020 21:59
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an issue with our existing code when I was moving statement ID creation inside getStatsForStmtWithKey. GetStatsForStmt and everything calling above it need to know the statement ID regardless of wether a stats object entry is created or not. This is because:

  • Statement stats collection could be turned off, but we still might be collecting transaction statistics.
  • A statement may not meet the latency threshold to be recorded, but the statement ID is still required to disambiguate transactions that comprise that statement.

The error was that even though we were creating the statement ID in getStatsForStmt, we never really returned it. This meant that if createIfNonExistent was false (case 2, from above), we would be returning a nil stats struct, and run into a nil exception further up the stack. I've fixed this oversight from before as part of my change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with nits, but also would be interested in nathan's final words

Reviewed 6 of 8 files at r3, 2 of 5 files at r5, 3 of 3 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)


pkg/sql/app_stats.go, line 257 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I think they should, though that would be a bigger change as if I switch appStats to have a RWLock, all other places that lock app stats should be updated as well. Would you be okay if I tackled that in a follow-up PR instead of this change?

Sure, but maybe add a TODO here
(Also I'd recommend you file an issue if you're not already working on the followup PR - I'd expect that there will be other high-priority work for you to work on next, and it'd be nice not to lose track of this)


pkg/sql/crdb_internal.go, line 956 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Wouldn't that cause potential overflows as these IDs are unsigned integer?

uint64 -> int64 is lossless, large positive values simply become negative


pkg/util/hash.go, line 32 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Much cleaner, done.

Pardon my ignorance Nathan, but doesn't the pointer receiver on the method cause the compiler to place the struct on the heap, and force an allocation that we were working hard to eliminate?


pkg/util/hash.go, line 38 at r3 (raw file):

// NewFNV64 initializes a new FNV64 hash state.
func NewFNV64() FNV64 {

a constructor that returns a value would be called "Make", not "New"

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 8 files at r3, 3 of 5 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @knz, and @solongordon)


pkg/server/serverpb/status.proto, line 893 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Seems so, though

I don't think this was fixed. It appears that there are only 3 spaces before the uint64.


pkg/sql/app_stats.go, line 247 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Done!

I'm not sure that this was done. We're still calling getStatsForStmtWithKey twice. It's not clear to me why we need to do that. Are we afraid of creating the stmtId under the lock because it's expensive? If so then why are we ok with allocating the stmtStats struct?

Or maybe this has to do with the fact that you want to return a stmtID even if if !ok && !createIfNonexistent {. If so, there are a few places around here that could use comments to describe what's going on. For instance, recordStatement should mention that it can never return an invalid statement ID. So should getStatsForStmt, even though it can return a nil stmtStats object.


pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Not applicable anymore because of the FNV64 struct switch below.

Do we still need invalidStmtID in that case?


pkg/util/hash.go, line 32 at r2 (raw file):

Previously, knz (kena) wrote…

Pardon my ignorance Nathan, but doesn't the pointer receiver on the method cause the compiler to place the struct on the heap, and force an allocation that we were working hard to eliminate?

The methods should all be inlined because they all have trivial complexity, so it shouldn't. And even if they aren't inlined, escape analysis should determine that they don't let the receiver escape. But it's worth checking. You can use the goescape script from https://wiki.crdb.io/wiki/spaces/CRDB/pages/73072868/Rando+scripts to do so. Run that and see if a call to the constructor results in a "moved to heap" line.

If you don't see anything, try making the following changes (which should cause the struct to escape) to confirm that you would detect a heap allocation if one existed.

type FNV64 struct {
    sum   uint64
    recur *FNV64
}

func (f *FNV64) Add(c uint64) {
    f.sum *= fnvPrime
    f.sum ^= c
    f.recur = f
}

pkg/util/hash.go, line 38 at r3 (raw file):

Previously, knz (kena) wrote…

a constructor that returns a value would be called "Make", not "New"

Yep, @knz is right about this. My mistake, thanks for catching!


pkg/util/hash.go, line 55 at r5 (raw file):

}

// Sum returns the hash value accumulated so now.

"to now"?

arulajmani and others added 2 commits October 7, 2020 11:15
We recently increased allocations in the `appStats` structure when
adding support for transaction level statistics in cockroachdb#52704. This was
because the interactions with the `fnv` library were expensive in terms
of allocations. This patch aims to claw back the regression by:

- Using our own implementation of the FNV algorithm instead of the
library, which is significantly lighter weight (microbenchmarks below).
- Reorganizing the code to only construct a statement ID from scratch
- if required.

When comparing the difference between the commit that introduced the
regression and the changes proposed by this diff, I got the following
improvements on the KV workload:

```
name             old ops/s   new ops/s   delta
kv95-throughput  34.5k ± 6%  35.7k ± 4%  +3.42%  (p=0.023 n=10+10)

```

Microbenchmarks for the new hashing algorithm (written/run by @knz):
```
name                     old time/op    new time/op    delta
ConstructStatementID-32     405ns ±17%      39ns ±12%  -90.34%  (p=0.008 n=5+5)

name                     old alloc/op   new alloc/op   delta
ConstructStatementID-32      120B ± 0%       16B ± 0%  -86.67%  (p=0.008 n=5+5)

name                     old allocs/op  new allocs/op  delta
ConstructStatementID-32      6.00 ± 0%      1.00 ± 0%  -83.33%  (p=0.008 n=5+5)
```

Closes cockroachdb#54515

Release note: None
This change upgrades the `admin-ui-components`
dependency. The latest version uses the modified
protobuf type for statements IDs in order to render
the Transactions Page Table.

No user-facing changes to any functionality were
made.

Release note: none
@arulajmani arulajmani force-pushed the fix-txn-stats-regression branch from c942919 to 9757d77 Compare October 7, 2020 15:15
Copy link
Collaborator Author

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all outstanding comments, should be good for another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nvanbenschoten, and @solongordon)


pkg/server/serverpb/status.proto, line 893 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think this was fixed. It appears that there are only 3 spaces before the uint64.

Sorry, I misunderstood which spacing you meant earlier. Done now.


pkg/sql/app_stats.go, line 247 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm not sure that this was done. We're still calling getStatsForStmtWithKey twice. It's not clear to me why we need to do that. Are we afraid of creating the stmtId under the lock because it's expensive? If so then why are we ok with allocating the stmtStats struct?

Or maybe this has to do with the fact that you want to return a stmtID even if if !ok && !createIfNonexistent {. If so, there are a few places around here that could use comments to describe what's going on. For instance, recordStatement should mention that it can never return an invalid statement ID. So should getStatsForStmt, even though it can return a nil stmtStats object.

It's the latter, I revisited this after my initial comment which made the comment above outdated. I'll leave comments around here that paint the picture.


pkg/sql/app_stats.go, line 257 at r2 (raw file):

Previously, knz (kena) wrote…

Sure, but maybe add a TODO here
(Also I'd recommend you file an issue if you're not already working on the followup PR - I'd expect that there will be other high-priority work for you to work on next, and it'd be nice not to lose track of this)

Filed #55285, and added a TODO in the appStats struct.


pkg/sql/crdb_internal.go, line 956 at r2 (raw file):

Previously, knz (kena) wrote…

uint64 -> int64 is lossless, large positive values simply become negative

Yes, but I'm slightly hesitant as then we'd be displaying different values in the admin UI and crdb_internal.node_transaction_statistics. Also, negative numbers for an ID field seem slightly weird from a UX perspective.

I filed #55284 to capture the discussion, and I can make the change in the followup if my UX concerns don't stand. Added a TODO referencing the issue as well.


pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we still need invalidStmtID in that case?

Yeah, one other place still uses it.


pkg/util/hash.go, line 32 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The methods should all be inlined because they all have trivial complexity, so it shouldn't. And even if they aren't inlined, escape analysis should determine that they don't let the receiver escape. But it's worth checking. You can use the goescape script from https://wiki.crdb.io/wiki/spaces/CRDB/pages/73072868/Rando+scripts to do so. Run that and see if a call to the constructor results in a "moved to heap" line.

If you don't see anything, try making the following changes (which should cause the struct to escape) to confirm that you would detect a heap allocation if one existed.

type FNV64 struct {
    sum   uint64
    recur *FNV64
}

func (f *FNV64) Add(c uint64) {
    f.sum *= fnvPrime
    f.sum ^= c
    f.recur = f
}

There's no moved to heap line so I think we should be good.

The code snippet you posted also doesn't result in a moved to heap line, presumably because it's a self reference? The following does though:

func (f *FNV64) Add(c uint64) {
	f.sum *= fnvPrime
	f.sum ^= c
	a := MakeFNV64()
	f.recur = &a
}

pkg/util/hash.go, line 38 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yep, @knz is right about this. My mistake, thanks for catching!

Done.


pkg/util/hash.go, line 55 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"to now"?

so far/till now = so now 😓. Thanks for catching, fixed

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for working through this!

Reviewed 10 of 10 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nvanbenschoten, and @solongordon)

@arulajmani
Copy link
Collaborator Author

Thanks for the reviews and help with this issue @nvanbenschoten @knz

bors r=nvanbenschoten, knz

@craig
Copy link
Contributor

craig bot commented Oct 7, 2020

Build succeeded:

@craig craig bot merged commit 04bfabb into cockroachdb:master Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: performance degradation in YCSB (A) between 20.1 and 20.2
5 participants