-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix performance regression due to increased hash allocations #55214
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)
pkg/roachpb/app_stats.go, line 43 at r1 (raw file):
} b := util.FNV64ToByteArray(hash) return StmtID(*(*string)(unsafe.Pointer(&b)))
you don't know that the byte array is not shared at this point. This cast is truly unsafe and should not be kept there.
pkg/sql/conn_executor_exec.go, line 1461 at r1 (raw file):
commitLat := ex.phaseTimes.getCommitLatency() key := txnKey(fmt.Sprintf("%x", util.FNV64ToByteArray(ex.extraTxnState.transactionStatementsHash)))
that is wayyyy too expensive. Just never use printf to compute keys or things on the hot path.
pkg/util/hash.go, line 42 at r1 (raw file):
} func FNV64ToByteArray(s0 uint64) []byte {
as per the previous discussion I'd encourage you to try make the stmt key an uint64 directly instead of a string
barring that, make this function return a string directly, otherwise you get weird allocation semantics with the unsafe cast.
7d1401f
to
f7bc4c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched the statement keys/txn keys to be uint64s instead of strings. RFAL.
CC-ing @dhartunian on this as this will probably affect the adminUI too. We recently discovered a significant performance regression because of increased string allocations. As a fix for this, I've updated the proto so that statement ID is now a uint64 instead of a string.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)
pkg/roachpb/app_stats.go, line 43 at r1 (raw file):
Previously, knz (kena) wrote…
you don't know that the byte array is not shared at this point. This cast is truly unsafe and should not be kept there.
Not applicable anymore as I've made StmtID
a uint64 instead of a string.
pkg/sql/conn_executor_exec.go, line 1461 at r1 (raw file):
Previously, knz (kena) wrote…
that is wayyyy too expensive. Just never use printf to compute keys or things on the hot path.
Same comment as above.
pkg/util/hash.go, line 42 at r1 (raw file):
Previously, knz (kena) wrote…
as per the previous discussion I'd encourage you to try make the stmt key an uint64 directly instead of a string
barring that, make this function return a string directly, otherwise you get weird allocation semantics with the unsafe cast.
Switched to uint64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)
pkg/sql/crdb_internal.go, line 956 at r2 (raw file):
stmtIDsDatum := tree.NewDArray(types.String) for _, stmtID := range s.statementIDs { if err := stmtIDsDatum.Append(tree.NewDString(strconv.FormatUint(uint64(stmtID), 10))); err != nil {
Maybe add a TODO here to migrate the schema of the vtable to use an integer type.
(Not in this change though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r1, 13 of 13 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @solongordon)
pkg/server/serverpb/status.proto, line 893 at r2 (raw file):
message CollectedStatementStatistics { ExtendedStatementStatisticsKey key = 1 [(gogoproto.nullable) = false]; uint64 id = 3 [(gogoproto.customname) = "ID",
nit: is the spacing off here?
pkg/sql/app_stats.go, line 247 at r2 (raw file):
// is an expensive operation). s := a.getStatsForStmtWithKey(key, invalidStmtID, false /* createIfNonexistent */) if s == nil && createIfNonexistent {
What fraction of the time do we expect to hit this branch? Would it make sense to move constructStatementIDFromStmtKey
under the lock and the if !ok && createIfNonexistent {
branch to avoid locking twice?
pkg/sql/app_stats.go, line 257 at r2 (raw file):
key stmtKey, ID roachpb.StmtID, createIfNonexistent bool, ) *stmtStats { a.Lock()
Can these functions acquire a read lock if !createIfNonexistent
?
pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):
// by the internal executor, in which case the hash is uninitialized (=0), and // can therefore be safely ignored. if ex.extraTxnState.transactionStatementsHash != 0 {
Should we be comparing against invalidStmtID
instead of 0
in a few spots?
pkg/util/hash.go, line 32 at r2 (raw file):
const fnvPrime = 1099511628211 // FNV64Init initializes a new hash function.
Very cool!
You might consider adding some type safety here to prevent misuse. Something like:
type FNV64 struct {
sum uint64
}
func NewFnv64() FNV64 {
return FNV64{sum: fnvBase}
}
func (f *FNV64) Add(c uint64) {
f.sum *= fnvPrime
f.sum ^= c
}
func (f *FNV64) Sum() uint64 {
return f.sum
}
Just an idea though, feel free to ignore.
9276903
to
87bb3c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)
pkg/server/serverpb/status.proto, line 893 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: is the spacing off here?
Seems so, fixed.
pkg/sql/app_stats.go, line 247 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What fraction of the time do we expect to hit this branch? Would it make sense to move
constructStatementIDFromStmtKey
under the lock and theif !ok && createIfNonexistent {
branch to avoid locking twice?
Done!
pkg/sql/app_stats.go, line 257 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can these functions acquire a read lock if
!createIfNonexistent
?
I think they should, though that would be a bigger change as if I switch appStats
to have a RWLock, all other places that lock app stats should be updated as well. Would you be okay if I tackled that in a follow-up PR instead of this change?
pkg/sql/crdb_internal.go, line 956 at r2 (raw file):
Previously, knz (kena) wrote…
Maybe add a TODO here to migrate the schema of the vtable to use an integer type.
(Not in this change though)
Wouldn't that cause potential overflows as these IDs are unsigned integer?
pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we be comparing against
invalidStmtID
instead of0
in a few spots?
Not applicable anymore because of the FNV64 struct switch below.
pkg/util/hash.go, line 32 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Very cool!
You might consider adding some type safety here to prevent misuse. Something like:
type FNV64 struct { sum uint64 } func NewFnv64() FNV64 { return FNV64{sum: fnvBase} } func (f *FNV64) Add(c uint64) { f.sum *= fnvPrime f.sum ^= c } func (f *FNV64) Sum() uint64 { return f.sum }Just an idea though, feel free to ignore.
Much cleaner, done.
f61b130
to
c942919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found an issue with our existing code when I was moving statement ID creation inside getStatsForStmtWithKey
. GetStatsForStmt
and everything calling above it need to know the statement ID regardless of wether a stats object entry is created or not. This is because:
- Statement stats collection could be turned off, but we still might be collecting transaction statistics.
- A statement may not meet the latency threshold to be recorded, but the statement ID is still required to disambiguate transactions that comprise that statement.
The error was that even though we were creating the statement ID in getStatsForStmt
, we never really returned it. This meant that if createIfNonExistent
was false (case 2, from above), we would be returning a nil stats struct, and run into a nil exception further up the stack. I've fixed this oversight from before as part of my change.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @solongordon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with nits, but also would be interested in nathan's final words
Reviewed 6 of 8 files at r3, 2 of 5 files at r5, 3 of 3 files at r6.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @solongordon)
pkg/sql/app_stats.go, line 257 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I think they should, though that would be a bigger change as if I switch
appStats
to have a RWLock, all other places that lock app stats should be updated as well. Would you be okay if I tackled that in a follow-up PR instead of this change?
Sure, but maybe add a TODO here
(Also I'd recommend you file an issue if you're not already working on the followup PR - I'd expect that there will be other high-priority work for you to work on next, and it'd be nice not to lose track of this)
pkg/sql/crdb_internal.go, line 956 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Wouldn't that cause potential overflows as these IDs are unsigned integer?
uint64 -> int64 is lossless, large positive values simply become negative
pkg/util/hash.go, line 32 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Much cleaner, done.
Pardon my ignorance Nathan, but doesn't the pointer receiver on the method cause the compiler to place the struct on the heap, and force an allocation that we were working hard to eliminate?
pkg/util/hash.go, line 38 at r3 (raw file):
// NewFNV64 initializes a new FNV64 hash state. func NewFNV64() FNV64 {
a constructor that returns a value would be called "Make", not "New"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 8 files at r3, 3 of 5 files at r5.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @knz, and @solongordon)
pkg/server/serverpb/status.proto, line 893 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Seems so, though
I don't think this was fixed. It appears that there are only 3 spaces before the uint64.
pkg/sql/app_stats.go, line 247 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Done!
I'm not sure that this was done. We're still calling getStatsForStmtWithKey
twice. It's not clear to me why we need to do that. Are we afraid of creating the stmtId under the lock because it's expensive? If so then why are we ok with allocating the stmtStats struct?
Or maybe this has to do with the fact that you want to return a stmtID even if if !ok && !createIfNonexistent {
. If so, there are a few places around here that could use comments to describe what's going on. For instance, recordStatement
should mention that it can never return an invalid statement ID. So should getStatsForStmt
, even though it can return a nil stmtStats
object.
pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Not applicable anymore because of the FNV64 struct switch below.
Do we still need invalidStmtID
in that case?
pkg/util/hash.go, line 32 at r2 (raw file):
Previously, knz (kena) wrote…
Pardon my ignorance Nathan, but doesn't the pointer receiver on the method cause the compiler to place the struct on the heap, and force an allocation that we were working hard to eliminate?
The methods should all be inlined because they all have trivial complexity, so it shouldn't. And even if they aren't inlined, escape analysis should determine that they don't let the receiver escape. But it's worth checking. You can use the goescape
script from https://wiki.crdb.io/wiki/spaces/CRDB/pages/73072868/Rando+scripts to do so. Run that and see if a call to the constructor results in a "moved to heap"
line.
If you don't see anything, try making the following changes (which should cause the struct to escape) to confirm that you would detect a heap allocation if one existed.
type FNV64 struct {
sum uint64
recur *FNV64
}
func (f *FNV64) Add(c uint64) {
f.sum *= fnvPrime
f.sum ^= c
f.recur = f
}
pkg/util/hash.go, line 38 at r3 (raw file):
Previously, knz (kena) wrote…
a constructor that returns a value would be called "Make", not "New"
Yep, @knz is right about this. My mistake, thanks for catching!
pkg/util/hash.go, line 55 at r5 (raw file):
} // Sum returns the hash value accumulated so now.
"to now"?
We recently increased allocations in the `appStats` structure when adding support for transaction level statistics in cockroachdb#52704. This was because the interactions with the `fnv` library were expensive in terms of allocations. This patch aims to claw back the regression by: - Using our own implementation of the FNV algorithm instead of the library, which is significantly lighter weight (microbenchmarks below). - Reorganizing the code to only construct a statement ID from scratch - if required. When comparing the difference between the commit that introduced the regression and the changes proposed by this diff, I got the following improvements on the KV workload: ``` name old ops/s new ops/s delta kv95-throughput 34.5k ± 6% 35.7k ± 4% +3.42% (p=0.023 n=10+10) ``` Microbenchmarks for the new hashing algorithm (written/run by @knz): ``` name old time/op new time/op delta ConstructStatementID-32 405ns ±17% 39ns ±12% -90.34% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ConstructStatementID-32 120B ± 0% 16B ± 0% -86.67% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ConstructStatementID-32 6.00 ± 0% 1.00 ± 0% -83.33% (p=0.008 n=5+5) ``` Closes cockroachdb#54515 Release note: None
This change upgrades the `admin-ui-components` dependency. The latest version uses the modified protobuf type for statements IDs in order to render the Transactions Page Table. No user-facing changes to any functionality were made. Release note: none
c942919
to
9757d77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all outstanding comments, should be good for another look.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nvanbenschoten, and @solongordon)
pkg/server/serverpb/status.proto, line 893 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think this was fixed. It appears that there are only 3 spaces before the uint64.
Sorry, I misunderstood which spacing you meant earlier. Done now.
pkg/sql/app_stats.go, line 247 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm not sure that this was done. We're still calling
getStatsForStmtWithKey
twice. It's not clear to me why we need to do that. Are we afraid of creating the stmtId under the lock because it's expensive? If so then why are we ok with allocating the stmtStats struct?Or maybe this has to do with the fact that you want to return a stmtID even if
if !ok && !createIfNonexistent {
. If so, there are a few places around here that could use comments to describe what's going on. For instance,recordStatement
should mention that it can never return an invalid statement ID. So shouldgetStatsForStmt
, even though it can return a nilstmtStats
object.
It's the latter, I revisited this after my initial comment which made the comment above outdated. I'll leave comments around here that paint the picture.
pkg/sql/app_stats.go, line 257 at r2 (raw file):
Previously, knz (kena) wrote…
Sure, but maybe add a TODO here
(Also I'd recommend you file an issue if you're not already working on the followup PR - I'd expect that there will be other high-priority work for you to work on next, and it'd be nice not to lose track of this)
Filed #55285, and added a TODO in the appStats
struct.
pkg/sql/crdb_internal.go, line 956 at r2 (raw file):
Previously, knz (kena) wrote…
uint64 -> int64 is lossless, large positive values simply become negative
Yes, but I'm slightly hesitant as then we'd be displaying different values in the admin UI and crdb_internal.node_transaction_statistics. Also, negative numbers for an ID field seem slightly weird from a UX perspective.
I filed #55284 to capture the discussion, and I can make the change in the followup if my UX concerns don't stand. Added a TODO referencing the issue as well.
pkg/sql/executor_statement_metrics.go, line 200 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we still need
invalidStmtID
in that case?
Yeah, one other place still uses it.
pkg/util/hash.go, line 32 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The methods should all be inlined because they all have trivial complexity, so it shouldn't. And even if they aren't inlined, escape analysis should determine that they don't let the receiver escape. But it's worth checking. You can use the
goescape
script from https://wiki.crdb.io/wiki/spaces/CRDB/pages/73072868/Rando+scripts to do so. Run that and see if a call to the constructor results in a"moved to heap"
line.If you don't see anything, try making the following changes (which should cause the struct to escape) to confirm that you would detect a heap allocation if one existed.
type FNV64 struct { sum uint64 recur *FNV64 } func (f *FNV64) Add(c uint64) { f.sum *= fnvPrime f.sum ^= c f.recur = f }
There's no moved to heap
line so I think we should be good.
The code snippet you posted also doesn't result in a moved to heap
line, presumably because it's a self reference? The following does though:
func (f *FNV64) Add(c uint64) {
f.sum *= fnvPrime
f.sum ^= c
a := MakeFNV64()
f.recur = &a
}
pkg/util/hash.go, line 38 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yep, @knz is right about this. My mistake, thanks for catching!
Done.
pkg/util/hash.go, line 55 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"to now"?
so far/till now = so now 😓. Thanks for catching, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working through this!
Reviewed 10 of 10 files at r7.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @nvanbenschoten, and @solongordon)
Thanks for the reviews and help with this issue @nvanbenschoten @knz bors r=nvanbenschoten, knz |
Build succeeded: |
We recently increased allocations in the
appStats
structure whenadding support for transaction level statistics in #52704. This was
because the interactions with the
fnv
library were expensive in termsof allocations. This patch aims to claw back the regression by:
library, which is significantly lighter weight (microbenchmarks below).
expensive operation) if required.
When comparing the difference between the commit that introduced the
regression and the changes proposed by this diff, I got the following
improvements on the KV workload:
Microbenchmarks for the new hashing algorithm (written/run by @knz):
Closes #54515
Release note: None