Bug: disk_row_container invalid row length 0, expected 2 when running TPC-H query 8 #18600

richardwu · 2017-09-19T17:46:23Z

When running TPC-H's query 8 with a scale-factor of 1 (with default options everywhere else)

./tpch --queries=8

on a 3-node cluster locally with environment variables COCKROACH_PROPOSER_EVALUATED_KV=true COCKROACH_ENTERPRISE_ENABLED=true (and with a CCL license):

./cockroach start --insecure --host=localhost --port=26257 --http-port=8080 --store=cockroach-data/node0 &
sleep 3
./cockroach start --insecure --host=localhost --port=26258 --http-port=8081 --store=cockroach-data/node1 --join=localhost:26257 &
./cockroach start --insecure --host=localhost --port=26259 --http-port=8082 --store=cockroach-data/node2 --join=localhost:26257 &

CockroachDB node starting at 2017-09-19 10:40:32.921068629 -0400 EDT m=+6.900215001 (took 6.9s)
build:      CCL 46fa327e6 @ 2017/09/19 14:40:06 (go1.9)
admin:      http://localhost:8082
sql:        postgresql://root@localhost:26259?application_name=cockroach&sslmode=disable
logs:       /Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node2/logs
store[0]:   path=/Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node2
status:     restarted pre-existing node
clusterID:  b72fae7c-032f-45dc-932d-21e7046cd796
nodeID:     2
CockroachDB node starting at 2017-09-19 10:40:32.999973842 -0400 EDT m=+6.979079430 (took 7.0s)
build:      CCL 46fa327e6 @ 2017/09/19 14:40:06 (go1.9)
admin:      http://localhost:8081
sql:        postgresql://root@localhost:26258?application_name=cockroach&sslmode=disable
logs:       /Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node1/logs
store[0]:   path=/Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node1
status:     restarted pre-existing node
clusterID:  b72fae7c-032f-45dc-932d-21e7046cd796
nodeID:     3
CockroachDB node starting at 2017-09-19 10:40:33.826325061 -0400 EDT m=+10.814508847 (took 10.8s)
build:      CCL 46fa327e6 @ 2017/09/19 14:40:06 (go1.9)
admin:      http://localhost:8080
sql:        postgresql://root@localhost:26257?application_name=cockroach&sslmode=disable
logs:       /Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node0/logs
store[0]:   path=/Users/richardwu/go/src/github.com/cockroachdb/cockroach/cockroach-data/node0
status:     restarted pre-existing node
clusterID:  b72fae7c-032f-45dc-932d-21e7046cd796
nodeID:     1

one of the nodes panic and fail

F170919 11:39:21.620075 6851945 sql/distsqlrun/disk_row_container.go:121  [client=127.0.0.1:59269,user=root,n1,Sorter] invalid row length 0, expected 2
goroutine 6851945 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x684e500, 0x14e5cd770c83f439, 0x626afbc, 0x24)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:832 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x684e580, 0xc400000004, 0x626afbc, 0x24, 0x79, 0xc42053ae60, 0x4d)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:735 +0x620
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6352d00, 0xc429b32cf0, 0x4, 0x2, 0x5776350, 0x22, 0xc4401b37e0, 0x2, 0x2)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:162 +0x4c4
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6352d00, 0xc429b32cf0, 0x1, 0xc400000004, 0x5776350, 0x22, 0xc4401b37e0, 0x2, 0x2)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:54 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x6352d00, 0xc429b32cf0, 0x5776350, 0x22, 0xc4401b37e0, 0x2, 0x2)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:148 +0x7e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*diskRowContainer).AddRow(0xc439fd4fc0, 0x6352d00, 0xc429b32cf0, 0x0, 0x0, 0x0, 0xc4244049a0, 0x1)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/disk_row_container.go:121 +0x590
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sortAllStrategy).Execute(0xc42f9f5960, 0x6352d00, 0xc429b32cf0, 0xc425532000, 0x0, 0x0)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sorterstrategy.go:99 +0x4be
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*sorter).Run(0xc425532000, 0x6352d00, 0xc429b32cf0, 0xc42168cea0)
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/sorter.go:147 +0x5d6
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Start
        /Users/richardwu/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:394 +0x3d7

The TPC-H tables were loaded via the RESTORE feature from Azure backups as outlined in backup option (1) here https://github.com/cockroachdb/loadgen/tree/master/tpch.

cc: @vivekmenezes

Additional info

./cockroach version

Build Tag:    v1.1-alpha.20170817-984-g3a9fd60f8
Build Time:   2017/09/19 16:40:25
Distribution: CCL
Platform:     darwin amd64
Go Version:   go1.9
C Compiler:   4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)
Build SHA-1:  3a9fd60f88742e47e9c1b6c7d37e0ac362cdfa77
Build Type:   development

The text was updated successfully, but these errors were encountered:

asubiotto · 2017-09-19T18:34:23Z

Have successfully reproduced this panic and will look into it. It seems that a nil row is being passed in to the disk row container during creation which should never happen, since this row is supposedly the row that caused the OOM error in the in-memory row container.

asubiotto · 2017-09-19T20:26:50Z

The issue is that we get an OOM error from NextRow (when querying the input) and we return a nil row. We only expect an OOM error from AddRow (when we add the row to the container). The reason we get an OOM error from NextRow is that an upstream Aggregator hits a memory limit (external storage has not been implemented for this processor yet) and this is sent over as metadata. The error is interpreted as if the Sorter hit the memory limit and so we fall back to disk and try to add a nil row to the container.

This is a bug and I will fix it but the reason this hasn't been caught before is that the Aggregator in query 8 did not use to run out of memory. Here is the issue when I ran query 8 back around 1.0 with distsql: #14295. Note that it was also running a couple days ago. Something must have been changed that modified the behavior of the Aggregator.

asubiotto · 2017-09-21T17:22:30Z

Correction: it is the HashJoiner, not the Aggregator that runs out of memory.

The reason we now run out of memory is due to e12ba3c. The memory limit that we hit is the max-sql-memory on the node (multiple processors), not the workmem limit (single processor).

In the HashJoiner, we set workmem and limit the buffering of rows such that any memory limit hit would be hit only in the buffer phase and we would be guaranteed to not go over the memory limit in the following probe phase (memory used is to mark rows as seen). However, since max-sql-memory has been decreased and we have other concurrent processors on the node, we hit this memory limit in the probe phase.

Initially, running out of memory in the probe phase was handled but I reverted to this suggested solution because the code was simpler. I can add the probe phase handling back in, any objections?

cc @cockroachdb/distsql

knz · 2017-09-21T17:40:55Z

Alberto has been explaining me two times this week that we should architecture hash joins differently: we should ensure the code is able to work with arbitrary little memory. The way to do that is called "segmented hash joins". If you think that will be necessary to durably address our concerns, we could add it to our roadmap.

asubiotto · 2017-09-21T20:32:50Z

Simpler change for 1.1 would be to preallocate memory used in the probe phase so will go ahead with this.

Segmented hash joins are interesting but some infrastructure work is needed. Let's talk about this at the next meeting.

richardwu assigned asubiotto Sep 19, 2017

asubiotto mentioned this issue Sep 19, 2017

distsqlrun: forward OOM error in sorter #18610

Merged

asubiotto closed this as completed in #18610 Sep 21, 2017

This was referenced Sep 21, 2017

cherrypick-1.1: distsqlrun: forward OOM error in sorter #18685

Merged

distsql: pre-reserve memory needed to mark rows in HashJoiner build phase #18838

Merged

asubiotto mentioned this issue Oct 3, 2017

cherrypick-1.1: distsql: pre-reserve memory needed to mark rows in HashJoiner build phase #18975

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: disk_row_container invalid row length 0, expected 2 when running TPC-H query 8 #18600

Bug: disk_row_container invalid row length 0, expected 2 when running TPC-H query 8 #18600

richardwu commented Sep 19, 2017 •

edited

Loading

asubiotto commented Sep 19, 2017

asubiotto commented Sep 19, 2017

asubiotto commented Sep 21, 2017

knz commented Sep 21, 2017

asubiotto commented Sep 21, 2017

Bug: disk_row_container invalid row length 0, expected 2 when running TPC-H query 8 #18600

Bug: disk_row_container invalid row length 0, expected 2 when running TPC-H query 8 #18600

Comments

richardwu commented Sep 19, 2017 • edited Loading

Additional info

asubiotto commented Sep 19, 2017

asubiotto commented Sep 19, 2017

asubiotto commented Sep 21, 2017

knz commented Sep 21, 2017

asubiotto commented Sep 21, 2017

richardwu commented Sep 19, 2017 •

edited

Loading