DELETE FROM ... [returning nothing] crashes a node #17921

AnyCPU · 2017-08-25T09:12:20Z

A cluster has 3 nodes running on Cockroachdb 1.0.5 (Linux, 64bit)
A db has about 195 millions of records.
Scheme is

CREATE TABLE stats (
	numa STRING(32) NULL,
	bytes INT NULL,
	amount DECIMAL NULL,
	ts TIMESTAMP NULL,
	da INT NULL,
	INDEX idx_stats_ts (ts ASC) STORING (numa, bytes, amount, da),
	FAMILY "primary" (numa, bytes, amount, ts, da, rowid)
)

The db has data of an one month.
I want to delete data starting from second day of month, so
I run either delete from stats where ts > '2017-01-01 23:59:59'::timestamp; or delete from stats where ts > '2017-01-01 23:59:59'::timestamp returning nothing;.
I use the cockroach sql --insecure --host = ... command to run my query.
After some minutes an used node dies.
A db driver returns - bad connection.
And there are a lot of messages like that context canceled while in command queue: ResolveIntent... in the log.

All nodes have free space.

Do I have to delete data by chunks?

Thanks.

The text was updated successfully, but these errors were encountered:

AnyCPU · 2017-08-25T12:32:09Z

I cannot even remove a half a day.

AnyCPU · 2017-08-25T13:05:30Z

I have tried to remove data for an one hour but I got
pq: kv/txn_coord_sender.go:428: transaction is too large to commit: 478692 intents

AnyCPU · 2017-08-25T13:27:24Z

It is possible to remove data by ten-minutes chunks :-(

tbg · 2017-08-25T14:08:57Z

Hey @AnyCPU, you're running into the fact that as of the time of writing, support for large writing transactions is not ideal in CockroachDB. You've pretty much experienced all the problems: operations never succeeding, nodes running into memory troubles (though that one's somewhat unexpected), or failing with the "too large to commit" error. While deleting in small chunks is currently the best option, rest assured that we're working on improving this (see the issue above). We have workarounds in place for the case in which you want to drop/truncate the whole table (which we can do more efficiently by essentially swapping out the table with a new one), but that's clearly not going to help you.

Tracking issue: #15849

AnyCPU · 2017-08-25T14:30:57Z

Thank you @tschottdorf

tbg · 2018-02-22T19:44:24Z

Despite many improvements, this is still an issue in master at the time of writing, though we now have a test that reproduces this reliably: #22876

tbg · 2018-02-22T19:52:00Z

@spencerkimball let's move the discussion from #22876 here. I agree that the easiest way to diagnose this is to get heap profiles.

@petermattis if I were to introduce such a facility into roachtest, how do you think I should do it? I think it would be reasonable to write a little standalone program (or bash script for starters) that does little except periodically store heap dumps to artifacts on all relevant nodes (so that we can just run it in tests that want it, like a load gen). That way, it won't get in the way of perf tests. To get fancy, we could let the program assert on the heap dump, though I'd like to punt on that.

petermattis · 2018-02-22T20:04:02Z

For heap profiles, you can already do roachprod start -e COCKROACH_MEMPROF_INTERVAL=1m .... Did you mean something different by "heap dumps"?

petermattis · 2018-02-22T20:05:05Z

Also, when diagnosing, it is easiest to create a cluster using roachprod create and then run the test using roachtest run -c <existing-cluster>.

tbg · 2018-02-22T20:09:05Z

For heap profiles, you can already do roachprod start -e COCKROACH_MEMPROF_INTERVAL=1m .... Did you mean something different by "heap dumps"?

No, that's what I mean. Forgot about the env var, thanks!

In the absence of a fast path deletion, `DELETE` would generate one potentially giant batch and OOM the gateway node. This became obvious quickly via heap profiling. Added chunking of the deletions to `tableDeleter`. SQL folks may have stronger opinions on how to achieve this, or a better idea of a preexisting chunking mechanism that works more reliably. If nothing else, this change serves as a prototype to fix cockroachdb#17921. With this change, `roachtest run drop` works (as in, it doesn't out-of-memory right away; the run takes a long time so I can't yet confirm that it actually passes). Release note (sql change): deleting many rows at once now consumes less memory.

tbg · 2018-02-23T06:37:53Z

With #22991, this seems to be making steady progress:

Remains to be seen whether it manages to commit.

tbg · 2018-02-23T07:20:28Z

Ok, as expected something did go wrong. I think after approximately 10 minutes we run into the timestamp cache and catch a retry and stagnate from then on:

/debug/requests shows little of use, the sql trace is huge since this line is extremely chatty and, well, because we're deleting ten million rows.

I think what you'd want here is that the refresh machinery realizes that nothing has changed and so the restart can be hidden. But something is clearly going wrong but it's not exactly clear to me what.

@spencerkimball, would you mind taking a look? This is really easy to run, just

roachprod create -n 9 spencer-test
roachtest run -c spencer-test drop

In the absence of a fast path deletion, `DELETE` would generate one potentially giant batch and OOM the gateway node. This became obvious quickly via heap profiling. Added chunking of the deletions to `tableDeleter`. SQL folks may have stronger opinions on how to achieve this, or a better idea of a preexisting chunking mechanism that works more reliably. If nothing else, this change serves as a prototype to fix cockroachdb#17921. With this change, `roachtest run drop` works (as in, it doesn't out-of-memory right away; the run takes a long time so I can't yet confirm that it actually passes). Release note (sql change): deleting many rows at once now consumes less memory.

tbg · 2018-02-25T05:05:19Z

Reopening as there are still problems with such deletions. They should either fail gracefully or succeed; not hang indefinitely.

tbg · 2018-03-03T15:23:04Z

Reopening to verify fix.

tbg · 2018-03-03T15:24:31Z

Ah, already done: #23258 (comment)

knz mentioned this issue Aug 29, 2017

Update interleaved index and count(*) known limitations cockroachdb/docs#1864

Merged

cuongdo mentioned this issue Aug 29, 2017

sql: support LIMIT for DELETEs #18006

Closed

knz mentioned this issue Aug 30, 2017

sql: add support for LIMIT with DELETE #18023

Merged

tbg mentioned this issue Feb 15, 2018

v1.1: add known limitation for large transactions (particularly deletions) cockroachdb/docs#2535

Closed

tbg mentioned this issue Feb 22, 2018

roachtest: add DELETE/TRUNCATE/DROP test #22876

Merged

tbg assigned tbg and spencerkimball and unassigned tbg Feb 22, 2018

tbg mentioned this issue Feb 23, 2018

sql: chunk deletions #22991

Merged

tbg closed this as completed in #22991 Feb 23, 2018

tbg mentioned this issue Feb 23, 2018

cherrypick-2.0: sql: chunk deletions #23013

Merged

tbg reopened this Feb 25, 2018

tbg mentioned this issue Mar 1, 2018

sql, kv: use Delete instead of DeleteRange to lessen impact on tscache #23258

Merged

spencerkimball closed this as completed in #23258 Mar 1, 2018

tbg reopened this Mar 3, 2018

tbg closed this as completed Mar 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DELETE FROM ... [returning nothing] crashes a node #17921

DELETE FROM ... [returning nothing] crashes a node #17921

AnyCPU commented Aug 25, 2017 •

edited

Loading

AnyCPU commented Aug 25, 2017

AnyCPU commented Aug 25, 2017

AnyCPU commented Aug 25, 2017

tbg commented Aug 25, 2017 •

edited

Loading

AnyCPU commented Aug 25, 2017

tbg commented Feb 22, 2018

tbg commented Feb 22, 2018

petermattis commented Feb 22, 2018

petermattis commented Feb 22, 2018

tbg commented Feb 22, 2018

tbg commented Feb 23, 2018

tbg commented Feb 23, 2018

tbg commented Feb 25, 2018

tbg commented Mar 3, 2018

tbg commented Mar 3, 2018

DELETE FROM ... [returning nothing] crashes a node #17921

DELETE FROM ... [returning nothing] crashes a node #17921

Comments

AnyCPU commented Aug 25, 2017 • edited Loading

AnyCPU commented Aug 25, 2017

AnyCPU commented Aug 25, 2017

AnyCPU commented Aug 25, 2017

tbg commented Aug 25, 2017 • edited Loading

AnyCPU commented Aug 25, 2017

tbg commented Feb 22, 2018

tbg commented Feb 22, 2018

petermattis commented Feb 22, 2018

petermattis commented Feb 22, 2018

tbg commented Feb 22, 2018

tbg commented Feb 23, 2018

tbg commented Feb 23, 2018

tbg commented Feb 25, 2018

tbg commented Mar 3, 2018

tbg commented Mar 3, 2018

AnyCPU commented Aug 25, 2017 •

edited

Loading

tbg commented Aug 25, 2017 •

edited

Loading