Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DELETE FROM ... [returning nothing] crashes a node #17921

Closed
AnyCPU opened this issue Aug 25, 2017 · 15 comments · Fixed by #22991 or #23258
Closed

DELETE FROM ... [returning nothing] crashes a node #17921

AnyCPU opened this issue Aug 25, 2017 · 15 comments · Fixed by #22991 or #23258
Assignees

Comments

@AnyCPU
Copy link

AnyCPU commented Aug 25, 2017

A cluster has 3 nodes running on Cockroachdb 1.0.5 (Linux, 64bit)
A db has about 195 millions of records.
Scheme is

CREATE TABLE stats (
	numa STRING(32) NULL,
	bytes INT NULL,
	amount DECIMAL NULL,
	ts TIMESTAMP NULL,
	da INT NULL,
	INDEX idx_stats_ts (ts ASC) STORING (numa, bytes, amount, da),
	FAMILY "primary" (numa, bytes, amount, ts, da, rowid)
)

The db has data of an one month.
I want to delete data starting from second day of month, so
I run either delete from stats where ts > '2017-01-01 23:59:59'::timestamp; or delete from stats where ts > '2017-01-01 23:59:59'::timestamp returning nothing;.
I use the cockroach sql --insecure --host = ... command to run my query.
After some minutes an used node dies.
A db driver returns - bad connection.
And there are a lot of messages like that context canceled while in command queue: ResolveIntent... in the log.

All nodes have free space.

Do I have to delete data by chunks?

Thanks.

@AnyCPU
Copy link
Author

AnyCPU commented Aug 25, 2017

I cannot even remove a half a day.

@AnyCPU
Copy link
Author

AnyCPU commented Aug 25, 2017

I have tried to remove data for an one hour but I got
pq: kv/txn_coord_sender.go:428: transaction is too large to commit: 478692 intents

@AnyCPU
Copy link
Author

AnyCPU commented Aug 25, 2017

It is possible to remove data by ten-minutes chunks :-(

@tbg
Copy link
Member

tbg commented Aug 25, 2017

Hey @AnyCPU, you're running into the fact that as of the time of writing, support for large writing transactions is not ideal in CockroachDB. You've pretty much experienced all the problems: operations never succeeding, nodes running into memory troubles (though that one's somewhat unexpected), or failing with the "too large to commit" error. While deleting in small chunks is currently the best option, rest assured that we're working on improving this (see the issue above). We have workarounds in place for the case in which you want to drop/truncate the whole table (which we can do more efficiently by essentially swapping out the table with a new one), but that's clearly not going to help you.

Tracking issue: #15849

@AnyCPU
Copy link
Author

AnyCPU commented Aug 25, 2017

Thank you @tschottdorf

@tbg
Copy link
Member

tbg commented Feb 22, 2018

Despite many improvements, this is still an issue in master at the time of writing, though we now have a test that reproduces this reliably: #22876

@tbg
Copy link
Member

tbg commented Feb 22, 2018

@spencerkimball let's move the discussion from #22876 here. I agree that the easiest way to diagnose this is to get heap profiles.

@petermattis if I were to introduce such a facility into roachtest, how do you think I should do it? I think it would be reasonable to write a little standalone program (or bash script for starters) that does little except periodically store heap dumps to artifacts on all relevant nodes (so that we can just run it in tests that want it, like a load gen). That way, it won't get in the way of perf tests. To get fancy, we could let the program assert on the heap dump, though I'd like to punt on that.

@petermattis
Copy link
Collaborator

For heap profiles, you can already do roachprod start -e COCKROACH_MEMPROF_INTERVAL=1m .... Did you mean something different by "heap dumps"?

@petermattis
Copy link
Collaborator

Also, when diagnosing, it is easiest to create a cluster using roachprod create and then run the test using roachtest run -c <existing-cluster>.

@tbg
Copy link
Member

tbg commented Feb 22, 2018

For heap profiles, you can already do roachprod start -e COCKROACH_MEMPROF_INTERVAL=1m .... Did you mean something different by "heap dumps"?

No, that's what I mean. Forgot about the env var, thanks!

tbg added a commit to tbg/cockroach that referenced this issue Feb 23, 2018
In the absence of a fast path deletion, `DELETE` would generate one
potentially giant batch and OOM the gateway node. This became obvious
quickly via heap profiling.

Added chunking of the deletions to `tableDeleter`. SQL folks may have
stronger opinions on how to achieve this, or a better idea of a
preexisting chunking mechanism that works more reliably. If nothing
else, this change serves as a prototype to fix cockroachdb#17921.

With this change, `roachtest run drop` works (as in, it doesn't
out-of-memory right away; the run takes a long time so I can't yet
confirm that it actually passes).

Release note (sql change): deleting many rows at once now consumes less
memory.
@tbg
Copy link
Member

tbg commented Feb 23, 2018

With #22991, this seems to be making steady progress:

image

Remains to be seen whether it manages to commit.

@tbg
Copy link
Member

tbg commented Feb 23, 2018

Ok, as expected something did go wrong. I think after approximately 10 minutes we run into the timestamp cache and catch a retry and stagnate from then on:

image

/debug/requests shows little of use, the sql trace is huge since this line is extremely chatty and, well, because we're deleting ten million rows.

I think what you'd want here is that the refresh machinery realizes that nothing has changed and so the restart can be hidden. But something is clearly going wrong but it's not exactly clear to me what.

@spencerkimball, would you mind taking a look? This is really easy to run, just

roachprod create -n 9 spencer-test
roachtest run -c spencer-test drop

tbg added a commit to tbg/cockroach that referenced this issue Feb 23, 2018
In the absence of a fast path deletion, `DELETE` would generate one
potentially giant batch and OOM the gateway node. This became obvious
quickly via heap profiling.

Added chunking of the deletions to `tableDeleter`. SQL folks may have
stronger opinions on how to achieve this, or a better idea of a
preexisting chunking mechanism that works more reliably. If nothing
else, this change serves as a prototype to fix cockroachdb#17921.

With this change, `roachtest run drop` works (as in, it doesn't
out-of-memory right away; the run takes a long time so I can't yet
confirm that it actually passes).

Release note (sql change): deleting many rows at once now consumes less
memory.
tbg added a commit to tbg/cockroach that referenced this issue Feb 23, 2018
In the absence of a fast path deletion, `DELETE` would generate one
potentially giant batch and OOM the gateway node. This became obvious
quickly via heap profiling.

Added chunking of the deletions to `tableDeleter`. SQL folks may have
stronger opinions on how to achieve this, or a better idea of a
preexisting chunking mechanism that works more reliably. If nothing
else, this change serves as a prototype to fix cockroachdb#17921.

With this change, `roachtest run drop` works (as in, it doesn't
out-of-memory right away; the run takes a long time so I can't yet
confirm that it actually passes).

Release note (sql change): deleting many rows at once now consumes less
memory.
@tbg
Copy link
Member

tbg commented Feb 25, 2018

Reopening as there are still problems with such deletions. They should either fail gracefully or succeed; not hang indefinitely.

@tbg tbg reopened this Feb 25, 2018
@tbg
Copy link
Member

tbg commented Mar 3, 2018

Reopening to verify fix.

@tbg tbg reopened this Mar 3, 2018
@tbg
Copy link
Member

tbg commented Mar 3, 2018

Ah, already done: #23258 (comment)

@tbg tbg closed this as completed Mar 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants