-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importer: AddSSTable sent after import cancellation #91418
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Comments
cc @cockroachdb/disaster-recovery |
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Dec 16, 2022
This ensures that goroutines started in the importer are shut down when the processor is shut down. While the previous comment indicates there is no need to wait on the goroutine, that isn't true in the case of cancellation in which the existing coordination is insufficient. Informs cockroachdb#91418 Release note: None
craig bot
pushed a commit
that referenced
this issue
Dec 16, 2022
93782: importer: use ctx in progress push, use ctxgroup r=adityamaru a=stevendanna This ensures that goroutines started in the importer are shut down when the processor is shut down. While the previous comment indicates there is no need to wait on the goroutine, that isn't true in the case of cancellation in which the existing coordination is insufficient. Informs #91418 Release note: None Co-authored-by: Steven Danna <[email protected]>
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Apr 3, 2023
This ensures that goroutines started in the importer are shut down when the processor is shut down. While the previous comment indicates there is no need to wait on the goroutine, that isn't true in the case of cancellation in which the existing coordination is insufficient. Informs cockroachdb#91418 Release note: None
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Apr 4, 2023
This ensures that goroutines started in the importer are shut down when the processor is shut down. While the previous comment indicates there is no need to wait on the goroutine, that isn't true in the case of cancellation in which the existing coordination is insufficient. Informs cockroachdb#91418 Release note: None Release justification: Low risk fix to avoid stray goroutines from running after IMPORT cancellation.
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Apr 4, 2023
This ensures that goroutines started in the importer are shut down when the processor is shut down. While the previous comment indicates there is no need to wait on the goroutine, that isn't true in the case of cancellation in which the existing coordination is insufficient. Informs cockroachdb#91418 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Describe the problem
In we've observed AddSSTable requests that were applied after an
IMPORT
job'sOnFailOrCancel
callback attempted to clear the data added by the import.As a result, the table contains data from the cancelled
IMPORT.
We believe this happens because the import processor on the remote node is still running. While the import processor will eventually see a context cancellation, occasionally that context cancellation isn't seen until after another node has already adopted the
IMPORT
job and run theOnFailOrCancel
hook.We've identified a few possible causes of this:
We have at least 1 go routine that we do not wait for
cockroach/pkg/sql/importer/import_processor.go
Lines 179 to 185 in 8a36b94
Since jobs: Clear out claim info when pausing #89014
OnFailOrCancel
is eligible for execution on another node immediately after the Resumer's context has been cancelled. We've observed via logs OnFailOrCancel running before the Resumer has exited.Even with (1) and (2) fixed, we've observed that
dsp.Run
cockroach/pkg/sql/importer/import_processor_planning.go
Line 271 in 8a36b94
*** Possible Solutions ***
The following are possible solutions we've discussed in the past for this problem:
Review our distSQL flow and ensure we are using the correct contexts and implementing the correct callbacks to ensure as orderly a shutdown as possible.
Add new code in the job coordinator to explicitly wait for an affirmative shutdown from all processors. This would certainly help on the happy path, but it wouldn't cover all cases since the node responsible for doing the waiting may
Add a safety timeout before issuing any DeleteRange requests.
Periodically broadcast a timestamp to all processors that the processors will use for writing AddSSTables (rather than allowing them to use the batch timestamps). The node responsible for cancellation would then know the last timestamp at which nodes were possibly writing.
For
IMPORT INTO
on empty tables, as a special case, we could write into a different index and then swap over to that index on success.A new KV feature "span admin lock" which would lock a span for admin operations. Any AddSSTable requests that arrived with the wrong or an old lock token would be rejected.
To Reproduce
The failure can be seen in the unit test found here: #91407 when run under stress for a few minutes.
Jira issue: CRDB-21252
The text was updated successfully, but these errors were encountered: