Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: upstream batch task may close early and cause broken hash_shuffle_channel #7324

Closed
BowenXiao1999 opened this issue Jan 11, 2023 · 4 comments
Assignees
Labels
type/bug Something isn't working

Comments

@BowenXiao1999
Copy link
Contributor

BowenXiao1999 commented Jan 11, 2023

Describe the bug

Assume task A is the exchange source of task B.

If task A fail due to Expr Error, the exchange sender will be dropped, task A will notify frontend scheduler this error, and scheduler will cancel other tasks. At same time, task B will read from exchange receiver. Because all events are happen asynchrounsly, it's possible that after task A failed, exchange sender dropped, task B get broken hash_shuffle_channel because of dropped sender, before receive any cancel instructions from scheduler. In this case, we may return broken hash_shuffle_channel error instead of Expr Error. And this will catch by sqlsmith.

Solution:

if let Err(e) = t_1
.try_execute(exec, &mut sender, shutdown_rx, &mut state_tx)

Basic idea is try to store the sender in fields of task A to make it live the same as task A. But because of the strict limitation of rust, it's not easy to write an elegant solution.. It may looks ugly and I'm not sure whether we can do that.

cc @liurenjie1024

relate #7218

Log:
https://buildkite.com/risingwavelabs/pull-request/builds/15504#01859f6e-0524-4435-86d6-fef6fc27fc3d

To Reproduce

see the log

Expected behavior

return ExprError

Additional context

No response

@BowenXiao1999 BowenXiao1999 added the type/bug Something isn't working label Jan 11, 2023
@github-actions github-actions bot added this to the release-0.1.16 milestone Jan 11, 2023
@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Jan 12, 2023

I think the sender can be cloned since the underlying channel is mpsc, so we can keep it in task A?

@BowenXiao1999
Copy link
Contributor Author

BowenXiao1999 commented Jan 12, 2023

I think the sender can be cloned since the underlying channel is mpsc, so we can keep it in task A?

But we need the mut self, not sure whether we need Arc<Mutex> (?), but TaskExecuiton do not provide a mut self🤔

Maybe just store the Sender itself (without Arc) in task A, and clone it into a mut one variable when use it, I'm not sure whether this trick is ok...

@liurenjie1024
Copy link
Contributor

Maybe just store the Sender itself (without Arc) in task A, and clone it into a mut one variable when use it, I'm not sure whether this trick is ok...

I prefer this one.

@BowenXiao1999
Copy link
Contributor Author

After #7367, we are expected to solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants