-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix raylet bug in driver cleanup #3962
Fix raylet bug in driver cleanup #3962
Conversation
} else { | ||
it++; | ||
} | ||
required_tasks_.erase(creating_task_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be erasing creating_task_id
from required_tasks_
or just erasing one element of required_tasks_[creating_task_id][object_id]
?
E.g., if there are some other tasks that still require object_id
, then creating_task_id
might still be required, right?
I may just be confusing myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the best way would be to erase from required_tasks[creating_task_id]
, but this really should be removing all the objects anyway, so it should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining that (or a check checking that that is the case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I missed this. But I did add a check to the end of the function that checks whether all of the tasks were erased from required_tasks_
.
it++; | ||
} | ||
required_tasks_.erase(creating_task_id); | ||
RAY_CHECK(!CheckObjectRequired(object_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will trivially pass because of the line required_tasks_.erase(creating_task_id);
above, but it's not clear to me why we would expect object_id
to not be required by any tasks. Couldn't object_id
still be required by some task that wasn't in task_ids
? Or are we making assumptions about when/how RemoveTasksAndRelatedObjects
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making the assumption that if RemoveTasksAndRelatedObjects(task_ids)
is called, then it should include all local tasks that depend on tasks in task_ids
. Does that make sense? I can update the header to make this clear.
} | ||
required_tasks_.erase(creating_task_id); | ||
RAY_CHECK(!CheckObjectRequired(object_id)); | ||
HandleRemoteDependencyCanceled(object_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is a no-op given the above check, right? Since the implementation is
void TaskDependencyManager::HandleRemoteDependencyCanceled(const ObjectID &object_id) {
bool required = CheckObjectRequired(object_id);
// If the object is no longer required, then cancel the object.
if (!required) {
auto it = required_objects_.find(object_id);
if (it != required_objects_.end()) {
object_manager_.CancelPull(object_id);
reconstruction_policy_.Cancel(object_id);
required_objects_.erase(it);
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it cancels the object with the object_manager_
, etc.
task_dependencies_.erase(*it); | ||
required_tasks_.erase(*it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not do this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens inside HandleRemoteDependencyCanceled
.
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
When a driver exits, we clean up all of the task and object dependencies for related tasks on each node. However, there was a bug in the backend that caused us to miss some of the dependencies. Then, the raylet would crash here because a task that had been removed from all of the queues was still tracked by the
TaskDependencyManager
.Related issue number
Not sure if there are any open related issues, but this can cause crashes when a driver exits early.