-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't check fail on missing lineage cache entry #3861
Conversation
Test FAILed. |
src/ray/raylet/lineage_cache.cc
Outdated
if (entry) { | ||
RAY_CHECK(uncommitted_lineage.SetEntry(entry->TaskData(), entry->GetStatus())); | ||
} else { | ||
RAY_LOG(ERROR) << "No lineage cache entry found for task " << task_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a comment on under what conditions this entry doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a pointer back to the issue.
It's not super clear what is evicting the lineage: perhaps some race condition on a task getting rescheduled, or the task succeeding but the node failing after that.
Test PASSed. |
I strongly think we should not do this. The assertion is failing because there is a bug that we don't understand. The solution is not to get rid of the assertion but rather to fix the bug. If we remove assertions whenever we don't understand what is going on, technical debt will accumulate. You've mentioned that we shouldn't have fatal checks in production. And I'm ok with shipping wheels where DCHECKS just log errors or something like that, but this is turning it off even in the case where we're trying to do development and debugging. |
What do we need to do to make RAY_DCHECK work? Line 27 in 3027dde
I modified it to a error log followed by a dcheck, but I don't know if this DCHECK is enabled in the right environments. But overall, I would prefer we move to a more defensively progammed, fault-tolerant implementation of the backend. By fault-tolerant, I mean that we keep running successfully even if certain components hit bugs. A bug in lineage cache should not take down other parts of Ray. A stronger condition yet would be to be able to survive raylet crashes. That way, as long as raylets can remain alive for long enough, the application keeps making progress. |
Test PASSed. |
What's the resolution here, is DCHECK the way to go? |
|
Ah ignore the above comment, I read the code wrong. The method that you changed here is supposed to return a lineage with the requested task in it, but it won't always do that now. You'll have to modify the place where this gets called in the raylet so that the caller adds the correct task to the lineage. |
src/ray/raylet/node_manager.cc
Outdated
auto entry = uncommitted_lineage.GetEntryMutable(task_id); | ||
int num_forwards = -1; | ||
if (entry) { | ||
Task &lineage_cache_entry_task = entry->TaskDataMutable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to actually add the task to the uncommitted lineage if it is not already there. The receiving node manager expects the forwarded task to be in the lineage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I guess it makes more sense to handle this all in this function.
790ffa6
to
e680553
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some renaming to make check semantics more clear.
@@ -1752,7 +1752,7 @@ void NodeManager::HandleTaskReconstruction(const TaskID &task_id) { | |||
"allocation via " | |||
<< "ray.init(redis_max_memory=<max_memory_bytes>)."; | |||
// Use a copy of the cached task spec to re-execute the task. | |||
const Task task = lineage_cache_.GetTask(task_id); | |||
const Task task = lineage_cache_.GetTaskOrDie(task_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't check fail in this function. However, the issue now is that you can't fail a task without the TaskSpec.
Is there a function from TaskID -> ReturnIDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use ComputeReturnId
. Unfortunately, we can't know how many return values to put the error for without the task spec...I guess the safest option for now is to just put one return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I returned like 10? Could that cause an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For actor tasks, the last return value is the dummy object, which isn't supposed to have any value in the object store...this could potentially break other parts of the raylet, but I'm not totally sure.
Note: I don't think NDEBUG is actually enabled in our prod builds, but that's probably fine for now. The proximate cause of the original issue is likely fixed by #3860 |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Perhaps we should punt on the return value issue since it's more complicated? Any other issues here? |
Sure, I approved. |
jenkins retest this please |
Test PASSed. |
What do these changes do?
Under some race conditions with slow actor creation tasks, it seems like we hit this check. Be defensive and just return empty lineage in this case.
Related issue number
#3813