Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check fail on missing lineage cache entry #3861

Merged
merged 7 commits into from
Feb 5, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 25, 2019

What do these changes do?

Under some race conditions with slow actor creation tasks, it seems like we hit this check. Be defensive and just return empty lineage in this case.

Related issue number

#3813

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11161/
Test FAILed.

if (entry) {
RAY_CHECK(uncommitted_lineage.SetEntry(entry->TaskData(), entry->GetStatus()));
} else {
RAY_LOG(ERROR) << "No lineage cache entry found for task " << task_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a comment on under what conditions this entry doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added a pointer back to the issue.

It's not super clear what is evicting the lineage: perhaps some race condition on a task getting rescheduled, or the task succeeding but the node failing after that.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11174/
Test PASSed.

@robertnishihara
Copy link
Collaborator

I strongly think we should not do this. The assertion is failing because there is a bug that we don't understand. The solution is not to get rid of the assertion but rather to fix the bug. If we remove assertions whenever we don't understand what is going on, technical debt will accumulate.

You've mentioned that we shouldn't have fatal checks in production. And I'm ok with shipping wheels where DCHECKS just log errors or something like that, but this is turning it off even in the case where we're trying to do development and debugging.

@ericl
Copy link
Contributor Author

ericl commented Jan 26, 2019

What do we need to do to make RAY_DCHECK work?

#define RAY_DCHECK(condition) \

I modified it to a error log followed by a dcheck, but I don't know if this DCHECK is enabled in the right environments.

But overall, I would prefer we move to a more defensively progammed, fault-tolerant implementation of the backend. By fault-tolerant, I mean that we keep running successfully even if certain components hit bugs. A bug in lineage cache should not take down other parts of Ray.

A stronger condition yet would be to be able to survive raylet crashes. That way, as long as raylets can remain alive for long enough, the application keeps making progress.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11177/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Jan 27, 2019

What's the resolution here, is DCHECK the way to go?

@stephanie-wang
Copy link
Contributor

DCHECK makes sense, but I would keep the assertion (and the log message) in the same place as before so that we can actually catch it.

@stephanie-wang
Copy link
Contributor

Ah ignore the above comment, I read the code wrong.

The method that you changed here is supposed to return a lineage with the requested task in it, but it won't always do that now. You'll have to modify the place where this gets called in the raylet so that the caller adds the correct task to the lineage.

auto entry = uncommitted_lineage.GetEntryMutable(task_id);
int num_forwards = -1;
if (entry) {
Task &lineage_cache_entry_task = entry->TaskDataMutable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to actually add the task to the uncommitted lineage if it is not already there. The receiving node manager expects the forwarded task to be in the lineage here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I guess it makes more sense to handle this all in this function.

@ericl ericl force-pushed the fix-3813 branch 2 times, most recently from 790ffa6 to e680553 Compare January 28, 2019 00:19
Copy link
Contributor Author

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some renaming to make check semantics more clear.

@@ -1752,7 +1752,7 @@ void NodeManager::HandleTaskReconstruction(const TaskID &task_id) {
"allocation via "
<< "ray.init(redis_max_memory=<max_memory_bytes>).";
// Use a copy of the cached task spec to re-execute the task.
const Task task = lineage_cache_.GetTask(task_id);
const Task task = lineage_cache_.GetTaskOrDie(task_id);
Copy link
Contributor Author

@ericl ericl Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't check fail in this function. However, the issue now is that you can't fail a task without the TaskSpec.

Is there a function from TaskID -> ReturnIDs?

Copy link
Contributor

@stephanie-wang stephanie-wang Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ComputeReturnId. Unfortunately, we can't know how many return values to put the error for without the task spec...I guess the safest option for now is to just put one return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I returned like 10? Could that cause an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For actor tasks, the last return value is the dummy object, which isn't supposed to have any value in the object store...this could potentially break other parts of the raylet, but I'm not totally sure.

@ericl
Copy link
Contributor Author

ericl commented Jan 28, 2019

Note: I don't think NDEBUG is actually enabled in our prod builds, but that's probably fine for now. The proximate cause of the original issue is likely fixed by #3860

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11220/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11216/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11219/
Test PASSed.

@ericl
Copy link
Contributor Author

ericl commented Feb 3, 2019

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

@stephanie-wang
Copy link
Contributor

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

Sure, I approved.

@ericl
Copy link
Contributor Author

ericl commented Feb 4, 2019

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11540/
Test PASSed.

@ericl ericl merged commit 5fb813f into ray-project:master Feb 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants