Don't check fail on missing lineage cache entry #3861

ericl · 2019-01-25T22:36:47Z

What do these changes do?

Under some race conditions with slow actor creation tasks, it seems like we hit this check. Be defensive and just return empty lineage in this case.

Related issue number

#3813

AmplabJenkins · 2019-01-26T00:15:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11161/
Test FAILed.

zhijunfu · 2019-01-26T03:16:03Z

src/ray/raylet/lineage_cache.cc

+    if (entry) {
+      RAY_CHECK(uncommitted_lineage.SetEntry(entry->TaskData(), entry->GetStatus()));
+    } else {
+      RAY_LOG(ERROR) << "No lineage cache entry found for task " << task_id;


It would be nice to add a comment on under what conditions this entry doesn't exist.

Done. Added a pointer back to the issue.

It's not super clear what is evicting the lineage: perhaps some race condition on a task getting rescheduled, or the task succeeding but the node failing after that.

AmplabJenkins · 2019-01-26T07:29:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11174/
Test PASSed.

robertnishihara · 2019-01-26T07:47:51Z

I strongly think we should not do this. The assertion is failing because there is a bug that we don't understand. The solution is not to get rid of the assertion but rather to fix the bug. If we remove assertions whenever we don't understand what is going on, technical debt will accumulate.

You've mentioned that we shouldn't have fatal checks in production. And I'm ok with shipping wheels where DCHECKS just log errors or something like that, but this is turning it off even in the case where we're trying to do development and debugging.

ericl · 2019-01-26T08:21:44Z

What do we need to do to make RAY_DCHECK work?

ray/src/ray/util/logging.h

Line 27 in 3027dde

#define RAY_DCHECK(condition) \

I modified it to a error log followed by a dcheck, but I don't know if this DCHECK is enabled in the right environments.

But overall, I would prefer we move to a more defensively progammed, fault-tolerant implementation of the backend. By fault-tolerant, I mean that we keep running successfully even if certain components hit bugs. A bug in lineage cache should not take down other parts of Ray.

A stronger condition yet would be to be able to survive raylet crashes. That way, as long as raylets can remain alive for long enough, the application keeps making progress.

AmplabJenkins · 2019-01-26T10:41:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11177/
Test PASSed.

ericl · 2019-01-27T21:58:41Z

What's the resolution here, is DCHECK the way to go?

stephanie-wang · 2019-01-27T22:03:59Z

DCHECK makes sense, but I would keep the assertion (and the log message) in the same place as before so that we can actually catch it.

stephanie-wang · 2019-01-27T22:08:17Z

Ah ignore the above comment, I read the code wrong.

The method that you changed here is supposed to return a lineage with the requested task in it, but it won't always do that now. You'll have to modify the place where this gets called in the raylet so that the caller adds the correct task to the lineage.

stephanie-wang · 2019-01-27T23:13:51Z

src/ray/raylet/node_manager.cc

+  auto entry = uncommitted_lineage.GetEntryMutable(task_id);
+  int num_forwards = -1;
+  if (entry) {
+    Task &lineage_cache_entry_task = entry->TaskDataMutable();


You have to actually add the task to the uncommitted lineage if it is not already there. The receiving node manager expects the forwarded task to be in the lineage here.

Thanks, I guess it makes more sense to handle this all in this function.

ericl

Did some renaming to make check semantics more clear.

ericl · 2019-01-28T00:43:41Z

src/ray/raylet/node_manager.cc

@@ -1752,7 +1752,7 @@ void NodeManager::HandleTaskReconstruction(const TaskID &task_id) {
               "allocation via "
            << "ray.init(redis_max_memory=<max_memory_bytes>).";
        // Use a copy of the cached task spec to re-execute the task.
-        const Task task = lineage_cache_.GetTask(task_id);
+        const Task task = lineage_cache_.GetTaskOrDie(task_id);


We shouldn't check fail in this function. However, the issue now is that you can't fail a task without the TaskSpec.

Is there a function from TaskID -> ReturnIDs?

You can use ComputeReturnId. Unfortunately, we can't know how many return values to put the error for without the task spec...I guess the safest option for now is to just put one return value.

What if I returned like 10? Could that cause an issue?

For actor tasks, the last return value is the dummy object, which isn't supposed to have any value in the object store...this could potentially break other parts of the raylet, but I'm not totally sure.

ericl · 2019-01-28T00:46:08Z

Note: I don't think NDEBUG is actually enabled in our prod builds, but that's probably fine for now. The proximate cause of the original issue is likely fixed by #3860

AmplabJenkins · 2019-01-28T00:53:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11220/
Test FAILed.

AmplabJenkins · 2019-01-28T01:46:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11216/
Test PASSed.

AmplabJenkins · 2019-01-28T02:35:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11219/
Test PASSed.

ericl · 2019-02-03T21:54:31Z

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

stephanie-wang · 2019-02-04T17:57:30Z

Perhaps we should punt on the return value issue since it's more complicated? Any other issues here?

Sure, I approved.

ericl · 2019-02-04T19:39:48Z

jenkins retest this please

AmplabJenkins · 2019-02-04T22:00:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11540/
Test PASSed.

fix linea

2f0baa6

ericl assigned stephanie-wang Jan 25, 2019

zhijunfu reviewed Jan 26, 2019

View reviewed changes

update

a5c66fd

Update lineage_cache.cc

58904ee

stephanie-wang reviewed Jan 27, 2019

View reviewed changes

wip

eb3cbff

ericl force-pushed the fix-3813 branch 2 times, most recently from 790ffa6 to e680553 Compare January 28, 2019 00:19

fix

2a3dc9d

ericl force-pushed the fix-3813 branch from e680553 to 2a3dc9d Compare January 28, 2019 00:19

ericl added 2 commits January 27, 2019 16:20

fix

c60def3

rename fatal

798e724

ericl force-pushed the fix-3813 branch from 252ace1 to 798e724 Compare January 28, 2019 00:31

ericl commented Jan 28, 2019

View reviewed changes

stephanie-wang approved these changes Feb 4, 2019

View reviewed changes

ericl merged commit 5fb813f into ray-project:master Feb 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't check fail on missing lineage cache entry #3861

Don't check fail on missing lineage cache entry #3861

ericl commented Jan 25, 2019

AmplabJenkins commented Jan 26, 2019

zhijunfu Jan 26, 2019

ericl Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

robertnishihara commented Jan 26, 2019

ericl commented Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

ericl commented Jan 27, 2019

stephanie-wang commented Jan 27, 2019

stephanie-wang commented Jan 27, 2019

stephanie-wang Jan 27, 2019

ericl Jan 28, 2019

ericl left a comment

ericl Jan 28, 2019 •

edited

Loading

stephanie-wang Jan 28, 2019 •

edited

Loading

ericl Jan 28, 2019

stephanie-wang Jan 28, 2019

ericl commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

ericl commented Feb 3, 2019 •

edited

Loading

stephanie-wang commented Feb 4, 2019

ericl commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

Don't check fail on missing lineage cache entry #3861

Don't check fail on missing lineage cache entry #3861

Conversation

ericl commented Jan 25, 2019

What do these changes do?

Related issue number

AmplabJenkins commented Jan 26, 2019

zhijunfu Jan 26, 2019

Choose a reason for hiding this comment

ericl Jan 26, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Jan 26, 2019

robertnishihara commented Jan 26, 2019

ericl commented Jan 26, 2019

AmplabJenkins commented Jan 26, 2019

ericl commented Jan 27, 2019

stephanie-wang commented Jan 27, 2019

stephanie-wang commented Jan 27, 2019

stephanie-wang Jan 27, 2019

Choose a reason for hiding this comment

ericl Jan 28, 2019

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

stephanie-wang Jan 28, 2019 • edited Loading

Choose a reason for hiding this comment

ericl Jan 28, 2019

Choose a reason for hiding this comment

stephanie-wang Jan 28, 2019

Choose a reason for hiding this comment

ericl commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

AmplabJenkins commented Jan 28, 2019

ericl commented Feb 3, 2019 • edited Loading

stephanie-wang commented Feb 4, 2019

ericl commented Feb 4, 2019

AmplabJenkins commented Feb 4, 2019

ericl Jan 28, 2019 •

edited

Loading

stephanie-wang Jan 28, 2019 •

edited

Loading

ericl commented Feb 3, 2019 •

edited

Loading