Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolves and addition tests for child preStart supervise #1385

Merged
merged 10 commits into from
Jul 15, 2024

Conversation

Roiocam
Copy link
Member

@Roiocam Roiocam commented Jul 3, 2024

Related with #1384 and resolves #1384, some of code contributed by @sadekmunawar

if PR can reproducer this issue, the ci checks will be unable to successd: https://github.com/apache/pekko/actions/runs/9770536752/job/26971887726?pr=1385

@Roiocam
Copy link
Member Author

Roiocam commented Jul 3, 2024

From the perspective of stack frame, the problem lies in the handling:

protected def faultCreate(): Unit = {
assert(mailbox.isSuspended, "mailbox must be suspended during failed creation, status=" + mailbox.currentStatus)
assert(perpetrator == self)
cancelReceiveTimeout()
cancelReceiveTimeoutTask()


[ERROR] [07/03/2024 09:46:12.113] [SupervisorSpec-pekko.actor.default-dispatcher-5] [pekko://SupervisorSpec/user/$a/$a] changing Recreate into Create after org.apache.pekko.actor.ActorInitializationException: pekko://SupervisorSpec/user/$a/$a: exception during creation, root cause message: [deliberate test failure]
[ERROR] [07/03/2024 09:46:12.114] [SupervisorSpec-pekko.actor.default-dispatcher-6] [pekko://SupervisorSpec/user/$a/$a] assertion failed
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:264)
	at org.apache.pekko.actor.dungeon.FaultHandling.faultCreate(FaultHandling.scala:160)
	at org.apache.pekko.actor.dungeon.FaultHandling.faultCreate$(FaultHandling.scala:158)
	at org.apache.pekko.actor.ActorCell.faultCreate(ActorCell.scala:420)
	at org.apache.pekko.actor.dungeon.FaultHandling.faultRecreate(FaultHandling.scala:96)
	at org.apache.pekko.actor.dungeon.FaultHandling.faultRecreate$(FaultHandling.scala:92)
	at org.apache.pekko.actor.ActorCell.faultRecreate(ActorCell.scala:420)
	at org.apache.pekko.actor.ActorCell.invokeAll$1(ActorCell.scala:526)
	at org.apache.pekko.actor.ActorCell.systemInvoke(ActorCell.scala:545)
	at org.apache.pekko.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:297)
	at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:232)
	at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:245)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)

@@ -643,7 +643,7 @@ private[pekko] class ActorCell(
def failActor(): Unit =
if (_actor != null) {
clearActorFields(actor, recreate = false)
setFailedFatally()
setFailed(actor.self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this change will break other cases or not, I will investigate later.

relate PR:

Copy link
Member Author

@Roiocam Roiocam Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigation, we can not set the fault state here... otherwise, we can not distinguish between fatal and normal fault

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(these links to Akka PRs are OK because they're old, pre-BSL PRs :) )

@He-Pin
Copy link
Member

He-Pin commented Jul 3, 2024

I think we can attach a test case for this, will look into this over the weekend.

@Roiocam Roiocam marked this pull request as draft July 3, 2024 06:53
@Roiocam Roiocam marked this pull request as ready for review July 3, 2024 09:53
@Roiocam Roiocam requested review from jrudolph, He-Pin and raboof July 3, 2024 10:23
@Roiocam Roiocam changed the title chore: add unit test for child actor initial exception supervise fix: resolves and addition tests for child preStart supervise Jul 3, 2024
@Roiocam
Copy link
Member Author

Roiocam commented Jul 5, 2024

@sadekmunawar Hi, my unit test partly refers to your code in #1383 links.

Would you like to contribute this to the Pekko community?

@Roiocam
Copy link
Member Author

Roiocam commented Jul 5, 2024

Would you like to contribute this to the Pekko community?

@pjfanning Am I doing this correctly, may I ask?

@Roiocam Roiocam requested a review from pjfanning July 5, 2024 10:02
@pjfanning
Copy link
Contributor

Would you like to contribute this to the Pekko community?

@pjfanning Am I doing this correctly, may I ask?

The test code was written by https://discuss.lightbend.com/u/nodeninja/summary

@sadekmunawar can you confirm if that is you? We received your iCLA.

@sadekmunawar
Copy link
Contributor

sadekmunawar commented Jul 5, 2024

The test code was written by https://discuss.lightbend.com/u/nodeninja/summary

@sadekmunawar can you confirm if that is you? We received your iCLA.

@pjfanning yes, that's me. After finding the bug, I wrote that test for a solution I implemented. My solution converts the FatallyFailed into a case class and attaches an Optional ActorRef to it.

I also wrote a scala version of this test in ActorLifCyleSpec. Feel free to use either version of the test

"must not break supervisor strategy due to unhandled exception in preStart" in {
  val id = newUuid.toString
  val gen = new AtomicInteger(0)
  val maxRetryNum = 3

  val childProps = Props(new LifeCycleTestActor(testActor, id, gen) {
    override def preStart(): Unit = {
      report("preStart")
      throw new Exception("test exception")
    }
  }).withDeploy(Deploy.local)

  val supervisorStrategy: SupervisorStrategy =
    OneForOneStrategy(maxNrOfRetries = maxRetryNum, timeout.duration) {
      case _: ActorInitializationException => SupervisorStrategy.Restart
      case _                               => SupervisorStrategy.Escalate
    }
  val supervisor =  system.actorOf(Props(classOf[Supervisor], supervisorStrategy))
  Await.result((supervisor ? childProps).mapTo[ActorRef], timeout.duration)

  (0 to maxRetryNum).foreach(i => {
    expectMsg(("preStart", id, i))
  })
  expectNoMessage()
  system.stop(supervisor)
}

Copy link
Contributor

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@laglangyue
Copy link
Contributor

lgtm.
Although I have read the relevant code, I did not delve into testing and debugging

@@ -160,6 +160,33 @@ class ActorLifeCycleSpec extends PekkoSpec with BeforeAndAfterEach with Implicit
a ! "hello"
expectMsg(42)
}

"not break supervisor strategy due to unhandled exception in preStart" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contributed by @sadekmunawar

@Roiocam
Copy link
Member Author

Roiocam commented Jul 8, 2024

I also wrote a scala version of this test in ActorLifCyleSpec. Feel free to use either version of the test

Thanks, I have updated this PR and verified it works in locally

@pjfanning pjfanning merged commit 73c9362 into apache:main Jul 15, 2024
18 checks passed
@pjfanning
Copy link
Contributor

@Roiocam should we backport this to this 1.0.x branch?

@pjfanning pjfanning added this to the 1.1.0-M2 milestone Jul 15, 2024
@Roiocam Roiocam deleted the child-init-exception-supervise-spec branch July 16, 2024 00:44
Roiocam added a commit to Roiocam/pekko that referenced this pull request Jul 16, 2024
…#1385)

* chore: add unit test for child actor initial exception supervise

* make assertion correct

* trying to fix assertion failed

* distinguish between fatal and normal fault

* fix unit tests

* trying to fix

* fix NPE

* revert isFailed condition

* revert isFailed place

* additional tests
@Roiocam
Copy link
Member Author

Roiocam commented Jul 16, 2024

@Roiocam should we backport this to this 1.0.x branch?

Of course, this is a bug. backport on #1396

Roiocam added a commit that referenced this pull request Jul 16, 2024
…#1396)

* chore: add unit test for child actor initial exception supervise

* make assertion correct

* trying to fix assertion failed

* distinguish between fatal and normal fault

* fix unit tests

* trying to fix

* fix NPE

* revert isFailed condition

* revert isFailed place

* additional tests
@sadekmunawar
Copy link
Contributor

It looks like the checks did not run the tests in SupervisorHierarchySpec.

When I ran the tests locally, two tests in SupervisorHierarchySpec failed.

  • handle failure in creation when supervision strategy returns Resume and Restart
  • survive being stressed

Can someone check if these two unit tests pass?

@pjfanning
Copy link
Contributor

It looks like the checks did not run the tests in SupervisorHierarchySpec.

When I ran the tests locally, two tests in SupervisorHierarchySpec failed.

  • handle failure in creation when supervision strategy returns Resume and Restart
  • survive being stressed

Can someone check if these two unit tests pass?

Those tests are tagged as long-running tests. Maybe such tests are not run by default.

Did those tests fail before this PR was added?

@sadekmunawar
Copy link
Contributor

Did those tests fail before this PR was added?

No, they passed in my local workspace before this PR was added. Unless there's something wrong with my local setup, this PR is the likely cause of the failures. I ran into issues with these two unit tests when experimenting with my own fix.

@Roiocam
Copy link
Member Author

Roiocam commented Jul 18, 2024

I will investigate it. :)

@Roiocam
Copy link
Member Author

Roiocam commented Jul 18, 2024

both survive being stressed and handle failure in creation when supervision strategy returns Resume and Restart has passes in my local by using #1399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes Need to release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug - Assertion failure caused by exception during pre-start
6 participants