Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track changes to consensusMode in the kernel DB #4510

Closed
michaelfig opened this issue Feb 9, 2022 · 8 comments · Fixed by #4768
Closed

Track changes to consensusMode in the kernel DB #4510

michaelfig opened this issue Feb 9, 2022 · 8 comments · Fixed by #4768
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet
Milestone

Comments

@michaelfig
Copy link
Member

What is the Problem Being Solved?

Refs: #4506

When verbose debugging is enabled, the validator will diverge from the non-debugging chain at some point. It would be better if that divergence happened quickly and obviously.

Description of the Design

Record consensus mode (a boolean) somewhere in the kernel DB so that its initial value and changes to it appear in the activityhash.

Security Considerations

Test Plan

@michaelfig michaelfig added enhancement New feature or request SwingSet package: SwingSet labels Feb 9, 2022
@warner
Copy link
Member

warner commented Feb 9, 2022

Let's see, the divergence is because userspace could use a getter or a Proxy to sense what console.log is or is not reading off the arguments, right? So it's userspace's fault, but we obviously can't allow that to cause a consensus failure.

If we elevate the consensus mode flag itself to be part of consensus, I think we need:

  • controller.setConsensusMode(bool), so changing it happens only as part of a (cosmos) transaction
  • kvStore.consensusMode = bool, to record the current state
  • read this flag just before each delivery, include in the deliver command to the worker

Storing it in kvStore will cause changes to get included in the crankHash and then the activityHash. We need to commit the crank buffer after changing it. I don't yet have a clear idea about how that buffer commit should happen for non-run-queue events, but the new controller.validateAndInstallBundle() method has the same needs.. I think the controller method should commit internally, but the host app needs to invoke the method as part of a block or equivalent (and commit the hostDB at some point afterwards).

We also need to remove whatever other controls exist over consensusMode so that it can only change as part of a consensus-driven transaction.

I don't know how this should interact with the "download a vat transcript from the chain, replay it locally with more debugging turned on" plan. I guess transcript replay happens independently of any kernel or crankhash/activityhash, so when you're doing that replay, you can set consensusMode in the ['deliver'] command to whatever you like, doesn't matter.

@michaelfig
Copy link
Member Author

Let's see, the divergence is because userspace could use a getter or a Proxy to sense what console.log is or is not reading off the arguments, right? So it's userspace's fault, but we obviously can't allow that to cause a consensus failure.

Actually, it's the fault of the supervisor. It allocates differently (and sends back console messages) depending on whether consensus mode is set.

@warner
Copy link
Member

warner commented Feb 9, 2022

Hm. console messages (as sent from the worker to the kernel process) aren't syscalls, and aren't included in the transcript. They shouldn't cause state divergence.

But you're right, metering differences in supervisor code execution will cause divergence in the vat's meter consumption, which will be reflected in state changes of the Meter object (for metered vats). And there's an edge case where a vat is very close to the hard-limit (the per-crank max computrons), and extra supervisor time might push it over the edge, killing the vat in the consensusMode=false case (where the supervisor does extra logging work), but allowing it to finish in the no-logging =true case. And that's not something userspace has to try to trigger, in fact userspace would have to go out of its way (never use console.log) to prevent it.

@michaelfig
Copy link
Member Author

michaelfig commented Feb 9, 2022

Hm. console messages (as sent from the worker to the kernel process) aren't syscalls, and aren't included in the transcript. They shouldn't cause state divergence.

But you're right, metering differences in supervisor code execution will cause divergence in the vat's meter consumption,

I saw a block that had console messages happen and bringOutYourDead fail to happen in the diverging node's Zoe vat.

@mhofman
Copy link
Member

mhofman commented Feb 9, 2022

I saw a block that had console messages happen and bringOutYourDead fail to happen in the diverging node's Zoe vat.

That feels like it shouldn't happen. If that's the case, maybe we should have a separate issue for it?

@warner
Copy link
Member

warner commented Feb 9, 2022

read this flag just before each delivery, include in the deliver command to the worker

Thinking about this more, I've gotten myself re-confused. If we say that consensusMode affects the behavior of a delivery, then each delivery needs to include the mode, and each time we make that delivery (e.g. when a vat transcript is replayed), we need it to use the same consensusMode that was used the first time around. That means we need to add consensusMode to the VatDeliveryObject that gets stored in the transcript, not merely include it in the kernel-to-worker command message next to the VDO.

But consensusMode = false means that we've given up on trying to maintain consensus: we're accepting that userspace can use a getter/Proxy to sense whatever non-consensus console.log behavior is currently in use, and we're accepting that metering results won't match any previous delivery. So consensusMode = false actually encompasses a range of possible behaviors (based on what logging is doing), whereas true means there's exactly one behavior (because console.log is a NOP).

Does it make any sense to every call controller.setConsensusMode(true)? Like, once you've allowed a vat to see consensusMode = false, can you ever hope to achieve consensus again?

If not, then it doesn't make sense to have a controller.setConsensusMode(), and instead it should be a static set-once-only flag in config. We can still store it in the DB and read it just before delivery, include it next to the VDO (and that might help with the local-debugging-replay case). But it shouldn't be dynamic.

And then, getting back to the original request, maybe what we actually need is for the initial contents of the kernel DB to get hashed into an initial value for activityHash. Or for the changes we make during initializeSwingset() to be hashed and used as that initial value. We might achieve this by a step (at the end of initializeSwingset() that walks the whole DB, excludes the keys that are supposed to be excluded, hashes everything, and stores that in activityHash. Or maybe some kind of pseudo-crank-0 that opens the crankBuffer just before initialization starts writing a whole lot of DB keys, and does a commit-and-update-crank-hash afterwards. (although I wouldn't want to spend the RAM on holding all those writes in the crankBuffer, that's a waste when we're never going to abandon it as we might a vat delivery).

It kind of points to:

  • each swingset is either consensus-mode or not consensus-mode, selected during initialization
  • replay-vat debugging doesn't use a kernel, so the person doing the replay can choose whether the vat is told to execute under consensus-mode or not (debug logs vs the potential for surprising getter/Proxy behavior)
    • we'd ignore metering results in this case anyways, but the vat might still hit the hard computron limit differently during non-consensus-mode replay

The consensus-mode switch is clearly the responsibility of the host application: swingset should just use whatever its config said to use.

If the host application (cosmic-swingset) offers the operator control of that switch (environment variable, whatever), then all validators must use the same setting. In fact, they must all use the true setting, because otherwise getter/Proxy tricks would allow userspace to cause divergence.

So why would cosmic-swingset ever offer a way to set it to false?

I think the only case is when you're running a degenerate (single-node) chain, or if you're running a test and you're willing to allow userspace to cause divergence in exchange for getting better vat log messages without doing a zillion replays. Neither sound like the main use case, so I think it's ok if it's not easy to set consensusMode = false.

So I think we only need to store consensusMode in the DB if it's easy to build a multi-node chain in which this switch is easy to turn off. And I think maybe we should not make that easy.

@michaelfig
Copy link
Member Author

We only have two main cosmic-swingset use cases: sim-chain (for debugging), where consensusMode should be hardwired false, and chain-main (for production), where consensusMode should be hardwired true. It's fine if the swingset config chooses between those at initialisation time.

@michaelfig
Copy link
Member Author

The $DEBUG environment variable needs to be decoupled from the selection of consensusMode, and then #4506 will be fixed, and @mhofman's divergence would be systematically avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants