Track changes to consensusMode in the kernel DB #4510

michaelfig · 2022-02-09T18:15:26Z

What is the Problem Being Solved?

When verbose debugging is enabled, the validator will diverge from the non-debugging chain at some point. It would be better if that divergence happened quickly and obviously.

Description of the Design

Record consensus mode (a boolean) somewhere in the kernel DB so that its initial value and changes to it appear in the activityhash.

Security Considerations

Test Plan

warner · 2022-02-09T18:27:34Z

Let's see, the divergence is because userspace could use a getter or a Proxy to sense what console.log is or is not reading off the arguments, right? So it's userspace's fault, but we obviously can't allow that to cause a consensus failure.

If we elevate the consensus mode flag itself to be part of consensus, I think we need:

controller.setConsensusMode(bool), so changing it happens only as part of a (cosmos) transaction
kvStore.consensusMode = bool, to record the current state
read this flag just before each delivery, include in the deliver command to the worker

Storing it in kvStore will cause changes to get included in the crankHash and then the activityHash. We need to commit the crank buffer after changing it. I don't yet have a clear idea about how that buffer commit should happen for non-run-queue events, but the new controller.validateAndInstallBundle() method has the same needs.. I think the controller method should commit internally, but the host app needs to invoke the method as part of a block or equivalent (and commit the hostDB at some point afterwards).

We also need to remove whatever other controls exist over consensusMode so that it can only change as part of a consensus-driven transaction.

I don't know how this should interact with the "download a vat transcript from the chain, replay it locally with more debugging turned on" plan. I guess transcript replay happens independently of any kernel or crankhash/activityhash, so when you're doing that replay, you can set consensusMode in the ['deliver'] command to whatever you like, doesn't matter.

michaelfig · 2022-02-09T18:32:49Z

Let's see, the divergence is because userspace could use a getter or a Proxy to sense what console.log is or is not reading off the arguments, right? So it's userspace's fault, but we obviously can't allow that to cause a consensus failure.

Actually, it's the fault of the supervisor. It allocates differently (and sends back console messages) depending on whether consensus mode is set.

warner · 2022-02-09T18:43:17Z

Hm. console messages (as sent from the worker to the kernel process) aren't syscalls, and aren't included in the transcript. They shouldn't cause state divergence.

But you're right, metering differences in supervisor code execution will cause divergence in the vat's meter consumption, which will be reflected in state changes of the Meter object (for metered vats). And there's an edge case where a vat is very close to the hard-limit (the per-crank max computrons), and extra supervisor time might push it over the edge, killing the vat in the consensusMode=false case (where the supervisor does extra logging work), but allowing it to finish in the no-logging =true case. And that's not something userspace has to try to trigger, in fact userspace would have to go out of its way (never use console.log) to prevent it.

michaelfig · 2022-02-09T18:49:00Z

Hm. console messages (as sent from the worker to the kernel process) aren't syscalls, and aren't included in the transcript. They shouldn't cause state divergence.

But you're right, metering differences in supervisor code execution will cause divergence in the vat's meter consumption,

I saw a block that had console messages happen and bringOutYourDead fail to happen in the diverging node's Zoe vat.

mhofman · 2022-02-09T19:23:57Z

I saw a block that had console messages happen and bringOutYourDead fail to happen in the diverging node's Zoe vat.

That feels like it shouldn't happen. If that's the case, maybe we should have a separate issue for it?

warner · 2022-02-09T20:05:54Z

read this flag just before each delivery, include in the deliver command to the worker

Thinking about this more, I've gotten myself re-confused. If we say that consensusMode affects the behavior of a delivery, then each delivery needs to include the mode, and each time we make that delivery (e.g. when a vat transcript is replayed), we need it to use the same consensusMode that was used the first time around. That means we need to add consensusMode to the VatDeliveryObject that gets stored in the transcript, not merely include it in the kernel-to-worker command message next to the VDO.

But consensusMode = false means that we've given up on trying to maintain consensus: we're accepting that userspace can use a getter/Proxy to sense whatever non-consensus console.log behavior is currently in use, and we're accepting that metering results won't match any previous delivery. So consensusMode = false actually encompasses a range of possible behaviors (based on what logging is doing), whereas true means there's exactly one behavior (because console.log is a NOP).

Does it make any sense to every call controller.setConsensusMode(true)? Like, once you've allowed a vat to see consensusMode = false, can you ever hope to achieve consensus again?

If not, then it doesn't make sense to have a controller.setConsensusMode(), and instead it should be a static set-once-only flag in config. We can still store it in the DB and read it just before delivery, include it next to the VDO (and that might help with the local-debugging-replay case). But it shouldn't be dynamic.

And then, getting back to the original request, maybe what we actually need is for the initial contents of the kernel DB to get hashed into an initial value for activityHash. Or for the changes we make during initializeSwingset() to be hashed and used as that initial value. We might achieve this by a step (at the end of initializeSwingset() that walks the whole DB, excludes the keys that are supposed to be excluded, hashes everything, and stores that in activityHash. Or maybe some kind of pseudo-crank-0 that opens the crankBuffer just before initialization starts writing a whole lot of DB keys, and does a commit-and-update-crank-hash afterwards. (although I wouldn't want to spend the RAM on holding all those writes in the crankBuffer, that's a waste when we're never going to abandon it as we might a vat delivery).

It kind of points to:

each swingset is either consensus-mode or not consensus-mode, selected during initialization
replay-vat debugging doesn't use a kernel, so the person doing the replay can choose whether the vat is told to execute under consensus-mode or not (debug logs vs the potential for surprising getter/Proxy behavior)
- we'd ignore metering results in this case anyways, but the vat might still hit the hard computron limit differently during non-consensus-mode replay

The consensus-mode switch is clearly the responsibility of the host application: swingset should just use whatever its config said to use.

If the host application (cosmic-swingset) offers the operator control of that switch (environment variable, whatever), then all validators must use the same setting. In fact, they must all use the true setting, because otherwise getter/Proxy tricks would allow userspace to cause divergence.

So why would cosmic-swingset ever offer a way to set it to false?

I think the only case is when you're running a degenerate (single-node) chain, or if you're running a test and you're willing to allow userspace to cause divergence in exchange for getting better vat log messages without doing a zillion replays. Neither sound like the main use case, so I think it's ok if it's not easy to set consensusMode = false.

So I think we only need to store consensusMode in the DB if it's easy to build a multi-node chain in which this switch is easy to turn off. And I think maybe we should not make that easy.

michaelfig · 2022-02-09T20:29:27Z

We only have two main cosmic-swingset use cases: sim-chain (for debugging), where consensusMode should be hardwired false, and chain-main (for production), where consensusMode should be hardwired true. It's fine if the swingset config chooses between those at initialisation time.

michaelfig · 2022-02-09T20:31:02Z

The $DEBUG environment variable needs to be decoupled from the selection of consensusMode, and then #4506 will be fixed, and @mhofman's divergence would be systematically avoided.

michaelfig added enhancement New feature or request SwingSet package: SwingSet labels Feb 9, 2022

warner mentioned this issue Feb 9, 2022

Consensus mode with verbose logging #4506

Closed

warner self-assigned this Feb 9, 2022

This was referenced Feb 9, 2022

fix(cosmic-swingset): enforce consensusMode, not by sniffing $DEBUG #4515

Merged

bringOutYourDead difference between non- and consensusMode vats #4517

Closed

michaelfig mentioned this issue Mar 8, 2022

fix(SwingSet): remove consensusMode flip-flop #4768

Merged

mergify bot closed this as completed in #4768 Mar 12, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track changes to consensusMode in the kernel DB #4510

Track changes to consensusMode in the kernel DB #4510

michaelfig commented Feb 9, 2022

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022 •

edited

Loading

mhofman commented Feb 9, 2022

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022

michaelfig commented Feb 9, 2022

Track changes to consensusMode in the kernel DB #4510

Track changes to consensusMode in the kernel DB #4510

Comments

michaelfig commented Feb 9, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022 • edited Loading

mhofman commented Feb 9, 2022

warner commented Feb 9, 2022

michaelfig commented Feb 9, 2022

michaelfig commented Feb 9, 2022

michaelfig commented Feb 9, 2022 •

edited

Loading