-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic simulation for sui tests #4429
Conversation
5312763
to
ab8fdd8
Compare
I haven't taken a deep look at this yet, i'll need to book some time but here are some initial questions given i know very little about what you've done so far.
I thought we had found a way that didn't require forking tokio?
What about Udp given i'm actively migrating things off of tcp atm. How does this handle things that happen inside of 3rd party code? Does it also require that we fork those third-party libraries? |
5d29058
to
726d9b5
Compare
It mostly doesn't:
UDP is even easier to support than TCP (unsurprisingly), and TCP was only a day or two of work (https://github.com/MystenLabs/mysten-simulator/blob/main/msim-tokio/src/sim/net.rs). As soon as you have a branch that is using UDP, let me know, and I will make sure it is supported.
It uses I did have to make a minimal hyper fork, because it uses |
cc4f4d5
to
e0f7462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor non-blocking comments.
In general I am quite (pleasantly) surprised this worked and tests run deterministically, this is amazing :)
I am also curious to see how this interacts with quic networking stack and whether we still have determinism with that
config_directory: PathBuf, | ||
randomize_ports: bool, | ||
committee_size: NonZeroUsize, | ||
committee: Option<CommitteeConfig>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am tryin to understand why is this an option - seems like there is no code path in which it can be None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that was unnecessary, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it was necessary - we need to be able to move out of self.committee_config
in the build()
method - the ValidatorGenesisInfo struct can't be made Clone() because it contains a keypair.
.expect("Response from lockservice was cancelled, should not happen!") | ||
exec_client_future(async move { | ||
let (os_sender, os_receiver) = oneshot::channel::<SuiResult>(); | ||
// NOTE: below is blocking, switch to Tokio channels which are async? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seem outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - will remove as long as i'm polluting the blame anyway.
// timing of the reply from the other thread. | ||
pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output { | ||
if cfg!(msim) { | ||
futures::executor::block_on(fut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sort of surprised this worked - normally you can't call block_on from inside tokio thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't call Runtime::block_on (I think that panics?) - but this is (almost) just a short-hand for calling poll() in a loop until it returns Ready. There's nothing to stop you from doing that.
Obviously this will deadlock if your future is waiting for something to happen on the current thread, but that's explicitly not the case here - we are sending a message to another thread and waiting for it to come back. So ultimately this boils down to just waiting on the condvar inside the channel, which is a perfectly reasonable thing to do.
crates/sui-storage/src/lib.rs
Outdated
// Used to exec futures that send data to/from other threads. In the simulator, this effectively | ||
// becomes a blocking call, which removes the non-determinism that would otherwise be caused by the | ||
// timing of the reply from the other thread. | ||
pub(crate) async fn exec_client_future<F: Future>(fut: F) -> <F as Future>::Output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand why you need this function, but I don't understand choice of the name to be honest :)
Maybe something like sim_blocking_call
or sim_no_preempt
would be better name? It would give intuition that
(a) this is done as a call specifically to make it work in simulator
(b) what it helps specifically with is communicating with threads that are outside of simulator scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion.
e0f7462
to
ea62763
Compare
Anemo just uses quinn, which uses tokio UdpSocket asynchronously, so there shouldn't be any major challenges. Getting it to work at all is a minor schlep, but once it works, I bet it will be deterministic. |
This PR enables running tests in a deterministic simulator. To use:
The simulator itself is at https://github.com/MystenLabs/mysten-simulator (in a separate repo so that it can be integrated into narwhal as well). The simulator provides a drop-in replacement for tokio which is patched in via cargo's patch mechanism - but only when running the tests. Normal builds use normal tokio.
How it works, in brief:
getrandom()
andgetentropy()
in order to transparently provide deterministic behavior from OsRng.Two new macros are provided:
#[sui_test]
- Runs the test as a#[tokio::test]
undercargo test
and a simulator test undercargo simtest
.#[sim_test]
- Marks the test as ignored forcargo test
, but runs it undercargo simtest
Tests defined with
#[tokio::test]
are ignored under simtest.Tests defined with
#[test]
are run in both. I would like to ignore them withsimtest
but I don't know how to do that.Most of the code in this PR is boilerplate necessary to handle configuration differences for swarm.
Remaining work for subsequent PRs:
#[sui_test]
and/or#[sim_test]