Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: tests fail when running cargo tarpaulin with recent nightly toolchain #885

Closed
enthusiastmartin opened this issue Nov 29, 2021 · 24 comments · Fixed by #896
Closed

bug: tests fail when running cargo tarpaulin with recent nightly toolchain #885

enthusiastmartin opened this issue Nov 29, 2021 · 24 comments · Fixed by #896
Assignees

Comments

@enthusiastmartin
Copy link

Describe the bug
We have been using tarpaulin for our project for quite some time.

We haven't had any issues until now. We updated toolchain to more recent one -> nightly-2021-11-08.

And using tarpaulin to generate code coverage, our tests started failing. The tests produce different results than expected.

However, running just the tests without tarpaulin is fine and results are correct.

Note using any earlier toolchain than the one from 2021-11-08 is fine (eg. 2021-11-07 is fine ).

To Reproduce
our project is quite complex, using even more complex Substrate framework.

However to reproduce - it is fairly simple:

  1. Clone repository Basilisk-node
  2. There is rust-toolchain file - edit to change toolchain to one from 2021-11-08.
  3. Run make coverage ( takes quite some time ). Fails.
  4. Run make test. Tests pass.

Expected behavior
I'd expect tests not to fail considering they pass correctly without tarpaulin.

Personally, I cannot tell yet what can cause this issue.

I'll dig deeper to figure out what have the failing tests in common - it is not obvious from the first quick look.

but I figured it would be good to let you know, as there might be something which you are already aware of.

Version
cargo tarpaulin --version
cargo-tarpauin version: 0.18.5

@xd009642
Copy link
Owner

This isn't something I'm aware of so looking into it a bit now to see if anything jumps out (otherwise it may be a longer investigation), but I can recreate it.

Initially got a list of things that failed first run with the newer toolchain

/home/daniel/personal/Basilisk-node/target/debug/deps/runtime_integration_tests-6533bd61a1bc44d5
cross_chain_transfer::transfer_from_relay_chain

/home/daniel/personal/Basilisk-node/target/debug/deps/pallet_exchange-9735471565d253de
tests::trade_limits_in_exact_match_scenario_should_work

/home/daniel/personal/Basilisk-node/target/debug/deps/pallet_xyk-e64700f0517e1fa1
money_in_sell_money_out_should_leave_the_same_balance
sell_test
sell_test_not_reaching_limit
sell_with_correct_fees_should_work
single_buy_should_work
work_flow_happy_path_should_work

/home/daniel/personal/Basilisk-node/target/debug/deps/pallet_duster-30499cb7ea4d1d0f
native_existential_deposit

@enthusiastmartin
Copy link
Author

enthusiastmartin commented Dec 1, 2021

Hi,
yes, that's correct.

I just wanted to dig little bit deeper to understand where the issue is. And this happened:

I renamed the sell_test to bug_test just to be easier to see it and run it.

And the test passed! I revert back the name to sell_test and the test fails again.

in fact , when i rename all failing tests to bug_test1,bug_test2,bug_test3 etc - they all pass ( at least in pallet-xyk crate under pallets/xyk directory) .

Can this give you any hint ?

@enthusiastmartin
Copy link
Author

update: i was just curious, so i renamed all failing tests - and they all pass. However few other tests starts failing after that change which were ok initially.

At this point i wont dig deeper. I guess it wont have anything to do with the code itself. Hope this information will give you an idea.

Let me know if you need anything else, I am happy to help.

@xd009642
Copy link
Owner

If you have time, if you could try #894 and let me know if it works. My laptop is quite weak so running the basilisk tests does take quite a lot of time 😅. Otherwise this fixes another issue that looks similar so I'll likely do a release next week with this PR included

@enthusiastmartin
Copy link
Author

If you have time, if you could try #894 and let me know if it works. My laptop is quite weak so running the basilisk tests does take quite a lot of time 😅. Otherwise this fixes another issue that looks similar so I'll likely do a release next week with this PR included

Sure. I will have a look and try it out later today. I'lllet you know.

@enthusiastmartin
Copy link
Author

enthusiastmartin commented Dec 13, 2021

If you have time, if you could try #894 and let me know if it works. My laptop is quite weak so running the basilisk tests does take quite a lot of time 😅. Otherwise this fixes another issue that looks similar so I'll likely do a release next week with this PR included

Using the branch remove-sched_affinity - it did not help.

Now the tests actually segfault.

running 50 tests
test mock::__construct_runtime_integrity_test::runtime_integrity_tests ... ok
Dec 13 01:31:44.770 ERROR cargo_tarpaulin: Failed to get test coverage! Error: Failed to run tests: A segfault occurred while executing tests
Error: "Failed to get test coverage! Error: Failed to run tests: A segfault occurred while executing tests"

to reproduce - you can actually run the tests for selected package only, so it does not take so long, eg:

cargo tarpaulin -p pallet-exchange --release --avoid-cfg-tarpaulin --all-features --skip-clean

This executes ok on tarpaulin develop branch with nightly-toolchain < 2021-11-08. With more recent one - it fails as described in the issue description.

And with the change from remove-sched_affinity branch - it segfaults with each toolchain.

@xd009642
Copy link
Owner

xd009642 commented Dec 13, 2021

Interesting, with latest nightly it works for me, maybe remove the --skip-clean as incremental compilation doesn't always work as expected when linker flags change (i.e. -Clink-dead-code) which tarpaulin uses. Edit: alternatively try using the develop tarpaulin and adding -- --test-threads N at the end where N is some value >1 (logical cores on CPU a reasonable default).

If it works with the test-threads arg it suggests there's some global mutable state in your test created via something like thread-local variables that isn't reset between tests. Which you probably want to fix as it means tests are subject to stop working based on the implementation of the test runner (i.e. rustc moving to a thread pool based implementation could cause them to fail)

@xd009642
Copy link
Owner

Just tried it twice with --test-threads 8 and it worked both times, also the sched affinity branch worked all times I tried it as well. It does look suspiciously like you have a test setting some thread local state that causes a single threaded test runner to fail.

The segfault is concerning though, I'll try to run the sched change across a variety of projects but if you could run without --skip-clean as well. If it's consistent I'll do a different PR to explicitly set test-threads if not specified by the user

@enthusiastmartin
Copy link
Author

enthusiastmartin commented Dec 13, 2021

Just tried it twice with --test-threads 8 and it worked both times, also the sched affinity branch worked all times I tried it as well. It does look suspiciously like you have a test setting some thread local state that causes a single threaded test runner to fail.

The segfault is concerning though, I'll try to run the sched change across a variety of projects but if you could run without --skip-clean as well. If it's consistent I'll do a different PR to explicitly set test-threads if not specified by the user

I can confirm that using --test-threads 8 on develop tarpaulin branch ( nightly-2021-12-10) - all seems to be fine.

However, it still segfaults using the remove-sched_affinity branch.

I am going to refactor our tests to not use the thread_local as it is not really needed in some cases ( substrate framework uses this approach in their tests so it was used as an inspiration ).

@xd009642
Copy link
Owner

I've done a PR using --test-threads then instead of removing the affinity, which I'll merge once CI passes

@enthusiastmartin
Copy link
Author

I've done a PR using --test-threads then instead of removing the affinity, which I'll merge once CI passes

actually my tests keeps failing .

I must have done something incorrect when it passed as mentioned previously ( no surprise since i was switching toolchains and branches)

Latest tarpaulin develop and nightly-2021-12-10 - > segfault when running all tests in workspace ( but these are tests which dont use thread_local at all)

but if run tests just for that specific pallet - they are fine.

Well, assuming that the segfault refers to the last executed test in the output ( i am not sure bout anything anymore at this point :-) ).

@xd009642 xd009642 reopened this Dec 13, 2021
@enthusiastmartin
Copy link
Author

Quick update - latest tarpaulin develop branch and nightly-2021-12-10.

using -- --test-threads=8 -> all tests pass.

Without that - tests fail ( it is either segfault or different outcome )

@xd009642
Copy link
Owner

Is it actually using the latest develop, and not one from say 2 days ago or the start of the day? Because now it uses num_cpus::get() to set --test-threads to be equal to number of logical CPUs if it isn't specified by the user

@enthusiastmartin
Copy link
Author

Is it actually using the latest develop, and not one from say 2 days ago or the start of the day? Because now it uses num_cpus::get() to set --test-threads to be equal to number of logical CPUs if it isn't specified by the user

Yes. it is precisely this commit - f73ed88bffcc37dd61fe5170eceb53b5828bf3dq

@xd009642
Copy link
Owner

Hmm, I just pushed a commit which adds verbose logging for the args and env vars which the test binary is executed with - since there didn't seem to be anything.

If you grab the latest develop commit and run with --verbose I'm just wondering if for some reason num_cpus is showing your system as having 1 logical core 🤔

@enthusiastmartin
Copy link
Author

Hmm, I just pushed a commit which adds verbose logging for the args and env vars which the test binary is executed with - since there didn't seem to be anything.

If you grab the latest develop commit and run with --verbose I'm just wondering if for some reason num_cpus is showing your system as having 1 logical core 🤔

Dec 14 04:38:52.118 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]

I guess that's the issue, correct ?

However, I can see also the following line for the first batch of tests:

Dec 14 04:39:28.210 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "8"]

After that it is only 1.

Grep only the debug:

Dec 14 04:40:20.915 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "8"]
Dec 14 04:40:21.083 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:21.271 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:23.382 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:23.562 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:23.702 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:26.283 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]
Dec 14 04:40:29.026 DEBUG cargo_tarpaulin::process_handling: Args: ["--test-threads", "1"]

@xd009642
Copy link
Owner

Oohh I believe I see what's happening, give me a sec and I'll put up a PR for it for you to try out!

@xd009642
Copy link
Owner

Well I couldn't build all of Basilisk after some attempts, maybe a wasm thing... Either way tested it on another project and it looks like it should fix the issue of --test-threads resetting! #899

@xd009642
Copy link
Owner

Interestingly one of the nightly CI jobs segfaulted so maybe that's the same issue, I'll dig in more

@xd009642
Copy link
Owner

decided to tweak it a bit further so it's more similar to what went before - that should solve the segfault and your issue 👍

@enthusiastmartin
Copy link
Author

I could see a segfault too but wanted to check first before reporting somethingincorrect;-)

I will try it with the latest update.

@xd009642
Copy link
Owner

It's never segfaulted on my machine but did on 1 CI run... Just part of the fun of using ptrace I guess 😅

@enthusiastmartin
Copy link
Author

Nice. Seems to be working fine now. Hopefully i did it correctly. Run it twice and all tests passed

thank you very much . Where is the sponsor button ?;-)

@xd009642
Copy link
Owner

Brilliant, I'll merge the PR then! And there should be a sponsor button somewhere as I do have github sponsors/patreon and the FUNDING.yml in the repo, but it's not necessary unless you really want to 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants