Feedback on fuzzer benchmarking setup #985

wuestholz · 2023-03-17T14:47:49Z

I'm trying to compare Echidna with the Forge fuzzer on several benchmark contracts.

To make the comparison as fair as possible, I've created a benchmark generator that automatically generates challenging contracts. The benchmarks intentionally use a limited subset of Solidity to avoid language features that could be handled differently by different tools. Each contract contains ~50 assertions (some can fail, but others cannot due to infeasible path conditions). (If you're curious, you can find one of the benchmarks here. The benchmark-generation approach is inspired by the Fuzzle benchmark generator for C-based fuzzers.) To find the assertions that can fail, a fuzzer needs to generate up to ~15 transactions and satisfy some input constraints for each transaction.

Since I'm not deeply familiar with Echidna I'd like to check if there are any potential issues with my benchmark setup before sharing results.

For each fuzzing campaign I'm using the following settings that deviate from the defaults:

testLimit: 1073741823 (instead of 50000)
shrinkLimit: 1073741823 (instead of 5000)
codeSize: 0xc00000 (instead of 0x6000)

The motivation for increasing the testLimit and shrinkLimit settings is that I want to run long fuzzing campaigns (for instance, 1 hour for each contract), and I use the timeout setting to terminate the campaign after a fixed amount of time.

I also increased the codeSize setting to handle larger contracts, if necessary. Currently, all benchmark contracts are below the EVM limit when using the solc optimizer (0.8.19).

Please let me know if you see any potential issues with this setup.

The text was updated successfully, but these errors were encountered:

ggrieco-tob · 2023-03-17T15:46:53Z

Hi,

Thanks for testing echidna in your new beachmark, happy to provide feedback for you. Some pointers:

codeSize should not impact the performance, unless the code deploys extremely large contracts over and over again.
testLimit and timeout are the main way to define the resource limits that you give the fuzzer to run. The default values should be changed for this, of course.
shrinkLimit can impact the time, since shrinking can take some time to run (and there is no guarantee the result will be minimal). This depends on the "score" you give to each tool. In some cases, finding a bug is more important that provide a minimized input, but this is up to you.

Currently, all benchmark contracts are below the EVM limit when using the solc optimizer (0.8.19).

Very important point to make sure the experiments make sense.

There one thing missing here: seqLen (number of transactions before resetting the EVM) is by default 100. It is unclear how this is compared to other fuzzers or if this value is optimal for the benchmark. Perhaps a solution will be to run the tools with different values and report an average, but we will need more details of your experiments to help you.

wuestholz · 2023-03-20T10:50:23Z

@ggrieco-tob Thanks a lot for the quick response!

I have two follow-up questions:

(1) In my experiments, I observed that a small shrinkLimit (0 or even the default) will terminate the campaign when the limit is hit. Is this not the case? If so, I'm happy to use the default or even a smaller value (perhaps even 0?).

Here's a quick experiment I did:

$ curl https://gist.githubusercontent.com/wuestholz/aec07f7d3572af8d477e8e0a387fb7ab/raw/1832587591dd04ab5cd4c2ecee4cae69321d942d/maze-0.sol --output maze-0.sol
$ rm -rf echidna-corpus
$ mkdir echidna-corpus
$ printf 'testMode: "assertion"\ntestLimit: 1073741823\ntimeout: 60\nshrinkLimit: 1\nseed: 0\nformat: text\ncodeSize: 0xc00000\ncorpusDir: echidna-corpus' > config.yaml
$ time echidna-test --config config.yaml --contract Maze maze-0.sol

On my machine, the last command terminates after only ~5 seconds even though the time limit is 60s.

(2) It's difficult to say what value of seqLen is optimal. That's why I thought to leave it at the default. However, I'm happy to try other values. For the generated benchmarks, it may require up to 15 transactions to trigger some assertion failures. For this reason, I'm currently using 30 as the limit for Forge (Forge's default is 15). After all, some randomly generated transactions may fail and it's probably quite unlikely that a fuzzer will just generate 15 successful transactions. Should I also try 30 for Echidna?

ggrieco-tob · 2023-03-20T12:47:18Z

I observed that a small shrinkLimit (0 or even the default) will terminate the campaign when the limit is hit

Yes, this is correct. However, using a large shrinkLimit will still force to use precious time during the campaign that could be used to find some other counter-example. Since shrinkLimit is a stochastic process, it will not check if the transactions are absolutely small, and will keep trying. Again, if it is up to you how to assign resources to the fuzzers for the experiments, but shrinking can be performed after the campaign in order to make the results understandable for humans.

After all, some randomly generated transactions may fail and it's probably quite unlikely that a fuzzer will just generate 15 successful transactions. Should I also try 30 for Echidna?

Well, if you expect smart contract fuzzers to accumulate a particular state over 15 transactions, clearly resetting every 15 is not enough (ofc!). In fact, from our experience, it is very very important to select a much larger limit to avoid resetting the state earlier than needed (100 or 200). We will love to see empirical experiments to support this intuition, and of course, there are some specially crafted examples where a fuzzer will benefit of resetting the state early (e.g if there a state that the fuzzer cannot leave) but these are not very common in our audits.

wuestholz · 2023-03-20T15:01:35Z

@ggrieco-tob Thanks for clarifying!

About the shrinkLimit: I would indeed like to spend as little time on shrinking as possible. What would be the recommended way to set this up?

It would be really useful if there was a shrinkLimitPerTest setting that could be set to 0. Have you considered this? Or is there another way to achieve this behavior?

About seqLen: I think your intuition makes sense unless there is some sort of coverage feedback that would add test cases into the corpus if they cover interesting/new states. Harvey does something like this, but I'm not sure any other fuzzers do.

ggrieco-tob · 2023-03-20T15:07:53Z

The shrinkLimit config is per test. Using as 0 should be enough to disable it.

About seqLen: I think your intuition makes sense unless there is some sort of coverage feedback that would add test cases into the corpus if they cover interesting/new states. Harvey does something like this, but I'm not sure any other fuzzers do.

Echidna also uses coverage for adding elements into the corpus, however we are not sure how much we can reduce that value even if we relying on the coverage guidance.

wuestholz · 2023-03-20T16:07:05Z

@ggrieco-tob Thanks! Then I don't quite understand the behavior I'm observing above. It seems like the entire fuzzing campaign terminates when the shrinkLimit is exceeded for some test. It would be great if exceeding the limit would just stop the shrinking for that particular test, while still continuing the fuzzing campaign. Is that perhaps a bug in Echidna I'm hitting?

ggrieco-tob · 2023-03-20T16:09:43Z

Could be the case. Can you please create a small issue to reproduce it? It is odd that the complete campaign is over, unless there is nothing else to test (e.g. everything failed).

wuestholz · 2023-03-20T16:56:03Z

I tried to minimize the example:

pragma solidity ^0.8.19;
contract Maze {
  event AssertionFailed(string message);
  uint64 private x;
  uint64 private y;
  function moveNorth(uint64 p0, uint64 p1) payable external returns (int64) {
    uint64 ny = y + 1;
    require(ny < 7);
    y = ny;
    return step(p0, p1);
  }
  function moveSouth(uint64 p0, uint64 p1) payable external returns (int64) {
    require(0 < y);
    uint64 ny = y - 1;
    y = ny;
    return step(p0, p1);
  }
  function moveEast(uint64 p0, uint64 p1) payable external returns (int64) {
    uint64 nx = x + 1;
    require(nx < 7);
    x = nx;
    return step(p0, p1);
  }
  function moveWest(uint64 p0, uint64 p1) payable external returns (int64) {
    require(0 < x);
    uint64 nx = x - 1;
    x = nx;
    return step(p0, p1);
  }
  function step(uint64 p0, uint64 p1) internal returns (int64) {
    unchecked {
      if (x == 0 && y == 0) {
        // start
        return 0;
      }
      if (x == 2 && y == 2) {
        emit AssertionFailed("1"); assert(false);  // bug
        return 1;
      }
      if (x == 6 && y == 6) {
        if (p0 * p1 == 938957) {
          emit AssertionFailed("2"); assert(false);  // bug
        }
        return 2;
      }
      return 3;
    }
  }
}

Assertion 1 is easy to cover, but assertion 2 should be more difficult to cover.

wuestholz · 2023-03-21T14:17:59Z

@ggrieco-tob I observed that setting the shrink limit to 1 or even 0 works just fine when using the exploration test-mode (instead of the assertion test-mode). Perhaps the fuzzer simply terminates after finding the first bug and uses up the shrink budget before terminating. With a small budget, it terminates very quickly whereas with a large budget it "wastes" most of the allocated time just shrinking.

I changed the test-mode in my benchmarking setup to "exploration" and this improved Echidna's performance very significantly. I'm using the covered.*.txt files in the corpus to determine which assertions were hit (see earlier discussion at #682).

I also compared shrink limit 0 with the default (5000) and did not observe a noticeable difference. I'm leaning towards simply keeping the default, but I'm also happy to use 0.

wuestholz · 2023-03-22T06:15:35Z

Quick update: I also tried to set seqLen to 30 (like I did initially for Foundry). The performance was slightly worse than with 100 (the default). I'm leaning towards keeping the default, but I'll probably also run an experiment with 200.

ggrieco-tob · 2023-03-22T08:04:48Z

You can also try using echidna-parade, which uses swarm testing to combine different configurations of echidna in order to get more coverage.

wuestholz · 2023-03-22T16:00:09Z

Thanks for the suggestion! I'll see if I can make it work. I'm still trying to set up Hybrid-Echidna... :)

The increase from 100 to 200 did not have a significant performance impact.

crytic locked and limited conversation to collaborators May 25, 2023

arcz converted this issue into discussion #1059 May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Feedback on fuzzer benchmarking setup #985

Feedback on fuzzer benchmarking setup #985

wuestholz commented Mar 17, 2023

ggrieco-tob commented Mar 17, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

wuestholz commented Mar 21, 2023

wuestholz commented Mar 22, 2023

ggrieco-tob commented Mar 22, 2023

wuestholz commented Mar 22, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Feedback on fuzzer benchmarking setup #985

Feedback on fuzzer benchmarking setup #985

Comments

wuestholz commented Mar 17, 2023

ggrieco-tob commented Mar 17, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

ggrieco-tob commented Mar 20, 2023

wuestholz commented Mar 20, 2023

wuestholz commented Mar 21, 2023

wuestholz commented Mar 22, 2023

ggrieco-tob commented Mar 22, 2023

wuestholz commented Mar 22, 2023

This issue was moved to a discussion.