You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.
I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.
A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.
Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms]
Range (min … max): 7.7 ms … 11.1 ms 278 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s]
Range (min … max): 1.885 s … 1.992 s 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```
And here's what it looks like after this PR:
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms]
Range (min … max): 7.9 ms … 11.7 ms 287 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms]
Range (min … max): 46.1 ms … 67.3 ms 57 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```
So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.
Closes#934
0 commit comments