Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite cache synchronization to lock instead of spin #21124

Merged
merged 1 commit into from
Jun 5, 2020
Merged

Conversation

roji
Copy link
Member

@roji roji commented Jun 3, 2020

Closes #18516

tl;dr While results aren't very conclusive, this PR replaces spinning with an equivalent lock-based approach.

  • The objective here is to remove the spinning loop that occurs when another thread is already compiling our query, and to generally simplify synchronization.
  • I checked replacing the spinning loop with two things:
    • LOCKING: A proper lock; this keeps the behavior where multiple threads don't compile the same query, but rather the first one compiles and the others wait for it (just not via spinning).
    • NOSYNC: No synchronization, so multiple threads that happen to execute an un-compiled query compile in parallel.
  • I used two benchmarking scenarios (the MemoryCache is compacted/reset at the beginning of each invocation):
    • Simply spin up 16 threads which execute the same heavy-ish query
    • Spin up 1 thread, wait a bit, then execute 15 more.
  • The results are a bit inconclusive - benchmarking scenarios like this is very messy as the threads interfere with each other etc. But I was able to generate a scenario where locking improved perf a bit. I'm not convinced this is important, but as @smitpatel argued for this and the implementation simple, I went for that.
  • Unrelated: this PR also removes ICompiledQueryCache.GetOrAddAsync which isn't used anywhere (and is a bit more complicated to implement with locking). I don't think there is a justification for a compiled query cache which performs I/O as part of its job...
Benchmark code
[Benchmark]
public virtual async Task MultipleThreadsNoDelay()
{
    _memoryCache.Compact(100); // Clear the cache between invocations

    for (var i = 0; i < 16; i++)
        _tasks[i] = Task.Run(ExecuteQuery);

    await Task.WhenAll(_tasks);
}

[Benchmark]
public virtual async Task MultipleThreadsDelay()
{
    _memoryCache.Compact(100); // Clear the cache between invocations

    _tasks[0] = Task.Run(ExecuteQuery);
    await Task.Delay(60);

    for (var i = 1; i < 16; i++)
        _tasks[i] = Task.Run(ExecuteQuery);

    await Task.WhenAll(_tasks);
}

async Task<List<Customer>> ExecuteQuery()
{
    using var context = _fixture.CreateContext(_serviceProvider);
    return await context.Customers
        .AsNoTracking()
        .Include(c => c.Orders)
        .ThenInclude(o => o.OrderLines)
        .ThenInclude(ol => ol.Product)
        .ToListAsync();
}
Benchmark results
### LOCKING, NODELAY

-------------------- Histogram --------------------                                                                                          
[ 41.421 ms ;  59.439 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                                                                                  
[ 59.439 ms ;  77.456 ms) |                              
[ 77.456 ms ;  92.401 ms) | @                                         
[ 92.401 ms ; 110.419 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[110.419 ms ; 121.052 ms) | @@                            
[121.052 ms ; 140.837 ms) |                              
[140.837 ms ; 158.855 ms) | @@@@@@@@@@@                                                                                                      
[158.855 ms ; 179.837 ms) | @                                                                                                                
[179.837 ms ; 191.093 ms) |                                                                                                                  
[191.093 ms ; 209.110 ms) | @@@@                          
[209.110 ms ; 223.323 ms) | @                            
---------------------------------------------------         
                                                                                                                                             
// * Summary *                                                                                                                               
                                                                                                                                             
BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04                             
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores                                                                        
.NET Core SDK=5.0.100-preview.6.20266.3                   
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
                                                                      
                                                                      
|                 Method |     Mean |    Error |   StdDev |
|----------------------- |---------:|---------:|---------:|
| MultipleThreadsNoDelay | 94.62 ms | 16.87 ms | 44.73 ms |

### LOCKING, DELAY

-------------------- Histogram --------------------
[ 46.843 ms ;  86.274 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@
[ 86.274 ms ; 117.981 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[117.981 ms ; 163.849 ms) | @@@@@@@@@@@@
[163.849 ms ; 200.926 ms) | @@@@@
[200.926 ms ; 222.260 ms) | @
[222.260 ms ; 253.967 ms) | @@@@@
[253.967 ms ; 277.726 ms) | 
[277.726 ms ; 309.433 ms) | @@@@@@@@@@@@@
[309.433 ms ; 343.544 ms) | @@@
[343.544 ms ; 375.251 ms) | @
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|               Method |     Mean |    Error |   StdDev |   Median |
|--------------------- |---------:|---------:|---------:|---------:|
| MultipleThreadsDelay | 146.5 ms | 28.69 ms | 83.25 ms | 114.1 ms |



### LOCKING_OLD, NODELAY

-------------------- Histogram --------------------
[ 43.083 ms ;  58.162 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@
[ 58.162 ms ;  73.240 ms) | 
[ 73.240 ms ;  81.958 ms) | 
[ 81.958 ms ;  91.323 ms) | @@@@@@
[ 91.323 ms ; 106.402 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[106.402 ms ; 121.480 ms) | 
[121.480 ms ; 136.559 ms) | 
[136.559 ms ; 158.610 ms) | @@@@@@@@@@@@
[158.610 ms ; 173.689 ms) | 
[173.689 ms ; 188.767 ms) | 
[188.767 ms ; 200.078 ms) | 
[200.078 ms ; 216.918 ms) | @@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|                 Method |     Mean |    Error |   StdDev |
|----------------------- |---------:|---------:|---------:|
| MultipleThreadsNoDelay | 93.30 ms | 14.08 ms | 37.59 ms |


### LOCKING_OLD, DELAY

-------------------- Histogram --------------------
[ 29.367 ms ;  47.891 ms) | @@@@@@@
[ 47.891 ms ;  72.333 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 72.333 ms ;  96.776 ms) | 
[ 96.776 ms ; 121.218 ms) | 
[121.218 ms ; 145.661 ms) | 
[145.661 ms ; 170.103 ms) | 
[170.103 ms ; 205.960 ms) | @@@@@@@@@@@@@@@@@@
[205.960 ms ; 229.476 ms) | @@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|               Method |     Mean |    Error |   StdDev |   Median |
|--------------------- |---------:|---------:|---------:|---------:|
| MultipleThreadsDelay | 93.21 ms | 22.17 ms | 63.95 ms | 56.50 ms |


### NOSYNC, NODELAY

-------------------- Histogram --------------------
[ 57.361 ms ;  78.356 ms) | @@@@@@@@@@@
[ 78.356 ms ;  99.352 ms) | 
[ 99.352 ms ; 113.401 ms) | 
[113.401 ms ; 127.650 ms) | @@
[127.650 ms ; 148.646 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[148.646 ms ; 168.370 ms) | @@@@@@@@@@@
[168.370 ms ; 189.366 ms) | 
[189.366 ms ; 214.092 ms) | @@@@@@@
[214.092 ms ; 235.088 ms) | @@@@@@@@@@@@@@@@@@@@@@
[235.088 ms ; 250.854 ms) | @@@@@@
[250.854 ms ; 271.849 ms) | 
[271.849 ms ; 299.214 ms) | 
[299.214 ms ; 320.210 ms) | @
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|                 Method |     Mean |    Error |   StdDev |
|----------------------- |---------:|---------:|---------:|
| MultipleThreadsNoDelay | 165.7 ms | 19.16 ms | 54.36 ms |


### NOSYNC, DELAY

-------------------- Histogram --------------------
[ 40.168 ms ;  67.306 ms) | @@@@@@@@@
[ 67.306 ms ;  94.106 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[ 94.106 ms ; 112.470 ms) | @@@@@
[112.470 ms ; 139.269 ms) | 
[139.269 ms ; 166.069 ms) | 
[166.069 ms ; 185.753 ms) | 
[185.753 ms ; 206.214 ms) | @@@@@@
[206.214 ms ; 233.014 ms) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[233.014 ms ; 257.876 ms) | @@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT


|               Method |     Mean |    Error |   StdDev |   Median |
|--------------------- |---------:|---------:|---------:|---------:|
| MultipleThreadsDelay | 142.3 ms | 24.25 ms | 70.36 ms | 96.36 ms |

@roji roji requested review from smitpatel and AndriySvyryd June 3, 2020 12:06
@AndriySvyryd
Copy link
Member

I hope that the benchmark code provided is only for illustration and memoryCache.Compact(100); wasn't actually run inside the benchmark

@AndriySvyryd
Copy link
Member

The query itself should also be larger for the difference in performance to manifest. See #18022 for some real-world candidates

@roji
Copy link
Member Author

roji commented Jun 5, 2020

I hope that the benchmark code provided is only for illustration and memoryCache.Compact(100); wasn't actually run inside the benchmark

I ran both. Running a setup/cleanup per method invocation outside the method in BenchmarkDotNet can be done with IterationSetup, but that setting has a significant impact on the number of iterations etc. Since the cache only ever has one entry the impact should be pretty negligible (and I didn't dive to deep into tweaking BDN). If you'd like to see more data I can also loop inside the function to reduce the impact of the clearing. At the end of the day I also don't think it matters too much - I think we all agreed to remove the current spinning loop, and between the proposed locking and not doing any synchronization I don't think it matters that much.

The query itself should also be larger for the difference in performance to manifest. See #18022 for some real-world candidates

Not sure we call that a real-world candidate, more like a cartesian explosion nightmare we recommend avoiding :)

But if you think that's important for deciding what to do with this PR, let me know and I'll run that.

@AndriySvyryd
Copy link
Member

AndriySvyryd commented Jun 5, 2020

Not sure we call that a real-world candidate, more like a cartesian explosion nightmare we recommend avoiding :)

Yes, I'm talking only about the query compilation. The database shouldn't have any data for the benchmark

@smitpatel
Copy link
Contributor

I think a more suitable example would be dynamic code generation with continuous AND conditions with a skewed binary tree, where perhaps only 1 constant value in very right changes. That gives a really large expression tree (which #18022 actually does not) and at the same time cache miss.

@AndriySvyryd
Copy link
Member

AndriySvyryd commented Jun 5, 2020

and at the same time cache miss.

I don't think this scenario is common enough. We could pregenerate the large skewed expression tree and use the same one in each iteration.

@roji
Copy link
Member Author

roji commented Jun 5, 2020

We can certainly spend a lot of time tweaking and measuring this scenario, and I'll do the work if you guys think it's justified. Note also that the current PR does retain the locking (like @smitpatel originally wanted) - it just does it in a much better way (blocking on the lock instead of spinning). Unless someone strongly feels that the proposed locking mechanism is somehow bad, I don't think there's much value in continuing to investigate and benchmark this.

Let me know.

@roji
Copy link
Member Author

roji commented Jun 5, 2020

One more note - adding an entry to the cache and then compacting takes less than two microseconds. That means it's really quite negligible, and including Compact inside the benchmark should be fine (no need to work extra hard to generate different expression trees etc.).

BenchmarkDotNet=v0.12.0, OS=ubuntu 20.04
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.6.20266.3
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method Mean Error StdDev
MemoryAddAndCompact 1.756 us 0.0315 us 0.0263 us
Benchmark
public class Program
{
    MemoryCache _cache;

    [GlobalSetup]
    public void Setup()
    {
        _cache = new MemoryCache(new MemoryCacheOptions { SizeLimit = 10240 });
    }

    [Benchmark]
    public void MemoryAddAndCompact()
    {
        _cache.Set("someKey", "somevalue", new MemoryCacheEntryOptions { Size = 10 });
        _cache.Compact(100);
    }

    static void Main(string[] args)
        => BenchmarkRunner.Run<Program>();
}

Copy link
Member

@AndriySvyryd AndriySvyryd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement, but it can still be improved further

@roji
Copy link
Member Author

roji commented Jun 5, 2020

@AndriySvyryd @smitpatel can you provide more details on what you'd like to see? Our original discussion was between two options: switch to prevent concurrent compilation of the same query via locks (this is done by this PR), or removing concurrent compilation prevention altogether. What could we be doing better here?

@smitpatel
Copy link
Contributor

What I would like to see is,
some perf numbers which proves that we are improving something here, or an article which details both the types of sync mechanism with pros/cons of both to determine if this is really needed change.
I am not against changing this code but it is crucial code path in query which has worked for years without any customer issue of any kind so we should do some decent research before we make the change.
Since @AndriySvyryd approved the PR, I am fine taking his word that this is improvement and ok with merging current change set. I do not have any other pattern to suggest.

@AndriySvyryd
Copy link
Member

You could decompose lock into Monitor calls to save on some calls and perhaps allocations, but thinking about it I realized that it would introduce significant complexity for minor perf gain in corner cases, so you can stop thinking about this after this PR is in. Fixing #12905 would probably be better

@roji
Copy link
Member Author

roji commented Jun 5, 2020

@AndriySvyryd I agree, the scenario where a query isn't already cached, and is being compiled by another thread, really, really doesn't seem like it's worth optimizing to this level.

@roji roji merged commit fb28b56 into master Jun 5, 2020
@roji roji deleted the CacheSync branch June 5, 2020 17:34
@roji
Copy link
Member Author

roji commented Jun 5, 2020

some perf numbers which proves that we are improving something here, or an article which details both the types of sync mechanism with pros/cons of both to determine if this is really needed change.

@smitpatel the basic thing here is to avoid uncontrolled spin looping, which is a bad idea in almost any scenario. This is less about actual, visible perf (although I believe that's also relevant), and more about safety: if in any way our compilation blocks or takes a very long time (we've had several bugs like this in the history of EF), that means other threads are occupying CPU cores with 100% spin loops. Basically, we should never, ever spin without at least using something like SpinWait, which spins for a while and then switches to waiting. In any case that isn't really relevant here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query: Revisit caching synchronization mechanism
3 participants