-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable Retry Logic #966
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to check in the two DLLs?
...ft.Data.SqlClient/tests/ManualTests/SQL/RetryLogic/Resources/CustomConfigurableRetryLogic.cs
Outdated
Show resolved
Hide resolved
If you mean |
I've been having trouble with the tests due to the new default use of the retry policy. It looks like if no policy is set and the connection can't be made (for example a bad connection string) you can end up in an infinite retry loop which stops the manual tests dead where they used to just fail the individual test and continue. I've also hit a case trying to debug where a config property is passed to the retry policy setup and assert that it isn't null but at runtime the null is accepted, since we don't run the tests with debug builds this will never be hit so either it's missing an important case that should be tested or it's a worthless assert. |
It's not clear to me under what circumstances it will go to an infinite loop. Could you provide more step-by-step information, please?
Suppose you're pointing to the |
Running the manual tests locally will always fail in the retry logic at the moment. Master, netcore2.1, build run and wait for it to just stop in the exponential retry test. |
Try breaking your named pipes connection string. Then it'll hit the infinite loop with an exponential backoff. In order to make it actually work you'll have to update the manual tests project to copy the config file on build, at the moment it doesn't and that's caused me a lot of problems. Named pipes didn't work on my machine for some reason and over the weekend I uninstalled cleaned and then reinstalled SqlServer. So I'm setting everything up from clean, if you can't connect using "np:localhost" then previously some tests failed but it was ok because I knew which ones they were. Now I can't complete a test run because it hits a loop. Yes, I should fix my machine and I'm working on that but having the ability for the tests to fall into an endless loop doesn't seem like a good idea anyway. A couple of things I noticed while looking at the retry code as it is as the moment in master; Lines 73 to 76 in 554c015
This is problematic because it causes an exception on any OpenAsync call which doesn't have a policy, which at the moment is all of them. Instead of null check then throw could you just if and jump to using the default? On that call stack that leads to that there's this call: SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs Lines 1459 to 1462 in 554c015
Which is going to cause an instance delegate to be allocated each time it's invoked. Could that be changed to a lambda which takes the token and connection as a params and then the compiler can use a hidden static delegate. |
Call Stack:
|
I used the following config to run the tests:
|
That's great if "np:localhost" works on your machine you'll have no problems. Change it to something that you know won't work and then run the manual tests, they'll hang. |
+ some improvements + fix test issues
Ok, i've got tests working so ignore that problem for now. |
I'm also in agreement with @Wraith2 's comment above and would suggest something like: public override object ExecuteScalar()
{
...
if (IsRetryOptedIn) // Check if user enabled App Context switch here
{
ds = RetryLogicProvider.Execute(this, () => RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true));
}
else
{
ds = RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true);
} [Similarly for other APIs] I think the check for App Context switch that happens inside the Reliability layer is too late, but based on perf results I didn't consider that significant before. But I agree if we can avoid it's after-effects, that should be preferred way to go. So until users opt-in, they should not get affected by any means. For other reliability implementation inside SqlConnection too, we should be checking App context switch earliest before changing any flow, if it's avoidable. |
I'm prototyping adding a property `bool SqlRetryLogicBaseProvider.IsEnabled {get; }" and then set it to false in the no-op implementation. Then in the callsites if we updated to langversion 7.0 we could use local functions to avoid the closure entirely unless it's called, e.g.: public override Task OpenAsync(CancellationToken cancellationToken)
{
if (!RetryLogicProvider.Enabled)
{
return InternalOpenAsync(cancellationToken);
}
else
{
return OpenAsyncRetry(cancellationToken);
}
Task OpenAsyncWithRetry(CancellationToken token)
{
return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
}
} I don't know if the complier is smart enough to eliminate the closure allocation in the version you proposed @cheenamalhotra |
It will need to be split out into a second function. The compiler hoists the closure into the scope of the containing function to ensure that the variables contained within the closure are always accessed through it. Compare these two: With Local Function: public override Task OpenAsync(CancellationToken cancellationToken)
{
if (!RetryLogicProvider.Enabled)
{
return InternalOpenAsync(cancellationToken);
}
else
{
return OpenAsyncWithRetry(cancellationToken);
}
Task OpenAsyncWithRetry(CancellationToken token)
{
return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(token), token);
}
}
public override Task OpenAsync(CancellationToken cancellationToken)
{
if (!RetryLogicProvider.Enabled)
{
return InternalOpenAsync(cancellationToken);
}
return <OpenAsync>g__OpenAsyncWithRetry|148_0(cancellationToken);
}
[CompilerGenerated]
private Task <OpenAsync>g__OpenAsyncWithRetry|148_0(CancellationToken token)
{
<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
<>c__DisplayClass148_.<>4__this = this;
<>c__DisplayClass148_.token = token;
return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__1), <>c__DisplayClass148_.token);
} With Inline Call: public override Task OpenAsync(CancellationToken cancellationToken)
{
if (!RetryLogicProvider.Enabled)
{
return InternalOpenAsync(cancellationToken);
}
else
{
return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
}
}
public override Task OpenAsync(CancellationToken cancellationToken)
{
<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
<>c__DisplayClass148_.<>4__this = this;
<>c__DisplayClass148_.cancellationToken = cancellationToken;
if (!RetryLogicProvider.Enabled)
{
return InternalOpenAsync(<>c__DisplayClass148_.cancellationToken);
}
return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__0), <>c__DisplayClass148_.cancellationToken);
} Which shows that only if you split out the call do you avoid the allocations. I like the local function approach because it keeps the allocating call wrapper private to the callsite. I'm not wild about the compiler naming convention. I could be convinced that it's better as a full private function. |
In that case, we can pass method delegate to a common method too (compacted version): public override Task OpenAsync(CancellationToken cancellationToken) =>
(!IsRetryEnabled ? InternalOpenAsync(cancellationToken) :
DoTaskWithRetry(InternalOpenAsync(cancellationToken), cancellationToken));
// This can also go in a Util class so SqlCommand can also use this.
internal Task DoTaskWithRetry(Task task, CancellationToken token) =>
RetryLogicProvider.ExecuteAsync(this, () => task, token);
// Stays with other App Context Flags or in Utils
internal bool IsRetryEnabled = <TryGetSwitch>; Similar redirect APIs can be written for other RetryLogicProvider Execute APIs, so that we can prevent the default RetryLogicProvider from loading and control it externally. |
I'm not fully following the discussion, but if the concern here is about heap allocations introduced because of closures, then that can be avoided without introducing a special opt-in flag. Make your lambdas static (via the new C# 9 syntax), and pass whatever objects you need to reference as parameters (including |
See dotnet/efcore#24208 for an example of a similar change in EF Core. |
I looked into doing that @roji but that would need changes to the callsites in I'd like to get c#9 in the library because it would make the static delegates I add a lot cleaner, they'd be compiler generated instead of me littering them all over the place manually. I talked to roslyn team and they say it's not supported but that it will work and won't generate invalid il that netfx can't use as long as we don't use features that require runtime support, static delegates do not require runtime support. |
Hi @roji This is a preview feature and is not meant to be consumable by default. Hence it's controlled with App Context switch which is going to be disabled by default. Our intention is to not change driver behavior for user applications that don't want to use Retry Logic Provider right now, including any visible allocations/performance cost. We're trying to move AppContext switch checks from internal provider layer to driver layers where this feature begins to allocate resources that are not needed. |
Apologies if I'm misunderstanding things - I really am missing a lot of the context. This would indeed require adding a RetryLogicProvider.Execute overload which accepts a "state" parameter and passes it to the lambda; this overload could then be called from OpenAsync, ExecuteScalar or anywhere else. I understand that RetryLogicProvider is an abstraction meant to be passed from the outside - FWIW methods accepting both a lambda and a state parameter to be passed to that lambda are very common in .NET precisely to avoid the heap allocation issue. The fact that people are exposed to that parameter doesn't seem problematic to me - part of the contract of a retrying provider would simply be for the implementation to pass that parameter when invoking the lambda. BTW I'd suggest taking a look at what we've done around this in EF Core, the concepts are very similar (e.g. an abstraction that's meant to be implemented by external users or providers). Re changing driver behavior, if you have a default NoopRetryLogicProvider that simply executes the given lambda once, everything should just work well, with no behavior changes or perf regressions. But if there's some reason why you can't introduce the new overload, then the above suggestions do seem right. Re C# 9 static lambdas, that is only an optional thing in this context - as long as a lambda doesn't refer to anything other than its parameters, it wouldn't be a closure and trigger any heap allocations. The C# 9 feature only helps avoid accidentally introducing such a reference, but as long as you're willing to introduce the new overload above the problem should go away. |
It requires 2 state parameters, this and cancelationToken, which means a tuple or some custom context object. the instance delegate contains the this ref so you only see one stat parameter but if we make it static as you suggest (and i'd have liked) then we have to pass in two params. Ultimately for an async call to be retried we're going to have to save the state on the heap sometime so closures are likely to be somewhere i'd just like to move them into a path that is only taken if the user wants to have retry logic active. There are worse cases, ExecuteReaderAsync will need 4 state params. |
You can just put both in a simple value tuple (see dotnet/efcore#24208 for examples) - simple terse syntax, no allocations.
I don't quite follow this... the heap allocations from the closures are unrelated (and in addition to) any async state - unless you're doing something I'm not aware of. Basically it seems that the closure allocations don't have to happen in any path (regardless of whether the feature is turned on or not), plus the code would be considerably simpler too. |
Valuetuples box with async because they're always either part of the async closure or they're passed in an object typed state parameter. If you strongly type the state params you end up with the sort of code we have in AsyncHelper which is complicated and still can't avoid allocations. We also can't use valuetuple in netfx without requiring a new non oob dependency which might be possible but isn't my decision to make.
To do any async call to signature other than |
Oh and another thing I noticed was that the retry api is using IEnumerator parameters and i think it should probably be using IEnumerable parameters. It's technically a breach of the rules to re-use an enumerator since there's no way to rewind them. An enumerable object can return itself multiple times it is a static implementation that is safe. |
Yes, I'm suggesting making the state parameter generic. I'm not sure what's complicated about that, and especially why it wouldn't avoid allocation; it's just a generic parameter that gets passed to a function. Again, if you look at the PR cited above, you can see it working and not allocating. This part of it has nothing to do with whether the method is async or not - it's just about whether your lambda is a closure or not.
OK.
I honestly don't understand what the above has to do with closure heap allocations. I also suspect you're talking about some SqlClient-specific detail of how you guys implement async - regular C# async/await methods have no single state object typed as an object, etc. I really think all this is irrelevant to what we're discussing. |
The static lambda will eliminate the delegate allocation. That seems like a good idea, compiler supported or not. I'd like to use the static keyword for safety and ease of use. Agreed on those points. How the closure of state is transported, compiler generated or manually, isn't important to me since we can't avoid it. If we use manual context it will change the published surface area of the retry provider. I assume that would be a bad thing. |
OK, thanks for listening on this. To summarize my point, ignoring SqlClient-specific constraints for a second, in C# it's quite easy to manage the above in a way that will be zero-allocation. Specifically in SqlClient, if you can't add the required public method to RetryLogicProvider for some reason (Is this type already public? I'm not seeing it in the docs), and/or cannot use value tuples and prefer not to use structs in their place, then that indeed cannot be done. Splitting your method into two, so that the closure occurs only when the feature flag is on, seems like the right way to restrict the closure allocations to when the feature flag is on. |
Great conversation! The intent of the initial prototype was indeed to avoid any allocation in case retry logic was not configured for connection opening and command execution (meaning if no config file section or explicit assignment in code was there). Setting aside the App Context switch, which in my mind should only exists during Preview, why the IsRetryEnable switch couldn't just check the condition above? |
The exact implementation can be whatever works best I just extended the existing enabled flag from the provider internals to the provider surface. If there's any easier and lower cost way to do the operation I'd be happy with it. In order to check if there's a config the loader logic has to do a fairly costly one-time check (does io) parses and instantiates the used defined provider with using reflection. That logic is hidden in the static |
That was exactly my idea. Isn't that the same in other config sections like for authentication provider? if i remember correctly we mentioned that in previous conversations. |
It may be the same. The authentication code addition predates my experience with this library and it isn't something that has come up with any of the performance changes I've made so I don't know much about it. As long as there is some way to avoid casual users paying a penalty for a feature they aren't using I'm not too bothered about the implementation detail. |
+ check the safety switch directly in SqlConnection and SqlCommand + improved memory allocation by adding a private method to avoid the instance delegate on each call + remove error 64, 20, and 0 from the list of default transient errors + test improvement
I just did a quick profile run on this and the additional changes to avoid allocations worked well. I can't see anything obviously attributable to retry logic when it isn't enabled. Good job @DavoudEshtehari |
@DavoudEshtehari question - why is error 4060 on the list of transient errors? I see comments in SMO and SSMS code that say 4060 means the database doesn't exist. |
I haven't reviewed the SSMS logic, but according to the documentation, error number 4060 is listed as one of the transient errors for Azure SQL Server. |
@shueybubbles IIRC, it's because, on startup, database availability can be delayed between the time a server starts accepting connections and the time a particular database is available. So, adding automatic retries reduces the incidences of connection failure when a server is starting up but a database is not yet available. |
Well, we need a new error value for the cloud case, then. It's pointless to keep trying when the server knows there no such database. IMO we have chosen the wrong default behaviors for cloud connections to missing/sleeping databases. |
Yes, sleeping databases is another issue. I think they prioritized making serverless as seamless as possible for existing apps to adopt and made some compromises that limit options. There is a good recent conversation on the EF repo about transient 4060 errors and create database actions: dotnet/efcore#33191 |
This is a complimentary PR to complete the remaining works, extend the feature, and address comments on PR #693.