Configurable Retry Logic #966

DavoudEshtehari · 2021-03-05T20:52:52Z

This is a complimentary PR to complete the remaining works, extend the feature, and address comments on PR #693.

David-Engel

Did you mean to check in the two DLLs?

...ft.Data.SqlClient/tests/ManualTests/SQL/RetryLogic/Resources/CustomConfigurableRetryLogic.cs

DavoudEshtehari · 2021-03-06T00:46:15Z

Did you mean to check in the two DLLs?

If you mean ClassLibrary_CustomConfigurableRetryLogic DLLs, I'd say they're customized retry logics compiled for .Net Framework and .Net Core to assess the external injection by test project for different .Net versions.

Wraith2 · 2021-03-07T21:23:33Z

I've been having trouble with the tests due to the new default use of the retry policy. It looks like if no policy is set and the connection can't be made (for example a bad connection string) you can end up in an infinite retry loop which stops the manual tests dead where they used to just fail the individual test and continue.

I've also hit a case trying to debug where a config property is passed to the retry policy setup and assert that it isn't null but at runtime the null is accepted, since we don't run the tests with debug builds this will never be hit so either it's missing an important case that should be tested or it's a worthless assert.

DavoudEshtehari · 2021-03-08T23:21:04Z

I've been having trouble with the tests due to the new default use of the retry policy. It looks like if no policy is set and the connection can't be made (for example a bad connection string) you can end up in an infinite retry loop which stops the manual tests dead where they used to just fail the individual test and continue.

It's not clear to me under what circumstances it will go to an infinite loop. Could you provide more step-by-step information, please?

I've also hit a case trying to debug where a config property is passed to the retry policy setup and assert that it isn't null but at runtime the null is accepted, since we don't run the tests with debug builds this will never be hit so either it's missing an important case that should be tested or it's a worthless assert.

Suppose you're pointing to the SqlRetryLogicProvider. In that case, generally, I assert null values in debug mode for internal usage to enforce an object creation at design time. Since this is an internal type and isn't going to be accessible directly, I've included the assert in debug mode to prevent null usage at design time; and this parameter won't be null. If you have another scenario, please include it here for further discussion.

Wraith2 · 2021-03-08T23:34:19Z

It's not clear to me under what circumstances it will go to an infinite loop. Could you provide more step-by-step information, please?

Running the manual tests locally will always fail in the retry logic at the moment. Master, netcore2.1, build run and wait for it to just stop in the exponential retry test.

DavoudEshtehari · 2021-03-09T00:39:24Z

Running the manual tests locally will always fail in the retry logic at the moment. Master, netcore2.1, build run and wait for it to just stop in the exponential retry test.

I couldn't find any issue locally.

Wraith2 · 2021-03-09T01:07:27Z

Try breaking your named pipes connection string. Then it'll hit the infinite loop with an exponential backoff. In order to make it actually work you'll have to update the manual tests project to copy the config file on build, at the moment it doesn't and that's caused me a lot of problems.

Named pipes didn't work on my machine for some reason and over the weekend I uninstalled cleaned and then reinstalled SqlServer. So I'm setting everything up from clean, if you can't connect using "np:localhost" then previously some tests failed but it was ok because I knew which ones they were. Now I can't complete a test run because it hits a loop. Yes, I should fix my machine and I'm working on that but having the ability for the tests to fall into an endless loop doesn't seem like a good idea anyway.

A couple of things I noticed while looking at the retry code as it is as the moment in master;

SqlClient/src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/Reliability/SqlConfigurableRetryLogicManager.CreateProvider.cs

Lines 73 to 76 in 554c015

    
           if (configSection == null) 
        
           { 
        
               throw SqlReliabilityUtil.ArgumentNull(nameof(configSection)); 
        
           }

This is problematic because it causes an exception on any OpenAsync call which doesn't have a policy, which at the moment is all of them. Instead of null check then throw could you just if and jump to using the default?

On that call stack that leads to that there's this call:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs

Lines 1459 to 1462 in 554c015

    
           public override Task OpenAsync(CancellationToken cancellationToken) 
        
           { 
        
               return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken); 
        
           }

Which is going to cause an instance delegate to be allocated each time it's invoked. Could that be changed to a lambda which takes the token and connection as a params and then the compiler can use a hidden static delegate.

Wraith2 · 2021-03-09T01:16:51Z

Call Stack:

 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(Microsoft.Data.SqlClient.ServerInfo serverInfo = {Microsoft.Data.SqlClient.ServerInfo}, string newPassword = "", System.Security.SecureString newSecurePassword = null, bool redirectedUserInstance = false, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.SqlClient.SqlCredential credential = null, Microsoft.Data.ProviderBase.TimeoutTimer timeout = {Microsoft.Data.ProviderBase.TimeoutTimer})	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(Microsoft.Data.ProviderBase.TimeoutTimer timeout, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions, Microsoft.Data.SqlClient.SqlCredential credential, string newPassword, System.Security.SecureString newSecurePassword, bool redirectedUserInstance)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.SqlInternalConnectionTds(Microsoft.Data.ProviderBase.DbConnectionPoolIdentity identity, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.SqlClient.SqlCredential credential = null, object providerInfo, string newPassword = "", System.Security.SecureString newSecurePassword = null, bool redirectedUserInstance = false, Microsoft.Data.SqlClient.SqlConnectionString userConnectionOptions, Microsoft.Data.SqlClient.SessionData reconnectSessionData, bool applyTransientFaultHandling = true, string accessToken, Microsoft.Data.ProviderBase.DbConnectionPool pool)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(Microsoft.Data.Common.DbConnectionOptions options, Microsoft.Data.Common.DbConnectionPoolKey poolKey, object poolGroupProviderInfo, Microsoft.Data.ProviderBase.DbConnectionPool pool, System.Data.Common.DbConnection owningConnection, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionFactory.CreateNonPooledConnection(System.Data.Common.DbConnection owningConnection = {Microsoft.Data.SqlClient.SqlConnection}, Microsoft.Data.ProviderBase.DbConnectionPoolGroup poolGroup, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(System.Data.Common.DbConnection owningConnection = {Microsoft.Data.SqlClient.SqlConnection}, System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry = null, Microsoft.Data.Common.DbConnectionOptions userOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.ProviderBase.DbConnectionInternal oldConnection = {Microsoft.Data.ProviderBase.DbConnectionClosedNeverOpened}, out Microsoft.Data.ProviderBase.DbConnectionInternal connection = null)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(System.Data.Common.DbConnection outerConnection = {Microsoft.Data.SqlClient.SqlConnection}, Microsoft.Data.ProviderBase.DbConnectionFactory connectionFactory = {Microsoft.Data.SqlClient.SqlConnectionFactory}, System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.TryOpen(System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry = null, Microsoft.Data.SqlClient.SqlConnectionOverrides overrides)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlRetryLogicProvider.Execute<bool>(object sender = {Microsoft.Data.SqlClient.SqlConnection}, System.Func<bool> function = {Method = {System.Reflection.RuntimeMethodInfo}})	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.Open(Microsoft.Data.SqlClient.SqlConnectionOverrides overrides)	Unknown	Non-user code. Skipped loading symbols.
 	ManualTests.dll!Microsoft.Data.SqlClient.ManualTesting.Tests.SqlCommandReliabilityTest.ConcurrentExecution(string cnnString = "Data Source=np:localhost;Initial Catalog=Northwind;Integrated Security=True;Pooling=False;Connect Timeout=5", Microsoft.Data.SqlClient.SqlRetryLogicBaseProvider provider = {Microsoft.Data.SqlClient.SqlRetryLogicProvider})	Unknown	Non-user code. Skipped loading symbols.
 	[Native to Managed Transition]		Annotated Frame
 	[Managed to Native Transition]		Annotated Frame
 	xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<System.__Canon>.CallTestMethod(object testClassInstance) Line 150	C#	Non-user code. Symbols loaded.
 	xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.InvokeTestMethodAsync.AnonymousMethod__1() Line 257	C#	Non-user code. Symbols loaded.
 	xunit.execution.dotnet.dll!Xunit.Sdk.ExecutionTimer.AggregateAsync(System.Func<System.Threading.Tasks.Task> asyncAction = {Method = {System.Reflection.RuntimeMethodInfo}}) Line 48	C#	Non-user code. Symbols loaded.

DavoudEshtehari · 2021-03-09T01:50:12Z

I used the following config to run the tests:

    "TCPConnectionString": "Data Source=tcp:localhost;Database=Northwind;Integrated Security=true;",
    "NPConnectionString": "Data Source=np:localhost;Database=Northwind;Integrated Security=true;",
    "TCPConnectionStringHGSVBS": "",
    "TCPConnectionStringAASVBS": "",
    "TCPConnectionStringAASSGX": "",
    "EnclaveEnabled": false,
    "TracingEnabled": false,
    "AADAuthorityURL": "",
    "AADPasswordConnectionString": "",
    "AADServicePrincipalId": "",
    "AADServicePrincipalSecret": "",
    "AzureKeyVaultURL": "",
    "AzureKeyVaultTenantId": "",
    "AzureKeyVaultClientId": "",
    "AzureKeyVaultClientSecret": "",
    "SupportsIntegratedSecurity": true,
    "SupportsLocalDb": false,
    "SupportsFileStream": false,
    "UseManagedSNIOnWindows": false,
    "IsAzureSynapse": false,
    "EnclaveAzureDatabaseConnString": "",
    "UserManagedIdentityObjectId": ""

I bring it from the build directory.

Wraith2 · 2021-03-09T01:55:18Z

That's great if "np:localhost" works on your machine you'll have no problems. Change it to something that you know won't work and then run the manual tests, they'll hang.

DavoudEshtehari · 2021-03-09T02:35:07Z

Results by invalid "NPConnectionString": "Data Source=np:localhost;Database=Northwind__;Integrated Security=true;"

for the SqlConnectionReliabilityTest:
for the SqlCommandReliabilityTest:

And for SqlConnectionReliabilityTest by "NPConnectionString": "Data Source=npt:localhost;Database=Northwind__;Integrated Security=true;":

Still can't find hang or infinite loop!
Here is the command to run the tests:
dotnet test "src\Microsoft.Data.SqlClient\tests\ManualTests\Microsoft.Data.SqlClient.ManualTesting.Tests.csproj" /p:Platform="AnyCPU" /p:Configuration="Release" /p:TestTargetOS=Windowsnetcoreapp --no-build -v n --filter "SqlConnectionReliabilityTest" /p:TargetNetCoreVersion=netcoreapp2.1

+ some improvements + fix test issues

Wraith2 · 2021-03-09T19:43:58Z

Ok, i've got tests working so ignore that problem for now.
I am a little concerned about the amount of lambdas this feature adds to common paths.

DavoudEshtehari · 2021-03-10T18:01:32Z

I couldn't find any concern according to the perf test result. As a general rule, there is always a chance of improvement. Let's see how we can improve it.

Wraith2 · 2021-03-10T18:53:13Z

It depends on your perspective and how you're measuring it. As you probably know by now I often use an implementation of the TechEmpower fortunes benchmark to measure performance and identify items that can be streamlined.

Here are a couple of examples of running it for 60 seconds:

OpenAsync

ExecuteReaderAsync:

Both OpenAsync and ExecuteReaderAsync are unavoidable methods for users. If you want to connect to the database and read data you're going to use them. The Retry machinery is also present on the sync versions so there will be equivalent additional allocations on those paths as well. The new retry feature is adding overhead to all users without any ability to opt out of it.

Now, you can argue that they aren't dominating allocations in the overall trace:

and you'd be right but that isn't the problem. All those allocations at the top of the trace are either unavoidable (all those task related things) or something I haven't found a way to eliminate yet. The top is major rework territory and we're not in a position to do anything about them at the moment. So I pick off the other things like minor string allocations and boxed guids which have relatively small impacts but those impacts add up. Every allocation we avoid is one small piece of extra time before the runtime runs a GC and the less GC that happens the more we can do per unit time translating to better throughput and lower cost in terms of energy and money.

Some of these things are relatively easy to fix. Some are more difficult or impossible. For example if the public surface area of SqlRetryLogicBaseProvider is fixed then there are some things that can't be optimized. We can do careful work to remove delegate and closure allocations from common paths like OpenAsync but it's a bit disheartening to see the work I've done so carefully over the years to slim down and optimize the library being consumed by a feature that most users will never see or use. I think users should pay for what they use and not what they don't unless it is unavoidable.

cheenamalhotra · 2021-03-10T19:14:25Z

I'm also in agreement with @Wraith2 's comment above and would suggest something like:

public override object ExecuteScalar()
{
...
    if (IsRetryOptedIn) // Check if user enabled App Context switch here
    {
        ds = RetryLogicProvider.Execute(this, () => RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true));
    }
    else
    {
        ds = RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true);
    }

[Similarly for other APIs]

I think the check for App Context switch that happens inside the Reliability layer is too late, but based on perf results I didn't consider that significant before. But I agree if we can avoid it's after-effects, that should be preferred way to go. So until users opt-in, they should not get affected by any means. For other reliability implementation inside SqlConnection too, we should be checking App context switch earliest before changing any flow, if it's avoidable.

Wraith2 · 2021-03-10T19:29:26Z

I'm prototyping adding a property `bool SqlRetryLogicBaseProvider.IsEnabled {get; }" and then set it to false in the no-op implementation. Then in the callsites if we updated to langversion 7.0 we could use local functions to avoid the closure entirely unless it's called, e.g.:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return OpenAsyncRetry(cancellationToken);
            }

            Task OpenAsyncWithRetry(CancellationToken token)
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
            }
        }

I don't know if the complier is smart enough to eliminate the closure allocation in the version you proposed @cheenamalhotra

Wraith2 · 2021-03-10T22:25:50Z

It will need to be split out into a second function. The compiler hoists the closure into the scope of the containing function to ensure that the variables contained within the closure are always accessed through it. Compare these two:

With Local Function:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return OpenAsyncWithRetry(cancellationToken);
            }

            Task OpenAsyncWithRetry(CancellationToken token)
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(token), token);
            }
        }

public override Task OpenAsync(CancellationToken cancellationToken)
{
	if (!RetryLogicProvider.Enabled)
	{
		return InternalOpenAsync(cancellationToken);
	}
	return <OpenAsync>g__OpenAsyncWithRetry|148_0(cancellationToken);
}

[CompilerGenerated]
private Task <OpenAsync>g__OpenAsyncWithRetry|148_0(CancellationToken token)
{
	<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
	<>c__DisplayClass148_.<>4__this = this;
	<>c__DisplayClass148_.token = token;
	return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__1), <>c__DisplayClass148_.token);
}

With Inline Call:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
            }
        }

public override Task OpenAsync(CancellationToken cancellationToken)
{
	<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
	<>c__DisplayClass148_.<>4__this = this;
	<>c__DisplayClass148_.cancellationToken = cancellationToken;
	if (!RetryLogicProvider.Enabled)
	{
		return InternalOpenAsync(<>c__DisplayClass148_.cancellationToken);
	}
	return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__0), <>c__DisplayClass148_.cancellationToken);
}

Which shows that only if you split out the call do you avoid the allocations. I like the local function approach because it keeps the allocating call wrapper private to the callsite. I'm not wild about the compiler naming convention. I could be convinced that it's better as a full private function.

cheenamalhotra · 2021-03-10T23:20:28Z

In that case, we can pass method delegate to a common method too (compacted version):

    public override Task OpenAsync(CancellationToken cancellationToken) =>
        (!IsRetryEnabled ? InternalOpenAsync(cancellationToken) : 
            DoTaskWithRetry(InternalOpenAsync(cancellationToken), cancellationToken));

    // This can also go in a Util class so SqlCommand can also use this.
    internal Task DoTaskWithRetry(Task task, CancellationToken token) => 
        RetryLogicProvider.ExecuteAsync(this, () => task, token);

    // Stays with other App Context Flags or in Utils
    internal bool IsRetryEnabled = <TryGetSwitch>;

Similar redirect APIs can be written for other RetryLogicProvider Execute APIs, so that we can prevent the default RetryLogicProvider from loading and control it externally.

roji · 2021-03-10T23:21:01Z

I'm not fully following the discussion, but if the concern here is about heap allocations introduced because of closures, then that can be avoided without introducing a special opt-in flag. Make your lambdas static (via the new C# 9 syntax), and pass whatever objects you need to reference as parameters (including this).

roji · 2021-03-10T23:22:08Z

See dotnet/efcore#24208 for an example of a similar change in EF Core.

Wraith2 · 2021-03-10T23:31:57Z

I looked into doing that @roji but that would need changes to the callsites in SqlRetryLogicBaseProvider which is publicly exposed. Even if we wanted to change that class it would expose the people who wanted to implement their own retry provider to the internal workings because they'd be given access to the closure object rather than it being baked into the delegate. The way around that it to make the retry provider less generic and have context objects for each of the internal calls that needs retrying.

I'd like to get c#9 in the library because it would make the static delegates I add a lot cleaner, they'd be compiler generated instead of me littering them all over the place manually. I talked to roslyn team and they say it's not supported but that it will work and won't generate invalid il that netfx can't use as long as we don't use features that require runtime support, static delegates do not require runtime support.

cheenamalhotra · 2021-03-10T23:33:58Z

Hi @roji

This is a preview feature and is not meant to be consumable by default. Hence it's controlled with App Context switch which is going to be disabled by default. Our intention is to not change driver behavior for user applications that don't want to use Retry Logic Provider right now, including any visible allocations/performance cost.

We're trying to move AppContext switch checks from internal provider layer to driver layers where this feature begins to allocate resources that are not needed.

roji · 2021-03-10T23:47:03Z

Apologies if I'm misunderstanding things - I really am missing a lot of the context. This would indeed require adding a RetryLogicProvider.Execute overload which accepts a "state" parameter and passes it to the lambda; this overload could then be called from OpenAsync, ExecuteScalar or anywhere else. I understand that RetryLogicProvider is an abstraction meant to be passed from the outside - FWIW methods accepting both a lambda and a state parameter to be passed to that lambda are very common in .NET precisely to avoid the heap allocation issue. The fact that people are exposed to that parameter doesn't seem problematic to me - part of the contract of a retrying provider would simply be for the implementation to pass that parameter when invoking the lambda. BTW I'd suggest taking a look at what we've done around this in EF Core, the concepts are very similar (e.g. an abstraction that's meant to be implemented by external users or providers).

Re changing driver behavior, if you have a default NoopRetryLogicProvider that simply executes the given lambda once, everything should just work well, with no behavior changes or perf regressions. But if there's some reason why you can't introduce the new overload, then the above suggestions do seem right.

Re C# 9 static lambdas, that is only an optional thing in this context - as long as a lambda doesn't refer to anything other than its parameters, it wouldn't be a closure and trigger any heap allocations. The C# 9 feature only helps avoid accidentally introducing such a reference, but as long as you're willing to introduce the new overload above the problem should go away.

Wraith2 · 2021-03-10T23:50:57Z

It requires 2 state parameters, this and cancelationToken, which means a tuple or some custom context object. the instance delegate contains the this ref so you only see one stat parameter but if we make it static as you suggest (and i'd have liked) then we have to pass in two params. Ultimately for an async call to be retried we're going to have to save the state on the heap sometime so closures are likely to be somewhere i'd just like to move them into a path that is only taken if the user wants to have retry logic active.

There are worse cases, ExecuteReaderAsync will need 4 state params.

roji · 2021-03-11T00:02:02Z

It requires 2 state parameters, this and cancelationToken, which means a tuple or some custom context object.

You can just put both in a simple value tuple (see dotnet/efcore#24208 for examples) - simple terse syntax, no allocations.

Ultimately for an async call to be retried we're going to have to save the state on the heap sometime so closures are likely to be somewhere i'd just like to move them into a path that is only taken if the user wants to have retry logic active.

I don't quite follow this... the heap allocations from the closures are unrelated (and in addition to) any async state - unless you're doing something I'm not aware of. Basically it seems that the closure allocations don't have to happen in any path (regardless of whether the feature is turned on or not), plus the code would be considerably simpler too.

Wraith2 · 2021-03-11T00:10:35Z

Valuetuples box with async because they're always either part of the async closure or they're passed in an object typed state parameter. If you strongly type the state params you end up with the sort of code we have in AsyncHelper which is complicated and still can't avoid allocations. We also can't use valuetuple in netfx without requiring a new non oob dependency which might be possible but isn't my decision to make.

the heap allocations from the closures are unrelated (and in addition to) any async state

To do any async call to signature other than static void fn() you require at least one piece of state. In order to suspend and allow threadless awaiting that information will always end up on the heap. The runtime only provides the ability to pass a single state object (unless you can conceal one in an TCS and recover it through task.AsyncState) and it is typed as object so any valuetype will box. There are no ways around this if you end up with a non synchronous completion.

Wraith2 · 2021-03-11T00:13:55Z

Oh and another thing I noticed was that the retry api is using IEnumerator parameters and i think it should probably be using IEnumerable parameters. It's technically a breach of the rules to re-use an enumerator since there's no way to rewind them. An enumerable object can return itself multiple times it is a static implementation that is safe.

roji · 2021-03-11T00:18:38Z

Valuetuples box with async because they're always either part of the async closure or they're passed in an object typed state parameter. If you strongly type the state params you end up with the sort of code we have in AsyncHelper which is complicated and still can't avoid allocations.

Yes, I'm suggesting making the state parameter generic. I'm not sure what's complicated about that, and especially why it wouldn't avoid allocation; it's just a generic parameter that gets passed to a function. Again, if you look at the PR cited above, you can see it working and not allocating. This part of it has nothing to do with whether the method is async or not - it's just about whether your lambda is a closure or not.

We also can't use valuetuple in netfx without requiring a new non oob dependency which might be possible but isn't my decision to make.

OK.

the heap allocations from the closures are unrelated (and in addition to) any async state

To do any async call to signature other than static void fn() you require at least one piece of state. In order to suspend and allow threadless awaiting that information will always end up on the heap. The runtime only provides the ability to pass a single state object (unless you can conceal one in an TCS and recover it through task.AsyncState) and it is typed as object so any valuetype will box. There are no ways around this if you end up with a non synchronous completion.

I honestly don't understand what the above has to do with closure heap allocations. I also suspect you're talking about some SqlClient-specific detail of how you guys implement async - regular C# async/await methods have no single state object typed as an object, etc. I really think all this is irrelevant to what we're discussing.

Wraith2 · 2021-03-11T00:45:22Z

The static lambda will eliminate the delegate allocation. That seems like a good idea, compiler supported or not. I'd like to use the static keyword for safety and ease of use. Agreed on those points.

How the closure of state is transported, compiler generated or manually, isn't important to me since we can't avoid it. If we use manual context it will change the published surface area of the retry provider. I assume that would be a bad thing.

roji · 2021-03-11T00:55:02Z

OK, thanks for listening on this. To summarize my point, ignoring SqlClient-specific constraints for a second, in C# it's quite easy to manage the above in a way that will be zero-allocation.

Specifically in SqlClient, if you can't add the required public method to RetryLogicProvider for some reason (Is this type already public? I'm not seeing it in the docs), and/or cannot use value tuples and prefer not to use structs in their place, then that indeed cannot be done. Splitting your method into two, so that the closure occurs only when the feature flag is on, seems like the right way to restrict the closure allocations to when the feature flag is on.

scoriani · 2021-03-11T07:54:03Z

In that case, we can pass method delegate to a common method too (compacted version):

    public override Task OpenAsync(CancellationToken cancellationToken) =>
        (!IsRetryEnabled ? InternalOpenAsync(cancellationToken) : 
            DoTaskWithRetry(InternalOpenAsync(cancellationToken), cancellationToken));

    // This can also go in a Util class so SqlCommand can also use this.
    internal Task DoTaskWithRetry(Task task, CancellationToken token) => 
        RetryLogicProvider.ExecuteAsync(this, () => task, token);

    // Stays with other App Context Flags or in Utils
    internal bool IsRetryEnabled = <TryGetSwitch>;

Similar redirect APIs can be written for other RetryLogicProvider Execute APIs, so that we can prevent the default RetryLogicProvider from loading and control it externally.

Great conversation! The intent of the initial prototype was indeed to avoid any allocation in case retry logic was not configured for connection opening and command execution (meaning if no config file section or explicit assignment in code was there). Setting aside the App Context switch, which in my mind should only exists during Preview, why the IsRetryEnable switch couldn't just check the condition above?

Wraith2 · 2021-03-11T08:39:31Z

Setting aside the App Context switch, which in my mind should only exists during Preview, why the IsRetryEnable switch couldn't just check the condition above?

The exact implementation can be whatever works best I just extended the existing enabled flag from the provider internals to the provider surface. If there's any easier and lower cost way to do the operation I'd be happy with it.

In order to check if there's a config the loader logic has to do a fairly costly one-time check (does io) parses and instantiates the used defined provider with using reflection. That logic is hidden in the static RetryLogicProvider internals and I'm not sure there's a low overhead way to do that and meet the hard requirement that this feature should require no calling code changes to be configurable. My view was that the lazy init had to happen sometime so it may as well happen on the first connection attempt and have it over with.

scoriani · 2021-03-11T09:54:11Z

In order to check if there's a config the loader logic has to do a fairly costly one-time check (does io) parses and instantiates the used defined provider with using reflection. That logic is hidden in the static RetryLogicProvider internals and I'm not sure there's a low overhead way to do that and meet the hard requirement that this feature should require no calling code changes to be configurable. My view was that the lazy init had to happen sometime so it may as well happen on the first connection attempt and have it over with.

That was exactly my idea. Isn't that the same in other config sections like for authentication provider? if i remember correctly we mentioned that in previous conversations.

Wraith2 · 2021-03-11T11:37:52Z

It may be the same. The authentication code addition predates my experience with this library and it isn't something that has come up with any of the performance changes I've made so I don't know much about it. As long as there is some way to avoid casual users paying a penalty for a feature they aren't using I'm not too bothered about the implementation detail.

+ check the safety switch directly in SqlConnection and SqlCommand + improved memory allocation by adding a private method to avoid the instance delegate on each call + remove error 64, 20, and 0 from the list of default transient errors + test improvement

Wraith2 · 2021-03-15T20:42:50Z

I just did a quick profile run on this and the additional changes to avoid allocations worked well. I can't see anything obviously attributable to retry logic when it isn't enabled. Good job @DavoudEshtehari

shueybubbles · 2024-05-03T22:18:09Z

@DavoudEshtehari question - why is error 4060 on the list of transient errors? I see comments in SMO and SSMS code that say 4060 means the database doesn't exist.

DavoudEshtehari · 2024-05-03T22:56:49Z

I haven't reviewed the SSMS logic, but according to the documentation, error number 4060 is listed as one of the transient errors for Azure SQL Server.

David-Engel · 2024-05-10T17:10:32Z

@shueybubbles IIRC, it's because, on startup, database availability can be delayed between the time a server starts accepting connections and the time a particular database is available. So, adding automatic retries reduces the incidences of connection failure when a server is starting up but a database is not yet available.

shueybubbles · 2024-05-13T15:29:41Z

Well, we need a new error value for the cloud case, then. It's pointless to keep trying when the server knows there no such database.

IMO we have chosen the wrong default behaviors for cloud connections to missing/sleeping databases.
For sleeping databases, accidental/transient connections forcing them to wake up can cost a customer a lot of money. I think we should have required a new connection string parameter to say "yes wake up the db if it's sleeping". Then tools like SSMS could have caught the "db is sleeping" error and prompted the user to say "Do you want to wake it up?".
Similarly, retrying to connect to a non-existent database multiple times just delays the user and creates a bad experience.

David-Engel · 2024-05-13T17:20:00Z

@shueybubbles

Yes, sleeping databases is another issue. I think they prioritized making serverless as seamless as possible for existing apps to adopt and made some compromises that limit options.

There is a good recent conversation on the EF repo about transient 4060 errors and create database actions: dotnet/efcore#33191

Add configuration manager tests

f9ea3fe

DavoudEshtehari requested review from cheenamalhotra, johnnypham, David-Engel, karinazhou and JRahnama March 5, 2021 20:52

DavoudEshtehari changed the title ~~Configurable Retry Logic~~ [WIP] Configurable Retry Logic Mar 5, 2021

David-Engel reviewed Mar 6, 2021

View reviewed changes

...ft.Data.SqlClient/tests/ManualTests/SQL/RetryLogic/Resources/CustomConfigurableRetryLogic.cs Outdated Show resolved Hide resolved

cheenamalhotra added this to the 3.0.0-preview1 milestone Mar 8, 2021

Address comments

6d3c377

+ some improvements + fix test issues

Address comments

574d705

+ check the safety switch directly in SqlConnection and SqlCommand + improved memory allocation by adding a private method to avoid the instance delegate on each call + remove error 64, 20, and 0 from the list of default transient errors + test improvement

DavoudEshtehari changed the title ~~[WIP] Configurable Retry Logic~~ Configurable Retry Logic Mar 12, 2021

cheenamalhotra approved these changes Mar 13, 2021

View reviewed changes

JRahnama approved these changes Mar 15, 2021

View reviewed changes

cheenamalhotra merged commit b35b3e3 into dotnet:master Mar 15, 2021

Configurable Retry Logic #966

Configurable Retry Logic #966

Conversation

DavoudEshtehari commented Mar 5, 2021

David-Engel left a comment

Choose a reason for hiding this comment

DavoudEshtehari commented Mar 6, 2021

Wraith2 commented Mar 7, 2021

DavoudEshtehari commented Mar 8, 2021

Wraith2 commented Mar 8, 2021

DavoudEshtehari commented Mar 9, 2021

Wraith2 commented Mar 9, 2021

Wraith2 commented Mar 9, 2021

DavoudEshtehari commented Mar 9, 2021

Wraith2 commented Mar 9, 2021

DavoudEshtehari commented Mar 9, 2021

Wraith2 commented Mar 9, 2021

DavoudEshtehari commented Mar 10, 2021

Wraith2 commented Mar 10, 2021

cheenamalhotra commented Mar 10, 2021

Wraith2 commented Mar 10, 2021

Wraith2 commented Mar 10, 2021

cheenamalhotra commented Mar 10, 2021

roji commented Mar 10, 2021

roji commented Mar 10, 2021

Wraith2 commented Mar 10, 2021

cheenamalhotra commented Mar 10, 2021

roji commented Mar 10, 2021

Wraith2 commented Mar 10, 2021 • edited Loading

roji commented Mar 11, 2021

Wraith2 commented Mar 11, 2021

Wraith2 commented Mar 11, 2021

roji commented Mar 11, 2021 • edited Loading

Wraith2 commented Mar 11, 2021

roji commented Mar 11, 2021 • edited Loading

scoriani commented Mar 11, 2021

Wraith2 commented Mar 11, 2021

scoriani commented Mar 11, 2021

Wraith2 commented Mar 11, 2021

Wraith2 commented Mar 15, 2021

shueybubbles commented May 3, 2024

DavoudEshtehari commented May 3, 2024

David-Engel commented May 10, 2024

shueybubbles commented May 13, 2024 • edited Loading

David-Engel commented May 13, 2024

Wraith2 commented Mar 10, 2021 •

edited

Loading

roji commented Mar 11, 2021 •

edited

Loading

roji commented Mar 11, 2021 •

edited

Loading

shueybubbles commented May 13, 2024 •

edited

Loading