Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Retry Logic #966

Merged
merged 3 commits into from
Mar 15, 2021
Merged

Conversation

DavoudEshtehari
Copy link
Contributor

This is a complimentary PR to complete the remaining works, extend the feature, and address comments on PR #693.

@DavoudEshtehari DavoudEshtehari changed the title Configurable Retry Logic [WIP] Configurable Retry Logic Mar 5, 2021
Copy link
Contributor

@David-Engel David-Engel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to check in the two DLLs?

@DavoudEshtehari
Copy link
Contributor Author

Did you mean to check in the two DLLs?

If you mean ClassLibrary_CustomConfigurableRetryLogic DLLs, I'd say they're customized retry logics compiled for .Net Framework and .Net Core to assess the external injection by test project for different .Net versions.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 7, 2021

I've been having trouble with the tests due to the new default use of the retry policy. It looks like if no policy is set and the connection can't be made (for example a bad connection string) you can end up in an infinite retry loop which stops the manual tests dead where they used to just fail the individual test and continue.

I've also hit a case trying to debug where a config property is passed to the retry policy setup and assert that it isn't null but at runtime the null is accepted, since we don't run the tests with debug builds this will never be hit so either it's missing an important case that should be tested or it's a worthless assert.

@cheenamalhotra cheenamalhotra added this to the 3.0.0-preview1 milestone Mar 8, 2021
@DavoudEshtehari
Copy link
Contributor Author

I've been having trouble with the tests due to the new default use of the retry policy. It looks like if no policy is set and the connection can't be made (for example a bad connection string) you can end up in an infinite retry loop which stops the manual tests dead where they used to just fail the individual test and continue.

It's not clear to me under what circumstances it will go to an infinite loop. Could you provide more step-by-step information, please?

I've also hit a case trying to debug where a config property is passed to the retry policy setup and assert that it isn't null but at runtime the null is accepted, since we don't run the tests with debug builds this will never be hit so either it's missing an important case that should be tested or it's a worthless assert.

Suppose you're pointing to the SqlRetryLogicProvider. In that case, generally, I assert null values in debug mode for internal usage to enforce an object creation at design time. Since this is an internal type and isn't going to be accessible directly, I've included the assert in debug mode to prevent null usage at design time; and this parameter won't be null. If you have another scenario, please include it here for further discussion.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 8, 2021

It's not clear to me under what circumstances it will go to an infinite loop. Could you provide more step-by-step information, please?

Running the manual tests locally will always fail in the retry logic at the moment. Master, netcore2.1, build run and wait for it to just stop in the exponential retry test.

@DavoudEshtehari
Copy link
Contributor Author

Running the manual tests locally will always fail in the retry logic at the moment. Master, netcore2.1, build run and wait for it to just stop in the exponential retry test.

I couldn't find any issue locally.
image
image

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 9, 2021

Try breaking your named pipes connection string. Then it'll hit the infinite loop with an exponential backoff. In order to make it actually work you'll have to update the manual tests project to copy the config file on build, at the moment it doesn't and that's caused me a lot of problems.

Named pipes didn't work on my machine for some reason and over the weekend I uninstalled cleaned and then reinstalled SqlServer. So I'm setting everything up from clean, if you can't connect using "np:localhost" then previously some tests failed but it was ok because I knew which ones they were. Now I can't complete a test run because it hits a loop. Yes, I should fix my machine and I'm working on that but having the ability for the tests to fall into an endless loop doesn't seem like a good idea anyway.

A couple of things I noticed while looking at the retry code as it is as the moment in master;

if (configSection == null)
{
throw SqlReliabilityUtil.ArgumentNull(nameof(configSection));
}

This is problematic because it causes an exception on any OpenAsync call which doesn't have a policy, which at the moment is all of them. Instead of null check then throw could you just if and jump to using the default?

On that call stack that leads to that there's this call:

public override Task OpenAsync(CancellationToken cancellationToken)
{
return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
}

Which is going to cause an instance delegate to be allocated each time it's invoked. Could that be changed to a lambda which takes the token and connection as a params and then the compiler can use a hidden static delegate.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 9, 2021

Call Stack:

 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.LoginNoFailover(Microsoft.Data.SqlClient.ServerInfo serverInfo = {Microsoft.Data.SqlClient.ServerInfo}, string newPassword = "", System.Security.SecureString newSecurePassword = null, bool redirectedUserInstance = false, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.SqlClient.SqlCredential credential = null, Microsoft.Data.ProviderBase.TimeoutTimer timeout = {Microsoft.Data.ProviderBase.TimeoutTimer})	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.OpenLoginEnlist(Microsoft.Data.ProviderBase.TimeoutTimer timeout, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions, Microsoft.Data.SqlClient.SqlCredential credential, string newPassword, System.Security.SecureString newSecurePassword, bool redirectedUserInstance)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlInternalConnectionTds.SqlInternalConnectionTds(Microsoft.Data.ProviderBase.DbConnectionPoolIdentity identity, Microsoft.Data.SqlClient.SqlConnectionString connectionOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.SqlClient.SqlCredential credential = null, object providerInfo, string newPassword = "", System.Security.SecureString newSecurePassword = null, bool redirectedUserInstance = false, Microsoft.Data.SqlClient.SqlConnectionString userConnectionOptions, Microsoft.Data.SqlClient.SessionData reconnectSessionData, bool applyTransientFaultHandling = true, string accessToken, Microsoft.Data.ProviderBase.DbConnectionPool pool)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnectionFactory.CreateConnection(Microsoft.Data.Common.DbConnectionOptions options, Microsoft.Data.Common.DbConnectionPoolKey poolKey, object poolGroupProviderInfo, Microsoft.Data.ProviderBase.DbConnectionPool pool, System.Data.Common.DbConnection owningConnection, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionFactory.CreateNonPooledConnection(System.Data.Common.DbConnection owningConnection = {Microsoft.Data.SqlClient.SqlConnection}, Microsoft.Data.ProviderBase.DbConnectionPoolGroup poolGroup, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(System.Data.Common.DbConnection owningConnection = {Microsoft.Data.SqlClient.SqlConnection}, System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry = null, Microsoft.Data.Common.DbConnectionOptions userOptions = {Microsoft.Data.SqlClient.SqlConnectionString}, Microsoft.Data.ProviderBase.DbConnectionInternal oldConnection = {Microsoft.Data.ProviderBase.DbConnectionClosedNeverOpened}, out Microsoft.Data.ProviderBase.DbConnectionInternal connection = null)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(System.Data.Common.DbConnection outerConnection = {Microsoft.Data.SqlClient.SqlConnection}, Microsoft.Data.ProviderBase.DbConnectionFactory connectionFactory = {Microsoft.Data.SqlClient.SqlConnectionFactory}, System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry, Microsoft.Data.Common.DbConnectionOptions userOptions)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.TryOpen(System.Threading.Tasks.TaskCompletionSource<Microsoft.Data.ProviderBase.DbConnectionInternal> retry = null, Microsoft.Data.SqlClient.SqlConnectionOverrides overrides)	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlRetryLogicProvider.Execute<bool>(object sender = {Microsoft.Data.SqlClient.SqlConnection}, System.Func<bool> function = {Method = {System.Reflection.RuntimeMethodInfo}})	Unknown	Non-user code. Skipped loading symbols.
 	Microsoft.Data.SqlClient.dll!Microsoft.Data.SqlClient.SqlConnection.Open(Microsoft.Data.SqlClient.SqlConnectionOverrides overrides)	Unknown	Non-user code. Skipped loading symbols.
 	ManualTests.dll!Microsoft.Data.SqlClient.ManualTesting.Tests.SqlCommandReliabilityTest.ConcurrentExecution(string cnnString = "Data Source=np:localhost;Initial Catalog=Northwind;Integrated Security=True;Pooling=False;Connect Timeout=5", Microsoft.Data.SqlClient.SqlRetryLogicBaseProvider provider = {Microsoft.Data.SqlClient.SqlRetryLogicProvider})	Unknown	Non-user code. Skipped loading symbols.
 	[Native to Managed Transition]		Annotated Frame
 	[Managed to Native Transition]		Annotated Frame
 	xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<System.__Canon>.CallTestMethod(object testClassInstance) Line 150	C#	Non-user code. Symbols loaded.
 	xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.InvokeTestMethodAsync.AnonymousMethod__1() Line 257	C#	Non-user code. Symbols loaded.
 	xunit.execution.dotnet.dll!Xunit.Sdk.ExecutionTimer.AggregateAsync(System.Func<System.Threading.Tasks.Task> asyncAction = {Method = {System.Reflection.RuntimeMethodInfo}}) Line 48	C#	Non-user code. Symbols loaded. 

@DavoudEshtehari
Copy link
Contributor Author

I used the following config to run the tests:

    "TCPConnectionString": "Data Source=tcp:localhost;Database=Northwind;Integrated Security=true;",
    "NPConnectionString": "Data Source=np:localhost;Database=Northwind;Integrated Security=true;",
    "TCPConnectionStringHGSVBS": "",
    "TCPConnectionStringAASVBS": "",
    "TCPConnectionStringAASSGX": "",
    "EnclaveEnabled": false,
    "TracingEnabled": false,
    "AADAuthorityURL": "",
    "AADPasswordConnectionString": "",
    "AADServicePrincipalId": "",
    "AADServicePrincipalSecret": "",
    "AzureKeyVaultURL": "",
    "AzureKeyVaultTenantId": "",
    "AzureKeyVaultClientId": "",
    "AzureKeyVaultClientSecret": "",
    "SupportsIntegratedSecurity": true,
    "SupportsLocalDb": false,
    "SupportsFileStream": false,
    "UseManagedSNIOnWindows": false,
    "IsAzureSynapse": false,
    "EnclaveAzureDatabaseConnString": "",
    "UserManagedIdentityObjectId": ""
  • I bring it from the build directory.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 9, 2021

That's great if "np:localhost" works on your machine you'll have no problems. Change it to something that you know won't work and then run the manual tests, they'll hang.

@DavoudEshtehari
Copy link
Contributor Author

Results by invalid "NPConnectionString": "Data Source=np:localhost;Database=Northwind__;Integrated Security=true;"

  • for the SqlConnectionReliabilityTest:
    image
  • for the SqlCommandReliabilityTest:
    image

And for SqlConnectionReliabilityTest by "NPConnectionString": "Data Source=npt:localhost;Database=Northwind__;Integrated Security=true;":
image

Still can't find hang or infinite loop!
Here is the command to run the tests:
dotnet test "src\Microsoft.Data.SqlClient\tests\ManualTests\Microsoft.Data.SqlClient.ManualTesting.Tests.csproj" /p:Platform="AnyCPU" /p:Configuration="Release" /p:TestTargetOS=Windowsnetcoreapp --no-build -v n --filter "SqlConnectionReliabilityTest" /p:TargetNetCoreVersion=netcoreapp2.1

+ some improvements
+ fix test issues
@Wraith2
Copy link
Contributor

Wraith2 commented Mar 9, 2021

Ok, i've got tests working so ignore that problem for now.
I am a little concerned about the amount of lambdas this feature adds to common paths.

@DavoudEshtehari
Copy link
Contributor Author

I couldn't find any concern according to the perf test result. As a general rule, there is always a chance of improvement. Let's see how we can improve it.
image

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 10, 2021

It depends on your perspective and how you're measuring it. As you probably know by now I often use an implementation of the TechEmpower fortunes benchmark to measure performance and identify items that can be streamlined.

Here are a couple of examples of running it for 60 seconds:

OpenAsync
OpenAsyncAllocation

OpenAsyncAllocation2

ExecuteReaderAsync:
ExecuteReaderAsync

Both OpenAsync and ExecuteReaderAsync are unavoidable methods for users. If you want to connect to the database and read data you're going to use them. The Retry machinery is also present on the sync versions so there will be equivalent additional allocations on those paths as well. The new retry feature is adding overhead to all users without any ability to opt out of it.

Now, you can argue that they aren't dominating allocations in the overall trace:
image

and you'd be right but that isn't the problem. All those allocations at the top of the trace are either unavoidable (all those task related things) or something I haven't found a way to eliminate yet. The top is major rework territory and we're not in a position to do anything about them at the moment. So I pick off the other things like minor string allocations and boxed guids which have relatively small impacts but those impacts add up. Every allocation we avoid is one small piece of extra time before the runtime runs a GC and the less GC that happens the more we can do per unit time translating to better throughput and lower cost in terms of energy and money.

Some of these things are relatively easy to fix. Some are more difficult or impossible. For example if the public surface area of SqlRetryLogicBaseProvider is fixed then there are some things that can't be optimized. We can do careful work to remove delegate and closure allocations from common paths like OpenAsync but it's a bit disheartening to see the work I've done so carefully over the years to slim down and optimize the library being consumed by a feature that most users will never see or use. I think users should pay for what they use and not what they don't unless it is unavoidable.

@cheenamalhotra
Copy link
Member

I'm also in agreement with @Wraith2 's comment above and would suggest something like:

public override object ExecuteScalar()
{
...
    if (IsRetryOptedIn) // Check if user enabled App Context switch here
    {
        ds = RetryLogicProvider.Execute(this, () => RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true));
    }
    else
    {
        ds = RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true);
    }

[Similarly for other APIs]

I think the check for App Context switch that happens inside the Reliability layer is too late, but based on perf results I didn't consider that significant before. But I agree if we can avoid it's after-effects, that should be preferred way to go. So until users opt-in, they should not get affected by any means. For other reliability implementation inside SqlConnection too, we should be checking App context switch earliest before changing any flow, if it's avoidable.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 10, 2021

I'm prototyping adding a property `bool SqlRetryLogicBaseProvider.IsEnabled {get; }" and then set it to false in the no-op implementation. Then in the callsites if we updated to langversion 7.0 we could use local functions to avoid the closure entirely unless it's called, e.g.:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return OpenAsyncRetry(cancellationToken);
            }

            Task OpenAsyncWithRetry(CancellationToken token)
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
            }
        }

I don't know if the complier is smart enough to eliminate the closure allocation in the version you proposed @cheenamalhotra

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 10, 2021

It will need to be split out into a second function. The compiler hoists the closure into the scope of the containing function to ensure that the variables contained within the closure are always accessed through it. Compare these two:

With Local Function:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return OpenAsyncWithRetry(cancellationToken);
            }

            Task OpenAsyncWithRetry(CancellationToken token)
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(token), token);
            }
        }

public override Task OpenAsync(CancellationToken cancellationToken)
{
	if (!RetryLogicProvider.Enabled)
	{
		return InternalOpenAsync(cancellationToken);
	}
	return <OpenAsync>g__OpenAsyncWithRetry|148_0(cancellationToken);
}

[CompilerGenerated]
private Task <OpenAsync>g__OpenAsyncWithRetry|148_0(CancellationToken token)
{
	<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
	<>c__DisplayClass148_.<>4__this = this;
	<>c__DisplayClass148_.token = token;
	return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__1), <>c__DisplayClass148_.token);
}

With Inline Call:

        public override Task OpenAsync(CancellationToken cancellationToken)
        {
            if (!RetryLogicProvider.Enabled)
            {
                return InternalOpenAsync(cancellationToken);
            }
            else
            {
                return RetryLogicProvider.ExecuteAsync(this, () => InternalOpenAsync(cancellationToken), cancellationToken);
            }
        }

public override Task OpenAsync(CancellationToken cancellationToken)
{
	<>c__DisplayClass148_0 <>c__DisplayClass148_ = new <>c__DisplayClass148_0();
	<>c__DisplayClass148_.<>4__this = this;
	<>c__DisplayClass148_.cancellationToken = cancellationToken;
	if (!RetryLogicProvider.Enabled)
	{
		return InternalOpenAsync(<>c__DisplayClass148_.cancellationToken);
	}
	return RetryLogicProvider.ExecuteAsync(this, new Func<Task>(<>c__DisplayClass148_.<OpenAsync>b__0), <>c__DisplayClass148_.cancellationToken);
}

Which shows that only if you split out the call do you avoid the allocations. I like the local function approach because it keeps the allocating call wrapper private to the callsite. I'm not wild about the compiler naming convention. I could be convinced that it's better as a full private function.

@cheenamalhotra
Copy link
Member

In that case, we can pass method delegate to a common method too (compacted version):

    public override Task OpenAsync(CancellationToken cancellationToken) =>
        (!IsRetryEnabled ? InternalOpenAsync(cancellationToken) : 
            DoTaskWithRetry(InternalOpenAsync(cancellationToken), cancellationToken));

    // This can also go in a Util class so SqlCommand can also use this.
    internal Task DoTaskWithRetry(Task task, CancellationToken token) => 
        RetryLogicProvider.ExecuteAsync(this, () => task, token);

    // Stays with other App Context Flags or in Utils
    internal bool IsRetryEnabled = <TryGetSwitch>;

Similar redirect APIs can be written for other RetryLogicProvider Execute APIs, so that we can prevent the default RetryLogicProvider from loading and control it externally.

@roji
Copy link
Member

roji commented Mar 10, 2021

I'm not fully following the discussion, but if the concern here is about heap allocations introduced because of closures, then that can be avoided without introducing a special opt-in flag. Make your lambdas static (via the new C# 9 syntax), and pass whatever objects you need to reference as parameters (including this).

@roji
Copy link
Member

roji commented Mar 10, 2021

See dotnet/efcore#24208 for an example of a similar change in EF Core.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 10, 2021

I looked into doing that @roji but that would need changes to the callsites in SqlRetryLogicBaseProvider which is publicly exposed. Even if we wanted to change that class it would expose the people who wanted to implement their own retry provider to the internal workings because they'd be given access to the closure object rather than it being baked into the delegate. The way around that it to make the retry provider less generic and have context objects for each of the internal calls that needs retrying.

I'd like to get c#9 in the library because it would make the static delegates I add a lot cleaner, they'd be compiler generated instead of me littering them all over the place manually. I talked to roslyn team and they say it's not supported but that it will work and won't generate invalid il that netfx can't use as long as we don't use features that require runtime support, static delegates do not require runtime support.

@cheenamalhotra
Copy link
Member

Hi @roji

This is a preview feature and is not meant to be consumable by default. Hence it's controlled with App Context switch which is going to be disabled by default. Our intention is to not change driver behavior for user applications that don't want to use Retry Logic Provider right now, including any visible allocations/performance cost.

We're trying to move AppContext switch checks from internal provider layer to driver layers where this feature begins to allocate resources that are not needed.

@roji
Copy link
Member

roji commented Mar 10, 2021

Apologies if I'm misunderstanding things - I really am missing a lot of the context. This would indeed require adding a RetryLogicProvider.Execute overload which accepts a "state" parameter and passes it to the lambda; this overload could then be called from OpenAsync, ExecuteScalar or anywhere else. I understand that RetryLogicProvider is an abstraction meant to be passed from the outside - FWIW methods accepting both a lambda and a state parameter to be passed to that lambda are very common in .NET precisely to avoid the heap allocation issue. The fact that people are exposed to that parameter doesn't seem problematic to me - part of the contract of a retrying provider would simply be for the implementation to pass that parameter when invoking the lambda. BTW I'd suggest taking a look at what we've done around this in EF Core, the concepts are very similar (e.g. an abstraction that's meant to be implemented by external users or providers).

Re changing driver behavior, if you have a default NoopRetryLogicProvider that simply executes the given lambda once, everything should just work well, with no behavior changes or perf regressions. But if there's some reason why you can't introduce the new overload, then the above suggestions do seem right.

Re C# 9 static lambdas, that is only an optional thing in this context - as long as a lambda doesn't refer to anything other than its parameters, it wouldn't be a closure and trigger any heap allocations. The C# 9 feature only helps avoid accidentally introducing such a reference, but as long as you're willing to introduce the new overload above the problem should go away.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 10, 2021

It requires 2 state parameters, this and cancelationToken, which means a tuple or some custom context object. the instance delegate contains the this ref so you only see one stat parameter but if we make it static as you suggest (and i'd have liked) then we have to pass in two params. Ultimately for an async call to be retried we're going to have to save the state on the heap sometime so closures are likely to be somewhere i'd just like to move them into a path that is only taken if the user wants to have retry logic active.

There are worse cases, ExecuteReaderAsync will need 4 state params.

@roji
Copy link
Member

roji commented Mar 11, 2021

It requires 2 state parameters, this and cancelationToken, which means a tuple or some custom context object.

You can just put both in a simple value tuple (see dotnet/efcore#24208 for examples) - simple terse syntax, no allocations.

Ultimately for an async call to be retried we're going to have to save the state on the heap sometime so closures are likely to be somewhere i'd just like to move them into a path that is only taken if the user wants to have retry logic active.

I don't quite follow this... the heap allocations from the closures are unrelated (and in addition to) any async state - unless you're doing something I'm not aware of. Basically it seems that the closure allocations don't have to happen in any path (regardless of whether the feature is turned on or not), plus the code would be considerably simpler too.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 11, 2021

Valuetuples box with async because they're always either part of the async closure or they're passed in an object typed state parameter. If you strongly type the state params you end up with the sort of code we have in AsyncHelper which is complicated and still can't avoid allocations. We also can't use valuetuple in netfx without requiring a new non oob dependency which might be possible but isn't my decision to make.

the heap allocations from the closures are unrelated (and in addition to) any async state

To do any async call to signature other than static void fn() you require at least one piece of state. In order to suspend and allow threadless awaiting that information will always end up on the heap. The runtime only provides the ability to pass a single state object (unless you can conceal one in an TCS and recover it through task.AsyncState) and it is typed as object so any valuetype will box. There are no ways around this if you end up with a non synchronous completion.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 11, 2021

Oh and another thing I noticed was that the retry api is using IEnumerator parameters and i think it should probably be using IEnumerable parameters. It's technically a breach of the rules to re-use an enumerator since there's no way to rewind them. An enumerable object can return itself multiple times it is a static implementation that is safe.

@roji
Copy link
Member

roji commented Mar 11, 2021

Valuetuples box with async because they're always either part of the async closure or they're passed in an object typed state parameter. If you strongly type the state params you end up with the sort of code we have in AsyncHelper which is complicated and still can't avoid allocations.

Yes, I'm suggesting making the state parameter generic. I'm not sure what's complicated about that, and especially why it wouldn't avoid allocation; it's just a generic parameter that gets passed to a function. Again, if you look at the PR cited above, you can see it working and not allocating. This part of it has nothing to do with whether the method is async or not - it's just about whether your lambda is a closure or not.

We also can't use valuetuple in netfx without requiring a new non oob dependency which might be possible but isn't my decision to make.

OK.

the heap allocations from the closures are unrelated (and in addition to) any async state

To do any async call to signature other than static void fn() you require at least one piece of state. In order to suspend and allow threadless awaiting that information will always end up on the heap. The runtime only provides the ability to pass a single state object (unless you can conceal one in an TCS and recover it through task.AsyncState) and it is typed as object so any valuetype will box. There are no ways around this if you end up with a non synchronous completion.

I honestly don't understand what the above has to do with closure heap allocations. I also suspect you're talking about some SqlClient-specific detail of how you guys implement async - regular C# async/await methods have no single state object typed as an object, etc. I really think all this is irrelevant to what we're discussing.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 11, 2021

The static lambda will eliminate the delegate allocation. That seems like a good idea, compiler supported or not. I'd like to use the static keyword for safety and ease of use. Agreed on those points.

How the closure of state is transported, compiler generated or manually, isn't important to me since we can't avoid it. If we use manual context it will change the published surface area of the retry provider. I assume that would be a bad thing.

@roji
Copy link
Member

roji commented Mar 11, 2021

OK, thanks for listening on this. To summarize my point, ignoring SqlClient-specific constraints for a second, in C# it's quite easy to manage the above in a way that will be zero-allocation.

Specifically in SqlClient, if you can't add the required public method to RetryLogicProvider for some reason (Is this type already public? I'm not seeing it in the docs), and/or cannot use value tuples and prefer not to use structs in their place, then that indeed cannot be done. Splitting your method into two, so that the closure occurs only when the feature flag is on, seems like the right way to restrict the closure allocations to when the feature flag is on.

@scoriani
Copy link

In that case, we can pass method delegate to a common method too (compacted version):

    public override Task OpenAsync(CancellationToken cancellationToken) =>
        (!IsRetryEnabled ? InternalOpenAsync(cancellationToken) : 
            DoTaskWithRetry(InternalOpenAsync(cancellationToken), cancellationToken));

    // This can also go in a Util class so SqlCommand can also use this.
    internal Task DoTaskWithRetry(Task task, CancellationToken token) => 
        RetryLogicProvider.ExecuteAsync(this, () => task, token);

    // Stays with other App Context Flags or in Utils
    internal bool IsRetryEnabled = <TryGetSwitch>;

Similar redirect APIs can be written for other RetryLogicProvider Execute APIs, so that we can prevent the default RetryLogicProvider from loading and control it externally.

Great conversation! The intent of the initial prototype was indeed to avoid any allocation in case retry logic was not configured for connection opening and command execution (meaning if no config file section or explicit assignment in code was there). Setting aside the App Context switch, which in my mind should only exists during Preview, why the IsRetryEnable switch couldn't just check the condition above?

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 11, 2021

Setting aside the App Context switch, which in my mind should only exists during Preview, why the IsRetryEnable switch couldn't just check the condition above?

The exact implementation can be whatever works best I just extended the existing enabled flag from the provider internals to the provider surface. If there's any easier and lower cost way to do the operation I'd be happy with it.

In order to check if there's a config the loader logic has to do a fairly costly one-time check (does io) parses and instantiates the used defined provider with using reflection. That logic is hidden in the static RetryLogicProvider internals and I'm not sure there's a low overhead way to do that and meet the hard requirement that this feature should require no calling code changes to be configurable. My view was that the lazy init had to happen sometime so it may as well happen on the first connection attempt and have it over with.

@scoriani
Copy link

In order to check if there's a config the loader logic has to do a fairly costly one-time check (does io) parses and instantiates the used defined provider with using reflection. That logic is hidden in the static RetryLogicProvider internals and I'm not sure there's a low overhead way to do that and meet the hard requirement that this feature should require no calling code changes to be configurable. My view was that the lazy init had to happen sometime so it may as well happen on the first connection attempt and have it over with.

That was exactly my idea. Isn't that the same in other config sections like for authentication provider? if i remember correctly we mentioned that in previous conversations.

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 11, 2021

It may be the same. The authentication code addition predates my experience with this library and it isn't something that has come up with any of the performance changes I've made so I don't know much about it. As long as there is some way to avoid casual users paying a penalty for a feature they aren't using I'm not too bothered about the implementation detail.

+ check the safety switch directly in SqlConnection and SqlCommand
+ improved memory allocation by adding a private method to avoid the instance delegate on each call
+ remove error 64, 20,  and 0 from the list of default transient errors
+ test improvement
@DavoudEshtehari DavoudEshtehari changed the title [WIP] Configurable Retry Logic Configurable Retry Logic Mar 12, 2021
@cheenamalhotra cheenamalhotra merged commit b35b3e3 into dotnet:master Mar 15, 2021
@Wraith2
Copy link
Contributor

Wraith2 commented Mar 15, 2021

I just did a quick profile run on this and the additional changes to avoid allocations worked well. I can't see anything obviously attributable to retry logic when it isn't enabled. Good job @DavoudEshtehari

@shueybubbles
Copy link

@DavoudEshtehari question - why is error 4060 on the list of transient errors? I see comments in SMO and SSMS code that say 4060 means the database doesn't exist.

@DavoudEshtehari
Copy link
Contributor Author

I haven't reviewed the SSMS logic, but according to the documentation, error number 4060 is listed as one of the transient errors for Azure SQL Server.

@David-Engel
Copy link
Contributor

@shueybubbles IIRC, it's because, on startup, database availability can be delayed between the time a server starts accepting connections and the time a particular database is available. So, adding automatic retries reduces the incidences of connection failure when a server is starting up but a database is not yet available.

@shueybubbles
Copy link

shueybubbles commented May 13, 2024

Well, we need a new error value for the cloud case, then. It's pointless to keep trying when the server knows there no such database.

IMO we have chosen the wrong default behaviors for cloud connections to missing/sleeping databases.
For sleeping databases, accidental/transient connections forcing them to wake up can cost a customer a lot of money. I think we should have required a new connection string parameter to say "yes wake up the db if it's sleeping". Then tools like SSMS could have caught the "db is sleeping" error and prompted the user to say "Do you want to wake it up?".
Similarly, retrying to connect to a non-existent database multiple times just delays the user and creates a bad experience.

@David-Engel
Copy link
Contributor

@shueybubbles

Yes, sleeping databases is another issue. I think they prioritized making serverless as seamless as possible for existing apps to adopt and made some compromises that limit options.

There is a good recent conversation on the EF repo about transient 4060 errors and create database actions: dotnet/efcore#33191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants