-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection does not support MultipleActiveResultSets error after different connection doomed #971
Comments
If you have a repro for this use-case, please upload it here so we can investigate further. |
I also work at accuRx with Nick. Sadly we've not been able to find any rhyme or reason why or when this occurs. We've spent a long time investigating internally but are really at a dead end. There is no obvious repo, but it does reliably occurs order 10s of times a day. However, not particularly linked to busy periods and can easily go off around 11pm/midnight when our traffic is really low. Is there anything else we can try to give you here to help you investigate? Also can tell you happens on a range of queries including ones which should take 1-2ms (selecting a row which has an index on it). |
Is it possible to capture traces again in developer environment with a custom driver version (from nightly builds) ? You could try this feed to fetch latest build: |
@cheenamalhotra Yes, I think we see the error in our staging env and should be able to add package there with traces from Sql package. |
Speculation,
Is there any previous trace information about connection 313 in the log about when it was used and returned to the pool? |
Thanks. Here is the previous trace we have for connection 313, just over a minute before the logs above, which as far as I can tell shows a normal connection life-cycle.
|
@cheenamalhotra I just checked all our staging logs and the error has only gone off once in last month, so pretty unlikely we'll catch it there sadly. We'd have to use the beta package in prod but am less keen for this. That does seem to indicate it's more likely with higher traffic, could mean it's a race? Is there anything else we can do to help pin this down? It's very disruptive for us (goes of 10s of times a day in production). We are quite high traffic consumers of Azure SQL if that helps diagnose. (I imagine 100M queries/transactions a day) |
Again, not sure if helpful, but we've always found the MARs exception happens after this first exception: "A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught" |
It's an interesting situation, I'm still trying to wrap my head around what's going on.. Can you give me full trace between Also please provide exact driver version for which traces are captured, as otherwise I cannot match behaviors properly. |
This is the full log trace I have for this particular function invocation (happy to email to you too). The reason it takes 15 minutes is that we needed to change a setting on K8 cluster just FYI (net.ipv4.tcp_retries2). One thing I haven't worked out is why we sometimes get 400 traces, and sometimes get 11,000. Happy to share traces from other times this has gone off too. By driver version, do you just mean nuget package version or something else?
|
On the contrary, can you check and tell if your application is doing something like below, theoretically: SqlConnection conn = new SqlConnection(connString);
try
{
SqlCommand command = new SqlCommand("QUERY_TEXT", conn);
// execute and read
}
catch { }
try
{
// Continues to use old connection irrespective of whether last command failed with FATAL exception.
SqlCommand command = new SqlCommand("ANOTHER QUERY_TEXT", conn);
// execute and read
}
catch { } Because from the logs if you see carefully, look at below lines:
It seems that the same connection is still being used by your application in next command even when it failed with FATAL CLASS error which is non-retriable error class. So there are 2 possibilities, for which you can provide more info:
|
@cheenamalhotra Don't think there are any extra logs in that 30 second window. This is how I'm capturing logs
The idea is that we only emit the verbose Sql client logs if an exception actually goes off. Let me know if we're using this wrong or misunderstood something though. With regards to your re-using SQL connection, I can't be 100% sure but I would be very surprised. We don't interact at all with your raw library, we use EF core for all our database interactions. The context is passed into this azure function via dependency injection and we just execute a single query which immediately throws. Again not sure if helpful, but we've found that the EF core context seems to be broken after this particular exception, we can't perform an "outer" retry of the same query. |
I would have also expected if we were doing something like that, for this to go off more often. On the particular azure function we are logging, I see this error about 1-4 times a day (function is executing 100,000s of times a day) |
Have been searching for other instances of this where we have O(100s) of logs not O(10,000) and grabbed this one from more recent time. Have added a few extra columns in and this has our new TCP timeout so only goes over 2 mins or so not 15 mins. Columns are time, activityID, message, OpCode, ThreadID, and log level
|
Hi @lbargery I can advise a quick workaround for you in this case to ensure this does not block you. You can try to create a new connection instance when you capture Fatal Error class 20 on query execution, i.e. "A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught)" Please re-initialize SqlConnection instance in DBContext in this case, and that should resolve your case. I'll continue to investigate and find root cause to fix this issue properly so connection stays usable again. |
Thanks for the work-around. Just to check I'm understanding correctly, in our application code layer we'd catch the MARs exception (we don't see the TCP one) right? Apologies if I'm being slow but how do you re-initialize the sql connection instance in a db context? Thanks for all your help - happy to provide as much detail as we can - appreciate this is a tricky one to debug without a repo! |
Added the following in our custom retry strategy for SQL, hopefully this is what you mean where I've closed and re-opened the connection:
|
@cheenamalhotra We managed to get it to work using the follow after the closeAysnc. Seems like re-open didn't actually reset the state. So this (for now) seems to "solve" the issue for us (though will keep monitoring in Prod)
|
I simply meant to drop connection and recreate it by disposing/re-initializing DbContext.
Is it possible to Dispose and re-create a new instance of DbContext instead of going deep down in EF Core and disposing connection? |
@cheenamalhotra I'm not sure if that's possible. I can definitely dispose the context in that method but I'm not sure how I could create a new one. The method isn't even generic on the DbContext it just passes in the concrete base type so I don't know how you'd even know what type to re-construct? The above code (whilst a bit dirty) seems to have worked well in production today FYI. |
Hi @lbargery I'm able to reproduce this issue locally with just a simple code and a little modification in the driver: Repro Code: using (SqlConnection sqlConnection = new SqlConnection(connString))
{
// Execute 2 commands - should work.
for (int i = 0; i < 2; i++)
{
sqlConnection.Open(); // Throws exception second time.
using (SqlCommand cmd = new SqlCommand("SELECT @@VERSION", sqlConnection))
{
cmd.CommandText = "SELECT @@VERSION";
try
{
using (var reader = await cmd.ExecuteReaderAsync())
{
while (await reader.ReadAsync())
{ }
}
Console.WriteLine(i + 1 + " passed");
}
catch (SqlException e)
{
Console.WriteLine(i + 1 + " " + e.Message);
}
}
}
} Driver modification in SNIStreams.cs#L34 [Force SocketException on ReadAsync] internal class SNISslStream : SslStream
{
...
public override async Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
{
await _readAsyncSemaphore.WaitAsync(cancellationToken).ConfigureAwait(false);
try
{
//return await base.ReadAsync(buffer, offset, count, cancellationToken).ConfigureAwait(false);
throw new SocketException();
}
finally
{
_readAsyncSemaphore.Release();
}
}
... Output: I sometimes also get And as you can see I'm not doing anything special except forcing driver to behave as if socket is not able to read anything from network, mocking the behavior as captured in traces above. What we see here is:
Given the fact that the first exception is a FATAL exception (Class 20), and internal connection and parser are no longer connected to server, as of now there is no way you can continue working on this connection instance. I'm still yet to determine if it should be possible to recover in this state and reconnect silently if new command needs to be executed, but it will be tricky as currently driver does not behave that way. |
The error message in second query is misleading which is due to a bug in async flow, but the connection is not usable eitherway.
|
@cheenamalhotra Thanks for the above. |
I would say an exception of Class 20 is fatal and should not be retried upon. I also found this section in remarks: "When the severity level is 20 or greater, the server ordinarily closes the SqlConnection. However, the user can reopen the connection and continue. In both cases, a SqlException is generated by the method executing the command." The issue we are looking at is that reopening connection fails and SqlConnection stays corrupted for further activity unless you dispose it too. Cc @roji @ajcvickers |
@lbargery Could you share an actual stack trace that involves EF Core please. |
@ErikEJ This is the stack trace we get
Worth saying the ExtendedSqlServerRetryingStrategy.cs is just us wrapping the default retry strategy and adding some properties to our logger. We now have the dipose code I posted above in this which handles the MARs error |
Here are some thoughts from an EF perspective... Hopefully I've understood all of the above. In regular coding patterns, EF Core closes and opens connections for every operation; this means that with a single DbContext instance, if two queries (or SaveChanges) are executed, DbConnection.OpenAsync is before each query, and DbConnection.CloseAsync is called right after each query. In contrast, when the user explicitly opens the connection via e.g. This means that in typical scenarios, a typical retrying execution strategy will attempt to retry the entire operation multiple times - and that includes closing and opening the connection. This means that as long as the underlying ADO.NET provider handles DbConnection.Close well, the DbConnection.Open from the next retry should succeed. Importantly, this also means that in the EF sense, an exception can be considered transient even if it dooms the current connection - just as long as closing, re-opening and retrying the operation could succeed (this is wider definition of transience compared to being able to retry the operation without closing/re-opening). To summarize, it seems to me that SqlClient should support closing and re-opening the DbConnection after this exception (or any other), without needing to Dispose; this is also how I'm reading the text in the docs linked to above by @cheenamalhotra:
After all, SqlConnection is merely a wrapper/façade over a physical connection - which itself may be doomed - but a close/re-open should get a new physical connection from the pool etc. Does that make sense? |
Thanks @roji for the explanation. Yes the issue is in SqlClient and we're looking into it. As you mentioned, EF Core's retry strategy is not retrying on previously opened connection, that confirms there is no issue from EF Core layer, and as per documentation SqlClient should be able to recover and re-open without the need to dispose. |
OK, good that I fully understood :) But this does raise an interesting question on the EF Core side - if the scenario uses a connection which has been manually opened by the user, the execution retrying strategy will not close/re-open the connection, and so exceptions which doom the connection would not be retriable. Opened dotnet/efcore#24432 to think about this. |
As another data point for the impact of fixing this, I think this is the root cause of this issue: madelson/DistributedLock#83 |
Closing as #1128 has been merged. Fix will be available in the next preview release. |
Thank you @cheenamalhotra |
Looks like issue or very similar also may exists for .Net Framework 4.8 and EF 6.4.
|
This was closed so I'd suggest making sure you have an up to date version of the library as the first step. |
@Wraith2 |
The System version is only updated with security and high priority patches and those are through windows update. This repo is for the Microsoft.Data.SqlClient library which EFCore uses, not sure about desktop framework EF 6. |
We do plan to make it possible to use EF 6 (non-Core) with Microsoft.Data.Sqlclient - but that's happening in the coming year, and isn't possible at the moment. |
@Wraith2 @roji actually possible! https://www.nuget.org/packages/ErikEJ.EntityFramework.SqlServer/ FyI @eolamisan |
Oh great @ErikEJ, now I know to point to it :) |
Describe the bug
We are intermittently seeing a
System.InvalidOperationException: The connection does not support MultipleActiveResultSets
. We have turned on tracing, and whenever the exception is thrown we see the following traces running up to the exception:As can be seen a connection (311) is doomed, due to the TCP connection timing out. But then it retries on a different connection (313) and throws the exception. We repeatedly see the same pattern whenever that exception is thrown.
Expected behavior
Additional tracing to be able to diagnose the problem.
The
System.InvalidOperationException: The connection does not support MultipleActiveResultSets.
should not be thrown.Further technical details
Microsoft.Data.SqlClient version: 2.1.2 and 1.1.3
.NET Framework: Core 3.1
SQL Server version: Azure SQL Server
Operating system: Both Windows and Linux (docker container)
The text was updated successfully, but these errors were encountered: