Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlConnection fails to reliably keep long-running connection open #25458

Closed
kevcunnane opened this issue Mar 14, 2018 · 18 comments
Closed

SqlConnection fails to reliably keep long-running connection open #25458

kevcunnane opened this issue Mar 14, 2018 · 18 comments

Comments

@kevcunnane
Copy link

This issue affects applications such as SQL Operations Studio which have long-running connections (e.g. backing a T-SQL Editor) based on the .Net Core stack. On the .Net Full stack, the connection reliably stays open or reports as non-open in the ConnectionState, allowing the application to reopen as needed. However on the .Net Core stack the state is shown as Open but on attempting any command the command will attempt to run for ~60sec, then fail with an error message

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught)

This has a high impact on reliability of tools build on top of .Net Core for SqlClient usage.

Related issue: microsoft/azuredatastudio#870.

**Diagnosis by @saurabh500 **:
Windows impl of SqlClient sets some Keepalive values which Unix stack doesn’t
SqlClient should set the keep alive values.

Original issue details:

  • SQL Operations Studio Version: Feb release & latest from master

Steps to Reproduce:

  • Use a macOS machine, such as the one owned by @twright-msft. Not all machines will work (mine is super-reliable for instance)
  • Run a query against any DB
  • Walk away for 30min
  • Re-run the query

Expected:

  • Query executes 2nd time around

Actual:

  • Executing... spins for ~1minute, then we get an error message in the batch stating
    • Note that even if we could just get a quick failure in a way where we could then close & reopen the connection that'd be OK. It's the lack of reliability plus a long wait that's making this quite bad.

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught)

  • If you are running 2 batches, you may see another message:

Query failed: BeginExecuteReader requires an open and available Connection. The connection's current state is open.

Note that we're actually checking if the connection isn't open but this isn't reliable when it comes to TCP sockets closing. This may well be a driver-side issue, but any fix/workaround we can apply would be good.

Here's our code to close & reopen on issues:

        private void VerifyConnectionOpen(DbConnection connection)
        {
            if (connection == null)
            {
                // Ignore this connection
                return;
            }
 
            if (connection.State != ConnectionState.Open)
            {
                // Note: this will fail and throw to the caller if something goes wrong.
                // This seems the right thing to do but if this causes serviceability issues where stack trace
                // is unexpected, might consider catching and allowing later code to fail. But given we want to get
                // an opened connection for any action using this, it seems OK to handle in this manner
                ClearPool(connection);
                connection.Open();
            }
        }

We can add connection.Close() in this case which might fix issues where connection is in a broken state, but definitely doesn't fix the error all up.

@corivera corivera self-assigned this Mar 14, 2018
@corivera corivera removed their assignment Mar 21, 2018
@jhudsoncedaron
Copy link

We're facing some horrendous internal debates here based on how to work around this not being fixed. We've known about it for ages and ages.

https://stackoverflow.com/questions/16347389/sqlserver-wedged-connection-due-to-network-drop

@kevcunnane
Copy link
Author

@saurabh500 @corivera isn't this fixed in 2.2? The KeepAlive flag was added and we're using in Azure Data Studio with latest bits... correct?

@jhudsoncedaron
Copy link

@kevcunnane : The diff of unmerged dotnet/corefx#33024 says definitely not fixed yet.

@saurabh500
Copy link
Contributor

This is not fixed in 2.2 and the PR for master is undergoing discussions about the unit of the API. For 2.2 the fix depends on networking APIs as well.

@saurabh500
Copy link
Contributor

@jhudsoncedaron have you had a chance to use the OS keepalive settings to alleviate this problem?

@jhudsoncedaron
Copy link

@saurabh500 : There is no OS setting that I know of to turn on keepalive for all sockets.

@saurabh500
Copy link
Contributor

@saurabh500
Copy link
Contributor

The recommended setting for Sql clients is mentioned at
dotnet/corefx#33024 (comment)

@jhudsoncedaron , you are facing this issue on Linux/macOS. Is that right?

@jhudsoncedaron
Copy link

@saurabh500 : I don't need to set precise timeout values. I just need to turn keepalive on at all to keep threads from getting stuck forever inside SqlDataReader.Read(). Right now it doesn't matter what the OS is. While there's OS specific settings to tune the values, the SetSocketOption call needs to execute to turn it on per socket.

@saurabh500
Copy link
Contributor

I don't think I follow. Based on the context of this issue, I understood that when you are trying to re-use a connection, after long idle period, the connection is failing.
Are your calls getting stuck in SqlDataReader.Read() ? I feel you are facing something other than what @kevcunnane mentions in this issue.
Can you post an example usage to help understand what exactly you are facing? I want to make sure that we are not mixing two issues in a single thread.
The reason Windows and Unix matters, in context of this issue, is because on Windows implementation we set the per-socket KeepAlive settings, to prevent the network connection from dropping after short periods of inactivity.

@jhudsoncedaron
Copy link

@saurabh500 : I did a version hunt. It was restored to working again on Windows by turning off SNI globally. dotnet/corefx@23709d0

@saurabh500
Copy link
Contributor

What about it? The diff you have mentioned here, turns off Managed SNI on Windows, which means that the SNI (SQL Network Interface) implementation written in C++, is being used on Windows.

@jhudsoncedaron
Copy link

@saurabh500 : This discussion is probably going to get degenerate. From our perspective, Managed SNI has switched on and off a few times and the bug comes and goes and we're using hardware workarounds and most of the time the thread that does get stuck isn't holding any locks. We're not willing to attempt to support environments w/o the hardware workarounds until the bug is really fixed. The code comments say Managed SNI is coming back again.

@saurabh500
Copy link
Contributor

saurabh500 commented Oct 30, 2018

I can assure you that Managed SNI has not been turned on, on any released version of SqlClient, for Windows. It was turned on, in one of the preview releases and then turned off.
Managed SNI is only operational on Linux and macOS.
You can't be facing issues on Windows because of turning on and off of managed SNI, unless you have been using one of the pre-2.0 preview release builds.

@jhudsoncedaron
Copy link

@saurabh500 : We used preview releases of stuff.

@saurabh500
Copy link
Contributor

There are lot of bug fixes after 2.0 preview releases and I do recommend upgrading to a newer released version of SqlClient.

@saurabh500
Copy link
Contributor

saurabh500 commented Oct 30, 2018

@jhudsoncedaron Can you tell me your version of SqlClient as well? There are 4.6.0-preview on nuget, which have a lot of bug fixes in them.
You could upgrade to 4.5.1, which is a released version and not a preview, to stay on a released version.

@jhudsoncedaron
Copy link

The current version is 4.5.1 release. We've been on preview versions of SqlClient in the past after discovering that cross-versioned binaries normally work really badly and there's a critical fix only available in a preview release for something in .NET MVC so we move the entire framework as a unit.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants