Missed synchronization #826

jinek · 2020-11-29T03:51:39Z

While investigating other issues I’ve found probably few hard to reproduce issue.

I haven’t seen them happening, but I’ve seen several issues which are claimed as hard to reproduce/happens under heavy load, so may be it could help there. In original PR (where code has been introduced) that was mentioned also.

This block accessing sockets array

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs

Line 341 in 8ad8da6

if (sockets[i] != null && !sockets[i].Connected)

is not synchronized with this one

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs

Line 368 in 8ad8da6

    
           if (sockets[i] != null) // sockets[i] can be null if cancel callback is executed during connect()

The following order seems is possible:

if (sockets[i] != null && !sockets[i].Connected) is true
if (sockets[i] != null) gives true
if (sockets[i].Connected) gives true
sockets[i].Dispose();
availableSocket = sockets[i];
break; return availableSocket;

In this case socket is first returned and then at random time disposed while being used (user does not receive any exceptions, connection just disposes).
But there are also many hard investigate synchronization/exceptions things there. I think may be that entire file and related should be checked/reviewed again.

Wraith2 · 2020-11-30T21:12:49Z

the Clear local function is only invoked from the cts, the cts is disposed in the finally block before the socket is returned to the user. So yes there is a small window between the break and finally where the socket could be disposed. How would you fix it though? If you ref the socket and then check for null/disposed before you return after the finally how do you recover from it being null/disposed?

jinek · 2020-11-30T23:52:51Z

the Clear local function is only invoked from the cts, the cts is disposed in the finally block before the socket is returned to the user. So yes there is a small window between the break and finally where the socket could be disposed.

In the scenario I'm offering, there are already two threads (cancellation has been already started while Connect is still in progress). So first cancellation thread passes all "fail" conditions, then main thread passes all "success" conditions and returnes the socket. CTS.Dispose/finally does not affect this scenario anyhow.

What's left: socket is returned and being used, same time cancellation thread is scheduled to dispose that socket and I believe timing of that can not be anyhow estimated. For example, if cancellation thread came from pool, it may be has just finished heavy job and is now planned after all other N threads to finish same heavy N-jobs. As you can imagine, this case is nearly impossible to reproduce on purpose. I've just searched issues by "ObjectDisposed" keyword and there are at least 2 from top 20 which describes similar behavior:
#449
dotnet/efcore#15069

How would you fix it though? If you ref the socket and then check for null/disposed before you return after the finally how do you recover from it being null/disposed?

I don't see any other option but to do some synchronization. Probably fastest would be to introduce "interlocked exchange". But this code also relies on try/catch{} and number of non-reproducable issues should grow too fast to maintain.

The best in my opinion would be to introduce Socket.Connect(timeout) (timeout of IO operation seems logical to me).
Or if we can not expect changes to Socket itself, we still can use Socket.ConnectAsync which returns Task and thus we can use simple Task synchronization methods.

Wraith2 · 2020-12-01T00:04:32Z

If the need were acute the method could be added to the runtime but that's an issue for /runtime and even if it were added the first shipping product that contained it would be .net6 and we couldn't use it in this library on downlevel runtimes so it wouldn't be very useful.

I understand what you're saying and agree that there is the possibility of returning a disposed socket. However there is no causation proof and I've debugged my way through too many incorrect assumptions to think that we can assert that this causes an observable problem. Is there a way that we can either 1) fix the issue you've highlighted without externally observable effect? or 2) enhance the tracability of the ObjectDisposedExceptions allowing us to see if this problem is manifesting?

jinek · 2020-12-01T02:01:37Z

However there is no causation proof and I've debugged my way through too many incorrect assumptions

I have found this issue to be reported and reproduced, repro is here: #449 (comment) It describes exactly same behavior as I do.

Randomness of this issues are predicted by the theory.
In 449 we were lucky the customer to report and repro the issue.
In #359 it is already crazy things like "switching off/on internet".
Some other cases I see got closed as non-reproducable.
But the impossibility to reproduce them (or difficult repro) is supposed by current SqlClient solution.

The correctness of assumptions in this tickets can be proven by static code-analysis tools. This is my next point: code-anlysis is proposed specifially to avoid hard to observe issues and I believe this should be used in SqlClient.

fix the issue you've highlighted without externally observable effect?

I think last option which I offered should do this: ConnectAsync should be used for easy synchronization of 2 tasks. Or again any synchronization eventually will do the thing.

enhance the tracability of the ObjectDisposedExceptions allowing us to see if this problem is manifesting?

As I said I see it's much more easier to fix the thing relying on code-analysis than to trace it.

roji · 2020-12-01T05:16:13Z

The best in my opinion would be to introduce Socket.Connect(timeout) (timeout of IO operation seems logical to me).

Am not following this issue in details, but if you're looking for a socket timeout without doing async, then there's already a way for doing this. You can set the socket to non-blocking, call Connect (at which point you get a WouldBlock exception), and then use Socket.Select to wait for the connection to complete, but with a timeout. It's not incredibly pretty but it works.

jinek · 2020-12-01T07:04:30Z

The best in my opinion would be to introduce Socket.Connect(timeout) (timeout of IO operation seems logical to me).

Am not following this issue in details, but if you're looking for a socket timeout without doing async, then there's already a way for doing this. You can set the socket to non-blocking, call Connect (at which point you get a WouldBlock exception), and then use Socket.Select to wait for the connection to complete, but with a timeout. It's not incredibly pretty but it works.

Thank you, this would work for us. But the code which we are discussing has been introduced under the sign of performance. Probably, we should avoid exceptions at least on success execution path

roji · 2020-12-01T08:03:08Z

@jinek as this is code which establishes a physical network connection (i.e. TCP handshake), I really doubt this extra exception would have any perf impact at all. I'd at least set up a quick BenchmarkDotNet benchmark to verify this rather than assuming it's true.

Wraith2 · 2020-12-01T10:06:02Z

Perhaps we're overthinking this. The caller handles null and not-connected cases by returning an sql error so why not just check the connected state of the return candidate after the cts has been disposed but before returning the value. At that point it can no longer be touched by the timer so it it's alive it'll stay alive and if it's dead we clean it up and return null allowing the caller to interpret that as a standard failure to connect, which is what it is.

jinek · 2020-12-01T10:39:28Z

why not just check the connected state of the return candidate after the cts has been disposed but before returning the value.

The socket passes checks (as it got connected) while same time Cancel method is already running.
CTS disposal does not abort Cancel thread.

Wraith2 · 2020-12-01T12:30:45Z

At the point when the cts is disposed either the cancel function has been invoked or it will not be invoked, as such either cancellation is done/pending or won't happen. But yes you're right due to the vagaries of threading we can't assume anything about ordering. So the socket array needs to change to a (Socket, int) class array and interlocks on the int used to guard socket access, 0-1 on cancel 0-2 on take. Have I ever mentioned that threading and async give me headaches? I also dislike allocations. I don't like there being an array allocated for just two things, especially when most of the time it's never going to hit the second one, and that second socket never gets disposed.

Fancy working on this?

jinek · 2020-12-01T14:34:46Z

either cancellation is done/pending or won't happen

By the way, anything in the world is either happened/gonna happen or won't happen 🤣

Fancy working on this?

Yes. I would love sqlclient to work with multithreading. I'm still not sure about implementation details, but something like you mention should happen.

JRahnama · 2020-12-01T17:56:48Z

@jinek thanks for bringing this up. While I was working on issue #422 I came a cross an error message on debug mode. (@Wraith2 you can reproduce the issue by attaching your project to the driver and run it under netcoreapp3.1-Debug. Do not forget to add the switch to use ManagedSNI). The error message was stating that we were trying to write to a disposed socket.

I think last option which I offered should do this: ConnectAsync should be used for easy synchronization of 2 tasks. Or again any synchronization eventually will do the thing.

There has been another issue in System.Data.SqlClient regarding the socket in connectAsync that the team had to change the socket connection to synch from async. I will find the exact issue number and will update here.

jinek · 2020-12-07T04:47:53Z

I've tried to run benchmark for implementations which we have discussed here.

Solution offered by @roji almost equal in time to async solution:

Method	CurrentSNITcpSolution	Mean	Error	StdDev	Median
Connect	Original	9.396 us	0.7643 us	1.473 us	8.800 us
Connect	ViaException	9.427 us	0.6319 us	1.247 us	8.700 us
Connect	ViaAsync	8.800 us	0.7240 us	1.324 us	8.350 us

Code is here: jinek@e9e7309

Here are implementations with async and exception (I've created delegate to substitute implementations at runtime):

Async

        public static Socket GetAvailableSocketViaAsync(int port, TimeSpan timeout, bool isInfiniteTimeout,
            IPAddress[] ipAddresses)
        {
            if (isInfiniteTimeout) //todo: for performance tests not considering infinite timeouts
                throw new NotSupportedException();

            Task timeoutTask = Task.Delay(timeout);
            
            foreach (IPAddress ipAddress in ipAddresses)
            {
                var socket = new Socket(ipAddress.AddressFamily, SocketType.Stream, ProtocolType.Tcp);

                Task connectTask = socket.ConnectAsync(ipAddress, port);

                if (Task.WaitAny(timeoutTask, connectTask) == 1)
                {
                    if(connectTask.Status==TaskStatus.RanToCompletion)
                        return socket;
                    
                    Exception theOnlyException = connectTask.Exception.InnerExceptions.Single();
                    if (theOnlyException is SocketException)
                    {
                        socket.Dispose();
                    }
                    else
                    {
                        throw new NotSupportedException("Not expected socket error", theOnlyException);
                    }
                }
                else
                {
                    connectTask.ContinueWith(task =>
                    {
                        Exception theOnlyException = task.Exception.InnerExceptions.Single();
                        switch (theOnlyException)
                        {
                            case ObjectDisposedException _:
                            case SocketException _:
                                break;
                            default: // throwing to TaskScheduler.UnobservedException
                                throw new NotSupportedException("Not expected socket error", theOnlyException);
                        }
                    }, TaskContinuationOptions.OnlyOnFaulted);
                    socket.Dispose();
                }
            }

            return null;
        }

Exception

        public static Socket GetAvailableSocketViaException(int port, TimeSpan timeout, bool isInfiniteTimeout,
            IPAddress[] ipAddresses)
        {
            if (isInfiniteTimeout) //todo: for performance tests not considering infinite timeouts
                throw new NotSupportedException();

            Stopwatch stopwatch = Stopwatch.StartNew();
            
            foreach (IPAddress ipAddress in ipAddresses)
            {
                var socket = new Socket(ipAddress.AddressFamily, SocketType.Stream, ProtocolType.Tcp)
                {
                    Blocking = false
                };

                bool socketSelected = false;

                try
                {
                    socket.Connect(ipAddress, port);
                    throw new NotSupportedException($"Call to {nameof(Socket.Connect)} must throw {nameof(SocketException)} with {SocketError.WouldBlock.ToString()} error code");
                }
                catch (SocketException socketException) when (socketException.SocketErrorCode ==
                                                              SocketError.WouldBlock)
                {
                    // https://github.com/dotnet/SqlClient/issues/826#issuecomment-736224118
                    
                        var timeLeft = timeout - stopwatch.Elapsed;

                        if (timeLeft <= TimeSpan.Zero)
                            return null;
                        try
                        {
                        Socket.Select(null, new List<Socket> {socket}, null,
                            (int)(timeLeft.TotalMilliseconds * 1000));
                        }
                        catch (SocketException) { }
                        
                        if (socket.Connected)
                        {
                            socketSelected = true;
                            socket.Blocking = true;
                            return socket;
                        }
                    
                }
                finally
                {
                    if (!socketSelected)
                        socket.Dispose();
                }
            }
            
            return null;
        }

Original was

        public static Socket GetAvailableSocketOriginal(int port, TimeSpan timeout, bool isInfiniteTimeout, IPAddress[] ipAddresses)
        {
            Socket[] sockets = new Socket[2];
            CancellationTokenSource cts = null;

            void Cancel()
            {
                for (int i = 0; i < sockets.Length; ++i)
                {
                    try
                    {
                        if (sockets[i] != null && !sockets[i].Connected)
                        {
                            sockets[i].Dispose();
                            sockets[i] = null;
                        }
                    }
                    catch
                    {
                    }
                }
            }

            if (!isInfiniteTimeout)
            {
                cts = new CancellationTokenSource(timeout);
                cts.Token.Register(Cancel);
            }

            Socket availableSocket = null;
            try
            {
                for (int i = 0; i < sockets.Length; ++i)
                {
                    try
                    {
                        if (ipAddresses[i] != null)
                        {
                            sockets[i] = new Socket(ipAddresses[i].AddressFamily, SocketType.Stream, ProtocolType.Tcp);
                            sockets[i].Connect(ipAddresses[i], port);
                            if (sockets[i] != null) // sockets[i] can be null if cancel callback is executed during connect()
                            {
                                if (sockets[i].Connected)
                                {
                                    availableSocket = sockets[i];
                                    break;
                                }
                                else
                                {
                                    sockets[i].Dispose();
                                    sockets[i] = null;
                                }
                            }
                        }
                    }
                    catch
                    {
                    }
                }
            }
            finally
            {
                cts?.Dispose();
            }

            return availableSocket;
        }

As for me both solution look good, but solution by @roji does not have async code, which surely looks more attractive for sync code. Additionally, it allows to exclude the path of exception for infinite timeout case.

@JRahnama Has been that "async" issue logged in this repository?

JRahnama · 2020-12-07T07:48:53Z

@jinek the issue and its related materials could be followed at issue #583 and it also can be tracked back to a change in 2018 in corefx at this issue.

jinek · 2020-12-07T08:45:44Z

@jinek the issue and its related materials could be followed at issue #583 and it also can be tracked back to a change in 2018 in corefx at this issue.

I see, thank you. Can we try Socket.Select(null, new List<Socket> {socket}, null, (int)(timeLeft.TotalMilliseconds * 1000)); as in code above?

SqlClient-826 Missed synchronization Additionally: * Exceptions swallowing removed to satisfy CA1031 https://docs.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1031 * InternalException refactored to satisfy ca1032 https://docs.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1032 (Best practices https://docs.microsoft.com/en-us/dotnet/standard/exceptions/best-practices-for-exceptions#include-three-constructors-in-custom-exception-classes ) * InternalException constructor has been changed to public as class is not marked as sealed.

JRahnama added the Area\Managed SNI Issues that are targeted to the Managed SNI codebase. label Dec 4, 2020

This was referenced Apr 13, 2021

Missed synchronization jinek/SqlClient#1

Closed

Fix | SqlClient-826 Missed synchronization #1029

Merged

Mike737377 mentioned this issue May 24, 2022

Broken connection on unix systems when executing multiple concurrent statements against a single connection #1620

Closed

JRahnama closed this as completed in #1029 May 12, 2023

This was referenced Aug 15, 2024

Errors after upgrading to 5.2.0 from 5.1.5 on Linux #2378

Closed

Fix | Fix issue with Socket.Connect in managed SNI #2777

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed synchronization #826

Missed synchronization #826

jinek commented Nov 29, 2020

Wraith2 commented Nov 30, 2020

jinek commented Nov 30, 2020 •

edited

Loading

Wraith2 commented Dec 1, 2020

jinek commented Dec 1, 2020

roji commented Dec 1, 2020

jinek commented Dec 1, 2020

roji commented Dec 1, 2020 •

edited

Loading

Wraith2 commented Dec 1, 2020

jinek commented Dec 1, 2020

Wraith2 commented Dec 1, 2020 •

edited

Loading

jinek commented Dec 1, 2020 •

edited

Loading

JRahnama commented Dec 1, 2020 •

edited

Loading

jinek commented Dec 7, 2020 •

edited

Loading

JRahnama commented Dec 7, 2020

jinek commented Dec 7, 2020

Missed synchronization #826

Missed synchronization #826

Comments

jinek commented Nov 29, 2020

Wraith2 commented Nov 30, 2020

jinek commented Nov 30, 2020 • edited Loading

Wraith2 commented Dec 1, 2020

jinek commented Dec 1, 2020

roji commented Dec 1, 2020

jinek commented Dec 1, 2020

roji commented Dec 1, 2020 • edited Loading

Wraith2 commented Dec 1, 2020

jinek commented Dec 1, 2020

Wraith2 commented Dec 1, 2020 • edited Loading

jinek commented Dec 1, 2020 • edited Loading

JRahnama commented Dec 1, 2020 • edited Loading

jinek commented Dec 7, 2020 • edited Loading

Async

Exception

Original was

JRahnama commented Dec 7, 2020

jinek commented Dec 7, 2020

jinek commented Nov 30, 2020 •

edited

Loading

roji commented Dec 1, 2020 •

edited

Loading

Wraith2 commented Dec 1, 2020 •

edited

Loading

jinek commented Dec 1, 2020 •

edited

Loading

JRahnama commented Dec 1, 2020 •

edited

Loading

jinek commented Dec 7, 2020 •

edited

Loading