Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Cancel #234

Closed
wants to merge 3 commits into from
Closed

Fix Cancel #234

wants to merge 3 commits into from

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Sep 29, 2019

Fixes #44

Makes the parser lock in the internal Cancel method (called once from SqlCommand or SqlDataReader whichever is active) optional in the case of Cancel. This means that the send attention needed to cancel the running query is not blocked until the query completes. There's a new test to verify the behaviour derived from the bug report.

@Wraith2 Wraith2 changed the title add cancel running query test and remove state lock in cancel method Fix Cancel Sep 29, 2019
@cheenamalhotra
Copy link
Member

Thanks @Wraith2 for coming up with the PR. I do see that in NetFx code (which works fine), the implementation for Cancel() is based on "objectId" instead of "caller" as in NetCore, which looks like a major difference to me. Hence the condition changes as under:

NetFx implements condition in Cancel as under:

if ((!_cancelled) && (objectID == _allowObjectID) && (objectID != -1))

while in NetCore it is done as:

if ((!_cancelled) && (_cancellationOwner.Target == caller))

Would you please verify the behavior and attempt to match NetFx if it's possible, we do want to avoid any unknown implications of making patches.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 30, 2019

The ObjectID is just a number assigned to the command object when it's created, it's monatonically incremented in the ctor. It's used for tracing with the Bid object that was removed from the corefx version. If you trace the calls to TdsStartParser.Dispose they both go through command. It also seems to be used for identifying self in some cases to make sure the thing that's being operated on Isn't another connection.

So as far as I can see it's just a different way of expressing the question of whether the target is the one that is expected.

Now, which one is better I couldn't tell you. The provenance of both versions of this code are opaque to me. I can see the code but the reasons behind any of the decisions that formed it aren't available to me. If you know a particular approach is better and should be followed then you'll need to instruct me. If you need reasons behind anything other than changes I've made you'll need to track down the original authors or version history notes, I can't access any of that information even though it would be incredibly helpful to me.

@David-Engel
Copy link
Contributor

Interesting. The new test is failing against netfx...

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Sep 30, 2019

Okkk.. It's actually causing the build to hang. 🤔
I didn't pay attention to builds today.

Looking at #44, the user reported it works fine in SQL Server Management Studio which is System.Data.SqlClient (NetFx). That gave me an impression, this was NetCore specific.

But now it fails with Microsoft.Data.SqlClient (NetFx), so something has happened in between 🤔

Let me verify this test scenario with S.D.S, M.D.S both NetFx and NetCore, and I'll get back on this. May have to drill down history to find change that caused this.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Sep 30, 2019

We could try using the tests added in dotnet/corefx#38271 which use delay and timers instead of an infinite loop, that'd prevent the build hang.

I'm not sure what's different between corefx and this repo that means we see a different cancel behaviour, if this issue was in corefx my other PR shouldn't fix the issue since this one is in fully shared code and the other is in a specific SNI implementation.

The report of it working in management studio is also confusing unless management studio uses another connection (or mars) to cancel it.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 17, 2019

I've updated this PR to use the test I mentioned. Those tests use DELAY so they won't hang the test runs.

@cheenamalhotra
Copy link
Member

The changes fail on Unix, please check logs below:

Ubuntu-SQL.zip
Ubuntu-Azure.zip

@Wraith2
Copy link
Contributor Author

Wraith2 commented Oct 17, 2019

Yes, they will do because of #248 which we've already discussed.

There are also tests failing because they're just wrong. For example PlainCancelAsync is failing because it's expecting a specific exception and message which I haven't changed. So was it failing before? I can't tell because I don't have a way to run the unix tests without hardcoding managed mode because I can't run tests in debug mode.

It really feels like you're trying to make even simple things much more difficult than they need to be and I don't understand why.

@cheenamalhotra
Copy link
Member

So can we instead merge both PRs at one place and then attempt to fix issue? It sounds like both PRs are dependent on each other.

Some tests do fail randomly so little bit of noise if they're not related you can avoid, but I'm only looking into new test failures, that fail consistently even when testing locally.

@cheenamalhotra
Copy link
Member

The 4 tests failing here as in logs are New failures coming from this PR:

image

@@ -598,41 +598,41 @@ internal void Cancel(object caller)
// Keep looping until we either grabbed the lock (and therefore sent attention) or the connection closes\breaks
while ((!hasLock) && (_parser.State != TdsParserState.Closed) && (_parser.State != TdsParserState.Broken))
Copy link

@Samirat Samirat Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't get the lock, won't this loop / send attention multiple times? Is that intended?

// try to take the lock so that if another command is attempted it will queue on the lock
// but don't require that the lock be taken because otherwise attention cannot be sent
// during command execution causing cancellation to wait
// This lock is also protecting against concurrent close and async continuations
Monitor.TryEnter(this, _waitForCancellationLockPollTimeout, ref hasLock);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of taking a lock at all here? If we actually need to prevent new commands being started while sending attention, then we can't allow this lock to be optional, and if we don't, then why have it at all.

{
try
_parser.Connection._parserLock.Wait(canReleaseFromAnyThread: false, timeout: _waitForCancellationLockPollTimeout, lockTaken: ref hasParserLock);
Copy link

@Samirat Samirat Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there should be an async version of this function. If the locks were async, wouldn't that also fix the cancellation hang?

@Wraith2 Wraith2 closed this Oct 24, 2019
@Wraith2 Wraith2 deleted the bug-44 branch June 29, 2021 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Canceling SQL Server query with while loop hangs forever
4 participants