Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues when selecting megabytes of text from nvarchar(max) column using ExecuteScalarAsync #689

Closed
alser opened this issue Aug 13, 2020 · 16 comments

Comments

@alser
Copy link

alser commented Aug 13, 2020

Describe the bug

Selecting ~20 MB of text data from nvarchar(max) column using ExecuteScalarAsync has very poor performance, executing up to 10 minutes with 100% CPU consumption (1 core) and lots of garbage collections.

It happens probably because of DataReader is creating multiple byte arrays for each portion of data received from SQL.

Selecting the text in SSMS occures in less than a second.

To reproduce

Sample project to reproduce the issue: https://github.com/alser/SqlDataClientStringPerformanceIssue

Its console app connecting to localhost to SQL Server (no network issues involved)

Expected behavior

Selecting data should occur in few seconds with low CPU consumptions and without lots of GCs.

Probably DataReader should rent byte arrays when reading string fields instead of creating new ones for each portion of data.

Further technical details

Microsoft.Data.SqlClient version: 2.0.0 (nuget)
.NET target: Core 3.1.7
SQL Server version: SQL Server 2016 (13.0.1742.0)
Operating system: Windows 10 version 10.0.19041.450

@karinazhou
Copy link
Member

Hi @alser ,

Thank you for reporting this. I will try your repro project on my side. Does old driver like v1.1.3 have the same performance?

@alser
Copy link
Author

alser commented Aug 13, 2020

Hello, @karinazhou

Yes, I've tried 1.1.3, same problems

Also System.Data.SqlClient 4.8.2 has the same issue (changing nuget and namespace to System.Data.SqlClient)

And I've checked SQL Server 2012 (11.0.5388.0) - same here

@alser
Copy link
Author

alser commented Aug 13, 2020

I added Microsoft.Data.SqlClient project to solution (to my sample project) from last commit in master: dbbcde8

when ExecuteScalarAsync was executed, there were 2521 calls creating byte array new byte[8000] in following location:

file: src\Microsoft.Data.SqlClient\netcore\src\Microsoft\Data\SqlClient\TdsParserStateObject.cs

method: internal bool TryReadNetworkPacket()

line: _inBuff = new byte[_inBuff.Length];

@roji
Copy link
Member

roji commented Aug 14, 2020

Related to #593, possibly even a dup?

@alser
Copy link
Author

alser commented Aug 14, 2020

@roji , it seems so, synchronous ExecuteScalar() call returns in 2 seconds

@alser
Copy link
Author

alser commented Aug 14, 2020

when ExecuteScalarAsync was executed, there were 2521 calls creating byte array new byte[8000] in following location:

On second thought, it shouldn't affect performance in such way, just creating all this arrays is almost immediate, and total size is ~20 MB

@Wraith2
Copy link
Contributor

Wraith2 commented Aug 16, 2020

it isn't just memory use (thought that can be quite poor) it's the fact that it resets state and replays from the first packet of the data load each time a packet arrives, this drives up cpu and wastes time. The larger the field the more packets it will span and the longer it will take over the sync version.

If you can detect the data size and use a scalar approach when the field is larger than around 2/3 of your packet size try that. Or try using a streaming function to get it because those won't buffer packets.

@alser
Copy link
Author

alser commented Aug 16, 2020

@Wraith2 thanks for advice

Actually I'm working on large enterprise product using ORM to access different databases, SqlServer and Postrgres, then there is multiple layers of business logic. Its quite challenging to detect which type of request is executing, i.e. what fields are being selected, from what data source, and how it should be processed.

Given situation described in #593 , I think of introducing several low level hacks in that product`s code base, when for SqlServer synchronous method will be called instead of async, but it can't be done in every place.

Do you know when we can expect fix for async methods in Microsoft.Data.SqlClient package ? thanks

@Wraith2
Copy link
Contributor

Wraith2 commented Aug 16, 2020

Do you know when we can expect fix for async methods in Microsoft.Data.SqlClient package ? thanks

It isn't broken. It's slower than we'd like but don't confuse that with not working.

@alser
Copy link
Author

alser commented Aug 16, 2020

It's slower than we'd like but don't confuse that with not working.

I meant performance fix, I didn't say "not working". If there are no plans, please, say that, there were no demands from my end to immediately address the problem.

@Wraith2
Copy link
Contributor

Wraith2 commented Aug 16, 2020

I'm an external contributor. I have no say on what happens to this library. I mainly contribute performance improvements.

This particular performance problem is complicated, difficult, risky and time consuming. If I found the time needed to rework the library to change the behaviour it would take me months and would then take more months to be reviewed and at that point given the scope may still not be approved because of the scope. It would also likely involve subtle behavioural changes and the risk of new errors emerging is high. So it's a time consuming and thankless task for me.

@roji
Copy link
Member

roji commented Aug 16, 2020

@Wraith2 just to confirm for me, is this the same issue as #593? In other words, after #603 (in SqlClient 2.0.0) at least streaming should be good? If so, we should close this as a dup.

@alser if so, then as @Wraith2 suggested above, as a workaround you can drop down to raw ADO.NET and use the streaming API DbDataReader.GetTextReader. The same code should work as-is across both SqlClient and Npgsql.

Another, possibly more reasonable option, is to simply use sync I/O for the time being. I don't know what your application looks like and what its perf characteristics are like, but that may be a completely reasonable option without any real drawbacks.

@Wraith2
Copy link
Contributor

Wraith2 commented Aug 16, 2020

Looks like it. The repro code is simple, it creates a large string then reads it and we know that's slow and why.

@roji
Copy link
Member

roji commented Aug 17, 2020

@cheenamalhotra as above, this seems like it can be closed as a dup of #593.

@cheenamalhotra
Copy link
Member

Closing as duplicate of #593

@alser
Copy link
Author

alser commented Aug 18, 2020

Thanks a lot for answers! Yes, its definetely dup. Its easier to use sync api in my case (given queries are executed using ORM) , and in some cases streaming can be useful.

In my app text column is usually json data, it can be several megabytes in size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants