Add Sylvan csv benchmark #2

MarkPflug · 2021-06-03T18:21:07Z

Adds my own CSV library, Sylvan.Data.Csv, to the benchmark set.

Current results on my machine:

Method	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Gen 2	Allocated
Sylvan	58.10 ms	0.411 ms	0.364 ms	1	2111.1111	1000.0000	111.1111	12 MB
PipeLines	88.62 ms	1.672 ms	1.642 ms	2	3166.6667	1166.6667	333.3333	17 MB
CsvHelper	178.20 ms	3.187 ms	2.981 ms	3	12666.6667	4666.6667	2000.0000	77 MB
AsyncStream	181.59 ms	3.024 ms	2.828 ms	3	10333.3333	3333.3333	1000.0000	64 MB

The pipelines run was previously faster than Sylvan until the most recent changes you added.

davidfowl · 2021-06-04T04:48:34Z

I'm guessing it's better because of the synchronous reads, as for why it allocates less, that's surprising 😄 . I guess we could do an allocation profile to find it, it might be the async overhead is my guess. FileStream has been rewritten in .NET 6 to make the async operations much more efficient. I'd like to see a run on there.

MarkPflug · 2021-06-04T05:18:09Z

One issue is that in the last commit the PipeLines code was changed from:
if (Utf8Parser.TryParse(buffer, out DateTime value, out var bytesConsumed))
to:
if (DateTime.TryParse(Encoding.UTF8.GetString(line[..comaAt]), out var doj))

Not sure why that was done, but it slows things down quite a bit. There is also an awkward check in the loop to check for the header row that isn't ideal:

CSharpPerformanceBoosters/src/FileIO/WithPipeLines.cs

Line 71 in 0e0380a

if (line.IndexOf(ColumnHeaders) >= 0) // Ignore the Header row

Fixing those issues and increasing the buffer that the Pipelines uses to 16k (and doing the same for Sylvan) changes things quite a bit. Using async with Sylvan causes a small perf regression, but not terrible so long as the buffer isn't too small.

Method	Mean	Error	StdDev	Median	Rank	Gen 0	Gen 1	Gen 2	Allocated
PipeLines	30.88 ms	0.439 ms	0.410 ms	30.95 ms	1	2187.5000	1062.5000	125.0000	12 MB
Sylvan	55.84 ms	1.082 ms	1.481 ms	55.59 ms	2	2200.0000	1100.0000	100.0000	13 MB
SylvanAsync	61.50 ms	1.225 ms	2.637 ms	60.38 ms	3	2111.1111	1000.0000	111.1111	13 MB

Quite a difference. But, the performance disparity is likely because Sylvan uses a TextReader, to allow for various encodings, so I'm paying a penalty to support encodings other than utf-8. Also, the Pipelines implementation isn't a real CSV implementation, in that it doesn't handle edge cases of quoted/escaped fields. Sometimes you can get away with that, depending on your dataset but I wouldn't want to rely on it.

goldytech · 2021-06-04T05:50:13Z

@MarkPflug Thanks for your contribution.The code was changed for DateTime Parsing because Utf8Parser.TryParse() method doesn't support the parsing of only date. Parsing is supported for DateTime and csv data (DateOfJoining) column doesn't have time in it hence it was removed. https://docs.microsoft.com/en-us/dotnet/api/system.buffers.text.utf8parser.tryparse?view=net-5.0#System_Buffers_Text_Utf8Parser_TryParse_System_ReadOnlySpan_System_Byte__System_DateTime__System_Int32__System_Char_
Header check was important as it is not part of the actual data that needs to be parsed.
If you have solution to it . Please raise a PR.

davidfowl · 2021-06-04T05:55:19Z

@goldytech so the parsing fails?

Also, the Pipelines implementation isn't a real CSV implementation, in that it doesn't handle edge cases of quoted/escaped fields. Sometimes you can get away with that, depending on your dataset but I wouldn't want to rely on it.

Agreed. The code can be made faster and more correct though, but I think that's besides the point of this blog post. Fit to purpose things are usually always faster and cut corners compared to general purpose things.

@MarkPflug if you're interested, can you do an allocation profile on the implementations to see what's being allocated?

goldytech · 2021-06-04T06:20:35Z

@davidfowl Yes the out variable has null value in TryParse method

Avoid allocating temp string for date parsing.

clean up benchmarks.

MarkPflug · 2021-06-04T14:38:17Z

@goldytech Interesting, I didn't know that about those new Utf8Parser methods. I assumed it would offer parity with the DateTime.TryParse method, so I would have expected all those date-only values to parse properly. The bad news is that the DateTime parsing is one of the more expensive bits of processing this file, so that's what's slowing things down the most. One way to improve this slightly is to use TryParseExact, but that can't currently be used for this dataset because the date formats are inconsistent: line 2 has "4/20/2847" while line 10 has "08/12/20". I think if these were "04/20/2847" and "08/12/0020" then you could TryParseExact with the format "MM/dd/yyyy" and it would speed things up a bit.

@davidfowl I ran an allocation profile (.NET Object Allocation Tracking in VS) and I see mostly strings, around 200k of them as expected. I don't have a lot of experience with that tool, so not sure what else to look for exactly. I was expecting it to be broken down by allocation size, but it only seems to show counts.

My previous results with PipeLines at ~30ms was bogus, because it was failing to parse the dates and it fails fast in that scenario. I've increased the buffer size being used, fixed up the temp string allocation used when parsing the dates, and cleaned up a few things in the PipeLines code. With those changes, I'm seeing these results on my machine:

Method	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Gen 2	Allocated
PipeLines	53.06 ms	0.360 ms	0.320 ms	1	2100.0000	1000.0000	100.0000	13 MB
Sylvan	53.83 ms	0.459 ms	0.429 ms	1	2200.0000	1100.0000	100.0000	13 MB
SylvanAsync	59.51 ms	0.193 ms	0.161 ms	2	2222.2222	1000.0000	111.1111	13 MB
CsvHelper	167.06 ms	3.242 ms	4.100 ms	3	12666.6667	4666.6667	2000.0000	77 MB
AsyncStream	170.41 ms	2.369 ms	1.978 ms	3	9666.6667	3333.3333	666.6667	60 MB

Do you see anything obvious that could make the pipeLines code any faster? All in all, I feel pretty good about where my library sits in these results. This probably isn't super fair for CsvHelper, there's probably things that could be done to improve it, but I haven't investigated.

Edit: I should also mention these results are running net6.0 preview 4.

MarkPflug · 2021-06-04T15:12:37Z

Tweaked the CsvHelper code a bit to improve it's performance somewhat. Might be other things that can be done, but I'm not that familiar with the library.

Method	Mean	Error	StdDev	Rank	Gen 0	Gen 1	Gen 2	Allocated
PipeLines	51.17 ms	0.596 ms	0.497 ms	1	2100.0000	1000.0000	100.0000	13 MB
Sylvan	53.76 ms	0.455 ms	0.426 ms	2	2200.0000	1100.0000	100.0000	13 MB
SylvanAsync	61.06 ms	1.196 ms	1.597 ms	3	2222.2222	1000.0000	111.1111	13 MB
CsvHelper	130.28 ms	2.594 ms	4.544 ms	4	9250.0000	3250.0000	1250.0000	56 MB
AsyncStream	172.89 ms	3.318 ms	3.687 ms	5	9666.6667	3333.3333	666.6667	60 MB

davidfowl · 2021-06-04T15:21:38Z

@goldytech Interesting, I didn't know that about those new Utf8Parser methods. I assumed it would offer parity with the DateTime.TryParse method, so I would have expected all those date-only values to parse properly. The bad news is that the DateTime parsing is one of the more expensive bits of processing this file, so that's what's slowing things down the most. One way to improve this slightly is to use TryParseExact, but that can't currently be used for this dataset because the date formats are inconsistent: line 2 has "4/20/2847" while line 10 has "08/12/20". I think if these were "04/20/2847" and "08/12/0020" then you could TryParseExact with the format "MM/dd/yyyy" and it would speed things up a bit.

@pgovind @tannergooding This seems like something we should add support for in Utf8Parser.
cc @jeffhandley

@davidfowl I ran an allocation profile (.NET Object Allocation Tracking in VS) and I see mostly strings, around 200k of them as expected. I don't have a lot of experience with that tool, so not sure what else to look for exactly. I was expecting it to be broken down by allocation size, but it only seems to show counts.

You can show more columns and there should be a size there. Can you share the profile?

Do you see anything obvious that could make the pipeLines code any faster? All in all, I feel pretty good about where my library sits in these results. This probably isn't super fair for CsvHelper, there's probably things that could be done to improve it, but I haven't investigated.

One of the things that could lower allocations is using a smaller buffer size for the FileStream to avoid the copy.

CSharpPerformanceBoosters/src/FileIO/WithPipeLines.cs

Line 24 in c6b9b40

    
           await using var fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read, BufferSize);

pass in a BufferSize of 1 to the FileStream to disable the internal buffering.

There's other micro optimizations that can be applied to ParseLines to avoid the SequenceReader if the buffer is in a single segment (a fast path for buffer.IsSingleSegment).

MarkPflug · 2021-06-04T15:41:56Z

@davidfowl Captured a diag session running the following code:

            var test = new FileIOTest();           
            await test.PipeLines();
            await Task.Delay(1000);
            await test.PipeLines();

Pretty much just strings. It is a 6mb CSV, and eyeballing the data, the Name and Email columns are the majority, so it isn't surprising that there's so much string allocation. Estimating that 4MB of the 6MB is in these two columns, multiply by 2 for utf-8 => UTF-16 chars you get 8MB, then there's additional overhead for the .NET objects (200k of them) I guess that comes to ~12MB total per run. The Employee[] allocation is big, but there's just the one due to the pooling.

Looking at the other allocations after string and Employee[], they mostly appear to be related to System.Diagnostics stuff: is this a side effect of running in the VS diag session?

I have the diagsession saved if you want it.

davidfowl · 2021-06-04T15:45:23Z

Pretty much just strings. It is a 6mb CSV, and eyeballing the data, the Name and Email columns are the majority, so it isn't surprising that there's so much string allocation. Estimating that 4MB of the 6MB is in these two columns, multiply by 2 for utf-8 => UTF-16 chars you get 8MB, then there's additional overhead for the .NET objects (200k of them) I guess that comes to ~12MB total per run. The Employee[] allocation is big, but there's just the one due to the pooling.

You can double click and get a backtrace to figure out exactly where allocations are coming from. That is great though as strings are the things you can't get rid of (for string fields that is).

I have the diagsession saved if you want it.

Yes, if you could share it that would be great. Though I can stop being lazy and collect it myself 😄

MarkPflug · 2021-06-04T15:47:12Z

@davidfowl
diag.zip

tannergooding · 2021-06-04T15:48:16Z

This seems like something we should add support for in Utf8Parser.

Feel free to open an API proposal ;)

davidfowl · 2021-06-05T03:09:53Z

@goldytech do you want to make a new API proposal on https://github.com/dotnet/runtime for parsing just the date using Utf8Parser?

davidfowl · 2021-06-05T07:31:00Z

@MarkPflug thanks for the profile! It's just as you said, mostly strings 👍🏾. A few very positive things we learned so far:

Utf8Parser doesn't support just parsing dates, we can fill that gap.
Sylvan is really fast full CSV parser 😄
Async typically has overhead over sync in cases like this where the entire file is being read off an SSD. In .NET 6 that overhead has been significantly reduced.
There are a few optimizations that can be made to the parser (probably in all implementations) to improve the performance further.

I don't know if you all want to keep going but this is fun 😄

goldytech · 2021-06-05T13:38:12Z

@davidfowl API Proposal created at dotnet/runtime#53768
Thanks to all contributors it was great learning experience for me. Our .NET community is awesome 👏

MarkPflug · 2021-06-05T15:03:42Z

Thanks @davidfowl, was fun looking at this with you.

There are a few optimizations that can be made to the parser (probably in all implementations) to improve the performance further.

If you can spot a performance optimization for my library that would be awesome. I believe it is currently the fastest CSV parser available for .NET: as measured by a member of the NuGet team. I've updated my library since the last update to his post, so it should be back on top.

Sylvan is really fast full CSV parser 😄

🤩 Thanks, means a lot coming from you. I've got a WIP to use it as a formatter for AspNet Core MVC (text/csv), I'll ping you when I get around to making it public.

add sylvan csv benchmark

00b9633

MarkPflug added 3 commits June 4, 2021 07:13

Increase buffer size.

c9d18c2

Avoid allocating temp string for date parsing.

unroll loop

18c04d2

sylvan async. Increase buffer size.

e155fc7

clean up benchmarks.

Manually bind CsvHelper data to improve perf.

c6b9b40

goldytech closed this Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sylvan csv benchmark #2

Add Sylvan csv benchmark #2

MarkPflug commented Jun 3, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

goldytech commented Jun 4, 2021

davidfowl commented Jun 4, 2021

goldytech commented Jun 4, 2021 •

edited

Loading

MarkPflug commented Jun 4, 2021 •

edited

Loading

MarkPflug commented Jun 4, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

tannergooding commented Jun 4, 2021

davidfowl commented Jun 5, 2021

davidfowl commented Jun 5, 2021

goldytech commented Jun 5, 2021

MarkPflug commented Jun 5, 2021

Add Sylvan csv benchmark #2

Add Sylvan csv benchmark #2

Conversation

MarkPflug commented Jun 3, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

goldytech commented Jun 4, 2021

davidfowl commented Jun 4, 2021

goldytech commented Jun 4, 2021 • edited Loading

MarkPflug commented Jun 4, 2021 • edited Loading

MarkPflug commented Jun 4, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

davidfowl commented Jun 4, 2021

MarkPflug commented Jun 4, 2021

tannergooding commented Jun 4, 2021

davidfowl commented Jun 5, 2021

davidfowl commented Jun 5, 2021

goldytech commented Jun 5, 2021

MarkPflug commented Jun 5, 2021

goldytech commented Jun 4, 2021 •

edited

Loading

MarkPflug commented Jun 4, 2021 •

edited

Loading