Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sylvan csv benchmark #2

Closed
wants to merge 5 commits into from
Closed

Add Sylvan csv benchmark #2

wants to merge 5 commits into from

Conversation

MarkPflug
Copy link

Adds my own CSV library, Sylvan.Data.Csv, to the benchmark set.

Current results on my machine:

Method Mean Error StdDev Rank Gen 0 Gen 1 Gen 2 Allocated
Sylvan 58.10 ms 0.411 ms 0.364 ms 1 2111.1111 1000.0000 111.1111 12 MB
PipeLines 88.62 ms 1.672 ms 1.642 ms 2 3166.6667 1166.6667 333.3333 17 MB
CsvHelper 178.20 ms 3.187 ms 2.981 ms 3 12666.6667 4666.6667 2000.0000 77 MB
AsyncStream 181.59 ms 3.024 ms 2.828 ms 3 10333.3333 3333.3333 1000.0000 64 MB

The pipelines run was previously faster than Sylvan until the most recent changes you added.

@davidfowl
Copy link
Contributor

I'm guessing it's better because of the synchronous reads, as for why it allocates less, that's surprising 😄 . I guess we could do an allocation profile to find it, it might be the async overhead is my guess. FileStream has been rewritten in .NET 6 to make the async operations much more efficient. I'd like to see a run on there.

@MarkPflug
Copy link
Author

One issue is that in the last commit the PipeLines code was changed from:
if (Utf8Parser.TryParse(buffer, out DateTime value, out var bytesConsumed))
to:
if (DateTime.TryParse(Encoding.UTF8.GetString(line[..comaAt]), out var doj))

Not sure why that was done, but it slows things down quite a bit. There is also an awkward check in the loop to check for the header row that isn't ideal:

if (line.IndexOf(ColumnHeaders) >= 0) // Ignore the Header row

Fixing those issues and increasing the buffer that the Pipelines uses to 16k (and doing the same for Sylvan) changes things quite a bit. Using async with Sylvan causes a small perf regression, but not terrible so long as the buffer isn't too small.

Method Mean Error StdDev Median Rank Gen 0 Gen 1 Gen 2 Allocated
PipeLines 30.88 ms 0.439 ms 0.410 ms 30.95 ms 1 2187.5000 1062.5000 125.0000 12 MB
Sylvan 55.84 ms 1.082 ms 1.481 ms 55.59 ms 2 2200.0000 1100.0000 100.0000 13 MB
SylvanAsync 61.50 ms 1.225 ms 2.637 ms 60.38 ms 3 2111.1111 1000.0000 111.1111 13 MB

Quite a difference. But, the performance disparity is likely because Sylvan uses a TextReader, to allow for various encodings, so I'm paying a penalty to support encodings other than utf-8. Also, the Pipelines implementation isn't a real CSV implementation, in that it doesn't handle edge cases of quoted/escaped fields. Sometimes you can get away with that, depending on your dataset but I wouldn't want to rely on it.

@goldytech
Copy link
Owner

@MarkPflug Thanks for your contribution.The code was changed for DateTime Parsing because Utf8Parser.TryParse() method doesn't support the parsing of only date. Parsing is supported for DateTime and csv data (DateOfJoining) column doesn't have time in it hence it was removed. https://docs.microsoft.com/en-us/dotnet/api/system.buffers.text.utf8parser.tryparse?view=net-5.0#System_Buffers_Text_Utf8Parser_TryParse_System_ReadOnlySpan_System_Byte__System_DateTime__System_Int32__System_Char_
Header check was important as it is not part of the actual data that needs to be parsed.
If you have solution to it . Please raise a PR.

@davidfowl
Copy link
Contributor

@goldytech so the parsing fails?

Also, the Pipelines implementation isn't a real CSV implementation, in that it doesn't handle edge cases of quoted/escaped fields. Sometimes you can get away with that, depending on your dataset but I wouldn't want to rely on it.

Agreed. The code can be made faster and more correct though, but I think that's besides the point of this blog post. Fit to purpose things are usually always faster and cut corners compared to general purpose things.

@MarkPflug if you're interested, can you do an allocation profile on the implementations to see what's being allocated?

@goldytech
Copy link
Owner

goldytech commented Jun 4, 2021

@davidfowl Yes the out variable has null value in TryParse method

MarkPflug added 3 commits June 4, 2021 07:13
Avoid allocating temp string for date parsing.
@MarkPflug
Copy link
Author

MarkPflug commented Jun 4, 2021

@goldytech Interesting, I didn't know that about those new Utf8Parser methods. I assumed it would offer parity with the DateTime.TryParse method, so I would have expected all those date-only values to parse properly. The bad news is that the DateTime parsing is one of the more expensive bits of processing this file, so that's what's slowing things down the most. One way to improve this slightly is to use TryParseExact, but that can't currently be used for this dataset because the date formats are inconsistent: line 2 has "4/20/2847" while line 10 has "08/12/20". I think if these were "04/20/2847" and "08/12/0020" then you could TryParseExact with the format "MM/dd/yyyy" and it would speed things up a bit.

@davidfowl I ran an allocation profile (.NET Object Allocation Tracking in VS) and I see mostly strings, around 200k of them as expected. I don't have a lot of experience with that tool, so not sure what else to look for exactly. I was expecting it to be broken down by allocation size, but it only seems to show counts.

My previous results with PipeLines at ~30ms was bogus, because it was failing to parse the dates and it fails fast in that scenario. I've increased the buffer size being used, fixed up the temp string allocation used when parsing the dates, and cleaned up a few things in the PipeLines code. With those changes, I'm seeing these results on my machine:

Method Mean Error StdDev Rank Gen 0 Gen 1 Gen 2 Allocated
PipeLines 53.06 ms 0.360 ms 0.320 ms 1 2100.0000 1000.0000 100.0000 13 MB
Sylvan 53.83 ms 0.459 ms 0.429 ms 1 2200.0000 1100.0000 100.0000 13 MB
SylvanAsync 59.51 ms 0.193 ms 0.161 ms 2 2222.2222 1000.0000 111.1111 13 MB
CsvHelper 167.06 ms 3.242 ms 4.100 ms 3 12666.6667 4666.6667 2000.0000 77 MB
AsyncStream 170.41 ms 2.369 ms 1.978 ms 3 9666.6667 3333.3333 666.6667 60 MB

Do you see anything obvious that could make the pipeLines code any faster? All in all, I feel pretty good about where my library sits in these results. This probably isn't super fair for CsvHelper, there's probably things that could be done to improve it, but I haven't investigated.

Edit: I should also mention these results are running net6.0 preview 4.

@MarkPflug
Copy link
Author

Tweaked the CsvHelper code a bit to improve it's performance somewhat. Might be other things that can be done, but I'm not that familiar with the library.

Method Mean Error StdDev Rank Gen 0 Gen 1 Gen 2 Allocated
PipeLines 51.17 ms 0.596 ms 0.497 ms 1 2100.0000 1000.0000 100.0000 13 MB
Sylvan 53.76 ms 0.455 ms 0.426 ms 2 2200.0000 1100.0000 100.0000 13 MB
SylvanAsync 61.06 ms 1.196 ms 1.597 ms 3 2222.2222 1000.0000 111.1111 13 MB
CsvHelper 130.28 ms 2.594 ms 4.544 ms 4 9250.0000 3250.0000 1250.0000 56 MB
AsyncStream 172.89 ms 3.318 ms 3.687 ms 5 9666.6667 3333.3333 666.6667 60 MB

@davidfowl
Copy link
Contributor

@goldytech Interesting, I didn't know that about those new Utf8Parser methods. I assumed it would offer parity with the DateTime.TryParse method, so I would have expected all those date-only values to parse properly. The bad news is that the DateTime parsing is one of the more expensive bits of processing this file, so that's what's slowing things down the most. One way to improve this slightly is to use TryParseExact, but that can't currently be used for this dataset because the date formats are inconsistent: line 2 has "4/20/2847" while line 10 has "08/12/20". I think if these were "04/20/2847" and "08/12/0020" then you could TryParseExact with the format "MM/dd/yyyy" and it would speed things up a bit.

@pgovind @tannergooding This seems like something we should add support for in Utf8Parser.
cc @jeffhandley

@davidfowl I ran an allocation profile (.NET Object Allocation Tracking in VS) and I see mostly strings, around 200k of them as expected. I don't have a lot of experience with that tool, so not sure what else to look for exactly. I was expecting it to be broken down by allocation size, but it only seems to show counts.

You can show more columns and there should be a size there. Can you share the profile?

Do you see anything obvious that could make the pipeLines code any faster? All in all, I feel pretty good about where my library sits in these results. This probably isn't super fair for CsvHelper, there's probably things that could be done to improve it, but I haven't investigated.

One of the things that could lower allocations is using a smaller buffer size for the FileStream to avoid the copy.

await using var fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read, BufferSize);
pass in a BufferSize of 1 to the FileStream to disable the internal buffering.

There's other micro optimizations that can be applied to ParseLines to avoid the SequenceReader if the buffer is in a single segment (a fast path for buffer.IsSingleSegment).

@MarkPflug
Copy link
Author

@davidfowl Captured a diag session running the following code:

            var test = new FileIOTest();           
            await test.PipeLines();
            await Task.Delay(1000);
            await test.PipeLines();

Capture

Pretty much just strings. It is a 6mb CSV, and eyeballing the data, the Name and Email columns are the majority, so it isn't surprising that there's so much string allocation. Estimating that 4MB of the 6MB is in these two columns, multiply by 2 for utf-8 => UTF-16 chars you get 8MB, then there's additional overhead for the .NET objects (200k of them) I guess that comes to ~12MB total per run. The Employee[] allocation is big, but there's just the one due to the pooling.

Looking at the other allocations after string and Employee[], they mostly appear to be related to System.Diagnostics stuff: is this a side effect of running in the VS diag session?

I have the diagsession saved if you want it.

@davidfowl
Copy link
Contributor

Pretty much just strings. It is a 6mb CSV, and eyeballing the data, the Name and Email columns are the majority, so it isn't surprising that there's so much string allocation. Estimating that 4MB of the 6MB is in these two columns, multiply by 2 for utf-8 => UTF-16 chars you get 8MB, then there's additional overhead for the .NET objects (200k of them) I guess that comes to ~12MB total per run. The Employee[] allocation is big, but there's just the one due to the pooling.

You can double click and get a backtrace to figure out exactly where allocations are coming from. That is great though as strings are the things you can't get rid of (for string fields that is).

I have the diagsession saved if you want it.

Yes, if you could share it that would be great. Though I can stop being lazy and collect it myself 😄

@MarkPflug
Copy link
Author

@davidfowl
diag.zip

@tannergooding
Copy link

This seems like something we should add support for in Utf8Parser.

Feel free to open an API proposal ;)

@davidfowl
Copy link
Contributor

@goldytech do you want to make a new API proposal on https://github.com/dotnet/runtime for parsing just the date using Utf8Parser?

@davidfowl
Copy link
Contributor

@MarkPflug thanks for the profile! It's just as you said, mostly strings 👍🏾. A few very positive things we learned so far:

  • Utf8Parser doesn't support just parsing dates, we can fill that gap.
  • Sylvan is really fast full CSV parser 😄
  • Async typically has overhead over sync in cases like this where the entire file is being read off an SSD. In .NET 6 that overhead has been significantly reduced.
  • There are a few optimizations that can be made to the parser (probably in all implementations) to improve the performance further.

I don't know if you all want to keep going but this is fun 😄

@goldytech
Copy link
Owner

@davidfowl API Proposal created at dotnet/runtime#53768
Thanks to all contributors it was great learning experience for me. Our .NET community is awesome 👏

@goldytech goldytech closed this Jun 5, 2021
@MarkPflug
Copy link
Author

Thanks @davidfowl, was fun looking at this with you.

There are a few optimizations that can be made to the parser (probably in all implementations) to improve the performance further.

If you can spot a performance optimization for my library that would be awesome. I believe it is currently the fastest CSV parser available for .NET: as measured by a member of the NuGet team. I've updated my library since the last update to his post, so it should be back on top.

Sylvan is really fast full CSV parser 😄

🤩 Thanks, means a lot coming from you. I've got a WIP to use it as a formatter for AspNet Core MVC (text/csv), I'll ping you when I get around to making it public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants