-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sylvan csv benchmark #2
Conversation
I'm guessing it's better because of the synchronous reads, as for why it allocates less, that's surprising 😄 . I guess we could do an allocation profile to find it, it might be the async overhead is my guess. FileStream has been rewritten in .NET 6 to make the async operations much more efficient. I'd like to see a run on there. |
One issue is that in the last commit the PipeLines code was changed from: Not sure why that was done, but it slows things down quite a bit. There is also an awkward check in the loop to check for the header row that isn't ideal:
Fixing those issues and increasing the buffer that the Pipelines uses to 16k (and doing the same for Sylvan) changes things quite a bit. Using async with Sylvan causes a small perf regression, but not terrible so long as the buffer isn't too small.
Quite a difference. But, the performance disparity is likely because Sylvan uses a TextReader, to allow for various encodings, so I'm paying a penalty to support encodings other than utf-8. Also, the Pipelines implementation isn't a real CSV implementation, in that it doesn't handle edge cases of quoted/escaped fields. Sometimes you can get away with that, depending on your dataset but I wouldn't want to rely on it. |
@MarkPflug Thanks for your contribution.The code was changed for DateTime Parsing because |
@goldytech so the parsing fails?
Agreed. The code can be made faster and more correct though, but I think that's besides the point of this blog post. Fit to purpose things are usually always faster and cut corners compared to general purpose things. @MarkPflug if you're interested, can you do an allocation profile on the implementations to see what's being allocated? |
@davidfowl Yes the out variable has null value in |
Avoid allocating temp string for date parsing.
clean up benchmarks.
@goldytech Interesting, I didn't know that about those new Utf8Parser methods. I assumed it would offer parity with the DateTime.TryParse method, so I would have expected all those date-only values to parse properly. The bad news is that the DateTime parsing is one of the more expensive bits of processing this file, so that's what's slowing things down the most. One way to improve this slightly is to use TryParseExact, but that can't currently be used for this dataset because the date formats are inconsistent: line 2 has "4/20/2847" while line 10 has "08/12/20". I think if these were "04/20/2847" and "08/12/0020" then you could TryParseExact with the format "MM/dd/yyyy" and it would speed things up a bit. @davidfowl I ran an allocation profile (.NET Object Allocation Tracking in VS) and I see mostly strings, around 200k of them as expected. I don't have a lot of experience with that tool, so not sure what else to look for exactly. I was expecting it to be broken down by allocation size, but it only seems to show counts. My previous results with PipeLines at ~30ms was bogus, because it was failing to parse the dates and it fails fast in that scenario. I've increased the buffer size being used, fixed up the temp string allocation used when parsing the dates, and cleaned up a few things in the PipeLines code. With those changes, I'm seeing these results on my machine:
Do you see anything obvious that could make the pipeLines code any faster? All in all, I feel pretty good about where my library sits in these results. This probably isn't super fair for CsvHelper, there's probably things that could be done to improve it, but I haven't investigated. Edit: I should also mention these results are running net6.0 preview 4. |
Tweaked the CsvHelper code a bit to improve it's performance somewhat. Might be other things that can be done, but I'm not that familiar with the library.
|
@pgovind @tannergooding This seems like something we should add support for in Utf8Parser.
You can show more columns and there should be a size there. Can you share the profile?
One of the things that could lower allocations is using a smaller buffer size for the
There's other micro optimizations that can be applied to ParseLines to avoid the SequenceReader if the buffer is in a single segment (a fast path for buffer.IsSingleSegment). |
@davidfowl Captured a diag session running the following code:
Pretty much just strings. It is a 6mb CSV, and eyeballing the data, the Name and Email columns are the majority, so it isn't surprising that there's so much string allocation. Estimating that 4MB of the 6MB is in these two columns, multiply by 2 for utf-8 => UTF-16 chars you get 8MB, then there's additional overhead for the .NET objects (200k of them) I guess that comes to ~12MB total per run. The Employee[] allocation is big, but there's just the one due to the pooling. Looking at the other allocations after string and Employee[], they mostly appear to be related to System.Diagnostics stuff: is this a side effect of running in the VS diag session? I have the diagsession saved if you want it. |
You can double click and get a backtrace to figure out exactly where allocations are coming from. That is great though as strings are the things you can't get rid of (for string fields that is).
Yes, if you could share it that would be great. Though I can stop being lazy and collect it myself 😄 |
Feel free to open an API proposal ;) |
@goldytech do you want to make a new API proposal on https://github.com/dotnet/runtime for parsing just the date using Utf8Parser? |
@MarkPflug thanks for the profile! It's just as you said, mostly strings 👍🏾. A few very positive things we learned so far:
I don't know if you all want to keep going but this is fun 😄 |
@davidfowl API Proposal created at dotnet/runtime#53768 |
Thanks @davidfowl, was fun looking at this with you.
If you can spot a performance optimization for my library that would be awesome. I believe it is currently the fastest CSV parser available for .NET: as measured by a member of the NuGet team. I've updated my library since the last update to his post, so it should be back on top.
🤩 Thanks, means a lot coming from you. I've got a WIP to use it as a formatter for AspNet Core MVC (text/csv), I'll ping you when I get around to making it public. |
Adds my own CSV library, Sylvan.Data.Csv, to the benchmark set.
Current results on my machine:
The pipelines run was previously faster than Sylvan until the most recent changes you added.