More optimizations after #157 #158

jpmckinney · 2015-10-07T19:18:17Z

Memoizes the result of CSV#encode_str, CSV#escape_re, and disables CSV's converters feature.

At this point, the two big bottlenecks are build_formats (already optimized, but eliminating some if branches will make it faster), and CSV#init_separators, which is slow because determining the row_sep is slow.

Call graph after memoizing escape_re:

After memoizing encode_str:

)

After disabling converters:

…s converters feature

jpmckinney · 2015-10-07T19:27:36Z

About 40% faster with all optimizations.

jpmckinney · 2015-10-07T19:32:07Z

Note that after switching to FastCSV, init_parsers can be overridden to skip the building of @parsers, since those are only used by CSV#shift, which FastCSV overrides.

pezholio · 2015-10-08T06:41:12Z

This is great stuff - thanks so much! Stackprof looks a really useful tool too, I'll definitely be using that as I do more optimisations 👍

More optimizations after #157

jpmckinney · 2015-10-08T14:51:12Z

I think the next big optimization will involve rewriting validate_stream and parse_line to not do all that line break/quote char work and let the CSV library do as much as possible (i.e. call CSV.new once for the entire file, not once for every row!). Not sure yet how that looks though.

pezholio · 2015-10-08T14:54:21Z

Yeah, it's a tricky one, because parse_line doesn't really know if it's dealing with a full string (especially when streaming from a URL), so some of the work needs to be manual. I'm assuming that would be faster than calling the CSV library anyway?

jpmckinney · 2015-10-08T15:03:39Z

The CSV library calls @io.gets(@row_sep) in its shift method. You can pass to the CSV library an IO object whose gets method waits for more data from Typhoeus until it hits a @row_sep (which is the semantics of IO#gets). This sounds like something someone doing streaming would have already implemented elsewhere.

If CSVlint isn't parsing a remote file, then you can just use the CSV library normally as before.

James McKinney added 2 commits October 7, 2015 14:01

Memoize the result of CSV#encode_re

51c7db6

Memoize the result of CSV#encode_str, CSV#escape_re, and disable CSV'…

77e54a4

…s converters feature

pezholio added a commit that referenced this pull request Oct 8, 2015

Merge pull request #158 from jpmckinney/opt2

11dabd3

More optimizations after #157

pezholio merged commit 11dabd3 into Data-Liberation-Front:master Oct 8, 2015

jpmckinney deleted the opt2 branch October 8, 2015 14:43

jpmckinney mentioned this pull request Oct 8, 2015

Merge memoization optimizations from csvlint jpmckinney/fastcsv#10

Closed

rbmrclo mentioned this pull request Feb 16, 2018

Don't patch CSV#init_converters for ruby 2.5 compatibility #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More optimizations after #157 #158

More optimizations after #157 #158

jpmckinney commented Oct 7, 2015

jpmckinney commented Oct 7, 2015

jpmckinney commented Oct 7, 2015

pezholio commented Oct 8, 2015

jpmckinney commented Oct 8, 2015

pezholio commented Oct 8, 2015

jpmckinney commented Oct 8, 2015

More optimizations after #157 #158

More optimizations after #157 #158

Conversation

jpmckinney commented Oct 7, 2015

jpmckinney commented Oct 7, 2015

jpmckinney commented Oct 7, 2015

pezholio commented Oct 8, 2015

jpmckinney commented Oct 8, 2015

pezholio commented Oct 8, 2015

jpmckinney commented Oct 8, 2015