Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More optimizations after #157 #158

Merged
merged 2 commits into from
Oct 8, 2015
Merged

More optimizations after #157 #158

merged 2 commits into from
Oct 8, 2015

Conversation

jpmckinney
Copy link

Memoizes the result of CSV#encode_str, CSV#escape_re, and disables CSV's converters feature.

At this point, the two big bottlenecks are build_formats (already optimized, but eliminating some if branches will make it faster), and CSV#init_separators, which is slow because determining the row_sep is slow.

Call graph after memoizing escape_re:

stack

After memoizing encode_str:

stack
)

After disabling converters:

stack

@jpmckinney
Copy link
Author

About 40% faster with all optimizations.

@jpmckinney
Copy link
Author

Note that after switching to FastCSV, init_parsers can be overridden to skip the building of @parsers, since those are only used by CSV#shift, which FastCSV overrides.

@pezholio
Copy link
Contributor

pezholio commented Oct 8, 2015

This is great stuff - thanks so much! Stackprof looks a really useful tool too, I'll definitely be using that as I do more optimisations 👍

pezholio added a commit that referenced this pull request Oct 8, 2015
@pezholio pezholio merged commit 11dabd3 into Data-Liberation-Front:master Oct 8, 2015
@jpmckinney jpmckinney deleted the opt2 branch October 8, 2015 14:43
@jpmckinney
Copy link
Author

I think the next big optimization will involve rewriting validate_stream and parse_line to not do all that line break/quote char work and let the CSV library do as much as possible (i.e. call CSV.new once for the entire file, not once for every row!). Not sure yet how that looks though.

@pezholio
Copy link
Contributor

pezholio commented Oct 8, 2015

Yeah, it's a tricky one, because parse_line doesn't really know if it's dealing with a full string (especially when streaming from a URL), so some of the work needs to be manual. I'm assuming that would be faster than calling the CSV library anyway?

@jpmckinney
Copy link
Author

The CSV library calls @io.gets(@row_sep) in its shift method. You can pass to the CSV library an IO object whose gets method waits for more data from Typhoeus until it hits a @row_sep (which is the semantics of IO#gets). This sounds like something someone doing streaming would have already implemented elsewhere.

If CSVlint isn't parsing a remote file, then you can just use the CSV library normally as before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants