Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make csvlint way faster #99

Merged
merged 11 commits into from
Nov 11, 2014
Merged

Make csvlint way faster #99

merged 11 commits into from
Nov 11, 2014

Conversation

jpmckinney
Copy link

The speed gain is directly correlated to the size of the CSV. I tested on a 1.7MB CSV. These commits dropped its validation time from 9s to 1s. I used stackprof to identify bottlenecks. Main bottleneck now is Ruby's CSV library, though date format detection is still slow. I recommend reviewing one commit at a time. In brief:

  • Using exceptions to control flow is incredibly slow (biggest speed gain here)
  • Metaprogramming is slow
  • A rescue that swallows all exceptions is an anti-pattern, because it can hide bugs, in this case "TypeError: no implicit conversion of Proc into String"

With the new, fast version, the code spends time roughly as follows:

  • 17% garbage collection
  • 82% loop in parse_csv
    • 49% csv.shift (previously 13%)
    • 24% build_formats (previously 61%)
    • 8% build_errors
    • 5% col_counts (because Rails' String#blank? is slow)

So, csvlint now spends a lot more time parsing CSVs (a cost we can't reduce) and less time rescuing exceptions (a cost I cut here).

@jpmckinney
Copy link
Author

Weird, no test failure locally. I'll try later on Ruby 2.0 (running 2.1). Update: Confirm works in 2.1 but not 2.0. Looking into it.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling e498fe8 on jpmckinney:speed into f926eb5 on theodi:master.

@jpmckinney
Copy link
Author

I've got some more commits for more speed, but I don't want to overcomplicate this PR. (e.g. 12MB CSV takes ~10s, not ~230s).

@pezholio
Copy link
Contributor

pezholio commented Nov 9, 2014

This is awesome. Thanks so much 😄

I'll review this in more detail and merge tomorrow if all's good 👍

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling cf3dc46 on jpmckinney:speed into f926eb5 on theodi:master.

pezholio added a commit that referenced this pull request Nov 11, 2014
@pezholio pezholio merged commit e7ab4a9 into Data-Liberation-Front:master Nov 11, 2014
@pezholio
Copy link
Contributor

Hurrah for open source! 🎆

@jpmckinney jpmckinney deleted the speed branch November 12, 2014 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants