Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit lines #101

Merged
merged 11 commits into from
Dec 17, 2014
Merged

Limit lines #101

merged 11 commits into from
Dec 17, 2014

Conversation

Hoedic
Copy link
Contributor

@Hoedic Hoedic commented Nov 9, 2014

Currently csvlint.rb reads all the lines of the CSV files which can become very long for large files while most of the issues will be discovered in the first 10,000 or 100,000 lines (in files with potentially millions of records).

The proposed change add a supported parameter (limitLines) in the configuration dialect to limit the maximum number of lines evaluated, allowing to control the execution duration. (could even be useful for the web instance csvlint.io where most users will think that no output will come.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling fdf7a01 on Hoedic:limit_lines into f926eb5 on theodi:master.

@pezholio
Copy link
Contributor

I'm going to leave this for now. Most of the time we'll want to analyse all of the CSV, as some errors like inconsistent_values could appear anywhere within the file. That said, we do have an ambition to take http://csvlint.io to beta at some point in the new year, so I'll leave this open for reinvestigation when we get time.

Thanks for the PR though, great that people are finding this useful! 😄 👍

@jpmckinney
Copy link

@pezholio What's the harm in adding an option to limit the number of lines? If the option is not set, then the whole file is analyzed.

While it's possible for a CSV to have all its bad rows in the last 10% of the file, bad rows in real CSVs tend to be spread throughout a file.

@JeniT JeniT mentioned this pull request Dec 17, 2014
4 tasks
@Floppy
Copy link
Member

Floppy commented Dec 17, 2014

I've finally got to looking at this, and it's a great pile of stuff, thanks!

Yes, adding a limit is perfectly fine as an option, as it won't affect existing users unless they choose to use it.

Only thing is, the dialect option mirrors CSVDDF, so I think I'd rather not add it to that, but instead to a separate options hash at the end of the initialize parameter list. Does that make sense?

I think we should:

  • separate this PR into a general improvements and fixes one (basically everything up to the last commit), and
  • change the limit_lines option to be in a separate options hash.

I will pull across the changes from your branch and open separate PRs for those, then get it all merged.

@Floppy
Copy link
Member

Floppy commented Dec 17, 2014

Ah, my mistake, I see the speed fixes are already in, but obviously this branch just needs a merge from master.

@Floppy
Copy link
Member

Floppy commented Dec 17, 2014

I'll merge this, and then change to a separate option.

Floppy added a commit that referenced this pull request Dec 17, 2014
@Floppy Floppy merged commit 05f7d44 into Data-Liberation-Front:master Dec 17, 2014
@Hoedic
Copy link
Contributor Author

Hoedic commented Dec 17, 2014

Thanks @Floppy.

Sorry for having the speed improvement already in my branch. I initially planned to only have the limit change in my branch but I was already working with the speed improvement and noticed it only when the PR was already sent.

@Floppy
Copy link
Member

Floppy commented Dec 17, 2014

No worries, sorry I got confused - I hadn't had coffee yet :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants