-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving encoding detection #2
Comments
One approach that has reasonable results is to sample the file, e.g. read the first X bytes, then use charlock holmes. A sample of 1000 bytes gives similar result to the above: same encoding, confidence of 55. Checking line by line shows that the reported encoding and confidence level can vary considerably:
To avoid extra web requests we could automatically run the text through charlock after we've read a configurable number of lines. This avoids an extra web request, but might need some tuning. The only downside is the need to rely on a native tool that doesn't install automatically via bundler. On ubuntu: |
I've created a branch that uses charlock holmes to add a https://github.com/theodi/csvlint.rb/tree/charlock-integration I've not created a PR for it yet, just for discussion. Build fails because of need to install charlock_holmes. Not sure I have access to Travis to fix that. |
Perhaps worth looking at sniffing in other contexts eg HTML (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#encoding-sniffing-algorithm) to work out how to configure/manage the encoding sniffing? |
There's some useful stuff there. Jotting down some working notes. The algorithm there is essentially:
That algorithm is used to identify a content type and a confidence level (certain, tentative). The content type is then used to process the document. We could adapt our parsing similarly, currently we're delegating to Ruby open-uri/CSV code to handle that. When we report on encoding metadata it'd be useful to include details on how the encoding was determined (e.g. schema, transport layer, guess work, default). Might be useful for people debugging issues. I think we'll also want to expand warnings to cover:
So for the honours list we'd generate two warnings: no explicit encoding, use of ISO-8859-1 rather than UTF-8. |
We've got the basic features for encoding checking in place for now, so am closing this issue. |
Revert "Add feature to check schema against length of supplied headers."
Ideally we want to detect what encoding a file is actually using, as this might differ from what is advertised.
See features in theodi/shared#120
For example, take the NY honours spreadsheet:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv/preview
This is delivered with a
Content-Type
oftext/csv
, with no mime type.US-ASCII
might be assumed to be the default, although I thinkUTF-8
is increasingly common and might be reasonable default assumptionHowever when trying to open the file, it apparently has invalid characters.
E.g.:
Using
charlock_holmes
we get a bit more information:Confidence level is relatively low, but guessed encoding seems reasonable.
The text was updated successfully, but these errors were encountered: