Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving encoding detection #2

Closed
ldodds opened this issue Jan 7, 2014 · 5 comments
Closed

Improving encoding detection #2

ldodds opened this issue Jan 7, 2014 · 5 comments

Comments

@ldodds
Copy link
Contributor

ldodds commented Jan 7, 2014

Ideally we want to detect what encoding a file is actually using, as this might differ from what is advertised.

See features in theodi/shared#120

For example, take the NY honours spreadsheet:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv/preview

This is delivered with a Content-Type of text/csv, with no mime type. US-ASCII might be assumed to be the default, although I think UTF-8 is increasingly common and might be reasonable default assumption

However when trying to open the file, it apparently has invalid characters.

E.g.:

$ curl -v https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv >/tmp/test.csv
$ file -bi /tmp/test.csv
text/plain; charset=unknown-8bit

Using charlock_holmes we get a bit more information:

$ irb
> require "charlock_holmes"
> contents = File.read("/tmp/test.csv")
> detection = CharlockHolmes::EncodingDetector.detect(contents)
 => {:type=>:text, :encoding=>"ISO-8859-1", :confidence=>61, :language=>"en"} 

Confidence level is relatively low, but guessed encoding seems reasonable.

@ldodds
Copy link
Contributor Author

ldodds commented Jan 7, 2014

One approach that has reasonable results is to sample the file, e.g. read the first X bytes, then use charlock holmes. A sample of 1000 bytes gives similar result to the above: same encoding, confidence of 55.

Checking line by line shows that the reported encoding and confidence level can vary considerably:

{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>50, :language=>"en"}
{:type=>:text, :encoding=>"UTF-8", :confidence=>10}
{:type=>:text, :encoding=>"UTF-8", :confidence=>10}
{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>66, :language=>"en"}
{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>92, :language=>"en"}

To avoid extra web requests we could automatically run the text through charlock after we've read a configurable number of lines. This avoids an extra web request, but might need some tuning.

The only downside is the need to rely on a native tool that doesn't install automatically via bundler.

On ubuntu: sudo apt-get install libicu-dev. On Mac: brew install icu4c

@ldodds
Copy link
Contributor Author

ldodds commented Jan 7, 2014

I've created a branch that uses charlock holmes to add a guessed_encoding to the validation result. Its a hash of :encoding and :confidence.

https://github.com/theodi/csvlint.rb/tree/charlock-integration

I've not created a PR for it yet, just for discussion. Build fails because of need to install charlock_holmes. Not sure I have access to Travis to fix that.

@JeniT
Copy link

JeniT commented Jan 7, 2014

Perhaps worth looking at sniffing in other contexts eg HTML (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#encoding-sniffing-algorithm) to work out how to configure/manage the encoding sniffing?

@ldodds
Copy link
Contributor Author

ldodds commented Jan 7, 2014

There's some useful stuff there. Jotting down some working notes.

The algorithm there is essentially:

  1. Allow user override to force encoding for processing -- in our case this might be encoding specified in schema
  2. Look for byte order marks to identify common cases of UTF-8 UTF-16. Not sure if charlock does this, but looks easy to add
  3. Use transport layer encoding from Content-Type -- open-uri seems to apply this currently
  4. Attempt to guess the encoding using a pre-scan (sample) of the data -- this is where charlock helps
  5. Assume a default -- we could assume UTF-8, I think Ruby may do that anyway

That algorithm is used to identify a content type and a confidence level (certain, tentative). The content type is then used to process the document. We could adapt our parsing similarly, currently we're delegating to Ruby open-uri/CSV code to handle that.

When we report on encoding metadata it'd be useful to include details on how the encoding was determined (e.g. schema, transport layer, guess work, default). Might be useful for people debugging issues.

I think we'll also want to expand warnings to cover:

  • A declared schema encoding that differs from the transport encoding (Content-Type) -- consistency
  • Lack of any explicit encoding -- clients shouldn't have to guess
  • A transport or schema encoding that differs from the actual data -- for this we will want to scan the file to look for discrepancies anyway. The algorithm above won't necessarily give that by default.
  • Using anything other than UTF-8 -- recommended default

So for the honours list we'd generate two warnings: no explicit encoding, use of ISO-8859-1 rather than UTF-8.

@ldodds
Copy link
Contributor Author

ldodds commented Jan 13, 2014

We've got the basic features for encoding checking in place for now, so am closing this issue.

@ldodds ldodds closed this as completed Jan 13, 2014
pezholio pushed a commit that referenced this issue Jul 14, 2015
Revert "Add feature to check schema against length of supplied headers."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants