Improving encoding detection #2

ldodds · 2014-01-07T16:06:35Z

Ideally we want to detect what encoding a file is actually using, as this might differ from what is advertised.

For example, take the NY honours spreadsheet:

https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv/preview

This is delivered with a Content-Type of text/csv, with no mime type. US-ASCII might be assumed to be the default, although I think UTF-8 is increasingly common and might be reasonable default assumption

However when trying to open the file, it apparently has invalid characters.

E.g.:

$ curl -v https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/246500/New_Year_Honours_2013_full_list.csv >/tmp/test.csv
$ file -bi /tmp/test.csv
text/plain; charset=unknown-8bit

Using charlock_holmes we get a bit more information:

$ irb
> require "charlock_holmes"
> contents = File.read("/tmp/test.csv")
> detection = CharlockHolmes::EncodingDetector.detect(contents)
 => {:type=>:text, :encoding=>"ISO-8859-1", :confidence=>61, :language=>"en"}

Confidence level is relatively low, but guessed encoding seems reasonable.

The text was updated successfully, but these errors were encountered:

ldodds · 2014-01-07T16:47:39Z

One approach that has reasonable results is to sample the file, e.g. read the first X bytes, then use charlock holmes. A sample of 1000 bytes gives similar result to the above: same encoding, confidence of 55.

Checking line by line shows that the reported encoding and confidence level can vary considerably:

{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>50, :language=>"en"}
{:type=>:text, :encoding=>"UTF-8", :confidence=>10}
{:type=>:text, :encoding=>"UTF-8", :confidence=>10}
{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>66, :language=>"en"}
{:type=>:text, :encoding=>"ISO-8859-1", :confidence=>92, :language=>"en"}

To avoid extra web requests we could automatically run the text through charlock after we've read a configurable number of lines. This avoids an extra web request, but might need some tuning.

The only downside is the need to rely on a native tool that doesn't install automatically via bundler.

On ubuntu: sudo apt-get install libicu-dev. On Mac: brew install icu4c

ldodds · 2014-01-07T17:26:31Z

I've created a branch that uses charlock holmes to add a guessed_encoding to the validation result. Its a hash of :encoding and :confidence.

https://github.com/theodi/csvlint.rb/tree/charlock-integration

I've not created a PR for it yet, just for discussion. Build fails because of need to install charlock_holmes. Not sure I have access to Travis to fix that.

JeniT · 2014-01-07T17:39:17Z

Perhaps worth looking at sniffing in other contexts eg HTML (http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#encoding-sniffing-algorithm) to work out how to configure/manage the encoding sniffing?

ldodds · 2014-01-07T18:36:54Z

There's some useful stuff there. Jotting down some working notes.

The algorithm there is essentially:

Allow user override to force encoding for processing -- in our case this might be encoding specified in schema
Look for byte order marks to identify common cases of UTF-8 UTF-16. Not sure if charlock does this, but looks easy to add
Use transport layer encoding from Content-Type -- open-uri seems to apply this currently
Attempt to guess the encoding using a pre-scan (sample) of the data -- this is where charlock helps
Assume a default -- we could assume UTF-8, I think Ruby may do that anyway

That algorithm is used to identify a content type and a confidence level (certain, tentative). The content type is then used to process the document. We could adapt our parsing similarly, currently we're delegating to Ruby open-uri/CSV code to handle that.

When we report on encoding metadata it'd be useful to include details on how the encoding was determined (e.g. schema, transport layer, guess work, default). Might be useful for people debugging issues.

I think we'll also want to expand warnings to cover:

A declared schema encoding that differs from the transport encoding (Content-Type) -- consistency
Lack of any explicit encoding -- clients shouldn't have to guess
A transport or schema encoding that differs from the actual data -- for this we will want to scan the file to look for discrepancies anyway. The algorithm above won't necessarily give that by default.
Using anything other than UTF-8 -- recommended default

So for the honours list we'd generate two warnings: no explicit encoding, use of ISO-8859-1 rather than UTF-8.

ldodds · 2014-01-13T12:14:08Z

We've got the basic features for encoding checking in place for now, so am closing this issue.

Revert "Add feature to check schema against length of supplied headers."

pezholio mentioned this issue Jan 8, 2014

Charlock Holmes #9

Closed

ldodds mentioned this issue Jan 9, 2014

Check the raw header contains /charset/ #12

Closed

ldodds closed this as completed Jan 13, 2014

pezholio pushed a commit that referenced this issue Jul 14, 2015

Merge pull request #2 from strategicdata/revert-1-validate-header-size

8e56bf4

Revert "Add feature to check schema against length of supplied headers."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving encoding detection #2

Improving encoding detection #2

ldodds commented Jan 7, 2014

ldodds commented Jan 7, 2014

ldodds commented Jan 7, 2014

JeniT commented Jan 7, 2014

ldodds commented Jan 7, 2014

ldodds commented Jan 13, 2014

Improving encoding detection #2

Improving encoding detection #2

Comments

ldodds commented Jan 7, 2014

ldodds commented Jan 7, 2014

ldodds commented Jan 7, 2014

JeniT commented Jan 7, 2014

ldodds commented Jan 7, 2014

ldodds commented Jan 13, 2014