Respect package encoding when parsing source code #605

jeroen · 2017-06-24T10:58:49Z

There is some support for utf8 however by default test_check() sources all test files as native. I think that if DESCRIPTION contains Encoding: UTF-8 all test files should be sourced as UTF-8.

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2017-06-24T12:28:09Z

It seems that non-ascii characters are allowed in tests, at least r cmd check does not warn for them, so it would indeed make sense to use the package encoding.

…

On 24 Jun 2017 11:58, "Jeroen Ooms" ***@***.***> wrote: There is some support for utf8 <#550> however by default test_check() sources all test files as native. I think that if DESCRIPTION contains Encoding: UTF-8 all test files should be sourced as UTF-8. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#605>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAoTQAncExfEh6_0vUAmc3NFWZeKGbHSks5sHOvpgaJpZM4OES1D> .

hadley · 2017-10-02T15:00:25Z

We should pull across the code from r-lib/roxygen2#649.

Unless you feel strongly, I think it's better to simply read in everything as UTF-8, and warn the user that testthat doesn't support other encodings.

jeroen · 2017-10-02T15:08:17Z

I don't feel strongly, though if the package declares Encoding: latin1 the correct thing would be to assume that. OTOH, people should just switch to UTF8 to write portable code, so perhaps requiring this in testthat is a good thing.

gaborcsardi · 2017-11-01T16:25:10Z

@hadley There are still problems with this. Reading the files in UTF-8 is one thing, but you also need you to supply the encoding to parse(), otherwise it will ~~mark~~ convert the input strings to the native encoding. So parse() here needs an encoding = "UTF-8" argument:

testthat/R/source.R

Line 27 in 1faa32f

exprs <- parse(text = lines, n = -1, srcfile = srcfile)

And maybe in test_example as well.

This said, I am not convinced that defaulting to UTF-8 is best. UTF-8 is still a bit painful on Windows, as I have just experienced. It is also not the default, of course.

gaborcsardi · 2017-11-01T16:38:22Z

The old testthat actually (by default) keeps UTF-8 files as UTF-8, because it marks them as "unknown" and then parse() does not convert the strings. (Although they are marked as unknown.)

The new behavior is worse, because you end up with stings recoded into the native encoding, typically latin1 on windows, and this conversion loses information, so it is not even possible to convert it back to UTF-8.

Can I fix the parse() calls?

hadley · 2017-11-01T19:02:01Z

Yes please!

gaborcsardi · 2017-11-01T23:24:48Z

Not so easy. :( parse seems to be buggy and always converts to the native encoding, if text= is used. Luckily, with textConnection it works fine. Hopefully this will not make the parsing much slower:

## Need to run this in a latin1 locale
old_locale <- Sys.getlocale("LC_CTYPE")
Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")

## UTF-8 string, quoted, so we can parse it
lines <- as.raw(c(0x22, 0xc3, 0xa1, 0x72, 0x76, 0xc3, 0xad, 0x7a, 0x74,
  0xc5, 0xb1, 0x72, 0xc5, 0x91, 0x20, 0x74, 0xc3, 0xbc, 0x6b, 0xc3,
  0xb6, 0x72, 0x66, 0xc3, 0xba, 0x72, 0xc3, 0xb3, 0x67, 0xc3, 0xa9,
  0x70, 0x22))
lines <- rawToChar(lines)
Encoding(lines) <- "UTF-8"
stringi::stri_enc_isutf8(lines)
# > [1] TRUE

## Parse it, keep it UTF-8
expr <- parse(text = lines, encoding = "UTF-8")[[1]]
Encoding(expr)
#> [1] "UTF-8"

## Ooops
stringi::stri_enc_isutf8(expr)
#> [1] FALSE

## It was recoded into latin1 (the native encoding) :(
stringi::stri_enc_isutf8(iconv(expr, "latin1", "UTF-8"))
#> [1] TRUE

## With a text connection it is OK. Phew!
expr2 <- parse(textConnection(lines, encoding = "UTF-8"), encoding = "UTF-8")[[1]]
stringi::stri_enc_isutf8(expr2)
#> [1] TRUE

Sys.setlocale("LC_CTYPE", old_locale)

hadley added the feature a feature request or enhancement label Oct 2, 2017

hadley closed this as completed in e23e7c4 Oct 3, 2017

hadley reopened this Nov 1, 2017

gaborcsardi added a commit that referenced this issue Nov 2, 2017

Parse in UTF-8, always, really fixes #605

875c7be

gaborcsardi added a commit that referenced this issue Nov 2, 2017

Parse in UTF-8, always, really fixes #605

3450adb

gaborcsardi mentioned this issue Nov 2, 2017

Allow turning symbols off on UTF-8 platforms r-lib/cli#29

Merged

hadley closed this as completed in a4cfc61 Nov 2, 2017

sfirke mentioned this issue Jan 22, 2018

offer other case options as an argument in clean_names sfirke/janitor#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect package encoding when parsing source code #605

Respect package encoding when parsing source code #605

jeroen commented Jun 24, 2017

gaborcsardi commented Jun 24, 2017 via email

hadley commented Oct 2, 2017

jeroen commented Oct 2, 2017

gaborcsardi commented Nov 1, 2017 •

edited

Loading

gaborcsardi commented Nov 1, 2017 •

edited

Loading

hadley commented Nov 1, 2017

gaborcsardi commented Nov 1, 2017 •

edited

Loading

Respect package encoding when parsing source code #605

Respect package encoding when parsing source code #605

Comments

jeroen commented Jun 24, 2017

gaborcsardi commented Jun 24, 2017 via email

hadley commented Oct 2, 2017

jeroen commented Oct 2, 2017

gaborcsardi commented Nov 1, 2017 • edited Loading

gaborcsardi commented Nov 1, 2017 • edited Loading

hadley commented Nov 1, 2017

gaborcsardi commented Nov 1, 2017 • edited Loading

gaborcsardi commented Nov 1, 2017 •

edited

Loading

gaborcsardi commented Nov 1, 2017 •

edited

Loading

gaborcsardi commented Nov 1, 2017 •

edited

Loading