Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect package encoding when parsing source code #605

Closed
jeroen opened this issue Jun 24, 2017 · 7 comments
Closed

Respect package encoding when parsing source code #605

jeroen opened this issue Jun 24, 2017 · 7 comments
Labels
feature a feature request or enhancement

Comments

@jeroen
Copy link
Member

jeroen commented Jun 24, 2017

There is some support for utf8 however by default test_check() sources all test files as native. I think that if DESCRIPTION contains Encoding: UTF-8 all test files should be sourced as UTF-8.

@gaborcsardi
Copy link
Member

gaborcsardi commented Jun 24, 2017 via email

@hadley
Copy link
Member

hadley commented Oct 2, 2017

We should pull across the code from r-lib/roxygen2#649.

Unless you feel strongly, I think it's better to simply read in everything as UTF-8, and warn the user that testthat doesn't support other encodings.

@hadley hadley added the feature a feature request or enhancement label Oct 2, 2017
@jeroen
Copy link
Member Author

jeroen commented Oct 2, 2017

I don't feel strongly, though if the package declares Encoding: latin1 the correct thing would be to assume that. OTOH, people should just switch to UTF8 to write portable code, so perhaps requiring this in testthat is a good thing.

@hadley hadley closed this as completed in e23e7c4 Oct 3, 2017
@gaborcsardi
Copy link
Member

gaborcsardi commented Nov 1, 2017

@hadley There are still problems with this. Reading the files in UTF-8 is one thing, but you also need you to supply the encoding to parse(), otherwise it will mark convert the input strings to the native encoding. So parse() here needs an encoding = "UTF-8" argument:

exprs <- parse(text = lines, n = -1, srcfile = srcfile)

And maybe in test_example as well.

This said, I am not convinced that defaulting to UTF-8 is best. UTF-8 is still a bit painful on Windows, as I have just experienced. It is also not the default, of course.

@gaborcsardi
Copy link
Member

gaborcsardi commented Nov 1, 2017

The old testthat actually (by default) keeps UTF-8 files as UTF-8, because it marks them as "unknown" and then parse() does not convert the strings. (Although they are marked as unknown.)

The new behavior is worse, because you end up with stings recoded into the native encoding, typically latin1 on windows, and this conversion loses information, so it is not even possible to convert it back to UTF-8.

Can I fix the parse() calls?

@hadley
Copy link
Member

hadley commented Nov 1, 2017

Yes please!

@hadley hadley reopened this Nov 1, 2017
@gaborcsardi
Copy link
Member

gaborcsardi commented Nov 1, 2017

Not so easy. :( parse seems to be buggy and always converts to the native encoding, if text= is used. Luckily, with textConnection it works fine. Hopefully this will not make the parsing much slower:

## Need to run this in a latin1 locale
old_locale <- Sys.getlocale("LC_CTYPE")
Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")

## UTF-8 string, quoted, so we can parse it
lines <- as.raw(c(0x22, 0xc3, 0xa1, 0x72, 0x76, 0xc3, 0xad, 0x7a, 0x74,
  0xc5, 0xb1, 0x72, 0xc5, 0x91, 0x20, 0x74, 0xc3, 0xbc, 0x6b, 0xc3,
  0xb6, 0x72, 0x66, 0xc3, 0xba, 0x72, 0xc3, 0xb3, 0x67, 0xc3, 0xa9,
  0x70, 0x22))
lines <- rawToChar(lines)
Encoding(lines) <- "UTF-8"
stringi::stri_enc_isutf8(lines)
# > [1] TRUE

## Parse it, keep it UTF-8
expr <- parse(text = lines, encoding = "UTF-8")[[1]]
Encoding(expr)
#> [1] "UTF-8"

## Ooops
stringi::stri_enc_isutf8(expr)
#> [1] FALSE

## It was recoded into latin1 (the native encoding) :(
stringi::stri_enc_isutf8(iconv(expr, "latin1", "UTF-8"))
#> [1] TRUE

## With a text connection it is OK. Phew!
expr2 <- parse(textConnection(lines, encoding = "UTF-8"), encoding = "UTF-8")[[1]]
stringi::stri_enc_isutf8(expr2)
#> [1] TRUE

Sys.setlocale("LC_CTYPE", old_locale)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants