Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

Use of UTF-8 encoding for i18n properties files #2639

Closed
MHerbst opened this issue Dec 10, 2016 · 26 comments
Closed

Use of UTF-8 encoding for i18n properties files #2639

MHerbst opened this issue Dec 10, 2016 · 26 comments

Comments

@MHerbst
Copy link
Contributor

MHerbst commented Dec 10, 2016

While testing the Kodi binding with the german translation properties file I realized that all german umlauts where not displayed correctly.
The same problem appears in the Yahoo binding's description:
image
The reason is simple: the properties file where the texts are stored uses encoding ISO 8859-1.
It seems that all properties files (in Smarthome and openHAB) are using this encoding while all Java files are using UTF-8.
The german properties file of the Hue binding uses an interesting solution: it contains a Unicode constante for the "ü":

Es wurde keine IP Adresse f\u00FCr die Hue Bridge angegeben.

Is there any special reason why all properties files are stored with ISO 8859-1? In my opinition it would be better to use generally UTF-8. Otherwise all bindings would have to use unicode constants to get correctly displayed umlauts and other international characters.
For the Kodi binding I have changed the encoding to UTF-8, corrected all umlautes and everything was displayed correctly.

@hakan42
Copy link
Contributor

hakan42 commented Dec 10, 2016

I seem to remember that java property files are defined by the JVM to be in 8859-1. A leftover from the bad old days of Java 0.9 perhaps?

I tend to use the character constant way in all my projects.

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 11, 2016

Your memory was correct @hakan42 . I checked the Java documentation and it explicitly says that properties files are read with 8859-1. Only properties files in XML format can use UTF-8 or UTF-16.
UTF-8 seems to work (at least on Windows OS), but I agree with you that it is better to stick with 8859-1. BTW: the Java SDK event contains a little utility to create the correct UTF-8 constant values (https://docs.oracle.com/javase/8/docs/technotes/tools/windows/native2ascii.html).

So I will correct the properties files from the Yahoo binding.

@kaikreuzer What do you think about adding a hint regarding the encoding to the documentation? I can write it and create a pull request (I think the best place would be this section https://github.com/eclipse/smarthome/blob/master/docs/documentation/features/internationalization.md#internationalize-binding-xml-files)

@kaikreuzer
Copy link
Contributor

@MHerbst I do not yet understand the problem and your solution.
Java expects property files in ISO8859-1 and this is what the IDE uses as an encoding. So having an "ü" in there (and correctly displayed in the IDE) should be just fine - no need to replace it by any escape sequences. Might the problem not rather be somewhere else in the chain?

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 12, 2016

@kaikreuzer I agree with you that we don't have a problem in the IDE (it was only my first thought that the encoding was wrong).
But in order to get correct characters in browser (e.g. in Paper UI) all characters must be encoded as valid UTF-8 characters. Otherwise only a question mark is shown.

I see two possible ways to solve this problem:

  1. Using of Unicode escapes in the .properties files. This is the way it was solved in the Hue binding ( https://github.com/eclipse/smarthome/blob/master/extensions/binding/org.eclipse.smarthome.binding.hue/ESH-INF/i18n/hue_de.properties). That's why I proposed to use Unicode escapes. The JDK even contains a utility (native2ascii) that converts e.g. all umlauts into Unicode escapes.
  2. Convert each string that is read from a .properties file from 8859-1 to UTF-8 encoding.

@kaikreuzer
Copy link
Contributor

Imho, option 2 is the correct one. If property files are supposed to e in ISO8859-1, we should not constrain our developers to use pure ASCII instead. In general, Java should be able to handle the conversion transparently. I assume that we some where have a bug on the way from reading the property file to serving it through the REST API.

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 12, 2016

OK, I will try to find out what's maybe going wrong.
I have set up a test environment where I can remote debug the Smarthome classes and I think I know where I have to start :-).
Maybe #2589 is related to this problem.

@kaikreuzer
Copy link
Contributor

I just tried option 2 the way it is suggested here - unfortunately, the black-squared questionmark is then only replaced by a normal questionmark, but not by the correct character.
I tried to chase this down in the debugger, but I am left very puzzled. It seems that the Java PropertyResourceBundle already reads it wrongly; the chars are not correctly returned by the getString() method.
Interestingly, if I set the encoding of the property files to UTF-8 in Eclipse (and make sure the content is proper UTF-8), everything is fine in the Paper UI. This is very much against everything that is written in the JavaDocs, though...

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 19, 2016

I tried to chase this down in the debugger, but I am left very puzzled. It seems that the Java PropertyResourceBundle already reads it wrongly; the chars are not correctly returned by the getString() method.

As far as I could see the Java PropertyResourceBundle returns a string encodes with ISO 8859-1 and this string must then be converted to UTF-8. In a separate test program it tried it this way (it is not exactly the same as in the Stackflow example but very similar:

        Properties prop = new Properties();
        prop.load(new FileInputStream("D:/yahooweather_de.properties"));
        String text = prop.getProperty("binding.yahooweather.description");
        System.out.println(new String(text.getBytes(), "UTF-8"));

It seems to work (but I am not absolutely sure why ...) but before adding it to the Smarthome code I wanted to make sure that the locale object used when reading the properties is correct (see my latest test results in #2589) and perform some more test.

Interestingly, if I set the encoding of the property files to UTF-8 in Eclipse (and make sure the content is proper UTF-8), everything is fine in the Paper UI. This is very much against everything that is written in the JavaDocs, though...

I have tried it too and I assume that it will work on most of the platforms. But I don't like to use undocumented features because there remains a risk that it won't work somewhere (e.g. I don't know what happens under the Open JDK).

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 23, 2016

@kaikreuzer For the last two hours I tried do figure out what happens and how to solve it. The method getTranslatedText executes resourceBunde.getString(key). This method reads the String from the properties file und converts it from ISO-8859-1 to UTF-8. All non ISO-8859-1 chars (like "ä", "ö", "ü") are replaced by hex values EF BF DB. This combination is an UTF-8 character that means REPLACEMENT CHARACTER and is shown as question mark.

I think we have the following possibilities to solve the problem:

  1. We continue to use ISO-8859-1 for the properties files and replace all non-8859 chars by their UTF-8 Constants ("\u...."). Its ugly but it works.
  2. We use UTF-8 for the properties files. This works but is not supported officially :-(
  3. We change to UTF-8 but use an own implementation to read the bund files (I found this solution on Stackoverflow: http://stackoverflow.com/a/4660195/979217). I have tested it with success.

My favorite is the third variant. It would require to change the encodings of all language resource bundles. But we would then have the same encoding for all Java sources and text files.

@hakan42
Copy link
Contributor

hakan42 commented Dec 23, 2016

In my opinion, option three would be the best too. It avoids surprises as by now many developers are accustomed to simply saving everything in UTF-8 and expect this to just work (tm)

Actually, now that you mention this and remind me of it, I have a very similiar class in many of my projects 😄

@NorbertHD
Copy link

All non ISO-8859-1 chars (like "ä", "ö", "ü")

äöü are ISO-8859-1 or do you mean ANSI 7-Bit?

The initial problem is, getTranslatedText() uses a getBundle method with a custom resourceClassLoader.
This resourceClassLoader defined in org.eclipse.smarthome.core.common.osgi.ResourceBundleClassLoader
sets the read encoding to UTF-8.
With the current code the language files have to have UTF-8 encoding.
If you want that the language files remain with ISO-8859-1 encoding you have to change ResourceBundleClassLoader or use another one.

@MHerbst
Copy link
Contributor Author

MHerbst commented Dec 23, 2016

äöü are ISO-8859-1 or do you mean ANSI 7-Bit?

You are right @NorbertHD . They are of course valid ISO-8859-1 but can't be handled if a file is interpreted as UTF_8. Now, with your hints, I understand why the umlauts are replaced by the "question mark".
Thank you for pointing us to the ResourceBundleClassLoader class. I have overlooked this parameter in the call to getBundle and did not see that the files are already read as UTF-8 files. It seems that nobody had this in mind because all i18/*.properties that I have seem are using ISO-8859-1 encoding :-).

@kaikreuzer @hakan42 For me this means that we can and should absolutely use UTF-8 encoding for the language properties files. The number of affected files is currently not so great. If you agree I can change the encoding of all i18/*.properties files and correct the content if necessary.

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 2, 2017

@kaikreuzer There are about 50 i18n/*properties in the smarthome and openhab2 repositories. If you agree with changing the encoding to UTF-8 I would perform the necessary changes and create two PRs.
I would also check the new bindings that are waiting for review.

@kaikreuzer
Copy link
Contributor

Sorry for not having followed up on this since christmas!

@NorbertHD Good catch indeed! I also never noticed the ResourceBundleClassLoader, which indeed is the culprit here that we were looking for.

we can and should absolutely use UTF-8 encoding for the language properties files.

I am not yet totally convinced. The other very simple option would be to change ResourceBundleClassLoader from UTF-8 to ISO-8859-1 it is only used to read *.properties files from i18n folders, so this should be without risk.
Using UTF-8 would mean to not follow Java standards. People & tools that expect the files to be "correct" Java property files might be trapped. Note that Eclipse automatically always uses a Property editor for these files which sets the encoding to ISO-8859-1. So when you use Eclipse, you can be sure that the encoding is correct (and I just checked that IntelliJ does the same thing. If we decide for UTF-8 instead, we will have to make sure that users as well as IDEs are aware of this - and this might be a tricky job to do, what do you think?

@hakan42
Copy link
Contributor

hakan42 commented Jan 2, 2017

Maybe we should add an .editorconfig file to the repo. Many editors support this, and eclipse and intellij have native support for it.

That way, we define once and for all time what encoding the property files should have (which should be iso-8859-1 as this is what java natively expects).

I believe we have an editor config file in the docs project, if not, I can provide a pull request tomorrow morning.

@kaikreuzer
Copy link
Contributor

According to http://editorconfig.org/#download, you need a plug-in for most IDEs (with the exception of IntelliJ) - so while the idea is neat, it might not solve the issue (because if people know that they need a plugin, they could also know that they need to use UTF-8 encoding).

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 2, 2017

@kaikreuzer I understand your concerns. No matter what encoding we will choose, there will be problems.
I especially don't like the idea of an additional plugin only to ensure the correct encoding.

I am not absolutely sure but maybe its possible to allow both encodings and detect dynamically the used encoding. I will check this. If it is possible we would onlsy have to modify the ResourceBundleClassLoader class.

@kaikreuzer
Copy link
Contributor

Did you check which encoding most of the 50 files have today? As all of them were created through Eclipse, I would actually expect them all to be ISO-8859-1. So there wouldn't really even be the need to do an automatic check for the encoding.

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 3, 2017

@kaikreuzer Your assumption is correct. I made a quick check and all files were using ISO-8859-1. There is at least one binding (hue) that uses unicode constants.
To define that all i18 properties files must use ISO-8859-1 would be easiest solution and we probably don't need the ResourceBundleClassLoader any longer (it is only used by LanguageResourceBundleManager and a Groovy test).

@kaikreuzer
Copy link
Contributor

I think we would still need the ResourceBundleClassLoader, but we could change it to read ISO-8859-1 from now on.

@kaikreuzer
Copy link
Contributor

It actually even already tries to be clever and identify the encoding (in method detectCharset) - so we should only add ISO to the list of supported charsets.

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 3, 2017

@kaikreuzer That's the fastest way to solve it :-). I have tried it and it looks good. I will do some further tests and then create a PR.
What do you think about adding a hint regarding the encoding to the documentation (https://www.eclipse.org/smarthome/documentation/features/internationalization.html)?

@kaikreuzer
Copy link
Contributor

What do you think about adding a hint regarding the encoding to the documentation

Sure, if you could add that in your PR, it would be great. (Note, the source file of the documentation is here.

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 3, 2017

Of course, I will chnange the doc, too.

Tests are looking good. It even works with properties files that are using Unicode constants.

@MHerbst
Copy link
Contributor Author

MHerbst commented Jan 3, 2017

The detectCharset method contains a nice bug that sometimes leads to a wrong identification. It checks the charset in chunks of 512 bytes. If the first chunk is not valid UTF-8 but the second chunk, then the encoding is still identified as UTF-8 ...

@maggu2810
Copy link
Contributor

maggu2810 commented Jan 4, 2017

IMHO we should not be such clever and detect the encoding but we should state that property files (e.g. for i18n) must use the ISO-8859-1 encoding.
If we would like to support also UTF-8 encoded i18n files, I would prefer that for such non ISO-8859-1 encoded files we use another file extension.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants