-
Notifications
You must be signed in to change notification settings - Fork 779
Use of UTF-8 encoding for i18n properties files #2639
Comments
I seem to remember that java property files are defined by the JVM to be in 8859-1. A leftover from the bad old days of Java 0.9 perhaps? I tend to use the character constant way in all my projects. |
Your memory was correct @hakan42 . I checked the Java documentation and it explicitly says that properties files are read with 8859-1. Only properties files in XML format can use UTF-8 or UTF-16. So I will correct the properties files from the Yahoo binding. @kaikreuzer What do you think about adding a hint regarding the encoding to the documentation? I can write it and create a pull request (I think the best place would be this section https://github.com/eclipse/smarthome/blob/master/docs/documentation/features/internationalization.md#internationalize-binding-xml-files) |
@MHerbst I do not yet understand the problem and your solution. |
@kaikreuzer I agree with you that we don't have a problem in the IDE (it was only my first thought that the encoding was wrong). I see two possible ways to solve this problem:
|
Imho, option 2 is the correct one. If property files are supposed to e in ISO8859-1, we should not constrain our developers to use pure ASCII instead. In general, Java should be able to handle the conversion transparently. I assume that we some where have a bug on the way from reading the property file to serving it through the REST API. |
OK, I will try to find out what's maybe going wrong. |
I just tried option 2 the way it is suggested here - unfortunately, the black-squared questionmark is then only replaced by a normal questionmark, but not by the correct character. |
As far as I could see the Java PropertyResourceBundle returns a string encodes with ISO 8859-1 and this string must then be converted to UTF-8. In a separate test program it tried it this way (it is not exactly the same as in the Stackflow example but very similar:
It seems to work (but I am not absolutely sure why ...) but before adding it to the Smarthome code I wanted to make sure that the locale object used when reading the properties is correct (see my latest test results in #2589) and perform some more test.
I have tried it too and I assume that it will work on most of the platforms. But I don't like to use undocumented features because there remains a risk that it won't work somewhere (e.g. I don't know what happens under the Open JDK). |
@kaikreuzer For the last two hours I tried do figure out what happens and how to solve it. The method I think we have the following possibilities to solve the problem:
My favorite is the third variant. It would require to change the encodings of all language resource bundles. But we would then have the same encoding for all Java sources and text files. |
In my opinion, option three would be the best too. It avoids surprises as by now many developers are accustomed to simply saving everything in UTF-8 and expect this to just work (tm) Actually, now that you mention this and remind me of it, I have a very similiar class in many of my projects 😄 |
äöü are ISO-8859-1 or do you mean ANSI 7-Bit? The initial problem is, getTranslatedText() uses a getBundle method with a custom resourceClassLoader. |
You are right @NorbertHD . They are of course valid ISO-8859-1 but can't be handled if a file is interpreted as UTF_8. Now, with your hints, I understand why the umlauts are replaced by the "question mark". @kaikreuzer @hakan42 For me this means that we can and should absolutely use UTF-8 encoding for the language properties files. The number of affected files is currently not so great. If you agree I can change the encoding of all i18/*.properties files and correct the content if necessary. |
@kaikreuzer There are about 50 i18n/*properties in the smarthome and openhab2 repositories. If you agree with changing the encoding to UTF-8 I would perform the necessary changes and create two PRs. |
Sorry for not having followed up on this since christmas! @NorbertHD Good catch indeed! I also never noticed the ResourceBundleClassLoader, which indeed is the culprit here that we were looking for.
I am not yet totally convinced. The other very simple option would be to change ResourceBundleClassLoader from UTF-8 to ISO-8859-1 it is only used to read *.properties files from i18n folders, so this should be without risk. |
Maybe we should add an .editorconfig file to the repo. Many editors support this, and eclipse and intellij have native support for it. That way, we define once and for all time what encoding the property files should have (which should be iso-8859-1 as this is what java natively expects). I believe we have an editor config file in the docs project, if not, I can provide a pull request tomorrow morning. |
According to http://editorconfig.org/#download, you need a plug-in for most IDEs (with the exception of IntelliJ) - so while the idea is neat, it might not solve the issue (because if people know that they need a plugin, they could also know that they need to use UTF-8 encoding). |
@kaikreuzer I understand your concerns. No matter what encoding we will choose, there will be problems. I am not absolutely sure but maybe its possible to allow both encodings and detect dynamically the used encoding. I will check this. If it is possible we would onlsy have to modify the ResourceBundleClassLoader class. |
Did you check which encoding most of the 50 files have today? As all of them were created through Eclipse, I would actually expect them all to be ISO-8859-1. So there wouldn't really even be the need to do an automatic check for the encoding. |
@kaikreuzer Your assumption is correct. I made a quick check and all files were using ISO-8859-1. There is at least one binding (hue) that uses unicode constants. |
I think we would still need the ResourceBundleClassLoader, but we could change it to read ISO-8859-1 from now on. |
It actually even already tries to be clever and identify the encoding (in method |
@kaikreuzer That's the fastest way to solve it :-). I have tried it and it looks good. I will do some further tests and then create a PR. |
Sure, if you could add that in your PR, it would be great. (Note, the source file of the documentation is here. |
Of course, I will chnange the doc, too. Tests are looking good. It even works with properties files that are using Unicode constants. |
The |
IMHO we should not be such clever and detect the encoding but we should state that property files (e.g. for i18n) must use the ISO-8859-1 encoding. |
While testing the Kodi binding with the german translation properties file I realized that all german umlauts where not displayed correctly.
data:image/s3,"s3://crabby-images/6b308/6b30800d70c4c3f7a83a55abc137a1b5d37e66a7" alt="image"
The same problem appears in the Yahoo binding's description:
The reason is simple: the properties file where the texts are stored uses encoding ISO 8859-1.
It seems that all properties files (in Smarthome and openHAB) are using this encoding while all Java files are using UTF-8.
The german properties file of the Hue binding uses an interesting solution: it contains a Unicode constante for the "ü":
Is there any special reason why all properties files are stored with ISO 8859-1? In my opinition it would be better to use generally UTF-8. Otherwise all bindings would have to use unicode constants to get correctly displayed umlauts and other international characters.
For the Kodi binding I have changed the encoding to UTF-8, corrected all umlautes and everything was displayed correctly.
The text was updated successfully, but these errors were encountered: