Add plugin for detecting values in Burmese Zawgyi encoding #2443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background: Zawgyi is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, text gets displayed as garbled characters unless the user happens to have a non-standard font installed. With this non-standard font, the system is able to render Zawgyi, but properly encoded Unicode strings look broken. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of text in those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode.
This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose.
Fixes #2442.