Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plugin for detecting values in Burmese Zawgyi encoding #2443

Merged
merged 2 commits into from
Feb 7, 2025

Conversation

brawer
Copy link
Contributor

@brawer brawer commented Feb 4, 2025

Background: Zawgyi is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, text gets displayed as garbled characters unless the user happens to have a non-standard font installed. With this non-standard font, the system is able to render Zawgyi, but properly encoded Unicode strings look broken. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of text in those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose.

Fixes #2442.

Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an
obsolete font encoding that is incompatible with proper Unicode.
Structurally, Zawgyi strings look like Unicode (they can be passed
around and stored as UTF-8, UTF-16, etc.), but when rendered, text
gets displayed as garbled characters unless the user happens to have a
non-standard font installed. With this non-standard font, the system
is able to render Zawgyi, but properly encoded Unicode strings look
broken. Also, because the Zawgyi encoding abuses codepoints intended
for Myanmar’s minority languages, installing a Zawgyi font breaks the
display of text in those minority languages.  The situation is a bit like with
ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend
to be Unicode. Because of the structural similarity to Unicode,
detecting Zawgyi is non-trivial and can only be done probabilistically.
As of early 2025, Zawgyi is on the decline on the general Internet.
However, OpenStreetMap still contains thousands of objects with tag
values that are encoded in Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector,
which uses Markov chains to estimate the likelihood of a Burmese
string being Zawgyi-encoded versus proper Unicode. The Osmose plugin
suggests fixes by converting mis-encoded strings from Zawgyi to
Unicode using the Unicode ICU library, which comes with a built-in
converter for this purpose.

Fixes osm-fr#2442.
@brawer
Copy link
Contributor Author

brawer commented Feb 4, 2025

./tools/pytest.sh sax fails for me, but these breakages look unrelated, as far as I can tell.

@brawer
Copy link
Contributor Author

brawer commented Feb 7, 2025

OK to merge, or would you like to see some changes?

@frodrigo frodrigo merged commit 0e22d0a into osm-fr:dev Feb 7, 2025
3 checks passed
@frodrigo
Copy link
Member

frodrigo commented Feb 7, 2025

Thank you. Merged, but not deployed yet.

cc @jocelynj note the changes in the python requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect Zawgyi (Burmese pseudo-Unicode)
2 participants