-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect Zawgyi (Burmese pseudo-Unicode) #2442
Comments
Yes. This check could be integrated into Osmose. Please, could you open a PR with this code ? Item 5070 could be fine. With class=50706. https://osmose.openstreetmap.fr/en/issues/open?item=5070 Please add a Maybe also limit the check to some tags ? |
brawer
added a commit
to brawer/osmose-backend
that referenced
this issue
Feb 4, 2025
Zawgyi is an obsolete font encoding that was shoehorned into Unicode, incompatible with proper Unicode. Structurally, Zawgyi stings looks like Unicode (and can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered the result looks garbled. Because of the similarity to proper Unicode, detection of Zawgyi is non-trivial and can only be done probabilistically. This plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to determine the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The plugin suggests fixes by converting the detected problem cases from Zawgyi to Unicode using the Unicode ICU library. Fixes osm-fr#2442.
brawer
added a commit
to brawer/osmose-backend
that referenced
this issue
Feb 4, 2025
Zawgyi is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (and can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, the result get displayed as garbled characters unless the user happens to install and use a non-standard font adapted to the hacked encoding. (The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode). Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Background: https://en.wikipedia.org/wiki/Zawgyi_font Fixes osm-fr#2442.
brawer
added a commit
to brawer/osmose-backend
that referenced
this issue
Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, the result get displayed as garbled characters unless the user happens to install a non-standard font. With a non-standard font, the system can render Zawgyi, but not properly encoded Unicode strings. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Fixes osm-fr#2442.
brawer
added a commit
to brawer/osmose-backend
that referenced
this issue
Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, text gets displayed as garbled characters unless the user happens to have a non-standard font installed. With this non-standard font, the system is able to render Zawgyi, but properly encoded Unicode strings look broken. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of text in those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Fixes osm-fr#2442.
brawer
added a commit
to brawer/osmose-backend
that referenced
this issue
Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, text gets displayed as garbled characters unless the user happens to have a non-standard font installed. With this non-standard font, the system is able to render Zawgyi, but properly encoded Unicode strings look broken. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of text in those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Fixes osm-fr#2442.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Could Osmose detect Burmese text in Zawgyi encoding?
Zawgyi is a historic font encoding that looks like Unicode, and can therefore be passed around as UTF-8, but abuses certain Burmese codepoints in a non-standard way. When rendered as a Unicode string, Zawgyi-encoded text will show up with garbled characters. Although Zawgyi is on the decline, there’s still a fair amount of Zawgyi-encoded text in OpenStreetMap as of early 2025. A current list of bad strings in OSM is found by this tool.
This tool is written in Go and calls Google’s probabilistic open-source Zawgyi detector, plus a custom open-source Zawgyi-to-Unicode converter. Perhaps the above encoder could upload its findings to Osmose via the XML API. But actually, it should be trivial to implement a new analyzer from scratch as a Python plugin for Osmose. Zawgyi dection and conversion also exist in Python, published as PyPI packages
myanmartools
respectivelyPyICU
. A plug-in for Osmose might look roughly like this. What do you think, would you be OK with this? What error class should be used, and how to get an ID for it?The text was updated successfully, but these errors were encountered: