Detect Zawgyi (Burmese pseudo-Unicode) #2442

brawer · 2025-02-03T17:17:10Z

Could Osmose detect Burmese text in Zawgyi encoding?

Zawgyi is a historic font encoding that looks like Unicode, and can therefore be passed around as UTF-8, but abuses certain Burmese codepoints in a non-standard way. When rendered as a Unicode string, Zawgyi-encoded text will show up with garbled characters. Although Zawgyi is on the decline, there’s still a fair amount of Zawgyi-encoded text in OpenStreetMap as of early 2025. A current list of bad strings in OSM is found by this tool.

This tool is written in Go and calls Google’s probabilistic open-source Zawgyi detector, plus a custom open-source Zawgyi-to-Unicode converter. Perhaps the above encoder could upload its findings to Osmose via the XML API. But actually, it should be trivial to implement a new analyzer from scratch as a Python plugin for Osmose. Zawgyi dection and conversion also exist in Python, published as PyPI packages myanmartools respectively PyICU. A plug-in for Osmose might look roughly like this. What do you think, would you be OK with this? What error class should be used, and how to get an ID for it?

import myanmartools
import PyICU

class TagFix_ZawgyiBurmese(Plugin):

    def init(self, logger):
        Plugin.init(self, logger)
        self.detector = myanmartools.ZawgyiDetector()
        self.converter = PyICU.Transliterator.createInstance('Zawgyi-my')

    def node(self, data, tags):
        err = []
        for key, value in tags.items():
            if not any(0x1000 <= ord(c) <= 0x109F for c in value):
                continue
            score = self.detector.get_zawgyi_probability(value)
            if score < 0.8:
                continue
            fixed_value = self.converter.transliterate(value)
            if value == fixed_value:
                continue
            err.append({
               "class": ????, "subclass": ????,
               "text": T_("Value contains Zawgyi-encoded Burmese"),
               "fix": {key: fixed_value},
            })
        return err

    def way(self, data, tags, nds):
        return self.node(data, tags)

    def relation(self, data, tags, members):
        return self.node(data, tags)

###########################################################################                              
from plugins.Plugin import TestPluginCommon

class Test(TestPluginCommon):
    def test(self):
        a = TagFix_ZawgyiBurmese(None)
        a.init(None)
        for name in [
                     u"foo",
                     u"",
                     u"ဘားအံ",
                     u"ကျိုက်မရော အဝေးပြေးလမ်း",
                    ]:
            assert not a.node(None, {"name": name}), name
        for name in [
                     u"ေအာင္ခ်မ္းသာလမ္း",
                     u"သင္းပယ္",
                    ]:
            self.check_err(a.node(None, {"addr:street": name}), name)

The text was updated successfully, but these errors were encountered:

frodrigo · 2025-02-03T19:08:30Z

Yes. This check could be integrated into Osmose.

Please, could you open a PR with this code ?

Item 5070 could be fine. With class=50706.

https://osmose.openstreetmap.fr/en/issues/open?item=5070

Please add a only_for. I see that not documented, but look for example in the code, to limit the usage.

Maybe also limit the check to some tags ?

Zawgyi is an obsolete font encoding that was shoehorned into Unicode, incompatible with proper Unicode. Structurally, Zawgyi stings looks like Unicode (and can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered the result looks garbled. Because of the similarity to proper Unicode, detection of Zawgyi is non-trivial and can only be done probabilistically. This plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to determine the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The plugin suggests fixes by converting the detected problem cases from Zawgyi to Unicode using the Unicode ICU library. Fixes osm-fr#2442.

Zawgyi is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (and can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, the result get displayed as garbled characters unless the user happens to install and use a non-standard font adapted to the hacked encoding. (The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode). Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Background: https://en.wikipedia.org/wiki/Zawgyi_font Fixes osm-fr#2442.

Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, the result get displayed as garbled characters unless the user happens to install a non-standard font. With a non-standard font, the system can render Zawgyi, but not properly encoded Unicode strings. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Fixes osm-fr#2442.

Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an obsolete font encoding that is incompatible with proper Unicode. Structurally, Zawgyi strings look like Unicode (they can be passed around and stored as UTF-8, UTF-16, etc.), but when rendered, text gets displayed as garbled characters unless the user happens to have a non-standard font installed. With this non-standard font, the system is able to render Zawgyi, but properly encoded Unicode strings look broken. Also, because the Zawgyi encoding abuses codepoints intended for Myanmar’s minority languages, installing a Zawgyi font breaks the display of text in those minority languages. The situation is a bit like with ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend to be Unicode. Because of the structural similarity to Unicode, detecting Zawgyi is non-trivial and can only be done probabilistically. As of early 2025, Zawgyi is on the decline on the general Internet. However, OpenStreetMap still contains thousands of objects with tag values that are encoded in Burmese Zawgyi instead of proper Unicode. This Osmose plugin makes use of Google’s open-source Zawgyi detector, which uses Markov chains to estimate the likelihood of a Burmese string being Zawgyi-encoded versus proper Unicode. The Osmose plugin suggests fixes by converting mis-encoded strings from Zawgyi to Unicode using the Unicode ICU library, which comes with a built-in converter for this purpose. Fixes osm-fr#2442.

brawer mentioned this issue Feb 3, 2025

Detect Zawgyi within Osmose bdon/OpenStreetMap-BurmeseEncoding#3

Open

brawer mentioned this issue Feb 4, 2025

Add plugin for detecting values in Burmese Zawgyi encoding #2443

Merged

frodrigo closed this as completed in #2443 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect Zawgyi (Burmese pseudo-Unicode) #2442

Detect Zawgyi (Burmese pseudo-Unicode) #2442

brawer commented Feb 3, 2025 •

edited

Loading

frodrigo commented Feb 3, 2025

Detect Zawgyi (Burmese pseudo-Unicode) #2442

Detect Zawgyi (Burmese pseudo-Unicode) #2442

Comments

brawer commented Feb 3, 2025 • edited Loading

frodrigo commented Feb 3, 2025

brawer commented Feb 3, 2025 •

edited

Loading