Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect Zawgyi (Burmese pseudo-Unicode) #2442

Closed
brawer opened this issue Feb 3, 2025 · 1 comment · Fixed by #2443
Closed

Detect Zawgyi (Burmese pseudo-Unicode) #2442

brawer opened this issue Feb 3, 2025 · 1 comment · Fixed by #2443

Comments

@brawer
Copy link
Contributor

brawer commented Feb 3, 2025

Could Osmose detect Burmese text in Zawgyi encoding?

Zawgyi is a historic font encoding that looks like Unicode, and can therefore be passed around as UTF-8, but abuses certain Burmese codepoints in a non-standard way. When rendered as a Unicode string, Zawgyi-encoded text will show up with garbled characters. Although Zawgyi is on the decline, there’s still a fair amount of Zawgyi-encoded text in OpenStreetMap as of early 2025. A current list of bad strings in OSM is found by this tool.

This tool is written in Go and calls Google’s probabilistic open-source Zawgyi detector, plus a custom open-source Zawgyi-to-Unicode converter. Perhaps the above encoder could upload its findings to Osmose via the XML API. But actually, it should be trivial to implement a new analyzer from scratch as a Python plugin for Osmose. Zawgyi dection and conversion also exist in Python, published as PyPI packages myanmartools respectively PyICU. A plug-in for Osmose might look roughly like this. What do you think, would you be OK with this? What error class should be used, and how to get an ID for it?

import myanmartools
import PyICU

class TagFix_ZawgyiBurmese(Plugin):

    def init(self, logger):
        Plugin.init(self, logger)
        self.detector = myanmartools.ZawgyiDetector()
        self.converter = PyICU.Transliterator.createInstance('Zawgyi-my')

    def node(self, data, tags):
        err = []
        for key, value in tags.items():
            if not any(0x1000 <= ord(c) <= 0x109F for c in value):
                continue
            score = self.detector.get_zawgyi_probability(value)
            if score < 0.8:
                continue
            fixed_value = self.converter.transliterate(value)
            if value == fixed_value:
                continue
            err.append({
               "class": ????, "subclass": ????,
               "text": T_("Value contains Zawgyi-encoded Burmese"),
               "fix": {key: fixed_value},
            })
        return err

    def way(self, data, tags, nds):
        return self.node(data, tags)

    def relation(self, data, tags, members):
        return self.node(data, tags)

###########################################################################                              
from plugins.Plugin import TestPluginCommon

class Test(TestPluginCommon):
    def test(self):
        a = TagFix_ZawgyiBurmese(None)
        a.init(None)
        for name in [
                     u"foo",
                     u"",
                     u"ဘားအံ",
                     u"ကျိုက်မရော အဝေးပြေးလမ်း",
                    ]:
            assert not a.node(None, {"name": name}), name
        for name in [
                     u"ေအာင္ခ်မ္းသာလမ္း",
                     u"သင္​းပယ္​",
                    ]:
            self.check_err(a.node(None, {"addr:street": name}), name)
@frodrigo
Copy link
Member

frodrigo commented Feb 3, 2025

Yes. This check could be integrated into Osmose.

Please, could you open a PR with this code ?

Item 5070 could be fine. With class=50706.

https://osmose.openstreetmap.fr/en/issues/open?item=5070

Please add a only_for. I see that not documented, but look for example in the code, to limit the usage.

Maybe also limit the check to some tags ?

brawer added a commit to brawer/osmose-backend that referenced this issue Feb 4, 2025
Zawgyi is an obsolete font encoding that was shoehorned into Unicode,
incompatible with proper Unicode. Structurally, Zawgyi stings looks like
Unicode (and can be passed around and stored as UTF-8, UTF-16, etc.),
but when rendered the result looks garbled. Because of the similarity
to proper Unicode, detection of Zawgyi is non-trivial and can only be done
probabilistically. This plugin makes use of Google’s open-source Zawgyi
detector, which uses Markov chains to determine the likelihood of
a Burmese string being Zawgyi-encoded versus proper Unicode. The plugin
suggests fixes by converting the detected problem cases from Zawgyi
to Unicode using the Unicode ICU library.

Fixes osm-fr#2442.
brawer added a commit to brawer/osmose-backend that referenced this issue Feb 4, 2025
Zawgyi is an obsolete font encoding that is incompatible with proper
Unicode. Structurally, Zawgyi strings look like Unicode (and can be
passed around and stored as UTF-8, UTF-16, etc.), but when rendered,
the result get displayed as garbled characters unless the user happens
to install and use a non-standard font adapted to the hacked encoding.
(The situation is a bit like with ISO 8859 in the 1980s, but worse
because Zawgyi text and fonts pretend to be Unicode). Because of the
structural similarity to Unicode, detecting Zawgyi is non-trivial
and can only be done probabilistically. As of early 2025, Zawgyi is on
the decline on the general Internet. However, OpenStreetMap still
contains thousands of objects with tag values that are encoded in
Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector,
which uses Markov chains to estimate the likelihood of a Burmese
string being Zawgyi-encoded versus proper Unicode. The Osmose plugin
suggests fixes by converting mis-encoded strings from Zawgyi to
Unicode using the Unicode ICU library, which comes with a built-in
converter for this purpose.

Background: https://en.wikipedia.org/wiki/Zawgyi_font

Fixes osm-fr#2442.
brawer added a commit to brawer/osmose-backend that referenced this issue Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an
obsolete font encoding that is incompatible with proper Unicode.
Structurally, Zawgyi strings look like Unicode (they can be passed
around and stored as UTF-8, UTF-16, etc.), but when rendered,
the result get displayed as garbled characters unless the user happens
to install a non-standard font. With a non-standard font, the system
can render Zawgyi, but not properly encoded Unicode strings. Also,
because the Zawgyi encoding abuses codepoints intended for Myanmar’s
minority languages, installing a Zawgyi font breaks the display of
those minority languages.  The situation is a bit like with ISO 8859
in the 1980s, but worse because Zawgyi text and fonts pretend to be
Unicode. Because of the structural similarity to Unicode, detecting
Zawgyi is non-trivial and can only be done probabilistically.  As of
early 2025, Zawgyi is on the decline on the general Internet.
However, OpenStreetMap still contains thousands of objects with tag
values that are encoded in Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector,
which uses Markov chains to estimate the likelihood of a Burmese
string being Zawgyi-encoded versus proper Unicode. The Osmose plugin
suggests fixes by converting mis-encoded strings from Zawgyi to
Unicode using the Unicode ICU library, which comes with a built-in
converter for this purpose.

Fixes osm-fr#2442.
brawer added a commit to brawer/osmose-backend that referenced this issue Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an
obsolete font encoding that is incompatible with proper Unicode.
Structurally, Zawgyi strings look like Unicode (they can be passed
around and stored as UTF-8, UTF-16, etc.), but when rendered, text
gets displayed as garbled characters unless the user happens to have a
non-standard font installed. With this non-standard font, the system
is able to render Zawgyi, but properly encoded Unicode strings look
broken. Also, because the Zawgyi encoding abuses codepoints intended
for Myanmar’s minority languages, installing a Zawgyi font breaks the
display of text in those minority languages.  The situation is a bit like with
ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend
to be Unicode. Because of the structural similarity to Unicode,
detecting Zawgyi is non-trivial and can only be done probabilistically.
As of early 2025, Zawgyi is on the decline on the general Internet.
However, OpenStreetMap still contains thousands of objects with tag
values that are encoded in Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector,
which uses Markov chains to estimate the likelihood of a Burmese
string being Zawgyi-encoded versus proper Unicode. The Osmose plugin
suggests fixes by converting mis-encoded strings from Zawgyi to
Unicode using the Unicode ICU library, which comes with a built-in
converter for this purpose.

Fixes osm-fr#2442.
brawer added a commit to brawer/osmose-backend that referenced this issue Feb 4, 2025
Background: [Zawgyi](https://en.wikipedia.org/wiki/Zawgyi_font) is an
obsolete font encoding that is incompatible with proper Unicode.
Structurally, Zawgyi strings look like Unicode (they can be passed
around and stored as UTF-8, UTF-16, etc.), but when rendered, text
gets displayed as garbled characters unless the user happens to have a
non-standard font installed. With this non-standard font, the system
is able to render Zawgyi, but properly encoded Unicode strings look
broken. Also, because the Zawgyi encoding abuses codepoints intended
for Myanmar’s minority languages, installing a Zawgyi font breaks the
display of text in those minority languages.  The situation is a bit like with
ISO 8859 in the 1980s, but worse because Zawgyi text and fonts pretend
to be Unicode. Because of the structural similarity to Unicode,
detecting Zawgyi is non-trivial and can only be done probabilistically.
As of early 2025, Zawgyi is on the decline on the general Internet.
However, OpenStreetMap still contains thousands of objects with tag
values that are encoded in Burmese Zawgyi instead of proper Unicode.

This Osmose plugin makes use of Google’s open-source Zawgyi detector,
which uses Markov chains to estimate the likelihood of a Burmese
string being Zawgyi-encoded versus proper Unicode. The Osmose plugin
suggests fixes by converting mis-encoded strings from Zawgyi to
Unicode using the Unicode ICU library, which comes with a built-in
converter for this purpose.

Fixes osm-fr#2442.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants