Category guessing leads to bizarre results #756

simoncozens · 2021-12-13T14:28:07Z

Why aren't my marks attaching to my bases? I have a base glyph "brm_RR" and that works fine, and another base glyph "brm_R" and marks don't attach. Turns out it's because:

In [4]: from glyphsLib.glyphdata import get_glyph

In [5]: get_glyph("brm_RR")
Out[5]: Glyph(name='brm_RR', production_name='brm_RR', unicode=None, category=None, subCategory=None, script=None, description=None)

In [6]: get_glyph("brm_R")
Out[6]: Glyph(name='brm_R', production_name='brm_R', unicode=None, category='Letter', subCategory='Ligature', script=None, description=None)

brm_R goes into ligatures not bases in GDEF, so marks aren't attached. But why is brm_R considered a ligature? Because...

glyphsLib/Lib/glyphsLib/glyphdata.py

Lines 245 to 247 in 10c8b1d

    
           # Corner case: when looking at ligatures, names that don't exist in the AGLFN 
        
           # are skipped, so len("acutecomb_o") == 2 but len("dotaccentcomb_o") == 1. 
        
           character = fontTools.agl.toUnicode(base_name)

brm_ is skipped, R is looked up, R is a letter, and so _translate_category is called, and inside _translate_category:

glyphsLib/Lib/glyphsLib/glyphdata.py

Lines 295 to 298 in 10c8b1d

    
           # Exception: Something like "one_two" should be a (_, Ligature), 
        
           # "acutecomb_brevecomb" should however stay (Mark, Nonspacing). 
        
           if "_" in glyph_name and glyphs_category[0] != "Mark": 
        
               return glyphs_category[0], "Ligature"

This is way too much magic.

I literally told Glyphs that it was Category=Letter, but no, glyphsLib thinks it knows better.

The text was updated successfully, but these errors were encountered:

anthrotype · 2021-12-13T14:40:31Z

too much magic indeed.
But I believe that setting the custom glyph category (via alt+cmd+I panel) should override any guessing in by glyphdata. Does it not? That's another bug.

simoncozens · 2021-12-13T14:46:19Z

Only if you set the subcategory as well.

schriftgestalt · 2021-12-13T14:55:13Z

The ligature-"magic" should only be applied when it finds a glyph info for all parts. As brm is not a known glyph name it needs to stop and return nothing.

simoncozens · 2021-12-13T17:20:51Z

I really hoped the response to "this is way too much magic" was not going to be "let's add more magic".

khaledhosny · 2021-12-13T17:41:33Z

Fonts depend on this “magic”, if we were to remove it they will break.

simoncozens · 2021-12-13T20:18:42Z

They’re breaking already, it’s just right now they’re breaking in strange ways instead of straightforward ways.

khaledhosny · 2021-12-13T20:22:34Z

If one has a glyph named alef_lam-ar they want it classified as ligature not base.

madig · 2022-01-12T15:30:55Z

What glyphsLib does does not match what Glyphs.app does. I have in the past tried to make them match but failed, there are quite a few edge cases where Glyphs.app thinks differently. I don't know how to improve the situation.

Maybe a mechanically verified model for glyph names like the Universal Shaping Engine?

schriftgestalt · 2022-01-12T16:11:07Z

Do you have examples of those differences?

Maybe we can have a shared test suite in the GlyphInfo repo?

madig · 2022-01-12T16:35:31Z

I guess I can collect some. Or loop through all known glyph names plus names taken from random .glyphs files in the wild and compare (sub)categories before/after.

Here's a static list of glyph names that I have to manually declare bases for the purpose of GDEF and anchor attachment in a certain project:

KNOWN_BASES = {
    "k_ssa-deva",
    "j_nya-deva",
    "k_ss-deva",
    "k_ss-deva.alt2",
    "k_ss-deva.alt3",
    "k_ss-deva.alt4",
    "k_ss-deva.alt5",
    "k_ss-deva.alt6",
    "k_ss-deva.alt7",
    "j_ny-deva",
    "j_ny-deva.alt2",
    "j_ny-deva.alt3",
    "j_ny-deva.alt4",
    "j_ny-deva.alt5",
    "j_ny-deva.alt6",
    "j_ny-deva.alt7",
    "j_ny-deva.alt8",
    "ng_ya-deva",
    "ch_ya-deva",
    "tt_tta-deva",
    "tt_ttha-deva",
    "tt_ya-deva",
    "tth_ttha-deva",
    "tth_ya-deva",
    "dd_dda-deva",
    "dd_ddha-deva",
    "dd_ya-deva",
    "ddh_ddha-deva",
    "ddh_ya-deva",
    "t_ta-deva",
    "t_ra-deva",
    "d_ga-deva",
    "d_gha-deva",
    "d_da-deva",
    "d_dha-deva",
    "d_dh_ya-deva",
    "d_ba-deva",
    "d_bha-deva",
    "d_ma-deva",
    "d_ya-deva",
    "d_ra-deva",
    "d_va-deva",
    "p_ta-deva",
    "sh_ra-deva",
    "ss_tta-deva",
    "ss_ttha-deva",
    "h_nna-deva",
    "h_na-deva",
    "h_ma-deva",
    "h_ya-deva",
    "h_ra-deva",
    "h_la-deva",
    "h_va-deva",
    "h_ra_uMatra-deva",
    "h_ra_uuMatra-deva",
    "ba-khmer",
    "ba-khmer.post",
    "ba-khmer.post2",
    "ba_aaSign-khmer",
    "ba_aaSign-khmer.post2_",
    "ba_aaSign-khmer.post_",
    "ba_auSign-khmer",
    "ba_auSign-khmer.post2_",
    "ba_auSign-khmer.post_",
    "beikoet-khmer",
    "beiroc-khmer",
    "buonkoet-khmer",
    "buonroc-khmer",
    "ca-khmer",
    "ca_aaSign-khmer",
    "ca_auSign-khmer",
    "cha-khmer",
    "cha_aaSign-khmer",
    "cha_auSign-khmer",
    "cho-khmer",
    "cho-khmer.post",
    "cho-khmer.post2",
    "cho_aaSign-khmer",
    "cho_aaSign-khmer.post2_",
    "cho_aaSign-khmer.post_",
    "cho_auSign-khmer",
    "cho_auSign-khmer.post2_",
    "cho_auSign-khmer.post_",
    "co-khmer",
    "co_aaSign-khmer",
    "co_auSign-khmer",
    "da-khmer",
    "da_aaSign-khmer",
    "da_auSign-khmer",
    "dapBeikoet-khmer",
    "dapBeiroc-khmer",
    "dapBuonkoet-khmer",
    "dapBuonroc-khmer",
    "dapMuoykoet-khmer",
    "dapMuoyroc-khmer",
    "dapPiikoet-khmer",
    "dapPiiroc-khmer",
    "dapPramkoet-khmer",
    "dapPramroc-khmer",
    "dapkoet-khmer",
    "daproc-khmer",
    "do-khmer",
    "do_aaSign-khmer",
    "do_auSign-khmer",
    "dottedCircle",
    "ha-khmer",
    "ha_aaSign-khmer",
    "ha_auSign-khmer",
    "ka-khmer",
    "ka_aaSign-khmer",
    "ka_auSign-khmer",
    "kha-khmer",
    "kha_aaSign-khmer",
    "kha_auSign-khmer",
    "kho-khmer",
    "kho-khmer.post",
    "kho-khmer.post2",
    "kho_aaSign-khmer",
    "kho_aaSign-khmer.post2_",
    "kho_aaSign-khmer.post_",
    "kho_auSign-khmer",
    "kho_auSign-khmer.post2_",
    "kho_auSign-khmer.post_",
    "ko-khmer",
    "ko_aaSign-khmer",
    "ko_auSign-khmer",
    "la-khmer",
    "la_aaSign-khmer",
    "la_auSign-khmer",
    "lo-khmer",
    "lo_aaSign-khmer",
    "lo_auSign-khmer",
    "mo-khmer",
    "mo_aaSign-khmer",
    "mo_auSign-khmer",
    "muoykoet-khmer",
    "muoyroc-khmer",
    "ngo-khmer",
    "ngo_aaSign-khmer",
    "ngo_auSign-khmer",
    "nno-khmer",
    "nno_aaSign-khmer",
    "nno_auSign-khmer",
    "no-khmer",
    "no_aaSign-khmer",
    "no_auSign-khmer",
    "nyo-khmer",
    "nyo-khmer.less",
    "nyo_aaSign-khmer",
    "nyo_aaSign-khmer.less",
    "nyo_auSign-khmer",
    "nyo_auSign-khmer.less",
    "pathamasat-khmer",
    "pha-khmer",
    "pha_aaSign-khmer",
    "pha_auSign-khmer",
    "pho-khmer",
    "pho_aaSign-khmer",
    "pho_auSign-khmer",
    "piikoet-khmer",
    "piiroc-khmer",
    "po-khmer",
    "po_aaSign-khmer",
    "po_auSign-khmer",
    "pramBeikoet-khmer",
    "pramBeiroc-khmer",
    "pramBuonkoet-khmer",
    "pramBuonroc-khmer",
    "pramMuoykoet-khmer",
    "pramMuoyroc-khmer",
    "pramPiikoet-khmer",
    "pramPiiroc-khmer",
    "pramkoet-khmer",
    "pramroc-khmer",
    "qa-khmer",
    "qa_aaSign-khmer",
    "qa_auSign-khmer",
    "ro-khmer",
    "ro_aaSign-khmer",
    "ro_auSign-khmer",
    "sa-khmer",
    "sa-khmer.post",
    "sa_aaSign-khmer",
    "sa_aaSign-khmer.post_",
    "sa_auSign-khmer",
    "sa_auSign-khmer.post_",
    "sha-khmer",
    "sha_aaSign-khmer",
    "sha_auSign-khmer",
    "sso-khmer",
    "sso-khmer.post",
    "sso_aaSign-khmer",
    "sso_aaSign-khmer.post_",
    "sso_auSign-khmer",
    "sso_auSign-khmer.post_",
    "ta-khmer",
    "ta_aaSign-khmer",
    "ta_auSign-khmer",
    "tha-khmer",
    "tha_aaSign-khmer",
    "tha_auSign-khmer",
    "tho-khmer",
    "tho_aaSign-khmer",
    "tho_auSign-khmer",
    "to-khmer",
    "to_aaSign-khmer",
    "to_auSign-khmer",
    "ttha-khmer",
    "ttha_aaSign-khmer",
    "ttha_auSign-khmer",
    "ttho-khmer",
    "ttho-khmer.post",
    "ttho-khmer.post2",
    "ttho_aaSign-khmer",
    "ttho_aaSign-khmer.post2_",
    "ttho_aaSign-khmer.post_",
    "ttho_auSign-khmer",
    "ttho_auSign-khmer.post2_",
    "ttho_auSign-khmer.post_",
    "tuteyasat-khmer",
    "vo-khmer",
    "vo_aaSign-khmer",
    "vo_auSign-khmer",
    "yo-khmer",
    "yo-khmer.post",
    "yo-khmer.post2",
    "yo_aaSign-khmer",
    "yo_aaSign-khmer.post2_",
    "yo_aaSign-khmer.post_",
    "yo_auSign-khmer",
    "yo_auSign-khmer.post2_",
    "yo_auSign-khmer.post_",
}

schriftgestalt · 2022-01-13T09:57:31Z

I can see three groups in the list.

basic letters: (e.g.: ba-khmer) Those should get their info directly form the glyph data.
symbols (e.g.: beiroc-khmer) dito
ligatures: there seem to be two kinds: a Khmer letter + aaSign. And Devanagari conjuncts (halfform + fullform). Both are ligating two glyphs but in some aspects they are not acting as ligatures in OpenType: they have a simple mark to base feature (only one top and bottom anchor instead of one per component). So I tend to come up with another subCategory for them. I think I’ll change it that the Khmer glyphs will be "Composition"s like other precomposed glyphs (base + mark). The Devanagari should be "Conjunct"s.
Does this make sense?

madig · 2022-01-13T11:26:26Z

According to some of our designers that deal in Deva, it makes sense. Internally, we avoid the underscore in conjuncts, so we have names like "KSsa-deva". Wouldn't that be more consistent? no more underscore to disambiguate.

schriftgestalt · 2022-01-13T11:56:17Z

Maybe. But it makes processing the names much more complicated.

madig · 2022-01-13T12:02:02Z

I'm curious, can you please give an example?

schriftgestalt · 2022-01-13T12:08:05Z

Getting from KSsa-deva to k-deva + ssa-deva. This example might be possible but what if there are uppercase in the base names (like in to_auSign-khmer)? And I have the code from processing normal ligatures already.

tallchai · 2022-03-07T00:07:31Z

@schriftgestalt Can you please elaborate on this fix? I am having [a similar issue] (googlefonts/gftools#497).
What works in my case (for *_ra-deva) is to categorize them as Devanagari > Letter > Other and then manually add them to the cjct feature set (under cjct_devanagari_rakar_forms).
Need to follow the same process for some other "conjunct" glyphs like "c_ca-deva", "d_ya-deva" etc. <- These are manually to be added under cjct_devanagari in the cjct feature.

Getting from KSsa-deva to k-deva + ssa-deva. This example might be possible but what if there are uppercase in the base names (like in to_auSign-khmer)? And I have the code from processing normal ligatures already.

schriftgestalt · 2022-03-07T21:40:49Z

Can you be more specific what you like to know?

tallchai · 2022-03-07T23:50:55Z

@schriftgestalt I have a similar issue(s) with certain conjuncts when I try to build the font with gftools.

The marks for multiple conjuncts like tt_ya-deva (categorized as Devanagari > Letter > Ligature) do not render correctly.

If I change the subcategory from Ligature to Other, the cjct feature does not pick up the glyph at all.

For the glyph to actually work, I need to manually edit the cjct feature to include this glyph.

A lot of Devanagari glyphs have this issue including all of the glyphs that end with _ra-deva

I have logged multiple issues regarding this. In the latest issue logged, @simoncozens referred to this issue.
I believe your comment above addresses a way to fix this issue and I am trying to understand it better. :)

Issues logged:
googlefonts/fontmake#703
googlefonts/fontmake#706
googlefonts/gftools#497

Thanks!

schriftgestalt · 2022-03-08T00:01:31Z

I can’t say much about the cjct feature in gftools. Only how it would be done in Glyphs.

simoncozens · 2022-03-26T09:44:37Z

I'm getting really annoyed with this bug. Basically if a glyph name has an underscore in it, mark attachment probably won't work.

simoncozens · 2022-04-06T08:47:03Z

Thinking about this more, the problem is not so much that glyphs are being incorrectly considered ligature glyphs. The problem is more that if they are ligature glyphs, ufo2ft does not perform mark-to-base attachment on them, only mark-to-ligature attachment. So a ligature glyph with a "top" anchor does not get top attachment; only "top_1" would be attached. If the anchor attachment was correct, the GDEF category would be less significant.

That said, it's still obviously wrong to consider brm_RR a ligature if there's no brm and RR glyphs in the font, so I'm also coming around to Georg's way of thinking:

The ligature-"magic" should only be applied when it finds a glyph info for all parts. As brm is not a known glyph name it needs to stop and return nothing.

But the anchor attachment issue is the real cause of pain here.

simoncozens · 2022-04-06T08:48:27Z

However, since the guessing is only heuristic, I am going to strongly recommend (and for Noto insist) that Glyphs sources explicitly set the category and subcategory for all glyphs.

schriftgestalt · 2022-04-06T09:00:03Z

I’m working on this right now and I think I can improve that quite a bit. I’m adding all my test cases from Glyphs and try to match the result as close as possible.

But for names that Glyphs can’t recognize (like brm_RR) it the glyph info has to be set manually to make it work in Glyphs and in glyphsLib.

simoncozens · 2022-04-06T09:04:57Z

Thought: we could look at it the other way around. The only reason the ligature/base difference actually matters (unless we are doing very funky feature code) is for mark attachment. So why can't we just say that if a glyph has ..._1 anchors it's a ligature and if it doesn't, it's a base?

schriftgestalt · 2022-04-06T10:09:29Z

That could work if your font data is all set correctly. One wring "top" anchor could make it difficult to decide. And in Glyphs this info is used for more things (anchors, feature code) so I need to have that work anyway.

schriftgestalt · 2022-04-07T07:37:14Z

Suspect that my algorithm is doing more than needed in glyphsLib. Is there any use to do name normalization? e.g. asking for info for uni00410041 and getting the name "fixed" to A_A? Of cause this is more interesting for names like uni0926094D0927094D0930094D092F.

schriftgestalt · 2023-03-09T10:55:07Z

What is the state of the Glyphs Info changes I did in the context of RTL kerning issue? That code will fix this issue.

anthrotype · 2023-03-09T11:16:07Z

What is the state of the Glyphs Info changes I did in the context of RTL kerning issue?

last time I looked at it, it looked unfinished. It should be rebased and cleaned up

simoncozens · 2023-03-09T11:27:20Z

That code will fix this issue.

It won't. The code is still exactly the same in that branch.

glyphsLib/Lib/glyphsLib/glyphdata.py

Lines 563 to 567 in 5426cce

    
           # Exception: Something like "one_two" should be a (_, Ligature), 
        
           # "acutecomb_brevecomb" should however stay (Mark, Nonspacing). 
        
           if "_" in glyph_name and glyphs_category[0] != "Mark": 
        
               return glyphs_category[0], "Ligature", glyphs_category[2]

I do not want glyphsLib to do magic guessing based on glyph names.

schriftgestalt · 2023-03-09T11:29:46Z

This is all properly implemented in the GlyphsInfo3 branch.

simoncozens · 2023-03-09T11:36:46Z

simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ git checkout GlyphInfo3
Already on 'GlyphInfo3'
Your branch is up to date with 'origin/GlyphInfo3'.
simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ PYTHONPATH=Lib ipython3
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from glyphsLib.glyphdata import get_glyph

In [2]: get_glyph("brm_R")
Out[2]: info:brm_R pro:brm_R sub:Ligature

simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ grep -5 get_glyph Lib/glyphsLib/builder/features.py
            if name and not name.startswith("_"):
                has_attaching_anchor = True

        # First check glyph.lib for category/subCategory overrides. Otherwise,
        # use global values from GlyphData.
        glyphinfo = glyphdata.get_glyph(
            glyph_name, unicodes=[f"{c:04X}" for c in glyph.unicodes]
        )
        category = glyph.lib.get(category_key) or glyphinfo.category
        subCategory = glyph.lib.get(subCategory_key) or glyphinfo.subCategory

schriftgestalt · 2023-03-09T14:43:53Z

This code is only used for fallback when the glyph data didn’t produce anything. I didn't touch that part.
At that point I can’t see how a name with an underscore could be end up here. So line 565 and 566 can be removed.

simoncozens mentioned this issue Feb 17, 2022

Devanagari: Issue with _ra-deva glyphs when built with gftools googlefonts/gftools#497

Closed

This was referenced Mar 24, 2022

Modified U+1D9FF SIGNWRITING HEAD glyphs lack anchors notofonts/sign-writing#6

Closed

Misplaced nuqta in Noto Serif Gujarati notofonts/gujarati#14

Closed

simoncozens mentioned this issue Mar 26, 2022

glyphdata.get_glyph info may not match Glyphs info 100% #507

Open

This was referenced May 13, 2022

Noto Serif Gujarati: Collision of second letter of a vertical consonant conjunct with a vowel sign notofonts/gujarati#16

Closed

Attaching marks to ligatures, but not mark-to-ligature googlefonts/ufo2ft#613

Open

simoncozens mentioned this issue Jan 25, 2023

Positioning of matras after conjuncts in Gujarati Serif notofonts/gujarati#3

Closed

simoncozens mentioned this issue Mar 9, 2023

Glyphs with subcategory Ligature don't get set as ligatures #877

Open

simoncozens mentioned this issue Feb 7, 2025

Martel: Version 1.002 added google/fonts#8989

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Category guessing leads to bizarre results #756

Category guessing leads to bizarre results #756

simoncozens commented Dec 13, 2021 •

edited

Loading

anthrotype commented Dec 13, 2021

simoncozens commented Dec 13, 2021

schriftgestalt commented Dec 13, 2021

simoncozens commented Dec 13, 2021

khaledhosny commented Dec 13, 2021

simoncozens commented Dec 13, 2021

khaledhosny commented Dec 13, 2021

madig commented Jan 12, 2022 •

edited

Loading

schriftgestalt commented Jan 12, 2022 •

edited

Loading

madig commented Jan 12, 2022

schriftgestalt commented Jan 13, 2022

madig commented Jan 13, 2022

schriftgestalt commented Jan 13, 2022

madig commented Jan 13, 2022

schriftgestalt commented Jan 13, 2022

tallchai commented Mar 7, 2022

schriftgestalt commented Mar 7, 2022

tallchai commented Mar 7, 2022

schriftgestalt commented Mar 8, 2022

simoncozens commented Mar 26, 2022

simoncozens commented Apr 6, 2022

simoncozens commented Apr 6, 2022

schriftgestalt commented Apr 6, 2022 •

edited

Loading

simoncozens commented Apr 6, 2022

schriftgestalt commented Apr 6, 2022

schriftgestalt commented Apr 7, 2022

schriftgestalt commented Mar 9, 2023

anthrotype commented Mar 9, 2023

simoncozens commented Mar 9, 2023

schriftgestalt commented Mar 9, 2023

simoncozens commented Mar 9, 2023

schriftgestalt commented Mar 9, 2023

Category guessing leads to bizarre results #756

Category guessing leads to bizarre results #756

Comments

simoncozens commented Dec 13, 2021 • edited Loading

anthrotype commented Dec 13, 2021

simoncozens commented Dec 13, 2021

schriftgestalt commented Dec 13, 2021

simoncozens commented Dec 13, 2021

khaledhosny commented Dec 13, 2021

simoncozens commented Dec 13, 2021

khaledhosny commented Dec 13, 2021

madig commented Jan 12, 2022 • edited Loading

schriftgestalt commented Jan 12, 2022 • edited Loading

madig commented Jan 12, 2022

schriftgestalt commented Jan 13, 2022

madig commented Jan 13, 2022

schriftgestalt commented Jan 13, 2022

madig commented Jan 13, 2022

schriftgestalt commented Jan 13, 2022

tallchai commented Mar 7, 2022

schriftgestalt commented Mar 7, 2022

tallchai commented Mar 7, 2022

schriftgestalt commented Mar 8, 2022

simoncozens commented Mar 26, 2022

simoncozens commented Apr 6, 2022

simoncozens commented Apr 6, 2022

schriftgestalt commented Apr 6, 2022 • edited Loading

simoncozens commented Apr 6, 2022

schriftgestalt commented Apr 6, 2022

schriftgestalt commented Apr 7, 2022

schriftgestalt commented Mar 9, 2023

anthrotype commented Mar 9, 2023

simoncozens commented Mar 9, 2023

schriftgestalt commented Mar 9, 2023

simoncozens commented Mar 9, 2023

schriftgestalt commented Mar 9, 2023

simoncozens commented Dec 13, 2021 •

edited

Loading

madig commented Jan 12, 2022 •

edited

Loading

schriftgestalt commented Jan 12, 2022 •

edited

Loading

schriftgestalt commented Apr 6, 2022 •

edited

Loading