Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Category guessing leads to bizarre results #756

Open
simoncozens opened this issue Dec 13, 2021 · 32 comments
Open

Category guessing leads to bizarre results #756

simoncozens opened this issue Dec 13, 2021 · 32 comments

Comments

@simoncozens
Copy link
Collaborator

simoncozens commented Dec 13, 2021

Why aren't my marks attaching to my bases? I have a base glyph "brm_RR" and that works fine, and another base glyph "brm_R" and marks don't attach. Turns out it's because:

In [4]: from glyphsLib.glyphdata import get_glyph

In [5]: get_glyph("brm_RR")
Out[5]: Glyph(name='brm_RR', production_name='brm_RR', unicode=None, category=None, subCategory=None, script=None, description=None)

In [6]: get_glyph("brm_R")
Out[6]: Glyph(name='brm_R', production_name='brm_R', unicode=None, category='Letter', subCategory='Ligature', script=None, description=None)

brm_R goes into ligatures not bases in GDEF, so marks aren't attached. But why is brm_R considered a ligature? Because...

# Corner case: when looking at ligatures, names that don't exist in the AGLFN
# are skipped, so len("acutecomb_o") == 2 but len("dotaccentcomb_o") == 1.
character = fontTools.agl.toUnicode(base_name)

brm_ is skipped, R is looked up, R is a letter, and so _translate_category is called, and inside _translate_category:

# Exception: Something like "one_two" should be a (_, Ligature),
# "acutecomb_brevecomb" should however stay (Mark, Nonspacing).
if "_" in glyph_name and glyphs_category[0] != "Mark":
return glyphs_category[0], "Ligature"

This is way too much magic.

I literally told Glyphs that it was Category=Letter, but no, glyphsLib thinks it knows better.

@anthrotype
Copy link
Member

too much magic indeed.
But I believe that setting the custom glyph category (via alt+cmd+I panel) should override any guessing in by glyphdata. Does it not? That's another bug.

@simoncozens
Copy link
Collaborator Author

Only if you set the subcategory as well.

@schriftgestalt
Copy link
Collaborator

The ligature-"magic" should only be applied when it finds a glyph info for all parts. As brm is not a known glyph name it needs to stop and return nothing.

@simoncozens
Copy link
Collaborator Author

I really hoped the response to "this is way too much magic" was not going to be "let's add more magic".

@khaledhosny
Copy link
Collaborator

Fonts depend on this “magic”, if we were to remove it they will break.

@simoncozens
Copy link
Collaborator Author

They’re breaking already, it’s just right now they’re breaking in strange ways instead of straightforward ways.

@khaledhosny
Copy link
Collaborator

If one has a glyph named alef_lam-ar they want it classified as ligature not base.

@madig
Copy link
Collaborator

madig commented Jan 12, 2022

What glyphsLib does does not match what Glyphs.app does. I have in the past tried to make them match but failed, there are quite a few edge cases where Glyphs.app thinks differently. I don't know how to improve the situation.

Maybe a mechanically verified model for glyph names like the Universal Shaping Engine?

@schriftgestalt
Copy link
Collaborator

schriftgestalt commented Jan 12, 2022

Do you have examples of those differences?

Maybe we can have a shared test suite in the GlyphInfo repo?

@madig
Copy link
Collaborator

madig commented Jan 12, 2022

I guess I can collect some. Or loop through all known glyph names plus names taken from random .glyphs files in the wild and compare (sub)categories before/after.

Here's a static list of glyph names that I have to manually declare bases for the purpose of GDEF and anchor attachment in a certain project:

KNOWN_BASES = {
    "k_ssa-deva",
    "j_nya-deva",
    "k_ss-deva",
    "k_ss-deva.alt2",
    "k_ss-deva.alt3",
    "k_ss-deva.alt4",
    "k_ss-deva.alt5",
    "k_ss-deva.alt6",
    "k_ss-deva.alt7",
    "j_ny-deva",
    "j_ny-deva.alt2",
    "j_ny-deva.alt3",
    "j_ny-deva.alt4",
    "j_ny-deva.alt5",
    "j_ny-deva.alt6",
    "j_ny-deva.alt7",
    "j_ny-deva.alt8",
    "ng_ya-deva",
    "ch_ya-deva",
    "tt_tta-deva",
    "tt_ttha-deva",
    "tt_ya-deva",
    "tth_ttha-deva",
    "tth_ya-deva",
    "dd_dda-deva",
    "dd_ddha-deva",
    "dd_ya-deva",
    "ddh_ddha-deva",
    "ddh_ya-deva",
    "t_ta-deva",
    "t_ra-deva",
    "d_ga-deva",
    "d_gha-deva",
    "d_da-deva",
    "d_dha-deva",
    "d_dh_ya-deva",
    "d_ba-deva",
    "d_bha-deva",
    "d_ma-deva",
    "d_ya-deva",
    "d_ra-deva",
    "d_va-deva",
    "p_ta-deva",
    "sh_ra-deva",
    "ss_tta-deva",
    "ss_ttha-deva",
    "h_nna-deva",
    "h_na-deva",
    "h_ma-deva",
    "h_ya-deva",
    "h_ra-deva",
    "h_la-deva",
    "h_va-deva",
    "h_ra_uMatra-deva",
    "h_ra_uuMatra-deva",
    "ba-khmer",
    "ba-khmer.post",
    "ba-khmer.post2",
    "ba_aaSign-khmer",
    "ba_aaSign-khmer.post2_",
    "ba_aaSign-khmer.post_",
    "ba_auSign-khmer",
    "ba_auSign-khmer.post2_",
    "ba_auSign-khmer.post_",
    "beikoet-khmer",
    "beiroc-khmer",
    "buonkoet-khmer",
    "buonroc-khmer",
    "ca-khmer",
    "ca_aaSign-khmer",
    "ca_auSign-khmer",
    "cha-khmer",
    "cha_aaSign-khmer",
    "cha_auSign-khmer",
    "cho-khmer",
    "cho-khmer.post",
    "cho-khmer.post2",
    "cho_aaSign-khmer",
    "cho_aaSign-khmer.post2_",
    "cho_aaSign-khmer.post_",
    "cho_auSign-khmer",
    "cho_auSign-khmer.post2_",
    "cho_auSign-khmer.post_",
    "co-khmer",
    "co_aaSign-khmer",
    "co_auSign-khmer",
    "da-khmer",
    "da_aaSign-khmer",
    "da_auSign-khmer",
    "dapBeikoet-khmer",
    "dapBeiroc-khmer",
    "dapBuonkoet-khmer",
    "dapBuonroc-khmer",
    "dapMuoykoet-khmer",
    "dapMuoyroc-khmer",
    "dapPiikoet-khmer",
    "dapPiiroc-khmer",
    "dapPramkoet-khmer",
    "dapPramroc-khmer",
    "dapkoet-khmer",
    "daproc-khmer",
    "do-khmer",
    "do_aaSign-khmer",
    "do_auSign-khmer",
    "dottedCircle",
    "ha-khmer",
    "ha_aaSign-khmer",
    "ha_auSign-khmer",
    "ka-khmer",
    "ka_aaSign-khmer",
    "ka_auSign-khmer",
    "kha-khmer",
    "kha_aaSign-khmer",
    "kha_auSign-khmer",
    "kho-khmer",
    "kho-khmer.post",
    "kho-khmer.post2",
    "kho_aaSign-khmer",
    "kho_aaSign-khmer.post2_",
    "kho_aaSign-khmer.post_",
    "kho_auSign-khmer",
    "kho_auSign-khmer.post2_",
    "kho_auSign-khmer.post_",
    "ko-khmer",
    "ko_aaSign-khmer",
    "ko_auSign-khmer",
    "la-khmer",
    "la_aaSign-khmer",
    "la_auSign-khmer",
    "lo-khmer",
    "lo_aaSign-khmer",
    "lo_auSign-khmer",
    "mo-khmer",
    "mo_aaSign-khmer",
    "mo_auSign-khmer",
    "muoykoet-khmer",
    "muoyroc-khmer",
    "ngo-khmer",
    "ngo_aaSign-khmer",
    "ngo_auSign-khmer",
    "nno-khmer",
    "nno_aaSign-khmer",
    "nno_auSign-khmer",
    "no-khmer",
    "no_aaSign-khmer",
    "no_auSign-khmer",
    "nyo-khmer",
    "nyo-khmer.less",
    "nyo_aaSign-khmer",
    "nyo_aaSign-khmer.less",
    "nyo_auSign-khmer",
    "nyo_auSign-khmer.less",
    "pathamasat-khmer",
    "pha-khmer",
    "pha_aaSign-khmer",
    "pha_auSign-khmer",
    "pho-khmer",
    "pho_aaSign-khmer",
    "pho_auSign-khmer",
    "piikoet-khmer",
    "piiroc-khmer",
    "po-khmer",
    "po_aaSign-khmer",
    "po_auSign-khmer",
    "pramBeikoet-khmer",
    "pramBeiroc-khmer",
    "pramBuonkoet-khmer",
    "pramBuonroc-khmer",
    "pramMuoykoet-khmer",
    "pramMuoyroc-khmer",
    "pramPiikoet-khmer",
    "pramPiiroc-khmer",
    "pramkoet-khmer",
    "pramroc-khmer",
    "qa-khmer",
    "qa_aaSign-khmer",
    "qa_auSign-khmer",
    "ro-khmer",
    "ro_aaSign-khmer",
    "ro_auSign-khmer",
    "sa-khmer",
    "sa-khmer.post",
    "sa_aaSign-khmer",
    "sa_aaSign-khmer.post_",
    "sa_auSign-khmer",
    "sa_auSign-khmer.post_",
    "sha-khmer",
    "sha_aaSign-khmer",
    "sha_auSign-khmer",
    "sso-khmer",
    "sso-khmer.post",
    "sso_aaSign-khmer",
    "sso_aaSign-khmer.post_",
    "sso_auSign-khmer",
    "sso_auSign-khmer.post_",
    "ta-khmer",
    "ta_aaSign-khmer",
    "ta_auSign-khmer",
    "tha-khmer",
    "tha_aaSign-khmer",
    "tha_auSign-khmer",
    "tho-khmer",
    "tho_aaSign-khmer",
    "tho_auSign-khmer",
    "to-khmer",
    "to_aaSign-khmer",
    "to_auSign-khmer",
    "ttha-khmer",
    "ttha_aaSign-khmer",
    "ttha_auSign-khmer",
    "ttho-khmer",
    "ttho-khmer.post",
    "ttho-khmer.post2",
    "ttho_aaSign-khmer",
    "ttho_aaSign-khmer.post2_",
    "ttho_aaSign-khmer.post_",
    "ttho_auSign-khmer",
    "ttho_auSign-khmer.post2_",
    "ttho_auSign-khmer.post_",
    "tuteyasat-khmer",
    "vo-khmer",
    "vo_aaSign-khmer",
    "vo_auSign-khmer",
    "yo-khmer",
    "yo-khmer.post",
    "yo-khmer.post2",
    "yo_aaSign-khmer",
    "yo_aaSign-khmer.post2_",
    "yo_aaSign-khmer.post_",
    "yo_auSign-khmer",
    "yo_auSign-khmer.post2_",
    "yo_auSign-khmer.post_",
}

@schriftgestalt
Copy link
Collaborator

I can see three groups in the list.

  1. basic letters: (e.g.: ba-khmer) Those should get their info directly form the glyph data.
  2. symbols (e.g.: beiroc-khmer) dito
  3. ligatures: there seem to be two kinds: a Khmer letter + aaSign. And Devanagari conjuncts (halfform + fullform). Both are ligating two glyphs but in some aspects they are not acting as ligatures in OpenType: they have a simple mark to base feature (only one top and bottom anchor instead of one per component). So I tend to come up with another subCategory for them. I think I’ll change it that the Khmer glyphs will be "Composition"s like other precomposed glyphs (base + mark). The Devanagari should be "Conjunct"s.
    Does this make sense?

@madig
Copy link
Collaborator

madig commented Jan 13, 2022

According to some of our designers that deal in Deva, it makes sense. Internally, we avoid the underscore in conjuncts, so we have names like "KSsa-deva". Wouldn't that be more consistent? no more underscore to disambiguate.

@schriftgestalt
Copy link
Collaborator

Maybe. But it makes processing the names much more complicated.

@madig
Copy link
Collaborator

madig commented Jan 13, 2022

I'm curious, can you please give an example?

@schriftgestalt
Copy link
Collaborator

Getting from KSsa-deva to k-deva + ssa-deva. This example might be possible but what if there are uppercase in the base names (like in to_auSign-khmer)? And I have the code from processing normal ligatures already.

@tallchai
Copy link

tallchai commented Mar 7, 2022

@schriftgestalt Can you please elaborate on this fix? I am having [a similar issue] (googlefonts/gftools#497).
What works in my case (for *_ra-deva) is to categorize them as Devanagari > Letter > Other and then manually add them to the cjct feature set (under cjct_devanagari_rakar_forms).
Need to follow the same process for some other "conjunct" glyphs like "c_ca-deva", "d_ya-deva" etc. <- These are manually to be added under cjct_devanagari in the cjct feature.

Getting from KSsa-deva to k-deva + ssa-deva. This example might be possible but what if there are uppercase in the base names (like in to_auSign-khmer)? And I have the code from processing normal ligatures already.

@schriftgestalt
Copy link
Collaborator

Can you be more specific what you like to know?

@tallchai
Copy link

tallchai commented Mar 7, 2022

@schriftgestalt I have a similar issue(s) with certain conjuncts when I try to build the font with gftools.

The marks for multiple conjuncts like tt_ya-deva (categorized as Devanagari > Letter > Ligature) do not render correctly.
Screen Shot 2020-11-09 at 7 10 48 PM

If I change the subcategory from Ligature to Other, the cjct feature does not pick up the glyph at all.
Screen Shot 2020-11-09 at 7 08 17 PM

For the glyph to actually work, I need to manually edit the cjct feature to include this glyph.

A lot of Devanagari glyphs have this issue including all of the glyphs that end with _ra-deva
Screen Shot 2022-02-17 at 12 40 52 PM

I have logged multiple issues regarding this. In the latest issue logged, @simoncozens referred to this issue.
I believe your comment above addresses a way to fix this issue and I am trying to understand it better. :)

Issues logged:
googlefonts/fontmake#703
googlefonts/fontmake#706
googlefonts/gftools#497

Thanks!

@schriftgestalt
Copy link
Collaborator

I can’t say much about the cjct feature in gftools. Only how it would be done in Glyphs.

@simoncozens
Copy link
Collaborator Author

I'm getting really annoyed with this bug. Basically if a glyph name has an underscore in it, mark attachment probably won't work.

@simoncozens
Copy link
Collaborator Author

Thinking about this more, the problem is not so much that glyphs are being incorrectly considered ligature glyphs. The problem is more that if they are ligature glyphs, ufo2ft does not perform mark-to-base attachment on them, only mark-to-ligature attachment. So a ligature glyph with a "top" anchor does not get top attachment; only "top_1" would be attached. If the anchor attachment was correct, the GDEF category would be less significant.

That said, it's still obviously wrong to consider brm_RR a ligature if there's no brm and RR glyphs in the font, so I'm also coming around to Georg's way of thinking:

The ligature-"magic" should only be applied when it finds a glyph info for all parts. As brm is not a known glyph name it needs to stop and return nothing.

But the anchor attachment issue is the real cause of pain here.

@simoncozens
Copy link
Collaborator Author

However, since the guessing is only heuristic, I am going to strongly recommend (and for Noto insist) that Glyphs sources explicitly set the category and subcategory for all glyphs.

@schriftgestalt
Copy link
Collaborator

schriftgestalt commented Apr 6, 2022

I’m working on this right now and I think I can improve that quite a bit. I’m adding all my test cases from Glyphs and try to match the result as close as possible.

But for names that Glyphs can’t recognize (like brm_RR) it the glyph info has to be set manually to make it work in Glyphs and in glyphsLib.

@simoncozens
Copy link
Collaborator Author

Thought: we could look at it the other way around. The only reason the ligature/base difference actually matters (unless we are doing very funky feature code) is for mark attachment. So why can't we just say that if a glyph has ..._1 anchors it's a ligature and if it doesn't, it's a base?

@schriftgestalt
Copy link
Collaborator

That could work if your font data is all set correctly. One wring "top" anchor could make it difficult to decide. And in Glyphs this info is used for more things (anchors, feature code) so I need to have that work anyway.

@schriftgestalt
Copy link
Collaborator

Suspect that my algorithm is doing more than needed in glyphsLib. Is there any use to do name normalization? e.g. asking for info for uni00410041 and getting the name "fixed" to A_A? Of cause this is more interesting for names like uni0926094D0927094D0930094D092F.

@schriftgestalt
Copy link
Collaborator

What is the state of the Glyphs Info changes I did in the context of RTL kerning issue? That code will fix this issue.

@anthrotype
Copy link
Member

What is the state of the Glyphs Info changes I did in the context of RTL kerning issue?

last time I looked at it, it looked unfinished. It should be rebased and cleaned up

@simoncozens
Copy link
Collaborator Author

That code will fix this issue.

It won't. The code is still exactly the same in that branch.

# Exception: Something like "one_two" should be a (_, Ligature),
# "acutecomb_brevecomb" should however stay (Mark, Nonspacing).
if "_" in glyph_name and glyphs_category[0] != "Mark":
return glyphs_category[0], "Ligature", glyphs_category[2]

I do not want glyphsLib to do magic guessing based on glyph names.

@schriftgestalt
Copy link
Collaborator

This is all properly implemented in the GlyphsInfo3 branch.

@simoncozens
Copy link
Collaborator Author

simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ git checkout GlyphInfo3
Already on 'GlyphInfo3'
Your branch is up to date with 'origin/GlyphInfo3'.
simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ PYTHONPATH=Lib ipython3
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from glyphsLib.glyphdata import get_glyph

In [2]: get_glyph("brm_R")
Out[2]: info:brm_R pro:brm_R sub:Ligature

simon@Simons-MacBook-Air ~/others-repos/glyphsLib *$ grep -5 get_glyph Lib/glyphsLib/builder/features.py
            if name and not name.startswith("_"):
                has_attaching_anchor = True

        # First check glyph.lib for category/subCategory overrides. Otherwise,
        # use global values from GlyphData.
        glyphinfo = glyphdata.get_glyph(
            glyph_name, unicodes=[f"{c:04X}" for c in glyph.unicodes]
        )
        category = glyph.lib.get(category_key) or glyphinfo.category
        subCategory = glyph.lib.get(subCategory_key) or glyphinfo.subCategory

@schriftgestalt
Copy link
Collaborator

This code is only used for fallback when the glyph data didn’t produce anything. I didn't touch that part.
At that point I can’t see how a name with an underscore could be end up here. So line 565 and 566 can be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants