Skip to content

number suffix type annotations #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

number suffix type annotations #513

wants to merge 9 commits into from

Conversation

zkat
Copy link
Member

@zkat zkat commented Mar 30, 2025

Fixes: #510

@zkat
Copy link
Member Author

zkat commented Mar 30, 2025

@tabatkins honestly the most heartbreaking thing about all this to me is that I can't have 123u32 and such :( Gotta do 123#u32

@zkat
Copy link
Member Author

zkat commented Apr 1, 2025

@bgotink does this align with your understanding/experience/what you did in your lib?

@zkat
Copy link
Member Author

zkat commented Apr 2, 2025

alright, tests added. This is ready for final review.

@zkat zkat requested a review from tabatkins April 3, 2025 03:47
annotation as a "suffix", instead of prepending it between `(` and `)`. This
makes it possible to, for example, write `10px`, `10.5%`, `512GiB`, etc., which
are equivalent to `(px)10`, `(%)5`, and `(GiB)512`, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for readability purposes, I think it's worth mentioning the # escape hatch here with a reference to the "Explicit Suffix Type Annotation" section, at least when it comes to types like u32. For example, maybe:

To remove ambiguity, some suffixes must be prefixed with #: for example, 10.0u8 is invalid, but 10.0#u8 is. The full list of rules for invalid suffixes is clarified in the "Explicit Suffix Type Annotation" section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped a suggestion for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, GitHub collapsed it as "outdated" because they're bad

Copy link
Contributor

@tabatkins tabatkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ after review of suggested changes

annotation as a "suffix", instead of prepending it between `(` and `)`. This
makes it possible to, for example, write `10px`, `10.5%`, `512GiB`, etc., which
are equivalent to `(px)10`, `(%)5`, and `(GiB)512`, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped a suggestion for this.

Copy link
Member

@bgotink bgotink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches what I implemented apart from one test that appears to be wrong and one mistake on my end where the parser skips certain validations on # suffixes which makes 123#123 equivalent to ("123")123 which is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer can end on an underscore so this is actually valid and equivalent to (abc)123_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. this is why we can't start a bare suffix with _, becuase the syntax would parse differently than intended.

@bgotink
Copy link
Member

bgotink commented Apr 5, 2025

The following tests that previously failed now run successfully:

test name test equivalent document
bare_ident_numeric_fail.kdl node 0n node (n)0
bare_ident_numeric_sign_fail.kdl node +0n node (n)+0
illegal_char_in_binary_fail.kdl node 0bx01 node (bx01)0
multiple_x_in_hex_fail.kdl node 0xx10 node (xx10)0
no_digits_in_hex_fail.kdl node 0x node (x)0

@zkat
Copy link
Member Author

zkat commented Apr 7, 2025

Uggghhhhh. That makes sense. Looks like we’re gonna need to be more specific what order things run in. I have an idea for the grammar.

zkat and others added 5 commits April 17, 2025 11:51
Co-authored-by: Tab Atkins Jr. <[email protected]>
Co-authored-by: Tab Atkins Jr. <[email protected]>
Co-authored-by: Tab Atkins Jr. <[email protected]>
Co-authored-by: Tab Atkins Jr. <[email protected]>
Co-authored-by: Tab Atkins Jr. <[email protected]>
@zkat
Copy link
Member Author

zkat commented Apr 17, 2025

@bgotink

The following tests that previously failed now run successfully:

Looking at this...

These should be dropped because they're ok now:

test name test equivalent document
bare_ident_numeric_fail.kdl node 0n node (n)0
bare_ident_numeric_sign_fail.kdl node +0n node (n)+0

These tests should stay and continue to fail. Once you've hit a 0b/0o/0x number prefix, you SHOULD only be able to parse their related number formats:

test name test equivalent document
illegal_char_in_binary_fail.kdl node 0bx01 node (bx01)0
multiple_x_in_hex_fail.kdl node 0xx10 node (xx10)0
no_digits_in_hex_fail.kdl node 0x node (x)0

@zkat
Copy link
Member Author

zkat commented Apr 17, 2025

uggghhhh I understand why those last 3 tests pass now. I want to do something about it, though. I don't like that. I wonder if there's a good way around it.

@zkat
Copy link
Member Author

zkat commented Apr 17, 2025

@bgotink @tabatkins I've... kind of changed the rules a bit. They're simpler now. And most importantly: we have 0u64 available now!

I checked #510 and I didn't see any discussion about us tackling these rules, because we were focusing so much on what the rules for the suffix should be that we didn't take a step back and think of these numbers as a whole, and how the number syntax in general could be addressed.

Please lmk what you think. I think with these changes, we'll get the results I expected in #513 (comment)

@zkat
Copy link
Member Author

zkat commented Apr 18, 2025

Some key changes:

  • I reorganized things a bit in general.
  • The grammar itself now blocks simultaneous suffix and prefix annotations
  • The underscore _ initial character restriction was removed: a bare suffix CAN'T start with one because the integer will slurp it up first
  • The complex rules meant to disambiguate from non-decimals were removed. The fact that we only allow bare prefixes on decimals is sufficient to protect us here imo. I think these complex rules formed because things were moving fast and we didn't take a step back and really rethink the implications of only doing decimals. This has been clarified in the spec. With the new rules, there is no dangerous ambiguity, just potential syntax errors on certain zero values, which is notably NOT an unexpected parsing success, which I think is the dangerous bit.
  • The rules around exponential-likes have been changed a bit to guard against small typos if you miss the digit part (so 1e+ is illegal)

@zkat
Copy link
Member Author

zkat commented Apr 18, 2025

I'm also wondering: could we just drop the exponent restriction (but keep the e+ protection, in case someone fails to write that digit and ends up with an unfortunate parse)

@mwh
Copy link

mwh commented Apr 19, 2025

I don't think the specification prose as currently written does require or even allow a trailing _ to be consumed by the number ("digits ... may be separated by _"), but the authoritative grammar does include arbitrary underscores in place of digits anywhere except the beginning.

I think either the description of the grammatical language needs tightening, or the grammar may now allow some undesirable constructions with underscores and suffixes. Specifically, I'm not sure whether * is expected to commit a parser to whatever it finds the first time, or whether it can backtrack to allow later productions to match. Here, it would be backtracking to an underscore after the input didn't match when the underscore was consumed as part of a number.

Consider 12_3,x. The overall match would fail when (digit | '_')* from the integer production consumes _3; does it then backtrack and try consuming less of the input? If it stops short after "12" and leaves "_3,x", the whole input matches suffixed-decimal successfully with significand accepting "12" and bare-type-suffix accepting "_3,x" — so the result would be equivalent to (_3,x)12, though I think it should be an error. 12_3.4 and 12_3.4.5 differ at the same point, or consider 1_234,567. All of these can be produced from the grammar, at least.

I'm not sure whether I am misreading the description of the grammar language. * consumes "as many instances as possible without failing the match" — is that failing the match of the whole input, or does it just mean as many instances as are present at this point in the input and failure isn't really part of it? The comparison to standard regex semantics and the existence of cut points makes me think it can shorten if needed. If it does commit early to the longest sequence it finds, this issue doesn't come up. Otherwise, for integer to definitively slurp the underscore up and make this an error, I think there would need to be a cut point suffixed-decimal := significand ¶ (bare-type-suffix | (exponent? explicit-type-suffix)).

If there is an issue, either banning _ as the initial character of a suffix again or solidifying the grammatical handling of integers could address it. I am not personally a fan of baking parsing and backtracking rules into a grammar and would probably just block _, but there are reasons to have that kind of grammar too, particularly to constrain where an error is detected.

Something shaped like 1_234,567 seems like the primary case where this is realistic and actually matters, or someone trying to write a list 1_234, 5_678, .... I do think the grammar should rule these out, but if I were implementing a parser directly off this grammar right now, I would end up with this backtracking and unknowingly accepting these cases because nothing commits the parse to the path that produces an error. If I built the parser with a lexer in the front, I think the lexer would probably consume the whole number as expected and then I wouldn't encounter the issue. Clearly one of those is wrong, but it's not good for compatibility if both seem reasonable to make.

If nothing else, test cases including both underscores within the number and invalid suffixes will be good to ensure that the incorrect readings are detected.


I also wonder in a similar way about 0xaz. The prose does rule this out. This time it's the grammar semantics of - in significand-initial I'm not certain about: "any digit except something that matches the literal '0x'" seems like it'd be the same as just digit alone, because "0x" is not a digit and is not matched by digit. The intention here is clear, just the formalism may not match up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Idea: number suffixes as annotations
5 participants