You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CSS Dimensions (and percentages) are pretty straightforward, but also complicated to deal with in Biome/Rowan's token system. Tokens here are all lightweight, direct representations of a string of text, which is great for efficiency, but limits how much information can be stored in them (also a good thing most of the time!). Dimensions and percentages, however, rely on a bit of additional context to be parsed correctly.
A <dimension-token> is a singular token that consists of a number immediately followed by an identifier (as shown in the railroad diagram in the link). No whitespace is allowed. The spec also calls out that for a <dimension>, it must be a "unit identifier", but does not define what a unit identifier actually is. For this reason, most parsers seem to treat <dimension> exactly like <dimension-token> (which the spec says is correct) and just allow any identifier, where the result of parsing is either a "regular" dimension for known unit values, or an "unknown" dimension otherwise.
While lexing a number and an identifier is easy enough, the restriction on whitespace presents a problem when the rest of the grammar effectively ignores whitespace. One possibility is to use a new LexContext when attempting to lex a dimension versus a number, but then if there's a failure, the lexer has to re-lex the token, and knowing when to attempt a dimension and when to stop at the end of the number is very difficult to thread through without duplicating work. It's also not really correct, either! It would be invalid to parse a number and an identifier with no whitespace between them as anything other than a dimension, because that is the syntax definition.
So we always want to parse the entirety of the dimension token together, but we also still want to know which part of the token is the number and which part is the unit. And this is where it gets complicated. In other parsers, they handle this by just creating a Dimension token that has multiple fields. One for the number and one for the unit. Servo's rust-cssparserdoes exactly this, for example. But with Rowan, we are limited to just tracking the raw text of the token and can't add fields to it all willy-nilly.
What do we do then? We cheat a little and use some intermediate representation! This PR introduces a new token type, CSS_DIMENSION_VALUE, that represents a number literal that is immediately followed by an identifier. It does not include the identifier, but just checks that the next bytes of the token stream will be an identifier. By returning this special kind of token, the parser can then distinguish between a plain number and a dimension value, and if it sees a dimension value it can continue to consume the identifier as well and create a CssRegularDimension from it (or CssUnknownDimension for unknown unit values). Before creating the node, it will also re-cast the value token into a CSS_NUMBER_LITERAL, keeping the behavior purely internal and opaque to the resulting syntax tree.
<percentage> is a very similar token type in the CSS spec: a number followed immediately by a % character. While not technically a dimension, we can treat them the same way, which this PR does by also introducing a CSS_PERCENTAGE_VALUE that does similar checks as the CSS_DIMENSION_VALUE described above, and again the parser consumes and converts the value as needed.
Other effects
This change touches a surprising amount of code! As an overview of everything that's changed:
Added CssUnknownDimension to handle dimensions with unknown unit types. This is the end-user solution for 📎 Handle and recover from unknown dimensions #1370, where 0\0 was not being parsed as a dimension, even though \0 is a valid identifier.
Added the new token types as described above, and adjusted the lexer to handle them.
Refactored css_dimension to understand and re-cast the dimension value tokens into Dimension nodes as needed.
Refactored the parsing for @keyframes to look for CSS_PERCENTAGE_VALUE tokens rather than CSS_NUMBER_LITERAL, which will also make effective recovery simple.
Refactored parse_pseudo_class_nth to treat the An+B microsyntax as a possible Dimension followed by an offset. This is awkward because the definition for the microsyntax is very specific but also flexible. The result still passes all existing tests, so I believe it is sound.
Updated snapshots for all tests that use Dimensions.
Added tests to cover the usae of the 0\0 hack.
Test Plan
All of the snapshots for the relevant features should be updated now, and they all pass. I want to wait on #1371 to get the updated recovery for @keyframes and then update that snapshot as well, since it's quite different and will change with this PR as well (a good change, but will probably cause git conflicts and such).
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#268. Closes #1370.
CSS Dimensions (and percentages) are pretty straightforward, but also complicated to deal with in Biome/Rowan's token system. Tokens here are all lightweight, direct representations of a string of text, which is great for efficiency, but limits how much information can be stored in them (also a good thing most of the time!). Dimensions and percentages, however, rely on a bit of additional context to be parsed correctly.
<dimension-token>
: Spec: https://drafts.csswg.org/css-syntax-3/#dimension-token-diagramA
<dimension-token>
is a singular token that consists of a number immediately followed by an identifier (as shown in the railroad diagram in the link). No whitespace is allowed. The spec also calls out that for a<dimension>
, it must be a "unit identifier", but does not define what a unit identifier actually is. For this reason, most parsers seem to treat<dimension>
exactly like<dimension-token>
(which the spec says is correct) and just allow any identifier, where the result of parsing is either a "regular" dimension for known unit values, or an "unknown" dimension otherwise.While lexing a number and an identifier is easy enough, the restriction on whitespace presents a problem when the rest of the grammar effectively ignores whitespace. One possibility is to use a new
LexContext
when attempting to lex a dimension versus a number, but then if there's a failure, the lexer has to re-lex the token, and knowing when to attempt a dimension and when to stop at the end of the number is very difficult to thread through without duplicating work. It's also not really correct, either! It would be invalid to parse a number and an identifier with no whitespace between them as anything other than a dimension, because that is the syntax definition.So we always want to parse the entirety of the dimension token together, but we also still want to know which part of the token is the number and which part is the unit. And this is where it gets complicated. In other parsers, they handle this by just creating a
Dimension
token that has multiple fields. One for the number and one for the unit. Servo'srust-cssparser
does exactly this, for example. But with Rowan, we are limited to just tracking the raw text of the token and can't add fields to it all willy-nilly.What do we do then? We cheat a little and use some intermediate representation! This PR introduces a new token type,
CSS_DIMENSION_VALUE
, that represents a number literal that is immediately followed by an identifier. It does not include the identifier, but just checks that the next bytes of the token stream will be an identifier. By returning this special kind of token, the parser can then distinguish between a plain number and a dimension value, and if it sees a dimension value it can continue to consume the identifier as well and create aCssRegularDimension
from it (orCssUnknownDimension
for unknown unit values). Before creating the node, it will also re-cast the value token into aCSS_NUMBER_LITERAL
, keeping the behavior purely internal and opaque to the resulting syntax tree.<percentage>
is a very similar token type in the CSS spec: a number followed immediately by a%
character. While not technically a dimension, we can treat them the same way, which this PR does by also introducing aCSS_PERCENTAGE_VALUE
that does similar checks as theCSS_DIMENSION_VALUE
described above, and again the parser consumes and converts the value as needed.Other effects
This change touches a surprising amount of code! As an overview of everything that's changed:
CssUnknownDimension
to handle dimensions with unknown unit types. This is the end-user solution for 📎 Handle and recover from unknown dimensions #1370, where0\0
was not being parsed as a dimension, even though\0
is a valid identifier.css_dimension
to understand and re-cast the dimension value tokens into Dimension nodes as needed.@keyframes
to look forCSS_PERCENTAGE_VALUE
tokens rather thanCSS_NUMBER_LITERAL
, which will also make effective recovery simple.parse_pseudo_class_nth
to treat theAn+B
microsyntax as a possible Dimension followed by an offset. This is awkward because the definition for the microsyntax is very specific but also flexible. The result still passes all existing tests, so I believe it is sound.0\0
hack.Test Plan
All of the snapshots for the relevant features should be updated now, and they all pass. I want to wait on #1371 to get the updated recovery for
@keyframes
and then update that snapshot as well, since it's quite different and will change with this PR as well (a good change, but will probably cause git conflicts and such).