Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve C# grammar generator. #73340

Merged
merged 75 commits into from
May 9, 2024
Merged

Conversation

CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi commented May 4, 2024

Fleshes out several rules that previously would say "/* see lexical specification */".

Added rules for many token types (identifiers, keywords,, modifiers, operators, punctuation, numerics, strings).

@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels May 4, 2024
s_normalizationRegex.Replace(name.EndsWith("Syntax") ? name[..^"Syntax".Length] : name, "_").ToLower(),
ImmutableArray.Create(name));

// Converts a PascalCased name into snake_cased name.
private static readonly Regex s_normalizationRegex = new Regex(
"(?<=[A-Z])(?=[A-Z][a-z]) | (?<=[^A-Z])(?=[A-Z]) | (?<=[A-Za-z])(?=[^A-Za-z])",
"(?<=[A-Z])(?=[A-Z][a-z0-9]) | (?<=[^A-Z])(?=[A-Z]) | (?<=[A-Za-z0-9])(?=[^A-Za-z0-9])",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures that utf8 stays together as a single word. not utf_8


var seen = new HashSet<string>();

// Define a few major sections to help keep the grammar file naturally grouped.
var majorRules = ImmutableArray.Create(
"CompilationUnitSyntax", "MemberDeclarationSyntax", "TypeSyntax", "StatementSyntax", "ExpressionSyntax", "XmlNodeSyntax", "StructuredTriviaSyntax");
"CompilationUnitSyntax", "MemberDeclarationSyntax", "TypeSyntax", "StatementSyntax", "ExpressionSyntax", "XmlNodeSyntax", "StructuredTriviaSyntax", "Utf8Suffix");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures we write out all the utf8 rules first before printing out the utf8 suffix.

rules.Add("Utf8StringLiteralToken", [Join(" ", [RuleReference("StringLiteralToken"), RuleReference("Utf8Suffix")])]);
rules.Add("Utf8MultiLineRawStringLiteralToken", [Join(" ", [RuleReference("MultiLineRawStringLiteralToken"), RuleReference("Utf8Suffix")])]);
rules.Add("Utf8SingleLineRawStringLiteralToken", [Join(" ", [RuleReference("SingleLineRawStringLiteralToken"), RuleReference("Utf8Suffix")])]);
rules.Add("Utf8Suffix", [new("'u8'"), new("'U8'")]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds a few pseudo rules to make it so that hte generated grammar has less 'see lexical specificatoin' productions.

@CyrusNajmabadi
Copy link
Member Author

@333fred ptal :-)

| 'false'
| 'null'
| 'true'
| '__arglist'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caused by move to case insensitive sorting.

| utf_8_string_literal_token
| utf8_multi_line_raw_string_literal_token
| utf8_single_line_raw_string_literal_token
| utf8_string_literal_token
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tweaked so that numbers don't start a new 'word' in snake_casing.

base_argument_list
: argument_list
| bracketed_argument_list
syntax_token
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fleshed out this construct.

IEnumerable<Production> repeat(Production production, int count)
=> Enumerable.Repeat(production, count);

IEnumerable<Production> anyCasing(string value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a helper to produce all casing variations of an initial piece of text. e.g. anyCasing("ul") is what generates UL/Ul/uL/ul. Useful for not having to specify a bunch of casing changes in several token places.

rules.Add("SingleCharacter", [new("""/* ~['\\\u000D\u000A\u0085\u2028\u2029] anything but ', \\, and new_line_character */""")]);
}

IEnumerable<Production> productionRange(char start, char end)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can say productionRange('a', 'f') instead of having to spell it out by hand.

@CyrusNajmabadi
Copy link
Member Author

Gentle poke @333fred . This is just a qol update for me :)

@jaredpar jaredpar added this to the 17.11 milestone May 9, 2024
@CyrusNajmabadi CyrusNajmabadi merged commit 7b8f135 into dotnet:main May 9, 2024
28 checks passed
@CyrusNajmabadi CyrusNajmabadi deleted the grammarUpdates branch May 9, 2024 22:54
@dotnet-policy-service dotnet-policy-service bot modified the milestones: 17.11, Next May 9, 2024
@Cosifne Cosifne modified the milestones: Next, 17.11 P2 May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants