Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peggy 3.0.0 reserves lots of keywords which aren't actually reserved in JavaScript #359

Closed
Tracked by #363
nene opened this issue Feb 28, 2023 · 26 comments · Fixed by #360 or #361
Closed
Tracked by #363

Peggy 3.0.0 reserves lots of keywords which aren't actually reserved in JavaScript #359

nene opened this issue Feb 28, 2023 · 26 comments · Fixed by #360 or #361
Labels
blocks-release Blocks an imminent release. High Priorty.

Comments

@nene
Copy link
Contributor

nene commented Feb 28, 2023

Peggy 3.0.0 added a bunch of new reserved keywords which can no more be used as label names:

abstract
boolean
byte
char
double
final
float
goto
int
long
native
short
synchronized
throws
transient
volatile

These are keywords that were reserved in ECMAScript 1..3, but are no more reserved in modern ECMAScript. Does Peggy plan to target these old ECMAScript versions when generating code? It doesn't seem so, because at the same list of reserved keywords it also contains keywords that are only reserved in ECMAScript strict mode - but strict mode was introduced in ECMAScript 5. This makes no sense to me. Peggy should either target modern ECMAScript or the old one... but it can't really do both. And even when Peggy wants to target old ECMAScript, there's no actual problem that the use of these names would cause.

On more practical side. When I read Peggy documentation, I get the impression that I can use as label names all the names that I could normally use as JavaScript variable names. That's easy to remember, as I already know JavaScript very well. But turns out that Peggy also blocks me from using some names that I wouldn't have guessed as being reserved keywords in JavaScript. This behavior breaks my intuitive assumptions.

Additionally this new reserved keywords list contains names that aren't reserved keywords at all:

  • arguments
  • as
  • async
  • eval
  • from
  • get
  • of
  • set

For example: one might think that of is reserved because it's used in for (.. of ...) statement, but in fact one can freely have a variable named of. The following is perfectly valid code:

for (const of of [1,2,3]) {
  console.log(of);
}

The only names out of this list that really shouldn't be allowed are:

  • arguments - Referring to arguments inside a function won't work for variable arguments declared outside of a function.
  • eval - declaring variable named eval is prohibited in strict mode.

But I might be mistaken. Perhaps all these extra keywords have been added to solve some actual problem. I would be interested in learning about that problem.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

You are correct. I added all of those last-minute as I was wrapping up the release, and we didn't have any conversation about it first. If you have time, I would welcome a pull request that modified the list here by commenting out the words that should still work, leaving notes as to why they should not be added back in (so that I don't make the same mistake in the future), and adding tests roughly here to ensure that the code we generate continues to work with those non-reserved words as rule names, rule references, and labels.

@hildjj hildjj added the blocks-release Blocks an imminent release. High Priorty. label Feb 28, 2023
@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

Note: I'll do a 3.0.1 release with just this and whatever else is ready when it's done. Please do not forget to add yourself to the AUTHORS file.

If you don't have time or inclination to send a pull request, I'll get to it eventually.

@reverofevil
Copy link

reverofevil commented Feb 28, 2023

Let me add 2 cents here: peggy doesn't even need any ES keywords. With plugins it's supposed to generate code in other languages. For example, in PHP, that has its own different set of keywords. I think correct approach would be to allow any identifiers that do not conflict with peggy's own grammar, and rename keywords at codegen phase wherever it's actually required.

@Mingun
Copy link
Member

Mingun commented Feb 28, 2023

For that codegen plugins can (and should) change the list of reserved words. By default it's contain JavaScript words.

We do not rename them, because peggy does not analyses the content of code blocks where that identifiers are used. So with renaming we will force users to remember that some identifiers should be written differently in the code block. Ok, now when we have the ability to emit warnings and info messages it would not be such bad UX as before, but from the other hand you always can write a plugin that will do that.

@reverofevil
Copy link

reverofevil commented Feb 28, 2023

If there is a single grammar without any code inserts, it can be compiled into several languages with different peggy plugins. Why would a user be required to rewrite the grammar to suit another target language? Which goal justifies this increase in compiler complexity?

This makes no more sense than to ban certain arithmetic operations in C based on a choice of target platform. Luckily, compilers don't usually do that: the whole idea of compiled language is to give a layer of abstraction around target platform quirks.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

Think of the default list of reserved words coming from the generate-js pass. If you replace that pass, you're likely to need to at least modify the list of reserved words to match whatever you're generating to.

Now, if you're arguing that the generated code should never use the tokens that the grammar writer chose in the generated output, that's interesting. It might make debugging a little more difficult depending on how you mangled those tokens into usable identifiers in the output.

@reverofevil
Copy link

I don't think there is even any need for that. Rule a is compiled into peg$parsea. Then it's exported as an object with {a: ...} field, and any keyword can be used as a field name. If there is actually any place in JS codegen that fails due to keywords as rule names, I'd rather consider it a codegen bug.

@reverofevil
Copy link

reverofevil commented Feb 28, 2023

For keywords as names of partial matches the exception should rather be thrown from generate-js, because it's the only logical place where it can be done. There is no way to rename a variable to salvage generation: rule = let:foo { return let; } will never become valid anyway.

@Mingun
Copy link
Member

Mingun commented Feb 28, 2023

I don't think there is even any need for that.

Rename is required for labels which is used as names of formal parameters in functions (which used to represent actions and semantic predicates). Of course, we can auto-apply some consistent rename rules, so the user would use the other name in action than the label name, but I think this is not obvious, even if the warning message would be printed during such replace. It would be better just rename label in that case:

rule = let:foo {
  // We could automatically rename all invalid identifiers
  // but it would be much simpler and clearly,
  // if the user just name label `let_`
  return let_;
}

The only way, where check for invalid identifiers could be annoying -- when you writes grammar with placeholder action blocks (that do nothing, just {}) and replace that empty functions with the real code on some ahead unknown language. In that case the plugin that does that also can disable the reserved identifiers check.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

We could loosen the reserved word restriction to not make it apply to rule names or rule references, but only to labels. We could also make it more explicit that this list is owned by the generator.

It is a requirement that the user get a good error message at generation time for labels that are reserved words in the generator's target language. The current approach is only one way for that to happen, and I'm open to other ways.

@reverofevil
Copy link

The example shows peggy with semantic actions in JS. It implies the only valid target is JS. So in case JS keyword is used as a label, it means we actually can just consider it a bug. There won't be another target, where let label is fine, because there is only one possible target. There is no need to rename anything here.

Parsers that ask code generators for extra data is an architectural abomination, some kind of colonoscopic dentistry that should never be put in practice. The only place in the code that owns code generation is code generator, and if the chosen code generator is unhappy with an identifier, it should throw an error on its own.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

There's no need to use that sort of rhetoric when we're engaging with your issue constructively. Neither of us wrote this code in the first place.

@reverofevil
Copy link

It didn't look that way to me. I'm sorry.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

Anyway, if you want to put together a patch, I'll take a look at it.

@Mingun
Copy link
Member

Mingun commented Feb 28, 2023

Parsers that ask code generators for extra data is an architectural abomination, some kind of colonoscopic dentistry that should never be put in practice.

Unfortunately, we still not able to report errors in code blocks together with the line numbers. And even if we do that, that will not help us to point user the correct place where a mistake was done. So the black-list is a forced measure that allows us to clearly say what and where the error is.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

I'm going to leave this open until I change the docs. I think we may need a few other issues as a result of this discussion however:

  • Ability to generate good errors from the generation pass. I think this may be part of a larger discussion about how we do generation that I'd like to have.
  • Moving reserved word detection to to the generation pass. This will depend upon the first issue. I'm also worried about existing plugins needing to change how they do reserved word detection, so at least making a good error message when the plugin hasn't been updated yet will be a part of the solution to this issue.

@hildjj
Copy link
Contributor

hildjj commented Feb 28, 2023

Everyone on this thread, please review the changes in #361 to make sure I've adequately captured the conversation.

@reverofevil
Copy link

I didn't realize AST doesn't store any information about location. There's only 4 references to IdentifierName in the grammar, and it would probably be enough to capture location data in every of those 4 AST nodes in a backwards compatible way. Something like

Location = "" {
	const {start, source} = location();
	return console.log({source, ...start});
}

Rule = ... from:Location name:IdentifierName to:Location ...

should capture all the required information, even if somewhat inconsistently.

@hildjj
Copy link
Contributor

hildjj commented Mar 1, 2023

The other issue is that we're currently generating from bytecode, not directly from the AST.

@Mingun
Copy link
Member

Mingun commented Mar 1, 2023

Something like

Location = "" {
	const {start, source} = location();
	return console.log({source, ...start});
}

Rule = ... from:Location name:IdentifierName to:Location ...

should capture all the required information, even if somewhat inconsistently.

That is not required, because IdentifierName is itself the identifier + its location.

peggy/src/parser.pegjs

Lines 319 to 322 in 6344542

IdentifierName "identifier"
= head:IdentifierStart tail:IdentifierPart* {
return [head + tail.join(""), location()];
}

Yes, the problem with moving check to the code generation pass is the absence of that information in the AST. At least that was true at time of writing that check. But since #240 we have locations of all rules in the AST, so it would be easy to also capture labels' locations and move that check.

Some hints for implementing such a check: add an utility method to the asts:

const asts = {
  renameInvalidLabels(ast: peggy.ast.Grammar, reservedWords: string[], session: peggy.Session) { ... }
}

This function may follow the following algorithm for each label:

  1. if label not in the reservedWords, then return
  2. add into reservedWords all label names in the current scope
  3. let i = 1
  4. let newName = label + i
  5. if newName in the reservedWords increase i and go to step 4
  6. add newName to the reservedWords
  7. replace label with newName and emit an info message with information about rename

@hildjj
Copy link
Contributor

hildjj commented Mar 1, 2023

I don't want to rename labels. If I'm understanding the approach correctly, there's no way for the person writing the grammar to predict what the label name will turn out to be in their action or predicate code.

@nene
Copy link
Contributor Author

nene commented Mar 1, 2023

As I understand this whole issue:

Currently Peggy doesn't run any static analysis of the code blocks. To perform any sort of renaming of labels, one would need to also analyze the code blocks and transform the code inside them. Doing such a thing would be a major undertaking. It would also need to be implemented for each language supported in the code blocks, which would become a major hurdle for anyone wanting to implement another target language. A much simpler solution is to just restrict the names of labels.

@reverofevil
Copy link

I think we all agree that renaming labels is a bad idea.

@hildjj
Copy link
Contributor

hildjj commented Mar 1, 2023

To be clear: I think it's an interesting idea, I just don't want to do it. :)

@reverofevil
Copy link

To be clear: I don't think it's interesting, because it's neither required, nor leads to any kind of desired behavior.

@hildjj hildjj mentioned this issue Mar 1, 2023
22 tasks
hildjj added a commit to hildjj/peggy that referenced this issue Mar 5, 2023
* main: (21 commits)
  Update CHANGELOG.md
  Update version number & rebuild
  Update dependencies
  Update test/unit/compiler/passes/report-infinite-repetition.spec.js
  Fixes peggyjs#357.  Do not allow infinite recursion in repetition delimiter.
  Update changelog
  Allow extra semicolons between rules.
  Fix an error in the code generator for "repeated" node
  Update changelog
  Fixes peggyjs#329
  Update changelog
  Fixes peggyjs#359.  Clarifies documentation about reserved words.
  Fix more HTML indentation.
  Test that the generated parser also works without errors
  Remove use of expect.to.not.throw()
  Add Rene Saarsoo to AUTHORS
  Typo in test description
  Add test to ensure special non-reserved keywords are allowed
  Comment out unnecessary reserved words
  Fixes peggyjs#347.  Makes $ invalid as an identifier start character.
  ...
@hildjj
Copy link
Contributor

hildjj commented Mar 8, 2023

The other issue is that we're currently generating from bytecode, not directly from the AST.

See #378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocks-release Blocks an imminent release. High Priorty.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants