Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(cpp) Support __ keywords #2173

Open
joshgoebel opened this issue Oct 9, 2019 · 24 comments
Open

(cpp) Support __ keywords #2173

joshgoebel opened this issue Oct 9, 2019 · 24 comments
Labels
enhancement An enhancement or new feature good first issue Should be easier for first time contributors help welcome Could use help from community language

Comments

@joshgoebel
Copy link
Member

Stems from PR:
#2020

See the related discussion there.

I'd prefer a separate mode, something like

So someone has to repeat that within every scope in the syntax file, yes? So if KEYWORDS is used 10 times, then they'd need to add that rule to all 50 place also, correct?

It'd be really nice to find a solution to this without a separate mode. This needs a little more though IMHO to see if that's possible before just breaking down and doing that.

@joshgoebel joshgoebel added language enhancement An enhancement or new feature labels Oct 9, 2019
@joshgoebel joshgoebel self-assigned this Oct 9, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 9, 2019

// _Complex, __always_inline, etc.
var UNDERSCORE_KEYWORDS_RE = ["_[A-Z][a-z\\d_]+", "__[a-z\\d_]+"]

var CPP_KEYWORDS = {
   keyword: 'int float while private char catch import module export virtual operator sizeof ' +
   ...
}

hljs.addKeywords(CPP_KEYWORDS["keyword"], UNDERSCORE_KEYWORDS_RE)
// or 
hljs.addKeyword(CPP_KEYWORDS["keyword"], UNDESCORE_KEYWORD_RE, relevance: 1)
hljs.addKeyword(CPP_KEYWORDS["keyword"], DOUBLE_UNDESCORE_KEYWORD_RE, relevance: 2)

I think I'd prefer something like the above (a mini API to let someone add keywords to the actual keywords). The would solve the problem of "not to depend on how keyword flattener works"... since we'd be free to change the keyword string format in the future (but how likely is that really?) just so long as we still supported addKeyword... right now all addKeyword would really do is just add the regex and a space... and it could blow up if (or do nothing) if someone included a space in their regex...

The problem with saying "just use another matcher" is (other than having to change the file perfectly in like 10 places, and then worry about TESTING that those 10 changes all work properly) that matchers actually have very different behavior from keywords... keywords are always the last to match, match the "gaps", etc...

I see lots of advantages to just piggybacking off the existing keyword system.

Thoughts?

@egor-rogov
Copy link
Collaborator

Something like that would be OK, but I think it's a bit complicated than it looks.

all addKeyword would really do is just add the regex and a space

So what about spaces in regex? And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE?
For me it seems that such API should somehow add keywords directly into compiled_keywords structure (bypassing flattener).

@joshgoebel
Copy link
Member Author

So what about spaces in regex?

I addressed that:

and it could blow up if (or do nothing) if someone included a space in their regex...

It could even blow up, which would simply fail to pass tests... etc...

And how about set some relevance to every keyword introduced by UNDERSCORE_KEYWORDS_RE?

Pretty easy to do that with a for loop? These special cases should be one offs, this shouldn't be something needed everyday or very often. Though I'd be fine with the API taking an array as well...

hljs.addKeyword(CPP_KEYWORDS["keyword"], [A_RE, B_RE, C_RE], relevance: 2)

For me it seems that such API should somehow add keywords directly into compiled_keywords structure (bypassing flattener).

Except the compiling happen much much later in the process... this is just the raw data file...

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 11, 2019

What if we just allow it to take a string or array?

var UNDERSCORE_KEYWORDS_RE = [/_[A-Z][a-z\d_]+/, /__[a-z\d_]+/]

var CPP_KEYWORDS = {
   keyword: ['int float while private char catch import module export virtual operator sizeof ' +
      '', ...UNDERSCORE_KEYWORDS_RE]
   ...
}

That's newer syntax which I'd have to walk back, but you get the idea... so:

  • A string is parsed with the flattener logic
  • An array is parsed one item at a time
    • String are treated with flattener
    • Regexes are taken "As they are"

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 11, 2019

I'm worried we may be overthinking this. Let me think and dig into this a bit more. :) Think there may be a few things about how we do keywords I need to bone up on anyways.

@joshgoebel joshgoebel added good first issue Should be easier for first time contributors help welcome Could use help from community labels Oct 11, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 11, 2019

So seems keywords don't really seem to support regex anyways, so that one grammar is mistaken unless I'm reading something wrong. So I think right now the only way to do this is another constant and then repeat it over and over inside contains in each level, just as we repeat keywords. (as @egor-rogov originally suggested) :-)

Tagging this beginner friendly as this should just be a matter of getting it done and writing some tests.

@joshgoebel joshgoebel changed the title Add all __ keywords to cpp in one fell swoop (cpp) Support __ keywords Oct 13, 2019
@joshgoebel joshgoebel removed their assignment Oct 26, 2019
@joshgoebel joshgoebel self-assigned this Feb 6, 2020
@joshgoebel joshgoebel removed their assignment Mar 3, 2020
@joshgoebel
Copy link
Member Author

Now that we have some precedent what about:

keywords: {
  $pattern: ...
  $modes: [
     { begin: /__[a-z\d_]+/, className: "keyword" }
  ],
  builtin: ...,
  literal: ...,
var UNDERSCORE_KEYWORDS_RE = 

As a way of attaching/grouping modes that exist for no other purpose than to match keywords, just when the keyword is a dynamic match (regex) instead of a static list? $modes would simply be copied to the end of whatever mode's contains list...

Just popped into my head and I thought I'd jot it down.

@joshgoebel
Copy link
Member Author

joshgoebel commented May 8, 2020

Or perhaps something like dynamic_pattern would be better, no word list, if the regex matches, it's a keyword, period...

keywords: {
  $dynamic_patterns: {
     [/_{1,2}[a-z\d_]+/, "keyword"]
  ]

Or a new namespace/suffix:

keywords: {
  keyword$all: /_{1,2}[a-z\d_]+/, // className = "keyword"
  builtin$all: /__(.*)__/, // className = "builtin"
  keyword: "if when else done",

@egor-rogov
Copy link
Collaborator

+1 for the last option. Probably even

keywords: {
  $keyword: /_{1,2}[a-z\d_]+/,
  keyword: "if when else done",

@joshgoebel
Copy link
Member Author

joshgoebel commented May 8, 2020

Well, actually the intent when we added $pattern was to reserve the whole $ namespace for keyword configuration, not random values. So in the future you could imagine:

keywords: {
  $pattern: ..., 
  $pcre_regex_lib: true,
  $ignoreCase: true,

So $keyword MAYBE could work but not if you want to be able to name the resulting matches with className... I'm actually coming around to the idea of a _re suffix. What do you think:

keywords: {
  keyword_re: /_{1,2}[a-z\d_]+/, // className = "keyword"
  builtin_re: /__(.*)__/, // className = "builtin"
  keyword: "if when else done",

_re seems unlikely to hit any real name you might want to use, and even if it did these are just internal symbols, you could rename your name slightly to say keyword_regex...

My problem is that I don't think _re screams "wildcard" or 'match all'... we could use * but then we'd have to stringily the keys:

keywords: {
  '*keyword': /_{1,2}[a-z\d_]+/, // className = "keyword"
  '*builtin': /__(.*)__/, // className = "builtin"
  keyword: "if when else done",

@joshgoebel
Copy link
Member Author

joshgoebel commented May 8, 2020

Maybe maybe just seeing the regex next to it makes _re clear enough in its intent? Actually the one worry I have here is that people could imagine they are configuring a pattern per type of keyword - rather than a match all... we could allow that but that isn't what this is. So I kind of like the * a little better I think.

I don't think we're there yet, but I like this avenue of exploration.

@allejo Any thoughts?

@egor-rogov
Copy link
Collaborator

Maybe just seeing the regex next to it makes _re clear enough in its intent?

Yes, it's clear enough for me. I'm fine with keyword_re etc. I don't think * makes the intent clearer.

@joshgoebel
Copy link
Member Author

joshgoebel commented May 9, 2020

I'm going to go back and review some of the grammars that make HEAVY use of modes for keywords and see if there is anything else we should keep in mind while working on this. My one thought so far is we don't have a great way to specify relevance. We could fall back on some helper functions:

import { kw } from "keyword_helpers";

keywords: {
  keyword_re: kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })

Although then I become temped to just make the input processing smarter:

import { kw } from "keyword_helpers";

keywords: {
  // __boo_hoo will be default to 1 relevance, bob 10, suzy 20, other _ identifiers 2
  keyword: [
    "bob|10 suzy|20", 
    /__boo.*__/, 
    kw.matchAll(/_{1,2}[a-z\d_]+/, { relevance: 2 })]

@klmr
Copy link
Contributor

klmr commented Sep 30, 2020

For what it’s worth, double-underscore names are “reserved identifiers” in C++ but they are not keywords, and don’t generally act as such (though some implementations use such identifiers as non-standard language extensions).

When reading C++ code I certainly wouldn’t expect them to be highlighted in the same way as keywords, nor even necessarily builtins (though this is certainly debatable). For one thing, the reason they are reserved is so that standard library implementors can use these names and, consequently, such implementations are peppered with them. When discussing standard library implementation code it would be extremely counter-productive to have them all displayed as keywords. For all intents and purposes they’re regular identifiers, they’re just illegal to use for “normal” users.

Special-casing them is also arbitrary, because this is only part of the actual rule regarding reserved names, which was reproduced incorrectly in the related PR. The actual rule is:

A name is reserved if

  1. it contains a double underscore anywhere
  2. it starts with a single underscore, followed by a capital letter
  3. it starts with a single underscore, in the global namespace

Rule 3 can’t be implemented with lexical analysis alone. My recommendation is to leave all these names alone.

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 30, 2020

Interesting.

void static delayShort(uint16_t ms) __attribute__ ((noinline));

What is __attribute__ here, something else entirely? My editor (VS CODE) highlights such things and it seems helpful. (though it seems github does not)

Scope: meta.preprocessor.macro.cpp

@joshgoebel
Copy link
Member Author

Rule 3 can’t be implemented with lexical analysis alone.

I'm totally open to just closing this, but also our goal has never been perfection. If we decided this was worthy of highlighting as "meta" something it wouldn't bother me that we got a few edge cases wrong... that's the cost of being a highlighter that doesn't fully understand language context.

@klmr
Copy link
Contributor

klmr commented Oct 1, 2020

What is __attribute__ here, something else entirely?

As far as standard C++ is concerned, nothing.

Leaving the ivory tower for a bit, it’s a widely used compiler extension. I do think highlighting this is useful (and VS Code’s classification seems OK, if not technically correct: it’s not a macro).

The issue with these constructs is that they differ between compilers (GCC and clang know __attribute__ but not __stdcall and MSVC is the other way round, for example). Some of these identifiers are common enough that highlighting them makes sense, but I’d list them explicitly rather than pattern-match against __.*, since only a handful are common. (With the caveat that GCC and clang define a large list of __builtin_.* identifiers, so that should probably be a wildcard).

Unfortunately I’m not aware of a comprehensive list of such identifiers. The closest may be this for MSVC and this for GCC (the clang list probably has lots of overlap here), and it’s not exactly helpful. Maybe we can crib them from other highlighters.

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 1, 2020

The original filed issue was trying to add:

__stdcall __try __finally

since only a handful are common.

Are those three above common? Would you mind taking a stab at what you might consider the "common" list? If it's a short list I'd be happy to just add those manually and mark this done for now. :-)

Do the common ones make sense to rise to "keyword" status, or would you still consider them built-ins?

The issue with these constructs is that they differ between compilers

I'm not sure how that's super relevant? I understand your original objection if they are not technically keywords... but say we decided to highlight them otherwise (say with built_in or some other tag) what would be the harm in just hitting them ALL with a wildcard? Then whatever C++ flavor someone is highlighting the __whatever_thingy would always be highlighted - without us needing to maintain an explicit list. Seems like a win?

@klmr
Copy link
Contributor

klmr commented Oct 1, 2020

I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend. As for the wildcard:

what would be the harm in just hitting them ALL with a wildcard?

The problem with that is that it would highlight standard library implementation code very confusingly. For example, consider (part of) the libc++ implementation of std::copy:

__copy(_Tp* __first, _Tp* __last, _Up* __result)
{
    const size_t __n = static_cast<size_t>(__last - __first);
    if (__n > 0)
        _VSTD::memmove(__result, __first, __n * sizeof(_Up));
    return __result + __n;
}

Highlighting all these regular identifiers as builtins or keywords is just confusing and counter-productive (and no other highlighter that I tried does it, including VS Code and Vim). Admittedly this only affects very specific projects since these identifiers are generally illegal, but discussions of standard library code is a non-negligible part of websites that could potentially use highlight.js, e.g. developer blogs and Stack Overflow.

By contrast, as a C++ developer I don’t really have any expectation to see non-standard extensions treated specially, so if a __thingy fails to be highlighted as a built-in or keyword I probably wouldn’t even notice. In other words: false negatives seem less problematic than false positives to me.

This is supported by the lack of highlighting of these names on GitHub. Even VS Code only highlights __attribute__ under very specific circumstances. In the following, only the first usage is highlighted.

int f()
__attribute__ ((noinline));

int f()
__attribute__((noinline));

@joshgoebel
Copy link
Member Author

Makes sense, guess I've only ever used a few of them (attribute, etc) and they always seemed pretty important to me... :-) and didn't realize the std library using __ so much for variable naming...

I probably won’t have time before the weekend.

No rush. :-) Looking forward to seeing what ya got.

@joshgoebel
Copy link
Member Author

Happy New Year. Still wanting to work on this one?

@klmr
Copy link
Contributor

klmr commented Jan 7, 2021

Definitely. Apologies, I’ve had some other personal stuff interfere. I hope I’ll get some time to work on this soon.

@joshgoebel
Copy link
Member Author

There is also #2954 which means C and C++ can be dealt with as separate things.

@joshgoebel
Copy link
Member Author

joshgoebel commented Apr 20, 2021

I’m happy to collect these lists and submit a PR but I probably won’t have time before the weekend.

Which weekend? ;-) I know, I know, life gets in the way... any chance perhaps you could at least handle the "collect a list" part and post it here or is that actually the hard part in your mind? :-) So far of everyone talking here you seem the most qualified to furnish a pseudo-official list of what "might be a good idea"... once we had a list someone else could come along and do the actual implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature good first issue Should be easier for first time contributors help welcome Could use help from community language
Projects
None yet
Development

No branches or pull requests

4 participants
@joshgoebel @klmr @egor-rogov and others