Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Make any consume a full code point, not a single code unit #424

Merged
merged 4 commits into from
Mar 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

### Breaking changes:

- [#424]: `any` now consumes an entire code point (i.e., a full Unicode character), not just a single, 16-bit code unit.
- [55c787b]: The namespace helpers (`namespace`, `extendNamespace`) have been removed. (These were always optional.)
- [bea0be9]: When used as an ES module, the main 'ohm-js' module now has _only_ named exports (i.e., no default export). The same is true for `ohm-js/extras`.
- [#395]: In generated type definitions, action dictionary types now inherit from `BaseActionDict<T>`, a new supertype of `ActionDict<T>`.
Expand Down
18 changes: 18 additions & 0 deletions doc/releases/ohm-js-17.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,24 @@

## Upgrading

### `any` now consumes a full code point

In JavaScript, a string is a sequence of 16-bit code units. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string '😆' has length 2, but contains a single Unicode code point. Previously, `any` matched a single 16-bit code unit — even if that unit was part of a surrogate pair. In v17, `any` now matches a full Unicode character.

Old behaviour:

```js
const g = ohm.grammar('OneChar { start = any }');
g.match('😆').succeeded(); // false
```

New behaviour (Ohm v17+):

```js
const g = ohm.grammar('OneChar { start = any }');
g.match('😆').succeeded(); // true
```

### Namespace helpers removed

The top-level `namespace` and `extendNamespace` functions have been removed. They were never required — it was always possible to use a plain old object in any API that asked for a namespace.
Expand Down
4 changes: 3 additions & 1 deletion doc/syntax-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ as well as multiline (`/* */`) comments like:

(See [src/built-in-rules.ohm](https://github.com/harc/ohm/blob/main/packages/ohm-js/src/built-in-rules.ohm).)

`any`: Matches the next character in the input stream, if one exists.
`any`: Matches the next Unicode character — i.e., a single code point — in the input stream, if one exists.

**NOTE:** A JavaScript string is a sequence of 16-bit _code units_. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string `'😆'` has length 2, but contains a single Unicode code point. Prior to Ohm v17, `any` always consumed a single 16-bit code unit, rather than a full Unicode character.

`letter`: Matches a single character which is a letter (either uppercase or lowercase).

Expand Down
6 changes: 3 additions & 3 deletions packages/ohm-js/src/pexprs-eval.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ pexprs.PExpr.prototype.eval = common.abstract('eval'); // function(state) { ...
pexprs.any.eval = function(state) {
const {inputStream} = state;
const origPos = inputStream.pos;
const ch = inputStream.next();
if (ch) {
state.pushBinding(new TerminalNode(ch.length), origPos);
const cp = inputStream.nextCodePoint();
if (cp !== undefined) {
state.pushBinding(new TerminalNode(String.fromCodePoint(cp).length), origPos);
return true;
} else {
state.processFailure(origPos, this);
Expand Down
22 changes: 22 additions & 0 deletions packages/ohm-js/test/test-ohm-syntax.js
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,28 @@ test('ranges w/ code points > 0xFFFF, special cases', t => {
assertSucceeds(t, g2.match('\u{D83D}x'));
});

test('any consumes an entire code point', t => {
const g = ohm.grammar('G { start = any any }');
const re = /../u; // The regex equivalent of `any any`.

t.is('😇'.length, 2);
t.is('😇!'.length, 3);
t.is('😇😇'.length, 4);

t.is(g.match('😇😇').succeeded(), true);
t.truthy(re.exec('😇😇'));

t.is(g.match('😇!').succeeded(), true);
t.truthy(re.exec('😇!'));

t.is(g.match('!😇').succeeded(), true);
t.truthy(re.exec('!😇'));

t.is('👋🏿'.length, 4); // Skin color modifier is a separate code point.
t.is(g.match('👋🏿').succeeded(), true);
t.truthy(re.exec('👋🏿'));
});

describe('alt', test => {
const m = ohm.grammar('M { altTest = "a" | "b" }');
const s = m.createSemantics().addAttribute('v', {
Expand Down