RangeError: premature EOF for Unicode character U+FEFF on start #342

kivancguckiran · 2022-06-30T20:20:57Z

Hello,

We've noticed that whenever the Unicode ZWNBSP character(U+FEFF) is received on start of the string of a message, it throws a silent error and omits the character in the decoded part. It seems that, this particular execution of TextDecoder.decode throws the mentioned error:

protobuf-ts/packages/runtime/src/binary-reader.ts

Line 245 in aaa63c7

return this.textDecoder.decode(this.bytes());

I've created a repository to reproduce:

https://github.com/kivancguckiran/premature-eof-protobuf-ts

Outputted the charcodes from the result of the create operation and after fromBinary operation. If the U+FEFF character is in the start, it is ommited from decoded part.

Since it is Zero-Width-No-Break-Space, github preview hides the mentioned unicode character.

This line is:
https://github.com/kivancguckiran/premature-eof-protobuf-ts/blob/main/index.ts#L4

Actually like this:

Thanks in advance.

The text was updated successfully, but these errors were encountered:

jcready · 2022-06-30T21:32:53Z

The text encoding/decoding is entirely up to your JS runtime.

You can run this in your browser or node and see the same result.

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true });
const output = decoder.decode(encoder.encode(input)); // "test\ufeff"

I don't know for certain if this is expected for UTF-8/protobuf or a bug.

jcready · 2022-07-01T00:00:46Z

Oh, duh. It's the BOM.

The name BYTE ORDER MARK is an alias for the original character name ZERO WIDTH NO-BREAK SPACE (ZWNBSP). With the introduction of U+2060 WORD JOINER, there's no longer a need to ever use U+FEFF for its ZWNSP effect, so from that point on, and with the availability of a formal alias, the name ZERO WIDTH NO-BREAK SPACE is no longer helpful, and we will use the alias here.

Using ignoreBOM: true in the TextDecoder constructor lets the string survives a round-trip:

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
const output = decoder.decode(encoder.encode(input)); // "\ufefftest\ufeff"

kivancguckiran · 2022-07-01T00:20:38Z

Thanks for the heads up! Do you think setting this option to true will merge into main? Since if I have analyzed the code correctly, TextEncoder instantiated on the fly within the library.

protobuf-ts/packages/runtime/src/binary-reader.ts

Line 50 in aaa63c7

this.textDecoder = textDecoder ?? new TextDecoder("utf-8", {

Or is there another solution you are pointing out which I'm not seeing? Like if I can set the TextDecoder used in library somehow.

kivancguckiran · 2022-07-01T08:32:18Z

Thought I can do this:

Test.fromBinary(data, {
  readerFactory: (bytes) => new BinaryReader(bytes, new TextDecoder('utf-8', {
    fatal: true,
    ignoreBOM: true,
  })),
});

But doing this everywhere fromBinary is called seems overkill.

jcready · 2022-07-01T12:08:46Z

I can't seem to find any information about whether or not the BOM should be ignored or not in protobuf strings. They're definitely ignored when protoc is reading a proto file to compile, but as far as the actual runtimes go I see no tests regarding the BOM in string field values.

I think the correct thing would be to update protobuf-ts to use the ignoreBOM setting, but I can't be certain without seeing an existing test or docs. For an immediate workaround what you have is basically what I would've recommended, but you should only need to create the TextDecoder instance once.

// shared-binary-read-options.ts
import { BinaryReader, BinaryReadOptions } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
export const binaryReadOptions: Partial<BinaryReadOptions> = {
    readerFactory: (bytes) => new BinaryReader(bytes, textDecoder)
};

// some-other-file.ts
import { binaryReadOptions } from './shared-binary-read-options';

// Later
Test.fromBinary(data, binaryReadOptions);

I wouldn't recommend the following approach, but you can monkey-patch the BinaryReader prototype so that you can avoid needing to import and pass the options everywhere. Just note that you will need to execute (import) this code once before calling fromBinary() to be effective.

// monkey-patch-protobuf-ts-binary-reader.ts
import { BinaryReader } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });

function monkeyPatchedString(): string {
    return textDecoder.decode(this.bytes());
}

// @ts-ignore
BinaryReader.prototype.string = monkeyPatchedString;

timostamm · 2022-07-07T08:28:06Z

Thanks to both of you for looking into this!

ignoreBOM is an option in Node.js. The docs say:

When true, the TextDecoder will include the byte order mark in the decoded result. When false, the byte order mark will be removed from the output. This option is only used when encoding is 'utf-8', 'utf-16be', or 'utf-16le'. Default: false.

Keeping the byte order mark in the decoded string seems reasonable to me, if only for reproducible encoding roundtrips. If this passes the conformance suite, it should be fine.

However, I'm curious about the error being thrown. @kivancguckiran, can you provide some details? If I run your example, I see that the BOM is stripped, but I don't see an error:

➜  ~ node -v
v16.13.2
➜  ~ npx esbuild --bundle index.ts | node
Result of Test.create()
65279, 116, 101, 115, 116, 65279, 

Result of Test.fromBinary()
116, 101, 115, 116, 65279,

kivancguckiran · 2022-07-07T09:16:32Z

Hello @timostamm!

You are absolutely right. This does not raise an error. I was trying to debug it inside node_modules at that time with protobuf-ts js modules, and my error was I was executing internally this.bytes() twice and that's when I was getting the error in binary-reader.js.

So no errors at all. But yes, the first BOM is omitted as you have noticed.

We are currently wrapping readerFactory like this to workaround the issue:

const decoder = new TextDecoder('utf-8', {
  fatal: true,
  ignoreBOM: true,
});

const deserialized = module?.fromBinary(e.data, {
  readerFactory: (bytes) => new BinaryReader(bytes, decoder),
});

Like suggested by @jcready. It would be nice to see this in main. If it does not cause any issues.

Thanks.

Closes #342

timostamm · 2022-08-17T13:05:26Z

We pass the conformance tests with ignoreBOM: true. Merged the addition in #362.

timostamm · 2022-08-17T13:45:47Z

Released in v2.8.0.

timostamm added a commit that referenced this issue Aug 17, 2022

Ignore byte-order mark

c8865fc

Closes #342

timostamm mentioned this issue Aug 17, 2022

Ignore byte-order mark #362

Merged

timostamm closed this as completed in #362 Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RangeError: premature EOF for Unicode character U+FEFF on start #342

RangeError: premature EOF for Unicode character U+FEFF on start #342

kivancguckiran commented Jun 30, 2022 •

edited

Loading

jcready commented Jun 30, 2022

jcready commented Jul 1, 2022

kivancguckiran commented Jul 1, 2022 •

edited

Loading

kivancguckiran commented Jul 1, 2022

jcready commented Jul 1, 2022 •

edited

Loading

timostamm commented Jul 7, 2022

kivancguckiran commented Jul 7, 2022

timostamm commented Aug 17, 2022

timostamm commented Aug 17, 2022

RangeError: premature EOF for Unicode character U+FEFF on start #342

RangeError: premature EOF for Unicode character U+FEFF on start #342

Comments

kivancguckiran commented Jun 30, 2022 • edited Loading

jcready commented Jun 30, 2022

jcready commented Jul 1, 2022

kivancguckiran commented Jul 1, 2022 • edited Loading

kivancguckiran commented Jul 1, 2022

jcready commented Jul 1, 2022 • edited Loading

timostamm commented Jul 7, 2022

kivancguckiran commented Jul 7, 2022

timostamm commented Aug 17, 2022

timostamm commented Aug 17, 2022

kivancguckiran commented Jun 30, 2022 •

edited

Loading

kivancguckiran commented Jul 1, 2022 •

edited

Loading

jcready commented Jul 1, 2022 •

edited

Loading