Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RangeError: premature EOF for Unicode character U+FEFF on start #342

Closed
kivancguckiran opened this issue Jun 30, 2022 · 9 comments · Fixed by #362
Closed

RangeError: premature EOF for Unicode character U+FEFF on start #342

kivancguckiran opened this issue Jun 30, 2022 · 9 comments · Fixed by #362

Comments

@kivancguckiran
Copy link

kivancguckiran commented Jun 30, 2022

Hello,

We've noticed that whenever the Unicode ZWNBSP character(U+FEFF) is received on start of the string of a message, it throws a silent error and omits the character in the decoded part. It seems that, this particular execution of TextDecoder.decode throws the mentioned error:

return this.textDecoder.decode(this.bytes());

I've created a repository to reproduce:

https://github.com/kivancguckiran/premature-eof-protobuf-ts

Outputted the charcodes from the result of the create operation and after fromBinary operation. If the U+FEFF character is in the start, it is ommited from decoded part.

Since it is Zero-Width-No-Break-Space, github preview hides the mentioned unicode character.

This line is:
https://github.com/kivancguckiran/premature-eof-protobuf-ts/blob/main/index.ts#L4

Actually like this:
resim

Thanks in advance.

@jcready
Copy link
Contributor

jcready commented Jun 30, 2022

The text encoding/decoding is entirely up to your JS runtime.

You can run this in your browser or node and see the same result.

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true });
const output = decoder.decode(encoder.encode(input)); // "test\ufeff"

I don't know for certain if this is expected for UTF-8/protobuf or a bug.

@jcready
Copy link
Contributor

jcready commented Jul 1, 2022

Oh, duh. It's the BOM.

The name BYTE ORDER MARK is an alias for the original character name ZERO WIDTH NO-BREAK SPACE (ZWNBSP). With the introduction of U+2060 WORD JOINER, there's no longer a need to ever use U+FEFF for its ZWNSP effect, so from that point on, and with the availability of a formal alias, the name ZERO WIDTH NO-BREAK SPACE is no longer helpful, and we will use the alias here.

Using ignoreBOM: true in the TextDecoder constructor lets the string survives a round-trip:

const input = 'test'; // "\ufefftest\ufeff"
const encoder = new TextEncoder();
const decoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
const output = decoder.decode(encoder.encode(input)); // "\ufefftest\ufeff"

@kivancguckiran
Copy link
Author

kivancguckiran commented Jul 1, 2022

Thanks for the heads up! Do you think setting this option to true will merge into main? Since if I have analyzed the code correctly, TextEncoder instantiated on the fly within the library.

this.textDecoder = textDecoder ?? new TextDecoder("utf-8", {

Or is there another solution you are pointing out which I'm not seeing? Like if I can set the TextDecoder used in library somehow.

@kivancguckiran
Copy link
Author

Thought I can do this:

Test.fromBinary(data, {
  readerFactory: (bytes) => new BinaryReader(bytes, new TextDecoder('utf-8', {
    fatal: true,
    ignoreBOM: true,
  })),
});

But doing this everywhere fromBinary is called seems overkill.

@jcready
Copy link
Contributor

jcready commented Jul 1, 2022

I can't seem to find any information about whether or not the BOM should be ignored or not in protobuf strings. They're definitely ignored when protoc is reading a proto file to compile, but as far as the actual runtimes go I see no tests regarding the BOM in string field values.

I think the correct thing would be to update protobuf-ts to use the ignoreBOM setting, but I can't be certain without seeing an existing test or docs. For an immediate workaround what you have is basically what I would've recommended, but you should only need to create the TextDecoder instance once.

// shared-binary-read-options.ts
import { BinaryReader, BinaryReadOptions } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });
export const binaryReadOptions: Partial<BinaryReadOptions> = {
    readerFactory: (bytes) => new BinaryReader(bytes, textDecoder)
};
// some-other-file.ts
import { binaryReadOptions } from './shared-binary-read-options';

// Later
Test.fromBinary(data, binaryReadOptions);

I wouldn't recommend the following approach, but you can monkey-patch the BinaryReader prototype so that you can avoid needing to import and pass the options everywhere. Just note that you will need to execute (import) this code once before calling fromBinary() to be effective.

// monkey-patch-protobuf-ts-binary-reader.ts
import { BinaryReader } from '@protobuf-ts/runtime';

const textDecoder = new TextDecoder('utf-8', { fatal: true, ignoreBOM: true });

function monkeyPatchedString(): string {
    return textDecoder.decode(this.bytes());
}

// @ts-ignore
BinaryReader.prototype.string = monkeyPatchedString;

@timostamm
Copy link
Owner

Thanks to both of you for looking into this!

ignoreBOM is an option in Node.js. The docs say:

When true, the TextDecoder will include the byte order mark in the decoded result. When false, the byte order mark will be removed from the output. This option is only used when encoding is 'utf-8', 'utf-16be', or 'utf-16le'. Default: false.

Keeping the byte order mark in the decoded string seems reasonable to me, if only for reproducible encoding roundtrips. If this passes the conformance suite, it should be fine.

However, I'm curious about the error being thrown. @kivancguckiran, can you provide some details? If I run your example, I see that the BOM is stripped, but I don't see an error:

➜  ~ node -v
v16.13.2
➜  ~ npx esbuild --bundle index.ts | node
Result of Test.create()
65279, 116, 101, 115, 116, 65279, 

Result of Test.fromBinary()
116, 101, 115, 116, 65279, 

@kivancguckiran
Copy link
Author

Hello @timostamm!

You are absolutely right. This does not raise an error. I was trying to debug it inside node_modules at that time with protobuf-ts js modules, and my error was I was executing internally this.bytes() twice and that's when I was getting the error in binary-reader.js.

So no errors at all. But yes, the first BOM is omitted as you have noticed.

We are currently wrapping readerFactory like this to workaround the issue:

const decoder = new TextDecoder('utf-8', {
  fatal: true,
  ignoreBOM: true,
});
const deserialized = module?.fromBinary(e.data, {
  readerFactory: (bytes) => new BinaryReader(bytes, decoder),
});

Like suggested by @jcready. It would be nice to see this in main. If it does not cause any issues.

Thanks.

timostamm added a commit that referenced this issue Aug 17, 2022
@timostamm
Copy link
Owner

We pass the conformance tests with ignoreBOM: true. Merged the addition in #362.

@timostamm
Copy link
Owner

Released in v2.8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants