Use TextEncoder to encode UTF-8 #3461

cag · 2020-04-08T22:59:47Z

Description

Use TextEncoder to encode UTF-8 sequences.

Related: #1610 (Edit: Thought this would fix until @cgewecke noted that getPastEvents uses a different UTF-8 codec)
Related: #3441

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist:

coveralls · 2020-04-08T23:11:21Z

Coverage decreased (-0.05%) to 88.549% when pulling e103736 on cag:use-textencoder into 2a5c5cb on ethereum:1.x.

cgewecke · 2020-04-09T17:44:30Z

packages/web3-utils/src/utils.js

+if (typeof TextEncoder !== 'undefined' && typeof TextDecoder !== 'undefined') {
+    utf8Encoder = new TextEncoder();
+    utf8Decoder = new TextDecoder();
+} else {


Just a review note: have added utf8 tests to the headless browser suite and manually verified code passes through this block. (The browser tests don't get captured in the coverage report).

cgewecke

Thanks @cag!

In #3441 you're hoping to stop getPastEvents from crashing ... is there a way to verify this PR achieves that?

Log decoding is being done here via the Ethers library and reading through the thread in #1610 (esp. this comment) makes me wonder if there isn't something which has to be addressed at that layer.

Do you have a view about this?

(Apologies in advance if I haven't understood this problem correctly.)

cag · 2020-04-09T20:08:07Z

@cgewecke You're right! It does have to be addressed in ethers.js. There's another UTF-8 encoding/decoding implementation in ethers.js as well. There is an option to pass an "ignoreErrors" param to the toUtf8String method in there, but that option's not used. Dunno what to do here now...

cgewecke · 2020-04-09T21:49:27Z

@cag Yes, I'm not sure either.

I'm going to check the Ethers V5 beta's parseLog and see if it handles this case gracefully. I don't see an issue similar to #1610 over there so maybe there's a way around this.

Web3 has Ethers pinned on a very old version:

https://github.com/ethereum/web3.js/blob/bd9cf48b265a8cddff925123817034351929ead0/packages/web3-eth-abi/package.json#L16

nivida · 2020-04-15T01:35:40Z

my two cents: I would prefer to use the ethers utf8 coder.. I already spoke about it with @ricmoo and he has done a great job with it. #usingOfSynergies

ricmoo · 2020-04-15T01:59:51Z

Heya! Just saw this. There are several included replacement strategies included in the ethers UTF8 library. If allowing non-strict parsing, you probably want some way for the user to specify this, since critical security issues can be introduced (as well as lost funds, asset destruction, etc) can occur by fuzzing a string.

Some quick examples:

imagine an ENS-like system. If you use the ignoreError option, I can provide a name which using this software will resolve to another name, so the user thinking they are paying person A is actually using sending it to person B. Allowing overlong UTF8 sequences, things get even more dire.
another ENS example. If someone attempts to purchase a name which their browser has encoded the string using Big5, they would be sending money to the contract, and receiving a name which is just a bunch of random bytes, paying for garbage.
if using the replace strategy, things also become crazy, since there is no longer a one to one mapping and it is up to the browser to indicate the UTF8 replacement characters.

My 2 cents are that, it does indeed suck that strings are so finicky in this environment. But there are good reasons they are. Similar to why we have to use big number, i.e. if you don’t, incorrect results occur, the same is true for strings that aren’t UTF8; if you don’t enforce it incorrect results occur. It just feels like strings aren’t as important, but they are. You won’t want numbers to just “ignore” errors that occurred. You won’t want a Swedish user typing in "1,337" ether to send 1337 ether, you’d want it to fail (barring localized parsing).

One option, might be once proxies are more widely accepted would be to make that property throw on a read, which would allow people to trap it if they can about the value, otherwise it can just silently and safely pass through the cracks. Maybe in the mean time there should be a wrapping object that indicates the code units are invalid? I can add an error type to the UTF8 decoder for the final wrapping to make this easier. Ideas?

cag · 2020-04-15T16:45:11Z

@nivida My only horse in the race is that there is a way for getPastEvents to not crash in these scenarios, or to give a way to recover from errors and continue processing. If that means using ethers.js, that's fine by me.

There's a spec for the codec, so that's what I'll refer to for the rest of this comment, assuming (maybe wrongly?) that this spec is faithfully and securely implemented by the various JS engines.

imagine an ENS-like system. If you use the ignoreError option, I can provide a name which using this software will resolve to another name, so the user thinking they are paying person A is actually using sending it to person B. Allowing overlong UTF8 sequences, things get even more dire.

I should hope people would find bob�.eth at least a little suspicious. Anyway, ignoring errors is not an option. Replacement with � is the default strategy for WhatWG codec, and there are a couple others, including throwing an error. While I'm sure the codec in ethers.js also provide these error strategies, imo the question would then really be about how to design the web3.js API to use an appropriate strategy in the face of an error, assuming the implementations are sound.

another ENS example. If someone attempts to purchase a name which their browser has encoded the string using Big5, they would be sending money to the contract, and receiving a name which is just a bunch of random bytes, paying for garbage.

By default, the decoder will attempt to decode the bytes as UTF-8. There's no format inference in the spec. Also, by default, the encoder will encode code points as UTF-8.

if using the replace strategy, things also become crazy, since there is no longer a one to one mapping and it is up to the browser to indicate the UTF8 replacement characters.

The decoder behavior under error conditions is as follows:

Otherwise, if result is error, switch on mode and run the associated steps:

"replacement"
Push U+FFFD to output.
"html"
Prepend U+0026, U+0023, followed by the shortest sequence of ASCII digits representing result’s code point in base ten, followed by U+003B to input.
"fatal"
Return error.

Anyway, it's up to the browser to implement the spec correctly.

My 2 cents are that, it does indeed suck that strings are so finicky in this environment. But there are good reasons they are. Similar to why we have to use big number, i.e. if you don’t, incorrect results occur, the same is true for strings that aren’t UTF8; if you don’t enforce it incorrect results occur. It just feels like strings aren’t as important, but they are. You won’t want numbers to just “ignore” errors that occurred. You won’t want a Swedish user typing in "1,337" ether to send 1337 ether, you’d want it to fail (barring localized parsing).

I agree that this is something that can affect dapp security.

One option, might be once proxies are more widely accepted would be to make that property throw on a read, which would allow people to trap it if they can about the value, otherwise it can just silently and safely pass through the cracks. Maybe in the mean time there should be a wrapping object that indicates the code units are invalid? I can add an error type to the UTF8 decoder for the final wrapping to make this easier. Ideas?

Referring to getPastEvents, I would propose something like the following:

Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with �, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.

In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.

cag · 2020-04-16T17:38:18Z

I'm closing this, as it is clear to me that these changes won't fix #1610. Probably the design change proposal in the comment above should go in the issue thread instead of here.

ricmoo · 2020-04-24T23:22:08Z

I should hope people would find bob�.eth at least a little suspicious.

I agree, but I think the consumer of events are often scripts, which may not be so discerning. ;)

But yes, the IGNORE, REPLACE and ERROR strategies are all available in ethers. I'm also not against the TextCoders, I just worry that invalid UTF-8 characters can be a signifiant security exploit, some there should be some sort of alarms raised.

imo the question would then really be about how to design the web3.js API to use an appropriate strategy in the face of an error

Basically this, 100%. :)

The decoder behavior under error conditions is as follows:

I'm curious what it does in the face of overlong representations... I'll have to look it up and/or experiment. They greatly impact dapp security, but in most day-to-day lives, shouldn't matter much, so I'm curious what the spec suggests...

cgewecke · 2020-05-04T20:51:34Z

@cag #3490 is a move in the direction of resolving this.

However, in parallel, (to make sure things work) have opened #3497 to reproduce the decoding error and the tests are not failing as expected.

What needs to be done to get getPastEvents to crash?

Use TextEncoder to encode UTF-8

9470831

cgewecke self-assigned this Apr 9, 2020

cgewecke added 1.x 1.0 related issues Bug Addressing a bug labels Apr 9, 2020

Add utf8 tests to headless browser suite

e103736

cgewecke reviewed Apr 9, 2020

View reviewed changes

cgewecke self-requested a review April 9, 2020 17:45

cgewecke reviewed Apr 9, 2020

View reviewed changes

cag closed this Apr 16, 2020

cgewecke mentioned this pull request Apr 24, 2020

Removing ethers from the dependencies #3484

Closed

ryanio mentioned this pull request Apr 28, 2020

Update ABICoder dependency #3490

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use TextEncoder to encode UTF-8 #3461

Use TextEncoder to encode UTF-8 #3461

cag commented Apr 8, 2020 •

edited

Loading

coveralls commented Apr 8, 2020 •

edited

Loading

cgewecke Apr 9, 2020

cgewecke left a comment

cag commented Apr 9, 2020

cgewecke commented Apr 9, 2020

nivida commented Apr 15, 2020

ricmoo commented Apr 15, 2020

cag commented Apr 15, 2020 •

edited

Loading

cag commented Apr 16, 2020

ricmoo commented Apr 24, 2020

cgewecke commented May 4, 2020

Use TextEncoder to encode UTF-8 #3461

Use TextEncoder to encode UTF-8 #3461

Conversation

cag commented Apr 8, 2020 • edited Loading

Description

Type of change

Checklist:

coveralls commented Apr 8, 2020 • edited Loading

cgewecke Apr 9, 2020

Choose a reason for hiding this comment

cgewecke left a comment

Choose a reason for hiding this comment

cag commented Apr 9, 2020

cgewecke commented Apr 9, 2020

nivida commented Apr 15, 2020

ricmoo commented Apr 15, 2020

cag commented Apr 15, 2020 • edited Loading

cag commented Apr 16, 2020

ricmoo commented Apr 24, 2020

cgewecke commented May 4, 2020

cag commented Apr 8, 2020 •

edited

Loading

coveralls commented Apr 8, 2020 •

edited

Loading

cag commented Apr 15, 2020 •

edited

Loading