Skip to content
This repository has been archived by the owner on Mar 5, 2025. It is now read-only.

Use TextEncoder to encode UTF-8 #3461

Closed
wants to merge 2 commits into from
Closed

Use TextEncoder to encode UTF-8 #3461

wants to merge 2 commits into from

Conversation

cag
Copy link

@cag cag commented Apr 8, 2020

Description

Use TextEncoder to encode UTF-8 sequences.

Related: #1610 (Edit: Thought this would fix until @cgewecke noted that getPastEvents uses a different UTF-8 codec)
Related: #3441

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • I have selected the correct base branch.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • Any dependent changes have been merged and published in downstream modules.
  • I ran npm run dtslint with success and extended the tests and types if necessary.
  • I ran npm run test:unit with success.
  • I have executed npm run test:cov and my test cases do cover all lines and branches of the added code.
  • I ran npm run build-all and tested the resulting file/'s from dist folder in a browser.
  • I have updated the CHANGELOG.md file in the root folder.
  • I have tested my code on the live network.

@coveralls
Copy link

coveralls commented Apr 8, 2020

Coverage Status

Coverage decreased (-0.05%) to 88.549% when pulling e103736 on cag:use-textencoder into 2a5c5cb on ethereum:1.x.

@cgewecke cgewecke self-assigned this Apr 9, 2020
@cgewecke cgewecke added 1.x 1.0 related issues Bug Addressing a bug labels Apr 9, 2020
if (typeof TextEncoder !== 'undefined' && typeof TextDecoder !== 'undefined') {
utf8Encoder = new TextEncoder();
utf8Decoder = new TextDecoder();
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a review note: have added utf8 tests to the headless browser suite and manually verified code passes through this block. (The browser tests don't get captured in the coverage report).

@cgewecke cgewecke self-requested a review April 9, 2020 17:45
Copy link
Collaborator

@cgewecke cgewecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cag!

In #3441 you're hoping to stop getPastEvents from crashing ... is there a way to verify this PR achieves that?

Log decoding is being done here via the Ethers library and reading through the thread in #1610 (esp. this comment) makes me wonder if there isn't something which has to be addressed at that layer.

Do you have a view about this?

(Apologies in advance if I haven't understood this problem correctly.)

@cag
Copy link
Author

cag commented Apr 9, 2020

@cgewecke You're right! It does have to be addressed in ethers.js. There's another UTF-8 encoding/decoding implementation in ethers.js as well. There is an option to pass an "ignoreErrors" param to the toUtf8String method in there, but that option's not used. Dunno what to do here now...

@cgewecke
Copy link
Collaborator

cgewecke commented Apr 9, 2020

@cag Yes, I'm not sure either.

I'm going to check the Ethers V5 beta's parseLog and see if it handles this case gracefully. I don't see an issue similar to #1610 over there so maybe there's a way around this.

Web3 has Ethers pinned on a very old version:

https://github.com/ethereum/web3.js/blob/bd9cf48b265a8cddff925123817034351929ead0/packages/web3-eth-abi/package.json#L16

@nivida
Copy link
Contributor

nivida commented Apr 15, 2020

my two cents: I would prefer to use the ethers utf8 coder.. I already spoke about it with @ricmoo and he has done a great job with it. #usingOfSynergies

@ricmoo
Copy link
Contributor

ricmoo commented Apr 15, 2020

Heya! Just saw this. There are several included replacement strategies included in the ethers UTF8 library. If allowing non-strict parsing, you probably want some way for the user to specify this, since critical security issues can be introduced (as well as lost funds, asset destruction, etc) can occur by fuzzing a string.

Some quick examples:

  • imagine an ENS-like system. If you use the ignoreError option, I can provide a name which using this software will resolve to another name, so the user thinking they are paying person A is actually using sending it to person B. Allowing overlong UTF8 sequences, things get even more dire.

  • another ENS example. If someone attempts to purchase a name which their browser has encoded the string using Big5, they would be sending money to the contract, and receiving a name which is just a bunch of random bytes, paying for garbage.

  • if using the replace strategy, things also become crazy, since there is no longer a one to one mapping and it is up to the browser to indicate the UTF8 replacement characters.

My 2 cents are that, it does indeed suck that strings are so finicky in this environment. But there are good reasons they are. Similar to why we have to use big number, i.e. if you don’t, incorrect results occur, the same is true for strings that aren’t UTF8; if you don’t enforce it incorrect results occur. It just feels like strings aren’t as important, but they are. You won’t want numbers to just “ignore” errors that occurred. You won’t want a Swedish user typing in "1,337" ether to send 1337 ether, you’d want it to fail (barring localized parsing).

One option, might be once proxies are more widely accepted would be to make that property throw on a read, which would allow people to trap it if they can about the value, otherwise it can just silently and safely pass through the cracks. Maybe in the mean time there should be a wrapping object that indicates the code units are invalid? I can add an error type to the UTF8 decoder for the final wrapping to make this easier. Ideas?

@cag
Copy link
Author

cag commented Apr 15, 2020

@nivida My only horse in the race is that there is a way for getPastEvents to not crash in these scenarios, or to give a way to recover from errors and continue processing. If that means using ethers.js, that's fine by me.

There's a spec for the codec, so that's what I'll refer to for the rest of this comment, assuming (maybe wrongly?) that this spec is faithfully and securely implemented by the various JS engines.

  • imagine an ENS-like system. If you use the ignoreError option, I can provide a name which using this software will resolve to another name, so the user thinking they are paying person A is actually using sending it to person B. Allowing overlong UTF8 sequences, things get even more dire.

I should hope people would find bob�.eth at least a little suspicious. Anyway, ignoring errors is not an option. Replacement with � is the default strategy for WhatWG codec, and there are a couple others, including throwing an error. While I'm sure the codec in ethers.js also provide these error strategies, imo the question would then really be about how to design the web3.js API to use an appropriate strategy in the face of an error, assuming the implementations are sound.

  • another ENS example. If someone attempts to purchase a name which their browser has encoded the string using Big5, they would be sending money to the contract, and receiving a name which is just a bunch of random bytes, paying for garbage.

By default, the decoder will attempt to decode the bytes as UTF-8. There's no format inference in the spec. Also, by default, the encoder will encode code points as UTF-8.

  • if using the replace strategy, things also become crazy, since there is no longer a one to one mapping and it is up to the browser to indicate the UTF8 replacement characters.

The decoder behavior under error conditions is as follows:

Otherwise, if result is error, switch on mode and run the associated steps:

"replacement"
Push U+FFFD to output.
"html"
Prepend U+0026, U+0023, followed by the shortest sequence of ASCII digits representing result’s code point in base ten, followed by U+003B to input.
"fatal"
Return error.

Anyway, it's up to the browser to implement the spec correctly.


My 2 cents are that, it does indeed suck that strings are so finicky in this environment. But there are good reasons they are. Similar to why we have to use big number, i.e. if you don’t, incorrect results occur, the same is true for strings that aren’t UTF8; if you don’t enforce it incorrect results occur. It just feels like strings aren’t as important, but they are. You won’t want numbers to just “ignore” errors that occurred. You won’t want a Swedish user typing in "1,337" ether to send 1337 ether, you’d want it to fail (barring localized parsing).

I agree that this is something that can affect dapp security.

One option, might be once proxies are more widely accepted would be to make that property throw on a read, which would allow people to trap it if they can about the value, otherwise it can just silently and safely pass through the cracks. Maybe in the mean time there should be a wrapping object that indicates the code units are invalid? I can add an error type to the UTF8 decoder for the final wrapping to make this easier. Ideas?

Referring to getPastEvents, I would propose something like the following:

Add an option, acknowledging that parsing event data might fail due to string processing, maybe called stringParsingErrorMode or something. A parsing error might occur for some non-empty subset of the events being gathered. We have replace which can just replace the bad sequence starts with �, skip which skips that subset and returns the events for which string parsing didn't fail, or error which rejects with the encoding error encountered. It's fine if the default behavior remains error, as long as these other modes are supported.

In particular, skip would be nice for dapps that do expect arbitrary input data from events, but do not want to be DOS'd by a bad event.

@cag
Copy link
Author

cag commented Apr 16, 2020

I'm closing this, as it is clear to me that these changes won't fix #1610. Probably the design change proposal in the comment above should go in the issue thread instead of here.

@ricmoo
Copy link
Contributor

ricmoo commented Apr 24, 2020

I should hope people would find bob�.eth at least a little suspicious.

I agree, but I think the consumer of events are often scripts, which may not be so discerning. ;)

But yes, the IGNORE, REPLACE and ERROR strategies are all available in ethers. I'm also not against the TextCoders, I just worry that invalid UTF-8 characters can be a signifiant security exploit, some there should be some sort of alarms raised.

imo the question would then really be about how to design the web3.js API to use an appropriate strategy in the face of an error

Basically this, 100%. :)

The decoder behavior under error conditions is as follows:

I'm curious what it does in the face of overlong representations... I'll have to look it up and/or experiment. They greatly impact dapp security, but in most day-to-day lives, shouldn't matter much, so I'm curious what the spec suggests...

@ryanio ryanio mentioned this pull request Apr 28, 2020
13 tasks
@cgewecke
Copy link
Collaborator

cgewecke commented May 4, 2020

@cag #3490 is a move in the direction of resolving this.

However, in parallel, (to make sure things work) have opened #3497 to reproduce the decoding error and the tests are not failing as expected.

What needs to be done to get getPastEvents to crash?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
1.x 1.0 related issues Bug Addressing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants