-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid UTF-8 was detected in one or more arguments #4954
Comments
Arguments are text, not binary data. For example on Windows it has to transcode from UTF-16 (as provided by the OS) to UTF-8 (as expected by wasi programs) Additionally null bytes are not allowed in arguments on Windows and Unix as they indicate the end of the argument (on Unix) or the end of the entire commandline (on Windows). |
I believe we have to differentiate between two types of arguments here: arguments to wasmtime itself and arguments to the executed program. The program in its minimal variant will look like this:
Right now wasmtime is forcing UTF-8 arguments on the program itself, which is IMHO quite limiting for the program (if done by design). |
If you want the program to run on Windows, UTF-16 to UTF-8 conversion has to be done one way or another as wasi programs don't know the host architecture. |
Also, command-line arguments are already not suitable for passing arbitrary binary data, because they're passed to C programs as NUL-terminated strings, so they can't contain embedded 0 bytes. If you want to pass arbitrary binary data as arguments to programs today, I suggest using an escape or quoting system, or a format like BASE64, to ensure that all the bytes are preserved. |
I believe there's a bug here where the Additionally the |
I believe we're concentrating on the wrong side of the coin. Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment. The other side of the coin is the forceful conversion of arguments to the UTF-8 encoding, which is a rather large constraint IMHO considering the zoo of character encodings out there. Perhaps the problem with UTF-16 on Windows, which as I understand is the main reason for conversion to UTF-8, can be solved some other way? p.s. There is a relevant discussion on the SG-16 group page: sg16-unicode/sg16#66, it seems they aim to solve the same problem that we're facing? |
On Windows, binary arguments would have to have a multiple-of-two size as the argument list consist of a single 16bit character list. In addition on Windows the program itself it responsible for parsing the argument list from a single string into however many arguments the user intended to parse. This is not the responsibility of the shell like on Unix. This means that if your binary argument contains a space character, it would either cause two arguments to be passed, or you would need to escape it, which doesn't really work well with binary arguments I would think.
How would you suggest making wasi programs agnostic to the character encoding without forcefully converting from and to UTF-8 at the boundary between the wasi program and the runtime? On Windows the A variants of all system api's have a behavior depending on the system character set, which wasi programs can't query and even if it could, now wasi programs would need to ship with huge tables translating between every possible character set and whatever encoding it uses internally (like UTF-8). |
Perhaps a flag to wasmtime, which defaults to UTF-8? Something like:
This was the first thing I was looking for after facing the problem the first time. It might not be perfect, but at least intuitive. :) |
For a truly minimal unit test, could you use the embedding API, instead of the CLI? Command-line arguments involve quite a lot of code, on both the host side and in wasm, even ignoring the character encoding issues. With the embedding API, you can have a function that takes an array of |
I definitely agree with your reasoning. My only comment on that, is that Linux CLI is based on a mature code-base and provides a "battle-proven" solution for almost any upstream/downstream problem, while embedding API almost always lacks an "in-house" solution and requires a third partly library, that is generally either not mature enough or not maintained for years. So there are advantages and disadvantages in both approaches. |
I'm a little unclear about what you're referring to. When I say embedding API, II mean Wasmtime's embedding API: https://docs.wasmtime.dev/lang.html I can also mention that the current direction for WASI is that it will define command-line arguments to be Unicode strings. While it may be useful in some unit-test-like use cases to let users that know how the host environment works to take advantage of it, the goal in many real-world wasm use cases is to have modules that don't know the environment, and don't interact with the "system" character encoding, because such knowledge would make them less portable. This only applies to things which are called "strings", which includes command-line arguments. If you have APIs that take |
Running a WASM file with a binary argument, such as:
or
Aborts execution with the following error:
I see three problems with this:
The text was updated successfully, but these errors were encountered: