-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898
Conversation
IMO yes, that would be amazing, especially if the error range only highlighted the incorrect portions of the input string. Visually scanning a string can be tricky, and having a more narrow range could allow for a codefix here. |
Yea, as @baronfel wrote, modifying the error range to be specific would be even better than including the invalid character in the error message. |
yes .. but that requires more change. So I use the same error reporting used for other string errors -- which unfortunately spans the complete string (which might lead to multiple errors reported on same range: |
I'm pretty sure it's fine, but just to be safe, we may want to see that new changes reflect what language spec says. If it extends it, it will probably requite RFC. |
As far as I see it: This PR doesn't extends the specs in any way but instead fixes things that aren't according to specs (or unclear). I went mostly by this two documents: Literals and Strings But for Byte Chars & Strings these two docs don't really explain what range the values should be. Maybe except by calling them So whatever the upper border should be -- it doesn't really introduce a new change. |
Done (for But Note: For chars only invalid in byte string it's still the complete String: > "foo\U000003C0bar";;
val it: string = "fooπbar"
> "foo\U000003C0bar"B;;
^^^^^^^^^^^^^^^^^^^
stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element and its type or location any more -> we know the value and can validate it -- but don't know exactly where it's located |
Errors reduced to Warnings and messages updated (see #15898 (comment)) Note: In How to proceed with Bytes inside |
Fix dotnet#15867: Add `\U` & `\x` for ASCII Byte Fix dotnet#15868: Fix: ASCII Byte Decimal notation accepts value `> 127`
Note: Values between `128` and `255` are still valid in Byte String Array (-> several tests in `ByteStrings` fail)
prev: ```fsharp > "foo\U12345678bar";; ^^^^^^^^^^^^^^^^^^ stdin(1,1): error FS1245: \U12345678 is not a valid Unicode character escape sequence ``` now: ```fsharp > "foo\U12345678bar";; ----^^^^^^^^^^ stdin(1,5): error FS1245: \U12345678 is not a valid Unicode character escape sequence ``` Note: In Byte Strings that's only the case for invalid chars (-> invalid in normal string too), but not for chars invalid only inside byte string: ```fsharp > "foo\U000003C0bar";; val it: string = "fooπbar" > "foo\U000003C0bar"B;; ^^^^^^^^^^^^^^^^^^^ stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte ``` Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element (and its notation) any more -> we know the value and can validate it -- but don't know exactly where it's located
Change `Error` to `Warning` for cases current F# succeeds, but are now Error (in prev commits introduced): * Trigraph ASCII Byte when between inside `128..255`: ```fsharp // prev > '\973'B;; ^^^^^^^ stdin(11,1): error FS1157: This is not a valid byte literal > '\250'B;; val it: byte = 250uy // now > '\973'B;; ^^^^^^^ stdin(2,1): error FS1157: This is not a valid ASCII byte literal. Value must be < 128y. > '\250'B;; ^^^^^^^ stdin(3,1): warning FS1157: This is not a valid ASCII byte literal. Value should be < 128y. Note: In a future F# version this Warning will be promoted to Error! val it: byte = 250uy ``` * Trigraph string when `> 255` ```fsharp // prev > "\937";; val it: string = "©" // now > "\937";; -^^^^ stdin(4,2): warning FS1252: '\937' is not a valid character literal. Note: Currently the value is wrapped around byte range to '\169'. In a future F# version this Warning will be promoted to Error! val it: string = "©" ``` Additional for these case: Add Note about Promoting to Error in future F# version -> in `lex.fsl`: `//TODO` with note how to promote to `Error` Note: I changed the message for `lexInvalidByteLiteral` (incl. rename to `lexInvalidAsciiByteLiteral`). `lexInvalidByteLiteral` was previously translated (-> in `FSComp.txt.XXX.xlf` files), but now isn't any more even though most of the message remained the same: * Prev: `This is not a valid byte literal.` * Now: `This is not a valid ASCII byte literal. Value should be < 128y.`
>= 256 (more than 1 byte) are already an error ```fsharp let _ = "ä"B // ^^^^ // This byte array literal contains 1 non-ASCII characters. All characters should be < 128y. // Note: In future F# versions this Warning might be promoted to Error! ``` Note: Issue with trigraph and > 255: ```fsharp let _ = "\973"B // ^^^^ // Warning: '\973' is not a valid character literal. // Note: Currently the value is wrapped around byte range to '\169' [...] // // ^^^^^^^ // Warning: This byte array literal contains 1 non-ASCII characters. ``` -> Two warnings for same trigraph: * Warning for trigraph `\973` > 255. Detected while parsing trigraph. Don't know yet if String or Byte String. * warning spans just trigraph * Warning for wrapped value `\169` > 127. Detected while checking Byte String for correct values. Don't have any infos about value (range & notation) any more. * warning spans full byte string -> Each Warning is emitted in a different parsing steps. -> No easy way to reduce to just one warning. However: Both warnings are correct and warn about a different issues (albeit in same value) And: should get automatically reduced once out-of-trigraph-range gets promoted to error **Enhancement**: Count number of invalid chars (2-bytes, >128) in byte array ```fsharp let _ = "ΩäΩüΩ"B ``` * prev: `This byte array literal contains characters that do not encode as a single byte` * now: `This byte array literal contains 3 characters that do not encode as a single byte` && `This byte array literal contains 2 non-ASCII characters. All characters should be < 128y.` Note: When checking valid bytes in Byte String, there's no direct mapping to range and notation any more. -> cannot highlight exact byte string part, but only full byte string. * Possible Enhancement?: List invalid chars in a certain notation? For example in `\u` notation. * Pro: * Gives an idea which chars are incorrect * Con: * Might be difficult to process by user because output might be different notation than written in string * Might be a long list with invalid chars
Ok, no answer how to handle So I decided to align it with ASCII char byte: Emits now a warning (NOT error -> compilation still works):
Note: There's a small flaw with trigraphs > "\973"B;;
"\973"B;;
-^^^^
stdin(3,2): warning FS1252: '\973' is not a valid character literal.
Note: Currently the value is wrapped around byte range to '\205'. In a future F# version this Warning will be promoted to Error!
"\973"B;;
^^^^^^^
stdin(3,1): warning FS1253: This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.
val it: byte array = [|205uy|] -> TWO warnings:
These two are checked at different parsing steps with different infos available:
-> two warnings with different ranges. I think that's ok because:
I also updated the error message for invalid byte char (two-byte long instead of just one) by adding the number of faulty characters: > "Ω --- Ω --- Ω --- Ω --- Ω"B;;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
error FS1140: This byte array literal contains 5 characters that do not encode as a single byte (prev was: Note: That's again "highlight full byte string instead of exact occurrence" because check is after string is parsed and we finally know it's a byte string -- but not any more what exactly is inside that string (without reparsing again). Another possible enhancement: List invalid chars in a certain notations (for example in
Tests should be adjusted to new behaviour (maybe with exception of some legacy conformance tests? let's see what the the CI result say) Edit: CI error:
I think that's "just" some random error and not related to this PR (also only on "Linux" build :/ ) |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
❗ Release notes required
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get this in. This really just fixes inconsistencies between different notations. Good stuff and solid testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Co-authored-by: Brian Rourke Boll <[email protected]>
/run xlf |
Co-authored-by: psfinaki <[email protected]>
Fix #15867: Support
\U
&\x
in char byte'a'B; '\097'B; '\x61'B; '\u0061'B; '\U00000061'B
> 127
(This is not a valid byte literal
), just like in other notationsFix #15868: Char byte in decimal notation can be
> 127
'\250'B
->error FS1157: This is not a valid byte literal
, just like in other notationsFix #15869: In String: decimal char can be
> 255
(and gets wrapped to<256
)"\937"
->This is not a valid character literal
"\937"B
->This is not a valid character literal
Note: I reused the error message for chars. Which is logical ok, but: Error on String spans the whole string, not just the invalid part. That's the case for other Strings errors too -- but the other errors at least mention its inside the string (This byte array literal contains characters that do not encode as a single byte
) or mention the wrong part (\U1100FFFA is not a valid Unicode character escape sequence
). Should I change the error to include the invalid trigraph?Edit: Error message now mentions trigraph:
'\937' is not a valid character literal
Edit: Error span is now limited to just the trigraph (fixed for
\U
too):Note:
Currently Byte String enforces a char to fit into a single byte -- but does not require value to be
< 128
like it's the case for Byte Char:I don't know if that's intended or not:
Further check each low byte <= 127
-- but doesn't actual check it. So I guess limit<128
is indented and the check is just missingfsharp/src/Compiler/SyntaxTree/LexHelpers.fs
Lines 225 to 234 in 97a5b65
> 127
-- indicating full byte range should be possiblefsharp/src/Compiler/SyntaxTree/LexHelpers.fs
Lines 149 to 154 in 97a5b65
< 128
just for trigraphs. That's actually not enforceable with current string parsing: We only know at the end of a string if it's a Byte String or not -- and at this point we don't know if a value was a trigraph or something else.So it's either all must be
< 128
or all can be> 127
I've written some tests expecting all values to be
<128
, but have not adjusted the lexer to produce an error when a value is above.-> There are currently some failing tests in
ByteStrings
(-> reason this PR is marked as Draft)Depending on what behaviour is wanted I'll either adjust the tests or the lexer