Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Merged
merged 22 commits into from
Aug 20, 2024

Conversation

Booksbaum
Copy link
Contributor

@Booksbaum Booksbaum commented Aug 29, 2023

Fix #15867: Support \U & \x in char byte

  • These now all work: 'a'B; '\097'B; '\x61'B; '\u0061'B; '\U00000061'B
  • Compiler error when > 127 (This is not a valid byte literal), just like in other notations

Fix #15868: Char byte in decimal notation can be > 127

  • Now: '\250'B -> error FS1157: This is not a valid byte literal, just like in other notations

Fix #15869: In String: decimal char can be > 255 (and gets wrapped to <256)

  • Now: "\937" -> This is not a valid character literal
  • Applies to Byte String too: "\937"B -> This is not a valid character literal
  • Note: I reused the error message for chars. Which is logical ok, but: Error on String spans the whole string, not just the invalid part. That's the case for other Strings errors too -- but the other errors at least mention its inside the string (This byte array literal contains characters that do not encode as a single byte) or mention the wrong part (\U1100FFFA is not a valid Unicode character escape sequence). Should I change the error to include the invalid trigraph?
    Edit: Error message now mentions trigraph: '\937' is not a valid character literal
    Edit: Error span is now limited to just the trigraph (fixed for \U too):
    > "foo\937bar";;
      ----^^^^
    stdin(1,5): error FS1252: '\937' is not a valid character literal

Note:
Currently Byte String enforces a char to fit into a single byte -- but does not require value to be < 128 like it's the case for Byte Char:

> 'ú'B;;
  ^^^^
stdin(2,1): error FS1157: This is not a valid byte literal

> "ú"B;;
val it: byte array = [|250uy|]

I don't know if that's intended or not:

  • There's a helper to check if a string is a correct bytes string: Checks every char (2 bytes) is just a single value byte. The comment for the function also says: Further check each low byte <= 127 -- but doesn't actual check it. So I guess limit <128 is indented and the check is just missing
    /// Sanity check that high bytes are zeros. Further check each low byte <= 127
    let stringBufferIsBytes (buf: ByteBuffer) =
    let bytes = buf.AsMemory()
    let mutable ok = true
    for i = 0 to bytes.Length / 2 - 1 do
    if bytes.Span[i * 2 + 1] <> 0uy then
    ok <- false
    ok
  • Another comment for another helper function explicitly mentions byte values > 127 -- indicating full byte range should be possible
    /// When lexing bytearrays we don't expect to see any unicode stuff.
    /// Likewise when lexing string constants we shouldn't see any trigraphs > 127
    /// So to turn the bytes collected in the string buffer back into a bytearray
    /// we just take every second byte we stored. Note all bytes > 127 should have been
    /// stored using addIntChar
    let stringBufferAsBytes (buf: ByteBuffer) =
    • Note: it mentions limit < 128 just for trigraphs. That's actually not enforceable with current string parsing: We only know at the end of a string if it's a Byte String or not -- and at this point we don't know if a value was a trigraph or something else.
      So it's either all must be < 128 or all can be > 127

I've written some tests expecting all values to be <128, but have not adjusted the lexer to produce an error when a value is above.
-> There are currently some failing tests in ByteStrings (-> reason this PR is marked as Draft)

Depending on what behaviour is wanted I'll either adjust the tests or the lexer

@baronfel
Copy link
Member

Should I change the error to include the invalid trigraph?

IMO yes, that would be amazing, especially if the error range only highlighted the incorrect portions of the input string. Visually scanning a string can be tricky, and having a more narrow range could allow for a codefix here.

@Happypig375
Copy link
Member

Yea, as @baronfel wrote, modifying the error range to be specific would be even better than including the invalid character in the error message.

@Booksbaum
Copy link
Contributor Author

modifying the error range to be specific would be even better

yes .. but that requires more change. So I use the same error reporting used for other string errors -- which unfortunately spans the complete string (which might lead to multiple errors reported on same range: "\U1100FFFA-\U1100FAFA-\U1100FAFF" -> three errors spanning the complete string)

@vzarytovskii
Copy link
Member

I'm pretty sure it's fine, but just to be safe, we may want to see that new changes reflect what language spec says. If it extends it, it will probably requite RFC.

@Booksbaum
Copy link
Contributor Author

we may want to see that new changes reflect what language spec says. If it extends it, it will probably requite RFC.

As far as I see it: This PR doesn't extends the specs in any way but instead fixes things that aren't according to specs (or unclear).

I went mostly by this two documents: Literals and Strings
So for example: Limit trigraph to <256 is clearly a fix because docs state: \DDD (where D indicates a decimal digit; range of 000 - 255; for example, \231 = "ç")

But for Byte Chars & Strings these two docs don't really explain what range the values should be. Maybe except by calling them ASCII byte and ASCII string (-> just values < 128)?
So I went with what's currently in code -- which is unfortunately different for byte char (<128) and byte string (<256).

So whatever the upper border should be -- it doesn't really introduce a new change.

@Booksbaum
Copy link
Contributor Author

especially if the error range only highlighted the incorrect portions of the input string.

Done (for \U too)
(was just some incorrect code order. I thought that would be something more involved.)

But Note: For chars only invalid in byte string it's still the complete String:

> "foo\U000003C0bar";;
val it: string = "fooπbar"

> "foo\U000003C0bar"B;;
  ^^^^^^^^^^^^^^^^^^^
stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte

Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element and its type or location any more -> we know the value and can validate it -- but don't know exactly where it's located

src/Compiler/lex.fsl Outdated Show resolved Hide resolved
@Booksbaum
Copy link
Contributor Author

Errors reduced to Warnings and messages updated (see #15898 (comment))

Note: In lex.fs: I marked the locations with Warnings that should later be promoted to Error with // TODO: Promote to Error and noted the necessary steps to produce correct Error





How to proceed with Bytes inside 128..255 range for ASCII Byte Strings ("ú"B = [|250uy|])? (see end of #15898 (comment))
Keep it like it is (-> accepts 0..255 -- like in normal String but unlike ASCII Byte Char which requires < 128y) or produce warning (error later) (like it's done in ASCII Byte: This is not a valid ASCII byte literal. Value should be < 128y.)?

Fix dotnet#15867: Add `\U` & `\x` for ASCII Byte
Fix dotnet#15868: Fix: ASCII Byte Decimal notation accepts value `> 127`
Note: Values between `128` and `255` are still valid in Byte String Array (-> several tests in `ByteStrings` fail)
prev:
```fsharp
> "foo\U12345678bar";;
  ^^^^^^^^^^^^^^^^^^
stdin(1,1): error FS1245: \U12345678 is not a valid Unicode character escape sequence
```

now:
```fsharp
> "foo\U12345678bar";;
  ----^^^^^^^^^^
stdin(1,5): error FS1245: \U12345678 is not a valid Unicode character escape sequence
```

Note: In Byte Strings that's only the case for invalid chars (-> invalid in normal string too), but not for chars invalid only inside byte string:
```fsharp
> "foo\U000003C0bar";;
val it: string = "fooπbar"

> "foo\U000003C0bar"B;;
  ^^^^^^^^^^^^^^^^^^^
stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte
```

Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element (and its notation) any more -> we know the value and can validate it -- but don't know exactly where it's located
Change `Error` to `Warning` for cases current F# succeeds, but are now Error (in prev commits introduced):
* Trigraph ASCII Byte when between inside `128..255`:
  ```fsharp
  // prev
  > '\973'B;;
    ^^^^^^^
  stdin(11,1): error FS1157: This is not a valid byte literal

  > '\250'B;;
  val it: byte = 250uy

  // now
  > '\973'B;;
    ^^^^^^^
  stdin(2,1): error FS1157: This is not a valid ASCII byte literal. Value must be < 128y.

  > '\250'B;;
    ^^^^^^^
  stdin(3,1): warning FS1157: This is not a valid ASCII byte literal. Value should be < 128y.
  Note: In a future F# version this Warning will be promoted to Error!

  val it: byte = 250uy
  ```
* Trigraph string when `> 255`
  ```fsharp
  // prev
  > "\937";;
  val it: string = "©"

  // now
  > "\937";;
    -^^^^
  stdin(4,2): warning FS1252: '\937' is not a valid character literal.
  Note: Currently the value is wrapped around byte range to '\169'. In a future F# version this Warning will be promoted to Error!

  val it: string = "©"
  ```

Additional for these case:
Add Note about Promoting to Error in future F# version
-> in `lex.fsl`: `//TODO` with note how to promote to `Error`

Note:
I changed the message for `lexInvalidByteLiteral` (incl. rename to `lexInvalidAsciiByteLiteral`).
`lexInvalidByteLiteral` was previously translated (-> in `FSComp.txt.XXX.xlf` files), but now isn't any more even though most of the message remained the same:
* Prev: `This is not a valid byte literal.`
* Now: `This is not a valid ASCII byte literal. Value should be < 128y.`
>= 256 (more than 1 byte) are already an error

```fsharp
let _ = "ä"B
//      ^^^^
//      This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.
//      Note: In future F# versions this Warning might be promoted to Error!
```

Note: Issue with trigraph and > 255:
```fsharp
let _ = "\973"B
//       ^^^^
//       Warning: '\973' is not a valid character literal.
//                Note: Currently the value is wrapped around byte range to '\169' [...]
//
//      ^^^^^^^
//      Warning: This byte array literal contains 1 non-ASCII characters.
```
->
Two warnings for same trigraph:
* Warning for trigraph `\973` > 255. Detected while parsing trigraph. Don't know yet if String or Byte String.
  * warning spans just trigraph
* Warning for wrapped value `\169` > 127. Detected while checking Byte String for correct values. Don't have any infos about value (range & notation) any more.
  * warning spans full byte string

-> Each Warning is emitted in a different parsing steps.
-> No easy way to reduce to just one warning.

However: Both warnings are correct and warn about a different issues (albeit in same value)

And: should get automatically reduced once out-of-trigraph-range gets promoted to error

**Enhancement**: Count number of invalid chars (2-bytes, >128) in byte array

```fsharp
let _ = "ΩäΩüΩ"B
```
* prev:
  `This byte array literal contains characters that do not encode as a single byte`
* now:
  `This byte array literal contains 3 characters that do not encode as a single byte`
  &&
  `This byte array literal contains 2 non-ASCII characters. All characters should be < 128y.`

Note: When checking valid bytes in Byte String, there's no direct mapping to range and notation any more.
  -> cannot highlight exact byte string part, but only full byte string.
* Possible Enhancement?: List invalid chars in a certain notation? For example in `\u` notation.
  * Pro:
    * Gives an idea which chars are incorrect
  * Con:
    * Might be difficult to process by user because output might be different notation than written in string
    * Might be a long list with invalid chars
@Booksbaum
Copy link
Contributor Author

Booksbaum commented Oct 8, 2023

Ok, no answer how to handle 128..255 in ASCII Byte String.

So I decided to align it with ASCII char byte: Emits now a warning (NOT error -> compilation still works):

  • prev:
    > "ä"B;;
    val it: byte array = [|228uy|]
  • now:
    > "ä"B;;
      ^^^^
      warning FS1253: This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.
    
    val it: byte array = [|228uy|]

Note: There's a small flaw with trigraphs > 255 & their % 256 > 127:

> "\973"B;;

  "\973"B;;
  -^^^^

stdin(3,2): warning FS1252: '\973' is not a valid character literal.
Note: Currently the value is wrapped around byte range to '\205'. In a future F# version this Warning will be promoted to Error!


  "\973"B;;
  ^^^^^^^

stdin(3,1): warning FS1253: This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.

val it: byte array = [|205uy|]

-> TWO warnings:

  • trigraph > 255 -> warning & wrapping to \205
  • wrapped value > 127 -> warning

These two are checked at different parsing steps with different infos available:

  • > 255: while parsing of trigraph
    • Know trigraph and where, but not if String or Byte String
    • -> warning spans exactly trigraph
  • > 127: while checking valid ASCII Byte in string
    • Know String or Byte String and byte value, but not source notation or location
    • -> warning spans complete Byte String

-> two warnings with different ranges.

I think that's ok because:

  • rare
  • both are correct warnings
  • should automatically get resolved once trigraph > 255 is promoted to error (at least I think)

I also updated the error message for invalid byte char (two-byte long instead of just one) by adding the number of faulty characters:

> "Ω --- Ω --- Ω  --- Ω --- Ω"B;;
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  error FS1140: This byte array literal contains 5 characters that do not encode as a single byte

(prev was: error FS1140: This byte array literal contains characters that do not encode as a single byte -> no count)
should be easier to debug

Note: That's again "highlight full byte string instead of exact occurrence" because check is after string is parsed and we finally know it's a byte string -- but not any more what exactly is inside that string (without reparsing again).
So I think adding a count is a slight additional hint what is wrong while not changing the lexer too much.
(and because I did count already for the > 127 warning, the count for > 255 error is basically available too)

Another possible enhancement: List invalid chars in a certain notations (for example in \u notation)

  • Pros:
    • Lists chars that are incorrect
  • Cons:
    • Might be difficult to process by user because chars in list might be different notation than written inside string
    • Might be a long list with invalid chars

Tests should be adjusted to new behaviour (maybe with exception of some legacy conformance tests? let's see what the the CI result say)

Edit:

CI error:

[xUnit.net 00:07:08.11]     FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols [FAIL]
  Failed FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols [142 ms]
  Error Message:
   Assert.Equal() Failure
Expected: 80
Actual:   62
  Stack Trace:
     at FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols() in /home/vsts/work/1/s/tests/FSharp.Compiler.ComponentTests/FSharpChecker/CommonWorkflows.fs:line 168
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

I think that's "just" some random error and not related to this PR (also only on "Linux" build :/ )

@Booksbaum Booksbaum marked this pull request as ready for review October 8, 2023 16:17
@Booksbaum Booksbaum requested a review from a team as a code owner October 8, 2023 16:17
@psfinaki
Copy link
Member

psfinaki commented Oct 9, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@abonie abonie requested review from KevinRansom and abonie August 12, 2024 17:07
Copy link
Contributor

github-actions bot commented Aug 12, 2024

❗ Release notes required


✅ Found changes and release notes in following paths:

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/9.0.100.md

src/Compiler/FSComp.txt Outdated Show resolved Hide resolved
Copy link
Member

@psfinaki psfinaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get this in. This really just fixes inconsistencies between different notations. Good stuff and solid testing.

@T-Gro T-Gro self-assigned this Aug 14, 2024
@T-Gro T-Gro enabled auto-merge (squash) August 15, 2024 08:01
src/Compiler/FSComp.txt Outdated Show resolved Hide resolved
src/Compiler/FSComp.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@edgarfgp edgarfgp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@psfinaki
Copy link
Member

/run xlf

Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
9 participants