Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Booksbaum · 2023-08-29T18:23:30Z

Fix #15867: Support \U & \x in char byte

These now all work: 'a'B; '\097'B; '\x61'B; '\u0061'B; '\U00000061'B
Compiler error when > 127 (This is not a valid byte literal), just like in other notations

Fix #15868: Char byte in decimal notation can be > 127

Now: '\250'B -> error FS1157: This is not a valid byte literal, just like in other notations

Fix #15869: In String: decimal char can be > 255 (and gets wrapped to <256)

Now: "\937" -> This is not a valid character literal
Applies to Byte String too: "\937"B -> This is not a valid character literal
Note: I reused the error message for chars. Which is logical ok, but: Error on String spans the whole string, not just the invalid part. That's the case for other Strings errors too -- but the other errors at least mention its inside the string (This byte array literal contains characters that do not encode as a single byte) or mention the wrong part (\U1100FFFA is not a valid Unicode character escape sequence). Should I change the error to include the invalid trigraph?
Edit: Error message now mentions trigraph: '\937' is not a valid character literal
Edit: Error span is now limited to just the trigraph (fixed for \U too):
```
> "foo\937bar";;
  ----^^^^
stdin(1,5): error FS1252: '\937' is not a valid character literal
```

Note:
Currently Byte String enforces a char to fit into a single byte -- but does not require value to be < 128 like it's the case for Byte Char:

> 'ú'B;;
  ^^^^
stdin(2,1): error FS1157: This is not a valid byte literal

> "ú"B;;
val it: byte array = [|250uy|]

I don't know if that's intended or not:

There's a helper to check if a string is a correct bytes string: Checks every char (2 bytes) is just a single value byte. The comment for the function also says: Further check each low byte <= 127 -- but doesn't actual check it. So I guess limit <128 is indented and the check is just missing

fsharp/src/Compiler/SyntaxTree/LexHelpers.fs

Lines 225 to 234 in 97a5b65

    
           /// Sanity check that high bytes are zeros. Further check each low byte <= 127 
        
           let stringBufferIsBytes (buf: ByteBuffer) = 
        
               let bytes = buf.AsMemory() 
        
               let mutable ok = true 
        
               for i = 0 to bytes.Length / 2 - 1 do 
        
                   if bytes.Span[i * 2 + 1] <> 0uy then 
        
                       ok <- false 
        
               ok

Another comment for another helper function explicitly mentions byte values > 127 -- indicating full byte range should be possible

fsharp/src/Compiler/SyntaxTree/LexHelpers.fs

Lines 149 to 154 in 97a5b65

    
           /// When lexing bytearrays we don't expect to see any unicode stuff. 
        
           /// Likewise when lexing string constants we shouldn't see any trigraphs > 127 
        
           /// So to turn the bytes collected in the string buffer back into a bytearray 
        
           /// we just take every second byte we stored.  Note all bytes > 127 should have been 
        
           /// stored using addIntChar 
        
           let stringBufferAsBytes (buf: ByteBuffer) =

Note: it mentions limit < 128 just for trigraphs. That's actually not enforceable with current string parsing: We only know at the end of a string if it's a Byte String or not -- and at this point we don't know if a value was a trigraph or something else.
So it's either all must be < 128 or all can be > 127

I've written some tests expecting all values to be <128, but have not adjusted the lexer to produce an error when a value is above.
-> There are currently some failing tests in ByteStrings (-> reason this PR is marked as Draft)

Depending on what behaviour is wanted I'll either adjust the tests or the lexer

baronfel · 2023-08-29T18:27:31Z

Should I change the error to include the invalid trigraph?

IMO yes, that would be amazing, especially if the error range only highlighted the incorrect portions of the input string. Visually scanning a string can be tricky, and having a more narrow range could allow for a codefix here.

Happypig375 · 2023-08-29T18:30:19Z

Yea, as @baronfel wrote, modifying the error range to be specific would be even better than including the invalid character in the error message.

Booksbaum · 2023-08-29T18:48:34Z

modifying the error range to be specific would be even better

yes .. but that requires more change. So I use the same error reporting used for other string errors -- which unfortunately spans the complete string (which might lead to multiple errors reported on same range: "\U1100FFFA-\U1100FAFA-\U1100FAFF" -> three errors spanning the complete string)

vzarytovskii · 2023-08-29T19:19:23Z

I'm pretty sure it's fine, but just to be safe, we may want to see that new changes reflect what language spec says. If it extends it, it will probably requite RFC.

Booksbaum · 2023-08-29T19:38:13Z

we may want to see that new changes reflect what language spec says. If it extends it, it will probably requite RFC.

As far as I see it: This PR doesn't extends the specs in any way but instead fixes things that aren't according to specs (or unclear).

I went mostly by this two documents: Literals and Strings
So for example: Limit trigraph to <256 is clearly a fix because docs state: \DDD (where D indicates a decimal digit; range of 000 - 255; for example, \231 = "ç")

But for Byte Chars & Strings these two docs don't really explain what range the values should be. Maybe except by calling them ASCII byte and ASCII string (-> just values < 128)?
So I went with what's currently in code -- which is unfortunately different for byte char (<128) and byte string (<256).

So whatever the upper border should be -- it doesn't really introduce a new change.

Booksbaum · 2023-08-30T12:02:31Z

especially if the error range only highlighted the incorrect portions of the input string.

Done (for \U too)
(was just some incorrect code order. I thought that would be something more involved.)

But Note: For chars only invalid in byte string it's still the complete String:

> "foo\U000003C0bar";;
val it: string = "fooπbar"

> "foo\U000003C0bar"B;;
  ^^^^^^^^^^^^^^^^^^^
stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte

Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element and its type or location any more -> we know the value and can validate it -- but don't know exactly where it's located

src/Compiler/lex.fsl

Booksbaum · 2023-09-07T11:20:34Z

Errors reduced to Warnings and messages updated (see #15898 (comment))

Note: In lex.fs: I marked the locations with Warnings that should later be promoted to Error with // TODO: Promote to Error and noted the necessary steps to produce correct Error

How to proceed with Bytes inside 128..255 range for ASCII Byte Strings ("ú"B = [|250uy|])? (see end of #15898 (comment))
Keep it like it is (-> accepts 0..255 -- like in normal String but unlike ASCII Byte Char which requires < 128y) or produce warning (error later) (like it's done in ASCII Byte: This is not a valid ASCII byte literal. Value should be < 128y.)?

Fix dotnet#15867: Add `\U` & `\x` for ASCII Byte Fix dotnet#15868: Fix: ASCII Byte Decimal notation accepts value `> 127`

Note: Values between `128` and `255` are still valid in Byte String Array (-> several tests in `ByteStrings` fail)

prev: ```fsharp > "foo\U12345678bar";; ^^^^^^^^^^^^^^^^^^ stdin(1,1): error FS1245: \U12345678 is not a valid Unicode character escape sequence ``` now: ```fsharp > "foo\U12345678bar";; ----^^^^^^^^^^ stdin(1,5): error FS1245: \U12345678 is not a valid Unicode character escape sequence ``` Note: In Byte Strings that's only the case for invalid chars (-> invalid in normal string too), but not for chars invalid only inside byte string: ```fsharp > "foo\U000003C0bar";; val it: string = "fooπbar" > "foo\U000003C0bar"B;; ^^^^^^^^^^^^^^^^^^^ stdin(8,1): error FS1140: This byte array literal contains characters that do not encode as a single byte ``` Reason: We only now at end of string if it's a Byte String. At this point we don't have direct/simple access to the invalid element (and its notation) any more -> we know the value and can validate it -- but don't know exactly where it's located

Change `Error` to `Warning` for cases current F# succeeds, but are now Error (in prev commits introduced): * Trigraph ASCII Byte when between inside `128..255`: ```fsharp // prev > '\973'B;; ^^^^^^^ stdin(11,1): error FS1157: This is not a valid byte literal > '\250'B;; val it: byte = 250uy // now > '\973'B;; ^^^^^^^ stdin(2,1): error FS1157: This is not a valid ASCII byte literal. Value must be < 128y. > '\250'B;; ^^^^^^^ stdin(3,1): warning FS1157: This is not a valid ASCII byte literal. Value should be < 128y. Note: In a future F# version this Warning will be promoted to Error! val it: byte = 250uy ``` * Trigraph string when `> 255` ```fsharp // prev > "\937";; val it: string = "©" // now > "\937";; -^^^^ stdin(4,2): warning FS1252: '\937' is not a valid character literal. Note: Currently the value is wrapped around byte range to '\169'. In a future F# version this Warning will be promoted to Error! val it: string = "©" ``` Additional for these case: Add Note about Promoting to Error in future F# version -> in `lex.fsl`: `//TODO` with note how to promote to `Error` Note: I changed the message for `lexInvalidByteLiteral` (incl. rename to `lexInvalidAsciiByteLiteral`). `lexInvalidByteLiteral` was previously translated (-> in `FSComp.txt.XXX.xlf` files), but now isn't any more even though most of the message remained the same: * Prev: `This is not a valid byte literal.` * Now: `This is not a valid ASCII byte literal. Value should be < 128y.`

>= 256 (more than 1 byte) are already an error ```fsharp let _ = "ä"B // ^^^^ // This byte array literal contains 1 non-ASCII characters. All characters should be < 128y. // Note: In future F# versions this Warning might be promoted to Error! ``` Note: Issue with trigraph and > 255: ```fsharp let _ = "\973"B // ^^^^ // Warning: '\973' is not a valid character literal. // Note: Currently the value is wrapped around byte range to '\169' [...] // // ^^^^^^^ // Warning: This byte array literal contains 1 non-ASCII characters. ``` -> Two warnings for same trigraph: * Warning for trigraph `\973` > 255. Detected while parsing trigraph. Don't know yet if String or Byte String. * warning spans just trigraph * Warning for wrapped value `\169` > 127. Detected while checking Byte String for correct values. Don't have any infos about value (range & notation) any more. * warning spans full byte string -> Each Warning is emitted in a different parsing steps. -> No easy way to reduce to just one warning. However: Both warnings are correct and warn about a different issues (albeit in same value) And: should get automatically reduced once out-of-trigraph-range gets promoted to error **Enhancement**: Count number of invalid chars (2-bytes, >128) in byte array ```fsharp let _ = "ΩäΩüΩ"B ``` * prev: `This byte array literal contains characters that do not encode as a single byte` * now: `This byte array literal contains 3 characters that do not encode as a single byte` && `This byte array literal contains 2 non-ASCII characters. All characters should be < 128y.` Note: When checking valid bytes in Byte String, there's no direct mapping to range and notation any more. -> cannot highlight exact byte string part, but only full byte string. * Possible Enhancement?: List invalid chars in a certain notation? For example in `\u` notation. * Pro: * Gives an idea which chars are incorrect * Con: * Might be difficult to process by user because output might be different notation than written in string * Might be a long list with invalid chars

Booksbaum · 2023-10-08T13:45:15Z

Ok, no answer how to handle 128..255 in ASCII Byte String.

So I decided to align it with ASCII char byte: Emits now a warning (NOT error -> compilation still works):

> "ä"B;;
val it: byte array = [|228uy|]

now:

> "ä"B;;
  ^^^^
  warning FS1253: This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.

val it: byte array = [|228uy|]

Note: There's a small flaw with trigraphs > 255 & their % 256 > 127:

> "\973"B;;

  "\973"B;;
  -^^^^

stdin(3,2): warning FS1252: '\973' is not a valid character literal.
Note: Currently the value is wrapped around byte range to '\205'. In a future F# version this Warning will be promoted to Error!


  "\973"B;;
  ^^^^^^^

stdin(3,1): warning FS1253: This byte array literal contains 1 non-ASCII characters. All characters should be < 128y.

val it: byte array = [|205uy|]

-> TWO warnings:

trigraph > 255 -> warning & wrapping to \205
wrapped value > 127 -> warning

These two are checked at different parsing steps with different infos available:

> 255: while parsing of trigraph
- Know trigraph and where, but not if String or Byte String
- -> warning spans exactly trigraph
> 127: while checking valid ASCII Byte in string
- Know String or Byte String and byte value, but not source notation or location
- -> warning spans complete Byte String

-> two warnings with different ranges.

I think that's ok because:

rare
both are correct warnings
should automatically get resolved once trigraph > 255 is promoted to error (at least I think)

I also updated the error message for invalid byte char (two-byte long instead of just one) by adding the number of faulty characters:

> "Ω --- Ω --- Ω  --- Ω --- Ω"B;;
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  error FS1140: This byte array literal contains 5 characters that do not encode as a single byte

(prev was: error FS1140: This byte array literal contains characters that do not encode as a single byte -> no count)
should be easier to debug

Note: That's again "highlight full byte string instead of exact occurrence" because check is after string is parsed and we finally know it's a byte string -- but not any more what exactly is inside that string (without reparsing again).
So I think adding a count is a slight additional hint what is wrong while not changing the lexer too much.
(and because I did count already for the > 127 warning, the count for > 255 error is basically available too)

Another possible enhancement: List invalid chars in a certain notations (for example in \u notation)

Pros:
- Lists chars that are incorrect
Cons:
- Might be difficult to process by user because chars in list might be different notation than written inside string
- Might be a long list with invalid chars

Tests should be adjusted to new behaviour (maybe with exception of some legacy conformance tests? let's see what the the CI result say)

Edit:

CI error:

[xUnit.net 00:07:08.11]     FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols [FAIL]
  Failed FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols [142 ms]
  Error Message:
   Assert.Equal() Failure
Expected: 80
Actual:   62
  Stack Trace:
     at FSharpChecker.CommonWorkflows.GetAllUsesOfAllSymbols() in /home/vsts/work/1/s/tests/FSharp.Compiler.ComponentTests/FSharpChecker/CommonWorkflows.fs:line 168
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

I think that's "just" some random error and not related to this PR (also only on "Linux" build :/ )

psfinaki · 2023-10-09T10:28:53Z

/azp run

azure-pipelines · 2023-10-09T10:29:09Z

Azure Pipelines successfully started running 2 pipeline(s).

github-actions · 2024-08-12T17:08:13Z

❗ Release notes required

✅ Found changes and release notes in following paths:

Change path Release notes path Description

src/Compiler docs/release-notes/.FSharp.Compiler.Service/9.0.100.md

src/Compiler/FSComp.txt

psfinaki

I think we should get this in. This really just fixes inconsistencies between different notations. Good stuff and solid testing.

src/Compiler/FSComp.txt

edgarfgp

Thank you

Co-authored-by: Brian Rourke Boll <[email protected]>

psfinaki · 2024-08-19T15:37:04Z

/run xlf

Co-authored-by: psfinaki <[email protected]>

github-actions · 2024-08-19T15:37:59Z

Ran xlf: https://github.com/dotnet/fsharp/actions/runs/10456450788

vzarytovskii reviewed Sep 5, 2023

View reviewed changes

src/Compiler/lex.fsl Outdated Show resolved Hide resolved

Booksbaum force-pushed the CharByte branch from 5984f26 to f552724 Compare September 7, 2023 11:19

Booksbaum mentioned this pull request Sep 18, 2023

Add CodeActions for Number Constants: Convert between bases, Add digit group separators ionide/FsAutoComplete#1167

Merged

Booksbaum added 6 commits October 8, 2023 11:32

Fix: ASCII byte: Add \U & \x; Limit Dec range

460f4af

Fix dotnet#15867: Add `\U` & `\x` for ASCII Byte Fix dotnet#15868: Fix: ASCII Byte Decimal notation accepts value `> 127`

Fix dotnet#15869: Decimal char in String can be > 255

0f740cb

Note: Values between `128` and `255` are still valid in Byte String Array (-> several tests in `ByteStrings` fail)

Enhancement: Mention invalid trigraph in error for string

3e954ed

Booksbaum force-pushed the CharByte branch from f552724 to 90f0111 Compare October 8, 2023 13:43

Booksbaum added 3 commits October 8, 2023 15:47

formatting

738f563

Adjust test: Error range is now exact char, not complete string

b7052de

Adjust test: Error msg now contains number of invalid chars

1009d6f

Booksbaum marked this pull request as ready for review October 8, 2023 16:17

Booksbaum requested a review from a team as a code owner October 8, 2023 16:17

Merge branch 'main' into CharByte

075dc20

abonie requested review from KevinRansom and abonie August 12, 2024 17:07

Merge branch 'main' into CharByte

8a56520

psfinaki reviewed Aug 14, 2024

View reviewed changes

src/Compiler/FSComp.txt Outdated Show resolved Hide resolved

psfinaki approved these changes Aug 14, 2024

View reviewed changes

T-Gro self-assigned this Aug 14, 2024

T-Gro approved these changes Aug 15, 2024

View reviewed changes

T-Gro enabled auto-merge (squash) August 15, 2024 08:01

brianrourkeboll reviewed Aug 15, 2024

View reviewed changes

src/Compiler/FSComp.txt Outdated Show resolved Hide resolved

src/Compiler/FSComp.txt Outdated Show resolved Hide resolved

edgarfgp approved these changes Aug 17, 2024

View reviewed changes

psfinaki and others added 3 commits August 19, 2024 17:32

Update src/Compiler/FSComp.txt

63c4c9f

Co-authored-by: Brian Rourke Boll <[email protected]>

Update FSComp.txt

84c4dfc

Update FSComp.txt

9c06eb3

Automated command ran: xlf

abf48a6

Co-authored-by: psfinaki <[email protected]>

psfinaki added 7 commits August 19, 2024 17:53

Merge branch 'main' into CharByte

5f41360

Update 9.0.100.md

9038d70

Update CharByteLiterals.fs

84ac97d

up

d4905e1

Merge branch 'main' into CharByte

d2d4b5f

up

babcd57

up

6057116

T-Gro merged commit eeda5fd into dotnet:main Aug 20, 2024
30 checks passed

This was referenced Aug 28, 2024

[Automated] PRs inserted in VS build main-35228.30 #17623

Closed

[Automated] PRs inserted in VS build feature.debugger.main-35228.301 #17632

Closed

dotnet-bot mentioned this pull request Sep 14, 2024

[Automated] PRs inserted in VS build feature.debugger.shadowDebug-35313.170 #17728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Booksbaum commented Aug 29, 2023 •

edited

Loading

baronfel commented Aug 29, 2023

Happypig375 commented Aug 29, 2023

Booksbaum commented Aug 29, 2023

vzarytovskii commented Aug 29, 2023

Booksbaum commented Aug 29, 2023

Booksbaum commented Aug 30, 2023

Booksbaum commented Sep 7, 2023

Booksbaum commented Oct 8, 2023 •

edited

Loading

psfinaki commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

github-actions bot commented Aug 12, 2024 •

edited

Loading

psfinaki left a comment

edgarfgp left a comment •

edited

Loading

psfinaki commented Aug 19, 2024

github-actions bot commented Aug 19, 2024

	/// Sanity check that high bytes are zeros. Further check each low byte <= 127
	let stringBufferIsBytes (buf: ByteBuffer) =
	let bytes = buf.AsMemory()
	let mutable ok = true

	for i = 0 to bytes.Length / 2 - 1 do
	if bytes.Span[i * 2 + 1] <> 0uy then
	ok <- false

	ok

	/// When lexing bytearrays we don't expect to see any unicode stuff.
	/// Likewise when lexing string constants we shouldn't see any trigraphs > 127
	/// So to turn the bytes collected in the string buffer back into a bytearray
	/// we just take every second byte we stored. Note all bytes > 127 should have been
	/// stored using addIntChar
	let stringBufferAsBytes (buf: ByteBuffer) =

Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Fix #15867, #15868, #15869: Add missing byte chars notations, enforce limits in decimal notation in byte char & string #15898

Conversation

Booksbaum commented Aug 29, 2023 • edited Loading

baronfel commented Aug 29, 2023

Happypig375 commented Aug 29, 2023

Booksbaum commented Aug 29, 2023

vzarytovskii commented Aug 29, 2023

Booksbaum commented Aug 29, 2023

Booksbaum commented Aug 30, 2023

Booksbaum commented Sep 7, 2023

Booksbaum commented Oct 8, 2023 • edited Loading

psfinaki commented Oct 9, 2023

azure-pipelines bot commented Oct 9, 2023

github-actions bot commented Aug 12, 2024 • edited Loading

❗ Release notes required

psfinaki left a comment

Choose a reason for hiding this comment

edgarfgp left a comment • edited Loading

Choose a reason for hiding this comment

psfinaki commented Aug 19, 2024

github-actions bot commented Aug 19, 2024

Booksbaum commented Aug 29, 2023 •

edited

Loading

Booksbaum commented Oct 8, 2023 •

edited

Loading

github-actions bot commented Aug 12, 2024 •

edited

Loading

edgarfgp left a comment •

edited

Loading