Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use Unicode aware API and CurrentCulture in compiler #16066

Closed
wants to merge 10 commits into from

Conversation

OwnageIsMagic
Copy link

@OwnageIsMagic OwnageIsMagic commented Oct 1, 2023

Strictly speaking this is a breaking change, but it's for the good.

// Unfortunately bold six '𝟲' U+1D7F2 is too bold to fit into Char.
// So let's speak some Farsi. Width nine, precision seven.
> printfn "%۹.۷f" 3. ;;









                                                                      3.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
val it: unit = ()

// NOTE: THIS IS NOT FIXED
// U+0B68 Oriya Digit Two
> let two = ୨UL ;;

  let two = ୨UL ;;
  ----------^^^

stdin(202,11): error FS1150: This number is outside the allowable range for 64-bit unsigned integers

Reverts #6524 cc @forki.
Part of #12352 (somewhat)

RegexOptions.ECMAScript

For example, ECMAScript does not support language elements such as the Unicode category or block elements \p and \P. Similarly, the \w element, which matches a word character, is equivalent to the [a-zA-Z_0-9] character class when using ECMAScript and [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}] when using canonical behavior.

Choosing a StringComparison member for your method call

File paths. A non-linguistic identifier, where case is irrelevant. OrdinalIgnoreCase

This PR is not comprehensive (there is at least 1 instance of Char.IsDigit left).
Commits are logically separated, you can check each commit individually.

Fixed behaviour:

> printfn "%۹.⅘f" 3. ;;

  printfn "%۹.⅘f" 3. ;;
  --------^^^^^^^

stdin(1,9): error FS0741: Unable to parse format string 'Bad format specifier: '۹''

@OwnageIsMagic OwnageIsMagic requested a review from a team as a code owner October 1, 2023 12:41
@OwnageIsMagic
Copy link
Author

OwnageIsMagic commented Oct 1, 2023

resolved

2 tests failed. Investigating...

E_MulticasePartialNotAllowed01_fs - --test:ErrorRanges(compilation: E_MulticasePartialNotAllowed01.fs)
Simple - W_BindCaptialIdent_fs - --test:ErrorRanges(compilation: W_BindCaptialIdent.fs)

@@ -55,10 +55,10 @@ type PrimaryAssembly =
static member IsPossiblePrimaryAssembly(fileName: string) =
let name = System.IO.Path.GetFileNameWithoutExtension(fileName)

String.Compare(name, "mscorlib", true) <> 0
Copy link
Author

@OwnageIsMagic OwnageIsMagic Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a bug, so I changed <> to = here. Or I did misunderstood something? This was introduced in #13870

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We check the exported types of all assemblies, since many may forward System.Object,
// but only check the actual type definitions for specific assemblies that we know
// might actually declare System.Object.
match mdef.Manifest with
| Some manifest when
manifest.ExportedTypes.TryFindByName "System.Object" |> Option.isSome
|| PrimaryAssembly.IsPossiblePrimaryAssembly resolvedAssembly.resolvedPath
&& mdef.TypeDefs.ExistsByName "System.Object"
->
mkRefToILAssembly manifest |> Some
| _ -> None

based on calling context this bug was resulting in making any framework assembly as primary assembly equivalent, this may affect some esoteric .NET implementations.

@OwnageIsMagic
Copy link
Author

@vzarytovskii All tests are passing. I will fix CheckCodeFormatting with the next review feedback commit.

@psfinaki
Copy link
Member

@OwnageIsMagic I would be happy to review this next week - would you mind resolving the conflicts here? If you need any help with that, please let us know :)

@OwnageIsMagic
Copy link
Author

@psfinaki Sorry for delay. Most conflicts are from the similar #16439

@@ -371,7 +371,7 @@ let parseFormatStringInternal
// type checker. They should always have '(...)' after for format string.
let requireAndSkipInterpolationHoleFormat i =
if i < len && fmt[i] = '(' then
let i2 = fmt.IndexOfOrdinal(")", i+1)
let i2 = fmt.IndexOf(')', i+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ordinal here?

Copy link
Author

@OwnageIsMagic OwnageIsMagic Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char overloads are always Ordinal. My PR is larger than #16439 (that introduced IndexOfOrdinal ext method) and established usage of char overload.
+ potentially it's faster on netfx (on Core they both delegated to the same Span function)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -3756,7 +3756,7 @@ let writePdb (
let pdbfileInfo = FileInfo(pdbfile).FullName

// If pdbfilepath matches output filepath then error
if String.Compare(outfileInfo, pdbfileInfo, StringComparison.InvariantCulture) = 0 then
if outfileInfo = pdbfileInfo then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the particular motivation for this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File systems usually doesn't normalize Unicode in paths, InvariantCulture does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those paths are CLI arguments, the logic to compare them should stay case insensitive like it was, to keep the behavior the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@T-Gro it was InvariantCulture, not InvariantCultureIgnoreCase

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@T-Gro do you suggest to actually make it case insensitive?

@@ -925,7 +925,7 @@ module PrintTypes =
if not denv.includeStaticParametersInTypeNames then
None, args
else
let regex = System.Text.RegularExpressions.Regex(@"\`\d+")
let regex = System.Text.RegularExpressions.Regex(@"\`\d+", System.Text.RegularExpressions.RegexOptions.ECMAScript)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen this yet - which of course doesn't mean this is not a valid change - just wondering, is this a recommendation from Microsoft or, in other words, what is it consistent with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See notes in OP.
By default Regex is Unicode aware, so char classes maps to Unicode categories and allow much more then expected here ASCII chars.

Copy link
Contributor

@brianrourkeboll brianrourkeboll Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a regular expression, this could also just be something like:

let tryGetTyparCount (s: string) =
    let indexOfBacktick = s.LastIndexOf '`'
    if indexOfBacktick >= 0 && indexOfBacktick < s.Length - 1 then
        match Int32.TryParse(s.AsSpan(indexOfBacktick + 1)) with
        | true, genericArgCount -> ValueSome genericArgCount
        | false, _ -> ValueNone
    else
        ValueNone

Copy link
Member

@T-Gro T-Gro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilwrite behavior to comparing CLI args should not be changed.

Copy link
Contributor

❗ Release notes required

@OwnageIsMagic,

Caution

No release notes found for the changed paths (see table below).

Please make sure to add an entry with an informative description of the change as well as link to this pull request, issue and language suggestion if applicable. Release notes for this repository are based on Keep A Changelog format.

The following format is recommended for this repository:

* <Informative description>. ([PR #XXXXX](https://github.com/dotnet/fsharp/pull/XXXXX))

See examples in the files, listed in the table below or in th full documentation at https://fsharp.github.io/fsharp-compiler-docs/release-notes/About.html.

If you believe that release notes are not necessary for this PR, please add NO_RELEASE_NOTES label to the pull request.

You can open this PR in browser to add release notes: open in github.dev

Change path Release notes path Description
src/FSharp.Core docs/release-notes/.FSharp.Core/8.0.300.md No release notes found or release notes format is not correct
src/Compiler docs/release-notes/.FSharp.Compiler.Service/8.0.300.md No release notes found or release notes format is not correct

@OwnageIsMagic
Copy link
Author

Any help needed?

Copy link
Contributor

github-actions bot commented Mar 7, 2024

Caution

Repository is on lockdown for maintenance, all merges are on hold.

@KevinRansom
Copy link
Member

@OwnageIsMagic --- this PR seems to address a disparate set of issues.

  1. Some assembly name comparison is case sensitive and culture aware. These can and perhaps should be addressed.
  2. File system name comparison uses InvariantCulture. Handling file system comparisons invariant culture, started around about the release of V1.0 of the CLR, or perhaps shortly after, it was so long ago, I forget exactly, one of the big code pushes was to ensure that file system compares happened using Invariant culture to ensure that a Turkish file system would be handled consistently regardless of the UI Culture the application was using. Over the years the comparison APIs have changed, but the basic advice, I believe, has remained invariant culture comparisons for file systems.
  3. Assorted code changes that impact character handling and may be affected by unicode.

In general our test suites are American/English with unicode tests only in the places where we know that unicode handling is impactful; we are not confident that it will catch bugs introduced in PRs such as this, or future regressions we may re-introduce in the future. In order to ensure that the change correctly addresses an observable problem, would it be possible to create a specific test case that currently fails, issue a bug and provide a fix, this will ensure that in the future we don't regress any fixes supplied.

I am closing this PR, but please resubmitted PR's with regression test cases and targeted fixes.

Thanks

Kevin.

@OwnageIsMagic
Copy link
Author

OwnageIsMagic commented May 13, 2024

@KevinRansom Most fixes in this PR are in 3 categories.

1. Micro optimizations like using String.Equals(...) vs String.Compare(...) = 0, using char overload instead of string, etc.
I agree that this part is mostly irrelevant to this PR, but I was checking all instances of string comparison in code base and it just bothered me. Luckily it's in separate commit, so I can easily drop it.
2. Correctness
There is no way to test this (beside linking with custom Unicode lib).
like

         member spec.IsGFormat = 
-            spec.IsDecimalFormat || System.Char.ToLower(spec.TypeChar) = 'g'
+            spec.IsDecimalFormat || spec.TypeChar = 'g' || spec.TypeChar = 'G'

Currently there is no codepoint in Unicode beside 'G' that will produce 'g' after calling ToLower, but it doesn't mean we should use user-culture-Unicode conversion for this.
or

-          Regex(@"^(/|--)test:ErrorRanges$", RegexOptions.Compiled ||| RegexOptions.IgnoreCase)
+          Regex(
+              @"^(/|--)test:ErrorRanges$",
+              RegexOptions.Compiled
+              ||| RegexOptions.IgnoreCase
+              ||| RegexOptions.CultureInvariant
            )

Command line flags are literals and should not be compared with default CurrentCulture. But I can't imagine any test input that will uncover bug with current Unicode version.

3. Actual bugs
Mostly due to wrongly assuming Char.IsDigit only detects ASCII digits.

-            while (Char.IsDigit s.[i]) do
+            while (isDigit s[i]) do
                 let n = int s.[i] - int '0'

This one manifests itself in printf.
Or using Current/Invariant culture for path comparisons.
Most common file systems (I checked NTFS and ext4) do not normalize Unicode:

([char]0x2126).ToString().Equals( ([char]0x3a9).ToString(), 'CurrentCulture')
# True
([char]0x2126).ToString().Equals( ([char]0x3a9).ToString(), 'InvariantCulture')
# True

New-Item -Type File ([char]0x2126).ToString()
New-Item -Type File ([char]0x3a9).ToString()

ls
    Directory: D:\tst

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          14.05.2024     0:26              0 Ω
-a---          14.05.2024     0:25              0

It's 2 different files.


The only place where Unicode aware comparison (InvariantCulture) should be used is for identifiers in user code.
In all other places only Ordinal (and maybe OrdinalIgnoreCase) should be used. And absolutely no place for CurrentCulture as it breaks determinism. You need code analyzer, not regression tests.

@T-Gro
Copy link
Member

T-Gro commented May 14, 2024

I think this is a viable approach.

  1. A micro-optimization PR which does not fix nor change anything, only makes it faster

  2. Bugfixes with added tests (failing before, success after) - also important to keep them around as regression tests, just in case someone would want to go in the opposite direction several years later.

Formatted printing, CLI argument handling as well as file system ops should be coverable with tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants