-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CultureInfo.TextInfo.ListSeparator broken in .Net 5 #43795
Comments
@olmobrutall In .NET 5.0 we have switched to depend on the ICU instead of NLS Win32 APIs. Usually ICU behavior is more correct as it picks its globalization data from CLDR (Unicode Standard). Please have a look at the doc https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu for more details about the change in .NET 5.0. We offer a config switch too to allow apps revert back to old behavior if needed too. the doc has details how you can use the config switch. Another point, Globalization data is not considered constant and can change. It is not right to take any dependency on the Globalization data or assume it will never change. If you see any data doesn't make sense or is not correct, the issue can be raised to the CLDR and can be tracked to be discussed and fixed if it is real issue. Let me know if you have any question. |
Sorry @tarekgh, but to me this surely looks like a bug. runtime/src/libraries/Native/Unix/System.Globalization.Native/pal_localeStringData.c Line 246 in 0e402bc
you can see that it isn't even implemented (falls through to ThousandsSeparator). So either this is a bug in .Net Core or the info is generally not available in ICU, which would then still be a bug (And would not be an issue for CLDR if it is not part of it). As native german I can tell that using . as a list separator does not make any sense. |
After checking our code this would be a .Net 5 blocker for us for the majority of our software products unless we would remove all uses of ListSeparator. A ListSeparator has no connection of any kind (and should not have) to the ThousandsSeparator. |
Thanks @ANahr, nice the you found the code. I've checked .Net Core 3.1 behaviour to see if ListSeparator can be inferred from DecimalsSeparator somehow.
Doesn't look like it can easily be inferred, there are even a few cultures where they use the same ListSeparator and NumberDecimalSeparator. |
@olmobrutall @ANahr thanks for your analysis. To clarify, we have switched to use ICU on Windows. Unfortunately, ICU doesn't have a property matching ListSeparator as Windows designed it. That is why for now we are using the Thousand separator as a fallback for List separator. I agree this is not the best and we need to enhance this experience. We are looking at if we can derive the list separator from ICU pattern separator but need more investigation to ensure if this will be satisfying outcome. @KalleOlaviNiemitalo thankfully linked the other issue tracking the exact problem we are talking about here. as we are in very late stage of 5.0 release, it will be very risky to change that now. For sure we'll address this in the next release.
We always recommend developers to not take any dependency on the Globalization data nor assuming the data is never change. Globalization data can change at anytime. You may look at the Shawn Steele blog who is in Windows team describing that. We always recommend to use Invariant culture for formatting and parsing back any data has Globalization properties. Otherwise, the parsing can break at anytime when Globalization data changes. If it is really an issue for you, you have the option to use the System.Globalization.UseNls config switch to go back old behavior. Or you can customize the used TextInfo to set the ListSeparator to the value that prevent breaking. or do something like teh following extension method: public static string GetListSeparator(this TextInfo textInfo) => string.IsNullOrWhiteSpace(textInfo.ListSeparator) ? ";" : textInfo.ListSeparator; by the way, does your product currently support running on Linux? Last, I want to say the current behavior (which I agree is not the best) is used on Linux for awhile now. I agree with you we should do something here. We are really appreciating your feedback and please let me know if the current plan I mentioned is reasonable for you. |
I have been using ListSeparator for human-readable output, never for parsing. I suspect that hardcoding |
While I have applications running in Docker for production, parsing Csv files is typically done in ETL processes that we run in windows, so it has gone unnoticed for now. About For now I use: public static string GetListSeparator(this CultureInfo culture) => string.IsNullOrWhiteSpace(culture.NumberInfo.NumberDecimalSeparator) == ',' ? ";" : ","; This works for the first 180 cultures (english group) and the this group of 375 (europe group) but leaves the other cultures broken. Can someone from this groups check what excel does when you export to CSV? |
It will be good to fix that now :-)
you can apply any customization logic to
Could you tell more about the broken cases when doing
I am wonder about your scenario now. why you don't use a fixed list separator value (like the one in Invariant culture) across all data which will guarantee will work all the time across all cultures. |
Hi @tarekgh, If you use Excel in Spain or Germany (any most (all?) of the continental Europe) and export to CSV instead of:
you get
Notice how the decimal numbers use I'm sure there are ways to configure Excel to export the english way, but CSV is typically used for one time data-loading scenarios connecting different departments, provided by the customer or other third party company, downloaded from internet, etc.. it is very convenient to be able to set the culture once, and the CSV library switches number format, date format, and separator. Csv.ReadFile<ProductInfoCsv>("productos.csv", culture: CultureInfo.GetCulture("es-ES")).Select(prod => ...) By using the code that I mentioned (ups... corrected now): public static string GetListSeparator(this CultureInfo culture) => culture.NumberInfo.NumberDecimalSeparator == ',' ? ";" : ","; This will work for the European cultures and for the English cultures, but if someone from the second group (for example latin america
(notice the And my simple heuristic won't work, because it will assume that is english style and expect @KalleOlaviNiemitalo I don't think using Remember that in Europe we use Also, checking for |
@olmobrutall your logic is already broken. it is very possible can use a separator which not necessary match what is stored in the culture. Excel settings allow that. I believe to be able to get what Excel using is to call Excel APIs something like The other idea could be, always include some fixed data in the excel sheet which you can derive always derive the separator from. for example, if the headers always constant like |
... it’s not broken. We have used it in hundred of different CSV (mainly Europe and English, this is right). Yes, you can configure excel to do non standard stuff, and if we ever need to parse some heterogeneous CSV like that we can add some configuration for it, but that didn’t happen yet. About checking Application.International(xlListSeparator), we very often parse Csv that are generated in another machine, sometimes the same application parses CSV from different cultures. Maybe we parse the CSV in an machine without excel installed. We just provide some extra configuration to the CSV library, like the encoding, the culture, how many header lines to ignore (typically 0 or 1). You could use some heuristics to try to detect this information, but this is error prone and potentially slow. Excel tries to do it an always gets it wrong :) |
@olmobrutall I cannot think in any solid information more than capturing the data you need in the collected CSV document and then use it to parse it. If you are providing the culture to CSV, why you don't provide Invariant culture all the time to guarantee specific formats? anyway, I don't think anything we do in the .NET here will help you much as the problem looks more specific to how you parse Excel data.
If you can have always some fixed data in the doc, then it shouldn't be error prone and should be very predictable. Considering the discussion here, do you mind we close this issue now as we have some issue tracking fixing |
Why
You don't control the source, for example imagine you want to import the CSV files in this page: https://portalestadistico.com/?pn=portalestadistico&pc=AAA00&idp=10003&idpl=100004&idioma=
I already have a solution that is... Ok(ish) for me. But the current behavior of |
I don't think it is a good idea to throw or mark it obsolete. we are going to fix it in the next releases anyway. Anyone can switch back to the old behavior anyway using the config switch. And the current behavior is already used on Linux for long time now. Thanks for all your feedback and discussion and feel free to ping us if you get any more questions. |
As much as I hate this option but even throwing a PlatformNotSupportedException would be far better than the current situation. I have seen three types of usage in our code and the current situation breaks all of them:
3a) Excel-"Interfacing" through "excel.exe export.csv": This currently works flawlessly because excel uses the same system list separator. This would be very inconvenient to implement yourself because it would require native dependecies to get user locale overrides (which excel also uses). 3b) Excel-Interfacting though COM-Automation: Although Excel offers functions to get internationalization these are known to not work in lots of situations (e.g. multilanguage Excel install but lots more) whereas list separator basically always works (unless VERY stupid user locale overrides, in which Excel itself will not work correctly anyways) Considering the "workarounds":
|
This is the effect of switching to use ICU. You still have the option to go back to the old behavior by using the config switch if you don't like the new behavior. as I mentioned before we'll look enhancing this in next release.
This is not true. running .NET Core on Linux was always has this behavior.
I am not sure how this is going to help at all? if someone run into this exception, what you think how they can handle it? and what about the libraries that call it and you don't have control over it? this will be much worse.
Well you are expecting we .NET handle what to expect from Excel as a separator. as I mentioned before whatever you do in .NET will not make this robust as the design to parse any data from excel using .NET separator is wrong. Out guidelines always to have predictable formats that you ensure can be parsed. Invariant is what gives that. If Invariant cannot be used in such cases, the apps/libs should have a way to ensure what format used in Excel.
I am wondering how did you have running on Linux before which has the exact case we are talking about here?
I wouldn't recommend that either but at least you can add your own fallback mechanism when using it. |
@tarekgh this might be worth adding to https://github.com/dotnet/core/blob/master/release-notes/5.0/5.0-known-issues.md |
@danmosemsft I'll add something there. Thanks! |
@danmosemsft I added Preview 4 section in the doc and listed this separator issue under it. |
I have been looking at CLDR/ICU data to see what separator can be used for different cultures. I found the following which looks more reasonable than what currently NLS is providing. So I want your feedback regarding making .NET 5.0 use such data. Please advise if you see any problem if we do that. All cultures uses
Just for the record, here is what NLS currently returning, I think CLDR data would be better to use instead:
|
Just curious @tarekgh what part of CLDR does that come from? |
Every locale has a properties listed under <listPattern>
<listPatternPart type="start">{0}, {1}</listPatternPart>
<listPatternPart type="middle">{0}, {1}</listPatternPart>
<listPatternPart type="end">{0}, and {1}</listPatternPart>
<listPatternPart type="2">{0} and {1}</listPatternPart>
</listPattern> You can reach the data if you download the CLDR http://unicode.org/Public/cldr/38/cldr-common-38.0.zip then open the What I actually did is I called ICU APIs that format the list and then I extracted the separator for every locale from the API output. let me know if you need more details. |
FWIW, you can also find that data here: https://github.com/unicode-org/icu/tree/master/icu4c/source/data/locales i.e, for en locale: https://github.com/unicode-org/icu/blob/master/icu4c/source/data/locales/en.txt#L2014-L2019 |
@ANahr, is the .NET 5 ListSeparator actually wrong for Excel COM interop? I expect this scenario would always use CultureInfo.CurrentCulture rather than look up a culture by name; then, TextInfo.ListSeparator gets the user override from [HKEY_CURRENT_USER\Control Panel\International] |
@tarekgh what's wrong with Number symbols list element (https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols)? Because for the few European cultures I checked that have ';' as a list separator on Windows, this property is also ';' in CLDR. Also, the list pattern you mention is a generic list separator for itemizing AFAICT, whereas sList/ListSeparator is used to group numbers (where ',' is not possible for European cultures, due to that character already being the decimal separator -- and I can attest to this European usage of ”;” in case of ambiguities in case of writing number sequences and intervals). The fact that all other characters in the Numbers section in the regional settings also correspond to the symbols section from CLDR also points at that particular solution. Yes, ListSeparatorn is wonderfully vague on what kind of lists it is talking about, but judging from the context of the regional settings and its use by lots of software this way (Excel, CSV), I think the list symbol makes much more sense, and is probably mostly backward compatible with NLS data. |
CLDR 38 has a semicolon Anyway, I now think the CLDR number list separator is a better fit for TextInfo.ListSeparator than what one could extract from list patterns, because it does not end with a space. An API for linguistic list formatting using CLDR list patterns could then be added in .NET 6 or 7 as a separate issue. (The country-dependent list separator was apparently added to INT 21H AH=38H in MS-DOS 3.0, and only had room for one byte of text and a null character.) |
@rubenprins @KalleOlaviNiemitalo TextInfo.ListSeparator is not for numbers only. it can separate any kind of data. This is why list pattern would fit better. Also if you look at https://unicode.org/reports/tr35/tr35-numbers.html#Number_Symbols it is clearly says
Which means this number separator is not for linguistic while TextInfo.ListSeprator is a cultural property which should be linguistic. That is exactly match the list pattern in CLDR. As @KalleOlaviNiemitalo indicated, even CLDR number separator will not be compatible with NLS. If really non-linguistic separator needed here, then I would say let it be To reiterate on CSV/Excel, I am still not seeing a way can guarantee ListSeparator will make this scenario reliable. This is wrong usage of ListSeparator and apps should depend on the real Excel doc separator. We'll keep ListSeparator read the user override for the current culture so most of the scenarios (as @KalleOlaviNiemitalo thankfully pointed at) like PoswerShell and COM interop will continue work as it used to be. If people prefer the NLS compatibility, we can keep this but this is the least preferable to me because this is Windows specific and also this data can change at any time too in Windows. I am still thinking the data I mentioned #43795 (comment) would be the most sensible data we can return for this separator. what you think? |
To move forward, I would like to get your feedback which of the following options would be preferable for your scenarios:
As I mentioned before I think the first option make more sense but will still be different than NLS and will not be fully compatible. |
If I didn't hear from you, I'll go with the NLS data to keep the compatibility and in the future releases we can expose new APIs if needed for list patterns. |
Hi @tarekgh. Not sure if you referred to me but NLS looks better to me too. It is possible to deliver this behavior in Unix? Or the behavior will be different in Unix and Windows? |
BTW, I just tested in exporting a CSV with Excel in my wife MacBook Pro configured with German and Spanish cultures, the results are consistent with Excel in Windows:
What about asking Office team what are they using to get the |
The idea is to deliver the behavior in both Windows and Unix. we'll hardcode the data and use it for all platforms. Thanks for trying your scenario on Mac. Usually Office carry some data for that but I'll double check anyway. |
IMHO the most unproblematic choice would be to keep NLS hard-coded Second one would be use CLDR list SYMBOL (e.g see https://github.com/unicode-org/cldr/blob/d1c59aeea6a26cd1018e164e8c470e4873925cac/common/main/en.xml#L4307). CLDR list pattern does not make any sense to me because we look for a symbol, not for a pattern which can be arbitrarily complex. Third one would be hard code something that would universally work (should not be something that EVER collides with decimal, so this cannot be ',') maybe ';', but I didn't check if that collides somewhere. |
That is what we decided to go with. I think this would address all currently broken scenarios. |
This now fixed by the PR #44823 |
As part of updating of updating the Csv class in Signum.Utilities to .Net 5, I've realize that the value for
ListSeparator
has changed for many cultures that use,
instead of.
for decimals separators.netcoreapp3.1
net5 rc2
Both tests in the same machine.
Previously I was using
ListSeparator
as a way to detect the separator that will be used in a Csv file. Is this the intended purpose?The text was updated successfully, but these errors were encountered: