-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303
Comments
Thank you @revans2 for documenting this. I'm a bit surprised that libcudf is returning a
I would expect this to be valid with and without single-quote normalization. Would you please help me understand what I'm missing? |
Hello @revans2, escaping of double quotes is expected behaviour when single quote normalization is enabled to handle the case of double quotes being present within single quotes. For example
If there are escaped single quotes within a single quoted string in the input, then the current behaviour of the quote normalization FST is to remove the backslash escapes since we no longer need them. For example
If I understand correctly, the quote normalizer is expected to retain the backslash i.e. |
I am sorry. I think I have been moving too quickly for my own good. I should have written some tests as repro cases for this before filing the issue. I think there is a mismatch between the escape handling in the parser and the single quote pre-processing.
The first test passes, which is what Spark would expect to see when single quote support is disabled, but when it is enabled (which is the default in Spark) nothing changes. I think that the problem is that anything in double quotes is not being processed by the quote normalization code. So this was a missed requirement on my part. |
…15324) This PR addresses the inconsistency in processing single quotes within a quoted string in the single quote normalizer. In the current implementation, when we have an escaped single quote within a single quoted string, the normalizer removes the backslash escape on converting the string to double quotes. However, the normalizer retains the contents of double quoted strings as-is i.e. if there are escaped single quotes within a double quoted string, the backslash character is retained in the output. We address this inconsistency by removing the escape character for single quotes in all double quoted string in the output. Tackles #15303 to mimic Spark behavior. Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Bradley Dice (https://github.com/bdice) - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) URL: #15324
I believe this is closed by #15324 |
Is your feature request related to a problem? Please describe.
This is really odd and documented at NVIDIA/spark-rapids#10596
But essentially if we want to enable normalization of single quoted values, we also then want to allow backslash escaping of single quoted values.
I am fine if we end up with a separate config for this, but it does make thinks a lot more difficult. Especially because this fits in with the validation too
#15222
The text was updated successfully, but these errors were encountered: