[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

revans2 · 2024-03-14T15:56:24Z

Is your feature request related to a problem? Please describe.
This is really odd and documented at NVIDIA/spark-rapids#10596

But essentially if we want to enable normalization of single quoted values, we also then want to allow backslash escaping of single quoted values.

I am fine if we end up with a separate config for this, but it does make thinks a lot more difficult. Especially because this fits in with the validation too

#15222

GregoryKimball · 2024-03-14T21:18:19Z

Thank you @revans2 for documenting this. I'm a bit surprised that libcudf is returning a null for this string.

"""{"data": "\'TESTING\'"}"""

I would expect this to be valid with and without single-quote normalization. Would you please help me understand what I'm missing?

shrshi · 2024-03-15T17:07:41Z

Hello @revans2, escaping of double quotes is expected behaviour when single quote normalization is enabled to handle the case of double quotes being present within single quotes. For example

{'TE"ST'} -> {"TE\"ST"}

If there are escaped single quotes within a single quoted string in the input, then the current behaviour of the quote normalization FST is to remove the backslash escapes since we no longer need them. For example

{'TE\'ST'} -> {"TE'ST"}

If I understand correctly, the quote normalizer is expected to retain the backslash i.e. {'TE\'ST'} -> {"TE\'ST"}
Is this correct?

revans2 · 2024-03-15T20:03:56Z

@shrshi

I am sorry. I think I have been moving too quickly for my own good. I should have written some tests as repro cases for this before filing the issue.

I think there is a mismatch between the escape handling in the parser and the single quote pre-processing.

TEST_F(JsonReaderTest, EscapedSingleQuotes)
{
  const std::string fname = temp_env->get_temp_dir() + "EscapedSingleQuotes.json";
  std::ofstream outfile(fname, std::ofstream::out);
  outfile << "{\"data\": \"\\'TESTING\\'\"}\n{\"data\": \"'TESTING'\"}";
  outfile.close();

  cudf::io::json_reader_options in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
      .lines(true)
      .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
      .mixed_types_as_string(true)
      .normalize_whitespace(true)
      .keep_quotes(true)
      .normalize_single_quotes(false);

  cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

  EXPECT_EQ(result.tbl->num_columns(), 1);
  EXPECT_EQ(result.tbl->num_rows(), 2);

  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
  EXPECT_EQ(result.metadata.schema_info[0].name, "data");

  cudf::test::print(result.tbl->get_column(0));

  cudf::test::strings_column_wrapper const strings(
      {"", "\"'TESTING'\""},
      {false, true});
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(result.tbl->get_column(0), strings);
}

TEST_F(JsonReaderTest, EscapedSingleQuotesNormalize)
{
  const std::string fname = temp_env->get_temp_dir() + "EscapedSingleQuotesNormalize.json";
  std::ofstream outfile(fname, std::ofstream::out);
  outfile << "{\"data\": \"\\'TESTING\\'\"}\n{\"data\": \"'TESTING'\"}";
  outfile.close();

  cudf::io::json_reader_options in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
      .lines(true)
      .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
      .mixed_types_as_string(true)
      .normalize_whitespace(true)
      .keep_quotes(true)
      .normalize_single_quotes(true);

  cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

  EXPECT_EQ(result.tbl->num_columns(), 1);
  EXPECT_EQ(result.tbl->num_rows(), 2);

  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
  EXPECT_EQ(result.metadata.schema_info[0].name, "data");

  cudf::test::print(result.tbl->get_column(0));

  cudf::test::strings_column_wrapper const strings{"\"'TESTING'\"", "\"'TESTING'\""};
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(result.tbl->get_column(0), strings);
}

The first test passes, which is what Spark would expect to see when single quote support is disabled, but when it is enabled (which is the default in Spark) nothing changes. I think that the problem is that anything in double quotes is not being processed by the quote normalization code. {"TE\'ST"} -> {"TE\'ST"}, when what I think I want is {"TE\'ST"} -> {"TE'ST"}.

So this was a missed requirement on my part.

shrshi · 2024-03-15T23:29:58Z

Thank you for the detailed explanation, @revans2.
I've opened #15324 to address this.

…15324) This PR addresses the inconsistency in processing single quotes within a quoted string in the single quote normalizer. In the current implementation, when we have an escaped single quote within a single quoted string, the normalizer removes the backslash escape on converting the string to double quotes. However, the normalizer retains the contents of double quoted strings as-is i.e. if there are escaped single quotes within a double quoted string, the backslash character is retained in the output. We address this inconsistency by removing the escape character for single quotes in all double quoted string in the output. Tackles #15303 to mimic Spark behavior. Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Bradley Dice (https://github.com/bdice) - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) URL: #15324

GregoryKimball · 2024-03-22T02:59:25Z

I believe this is closed by #15324

revans2 added feature request New feature or request Spark Functionality that helps Spark RAPIDS labels Mar 14, 2024

GregoryKimball mentioned this issue Mar 14, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Closed

GregoryKimball added this to libcudf Mar 14, 2024

GregoryKimball added this to the Nested JSON reader milestone Mar 14, 2024

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 14, 2024

shrshi mentioned this issue Mar 15, 2024

Address inconsistency in single quote normalization in JSON reader #15324

Merged

3 tasks

GregoryKimball closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

revans2 commented Mar 14, 2024 •

edited

Loading

GregoryKimball commented Mar 14, 2024

shrshi commented Mar 15, 2024

revans2 commented Mar 15, 2024

shrshi commented Mar 15, 2024

GregoryKimball commented Mar 22, 2024

[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

Comments

revans2 commented Mar 14, 2024 • edited Loading

GregoryKimball commented Mar 14, 2024

shrshi commented Mar 15, 2024

revans2 commented Mar 15, 2024

shrshi commented Mar 15, 2024

GregoryKimball commented Mar 22, 2024

revans2 commented Mar 14, 2024 •

edited

Loading