[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

AngersZhuuuu · 2020-12-29T05:34:11Z

What changes were proposed in this pull request?

For same SQL

SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3  as c) t

In hive:

hive> SELECT TRANSFORM(a, b, c, null)
    > ROW FORMAT DELIMITED
    > USING 'cat'
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '&'
    > FROM (select 1 as a, 2 as b, 3  as c) t;
OK
123\N	NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe

In Spark

Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
         > ROW FORMAT DELIMITED
         > USING 'cat'
         > ROW FORMAT DELIMITED
         > FIELDS TERMINATED BY '&'
         > FROM (select 1 as a, 2 as b, 3  as c) t;
1	2	3	null	NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>

We should keep same. Change default ROW FORMAT FIELD DELIMIT to \u0001

In hive default value is '1' to char is '\u0001'

bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat

Why are the changes needed?

Keep same behavior with hive

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

…t is '\u0001'

AngersZhuuuu · 2020-12-29T05:37:01Z

FYI @maropu @cloud-fan

viirya

Not related to the change. But I notice that some contributors usually use screenshots in the description. I personally don't recommend this approach. The images cannot be indexed and searched. So I suggest that for problem and fix description, some text are more helpful.

Screenshots are usually posted for UI or doc change so we can verify the UI/doc rendering results.

AngersZhuuuu · 2020-12-29T07:08:46Z

Not related to the change. But I notice that some contributors usually use screenshots in the description. I personally don't recommend this approach. The images cannot be indexed and searched. So I suggest that for problem and fix description, some text are more helpful.

Yea， thanks for your suggestion, I will update pr desc. And will pay attention to this problem.
Maybe we should sen an email to mention this ?

cloud-fan · 2020-12-29T07:38:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala

+        (1, 2, 3),
+        (2, 3, 4),
+        (3, 4, 5)
+      ).toDF("a", "b", "c") // Note column d's data type is Decimal(38, 18)


where is column d?

where is column d?

Remove this unrelated comment. Copy code from other UT..forgot remove this comment

viirya · 2020-12-29T08:23:15Z

Spark SQL no serde row format field delimit default value is '\u0001' -> Spark SQL no serde row format field delimit default value should be '\u0001'?

viirya · 2020-12-29T08:24:41Z

Maybe "Script Transform default FIELD DELIMIT should be \u0001 for no serde".

viirya

This also changes current behavior. Shall we update the SQL migration guide too?

AngersZhuuuu · 2020-12-29T08:34:44Z

This also changes current behavior. Shall we update the SQL migration guide too?

How about current updated migration guide doc?

viirya · 2020-12-29T08:55:52Z

docs/sql-migration-guide.md

@@ -30,6 +30,8 @@ license: |

  - In Spark 3.2, `ALTER TABLE .. RENAME TO PARTITION` throws `PartitionAlreadyExistsException` instead of `AnalysisException` for tables from Hive external when the target partition already exists.

+  - In Spark 3.2, script transform default `FIELD DELIMIT` is `\u0001` for no serde mode. In Spark 3.1 or earlier, the default `FIELD DELIMIT` is `\t`.


Do we need backquote? Maybe `FIELD DELIMIT` -> FIELD DELIMIT?

Do we need backquote? Maybe `FIELD DELIMIT` -> FIELD DELIMIT?

Done

viirya

One minor comment.

SparkQA · 2020-12-29T09:02:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38068/

SparkQA · 2020-12-29T09:07:06Z

Test build #133470 has finished for PR 30958 at commit 1812826.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-29T09:34:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38068/

SparkQA · 2020-12-29T09:50:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38073/

SparkQA · 2020-12-29T10:03:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38078/

SparkQA · 2020-12-29T10:31:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38078/

SparkQA · 2020-12-29T10:51:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38073/

SparkQA · 2020-12-29T11:09:59Z

Test build #133479 has finished for PR 30958 at commit 3c6a4ee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-29T12:32:55Z

Test build #133484 has finished for PR 30958 at commit 4691cb3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-29T13:26:55Z

Test build #133489 has finished for PR 30958 at commit 75bfd87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-29T14:26:16Z

Merged to master.

[SPARK-33930][SQL] Spark SQL no serde row format field delimit defaul…

1812826

…t is '\u0001'

github-actions bot added the SQL label Dec 29, 2020

viirya reviewed Dec 29, 2020

View reviewed changes

cloud-fan reviewed Dec 29, 2020

View reviewed changes

Update BaseScriptTransformationSuite.scala

3c6a4ee

cloud-fan approved these changes Dec 29, 2020

View reviewed changes

AngersZhuuuu changed the title ~~[SPARK-33930][SQL] Spark SQL no serde row format field delimit default value is '\u0001'~~ [SPARK-33930][SQL] Spark SQL no serde row format field delimit default value should be '\u0001'? Dec 29, 2020

AngersZhuuuu changed the title ~~[SPARK-33930][SQL] Spark SQL no serde row format field delimit default value should be '\u0001'?~~ [SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde Dec 29, 2020

viirya reviewed Dec 29, 2020

View reviewed changes

Update sql-migration-guide.md

6b58021

github-actions bot added the DOCS label Dec 29, 2020

viirya reviewed Dec 29, 2020

View reviewed changes

viirya approved these changes Dec 29, 2020

View reviewed changes

Update sql-migration-guide.md

4691cb3

FIX UT

75bfd87

HyukjinKwon approved these changes Dec 29, 2020

View reviewed changes

HyukjinKwon closed this in aadda4b Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

AngersZhuuuu commented Dec 29, 2020 •

edited

Loading

AngersZhuuuu commented Dec 29, 2020

viirya left a comment •

edited

Loading

AngersZhuuuu commented Dec 29, 2020 •

edited

Loading

cloud-fan Dec 29, 2020

AngersZhuuuu Dec 29, 2020

viirya commented Dec 29, 2020

viirya commented Dec 29, 2020 •

edited

Loading

viirya left a comment

AngersZhuuuu commented Dec 29, 2020

viirya Dec 29, 2020

AngersZhuuuu Dec 29, 2020

viirya left a comment

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

HyukjinKwon commented Dec 29, 2020

		@@ -30,6 +30,8 @@ license: \|

		- In Spark 3.2, `ALTER TABLE .. RENAME TO PARTITION` throws `PartitionAlreadyExistsException` instead of `AnalysisException` for tables from Hive external when the target partition already exists.

		- In Spark 3.2, script transform default `FIELD DELIMIT` is `\u0001` for no serde mode. In Spark 3.1 or earlier, the default `FIELD DELIMIT` is `\t`.

[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

Conversation

AngersZhuuuu commented Dec 29, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Dec 29, 2020

viirya left a comment • edited Loading

Choose a reason for hiding this comment

AngersZhuuuu commented Dec 29, 2020 • edited Loading

cloud-fan Dec 29, 2020

Choose a reason for hiding this comment

AngersZhuuuu Dec 29, 2020

Choose a reason for hiding this comment

viirya commented Dec 29, 2020

viirya commented Dec 29, 2020 • edited Loading

viirya left a comment

Choose a reason for hiding this comment

AngersZhuuuu commented Dec 29, 2020

viirya Dec 29, 2020

Choose a reason for hiding this comment

AngersZhuuuu Dec 29, 2020

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

SparkQA commented Dec 29, 2020

HyukjinKwon commented Dec 29, 2020

AngersZhuuuu commented Dec 29, 2020 •

edited

Loading

viirya left a comment •

edited

Loading

AngersZhuuuu commented Dec 29, 2020 •

edited

Loading

viirya commented Dec 29, 2020 •

edited

Loading