Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde #30958

Closed
wants to merge 5 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Dec 29, 2020

What changes were proposed in this pull request?

For same SQL

SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3  as c) t

In hive:

hive> SELECT TRANSFORM(a, b, c, null)
    > ROW FORMAT DELIMITED
    > USING 'cat'
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '&'
    > FROM (select 1 as a, 2 as b, 3  as c) t;
OK
123\N	NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe

In Spark

Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
         > ROW FORMAT DELIMITED
         > USING 'cat'
         > ROW FORMAT DELIMITED
         > FIELDS TERMINATED BY '&'
         > FROM (select 1 as a, 2 as b, 3  as c) t;
1	2	3	null	NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>

We should keep same. Change default ROW FORMAT FIELD DELIMIT to \u0001

In hive default value is '1' to char is '\u0001'

bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat

Why are the changes needed?

Keep same behavior with hive

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Dec 29, 2020
@AngersZhuuuu
Copy link
Contributor Author

FYI @maropu @cloud-fan

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the change. But I notice that some contributors usually use screenshots in the description. I personally don't recommend this approach. The images cannot be indexed and searched. So I suggest that for problem and fix description, some text are more helpful.

Screenshots are usually posted for UI or doc change so we can verify the UI/doc rendering results.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Dec 29, 2020

Not related to the change. But I notice that some contributors usually use screenshots in the description. I personally don't recommend this approach. The images cannot be indexed and searched. So I suggest that for problem and fix description, some text are more helpful.

Yea, thanks for your suggestion, I will update pr desc. And will pay attention to this problem.
Maybe we should sen an email to mention this ?

(1, 2, 3),
(2, 3, 4),
(3, 4, 5)
).toDF("a", "b", "c") // Note column d's data type is Decimal(38, 18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is column d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is column d?

Remove this unrelated comment. Copy code from other UT..forgot remove this comment

@viirya
Copy link
Member

viirya commented Dec 29, 2020

Spark SQL no serde row format field delimit default value is '\u0001' -> Spark SQL no serde row format field delimit default value should be '\u0001'?

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-33930][SQL] Spark SQL no serde row format field delimit default value is '\u0001' [SPARK-33930][SQL] Spark SQL no serde row format field delimit default value should be '\u0001'? Dec 29, 2020
@viirya
Copy link
Member

viirya commented Dec 29, 2020

Maybe "Script Transform default FIELD DELIMIT should be \u0001 for no serde".

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-33930][SQL] Spark SQL no serde row format field delimit default value should be '\u0001'? [SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde Dec 29, 2020
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also changes current behavior. Shall we update the SQL migration guide too?

@AngersZhuuuu
Copy link
Contributor Author

This also changes current behavior. Shall we update the SQL migration guide too?

How about current updated migration guide doc?

@github-actions github-actions bot added the DOCS label Dec 29, 2020
@@ -30,6 +30,8 @@ license: |

- In Spark 3.2, `ALTER TABLE .. RENAME TO PARTITION` throws `PartitionAlreadyExistsException` instead of `AnalysisException` for tables from Hive external when the target partition already exists.

- In Spark 3.2, script transform default `FIELD DELIMIT` is `\u0001` for no serde mode. In Spark 3.1 or earlier, the default `FIELD DELIMIT` is `\t`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need backquote? Maybe `FIELD DELIMIT` -> FIELD DELIMIT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need backquote? Maybe `FIELD DELIMIT` -> FIELD DELIMIT?

Done

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38068/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133470 has finished for PR 30958 at commit 1812826.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38068/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38073/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38078/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38078/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38073/

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133479 has finished for PR 30958 at commit 3c6a4ee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133484 has finished for PR 30958 at commit 4691cb3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2020

Test build #133489 has finished for PR 30958 at commit 75bfd87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants