Skip to content

Commit

Permalink
[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \…
Browse files Browse the repository at this point in the history
…u0001 for no serde

### What changes were proposed in this pull request?
For same SQL
```
SELECT TRANSFORM(a, b, c, null)
ROW FORMAT DELIMITED
USING 'cat'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '&'
FROM (select 1 as a, 2 as b, 3  as c) t
```
In hive:
```
hive> SELECT TRANSFORM(a, b, c, null)
    > ROW FORMAT DELIMITED
    > USING 'cat'
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY '&'
    > FROM (select 1 as a, 2 as b, 3  as c) t;
OK
123\N	NULL
Time taken: 14.519 seconds, Fetched: 1 row(s)
hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe
```

In Spark
```
Spark master: local[*], Application Id: local-1609225830376
spark-sql> SELECT TRANSFORM(a, b, c, null)
         > ROW FORMAT DELIMITED
         > USING 'cat'
         > ROW FORMAT DELIMITED
         > FIELDS TERMINATED BY '&'
         > FROM (select 1 as a, 2 as b, 3  as c) t;
1	2	3	null	NULL
Time taken: 4.297 seconds, Fetched 1 row(s)
spark-sql>
```
We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001`

In hive default value is '1' to char is '\u0001'
```
bucket_count -1
column.name.delimiter ,
columns
columns.comments
columns.types
file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat
```

### Why are the changes needed?
Keep same behavior with hive

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30958 from AngersZhuuuu/SPARK-33930.

Authored-by: angerszhu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
AngersZhuuuu authored and HyukjinKwon committed Dec 29, 2020
1 parent 872107f commit aadda4b
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 2 deletions.
2 changes: 2 additions & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ license: |

- In Spark 3.2, `ALTER TABLE .. RENAME TO PARTITION` throws `PartitionAlreadyExistsException` instead of `AnalysisException` for tables from Hive external when the target partition already exists.

- In Spark 3.2, script transform default FIELD DELIMIT is `\u0001` for no serde mode. In Spark 3.1 or earlier, the default FIELD DELIMIT is `\t`.

## Upgrading from Spark SQL 3.0 to 3.1

- In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ case class ScriptTransformationIOSchema(

object ScriptTransformationIOSchema {
val defaultFormat = Map(
("TOK_TABLEROWFORMATFIELD", "\t"),
("TOK_TABLEROWFORMATFIELD", "\u0001"),
("TOK_TABLEROWFORMATLINES", "\n")
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import org.scalatest.exceptions.TestFailedException

import org.apache.spark.{SparkException, TaskContext, TestUtils}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, Expression, GenericInternalRow}
import org.apache.spark.sql.catalyst.plans.physical.Partitioning
Expand Down Expand Up @@ -123,7 +124,11 @@ abstract class BaseScriptTransformationSuite extends SparkPlanTest with SQLTestU
s"""
|SELECT
|TRANSFORM(a, b, c, d, e)
|USING 'python $scriptFilePath' AS (a, b, c, d, e)
| ROW FORMAT DELIMITED
| FIELDS TERMINATED BY '\t'
| USING 'python $scriptFilePath' AS (a, b, c, d, e)
| ROW FORMAT DELIMITED
| FIELDS TERMINATED BY '\t'
|FROM v
""".stripMargin)

Expand Down Expand Up @@ -440,6 +445,31 @@ abstract class BaseScriptTransformationSuite extends SparkPlanTest with SQLTestU
}
}
}

test("SPARK-33930: Script Transform default FIELD DELIMIT should be \u0001 (no serde)") {
withTempView("v") {
val df = Seq(
(1, 2, 3),
(2, 3, 4),
(3, 4, 5)
).toDF("a", "b", "c")
df.createTempView("v")

checkAnswer(
sql(
s"""
|SELECT TRANSFORM(a, b, c)
| ROW FORMAT DELIMITED
| USING 'cat' AS (a)
| ROW FORMAT DELIMITED
| FIELDS TERMINATED BY '&'
|FROM v
""".stripMargin), identity,
Row("1\u00012\u00013") ::
Row("2\u00013\u00014") ::
Row("3\u00014\u00015") :: Nil)
}
}
}

case class ExceptionInjectingOperator(child: SparkPlan) extends UnaryExecNode {
Expand Down

0 comments on commit aadda4b

Please sign in to comment.