-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28973][SQL] Add TimeType
and support java.time.LocalTime
as its external type.
#25678
Conversation
Test build #110123 has finished for PR 25678 at commit
|
TimeType
and support java.time.LocalTime
as its external type.TimeType
and support java.time.LocalTime
as its external type.
@cloud-fan @dongjoon-hyun @wangyum May I ask you to review this PR. |
Test build #110125 has finished for PR 25678 at commit
|
@gatorsmile @rxin @HyukjinKwon Do you have any objections for the new type? |
@srowen WDYT of the PR? |
It looks pretty thorough to me. I don't see a problem with supporting You're right that the more important question is whether introducing a catalyst |
@srowen Thank you for your quick response.
Parquet has appropriate logical type for |
This is just for my education, but how is TimeType mapped to Parquet TIME here? Yes, that's good confirmation that this is a common type and we are using the same semantics as, at least, Parquet. Let's say I wrote to some other system that didn't have such a type, like an RDBMS. It would end up a long? wouldn't cause any particular problem right? |
I think in the same way as Lines 180 to 182 in 26998b8
The physical Parquet type will be Lines 390 to 391 in be2238f
|
Just to be clear, does ParquetSchemaConverter need a new case to handle the new type then? to and from Parquet? or is it already handled. Likewise, just trying to figure out how for example our JDBC support translates this type, for an RDBMS that does support |
We will need to support new type in all data sources. In particular,
In this PR, I just introduce new type and its external type. To support it by the spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala Lines 567 to 569 in 868e025
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala Line 178 in 868e025
|
OK, so we need to add those kinds of translations or else this won't properly convert to types in external systems, and the idea is to follow up with that in another PR? I think that could be fine. |
Why do we need a new data type?
…On Tue, Sep 10, 2019 at 1:13 PM Sean Owen ***@***.***> wrote:
OK, so we need to add those kinds of translations or else this won't
properly convert to types in external systems, and the idea is to follow up
with that in another PR? I think that could be fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25678?email_source=notifications&email_token=AACO6PBO3FEAM5AOOQIEHZTQI75WDA5CNFSM4ITR3EIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6MK5BI#issuecomment-530099845>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACO6PHACO5BFSHODRBFMJDQI75WDANCNFSM4ITR3EIA>
.
|
Seems like that's answered in PR description. It targets to follow ANSI I guess:
|
@rxin , I see the following reasons to add new type
|
Supporting Time type is pretty complex and the sizing is pretty big. Can we delay it and do it in the future? |
I haven't had a chance but I'd push back against most of the functionalities in the umbrella ticket (Postgres compatibility). While I think it's a good idea in general to align with some existing standard when a functionality exists, a blanket "implement some other database's functionality" is a dangerous umbrella, because the context and use cases are very different between Spark and "some other databases". A few reasons:
|
@rxin how about those real use cases:
Another argument for the
This argument blocks any extensions of Spark SQL type systems, actually forever. It seems implementing the |
I tend to agree with @rxin here. This is a lot of long term cost without good user justification. I would be much more sympathetic to an argument that showed lots of use of the |
I am closing this PR. @srowen @rxin @marmbrus @gatorsmile @HyukjinKwon Thank you for your comments. |
Thanks @MaxGekk. In case somebody else wonders in the future, there's a lot more complexity to adding a new type than what's shown here. For example, we need to add support for Python and R. We also need to support expressions. Those are much larger than this PR itself. |
I'm currently using timestamp as time type for create data for rush hour graphic. date part is always epoch (1970-01-01). Because spark don't support TimeType. My point is micros is not important. We can implement it just like timestamp internally. But date part is always epoch. Only requirement is toString() should print only time part (HH:mm:ss.SSS) and mappable to java.sql.Time, java.time.LocalTime. This is enough. Maybe we can use timestamp directly. only printable representation must be only time part. |
@MaxGekk Whether to consider supporting the time type again,iceberg and parquet support time type @rxin Spark 3.2 supports INTERVAL type, whether to consider supports Time type |
@MaxGekk @rxin I am also looking into this issue of missing Time type in Spark. I am working with Iceberg tables built on top of parquet files which have Time data type support. Spark cannot read or write to such tables. |
looking forward to this feature because my iceberg table needs time data type. |
I sent the SPIP https://issues.apache.org/jira/browse/SPARK-51162 to dev list for discussion (link). @zeddit @Fokko @tundraraj @redblackcoder @melin @hurelhuyag @younggyuchun Please, leave comments in the thread if you are still interested in the feature. |
What changes were proposed in this pull request?
Proposed new type for Catalyst's type system to represent local time. According to the SQL standard, a value of data type
TIME
comprises values of the datetime fields HOUR, MINUTE and SECOND. It is always a valid time of day.HOUR
- hour within day, between 00 and 23MINUTE
- minute within hour, between 00 and 59SECOND
- second and possibly fraction of a second within minute. Valid range is 0-59.999999Internally, the
TIME
type is implemented as CatalystTimeType
and stores a number of microseconds since00:00:00.000000
.The java class
java.time.LocalTime
was supported as the external type forTimeType
. So, instances ofjava.time.LocalTime
can be parallelized and converted to values ofTimeType
, and collected back to instances ofLocalTime
as well. Spark also accepts literals ofTimeType
with values ofLocalTime
.Created an encoder that serializes instances of the
java.time.LocalTime
classto the internal representation of nullable Catalyst'sTimeType
.Why are the changes needed?
To maintain feature parity with PostgreSQL which supports the
TIME
type and its constructormake_time
. The former function is commented indate.sql
at the moment:spark/sql/core/src/test/resources/sql-tests/inputs/pgSQL/date.sql
Line 353 in 3a4afce
To be compliant with the SQL standard
Does this PR introduce any user-facing change?
The PR extends existing functionality. So, users can parallelize instances of the
java.time.LocalTime
class and collect them back:How was this patch tested?
CatalystTypeConvertersSuite
to check conversion from/tojava.time.LocalTime
.RowEncoderSuite
.TimeType
is tested inLiteralExpressionSuite
DatasetSuite
andJavaDatasetSuite
.