Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_sql from SQL Server to Arrow truncates datetime #229

Closed
t-alex-fritz opened this issue Feb 5, 2022 · 6 comments
Closed

read_sql from SQL Server to Arrow truncates datetime #229

t-alex-fritz opened this issue Feb 5, 2022 · 6 comments

Comments

@t-alex-fritz
Copy link

I'm working with a MS SQL Server 2019. When reading a datetime field, read_sql correctly captures it in a pandas dataframe. When I convert that dataframe to an arrow table, datetimes are also retained correctly. However, loading directly to an arrow table truncates the datetime fields to midnight. I'd like to remove the pandas dependency and load directly to an arrow table. Is there a way to do this without truncating the datetime?

Example Code:

import connectorx as cx
import pyarrow as pa
con_string = 'mssql://user:[email protected]%5CG:1439/database'
print('------- Pandas Table -------')
query = 'SELECT top 5 Datum FROM Termine WHERE Datum>getdate()'
pandas_table = cx.read_sql(con_string, query, return_type='pandas')
print(pandas_table)
print('------- Arrow table from pandas -------')
arrow_table_from_pandas = pa.Table.from_pandas(pandas_table)
print(arrow_table_from_pandas)
print('------- Arrow Table -------')
arrow_table = cx.read_sql(con_string, query, return_type='arrow')
print(arrow_table)

Example Output:

------- Pandas Table -------
                Datum
0 2022-02-06 07:30:00
1 2022-02-07 00:00:00
2 2022-02-07 00:00:00
3 2022-02-07 07:00:00
4 2022-02-07 07:30:00
------- Arrow table from pandas -------
pyarrow.Table
Datum: timestamp[ns]
----
Datum: [[2022-02-06 07:30:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 07:00:00.000000000,2022-02-07 07:30:00.000000000]]
------- Arrow Table -------
pyarrow.Table
Datum: date64[ms]
----
Datum: [[2022-02-06,2022-02-07,2022-02-07,2022-02-07,2022-02-07]]
@wangxiaoying
Copy link
Contributor

Hi @t-alex-fritz , there is some issue when we dealing with datetime for arrow and we are currently shifting from arrow to arrow2. Can you set the return_type="arrow2" to see whether it solves the problem?

@t-alex-fritz
Copy link
Author

Hi @wangxiaoying, thank you! This throws a ValueError: arrow2

@wangxiaoying
Copy link
Contributor

@t-alex-fritz , can you update to the latest alpha version 0.2.4a6?

@t-alex-fritz
Copy link
Author

t-alex-fritz commented Feb 6, 2022

On 0.2.4a6 I do get non-truncated timestamp[ns] - that worked! However, pyarrow then fails to write the table to parquet whenever datetimes are involved that aren't rounded.

import connectorx as cx
import pyarrow.parquet as pq
con_string = 'mssql://user:[email protected]%5CG:1439/database'
query = 'SELECT top 5 ModificationDate FROM Termine WHERE Datum>getdate()'
arrow_table = cx.read_sql(con_string, query, return_type='arrow2')
print(arrow_table)
pq.write_table(arrow_table, 'output.parquet')
pyarrow.Table
ModificationDate: timestamp[ns]
----
ModificationDate: [[2021-04-14 00:13:02.216666666,2021-07-22 10:11:13.310000000,2021-08-24 15:59:44.463333333,2021-05-12 00:04:40.973333333,2021-08-24 16:14:56.170000000]]
Traceback (most recent call last):
  File "...", line 7, in <module>
    pq.write_table(arrow_table, 'output.parquet')
  File "...", line 2092, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "...", line 754, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1506, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1618359182216666666

@wangxiaoying
Copy link
Contributor

@t-alex-fritz looks like it is the issue of type conversion of pyarrow. Can you check whether this allow_truncated_timestamps answers your question? https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error

@t-alex-fritz
Copy link
Author

Oh, you are right, my bad! That last part was a pyarrow thing and works now after setting coerce_timestamps and allow_truncated_timestamps. Thanks a lot @wangxiaoying. And great project! Will be using this a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants