read_sql from SQL Server to Arrow truncates datetime #229

t-alex-fritz · 2022-02-05T17:12:20Z

I'm working with a MS SQL Server 2019. When reading a datetime field, read_sql correctly captures it in a pandas dataframe. When I convert that dataframe to an arrow table, datetimes are also retained correctly. However, loading directly to an arrow table truncates the datetime fields to midnight. I'd like to remove the pandas dependency and load directly to an arrow table. Is there a way to do this without truncating the datetime?

Example Code:

import connectorx as cx
import pyarrow as pa
con_string = 'mssql://user:[email protected]%5CG:1439/database'
print('------- Pandas Table -------')
query = 'SELECT top 5 Datum FROM Termine WHERE Datum>getdate()'
pandas_table = cx.read_sql(con_string, query, return_type='pandas')
print(pandas_table)
print('------- Arrow table from pandas -------')
arrow_table_from_pandas = pa.Table.from_pandas(pandas_table)
print(arrow_table_from_pandas)
print('------- Arrow Table -------')
arrow_table = cx.read_sql(con_string, query, return_type='arrow')
print(arrow_table)

Example Output:

------- Pandas Table -------
                Datum
0 2022-02-06 07:30:00
1 2022-02-07 00:00:00
2 2022-02-07 00:00:00
3 2022-02-07 07:00:00
4 2022-02-07 07:30:00
------- Arrow table from pandas -------
pyarrow.Table
Datum: timestamp[ns]
----
Datum: [[2022-02-06 07:30:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 00:00:00.000000000,2022-02-07 07:00:00.000000000,2022-02-07 07:30:00.000000000]]
------- Arrow Table -------
pyarrow.Table
Datum: date64[ms]
----
Datum: [[2022-02-06,2022-02-07,2022-02-07,2022-02-07,2022-02-07]]

The text was updated successfully, but these errors were encountered:

wangxiaoying · 2022-02-06T01:32:48Z

Hi @t-alex-fritz , there is some issue when we dealing with datetime for arrow and we are currently shifting from arrow to arrow2. Can you set the return_type="arrow2" to see whether it solves the problem?

t-alex-fritz · 2022-02-06T01:41:06Z

Hi @wangxiaoying, thank you! This throws a ValueError: arrow2

wangxiaoying · 2022-02-06T01:44:08Z

@t-alex-fritz , can you update to the latest alpha version 0.2.4a6?

t-alex-fritz · 2022-02-06T02:15:45Z

On 0.2.4a6 I do get non-truncated timestamp[ns] - that worked! However, pyarrow then fails to write the table to parquet whenever datetimes are involved that aren't rounded.

import connectorx as cx
import pyarrow.parquet as pq
con_string = 'mssql://user:[email protected]%5CG:1439/database'
query = 'SELECT top 5 ModificationDate FROM Termine WHERE Datum>getdate()'
arrow_table = cx.read_sql(con_string, query, return_type='arrow2')
print(arrow_table)
pq.write_table(arrow_table, 'output.parquet')

pyarrow.Table
ModificationDate: timestamp[ns]
----
ModificationDate: [[2021-04-14 00:13:02.216666666,2021-07-22 10:11:13.310000000,2021-08-24 15:59:44.463333333,2021-05-12 00:04:40.973333333,2021-08-24 16:14:56.170000000]]
Traceback (most recent call last):
  File "...", line 7, in <module>
    pq.write_table(arrow_table, 'output.parquet')
  File "...", line 2092, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "...", line 754, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1506, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1618359182216666666

wangxiaoying · 2022-02-06T02:29:34Z

@t-alex-fritz looks like it is the issue of type conversion of pyarrow. Can you check whether this allow_truncated_timestamps answers your question? https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error

t-alex-fritz · 2022-02-06T02:40:19Z

Oh, you are right, my bad! That last part was a pyarrow thing and works now after setting coerce_timestamps and allow_truncated_timestamps. Thanks a lot @wangxiaoying. And great project! Will be using this a lot.

wangxiaoying closed this as completed Feb 17, 2022

Kipriz mentioned this issue Apr 3, 2022

Arrow2: No conversion rule from Blob(true) for Arrow2TypeSystem #261

Closed

wangxiaoying mentioned this issue May 6, 2022

Resolve arrow datetime/timestamp related issues #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_sql from SQL Server to Arrow truncates datetime #229

read_sql from SQL Server to Arrow truncates datetime #229

t-alex-fritz commented Feb 5, 2022

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022 •

edited

Loading

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022

read_sql from SQL Server to Arrow truncates datetime #229

read_sql from SQL Server to Arrow truncates datetime #229

Comments

t-alex-fritz commented Feb 5, 2022

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022 • edited Loading

wangxiaoying commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022

t-alex-fritz commented Feb 6, 2022 •

edited

Loading