-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adbc_ingest() is dropping rows in Snowflake #1847
Comments
That's really weird. Can you confirm the issue still exists in the newer ADBC driver versions while I take a look and see if anything stands out to me that could be causing the issue. I'm guessing there's a race condition somewhere in there |
I reviewed the data differences between the source and snowflake table and I don't see any patterns.. Large numbers, small numbers, positive/negative/0.0 records are randomly missing.. With 95,707,710 out of 98 million rows inserted I see missing data for Jan 18th, 2018 which is one of the first occurrences.. Running this a 2nd time, I get 96,017,343 out of 98 million rows inserted and now Jan 18th, 2018 isn't missing any data.. |
I tried adjusting the settings below and it didn't help.. There is most likely a bug with these parameters. If I pass them in as Integers I get an error: ValueError: value must be str or bytes If I pass them in as a String it doesn't appear to do anything. The activity logs don't reflect the expected behavior.... |
I'm gonna try to find some time later this week to look at this. In the meantime, @joellubi would you happen to have some time to dig into this? |
Sure, taking a look. @davlee1972 I did notice a typo in the option name: I think that by |
I haven't been able to reproduce this so far even with an ingest of this volume (running Are there any COPY errors found by running the following query? You may need to add more filters to isolate a particular run. select * from snowflake.account_usage.copy_history where table_name = '<TABLE_NAME>' AND error_count > 0; |
Here's the copy history.. I don't see any errors, but there are a ton of 256k empty parquet files being sent.. I think I made a typo in the comments above for RPC parameters, but I'll retest all four RPC parameters again. Trying to eliminate any type of concurrency to debug this.. I'm using a X-SMALL warehouse so I'm wondering if there might be issues with concurrency settings if the number of warehouse cores is lower than settings.. |
Ok I figured it out.. There is a bug with how adbc_ingest is handling batches.. I reduced the number of records to 1.1 million rows and the bug still happens.. If I read data from a pyarrow dataset of parquet files and try to write it to Snowflake I get ZERO rows inserted.. If I write my data to a single parquet file, reread it and then try to write it to Snowflake I get all my rows inserted.. On a side note I'm not sure why these params are STRINGs and not INTs.. |
@davlee1972 They are strings and not ints, primarily because the corresponding functions in the snowflake driver haven't been implemented. Only the default
Interesting, that's extremely odd. Looking at the code you provided screenshots for the |
Thanks @davlee1972 that's a great insight. By setting several of the table's chunks to be empty, I can now reproduce the issue getting a nondeterministic number of rows copied in each run. |
@joellubi can you replicate it with pure go? Or only through pyarrow with some table chunks set to be empty? Just trying to narrow down where the issue might be |
@zeroshade Yes just got a pure go reproduction With a record reader that produces 1 empty batch and then 10 batches of 100 (i.e. expecting 1000 rows):
No code changes between those three runs. Also I ran the same ingestion with the postgres driver from python and the issue does not reproduce under any conditions. This seems to be specific to the snowflake driver itself. |
Awesome. So now we just gotta figure out if the issue is in the Parquet writer, or on snowflake's side :) if you don't get the time to dig deeper, I should be able to poke it tomorrow if you can post your repro |
Sure @zeroshade, I ported the repro to a failing test case and pushed it up to #1866 |
@joellubi I did a similar investigation and proving it. It appears to be caused by the existence of a row group with 0 rows in the file. We can work around it by using |
…y batch is present (#1866) Reproduces and fixes: #1847 Parquet files with empty row groups are valid per the spec, but Snowflake does not currently handle them properly. To mitigate this we buffer writes to the parquet file so that a row group is not written until some amount of data has been received. The CheckedAllocator was enabled for all tests as part of this fix, which detected a leak in the BufferWriter that was fixed in: [https://github.com/apache/arrow/pull/41698](https://github.com/apache/arrow/pull/41698). There was an unrelated test failure that surfaced once the CheckedAllocator was enabled which had to do with casting decimals of certain precision. The fix is included in this PR as well.
Thanks for merging @zeroshade. Any recommendations on the best way to inform Snowflake of the bug? It's not really related to any of their open source projects so a github issue doesn't seem appropriate. |
I'll reach out to the individuals I've been working with on ADBC stuff and bring it up to them. Thanks! |
Just wanted to follow up here, Snowflake is now aware of the issue and were able to replicate it. Hopefully they fix it soon. |
…rgetSize on ingestion (#2026) Fixes: #1997 **Core Changes** - Change ingestion `writeParquet` function to use unbuffered writer, skipping 0-row records to avoid recurrence of #1847 - Use parquet writer's internal `RowGroupTotalBytesWritten()` method to track output file size in favor of `limitWriter` - Unit test to validate that file cutoff occurs precisely when expected **Secondary Changes** - Bump arrow dependency to `v18` to pull in the changes from [ARROW-43326](apache/arrow#43326) - Fix flightsql test that depends on hardcoded arrow version
What happened?
I'm trying to load 98 million rows from a set of CSV files (5 year period), but only 95 to 96 million rows are getting inserted into Snowflake uisng adbc_ingest.. The distribution of missing data is pretty random and is around ~16k records per day.
I tried passing to adbc_ingest(), a pyarrow table and record batches.. In both cases rows are being dropped..
Here's a screenshot of my notebook code..

The odd thing is that sometimes it inserts 95 million rows and other times it inserts 96 million rows.. The total sum of inserted rows matches what I'm seeing in Snowflake logs if I add up all the rows created using COPY INTO sql commands..
It looks like we're not sending all the batches across the wire..
How can we reproduce the bug?
No response
Environment/Setup
Python 3.9.10 on RedHat 8 linux with ADBC drivers 0.10.0..
The text was updated successfully, but these errors were encountered: