parallel csv scan #6801

2010YOUY01 · 2023-06-29T20:52:17Z

Which issue does this PR close?

Closes #6325.

Rationale for this change

There are 3 steps to scan a CSV file in parallel:

The initial physical plan contains some whole file scans during the planing
- CsvExec: file_groups={1 group: [[a.csv:0..100]]}
Step 1: PhysicalOptimizerRule (Repartition) will decide if the CsvExec node can be parallelized, it won't be parallelized if the parent node requires certain order or have some other conditions.
Step 2: Repartition the byte range evenly during physical optimization Repartition, and store partitions in CsvExec.base_config.file_groups, also this step doesn't care about how to separate lines correctly (partitions may contain half lines)
- CsvExec: file_groups={2 groups: [[a.csv:0..50], [a:50..100]]}
Step 3: A worker will be spawned for each file group partition (e.g. [a.csv:0..50]), and deal with partition boundaries
- ([a.csv:0..50] -> Lines:1..5).

What changes are included in this PR?

Step 1

The parallel parquet scan PR has already done it #5057, for parallel CSV scan, it only added a rule to not repartition if the CSV file is compressed
Testing - Added CsvExec case along with the existing tests for optimizing rule Repartition on ParquetExec

Step 2

Parallel Parquet PR also has done this but inside ParquetExec, now the repartition byte range logic is refactored to somewhere else to let CsvExec reuse it.
Testing - Unit test for partitioning byte range has also been well covered by the parallel Parquet PR https://github.com/apache/arrow-datafusion/blob/e91af991c5ae4a6b4afab2cb1b0c9307a69e4046/datafusion/core/src/datasource/physical_plan/parquet.rs#L1924-L2128

Step 3

The logic for handling line boundaries correctly is done in CsvOpener, the rule used is exactly the same as #6325 mentioned. Unlike parallel Parquet scan, which can use metadata to decide which row groups to read according to the approximate byte range, CSV scan needs to inspect the file content around partition boundaries to determine the byte range of lines in this partition. In the implementation, it finds the offset to the first newline encountered from the approximate byte partition start or end, calculates the byte range for complete lines of this partition, and reads that actual byte range from the object store.
Testing - Added some integration tests in file_format/csv.rs (also manually run these tests on S3). For more complex query tests, the existing TPCH correctness test under sqllogictest is scanned on CSV files with 4 partitions.
Some issue:

Range get not working for local filesystem https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355, need to update implementation after it's fixed
Need 3 get requests to object store for each partition (find head offset, find tail offset, and finally actually get the correct byte range for lines) It's easier to do this way but 3 * target_partitions requests on cloud stores(billed by number of requests) might be a problem. This may be improved by some follow on PR, but I haven't come up with a good solution yet, arrow CSV reader now only can handle valid CSV files. Maybe add a wrapper on the byte stream fetched from the object store, and let it skip the first line and read until the first newline after the range length?

Are these changes tested?

See above section.
A very simple benchmark: scan TPC-H lineitem table (SF1) with very selective predicate (select * from lineitem where column_1 = 1;). 6 is # of physical cores for my machine

1 partitions: 54.127 s
2 partitions: 27.176 s
3 partitions: 20.532 s
4 partitions: 16.147 s
5 partitions: 15.061 s
6 partitions: 13.396 s

Are there any user-facing changes?

No

alamb · 2023-06-30T20:01:45Z

Thank you @2010YOUY01 -- I will review this over the coming days. This looks awesoem

alamb · 2023-07-02T12:05:59Z

datafusion/core/Cargo.toml

@@ -84,6 +84,7 @@ parquet = { workspace = true }
 percent-encoding = "2.2.0"
 pin-project-lite = "^0.2.7"
 rand = "0.8"
+regex = "1.5.4"


since this is only used for tests, I think it can be put in the dev-dependencies section

alamb · 2023-07-02T12:38:09Z

I started running some tests with this branch. I will continue tomorrow

alamb · 2023-07-03T19:38:16Z

This is on my list to review, but I might not have time for it until later in the week (Wednesday) -- I am somewhat occupied with #6800 at the moment

alamb · 2023-07-05T18:21:46Z

Hi @2010YOUY01 -- I am having trouble reproducing the benchmark results you reported

Results

Master:

❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 1.424 seconds.
1 row in set. Query took 1.374 seconds.
1 row in set. Query took 1.409 seconds.

This PR branch:

❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 1.918 seconds.
1 row in set. Query took 1.672 seconds.
1 row in set. Query took 2.008 seconds.

(I also merged up your branch from master and it still had the same performance)

Methodology:

I tested this branch out using the TPCH SF1 (6M rows, 725 MB) lineitem CSV file (created with arrow-datafusion/benchmarks$ ./bench.sh data tpch):

(arrow_dev) alamb@MacBook-Pro-8:~$ du -h  /Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
725M	/Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl

(arrow_dev) alamb@MacBook-Pro-8:~$ wc -l /Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl
 6001215 /Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl

And used datafusion-cli (built via cargo build --release)

CREATE EXTERNAL TABLE lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR,
        l_rev VARCHAR,
) STORED AS CSV DELIMITER '|' LOCATION '/Users/alamb/Software/arrow-datafusion/benchmarks/data/lineitem.tbl';

--- Run a query that scans the entire CSV
select count(*) from lineitem where l_quantity < 10;

+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 1079240         |
+-----------------+

alamb · 2023-07-05T18:23:03Z

datafusion/core/src/datasource/physical_plan/mod.rs

+        let current_partition_size: usize = 0;
+
+        // Partition byte range evenly for all `PartitionedFile`s
+        let repartitioned_files = flattened_files


One thing I was thinking about for the range based approach is that it isn't likely to work for streaming compressed files (as you need to decompress the data linearly)

I wonder if you have considered that 🤔

Yes. For now, the physical optimizer will check that, and won't get compressed CSV files repartitioned.

2010YOUY01 · 2023-07-06T00:10:11Z

Hi @2010YOUY01 -- I am having trouble reproducing the benchmark results you reported

@alamb Thank you for the feedback!
My initial benchmark run the query under different target_partition, just realized that was not effective 😰
I reproduced your benchmark. Since streaming byte range get on local FS is not implemented in Arrow yet,

Some issue:

Range get not working for local filesystem https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355, need to update implementation after it's fixed

alternative get_range() is used (which will copy the range into memory at once instead of in a streaming fashion). It is called when finding the first newline after the start/end of each partition, and multiple unnecessary large disk read caused the performance issue.
This should be solved after get_opts() - Range option is supported for local FS. For now, I suppressed this issue with a preset max line length, and re-run the benchmark again:
This PR:

❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 0.894 seconds.
1 row in set. Query took 0.513 seconds.
1 row in set. Query took 0.532 seconds.

Main branch:

❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 1.757 seconds.
1 row in set. Query took 1.496 seconds.
1 row in set. Query took 1.498 seconds.

alamb · 2023-07-08T17:23:55Z

Thanks @2010YOUY01 -- I am starting to look at this again

alamb

Thank you @2010YOUY01 -- I tried this out again and it does indeed go (much!) faster -- 3x faster in my initial testing. 👏

I also reviewed the code, and I found it very easy to read, well structured and well tested. Thank you so much for this and I am sorry for the delay.

I have some thoughts on how to improve the byte range calculations to avoid the number of object store requests. I think this could be done as a follow on PR as well.

My proposal is to:

Review the comments on this PR and consider if you want to make any more changes
Merge this PR
File tickets for any remaining items
File tickets for parallel reading of compressed CSV / JSON files (I will do this)
File ticket for parallel reading of JSON files

Testing Results

This branch now goes about 3x faster (3.9s vs 13.4s) on my test:

$ datafusion-cli -f /tmp/test.sql
DataFusion CLI v27.0.0
0 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 10796819        |
+-----------------+
1 row in set. Query took 3.961 seconds.

main:

$ ~/Software/target-df/release/datafusion-cli -f /tmp/test.sql
DataFusion CLI v27.0.0
0 rows in set. Query took 0.001 seconds.
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 10796819        |
+-----------------+
1 row in set. Query took 13.364 seconds.

Here is the queryL:

$ cat /tmp/test.sql
CREATE EXTERNAL TABLE lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR,
        l_rev VARCHAR,
) STORED AS CSV DELIMITER '|' LOCATION '/Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem.tbl';

--- Run a query that scans the entire CSV
select count(*) from lineitem where l_quantity < 10;

alamb · 2023-07-09T10:47:16Z

datafusion/core/src/datasource/file_format/csv.rs

+        Ok(())
+    }
+
+    /// Parappel scan on a csv file with only 1 byte in each line


Suggested change

/// Parappel scan on a csv file with only 1 byte in each line

/// Parallel scan on a csv file with only 1 byte in each line

alamb · 2023-07-09T10:47:26Z

datafusion/core/src/datasource/file_format/csv.rs

+        Ok(())
+    }
+
+    /// Parappel scan on a csv file with 2 wide rows


Suggested change

/// Parappel scan on a csv file with 2 wide rows

/// Parallel scan on a csv file with 2 wide rows

alamb · 2023-07-09T11:00:23Z

datafusion/core/src/datasource/physical_plan/parquet.rs

-        if has_ranges {
-            return self.clone();
-        }
+        let repartitioned_file_groups_option = FileScanConfig::repartition_file_groups(


👍 for this refactor

alamb · 2023-07-09T11:03:04Z

datafusion/core/src/datasource/physical_plan/parquet.rs

+
+    // Repartition when there is a empty file in file groups
+    #[tokio::test]
+    async fn parquet_exec_repartition_empty_files() {


maybe as a follow on the tests for file repartitioning could be moved to datafusion/core/src/datasource/physical_plan/mod.rs to be in the same module they are testing (as they are now no longer specific to parquet)

alamb · 2023-07-09T11:04:05Z

datafusion/core/tests/sqllogictests/test_files/tpch/q1.slt.part

--------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
----------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/sqllogictests/test_files/tpch/data/lineitem.tbl]]}, projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate], has_header=false
-
+--------------------CsvExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/core/tests/sqllogictests/test_files/tpch/data/lineitem.tbl:0..18561749], [WORKSPACE_ROOT/datafusion/core/tests/sqllogictests/test_files/tpch/data/lineitem.tbl:18561749..37123498], [WORKSPACE_ROOT/datafusion/core/tests/sqllogictests/test_files/tpch/data/lineitem.tbl:37123498..55685247], [WORKSPACE_ROOT/datafusion/core/tests/sqllogictests/test_files/tpch/data/lineitem.tbl:55685247..74246996]]}, projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate], has_header=false


alamb · 2023-07-09T11:07:26Z

datafusion/core/src/physical_optimizer/repartition.rs

+        if repartition_file_scans && !csv_exec.file_compression_type.is_compressed() {
+            let repartitioned_exec_option =
+                csv_exec.get_repartitioned(target_partitions, repartition_file_min_size);
+            if let Some(repartitioned_exec) = repartitioned_exec_option {
+                return Ok(Transformed::Yes(Arc::new(repartitioned_exec)));
+            }


I recommend moving the check for compression csv_exec.file_compression_type.is_compressed() into CsvExec::get_repartitioned() to keep the logic for handling CSV files together. I don't think this is required however.

Suggested change

if repartition_file_scans && !csv_exec.file_compression_type.is_compressed() {

let repartitioned_exec_option =

csv_exec.get_repartitioned(target_partitions, repartition_file_min_size);

if let Some(repartitioned_exec) = repartitioned_exec_option {

return Ok(Transformed::Yes(Arc::new(repartitioned_exec)));

}

if repartition_file_scans {

let repartitioned_exec_option =

csv_exec.get_repartitioned(target_partitions, repartition_file_min_size);

if let Some(repartitioned_exec) = repartitioned_exec_option {

return Ok(Transformed::Yes(Arc::new(repartitioned_exec)));

}

alamb · 2023-07-09T11:35:59Z

datafusion/core/src/datasource/physical_plan/csv.rs

+    /// If `file_meta.range` is `Some(FileRange {start, end})`, this signifies that the partition
+    /// corresponds to the byte range [start, end) within the file.
+    ///
+    /// Note: `start` or `end` might be in the middle of some lines. In such cases, the following rules


As I understand it, this code does potentially several object store requests to adjust the initial ranges based on where the end of the CSV lines actually fall:

Initial situation

CSV data with the next newlines (\n) after range marked ┌─────────────┬──┬────────────────────┬──┬────────────────┬──┬────────┬──┬────────┐ │ ... │\n│ ... │\n│ ... │\n│ ... │\n│ ... │ └─────────────┴──┴────────────────────┴──┴────────────────┴──┴────────┴──┴────────┘ ▲ ▲ ▲ ▲ └─────────────┬──────┴─────────────────────┴───────────────┘ │ │ Initial file_meta.range es

This PR

This PR: adjust the ranges prior to IO start, via object store operations ┌ ─ ─ ─ ─ ─ ─ ─ ─┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌─────────────┬──┼────────────────────┬──┬────────────────┬──┬────────┬──┬────────┤ │ ... │\n│ ... │\n│ ... │\n│ ... │\n│ ... │ └─────────────┴──┼────────────────────┴──┴────────────────┴──┴────────┴──┴────────┤ │ │ │ │ ─ ─ ─ ─ ─ ─ ─ ─ ┘─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ Partition 0 Partition 1 Partition 2 Partition 3 Read Read Read Read

This design has the nice property that each partition reads exactly the bytes it needs

This design has the downside that it requires several object store reads to find the newlines, and the overhead of each object store operation often is much larger than the overhead of reading extra data.

Alternate idea

Another approach that would reduce the number of object store requests would be to read past the initial range and stop at the next newline \n like this:

Each partition reads *more* than its assigned ranged to find the trailing new line, and ignores everything afterwards ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌─────────────┬──┬───────┼────────────┬──┬────────────────┬──┬────────┬──┬────────┐ │ ... │\n│ ... │\n│ ... │\n│ ... │\n│ ... │ └─────────────┴──┴───────┼────────────┴──┴────────────────┴──┴────────┴──┴────────┘ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ Partition 0 Read

I think the tricky bit of this design would be to ensure enough extra data was read. Initially, maybe we could just pick something sufficiently large for most files, like 1MB and error if the next newline can't be found. As a follow on we could add some fanciness like make another object store request if necessary.

These text charts look awesome 😮 How did you draw that?

I use https://monodraw.helftone.com/ to make the ASCII art -- I haven't found a free alternative yet

Alternate idea

Another approach that would reduce the number of object store requests would be to read past the initial range and stop at the next newline \n like this:

Each partition reads *more* than its assigned ranged to find the trailing new line, and ignores everything afterwards ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌─────────────┬──┬───────┼────────────┬──┬────────────────┬──┬────────┬──┬────────┐ │ ... │\n│ ... │\n│ ... │\n│ ... │\n│ ... │ └─────────────┴──┴───────┼────────────┴──┴────────────────┴──┴────────┴──┴────────┘ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ Partition 0 Read

I think the tricky bit of this design would be to ensure enough extra data was read. Initially, maybe we could just pick something sufficiently large for most files, like 1MB and error if the next newline can't be found. As a follow on we could add some fanciness like make another object store request if necessary.

@alamb
...working on #8502; I noticed the same thing.

So I did some testing.

I issued just a single GetRequest to the object store with an extended end_range (factor 1.2); then looped over the result byte stream and computed the start and end offset in a single pass.

I benchmarked it against 60mil rows of NDJSON - with no difference compared to the other approach. I could not gain any performance. Maybe my implementation is to naive? Or the overhead for the object store requests is to small when working with the local Filesystem?

let result = store.get_opts(location, options).await?; let mut result_stream = result.into_stream(); let mut index = 0; let mut buffer = Bytes::new(); let mut start_delta = 0; let mut end_delta = 0; 'outer: loop { if buffer.is_empty() { match result_stream.next().await.transpose()? { Some(bytes) => buffer = bytes, None => break, } } for byte in &buffer { if *byte == b'\n' { if start != 0 && start_delta == 0 { start_delta = index; if end == file_size { break 'outer; } } if start + index > end { end_delta = index; break 'outer; } } index += 1; } buffer.clear(); }

I benchmarked it against 60mil rows of NDJSON - with no difference compared to the other approach. I could not gain any performance. Maybe my implementation is to naive? Or the overhead for the object store requests is to small when working with the local Filesystem?

Yes, I think this is the fundamental observation -- multiple requests to an actual remote object store is quite costly (like 10s of ms minimum) and fetching anything less than several MB in one request is likely less efficient than a single large request.

There is more about object storage in Exploiting Cloud Object Storage for High-Performance Analytics : VLDB 2023 paper about running high performance analytics in object stores.

@alamb
Thanks for linking the paper.

So I'll guess the alternate approach is still worth pursuing?

I would proceed by testing the POC against a remote store and if this looks promising - I'd create a separate issue to discuss and refine the approach further?

Sounds like a good plan to me

2010YOUY01 · 2023-07-09T17:23:29Z

Thank you @2010YOUY01 -- I tried this out again and it does indeed go (much!) faster -- 3x faster in my initial testing. 👏

Thank you for the review ❤️ I will update this PR as per the comments next week

2010YOUY01 · 2023-07-09T17:31:25Z

memo for myself: update comments/docs in configurations

2010YOUY01 · 2023-07-12T00:27:55Z

Previous review comments should be all addressed in recent commits, it's ready for review now.
Also filed an issue for the remaining tasks related to this PR.

alamb · 2023-07-12T18:36:58Z

Thanks @2010YOUY01 -- I took the liberty of merging up from main to resolve a conflict on this branch and I plan to merge it when the CI has passed

alamb · 2023-07-12T19:44:41Z

Thanks again @2010YOUY01

* parallel csv scan * add max line length * Update according to review comments * Update Configuration doc --------- Co-authored-by: Andrew Lamb <[email protected]>

* Vectorized hash grouping * Prepare for merge to main * Improve comments and update size calculations * Implement test for accumulate_boolean refactor * Use resize instead of resize_with * fix avg size calculation * Simplify sum accumulator * Add comments explaining i64 as counts * Clarify `aggreate_arguments` * Apply suggestions from code review Co-authored-by: Mustafa Akur <[email protected]> * Clarify rationale for ScratchSpace being a field * use slice syntax * Update datafusion/physical-expr/src/aggregate/average.rs Co-authored-by: Mustafa Akur <[email protected]> * Update datafusion/physical-expr/src/aggregate/count.rs Co-authored-by: Mustafa Akur <[email protected]> * Update datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs Co-authored-by: Mustafa Akur <[email protected]> * fix diagram * Update datafusion/physical-expr/src/aggregate/groups_accumulator/adapter.rs Co-authored-by: Mustafa Akur <[email protected]> * simplify the supported logic * Add a log message when using slow adapter * fmt * Revert "chore(deps): update bigdecimal requirement from 0.3.0 to 0.4.0 (#6848)" (#6896) This reverts commit d0def42. * Make FileScanConfig::project pub (#6931) Co-authored-by: Daniël Heres <[email protected]> * feat: add round trip test of physical plan in tpch unit tests (#6918) * Use thiserror to implement the From trait for DFSqlLogicTestError (#6924) * parallel csv scan (#6801) * parallel csv scan * add max line length * Update according to review comments * Update Configuration doc --------- Co-authored-by: Andrew Lamb <[email protected]> * Add additional test coverage for aggregaes using dates/times/timestamps/decimals (#6939) * Add additional test coverage for aggregaes using dates/times/timestamps/decimals * Add coverage for date32/date64 * Support timestamp types for min/max * Fix aggregate nullability calculation --------- Co-authored-by: Mustafa Akur <[email protected]> Co-authored-by: Daniël Heres <[email protected]> Co-authored-by: Daniël Heres <[email protected]> Co-authored-by: r.4ntix <[email protected]> Co-authored-by: Jonah Gao <[email protected]> Co-authored-by: Yongting You <[email protected]>

parallel csv scan

5667170

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 29, 2023

alamb reviewed Jul 2, 2023

View reviewed changes

alamb reviewed Jul 5, 2023

View reviewed changes

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules substrait labels Jul 5, 2023

2010YOUY01 force-pushed the parallel-csv branch from 404fb68 to 5667170 Compare July 5, 2023 23:52

github-actions bot removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules substrait labels Jul 5, 2023

add max line length

329e5e8

alamb approved these changes Jul 9, 2023

View reviewed changes

2010YOUY01 added 2 commits July 11, 2023 14:52

Update according to review comments

947d71a

merge main

4ba25e0

2010YOUY01 force-pushed the parallel-csv branch from 157dd4d to bbe5419 Compare July 11, 2023 22:42

Update Configuration doc

f476acc

2010YOUY01 force-pushed the parallel-csv branch from bbe5419 to f476acc Compare July 11, 2023 23:03

2010YOUY01 mentioned this pull request Jul 12, 2023

Improve parallel CSV scan #6922

Closed

alamb approved these changes Jul 12, 2023

View reviewed changes

Merge remote-tracking branch 'apache/main' into parallel-csv

6696216

alamb merged commit ad3b8f6 into apache:main Jul 12, 2023

alamb added a commit to alamb/datafusion that referenced this pull request Jul 13, 2023

parallel csv scan (apache#6801)

54f96e6

* parallel csv scan * add max line length * Update according to review comments * Update Configuration doc --------- Co-authored-by: Andrew Lamb <[email protected]>

alamb mentioned this pull request Jul 17, 2023

use more than one core/thread when querying CSV #5205

Closed

This was referenced Dec 28, 2023

Closes #8502: Parallel NDJSON file reading #8659

Merged

Improve Parallel Reading (CSV, JSON) / Help Wanted #8723

Open

alamb mentioned this pull request Dec 15, 2024

fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel csv scan #6801

parallel csv scan #6801

2010YOUY01 commented Jun 29, 2023

alamb commented Jun 30, 2023

alamb Jul 2, 2023

alamb commented Jul 2, 2023

alamb commented Jul 3, 2023

alamb commented Jul 5, 2023

alamb Jul 5, 2023

2010YOUY01 Jul 6, 2023

2010YOUY01 commented Jul 6, 2023

alamb commented Jul 8, 2023

alamb left a comment

alamb Jul 9, 2023

alamb Jul 9, 2023

alamb Jul 9, 2023

alamb Jul 9, 2023

alamb Jul 9, 2023

alamb Jul 9, 2023

alamb Jul 9, 2023

2010YOUY01 Jul 9, 2023

alamb Jul 9, 2023

marvinlanhenke Dec 27, 2023 •

edited

Loading

alamb Dec 28, 2023

marvinlanhenke Dec 28, 2023

alamb Dec 28, 2023

2010YOUY01 commented Jul 9, 2023

2010YOUY01 commented Jul 9, 2023

2010YOUY01 commented Jul 12, 2023

alamb commented Jul 12, 2023

alamb commented Jul 12, 2023

	/// Parappel scan on a csv file with only 1 byte in each line
	/// Parallel scan on a csv file with only 1 byte in each line

	/// Parappel scan on a csv file with 2 wide rows
	/// Parallel scan on a csv file with 2 wide rows

parallel csv scan #6801

parallel csv scan #6801

Conversation

2010YOUY01 commented Jun 29, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Step 1

Step 2

Step 3

Are these changes tested?

Are there any user-facing changes?

alamb commented Jun 30, 2023

Choose a reason for hiding this comment

alamb commented Jul 2, 2023

alamb commented Jul 3, 2023

alamb commented Jul 5, 2023

Results

Methodology:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Jul 6, 2023

alamb commented Jul 8, 2023

alamb left a comment

Choose a reason for hiding this comment

Testing Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Initial situation

This PR

Alternate idea

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marvinlanhenke Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Alternate idea

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Jul 9, 2023

2010YOUY01 commented Jul 9, 2023

2010YOUY01 commented Jul 12, 2023

alamb commented Jul 12, 2023

alamb commented Jul 12, 2023

marvinlanhenke Dec 27, 2023 •

edited

Loading