Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to arrow/parquet 53.0.0, tonic, prost, object_store, pyo3 #12032

Merged
merged 50 commits into from
Sep 5, 2024
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
f13c8f3
Update prost, prost-derive, pbjson
alamb Jul 9, 2024
c51fd50
udpate more
alamb Jul 9, 2024
839ba91
Merge remote-tracking branch 'apache/main' into alamb/update_prost
alamb Jul 10, 2024
a6cadde
Update datafusion/substrait/Cargo.toml
alamb Jul 10, 2024
e34019b
Merge branch 'alamb/update_prost' of github.com:alamb/datafusion into…
alamb Jul 10, 2024
7900a07
Update vendored code
alamb Jul 10, 2024
5ea5ced
revert upgrade in datafusion-examples until arrow-flight is updated
alamb Jul 10, 2024
6694983
Pin to pre-release arrow-rs
alamb Aug 16, 2024
4e3f97e
Merge remote-tracking branch 'alamb/alamb/update_prost' into alamb/up…
alamb Aug 16, 2024
6cc5db1
update pyo3
alamb Aug 16, 2024
68b6e6c
Update to use new arrow apis
alamb Aug 16, 2024
aa310fe
update for new api
alamb Aug 16, 2024
f454e89
Update tonic in examples
alamb Aug 16, 2024
98bb11a
update prost
alamb Aug 16, 2024
5b0fa44
update datafusion-cli/cargo
alamb Aug 16, 2024
6f501bc
update test output
alamb Aug 16, 2024
3542aea
update
alamb Aug 16, 2024
e6416c3
updates
alamb Aug 16, 2024
3a06488
updates
alamb Aug 16, 2024
5d7b0fe
update math
alamb Aug 16, 2024
cb623d9
update more
alamb Aug 16, 2024
c2a6bf5
fix scalar tests
alamb Aug 16, 2024
1ac1787
Port statistics to use new API
alamb Aug 16, 2024
67ad234
factor into a function
alamb Aug 16, 2024
5b6498e
update generated files
alamb Aug 16, 2024
e77b7df
Update test
alamb Aug 19, 2024
d38b99e
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Aug 19, 2024
a89fa87
add new test
alamb Aug 19, 2024
3717c25
update tests
alamb Aug 19, 2024
230aeec
tapelo format
alamb Aug 19, 2024
ed2b222
Update other tests
alamb Aug 19, 2024
b446fcc
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Aug 20, 2024
b666c31
Update datafusion pin
alamb Aug 20, 2024
f2be69f
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Aug 21, 2024
2062a32
Update for API change
alamb Aug 21, 2024
82641d8
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Aug 21, 2024
c61b499
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Aug 22, 2024
124efd2
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 1, 2024
13bb146
Update to arrow 53.0.0 sha
alamb Sep 1, 2024
03378ed
Update cli deps
alamb Sep 1, 2024
0ede9e4
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 2, 2024
d43f9dd
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 2, 2024
15f4c5f
update cargo.lock
alamb Sep 2, 2024
fdd6e98
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 2, 2024
5a35f3c
Update expected output
alamb Sep 2, 2024
3d0b99c
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 3, 2024
c534a29
Remove patch
alamb Sep 3, 2024
1dfd713
update datafusion-cli cargo
alamb Sep 3, 2024
a2613a6
Pin some aws sdks whose update caused CI failures
alamb Sep 3, 2024
0b6b11c
Merge remote-tracking branch 'apache/main' into alamb/update_arrow_53
alamb Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,20 @@ large_futures = "warn"
[workspace.lints.rust]
unexpected_cfgs = { level = "warn", check-cfg = ["cfg(tarpaulin)"] }
unused_imports = "deny"


## Temporary arrow-rs patch until 53.0.0 is released

[patch.crates-io]
arrow = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-array = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-buffer = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-cast = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-data = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-ipc = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-schema = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-select = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-string = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-ord = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
arrow-flight = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
parquet = { git = "https://github.com/apache/arrow-rs.git", rev = "042d725888358c73cd2a0d58868ea5c4bad778f7" }
6 changes: 3 additions & 3 deletions datafusion-examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,14 @@ log = { workspace = true }
mimalloc = { version = "0.1", default-features = false }
num_cpus = { workspace = true }
object_store = { workspace = true, features = ["aws", "http"] }
prost = { version = "0.12", default-features = false }
prost-derive = { version = "0.13", default-features = false }
prost = { version = "0.13.1", default-features = false }
prost-derive = { version = "0.13.1", default-features = false }
serde = { version = "1.0.136", features = ["derive"] }
serde_json = { workspace = true }
tempfile = { workspace = true }
test-utils = { path = "../test-utils" }
tokio = { workspace = true, features = ["rt-multi-thread", "parking_lot"] }
tonic = "0.11"
tonic = "0.12.1"
url = { workspace = true }
uuid = "1.7"

Expand Down
2 changes: 1 addition & 1 deletion datafusion/common/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ num_cpus = { workspace = true }
object_store = { workspace = true, optional = true }
parquet = { workspace = true, optional = true, default-features = true }
paste = "1.0.15"
pyo3 = { version = "0.21.0", optional = true }
pyo3 = { version = "0.22.0", optional = true }
sqlparser = { workspace = true }

[target.'cfg(target_family = "wasm")'.dependencies]
Expand Down
2 changes: 1 addition & 1 deletion datafusion/core/src/datasource/file_format/parquet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1999,7 +1999,7 @@ mod tests {

// test result in int_col
let int_col_index = page_index.get(4).unwrap();
let int_col_offset = offset_index.get(4).unwrap();
let int_col_offset = offset_index.get(4).unwrap().page_locations();

// 325 pages in int_col
assert_eq!(int_col_offset.len(), 325);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -392,13 +392,16 @@ impl<'a> PagesPruningStatistics<'a> {
trace!("No page offsets for row group {row_group_index}, skipping");
return None;
};
let Some(page_offsets) = row_group_page_offsets.get(parquet_column_index) else {
let Some(offset_index_metadata) =
row_group_page_offsets.get(parquet_column_index)
else {
trace!(
"No page offsets for column {:?} in row group {row_group_index}, skipping",
converter.arrow_field()
);
return None;
};
let page_offsets = offset_index_metadata.page_locations();

Some(Self {
row_group_index,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -487,11 +487,23 @@ mod tests {
let schema_descr = get_test_schema_descr(vec![field]);
let rgm1 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::int32(Some(1), Some(10), None, 0, false)],
vec![ParquetStatistics::int32(
Some(1),
Some(10),
None,
Some(0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to apache/arrow-rs#6216 (where null counts is now Option)

false,
)],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::int32(Some(11), Some(20), None, 0, false)],
vec![ParquetStatistics::int32(
Some(11),
Some(20),
None,
Some(0),
false,
)],
);

let metrics = parquet_file_metrics();
Expand Down Expand Up @@ -520,11 +532,11 @@ mod tests {
let schema_descr = get_test_schema_descr(vec![field]);
let rgm1 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::int32(None, None, None, 0, false)],
vec![ParquetStatistics::int32(None, None, None, Some(0), false)],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::int32(Some(11), Some(20), None, 0, false)],
vec![ParquetStatistics::int32(Some(11), Some(20), None, Some(0), false)],
);
let metrics = parquet_file_metrics();
// missing statistics for first row group mean that the result from the predicate expression
Expand Down Expand Up @@ -560,15 +572,15 @@ mod tests {
let rgm1 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(1), Some(10), None, 0, false),
ParquetStatistics::int32(Some(1), Some(10), None, 0, false),
ParquetStatistics::int32(Some(1), Some(10), None, Some(0), false),
ParquetStatistics::int32(Some(1), Some(10), None, Some(0), false),
],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(11), Some(20), None, 0, false),
ParquetStatistics::int32(Some(11), Some(20), None, 0, false),
ParquetStatistics::int32(Some(11), Some(20), None, Some(0), false),
ParquetStatistics::int32(Some(11), Some(20), None, Some(0), false),
],
);

Expand Down Expand Up @@ -633,16 +645,16 @@ mod tests {
let rgm1 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(-10), Some(-1), None, 0, false), // c2
ParquetStatistics::int32(Some(1), Some(10), None, 0, false),
ParquetStatistics::int32(Some(-10), Some(-1), None, Some(0), false), // c2
ParquetStatistics::int32(Some(1), Some(10), None, Some(0), false),
],
);
// rg1 has c2 greater than zero, c1 less than zero
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(1), Some(10), None, 0, false),
ParquetStatistics::int32(Some(-10), Some(-1), None, 0, false),
ParquetStatistics::int32(Some(1), Some(10), None, Some(0), false),
ParquetStatistics::int32(Some(-10), Some(-1), None, Some(0), false),
],
);

Expand All @@ -669,15 +681,15 @@ mod tests {
let rgm1 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(1), Some(10), None, 0, false),
ParquetStatistics::boolean(Some(false), Some(true), None, 0, false),
ParquetStatistics::int32(Some(1), Some(10), None, Some(0), false),
ParquetStatistics::boolean(Some(false), Some(true), None, Some(0), false),
],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
vec![
ParquetStatistics::int32(Some(11), Some(20), None, 0, false),
ParquetStatistics::boolean(Some(false), Some(true), None, 1, false),
ParquetStatistics::int32(Some(11), Some(20), None, Some(0), false),
ParquetStatistics::boolean(Some(false), Some(true), None, Some(1), false),
],
);
vec![rgm1, rgm2]
Expand Down Expand Up @@ -775,21 +787,21 @@ mod tests {
Some(100),
Some(600),
None,
0,
Some(0),
false,
)],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
// [0.1, 0.2]
// c1 > 5, this row group will not be included in the results.
vec![ParquetStatistics::int32(Some(10), Some(20), None, 0, false)],
vec![ParquetStatistics::int32(Some(10), Some(20), None, Some(0), false)],
);
let rgm3 = get_row_group_meta_data(
&schema_descr,
// [1, None]
// c1 > 5, this row group can not be filtered out, so will be included in the results.
vec![ParquetStatistics::int32(Some(100), None, None, 0, false)],
vec![ParquetStatistics::int32(Some(100), None, None, Some(0), false)],
);
let metrics = parquet_file_metrics();
let mut row_groups = RowGroupAccessPlanFilter::new(ParquetAccessPlan::new_all(3));
Expand Down Expand Up @@ -837,27 +849,27 @@ mod tests {
Some(100),
Some(600),
None,
0,
Some(0),
false,
)],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
// [10, 20]
// c1 > 5, this row group will be included in the results.
vec![ParquetStatistics::int32(Some(10), Some(20), None, 0, false)],
vec![ParquetStatistics::int32(Some(10), Some(20), None, Some(0), false)],
);
let rgm3 = get_row_group_meta_data(
&schema_descr,
// [0, 2]
// c1 > 5, this row group will not be included in the results.
vec![ParquetStatistics::int32(Some(0), Some(2), None, 0, false)],
vec![ParquetStatistics::int32(Some(0), Some(2), None, Some(0), false)],
);
let rgm4 = get_row_group_meta_data(
&schema_descr,
// [None, 2]
// c1 > 5, this row group can not be filtered out, so will be included in the results.
vec![ParquetStatistics::int32(None, Some(2), None, 0, false)],
vec![ParquetStatistics::int32(None, Some(2), None, Some(0), false)],
);
let metrics = parquet_file_metrics();
let mut row_groups = RowGroupAccessPlanFilter::new(ParquetAccessPlan::new_all(4));
Expand Down Expand Up @@ -896,19 +908,19 @@ mod tests {
Some(600),
Some(800),
None,
0,
Some(0),
false,
)],
);
let rgm2 = get_row_group_meta_data(
&schema_descr,
// [0.1, 0.2]
vec![ParquetStatistics::int64(Some(10), Some(20), None, 0, false)],
vec![ParquetStatistics::int64(Some(10), Some(20), None, Some(0), false)],
);
let rgm3 = get_row_group_meta_data(
&schema_descr,
// [0.1, 0.2]
vec![ParquetStatistics::int64(None, None, None, 0, false)],
vec![ParquetStatistics::int64(None, None, None, Some(0), false)],
);
let metrics = parquet_file_metrics();
let mut row_groups = RowGroupAccessPlanFilter::new(ParquetAccessPlan::new_all(3));
Expand Down Expand Up @@ -957,7 +969,7 @@ mod tests {
8000i128.to_be_bytes().to_vec(),
))),
None,
0,
Some(0),
false,
)],
);
Expand All @@ -973,15 +985,15 @@ mod tests {
20000i128.to_be_bytes().to_vec(),
))),
None,
0,
Some(0),
false,
)],
);

let rgm3 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::fixed_len_byte_array(
None, None, None, 0, false,
None, None, None, Some(0), false,
)],
);
let metrics = parquet_file_metrics();
Expand Down Expand Up @@ -1027,7 +1039,7 @@ mod tests {
// 80.00
Some(ByteArray::from(8000i128.to_be_bytes().to_vec())),
None,
0,
Some(0),
false,
)],
);
Expand All @@ -1039,13 +1051,13 @@ mod tests {
// 200.00
Some(ByteArray::from(20000i128.to_be_bytes().to_vec())),
None,
0,
Some(0),
false,
)],
);
let rgm3 = get_row_group_meta_data(
&schema_descr,
vec![ParquetStatistics::byte_array(None, None, None, 0, false)],
vec![ParquetStatistics::byte_array(None, None, None, Some(0), false)],
);
let metrics = parquet_file_metrics();
let mut row_groups = RowGroupAccessPlanFilter::new(ParquetAccessPlan::new_all(3));
Expand Down
3 changes: 1 addition & 2 deletions datafusion/functions/src/regex/regexpreplace.rs
Original file line number Diff line number Diff line change
Expand Up @@ -401,8 +401,7 @@ fn _regexp_replace_static_pattern_replace<T: OffsetSizeTrait>(
DataType::Utf8View => {
let string_view_array = as_string_view_array(&args[0])?;

let mut builder = StringViewBuilder::with_capacity(string_view_array.len())
.with_block_size(1024 * 1024 * 2);
let mut builder = StringViewBuilder::with_capacity(string_view_array.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean the block size is not used in the builder for this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we don't need it after this pr is merged: apache/arrow-rs#6136
I ran a local benchmark and show no perf diff


for val in string_view_array.iter() {
if let Some(val) = val {
Expand Down
2 changes: 1 addition & 1 deletion datafusion/physical-expr-common/src/binary_view_map.rs
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ where
output_type,
map: hashbrown::raw::RawTable::with_capacity(INITIAL_MAP_CAPACITY),
map_size: 0,
builder: GenericByteViewBuilder::new().with_block_size(2 * 1024 * 1024),
builder: GenericByteViewBuilder::new(),
random_state: RandomState::new(),
hashes_buffer: vec![],
null: None,
Expand Down
5 changes: 3 additions & 2 deletions datafusion/physical-plan/src/coalesce_batches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,7 @@ fn gc_string_view_batch(batch: &RecordBatch) -> RecordBatch {
// See https://github.com/apache/arrow-rs/issues/6094 for more details.
let mut builder = StringViewBuilder::with_capacity(s.len());
if ideal_buffer_size > 0 {
builder = builder.with_block_size(ideal_buffer_size as u32);
builder = builder.with_fixed_block_size(ideal_buffer_size as u32);
}

for v in s.iter() {
Expand Down Expand Up @@ -856,7 +856,8 @@ mod tests {
impl StringViewTest {
/// Create a `StringViewArray` with the parameters specified in this struct
fn build(self) -> StringViewArray {
let mut builder = StringViewBuilder::with_capacity(100).with_block_size(8192);
let mut builder =
StringViewBuilder::with_capacity(100).with_fixed_block_size(8192);
loop {
for &v in self.strings.iter() {
builder.append_option(v);
Expand Down
4 changes: 2 additions & 2 deletions datafusion/proto-common/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ arrow = { workspace = true }
chrono = { workspace = true }
datafusion-common = { workspace = true }
object_store = { workspace = true }
pbjson = { version = "0.6.0", optional = true }
prost = "0.12.0"
pbjson = { version = "0.7.0", optional = true }
prost = "0.13.0"
serde = { version = "1.0", optional = true }
serde_json = { workspace = true, optional = true }

Expand Down
4 changes: 2 additions & 2 deletions datafusion/proto-common/gen/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,5 @@ workspace = true

[dependencies]
# Pin these dependencies so that the generated output is deterministic
pbjson-build = "=0.6.2"
prost-build = "=0.12.6"
pbjson-build = "=0.7.0"
prost-build = "=0.13.1"
4 changes: 2 additions & 2 deletions datafusion/proto/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
datafusion-proto-common = { workspace = true }
object_store = { workspace = true }
pbjson = { version = "0.6.0", optional = true }
prost = "0.12.0"
pbjson = { version = "0.7.0", optional = true }
prost = "0.13.0"
serde = { version = "1.0", optional = true }
serde_json = { workspace = true, optional = true }

Expand Down
4 changes: 2 additions & 2 deletions datafusion/proto/gen/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,5 @@ workspace = true

[dependencies]
# Pin these dependencies so that the generated output is deterministic
pbjson-build = "=0.6.2"
prost-build = "=0.12.6"
pbjson-build = "=0.7.0"
prost-build = "=0.13.1"
Loading
Loading