-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8883: [Rust] [Integration] Enable more tests #8200
ARROW-8883: [Rust] [Integration] Enable more tests #8200
Conversation
Remember to update the implementation status when done ;-) |
63c6e4c
to
24d274a
Compare
Thanks a lot for driving this and CCing. This is definitely important. I have myself hit that AFAI can tell, we have two use-cases of
Because arrays share buffers, the total size of an array is currently misleading. For example, when the array is computed from Thus, IMO Regardless, I would say that the vast majority of the use-cases on which we want to compare buffers is when we want to compare its contents, irrespectively of how they should be deallocated / capacity. Therefore, I would be happy to have buffer comparison be based on their actual content (in bytes). We just need to be careful about bitmaps, on which the comparison should be made in bits. |
I agree with your thinking here, especially on equality and arrays with offsets. One of the things we don't test (extensively or at all) is whether we're able to write arrays with unaligned offsets to the IPC format. I'll put in time to work on this whenever I get to v5 of the Arrow format.
I'd also prefer not to compare using buffer capacity, as I'm mainly interested in the buffer contents, especially when dealing with a frozen buffer where we won't be modifying contents. On bitmaps, I'll have a look, but I think that the approach of using the bit iterators should cover this. Apologies again for not having gotten to #8401 yet, I hope you're not in a rush to get it implemented (esp as we're now looking at 3.0 for all new work). |
832c187
to
d103c3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick run trough and left some comments. I am still missing the large bin file.
So far, all changes make sense to me (apart from the comments).
d103c3f
to
e3b6e42
Compare
5ec1b21
to
4b5dc0d
Compare
This is a major refactor of the `equal.rs` module. The rational for this change is many fold: * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders. * the code in array comparison is difficult to follow given the amount of calls that they perform around offsets. * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate. * Some code is being repeated. This PR: 1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change) 2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons. 3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison 4. Makes array equality be statically dispatched, via `match datatype`. 5. DRY the code around array equality 6. Fixes an error in equality of dictionary with equal values 7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries) 8. splits `equal.rs` in smaller, more manageable files. 9. Removes `ArrayListOps`, since it it no longer needed 10. Moves Json equality to its own module, for clarity. 11. removes the need to have two functions per type to compare arrays. 12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from #8401 13. adds a benchmark for array equality Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200. IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long. This also improves performance by 10-40%. <details> <summary>Benchmark results</summary> ``` Previous HEAD position was 3dd3c69 Added bench for equality. Switched to branch 'equal' Your branch is up to date with 'origin/equal'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 51.28s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12 Gnuplot not found, using plotters backend equal_512 time: [36.861 ns 36.894 ns 36.934 ns] change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe equal_nulls_512 time: [2.3271 us 2.3299 us 2.3331 us] change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe equal_string_512 time: [49.219 ns 49.347 ns 49.517 ns] change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe equal_string_nulls_512 time: [3.7873 us 3.7939 us 3.8013 us] change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe ``` </details> All tests are there, plus new tests for some of the edge cases and untested arrays. This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. Closes #8541 from jorgecarleitao/equal Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Tested with Rust and Java, but there are Java failures make ARROW_MAGIC and CONT_MARKER consts remove unused import
We weren't able to read dictionaries because of legacy issues
expose Field's dictionary fields via methods some progress on dictionary integration
The intention is not to commit this, but to share with other users who might be interested in helping out with the integration testing. I found the docker-compose process too slow, and it was returning errors on Rust :(
This is to be more in line with integration testing behaviour
After slicing buffers, we can end up with 2 buffers having the same len and data, but not capacity. Such buffers fail the equality test. CC @sunchao jorgecarleitao as you've worked on this. Perhaps we can find a suitable definition of equality, as we're otherwise failing integration tests because of this.
We were only doing basic comparisons of array properties, such as whether data types and lengths are the same. This extends this to test the array data. fix lint
remove my integration testing helper rebase, fix failing test
(cherry picked from commit 03b8aa4)
2e3763f
to
dde96c6
Compare
@jorgecarleitao @carols10cents I'm contemplating closing this, and breaking it up into smaller PRs (if this round of changes don't pass integration). I'll work on it this week in my evenings (GMT+2) so I can get it out of the way, and go back to working on Parquet. |
From what I have read in this PR, it looks good so far. Is there anything blocking it? |
@jorgecarleitao the actual integration tests not passing, are the blocker 😅 There were 10 failures in the last run, logical equality related. It's actually odd, because on my local testing I get 120+ failures between Rust and the other langs (except C++, that passes). I'll monitor https://github.com/apache/arrow/pull/8200/checks?check_run_id=1378263093 in the background, if it passes; we can merge this (if you don't have any outstanding queries/concerns). I used your suggested implementation of logical equality (dde96c6) |
@jorgecarleitao we're down to 3 failures. thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', arrow/src/array/equal/variable_size.rs:31:21 Looks like we're not comparing zero length arrays correctly |
@jorgecarleitao finally passed! There's still some tests that don't pass, but I've updated the status documenation with them (cc @andygrove as you had opened a JIRA for this) |
This enables more integration tests, and makes the necessary changes to make them pass. ___ # Status ## Primitive Arrays These are passing, but I have questions on whether the way I'm comparing them is fine. Specifically, when reading an array that has 3 values, the `Buffer` is returned aligned to 8 bytes, such that an `i8`'s buffer would have `{ data: [v1, v2, v3, 0, 0, 0, 0, 0] }`. This causes problems, because when we reconstruct the above from JSON, we would have `{ data: [v1, v2, v3] }`. I have implemented `PartialEq` manually for `ArrayData` at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately used `assert_eq!(a, b, "helpful message")` to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without some `log!()` implementation. @jhorstmann 's work on bit slicing has been instrumental in allowing us to compare `ArrayData`, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. _I don't think this is a bug_. Due to the above, I've also had to compare `Buffer` by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity). I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage. ## Complex Types Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen `item` or `list`). We currently have no way of encoding this information; and thus tests fail on schema equality. I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from `master` 🤭, as this will affect Parquet and DataFusion. ## Dictionaries We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing. I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much. @carols10cents this builds on the work that you did for dictionary support from `arrow::ipc::conversion`, so thanks for that 😃 ## Metadata We fail on some metadata tests because `arrow::datatypes::Field` doesn't support metadata. This one's a bit difficult to implement, as `HashMap<String, String>` doesn't support `#[derive(Hash)]` and other requirements. I was thinking that maybe we could use `Vec<(String, String)>` and deal with deduplication of fields, and compatible JSON serde internally. I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above. ## Quicker Testing One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff). `docker-compose --file docker-compose.yml run --rm conda-integration` took long, and I was getting sporadic failures in Rust. I've temporarily created a script called `integration-testing.sh`, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing. @pitrou @nealrichardson I'll remove the script and the .dockerignore entry before merging this. Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month. Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it. Closes apache#8200 from nevi-me/rust-integration-more-tests-sep2020 Lead-authored-by: Neville Dipale <[email protected]> Co-authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…info The purpose of this PR is refactoring `read_dictionary` to only need one kind of `Schema`, which lets us then remove the `find_dictionary_field` function and the `ipc_schema` field on `StreamReader` by adding a way to look up schema fields that use a particular dictionary by ID. I'm also resubmitting a change to the `dict_id`/`dict_is_ordered` methods on `Field`; I had submitted this to @nevi-me to become part of #8200 but it looks like it got lost in a rebase or something? I think it's more correct to only return values if the fields have a dictionary as their datatype. Closes #8820 from carols10cents/simplify-dictionary-field-lookup Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This is a major refactor of the `equal.rs` module. The rational for this change is many fold: * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders. * the code in array comparison is difficult to follow given the amount of calls that they perform around offsets. * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate. * Some code is being repeated. This PR: 1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change) 2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons. 3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison 4. Makes array equality be statically dispatched, via `match datatype`. 5. DRY the code around array equality 6. Fixes an error in equality of dictionary with equal values 7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries) 8. splits `equal.rs` in smaller, more manageable files. 9. Removes `ArrayListOps`, since it it no longer needed 10. Moves Json equality to its own module, for clarity. 11. removes the need to have two functions per type to compare arrays. 12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401 13. adds a benchmark for array equality Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200. IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long. This also improves performance by 10-40%. <details> <summary>Benchmark results</summary> ``` Previous HEAD position was 3dd3c69 Added bench for equality. Switched to branch 'equal' Your branch is up to date with 'origin/equal'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 51.28s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12 Gnuplot not found, using plotters backend equal_512 time: [36.861 ns 36.894 ns 36.934 ns] change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe equal_nulls_512 time: [2.3271 us 2.3299 us 2.3331 us] change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe equal_string_512 time: [49.219 ns 49.347 ns 49.517 ns] change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe equal_string_nulls_512 time: [3.7873 us 3.7939 us 3.8013 us] change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe ``` </details> All tests are there, plus new tests for some of the edge cases and untested arrays. This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. Closes apache#8541 from jorgecarleitao/equal Authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This enables more integration tests, and makes the necessary changes to make them pass. ___ # Status ## Primitive Arrays These are passing, but I have questions on whether the way I'm comparing them is fine. Specifically, when reading an array that has 3 values, the `Buffer` is returned aligned to 8 bytes, such that an `i8`'s buffer would have `{ data: [v1, v2, v3, 0, 0, 0, 0, 0] }`. This causes problems, because when we reconstruct the above from JSON, we would have `{ data: [v1, v2, v3] }`. I have implemented `PartialEq` manually for `ArrayData` at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately used `assert_eq!(a, b, "helpful message")` to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without some `log!()` implementation. @jhorstmann 's work on bit slicing has been instrumental in allowing us to compare `ArrayData`, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. _I don't think this is a bug_. Due to the above, I've also had to compare `Buffer` by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity). I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage. ## Complex Types Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen `item` or `list`). We currently have no way of encoding this information; and thus tests fail on schema equality. I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from `master` 🤭, as this will affect Parquet and DataFusion. ## Dictionaries We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing. I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much. @carols10cents this builds on the work that you did for dictionary support from `arrow::ipc::conversion`, so thanks for that 😃 ## Metadata We fail on some metadata tests because `arrow::datatypes::Field` doesn't support metadata. This one's a bit difficult to implement, as `HashMap<String, String>` doesn't support `#[derive(Hash)]` and other requirements. I was thinking that maybe we could use `Vec<(String, String)>` and deal with deduplication of fields, and compatible JSON serde internally. I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above. ## Quicker Testing One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff). `docker-compose --file docker-compose.yml run --rm conda-integration` took long, and I was getting sporadic failures in Rust. I've temporarily created a script called `integration-testing.sh`, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing. @pitrou @nealrichardson I'll remove the script and the .dockerignore entry before merging this. Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month. Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it. Closes apache#8200 from nevi-me/rust-integration-more-tests-sep2020 Lead-authored-by: Neville Dipale <[email protected]> Co-authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…info The purpose of this PR is refactoring `read_dictionary` to only need one kind of `Schema`, which lets us then remove the `find_dictionary_field` function and the `ipc_schema` field on `StreamReader` by adding a way to look up schema fields that use a particular dictionary by ID. I'm also resubmitting a change to the `dict_id`/`dict_is_ordered` methods on `Field`; I had submitted this to @nevi-me to become part of apache#8200 but it looks like it got lost in a rebase or something? I think it's more correct to only return values if the fields have a dictionary as their datatype. Closes apache#8820 from carols10cents/simplify-dictionary-field-lookup Authored-by: Carol (Nichols || Goulding) <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
This enables more integration tests, and makes the necessary changes to make them pass.
Status
Primitive Arrays
These are passing, but I have questions on whether the way I'm comparing them is fine.
Specifically, when reading an array that has 3 values, the
Buffer
is returned aligned to 8 bytes, such that ani8
's buffer would have{ data: [v1, v2, v3, 0, 0, 0, 0, 0] }
.This causes problems, because when we reconstruct the above from JSON, we would have
{ data: [v1, v2, v3] }
.I have implemented
PartialEq
manually forArrayData
at 261c021. I'd appreciate some feedback on whether the approach is fine or not. I deliberately usedassert_eq!(a, b, "helpful message")
to make it a bit easier to track down the differences between objects being compared, as it's otherwise very difficult and tedious without somelog!()
implementation.@jhorstmann 's work on bit slicing has been instrumental in allowing us to compare
ArrayData
, so I'm leveraging that for buffer comparisons. We however can end up with a situation where slicing buffers with offsets could lead to 2 buffers having different capacities. I don't think this is a bug.Due to the above, I've also had to compare
Buffer
by length, and not capacity at 69f4ef3. @sunchao @jorgecarleitao PTAL and advise. I just made the change so tests could pass, but perhaps we could agree on what constitutes equal buffers (whether we should compare capacity).I'd prefer to make the above change as a separate PR, so I can address offsets and also add tests for coverage.
Complex Types
Lists and structs currently fail because we can't carry custom names and nullability. For example, if a list's child is a list, such child could be named anything (I've seen
item
orlist
). We currently have no way of encoding this information; and thus tests fail on schema equality.I opened https://issues.apache.org/jira/browse/ARROW-10261 to address this. I'll open a PR on the branch that's soon to be renamed from
master
🤭, as this will affect Parquet and DataFusion.Dictionaries
We seem to have covered a bit of dictonary support, but because we can't write dictionaries yet, we can't complete the integration testing.
I've opened https://issues.apache.org/jira/browse/ARROW-10268 for any interesting takers. I think it should be fairly trivial to implement, but I'm already trying to do too much.
@carols10cents this builds on the work that you did for dictionary support from
arrow::ipc::conversion
, so thanks for that 😃Metadata
We fail on some metadata tests because
arrow::datatypes::Field
doesn't support metadata.This one's a bit difficult to implement, as
HashMap<String, String>
doesn't support#[derive(Hash)]
and other requirements.I was thinking that maybe we could use
Vec<(String, String)>
and deal with deduplication of fields, and compatible JSON serde internally.I opened https://issues.apache.org/jira/browse/ARROW-10259 for the above.
Quicker Testing
One of the reasons why I've taken long to work on this, was because I struggled a bit with running the integration tests (at least I had to go into a few dirs and build stuff).
docker-compose --file docker-compose.yml run --rm conda-integration
took long, and I was getting sporadic failures in Rust.I've temporarily created a script called
integration-testing.sh
, so that whoever pitches in to help out, can use that to quickly build the C++, Java and Rust binaries for testing.@pitrou @nealrichardson I'll remove the script and the .dockerignore entry before merging this.
Lastly, Antoine, seeing that we can't complete this in time for 2.0.0, I'm fine with leaving the implementation status fields on Rust as blank for a few more month.
Apologies for the "CC the world" update, but I think this is important enough for us to get more eyes on it.