forked from rapidsai/cudf
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiling prototype code in tests #9
Open
karthikeyann
wants to merge
14
commits into
branch-24.06
Choose a base branch
from
exp-json_spark_profile
base: branch-24.06
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…5749) closes rapidsai#15748 The performance implication can be seen in the issue Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#15749
…s.yaml (rapidsai#15736) Pulls out some changes I noticed while working on rapidsai#15245. * removes `host` dependency on `setuptools` for `cudf` and `cudf_kafka` - *they don't need it now that they build with `scikit-build-core`* * consolidates some redundant blocks in `dependencies.yaml` Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15736
Add stream parameter to public reduction APIs: - `reduce()` - `segmented_reduce()` - `scan()` - `minmax()` Reference rapidsai#13744 Authors: - Srinivas Yadav (https://github.com/srinivasyadav18) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#15737
…sai#15735) Fixes rapidsai#15690 There was an issue when computing page row counts/indices at the pass level in the chunked reader. Because we estimate list row counts for pages we have not yet decompressed, this can sometimes lead to estimates row counts that are larger than the actual (known) number of rows for a pass. This caused an out-of-bounds read down the line. We were already handling this at the subpass level, just not at the pass level. Also includes some fixes in debug logging code that is #ifdef'd out. Authors: - https://github.com/nvdbaranec - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15735
…ion objects (rapidsai#15732) Since the C++ layer provides implementations of these, use them, rather than redoing an implementation. This avoids things ever getting out of sync. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15732
Fixes: rapidsai#15742 This PR resolves issues with returning incorrect ranges for `DatetimeIndex.loc` when the index objects are monotonically decreasing. Additionally, I went ahead and fixed it for all cases, (i.e., random ordering) too. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#15761
…i#15672) Reduces the runtime for the `ParquetChunkedReaderInputLimitTest.List` and `ParquetChunkedReaderInputLimitTest.Mixed` which together are 1/3 the total time for `PARQUET_TEST`. These two tests produce multi-GB test files that are not strictly necessary for testing the chunked reader since the chunk sizes are controllable. The changes here reduce the runtime for these 2 tests by about 1/3 the original runtime. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Paul Mattione (https://github.com/pmattione-nvidia) URL: rapidsai#15672
Fixes offsets type for list column returned by `cudf::strings::split_record` and `cudf::strings::split_record_re` when large-strings enabled. The list column offsets type must be INT32. The code was changed to use the appropriate `make_offsets_child_column` utility function. Also added some `is_large_strings_enabled()` checks to check-overflow gtests. This allows all current gtests to pass when the large-strings support environment variable is set. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15707
Fixes check for multibyte characters on large strings column. The `thrust::count_if` exceeds the int64 reduce type maximum and so the logic was recoded as a native kernel. Added additional tests and fixed subsequent errors where kernels are launched with greater than max(size_type) threads. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#15721
Fixes rapidsai#15051 The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation. With this PR change, the user will be able to use both column name reference and column index reference in the filter. - column name reference: the filters may specify any columns by name even if they are not present in column projection. - column reference (index): The indices used should be the indices of output columns in the requested order. This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned. The change includes reading only output columns' statistics data instead of all root columns. Summary of changes: - `get_column_names_in_expression` extracts column names in filter. - The extra columns in filter are added to output buffers during reader initialization - `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp` - instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter) - `cpp/src/io/parquet/predicate_pushdown.cpp` - To do this, output column schemas and its dtypes should be cached. - statistics data extraction code is updated to check for `schema_idx` in row group metadata. - No need to convert filter again for all root columns, reuse the passed output columns reference filter. - Rest of the code is same. - After the output filter predicate is calculated, these filter-only columns are removed - moved `named_to_reference_converter` constructor to cpp, and remove used constructor. - small include<> cleanup Authors: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#15113
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR to profile
different_get_json_object
spark prototype.profile posted here.
rapidsai#15605 (comment)