profiling prototype code in tests #9

karthikeyann · 2024-05-08T02:15:20Z

PR to profile different_get_json_object spark prototype.

profile posted here.
rapidsai#15605 (comment)

…5749) closes rapidsai#15748 The performance implication can be seen in the issue Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#15749

…s.yaml (rapidsai#15736) Pulls out some changes I noticed while working on rapidsai#15245. * removes `host` dependency on `setuptools` for `cudf` and `cudf_kafka` - *they don't need it now that they build with `scikit-build-core`* * consolidates some redundant blocks in `dependencies.yaml` Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15736

Add stream parameter to public reduction APIs: - `reduce()` - `segmented_reduce()` - `scan()` - `minmax()` Reference rapidsai#13744 Authors: - Srinivas Yadav (https://github.com/srinivasyadav18) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#15737

…sai#15735) Fixes rapidsai#15690 There was an issue when computing page row counts/indices at the pass level in the chunked reader. Because we estimate list row counts for pages we have not yet decompressed, this can sometimes lead to estimates row counts that are larger than the actual (known) number of rows for a pass. This caused an out-of-bounds read down the line. We were already handling this at the subpass level, just not at the pass level. Also includes some fixes in debug logging code that is #ifdef'd out. Authors: - https://github.com/nvdbaranec - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15735

…ion objects (rapidsai#15732) Since the C++ layer provides implementations of these, use them, rather than redoing an implementation. This avoids things ever getting out of sync. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15732

Fixes: rapidsai#15742 This PR resolves issues with returning incorrect ranges for `DatetimeIndex.loc` when the index objects are monotonically decreasing. Additionally, I went ahead and fixed it for all cases, (i.e., random ordering) too. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: rapidsai#15761

…i#15672) Reduces the runtime for the `ParquetChunkedReaderInputLimitTest.List` and `ParquetChunkedReaderInputLimitTest.Mixed` which together are 1/3 the total time for `PARQUET_TEST`. These two tests produce multi-GB test files that are not strictly necessary for testing the chunked reader since the chunk sizes are controllable. The changes here reduce the runtime for these 2 tests by about 1/3 the original runtime. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Paul Mattione (https://github.com/pmattione-nvidia) URL: rapidsai#15672

Fixes offsets type for list column returned by `cudf::strings::split_record` and `cudf::strings::split_record_re` when large-strings enabled. The list column offsets type must be INT32. The code was changed to use the appropriate `make_offsets_child_column` utility function. Also added some `is_large_strings_enabled()` checks to check-overflow gtests. This allows all current gtests to pass when the large-strings support environment variable is set. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#15707

Fixes check for multibyte characters on large strings column. The `thrust::count_if` exceeds the int64 reduce type maximum and so the logic was recoded as a native kernel. Added additional tests and fixed subsequent errors where kernels are launched with greater than max(size_type) threads. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#15721

Fixes rapidsai#15051 The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation. With this PR change, the user will be able to use both column name reference and column index reference in the filter. - column name reference: the filters may specify any columns by name even if they are not present in column projection. - column reference (index): The indices used should be the indices of output columns in the requested order. This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned. The change includes reading only output columns' statistics data instead of all root columns. Summary of changes: - `get_column_names_in_expression` extracts column names in filter. - The extra columns in filter are added to output buffers during reader initialization - `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp` - instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter) - `cpp/src/io/parquet/predicate_pushdown.cpp` - To do this, output column schemas and its dtypes should be cached. - statistics data extraction code is updated to check for `schema_idx` in row group metadata. - No need to convert filter again for all root columns, reuse the passed output columns reference filter. - Rest of the code is same. - After the output filter predicate is calculated, these filter-only columns are removed - moved `named_to_reference_converter` constructor to cpp, and remove used constructor. - small include<> cleanup Authors: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vukasin Milovanovic (https://github.com/vuule) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#15113

…on_spark_profile

profiling prototype code in tests

7604512

github-actions bot added libcudf CMake labels May 8, 2024

mroeschke and others added 11 commits May 15, 2024 17:26

wqMerge branch 'branch-24.06' of github.com:rapidsai/cudf into exp-js…

5e63446

…on_spark_profile

github-actions bot added cuDF (Python) conda labels May 16, 2024

karthikeyann added 2 commits May 16, 2024 13:00

fix merge issues

0915406

Write coaleasing by Elias

19b1b2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiling prototype code in tests #9

profiling prototype code in tests #9

karthikeyann commented May 8, 2024 •

edited

Loading

profiling prototype code in tests #9

Are you sure you want to change the base?

profiling prototype code in tests #9

Conversation

karthikeyann commented May 8, 2024 • edited Loading

karthikeyann commented May 8, 2024 •

edited

Loading