Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiling prototype code in tests #9

Open
wants to merge 14 commits into
base: branch-24.06
Choose a base branch
from

Conversation

karthikeyann
Copy link
Owner

@karthikeyann karthikeyann commented May 8, 2024

PR to profile different_get_json_object spark prototype.

profile posted here.
rapidsai#15605 (comment)

mroeschke and others added 11 commits May 15, 2024 17:26
…5749)

closes rapidsai#15748

The performance implication can be seen in the issue

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#15749
…s.yaml (rapidsai#15736)

Pulls out some changes I noticed while working on rapidsai#15245.

* removes `host` dependency on `setuptools` for `cudf` and `cudf_kafka`
  - *they don't need it now that they build with `scikit-build-core`*
* consolidates some redundant blocks in `dependencies.yaml`

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#15736
Add stream parameter to public reduction APIs:

- `reduce()`
- `segmented_reduce()`
- `scan()`
- `minmax()`

Reference rapidsai#13744

Authors:
  - Srinivas Yadav (https://github.com/srinivasyadav18)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Yunsong Wang (https://github.com/PointKernel)

URL: rapidsai#15737
…sai#15735)

Fixes  rapidsai#15690

There was an issue when computing page row counts/indices at the pass level in the chunked reader.  Because we estimate list row counts for pages we have not yet decompressed, this can sometimes lead to estimates row counts that are larger than the actual (known) number of rows for a pass.  This caused an out-of-bounds read down the line.  We were already handling this at the subpass level, just not at the pass level.

Also includes some fixes in debug logging code that is #ifdef'd out.

Authors:
  - https://github.com/nvdbaranec
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#15735
…ion objects (rapidsai#15732)

Since the C++ layer provides implementations of these, use them, rather than redoing an implementation. This avoids things ever getting out of sync.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#15732
Fixes: rapidsai#15742 

This PR resolves issues with returning incorrect ranges for `DatetimeIndex.loc` when the index objects are monotonically decreasing. Additionally, I went ahead and fixed it for all cases, (i.e., random ordering) too.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#15761
…i#15672)

Reduces the runtime for the `ParquetChunkedReaderInputLimitTest.List` and `ParquetChunkedReaderInputLimitTest.Mixed` which together are 1/3 the total time for `PARQUET_TEST`.
These two tests produce multi-GB test files that are not strictly necessary for testing the chunked reader since the chunk sizes are controllable. The changes here reduce the runtime for these 2 tests by about 1/3 the original runtime.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Paul Mattione (https://github.com/pmattione-nvidia)

URL: rapidsai#15672
Fixes offsets type for list column returned by `cudf::strings::split_record` and `cudf::strings::split_record_re` when large-strings enabled. The list column offsets type must be INT32. The code was changed to use the appropriate `make_offsets_child_column` utility function.
Also added some `is_large_strings_enabled()` checks to check-overflow gtests.
This allows all current gtests to pass when the large-strings support environment variable is set.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#15707
Fixes check for multibyte characters on large strings column. The `thrust::count_if` exceeds the int64 reduce type maximum and so the logic was recoded as a native kernel. Added additional tests and fixed subsequent errors where kernels are launched with greater than max(size_type) threads.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: rapidsai#15721
Fixes rapidsai#15051

The predicate filtering in parquet did not work while column projection is used. This PR fixes that limitation.

With this PR change, the user will be able to use both column name reference and column index reference in the filter.
- column name reference: the filters may specify any columns by name even if they are not present in column projection.
- column reference (index): The indices used should be the indices of output columns in the requested order.

This is achieved by extracting column names from filter and add to output buffers, after predicate filtering is done, these filter-only columns are removed and only requested columns are returned.
The change includes reading only output columns' statistics data instead of all root columns.

Summary of changes:
- `get_column_names_in_expression` extracts column names in filter.
- The extra columns in filter are added to output buffers during reader initialization
  - `cpp/src/io/parquet/reader_impl_helpers.cpp`, `cpp/src/io/parquet/reader_impl.cpp`
- instead of extracting statistics data of all root columns, it extracts for only output columns (including columns in filter)
  - `cpp/src/io/parquet/predicate_pushdown.cpp`
  - To do this, output column schemas and its dtypes should be cached.
  - statistics data extraction code is updated to check for `schema_idx` in row group metadata.
  - No need to convert filter again for all root columns, reuse the passed output columns reference filter. 
  - Rest of the code is same.
- After the output filter predicate is calculated, these filter-only columns are removed
- moved `named_to_reference_converter` constructor to cpp, and remove used constructor.
- small include<> cleanup

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: rapidsai#15113
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants