Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tracing of thread pool tasks using NVTX #630

Merged
merged 11 commits into from
Feb 18, 2025

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Feb 9, 2025

This PR implements the basic feature outlined in #631.
The two good-to-haves are currently blocked.

@kingcrimsontianyu kingcrimsontianyu added feature request New feature or request non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Feb 9, 2025
Copy link

copy-pr-bot bot commented Feb 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the title [WIP] Enable tracing of thread pool tasks using NVTX Enable tracing of thread pool tasks using NVTX Feb 10, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

#define KVIKIO_NVTX_FUNC_RANGE_IMPL() NVTX3_FUNC_RANGE_IN(kvikio::libkvikio_domain)

// Implementation of KVIKIO_NVTX_SCOPED_RANGE(...)
#define KVIKIO_NVTX_SCOPED_RANGE_IMPL_3(message, payload_v, color) \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the variable has to be named payload_v. Otherwise payload would cause compile errors. Perhaps a name look-up related issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably because of the name nvtx3::payload used in the macro

@@ -192,7 +194,8 @@ std::future<std::size_t> FileHandle::pread(void* buf,
std::size_t gds_threshold,
bool sync_default_stream)
{
KVIKIO_NVTX_MARKER("FileHandle::pread()", size);
auto& [nvtx_color, call_idx] = detail::get_next_color_and_call_idx();
KVIKIO_NVTX_SCOPED_RANGE("FileHandle::pread()", size, nvtx_color);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with RemoteHandle, the NVTX marker here is replaced with the scoped range.

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review February 10, 2025 20:16
@kingcrimsontianyu kingcrimsontianyu requested review from a team as code owners February 10, 2025 20:16
@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Feb 11, 2025

The following sample result shows the nsys profile generated using this PR. There are 4 worker threads in the thread pool, and I/O tasks of the same color come from the same pread()/pwrite() call. The payload of the task's time range is a global, incremental counter.

image

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I only have a minor suggestion

@kingcrimsontianyu
Copy link
Contributor Author

Performance check

Four I/O benchmarks from libcudf were used to check if this PR causes runtime performance regression.

System

Results

  • No significant performance regression has been observed.
  • For the hot file cache cases, the unexpected performance improvement may have been a false signal: tests rerun later might benefit more from the kernel buffer cache.

parquet_read_io_compression

2 threads 8 threads
parquet_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase
cold cache This PR 0.137 0.020 1.36% 0.098 0.023 1.57%
branch-25.04 0.136 0.014 0.096 0.006
hot cache This PR 0.054 0.034 -15.80% 0.051 0.032 4.44%
branch-25.04 0.064 0.108 0.049 0.042

orc_read_io_compression

2 threads 8 threads
orc_read_io_compression CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase
cold cache This PR 0.140 0.026 1.13% 0.127 0.032 0.61%
branch-25.04 0.139 0.012 0.126 0.029
hot cache This PR 0.064 0.042 -16.38% 0.061 0.036 -9.75%
branch-25.04 0.076 0.031 0.068 0.045

json_read_io

2 threads 8 threads
json_read_io CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase
cold cache This PR 0.550 0.030 -0.26% 0.453 0.047 -0.36%
branch-25.04 0.551 0.024 0.455 0.045
hot cache This PR 0.347 0.044 -9.94% 0.316 0.040 -10.31%
branch-25.04 0.385 0.024 0.352 0.033

csv_read_io

2 threads 8 threads
csv_read_io CPU Time (sec) Noise Time increase CPU Time (sec) Noise Time increase
cold cache This PR 0.350 0.034 -0.10% 0.297 0.029 -1.67%
branch-25.04 0.350 0.030 0.302 0.046
hot cache This PR 0.229 0.055 -13.62% 0.212 0.021 -13.58%
branch-25.04 0.265 0.029 0.245 0.037

@madsbk
Copy link
Member

madsbk commented Feb 18, 2025

I have also tested it in a no-cuda environment with no issue.

@madsbk
Copy link
Member

madsbk commented Feb 18, 2025

/merge

@rapids-bot rapids-bot bot merged commit 096ac0f into rapidsai:branch-25.04 Feb 18, 2025
61 checks passed
Comment on lines +65 to +69
// Rename the worker thread in the thread pool to improve clarity from nsys-ui.
// Note: This NVTX feature is currently not supported by nsys-ui.
thread_local std::once_flag call_once_per_thread;
std::call_once(call_once_per_thread,
[] { nvtx_manager::rename_current_thread("thread pool"); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note thread_local compiles into a critical section that blocks other threads (even in the fast-path "already initialised" case, I think). See https://yosefk.com/blog/cxx-thread-local-storage-performance.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we could use thread_pool's thread initialization ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting pitfall. I've replaced the thread local, call once section with the worker thread initialization function in this PR: #637
Thanks! @wence- @madsbk

rapids-bot bot pushed a commit that referenced this pull request Feb 21, 2025
…thread initialization (#637)

This PR makes the following minor fixes:
- Use the correct file permission flags corresponding to the `644` code.
- Use the correct flag for the `cuMemHostAlloc` call.
- For the thread pool, replace the `thread_local` call-once section (which may negatively affect performance; see #630 (comment)) with more idiomatic worker thread initialization function.

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #637
@jakirkham
Copy link
Member

I have also tested it in a no-cuda environment with no issue.

Should we add this test condition to CI?

@madsbk
Copy link
Member

madsbk commented Feb 27, 2025

I have also tested it in a no-cuda environment with no issue.

Should we add this test condition to CI?

Yes, a smoke test of the C++ examples would be good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Affects the C++ API of KvikIO feature request New feature or request non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants