Enable tracing of thread pool tasks using NVTX #630

kingcrimsontianyu · 2025-02-09T07:57:53Z

This PR implements the basic feature outlined in #631.
The two good-to-haves are currently blocked.

copy-pr-bot · 2025-02-09T07:57:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kingcrimsontianyu · 2025-02-10T17:10:19Z

/ok to test

kingcrimsontianyu · 2025-02-10T20:14:10Z

cpp/include/kvikio/nvtx.hpp

+#define KVIKIO_NVTX_FUNC_RANGE_IMPL() NVTX3_FUNC_RANGE_IN(kvikio::libkvikio_domain)
+
+// Implementation of KVIKIO_NVTX_SCOPED_RANGE(...)
+#define KVIKIO_NVTX_SCOPED_RANGE_IMPL_3(message, payload_v, color)                                \


I'm not sure why the variable has to be named payload_v. Otherwise payload would cause compile errors. Perhaps a name look-up related issue.

it's probably because of the name nvtx3::payload used in the macro

kingcrimsontianyu · 2025-02-10T20:16:02Z

cpp/src/file_handle.cpp

@@ -192,7 +194,8 @@ std::future<std::size_t> FileHandle::pread(void* buf,
                                           std::size_t gds_threshold,
                                           bool sync_default_stream)
 {
-  KVIKIO_NVTX_MARKER("FileHandle::pread()", size);
+  auto& [nvtx_color, call_idx] = detail::get_next_color_and_call_idx();
+  KVIKIO_NVTX_SCOPED_RANGE("FileHandle::pread()", size, nvtx_color);


To be consistent with RemoteHandle, the NVTX marker here is replaced with the scoped range.

kingcrimsontianyu · 2025-02-11T22:07:16Z

The following sample result shows the nsys profile generated using this PR. There are 4 worker threads in the thread pool, and I/O tasks of the same color come from the same pread()/pwrite() call. The payload of the task's time range is a global, incremental counter.

madsbk

Looks good, I only have a minor suggestion

cpp/src/file_handle.cpp

cpp/include/kvikio/parallel_operation.hpp

kingcrimsontianyu · 2025-02-18T06:15:39Z

Performance check

Four I/O benchmarks from libcudf were used to check if this PR causes runtime performance regression.

System

https://pcpartpicker.com/user/biubiuty/saved/#view=K6YjWZ
AMD Ryzen 7 7800X3D 8-Core Processor
NVIDIA GeForce RTX 3080 Ti
Samsung SSD 990 PRO with Heatsink 2TB
Driver Version: 570.86.15
CUDA Version: 12.8

Results

No significant performance regression has been observed.
For the hot file cache cases, the unexpected performance improvement may have been a false signal: tests rerun later might benefit more from the kernel buffer cache.

parquet_read_io_compression

		2 threads			8 threads
	parquet_read_io_compression	CPU Time (sec)	Noise	Time increase	CPU Time (sec)	Noise	Time increase
cold cache	This PR	0.137	0.020	1.36%	0.098	0.023	1.57%
	branch-25.04	0.136	0.014		0.096	0.006
hot cache	This PR	0.054	0.034	-15.80%	0.051	0.032	4.44%
	branch-25.04	0.064	0.108		0.049	0.042

orc_read_io_compression

		2 threads			8 threads
	orc_read_io_compression	CPU Time (sec)	Noise	Time increase	CPU Time (sec)	Noise	Time increase
cold cache	This PR	0.140	0.026	1.13%	0.127	0.032	0.61%
	branch-25.04	0.139	0.012		0.126	0.029
hot cache	This PR	0.064	0.042	-16.38%	0.061	0.036	-9.75%
	branch-25.04	0.076	0.031		0.068	0.045

json_read_io

		2 threads			8 threads
	json_read_io	CPU Time (sec)	Noise	Time increase	CPU Time (sec)	Noise	Time increase
cold cache	This PR	0.550	0.030	-0.26%	0.453	0.047	-0.36%
	branch-25.04	0.551	0.024		0.455	0.045
hot cache	This PR	0.347	0.044	-9.94%	0.316	0.040	-10.31%
	branch-25.04	0.385	0.024		0.352	0.033

csv_read_io

		2 threads			8 threads
	csv_read_io	CPU Time (sec)	Noise	Time increase	CPU Time (sec)	Noise	Time increase
cold cache	This PR	0.350	0.034	-0.10%	0.297	0.029	-1.67%
	branch-25.04	0.350	0.030		0.302	0.046
hot cache	This PR	0.229	0.055	-13.62%	0.212	0.021	-13.58%
	branch-25.04	0.265	0.029		0.245	0.037

madsbk · 2025-02-18T08:57:36Z

I have also tested it in a no-cuda environment with no issue.

madsbk · 2025-02-18T08:57:41Z

/merge

wence- · 2025-02-18T18:24:26Z

cpp/include/kvikio/parallel_operation.hpp

+    // Rename the worker thread in the thread pool to improve clarity from nsys-ui.
+    // Note: This NVTX feature is currently not supported by nsys-ui.
+    thread_local std::once_flag call_once_per_thread;
+    std::call_once(call_once_per_thread,
+                   [] { nvtx_manager::rename_current_thread("thread pool"); });


Just to note thread_local compiles into a critical section that blocks other threads (even in the fast-path "already initialised" case, I think). See https://yosefk.com/blog/cxx-thread-local-storage-performance.html

I guess, we could use thread_pool's thread initialization ?

That's an interesting pitfall. I've replaced the thread local, call once section with the worker thread initialization function in this PR: #637
Thanks! @wence- @madsbk

…thread initialization (#637) This PR makes the following minor fixes: - Use the correct file permission flags corresponding to the `644` code. - Use the correct flag for the `cuMemHostAlloc` call. - For the thread pool, replace the `thread_local` call-once section (which may negatively affect performance; see #630 (comment)) with more idiomatic worker thread initialization function. Authors: - Tianyu Liu (https://github.com/kingcrimsontianyu) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Vukasin Milovanovic (https://github.com/vuule) URL: #637

jakirkham · 2025-02-26T18:51:54Z

I have also tested it in a no-cuda environment with no issue.

Should we add this test condition to CI?

madsbk · 2025-02-27T07:22:17Z

I have also tested it in a no-cuda environment with no issue.

Should we add this test condition to CI?

Yes, a smoke test of the C++ examples would be good!

kingcrimsontianyu added feature request New feature or request non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Feb 9, 2025

kingcrimsontianyu changed the title ~~[WIP] Enable tracing of thread pool tasks using NVTX~~ Enable tracing of thread pool tasks using NVTX Feb 10, 2025

kingcrimsontianyu force-pushed the add-nvtx-to-task branch from 2aca97e to 4efb3cf Compare February 10, 2025 19:12

kingcrimsontianyu commented Feb 10, 2025

View reviewed changes

kingcrimsontianyu marked this pull request as ready for review February 10, 2025 20:16

kingcrimsontianyu requested review from a team as code owners February 10, 2025 20:16

kingcrimsontianyu self-assigned this Feb 13, 2025

kingcrimsontianyu force-pushed the add-nvtx-to-task branch from 786d20b to 37df91a Compare February 14, 2025 04:37

madsbk approved these changes Feb 17, 2025

View reviewed changes

cpp/src/file_handle.cpp Outdated Show resolved Hide resolved

wence- reviewed Feb 17, 2025

View reviewed changes

cpp/include/kvikio/parallel_operation.hpp Outdated Show resolved Hide resolved

wence- approved these changes Feb 17, 2025

View reviewed changes

kingcrimsontianyu force-pushed the add-nvtx-to-task branch from 37df91a to 32971de Compare February 17, 2025 17:14

kingcrimsontianyu added 10 commits February 17, 2025 12:46

Initial implementation

5f7655f

Simplify impl. Enable color coding

eb73fc3

Revert accidental changes to the test

83bebfe

Add thread renaming. Add comments

15fc559

Add more comment

99b3d3a

Rename vars

b338d60

Rename vars and types

9d80dfc

Add more comments

459cbe4

Improve comments

6d34220

Address review comments

0e4ecd5

kingcrimsontianyu force-pushed the add-nvtx-to-task branch from 32971de to 0e4ecd5 Compare February 17, 2025 17:46

Final cleanup

9613700

rapids-bot bot merged commit 096ac0f into rapidsai:branch-25.04 Feb 18, 2025
61 checks passed

wence- reviewed Feb 18, 2025

View reviewed changes

kingcrimsontianyu mentioned this pull request Feb 19, 2025

Minor fixes on the file permission, host allocation flag, and worker thread initialization #637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable tracing of thread pool tasks using NVTX #630

Enable tracing of thread pool tasks using NVTX #630

kingcrimsontianyu commented Feb 9, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 9, 2025

kingcrimsontianyu commented Feb 10, 2025

kingcrimsontianyu Feb 10, 2025

devavret Feb 25, 2025

kingcrimsontianyu Feb 10, 2025

kingcrimsontianyu commented Feb 11, 2025 •

edited

Loading

madsbk left a comment

kingcrimsontianyu commented Feb 18, 2025

madsbk commented Feb 18, 2025

madsbk commented Feb 18, 2025

wence- Feb 18, 2025

madsbk Feb 19, 2025

kingcrimsontianyu Feb 19, 2025

jakirkham commented Feb 26, 2025

madsbk commented Feb 27, 2025

Enable tracing of thread pool tasks using NVTX #630

Enable tracing of thread pool tasks using NVTX #630

Conversation

kingcrimsontianyu commented Feb 9, 2025 • edited Loading

copy-pr-bot bot commented Feb 9, 2025

kingcrimsontianyu commented Feb 10, 2025

kingcrimsontianyu Feb 10, 2025

Choose a reason for hiding this comment

devavret Feb 25, 2025

Choose a reason for hiding this comment

kingcrimsontianyu Feb 10, 2025

Choose a reason for hiding this comment

kingcrimsontianyu commented Feb 11, 2025 • edited Loading

madsbk left a comment

Choose a reason for hiding this comment

kingcrimsontianyu commented Feb 18, 2025

Performance check

System

Results

parquet_read_io_compression

orc_read_io_compression

json_read_io

csv_read_io

madsbk commented Feb 18, 2025

madsbk commented Feb 18, 2025

wence- Feb 18, 2025

Choose a reason for hiding this comment

madsbk Feb 19, 2025

Choose a reason for hiding this comment

kingcrimsontianyu Feb 19, 2025

Choose a reason for hiding this comment

jakirkham commented Feb 26, 2025

madsbk commented Feb 27, 2025

kingcrimsontianyu commented Feb 9, 2025 •

edited

Loading

kingcrimsontianyu commented Feb 11, 2025 •

edited

Loading