-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Prometheus metric export #134
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! This is an important piece in making Zeus more usable in a real world scenario. I looked over it at a mid- to high-level (not the nitty gritty details yet) and left some comments. Let me know what you think.
Co-authored-by: Jae-Won Chung <[email protected]>
- Changed metric instantiation to accept CPU and GPU indices directly instead of class objects. - Improved multiprocessing logic to address and fix pickle-related errors. - Added consistent handling for sync_execution across begin_window and end_window calls for all metrics. - Centralized bucket range validation and default handling for EnergyHistogram. - Improved error handling and logging for multiprocessing processes. - Standardized Prometheus metric labels (e.g., window and index) across Histogram, Counter, and Gauge. - Updated docstrings for consistency and clarity across all Metric subclasses.
Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.
examples/prometheus/train_single.py
Outdated
print(f"Top-1 accuracy: {acc1}") | ||
|
||
# Allow metrics to capture remaining data before shutting down monitoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are useful. Please bring them back.
examples/prometheus/train_single.py
Outdated
@@ -430,3 +418,4 @@ def accuracy(output, target, topk=(1,)): | |||
|
|||
if __name__ == "__main__": | |||
main() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline.
tests/test_metric.py
Outdated
gpu_energy={0: 30.0, 1: 35.0, 2: 40.0}, | ||
cpu_energy={0: 20.0, 1: 25.0}, | ||
gpu_energy={0: 50.0, 1: 100.0, 2: 200.0}, | ||
cpu_energy={0: 40.0, 1: 50.0}, | ||
dram_energy={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mock CPU 0 supports DRAM energy measurement (in mock_get_cpus
), shouldn't this be something like dram_energy={0: 10.0}
?
The metrics would be expecting the monitor to provide DRAM energy measurements for CPU 0, but if the Measurement object has nothing, shouldn't it raise an error?
zeus/metric.py
Outdated
|
||
Args: | ||
name (str): Name of the measurement window. | ||
sync_execution (bool): Whether to execute synchronously. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. See ZeusMonitor
.
zeus/metric.py
Outdated
@@ -54,6 +73,9 @@ class EnergyHistogram(Metric): | |||
gpu_bucket_range: Histogram buckets for GPU energy. | |||
cpu_bucket_range: Histogram buckets for CPU energy. | |||
dram_bucket_range: Histogram buckets for DRAM energy. | |||
gpu_histograms: A single Prometheus Histogram metric for all GPU energy consumption, indexed by window and GPU index. | |||
cpu_histograms: A single Prometheus Histogram metric for all CPU energy consumption, indexed by window and CPU index. | |||
dram_histograms: A single Prometheus Histogram metric for all DRAM energy consumption, indexed by window and DRAM index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the entire Attributes
section. They're not intended to be public attributes AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every class.
Add link to the push gateway Co-authored-by: Jae-Won Chung <[email protected]>
Generalize the device as {component} with the note that Gauge only supports GPU Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py
Outdated
self.energy_monitor.begin_window( | ||
f"__EnergyHistogram_{name}", sync_execution=True | ||
) | ||
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution) | |
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution=sync_execution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for end_window.
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py
Outdated
@@ -288,28 +319,36 @@ def begin_window(self, name: str) -> None: | |||
self.update_period, | |||
self.prometheus_url, | |||
self.job, | |||
sync_execution, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. If sync_execution
is True
, you need to call zeus.utils.framework.sync_execution
on the main thread there the application is running. On the other hand, the power/energy monitor process's ZeusMonitor
should always be invoked with sync_execution=False
.
Read ZeusMonitor
to see how sync_execution
(sometimes a boolean parameter and other times a function in zeus.utils.framework
) is being used.
zeus/metric.py
Outdated
self.window_state[name] = MonitoringProcessState( | ||
queue=self.queue, proc=self.proc | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you putting these in self.queue
and self.proc
??
zeus/metric.py
Outdated
if self.queue is not None: | ||
self.queue.put("stop") | ||
else: | ||
raise RuntimeError("Queue is not initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.queue
can the queue from any random window??? More specifically, it's going to be the queue that belongs to the most recently started window.
This level of quality is completely unacceptable. Please re-check the correctness of every line of code and documentation, and then ask for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on changes to make. I think they will be more or less straightforward ones. Let's hope this is the final round of change requests. Thanks!
…ng window operations
…ogic and multiprocessing window state management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your work!
This pull request introduces Prometheus-based metric tracking for energy and power usage within the Zeus framework. It includes functionality for monitoring GPU, CPU, and DRAM energy usage via Histograms, Cumulative Counters, and Gauges.
zeus/metric.py
:A new module that introduces
EnergyHistogram
,EnergyCumulativeCounter
, andPowerGauge
classes. These classes enable real-time monitoring of CPU, GPU, and DRAM energy and power consumption by integrating with Prometheus.zeus/prometheus.yml
:Configuration file for setting up Prometheus monitoring.
zeus/docker-compose.yml
:A Docker Compose file for easily setting up Prometheus with the project for local or cloud-based monitoring.
Modified
pyproject.toml
:Added
prometheus-client
as an optional dependency for Prometheus metric integration.