Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Prometheus metric export #134

Merged
merged 61 commits into from
Jan 26, 2025
Merged

Conversation

sharonsyh
Copy link
Collaborator

This pull request introduces Prometheus-based metric tracking for energy and power usage within the Zeus framework. It includes functionality for monitoring GPU, CPU, and DRAM energy usage via Histograms, Cumulative Counters, and Gauges.

  • zeus/metric.py:
    A new module that introduces EnergyHistogram, EnergyCumulativeCounter, and PowerGauge classes. These classes enable real-time monitoring of CPU, GPU, and DRAM energy and power consumption by integrating with Prometheus.

  • zeus/prometheus.yml:
    Configuration file for setting up Prometheus monitoring.

  • zeus/docker-compose.yml:
    A Docker Compose file for easily setting up Prometheus with the project for local or cloud-based monitoring.

  • Modified pyproject.toml:
    Added prometheus-client as an optional dependency for Prometheus metric integration.

@sharonsyh sharonsyh changed the title Prometheus Integration Prometheus Integration - Branch Updated Oct 18, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! This is an important piece in making Zeus more usable in a real world scenario. I looked over it at a mid- to high-level (not the nitty gritty details yet) and left some comments. Let me know what you think.

@jaywonchung jaywonchung changed the title Prometheus Integration - Branch Updated [Feat] Prometheus metric export Nov 13, 2024
Co-authored-by: Jae-Won Chung <[email protected]>
sharonsyh and others added 12 commits November 28, 2024 21:42
- Changed metric instantiation to accept CPU and GPU indices directly instead of class objects.
- Improved multiprocessing logic to address and fix pickle-related errors.
- Added consistent handling for sync_execution across begin_window and end_window calls for all metrics.
- Centralized bucket range validation and default handling for EnergyHistogram.
- Improved error handling and logging for multiprocessing processes.
- Standardized Prometheus metric labels (e.g., window and index) across Histogram, Counter, and Gauge.
- Updated docstrings for consistency and clarity across all Metric subclasses.
Adjust target names to standardize pushgateway references, ensuring consistency with the Docker Compose configuration.
print(f"Top-1 accuracy: {acc1}")

# Allow metrics to capture remaining data before shutting down monitoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are useful. Please bring them back.

@@ -430,3 +418,4 @@ def accuracy(output, target, topk=(1,)):

if __name__ == "__main__":
main()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline.

gpu_energy={0: 30.0, 1: 35.0, 2: 40.0},
cpu_energy={0: 20.0, 1: 25.0},
gpu_energy={0: 50.0, 1: 100.0, 2: 200.0},
cpu_energy={0: 40.0, 1: 50.0},
dram_energy={},
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mock CPU 0 supports DRAM energy measurement (in mock_get_cpus), shouldn't this be something like dram_energy={0: 10.0}?

The metrics would be expecting the monitor to provide DRAM energy measurements for CPU 0, but if the Measurement object has nothing, shouldn't it raise an error?

zeus/metric.py Outdated

Args:
name (str): Name of the measurement window.
sync_execution (bool): Whether to execute synchronously. Defaults to None.
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. See ZeusMonitor.

zeus/metric.py Outdated
@@ -54,6 +73,9 @@ class EnergyHistogram(Metric):
gpu_bucket_range: Histogram buckets for GPU energy.
cpu_bucket_range: Histogram buckets for CPU energy.
dram_bucket_range: Histogram buckets for DRAM energy.
gpu_histograms: A single Prometheus Histogram metric for all GPU energy consumption, indexed by window and GPU index.
cpu_histograms: A single Prometheus Histogram metric for all CPU energy consumption, indexed by window and CPU index.
dram_histograms: A single Prometheus Histogram metric for all DRAM energy consumption, indexed by window and DRAM index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the entire Attributes section. They're not intended to be public attributes AFAIK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every class.

Add link to the push gateway

Co-authored-by: Jae-Won Chung <[email protected]>
sharonsyh and others added 2 commits December 10, 2024 13:33
Generalize the device as {component} with the note that Gauge only supports GPU

Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py Outdated
self.energy_monitor.begin_window(
f"__EnergyHistogram_{name}", sync_execution=True
)
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution)
self.energy_monitor.begin_window(f"__EnergyHistogram_{name}", sync_execution=sync_execution)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for end_window.

sharonsyh and others added 4 commits December 10, 2024 13:35
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
zeus/metric.py Outdated
@@ -288,28 +319,36 @@ def begin_window(self, name: str) -> None:
self.update_period,
self.prometheus_url,
self.job,
sync_execution,
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. If sync_execution is True, you need to call zeus.utils.framework.sync_execution on the main thread there the application is running. On the other hand, the power/energy monitor process's ZeusMonitor should always be invoked with sync_execution=False.

Read ZeusMonitor to see how sync_execution (sometimes a boolean parameter and other times a function in zeus.utils.framework) is being used.

zeus/metric.py Outdated
Comment on lines 329 to 331
self.window_state[name] = MonitoringProcessState(
queue=self.queue, proc=self.proc
)
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you putting these in self.queue and self.proc??

zeus/metric.py Outdated
Comment on lines 344 to 347
if self.queue is not None:
self.queue.put("stop")
else:
raise RuntimeError("Queue is not initialized")
Copy link
Member

@jaywonchung jaywonchung Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.queue can the queue from any random window??? More specifically, it's going to be the queue that belongs to the most recently started window.

This level of quality is completely unacceptable. Please re-check the correctness of every line of code and documentation, and then ask for review.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on changes to make. I think they will be more or less straightforward ones. Let's hope this is the final round of change requests. Thanks!

@jaywonchung jaywonchung linked an issue Dec 19, 2024 that may be closed by this pull request
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your work!

@jaywonchung jaywonchung merged commit 3875315 into master Jan 26, 2025
2 checks passed
@jaywonchung jaywonchung deleted the prometheus-metrics-update branch January 26, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants