Skip to content

Commit

Permalink
Fix: Pandas warnings from PowerMonitor (#75)
Browse files Browse the repository at this point in the history
  • Loading branch information
jaywonchung authored May 8, 2024
1 parent ba34175 commit 585ae2b
Show file tree
Hide file tree
Showing 3 changed files with 54 additions and 29 deletions.
34 changes: 19 additions & 15 deletions docs/measure/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,13 @@ if __name__ == "__main__":
print(f"One step took {avg_time} s and {avg_energy} J on average.")
```

!!! Tip "[`zeus.monitor.PowerMonitor`][zeus.monitor.power.PowerMonitor]"
This monitor spawns a process that polls the instantaneous GPU power consumption API and exposes two methods: [`get_power`][zeus.monitor.power.PowerMonitor.get_power] and [`get_energy`][zeus.monitor.power.PowerMonitor.get_energy].
For older GPUs that do not support querying energy directly, [`ZeusMonitor`][zeus.monitor.ZeusMonitor] automatically uses the [`PowerMonitor`][zeus.monitor.power.PowerMonitor] internally.

!!! Warning "Use of global variables on GPUs older than Volta"
On older GPUs, **you should not** instantiate [`ZeusMonitor`][zeus.monitor.ZeusMonitor] as a global variable without protecting it with `if __name__ == "__main__"`.
It's because the energy query API is only available on Volta or newer NVIDIA GPU microarchitectures, and for older GPUs, a separate process that polls the power API has to be spawned.
It's because the energy query API is only available on Volta or newer NVIDIA GPU microarchitectures, and for older GPUs, a separate process that polls the power API has to be spawned (i.e., [`PowerMonitor`][zeus.monitor.power.PowerMonitor]).
In this case, global code that spawns the process should be guarded with `if __name__ == "__main__"`.
More details in [Python docs](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods){.external}.

Expand All @@ -52,8 +56,21 @@ if __name__ == "__main__":

## CLI power and energy monitor

The energy monitor measures the total energy consumed by the GPU during the lifetime of the monitor process.
It's a simple wrapper around [`ZeusMonitor`][zeus.monitor.ZeusMonitor].

```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
```

The power monitor periodically prints out the GPU's power draw.
It's a simple wrapper around [`PowerMonitor`][zeus.monitor.power.PowerMonitor].
It's a simple wrapper around [`PowerMonitor`][zeus.monitor.PowerMonitor].

```console
$ python -m zeus.monitor power
Expand All @@ -71,16 +88,3 @@ Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
```

The energy monitor measures the total energy consumed by the GPU during the lifetime of the monitor process.
It's a simple wrapper around [`ZeusMonitor`][zeus.monitor.ZeusMonitor].

```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
```
1 change: 1 addition & 0 deletions zeus/monitor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@
"""

from zeus.monitor.energy import ZeusMonitor, Measurement
from zeus.monitor.power import PowerMonitor
48 changes: 34 additions & 14 deletions zeus/monitor/power.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,13 @@ class PowerMonitor:
[`ZeusMonitor`][zeus.monitor.ZeusMonitor] for older architecture GPUs that
do not support the nvmlDeviceGetTotalEnergyConsumption API.
!!! Warning
Since the monitor spawns a child process, **it should not be instantiated as a global variable**.
Python puts a protection to prevent creating a process in global scope.
Refer to the "Safe importing of main module" section in the
[Python documentation](https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods)
for more details.
Attributes:
gpu_indices (list[int]): Indices of the GPUs to monitor.
update_period (int): Update period of the power monitor in seconds.
Expand All @@ -122,18 +129,17 @@ def __init__(
self,
gpu_indices: list[int] | None = None,
update_period: float | None = None,
power_csv_path: str | None = None,
) -> None:
"""Initialize the power monitor.
Initialization should not be done in global scope due to python's protection.
Refer to the "Safe importing of main module" section in
https://docs.python.org/3/library/multiprocessing.html for more details.
Args:
gpu_indices: Indices of the GPUs to monitor. If None, monitor all GPUs.
update_period: Update period of the power monitor in seconds. If None,
infer the update period by max speed polling the power counter for
each GPU model.
power_csv_path: If given, the power polling process will write measurements
to this path. Otherwise, a temporary file will be used.
"""
if gpu_indices is not None and not gpu_indices:
raise ValueError("`gpu_indices` must be either `None` or non-empty")
Expand All @@ -155,17 +161,19 @@ def __init__(
update_period = infer_counter_update_period(self.gpu_indices)
self.update_period = update_period

# Create the CSV file for power measurements.
power_csv = tempfile.mkstemp(suffix=".csv", text=True)[1]
open(power_csv, "w").close()
self.power_f = open(power_csv)
# Create and open the CSV to record power measurements.
if power_csv_path is None:
power_csv_path = tempfile.mkstemp(suffix=".csv", text=True)[1]
open(power_csv_path, "w").close()
self.power_f = open(power_csv_path)
self.power_df_columns = ["time"] + [f"power{i}" for i in self.gpu_indices]
self.power_df = pd.DataFrame(columns=self.power_df_columns)

# Spawn the power polling process.
atexit.register(self._stop)
self.process = mp.get_context("spawn").Process(
target=_polling_process, args=(self.gpu_indices, power_csv, update_period)
target=_polling_process,
args=(self.gpu_indices, power_csv_path, update_period),
)
self.process.start()

Expand All @@ -182,11 +190,23 @@ def _update_df(self) -> None:
try:
additional_df = typing.cast(
pd.DataFrame,
pd.read_csv(self.power_f, header=None, names=self.power_df_columns), # type: ignore
pd.read_csv(self.power_f, header=None, names=self.power_df_columns),
)
except pd.errors.EmptyDataError:
return
self.power_df = pd.concat([self.power_df, additional_df], axis=0)

if additional_df.empty:
return

if self.power_df.empty:
self.power_df = additional_df
else:
self.power_df = pd.concat(
[self.power_df, additional_df],
axis=0,
ignore_index=True,
copy=False,
)

def get_energy(self, start_time: float, end_time: float) -> dict[int, float] | None:
"""Get the energy used by the GPUs between two times.
Expand Down Expand Up @@ -228,7 +248,7 @@ def get_power(self, time: float | None = None) -> dict[int, float] | None:
A dictionary mapping GPU indices to the power usage of the GPU at the
specified time point. GPU indices are from the DL framework's perspective
after applying `CUDA_VISIBLE_DEVICES`.
If there are no power readings, return None.
If there are no power readings (e.g., future timestamps), return None.
"""
self._update_df()

Expand All @@ -250,7 +270,7 @@ def get_power(self, time: float | None = None) -> dict[int, float] | None:

def _polling_process(
gpu_indices: list[int],
power_csv: str,
power_csv_path: str,
update_period: float,
) -> None:
"""Run the power monitor."""
Expand All @@ -259,7 +279,7 @@ def _polling_process(
gpus = get_gpus()

# Use line buffering.
with open(power_csv, "w", buffering=1) as power_f:
with open(power_csv_path, "w", buffering=1) as power_f:
while True:
power: list[float] = []
now = time()
Expand Down

0 comments on commit 585ae2b

Please sign in to comment.