Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm #7525

RyanJDick · 2025-01-07T16:18:09Z

Summary

This PR contains a bugfix for an edge case with model unloading (from VRAM to RAM). Thanks to @JPPhoto for finding it.

The bug was triggered under the following conditions:

A GGML-quantized model is loaded in VRAM
We run a Spandrel image-to-image invocation (which is wrapped in a torch.inference_mode() context manager.
The model cache attempts to unload the GGML-quantized model from VRAM to RAM.
Doing this inside of the torch.inference_mode() cm results in the following error:

 [2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Error while invoking session 98a07259-0c03-4111-a8d8-107041cb86f9, invocation d8daa90b-7e4c-4fc4-807c-50ba9be1a4ed (spandrel_image_to_image): Cannot set version_counter for inference tensor
[2025-01-07 15:48:17,744]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/home/ryan/src/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 129, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/baseinvocation.py", line 300, in invoke_internal
    output = self.invoke(context)
  File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ryan/src/InvokeAI/invokeai/app/invocations/spandrel_image_to_image.py", line 167, in invoke
    with context.models.load(self.image_to_image_model) as spandrel_model:
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/load_base.py", line 60, in __enter__
    self._cache.lock(self._cache_record, None)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 224, in lock
    self._load_locked_model(cache_entry, working_mem_bytes)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 272, in _load_locked_model
    vram_bytes_freed = self._offload_unlocked_models(model_vram_needed, working_mem_bytes)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 458, in _offload_unlocked_models
    cache_entry_bytes_freed = self._move_model_to_ram(cache_entry, vram_bytes_to_free)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/model_cache.py", line 330, in _move_model_to_ram
    return cache_entry.cached_model.partial_unload_from_vram(
  File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ryan/src/InvokeAI/invokeai/backend/model_manager/load/model_cache/cached_model/cached_model_with_partial_load.py", line 182, in partial_unload_from_vram
    cur_state_dict = self._model.state_dict()
  File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1939, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1936, in state_dict
    self._save_to_state_dict(destination, prefix, keep_vars)
  File "/home/ryan/.pyenv/versions/3.10.14/envs/InvokeAI_3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1843, in _save_to_state_dict
    destination[prefix + name] = param if keep_vars else param.detach()
RuntimeError: Cannot set version_counter for inference tensor

Explanation

From the torch.inference_mode() docs:

Code run under this mode gets better performance by disabling view tracking and version counter bumps.

Disabling version counter bumps results in the aforementioned error when saving GGMLTensors to a state_dict.

This incompatibility between GGMLTensors and torch.inference_mode() is likely caused by the custom tensor type implementation. There may very well be a way to get these to cooperate, but for now it is much simpler to remove the torch.inference_mode() contexts.

Note that there are several other uses of torch.inference_mode() in the Invoke codebase, but they are all tight wrappers around the inference forward pass and do not contain the model load/unload process.

Related Issues / Discussions

Original discussion: https://discord.com/channels/1020123559063990373/1149506274971631688/1326180753159094303

QA Instructions

Find a sequence of operations that triggers the condition. For me, this was:

Reserve VRAM in a separate process so that there was ~12GB left.
Fresh start of Invoke
Run FLUX inference with a GGML 8K model
Run Spandrel upscaling

Tests:

Confirmed that I can reproduce the error and that it is no longer hit after the change
Confirm that there is no speed regression from switching from torch.inference_mode() to torch.no_grad().
- Before: 50.354s, After: 51.536s

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…antized model is offloaded from VRAM inside of a torch.inference_mode() context manager, this will cause the following error: 'RuntimeError: Cannot set version_counter for inference tensor'.

Fix an edge case with model offloading from VRAM to RAM. If a GGML-qu…

85eb4f0

…antized model is offloaded from VRAM inside of a torch.inference_mode() context manager, this will cause the following error: 'RuntimeError: Cannot set version_counter for inference tensor'.

RyanJDick requested review from blessedcoolant, psychedelicious, brandonrising and hipsterusername as code owners January 7, 2025 16:18

github-actions bot added python PRs that change python files invocations PRs that change invocations labels Jan 7, 2025

hipsterusername approved these changes Jan 7, 2025

View reviewed changes

RyanJDick merged commit 6b18f27 into main Jan 7, 2025
15 checks passed

RyanJDick deleted the ryan/fix-gguf-offload-in-inference-mode-bug branch January 7, 2025 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm #7525

Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm #7525

RyanJDick commented Jan 7, 2025 •

edited

Loading

Bugfix: Offload of GGML-quantized model in torch.inference_mode() cm #7525

Bugfix: Offload of GGML-quantized model in torch.inference_mode() cm #7525

Conversation

RyanJDick commented Jan 7, 2025 • edited Loading

Summary

Explanation

Related Issues / Discussions

QA Instructions

Checklist

Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm #7525

Bugfix: Offload of GGML-quantized model in `torch.inference_mode()` cm #7525

RyanJDick commented Jan 7, 2025 •

edited

Loading