CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

fzyzcjy · 2024-12-28T12:18:17Z

Outdated Content

The test will fail because it uses LD_PRELOAD currently (to intercept and change logic of cudaMalloc and cudaFree). If the general logic looks good, I will further update this PR to handle this part (e.g. try to specify LD_PRELOAD automatically when creating the backend process.)

How to execute it

Suppose this branch of SGLang is at /path/to/sglang, then inside sglang's docker container, execute the following:

# install the torch_memory_saver (currently install from source, surely on pip later)
git clone https://github.com/fzyzcjy/torch_memory_saver
(cd torch_memory_saver && make reinstall)

cd /path/to/sglang
PYTHONPATH=$(pwd)/python LD_PRELOAD=/sgl-workspace/torch_memory_saver/torch_memory_saver_cpp.cpython-310-x86_64-linux-gnu.so python3 test/srt/test_release_gpu_occupation.py

Expected results are as follows. x is time, red color is memory consume. The low memory at the center is caused by temporarily release KV cache memory.

What's changed

Though the PR seems large, most are boilerplate.

Core:

Wrap things inside with primary_memory_saver.region(): TokenToKVPool.k_buffers/v_buffers, ModelRunner.model, ReqToTokenPool.req_to_token
Call primary_memory_saver.pause()/.resume(): At scheduler.py, Scheduler.release_gpu_occupation/resume_gpu_occupation

Others:

Add Engine.release_gpu_occupation/resume_gpu_occupation: Need to add multiple structs such as ReleaseGPUOccupationReqInput, and multiple forwarding boilerplate in Engine/TokenizerManager
Add a base class BaseCausalLM: Need to change all models' parent classes.
Add tests

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

# Conflicts: # python/sglang /srt/managers/tokenizer_manager.py

python/sglang/srt/models/base.py

python/sglang/srt/managers/io_struct.py

scripts/ci_install_dependency.sh

python/sglang/srt/server_args.py

merrymercy · 2025-01-13T06:45:12Z

python/sglang/srt/model_executor/model_runner.py

@@ -536,6 +542,7 @@ def init_memory_pool(
            max_context_len=self.model_config.context_len + 4,
            device=self.device,
            use_records=False,
+            memory_saver_adapter=self.memory_saver_adapter,


Can we do something similar to how we handle get_model, so we do not need to pass memory_saver_adapter as an argument to all kinds of memory pools.

with self.memory_saver_adapter.region(): self.req_to_token_pool = ReqToTokenPool(

Co-authored-by: Lianmin Zheng <[email protected]>

This reverts commit b03f558.

merrymercy

Almost there! Just one final comment.

python/sglang/srt/models/llama.py

merrymercy · 2025-01-13T13:16:43Z

python/sglang/srt/model_executor/model_runner.py

@@ -590,6 +596,7 @@ def init_memory_pool(
            max_context_len=self.model_config.context_len + 4,
            device=self.device,
            use_records=False,
+            enable_memory_saver=self.server_args.enable_memory_saver,


See this comment #2630 (comment)

Oops I misunderstood your question. I am worried maybe not, because it can happen that we do have some tensors inside req pool in the future that needs to be preserved across memory release. (The torch_memory_saver does not offload things to cpu, instead it just throws away the content, in order to be faster)

~~But anyway, today it seems to be OK (will check it), so I can update it if needed.~~ Quickly skimmed it, worried maybe BaseTokenToKVPool.free_slots is one such tensor that hopes not to be released. Too tired now and not do any experiment though.

fzyzcjy and others added 30 commits December 26, 2024 13:16

empty struct

a5061cc

more

5a5651b

more

1ccf84c

simp

6e55282

more

5edcf5a

fix typing

35eb3ad

more

211550e

Merge branch 'feat/code_cleanup' into feat/memory_optimization

619aa19

more

95a8db9

more

5650a75

more

ecd3d9a

more

53573cc

fix

8f8bc3d

more

eaa9808

more

f3c948c

more

94e9ec8

more

2317150

Merge branch 'feat/code_cleanup' into feat/memory_optimization

bc56193

more

8042494

Merge branch 'main' into feat/code_cleanup

5063ff0

Merge branch 'main' into feat/memory_optimization

711b7de

cleanup

255251f

fix

7f737d5

Merge branch 'main' into feat/code_cleanup

db53dfa

Merge branch 'main' into feat/code_cleanup

9dcba7b

Merge branch 'feat/code_cleanup' into feat/memory_optimization

adff864

# Conflicts: # python/sglang /srt/managers/tokenizer_manager.py

Merge branch 'feat/memory_optimization' into feat/memory_saver

23ff620

more

59989d7

more

08d5900

enable cudagraph

bf890f8

merrymercy requested changes Jan 13, 2025

View reviewed changes

fzyzcjy and others added 25 commits January 13, 2025 14:51

Update python/sglang/srt/server_args.py

df94572

Co-authored-by: Lianmin Zheng <[email protected]>

Merge branch 'main' into feat/memory_saver

770b24a

force reinstall

a5cfbdb

fix text

aa0df0c

rename

e370931

rename

311ff2f

save

658a259

rename

80e4fc0

rename

683ebb1

more

02058af

more

3c5973f

fmt

358cbb8

rm

ddca874

rm

3e2fd5d

rm

b03f558

Revert "rm"

4d62493

This reverts commit b03f558.

more

926f58d

more

4700a70

fmt

41c61b1

rename

0221d0c

fmt

a0ae38a

fix ci

46c0d17

Merge branch 'main' into feat/memory_saver

21acb4d

more

e06dcf2

fix ci

ebf3578

merrymercy reviewed Jan 13, 2025

View reviewed changes

revert

948cb10

merrymercy merged commit 923f518 into sgl-project:main Jan 13, 2025
15 checks passed

galv mentioned this pull request Feb 4, 2025

Failing to free a tensor allocated while a torch.cuda.Mempool is active results in that tensor being freed with cudaFree() rather than the custom free function. pytorch/pytorch#146431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

fzyzcjy commented Dec 28, 2024 •

edited

Loading

merrymercy Jan 13, 2025

merrymercy left a comment

merrymercy Jan 13, 2025

fzyzcjy Jan 13, 2025 •

edited

Loading

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Conversation

fzyzcjy commented Dec 28, 2024 • edited Loading

How to execute it

What's changed

Checklist

merrymercy Jan 13, 2025

Choose a reason for hiding this comment

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy Jan 13, 2025

Choose a reason for hiding this comment

fzyzcjy Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

fzyzcjy commented Dec 28, 2024 •

edited

Loading

fzyzcjy Jan 13, 2025 •

edited

Loading