-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FP8-E5M2 KV Cache #2279
Conversation
LGTM, I was wondering about the performance improvement. |
And I want to know which one should we use for better precision and performance between E5M2 and E4M3? I guess this may be related to the specific model. |
This seriously looks good. Is RTN used for the kv-cache quantization? |
It is not limited on Hopper. Volta/Ampere are both ok and have bee tested. The fp8 intrinsic will directly use ASM to do data type conversion on Hopper while use bit operations on pre-Hopper. |
RoundToNearest is not used in this impl. The impl uses cuda fp8 intrinsic, such as |
Below are tested on A100-40GB: Offline throughput: [fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 05:45:54 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 05:46:12 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 05:46:17 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 05:46:17 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 05:46:31 model_runner.py:449] Graph capturing finished in 14 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [22:08<00:00, 1.33s/it]
Throughput: 0.75 requests/s, 1541.35 tokens/s
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_throughput.py --input-len 1024 --output-len 1024 --model /models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(backend='vllm', dataset=None, dtype='auto', enforce_eager=False, hf_max_batch_size=None, input_len=1024, kv_cache_dtype='fp8', max_model_len=None, model='/models/huggingface/LLM/llama-7B-hf/', n=1, num_prompts=1000, output_len=1024, quantization=None, seed=0, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 06:16:00 llm_engine.py:74] Initializing an LLM engine with config: model='/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 06:16:13 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 06:16:21 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 06:16:21 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 06:16:41 model_runner.py:449] Graph capturing finished in 20 secs.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [15:03<00:00, 1.11it/s]
Throughput: 1.11 requests/s, 2265.89 tokens/ Latency: [fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype=None, model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:01:41 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=None, seed=0)
INFO 12-29 07:01:53 llm_engine.py:230] # GPU blocks: 2802, # CPU blocks: 512
INFO 12-29 07:01:55 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:01:55 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:02:01 model_runner.py:449] Graph capturing finished in 6 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.78s/it]
Avg latency: 18.779154599333804 seconds
[fp8_cache]root@50c663527862:/zy/github/remote/vllm# python3 benchmarks/benchmark_latency.py --input-len 1024 --output-len 1024 --model /shared/models/huggingface/LLM/llama-7B-hf/ --tokenizer /zy/llama-tokenizer/ --kv-cache-dtype="fp8"
Namespace(batch_size=8, dtype='auto', enforce_eager=False, input_len=1024, kv_cache_dtype='fp8', model='/shared/models/huggingface/LLM/llama-7B-hf/', n=1, num_iters=3, output_len=1024, profile=False, profile_result_dir=None, quantization=None, tensor_parallel_size=1, tokenizer='/zy/llama-tokenizer/', trust_remote_code=False, use_beam_search=False)
INFO 12-29 07:13:48 llm_engine.py:74] Initializing an LLM engine with config: model='/shared/models/huggingface/LLM/llama-7B-hf/', tokenizer='/zy/llama-tokenizer/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, kv_cache_dtype=torch.uint8, seed=0)
INFO 12-29 07:13:55 llm_engine.py:230] # GPU blocks: 5605, # CPU blocks: 1024
INFO 12-29 07:13:57 model_runner.py:403] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-29 07:13:57 model_runner.py:407] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 12-29 07:14:02 model_runner.py:449] Graph capturing finished in 5 secs.
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:52<00:00, 17.37s/it]
Avg latency: 17.37384683514635 second |
@WoosukKwon @zhuohan123 The PR is ready for review. Could you please take some time to review the code? Thanks a lot. |
@seanxcwang Thanks for your feedback. We need to add torch.uint8 dtype for cache ops (copy, swap). I will fix it ASAP. |
Fixed. @seanxcwang could you please use the latest PR to test? Thanks again. |
@zhaoyang-star have used new pr for testing,no other errors were found |
@zhuohan123 @WoosukKwon The PR is ready for review. Could you please take time to review the code? |
I hope it can be merged, which is very useful for large models |
@tjtanaa @hongxiayang We use CUDA Math API such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below:
- E4M3 is the only common FP8 type used (and needed) during inference or model forward path, using E5M2 in forward is rare.
- Most FP8 serving and inferencing come with scaled tensor quant, either parameters or activations (KV cache as a part). Using saturate to finite without scaling isn't common in practice, which may incur performance issues in general.
@@ -220,6 +220,8 @@ def _paged_attention( | |||
) -> torch.Tensor: | |||
output = torch.empty_like(query) | |||
|
|||
enable_fp8_kv_cache = key_cache.dtype == torch.uint8 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this invalidate 8bit KV cache other than FP8, unnecessarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Can we get this from model config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Can we get this from model config?
Sure, have fixed.
Thanks for your review.
|
vllm/engine/arg_utils.py
Outdated
type=str, | ||
choices=['fp8', None], | ||
default=None, | ||
help='Data type for kv cache storage.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help='Data type for kv cache storage.') | |
help='Data type for kv cache storage. If None, will use model data type.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left final two minor comments (I hope these are really the final comments). Can you merge with the main branch and let's see how CI goes?
vllm/config.py
Outdated
""" | ||
|
||
def __init__( | ||
self, | ||
block_size: int, | ||
gpu_memory_utilization: float, | ||
swap_space: int, | ||
cache_dtype_str: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just call it cache_dtype
? The _str
suffix seems unnecessary to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Fixed.
vllm/worker/worker.py
Outdated
@@ -36,6 +36,7 @@ def __init__( | |||
rank: int, | |||
distributed_init_method: str, | |||
lora_config: Optional[LoRAConfig] = None, | |||
cache_config: Optional[CacheConfig] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is weird. Originally we set the cache_config
in self.init_cache_engine()
(as in line 60 below). This change introduces two cache_config
objects, which is super confusing.
The reason we delay the initialization of the cache_config
is that cache_config
includes the number of KV blocks, which can only be known after memory profiling.
To make things more clear, I think we can just feed in kv_cache_dtype
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion is good for me. cache_config: Optional[CacheConfig] = None
-> kv_cache_dtype: Optional[str] = "auto"
Co-authored-by: Zhuohan Li <[email protected]>
num_layers: int, | ||
num_heads: int, | ||
head_size: int, | ||
cache_dtype: Optional[Union[str, torch.dtype]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make the type Union[str, torch.dtype]
here? Based on below implementation, if this is None
, the first set of if conditions at the beginning of the function will always end with a ValueError
, right? So None is not really an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is better to use Union[str, torch.dtype]
for cache_dtype
. I will modify it soon.
Hi @zhaoyang-star, thanks for the great work! What's the sampling parameter did you use to get the HumanEval pass@1 score? I recently found I need to set |
Co-authored-by: zhaoyang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>
I used a fine-tuned model based on the open-sourced WarzardCoder-34B. Sorry, The sampling parameter is not recorded and I have not evaluate it under greedy sampling. |
Co-authored-by: zhaoyang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: zhaoyang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]>
@zhaoyang-star Is it possible to share the test configuration / parameters for the following table? Thanks. |
@HaiShaw Thanks for your attention. The main configuration I used is as following. Note the WarzardCoder-34B I used is fune-tuned for inner use so it may not be possible to open source. I have seen your RFC #2461 about fp8 e4m3 with scale factor. It is a great work! I think fp8 with scale factor will achieve less accuracy drop compared to the current Implementation in the PR. {
"max_tokens": 2048,
"temperature": 0.2,
"use_beam_search": false,
"top_p": 1,
"top_k": -1,
"ignore_eos": false,
"presence_penalty": 1.2,
"frequency_penalty": 1.0
} |
@zhaoyang-star Thanks for your info on the WarzardCoder-34B testing parameters. |
Hello. Does KV Cache FP8 need calibrate dataset? How to specify this dataset? |
@Time-Limit The fp8-e5m2 in vllm has no scaling factors, so calibrate dataset is not need. docs is about how to use this feature. Please feel free to touch me if any trouble is met. |
Had a reference to quantizer tool in #2461 Short answer is that, quantizer and its utility would enable you to quantize and compute some scaling factors over your assigned calibration dataset (e.g. cnnmail, or your domain specific data). |
I want to check: is kv cache quantization performed in per tensor way? |
Quantize KV Cache to fp8 can reducue memory usage of kv cache and then could boost throughput. The impl uses fp8 data type for kv cache and has been tested on A100.
The following test is under WarzardCoder-34B.
Usage: