Add custom kernel for RMS normalization #16

WoosukKwon · 2023-03-31T01:24:40Z

This PR adds a custom CUDA kernel for RMS normalization, which is used in LLaMA models. The kernel removes the inefficient data movement in the current PyTorch implementation.

Performance (fp16):

* num_tokens=7, hidden_size=1024
Kernel: 9 us
PyTorch: 93 us

* num_tokens=128, hidden_size=1024
Kernel: 5 us
PyTorch: 84 us

* num_tokens=2048, hidden_size=5120
Kernel: 60 us
PyTorch: 353 us

Tested models:

LLaMA-7B
LLaMA-13B

Tested GPUs:

A100

zhuohan123

LGTM

Generalizing KV scales JSON to updated schema

Extra optimizations for PA

chore: add OWNERS file to ibm_main

Fix ambiguous fma call

SUMMARY: - Fix bug whereby 2:4 is not being invoked - Eschew SparseTensor based implementation TESTING: - examples/offline_inference_semi_structured_sparse.py --------- Co-authored-by: Lucas Wilkinson <[email protected]>

WoosukKwon added 4 commits March 31, 2023 01:15

Add reduction_utils.h

3d17680

Add RMS norm kernel

2a3f6eb

Add tests for RMS norm kernel

a6b924e

Add RMSNorm module

9921699

WoosukKwon requested a review from zhuohan123 March 31, 2023 01:25

zhuohan123 approved these changes Mar 31, 2023

View reviewed changes

zhuohan123 merged commit 09e9245 into main Mar 31, 2023

WoosukKwon deleted the rms-norm branch March 31, 2023 16:51

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add custom kernel for RMS normalization (vllm-project#16)

3d52973

AdrianAbeyta added a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024

Merge pull request vllm-project#16 from ROCm/fp8_ingest_stage1_model

7b72159

Generalizing KV scales JSON to updated schema

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 1, 2024

Merge pull request vllm-project#16 from ilya-lavrenov/pa-optimizations

2ce2ff7

Extra optimizations for PA

z103cb added a commit to dtrifiro/vllm that referenced this pull request May 8, 2024

Merge pull request vllm-project#16 from z103cb/update-ibm-main-owners

8a6c9c9

chore: add OWNERS file to ibm_main

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#16 from cjatin/bfloat16_fix

24584bc

Fix ambiguous fma call

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom kernel for RMS normalization #16

Add custom kernel for RMS normalization #16

WoosukKwon commented Mar 31, 2023

zhuohan123 left a comment

Add custom kernel for RMS normalization #16

Add custom kernel for RMS normalization #16

Conversation

WoosukKwon commented Mar 31, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment