-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for GPT-NeoX (Pythia) #50
Conversation
NOTE: Dolly V2 is not supported by this PR, because it uses Bfloat16, which some of our kernel do not support. It will be added by another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! See comments for more details.
cacheflow/models/gpt_neox.py
Outdated
|
||
def initialize_dummy_weights(self) -> None: | ||
for param in self.state_dict().values(): | ||
param.data.uniform_(-0.1, 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The U(-0.1, 0.1) initialization will lead to many out-of-ranges and NaNs during the model execution. Maybe use a smaller range like U(-1e-5, 1e-5)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (-0.1, 0.1) initialization actually works. However, to be cautious, I changed the range to (-1e-3, 1e-3).
self.max_position = 8192 | ||
self.tie_word_embeddings = config.tie_word_embeddings | ||
|
||
def get_param_size(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the parameter size by counting the actual parameters after the models get initialized? Use some code like the following:
mem_params = sum([param.nelement()*param.element_size() for param in model.parameters()])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Let's do that in another PR.
dtype_size = get_dtype_size(self.dtype) | ||
return dtype_size * total | ||
|
||
def get_max_num_gpu_blocks( | ||
def get_max_act_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we profile the actual max activation size by running the model once without any KV cache?
SUMMARY: * "remote push" job for multi-gpu runner. * "remote push" job for single gpu runner. * patches for re-initialization of "ray". found other places in `vllm` where they are passing in `ignore_reinit_error=True`, it just looked like they missed a couple of places. * patch "find" command to only find *.py files starting with "test_". TEST PLAN: runs on remote push --------- Co-authored-by: andy-neuma <[email protected]>
* update quark quantizer command * typo * Using scaled_mm for untuned gemm * remove comment * fix yapf
This PR adds support for the GPT-NeoX (Pythia) model, which is the backbone of many popular models including Dolly V2, Stable-LM, and Open Assistant.