-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add meta onDevice support for LLAMA2 #4147
add meta onDevice support for LLAMA2 #4147
Conversation
964ff1d
to
27e1dbe
Compare
@molly-smith - please review this PR when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@dc3671 Can you elaborate on the issue you were seeing and for what use case? Maybe share your reproducer script? Meta tensor is not supported with autotp for any model and not supported in the llama container. |
@molly-smith I think container is only related to kernel injection? Because For autoTP, the only thing that matters is that deepspeed needs to make sure the replaced Linear or other modules can find and load correct checkpoint from state_dict, which is only related to these two positions:
So I just added I'm using this modified python script for launching: https://github.com/dc3671/intel-extension-for-transformers/blob/llm/examples/huggingface/pytorch/text-generation/inference/run_generation_with_deepspeed.py#L199 If I remove |
@molly-smith any update? |
@dc3671 I still see |
@mrwyattii I updated some details of my script for llama2. It can be run with: mpirun -np 4 python -u run_generation_with_deepspeed.py -m /localdisk/llama2 --benchmark I use following code to load it first with meta tensor: config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
with deepspeed.OnDevice(dtype=load_dtype, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=load_dtype, trust_remote_code=True) And then I need to tell model = deepspeed.init_inference(
model,
mp_size=world_size,
base_dir=repo_root,
dtype=infer_dtype,
checkpoint=checkpoints_json if is_meta_support else None,
**kwargs,
) The original way of getting I guess maybe you didn't set checkpoint argument in init_inference so that it won't load checkpoint after autoTP. |
@dc3671 Thank you for your patience and detailed response. I was able to recreate your issue and successfully test your changes. Some of us were not aware that meta tensor support was added to AutoTP but I'm glad to see that it is in fact working. I will merge these changes soon. Thanks again. |
Hi, does this merge mean that we can now successfully use {
"type": "LLAMA",
"checkpoints": [
"pytorch_model-00001-of-00015.bin",
...
"pytorch_model-00015-of-00015.bin"
]
} but I got |
@lashoun This error maybe is because you are running with kernel_injection=True. KI mode need to modify corresponding llama container and it's not contained in this PR. |
@dc3671 Hi, sorry to bother you, but I encountered the same issue as #3452 . When I load the llama2-13b-hf model normally and enable replace_with_kernel_inject, as follow:
However, when I tried to load it with meta ondevice, I found that replace_with_kernel_inject does not currently support the llama2 model. Based on this pr , I modified the code and set replace_with_kernel_inject to False, like:
but the output became very anomalous. I suspect there's an issue with how weights are loaded. Could the 'ds_model' type setting in checkpoint_json be causing incorrect loading? I find that it only support for 'megatronlm', 'ds_model', and 'bloom'. If my understanding is flawed, which type should I use? Any insights would be greatly appreciated. By the way, I'm using multi-node to inference:
|
@ZaVang I'm not that familiar with this part. But I think "ds_model" is ok according to this function (I'm using "bloom"): class SDLoaderFactory:
@staticmethod
def get_sd_loader_json(json_file, checkpoint_engine):
if isinstance(json_file, str):
with open(json_file) as f:
data = json.load(f)
else:
assert isinstance(json_file, dict)
data = json_file
sd_type = data['type']
ckpt_list = data['checkpoints']
version = data['version']
ckpt_type = data.get('parallelization', 'pp')
mp_size = data.get('mp_size', 0)
if sd_type.lower() in ['bloom', 'ds_model']:
return data
return SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine, sd_type, version) |
fixed by #4259 |
Problem
Currently
LlamaRMSNorm
inside LLAMA2LlamaDecoderLayer
won't be handled correctly in autoTP weight load when usingdeepspeed.onDevice(device="meta")
.Solution
Add more special case in loading's if-clause, and abstract it as a method under
Loading
class along withLoading.load
method. For it's used by two places: one for children inside autoTP policy module, and one for outside module.Also, I aligned the way to add special case for this scenario, which is using string name rather than import the actual module class inside a try-catch-clause.
The method name can be changed if any other better name. @mrwyattii @jeffra Please help review this, thanks~