Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chatglm #1478

Merged
merged 3 commits into from
Dec 9, 2024
Merged

Add chatglm #1478

merged 3 commits into from
Dec 9, 2024

Conversation

mengker33
Copy link
Contributor

@mengker33 mengker33 commented Nov 12, 2024

What does this PR do?

This PR adds the chatglm model (a custom model), including chatglm2-6b, chatglm3-6b.
The inference test and pretrain example/test are also available.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@xuguangxin
Copy link

@libinta @sywangyi , pls help review, thanks.

@emascarenhas
Copy link
Contributor

@mengker33 ,
Please run "pip install -U ruff; make style" and check for errors.
Also run tests/ci/fast_tests.sh and all the slow tests related to test generation for chatglm that you added e.g., "GAUDI2_CI=1 RUN_SLOW=1 python -m pytest test_text_generation_example.py" and check no new errors.

Don't you also need to add a test for the language modeling part?

Does this PR need to be included in this release or can it wait for the next release?

@phoenixdna
Copy link

phoenixdna commented Nov 24, 2024

Hi, I am trying to use Guadi card do some inference work on Chatglm3-6b , but I continue to have with the following problem although I use the PR 1478.
The following is the script I copied from your instruction:

GLM=3 python3 run_generation.py \
--model_name_or_path /data/ZhipuAI/chatglm3-6b \
--use_hpu_graphs \
--use_kv_cache \
--do_sample \
--bf16 \
--trim_logits \
--batch_size 1 \
--max_input_tokens 1024 \
--max_new_tokens 512 \
--reuse_cache \
--use_flash_attention

however , I still got the following errors:

[WARNING|utils.py:225] 2024-11-24 21:54:40,419 >> optimum-habana v1.15.0.dev0 has been validated for SynapseAI v1.18.0 but the driver version is v1.17.0, this could lead to undefined behavior!
/root/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/root/habanalabs-venv/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
11/24/2024 21:54:41 - INFO - __main__ - Single-device run.
ChatGLMForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.73it/s]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 28
CPU RAM       : 123577836 KB
------------------------------------------------------------------------------
[WARNING|tokenization_chatglm.py:174] 2024-11-24 21:54:50,850 >> Setting eos_token is not supported, use the default one.
[WARNING|tokenization_chatglm.py:170] 2024-11-24 21:54:50,850 >> Setting pad_token is not supported, use the default one.
[WARNING|tokenization_chatglm.py:166] 2024-11-24 21:54:50,850 >> Setting unk_token is not supported, use the default one.
11/24/2024 21:54:51 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='/data/ZhipuAI/chatglm3-6b', bf16=True, max_new_tokens=512, max_input_tokens=1024, batch_size=1, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=True, num_beams=1, top_k=None, penalty_alpha=None, trim_logits=True, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=None, bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=False, output_dir=None, bucket_size=-1, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=False, show_graphs_count=False, reuse_cache=True, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=True, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=True, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, trust_remote_code=False, parallel_strategy='none', input_embeds=False, run_partial_dataset=False, load_quantized_model_with_autogptq=False, disk_offload=False, load_quantized_model_with_inc=False, local_quantized_inc_model_path=None, quant_config='', world_size=0, global_rank=0)
11/24/2024 21:54:51 - INFO - __main__ - device: hpu, n_hpu: 0, bf16: True
11/24/2024 21:54:51 - INFO - __main__ - Model initialization took 10.588s
11/24/2024 21:54:51 - INFO - __main__ - Graph compilation...
Warming up iteration 1/3
Traceback (most recent call last):
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 758, in <module>
    main()
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 523, in main
    generate(None, args.reduce_recompile)
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 494, in generate
    outputs = model.generate(
  File "/root/habanalabs-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/jupyter/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1008, in generate
    self._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)
  File "/root/habanalabs-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1676, in _prepare_special_tokens
    eos_token_tensor is not None
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

As I understand this PR is exactly for chatglm3-6b,but I don't understand why this happens after tried a lots of times. So please help to give some suggestion!
with regards.

@mengker33
Copy link
Contributor Author

tests/ci/fast_tests.sh

Hi, I have tried with fast_tests.sh and test_text_generation_example.py, and no errors occurred.
I also added tests for the language modeling part.

@mengker33
Copy link
Contributor Author

Hi, I am trying to use Guadi card do some inference work on Chatglm3-6b , but I continue to have with the following problem although I use the PR 1478. The following is the script I copied from your instruction:

GLM=3 python3 run_generation.py \
--model_name_or_path /data/ZhipuAI/chatglm3-6b \
--use_hpu_graphs \
--use_kv_cache \
--do_sample \
--bf16 \
--trim_logits \
--batch_size 1 \
--max_input_tokens 1024 \
--max_new_tokens 512 \
--reuse_cache \
--use_flash_attention

however , I still got the following errors:

[WARNING|utils.py:225] 2024-11-24 21:54:40,419 >> optimum-habana v1.15.0.dev0 has been validated for SynapseAI v1.18.0 but the driver version is v1.17.0, this could lead to undefined behavior!
/root/habanalabs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/root/habanalabs-venv/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
11/24/2024 21:54:41 - INFO - __main__ - Single-device run.
ChatGLMForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.73it/s]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 28
CPU RAM       : 123577836 KB
------------------------------------------------------------------------------
[WARNING|tokenization_chatglm.py:174] 2024-11-24 21:54:50,850 >> Setting eos_token is not supported, use the default one.
[WARNING|tokenization_chatglm.py:170] 2024-11-24 21:54:50,850 >> Setting pad_token is not supported, use the default one.
[WARNING|tokenization_chatglm.py:166] 2024-11-24 21:54:50,850 >> Setting unk_token is not supported, use the default one.
11/24/2024 21:54:51 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='/data/ZhipuAI/chatglm3-6b', bf16=True, max_new_tokens=512, max_input_tokens=1024, batch_size=1, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=True, num_beams=1, top_k=None, penalty_alpha=None, trim_logits=True, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=None, bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=False, output_dir=None, bucket_size=-1, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=False, show_graphs_count=False, reuse_cache=True, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=True, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=True, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, trust_remote_code=False, parallel_strategy='none', input_embeds=False, run_partial_dataset=False, load_quantized_model_with_autogptq=False, disk_offload=False, load_quantized_model_with_inc=False, local_quantized_inc_model_path=None, quant_config='', world_size=0, global_rank=0)
11/24/2024 21:54:51 - INFO - __main__ - device: hpu, n_hpu: 0, bf16: True
11/24/2024 21:54:51 - INFO - __main__ - Model initialization took 10.588s
11/24/2024 21:54:51 - INFO - __main__ - Graph compilation...
Warming up iteration 1/3
Traceback (most recent call last):
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 758, in <module>
    main()
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 523, in main
    generate(None, args.reduce_recompile)
  File "/root/jupyter/optimum-habana/examples/text-generation/run_generation.py", line 494, in generate
    outputs = model.generate(
  File "/root/habanalabs-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/jupyter/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1008, in generate
    self._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)
  File "/root/habanalabs-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1676, in _prepare_special_tokens
    eos_token_tensor is not None
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure].

As I understand this PR is exactly for chatglm3-6b,but I don't understand why this happens after tried a lots of times. So please help to give some suggestion! with regards.

Hi, I didn't see any inference/pretraining error from my local test, please check if your test goes through the correct glm modeling path in optimum-habana.

@phoenixdna
Copy link

phoenixdna commented Nov 29, 2024

Hi, I am trying to use Guadi card do some inference work on Chatglm3-6b , but I continue to have with the following problem although I use the PR 1478. The following is the script I copied from your instruction:

As I understand this PR is exactly for chatglm3-6b,but I don't understand why this happens after tried a lots of times. So please help to give some suggestion! with regards.

Hi, I didn't see any inference/pretraining error from my local test, please check if your test goes through the correct glm modeling path in optimum-habana.

Thanks for you reply, I download the chatglm3-6b from the modelscope, I don't know what's your mean by "go throught the correct glm modeling path in optimum-habana", could you kindly expain this?

@mengker33
Copy link
Contributor Author

Hi, I am trying to use Guadi card do some inference work on Chatglm3-6b , but I continue to have with the following problem although I use the PR 1478. The following is the script I copied from your instruction:

As I understand this PR is exactly for chatglm3-6b,but I don't understand why this happens after tried a lots of times. So please help to give some suggestion! with regards.

Hi, I didn't see any inference/pretraining error from my local test, please check if your test goes through the correct glm modeling path in optimum-habana.

Thanks for you reply, I download the chatglm3-6b from the modelscope, I don't know what's your mean by "go throught the correct glm modeling path in optimum-habana", could you kindly expain this?

You need to check if the model is initialized correctly by going through optimum/habana/transformers/models/chatglm/modeling_chatglm.py instead of the one in your downloaded modeling codes.

@mengker33 mengker33 force-pushed the chatglm_upstream branch 2 times, most recently from 7e06281 to 6374a63 Compare December 2, 2024 07:03
@emascarenhas
Copy link
Contributor

@mengker33 ,

I tried to run this test and got an error.
optimum-habana# GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

__________________________________________________________ ERROR collecting tests/test_text_generation_example.py __________________________________________________________
tests/test_text_generation_example.py::test_text_generation_bf16_1x: in "parametrize" the number of names (5):
['model_name', 'batch_size', 'reuse_cache', 'baseline', 'check_output']
must be equal to the number of values (4):
('THUDM/glm-4-9b-chat', 1, True, 105)

@mengker33
Copy link
Contributor Author

@mengker33 ,

I tried to run this test and got an error. optimum-habana# GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_text_generation_example.py

__________________________________________________________ ERROR collecting tests/test_text_generation_example.py __________________________________________________________ tests/test_text_generation_example.py::test_text_generation_bf16_1x: in "parametrize" the number of names (5): ['model_name', 'batch_size', 'reuse_cache', 'baseline', 'check_output'] must be equal to the number of values (4): ('THUDM/glm-4-9b-chat', 1, True, 105)

I think you are using the old version of this PR, please rebase to the latest and try again, thanks!

@mengker33 mengker33 force-pushed the chatglm_upstream branch 2 times, most recently from cb23dfe to dede6fe Compare December 4, 2024 02:55
setup.py Outdated Show resolved Hide resolved
@emascarenhas
Copy link
Contributor

@mengker33 ,

I think you are using the old version of this PR, please rebase to the latest and try again, thanks!

Yes. This was the case. I am able to run the examples in the readme successfully after rebasing.
This test command is giving an error.
GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_examples.py -s -v -k chatglm
E FileNotFoundError: [Errno 2] No such file or directory: '/home/optimum-habana/tests/baselines/chatglm3_6b.json'

Is that file required?

@phoenixdna
Copy link

Thanks for you reply, I download the chatglm3-6b from the modelscope, I don't know what's your mean by "go throught the correct glm modeling path in optimum-habana", could you kindly expain this?

You need to check if the model is initialized correctly by going through optimum/habana/transformers/models/chatglm/modeling_chatglm.py instead of the one in your downloaded modeling codes.

ok, thx for your reply and will give a try

@mengker33
Copy link
Contributor Author

GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_examples.py -s -v -k chatglm

Sorry, my bad... I had this baselines/chatglm3_6b.json file locally but forgot to push it to this PR. Really appreciate your test!

@mengker33 mengker33 force-pushed the chatglm_upstream branch 2 times, most recently from 9fe9e0c to 90ea98b Compare December 6, 2024 02:28
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this model to the table in the README and in the doc:

("THUDM/chatglm2-6b", 1, True, 150, False),
("THUDM/chatglm3-6b", 1, True, 150, False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between ChatGLM-2 and ChatGLM-3 exactly? To know if we really need to test both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is modeling functional difference, the only difference lies in some customized tokenizer methods' implementation. I removed the test for chatglm2.

@mengker33
Copy link
Contributor Author

Please add this model to the table in the README and in the doc:

Done, thanks!

mengker33 and others added 3 commits December 9, 2024 06:31
Including chatglm2-6b and chatglm3-6b.

Co-authored-by: Wei Lin <[email protected]>
Co-authored-by: Jianqian Zhou <[email protected]>
Co-authored-by: Leo Zhao <[email protected]>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss regisss merged commit 1bf9a9a into huggingface:main Dec 9, 2024
4 checks passed
zzhang37 pushed a commit to zzhang37/optimum-habana that referenced this pull request Dec 9, 2024
Co-authored-by: Wei Lin <[email protected]>
Co-authored-by: Jianqian Zhou <[email protected]>
Co-authored-by: Leo Zhao <[email protected]>
imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Dec 10, 2024
Co-authored-by: Wei Lin <[email protected]>
Co-authored-by: Jianqian Zhou <[email protected]>
Co-authored-by: Leo Zhao <[email protected]>
)
else:
with ht.sdp_kernel(enable_recompute=flash_attention_recompute):
if (q_len > 8192 or (q_len >= 6144 and bsz >= 2)) and self.training:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengker33 ,Hi, just curious about why is 6144 here?

Liangyx2 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Jan 20, 2025
Co-authored-by: Wei Lin <[email protected]>
Co-authored-by: Jianqian Zhou <[email protected]>
Co-authored-by: Leo Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants