Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

Open
basncy opened this issue Feb 12, 2025 · 4 comments
Open

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

basncy opened this issue Feb 12, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@basncy
Copy link

basncy commented Feb 12, 2025

Replace deepseek-ai/DeepSeek-R1 with deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on examples/deepseekr1/main.rs for demo running on CPU, this example application gets into an endless loop after Dummy run completed.

@basncy basncy added the bug Something isn't working label Feb 12, 2025
@EricLBuehler
Copy link
Owner

@basncy can you please try what is detailed here:

#1064 (comment)

@basncy
Copy link
Author

basncy commented Feb 13, 2025

Hi EricLBuehler,

Not sure if this related with PagedAttention.
PagedAttention is not supported with CPU, would you plan to implement it for research only?
or any plan on adding SYCL backend hardware? Hmm, it looks like a long way to go in supporting SYCL with RUST.

// TODO: PagedAttention is not supported with CPU for now.

Here is some strace log during the inference loop:

...
[pid 370744] 1739433696.342936 <... futex resumed>) = -1 EAGAIN (Resource temporarily unavailable)                                                                         16:01:36 [822/9050]
[pid 370743] 1739433696.342947 <... sched_yield resumed>) = 0
[pid 370742] 1739433696.342957 sched_yield( <unfinished ...>
[pid 370741] 1739433696.342967 <... sched_yield resumed>) = 0
[pid 370740] 1739433696.342976 <... sched_yield resumed>) = 0
[pid 370739] 1739433696.342985 <... sched_yield resumed>) = 0
[pid 370738] 1739433696.342995 <... sched_yield resumed>) = 0
[pid 370737] 1739433696.343006 <... futex resumed>) = 0
[pid 370736] 1739433696.343016 <... sched_yield resumed>) = 0
[pid 370735] 1739433696.343025 <... sched_yield resumed>) = 0
[pid 370734] 1739433696.343036 <... sched_yield resumed>) = 0
[pid 370733] 1739433696.343050 <... sched_yield resumed>) = 0
[pid 370731] 1739433696.343059 sched_yield( <unfinished ...>
[pid 370746] 1739433696.343085 <... sched_yield resumed>) = 0
[pid 370744] 1739433696.343095 futex(0x652080195a00, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 370743] 1739433696.343108 sched_yield( <unfinished ...>
[pid 370742] 1739433696.343120 <... sched_yield resumed>) = 0
[pid 370741] 1739433696.343131 sched_yield( <unfinished ...>
[pid 370740] 1739433696.343143 sched_yield( <unfinished ...>
[pid 370739] 1739433696.343153 sched_yield( <unfinished ...>
[pid 370738] 1739433696.343163 sched_yield( <unfinished ...>
[pid 370737] 1739433696.343174 futex(0x652080195a88, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 370736] 1739433696.343184 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370735] 1739433696.343195 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370734] 1739433696.343205 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370733] 1739433696.343216 sched_yield( <unfinished ...>
[pid 370732] 1739433696.343227 sched_yield( <unfinished ...>
[pid 370731] 1739433696.343237 <... sched_yield resumed>) = 0
[pid 370746] 1739433696.343255 sched_yield( <unfinished ...>
[pid 370745] 1739433696.343262 <... futex resumed>) = 0
[pid 370744] 1739433696.343268 <... futex resumed>) = 0
[pid 370743] 1739433696.343273 <... sched_yield resumed>) = 0
[pid 370742] 1739433696.343279 sched_yield( <unfinished ...>
[pid 370741] 1739433696.343284 <... sched_yield resumed>) = 0
[pid 370740] 1739433696.343291 <... sched_yield resumed>) = 0
[pid 370739] 1739433696.343296 <... sched_yield resumed>) = 0
[pid 370738] 1739433696.343302 <... sched_yield resumed>) = 0
[pid 370737] 1739433696.343308 <... futex resumed>) = 1
[pid 370733] 1739433696.343314 <... sched_yield resumed>) = 0
[pid 370732] 1739433696.343320 <... sched_yield resumed>) = 0
[pid 370746] 1739433696.343333 <... sched_yield resumed>) = 0
[pid 370745] 1739433696.343339 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370744] 1739433696.343346 sched_yield( <unfinished ...>
[pid 370743] 1739433696.343352 sched_yield( <unfinished ...>
...      

EricLBuehler added a commit that referenced this issue Feb 13, 2025
@EricLBuehler
Copy link
Owner

Hi @basncy!

Could you please remove the .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())? line entirely to see if that helps, given you are using only the CPU.

Not sure if this related with PagedAttention.
PagedAttention is not supported with CPU, would you plan to implement it for research only?

I don't think it has to do with PagedAttention, unless the example is unchanged, and you are compiling with the metal feature. In this case, a large amount of PagedAttention KV cache would be allocated, as described in #1064.

I think implementing PagedAttention for the CPU would be a relatively low priority for now, as deployment use cases would most likely target using a GPU, for which we support both CUDA and Metal.

A SYCL backend might be interesting, but I think implementing WGPU support would have a broader impact and make using Vulkan/OpenGL and others available.

@basncy
Copy link
Author

basncy commented Feb 14, 2025

Could you please remove the .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())? line entirely to see if that helps, given you are using only the CPU.

Same result, debug with default option from "Debug example 'deepseekr1'"

2025-02-14T03:02:12.015031Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:12.015176Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:12.864254Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-02-14T03:02:13.206167Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:13.920594Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:14.262308Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-14T03:02:14.263287Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-14T03:02:14.263603Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-02-14T03:02:14.599928Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-14T03:02:14.600075Z  INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-02-14T03:02:14.600090Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-14T03:02:14.600107Z  INFO mistralrs_core::utils::log: Layers 0-27: cpu
2025-02-14T03:02:14.601037Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-14T03:02:14.601088Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-02-14T03:02:14.601151Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 1536, intermediate_size: 8960, num_hidden_layers: 28, num_attention_heads: 12, num_key_value_heads: 2, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: false }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [01:44<00:00, 2010.19it/s]
2025-02-14T03:04:24.078039Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-02-14T03:04:24.078188Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 197 tensors.
2025-02-14T03:04:24.078290Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 16 threads.
2025-02-14T03:06:04.832834Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 197 tensors out of 197 total tensors. Took 100.75s
2025-02-14T03:06:05.639157Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin▁of▁sentence|>", eos_toks = "<|end▁of▁sentence|>", unk_tok = `None`
2025-02-14T03:06:05.653194Z  INFO mistralrs_core: Beginning dummy run.
2025-02-14T03:06:07.291909Z  INFO mistralrs_core: Dummy run completed in 1.638660426s.


A SYCL backend might be interesting, but I think implementing WGPU support would have a broader impact and make using Vulkan/OpenGL and others available.

Perhaps, but there are still many challenges as it focuses on rendering at this moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants