Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

basncy · 2025-02-12T14:47:14Z

Replace deepseek-ai/DeepSeek-R1 with deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on examples/deepseekr1/main.rs for demo running on CPU, this example application gets into an endless loop after Dummy run completed.

EricLBuehler · 2025-02-13T02:30:14Z

@basncy can you please try what is detailed here:

#1064 (comment)

basncy · 2025-02-13T09:34:53Z

Hi EricLBuehler,

Not sure if this related with PagedAttention.
PagedAttention is not supported with CPU, would you plan to implement it for research only?
or any plan on adding SYCL backend hardware? Hmm, it looks like a long way to go in supporting SYCL with RUST.

mistral.rs/mistralrs-core/src/pipeline/normal.rs

Line 420 in c9ac321

// TODO: PagedAttention is not supported with CPU for now.

Here is some strace log during the inference loop:

...
[pid 370744] 1739433696.342936 <... futex resumed>) = -1 EAGAIN (Resource temporarily unavailable)                                                                         16:01:36 [822/9050]
[pid 370743] 1739433696.342947 <... sched_yield resumed>) = 0
[pid 370742] 1739433696.342957 sched_yield( <unfinished ...>
[pid 370741] 1739433696.342967 <... sched_yield resumed>) = 0
[pid 370740] 1739433696.342976 <... sched_yield resumed>) = 0
[pid 370739] 1739433696.342985 <... sched_yield resumed>) = 0
[pid 370738] 1739433696.342995 <... sched_yield resumed>) = 0
[pid 370737] 1739433696.343006 <... futex resumed>) = 0
[pid 370736] 1739433696.343016 <... sched_yield resumed>) = 0
[pid 370735] 1739433696.343025 <... sched_yield resumed>) = 0
[pid 370734] 1739433696.343036 <... sched_yield resumed>) = 0
[pid 370733] 1739433696.343050 <... sched_yield resumed>) = 0
[pid 370731] 1739433696.343059 sched_yield( <unfinished ...>
[pid 370746] 1739433696.343085 <... sched_yield resumed>) = 0
[pid 370744] 1739433696.343095 futex(0x652080195a00, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 370743] 1739433696.343108 sched_yield( <unfinished ...>
[pid 370742] 1739433696.343120 <... sched_yield resumed>) = 0
[pid 370741] 1739433696.343131 sched_yield( <unfinished ...>
[pid 370740] 1739433696.343143 sched_yield( <unfinished ...>
[pid 370739] 1739433696.343153 sched_yield( <unfinished ...>
[pid 370738] 1739433696.343163 sched_yield( <unfinished ...>
[pid 370737] 1739433696.343174 futex(0x652080195a88, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 370736] 1739433696.343184 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370735] 1739433696.343195 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370734] 1739433696.343205 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370733] 1739433696.343216 sched_yield( <unfinished ...>
[pid 370732] 1739433696.343227 sched_yield( <unfinished ...>
[pid 370731] 1739433696.343237 <... sched_yield resumed>) = 0
[pid 370746] 1739433696.343255 sched_yield( <unfinished ...>
[pid 370745] 1739433696.343262 <... futex resumed>) = 0
[pid 370744] 1739433696.343268 <... futex resumed>) = 0
[pid 370743] 1739433696.343273 <... sched_yield resumed>) = 0
[pid 370742] 1739433696.343279 sched_yield( <unfinished ...>
[pid 370741] 1739433696.343284 <... sched_yield resumed>) = 0
[pid 370740] 1739433696.343291 <... sched_yield resumed>) = 0
[pid 370739] 1739433696.343296 <... sched_yield resumed>) = 0
[pid 370738] 1739433696.343302 <... sched_yield resumed>) = 0
[pid 370737] 1739433696.343308 <... futex resumed>) = 1
[pid 370733] 1739433696.343314 <... sched_yield resumed>) = 0
[pid 370732] 1739433696.343320 <... sched_yield resumed>) = 0
[pid 370746] 1739433696.343333 <... sched_yield resumed>) = 0
[pid 370745] 1739433696.343339 futex(0x652080195a80, FUTEX_WAIT_BITSET_PRIVATE, 2, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 370744] 1739433696.343346 sched_yield( <unfinished ...>
[pid 370743] 1739433696.343352 sched_yield( <unfinished ...>
...

EricLBuehler · 2025-02-13T23:04:07Z

Hi @basncy!

Could you please remove the .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())? line entirely to see if that helps, given you are using only the CPU.

Not sure if this related with PagedAttention.
PagedAttention is not supported with CPU, would you plan to implement it for research only?

I don't think it has to do with PagedAttention, unless the example is unchanged, and you are compiling with the metal feature. In this case, a large amount of PagedAttention KV cache would be allocated, as described in #1064.

I think implementing PagedAttention for the CPU would be a relatively low priority for now, as deployment use cases would most likely target using a GPU, for which we support both CUDA and Metal.

A SYCL backend might be interesting, but I think implementing WGPU support would have a broader impact and make using Vulkan/OpenGL and others available.

basncy · 2025-02-14T04:42:27Z

Could you please remove the .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())? line entirely to see if that helps, given you are using only the CPU.

Same result, debug with default option from "Debug example 'deepseekr1'"

2025-02-14T03:02:12.015031Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:12.015176Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:12.864254Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-02-14T03:02:13.206167Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:13.920594Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-14T03:02:14.262308Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-14T03:02:14.263287Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-14T03:02:14.263603Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-02-14T03:02:14.599928Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-14T03:02:14.600075Z  INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-02-14T03:02:14.600090Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-14T03:02:14.600107Z  INFO mistralrs_core::utils::log: Layers 0-27: cpu
2025-02-14T03:02:14.601037Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-02-14T03:02:14.601088Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-02-14T03:02:14.601151Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 1536, intermediate_size: 8960, num_hidden_layers: 28, num_attention_heads: 12, num_key_value_heads: 2, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: false }
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [01:44<00:00, 2010.19it/s]
2025-02-14T03:04:24.078039Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-02-14T03:04:24.078188Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 197 tensors.
2025-02-14T03:04:24.078290Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 16 threads.
2025-02-14T03:06:04.832834Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 197 tensors out of 197 total tensors. Took 100.75s
2025-02-14T03:06:05.639157Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<｜begin▁of▁sentence｜>", eos_toks = "<｜end▁of▁sentence｜>", unk_tok = `None`
2025-02-14T03:06:05.653194Z  INFO mistralrs_core: Beginning dummy run.
2025-02-14T03:06:07.291909Z  INFO mistralrs_core: Dummy run completed in 1.638660426s.

A SYCL backend might be interesting, but I think implementing WGPU support would have a broader impact and make using Vulkan/OpenGL and others available.

Perhaps, but there are still many challenges as it focuses on rendering at this moment.

basncy added the bug Something isn't working label Feb 12, 2025

EricLBuehler added a commit that referenced this issue Feb 13, 2025

Some fixes for Qwen, #1134

87a7c23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

basncy commented Feb 12, 2025

EricLBuehler commented Feb 13, 2025

basncy commented Feb 13, 2025

EricLBuehler commented Feb 13, 2025

basncy commented Feb 14, 2025

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

Endless inferencing with cpu on DeepSeek-R1-Distill-Qwen-1.5B #1134

Comments

basncy commented Feb 12, 2025

EricLBuehler commented Feb 13, 2025

basncy commented Feb 13, 2025

EricLBuehler commented Feb 13, 2025

basncy commented Feb 14, 2025