Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Failed on Amazon Linux 2023 + Cuda 12.6 / 12.7 #1126

Open
aminnasiri opened this issue Feb 8, 2025 · 14 comments
Open

Build Failed on Amazon Linux 2023 + Cuda 12.6 / 12.7 #1126

aminnasiri opened this issue Feb 8, 2025 · 14 comments
Labels
bug Something isn't working build Issues relating to building mistral.rs

Comments

@aminnasiri
Copy link

Minimum reproducible example

The build is failing on Amazon Linux 2023 with Cuda 12.6 / 12.7

Error

these are my instructions:

Other information

Please specify:

  • Operating system: Linux - Amazon Linux 2023 version

  • GPU or accelerator information

    • nvidia-smi
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
      |-----------------------------------------+------------------------+----------------------+
      | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
      | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M.|
      | | | MIG M. |
      |===============================+=================+=================|
      | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
      | 0% 16C P8 9W / 300W | 4MiB / 23028MiB | 0% Default |
      | | | N/A |
      +-----------------------------------------+------------------------+---------------------+

    +-----------------------------------------------------------------------------------------+
    | Processes: |
    | GPU GI CI PID Type Process name GPU Memory |
    | ID ID Usage |
    |===================================================================|
    | No running processes found |
    +-----------------------------------------------------------------------------------------+

    • nvcc --version
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2024 NVIDIA Corporation
      Built on Thu_Sep_12_02:18:05_PDT_2024
      Cuda compilation tools, release 12.6, V12.6.77
      Build cuda_12.6.r12.6/compiler.34841621_0

Latest commit or version

I am using tag v0.4.0

@aminnasiri aminnasiri added bug Something isn't working build Issues relating to building mistral.rs labels Feb 8, 2025
@EricLBuehler
Copy link
Owner

EricLBuehler commented Feb 11, 2025

@aminnasiri can you please use git pull for the latest version and recompile? I merged #1129 which gates our new NCCL support behind a feature flag for build compatibility - that may help.

@aminnasiri
Copy link
Author

Thanks @EricLBuehler, I pulled the last code from master and run cargo clean and then run cargo build --release --features cuda, but I am still getting error.

new-error-gist

@EricLBuehler
Copy link
Owner

Hi @aminnasiri ! Can you please recompile with NVCC_CUDA_FLAGS="-fPIC" (#286)

@aminnasiri
Copy link
Author

Thanks @EricLBuehler , It was build successfully with this command;
NVCC_CUDA_FLAGS="-fPIC" cargo build --release --features "cuda flash-attn"

I do have an issue of loading 8B models, like DeepSeek-R1-Distill-Llama-8B which I was able to use the same model with VLLM on the same machine, but some small models like DeepSeek-R1-Distill-Qwen-1.5B is running well.

Do you have any idea?

@EricLBuehler
Copy link
Owner

EricLBuehler commented Feb 12, 2025

@aminnasiri can you please paste the error/log?

@aminnasiri
Copy link
Author

aminnasiri commented Feb 13, 2025

Sorry to forgot add error logs, and unfortunately I missed it. I ran cargo clean and then rebuilt it with this command
NVCC_CUDA_FLAGS="-fPIC" cargo build --release --features cuda

./target/release/mistralrs-server --isq Q4K --port 1234 plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Logs:

2025-02-13T00:50:27.207215Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-02-13T00:50:27.207247Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-02-13T00:50:27.207266Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-02-13T00:50:27.207424Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-13T00:50:27.211307Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-13T00:50:27.273773Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-02-13T00:50:27.322628Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-13T00:50:27.420596Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
2025-02-13T00:50:27.487050Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-13T00:50:27.517987Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-02-13T00:50:27.643356Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-13T00:50:27.643451Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `qwen2`
2025-02-13T00:50:27.720703Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-13T00:50:27.720755Z  INFO mistralrs_core::utils::log: Model has 28 repeating layers.
2025-02-13T00:50:27.720780Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-13T00:50:27.720800Z  INFO mistralrs_core::utils::log: Layers 0-27: cuda[0]
2025-02-13T00:50:27.740109Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-02-13T00:50:27.740586Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-13T00:50:27.740625Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 151936, hidden_size: 1536, intermediate_size: 8960, num_hidden_layers: 28, num_attention_heads: 12, num_key_value_heads: 2, max_position_embeddings: 131072, sliding_window: 4096, rope_theta: 10000.0, rms_norm_eps: 1e-6, hidden_act: Silu, use_flash_attn: false, quantization_config: None, tie_word_embeddings: false }
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:26<00:00, 23.75it/s]
2025-02-13T00:50:54.662686Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-02-13T00:50:54.662750Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 197 tensors.
2025-02-13T00:50:54.663521Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 8 threads.
2025-02-13T00:51:03.655131Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 197 tensors out of 197 total tensors. Took 8.99s
2025-02-13T00:51:03.655376Z  INFO mistralrs_core::paged_attention: Allocating 3584 MB for PagedAttention KV cache per GPU
2025-02-13T00:51:03.655396Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 4096 GPU blocks: available context length is 131072 tokens
2025-02-13T00:51:03.955010Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin▁of▁sentence|>", eos_toks = "<|end▁of▁sentence|>", unk_tok = `None`
2025-02-13T00:51:03.967650Z  INFO mistralrs_server: Model loaded.
2025-02-13T00:51:03.968067Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-02-13T00:51:04.002025Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-02-13T00:51:04.002986Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-02-13T00:51:04.003057Z  INFO mistralrs_core: Beginning dummy run.
2025-02-13T00:51:04.192342Z  INFO mistralrs_core: Dummy run completed in 0.18926775s.
2025-02-13T00:51:04.192681Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.

and then running this curl command, but after long wait there is no any response.
curl --location 'http://localhost:1234/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "", "messages": [ { "role": "system", "content": "You are Mistral.rs, an AI assistant." }, { "role": "user", "content": "Write a story about Rust error handling." } ] }'

@aminnasiri
Copy link
Author

aminnasiri commented Feb 13, 2025

then I tried to run it again with this model deepseek-ai/DeepSeek-R1-Distill-Llama-8B

./target/release/mistralrs-server --isq Q4K --port 1234 plain -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B

logs:

2025-02-13T00:58:35.895593Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2025-02-13T00:58:35.895637Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-02-13T00:58:35.895657Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-02-13T00:58:35.895806Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
2025-02-13T00:58:35.896016Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
2025-02-13T00:58:35.958574Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-000002.safetensors", "model-00002-of-000002.safetensors"]
2025-02-13T00:58:36.011984Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
2025-02-13T00:58:36.092555Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`
2025-02-13T00:58:36.152973Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 512.
2025-02-13T00:58:36.173063Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-02-13T00:58:36.297264Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-13T00:58:36.297354Z  INFO mistralrs_core::utils::log: Automatic loader type determined to be `llama`
2025-02-13T00:58:36.360908Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-02-13T00:58:36.360953Z  INFO mistralrs_core::utils::log: Model has 32 repeating layers.
2025-02-13T00:58:36.360968Z  INFO mistralrs_core::utils::log: Loading model according to the following repeating layer mappings:
2025-02-13T00:58:36.360988Z  INFO mistralrs_core::utils::log: Layers 0-31: cuda[0]
2025-02-13T00:58:36.380364Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-02-13T00:58:36.380871Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-02-13T00:58:36.380907Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_size: 4096, intermediate_size: 14336, vocab_size: 128256, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, use_flash_attn: false, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 8.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: false }
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [01:02<00:00, 5.39it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [01:11<00:00, 2.74it/s]
2025-02-13T00:59:48.892817Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to all ranks.
2025-02-13T00:59:48.895080Z  INFO mistralrs_core::pipeline::isq: Applying in-situ quantization into Some(Q4K) to 225 tensors.
2025-02-13T00:59:48.895192Z  INFO mistralrs_core::pipeline::isq: Applying ISQ on 8 threads.
2025-02-13T01:00:25.114210Z  INFO mistralrs_core::pipeline::isq: Applied in-situ quantization into Some(Q4K) to 225 tensors out of 225 total tensors. Took 36.22s
2025-02-13T01:00:25.115408Z  INFO mistralrs_core::paged_attention: Allocating 15054 MB for PagedAttention KV cache per GPU
2025-02-13T01:00:25.115425Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 3763 GPU blocks: available context length is 120416 tokens
2025-02-13T01:00:25.395785Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin▁of▁sentence|>", eos_toks = "<|end▁of▁sentence|>", unk_tok = `None`
2025-02-13T01:00:25.407956Z  INFO mistralrs_server: Model loaded.
2025-02-13T01:00:25.408396Z  INFO mistralrs_core: Enabling GEMM reduced precision in BF16.
2025-02-13T01:00:25.481196Z  INFO mistralrs_core: Enabling GEMM reduced precision in F16.
2025-02-13T01:00:25.483372Z  INFO mistralrs_core::cublaslt: Initialized cuBLASlt handle
2025-02-13T01:00:25.483449Z  INFO mistralrs_core: Beginning dummy run.
2025-02-13T01:00:25.630237Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 1, 128], rhs: [1, 8, 1, 128], op: "reshape" }
2025-02-13T01:00:25.631446Z  INFO mistralrs_core: Dummy run completed in 0.147980138s.
2025-02-13T01:00:25.631754Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 1, 128], rhs: [1, 8, 1, 128], op: "reshape" }
thread '<unnamed>' panicked at mistralrs-core/src/engine/mod.rs:413:25:
called `Result::unwrap()` on an `Err` value: SendError { .. }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-02-13T01:00:25.637284Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2025-02-13T01:00:37.944165Z  WARN mistralrs_core: Engine is dead, rebooting
2025-02-13T01:00:37.944240Z  INFO mistralrs_core: Successfully rebooted engine and updated sender + engine handler
2025-02-13T01:00:37.968853Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 23, 128], rhs: [1, 23, 8, 128], op: "reshape" }
2025-02-13T01:00:37.969791Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 1, 128], rhs: [1, 8, 1, 128], op: "reshape" }
thread '<unnamed>' panicked at mistralrs-core/src/engine/mod.rs:413:25:
called `Result::unwrap()` on an `Err` value: SendError { .. }
2025-02-13T01:00:46.891258Z  WARN mistralrs_core: Engine is dead, rebooting
2025-02-13T01:00:46.891316Z  INFO mistralrs_core: Successfully rebooted engine and updated sender + engine handler
2025-02-13T01:00:46.892958Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 23, 128], rhs: [1, 23, 8, 128], op: "reshape" }
2025-02-13T01:00:46.893892Z ERROR mistralrs_core::engine: step - Model failed with error: ShapeMismatchBinaryOp { lhs: [1, 1, 128], rhs: [1, 8, 1, 128], op: "reshape" }
thread '<unnamed>' panicked at mistralrs-core/src/engine/mod.rs:413:25:
called `Result::unwrap()` on an `Err` value: SendError { .. }

and then running this curl command:
curl --location 'http://localhost:1234/v1/chat/completions' \ --header 'Content-Type: application/json' \ --data '{ "model": "", "messages": [ { "role": "system", "content": "You are Mistral.rs, an AI assistant." }, { "role": "user", "content": "Write a story about Rust error handling." } ] }'

response:

{
  "message": "shape mismatch in reshape, lhs: [1, 23, 128], rhs: [1, 23, 8, 128]",
  "partial_response": {
    "id": "0",
    "choices": [
      {
        "finish_reason": "error",
        "index": 0,
        "message": {
          "content": "",
          "role": "assistant",
          "tool_calls": []
        },
        "logprobs": null
      }
    ],
    "created": 1739408446,
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "system_fingerprint": "local",
    "object": "chat.completion",
    "usage": {
      "completion_tokens": 0,
      "prompt_tokens": 23,
      "total_tokens": 23,
      "avg_tok_per_sec": null,
      "avg_prompt_tok_per_sec": null,
      "avg_compl_tok_per_sec": null,
      "total_time_sec": 0.0,
      "total_prompt_time_sec": 0.0,
      "total_completion_time_sec": 0.0
    }
  }
}

@EricLBuehler
Copy link
Owner

@aminnasiri this has been fixed, can you please try it again after git pull?

@aminnasiri
Copy link
Author

@EricLBuehler I am master branch and git commit#:
commit 323e7cd4e4a5a1f4cb2de68535b9683f6b16fb78 (HEAD -> master, origin/master, origin/HEAD) Author: Eric Buehler <xxxx> Date: Tue Feb 11 20:37:53 2025 -0500

@aminnasiri
Copy link
Author

@EricLBuehler But those error that I mentioned are happening in Master branch after a successful build with commit 323e7cd4e4a5a1f4cb2de68535b9683f6b16fb78 (HEAD -> master, origin/master, origin/HEAD) Author: Eric Buehler <xxxx> Date: Tue Feb 11 20:37:53 2025 -0500

@EricLBuehler
Copy link
Owner

@aminnasiri the current latest commit is c9ac321. If running git pull does not work to fetch these latest changes, can you delete & re-clone the repository?

@aminnasiri
Copy link
Author

@EricLBuehler I did and am getting this error.

(gist)[https://gist.github.com/aminnasiri/cb3f4aec219aed9eb803648bfaa71930]

@EricLBuehler
Copy link
Owner

@aminnasiri did you specify the NVCC_CUDA_FLAGS variable (i.e. NVCC_CUDA_FLAGS="-fPIC" cargo build --release --features "cuda flash-attn")?

@aminnasiri
Copy link
Author

aminnasiri commented Feb 13, 2025

@EricLBuehler
Yes, I tried to build with both commands and both of them are failing.

  1. NVCC_CUDA_FLAGS="-fPIC" cargo build --release --features cuda
  2. NVCC_CUDA_FLAGS="-fPIC" cargo build --release --features "cuda flash-attn"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Issues relating to building mistral.rs
Projects
None yet
Development

No branches or pull requests

2 participants