Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

优化 CMake 多核编译参数逻辑 #523

Merged
merged 1 commit into from
Feb 22, 2025
Merged

Conversation

miaooo0000OOOO
Copy link
Contributor

以下内容由AI生成,人工审核

更改的代码段通过智能选择并行编译参数,显著提升构建速度并确保跨平台兼容性。以下是不同场景下的行为说明:


1. 用户已设置 CMAKE_BUILD_PARALLEL_LEVEL 环境变量

  • 行为
    更改的代码块不会执行,直接使用用户指定的值。
    示例
    export CMAKE_BUILD_PARALLEL_LEVEL=4
    pip install ktransformers --no-build-isolation
    • CMake 将使用 --parallel=4,完全尊重用户配置。

2. 用户未设置 CMAKE_BUILD_PARALLEL_LEVEL

场景 2.1:self.parallel 属性存在且有效
  • 行为
    优先使用 self.parallel 的值。
    示例
    self.parallel = 8,则添加 --parallel=8
    适用场景
    用户通过其他途径(如命令行参数)显式指定并行度。
场景 2.2:self.parallel 不存在或为 None/0
  • 行为
    自动检测逻辑 CPU 核心数(含超线程),并以此设置并行度。
    示例
    • 4 核 8 线程 CPU → --parallel=8
    • 无法检测 CPU 数(如旧系统)→ 回退到 --parallel=1(安全编译)

3. 跨平台行为

Linux/macOS(Make/Ninja)
  • 参数转换
    --parallel=N → 底层工具接收 -jN(Make)或 -jN(Ninja)。
    效果
    完全利用多核资源,加速编译。
Windows(MSBuild)
  • 参数转换
    --parallel=N → MSBuild 的 /m 参数。
    效果
    多进程编译,避免 -jN 的兼容性问题。

4. 极端情况处理

场景 4.1:CPU 核心数检测失败
  • 行为
    代码将 cpu_count 设为 1,回退到单线程编译。
    示例
    某些虚拟化环境或旧硬件可能无法检测 CPU 数,此时安全编译。
场景 4.2:用户显式禁用并行
  • 行为
    设置 CMAKE_BUILD_PARALLEL_LEVEL=1self.parallel=1
    效果
    强制单线程编译,便于调试或资源受限环境。

总结

此修改通过智能适配多核编译参数,实现了:

  • 跨平台兼容:支持 Make/Ninja/MSBuild。
  • 用户优先级:环境变量 > self.parallel > 自动检测。
  • 安全回退:CPU 检测失败时降级为单线程。
  • 性能提升:充分利用硬件资源,减少编译时间。

@miaooo0000OOOO
Copy link
Contributor Author

fixed issue #215

@Atream
Copy link
Contributor

Atream commented Feb 22, 2025

Thank you for your contribution, we will test it.

@Atream Atream requested a review from UnicornChan February 22, 2025 09:36
@Atream Atream merged commit 94ab2de into kvcache-ai:main Feb 22, 2025
6 checks passed
@ubergarm
Copy link
Contributor

@miaooo0000OOOO 感谢你的贡献,我刚刚为 ktransformers@94ab2de 编译了一个二进制的.whl
版本,并更新了我的指南以展示如何使用新的环境变量 export CMAKE_BUILD_PARALLEL_LEVEL=8

https://github.com/ubergarm/r1-ktransformers-guide/blob/main/README.zh.md

不过在测试过程中发现:虽然编译初期 cc1/cc1plus 确实启动了 8 个进程,但到了后期阶段,尽管也设置了 export MAX_JOBS=8cicc 仍然只占用单核运行。

以下是部分截图以供参考:


@miaooo0000OOOO thanks for the contribution, I just compiled a binary .whel release for ktransformers@94ab2de and updated my guide to show using the new export CMAKE_BUILD_PARALLEL_LEVEL=8 environment variable.

However, in testing, I do see 8x processes early in the build for cc1/cc1plus. But later in the build I still only see a single CPU core running for cicc despite also setting export MAX_JOBS=8.

Here are some screen shots to help show:


correct-eight-threads-compiling

only-one-core-compiling

@miaooo0000OOOO
Copy link
Contributor Author

@ubergarm 感谢你的测试,我成功复现了你遇到的问题,我猜想cicc只能单核编译

首先,我尝试设置MAX_JOBS环境变量,没有解决问题,依然单核编译

随后,我查看了nvcc文档,设置nvcc多核编译参数--threads 8,依然单核编译

See NVCC Documentation §4.2.5.8

最后,我运行了pip install . --no-build-isolation -vvv,输出如下:

Emitting ninja build file .../ktransformers/build/temp.linux-x86_64-cpython-311/build.ninja...
Compiling objects...

这会生成一个ninja文件,查看build.ninja文件,只有三个文件需要编译

build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/binding.o: compile .../ktransformers/ktransformers/ktransformers_ext/cuda/binding.cpp  
build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.o: cuda_compile .../ktransformers/ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.cu  
build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.o: cuda_compile .../ktransformers/ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu  

我认为是因为要编译的文件数量少导致无法多核编译


Translated by DeepSeek R1

@ubergarm Thank you for testing. I successfully reproduced the issue you encountered. My current hypothesis is that cicc can only compile in single-core mode.

First, I attempted setting the MAX_JOBS environment variable, but it did not resolve the issue—compilation remained single-core.

Next, I reviewed the NVCC documentation and set the multi-core compilation parameter --threads 8 for NVCC, but the problem persisted (still single-core compilation).
See: NVCC Documentation §4.2.5.8

Finally, I ran pip install . --no-build-isolation -vvv, and the output revealed:

Emitting ninja build file .../ktransformers/build/temp.linux-x86_64-cpython-311/build.ninja...  
Compiling objects...  

This generated a Ninja build file. Upon inspecting build.ninja, only three files require compilation:

build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/binding.o: compile .../ktransformers/ktransformers/ktransformers_ext/cuda/binding.cpp  
build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.o: cuda_compile .../ktransformers/ktransformers/ktransformers_ext/cuda/custom_gguf/dequant.cu  
build .../ktransformers/build/temp.linux-x86_64-cpython-311/ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.o: cuda_compile .../ktransformers/ktransformers/ktransformers_ext/cuda/gptq_marlin/gptq_marlin.cu  

I suspect the root cause is the limited number of compilation targets (only 3 files), which inherently restricts multi-core utilization. The build system may not parallelize tasks when the workload is too small to justify thread overhead.

@Atream
Copy link
Contributor

Atream commented Feb 23, 2025

编译后期有一个单独的大文件,其内部包含了多种template,单个文件的编译暂时无法并行,我们后续会尝试将其拆分到多个不同文件中。
During the later stages of compilation, there is a single large file that contains multiple templates. The compilation of this single file cannot be parallelized for now. We will later attempt to split it into multiple separate files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants