Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation #6125
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation #6125
Changes from all commits
be05b47
be11ea5
3b56f13
e1fd267
c0ab019
3d653f6
84d8029
c7fee29
0d85c41
991e8b3
1845d95
766705e
fae9534
6cfbedf
51d0eda
ddcae0d
331b4f5
8af42c0
f61b9a8
c3d028e
6be36d6
f1fbb24
4a13516
ecd2cac
6fb8f90
43766b5
eafb75f
cfbeae5
3f7384f
cc4b330
3268497
63e085c
7f48a4f
c87f0ab
a5a13cd
aceded9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Will the API server automatically use CPU 30 and 31?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate this simplification. However, can we further set this env variable internally in vLLM so that users don't have to care about it? Just wondering because it's still not super easy to me.
For example, users may have the following questions:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, CPU 30 and 31 are reserved for non-openMP threads (e.g., python threads, asyncio event loop, ...), and leveraged by OS scheduler automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fully automatically setting is the best solution. It requires to detect the topology of CPU cores and memory nodes. We also want to achieve such out-of-box usage.
The
VLLM_CPU_OMP_THREADS_BIND
controls the openMP thread behavior of the model inference, including thread number, thread affinity (pin a inference thread on a fixed CPU core), memory allocation policy (only allocate memory from the closest memory node). We have added two performance tips about this arg for platforms with hyper-threading or multi-socket configuration.For platforms without hyper-threading or multi-socket, allocating more CPUs for model inference will improve the performance theoretically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bigPYJ1151 and @WoosukKwon , this change missed the suffix
+cpu
for torch version which leads to the build failure of the CPU target.Please take a look at #6931 .
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also does something more subtle: the required version in
requirements-build.txt
andpyproject.toml
is still2.3.1
, causing building this withpip install .
(i.e. pep517 style builds) result in a broken build:Querying the engine then results in an error:
AttributeError: '_OpNamespace' '_C_cache_ops' object has no attribute 'reshape_and_cache'
.Full traceback follows: