Batched benchmark script and more detailed benchmark metrics #25

zhuohan123 · 2023-04-03T14:12:15Z

No description provided.

zhuohan123 · 2023-06-03T02:57:21Z

Close this PR since it's too outdated.

Disable weight compression on optimum-intel conversion path

Cherry-pick of fix commit 6100f4b from ODH: opendatahub-io#17 --------- Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: Daniele Trifirò <[email protected]>

Ibm main update 2024-05-16

Updates to custom PagedAttention for supporting context len upto 32k.

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (vllm-project#25) --------- Co-authored-by: madamczykhabana <[email protected]>

Add IPEX MOE CPU support

Overlap io

Batched benchmark script and more detailed benchmark metrics.

3564cd9

zhuohan123 requested a review from WoosukKwon April 3, 2023 14:12

zhuohan123 closed this Jun 3, 2023

zhuohan123 deleted the benchmark-script branch June 18, 2023 07:22

zhuohan123 restored the benchmark-script branch June 18, 2023 07:22

zhuohan123 deleted the benchmark-script branch June 18, 2023 07:22

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 17, 2024

Merge pull request vllm-project#25 from slyalin/disable_int8

f35263f

Disable weight compression on optimum-intel conversion path

dtrifiro added a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Merge pull request vllm-project#25 from z103cb/ibm_main_update_05162022

81954a7

Ibm main update 2024-05-16

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#25 from ROCm/cl/updates-pag-shomy

0cd6239

Updates to custom PagedAttention for supporting context len upto 32k.

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jun 25, 2024

Merge pull request vllm-project#25 from jianan-gu/jianan/enable_moe

d51ea18

Add IPEX MOE CPU support

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024

Add bias support for sparse layers (vllm-project#25)

ab469e5

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024

Add bias support for sparse layers (vllm-project#25)

e802bc2

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

njhill pushed a commit to njhill/vllm that referenced this pull request Nov 6, 2024

Merge pull request vllm-project#25 from njhill/overlap_io

e3014e2

Overlap io

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched benchmark script and more detailed benchmark metrics #25

Batched benchmark script and more detailed benchmark metrics #25

zhuohan123 commented Apr 3, 2023

zhuohan123 commented Jun 3, 2023

Batched benchmark script and more detailed benchmark metrics #25

Batched benchmark script and more detailed benchmark metrics #25

Conversation

zhuohan123 commented Apr 3, 2023

zhuohan123 commented Jun 3, 2023