Upstream sync 2024 06 08 #288

robertgshaw2-redhat · 2024-06-08T17:05:31Z

Upstream sync 2024 06 08 (#288) - ties to v0.4.3 of vllm-upstream

SUMMARY:

Merge commits from vllm-project@f68470e to vllm-project@1197e02
Our GCP test instances do not have gcc or clang installed. All of the triton kernels rely on the gcc and clang to generate JITs. I disabled these for now, but we need to get these installed (cc @andy-neuma). All are marked with:

@pytest.mark.skip("C compiler not installed in NM automation. "
                  "This codepath follows a triton pathway, which "
                  "JITs using clang or gcc. Since neither are installed "
                  "in our test instances, we need to skip this for now.")

Cherry-picked in the changes associated with Fp8 weight format from @mgoin

Note that vllm-project@f68470e is NOT included in this merge.

COMPARE vs UPSTREAM:

https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-08..vllm-project:vllm:v0.4.3

…ct#4914)

Co-authored-by: Alexey Kondratiev <[email protected]>

Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <[email protected]>

…project#4920)

Signed-off-by: kerthcet <[email protected]>

…llm-project#4944)

…llm-project#4722)

…#4977)

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

…ct#4893) The 2nd PR for vllm-project#4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

…llm-project#4894)

…Config (vllm-project#4991)

…e) (vllm-project#4983)

…ot defined (vllm-project#5009)

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

…project#4985) Co-authored-by: Elisei Smirnov <[email protected]>

…_scale (vllm-project#5353)

andy-neuma

thanks

@andy-neuma

Upstream sync 2024 06 11 (#288) SUMMARY: * Merge commits from vllm-project@1197e02 to vllm-project@114332b * Our GCP test instances do not have gcc or clang installed. All of the triton kernels rely on the gcc and clang to generate JITs. These are still disabled (cc @andy-neuma). All are marked with: ```python @pytest.mark.skip("C compiler not installed in NM automation. " "This codepath follows a triton pathway, which " "JITs using clang or gcc. Since neither are installed " "in our test instances, we need to skip this for now.") ``` Note that vllm-project@1197e02 is NOT included in this merge. COMPARE vs UPSTREAM: https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-11..vllm-project:vllm:v0.5.0 --------- Signed-off-by: Ye Cao <[email protected]> Signed-off-by: kevin <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Ye Cao <[email protected]> Co-authored-by: Nadav Shmayovits <[email protected]> Co-authored-by: chenqianfzh <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: Daniil Arapov <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Avinash Raj <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: Kaiyang Chen <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: Toshiki Kataoka <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: tomeras91 <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: DriverSong <[email protected]> Co-authored-by: qiujiawei9 <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Breno Faria <[email protected]> Co-authored-by: liuyhwangyh <[email protected]> Co-authored-by: mulin.lyh <[email protected]> Co-authored-by: Matthew Goldey <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Itay Etelis <[email protected]> Co-authored-by: limingshu <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Calvinn Ng <[email protected]> Co-authored-by: team <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Bla_ckB <[email protected]> Co-authored-by: Roger Wang <[email protected]>

alexm-redhat and others added 30 commits June 8, 2024 16:39

[Kernel] Add marlin_24 unit tests (vllm-project#4901)

e69d23b

[Kernel] Add flash-attn back (vllm-project#4907)

81ec16b

[Model] LLaVA model refactor (vllm-project#4910)

5500975

Remove marlin warning (vllm-project#4918)

b913d04

[Misc]: allow user to specify port in distributed setting (vllm-proje…

683a30b

…ct#4914)

[Build/CI] Enabling AMD Entrypoints Test (vllm-project#4834)

c8794c3

Co-authored-by: Alexey Kondratiev <[email protected]>

[Bugfix] Fix dummy weight for fp8 (vllm-project#4916)

5b6a7b5

Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <[email protected]>

[Core] Sharded State Loader download from HF (vllm-project#4889)

a5e66c7

[Doc]Add documentation to benchmarking script when running TGI (vllm-…

8a78ed8

…project#4920)

[Core] Fix scheduler considering "no LoRA" as "LoRA" (vllm-project#4897)

6b46dcf

[Model] add rope_scaling support for qwen2 (vllm-project#4930)

907d48a

[Model] Add Phi-2 LoRA support (vllm-project#4886)

11d6f7e

[Docs] Add acknowledgment for sponsors (vllm-project#4925)

5d98989

[CI/Build] Codespell ignore build/ directory (vllm-project#4945)

58a235b

[Bugfix] Fix flag name for max_seq_len_to_capture (vllm-project#4935)

253d8fb

Signed-off-by: kerthcet <[email protected]>

[Bugfix][Kernel] Add head size check for attention backend selection (v…

f744125

…llm-project#4944)

[Frontend] Dynamic RoPE scaling (vllm-project#4638)

c1672a9

[CI/Build] Enforce style for C++ and CUDA code with clang-format (v…

4b6c961

…llm-project#4722)

[misc] remove comments that were supposed to be removed (vllm-project…

4b74974

…#4977)

[Kernel] Fixup for CUTLASS kernels in CUDA graphs (vllm-project#4954)

39c15ee

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

[Misc] Load FP8 kv-cache scaling factors from checkpoints (vllm-proje…

2835fc6

…ct#4893) The 2nd PR for vllm-project#4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

[Model] LoRA gptbigcode implementation (vllm-project#3949)

3db99a6

[Core] Eliminate parallel worker per-step task scheduling overhead (v…

39a0a40

…llm-project#4894)

[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…

847ca88

…Config (vllm-project#4991)

[Misc] Take user preference in attention selector (vllm-project#4960)

c60384c

Marlin 24 prefill performance improvement (about 25% better on averag…

dae5aaf

…e) (vllm-project#4983)

[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…

05a4f64

…ot defined (vllm-project#5009)

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

bf4c411

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

[Kernel] Initial Activation Quantization Support (vllm-project#4525)

c623663

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[Core]: Option To Use Prompt Token Ids Inside Logits Processor (vllm-…

a9ca32d

…project#4985) Co-authored-by: Elisei Smirnov <[email protected]>

robertgshaw2-redhat and others added 24 commits June 9, 2024 12:34

skip blockspase attention

9ed5f76

fix falcon

ec71544

skip sliding window chunked prefill

7381340

skip prefix prefill

c23ca05

skip tensorizer

85512eb

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…

0cea2c2

…_scale (vllm-project#5353)

format

31147df

fix issue with internal method

2256610

formatting

01973f5

disabled more kernel tests that use triton

a1a659d

updated cutlass skipping. We need cuda 12.4 in automation

c50784c

trigger kernel tests in automation

99fa9f8

cleanup spurious setup.py change

2ec6643

readded the missing images

0bb099c

multilora inference

198f364

offline inference with prefix

ec0e89a

backend request func

e6f1cbd

benchmark serving

ca8d74a

prod monitoring readme

5335ad9

format

611cfed

fix benchmark issue - internal method changed

73132a5

removed skip for remote push edits

7f5c715

update internal method in benchmark throughput too

437912e

skip triton sampler tests

950981c

andy-neuma self-requested a review June 10, 2024 17:25

andy-neuma approved these changes Jun 10, 2024

View reviewed changes

andy-neuma merged commit db9ed90 into main Jun 10, 2024
49 of 57 checks passed

robertgshaw2-redhat mentioned this pull request Jun 11, 2024

[Rel Eng] Upstream sync 2024 06 11 #298

Merged

robertgshaw2-redhat mentioned this pull request Jun 12, 2024

Upstream sync 2024 06 12 #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync 2024 06 08 #288

Upstream sync 2024 06 08 #288

robertgshaw2-redhat commented Jun 8, 2024 •

edited

Loading

andy-neuma left a comment

Upstream sync 2024 06 08 #288

Upstream sync 2024 06 08 #288

Conversation

robertgshaw2-redhat commented Jun 8, 2024 • edited Loading

andy-neuma left a comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Jun 8, 2024 •

edited

Loading