Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

andy/bump main to v0.3.2 #49

Closed
wants to merge 113 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
6b7de1a
[ROCm] add support to ROCm 6.0 and MI300 (#2274)
hongxiayang Jan 26, 2024
3a0e1fc
Support for Stable LM 2 (#2598)
dakotamahan-stability Jan 26, 2024
390b495
Don't build punica kernels by default (#2605)
pcmoritz Jan 26, 2024
beb89f6
AWQ: Up to 2.66x higher throughput (#2566)
casper-hansen Jan 27, 2024
220a476
Use head_dim in config if exists (#2622)
xiangxu-google Jan 27, 2024
3801700
Implement custom all reduce kernels (#2192)
hanzhi713 Jan 27, 2024
5f036d2
[Minor] Fix warning on Ray dependencies (#2630)
WoosukKwon Jan 27, 2024
f8ecb84
Speed up Punica compilation (#2632)
WoosukKwon Jan 28, 2024
89be30f
Small async_llm_engine refactor (#2618)
andoorve Jan 28, 2024
7d64841
Update Ray version requirements (#2636)
simon-mo Jan 28, 2024
9090bf0
Support FP8-E5M2 KV Cache (#2279)
zhaoyang-star Jan 29, 2024
b72af8f
Fix error when tp > 1 (#2644)
zhaoyang-star Jan 29, 2024
1b20639
No repeated IPC open (#2642)
hanzhi713 Jan 29, 2024
ea8489f
ROCm: Allow setting compilation target (#2581)
rlrs Jan 29, 2024
5d60def
DeepseekMoE support with Fused MoE kernel (#2453)
zwd003 Jan 30, 2024
ab40644
Fused MOE for Mixtral (#2542)
pcmoritz Jan 30, 2024
d79ced3
Fix 'Actor methods cannot be called directly' when using `--engine-us…
HermitSun Jan 30, 2024
4f65af0
Add swap_blocks unit tests (#2616)
sh1ng Jan 30, 2024
bbe9bd9
[Minor] Fix a small typo (#2672)
pcmoritz Jan 30, 2024
105a40f
[Minor] Fix false warning when TP=1 (#2674)
WoosukKwon Jan 30, 2024
3dad944
Add quantized mixtral support (#2673)
WoosukKwon Jan 31, 2024
1af090b
Bump up version to v0.3.0 (#2656)
zhuohan123 Jan 31, 2024
d69ff0c
Fixes assertion failure in prefix caching: the lora index mapping sho…
sighingnow Jan 31, 2024
c664b0e
fix some bugs (#2689)
zspo Jan 31, 2024
89efcf1
[Minor] Fix test_cache.py CI test failure (#2684)
pcmoritz Jan 31, 2024
d0d93b9
Add unit test for Mixtral MoE layer (#2677)
pcmoritz Jan 31, 2024
93b38be
Refactor Prometheus and Add Request Level Metrics (#2316)
robertgshaw2-redhat Jan 31, 2024
cd9e60c
Add Internlm2 (#2666)
Feb 1, 2024
923797f
Fix compile error when using rocm (#2648)
zhaoyang-star Feb 1, 2024
b9e96b1
fix python 3.8 syntax (#2716)
simon-mo Feb 1, 2024
bb8c697
Update README for meetup slides (#2718)
simon-mo Feb 1, 2024
c410f5d
Use revision when downloading the quantization config file (#2697)
Pernekhan Feb 1, 2024
96b6f47
Remove hardcoded `device="cuda" ` to support more devices (#2503)
jikunshang Feb 1, 2024
0e163fc
Fix default length_penalty to 1.0 (#2667)
zspo Feb 1, 2024
4abf633
Add one example to run batch inference distributed on Ray (#2696)
c21 Feb 2, 2024
5ed704e
docs: fix langchain (#2736)
mspronesti Feb 4, 2024
51cd22c
set&get llm internal tokenizer instead of the TokenizerGroup (#2741)
dancingpipi Feb 4, 2024
5a6c81b
Remove eos tokens from output by default (#2611)
zcnrex Feb 4, 2024
c9b45ad
Require triton >= 2.1.0 (#2746)
whyiug Feb 5, 2024
72d3a30
[Minor] Fix benchmark_latency script (#2765)
WoosukKwon Feb 5, 2024
56f738a
[ROCm] Fix some kernels failed unit tests (#2498)
hongxiayang Feb 5, 2024
b92adec
Set local logging level via env variable (#2774)
gardberg Feb 5, 2024
2ccee3d
[ROCm] Fixup arch checks for ROCM (#2627)
dllehr-amd Feb 5, 2024
f0d4e14
Add fused top-K softmax kernel for MoE (#2769)
WoosukKwon Feb 6, 2024
ed70c70
modelscope: fix issue when model parameter is not a model id but path…
liuyhwangyh Feb 6, 2024
fe6d09a
[Minor] More fix of test_cache.py CI test failure (#2750)
LiuXiaoxuanPKU Feb 6, 2024
c81dddb
[ROCm] Fix build problem resulted from previous commit related to FP8…
hongxiayang Feb 7, 2024
931746b
Add documentation on how to do incremental builds (#2796)
pcmoritz Feb 7, 2024
65b89d1
[Ray] Integration compiled DAG off by default (#2471)
rkooo567 Feb 8, 2024
3711811
Disable custom all reduce by default (#2808)
WoosukKwon Feb 8, 2024
0580aab
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-atte…
hongxiayang Feb 11, 2024
4ca2c35
Add documentation section about LoRA (#2834)
pcmoritz Feb 12, 2024
5638364
Refactor 2 awq gemm kernels into m16nXk32 (#2723)
zcnrex Feb 12, 2024
a4211a4
Serving Benchmark Refactoring (#2433)
ywang96 Feb 13, 2024
f964493
[CI] Ensure documentation build is checked in CI (#2842)
simon-mo Feb 13, 2024
5c976a7
Refactor llama family models (#2637)
esmeetu Feb 13, 2024
ea35600
Revert "Refactor llama family models (#2637)" (#2851)
pcmoritz Feb 13, 2024
a463c33
Use CuPy for CUDA graphs (#2811)
WoosukKwon Feb 13, 2024
317b29d
Remove Yi model definition, please use `LlamaForCausalLM` instead (#2…
pcmoritz Feb 13, 2024
2a543d6
Add LoRA support for Mixtral (#2831)
tterrysun Feb 13, 2024
7eacffd
Migrate InternLMForCausalLM to LlamaForCausalLM (#2860)
pcmoritz Feb 14, 2024
0c48b37
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2…
pcmoritz Feb 14, 2024
7e45107
[Fix] Fix memory profiling when GPU is used by multiple processes (#2…
WoosukKwon Feb 14, 2024
87069cc
Fix docker python version (#2845)
NikolaBorisov Feb 14, 2024
4efbac6
Migrate AquilaForCausalLM to LlamaForCausalLM (#2867)
esmeetu Feb 14, 2024
25e86b6
Don't use cupy NCCL for AMD backends (#2855)
WoosukKwon Feb 14, 2024
31348df
Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880)
pcmoritz Feb 15, 2024
d7afab6
[BugFix] Fix GC bug for `LLM` class (#2882)
WoosukKwon Feb 15, 2024
4f2ad11
Fix DeciLM (#2883)
pcmoritz Feb 15, 2024
5255d99
[ROCm] Dockerfile fix for flash-attention build (#2885)
hongxiayang Feb 15, 2024
64da65b
Prefix Caching- fix t4 triton error (#2517)
caoshiyi Feb 16, 2024
5f08050
Bump up to v0.3.1 (#2887)
WoosukKwon Feb 16, 2024
185b2c2
Defensively copy `sampling_params` (#2881)
njhill Feb 17, 2024
8f36444
multi-LoRA as extra models in OpenAI server (#2775)
jvmncs Feb 17, 2024
786b7f1
Add code-revision config argument for Hugging Face Hub (#2892)
mbm-ai Feb 18, 2024
537c975
[Minor] Small fix to make distributed init logic in worker looks clea…
zhuohan123 Feb 18, 2024
a61f052
[Test] Add basic correctness test (#2908)
zhuohan123 Feb 19, 2024
ab3a5a8
Support OLMo models. (#2832)
Isotr0py Feb 19, 2024
86fd8bb
Add warning to prevent changes to benchmark api server (#2858)
simon-mo Feb 19, 2024
e433c11
Fix `vllm:prompt_tokens_total` metric calculation (#2869)
ronensc Feb 19, 2024
264017a
[ROCm] include gfx908 as supported (#2792)
jamestwhedbee Feb 20, 2024
63e2a64
[FIX] Fix beam search test (#2930)
zhuohan123 Feb 20, 2024
181b27d
Make vLLM logging formatting optional (#2877)
Yard1 Feb 20, 2024
017d9f1
Add metrics to RequestOutput (#2876)
Yard1 Feb 21, 2024
5253eda
Add Gemma model (#2964)
xiangxu-google Feb 21, 2024
c20ecb6
Upgrade transformers to v4.38.0 (#2965)
WoosukKwon Feb 21, 2024
a9c8212
[FIX] Add Gemma model to the doc (#2966)
zhuohan123 Feb 21, 2024
dc903e7
[ROCm] Upgrade transformers to v4.38.0 (#2967)
WoosukKwon Feb 21, 2024
7d2dcce
Support per-request seed (#2514)
njhill Feb 21, 2024
8fbd84b
Bump up version to v0.3.2 (#2968)
zhuohan123 Feb 21, 2024
7c4304b
Add sparsity support based with magic_wand GPU kernels
robertgshaw2-redhat Feb 1, 2024
5344a01
Update README.md
mgoin Feb 2, 2024
81dba47
Semi-structured 2:4 sparsity via SparseSemiStructuredTensor #4
afeldman-nm Feb 2, 2024
cf8eed7
Sparse fused gemm integration (#12)
LucasWilkinson Feb 14, 2024
7527b9c
Abf149/fix semi structured sparse (#16)
afeldman-nm Feb 16, 2024
3c11f56
Enable bfloat16 for sparse_w16a16 (#18)
mgoin Feb 16, 2024
8147811
seed workflow (#19)
andy-neuma Feb 16, 2024
e802bc2
Add bias support for sparse layers (#25)
mgoin Feb 16, 2024
b976653
Use naive decompress for SM<8.0 (#32)
mgoin Feb 21, 2024
78ba5c1
Varun/benchmark workflow (#28)
varun-sundar-rabindranath Feb 21, 2024
fbfd764
initial GHA workflows for "build test" and "remote push" (#27)
andy-neuma Feb 21, 2024
37883e0
Only import magic_wand if sparsity is enabled (#37)
mgoin Feb 21, 2024
acf16bf
manually reverted requirements to match v0.3.2
robertgshaw2-redhat Feb 22, 2024
dbf3cab
Merge branch 'main' into rs/bump-main-to-v0.3.2
robertgshaw2-redhat Feb 22, 2024
0feedf9
reverted requirements
robertgshaw2-redhat Feb 22, 2024
ce8164d
removed duplicate
robertgshaw2-redhat Feb 22, 2024
166c13b
format
robertgshaw2-redhat Feb 22, 2024
1b395b4
added noqa to upstream scripts for linter
robertgshaw2-redhat Feb 22, 2024
8d935be
format
robertgshaw2-redhat Feb 22, 2024
acb8615
Sparsity fix (#40)
robertgshaw2-redhat Feb 22, 2024
4b44479
Rs/marlin downstream v0.3.2 (#43)
robertgshaw2-redhat Feb 22, 2024
9209f15
additional updates to "bump-to-v0.3.2" (#39)
andy-neuma Feb 23, 2024
b1e14c2
move to 4 x gpu
Feb 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,31 @@ set -o pipefail
# cd into parent directory of this file
cd "$(dirname "${BASH_SOURCE[0]}")/.."

(wget && curl) || (apt-get update && apt-get install -y wget curl)
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)

# run benchmarks and upload the result to buildkite
# run python-based benchmarks and upload the result to buildkite
python3 benchmarks/benchmark_latency.py 2>&1 | tee benchmark_latency.txt
bench_latency_exit_code=$?

python3 benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 2>&1 | tee benchmark_throughput.txt
bench_throughput_exit_code=$?

# run server-based benchmarks and upload the result to buildkite
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-chat-hf &
server_pid=$!
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# wait for server to start, timeout after 600 seconds
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend openai \
--dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
--model meta-llama/Llama-2-7b-chat-hf \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer meta-llama/Llama-2-7b-chat-hf 2>&1 | tee benchmark_serving.txt
--tokenizer meta-llama/Llama-2-7b-chat-hf \
--save-result \
2>&1 | tee benchmark_serving.txt
bench_serving_exit_code=$?
kill $server_pid

Expand All @@ -44,7 +48,7 @@ sed -n '$p' benchmark_throughput.txt >> benchmark_results.md # last line
echo "### Serving Benchmarks" >> benchmark_results.md
sed -n '1p' benchmark_serving.txt >> benchmark_results.md # first line
echo "" >> benchmark_results.md
tail -n 5 benchmark_serving.txt >> benchmark_results.md # last 5 lines
tail -n 13 benchmark_serving.txt >> benchmark_results.md # last 13 lines

# upload the results to buildkite
/workspace/buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md
Expand All @@ -61,3 +65,5 @@ fi
if [ $bench_serving_exit_code -ne 0 ]; then
exit $bench_serving_exit_code
fi

/workspace/buildkite-agent artifact upload openai-*.json
19 changes: 17 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,16 @@ steps:
- label: AsyncEngine Test
command: pytest -v -s async_engine

- label: Distributed Test
command: pytest -v -s test_comm_ops.py
- label: Basic Correctness Test
command: pytest -v -s --forked basic_correctness

- label: Distributed Comm Ops Test
command: pytest -v -s --forked test_comm_ops.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

- label: Distributed Correctness Test
command: pytest -v -s --forked test_basic_distributed_correctness.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

Expand Down Expand Up @@ -49,3 +57,10 @@ steps:
commands:
- pip install aiohttp
- bash run-benchmarks.sh

- label: Documentation Build
working_dir: "/vllm-workspace/docs"
no_gpu: True
commands:
- pip install -r requirements-docs.txt
- SPHINXOPTS=\"-W\" make html
6 changes: 4 additions & 2 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
steps:
- label: ":docker: build image"
commands:
- "docker build --tag {{ docker_image }} --target test --progress plain ."
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
- "docker push {{ docker_image }}"
env:
DOCKER_BUILDKIT: "1"
Expand Down Expand Up @@ -35,13 +35,15 @@ steps:
- image: "{{ docker_image }}"
command: ["bash"]
args:
- "-c"
- '-c'
- "'cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}'"
{% if not step.no_gpu %}
resources:
requests:
nvidia.com/gpu: "{{ step.num_gpus or default_num_gpu }}"
limits:
nvidia.com/gpu: "{{ step.num_gpus or default_num_gpu }}"
{% endif %}
env:
- name: HF_TOKEN
valueFrom:
Expand Down
2 changes: 0 additions & 2 deletions .github/actions/nm-build-vllm/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ runs:
steps:
- id: build
run: |
# TODO: this is a hack ... fix it later
# pyenv hardcoded ... python version hardcoded ...
COMMIT=${{ github.sha }}
VENV="${{ inputs.venv }}-${COMMIT:0:7}"
source $(pyenv root)/versions/${{ inputs.python }}/envs/${VENV}/bin/activate
Expand Down
13 changes: 9 additions & 4 deletions .github/actions/nm-set-env/action.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,26 @@
name: set neuralmagic env
description: 'sets environment variables for neuralmagic'
inputs:
hf_home:
hf_token:
description: 'Hugging Face home'
required: true
Gi_per_thread:
description: 'requested GiB to reserve per thread'
required: true
runs:
using: composite
steps:
- run: |
echo "HF_HOME=${HF_HOME_TOKEN}" >> $GITHUB_ENV
echo "TORCH_CUDA_ARCH_LIST=8.0+PTX" >> $GITHUB_ENV
echo "HF_TOKEN=${HF_TOKEN_SECRET}" >> $GITHUB_ENV
NUM_THREADS=$(./.github/scripts/determine-threading -G ${{ inputs.Gi_per_thread }})
echo "MAX_JOBS=${NUM_THREADS}" >> $GITHUB_ENV
echo "VLLM_INSTALL_PUNICA_KERNELS=1" >> $GITHUB_ENV
echo "PYENV_ROOT=/usr/local/apps/pyenv" >> $GITHUB_ENV
echo "XDG_CONFIG_HOME=/usr/local/apps" >> $GITHUB_ENV
WHOAMI=$(whoami)
echo "PATH=/usr/local/apps/pyenv/plugins/pyenv-virtualenv/shims:/usr/local/apps/pyenv/shims:/usr/local/apps/pyenv/bin:/usr/local/apps/nvm/versions/node/v16.20.2/bin:/usr/local/cuda-12.1/bin:/usr/local/cuda-12.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/${WHOAMI}/.local/bin:" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64::/usr/local/cuda-12.1/lib64:" >> $GITHUB_ENV
echo "PROJECT_ID=12" >> $GITHUB_ENV
env:
HF_HOME_TOKEN: ${{ inputs.hf_home }}
HF_TOKEN_SECRET: ${{ inputs.hf_token }}
shell: bash
12 changes: 6 additions & 6 deletions .github/actions/nm-test-vllm/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ inputs:
test_directory:
description: 'test directory, path is relative to neuralmagic-vllm'
required: true
test_xml:
description: 'filename for xml test results'
test_results:
description: 'top-level directory for xml test results'
required: true
python:
description: 'python version, e.g. 3.10.12'
Expand All @@ -22,15 +22,15 @@ runs:
steps:
- id: test
run: |
SUCCESS=0
# TODO: this is a hack ... fix it later
# pyenv hardcoded ... python version hardcoded ...
COMMIT=${{ github.sha }}
VENV="${{ inputs.venv }}-${COMMIT:0:7}"
source $(pyenv root)/versions/${{ inputs.python }}/envs/${VENV}/bin/activate
pip3 install --index-url http://192.168.201.226:8080/ --trusted-host 192.168.201.226 magic-wand
pip3 install -r requirements-dev.txt
pytest --junitxml=${{ inputs.test_xml }} ${{ inputs.test_directory }} || SUCCESS=$?
# run tests via runner script (serially)
SUCCESS=0
./.github/scripts/run-tests -t ${{ inputs.test_directory }} -r ${{ inputs.test_results }} || SUCCESS=$?
echo "was this a SUCCESS? ${SUCCESS}"
echo "status=${SUCCESS}" >> "$GITHUB_OUTPUT"
exit ${SUCCESS}
shell: bash
6 changes: 6 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

66 changes: 66 additions & 0 deletions .github/scripts/run-tests
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/bin/bash -e

# simple helper script to manage concurrency while running tests

usage() {
echo "Usage: ${0} <options>"
echo
echo " -t - test directory, i.e. location of *.py test files. (default 'tests/')"
echo " -r - desired results base directory. xml results will mirror provided tests directory structure. (default 'test-results/')"
echo " -h - this list of options"
echo
echo "note: all paths are relative to 'neuralmagic-vllm' root"
echo
exit 1
}

TEST_DIR=tests
RESULTS_DIR=test-results

while getopts "ht:r:" OPT; do
case "${OPT}" in
h)
usage
;;
t)
TEST_DIR="${OPTARG}"
;;
r)
RESULTS_DIR="${OPTARG}"
;;
esac
done

# check if variables are valid
if [ -z "${RESULTS_DIR}" ]; then
echo "please set desired results base directory"
usage
fi

if [ -z "${TEST_DIR}" ]; then
echo "please set test directory"
usage
fi

if [ ! -d "${TEST_DIR}" ]; then
echo "specified test directory, '${TEST_DIR}' does not exist ..."
usage
fi

# run tests serially
TESTS_DOT_PY=$(find ${TEST_DIR} -not -name "__init__.py" -name "*.py")
TESTS_TO_RUN=($TESTS_DOT_PY)
SUCCESS=0
for TEST in "${TESTS_TO_RUN[@]}"
do
LOCAL_SUCCESS=0
RESULT_XML=$(echo ${TEST} | sed -e "s/${TEST_DIR}/${RESULTS_DIR}/" | sed -e "s/.py/.xml/")
pytest --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
SUCCESS=$((SUCCESS + LOCAL_SUCCESS))
done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
24 changes: 17 additions & 7 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ on:
description: "git commit hash or branch name"
type: string
required: true
Gi_per_thread:
description: 'requested GiB to reserve per thread'
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand All @@ -35,6 +39,10 @@ on:
description: "git commit hash or branch name"
type: string
required: true
Gi_per_thread:
description: 'requested GiB to reserve per thread'
type: string
required: true
python:
description: "python version, e.g. 3.10.12"
type: string
Expand All @@ -61,7 +69,8 @@ jobs:
id: setenv
uses: ./.github/actions/nm-set-env/
with:
hf_home: ${{ secrets.NM_HF_HOME }}
hf_token: ${{ secrets.NM_HF_TOKEN }}
Gi_per_thread: ${{ inputs.Gi_per_thread }}

- name: set python
id: set_python
Expand All @@ -88,7 +97,7 @@ jobs:
id: build
uses: ./.github/actions/nm-build-vllm/
with:
Gi_per_thread: 1
Gi_per_thread: ${{ inputs.Gi_per_thread }}
python: ${{ inputs.python }}
venv: TEST

Expand All @@ -97,7 +106,7 @@ jobs:
uses: ./.github/actions/nm-test-vllm/
with:
test_directory: tests
test_xml: test-results/all_tests.xml
test_results: test-results
python: ${{ inputs.python }}
venv: TEST

Expand Down Expand Up @@ -134,12 +143,13 @@ jobs:
TEST_STATUS: ${{ steps.test.outputs.status }}
run: |
echo "checkout status: ${CHECKOUT}"
if [[ "${CHECKOUT}" != *"success"* ]]; then exit 1; fi
if [ ${LINT_STATUS} -ne 0 ]; then exit 1; fi
if [ ${BUILD_STATUS} -ne 0 ]; then exit 1; fi
echo "lint status: ${LINT_STATUS}"
echo "build status: ${BUILD_STATUS}"
if [ ${TEST_STATUS} -ne 0 ]; then exit 1; fi
echo "test status: ${TEST_STATUS}"
if [[ "${CHECKOUT}" != *"success"* ]]; then exit 1; fi
if [ -z "${LINT_STATUS}" ] || [ "${LINT_STATUS}" -ne "0" ]; then exit 1; fi
if [ -z "${BUILD_STATUS}" ] || [ "${BUILD_STATUS}" -ne "0" ]; then exit 1; fi
if [ -z "${TEST_STATUS}" ] || [ "${TEST_STATUS}" -ne "0" ]; then exit 1; fi

- name: complete testmo run
uses: ./.github/actions/nm-testmo-run-complete/
Expand Down
9 changes: 4 additions & 5 deletions .github/workflows/remote-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,17 @@ jobs:

# TODO: expand python matrix later, once CI system has
# matured.
# TODO: adjust timeout after we get a bit more experience.
# making it 60 is a bit permissive.

# TODO: enable this later
AWS-AVX2-32G-A10G-24G:
AWS-AVX2-192G-4-A10G-96G:
strategy:
matrix:
python: [3.10.12]
uses: ./.github/workflows/build-test.yml
with:
label: aws-avx2-32G-a10g-24G
timeout: 60
label: aws-avx2-192G-4-a10g-96G
timeout: 180
gitref: '${{ github.ref }}'
Gi_per_thread: 4
python: ${{ matrix.python }}
secrets: inherit
2 changes: 2 additions & 0 deletions .github/workflows/scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ $python_executable -m pip install -r requirements.txt

# Limit the number of parallel jobs to avoid OOM
export MAX_JOBS=1
# Make sure punica is built for the release (for LoRA)
export VLLM_INSTALL_PUNICA_KERNELS=1

# Build
$python_executable setup.py bdist_wheel --dist-dir=dist
14 changes: 12 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev
RUN apt-get update -y \
&& apt-get install -y python3-pip git

# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-12.1/compat/

WORKDIR /workspace

# install build and runtime dependencies
Expand Down Expand Up @@ -45,6 +51,8 @@ ENV MAX_JOBS=${max_jobs}
# number of threads used by nvcc
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
# make sure punica kernels are built (for LoRA)
ENV VLLM_INSTALL_PUNICA_KERNELS=1

RUN python3 setup.py build_ext --inplace
#################### EXTENSION Build IMAGE ####################
Expand All @@ -67,8 +75,10 @@ RUN --mount=type=cache,target=/root/.cache/pip VLLM_USE_PRECOMPILED=1 pip instal


#################### RUNTIME BASE IMAGE ####################
# use CUDA base as CUDA runtime dependencies are already installed via pip
FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS vllm-base
# We used base cuda image because pytorch installs its own cuda libraries.
# However cupy depends on cuda libraries so we had to switch to the runtime image
# In the future it would be nice to get a container with pytorch and cuda without duplicating cuda
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 AS vllm-base

# libnccl required for ray
RUN apt-get update -y \
Expand Down
Loading