[V1][PoC] Refactor EngineCoreOutputs #12853

bnellnm · 2025-02-06T23:25:00Z

Fold EngineCoreOutput fields directly into EngineCoreOutputs so that we don't need to create so many small objects in the scheduler.

github-actions · 2025-02-06T23:25:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

njhill

Thanks @bnellnm it's a great start.

Some parts of this code have been changed in #9880 which will be merged soon, so would be good to rebase on that.

vllm/v1/engine/__init__.py

vllm/v1/core/scheduler.py

vllm/v1/engine/async_llm.py

mergify · 2025-02-10T20:21:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

njhill

Thanks @bnellnm, I added a few more comments...

vllm/v1/engine/__init__.py

vllm/v1/core/scheduler.py

njhill · 2025-02-10T21:14:51Z

vllm/v1/core/scheduler.py

+                output.request_ids.append(req_id)
+                output.new_token_id_offsets.append(offset)
+                new_ids = request.output_token_ids[-num_new_tokens:]
+                output.new_token_ids += new_ids


It would be for another iteration, but I'm thinking he we may want to do this outside of the loop, keep the new token ids as a tensor and either send as-is and do additional filtering in the front-end process, or do the filtering via tensor slicing/index_select type operations.

This would have the benefits of:

Eliminating intermediate objects creation which scales with the batch size

Eliminating serialization overhead - I think if we can keep as much as possible of EngineCoreOutputs in tensor/numpy form, we can transmit the backing buffer(s) as-is ... zmq can read directly from these and later we could also see if we can use shm

Moving some work to the front-end process which we can more easily scale out

Signed-off-by: Bill Nell <[email protected]>

njhill

Thanks @bnellnm, I made some more comments inline

njhill · 2025-02-24T21:40:40Z

vllm/v1/engine/async_llm.py

+                if num_requests <= VLLM_V1_OUTPUT_PROC_CHUNK_SIZE:
+                    num_chunks = 1
+                    chunk_size = num_requests
+                    rem = 0
                else:


Could just keep the else logic here?

I got div by zero when I tried just the else code path, so I left both branches.

Hmm that should only be possible if outputs.request_ids is empty ... I don't remember if that should ever happen but if it does we would just skip the loop anyhow (unless we need to still update the iteration stats in this case)

njhill · 2025-02-24T21:45:58Z

vllm/v1/engine/output_processor.py

+        for i, req_id in enumerate(
+                engine_core_outputs.request_ids[first:last]):


Suggested change

for i, req_id in enumerate(

engine_core_outputs.request_ids[first:last]):

for req_idx in range(first, last):

req_id = engine_core_outputs.request_ids[req_idx]

njhill · 2025-02-24T21:48:59Z

vllm/v1/request.py

+        self._num_tokens = len(self.all_token_ids)
+        self._num_output_tokens = len(self.output_token_ids)


why the changes in this file?

I was doing some profiling and it showed num_tokens and num_output_tokens being fairly expensive/called often so I figured I would cache the lengths.

njhill · 2025-02-24T21:50:58Z

vllm/v1/serial_utils.py

@@ -34,13 +37,29 @@ def decode(self, obj: Any):
        return self.decoder.decode(obj)


+class NumpySerializedRepresentation(msgspec.Struct, gc=False, array_like=True):


I am working on this part in this branch (hopefully it's in a pretty much complete/working state now): https://github.com/njhill/vllm/tree/tensor-nocopy

idea is for the encoder to collect a list of the buffer references and return it at the end, rather than serializing any of them directly.

njhill · 2025-02-24T21:53:04Z

vllm/v1/worker/gpu_model_runner.py

+        num_samples = sampled_token_ids.shape[0]
+        max_gen_len = sampled_token_ids.shape[1]


Suggested change

num_samples = sampled_token_ids.shape[0]

max_gen_len = sampled_token_ids.shape[1]

num_samples, max_gen_len = sampled_token_ids.shape

njhill · 2025-02-24T22:12:48Z

vllm/v1/worker/gpu_model_runner.py

-        for i, sampled_ids in enumerate(sampled_token_ids):
+        #draft_token_ids: np.ndarray = np.full((num_samples, max_gen_len), INVALID_TOKEN_ID, dtype=int)
+
+        valid_mask = sampled_token_ids != INVALID_TOKEN_ID


Probably we still want to do the mask and sum on GPU and then that list of lengths could be passed in:

valid_mask = sampled_token_ids != INVALID_TOKEN_ID gen_lens = valid_mask.sum(dim=1).tolist()

This line below you can change to read directly from the nparray:

self.input_batch.token_ids_cpu[i, start_idx:end_idx] = sampled_ids

i.e.

self.input_batch.token_ids_cpu[i, start_idx:end_idx] = sampled_token_ids[i, :gen_lens[i]]

njhill · 2025-02-24T22:19:35Z

vllm/v1/worker/gpu_model_runner.py

@@ -1042,8 +1058,11 @@ def generate_draft_token_ids(
            if drafter_output is None or len(drafter_output) == 0:
                draft_token_ids.append([])
            else:
+                #assert len(drafter_output) <= max_gen_len
+                #draft_token_ids[i] = drafter_output
                draft_token_ids.append(drafter_output.tolist())


We could keep this as a list of ndarrays. Later could even update the drafter to take an ndarray as input and have it write the tokens into it, so we can build a single 1-dim array with offsets.

njhill · 2025-02-24T22:20:48Z

vllm/v1/engine/__init__.py

+    new_token_id_offsets : List[int] = []
+    new_token_id_counts: Optional[List[int]] = None  # ndarray?


Yes keep as array ... and we don't need both offsets and counts right?

njhill · 2025-02-24T22:21:31Z

vllm/v1/engine/__init__.py

+    new_token_ids: np.ndarray = np.empty(0, dtype=int) # Optional?
+
+    # req_id -> LogprobsLists
+    new_logprobs: Dict[str, LogprobsLists] = {}


We should change these to LogprobsTensors too

mergify bot added the v1 label Feb 6, 2025

njhill reviewed Feb 7, 2025

View reviewed changes

njhill changed the title ~~Refactor EngineCoreOutputs~~ [V1][PoC] Refactor EngineCoreOutputs Feb 7, 2025

bnellnm force-pushed the no-list branch from e00875a to 56edd27 Compare February 10, 2025 20:20

mergify bot added the needs-rebase label Feb 10, 2025

njhill reviewed Feb 10, 2025

View reviewed changes

bnellnm force-pushed the no-list branch from 56edd27 to 0531695 Compare February 12, 2025 20:20

bnellnm marked this pull request as ready for review February 12, 2025 20:21

bnellnm requested review from WoosukKwon, robertgshaw2-redhat, ywang96, comaniac and alexm-redhat as code owners February 12, 2025 20:21

bnellnm requested review from DarkLight1337 and simon-mo as code owners February 13, 2025 02:40

bnellnm force-pushed the no-list branch from 00f9879 to 9f4b826 Compare February 19, 2025 19:13

bnellnm and others added 12 commits February 24, 2025 03:01

Refactor EngineCoreOutputs

99b75b0

Signed-off-by: Bill Nell <[email protected]>

hack around slices bug

9a85489

Signed-off-by: Bill Nell <[email protected]>

wip

e610473

Signed-off-by: Bill Nell <[email protected]>

wip merge w/logprobs pr

c3c6f0b

Signed-off-by: Bill Nell <[email protected]>

wip

fa5c069

Signed-off-by: Bill Nell <[email protected]>

merge + review comments

7451b9a

Signed-off-by: Bill Nell <[email protected]>

review comments, fix events

a593dd9

Signed-off-by: Bill Nell <[email protected]>

cleanups

3bb4ab8

Signed-off-by: Bill Nell <[email protected]>

fix indent

d93addb

Signed-off-by: Bill Nell <[email protected]>

fix bug in output_processor, simplify slice code

eac1ee2

Signed-off-by: Bill Nell <[email protected]>

don't store token offsets if they are all 1

c63b98b

Signed-off-by: Bill Nell <[email protected]>

simplify slice code

1ff520b

Signed-off-by: Bill Nell <[email protected]>

bnellnm added 5 commits February 24, 2025 03:02

fix merge

e1ac513

Signed-off-by: Bill Nell <[email protected]>

wip numpy arrays

7e019f7

Signed-off-by: Bill Nell <[email protected]>

wip numpy arrays

3eda699

Signed-off-by: Bill Nell <[email protected]>

wip ndarrays

7021595

Signed-off-by: Bill Nell <[email protected]>

get rid of req_id_to_index Dict

d70d0bb

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the no-list branch from 9f4b826 to d70d0bb Compare February 24, 2025 16:01

njhill reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][PoC] Refactor EngineCoreOutputs #12853

[V1][PoC] Refactor EngineCoreOutputs #12853

bnellnm commented Feb 6, 2025 •

edited

Loading

github-actions bot commented Feb 6, 2025

njhill left a comment

mergify bot commented Feb 10, 2025

njhill left a comment

njhill Feb 10, 2025

njhill left a comment

njhill Feb 24, 2025

bnellnm Feb 24, 2025

njhill Feb 25, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

bnellnm Feb 24, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

njhill Feb 24, 2025

		for i, req_id in enumerate(
		engine_core_outputs.request_ids[first:last]):

		self._num_tokens = len(self.all_token_ids)
		self._num_output_tokens = len(self.output_token_ids)

		@@ -34,13 +37,29 @@ def decode(self, obj: Any):
		return self.decoder.decode(obj)


		class NumpySerializedRepresentation(msgspec.Struct, gc=False, array_like=True):

		num_samples = sampled_token_ids.shape[0]
		max_gen_len = sampled_token_ids.shape[1]

	num_samples = sampled_token_ids.shape[0]
	max_gen_len = sampled_token_ids.shape[1]
	num_samples, max_gen_len = sampled_token_ids.shape

		new_token_id_offsets : List[int] = []
		new_token_id_counts: Optional[List[int]] = None # ndarray?

[V1][PoC] Refactor EngineCoreOutputs #12853

Are you sure you want to change the base?

[V1][PoC] Refactor EngineCoreOutputs #12853

Conversation

bnellnm commented Feb 6, 2025 • edited Loading

github-actions bot commented Feb 6, 2025

njhill left a comment

Choose a reason for hiding this comment

mergify bot commented Feb 10, 2025

njhill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njhill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnellnm commented Feb 6, 2025 •

edited

Loading