-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix after #214 #215
Fix after #214 #215
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate spotting this important issue and quick fix!
Would be also nice to leave some performance difference before/after for the future reference. I think this is important data point.
vocab_size = request.sampling_params.vocab_size | ||
bin_counts = torch.zeros((vocab_size + 1,), | ||
dtype=torch.long, | ||
device=tokens.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify where this computation runs? What I want to do as a follow-up is clarify where each operation is running and further async/parallelize if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have good design now, but plan to think about. It is only the quick fix. I do not think that it is alone place should be fixed, need to think about sampler design (see also comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was not suggesting we should do it now haha I just wanted to leave a comment where this computation runs. device=tokens.device
is cpu i assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes, it is device of prompt_token_ids, I think they are on CPU. Comment has been added
c6503f2
to
ae5f2a8
Compare
@elvin-n could you show results before sampler big refactor as ref data point. I add last ones |
I added it to the desc of the PR. Need to say that our aim is 19000-20000 token per sec for Mistral-7b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you guys for spotting this issue and proactively addressing them!
After this fix I've decided to double check: why had performance reduction been fixed? Of course, I had some ideas which I based on during the fixing, but there were also doubts that were eventually justified. Using
In all cases I've tested processing with batch (default length = 10). But my measurements showed that it is linearly depended on batch size, due to this the results is average for one request at point 1. |
After merge #214 strong performance reduction was observed. There is fix of it. throughput benchmark shows no performance reduction after fix
Results of throughput benchmark (Mistral-7b, use input 10):
5e36d1b (last but one commit): Engine Throughput: 6.48 requests/s, 2516.60 tokens/s
7495bd0 (last commit): Engine Throughput: 0.14 requests/s, 53.75 tokens/s
with fix: Engine Throughput: 5.62 requests/s, 2181.77 tokens/s
Results of throughput benchmark (Mistral-7b, use input 1000):
5e36d1b (last but one commit): Engine Throughput: 34.39 requests/s, 13157.05 tokens/s
7495bd0 (last commit): - need very long time to get it
with fix: Engine Throughput: 34.48 requests/s, 13193.23 tokens/s