Add support for OpenAI API parallel sampling #640

yichuan520030910320 · 2024-07-17T17:27:13Z

Add support for OpenAI API parallel sampling:

add support for request.n>1 when using OpenAI API ; First send one prefilling request to increase cache hit rate, then async send n decoding request in parallel
Do not support when there are m prompts organized as a List in OpenAI API
Do not support request.n>1 while streaming

Example code:

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.8,
    max_tokens=64,
    logprobs=True,
    n=3,
)
print(response)

The result of running python sglang/examples/usage/openai_parallel_sample.py is

Completion(id='6782cf18dd97421bbb72462eb404d93a', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Bob. Bob was designed to learn and grow, but he was stuck in a rut. He kept repeating the same tasks over and over again,'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' robot named Robby who lived in a big factory with lots of other robots. Robby was very curious and wanted to learn new things, but the other'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=2, logprobs=None, text=' robot named Robby. Robby was different from the other robots, for he had a big dream: to learn how to think and learn like a human')], created=1721238118, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=96, prompt_tokens=30, total_tokens=126))
ChatCompletion(id='4e589636d6bb4cae9ce307f4e0be4203', choices=[Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=0, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_LENGTH: 64', index=1, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília\n\nI hope this helps! Let me know if you need anything else.', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, message=ChatCompletionMessage(content="  Of course, I'd be happy to help! Here are three countries and their capitals:\n\n1. Italy - Rome\n2. Japan - Tokyo\n3. Mexico - Mexico City", role='assistant', function_call=None, tool_calls=None))], created=1721238120, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=155, prompt_tokens=38, total_tokens=193))

cc @Ying1123 @merrymercy @hnyls2002 and thanks @hnyls2002 for help in some guidance

…as well

yichuan520030910320 · 2024-07-19T09:15:03Z

Now it supports when there are m prompts organized as a List in OpenAI API, and these m prompts can parallel sampling.

When we run this code

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=1,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt="I am a robot and I want to study like humans. Now let's tell a story. Once upon a time, there was a little",
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt=["The name of the famous soccer player is ", "The capital of US is"],
    n=1,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Text completion
response = client.completions.create(
    model="default",
    prompt=["The name of the famous soccer player is ", "The capital of US is"],
    n=3,
    temperature=0.8,
    max_tokens=32,
)
print(response)


# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.8,
    max_tokens=64,
    logprobs=True,
    n=4,
)
print(response)

The result will be

Completion(id='0e8758bc5e964ffc893c78ec4a805484', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Bob. Bob was different from the other robots because he was curious about the world beyond his factory floor. He wanted to learn how to think like')], created=1721380457, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=30, total_tokens=62))
Completion(id='1ca8cb8cd4b74613858935311ecdbefc', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' robot named Zeta. Zeta lived in a big factory where she helped make other robots. But Zeta had big dreams. She wanted to learn'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' robot named Zeta. Zeta loved to learn and play with his robot friends, but he wanted to learn more. He wanted to learn like humans! So'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=2, logprobs=None, text=' robot named Robby. Robby lived in a big factory where he did lots of work, but he was not happy. He wanted to learn more and be')], created=1721380458, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=96, prompt_tokens=30, total_tokens=126))
Completion(id='b6f62139cb8148d6ac1868a932923790', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' Medi éllo Gór Mahrez. It is a good name for a soccer player because it is unique and easy to pronounce. It is also'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' Washington DC, which stands for District of Columbia. It is located on the East Coast of the United States and is home to many national landmarks and institutions,')], created=1721380459, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=11, total_tokens=75))
Completion(id='a4b18b086b1b4c9f8a6a59317444a53c', choices=[CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=0, logprobs=None, text=' \n\nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States of America is Washington, D.C'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=1, logprobs=None, text=' \nDirections: Read the text and answer the questions that follow.\n\nThe name of the famous soccer player is David Beckham. He'), CompletionChoice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, text='\n\nAnswer:\nThe famous soccer player is Lionel Messi, and the capital of the United States is Washington D.C.'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=3, logprobs=None, text=' __________ \nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States is Washington D.C.'), CompletionChoice(finish_reason='FINISH_LENGTH: 32', index=4, logprobs=None, text='  Washington DC\nSoccer players from around the world travel to Brazil to compete in the FIFA World Cup, the most prestigious international soccer tournament'), CompletionChoice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=5, logprobs=None, text=' \n\nAnswer:\nThe famous soccer player is Lionel Messi.\nThe capital of the United States is Washington D.C.')], created=1721380460, model='default', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=189, prompt_tokens=17, total_tokens=206))
ChatCompletion(id='06ca8f81157c4f1887be08129621f84e', choices=[Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=0, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: Brazil\nCapital: Brasília\n2. Country: China\nCapital: Beijing\n3. Country: Germany\nCapital: Berlin', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=1, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: Japan\nCapital: Tokyo\n3. Country: Brazil\nCapital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_MATCHED_TOKEN: 2', index=2, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: Japan - Capital: Tokyo\n2. Country: France - Capital: Paris\n3. Country: Brazil - Capital: Brasília', role='assistant', function_call=None, tool_calls=None)), Choice(finish_reason='FINISH_LENGTH: 64', index=3, logprobs=None, message=ChatCompletionMessage(content='  Of course! Here are three countries and their capitals:\n\n1. Country: France\nCapital: Paris\n2. Country: China\nCapital: Beijing\n3. Country: Brazil\nCapital: Brasília\n\nI hope that helps! Let me know if you need more', role='assistant', function_call=None, tool_calls=None))], created=1721380462, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=210, prompt_tokens=38, total_tokens=248))

Also I will support batch and model API in another PR

python/sglang/srt/managers/tokenizer_manager.py

yichuan520030910320 · 2024-07-20T05:35:14Z

Here I fix these LoC.
Also, I rewrite some codes to align with gpt-3.5-turbo-instruct output's sequence

yichuan520030910320 added 2 commits July 17, 2024 17:22

Add support for OpenAI API parallel sampling

b3868a3

Add support for OpenAI API parallel sampling

ab29158

Ying1123 mentioned this pull request Jul 17, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

Add support for OpenAI API parallel sampling and support list prompt …

1404a99

…as well

Ying1123 reviewed Jul 20, 2024

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

fix the comment && adjust the output to align w OAI API

e7869af

merrymercy merged commit 49c5e0e into sgl-project:main Jul 20, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for OpenAI API parallel sampling #640

Add support for OpenAI API parallel sampling #640

yichuan520030910320 commented Jul 17, 2024 •

edited

Loading

yichuan520030910320 commented Jul 19, 2024

yichuan520030910320 commented Jul 20, 2024

Add support for OpenAI API parallel sampling #640

Add support for OpenAI API parallel sampling #640

Conversation

yichuan520030910320 commented Jul 17, 2024 • edited Loading

yichuan520030910320 commented Jul 19, 2024

yichuan520030910320 commented Jul 20, 2024

yichuan520030910320 commented Jul 17, 2024 •

edited

Loading