Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] JSON constrained support #1125

Merged
merged 9 commits into from
Aug 26, 2024

Conversation

havetc
Copy link
Contributor

@havetc havetc commented Aug 16, 2024

Motivation

A lot of llm API (Together AI, fireworks, Anyscale...) and other engines (vllm...) support constrained generation with a JSON schema. As outlines is already a dependency of sglang, it is straightforward to extend its usage to directly support json schema in the API.

Modification

Adding json_schema parameter (in sampling params for sglang, as an extra parameter for CompetionRequest/ChatCompletionRequest for openai compatible server)

Adding a new FSMJsonCache for JSON. It inherit FSMCache, so it functions the same way, but in addition it also stores the regex string converted by outlines. This regex string is required by the Jump Forward Cache.

Adding a unit test, and updating sampling params documentation

Checklist

  • Before submitting a PR for review, make sure it has passed verification in your local development environment: limited env with only 24GB VRAM, some out of memory on the test suite but no functional errors.
  • Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
  • Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
  • Modify documentation as needed, such as docstrings or example tutorials.

@havetc
Copy link
Contributor Author

havetc commented Aug 16, 2024

Any idea why the accuracy test is failing? I'm not sure I've done any changes that could impact accuracy, at least when not using a json_schema.

@Ying1123 Ying1123 self-assigned this Aug 17, 2024
@Ying1123 Ying1123 mentioned this pull request Aug 17, 2024
29 tasks
@zhyncs zhyncs changed the title [FEAT] Json constrained support [FEAT] JSON constrained support Aug 17, 2024
Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I left a few comments.

@merrymercy
Copy link
Contributor

@havetc can you address the comments and rebase?

@havetc
Copy link
Contributor Author

havetc commented Aug 26, 2024

@merrymercy

Hello! I'm back from a few days on holidays, I'm on it

havetc added 7 commits August 26, 2024 16:03
* based on regex constrained generation, with another type of cache for
json added (to store FSM and regex conversion)
* minor changes to batch scheduler, to avoid edge cases were it could break
@havetc havetc force-pushed the json-constrained-support branch from e142c61 to 00fd8b4 Compare August 26, 2024 14:36
@merrymercy
Copy link
Contributor

please fix the lint error

@merrymercy merrymercy merged commit 9935f97 into sgl-project:main Aug 26, 2024
8 checks passed
@merrymercy
Copy link
Contributor

@havetc Thanks for the contribution. It is merged.

@havetc havetc mentioned this pull request Aug 27, 2024
3 tasks
@merrymercy
Copy link
Contributor

Hi @havetc , actually a better method is that we do not need to change anything inside the server.
We can just do the JSON schema -> regex conversion in this function https://github.com/havetc/sglang/blob/5ff25cdf5b1310e83d9e595142b39ae4d7b561e9/python/sglang/srt/sampling/sampling_params.py#L114

Then all changes in python/sglang/srt/managers/tp_worker.py, python/sglang/srt/constrained/fsm_cache.py seems not needed anymore.

Does it make sense? If so, can you simplify your code with another PR?

@havetc
Copy link
Contributor Author

havetc commented Aug 28, 2024

@merrymercy But as I understand it, that wouldn't cache the Json compilation to regex? As it is a costly operation that can take several second on cpu, it really makes sense to cache it to avoid doing it for each request.

@qeternity
Copy link
Contributor

Hi all - thanks for this. I noticed that this doesn't actually expose json_schema to the gen function, which we'd love to be able to use. I just wanted to raise this (not sure if there was another planned PR to introduce that) and also get it onto a branch that we could use internally.

I opened a PR here #1254 (there may be other areas that need updating, I just wanted to hack together something quickly so we could experiment).

@havetc
Copy link
Contributor Author

havetc commented Aug 28, 2024

@qeternity
Indeed that seems like a miss, I tried to find where to update the code to add a global support for json schema, but as I only ever use sglang as an inference server I didn't think about the gen function.
Your pull request looks good to me (haven't tested though), I think one thing that maybe would be worth updating are the json examples, that uses only regex. (Also, feel free to extend the json unit test with a gen function test if you want to)

@qeternity
Copy link
Contributor

I have been testing here for the past few hours, it's all working well. But @merrymercy will have a better view on whether there are other places that also need to be updated to accept json_schema.

Will have a look at docs and tests later if I have time.

@merrymercy
Copy link
Contributor

merrymercy commented Aug 29, 2024

@havetc I see. In this case, we can also make a cache on the TokenizerManager to cache the JSON -> regex conversion. In this way, the conversion even overlaps with GPU computation. I just want to make the core part in tp_worker as minimal as possible, so we want to remove this duplication on handling both regex and json schema.

qeternity pushed a commit to qeternity/sglang that referenced this pull request Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants