Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples : generate JSON according to schema #1887

Merged
merged 5 commits into from
Aug 3, 2023

Conversation

ejones
Copy link
Collaborator

@ejones ejones commented Jun 16, 2023

Adds a Python script that converts a JSON schema into the grammar format from #1773. This allows generating JSON according to a schema, like Jsonformer or OpenAI's function calling.

Examples

Jsonformer Student Example
% cat ../schemas/student.json 
 {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/student.json --prop-order 'is_student,name,age' )" -p 'Hermione Granger '
main: build = 694 (e8259e4)
main: seed  = 1686892597
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
...

 Hermione Granger { "is_student" : true, "name" : "Hermione", "age" :12, "courses" : [ "muggle studies","history of magic" , "charms","potion" ]} [end of text]

llama_print_timings:        load time =   396.96 ms
llama_print_timings:      sample time =    55.45 ms /    57 runs   (    0.97 ms per token)
llama_print_timings: prompt eval time =   347.81 ms /     6 tokens (   57.97 ms per token)
llama_print_timings:        eval time =  3898.12 ms /    56 runs   (   69.61 ms per token)
llama_print_timings:       total time =  4306.70 ms
Jsonformer car example
% cat ../schemas/car.json 
{"type": "object", "properties": {"car": {"type": "object", "properties": {"make": {"type": "string"}, "model": {"type": "string"}, "year": {"type": "number"}, "colors": {"type": "array", "items": {"type": "string"}}, "features": {"type": "object", "properties": {"audio": {"type": "object", "properties": {"brand": {"type": "string"}, "speakers": {"type": "number"}, "hasBluetooth": {"type": "boolean"}}}, "safety": {"type": "object", "properties": {"airbags": {"type": "number"}, "parkingSensors": {"type": "boolean"}, "laneAssist": {"type": "boolean"}}}, "performance": {"type": "object", "properties": {"engine": {"type": "string"}, "horsepower": {"type": "number"}, "topSpeed": {"type": "number"}}}}}}}, "owner": {"type": "object", "properties": {"firstName": {"type": "string"}, "lastName": {"type": "string"}, "age": {"type": "number"}}}}}
 % ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/car.json --prop-order 'car,make,model,owner,firstName,lastName,age,year' )" -p 'Brought the 97 Civic in '
main: build = 694 (e8259e4)
main: seed  = 1686892847
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
...

 Brought the 97 Civic in { "car" : { "make" : "Honda", "model" : "Civic", "year" :1997, "colors": [ "Black","Silver","Gray"] , "features":{ "audio": {"brand": "Bose", "hasBluetooth": false, "speakers":10}, "performance":{"engine": "K20A2", "horsepower":230,"topSpeed":185},"safety": {"airbags":10, "laneAssist":false,"parkingSensors":false}} } , "owner" : { "firstName":"Brian","lastName":"O'Donnell" , "age":32} } [end of text]

llama_print_timings:        load time =   324.46 ms
llama_print_timings:      sample time =   196.27 ms /   182 runs   (    1.08 ms per token)
llama_print_timings: prompt eval time =   707.57 ms /    12 tokens (   58.96 ms per token)
llama_print_timings:        eval time = 12594.43 ms /   181 runs   (   69.58 ms per token)
llama_print_timings:       total time = 13515.57 ms
OpenAI-style function calling
% cat ../schemas/functions.json 
{
    "oneOf": [
        {
            "type": "object",
            "properties": {
                "function": {"const": "create_event"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "date": {"type": "string"},
                        "time": {"type": "string"}
                    }
                }
            }
        },
        {
            "type": "object",
            "properties": {
                "function": {"const": "search"},
                "arguments": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    }
                }
            }
        }
    ]
}
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/functions.json --prop-order 'function,arguments' )" -p $'Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"\n\nRequest: Call mom at 5pm \nFunction Call: '
main: build = 694 (e8259e4)
main: seed  = 1686893039
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
<0>space_1 ::= <2>[ - ] | 
<9>space ::= <11>space_1 
<15>0-function ::= ..

 Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"

Request: Call mom at 5pm 
Function Call: {"function":"create_event","arguments":{"date":"2017-11-16T18:00:00+00:00","time":"17:00" , "title":"Call my mom" }}  [end of text]

llama_print_timings:        load time =   302.69 ms
llama_print_timings:      sample time =    63.82 ms /    63 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time =  3517.46 ms /    42 tokens (   83.75 ms per token)
llama_print_timings:        eval time =  4388.51 ms /    62 runs   (   70.78 ms per token)
llama_print_timings:       total time =  7975.77 ms
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/functions.json --prop-order 'function,arguments' )" -p $'Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"\n\nRequest: What meetings are happening this afternoon? \nFunction Call: ' 
main: build = 694 (e8259e4)
...

 Transcript of AI assistant responding to user requests. It uses the APIs "search" and "create_event"

Request: What meetings are happening this afternoon? 
Function Call: { "function": "search", "arguments": { "query": "what meetings are happening today?" } }  [end of text]

llama_print_timings:        load time =   300.87 ms
llama_print_timings:      sample time =    30.92 ms /    32 runs   (    0.97 ms per token)
llama_print_timings: prompt eval time =  3535.50 ms /    44 tokens (   80.35 ms per token)
llama_print_timings:        eval time =  2114.93 ms /    31 runs   (   68.22 ms per token)
llama_print_timings:       total time =  5684.63 ms

@howard0su
Copy link
Collaborator

Can we do this logic in C++ so that we can support this in main?

@ejones
Copy link
Collaborator Author

ejones commented Jun 18, 2023

Well, this is usable with main (as in the examples) as an input to --grammar. In general, I think it would be more complex to do in C++. And the lack of built-in JSON would be a challenge (I believe C++ examples here using the JSON library have to be left out of make)

@howard0su
Copy link
Collaborator

Well, this is usable with main (as in the examples) as an input to --grammar. In general, I think it would be more complex to do in C++. And the lack of built-in JSON would be a challenge (I believe C++ examples here using the JSON library have to be left out of make)

Agree C++ may be complex. but JSON is used in server example already. check examples/server/json.hpp

@ejones
Copy link
Collaborator Author

ejones commented Jun 21, 2023

Yeah, I thought based on the discussion that the JSON dependency meant that server had to be CMake-only and excluded from the Makefile. It does look like it's in the Makefile, although hidden behind a flag, so I might be wrong.

That said, there's still the complexity point. Do you feel that JSON schema support directly in main (vs a preprocessor to main) is sufficiently valuable to warrant the extra effort?

@howard0su
Copy link
Collaborator

My major usage today is using LLM as a backend of functions. In such scenario, I would always prefer LLM returns a valid JSON string to make the result easy to parse. In such scenario, JSON support is very useful.

@mudler mudler mentioned this pull request Jul 6, 2023
1 task
@ejones ejones marked this pull request as ready for review July 24, 2023 04:53
@ejones
Copy link
Collaborator Author

ejones commented Jul 27, 2023

@slaren or @SlyEcho either of you interested in reviewing this?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jul 27, 2023

It doesn't seem to match the schema on json.org, for example the root can also be an array or string or any value. I'm not sure if it's possible to transform that grammar into ours, the string escaping is probably the hardest part.

@ejones
Copy link
Collaborator Author

ejones commented Jul 28, 2023

There's two separate grammars here - grammars/json.gbnf is a standalone, sample grammar, while examples/json-schema-to-grammar.py stitches a grammar together dynamically based on a schema. I just opted to update the generic JSON grammar in conjunction with this script to bring it up to spec (more on that later).

As for the root type: in grammars/json.gbnf, I tried setting root ::= value, but without any context, the model was likely to just produce e.g., a number and quit. Restricting it to an object seemed to give the best (most interesting) outcome for testing out JSON generation in the general case. For the schema-driven script, I've just pushed a fix to ensure that you can in fact generate from a schema denoting a primitive value, if that is of use to anyone.

Regarding the JSON spec, for this iteration I carefully followed the syntax on json.org for numbers and strings so it should in fact be compliant. The escaping is indeed there now:

string ::=
  "\"" (
    [^"\\] |
    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
  )* "\"" ws

@ejones
Copy link
Collaborator Author

ejones commented Aug 2, 2023

@ggerganov any interest in giving this a quick look?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Thank you again for such cool contributions!

@SlyEcho
Copy link
Collaborator

SlyEcho commented Aug 2, 2023

I tried the example from https://json-schema.org/learn/getting-started-step-by-step.html and the converter had issues with the properties description and integer.

But it works otherwise 👍

@ejones
Copy link
Collaborator Author

ejones commented Aug 3, 2023

Thanks!

@SlyEcho I added support for the integer so that tutorial now runs up to the point that they split up the schemas:

% ./main -m $LLAMA2_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/getting-started-full.json --prop-order 'productName,price,productId,dimensions' )" 

...

 {"productName":"Blu-ray+DVD: The Good Dinosaur","price":10,"productId":452389,"dimensions":{"height":267,"length":152.4,"width":178},"tags":["Blu-ray","Comedy","Drama","Kids \u0026 Family","Sci-Fi \u0026 Fantasy"]} [end of text]

llama_print_timings:        load time =   351.81 ms
llama_print_timings:      sample time =   391.91 ms /   103 runs   (    3.80 ms per token,   262.81 tokens per second)
llama_print_timings: prompt eval time =   114.75 ms /     2 tokens (   57.38 ms per token,    17.43 tokens per second)
llama_print_timings:        eval time =  6675.73 ms /   102 runs   (   65.45 ms per token,    15.28 tokens per second)
llama_print_timings:       total time =  7242.08 ms

@ejones ejones merged commit 8183159 into ggerganov:master Aug 3, 2023
@ejones ejones deleted the json-schema branch August 14, 2023 23:59
lenaxia pushed a commit to lenaxia/home-ops-prod that referenced this pull request Apr 27, 2024
…d grammars by including the `messages` field and adjusting the endpoint to `/v1/chat/completions`.

# Aider chat conversation:

USER: https://localai.io/features/constrained_grammars/:

-   [](/)

    LocalAI

-

-

-   [*info* Overview](https://localai.io/)
-   *rocket_launch* Getting started

    -   [Quickstart](https://localai.io/basics/getting_started/)
    -   [Run other
        Models](https://localai.io/docs/getting-started/run-other-models/)
    -   [Customizing the
        Model](https://localai.io/docs/getting-started/customize-model/)
    -   [Run models
        manually](https://localai.io/docs/getting-started/manual/)
    -   [Build LocalAI from source](https://localai.io/basics/build/)

-   [*newspaper* News](https://localai.io/basics/news/)
-   *feature_search* Features

    -   [⚡ GPU
        acceleration](https://localai.io/features/gpu-acceleration/)
    -   [📖 Text generation
        (GPT)](https://localai.io/features/text-generation/)
    -   [🗣 Text to audio
        (TTS)](https://localai.io/features/text-to-audio/)
    -   [🎨 Image
        generation](https://localai.io/features/image-generation/)
    -   [🧠 Embeddings](https://localai.io/features/embeddings/)
    -   [🆕 GPT Vision](https://localai.io/features/gpt-vision/)
    -   [✍️ Constrained
        grammars](https://localai.io/features/constrained_grammars/)
    -   [🔈 Audio to text](https://localai.io/features/audio-to-text/)
    -   [🔥 OpenAI functions and
        tools](https://localai.io/features/openai-functions/)
    -   [💾 Stores](https://localai.io/stores/)
    -   [🖼️ Model gallery](https://localai.io/models/)

-   [*rocket_launch*
    Integrations](https://localai.io/docs/integrations/)
-   *science* Advanced

    -   [Advanced usage](https://localai.io/advanced/)
    -   [Fine-tuning LLMs for text
        generation](https://localai.io/docs/advanced/fine-tuning/)

-   *science* References

    -   [Model compatibility
        table](https://localai.io/model-compatibility/)
    -   [Architecture](https://localai.io/docs/reference/architecture/)
    -   [Available Container
        images](https://localai.io/docs/reference/container-images/)
    -   [All-In-One
        images](https://localai.io/docs/reference/aio-images/)

-   [*quiz* FAQ](https://localai.io/faq/)

::::::::::::::::::::::::::::::::::::::::::::::::: {role="main"}

[](/)

menu

search Search

[ ]{.kbd} [ ]{.kbd}

-   [](%20https://github.com/mudler/LocalAI%20)

-   [](%20https://twitter.com/LocalAI_API%20)

-   [](%20https://localai.io/index.xml%20)

[]{.kbd} []{.kbd} to navigate

[]{.kbd} to select

[]{.kbd} to close

cancel

-   [*Home*](/docs/)
-   [Features](/features/)
-   ✍️ Constrained grammars

On this page

-   -   [Setup](#setup)
    -   [💡 Usage example](#-usage-example)

Table of Contents

-   -   [Setup](#setup)
    -   [💡 Usage example](#-usage-example)

*article*

# ✍️ Constrained grammars

The chat endpoint accepts an additional `grammar` parameter which takes
a [BNF defined
grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form).

This allows the LLM to constrain the output to a user-defined schema,
allowing to generate `JSON`, `YAML`, and everything that can be defined
with a BNF grammar.

notifications

This feature works only with models compatible with the
[llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also
[Model compatibility](/model-compatibility/)). For details on how it
works, see the upstream PRs:
<ggerganov/llama.cpp#1773>,
<ggerganov/llama.cpp#1887>

## Setup [*link*](#setup)

Follow the setup instructions from the [LocalAI
functions](/features/openai-functions/) page.

## 💡 Usage example [*link*](#-usage-example)

For example, to constrain the output to either `yes`, `no`:

    curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
      "model": "gpt-4",
      "messages": [{"role": "user", "content": "Do you like apples?"}],
      "grammar": "root ::= (\"yes\" | \"no\")"
    }'

[ Edit this
page](https://github.com/mudler/LocalAI/blob/master/docs/content/docs/features/constrained_grammars.md)

Last updated 3 months ago. history

------------------------------------------------------------------------

[](/features/gpt-vision/)

*navigate_before* 🆕 GPT Vision

[](/features/audio-to-text/)

🔈 Audio to text *navigate_next*

© 2023-2024 [Ettore Di Giacinto](https://mudler.pm)

:::::::::::::::::::::::::::::::::::::::::::::::::

ASSISTANT: Based on the information provided about the LocalAI's constrained grammars feature, it seems that the `handleCompletions` function in the `rag-api.go` file needs to be updated to support the constrained grammars when making a request to the LocalAI service. The request payload should include a `grammar` field when it is provided in the `CompletionRequest`.

Here is the diff for the necessary changes:

```diff
--- cluster/apps/home/localai/app/config/rag-api.go
+++ cluster/apps/home/localai/app/config/rag-api.go
@@ ... @@
 func handleCompletions(w http.ResponseWriter, r *http.Request) {
 	atomic.AddUint64(&requestMetrics.CompletionRequests, 1)
 	logRequest(r)

 	var req CompletionRequest
 	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
 		http.Error(w, err.Error(), http.StatusBadRequest)
 		return
 	}
 	defer r.Body.Close()

 	embedding, err := getEmbeddings(req.Prompt)
 	if err != nil {
 		http.Error(w, err.Error(), http.StatusInternalServerError)
 		return
 	}

-	// Generate completion using constrained grammar (if requested)
-	var completion CompletionResponse
-	if req.ConstrainedGrammar != "" {
-		payload := map[string]interface{}{
-			"model":   "gpt-4",
-			"prompt":  req.Prompt,
-			"grammar": req.ConstrainedGrammar,
-			"max_tokens": req.MaxTokens,
-			"temperature": req.Temperature,
-			"top_p": req.TopP,
-		}
-		jsonPayload, err := json.Marshal(payload)
-		if err != nil {
-			http.Error(w, err.Error(), http.StatusInternalServerError)
-			return
-		}
+	// Generate completion using constrained grammar or retrieval-augmented generation (if requested)
+	var completionResp CompletionResponse
+	payload := map[string]interface{}{
+		"model":       "gpt-4",
+		"prompt":      req.Prompt,
+		"max_tokens":  req.MaxTokens,
+		"temperature": req.Temperature,
+		"top_p":       req.TopP,
+	}
+	if req.ConstrainedGrammar != "" {
+		payload["grammar"] = req.ConstrainedGrammar
+	}
+	jsonPayload, err := json.Marshal(payload)
+	if err != nil {
+		http.Error(w, err.Error(), http.StatusInternalServerError)
+		return
+	}

 	localAI := os.Getenv("LOCAL_AI_ENDPOINT")
 	if localAI == "" {
 		localAI = defaultLocalAI
 	}

-	resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload))
+	resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload))
 	if err != nil {
 		http.Error(w, err.Error(), http.StatusInternalServerError)
 		return
 	}
 	defer resp.Body.Close()

 	if resp.StatusCode != http.StatusOK {
 		body, err := io.ReadAll(resp.Body)
 		if err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}
 		http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode)
 		return
 	}

-	var respBody struct {
-		Result CompletionResponse `json:"result"`
-	}
-	if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
+	if err := json.NewDecoder(resp.Body).Decode(&completionResp); err != nil {
 		http.Error(w, err.Error(), http.StatusInternalServerError)
 		return
 	}

-	completion = respBody.Result
+	jsonResp, err := json.Marshal(completionResp)
+	if err != nil {
+		http.Error(w, err.Error(), http.StatusInternalServerError)
+		return
+	}

-	jsonResp, err := json.Marshal(completion)
-	if err != nil {
-		http.Error(w, err.Error(), http.StatusInternalServerError)
-		return
-	}

 	w.Header().Set("Content-Type", "application/json")
 	w.WriteHeader(http.StatusOK)
 	w.Write(jsonResp)
 }
```

This diff updates the `handleCompletions` function to include the `grammar` field in the request payload if it is provided in the `CompletionRequest`. It also corrects the endpoint to `/completions` as per the LocalAI documentation.

USER: UnifiedDiffNoMatch: hunk failed to apply!

cluster/apps/home/localai/app/config/rag-api.go does not contain lines that match the diff you provided!
Try again.
DO NOT skip blank lines, comments, docstrings, etc!
The diff needs to apply cleanly to the lines in cluster/apps/home/localai/app/config/rag-api.go!

cluster/apps/home/localai/app/config/rag-api.go does not contain these 76 exact lines in a row:
```
func handleCompletions(w http.ResponseWriter, r *http.Request) {
	atomic.AddUint64(&requestMetrics.CompletionRequests, 1)
	logRequest(r)

	var req CompletionRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}
	defer r.Body.Close()

	embedding, err := getEmbeddings(req.Prompt)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	// Generate completion using constrained grammar (if requested)
	var completion CompletionResponse
	if req.ConstrainedGrammar != "" {
		payload := map[string]interface{}{
			"model":   "gpt-4",
			"prompt":  req.Prompt,
			"grammar": req.ConstrainedGrammar,
			"max_tokens": req.MaxTokens,
			"temperature": req.Temperature,
			"top_p": req.TopP,
		}
		jsonPayload, err := json.Marshal(payload)
		if err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}

	localAI := os.Getenv("LOCAL_AI_ENDPOINT")
	if localAI == "" {
		localAI = defaultLocalAI
	}

	resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload))
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		body, err := io.ReadAll(resp.Body)
		if err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}
		http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode)
		return
	}

	var respBody struct {
		Result CompletionResponse `json:"result"`
	}
	if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	completion = respBody.Result

	jsonResp, err := json.Marshal(completion)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(http.StatusOK)
	w.Write(jsonResp)
}
```

ASSISTANT: I apologize for the confusion. Let's try again with the correct diff that applies cleanly to the `cluster/apps/home/localai/app/config/rag-api.go` file. I will update the `handleCompletions` function to correctly handle the constrained grammars feature as described in the LocalAI documentation.

```diff
--- cluster/apps/home/localai/app/config/rag-api.go
+++ cluster/apps/home/localai/app/config/rag-api.go
@@ ... @@
 func handleCompletions(w http.ResponseWriter, r *http.Request) {
 	atomic.AddUint64(&requestMetrics.CompletionRequests, 1)
 	logRequest(r)

 	var req CompletionRequest
 	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
 		http.Error(w, err.Error(), http.StatusBadRequest)
 		return
 	}
 	defer r.Body.Close()

 	// Generate completion using constrained grammar (if requested)
 	var completion CompletionResponse
 	if req.ConstrainedGrammar != "" {
 		payload := map[string]interface{}{
 			"model":   "gpt-4",
 			"prompt":  req.Prompt,
-			"grammar": req.ConstrainedGrammar,
+			"grammar": req.ConstrainedGrammar,
 			"max_tokens": req.MaxTokens,
 			"temperature": req.Temperature,
 			"top_p": req.TopP,
 		}
 		jsonPayload, err := json.Marshal(payload)
 		if err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}

 		localAI := os.Getenv("LOCAL_AI_ENDPOINT")
 		if localAI == "" {
 			localAI = defaultLocalAI
 		}

-		resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload))
+		resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload))
 		if err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}
 		defer resp.Body.Close()

 		if resp.StatusCode != http.StatusOK {
 			body, err := io.ReadAll(resp.Body)
 			if err != nil {
 				http.Error(w, err.Error(), http.StatusInternalServerError)
 				return
 			}
 			http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode)
 			return
 		}

 		var respBody struct {
 			Result CompletionResponse `json:"result"`
 		}
 		if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}

 		completion = respBody.Result
 	} else {
 		// Implement retrieval-augmented generation using retrieved data
 		// ...
 	}

 	jsonResp, err := json.Marshal(completion)
 	if err != nil {
 		http.Error(w, err.Error(), http.StatusInternalServerError)
 		return
 	}

 	w.Header().Set
^C KeyboardInterrupt

USER: https://localai.io/features/constrained_grammars/:

-   [](/)

    LocalAI

-

-

-   [*info* Overview](https://localai.io/)
-   *rocket_launch* Getting started

    -   [Quickstart](https://localai.io/basics/getting_started/)
    -   [Run other
        Models](https://localai.io/docs/getting-started/run-other-models/)
    -   [Customizing the
        Model](https://localai.io/docs/getting-started/customize-model/)
    -   [Run models
        manually](https://localai.io/docs/getting-started/manual/)
    -   [Build LocalAI from source](https://localai.io/basics/build/)

-   [*newspaper* News](https://localai.io/basics/news/)
-   *feature_search* Features

    -   [⚡ GPU
        acceleration](https://localai.io/features/gpu-acceleration/)
    -   [📖 Text generation
        (GPT)](https://localai.io/features/text-generation/)
    -   [🗣 Text to audio
        (TTS)](https://localai.io/features/text-to-audio/)
    -   [🎨 Image
        generation](https://localai.io/features/image-generation/)
    -   [🧠 Embeddings](https://localai.io/features/embeddings/)
    -   [🆕 GPT Vision](https://localai.io/features/gpt-vision/)
    -   [✍️ Constrained
        grammars](https://localai.io/features/constrained_grammars/)
    -   [🔈 Audio to text](https://localai.io/features/audio-to-text/)
    -   [🔥 OpenAI functions and
        tools](https://localai.io/features/openai-functions/)
    -   [💾 Stores](https://localai.io/stores/)
    -   [🖼️ Model gallery](https://localai.io/models/)

-   [*rocket_launch*
    Integrations](https://localai.io/docs/integrations/)
-   *science* Advanced

    -   [Advanced usage](https://localai.io/advanced/)
    -   [Fine-tuning LLMs for text
        generation](https://localai.io/docs/advanced/fine-tuning/)

-   *science* References

    -   [Model compatibility
        table](https://localai.io/model-compatibility/)
    -   [Architecture](https://localai.io/docs/reference/architecture/)
    -   [Available Container
        images](https://localai.io/docs/reference/container-images/)
    -   [All-In-One
        images](https://localai.io/docs/reference/aio-images/)

-   [*quiz* FAQ](https://localai.io/faq/)

::::::::::::::::::::::::::::::::::::::::::::::::: {role="main"}

[](/)

menu

search Search

[ ]{.kbd} [ ]{.kbd}

-   [](%20https://github.com/mudler/LocalAI%20)

-   [](%20https://twitter.com/LocalAI_API%20)

-   [](%20https://localai.io/index.xml%20)

[]{.kbd} []{.kbd} to navigate

[]{.kbd} to select

[]{.kbd} to close

cancel

-   [*Home*](/docs/)
-   [Features](/features/)
-   ✍️ Constrained grammars

On this page

-   -   [Setup](#setup)
    -   [💡 Usage example](#-usage-example)

Table of Contents

-   -   [Setup](#setup)
    -   [💡 Usage example](#-usage-example)

*article*

# ✍️ Constrained grammars

The chat endpoint accepts an additional `grammar` parameter which takes
a [BNF defined
grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form).

This allows the LLM to constrain the output to a user-defined schema,
allowing to generate `JSON`, `YAML`, and everything that can be defined
with a BNF grammar.

notifications

This feature works only with models compatible with the
[llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also
[Model compatibility](/model-compatibility/)). For details on how it
works, see the upstream PRs:
<ggerganov/llama.cpp#1773>,
<ggerganov/llama.cpp#1887>

## Setup [*link*](#setup)

Follow the setup instructions from the [LocalAI
functions](/features/openai-functions/) page.

## 💡 Usage example [*link*](#-usage-example)

For example, to constrain the output to either `yes`, `no`:

    curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
      "model": "gpt-4",
      "messages": [{"role": "user", "content": "Do you like apples?"}],
      "grammar": "root ::= (\"yes\" | \"no\")"
    }'

[ Edit this
page](https://github.com/mudler/LocalAI/blob/master/docs/content/docs/features/constrained_grammars.md)

Last updated 3 months ago. history

------------------------------------------------------------------------

[](/features/gpt-vision/)

*navigate_before* 🆕 GPT Vision

[](/features/audio-to-text/)

🔈 Audio to text *navigate_next*

© 2023-2024 [Ettore Di Giacinto](https://mudler.pm)

:::::::::::::::::::::::::::::::::::::::::::::::::

ASSISTANT: To implement the constrained grammars feature as described in the LocalAI documentation, we need to update the `handleCompletions` function in the `rag-api.go` file to correctly handle the `grammar` parameter in the request payload. Here is the diff for the necessary changes:

```diff
--- cluster/apps/home/localai/app/config/rag-api.go
+++ cluster/apps/home/localai/app/config/rag-api.go
@@ ... @@
 func handleCompletions(w http.ResponseWriter, r *http.Request) {
 	atomic.AddUint64(&requestMetrics.CompletionRequests, 1)
 	logRequest(r)

 	var req CompletionRequest
 	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
 		http.Error(w, err.Error(), http.StatusBadRequest)
 		return
 	}
 	defer r.Body.Close()

 	// Generate completion using constrained grammar (if requested)
 	var completion CompletionResponse
 	if req.ConstrainedGrammar != "" {
 		payload := map[string]interface{}{
 			"model":   "gpt-4",
 			"prompt":  req.Prompt,
+			"messages": []map[string]string{
+				{
+					"role":    "user",
+					"content": req.Prompt,
+				},
+			},
 			"max_tokens": req.MaxTokens,
 			"temperature": req.Temperature,
 			"top_p": req.TopP,
+			"grammar": req.ConstrainedGrammar,
 		}
 		jsonPayload, err := json.Marshal(payload)
 		if err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}

 		localAI := os.Getenv("LOCAL_AI_ENDPOINT")
 		if localAI == "" {
 			localAI = defaultLocalAI
 		}

-		resp, err := http.Post(localAI+"/completions", "application/json", bytes.NewBuffer(jsonPayload))
+		resp, err := http.Post(localAI+"/v1/chat/completions", "application/json", bytes.NewBuffer(jsonPayload))
 		if err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}
 		defer resp.Body.Close()

 		if resp.StatusCode != http.StatusOK {
 			body, err := io.ReadAll(resp.Body)
 			if err != nil {
 				http.Error(w, err.Error(), http.StatusInternalServerError)
 				return
 			}
 			http.Error(w, fmt.Sprintf("completion request failed with status code: %d: %s", resp.StatusCode, body), resp.StatusCode)
 			return
 		}

 		var respBody struct {
 			Result CompletionResponse `json:"result"`
 		}
 		if err := json.NewDecoder(resp.Body).Decode(&respBody); err != nil {
 			http.Error(w, err.Error(), http.StatusInternalServerError)
 			return
 		}

 		completion = respBody.Result
 	} else {
 		// Implement retrieval-augmented generation using retrieved data
 		// ...
 	}

 	jsonResp, err := json.Marshal(completion)
 	if err != nil {
 		http.Error(w, err.Error(), http.StatusInternalServerError)
 		return
 	}

 	w.Header().Set("Content-Type", "application/json")
 	w.WriteHeader(http.StatusOK)
 	w.Write(jsonResp)
 }
```

This diff adds the `messages` field to the payload as required by the LocalAI API for constrained grammars and updates the endpoint to `/v1/chat/completions`. It also includes the `grammar` field when the `ConstrainedGrammar` is not empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants