Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

ochafik · 2024-09-25T15:37:26Z

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Which models are supported (in their native style)?

While any model should work (w/ generic fallback using JSON schema constraints), this PR supports the native call style of a few models:

Llama 3.1 / 3.3 (including builtin tools support), Llama 3.2
Functionary v3.1 / v3.2
Hermes 2/3, Qwen 2.5
Mistral Nemo
Firefunction v2
DeepSeek R1 (WIP / seems reluctant to call any tools?)

Show all templates supported by minja and which handler they use

Template	Format
CohereForAI-c4ai-command-r-plus-default.jinja	generic tool calls
CohereForAI-c4ai-command-r-plus-rag.jinja	generic tool calls
CohereForAI-c4ai-command-r-plus-tool_use.jinja	generic tool calls
MiniMaxAI-MiniMax-Text-01.jinja	generic tool calls
NexaAIDev-Octopus-v2.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja	hermes 2 pro tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja	hermes 2 pro tool calls
NousResearch-Hermes-3-Llama-3.1-70B-default.jinja	generic tool calls
NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja	hermes 2 pro tool calls
OrionStarAI-Orion-14B-Chat.jinja	generic tool calls
Qwen-QwQ-32B-Preview.jinja	hermes 2 pro tool calls
Qwen-Qwen2-7B-Instruct.jinja	generic tool calls
Qwen-Qwen2-VL-7B-Instruct.jinja	generic tool calls
Qwen-Qwen2.5-7B-Instruct.jinja	hermes 2 pro tool calls
Qwen-Qwen2.5-Math-7B-Instruct.jinja	hermes 2 pro tool calls
TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja	generic tool calls
abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja	generic tool calls
bofenghuang-vigogne-2-70b-chat.jinja	generic tool calls
databricks-dbrx-instruct.jinja	generic tool calls
deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja	generic tool calls
deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-V2.5.jinja	deepseek r1 tool calls
deepseek-ai-deepseek-coder-33b-instruct.jinja	generic tool calls
google-gemma-2-2b-it.jinja	generic tool calls
google-gemma-7b-it.jinja	generic tool calls
indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja	generic tool calls
mattshumer-Reflection-Llama-3.1-70B.jinja	generic tool calls
meetkai-functionary-medium-v3.2.jinja	functionary v3.2 tool calls
meta-llama-Llama-3.1-8B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
meta-llama-Llama-3.2-3B-Instruct.jinja	llama 3.x tool calls
meta-llama-Llama-3.3-70B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
meta-llama-Meta-Llama-3.1-8B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
microsoft-Phi-3-medium-4k-instruct.jinja	generic tool calls
microsoft-Phi-3-mini-4k-instruct.jinja	generic tool calls
microsoft-Phi-3-small-8k-instruct.jinja	generic tool calls
microsoft-Phi-3.5-mini-instruct.jinja	generic tool calls
microsoft-Phi-3.5-vision-instruct.jinja	generic tool calls
mistralai-Mistral-7B-Instruct-v0.2.jinja	generic tool calls
mistralai-Mistral-Large-Instruct-2407.jinja	mistral nemo tool calls
mistralai-Mistral-Large-Instruct-2411.jinja	generic tool calls
mistralai-Mistral-Nemo-Instruct-2407.jinja	mistral nemo tool calls
mistralai-Mixtral-8x7B-Instruct-v0.1.jinja	generic tool calls
mlabonne-AlphaMonarch-7B.jinja	generic tool calls
nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja	llama 3.x tool calls (w/ builtin tools)
openchat-openchat-3.5-0106.jinja	generic tool calls
teknium-OpenHermes-2.5-Mistral-7B.jinja	generic tool calls

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Chat format: .

Any tool_calls field returned by llama-server should always conform to the JSON schema (to the extent that it uses supported features of JSON schemas), so there's no need to use any post-processor.

How to use / test

You can test tool calls as follows:

Get and build this PR's branch

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-call
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run llama-server w/ any model:

# Native support for Llama 3.x, Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x, Firefunction v2...

llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M

llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

llama-server --jinja -fa -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q6_K

llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M

# Native support requires the right template for these GGUFs:

llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \
  --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )

llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
  --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-2-Pro-Llama-3-8B )

llama-server --jinja -fa -hf bartowski/firefunction-v2-GGUF -hff firefunction-v2-IQ1_M.gguf \
  --chat-template-file <( python scripts/get_chat_template.py fireworks-ai/firellama-3-firefunction-v2 )

# Generic support, e.g. Phi 3.5, Gemma 2b, but really any model goes

llama-server --jinja -fa -hf bartowski/Phi-3.5-mini-instruct-GGUF:Q4_K_M

llama-server --jinja -fa -hf bartowski/gemma-2-2b-it-GGUF:Q4_K_M

Call the chat completions endpoint (in non-streamed mode) with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ]
}'

It will output something like (once piped in jq):

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "content": "",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "python",
              "arguments": "{\"code\":\"print('Hello, World!')\"}"
            },
            "id": null
          }
        ],
        "role": "assistant"
      }
    }
  ],
  ...
}

I've also created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).

Background

This PR tackles two main problems related to tool calling:

Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.
- Solved w/ lazy grammars activated by trigger words (similar to stop words, but awaited in the grammar implementation itself). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).
  - For Llama 3.x (cf. these docs: 1, 2, 3), triggers are
    - <|python_tag|> if any of the builtin tools are detected (wolfram_alpha, brave_search / web_search with query param, code_interpreter with code param); NOT for Llama 3.2
    - {"name": "toolN" (for each toolN in the list of tools in the request)
    - Also just {"name": (needed for very small 1B/3B models which get confused very quickly otherwise), and some other variations (to allow the somewhat popular {"type": "function", "name": ...)
  - For Functionary v3.1, we trigger on <function= and <|python_tag|> (NOTE: seems to work well w/ Llama-3.1-Instruct, e.g. it's on together.ai's docs). Note that <|python_tag|> here introduces freeform Python code, whereas for Llama-3.1-Instruct's template it introduces builtin tool calls in Python syntax. Almost the same, but handled quite differently.
  - For Functionary v3.2, it's >>>toolN\n for each toolN (technically also triggering on toolN\n for the first tool call, there's a todo to avoid spurious matches by forcing a match at the very start)
  - For Hermes Pro (cf. Hermes-Function-Calling repo), the trigger is <tool_call>.
  - For Mistral Nemo, the trigger is the special [TOOL_CALLS] token
  - For DeepSeek R1 and its distills, it's <｜tool▁calls▁begin｜> (Note: DeepSeek-R1 seems more eager to talk than to call tools for now, lemme know if you get it to work)
  - For Firefunction v2, the trigger is functools[
  - For other models ("generic" chat format), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
- Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the main parts of this PR:

minja.hpp: minimal Jinja templating engine and its tests against actual templates & a few test contexts
- Spun into its own repo: https://github.com/google/minja
- Integrated under --jinja flag in Add Jinja template support #11016
Tool call grammar generation + output parsing logic for 8 different tool call styles (covering most of the popular models, incl. Llama 3.x, Functionary 3, Qwen 2.5, DeepSeek R1, Mistral Nemo...), with a generic fallback.
Lazy grammar wired into the sampler, using a mix of trigger words and trigger tokens to enable the grammar. Trigger tokens are also used to override printability of special tokens, even when the grammar is not lazy (e.g. when "tool_choice": "required" is passed in the request)
Integration with llama-server (full tools & tool_choice support).
- Growing set of tests in examples/server/tests/unit/test_tool_call.py, some of which are skipped by default as they require downloading lots of models (can bulk get them with scripts/fetch_server_test_models.py, then run the slow tests w/ ( cd examples/server/tests && ./tests.sh -m slow -v -x )).

TODOs

Blocking:

sync: minja #11499 (this PR's diff won't include chat-template.hpp or minja.hpp)
- Ensure tools aren't described twice in the generic handler (now that Minja does it for us)
Add test for lazy grammars (cf. removed test-antiprompts.cpp)
Test parsers on corner case inputs (now they're easier to call w/ an enum) and tighten their implementations
Drop legacy python_code_argument_name in favour of expect_tool_arguments

Nice to haves:

Implement at_first semantics to require trigger word to be at start of output (equiv. to ^ regex behaviour; not using regexes as ^ can't be made to mean "start of entire string" reliably afaict), to reduce spurious triggers w/ Llama 3.x
Document llama3.1 builtin tools schemas
May want to ping owners of models which GGUF doesn't contain the right chat templates + provide them w/ an easy one-liner to surgically edit the gguf
Warning log when using the generic chat format
Find examples of tool call w/ DeepSeek-R1-Distill-* (ought to work, but proving elusive / just wants to think, think, think)
Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1 and functionary

See draft-times TODOs

Possible follow ups:

Add -hft / --hf_template flag to override the GGUF's chat templates from a HF model repo
Add agent example w/ isolation in c++ or python (see example/agent moved from this PR to that Gist).
Add agent w/ MCP support?
Add tool call loop to the default web chat using Pyodide as a python interpreter?
Add tool call loop to the CLIs?

ochafik · 2024-09-27T06:25:09Z

Apologies for this PR being a moving target.

I've now stabilized things (except older gcc giving me sweats), added tests & included basic usage instructions (w/ a tiny agent helper adapted from #6389) for Llama-3.1-8B-Instruct, Hermes-2-Pro-Llama-3-8B and functionary-small-3.2 (which still needs a bit of work).

rujialiu · 2024-09-29T12:25:32Z

@ochafik Your minja.hpp is cool (I like minimalist things) but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly. I'll definitely give it a try when you finish this PR. Exciting work!

ochafik · 2024-09-29T21:21:03Z

@ochafik Your minja.hpp is cool (I like minimalist things)

Thanks @rujialiu !

but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features, e.g. NousResearch/Hermes-3-Llama-3.1, Cohere/command-r-plus, meetkai/functionary-medium-v3.2 ). Filters (w/ the pipe syntax, e.g. {{ range(10) | length }}, macros are glaring omissions for instance.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly.

Yeah I'm doing the same, that's why I spent so much energy improving the JSON schema support tbh.

I'll definitely give it a try when you finish this PR. Exciting work!

Hopefully soon! (famous last words haha)

rujialiu · 2024-09-30T07:43:20Z

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features

Ouch, I was not aware of that. That's crazy. Now I'm really impressed that your little code already supports these. Maybe I should use your minja.hpp in production instead in the future 8-)

Maximilian-Winter · 2024-10-07T16:57:07Z

@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow.

ochafik · 2024-10-17T18:35:06Z

@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅)

I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next.

Depending on your timezone, happy to jump into a video chat too :-) (DM on x?)

(Also, llama-cpp-agent looks suuuper cool! 💜)

Maximilian-Winter · 2024-10-18T23:50:52Z

@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is [email protected]

… dumb for function call)

ngxson · 2025-01-30T14:55:06Z

I'm getting this error while running test_tool_call.py btw:

        assert res.status_code == 200, f"Expected status code 200, got {res.status_code}"
        choice = res.body["choices"][0]
        tool_calls = choice["message"].get("tool_calls")
>       assert tool_calls and len(tool_calls) == 1, f'Expected 1 tool call in {choice["message"]}'
E       AssertionError: Expected 1 tool call in {'content': '<tool_call>\n{"name": "test", "arguments": {"success": true}}\n', 'tool_calls': None, 'role': 'assistant'}
E       assert (None)

unit/test_tool_call.py:191: AssertionError
=============================================== short test summary info ================================================
FAILED unit/test_tool_call.py::test_completion_with_required_tool_real_model[tool8-success-bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M-template_override8] - AssertionError: Expected 1 tool call in {'content': '<tool_call>\n{"name": "test", "arguments": {"success": true}}\...

Probably not important to make test_tool_call work right now. What I think more important is to make all the non-slow test works before merging this PR.

In the near future, we can have a dedicated CI workflow to run all the slow tests. I can setup a HF space with T4 or L4 GPU, to be discussed with HF team..

ochafik · 2025-01-30T15:14:25Z

I'm getting this error while running test_tool_call.py btw:

@ngxson hopefully fixed (slight hack), the bartowski version of the model i switched to is marking (correctly) </tool_calls> as a special token.

…en completion & tool call tests?)

ochafik · 2025-01-30T19:04:23Z

@ngxson I think this is mergeable once you're happy with it; had to disable the plain non-tools jinja test for now (not critical as i've only introduced it to support tool calls), one of many things to follow up on 😅

ngxson · 2025-01-30T19:08:40Z

That sounds ok to me, let's merge this

m18coppola · 2025-01-30T20:43:46Z

minor fix

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index d1ea343d..8efc18ad 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -1813,7 +1813,7 @@ struct server_context {

         n_ctx = llama_n_ctx(ctx);

-        add_bos_token = llama_vocab_get_add_bos(vocab);
+        add_bos_token = llama_vocab_get_add_bos(vocab) && !params.use_jinja;
         has_eos_token = llama_vocab_eos(vocab) != LLAMA_TOKEN_NULL;

         if (!params_base.speculative.model.empty() || !params_base.speculative.hf_repo.empty()) {

Edit: This fix doesn't seem to work 😞 Nonetheless, it seems the Llama models still have a double bos token while using jinja templates

3Simplex · 2025-01-30T21:43:59Z

Today I am getting odd behavior in the WebUI with DeepSeek-R1-Distill-Llama-8B-Q8_0

main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'

ochafik · 2025-01-30T21:49:47Z

Today I am getting odd behavior in the WebUI with DeepSeek-R1-Distill-Llama-8B-Q8_0

main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %}, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'

Thanks for reporting! Which flags / exact model repo id did you launch with? (There’s an interference with ‘—jinja’ I think)

3Simplex · 2025-01-30T22:35:52Z

Thanks for reporting! Which flags / exact model repo id did you launch with? (There’s an interference with ‘—jinja’ I think)

.\llama-server.exe -m "...\DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf" --port 8082 --jinja -c 30720 -ngl 33 -t 8
unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF

brucepro · 2025-01-31T05:32:05Z

Tested with pydantic-ai, had to modify their _util.py since the schema doesn't seem to set tool_call_id and it gets AssertionError: OpenAI requires tool_call_id to be set: ToolCallPart(tool_name='roll_die', args='{}', tool_call_id=None, part_kind='tool-call')
Here is my test code for anyone that needs it.
`import random
from pydantic_ai import Agent, RunContext
from pydantic_ai.models.openai import OpenAIModel

model = OpenAIModel(
'llama3.3-70B',
base_url='http://127.0.0.1:8080',
api_key='123',

)

agent = Agent(
model,
deps_type=str,
system_prompt=(
"You're a dice game, you should roll the die and see if the number "
"you get back matches the user's guess. If so, tell them they're a winner. "
"Use the player's name in the response."
),
)

@agent.tool_plain
def roll_die() -> str:
"""Roll a six-sided die and return the result."""
return str(random.randint(1, 6))

@agent.tool
def get_player_name(ctx: RunContext[str]) -> str:
"""Get the player's name."""
return ctx.deps

dice_result = agent.run_sync('My guess is 4', deps='Anne')
print(dice_result.data)
#> Congratulations Anne, you guessed correctly! You're a winner!`

Modify the site-packages\pydantic_ai_utils.py", line 201, in guard_tool_call_id to ignore the check. If I get a less hacky fix, I will share it. Full pydantic-ai support is pretty cool.

ochafik · 2025-01-31T09:13:22Z

Tested with pydantic-ai, had to modify their _util.py since the schema doesn't seem to set tool_call_id and it gets AssertionError: OpenAI requires tool_call_id to be set: ToolCallPart(tool_name='roll_die', args='{}', tool_call_id=None, part_kind='tool-call')

Hey @brucepro, thanks for sharing your experimentation!

Only a few models seem to spontaneously generate a tool call id on their own (and use it in their template; mostly models that support parallel tool calls), I'll look into forcing it for the others.

Mistral Nemo is one of them, works without hack rn:

llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L

(incidentally, I'm forcing a tool call id for the generic support when "parallel_tool_calls": true is set in the request, but it's not helping here)

ochafik · 2025-01-31T12:06:57Z

@brucepro bunch of fixes on their way: #11539

phpmac · 2025-02-01T09:19:23Z

很牛逼

Kreijstal · 2025-02-01T19:38:26Z

so uh we can finally make models run code with chat?

github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024

ochafik mentioned this pull request Sep 27, 2024

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Closed

15 tasks

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 28, 2024

github-actions bot added the script Script related label Oct 2, 2024

ochafik changed the title ~~Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Oct 24, 2024

ochafik added 13 commits October 27, 2024 16:44

nits

ec9f3b1

tool-call: slow tool call integration tests

9a86ea7

space nits

c88095e

tool_call: test no tool call on a real model + rename scenarios

7fde6d0

tool-call: script to prefetch models used in server tests

dd6d024

Update tool_call.feature

168add7

tool-call: add tests: tool_call=none, parallel_tool_calls=true

ec547e4

tool-call: remove duplicate script to fetch templates

b51c71c

agent: simplify syntax (default tools to local w/ default port)

74d71a6

tool-call: use Q4_K_M models

b825440

tool-call: update scripts/fetch_server_test_models.py

aefac1e

tool-call: test Hermes-3-Llama-3.1-8B

64287a3

tool-call: use functionary-small-v3.2-Q8_0.gguf in test (Q4_K_M too…

fa4c111

… dumb for function call)

ochafik and others added 4 commits January 30, 2025 14:10

add llama_sampler_init_grammar_lazy instead of renaming the non-lazy

5a64af6

Format test-chat.cpp

f223df0

log prompt + nits

8205246

test: leave model_hf_file blank

5add261

force printing </tool_call> on hermes 2 model if/as it's a special token

1029ff9

try and avoid weird server test failure (spillage / parallelism betwe…

3bd6abe

…en completion & tool call tests?)

ochafik added the enhancement New feature or request label Jan 30, 2025

ochafik added 2 commits January 30, 2025 17:43

Disable chat_completion tests of non-tool jinja mode

729d2d3

Fix typo

34f54dd

ochafik added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jan 30, 2025

ochafik merged commit 8b576b6 into ggerganov:master Jan 30, 2025
47 checks passed

ochafik mentioned this pull request Jan 30, 2025

Fix --jinja when there's no tools or schema (typo was forcing JSON) #11531

Merged

matteoserva mentioned this pull request Jan 31, 2025

Misc. bug: llama-server ignores the stop parameter #11538

Closed

ochafik mentioned this pull request Jan 31, 2025

tool-call: fix llama 3.x and functionary 3.2, play nice w/ pydantic_ai package, update readme #11539

Merged

This was referenced Jan 31, 2025

server: fix stop regression #11543

Merged

changelog : llama-server REST API #9291

Open

changelog : libllama API #9289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

ochafik commented Sep 25, 2024 •

edited

Loading

ochafik commented Sep 27, 2024

rujialiu commented Sep 29, 2024

ochafik commented Sep 29, 2024 •

edited

Loading

rujialiu commented Sep 30, 2024

Maximilian-Winter commented Oct 7, 2024

ochafik commented Oct 17, 2024 •

edited

Loading

Maximilian-Winter commented Oct 18, 2024

ngxson commented Jan 30, 2025

ochafik commented Jan 30, 2025

ochafik commented Jan 30, 2025

ngxson commented Jan 30, 2025

m18coppola commented Jan 30, 2025 •

edited

Loading

3Simplex commented Jan 30, 2025

ochafik commented Jan 30, 2025

3Simplex commented Jan 30, 2025

brucepro commented Jan 31, 2025

ochafik commented Jan 31, 2025

ochafik commented Jan 31, 2025

phpmac commented Feb 1, 2025

Kreijstal commented Feb 1, 2025

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Conversation

ochafik commented Sep 25, 2024 • edited Loading

Which models are supported (in their native style)?

How to use / test

Background

TODOs

ochafik commented Sep 27, 2024

rujialiu commented Sep 29, 2024

ochafik commented Sep 29, 2024 • edited Loading

rujialiu commented Sep 30, 2024

Maximilian-Winter commented Oct 7, 2024

ochafik commented Oct 17, 2024 • edited Loading

Maximilian-Winter commented Oct 18, 2024

ngxson commented Jan 30, 2025

ochafik commented Jan 30, 2025

ochafik commented Jan 30, 2025

ngxson commented Jan 30, 2025

m18coppola commented Jan 30, 2025 • edited Loading

3Simplex commented Jan 30, 2025

ochafik commented Jan 30, 2025

3Simplex commented Jan 30, 2025

brucepro commented Jan 31, 2025

ochafik commented Jan 31, 2025

ochafik commented Jan 31, 2025

phpmac commented Feb 1, 2025

Kreijstal commented Feb 1, 2025

ochafik commented Sep 25, 2024 •

edited

Loading

ochafik commented Sep 29, 2024 •

edited

Loading

ochafik commented Oct 17, 2024 •

edited

Loading

m18coppola commented Jan 30, 2025 •

edited

Loading