Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: passkey challenge / self-extend with context shift demo #5832

Merged
merged 28 commits into from
Mar 2, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
73a7e42
server: tests: add models endpoint scenario
phymbert Mar 2, 2024
0f774a8
server: /v1/models add some metadata
phymbert Mar 2, 2024
1780d96
server: tests: add debug field in context before scenario
phymbert Mar 2, 2024
319ded7
server: tests: download model from HF, add batch size
phymbert Mar 2, 2024
18e739d
server: tests: add passkey test
phymbert Mar 2, 2024
ab5b06b
server: logs: do not truncate log values
phymbert Mar 2, 2024
60113da
server: tests: add group attention params
phymbert Mar 2, 2024
616d7e9
server: do not truncate prompt tokens if self-extend through group at…
phymbert Mar 2, 2024
2495f72
server: logs: do not truncate log values
phymbert Mar 2, 2024
af82fb4
server: revert change on slot n_ctx
phymbert Mar 2, 2024
3b8242a
server: tests - missing EOL at EOF
phymbert Mar 2, 2024
ed60b97
server: tests - fix passkey not using pre/suffix
phymbert Mar 2, 2024
cf4c86e
server: tests - passkey - first good working value of nga
phymbert Mar 2, 2024
f8773f7
server: tests - passkey - limit the number of max tokens to predix
phymbert Mar 2, 2024
a80533e
server: tests - passkey - limit the number of max tokens to predix
phymbert Mar 2, 2024
8abf8d3
server: tests: fix server timeout
phymbert Mar 2, 2024
407cc60
server: tests: fix passkey, add doc, fix regex content matching, fix …
phymbert Mar 2, 2024
178b0c6
server: tests: fix regex content matching
phymbert Mar 2, 2024
9ab72d7
server: tests: schedule slow tests on master
phymbert Mar 2, 2024
9fcfa63
server: tests: schedule slow tests on master
phymbert Mar 2, 2024
61b9791
server: metrics: fix when no prompt processed
phymbert Mar 2, 2024
763ae0a
Merge remote-tracking branch 'origin/tests/server/passkey' into tests…
phymbert Mar 2, 2024
830d0ef
server: tests: CI workflow failed on first scenario failed
phymbert Mar 2, 2024
1aa5ad9
server: tests: fix re content
phymbert Mar 2, 2024
c1f66f0
server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
phymbert Mar 2, 2024
2cdd21e
server: tests: increase timeout for completion
phymbert Mar 2, 2024
a6ea725
server: tests: keep only the PHI-2 test
phymbert Mar 2, 2024
0c7f5b2
server: tests: passkey add a negative test
phymbert Mar 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions .github/workflows/server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,6 @@ jobs:
run: |
pip install -r examples/server/tests/requirements.txt

- name: Download models
id: download_models
run: |
cd examples/server/tests
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf

- name: Tests
id: server_integration_test
run: |
Expand Down
39 changes: 27 additions & 12 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -441,8 +441,8 @@ struct llama_server_context
const int ga_w = params.grp_attn_w;

if (ga_n != 1) {
GGML_ASSERT(ga_n > 0 && "ga_n must be positive"); // NOLINT
GGML_ASSERT(ga_w % ga_n == 0 && "ga_w must be a multiple of ga_n"); // NOLINT
GGML_ASSERT(ga_n > 0 && "ga_n must be positive"); // NOLINT
GGML_ASSERT(ga_w % ga_n == 0 && "ga_w must be a multiple of ga_n"); // NOLINT
//GGML_ASSERT(n_ctx_train % ga_w == 0 && "n_ctx_train must be a multiple of ga_w"); // NOLINT
//GGML_ASSERT(n_ctx >= n_ctx_train * ga_n && "n_ctx must be at least n_ctx_train * ga_n"); // NOLINT

Expand Down Expand Up @@ -1709,8 +1709,8 @@ struct llama_server_context
}
slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);

// if input prompt is too big, truncate it
if (slot.n_prompt_tokens >= slot.n_ctx)
// if input prompt is too big, truncate it, if group attention self-extend is disabled
if (slot.ga_n == 1 && slot.n_prompt_tokens >= slot.n_ctx)
{
const int n_left = slot.n_ctx - slot.params.n_keep;
const int n_block_size = n_left / 2;
Expand Down Expand Up @@ -1785,9 +1785,11 @@ struct llama_server_context
}

LOG_INFO("slot progression", {
{ "slot_id", slot.id },
{ "task_id", slot.task_id },
{ "n_past", slot.n_past },
{ "slot_id", slot.id },
{ "task_id", slot.task_id },
{ "n_past", slot.n_past },
{ "n_past_se", slot.n_past_se },
{ "ga_i", slot.ga_i },
{ "n_prompt_tokens_processed", slot.n_prompt_tokens_processed }
});
}
Expand Down Expand Up @@ -2001,6 +2003,17 @@ struct llama_server_context
LOG_VERBOSE("slots updated", {});
return true;
}

json model_meta() {
return json{
{"vocab_type", llama_vocab_type(model)},
{"n_vocab", llama_n_vocab(model)},
{"n_ctx_train", llama_n_ctx_train(model)},
{"n_embd", llama_n_embd(model)},
{"n_params", llama_model_n_params(model)},
{"size", llama_model_size(model)},
};
}
};

static void server_print_usage(const char *argv0, const gpt_params &params,
Expand Down Expand Up @@ -2994,6 +3007,7 @@ int main(int argc, char **argv)
state.store(SERVER_STATE_READY);
LOG_INFO("model loaded", {});
}
const auto model_meta = llama.model_meta();

if (sparams.chat_template.empty()) { // custom chat template is not supplied
// check if the template comes with the model is supported by us
Expand Down Expand Up @@ -3143,7 +3157,7 @@ int main(int argc, char **argv)
}
});

svr.Get("/v1/models", [&params](const httplib::Request& req, httplib::Response& res)
svr.Get("/v1/models", [&params, &model_meta](const httplib::Request& req, httplib::Response& res)
{
res.set_header("Access-Control-Allow-Origin", req.get_header_value("Origin"));
std::time_t t = std::time(0);
Expand All @@ -3152,10 +3166,11 @@ int main(int argc, char **argv)
{"object", "list"},
{"data", {
{
{"id", params.model_alias},
{"object", "model"},
{"created", t},
{"owned_by", "llamacpp"}
{"id", params.model_alias},
{"object", "model"},
{"created", t},
{"owned_by", "llamacpp"},
{"meta", model_meta}
},
}}
};
Expand Down
50 changes: 35 additions & 15 deletions examples/server/tests/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,67 @@
# Server tests

Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/):
* [issues.feature](./features/issues.feature) Pending issues scenario
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development)
and [behave](https://behave.readthedocs.io/en/latest/):

* [issues.feature](./features/issues.feature) Pending issues scenario
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests
* [security.feature](./features/security.feature) Security, CORS and API Key
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc...

Tests target GitHub workflows job runners with 4 vCPU.

Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client.
Requests are
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html)
based http client.

Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in `n_predict`, `kv_size`.

### Install dependencies

`pip install -r requirements.txt`

### Run tests

1. Build the server

```shell
cd ../../..
mkdir build
cd build
cmake ../
cmake --build . --target server
```
2. download required models:
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
3. Start the test: `./tests.sh`

2. Start the test: `./tests.sh`

It's possible to override some scenario steps values with environment variables:
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server`
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose`
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format

| variable | description |
|--------------------------|------------------------------------------------------------------------------------------------|
| `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` |
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` |
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` |
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format |
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` |

### Run @bug, @wip or @wrong_usage annotated scenario

Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope.

- `@bug` annotation aims to link a scenario with a GitHub issue.
- `@wrong_usage` are meant to show user issue that are actually an expected behavior
- `@wip` to focus on a scenario working in progress
- `@slow` heavy test, disabled by default

To run a scenario annotated with `@bug`, start:
`DEBUG=ON ./tests.sh --no-skipped --tags bug`

```shell
DEBUG=ON ./tests.sh --no-skipped --tags bug
```

After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated.

```shell
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile"
```
5 changes: 4 additions & 1 deletion examples/server/tests/features/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@


def before_scenario(context, scenario):
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m")
context.debug = 'DEBUG' in os.environ and os.environ['DEBUG'] == 'ON'
if context.debug:
print("DEBUG=ON\n")
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m\n")
port = 8080
if 'PORT' in os.environ:
port = int(os.environ['PORT'])
Expand Down
5 changes: 3 additions & 2 deletions examples/server/tests/features/parallel.feature
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
@llama.cpp
@parallel
Feature: Parallel

Background: Server startup
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model alias tinyllama-2
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And 42 as server seed
And 512 as batch size
And 64 KV cache size
And 2 slots
And embeddings extraction
Expand Down
53 changes: 53 additions & 0 deletions examples/server/tests/features/passkey.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#@llama.cpp
@passkey
@wip
@slow
@bug
Feature: Passkey / Self-extend with context shift

Background: Server startup
Given a server listening on localhost:8080

# Generates a long text of junk and inserts a secret passkey number inside it.
# We process the entire prompt using batches of n_batch and shifting the cache
# when it is full and then we query the LLM for the secret passkey.
# see #3856 and #4810
Scenario Outline: Passkey
Given a model file <hf_file> from HF repo <hf_repo>
And <n_batch> as batch size
And <n_junk> as number of junk
And a self-extend context with a factor of <n_grp>
And <seed> as seed
And a KV cache size based on the model trained context <n_ctx_train> extended by <n_grp> with additional <n_keep> tokens
And <n_slots> slots
And <n_ga> group attention factor to extend context size through self-extend
And <n_ga_w> group attention width to extend context size through self-extend
# Can be override with N_GPU_LAYERS
And <ngl> GPU offloaded layers
Then the server is starting
Then the server is healthy
Given available models
Then model 0 is trained on <n_ctx_train> tokens context
Given a prefix prompt:
"""
here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.
"""
And a passkey prompt template:
"""
The pass key is <passkey> Remember it. <passkey> is the pass key.
"""
And a junk suffix prompt:
"""
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.
"""
And a suffix prompt:
"""
What is the pass key? The pass key is
"""
Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk
And a completion request with no api error
Then <n_predicted> tokens are predicted matching <re_content>

Examples:
| hf_repo | hf_file | n_ctx_train | ngl | n_batch | n_slots | n_ga | n_ga_w | n_junk | n_grp | i_pos | seed | n_keep | passkey | n_predicted | re_content |
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 512 | 1 | 4 | 2048 | 250 | 4 | 50 | 86 | 32 | 42 | -1 | .*42.* |
3 changes: 2 additions & 1 deletion examples/server/tests/features/security.feature
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
@llama.cpp
@security
Feature: Security

Background: Server startup with an api key defined
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a server api key llama.cpp
Then the server is starting
Then the server is healthy
Expand Down
11 changes: 9 additions & 2 deletions examples/server/tests/features/server.feature
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
@llama.cpp
@server
Feature: llama.cpp server

Background: Server startup
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
And a model alias tinyllama-2
And 42 as server seed
# KV Cache corresponds to the total amount of tokens
# that can be stored across all independent sequences: #4130
# see --ctx-size and #5568
And 32 KV cache size
And 512 as batch size
And 1 slots
And embeddings extraction
And 32 server max tokens to predict
Expand Down Expand Up @@ -75,10 +77,15 @@ Feature: llama.cpp server
When an OAI compatible embeddings computation request for multiple inputs
Then embeddings are generated


Scenario: Tokenize / Detokenize
When tokenizing:
"""
What is the capital of France ?
"""
Then tokens can be detokenize

Scenario: Models available
Given available models
Then 1 models are supported
Then model 0 is identified by tinyllama-2
Then model 0 is trained on 128 tokens context
Loading
Loading