llama : move the sampling API from common into llama lib #5214

ggerganov · 2024-01-30T12:44:03Z

There is functionality around llama_sampling_context currently part of common. We should move it into llama. Pretty much the entire API from common/sampling.h except llama_sampling_params and llama_sampling_sample can be integrated into the library.

This would probably require to also merge the grammar parser into the llama lib implementation.

The llama_sampling_params and llama_sampling_sample will stay in common since they are very example-specific and not general-purpose enough to be merged.

The text was updated successfully, but these errors were encountered:

slaren · 2024-01-30T12:52:45Z

Is this meant as a short-term stop gap measure? If we are going to add a new sampling API to llama.cpp, it would be good to do this from the ground up with the possibility of GPU sampling in mind. The implementation is sampling.h does not seem flexible enough to do this.

ggerganov · 2024-01-30T13:23:43Z

This change is more relevant for the CPU-based sampling. There are many use cases that require to manage a sampling state (e.g. previously sampled tokens, grammar state, etc.) so it makes sense to add support directly into the core library.

I haven't thought deeply about GPU sampling support. Wouldn't it make more sense to have a limited number of GPU sampling options (such as greedy and top-k) as part of llama_context since this would require changing the compute graph in the first place? I don't expect we can ever support GPU grammar sampling for example or even GPU repeat-penalty - is that a correct assumption?

slaren · 2024-01-30T13:30:14Z

It's clear that some samplers cannot have GPU implementations, however this doesn't mean that we need two different APIs for GPU and CPU sampling. We could define samplers as an abstract object, that may or may not contain a state, and that may contain ggml or CPU implementations. Then we would need to assemble a pipeline of sampler objects that can be run at the end of the model evaluation. If all the samplers contain ggml implementations, then it can run on the GPU, otherwise at least some parts would still run on the CPU. I think it is mostly a matter of designing a flexible enough interface.

ggerganov · 2024-01-30T13:53:01Z

Ok will give it further thought.

One way that comes to mind is something like this:

int32_t llama_decode_with_sampling(
            struct llama_context * ctx,
   struct llama_sampling_context * ctx_s,
              struct llama_batch   batch,
                     llama_token * result);

The llama_sampling_context can hold the information about the sampling pipeline together with the sampling state. So in that sense, merging llama_sampling_context in llama seems compatible with future GPU support - it just has to be extended with the samplers info.

slaren · 2024-01-30T13:58:25Z

It's also important to allow multiple evaluations to be queued together, that's one of the biggest advantages of GPU sampling. That can be done by making llama_decode_with_sampling asynchronous, however the token output result needs to be removed, since obtaining that requires flushing the queue.

ggerganov · 2024-01-30T14:09:09Z

Yes, this might get tricky when considering multiple sequences in the batch, but seems doable.

Let me know if you have other concerns about merging llama_sampling_context - seems like a step in the right direction even when considering GPU support.

If we do that, then the llama_sample_... API that currently exists in llama.h can be updated to take llama_sampling_context instead of candidates and last_tokens. This API will remain as a utility for the users to do manual sampling and can potentially be removed when the llama_decode_with_sampling gets fully implemented.

github-actions · 2024-03-18T01:32:36Z

This issue is stale because it has been open for 30 days with no activity.

cebtenzzre · 2024-03-18T16:43:35Z

Not stale.

github-actions · 2024-06-07T01:07:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

kevmo314 · 2024-07-09T12:56:29Z

Ok will give it further thought.

One way that comes to mind is something like this:
int32_t llama_decode_with_sampling(
            struct llama_context * ctx,
   struct llama_sampling_context * ctx_s,
              struct llama_batch   batch,
                     llama_token * result);
The llama_sampling_context can hold the information about the sampling pipeline together with the sampling state. So in that sense, merging llama_sampling_context in llama seems compatible with future GPU support - it just has to be extended with the samplers info.

So I better understand this approach, would this involve updating llama_build_graph to take in the sampling context and append it to the constructed model graph? Or alternatively some sort of "postprocessing graph" type that llama_decode_with_sampling could convert the sampling context to?

ggerganov · 2024-09-07T12:17:24Z

Resolved via #9294

ggerganov added the refactoring Refactoring label Jan 30, 2024

ggerganov added this to ggml : roadmap Jan 30, 2024

ggerganov moved this to Todo in ggml : roadmap Jan 30, 2024

ggerganov mentioned this issue Jan 30, 2024

llama : create llamax library #5215

Open

netdur mentioned this issue Jan 30, 2024

Use gpt_params instead of model_params and context_params netdur/llama_cpp_dart#9

Closed

ggerganov mentioned this issue Mar 8, 2024

main : add Self-Extend support #4815

Merged

github-actions bot added the stale label Mar 18, 2024

slaren removed the stale label Mar 18, 2024

github-actions bot added the stale label Apr 18, 2024

ggerganov removed the stale label Apr 23, 2024

ggerganov mentioned this issue Apr 23, 2024

Server: fix seed for multiple slots #6835

Merged

github-actions bot added the stale label May 24, 2024

github-actions bot closed this as completed Jun 7, 2024

ggerganov removed the stale label Jun 13, 2024

ggerganov reopened this Jun 13, 2024

ggerganov mentioned this issue Jul 16, 2024

llama : move vocab, grammar and sampling into separate files #8508

Merged

7 tasks

ggerganov self-assigned this Jul 24, 2024

ggerganov moved this from Todo to In Progress in ggml : roadmap Jul 24, 2024

ggerganov mentioned this issue Jul 25, 2024

llama : refactor sampling #8643

Closed

6 tasks

github-actions bot added the stale label Aug 24, 2024

ggerganov removed the stale label Aug 26, 2024

ggerganov closed this as completed Sep 7, 2024

ggerganov moved this from In Progress to Done in ggml : roadmap Sep 7, 2024

ggerganov mentioned this issue Sep 9, 2024

llama : refactor sampling v2 #9294

Merged

4 tasks

netdur mentioned this issue Oct 31, 2024

Question: is the project on hold? netdur/llama_cpp_dart#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : move the sampling API from common into llama lib #5214

llama : move the sampling API from common into llama lib #5214

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024 •

edited

Loading

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024 •

edited

Loading

github-actions bot commented Mar 18, 2024

cebtenzzre commented Mar 18, 2024

github-actions bot commented Jun 7, 2024

kevmo314 commented Jul 9, 2024 •

edited

Loading

ggerganov commented Sep 7, 2024

llama : move the sampling API from common into llama lib #5214

llama : move the sampling API from common into llama lib #5214

Comments

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024 • edited Loading

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024 • edited Loading

github-actions bot commented Mar 18, 2024

cebtenzzre commented Mar 18, 2024

github-actions bot commented Jun 7, 2024

kevmo314 commented Jul 9, 2024 • edited Loading

ggerganov commented Sep 7, 2024

slaren commented Jan 30, 2024 •

edited

Loading

ggerganov commented Jan 30, 2024 •

edited

Loading

kevmo314 commented Jul 9, 2024 •

edited

Loading