Batched inference with greedy sampling yields different completions #6583

mbonacci · 2024-04-10T07:41:29Z

Using batched.cpp example, modified to use greedy sampling, yields different completions (sample output below).
I'm using Windows, llama.cpp compiled by w64devkit on laptop with RTX3070.

Correct me if I'm wrong, but sampling with greedy sampler (i.e. always picking the most likely next token) should yield same result for same prompt, always (for same model).

Can this be a result of model quantizaton (I'm using 6K quantized llama2-chat gguf and tried also with 8 bit)?
Note: llama.cpp was compiled with no CUDA, so this is all on CPU.

batched ../models/TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q6_K.gguf  "Hello, my name is D" 4 50 0

sequence 0:

Hello, my name is Drew and I'm a 30-year-old man from the United States. I've been interested in Japanese culture for as long as I can remember, and I've been studying the language

sequence 1:

Hello, my name is Drew and I'm a 30-year-old man from the United States. I've been a fan of anime for as long as I can remember, and I've been lucky

sequence 2:

Hello, my name is Drew and I'm a 30-something year old man from the United States. I've been a fan of anime for as long as I can remember, and I've been lucky

sequence 3:

Hello, my name is Drew and I'm a 30-something year old man from the United States. I've been a fan of anime for as long as I can remember, and I've been lucky

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-04-10T18:01:31Z

This is an effect from using unified KV cache: ggerganov/whisper.cpp#1941 (comment)

MichaelZhangBH · 2024-05-14T13:03:45Z

Hi, @ggerganov , I saw your comment here at #4130

In order to resolve these, I think we should add a standard attention implementation where each sequence has it's own KV cache buffer and the attention is computed separately. This way, users would be able to choose which implementation to use based on their specific use case.

Is there any plan for this implementation? Sometimes greedy generations with different outcome can be a trouble.

ggerganov · 2024-05-17T12:53:52Z

No plan at the moment on my side. Haven't figure out a good way to implement this yet

martindevans · 2024-06-23T21:24:15Z

I've been investigating the performance of models with batched inference. I had expected slightly different results based on the number of parallel sequences being evaluated (i.e. some small amount of random noise), but I have instead noticed a very distinct downward trend. i.e. more sequences leads to less accuracy on test set!

Is this expected?

Evaluating against the Google BoolQ dataset, vertical axis shows accuracy percentage (note it starts at 48%), horizontal axis shows number of sequences (each sequence answering an independent question):

ggerganov · 2024-06-24T05:39:01Z

This is not expected

martindevans · 2024-06-24T09:56:39Z

Thanks for confirming that. I'll do some more digging into this to see if I can turn up anything more.

martindevans · 2024-06-24T12:46:01Z

I tried running the BoolQ dataset again, but this time asking each question in N parallel sequences.

As far as I can tell this always produces the same answer across all sequences. no matter how many parallel sequences I run (up to 64). There's some variance in accuracy with different sequence counts, but nothing as huge as before. This is not what I had expected! Here's what that looks like:

Note that when running this test I made sure that no tokens were shared between sequences in the prompt batch, so each sequence is totally independent.

github-actions · 2024-08-09T01:07:03Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

mbonacci added the bug-unconfirmed label Apr 10, 2024

mbonacci mentioned this issue Apr 10, 2024

Parallel Inferencing? SciSharp/LLamaSharp#623

Closed

github-actions bot added the stale label May 11, 2024

github-actions bot removed the stale label May 15, 2024

github-actions bot added the stale label Jun 17, 2024

github-actions bot removed the stale label Jun 24, 2024

github-actions bot added the stale label Jul 25, 2024

github-actions bot closed this as completed Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched inference with greedy sampling yields different completions #6583

Batched inference with greedy sampling yields different completions #6583

mbonacci commented Apr 10, 2024

ggerganov commented Apr 10, 2024

MichaelZhangBH commented May 14, 2024

ggerganov commented May 17, 2024

martindevans commented Jun 23, 2024

ggerganov commented Jun 24, 2024

martindevans commented Jun 24, 2024

martindevans commented Jun 24, 2024

github-actions bot commented Aug 9, 2024

Batched inference with greedy sampling yields different completions #6583

Batched inference with greedy sampling yields different completions #6583

Comments

mbonacci commented Apr 10, 2024

ggerganov commented Apr 10, 2024

MichaelZhangBH commented May 14, 2024

ggerganov commented May 17, 2024

martindevans commented Jun 23, 2024

ggerganov commented Jun 24, 2024

martindevans commented Jun 24, 2024

martindevans commented Jun 24, 2024

github-actions bot commented Aug 9, 2024