Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static lookup decoding #5398

Closed

Conversation

JohannesGaessler
Copy link
Collaborator

I had an idea for an alternative approach to lookup decoding. Instead of looking at the context I'm extracting token sequences from a static corpus of text and then using the most common sequences in that text corpus to construct a draft. It somewhat works, but honestly the results are not very good. With predictions based on the previous 2 tokens, wikitext train as the static text corpus, and a prompt that generates a story, the acceptance rate of the draft is only ~10.

With the commands

export model_name=mixtral_instruct-8x7b && export quantization=q8_0
./lookup-static --model models/opt/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --n-predict 1024 --seed 1337 --draft 1 --color --prompt "[INST] Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of eotional and dramatic language. [/INST]"

I get this output:

Screenshot_20240207_213617

It would be great if it turns out I just did the implementation wrong but my intuition is that language is just not as easily predictable as I had hoped. If there is a desire to turn this into a proper PR I could do it but I personally believe this has more value as a negative result, i.e. to prevent other devs from wasting their time on this approach.

@JohannesGaessler
Copy link
Collaborator Author

I forgot to say: with a prompt like "[INST] Explain to me how a type Ia supernova occurs. [/INST]" that would result in an output more similar to the text corpus the acceptance rate is also only ~10%:

Screenshot_20240207_215856

And for those cases the lookup decoding that is already on master works a lot better (~50% acceptance rate) because there is a lot more repetition.

@ggerganov
Copy link
Member

Having a custom ngram cache that's dynamically adjusted based on the stuff that the user generates locally should improve significantly the acceptance rate. I wrote a bit about this here: #4235

@JohannesGaessler
Copy link
Collaborator Author

Okay, it seems my implementation had a bug where one of the hashmaps wasn't being updated correctly. With the fix and an additional filter that only accepts those sequences which have a relative frequency of >= 50% I get much better results:

Screenshot_20240208_103927

With the story prompt I get ~28% acceptance rate, ~24% with the factual prompt. This is potentially something that could be workable. The correctly predicted tokens only make up ~5% of the generated test though so the maximum theoretical speedup is still low. But maybe you could combine this technique with the lookup decoding implementation on master to get more token predictions.

@JohannesGaessler
Copy link
Collaborator Author

It seems that the size of the text corpus makes a large difference. I used wikitext-103 instead of wikitext-2 (~50x larger) and the results are much better:

Screenshot_20240210_125354

The acceptance rate has increased to ~50% and ~10% of the final result consists of correctly predicted tokens.

@JohannesGaessler
Copy link
Collaborator Author

Obsoleted by #5479 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants