Parallel Inferencing? #623

AshD · 2024-03-24T12:28:21Z

Is there a way to do Parallel Inferencing with LlamaSharp?

I want to send multiple prompts from different threads to different StatelessExecutor objects to do inference.
Currently, I use a semaphore to enable only one StatelessExecutor processing at a time.

Thanks,
Ash

martindevans · 2024-03-24T16:06:51Z

LLamaSharp intends to be threadsafe, but that's a bit tricky due to some thread safety issues in llama.cpp itself.

At the moment it's set up so there's a global lock taken while inference is running - this means you can dispatch as much work as you like from as many threads as you like (but ultimately, only one will happen at a time). Hopefully in the near future we'll be able to remove this lock!

AshD · 2024-03-24T17:26:55Z

Thanks Martin. Do you know if llama.cpp can support taking in multiple prompts with one call.

It looks like llama.cpp web server can support it but not sure how stable the feature is.

martindevans · 2024-03-24T18:18:07Z

LLamaSharp has the BatchedExecutor which is an entirely new executor I've been working on. You can spawn multiple "Conversations" which can all be prompted and then inference runs for all of them simultaneously.

AshD · 2024-03-24T19:50:40Z

Thanks Martin. I will try it out with the next release

martindevans · 2024-03-24T20:16:41Z

The BatchedExecutor is actually already available in the previous release (although of course there will be improvements in the next release!).

AshD · 2024-03-24T20:50:50Z

Cool. I will test it out this week

WesselvanGils · 2024-03-29T10:14:53Z

Hi, I've been wanting to try this out myself but I can't get the example to work.
After fiddeling around a bit I've gotten pretty far, my code now looks like this:

using LLama.Batched;
using LLama.Common;
using LLama.Native;
using LLama.Sampling;
using Spectre.Console;

namespace LLama.Examples.Examples;

/// <summary>
/// This demonstrates using a batch to generate two sequences and then using one
/// sequence as the negative guidance ("classifier free guidance") for the other.
/// </summary>
/// 

public class BatchedExecutorGuidance
{
    private const int n_len = 32;

    public static async Task Run()
    {
        string modelPath = @"/mnt/models/openchat-3.5-0106.Q8_0.gguf";

        var parameters = new ModelParams(modelPath);
        using var model = LLamaWeights.LoadFromFile(parameters);

        var positivePrompt = AnsiConsole.Ask("Positive Prompt (or ENTER for default):", "My favourite colour is").Trim();
        var negativePrompt = AnsiConsole.Ask("Negative Prompt (or ENTER for default):", "I hate the colour red. My favourite colour is").Trim();
        var weight = AnsiConsole.Ask("Guidance Weight (or ENTER for default):", 2.0f);

        // Create an executor that can evaluate a batch of conversations together
        using var executor = new BatchedExecutor(model, parameters);

        // Print some info
        var name = executor.Model.Metadata.GetValueOrDefault("general.name", "unknown model name");
        Console.WriteLine($"Created executor with model: {name}");

        // Load the two prompts into two conversations
        // using var guided = executor.Create();
        // guided.Prompt(positivePrompt);
        using var guided = executor.Prompt(positivePrompt);

        // using var guidance = executor.Create();
        // guidance.Prompt(negativePrompt);
        using var guidance = executor.Prompt(negativePrompt);

        // Run inference to evaluate prompts
        await AnsiConsole
             .Status()
             .Spinner(Spinner.Known.Line)
             .StartAsync("Evaluating Prompts...", _ => executor.Infer());

        // Fork the "guided" conversation. We'll run this one without guidance for comparison
        using var unguided = guided.Fork();

        // Run inference loop
        var unguidedSampler = new GuidedSampler(null, weight);
        var unguidedDecoder = new StreamingTokenDecoder(executor.Context);
        var guidedSampler = new GuidedSampler(guidance, weight);
        var guidedDecoder = new StreamingTokenDecoder(executor.Context);
        await AnsiConsole
           .Progress()
           .StartAsync(async progress =>
            {
                var reporter = progress.AddTask("Running Inference", maxValue: n_len);

                for (var i = 0; i < n_len; i++)
                {
                    if (i != 0)
                        await executor.Infer();

                    // Sample from the "unguided" conversation. This is just a conversation using the same prompt, without any
                    // guidance. This serves as a comparison to show the effect of guidance.
                    var u = unguidedSampler.Sample(executor.Context.NativeHandle, unguided.Sample().ToArray(), Array.Empty<LLamaToken>());
                    unguidedDecoder.Add(u);
                    unguided.Prompt(u);

                    // Sample from the "guided" conversation. This sampler will internally use the "guidance" conversation
                    // to steer the conversation. See how this is done in GuidedSampler.ProcessLogits (bottom of this file).
                    var g = guidedSampler.Sample(executor.Context.NativeHandle, guided.Sample().ToArray(), Array.Empty<LLamaToken>());
                    guidedDecoder.Add(g);

                    // Use this token to advance both guided _and_ guidance. Keeping them in sync (except for the initial prompt).
                    guided.Prompt(g);
                    guidance.Prompt(g);

                    // Early exit if we reach the natural end of the guided sentence
                    if (g == model.EndOfSentenceToken)
                        break;

                    // Update progress bar
                    reporter.Increment(1);
                }
            });

        AnsiConsole.MarkupLine($"[green]Unguided:[/][white]{unguidedDecoder.Read().ReplaceLineEndings(" ")}[/]");
        AnsiConsole.MarkupLine($"[green]Guided:[/][white]{guidedDecoder.Read().ReplaceLineEndings(" ")}[/]");
    }

    private class GuidedSampler(Conversation? guidance, float weight)
        : BaseSamplingPipeline
    {
        public override void Accept(SafeLLamaContextHandle ctx, LLamaToken token)
        {
        }

        public override ISamplingPipeline Clone()
        {
            throw new NotSupportedException();
        }

        protected override void ProcessLogits(SafeLLamaContextHandle ctx, Span<float> logits, ReadOnlySpan<LLamaToken> lastTokens)
        {
            if (guidance == null)
                return;

            // Get the logits generated by the guidance sequences
            var guidanceLogits = guidance.Sample();

            // Use those logits to guide this sequence
            // NativeApi.llama_sample_apply_guidance(ctx, logits, guidanceLogits, weight);
        }

        protected override LLamaToken ProcessTokenDataArray(SafeLLamaContextHandle ctx, LLamaTokenDataArray candidates, ReadOnlySpan<LLamaToken> lastTokens)
        {
            candidates.Temperature(ctx, 0.8f);
            candidates.TopK(ctx, 25);

            return candidates.SampleToken(ctx);
        }

        protected override IReadOnlyList<LLamaToken> GetProtectedTokens(SafeLLamaContextHandle ctx)
        {
            throw new NotImplementedException();
        }
    }
}

I'm stuck right now because I don't know how to implement GetProtectedTokens. If I'm missing something please let me know.
Modifications I've made are on line 38 for example because BatchedExecutor.Create() I believe is not a method. And line 73 because Sample() now returns a ReadOnlySpan instead of a regular Span

martindevans · 2024-03-29T13:25:23Z

I'd suggest cloning the master branch and working with that, BatchedExecutor is very new and I think the things you're asking about have been changed (and hopefully improved!). For example GetProtectedTokens not longer exists, so you don't have to worry about implementing it!

WesselvanGils · 2024-04-03T13:57:57Z

I've been experimenting with the BatchedExecutor for a bit now and I have a couple questions.
Can we create conversations and prompt them in parallel?
Can BatchedExecutor.Infer() be run in parallel?

Because when I'm attempting to create multiple conversations when the second one is prompted I get

System.ArgumentException: An item with the same key has already been added. Key: (1, LLamaPos { Value = 0 })
at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
at LLama.Native.LLamaBatch.Add(LLamaToken token, LLamaPos pos, ReadOnlySpan`1 sequences, Boolean logits)

So I guess my actual question is, what parts of the pipeline can BatchedExecutor parallelise? Is it just the inference?

martindevans · 2024-04-03T14:04:17Z

BatchedExecutor itself is not currently designed to be used in parallel (although it might be modified to allow that in the future).

The parallelism is built into it - when you call Infer it will run inference (which is by far the most expensive bit) for all active conversations in parallel.

AsakusaRinne · 2024-04-03T17:27:38Z

The key for BatchedExecutor is to put your prompts from different sessions in one batch. Then run the inference with single thread and split the result to each session.

WesselvanGils · 2024-04-04T13:30:56Z

I got that figured out after a bit as well @AsakusaRinne but thanks for the advice nontheless. I managed to get a working iteration of it and thought I'd share for people who find this issue as well.

public class MultiChatSession(BatchedExecutor executor)
{
    private BatchedExecutor _executor = executor;
    private Dictionary<Conversation, DataStream> _conversations = [];
    private readonly SemaphoreSlim inferenceLock = new(1, 1);
    private bool _inferenceStarted = false;

    public async Task<DataStream> Add(string prompt)
    {
        Console.WriteLine("Attempting to add a conversation, waiting to aquire lock");
        await inferenceLock.WaitAsync(); // Acquire lock before adding

        try
        {
            Console.WriteLine("Started adding a conversation");
            var dataStream = new DataStream();
            var conversation = _executor.Create();
            conversation.Prompt(prompt);

            _conversations.Add(conversation, dataStream);

            Console.WriteLine("A conversation was added");

            return dataStream;
        }
        finally
        {
            inferenceLock.Release(); // Release lock after adding
            Console.WriteLine("Done adding conversation, lock released");

            if (!_inferenceStarted)
            {
                _inferenceStarted = true;
                _ = Task.Run(RunInferenceLoopAsync);
            }
        }
    }

    private async Task RunInferenceLoopAsync()
    {
        while(true)
        {      
            Console.WriteLine("Attempting to start inference iteration, waiting to aquire lock");
            await inferenceLock.WaitAsync(); // Acquire lock before starting inference

            try
            {
                Console.WriteLine("Inference iteration started");
                await _executor.Infer();
                
                var decoder = new StreamingTokenDecoder(_executor.Context);
                var ctx = _executor.Context.NativeHandle;

                foreach (var kvp in _conversations)
                {
                    using var sampler = new DefaultSamplingPipeline();

                    var token = sampler.Sample(ctx, kvp.Key.Sample(), Array.Empty<LLamaToken>());
                    sampler.Accept(ctx, token);

                    if (token == _executor.Model.EndOfSentenceToken)
                    {
                        kvp.Value.MarkAsCompleted();
                        _conversations.Remove(kvp.Key);
                    }
                    else
                    {
                        decoder.Add(token);
                        kvp.Key.Prompt(token);
            
                        kvp.Value.AddToken(decoder.Read());
                    }
                }

                Console.WriteLine("Inference iteration completed");
            }
            finally
            {
                inferenceLock.Release();
                Console.WriteLine("Done with inference iteration, lock released");
            }

            if (_conversations.Count == 0)
            {
                _inferenceStarted = false;
                break;
            }
        }
    }
}

A point I wanted in my application was the ability to add a conversation at any time and the challanges with that were mostly making sure no conversations were added at the same time or that conversations were added during inference. This version is definitly somewhat verbose but it helped me to see when the thread was locked otherwise it'd be really easy to get stuck somewhere.

Edit: Made some code changes so the inference loop only runs when there are conversations to be inferred.

AshD · 2024-04-05T14:45:05Z

Thanks @WesselvanGils
How is the speed when doing multiple inferences together?

I had tried the llama.cpp webserver with 2 slots and continuous batching but the speed seemed to be more twice as slow as single prompt processing.
server -m MODELFILE --host 0.0.0.0 --ctx_size 8192 -t 8 -ngl 99 -np 2 -cb

martindevans · 2024-04-05T14:53:25Z

Try the BatchedExecutor demos in LLamaSharp to get a feel for the speed. The Fork example starts with one conversation and keeps forking it again and again so it ends up with quite a few parallel sequences.

AshD · 2024-04-05T16:32:13Z

I had taken a look at BatchedExecutor demo but was confused how to add different prompts and get different streaming results for each prompt.
https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Examples/Examples/BatchedExecutorFork.cs

WesselvanGils · 2024-04-05T16:40:54Z

@AshD basically every time you create new conversation with executoriale.Create() the executor takes note, when running inference it will infer for every conversation that was created.

The example class I made shows this off pretty well I believe.

martindevans · 2024-04-05T16:43:39Z

The basic flow for the batched executor is:

Create one or more conversations:

using var conversation = executor.Create();
conversation.Prompt("Hello AshD");

Call Infer() to run the model over all active conversations in parallel.

executor.Infer()

Sample each conversation (pick a token from the logits generated by the model) and then prompt the conversation with that token.

// Sampling can be done however you like. There's a whole sampling pipeline infrastructure in LLamaSharp you can use. Here's a basic one
var token = new DefaultSamplingPipeline().Sample(context, conversation.Sample(), Array.Empty<LLamaToken>());
conversation.Prompt(token);

goto step 2

You can have as many conversations as you like, although of course they do take up extra memory for the KV cache. If you have several conversations with totally unrelated prompts you will get a small speedup.

Where it gets really cool is where you have common bits of multiple conversations, for example if you have a system prompt you can evaluate that and then fork that conversation and now you have 2 conversations using exactly the same bit of KV cache (no extra memory consumed, and you didn't have to re-evaluate the entire prompt).

AshD · 2024-04-05T16:50:09Z

Thanks @martindevans and @WesselvanGils

Dumb question, since I have not used LLamaSharp API at a low level. How do I get the Generated tokens in a streaming manner? Is it using the DataStream?

martindevans · 2024-04-05T19:30:14Z

The BatchedExecutor doesn't have anything like that at the moment, it's up to you to sample a token, reprompt the conversation with that token, and detokenize it into text using a StreamingTokenDecoder. My work over the next few months is going to involve slowly building up higher level abstractions around this.

WesselvanGils · 2024-04-05T21:09:00Z

Thanks @martindevans and @WesselvanGils

Dumb question, since I have not used LLamaSharp API at a low level. How do I get the Generated tokens in a streaming manner? Is it using the DataStream?

The DataStream object you see is something I just made myself so other parts of the program can consume the data at their own leisure. I can share it if you're interested. But basically after each inference run you get one token. So in essence you already have a "stream". So it's, infer -> sample the token -> add it back in the conversation, doing that over and over is how you achieve a "stream" of tokens.

mbonacci · 2024-04-07T08:08:29Z

I've been playing around with BatchedExecutor example and noticed when I set sampling temperature to 0 (or use GreedySamplingPipeline), the conversations diverge. I've tries a few short initial prompts and sometimes it does not happen (all forks are the same), but it usually does.
Am I missing something or shouldn't the model produce exactly same tokens for each fork?

martindevans · 2024-04-07T16:07:12Z

Unfortunately I think that's probably because the inference process itself is not completely deterministic.

mbonacci · 2024-04-09T09:20:00Z

That doesn't sound right... As I understand, when sampling using greedy sampler (which always picks most probable token), the inference process should always return same result for same prompt.
I did a quick test, and StatelessExecutor (wit greedy sampling) always returns same result, no matter how many times I call it. In BatchedExecutor, if I use same prompt for each sequence, there are differences in inferred sequences.

martindevans · 2024-04-09T11:31:50Z

Yeah you're right, I was wrong there. I'll have a look into this.

mbonacci · 2024-04-10T07:44:38Z

I did a test with batched example from llama.cpp (modified to use greedy sampling) and the results are the same. So I created an issue in llama.cpp (ggerganov/llama.cpp#6583) for this.

martindevans · 2024-04-10T13:23:25Z

Ah interesting, I'll wait and see if someone upstream knows what the issue is. Thanks for looking into that.

martindevans · 2024-04-10T18:40:25Z

Looks like it's expected, according to ggerganov.

martindevans · 2024-06-26T00:51:06Z

I'll close this issue now, since I think the questions have been answered and there hasn't been any activity for a while.

martindevans added the Upstream Tracking an issue in llama.cpp label Apr 10, 2024

AsakusaRinne mentioned this issue Apr 21, 2024

[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684

Open

martindevans closed this as completed Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Inferencing? #623

Parallel Inferencing? #623

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

WesselvanGils commented Mar 29, 2024 •

edited

Loading

martindevans commented Mar 29, 2024

WesselvanGils commented Apr 3, 2024 •

edited

Loading

martindevans commented Apr 3, 2024

AsakusaRinne commented Apr 3, 2024

WesselvanGils commented Apr 4, 2024 •

edited

Loading

AshD commented Apr 5, 2024

martindevans commented Apr 5, 2024

AshD commented Apr 5, 2024

WesselvanGils commented Apr 5, 2024 •

edited

Loading

martindevans commented Apr 5, 2024 •

edited

Loading

AshD commented Apr 5, 2024 •

edited

Loading

martindevans commented Apr 5, 2024

WesselvanGils commented Apr 5, 2024

mbonacci commented Apr 7, 2024

martindevans commented Apr 7, 2024

mbonacci commented Apr 9, 2024

martindevans commented Apr 9, 2024

mbonacci commented Apr 10, 2024

martindevans commented Apr 10, 2024

martindevans commented Apr 10, 2024

martindevans commented Jun 26, 2024

Parallel Inferencing? #623

Parallel Inferencing? #623

Comments

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

martindevans commented Mar 24, 2024

AshD commented Mar 24, 2024

WesselvanGils commented Mar 29, 2024 • edited Loading

martindevans commented Mar 29, 2024

WesselvanGils commented Apr 3, 2024 • edited Loading

martindevans commented Apr 3, 2024

AsakusaRinne commented Apr 3, 2024

WesselvanGils commented Apr 4, 2024 • edited Loading

AshD commented Apr 5, 2024

martindevans commented Apr 5, 2024

AshD commented Apr 5, 2024

WesselvanGils commented Apr 5, 2024 • edited Loading

martindevans commented Apr 5, 2024 • edited Loading

AshD commented Apr 5, 2024 • edited Loading

martindevans commented Apr 5, 2024

WesselvanGils commented Apr 5, 2024

mbonacci commented Apr 7, 2024

martindevans commented Apr 7, 2024

mbonacci commented Apr 9, 2024

martindevans commented Apr 9, 2024

mbonacci commented Apr 10, 2024

martindevans commented Apr 10, 2024

martindevans commented Apr 10, 2024

martindevans commented Jun 26, 2024

WesselvanGils commented Mar 29, 2024 •

edited

Loading

WesselvanGils commented Apr 3, 2024 •

edited

Loading

WesselvanGils commented Apr 4, 2024 •

edited

Loading

WesselvanGils commented Apr 5, 2024 •

edited

Loading

martindevans commented Apr 5, 2024 •

edited

Loading

AshD commented Apr 5, 2024 •

edited

Loading