Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow CONCURRENT requests and Multiple Instances management; Add API authentication; and configuration improvements #225

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

CodePothunter
Copy link

  • Implement OpenAI-compatible API key authentication
  • Add configuration options for GPU instances, concurrency, and request handling
  • Update README with authentication instructions
  • Modify configuration and routing to support optional API key verification
  • Enhance system information and debug endpoints to expose authentication status.

- Implement OpenAI-compatible API key authentication
- Add configuration options for GPU instances, concurrency, and request handling
- Update README with authentication instructions
- Modify configuration and routing to support optional API key verification
- Enhance system information and debug endpoints to expose authentication status
@remsky
Copy link
Owner

remsky commented Mar 7, 2025

This looks great, will take a look through today

- Modify audio chunk concatenation to handle float32 audio data
- Add explicit conversion from float32 to int16 using amplitude scaling
- Remove unnecessary dtype specification in np.concatenate
- Create GPU-specific startup script
- Set environment variables for GPU and project configuration
- Use uv to install GPU extras and run FastAPI server
@fireblade2534
Copy link
Collaborator

Is there a reason you deleted start-gpu and not start-cpu aswell

@fireblade2534
Copy link
Collaborator

@CodePothunter

@CodePothunter
Copy link
Author

@CodePothunter

It's a misoperation I've added it back in my new commit.

Copy link
Collaborator

@fireblade2534 fireblade2534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this pr breaks streaming. (I tested by running Test.py) It produced an empty wav file (outputstream.wav).

The reason is that when stream=True, the audio conversion functions are not really called.
@CodePothunter
Copy link
Author

As far as I can tell this pr breaks streaming. (I tested by running Test.py) It produced an empty wav file (outputstream.wav).

It has been solved in the most recent commit. The reason is that when stream=True, the audio conversion functions were not really called compared to the non-stream mode.

@fireblade2534
Copy link
Collaborator

When I run the docker container using the config in:
docker/gpu
by using:
docker compose up --build

It doesn't seem to respect env vars when I put them in a .env file or when I add them to the docker compose file

@CodePothunter
Copy link
Author

When I run the docker container using the config in: docker/gpu by using: docker compose up --build

It doesn't seem to respect env vars when I put them in a .env file or when I add them to the docker compose file

Sorry, I did not consider the docker-related issues.

@fireblade2534
Copy link
Collaborator

Also when running it in a docker container (I havn't tested this outsidee of one) running Test.py twice to generate four querys causes the container to exit with code 139 (Note that I am using the gpu container):

kokoro-tts-1  | 
kokoro-tts-1  | ==========
kokoro-tts-1  | == CUDA ==
kokoro-tts-1  | ==========
kokoro-tts-1  |
kokoro-tts-1  | CUDA Version 12.8.0
kokoro-tts-1  |
kokoro-tts-1  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
kokoro-tts-1  |
kokoro-tts-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
kokoro-tts-1  | By pulling and using the container, you accept the terms and conditions of this license:
kokoro-tts-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
kokoro-tts-1  |
kokoro-tts-1  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
kokoro-tts-1  |
kokoro-tts-1  | 2025-03-10 02:48:15.520 | INFO     | __main__:download_model:60 - Model files already exist and are valid
kokoro-tts-1  | INFO:     Started server process [30]
kokoro-tts-1  | INFO:     Waiting for application startup.
kokoro-tts-1  | 02:48:22 AM | INFO     | main:57 | Loading TTS model and voice packs...
kokoro-tts-1  | 02:48:22 AM | INFO     | model_manager:35 | Initializing Kokoro V1 on cuda
kokoro-tts-1  | 02:48:22 AM | DEBUG    | paths:101 | Searching for model in path: /app/api/src/models
kokoro-tts-1  | 02:48:22 AM | INFO     | kokoro_v1:45 | Loading Kokoro model on cuda
kokoro-tts-1  | 02:48:22 AM | INFO     | kokoro_v1:46 | Config path: /app/api/src/models/v1_0/config.json
kokoro-tts-1  | 02:48:22 AM | INFO     | kokoro_v1:47 | Model path: /app/api/src/models/v1_0/kokoro-v1_0.pth
kokoro-tts-1  | /app/.venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py:123: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
kokoro-tts-1  |   warnings.warn(
kokoro-tts-1  | /app/.venv/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
kokoro-tts-1  |   WeightNorm.apply(module, name, dim)
kokoro-tts-1  | 02:48:24 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:24 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:24 AM | DEBUG    | model_manager:74 | Using default voice 'af_heart' for warmup
kokoro-tts-1  | 02:48:24 AM | INFO     | kokoro_v1:73 | Creating new pipeline for language code: a
kokoro-tts-1  | 02:48:24 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Warmup text for initialization.'
kokoro-tts-1  | 02:48:25 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([57600])
kokoro-tts-1  | 02:48:25 AM | INFO     | model_manager:81 | Warmup completed in 3456ms
kokoro-tts-1  | 02:48:25 AM | INFO     | instance_pool:31 | Initialized model instance 0 on GPU 0
kokoro-tts-1  | 02:48:25 AM | INFO     | instance_pool:69 | Successfully initialized instance 0 on GPU 0
kokoro-tts-1  | 02:48:25 AM | INFO     | model_manager:35 | Initializing Kokoro V1 on cuda
kokoro-tts-1  | 02:48:25 AM | DEBUG    | paths:101 | Searching for model in path: /app/api/src/models
kokoro-tts-1  | 02:48:25 AM | INFO     | kokoro_v1:45 | Loading Kokoro model on cuda
kokoro-tts-1  | 02:48:25 AM | INFO     | kokoro_v1:46 | Config path: /app/api/src/models/v1_0/config.json
kokoro-tts-1  | 02:48:25 AM | INFO     | kokoro_v1:47 | Model path: /app/api/src/models/v1_0/kokoro-v1_0.pth
kokoro-tts-1  | 02:48:26 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:26 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:26 AM | DEBUG    | model_manager:74 | Using default voice 'af_heart' for warmup
kokoro-tts-1  | 02:48:26 AM | INFO     | kokoro_v1:73 | Creating new pipeline for language code: a
kokoro-tts-1  | 02:48:27 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Warmup text for initialization.'
kokoro-tts-1  | 02:48:27 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([57600])
kokoro-tts-1  | 02:48:27 AM | INFO     | model_manager:81 | Warmup completed in 1894ms
kokoro-tts-1  | 02:48:27 AM | INFO     | instance_pool:31 | Initialized model instance 1 on GPU 0
kokoro-tts-1  | 02:48:27 AM | INFO     | instance_pool:69 | Successfully initialized instance 1 on GPU 0
kokoro-tts-1  | 02:48:27 AM | INFO     | model_manager:35 | Initializing Kokoro V1 on cuda
kokoro-tts-1  | 02:48:27 AM | DEBUG    | paths:101 | Searching for model in path: /app/api/src/models
kokoro-tts-1  | 02:48:27 AM | INFO     | kokoro_v1:45 | Loading Kokoro model on cuda
kokoro-tts-1  | 02:48:27 AM | INFO     | kokoro_v1:46 | Config path: /app/api/src/models/v1_0/config.json
kokoro-tts-1  | 02:48:27 AM | INFO     | kokoro_v1:47 | Model path: /app/api/src/models/v1_0/kokoro-v1_0.pth
kokoro-tts-1  | 02:48:28 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:28 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:28 AM | DEBUG    | model_manager:74 | Using default voice 'af_heart' for warmup
kokoro-tts-1  | 02:48:28 AM | INFO     | kokoro_v1:73 | Creating new pipeline for language code: a
kokoro-tts-1  | 02:48:29 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Warmup text for initialization.'
kokoro-tts-1  | 02:48:29 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([57600])
kokoro-tts-1  | 02:48:29 AM | INFO     | model_manager:81 | Warmup completed in 1719ms
kokoro-tts-1  | 02:48:29 AM | INFO     | instance_pool:31 | Initialized model instance 2 on GPU 0
kokoro-tts-1  | 02:48:29 AM | INFO     | instance_pool:69 | Successfully initialized instance 2 on GPU 0
kokoro-tts-1  | 02:48:29 AM | INFO     | model_manager:35 | Initializing Kokoro V1 on cuda
kokoro-tts-1  | 02:48:29 AM | DEBUG    | paths:101 | Searching for model in path: /app/api/src/models
kokoro-tts-1  | 02:48:29 AM | INFO     | kokoro_v1:45 | Loading Kokoro model on cuda
kokoro-tts-1  | 02:48:29 AM | INFO     | kokoro_v1:46 | Config path: /app/api/src/models/v1_0/config.json
kokoro-tts-1  | 02:48:29 AM | INFO     | kokoro_v1:47 | Model path: /app/api/src/models/v1_0/kokoro-v1_0.pth
kokoro-tts-1  | 02:48:30 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:30 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:30 AM | DEBUG    | model_manager:74 | Using default voice 'af_heart' for warmup
kokoro-tts-1  | 02:48:30 AM | INFO     | kokoro_v1:73 | Creating new pipeline for language code: a
kokoro-tts-1  | 02:48:30 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Warmup text for initialization.'
kokoro-tts-1  | 02:48:31 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([57600])
kokoro-tts-1  | 02:48:31 AM | INFO     | model_manager:81 | Warmup completed in 1879ms
kokoro-tts-1  | 02:48:31 AM | INFO     | instance_pool:31 | Initialized model instance 3 on GPU 0
kokoro-tts-1  | 02:48:31 AM | INFO     | instance_pool:69 | Successfully initialized instance 3 on GPU 0
kokoro-tts-1  | 02:48:31 AM | INFO     | instance_pool:80 | Successfully initialized 4 instances on GPU 0
kokoro-tts-1  | 02:48:31 AM | INFO     | main:110 |
kokoro-tts-1  | 
kokoro-tts-1  | ░░░░░░░░░░░░░░░░░░░░░░░░
kokoro-tts-1  | 
kokoro-tts-1  |     ╔═╗┌─┐┌─┐┌┬┐
kokoro-tts-1  |     ╠╣ ├─┤└─┐ │
kokoro-tts-1  |     ╚  ┴ ┴└─┘ ┴
kokoro-tts-1  |     ╦╔═┌─┐┬┌─┌─┐
kokoro-tts-1  |     ╠╩╗│ │├┴┐│ │
kokoro-tts-1  |     ╩ ╩└─┘┴ ┴└─┘
kokoro-tts-1  | 
kokoro-tts-1  | ░░░░░░░░░░░░░░░░░░░░░░░░
kokoro-tts-1  | 
kokoro-tts-1  | Model warmed up on cuda:0: kokoro_v1
kokoro-tts-1  | CUDA: True
kokoro-tts-1  | Running 4 instances on GPU 0
kokoro-tts-1  | Max concurrent requests: 4
kokoro-tts-1  | Request queue size: 100
kokoro-tts-1  | Default language code: auto (from voice name: a)
kokoro-tts-1  | 
kokoro-tts-1  | Beta Web Player: http://0.0.0.0:8090/web/
kokoro-tts-1  | or http://localhost:8090/web/
kokoro-tts-1  | ░░░░░░░░░░░░░░░░░░░░░░░░
kokoro-tts-1  | 
kokoro-tts-1  | INFO:     Application startup complete.
kokoro-tts-1  | INFO:     Uvicorn running on http://0.0.0.0:8880 (Press CTRL+C to quit)
kokoro-tts-1  | 02:48:31 AM | INFO     | openai_compatible:72 | Created global TTSService instance
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | INFO     | openai_compatible:215 | Starting audio generation with lang_code: a
kokoro-tts-1  | INFO:     172.18.0.1:52196 - "POST /v1/audio/speech HTTP/1.1" 200 OK
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_heart.pt
kokoro-tts-1  | 02:48:31 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_sky.pt
kokoro-tts-1  | 02:48:31 AM | DEBUG    | tts_service:228 | Combining 2 voice tensors with weights [0.5, 0.5]
kokoro-tts-1  | 02:48:31 AM | DEBUG    | tts_service:236 | Saving combined voice to: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:31 AM | DEBUG    | tts_service:280 | Using voice path: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:31 AM | INFO     | tts_service:284 | Using lang_code 'a' for voice 'af_heart+af_sky' in audio stream
kokoro-tts-1  | 02:48:31 AM | DEBUG    | instance_pool:122 | Processing request on instance 0
kokoro-tts-1  | 02:48:31 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Delving into the Abyss: A Deeper Exploration of Meaning in 5 Seconds of Summer's "Jet Black Heart"
kokoro-tts-1  | 
kokoro-tts-1  | ...'
kokoro-tts-1  | 02:48:31 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([159600])
kokoro-tts-1  | 02:48:32 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([522000])
kokoro-tts-1  | 02:48:32 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([654000])
kokoro-tts-1  | 02:48:32 AM | DEBUG    | instance_pool:137 | Instance 0 is now available
kokoro-tts-1  | 02:48:33 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:33 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:33 AM | INFO     | openai_compatible:215 | Starting audio generation with lang_code: a
kokoro-tts-1  | 02:48:33 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:33 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_heart.pt
kokoro-tts-1  | 02:48:33 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:33 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_sky.pt
kokoro-tts-1  | 02:48:33 AM | DEBUG    | tts_service:228 | Combining 2 voice tensors with weights [0.5, 0.5]
kokoro-tts-1  | 02:48:33 AM | DEBUG    | tts_service:236 | Saving combined voice to: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:33 AM | DEBUG    | tts_service:280 | Using voice path: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:33 AM | INFO     | tts_service:284 | Using lang_code 'a' for voice 'af_heart+af_sky' in audio stream
kokoro-tts-1  | 02:48:33 AM | DEBUG    | instance_pool:122 | Processing request on instance 1
kokoro-tts-1  | 02:48:33 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Delving into the Abyss: A Deeper Exploration of Meaning in 5 Seconds of Summer's "Jet Black Heart"
kokoro-tts-1  | 
kokoro-tts-1  | ...'
kokoro-tts-1  | 02:48:33 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([159600])
kokoro-tts-1  | 02:48:34 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([522000])
kokoro-tts-1  | 02:48:34 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([654000])
kokoro-tts-1  | 02:48:34 AM | DEBUG    | instance_pool:137 | Instance 1 is now available
kokoro-tts-1  | INFO:     172.18.0.1:49830 - "POST /v1/audio/speech HTTP/1.1" 200 OK
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | INFO     | openai_compatible:215 | Starting audio generation with lang_code: a
kokoro-tts-1  | INFO:     172.18.0.1:49836 - "POST /v1/audio/speech HTTP/1.1" 200 OK
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:153 | Scanning for voices in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_heart.pt
kokoro-tts-1  | 02:48:40 AM | DEBUG    | paths:131 | Searching for voice in path: /app/api/src/voices/v1_0
kokoro-tts-1  | 02:48:40 AM | DEBUG    | tts_service:223 | Loading voice tensor from: /app/api/src/voices/v1_0/af_sky.pt
kokoro-tts-1  | 02:48:40 AM | DEBUG    | tts_service:228 | Combining 2 voice tensors with weights [0.5, 0.5]
kokoro-tts-1  | 02:48:40 AM | DEBUG    | tts_service:236 | Saving combined voice to: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:40 AM | DEBUG    | tts_service:280 | Using voice path: /tmp/af_heart+af_sky.pt
kokoro-tts-1  | 02:48:40 AM | INFO     | tts_service:284 | Using lang_code 'a' for voice 'af_heart+af_sky' in audio stream
kokoro-tts-1  | 02:48:40 AM | DEBUG    | instance_pool:122 | Processing request on instance 2
kokoro-tts-1  | 02:48:40 AM | DEBUG    | kokoro_v1:245 | Generating audio for text with lang_code 'a': 'Delving into the Abyss: A Deeper Exploration of Meaning in 5 Seconds of Summer's "Jet Black Heart"
kokoro-tts-1  | 
kokoro-tts-1  | ...'
kokoro-tts-1  | 02:48:40 AM | DEBUG    | kokoro_v1:252 | Got audio chunk with shape: torch.Size([159600])
kokoro-tts-1 exited with code 139

…easing audio container

Refactor StreamingAudioWriter to improve audio encoding reliability

- Restructure audio encoding logic for better error handling
- Create a new method `_create_container()` to manage container creation
- Improve handling of different audio formats and encoding scenarios
- Add error logging for audio chunk encoding failures
- Simplify container and stream management in write_chunk method
Copy link
Collaborator

@fireblade2534 fireblade2534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr currently breaks or removes the following features:

  • The webui does not actually stream or receive any audio
  • All text normalization that was in there is no longer being called
  • There is no option to change the speed as it is not being passed into the generation system
  • _process_chunk is in tts_service but it never gets called
  • Captioned speech is broken because no timestamps are ever requested
  • Streaming is broken as only the first chunk of text is returned
  • Triming audio is always disabled even though it makes sense to do for chunks that contain speech
  • smart_split is never called so I'm not really sure how it is suppost to split text in a sensible way
  • process_text_chunk is never called

Honestly this pr feels unfinished and untested

- Update InstancePool to accept and process speed parameter
- Modify TTSService to pass speed to instance pool
- Update Test.py with new port and authentication
- Adjust start-gpu.sh to use port 50888
@fireblade2534
Copy link
Collaborator

Why did u change the gpu port to 50888

@fireblade2534 fireblade2534 marked this pull request as draft March 11, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants