Stars
Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition.
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
Di♪♪Rhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
[CVPR 2025] Truncated Diffusion Model for Real-Time End-to-End Autonomous Driving
PyTorch implementation of FractalGen https://arxiv.org/abs/2502.17437
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
SlamKit is an open source tool kit for efficient training of SpeechLMs. It was used for "Slamming: Training a Speech Language Model on One GPU in a Day"
Vector (and Scalar) Quantization, in Pytorch
A low-bitrate single-codebook 16 kHz speech codec based on focal modulation
Examples of using the llasa-tts models locally
Unified automatic quality assessment for speech, music, and sound.
Ola: Pushing the Frontiers of Omni-Modal Language Model
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
GPT-4o-level, real-time spoken dialogue system.
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
A list of tools, papers and code related to Fake Audio Detection.
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
A Unified Library for Parameter-Efficient and Modular Transfer Learning
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
A suite of image and video neural tokenizers
This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use".
[PyTorch] Minimal codebase for MusicGen models
Cosmos is a world model development platform that consists of world foundation models, tokenizers and video processing pipeline to accelerate the development of Physical AI at Robotics & AV labs. C…