Starred repositories
A Datacenter Scale Distributed Inference Serving Framework
Netease cloud music song downloader, with full ID3 metadata, eg: front cover image, artist name, album name, song title and so on.
💬 MaskLID: Code-Switching Language Identification through Iterative Masking -- ACL 2024
No fortress, purely open ground. OpenManus is Coming.
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)
Fully open reproduction of DeepSeek-R1
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Di♪♪Rhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Use any web browser or WebView as GUI, with your preferred language in the backend and modern web technologies in the frontend, all in a lightweight portable library.
An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Wan: Open and Advanced Large-Scale Video Generative Models
An extremely fast Python package and project manager, written in Rust.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS…
Solve Visual Understanding with Reinforced VLMs
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Align Anything: Training All-modality Model with Feedback
SkyReels V1: The first and most advanced open-source human-centric video foundation model
[CVPR 2025] Official repository for “MagicArticulate: Make Your 3D Models Articulation-Ready”
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
A simple screen parsing tool towards pure vision based GUI agent
OSUM: Open Speech Understanding Model, open-sourced by ASLP@NPU.