GitHub - argmaxinc/SDBench: Open-source and reproducible benchmarks for Speaker Diarization

Note

The SDBench code is licensed under the MIT License. However, please note that:

SpeakerKit CLI and other integrated systems have their own licenses that apply
The datasets used in this benchmark have their own licenses and usage restrictions (see Diarization Datasets section for details)

SDBench is an open-source benchmarking tool for speaker diarization systems. The primary objective is to promote standardized, reproducible, and continuous evaluation of open-source and proprietary speaker diarization systems across on-device and server-side implementations.

Key features include:

Simple interface to wrap your diarization, ASR, or ASR + diarization system
Easily accessible and extensible metrics following pyannote standard metric implementations
Modular and convenient configuration management through hydra
Out-of-the-box Weights & Biases logging
Availability of 13+ commonly used datasets (Original dataset license restrictions apply)

Tip

Want to add your own diarization, ASR, or combined pipeline? Check out our Adding a New Diarization Pipeline section for a step-by-step guide!

Important

Before getting started, please note that some datasets in our Diarization Datasets section require special access or have license restrictions. While we provide dataset preparation utilities in common/download_dataset, you'll need to procure the raw data independently for these datasets. See the dataset table for details on access requirements.

🚀 Roadmap

Distribute SpeakerKit CLI for reproduction
Living Benchmark, running every other month

Setting up the environment

Click to expand

In order to get started, first make sure you have poetry installed. The official documentation has instructions for how to install the poetry CLI.

If you already have poetry installed you can run make setup to install the dependencies and set up the environment. If you use conda or venv directly to manage your python environment you can install poetry with pip intstall poetry and then run make setup to install the dependencies.

Example with conda:

conda create -n <your-env-name> python=3.11
conda activate <your-env-name>
pip install poetry
make setup

Diarization Datasets

Click to expand

The benchmark suite uses several speaker diarization datasets that are stored on the HuggingFace Hub. You can find all the datasets used in our evaluation in this collection. The datasets available in the aforementioned collection are:

Dataset Name	Out-of-the-box	License	How to Access
earnings21	✅	CC BY-SA 4.0	Provided
msdwild	❌	MSDWild License Agreement	Use `common/download_dataset.py` script
icsi-meetings	✅	CC BY 4.0	Provided
aishell-4	✅	CC BY-SA 4.0	Provided
ali-meetings	✅	CC BY-SA 4.0	Provided
voxconverse	✅	CC BY 4.0	Provided
ava-avd	✅	MIT	Provided
ami-sdm	✅	CC BY 4.0	Provided
ami-ihm	✅	CC BY 4.0	Provided
american-life-podcast	❌	Not disclosed	Use `common/download_dataset.py` script
dihard-III	❌	LDC License Agreement	Request access to LDC and use `common/download_dataset.py` script to parse
callhome	❌	LDC License Agreement	Request access to LDC and use `common/download_dataset.py` script to parse
ego-4d	❌	Ego4D License Agreement	Request access to Ego4D and use `common/download_dataset.py` script to parse

From these datasets voxconverse and ami are not present as download options as they were already present in the HuggingFace Hub uploaded by diarizers-community.

Dataset Schema

The benchmark suite supports different types of pipelines (Diarization, ASR, and Orchestration) with varying schema requirements. All datasets must follow a base schema, with additional fields required for specific pipeline types.

Base Schema (Required for all pipelines)

audio: Audio column containing:
- array: Audio waveform as numpy array of shape (n_samples,)
- sampling_rate: Sample rate as integer
timestamps_start: List of float containing start timestamps of segments in seconds
timestamps_end: List of float containing end timestamps of segments in seconds
speakers: List of str containing speaker IDs for each segment

Additional Fields for Specific Pipeline Types

Diarization Pipeline

uem_timestamps: Optional list of tuples [(start, end), ...] containing Universal Evaluation Map (UEM) timestamps for evaluation

ASR Pipeline

transcript: List of strings containing the words in the transcript
word_timestamps: Optional list of tuples [(start, end), ...] containing timestamps for each word
word_speakers: Optional list of strings containing speaker IDs for each word

Orchestration Pipeline (Combined Diarization + ASR)

All fields from both Diarization and ASR pipelines are required
word_speakers must be provided if word_timestamps is present
Length of word_speakers must match length of transcript
Length of word_timestamps must match length of transcript

Validation Rules

For ASR and Orchestration pipelines, if word-level information is provided:
- word_speakers and transcript must have the same length
- word_timestamps and transcript must have the same length
- If word_timestamps is provided, word_speakers must also be provided

Downloading Datasets

If you want to reproduce the exact dataset downloads and processing, you can use our dataset downloading scripts. First, make sure you have the required dependencies installed as mentioned in the Getting Started section and also install the dataset dependencies doing poetry install --with dataset

After installing the dependencies, you can run the dataset downloading script at common/download_dataset.py. For example, to download the ICSI meetings dataset, you can run:

poetry run python common/download_dataset.py --dataset icsi-meetings --hf-repo-owner <your-huggingface-username>

This will download the dataset and store locally at raw_datasets/icsi-meetings directory and upload it to the designated HuggingFace organization at <your-huggingface-username>/icsi-meetings. In case you only want to download and not push to HuggingFace, you can use the --generate-only flag.

For simplicity if you want to download all the datasets you can run:

# This will download all the datasets and store them in the raw_datasets directory
# Will not push to HuggingFace
make download-datasets

NOTE:

For datasets requiring Hugging Face access, make sure you have your HF_TOKEN environment variable set
For the American Life Podcast dataset, you'll need Kaggle API credentials in ~/.kaggle/kaggle.json
For Callhome and Dihard-III you need to acquire the datasets from LDC first and then set their paths in the following env variables:
- DIHARD_DATASET_DIR if not specified it will assume the directory lives at ~/third_dihard_challenge_eval/data
- CALLHOME_AUDIO_ROOT if not specified it will assume the directory lives at ~/callhome/nist_recognition_evaluation/r65_8_1/sid00sg1/data
The downloaded datasets will be stored in the raw_datasets directory (which is gitignored):

Adding a New Diarization Pipeline

Click to expand

SDBench can be used as a library to evaluate your own diarization, transcription, or orchestration pipelines. The framework supports three types of pipelines:

Diarization Pipeline: For speaker diarization tasks
Transcription Pipeline: For ASR/transcription tasks
Orchestration Pipeline: For combined diarization and transcription tasks

Creating Your Pipeline

Create a new Python file (e.g., my_pipeline.py) and implement your pipeline:

from typing import Callable

from sdbench.dataset import DiarizationSample
from sdbench.pipeline.base import Pipeline, PipelineType, register_pipeline
from sdbench.pipeline.diarization.common import DiarizationOutput, DiarizationPipelineConfig
from sdbench.pipeline_prediction import DiarizationAnnotation

@register_pipeline
class MyDiarizationPipeline(Pipeline):
    _config_class = MyDiarizationConfig
    pipeline_type = PipelineType.DIARIZATION

    def build_pipeline(self) -> Callable[[dict], dict]:
        # Initialize your model/function and return a callable
        return my_diarizer_function

    def parse_input(self, input_sample: DiarizationSample) -> dict:
        # Convert DiarizationSample to your model's input format
        return {
            "waveform": input_sample.waveform,
            "sample_rate": input_sample.sample_rate
        }

    def parse_output(self, output: dict) -> DiarizationOutput:
        # Convert your model's output to DiarizationOutput
        return DiarizationOutput(prediction=annotation)

Create a configuration class for your pipeline:

from pydantic import Field
from sdbench.pipeline.diarization.common import DiarizationPipelineConfig

class MyDiarizationConfig(DiarizationPipelineConfig):
    model_path: str = Field(..., description="Path to model weights")
    threshold: float = Field(0.5, description="Detection threshold")
    num_speakers: int | None = Field(None, description="Number of speakers (optional)")

Create a configuration file for your pipeline:

# my_pipeline_config.yaml
out_dir: ./my_pipeline_logs
model_path: /path/to/model
threshold: 0.5
num_speakers: null

Using Your Pipeline

Import your pipeline and create a benchmark configuration:

from sdbench.runner import BenchmarkConfig, BenchmarkRunner, WandbConfig
from sdbench.metric import MetricOptions
from sdbench.dataset import DiarizationDatasetConfig

from my_pipeline import MyDiarizationPipeline, MyDiarizationConfig

# Create pipeline configuration
pipeline_config = MyDiarizationConfig(
    model_path="/path/to/model",
    threshold=0.5,
    num_speakers=None,
    out_dir="./my_pipeline_logs"
)

# Create benchmark configuration
benchmark_config = BenchmarkConfig(
    wandb_config=WandbConfig(
        project_name="my-diarization-benchmark",
        run_name="my-pipeline-evaluation",
        tags=["my-pipeline", "evaluation"],
        wandb_mode="online"  # or "offline" for local testing
    ),
    metrics={
        MetricOptions.DER: {},  # Diarization Error Rate
        MetricOptions.JER: {},  # Jaccard Error Rate
    },
    datasets={
        "voxconverse": DiarizationDatasetConfig(
            dataset_id="diarizers-community/voxconverse",
            split="test"
        )
    }
)

# Create pipeline instance
pipeline = MyDiarizationPipeline(pipeline_config)

# Create and run benchmark
runner = BenchmarkRunner(benchmark_config, [pipeline])
benchmark_result = runner.run()

print(benchmark_result.global_results[0])

For parallel processing, you can configure the number of worker processes in your pipeline config:

pipeline_config = MyDiarizationConfig(
    model_path="/path/to/model",
    threshold=0.5,
    num_speakers=None,
    out_dir="./my_pipeline_logs",
    num_worker_processes=4,  # Number of parallel workers
    per_worker_chunk_size=2  # Samples per worker
)

To use Weights & Biases for experiment tracking, make sure to:
- Set up your W&B account and get your API key
- Make sure you're logged into your W&B account otherwise run wandb login
- Configure the wandb_config in your benchmark configuration

The BenchmarkRunner will automatically:

Run your pipeline on the specified datasets
Calculate metrics for each sample
Aggregate results globally
Log everything to Weights & Biases (if configured)
Handle parallel processing if enabled (specially interesting for APIs)
Generate detailed reports and artifacts

Pipeline Types and Requirements

Diarization Pipeline

Must implement build_pipeline(), parse_input(), and parse_output()
Input parsing should convert DiarizationSample to your model's expected format
Output parsing should return a DiarizationOutput with a prediction field

Transcription Pipeline

Must implement build_pipeline(), parse_input(), and parse_output()
Input parsing should convert DiarizationSample to your model's expected format
Output parsing should return a TranscriptionOutput with a prediction field

Orchestration Pipeline

Must implement build_pipeline(), parse_input(), and parse_output()
Can either:
- Implement end-to-end diarization and transcription
- Use PostInferenceMergePipeline to combine separate diarization and transcription pipelines
Output parsing should return an OrchestrationOutput with a prediction field and optionaly diarization and transcription results

Hydra Configuration

Click to expand

The benchmark suite uses Hydra for configuration management, providing a flexible and modular way to configure evaluation runs. The configuration files are organized in the following structure:

config
├── evaluation_config.yaml                      # Main evaluation configuration
├── benchmark_config                            # Base configurations for benchmarking
│   ├── datasets                                # Dataset-specific configs
│   ├── wandb_config                            # Weights & Biases logging configs
│   └── base.yaml                               # Default benchmark_config used in evaluation_config.yaml
└── pipeline_configs                            # Predefined pipeline configurations for ease of use
    ├── my_pipeline
    │   ├── base.yaml                           # Default config used in my_pipeline.yaml
    │   └── config
    │       ├── base.yaml                       # Default config used in MyPipeline
    │       └── diarization_config
    │           ├── chunking_config             # Defines different useful chunking configurations
    │           ├── cluster_definition          # Defines different useful cluster definitions
    │           ├── speaker_embedder_config     # Defines different useful speaker embedder configurations
    │           ├── speaker_segmenter_config    # Defines different useful speaker segmenter configurations
    │           └── base.yaml                   # Default diarization_config used in evaluation_config.yaml
    ├── my_pipeline.yaml                        # Uses MyPipeline as default pipeline
    └── pyannote.yaml                           # Defines configuration for PyAnnotePipeline

Running Evaluations with Different Configurations

You can easily customize your evaluation runs using Hydra's override syntax. Here are some common usage patterns:

Selecting Specific Pipelines

# Run evaluation with only MyPipeline
poetry run python evaluation.py pipeline_configs=my_pipeline

Modifying Pipeline Parameters You can override specific configuration parameters in two ways:

a. Override by Value:

# Change the speaker segmenter stride
poetry run python evaluation.py \
    pipeline_configs=my_pipeline \
    pipeline_configs.MyPipeline.config.diarization_config.speaker_segmenter_config.variant_name=stride_2

b. Override by Config:

# Use a predefined speaker segmenter configuration
poetry run python evaluation.py \
    pipeline_configs=my_pipeline \
    pipeline_configs/MyPipeline/config/diarization_config/speaker_segmenter_config=stride_2

Note: Use -h flag with any command to see the resulting configuration:

poetry run python evaluation.py pipeline_configs=my_pipeline -h

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
common		common
config		config
interspeech_analysis		interspeech_analysis
src/sdbench		src/sdbench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
evaluation.py		evaluation.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Roadmap

Setting up the environment

Diarization Datasets

Dataset Schema

Base Schema (Required for all pipelines)

Additional Fields for Specific Pipeline Types

Diarization Pipeline

ASR Pipeline

Orchestration Pipeline (Combined Diarization + ASR)

Validation Rules

Downloading Datasets

NOTE:

Adding a New Diarization Pipeline

Creating Your Pipeline

Using Your Pipeline

Pipeline Types and Requirements

Diarization Pipeline

Transcription Pipeline

Orchestration Pipeline

Hydra Configuration

Running Evaluations with Different Configurations

About

Releases

Packages

Languages

License

argmaxinc/SDBench

Folders and files

Latest commit

History

Repository files navigation

🚀 Roadmap

Setting up the environment

Diarization Datasets

Dataset Schema

Base Schema (Required for all pipelines)

Additional Fields for Specific Pipeline Types

Diarization Pipeline

ASR Pipeline

Orchestration Pipeline (Combined Diarization + ASR)

Validation Rules

Downloading Datasets

NOTE:

Adding a New Diarization Pipeline

Creating Your Pipeline

Using Your Pipeline

Pipeline Types and Requirements

Diarization Pipeline

Transcription Pipeline

Orchestration Pipeline

Hydra Configuration

Running Evaluations with Different Configurations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages