
Note
The SDBench code is licensed under the MIT License. However, please note that:
- SpeakerKit CLI and other integrated systems have their own licenses that apply
- The datasets used in this benchmark have their own licenses and usage restrictions (see Diarization Datasets section for details)
SDBench
is an open-source benchmarking tool for speaker diarization systems. The primary objective is to promote standardized, reproducible, and continuous evaluation of open-source and proprietary speaker diarization systems across on-device and server-side implementations.
Key features include:
- Simple interface to wrap your diarization, ASR, or ASR + diarization system
- Easily accessible and extensible metrics following
pyannote
standard metric implementations - Modular and convenient configuration management through
hydra
- Out-of-the-box
Weights & Biases
logging - Availability of 13+ commonly used datasets (Original dataset license restrictions apply)
Tip
Want to add your own diarization, ASR, or combined pipeline? Check out our Adding a New Diarization Pipeline section for a step-by-step guide!
Important
Before getting started, please note that some datasets in our Diarization Datasets section require special access or have license restrictions. While we provide dataset preparation utilities in common/download_dataset
, you'll need to procure the raw data independently for these datasets. See the dataset table for details on access requirements.
- Distribute SpeakerKit CLI for reproduction
- Living Benchmark, running every other month
Click to expand
In order to get started, first make sure you have poetry
installed. The official documentation has instructions for how to install the poetry
CLI.
If you already have poetry
installed you can run make setup
to install the dependencies and set up the environment.
If you use conda
or venv
directly to manage your python environment you can install poetry with pip intstall poetry
and then run make setup
to install the dependencies.
Example with conda
:
conda create -n <your-env-name> python=3.11
conda activate <your-env-name>
pip install poetry
make setup
Click to expand
The benchmark suite uses several speaker diarization datasets that are stored on the HuggingFace Hub. You can find all the datasets used in our evaluation in this collection. The datasets available in the aforementioned collection are:
Dataset Name | Out-of-the-box | License | How to Access |
---|---|---|---|
earnings21 | ✅ | CC BY-SA 4.0 | Provided |
msdwild | ❌ | MSDWild License Agreement | Use common/download_dataset.py script |
icsi-meetings | ✅ | CC BY 4.0 | Provided |
aishell-4 | ✅ | CC BY-SA 4.0 | Provided |
ali-meetings | ✅ | CC BY-SA 4.0 | Provided |
voxconverse | ✅ | CC BY 4.0 | Provided |
ava-avd | ✅ | MIT | Provided |
ami-sdm | ✅ | CC BY 4.0 | Provided |
ami-ihm | ✅ | CC BY 4.0 | Provided |
american-life-podcast | ❌ | Not disclosed | Use common/download_dataset.py script |
dihard-III | ❌ | LDC License Agreement | Request access to LDC and use common/download_dataset.py script to parse |
callhome | ❌ | LDC License Agreement | Request access to LDC and use common/download_dataset.py script to parse |
ego-4d | ❌ | Ego4D License Agreement | Request access to Ego4D and use common/download_dataset.py script to parse |
From these datasets voxconverse
and ami
are not present as download options as they were already present in the HuggingFace Hub uploaded by diarizers-community.
The benchmark suite supports different types of pipelines (Diarization
, ASR
, and Orchestration
) with varying schema requirements. All datasets must follow a base schema, with additional fields required for specific pipeline types.
audio
: Audio column containing:array
: Audio waveform as numpy array of shape(n_samples,)
sampling_rate
: Sample rate as integer
timestamps_start
: List offloat
containing start timestamps of segments in secondstimestamps_end
: List offloat
containing end timestamps of segments in secondsspeakers
: List ofstr
containing speaker IDs for each segment
uem_timestamps
: Optional list of tuples[(start, end), ...]
containing Universal Evaluation Map (UEM) timestamps for evaluation
transcript
: List of strings containing the words in the transcriptword_timestamps
: Optional list of tuples[(start, end), ...]
containing timestamps for each wordword_speakers
: Optional list of strings containing speaker IDs for each word
- All fields from both Diarization and ASR pipelines are required
word_speakers
must be provided ifword_timestamps
is present- Length of
word_speakers
must match length oftranscript
- Length of
word_timestamps
must match length oftranscript
- For ASR and Orchestration pipelines, if word-level information is provided:
word_speakers
andtranscript
must have the same lengthword_timestamps
andtranscript
must have the same length- If
word_timestamps
is provided,word_speakers
must also be provided
If you want to reproduce the exact dataset downloads and processing, you can use our dataset downloading scripts. First, make sure you have the required dependencies installed as mentioned in the Getting Started
section and also install the dataset
dependencies doing poetry install --with dataset
After installing the dependencies, you can run the dataset downloading script at common/download_dataset.py
. For example, to download the ICSI meetings dataset, you can run:
poetry run python common/download_dataset.py --dataset icsi-meetings --hf-repo-owner <your-huggingface-username>
This will download the dataset and store locally at raw_datasets/icsi-meetings
directory and upload it to the designated HuggingFace organization at <your-huggingface-username>/icsi-meetings
. In case you only want to download and not push to HuggingFace, you can use the --generate-only
flag.
For simplicity if you want to download all the datasets you can run:
# This will download all the datasets and store them in the raw_datasets directory
# Will not push to HuggingFace
make download-datasets
- For datasets requiring Hugging Face access, make sure you have your
HF_TOKEN
environment variable set - For the
American Life Podcast
dataset, you'll need Kaggle API credentials in~/.kaggle/kaggle.json
- For
Callhome
andDihard-III
you need to acquire the datasets from LDC first and then set their paths in the following env variables:DIHARD_DATASET_DIR
if not specified it will assume the directory lives at~/third_dihard_challenge_eval/data
CALLHOME_AUDIO_ROOT
if not specified it will assume the directory lives at~/callhome/nist_recognition_evaluation/r65_8_1/sid00sg1/data
- The downloaded datasets will be stored in the
raw_datasets
directory (which is gitignored):
Click to expand
SDBench can be used as a library to evaluate your own diarization, transcription, or orchestration pipelines. The framework supports three types of pipelines:
- Diarization Pipeline: For speaker diarization tasks
- Transcription Pipeline: For ASR/transcription tasks
- Orchestration Pipeline: For combined diarization and transcription tasks
- Create a new Python file (e.g.,
my_pipeline.py
) and implement your pipeline:
from typing import Callable
from sdbench.dataset import DiarizationSample
from sdbench.pipeline.base import Pipeline, PipelineType, register_pipeline
from sdbench.pipeline.diarization.common import DiarizationOutput, DiarizationPipelineConfig
from sdbench.pipeline_prediction import DiarizationAnnotation
@register_pipeline
class MyDiarizationPipeline(Pipeline):
_config_class = MyDiarizationConfig
pipeline_type = PipelineType.DIARIZATION
def build_pipeline(self) -> Callable[[dict], dict]:
# Initialize your model/function and return a callable
return my_diarizer_function
def parse_input(self, input_sample: DiarizationSample) -> dict:
# Convert DiarizationSample to your model's input format
return {
"waveform": input_sample.waveform,
"sample_rate": input_sample.sample_rate
}
def parse_output(self, output: dict) -> DiarizationOutput:
# Convert your model's output to DiarizationOutput
return DiarizationOutput(prediction=annotation)
- Create a configuration class for your pipeline:
from pydantic import Field
from sdbench.pipeline.diarization.common import DiarizationPipelineConfig
class MyDiarizationConfig(DiarizationPipelineConfig):
model_path: str = Field(..., description="Path to model weights")
threshold: float = Field(0.5, description="Detection threshold")
num_speakers: int | None = Field(None, description="Number of speakers (optional)")
- Create a configuration file for your pipeline:
# my_pipeline_config.yaml
out_dir: ./my_pipeline_logs
model_path: /path/to/model
threshold: 0.5
num_speakers: null
- Import your pipeline and create a benchmark configuration:
from sdbench.runner import BenchmarkConfig, BenchmarkRunner, WandbConfig
from sdbench.metric import MetricOptions
from sdbench.dataset import DiarizationDatasetConfig
from my_pipeline import MyDiarizationPipeline, MyDiarizationConfig
# Create pipeline configuration
pipeline_config = MyDiarizationConfig(
model_path="/path/to/model",
threshold=0.5,
num_speakers=None,
out_dir="./my_pipeline_logs"
)
# Create benchmark configuration
benchmark_config = BenchmarkConfig(
wandb_config=WandbConfig(
project_name="my-diarization-benchmark",
run_name="my-pipeline-evaluation",
tags=["my-pipeline", "evaluation"],
wandb_mode="online" # or "offline" for local testing
),
metrics={
MetricOptions.DER: {}, # Diarization Error Rate
MetricOptions.JER: {}, # Jaccard Error Rate
},
datasets={
"voxconverse": DiarizationDatasetConfig(
dataset_id="diarizers-community/voxconverse",
split="test"
)
}
)
# Create pipeline instance
pipeline = MyDiarizationPipeline(pipeline_config)
# Create and run benchmark
runner = BenchmarkRunner(benchmark_config, [pipeline])
benchmark_result = runner.run()
print(benchmark_result.global_results[0])
- For parallel processing, you can configure the number of worker processes in your pipeline config:
pipeline_config = MyDiarizationConfig(
model_path="/path/to/model",
threshold=0.5,
num_speakers=None,
out_dir="./my_pipeline_logs",
num_worker_processes=4, # Number of parallel workers
per_worker_chunk_size=2 # Samples per worker
)
- To use Weights & Biases for experiment tracking, make sure to:
- Set up your W&B account and get your API key
- Make sure you're logged into your W&B account otherwise run
wandb login
- Configure the
wandb_config
in your benchmark configuration
The BenchmarkRunner will automatically:
- Run your pipeline on the specified datasets
- Calculate metrics for each sample
- Aggregate results globally
- Log everything to Weights & Biases (if configured)
- Handle parallel processing if enabled (specially interesting for APIs)
- Generate detailed reports and artifacts
- Must implement
build_pipeline()
,parse_input()
, andparse_output()
- Input parsing should convert
DiarizationSample
to your model's expected format - Output parsing should return a
DiarizationOutput
with aprediction
field
- Must implement
build_pipeline()
,parse_input()
, andparse_output()
- Input parsing should convert
DiarizationSample
to your model's expected format - Output parsing should return a
TranscriptionOutput
with aprediction
field
- Must implement
build_pipeline()
,parse_input()
, andparse_output()
- Can either:
- Implement end-to-end diarization and transcription
- Use
PostInferenceMergePipeline
to combine separate diarization and transcription pipelines
- Output parsing should return an
OrchestrationOutput
with aprediction
field and optionalydiarization
andtranscription
results
Click to expand
The benchmark suite uses Hydra for configuration management, providing a flexible and modular way to configure evaluation runs. The configuration files are organized in the following structure:
config
├── evaluation_config.yaml # Main evaluation configuration
├── benchmark_config # Base configurations for benchmarking
│ ├── datasets # Dataset-specific configs
│ ├── wandb_config # Weights & Biases logging configs
│ └── base.yaml # Default benchmark_config used in evaluation_config.yaml
└── pipeline_configs # Predefined pipeline configurations for ease of use
├── my_pipeline
│ ├── base.yaml # Default config used in my_pipeline.yaml
│ └── config
│ ├── base.yaml # Default config used in MyPipeline
│ └── diarization_config
│ ├── chunking_config # Defines different useful chunking configurations
│ ├── cluster_definition # Defines different useful cluster definitions
│ ├── speaker_embedder_config # Defines different useful speaker embedder configurations
│ ├── speaker_segmenter_config # Defines different useful speaker segmenter configurations
│ └── base.yaml # Default diarization_config used in evaluation_config.yaml
├── my_pipeline.yaml # Uses MyPipeline as default pipeline
└── pyannote.yaml # Defines configuration for PyAnnotePipeline
You can easily customize your evaluation runs using Hydra's override syntax. Here are some common usage patterns:
- Selecting Specific Pipelines
# Run evaluation with only MyPipeline
poetry run python evaluation.py pipeline_configs=my_pipeline
- Modifying Pipeline Parameters You can override specific configuration parameters in two ways:
a. Override by Value:
# Change the speaker segmenter stride
poetry run python evaluation.py \
pipeline_configs=my_pipeline \
pipeline_configs.MyPipeline.config.diarization_config.speaker_segmenter_config.variant_name=stride_2
b. Override by Config:
# Use a predefined speaker segmenter configuration
poetry run python evaluation.py \
pipeline_configs=my_pipeline \
pipeline_configs/MyPipeline/config/diarization_config/speaker_segmenter_config=stride_2
Note: Use -h
flag with any command to see the resulting configuration:
poetry run python evaluation.py pipeline_configs=my_pipeline -h