Skip to content

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ

Notifications You must be signed in to change notification settings

egorsmkv/cv10-uk-testset-clean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ

Overview

This repository contains the archive of CV10 (test set) with checked Ukrainian transcriptions and audios. All audios have been checked by a human to be sure that they are correct.

This archive is used to test all ASR models listed here: https://github.com/egorsmkv/speech-recognition-uk

Hugging Face dataset

Usage

Example with datasets:

from datasets import load_dataset

ds = load_dataset('Yehor/cv10-uk-testset-clean')

print(ds)

for row in ds['train']:
  audio = row["audio"]

  sampling_rate = audio["sampling_rate"]
  audio_bytes = audio["array"]
  filename = audio["path"]

  print(len(audio_bytes), sampling_rate, filename)
  print(row["duration"], row["transcription"])

  print('---')

Example with polars: https://colab.research.google.com/drive/1upeXw3WbLjK37b1LetpM0HxFXDdOZqSK?usp=sharing

Google Colabs

Use the following colabs to see how you can download this dataset in Python:

datasets:

polars:

Statistics

Duration statistics

Duration: 4.6 hours

Metrics Value
mean 5.201474
std 1.764957
min 1.704
25% 3.816
50% 4.896
75% 6.384
max 10.536

Download from GitHub

We recommend to use Hugging Face dataset, but in case you need raw dataset, use:

About

The cleaned Common Voice 10 (test set) that has been checked by a human for Ukrainian ๐Ÿ‡บ๐Ÿ‡ฆ

Topics

Resources

Stars

Watchers

Forks