Dark Mode

Home

Data Categories

AI Training Data

Long Tail Multilingual Speech Dataset 423981 Hours 92 Languages

Verbalscripts Transcription LLC

Licensed LLM Data Provider

£950

Long Tail Multilingual Speech Dataset 423981 Hours 92 Languages

Name: Long Tail Multilingual Speech Dataset 423981 Hours 92 Languages
Creator: Verbalscripts Transcription LLC
Published: 2026-06-14T19:54:50.686Z
License: https://docs.opendatabay.com/ai-training-and-model-development-licenses/commercial-ai-training-and-fine-tuning-data-license

Audio, Speech & Acoustic Datasets

Tags and Keywords

Longtail

Multilingual

Transcription

Ai

Asr

Llm

Datasets

Training

Diarization

Transcript

Metadata

Evaluation

Voice

Audio

Long Tail Multilingual Speech Dataset 423981 Hours 92 Languages Dataset on Opendatabay data marketplace

"No reviews yet"

£950

About

Overview

The Long Tail Multilingual Speech Dataset provides access to 423,981 available audio hours across 92 long-tail and regional languages. It is designed for AI teams that need broader multilingual coverage beyond the most common global languages.

This dataset is suitable for multilingual ASR systems, language identification models, speech-to-text models, voice AI products, speaker diarization systems, LLM speech workflows, and speech analytics tools.

This listing is a catalog-based commercial data product. The uploaded sample file is only a catalog preview. The listed price represents a starting licensed delivery package and does not cover the full 423,981 hours. Final pricing depends on selected languages, number of hours, enrichment files, delivery format, and licensing scope.

Dataset Contents

Available data may include conversational audio, mixed-down audio, speaker-separated stems where available, transcript and diarization files, word-level transcript data where available, language metadata, regional metadata, gender detection metadata, conversation summaries, sentiment analysis, and supporting metadata.

Files can be delivered in WAV, MP3, JSON, CSV, TXT, ZIP, or a custom format depending on buyer requirements.

Coverage

The dataset covers 92 long-tail and regional languages from the wider Verbalscripts conversational audio catalog. Example languages include Amharic, Hausa, Yoruba, Swahili, Somali, Lingala, Tibetan, Bashkir, Hebrew, Telugu, Sindhi, Myanmar, Faroese, Luxembourgish, Occitan, Belarusian, Afrikaans, Albanian, Armenian, Assamese, Azerbaijani, Basque, Breton, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Georgian, Greek, Gujarati, Icelandic, Irish, Kannada, Kazakh, Khmer, Korean, Kurdish, Lao, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Nepali, Pashto, Persian, Polish, Punjabi, Romanian, Serbian, Sinhala, Slovenian, Tamil, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Zulu, and others.

The dataset is global in scope. Exact language availability, regional distribution, collection period, and enrichment availability vary by selected subset.

Source and Collection Method

The data is sourced through vetted contributors, approved data collection partners, rights-cleared supplier relationships, and language-specific speech-data sourcing programs. Collection methods vary by language and may include contributor-recorded conversational speech, approved collection projects, licensed conversational audio libraries, and regional supplier datasets.

Collection periods vary across the 92 languages and are documented at subset level. Specific collection-period and source-category details can be provided during buyer due diligence.

Consent, Legal Basis, and Rights Chain

The data is offered for approved commercial AI and machine learning workflows where the applicable legal basis, consent basis, licensing arrangement, or rights-holder authorization exists. Depending on the selected subset, rights may be supported by contributor consent, supplier licenses, rights-holder authorization, or project-specific data collection terms.

Documentation covering provenance, source category, collection period, geography, consent or legal basis, permitted use, and licensing chain is available during qualified buyer due diligence or upon request, subject to confidentiality and the exact subset requested.

AI Training Use

The licensed data may be used for low-resource language ASR, multilingual speech recognition, language identification, voice AI expansion, speech-to-text benchmarking, speaker diarization evaluation, multilingual LLM audio workflows, and conversational AI localization, subject to the final license agreement.

The data itself may not be resold, redistributed, published, sublicensed, or shared outside the licensed environment unless expressly permitted in writing.

Delivery

Delivery is handled through secure custom delivery after purchase or after execution of a buyer-specific agreement. Delivery size depends on selected languages, selected hours, audio format, transcript format, and enrichment layers. The uploaded sample file is only a catalog preview.

Pricing and Delivery Clarification

The listed price of GBP 950 is a starter licensed delivery package covering up to 25 selected audio hours from this catalog, subject to language availability, licensing scope, enrichment requirements, and delivery format.

This price does not cover the full available corpus shown in the catalog, including the full 515,849-hour, 174,308-hour, 75,807-hour, or 423,981-hour collections.

Larger orders, full-language packages, multi-language packages, and bulk licensing are quoted separately based on selected language, number of audio hours, transcript and diarization requirements, metadata enrichment, audio format, delivery format, and permitted AI/ML usage rights.

The uploaded catalog file is a preview of available coverage only. Full audio data is delivered through secure custom delivery after buyer confirmation, licensing review, and purchase.

Listing Stats

VIEWS

DELIVERY

CUSTOM, S3

LISTED

14/06/2026

UPDATED

16/06/2026

REGION

GLOBAL

QUALITY

5 / 5

£950

Download Dataset in Unknown Format

Recommended Datasets

Loading recommendations...