Long Tail Multilingual Speech Dataset 423981 Hours 92 Languages
Audio, Speech & Acoustic Datasets
Tags and Keywords

"No reviews yet"
£950
About
Overview
The Long Tail Multilingual Speech Dataset provides access to 423,981 available audio hours across 92 long-tail and regional languages. It is designed for AI teams that need broader multilingual coverage beyond the most common global languages.
This dataset is suitable for multilingual ASR systems, language identification models, speech-to-text models, voice AI products, speaker diarization systems, LLM speech workflows, and speech analytics tools.
This listing is a catalog-based commercial data product. The uploaded sample file is only a catalog preview. The listed price represents a starting licensed delivery package and does not cover the full 423,981 hours. Final pricing depends on selected languages, number of hours, enrichment files, delivery format, and licensing scope.
Dataset Contents
Available data may include conversational audio, mixed-down audio, speaker-separated stems where available, transcript and diarization files, word-level transcript data where available, language metadata, regional metadata, gender detection metadata, conversation summaries, sentiment analysis, and supporting metadata.
Files can be delivered in WAV, MP3, JSON, CSV, TXT, ZIP, or a custom format depending on buyer requirements.
Coverage
The dataset covers 92 long-tail and regional languages from the wider Verbalscripts conversational audio catalog. Example languages include Amharic, Hausa, Yoruba, Swahili, Somali, Lingala, Tibetan, Bashkir, Hebrew, Telugu, Sindhi, Myanmar, Faroese, Luxembourgish, Occitan, Belarusian, Afrikaans, Albanian, Armenian, Assamese, Azerbaijani, Basque, Breton, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Georgian, Greek, Gujarati, Icelandic, Irish, Kannada, Kazakh, Khmer, Korean, Kurdish, Lao, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Nepali, Pashto, Persian, Polish, Punjabi, Romanian, Serbian, Sinhala, Slovenian, Tamil, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Zulu, and others.
The dataset is global in scope. Exact language availability, regional distribution, collection period, and enrichment availability vary by selected subset.
Source and Collection Method
The data is sourced through vetted contributors, approved data collection partners, rights-cleared supplier relationships, and language-specific speech-data sourcing programs. Collection methods vary by language and may include contributor-recorded conversational speech, approved collection projects, licensed conversational audio libraries, and regional supplier datasets.
Collection periods vary across the 92 languages and are documented at subset level. Specific collection-period and source-category details can be provided during buyer due diligence.
Consent, Legal Basis, and Rights Chain
The data is offered for approved commercial AI and machine learning workflows where the applicable legal basis, consent basis, licensing arrangement, or rights-holder authorization exists. Depending on the selected subset, rights may be supported by contributor consent, supplier licenses, rights-holder authorization, or project-specific data collection terms.
Documentation covering provenance, source category, collection period, geography, consent or legal basis, permitted use, and licensing chain is available during qualified buyer due diligence or upon request, subject to confidentiality and the exact subset requested.
AI Training Use
The licensed data may be used for low-resource language ASR, multilingual speech recognition, language identification, voice AI expansion, speech-to-text benchmarking, speaker diarization evaluation, multilingual LLM audio workflows, and conversational AI localization, subject to the final license agreement.
The data itself may not be resold, redistributed, published, sublicensed, or shared outside the licensed environment unless expressly permitted in writing.
Delivery
Delivery is handled through secure custom delivery after purchase or after execution of a buyer-specific agreement. Delivery size depends on selected languages, selected hours, audio format, transcript format, and enrichment layers. The uploaded sample file is only a catalog preview.
Pricing and Delivery Clarification
The listed price of GBP 950 is a starter licensed delivery package covering up to 25 selected audio hours from this catalog, subject to language availability, licensing scope, enrichment requirements, and delivery format.
This price does not cover the full available corpus shown in the catalog, including the full 515,849-hour, 174,308-hour, 75,807-hour, or 423,981-hour collections.
Larger orders, full-language packages, multi-language packages, and bulk licensing are quoted separately based on selected language, number of audio hours, transcript and diarization requirements, metadata enrichment, audio format, delivery format, and permitted AI/ML usage rights.
The uploaded catalog file is a preview of available coverage only. Full audio data is delivered through secure custom delivery after buyer confirmation, licensing review, and purchase.
Loading...
£950
Download Dataset in Unknown Format
Recommended Datasets
Loading recommendations...
