Dark Mode

Home

Data Categories

AI Training Data

African MENA Conversational Audio Dataset 75807 Hours

Verbalscripts Transcription LLC

Licensed LLM Data Provider

£950

African MENA Conversational Audio Dataset 75807 Hours

Name: African MENA Conversational Audio Dataset 75807 Hours
Creator: Verbalscripts Transcription LLC
Published: 2026-06-14T19:54:50.686Z
License: https://docs.opendatabay.com/ai-training-and-model-development-licenses/commercial-ai-training-and-fine-tuning-data-license

Audio, Speech & Acoustic Datasets

Tags and Keywords

Africa

Mena

Audio

Speech

Asr

Training

Ai

Llm

Amharic

Yoruba

Swahili

Somali

Arabic

Diarization

Transcript

Speechtotext

African MENA Conversational Audio Dataset 75807 Hours Dataset on Opendatabay data marketplace

"No reviews yet"

£950

About

Overview

The Top 25 Strategic Languages Conversational Audio Pack provides access to 174,308 available audio hours across 25 high-value global and regional languages. This package is designed for AI teams that want strong multilingual speech coverage without licensing the full 100-language catalog.

The dataset is suitable for ASR training, multilingual speech-to-text systems, voice assistant workflows, speaker diarization, language identification, speech analytics, and LLM-based speech applications.

This listing is a catalog-based commercial data product. The uploaded sample file is a preview catalog only. The listed price represents a starting licensed delivery package and does not cover the full 174,308 hours. Final pricing depends on selected languages, number of hours, enrichment files, delivery format, and licensing scope.

Dataset Contents

Available files may include conversational audio, mixed-down audio, speaker-separated stems where available, transcript and diarization data, word-level transcript information where available, language metadata, regional metadata, gender detection metadata, summaries, sentiment analysis, and other structured enrichment fields depending on the selected package.

Delivery can be prepared in WAV, MP3, JSON, CSV, TXT, ZIP, or another agreed buyer-ready structure.

Coverage

This package covers 25 strategic languages from the larger Verbalscripts conversational audio catalog. Coverage includes English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, Bosnian, Indonesian, Luxembourgish, Occitan, Slovak, Tibetan, Belarusian, Faroese, Hungarian, Lingala, Norwegian, Sindhi, Telugu, Bashkir, Hebrew, Latin, and Myanmar.

The package is global in scope and includes languages associated with markets across Asia, Europe, Africa, the Middle East, and the Americas. Exact regional distribution, collection period, and metadata availability vary by language and subset.

Source and Collection Method

The data is sourced through vetted contributors, approved data collection partners, rights-cleared supplier relationships, and language-specific speech-data sourcing programs. Collection methods vary by language and subset, but may include contributor-recorded conversational audio, licensed conversational speech sources, and approved data collection projects.

The collection period is not identical across all 25 languages. Collection-period information can be provided for the specific language subset requested by the buyer.

Consent, Legal Basis, and Rights Chain

The available data is intended for approved commercial AI and machine learning workflows where the relevant legal basis, licensing basis, or contributor/supplier authorization exists. Depending on the subset, this may include contributor consent, rights-holder authorization, supplier licensing, or project-specific collection agreements.

Subset-level provenance, consent, licensing, permitted-use, and rights-chain documentation is available during qualified buyer due diligence or upon request, subject to confidentiality and the specific data package being licensed.

AI Training Use

The licensed data may be used for ASR training, speech-to-text development, multilingual model evaluation, LLM speech workflows, voice assistant development, diarization, and speech analytics, subject to the final commercial license agreement.

The data itself may not be resold, redistributed, published, sublicensed, or shared outside the buyer’s licensed usage environment unless expressly permitted in writing.

Delivery

Full delivery is handled through secure custom delivery after purchase or after a buyer-specific agreement is completed. Delivery size depends on selected languages, selected hours, audio format, transcript format, and enrichment layers. The uploaded sample file is a catalog preview only.

Pricing and Delivery Clarification

The listed price of GBP 950 is a starter licensed delivery package covering up to 25 selected audio hours from this catalog, subject to language availability, licensing scope, enrichment requirements, and delivery format.

This price does not cover the full available corpus shown in the catalog, including the full 515,849-hour, 174,308-hour, 75,807-hour, or 423,981-hour collections.

Larger orders, full-language packages, multi-language packages, and bulk licensing are quoted separately based on selected language, number of audio hours, transcript and diarization requirements, metadata enrichment, audio format, delivery format, and permitted AI/ML usage rights.

The uploaded catalog file is a preview of available coverage only. Full audio data is delivered through secure custom delivery after buyer confirmation, licensing review, and purchase.

Listing Stats

VIEWS

DELIVERY

CUSTOM, S3

LISTED

14/06/2026

UPDATED

16/06/2026

REGION

AFRICA

QUALITY

5 / 5

£950

Download Dataset in Other Format

Recommended Datasets

Loading recommendations...