Dark Mode

Home

Data Categories

AI & ML Data

Cambridge Dictionary Text-to-Audio Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Cambridge Dictionary Text-to-Audio Dataset

Telecommunications & Network Data

Tags and Keywords

Text

Nlp

Audio

Pronunciation

Dictionary

Trusted By

Cambridge Dictionary Text-to-Audio Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a vast collection of text-to-audio pairs sourced from the Cambridge Dictionary, offering pronunciations in both British and American English accents [1, 2]. Its primary purpose is to serve as a valuable resource for tasks such as speech recognition, speech generation, and as a complementary data source for existing models [2]. The dataset was meticulously created by web scraping the Cambridge Dictionary website [2].

Columns

The dataset includes the following columns:

ID: A unique identifier for each entry [1].
word: The word itself [1].
Word: An alternative representation of the word [1].
gb_audio_url: The URL pointing to the audio file for the British English pronunciation [1].
us_audio_url: The URL pointing to the audio file for the American English pronunciation [1].

Distribution

The dataset is typically provided in a CSV format containing the words and their corresponding audio file URLs [1, 3]. The audio files themselves are in MP3 format [1]. It contains over 37,000 unique words, specifically approximately 37,528 unique values, with audio files corresponding to their pronunciations [2, 4]. The total number of words extends beyond 37,000 [2].

Usage

This dataset is ideally suited for applications in artificial intelligence and machine learning, particularly within the domain of natural language processing and audio processing [2]. Specific use cases include:

Developing and training speech recognition systems [2].
Creating text-to-speech (TTS) synthesis applications [2].
Enhancing and supplementing existing language models and speech models with diverse pronunciation data [2].
Linguistic research focusing on accent variations between British and American English.

Coverage

The dataset offers global coverage as it encompasses pronunciations for both British and American English [1, 5]. It includes all words from A-Z found in the Cambridge Dictionary [2]. The data does not specify a particular time range or demographic scope beyond the distinct British and American accents.

License

CC0

Who Can Use It

This dataset is valuable for a wide range of users, including:

AI and Machine Learning Developers: For building and refining speech recognition and generation models [2].
Linguists and Researchers: For studying phonetic variations and accent differences.
Educators: For creating pronunciation learning tools.
Data Scientists: For expanding their language and audio datasets.

Dataset Name Suggestions

British and American English Pronunciation Audio
Cambridge Dictionary Text-to-Audio Dataset
UK-US English Pronunciation Library
Spoken English Accents Collection
Multi-Accent English Audio Dictionary

Attributes

Original Data Source: Text-to-Audio Pairs from the Cambridge Dictionary

Listing Stats

VIEWS

DOWNLOADS

LISTED

27/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...