Opendatabay APP

Cambridge Dictionary Text-to-Audio Dataset

Telecommunications & Network Data

Tags and Keywords

Text

Nlp

Audio

Pronunciation

Dictionary

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Cambridge Dictionary Text-to-Audio Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a vast collection of text-to-audio pairs sourced from the Cambridge Dictionary, offering pronunciations in both British and American English accents [1, 2]. Its primary purpose is to serve as a valuable resource for tasks such as speech recognition, speech generation, and as a complementary data source for existing models [2]. The dataset was meticulously created by web scraping the Cambridge Dictionary website [2].

Columns

The dataset includes the following columns:
  • ID: A unique identifier for each entry [1].
  • word: The word itself [1].
  • Word: An alternative representation of the word [1].
  • gb_audio_url: The URL pointing to the audio file for the British English pronunciation [1].
  • us_audio_url: The URL pointing to the audio file for the American English pronunciation [1].

Distribution

The dataset is typically provided in a CSV format containing the words and their corresponding audio file URLs [1, 3]. The audio files themselves are in MP3 format [1]. It contains over 37,000 unique words, specifically approximately 37,528 unique values, with audio files corresponding to their pronunciations [2, 4]. The total number of words extends beyond 37,000 [2].

Usage

This dataset is ideally suited for applications in artificial intelligence and machine learning, particularly within the domain of natural language processing and audio processing [2]. Specific use cases include:
  • Developing and training speech recognition systems [2].
  • Creating text-to-speech (TTS) synthesis applications [2].
  • Enhancing and supplementing existing language models and speech models with diverse pronunciation data [2].
  • Linguistic research focusing on accent variations between British and American English.

Coverage

The dataset offers global coverage as it encompasses pronunciations for both British and American English [1, 5]. It includes all words from A-Z found in the Cambridge Dictionary [2]. The data does not specify a particular time range or demographic scope beyond the distinct British and American accents.

License

CC0

Who Can Use It

This dataset is valuable for a wide range of users, including:
  • AI and Machine Learning Developers: For building and refining speech recognition and generation models [2].
  • Linguists and Researchers: For studying phonetic variations and accent differences.
  • Educators: For creating pronunciation learning tools.
  • Data Scientists: For expanding their language and audio datasets.

Dataset Name Suggestions

  • British and American English Pronunciation Audio
  • Cambridge Dictionary Text-to-Audio Dataset
  • UK-US English Pronunciation Library
  • Spoken English Accents Collection
  • Multi-Accent English Audio Dictionary

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

27/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format