Dark Mode

Home

Data Categories

AI & ML Data

BERT Model Vocabulary Analysis Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

BERT Model Vocabulary Analysis Dataset

Entertainment & Media Consumption

Tags and Keywords

Music

Nlp

Bert

Unigrams

Vocabulary

Frequency

Trusted By

BERT Model Vocabulary Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a detailed reconstruction of the training data utilised for the English BERT base uncased model. It is essential for understanding the vocabulary employed by these foundational models, especially since subword tokenisation often obscures the original training vocabulary. By examining the unigrams and their distributions from the original training data, users can ascertain whether their specific data would benefit from fine-tuning an existing BERT model or necessitate pretraining a model from scratch. The dataset was meticulously compiled from the BookCorpus dataset and a processed dump of Wikipedia, specifically from August 2019. Adhering to the principles of BERT's tokenisation scheme, the data retains all punctuation and stopwords. The original unicode text was normalised using NFKC, subsequently tokenised with the SpaCy English large model, and the total count for each unigram across the corpora was meticulously recorded. The unigrams are presented in descending order of frequency, ensuring ease of analysis.

Columns

unigram: Represents a single token from the corpus.
count: Indicates the frequency count of the corresponding token across the entire dataset.

The column values within the CSV file are separated by tabs.

Distribution

The dataset is provided in a CSV file format, with values tab-separated. It features a substantial collection of 7,650,688 unique unigram values. The unigrams are ordered by their frequency, from most to least common, providing an organised structure for analysis. Specific numbers for rows or records are not explicitly stated beyond the unique unigram count.

Usage

This dataset offers several valuable applications:

Constructing a probability distribution of data within your specific domain to assess if the BERT base model is sufficiently aligned for your task.
Analysing the training data of specialised BERT models (e.g., Bio-BERT, Legal-BERT) and quantifying their similarity or difference to BERT base by calculating the Kullback–Leibler divergence for their shared vocabulary.
Evaluating and identifying important bigrams when used in conjunction with its companion dataset, BERT bigrams.
Determining the proportion of your data that is out-of-vocabulary (OOV), which serves as a strong indicator for the potential need for model retraining.

Coverage

The data originates from the BookCorpus dataset and a processed Wikipedia dump from August 2019. While the data is global in its applicability to English language models, specific geographical or demographic scopes are not relevant to this unigram frequency dataset. The dataset represents a best-effort reconstruction of the original training data for the BERT English uncased model. Updates may occur if a Wikipedia data release that more closely approximates the original BERT training data becomes available.

License

CC-BY-SA

Who Can Use It

This dataset is ideal for:

Machine learning engineers and researchers focused on Natural Language Processing (NLP).
Developers and practitioners involved in fine-tuning existing BERT models.
Individuals planning to pretrain new language models from scratch.
Data scientists interested in corpus linguistics, vocabulary analysis, and understanding large language model foundations.

Dataset Name Suggestions

BERT English Unigram Frequencies
BERT Base Uncased Vocabulary Counts
English BERT Training Data Unigrams
Unigram Frequencies for BERT Base
BERT Model Vocabulary Analysis Dataset

Attributes

Original Data Source: BERT English uncased unigrams

Listing Stats

VIEWS

DOWNLOADS

LISTED

27/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

FREE DATASET LIBRARY

£0

BERT Model Vocabulary Analysis Dataset

Entertainment & Media Consumption

Tags and Keywords

Music

Nlp

Bert

Unigrams

Vocabulary

Frequency

Trusted By

Free

About

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Listing Stats

Free

Download Dataset in CSV Format

RECOMMENDED DATASETS