BERT Model Vocabulary Analysis Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a detailed reconstruction of the training data utilised for the English BERT base uncased model. It is essential for understanding the vocabulary employed by these foundational models, especially since subword tokenisation often obscures the original training vocabulary. By examining the unigrams and their distributions from the original training data, users can ascertain whether their specific data would benefit from fine-tuning an existing BERT model or necessitate pretraining a model from scratch. The dataset was meticulously compiled from the BookCorpus dataset and a processed dump of Wikipedia, specifically from August 2019. Adhering to the principles of BERT's tokenisation scheme, the data retains all punctuation and stopwords. The original unicode text was normalised using NFKC, subsequently tokenised with the SpaCy English large model, and the total count for each unigram across the corpora was meticulously recorded. The unigrams are presented in descending order of frequency, ensuring ease of analysis.
Columns
- unigram: Represents a single token from the corpus.
- count: Indicates the frequency count of the corresponding token across the entire dataset.
The column values within the CSV file are separated by tabs.
Distribution
The dataset is provided in a CSV file format, with values tab-separated. It features a substantial collection of 7,650,688 unique unigram values. The unigrams are ordered by their frequency, from most to least common, providing an organised structure for analysis. Specific numbers for rows or records are not explicitly stated beyond the unique unigram count.
Usage
This dataset offers several valuable applications:
- Constructing a probability distribution of data within your specific domain to assess if the BERT base model is sufficiently aligned for your task.
- Analysing the training data of specialised BERT models (e.g., Bio-BERT, Legal-BERT) and quantifying their similarity or difference to BERT base by calculating the Kullback–Leibler divergence for their shared vocabulary.
- Evaluating and identifying important bigrams when used in conjunction with its companion dataset, BERT bigrams.
- Determining the proportion of your data that is out-of-vocabulary (OOV), which serves as a strong indicator for the potential need for model retraining.
Coverage
The data originates from the BookCorpus dataset and a processed Wikipedia dump from August 2019. While the data is global in its applicability to English language models, specific geographical or demographic scopes are not relevant to this unigram frequency dataset. The dataset represents a best-effort reconstruction of the original training data for the BERT English uncased model. Updates may occur if a Wikipedia data release that more closely approximates the original BERT training data becomes available.
License
CC-BY-SA
Who Can Use It
This dataset is ideal for:
- Machine learning engineers and researchers focused on Natural Language Processing (NLP).
- Developers and practitioners involved in fine-tuning existing BERT models.
- Individuals planning to pretrain new language models from scratch.
- Data scientists interested in corpus linguistics, vocabulary analysis, and understanding large language model foundations.
Dataset Name Suggestions
- BERT English Unigram Frequencies
- BERT Base Uncased Vocabulary Counts
- English BERT Training Data Unigrams
- Unigram Frequencies for BERT Base
- BERT Model Vocabulary Analysis Dataset
Attributes
Original Data Source: BERT English uncased unigrams