Historical English OCR Correction Dataset
E-commerce & Online Transactions
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a preprocessed version of historical English texts, specifically designed for OCR error detection and correction tasks. It contains raw text output generated by Optical Character Recognition (OCR) systems alongside their corresponding aligned ground truth, which is the corrected, gold-standard text. The OCR output often includes inaccuracies such as misrecognised characters and missing words. This makes the dataset invaluable for training models to identify and fix these common OCR errors.
Columns
The dataset is structured with three key columns:
- OCR_toInput: This column contains the raw text output directly from the OCR system, which may include various errors.
- OCR_aligned: This column provides the OCR output with character-level alignment, making it suitable for precise correction efforts.
- GS_aligned: This column represents the corrected ground truth or gold-standard text, serving as the accurate reference for comparison and training. All three columns contain text data, with 724 unique values each.
Distribution
The dataset is stored in CSV format. It comprises a total of 724 entries or records. The character-level OCR error rate observed within the dataset is approximately 1.79%, highlighting common OCR inaccuracies like "1 → I", "tbe → the", "tho → the", and "aud → and".
Usage
This dataset is ideal for a variety of applications, including:
- OCR Error Detection & Correction: Directly useful for developing and refining algorithms that identify and rectify errors in OCR-generated text.
- Training Character-Based Machine Translation Models: Can be used to enhance the accuracy of translation models, especially when dealing with noisy or error-prone input.
- Natural Language Processing (NLP) on Historical Texts: Provides a clean and error-corrected source for NLP research and applications involving historical documents.
Coverage
The dataset consists of historical English texts. While specific time ranges are not detailed, its focus on historical monographs implies a broad historical scope. The regional coverage is global, indicating its applicability is not limited to a specific geography.
License
CC BY-SA
Who Can Use It
This dataset is primarily intended for:
- Researchers and developers in the fields of Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) who are working on text processing.
- Those focused on improving the accuracy of OCR technology for digitised historical documents.
- Academics and data scientists interested in historical text analysis and the challenges posed by OCR errors.
Dataset Name Suggestions
- Historical English OCR Correction Dataset
- Preprocessed Monograph OCR Error Data
- Aligned OCR Ground Truth for English Texts
- OCR Post-Correction Benchmark Dataset
Attributes
Original Data Source: English Monograph OCR Dataset (Preprocessed) 📄🔍