Opendatabay APP

Historical English OCR Correction Dataset

E-commerce & Online Transactions

Tags and Keywords

Business

Nlp

Text

History

Lstm

Benchmark

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Historical English OCR Correction Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a preprocessed version of historical English texts, specifically designed for OCR error detection and correction tasks. It contains raw text output generated by Optical Character Recognition (OCR) systems alongside their corresponding aligned ground truth, which is the corrected, gold-standard text. The OCR output often includes inaccuracies such as misrecognised characters and missing words. This makes the dataset invaluable for training models to identify and fix these common OCR errors.

Columns

The dataset is structured with three key columns:
  • OCR_toInput: This column contains the raw text output directly from the OCR system, which may include various errors.
  • OCR_aligned: This column provides the OCR output with character-level alignment, making it suitable for precise correction efforts.
  • GS_aligned: This column represents the corrected ground truth or gold-standard text, serving as the accurate reference for comparison and training. All three columns contain text data, with 724 unique values each.

Distribution

The dataset is stored in CSV format. It comprises a total of 724 entries or records. The character-level OCR error rate observed within the dataset is approximately 1.79%, highlighting common OCR inaccuracies like "1 → I", "tbe → the", "tho → the", and "aud → and".

Usage

This dataset is ideal for a variety of applications, including:
  • OCR Error Detection & Correction: Directly useful for developing and refining algorithms that identify and rectify errors in OCR-generated text.
  • Training Character-Based Machine Translation Models: Can be used to enhance the accuracy of translation models, especially when dealing with noisy or error-prone input.
  • Natural Language Processing (NLP) on Historical Texts: Provides a clean and error-corrected source for NLP research and applications involving historical documents.

Coverage

The dataset consists of historical English texts. While specific time ranges are not detailed, its focus on historical monographs implies a broad historical scope. The regional coverage is global, indicating its applicability is not limited to a specific geography.

License

CC BY-SA

Who Can Use It

This dataset is primarily intended for:
  • Researchers and developers in the fields of Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) who are working on text processing.
  • Those focused on improving the accuracy of OCR technology for digitised historical documents.
  • Academics and data scientists interested in historical text analysis and the challenges posed by OCR errors.

Dataset Name Suggestions

  • Historical English OCR Correction Dataset
  • Preprocessed Monograph OCR Error Data
  • Aligned OCR Ground Truth for English Texts
  • OCR Post-Correction Benchmark Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

17/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free