Opendatabay APP

Bangla Handwritten Text Recognition Dataset

Data Science and Analytics

Tags and Keywords

Bengali

Htr

Handwriting

Image

Recognition

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Bangla Handwritten Text Recognition Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

A new dataset designed to address the scarcity of annotated resources for Offline Handwritten Text Recognition (HTR) in the resource-constrained Bangla (Bengali) language. This collection features full-page handwritten scripts intended for training Neural Network-based HTR model architectures. These models are unique in that they can be taught to recognize entire pages of handwritten text without requiring pre-segmentation of the image. The dataset facilitates rigorous comparative study of Image-to-Sequence architectures, allowing researchers to evaluate performance using metrics such as Character Error Rate (CER), Word Error Rate (WER), and Sequence Error Rate (SER). The source material includes varied texts, such as literature and detective novels.

Columns

The dataset includes metadata detailing the source and characteristics of each script:
  • SN: Serial Number.
  • Label Count: The number of labels associated with the script, ranging from 1 to 111.
  • Filename: The unique file identifier for the scanned image.
  • Username: The anonymised user identifier for the contributor (49 unique users).
  • Age: The age of the contributor, with a mean of 37.2.
  • Gender: The gender of the contributor (Male or Female), with 68% being Male.
  • Occupation: The job or student status of the contributor, where Student makes up 50% and Service - Government Sector makes up 25%.
  • Category: The genre of the text provided, most commonly Literature - Short Story (37%) or Literature - Detective Novel (14%).
  • Char Count: The total character count in the script, ranging from 324 to 1.48k, with a mean of 1.08k.
  • Article link: The source article or text link.
  • Strike: A boolean indicating if strike-throughs are present in the script (True in 50% of records).
  • Bangla - English: A boolean indicating if the script contains a mixture of Bangla and English text (True in 23% of records).
  • Multi - Paragraph: A boolean indicating if the script contains multiple paragraphs (True in 50% of records).

Distribution

The dataset, packaged as Bongabdo_Metadata.csv, contains 12 columns and 111 valid records. The file size is 18.81 kB. The data provides detailed metadata accompanying scanned handwritten scripts. This dataset is not expected to receive future updates. The mean character count per script is 1.08 thousand.

Usage

This dataset is ideal for:
  • Developing and training Neural Network HTR models that utilize Image-to-Sequence architectures.
  • Conducting comparative studies on HTR settings and hyperparameters for Bengali script recognition.
  • Benchmarking the performance of HTR systems using standard metrics like CER, WER, and SER.
  • Researching methods designed to handle high variability in handwritten styles stemming from diverse demographics.

Coverage

The data focuses on Bangla (Bengali) handwriting, collected from a wide variety of contributors. There are 49 unique contributors providing the script images.
  • Demographics: The scripts capture variation across age groups (minimum age 8, maximum age 62), gender (32% Female), and occupation (including students and government service workers).
  • Linguistic Scope: While primarily Bangla, approximately 23% of the scripts contain a mix of both Bangla and English text.
  • Stylistic Variation: The collection includes examples with and without strike-throughs and scripts that contain multiple paragraphs, reflecting real-world document variations.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

  • Machine Learning Researchers: To develop and refine state-of-the-art models for text recognition in Indic languages.
  • AI Developers: For building applications that require the digitisation of handwritten documents in Bengali.
  • Computational Linguists: To analyse the impact of stylistic and demographic factors on automated recognition accuracy.

Dataset Name Suggestions

  • Bangla Handwritten Text Recognition Dataset
  • Full-page Bengali HTR Scripts
  • Bongabdo HTR Corpus

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

22/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in ZIP Format