Bangla Handwritten Text Recognition Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
A new dataset designed to address the scarcity of annotated resources for Offline Handwritten Text Recognition (HTR) in the resource-constrained Bangla (Bengali) language. This collection features full-page handwritten scripts intended for training Neural Network-based HTR model architectures. These models are unique in that they can be taught to recognize entire pages of handwritten text without requiring pre-segmentation of the image. The dataset facilitates rigorous comparative study of Image-to-Sequence architectures, allowing researchers to evaluate performance using metrics such as Character Error Rate (CER), Word Error Rate (WER), and Sequence Error Rate (SER). The source material includes varied texts, such as literature and detective novels.
Columns
The dataset includes metadata detailing the source and characteristics of each script:
- SN: Serial Number.
- Label Count: The number of labels associated with the script, ranging from 1 to 111.
- Filename: The unique file identifier for the scanned image.
- Username: The anonymised user identifier for the contributor (49 unique users).
- Age: The age of the contributor, with a mean of 37.2.
- Gender: The gender of the contributor (Male or Female), with 68% being Male.
- Occupation: The job or student status of the contributor, where Student makes up 50% and Service - Government Sector makes up 25%.
- Category: The genre of the text provided, most commonly Literature - Short Story (37%) or Literature - Detective Novel (14%).
- Char Count: The total character count in the script, ranging from 324 to 1.48k, with a mean of 1.08k.
- Article link: The source article or text link.
- Strike: A boolean indicating if strike-throughs are present in the script (True in 50% of records).
- Bangla - English: A boolean indicating if the script contains a mixture of Bangla and English text (True in 23% of records).
- Multi - Paragraph: A boolean indicating if the script contains multiple paragraphs (True in 50% of records).
Distribution
The dataset, packaged as
Bongabdo_Metadata.csv, contains 12 columns and 111 valid records. The file size is 18.81 kB. The data provides detailed metadata accompanying scanned handwritten scripts. This dataset is not expected to receive future updates. The mean character count per script is 1.08 thousand.Usage
This dataset is ideal for:
- Developing and training Neural Network HTR models that utilize Image-to-Sequence architectures.
- Conducting comparative studies on HTR settings and hyperparameters for Bengali script recognition.
- Benchmarking the performance of HTR systems using standard metrics like CER, WER, and SER.
- Researching methods designed to handle high variability in handwritten styles stemming from diverse demographics.
Coverage
The data focuses on Bangla (Bengali) handwriting, collected from a wide variety of contributors. There are 49 unique contributors providing the script images.
- Demographics: The scripts capture variation across age groups (minimum age 8, maximum age 62), gender (32% Female), and occupation (including students and government service workers).
- Linguistic Scope: While primarily Bangla, approximately 23% of the scripts contain a mix of both Bangla and English text.
- Stylistic Variation: The collection includes examples with and without strike-throughs and scripts that contain multiple paragraphs, reflecting real-world document variations.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Machine Learning Researchers: To develop and refine state-of-the-art models for text recognition in Indic languages.
- AI Developers: For building applications that require the digitisation of handwritten documents in Bengali.
- Computational Linguists: To analyse the impact of stylistic and demographic factors on automated recognition accuracy.
Dataset Name Suggestions
- Bangla Handwritten Text Recognition Dataset
- Full-page Bengali HTR Scripts
- Bongabdo HTR Corpus
Attributes
Original Data Source: Bangla Handwritten Text Recognition Dataset
Loading...
