Opendatabay APP

Supermarket Receipt Extraction Dataset

Retail & Consumer Behavior

Tags and Keywords

Ocr

Receipts

Retail

Images

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Supermarket Receipt Extraction Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset features a collection of photographs of various grocery store receipts, specifically designed for Optical Character Recognition (OCR) tasks within the retail sector. Each image is accompanied by bounding box annotations, precisely marking specific text segments on the receipts. These text segments are categorised into four distinct classes: item, store, date_time, and total, making it highly valuable for training and developing models focused on extracting structured information from receipt images.

Columns

The dataset is structured around the detection and categorisation of key text segments found on grocery receipts. The primary extracted text categories, which can be thought of as data points or 'columns' in a structured output, include:
  • store: The name of the grocery store where the receipt was issued.
  • item: Individual items purchased, as listed on the receipt.
  • date_time: The date and time of the transaction.
  • total: The total price indicated on the receipt. Each image is provided with an XML annotation file detailing the coordinates of the bounding boxes for these detected text elements, along with the extracted text itself.

Distribution

The dataset is supplied as images of receipts, with accompanying XML annotation files that provide bounding box coordinates and detected text. It includes a receipts.csv file, indicating that extracted data can also be provided in a tabular format. The dataset size for Version 1 is 56.33 MB. While specific numbers for rows or records are not explicitly available, the structure includes original images in an 'images' folder and bounding box labels in a 'boxes' folder.

Usage

This dataset is ideal for various applications, particularly those involving Optical Character Recognition (OCR), text detection, and text recognition from scanned documents. Key use cases include:
  • Developing and refining deep learning models for receipt processing.
  • Automating data extraction from grocery receipts for retail analytics.
  • Building systems for retail store management, such as inventory tracking or expense management.
  • Applications requiring document text recognition and text area detection.
  • Creating tools for image-to-text conversion for consumer goods data.

Coverage

The dataset primarily covers grocery store receipts from a variety of retailers, including major chains such as Walmart, Trader Joe's, SPAR, Whole Foods Market, Costco Wholesale, and WinCo Foods, among others. The sample data indicates a broad time range, with receipts dating from 2007 through to 2023, showcasing a diverse collection over multiple years. No specific geographic or demographic scope beyond "various grocery store receipts" is mentioned.

License

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Who Can Use It

This dataset is intended for a range of users involved in data science, machine learning, and retail technology. Examples include:
  • AI/ML Engineers and Data Scientists building and training OCR models for document processing.
  • Retail Businesses aiming to automate receipt data entry for financial tracking or customer insights.
  • Researchers studying text detection, object detection, and image-to-text challenges in varied document layouts.
  • Developers creating applications for expense management, loyalty programmes, or automated checkout systems.

Dataset Name Suggestions

  • Grocery Receipt OCR Dataset
  • Retail Receipts Text Detection
  • Annotated Grocery Receipt Images
  • OCR Receipt Data for Retail
  • Supermarket Receipt Extraction Dataset

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

12/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format