Supermarket Receipt Extraction Dataset
Retail & Consumer Behavior
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset features a collection of photographs of various grocery store receipts, specifically designed for Optical Character Recognition (OCR) tasks within the retail sector. Each image is accompanied by bounding box annotations, precisely marking specific text segments on the receipts. These text segments are categorised into four distinct classes: item, store, date_time, and total, making it highly valuable for training and developing models focused on extracting structured information from receipt images.
Columns
The dataset is structured around the detection and categorisation of key text segments found on grocery receipts. The primary extracted text categories, which can be thought of as data points or 'columns' in a structured output, include:
- store: The name of the grocery store where the receipt was issued.
- item: Individual items purchased, as listed on the receipt.
- date_time: The date and time of the transaction.
- total: The total price indicated on the receipt. Each image is provided with an XML annotation file detailing the coordinates of the bounding boxes for these detected text elements, along with the extracted text itself.
Distribution
The dataset is supplied as images of receipts, with accompanying XML annotation files that provide bounding box coordinates and detected text. It includes a
receipts.csv
file, indicating that extracted data can also be provided in a tabular format. The dataset size for Version 1 is 56.33 MB. While specific numbers for rows or records are not explicitly available, the structure includes original images in an 'images' folder and bounding box labels in a 'boxes' folder.Usage
This dataset is ideal for various applications, particularly those involving Optical Character Recognition (OCR), text detection, and text recognition from scanned documents. Key use cases include:
- Developing and refining deep learning models for receipt processing.
- Automating data extraction from grocery receipts for retail analytics.
- Building systems for retail store management, such as inventory tracking or expense management.
- Applications requiring document text recognition and text area detection.
- Creating tools for image-to-text conversion for consumer goods data.
Coverage
The dataset primarily covers grocery store receipts from a variety of retailers, including major chains such as Walmart, Trader Joe's, SPAR, Whole Foods Market, Costco Wholesale, and WinCo Foods, among others. The sample data indicates a broad time range, with receipts dating from 2007 through to 2023, showcasing a diverse collection over multiple years. No specific geographic or demographic scope beyond "various grocery store receipts" is mentioned.
License
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Who Can Use It
This dataset is intended for a range of users involved in data science, machine learning, and retail technology. Examples include:
- AI/ML Engineers and Data Scientists building and training OCR models for document processing.
- Retail Businesses aiming to automate receipt data entry for financial tracking or customer insights.
- Researchers studying text detection, object detection, and image-to-text challenges in varied document layouts.
- Developers creating applications for expense management, loyalty programmes, or automated checkout systems.
Dataset Name Suggestions
- Grocery Receipt OCR Dataset
- Retail Receipts Text Detection
- Annotated Grocery Receipt Images
- OCR Receipt Data for Retail
- Supermarket Receipt Extraction Dataset
Attributes
Original Data Source:Supermarket Receipt Extraction Dataset