Opendatabay APP

Cafe Sales Data Cleaning Challenge

NLP / Natural Language Processing

Tags and Keywords

Cleaning

Sales

Cafe

Data

Dirty

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Cafe Sales Data Cleaning Challenge Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, titled "Dirty Cafe Sales Dataset", contains 10,000 rows of synthetic sales transaction data from a cafe [1, 2]. It has been deliberately engineered to be "dirty" by including missing values, inconsistent data, and various errors [1]. The primary purpose of this dataset is to offer a realistic scenario for practising data cleaning, data wrangling, and exploratory data analysis (EDA) techniques [1, 3]. It serves as an excellent resource for those looking to refine their skills in handling real-world data challenges [1].

Columns

The dataset comprises 8 columns [2], each detailing aspects of a cafe sales transaction:
  • Transaction ID: A unique identifier for each transaction, consistently present and unique [2, 4].
  • Item: The name of the item purchased, such as "Coffee" or "Sandwich". This column may contain missing or invalid values like "ERROR" [2, 4].
  • Quantity: The quantity of the item purchased. This column might have missing or invalid entries such as "UNKNOWN" [2, 5].
  • Price Per Unit: The price of a single unit of the item. This column may also contain missing or invalid values [2, 5].
  • Total Spent: The total amount spent on the transaction, calculated as Quantity * Price Per Unit [5, 6].
  • Payment Method: The method used for payment (e.g., "Cash", "Credit Card"). This column can have missing values (e.g., "None") or invalid entries (e.g., "UNKNOWN") [6, 7].
  • Location: The location where the transaction occurred, such as "In-store" or "Takeaway". This column may contain missing or invalid values [6, 7].
  • Transaction Date: The date of the transaction. This column may contain missing or incorrect values [6, 7].
Key data characteristics include:
  • Missing Values: Present in columns like Item, Payment Method, and Location, often represented as None or empty cells [6].
  • Invalid Values: Includes entries such as "ERROR" or "UNKNOWN" to mimic real-world data issues [3].
  • Price Consistency: Prices for menu items are consistent but may have missing or incorrect values introduced [3].
  • Menu Items: Specific items like Coffee (£2), Tea (£1.5), Sandwich (£4), Salad (£5), Cake (£3), Cookie (£1), Smoothie (£4), and Juice (£3) are included [3].

Distribution

The dataset is provided as a CSV file named dirty_cafe_sales.csv [2, 8]. It contains 10,000 rows (records) and 8 columns [1, 2]. The file size is approximately 550.3 KB [4].

Usage

This dataset is ideally suited for:
  • Practising data cleaning techniques, including handling missing values, removing duplicates, and correcting invalid entries [3, 9].
  • Exploring Exploratory Data Analysis (EDA) techniques, such as visualisations and summary statistics [9].
  • Performing feature engineering for machine learning workflows [9].
Suggested cleaning steps involve filling missing numeric values with the median or mean, replacing missing categorical values with the mode or "Unknown", and handling invalid entries like "ERROR" and "UNKNOWN" by replacing them with NaN or other appropriate values [9]. Ensuring date consistency and creating new features like 'Day of the Week' or 'Transaction Month' are also recommended [10].

Coverage

This is a synthetic dataset [1], meaning it is artificially generated rather than collected from real-world cafe operations. As such, it does not represent specific geographic, time range, or demographic coverage from real-world data. The transaction dates provided, for example, show values like "2023-01-01" and "UNKNOWN" [6, 7].

License

CC BY-SA 4.0 License

Who Can Use It

This dataset is intended for users who need to develop or hone their data manipulation and analysis skills. This includes:
  • Data analysts looking to practise real-world data challenges [1, 3].
  • Data scientists seeking to build robust data pipelines and perform feature engineering [9].
  • Students and educators for training in data cleaning and EDA methodologies [1].
  • Anyone interested in exploring data visualisation and statistical summarisation techniques [9].

Dataset Name Suggestions

  • Cafe Sales Data Cleaning Challenge
  • Messy Cafe Transactions
  • Cafe Data Quality Exercise
  • Synthetic Cafe Sales for EDA
  • Cafe Sales Dirty Data

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

1

LISTED

14/07/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format