Cafe Sales Data Cleaning Challenge
NLP / Natural Language Processing
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset, titled "Dirty Cafe Sales Dataset", contains 10,000 rows of synthetic sales transaction data from a cafe [1, 2]. It has been deliberately engineered to be "dirty" by including missing values, inconsistent data, and various errors [1]. The primary purpose of this dataset is to offer a realistic scenario for practising data cleaning, data wrangling, and exploratory data analysis (EDA) techniques [1, 3]. It serves as an excellent resource for those looking to refine their skills in handling real-world data challenges [1].
Columns
The dataset comprises 8 columns [2], each detailing aspects of a cafe sales transaction:
- Transaction ID: A unique identifier for each transaction, consistently present and unique [2, 4].
- Item: The name of the item purchased, such as "Coffee" or "Sandwich". This column may contain missing or invalid values like "ERROR" [2, 4].
- Quantity: The quantity of the item purchased. This column might have missing or invalid entries such as "UNKNOWN" [2, 5].
- Price Per Unit: The price of a single unit of the item. This column may also contain missing or invalid values [2, 5].
- Total Spent: The total amount spent on the transaction, calculated as Quantity * Price Per Unit [5, 6].
- Payment Method: The method used for payment (e.g., "Cash", "Credit Card"). This column can have missing values (e.g., "None") or invalid entries (e.g., "UNKNOWN") [6, 7].
- Location: The location where the transaction occurred, such as "In-store" or "Takeaway". This column may contain missing or invalid values [6, 7].
- Transaction Date: The date of the transaction. This column may contain missing or incorrect values [6, 7].
Key data characteristics include:
- Missing Values: Present in columns like Item, Payment Method, and Location, often represented as
None
or empty cells [6]. - Invalid Values: Includes entries such as "ERROR" or "UNKNOWN" to mimic real-world data issues [3].
- Price Consistency: Prices for menu items are consistent but may have missing or incorrect values introduced [3].
- Menu Items: Specific items like Coffee (£2), Tea (£1.5), Sandwich (£4), Salad (£5), Cake (£3), Cookie (£1), Smoothie (£4), and Juice (£3) are included [3].
Distribution
The dataset is provided as a CSV file named
dirty_cafe_sales.csv
[2, 8]. It contains 10,000 rows (records) and 8 columns [1, 2]. The file size is approximately 550.3 KB [4].Usage
This dataset is ideally suited for:
- Practising data cleaning techniques, including handling missing values, removing duplicates, and correcting invalid entries [3, 9].
- Exploring Exploratory Data Analysis (EDA) techniques, such as visualisations and summary statistics [9].
- Performing feature engineering for machine learning workflows [9].
Suggested cleaning steps involve filling missing numeric values with the median or mean, replacing missing categorical values with the mode or "Unknown", and handling invalid entries like "ERROR" and "UNKNOWN" by replacing them with
NaN
or other appropriate values [9]. Ensuring date consistency and creating new features like 'Day of the Week' or 'Transaction Month' are also recommended [10].Coverage
This is a synthetic dataset [1], meaning it is artificially generated rather than collected from real-world cafe operations. As such, it does not represent specific geographic, time range, or demographic coverage from real-world data. The transaction dates provided, for example, show values like "2023-01-01" and "UNKNOWN" [6, 7].
License
CC BY-SA 4.0 License
Who Can Use It
This dataset is intended for users who need to develop or hone their data manipulation and analysis skills. This includes:
- Data analysts looking to practise real-world data challenges [1, 3].
- Data scientists seeking to build robust data pipelines and perform feature engineering [9].
- Students and educators for training in data cleaning and EDA methodologies [1].
- Anyone interested in exploring data visualisation and statistical summarisation techniques [9].
Dataset Name Suggestions
- Cafe Sales Data Cleaning Challenge
- Messy Cafe Transactions
- Cafe Data Quality Exercise
- Synthetic Cafe Sales for EDA
- Cafe Sales Dirty Data
Attributes
Original Data Source: Cafe Sales Data Cleaning Challenge