Retail Sales Data Cleaning Challenge
Retail & Consumer Behavior
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset presents synthetic sales transactions from a retail store, purpose-built for data cleaning challenges. It features 12,575 rows of data, simulating real-world inconsistencies such as missing or invalid values across its 11 columns. The dataset encompasses eight distinct product categories, each containing 25 unique items with fixed prices. It is an excellent resource for anyone looking to practise data cleaning techniques, perform exploratory data analysis, and develop feature engineering skills.
Columns
- Transaction ID: A unique identifier for each transaction, always present and unique.
- Customer ID: A unique identifier for each of the 25 distinct customers, always valid.
- Category: The product category for the purchased item, such as 'Food' or 'Furniture'. There are eight unique categories.
- Item: The specific name of the purchased item. This column may contain missing or 'None' values.
- Price Per Unit: The static price of a single unit of the item. This column may also have missing or 'None' values. Prices range from £5.00 to £41.00.
- Quantity: The number of units of the item purchased. Missing or 'None' values can be found here. Quantities range from 1 to 10.
- Total Spent: The overall amount spent on the transaction, calculated as Quantity multiplied by Price Per Unit. This column may contain missing values. Total amounts range from £5.00 to £410.00.
- Payment Method: The method used for payment, which might include 'Cash' or 'Credit Card'. This column can have missing or invalid entries.
- Location: The place where the transaction occurred, such as 'In-store' or 'Online'. This column may also contain missing or invalid values.
- Transaction Date: The date of the transaction. This field is always present and valid, with dates spanning from 2022-01-01 to 2025-01-18.
- Discount Applied: An indicator of whether a discount was applied to the transaction. This column can be 'True', 'False', or 'None' due to missing values.
Distribution
The dataset is provided as a CSV file, named 'retail_store_sales.csv'. It contains 12,575 rows and 11 columns. The data is entirely synthetic, designed to mimic real-world retail sales with introduced inconsistencies.
Usage
This dataset is perfectly suited for several analytical applications, including:
- Data Cleaning: Practising tasks such as handling missing values, inferring missing entries, and validating data integrity.
- Exploratory Data Analysis (EDA): Analysing sales trends, evaluating category performance, and understanding customer behaviour.
- Feature Engineering: Developing techniques to create new, insightful variables from existing data.
Coverage
The dataset's scope includes sales transactions from a retail store. The transactions cover a time range from 1st January 2022 to 18th January 2025. It details transactions involving 25 distinct customer IDs and items from eight different product categories. No specific geographic regions or detailed demographic information beyond customer IDs are provided.
License
CC BY-SA 4.0
Who Can Use It
This dataset is ideal for:
- Data Analysts: For practising data manipulation and trend identification.
- Data Scientists: To build and refine models, especially for data pre-processing steps.
- Students and Educators: As a teaching and learning resource for data quality and analytics.
- Business Intelligence Professionals: To understand data challenges in retail operations.
Dataset Name Suggestions
- Retail Sales Data Cleaning Challenge
- Dirty Sales Transactions Dataset
- Simulated Retail Sales for Analytics
- E-commerce Sales Anomaly Data
Attributes
Original Data Source: Retail Sales Data Cleaning Challenge