Hierarchical Amazon Reviews for Classification
Reviews & Ratings
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed for exploring various approaches to hierarchical text classification, specifically using Amazon product reviews. It contains 40,000 training reviews, providing a rich source for developing and evaluating machine learning models focused on categorising products. The classification structure is tiered, featuring 6 top-level categories (level 1), 64 sub-categories (level 2), and 510 granular categories (level 3).
Columns
- productId: A unique identifier for the product.
- Title: The headline or title given to the product review.
- userId: A unique identifier for the user who submitted the review.
- Helpfulness: Indicates whether the review was found helpful by other users.
- Score: The rating assigned by other users to the review, typically on a scale of 1 to 5.
- Time: The timestamp of when the review was submitted.
- Text: The main body or content of the product review.
- Cat1: The primary, level 1 class name or category assigned to the product.
- Cat2: The secondary, level 2 class name or sub-category.
- Cat3: The most specific, level 3 class name or sub-sub-category.
Distribution
The data is typically provided in CSV format. The main training file includes 40,000 Amazon product reviews. Additional files such as 10,000 validation reviews and 150,000 raw reviews are also part of the broader dataset offering.
- Product IDs: Over 20,851 unique product identifiers are present.
- User IDs: There are over 19,598 unique user identifiers.
- Helpfulness: Around 38% of reviews have a 0/0 helpfulness score, while 15% are rated 1/1.
- Score: Review scores range from 1 to 5, with a notable portion (23,362 records) falling into the 4.80-5.00 range.
- Time: Timestamps span a wide period, with a large concentration of reviews (16,331 records) between roughly 1277000639.95 and 1344211200.00.
- Level 1 Categories (Cat1): There are 6 distinct top-level categories, including "toys games" (26% of reviews) and "health personal care" (24% of reviews). The full list of level 1 classes includes: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.
- Level 2 Categories (Cat2): There are 64 sub-categories, with examples like "personal care" (7%) and "dogs" (7%).
- Level 3 Categories (Cat3): The most granular level contains 510 unique categories, with "shaving hair removal" being an example.
Usage
This dataset is ideal for:
- Developing and testing various hierarchical text classification methodologies.
- Training multi-class models, whether through a flat approach (concatenating class names) or a simple hierarchical approach (sequential classification).
- Language model fine-tuning, particularly with the raw, unlabelled review data provided.
- Building systems for sentiment analysis and product categorisation.
- Market research and identifying product trends based on consumer reviews.
Coverage
The dataset covers Amazon product reviews globally. The timestamps within the
Time
column suggest a wide historical range of reviews, although specific human-readable dates are not provided. The data reflects review contributions from a diverse base of Amazon users.License
CCO
Who Can Use It
- Data Scientists and Machine Learning Engineers focusing on natural language processing (NLP) and classification tasks.
- Academics and Researchers studying hierarchical classification techniques and consumer behaviour.
- Developers looking to fine-tune language models with real-world product review data.
- Businesses aiming to improve product categorisation, sentiment analysis, or customer insight generation.
Dataset Name Suggestions
- Amazon Product Review Hierarchy
- Hierarchical Amazon Reviews for Classification
- Product Review Categorisation Dataset
- Amazon Consumer Review Data
Attributes
Original Data Source: Hierarchical text classification