Opendatabay APP

Hierarchical Amazon Reviews for Classification

Reviews & Ratings

Tags and Keywords

Hierarchical

Reviews

Amazon

Product

Classification

Text

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Hierarchical Amazon Reviews for Classification Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed for exploring various approaches to hierarchical text classification, specifically using Amazon product reviews. It contains 40,000 training reviews, providing a rich source for developing and evaluating machine learning models focused on categorising products. The classification structure is tiered, featuring 6 top-level categories (level 1), 64 sub-categories (level 2), and 510 granular categories (level 3).

Columns

  • productId: A unique identifier for the product.
  • Title: The headline or title given to the product review.
  • userId: A unique identifier for the user who submitted the review.
  • Helpfulness: Indicates whether the review was found helpful by other users.
  • Score: The rating assigned by other users to the review, typically on a scale of 1 to 5.
  • Time: The timestamp of when the review was submitted.
  • Text: The main body or content of the product review.
  • Cat1: The primary, level 1 class name or category assigned to the product.
  • Cat2: The secondary, level 2 class name or sub-category.
  • Cat3: The most specific, level 3 class name or sub-sub-category.

Distribution

The data is typically provided in CSV format. The main training file includes 40,000 Amazon product reviews. Additional files such as 10,000 validation reviews and 150,000 raw reviews are also part of the broader dataset offering.
  • Product IDs: Over 20,851 unique product identifiers are present.
  • User IDs: There are over 19,598 unique user identifiers.
  • Helpfulness: Around 38% of reviews have a 0/0 helpfulness score, while 15% are rated 1/1.
  • Score: Review scores range from 1 to 5, with a notable portion (23,362 records) falling into the 4.80-5.00 range.
  • Time: Timestamps span a wide period, with a large concentration of reviews (16,331 records) between roughly 1277000639.95 and 1344211200.00.
  • Level 1 Categories (Cat1): There are 6 distinct top-level categories, including "toys games" (26% of reviews) and "health personal care" (24% of reviews). The full list of level 1 classes includes: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.
  • Level 2 Categories (Cat2): There are 64 sub-categories, with examples like "personal care" (7%) and "dogs" (7%).
  • Level 3 Categories (Cat3): The most granular level contains 510 unique categories, with "shaving hair removal" being an example.

Usage

This dataset is ideal for:
  • Developing and testing various hierarchical text classification methodologies.
  • Training multi-class models, whether through a flat approach (concatenating class names) or a simple hierarchical approach (sequential classification).
  • Language model fine-tuning, particularly with the raw, unlabelled review data provided.
  • Building systems for sentiment analysis and product categorisation.
  • Market research and identifying product trends based on consumer reviews.

Coverage

The dataset covers Amazon product reviews globally. The timestamps within the Time column suggest a wide historical range of reviews, although specific human-readable dates are not provided. The data reflects review contributions from a diverse base of Amazon users.

License

CCO

Who Can Use It

  • Data Scientists and Machine Learning Engineers focusing on natural language processing (NLP) and classification tasks.
  • Academics and Researchers studying hierarchical classification techniques and consumer behaviour.
  • Developers looking to fine-tune language models with real-world product review data.
  • Businesses aiming to improve product categorisation, sentiment analysis, or customer insight generation.

Dataset Name Suggestions

  • Amazon Product Review Hierarchy
  • Hierarchical Amazon Reviews for Classification
  • Product Review Categorisation Dataset
  • Amazon Consumer Review Data

Attributes

Original Data Source: Hierarchical text classification

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

08/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free