Online Review Authenticity Dataset
Reviews & Ratings
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to support the creation and detection of fake reviews for online products. It comprises a collection of 40,000 product reviews, equally split between 20,000 authentic, human-generated reviews and 20,000 computer-generated fake reviews. The dataset includes information on review content, categorisation, and associated ratings, making it a valuable resource for developing and testing review integrity solutions within e-commerce and other online platforms.
Columns
- review dateaset: Likely indicates the type or source of the review within the dataset.
- category: Specifies the product category the review belongs to, such as 'Kindle_Store_5' or 'Books_5'.
- rating: The numerical rating given in the review.
- label: A classification label, possibly indicating if a review is original (OR) or computer-generated (CG).
- text_: The actual textual content of the product review.
Distribution
The dataset contains a total of 40,412 unique entries, with a balanced distribution of 20,000 fake and 20,000 real product reviews. Data is typically provided in a CSV file format.
The distribution of ratings is as follows:
- 1.00 - 1.20: 2,155 entries
- 2.00 - 2.20: 1,967 entries
- 3.00 - 3.20: 3,786 entries
- 4.00 - 4.20: 7,965 entries
- 4.80 - 5.00: 24,559 entries
The dataset categorisation includes:
- Kindle_Store_5: 12%
- Books_5: 11%
- Other: 77% (31,332 entries)
Usage
This dataset is ideal for training machine learning models to identify and flag fraudulent or computer-generated product reviews. It can be utilised for:
- Developing Natural Language Processing (NLP) models for sentiment analysis and text classification.
- Building AI & Machine Learning solutions for fraud detection in online marketplaces.
- Researching the characteristics and patterns of authentic versus fabricated consumer feedback.
- Enhancing the trustworthiness and reliability of online review systems.
Coverage
The dataset has global coverage, making it applicable for systems and research worldwide. While specific time ranges for the reviews themselves are not explicitly detailed, the data's utility is broad across various product categories and review contexts within e-commerce.
License
CC-BY
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers: To develop and fine-tune models for fake review detection and NLP tasks.
- Researchers: Studying consumer behaviour, online trust, and adversarial attacks in digital platforms.
- E-commerce Businesses: To implement internal systems for maintaining review authenticity and improving customer trust.
- Academics and Students: For educational purposes, projects, and academic studies in AI, NLP, and data science.
Dataset Name Suggestions
- Fake Product Reviews Dataset
- Online Review Authenticity Dataset
- E-commerce Review Integrity Data
- AI Review Detection Dataset
- Customer Review Verification Set
Attributes
Original Data Source: 🚨 Fake Reviews Dataset