NLP Preprocessed Sentiment Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a substantial collection of over 241,000 English-language comments, gathered from various online platforms. Each comment within the dataset has been carefully annotated with a sentiment label: 0 for negative sentiment, 1 for neutral, and 2 for positive. The primary aim of this dataset is to facilitate the training and evaluation of multi-class sentiment analysis models, designed to work effectively with real-world text data. The dataset has undergone a preprocessing stage, ensuring comments are in lowercase, and are cleaned of punctuation, URLs, numbers, and stopwords, making it readily usable for Natural Language Processing (NLP) pipelines.
Columns
- Comment: This column contains the user-generated text content.
- Sentiment: This column provides the corresponding sentiment label for each comment, where 0 denotes Negative, 1 denotes Neutral, and 2 denotes Positive.
Distribution
The dataset comprises over 241,000 records. While the specific file format is not detailed, such datasets are typically provided in a tabular format, often as a CSV file. It is structured with two distinct columns as described above, suitable for direct integration into machine learning workflows.
Usage
This dataset is ideally suited for a variety of applications and use cases, including:
- Training sentiment classifiers utilising advanced models such as LSTM, BiLSTM, CNN, BERT, or RoBERTa.
- Evaluating the efficacy of different preprocessing and tokenisation strategies for text data.
- Benchmarking NLP models on multi-class classification tasks to assess their performance.
- Supporting educational projects and research initiatives in the fields of opinion mining or text classification.
- Fine-tuning transformer models on a large and diverse collection of sentiment-annotated text.
Coverage
The dataset's coverage is global, comprising English-language comments. It focuses on general user-generated text content without specific demographic notes. The dataset is listed with a version of 1.0.
License
CC0
Who Can Use It
This dataset is suitable for individuals and organisations involved in data science and analytics. Intended users include:
- Data Scientists and Machine Learning Engineers for developing and deploying sentiment analysis models.
- Researchers and Academics for studies in NLP, text classification, and opinion mining.
- Students undertaking educational projects in artificial intelligence and machine learning.
Dataset Name Suggestions
- Multi-class Comment Sentiment Data
- User Text Sentiment Collection
- Online Comment Sentiment Analysis Dataset
- English Sentiment Labelled Comments
- Preprocessed Sentiment Dataset
Attributes
Original Data Source: Sentiment Analysis Dataset