GoEmotions Text Emotion Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is a corpus of 58,009 Reddit comments, each meticulously annotated by humans to one of 27 distinct emotion categories or a neutral label. It serves as an invaluable resource for tasks involving the multi-classification of emotions and is particularly well-suited for various natural language processing (NLP) applications.
Columns
data
: The original textual content of the Reddit comment.text
: The textual content of the Reddit comment, which may be a processed or identical version of thedata
column.id
: A unique identifier for each individual Reddit comment.author
: The username of the Reddit account that posted the comment.subreddit
: The name of the Reddit community (subreddit) where the comment was published.link_id
: An identifier for the submission (post) to which the comment is linked.parent_id
: An identifier for the parent comment or the original submission, indicating its place within a conversation thread.created_utc
: The creation timestamp of the comment, presented in Unix epoch format.rater_id
: An identifier for the human annotator who provided the emotion label for the comment.example_very_unclear
: A boolean flag that indicates whether the example was deemed very unclear during the annotation process.admiration
: One of the 27 emotion categories assigned to the comment, typically represented as a binary (0 or 1) value. Other emotion categories include amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realisation, relief, remorse, sadness, and surprise, in addition to a Neutral label.
Distribution
The dataset is provided in a CSV file format. It contains 58,009 individual examples and has a file size of 42.74 MB. The data is structured with a version filtered based on rater-agreement, which is further divided into training, testing, and validation sets:
- Training dataset: 43,410 examples
- Test dataset: 5,427 examples
- Validation dataset: 5,426 examples
Usage
This dataset is ideal for:
- Developing and evaluating emotion classification models.
- Performing sentiment analysis on social media content.
- Conducting research in natural language processing and understanding.
- Facilitating exploratory data analysis of emotional expression on the Reddit platform.
- Aiding the development of AI and large language model (LLM) applications that require emotion detection capabilities.
Coverage
- Geographic Scope: The data's scope is global.
- Time Range: Comments included in the dataset were created between approximately 1st January 2019 and 1st February 2019.
- Demographic Scope: As the data originates from Reddit comments, it reflects the diverse range of user demographics present on the platform, although specific demographic breakdowns are not provided.
License
CC BY-NC-SA.
Who Can Use It
- Data scientists seeking to build and test machine learning models for emotion detection.
- NLP researchers focused on advancements in emotion recognition and textual sentiment.
- Academics engaged in linguistic or social science studies of online communication patterns.
- Developers creating applications for social media monitoring or conversational AI systems.
Dataset Name Suggestions
- GoEmotions Reddit Comments
- Reddit Emotion Corpus
- Social Media Emotion Labels Dataset
- GoEmotions Text Emotion Dataset
Attributes
Original Data Source: GoEmotions