Opendatabay APP

Tweet Sentiment Extraction Pseudo Labels

Social Media and Networking

Tags and Keywords

News

Social

Networks

Email

Messaging

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Tweet Sentiment Extraction Pseudo Labels Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset contains pseudo-labelled tweets, specifically curated for the Twitter Sentiment Extraction competition. It offers valuable pre-processed and original tweet texts, alongside their respective sentiments and extracted pseudo-labels. The dataset serves as an excellent resource for developing and evaluating models for sentiment analysis on social media content.

Columns

  • textID: Unique identifiers for competition tweet texts.
  • sentiment: The assigned sentiment of the tweet (e.g., positive, negative, neutral).
  • author: The Twitter handle.
  • text: The pre-processed version of the tweet.
  • old_text: The original tweet.
  • aux_id: Auxiliary identifiers for competition tweet texts.
  • new_sentiment: An additional sentiment label.
  • selected_text: The pseudo-label representing the portion of the text that conveys the sentiment.

Distribution

The dataset is typically provided in a CSV file format. It features approximately 12,520 records, with unique values observed across various columns. For instance, there are around 12,520 unique textID entries and 11,599 unique authors. The sentiment distribution includes roughly 35% negative, 34% neutral, and 32% other sentiments. Further sentiment categorisations within 'new_sentiment' show a significant portion as null (69%), with 12% neutral and varying percentages for other sentiments like 'happy' and 'good'. Specific details on file size are not provided in the sources, but the number of records suggests a sizeable collection.

Usage

This dataset is ideally suited for tasks involving natural language processing (NLP), particularly sentiment analysis and emotion detection in text. It can be used for training machine learning models to identify sentiment, for text extraction tasks, and for research into social media communication patterns. Potential applications include customer feedback analysis, brand monitoring, and public opinion tracking.

Coverage

The dataset has global coverage, making it suitable for analyses without geographic restrictions. While it was listed on 24 June 2025, specific time ranges for the tweet collection are not detailed. There are no explicit demographic scopes mentioned, ensuring broad applicability.

License

CC0

Who Can Use It

This dataset is beneficial for data scientists, machine learning engineers, and researchers focused on NLP. It is particularly useful for those participating in sentiment extraction competitions or working on projects that require training models to understand and extract sentiment from social media posts. Developers building applications that interpret user sentiment or perform content moderation will also find it valuable.

Dataset Name Suggestions

  • Tweet Sentiment Extraction Pseudo Labels
  • Social Media Sentiment Dataset
  • Twitter Sentiment Analysis Data
  • Pseudo-Labelled Tweet Corpus
  • Tweet Emotion Extraction Dataset

Attributes

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

24/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format