Opendatabay APP

r/worldnews Textual Analysis Dataset

Entertainment & Media Consumption

Tags and Keywords

Internet

Nlp

Text

Classification

Lstm

Trusted By
Trusted by company1Trusted by company2Trusted by company3
r/worldnews Textual Analysis Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides a collection of Reddit posts from the r/worldnews subreddit, scraped using the Pushshift API. The primary purpose of this dataset is to facilitate sentiment analysis of top trending articles. It includes the body text scraped from the attached URLs using newspaper3k and has undergone Named Entity Recognition (NER) via SpaCy. The data has been meticulously cleaned to remove errors, advertisements, and spam posts, ensuring a high-quality resource for analysis.

Columns

  • subreddit: The name of the subreddit, which is consistently r/worldnews.
  • title: The title of the Reddit post.
  • url: The URL attached to the submission.
  • id: A unique identifier for each post. This dataset contains 3652 unique IDs.
  • author: The username of the post's author. For instance, DoremusJessup accounts for 2% of posts, misana123 for 1%, and other authors for 97%.
  • utc_datetime_str: The date and time (in UTC) when the post was submitted. Dates span from 17 February 2023 to 02 March 2023.
  • text_url: The body text scraped from the URL attached to the post. An example value is "Business Of Sports If the only thing you know about sports is who wins and who loses, you are missing the highest stakes action of all. The business owners that power this multibillion dollar industry are changing, and a new era of the business of sports".
  • NER: Named Entity Recognition performed on the text_url using SpaCy. An example value is ['multibillion dollar'].

Distribution

This dataset comprises 3652 Reddit posts, each with 8 distinct attributes. It is typically available as a data file, such as a CSV. While specific record counts for individual columns vary, there are 3652 unique posts. The posts cover a date range from 17 February 2023 to 02 March 2023, with daily post counts varying (e.g., 451 posts on 23-24 February 2023).

Usage

This dataset is ideal for:
  • Performing sentiment analysis on news articles and social media content.
  • Developing and testing Natural Language Processing (NLP) models.
  • Conducting text classification tasks on news headlines and article bodies.
  • Analysing trends and popular topics within world news discussions on Reddit.
  • Research into media consumption patterns and public opinion on global events.

Coverage

The dataset's geographic scope is global, focusing on world news as presented on the r/worldnews subreddit. The time range covered for the posts is from 17 February 2023 to 02 March 2023. Data availability is consistent across this period, with daily post counts provided.

License

CC0

Who Can Use It

  • Data Scientists and Machine Learning Engineers: For building and refining NLP models, especially for sentiment analysis and text classification.
  • Researchers: Studying global news trends, social media discourse, and public sentiment.
  • Journalists and Media Analysts: For understanding how specific news topics are discussed and perceived online.
  • Developers: Creating applications that leverage social media data for insights or content categorisation.

Dataset Name Suggestions

  • World News Reddit Posts (2023)
  • Global News Reddit Discourse
  • r/worldnews Textual Analysis Dataset
  • Reddit World News Scraped Data
  • World News Article Sentiment Dataset

Attributes

Original Data Source: r/wordnews textual

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

24/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format