r/worldnews Textual Analysis Dataset
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a collection of Reddit posts from the r/worldnews subreddit, scraped using the Pushshift API. The primary purpose of this dataset is to facilitate sentiment analysis of top trending articles. It includes the body text scraped from the attached URLs using newspaper3k and has undergone Named Entity Recognition (NER) via SpaCy. The data has been meticulously cleaned to remove errors, advertisements, and spam posts, ensuring a high-quality resource for analysis.
Columns
- subreddit: The name of the subreddit, which is consistently r/worldnews.
- title: The title of the Reddit post.
- url: The URL attached to the submission.
- id: A unique identifier for each post. This dataset contains 3652 unique IDs.
- author: The username of the post's author. For instance, DoremusJessup accounts for 2% of posts, misana123 for 1%, and other authors for 97%.
- utc_datetime_str: The date and time (in UTC) when the post was submitted. Dates span from 17 February 2023 to 02 March 2023.
- text_url: The body text scraped from the URL attached to the post. An example value is "Business Of Sports If the only thing you know about sports is who wins and who loses, you are missing the highest stakes action of all. The business owners that power this multibillion dollar industry are changing, and a new era of the business of sports".
- NER: Named Entity Recognition performed on the
text_url
using SpaCy. An example value is['multibillion dollar']
.
Distribution
This dataset comprises 3652 Reddit posts, each with 8 distinct attributes. It is typically available as a data file, such as a CSV. While specific record counts for individual columns vary, there are 3652 unique posts. The posts cover a date range from 17 February 2023 to 02 March 2023, with daily post counts varying (e.g., 451 posts on 23-24 February 2023).
Usage
This dataset is ideal for:
- Performing sentiment analysis on news articles and social media content.
- Developing and testing Natural Language Processing (NLP) models.
- Conducting text classification tasks on news headlines and article bodies.
- Analysing trends and popular topics within world news discussions on Reddit.
- Research into media consumption patterns and public opinion on global events.
Coverage
The dataset's geographic scope is global, focusing on world news as presented on the r/worldnews subreddit. The time range covered for the posts is from 17 February 2023 to 02 March 2023. Data availability is consistent across this period, with daily post counts provided.
License
CC0
Who Can Use It
- Data Scientists and Machine Learning Engineers: For building and refining NLP models, especially for sentiment analysis and text classification.
- Researchers: Studying global news trends, social media discourse, and public sentiment.
- Journalists and Media Analysts: For understanding how specific news topics are discussed and perceived online.
- Developers: Creating applications that leverage social media data for insights or content categorisation.
Dataset Name Suggestions
- World News Reddit Posts (2023)
- Global News Reddit Discourse
- r/worldnews Textual Analysis Dataset
- Reddit World News Scraped Data
- World News Article Sentiment Dataset
Attributes
Original Data Source: r/wordnews textual