Global News Articles Dataset
Government & Civic Records
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset contains over 90,000 news articles gathered from various free news APIs, offering a valuable resource for text analysis and natural language processing tasks. It includes articles from over 600 sources across 26 countries, categorised into more than 16 topics. The dataset's primary purpose is to provide rich content for tasks such as article classification and deeper content understanding.
Columns
The dataset features 9 distinct columns, each providing specific details about the news articles:
- id: A unique identifier for each news article.
- title: The headline or title of the news article.
- link: The URL pointing to the original news article.
- source: The domain or website from which the article was published.
- country: The country where the article was published.
- topic: The category or subject of the article.
- language: The language in which the article was published.
- summary: A detailed description or the full content of the article.
- published_date: The date when the article was published.
Distribution
The data files are typically in CSV format. The dataset comprises over 90,000 articles, with unique identifiers for each article. Approximately 36,649 unique article IDs and titles are present, alongside 35,503 unique links. Key sources include yahoo.com (15%) and indiatimes.com (7%). The main topics covered are news (67%) and finance (9%). There is one unique language value indicated. The dataset spans articles published between 26th May 2022 and 6th June 2022.
Usage
This dataset is ideal for a range of applications, including:
- Natural Language Processing (NLP): Training models for text classification, entity recognition, and sentiment analysis.
- News Aggregation and Recommendation Systems: Developing systems that categorise and suggest news content based on user preferences or trends.
- Journalism and Media Studies: Analysing news coverage patterns, source reliability, and topic distribution across different regions.
- Market Research: Identifying trends and insights from news related to specific industries or events.
Coverage
The dataset offers a global geographic scope, featuring articles from 26 different countries and over 600 sources. The primary countries represented are the United States (67%) and India (13%). The time range for the data is from 26th May 2022 to 6th June 2022. There are no specific notes on demographic availability.
License
CC0
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers: For building and testing NLP models.
- Academic Researchers: For studies in media, communication, and computational linguistics.
- Developers: Creating news-related applications, such as news aggregators or content analysis tools.
- Journalists and Analysts: For conducting deep dives into news trends and public sentiment.
Dataset Name Suggestions
- Global News Articles Dataset
- Daily News Corpus
- Multilingual News Headlines
- Current Events Data Stream
- News Article Text Dataset
Attributes
Original Data Source: News Articles