Headline News Categorisation
Entertainment & Media Consumption
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed for news topic classification, offering a collection of news article headlines. It serves as a text classification benchmark derived from AG's news corpus, which comprises over 1 million news articles. These articles were gathered from more than 2000 news sources by ComeToMyHead, an academic news search engine, over a period of more than one year, starting from July 2004. The headlines are categorised into four distinct news topics: 'World', 'Sports', 'Business', and 'Sci/Tech', making it suitable for training and evaluating machine learning models for news categorisation.
Columns
- text: This column contains the headline of a news article.
- label: This column indicates the news article topic number. The numeric labels correspond to specific news topics: 0 for 'World', 1 for 'Sports', 2 for 'Business', and 3 for 'Sci/Tech'.
Distribution
The dataset is provided in CSV file format and has a size of 28.92 MB. It features 120,000 unique values across its labels, with an equal distribution of 30,000 instances for each of the four news topics (labels 0, 1, 2, and 3).
Usage
This dataset is ideally suited for various applications, including:
- Developing and evaluating text classification models.
- Conducting Natural Language Processing (NLP) tasks, particularly for news content.
- Benchmarking the performance of machine learning algorithms in categorising textual data.
- Building systems for automated news categorisation and content filtering.
Coverage
The dataset's coverage is global, encompassing news articles from more than 2000 news sources. The data was collected over a period of one year, beginning in July 2004, and includes over 1 million news articles. There are no specific notes on data availability for certain groups or years beyond this general collection period.
License
CC BY-SA
Who Can Use It
This dataset is beneficial for:
- Machine learning engineers and data scientists working on text analytics and classification problems.
- Researchers in the fields of NLP, AI, and information retrieval.
- Developers creating applications that require automated news sorting or content recommendation.
- Academic institutions for educational purposes and research projects involving textual data.
Dataset Name Suggestions
- AG News Topic Classification Dataset
- Headline News Categorisation
- Four-Topic News Dataset
- NLP News Classifier Data
- Global News Headlines Dataset
Attributes
Original Data Source: News Topic Classification