Academic News Search Engine Classification Data
News & Media Articles
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Aggregating over one million news articles from more than 2,000 global outlets, this collection serves as a foundational benchmark for multiclass text classification tasks. It originated from the ComeToMyHead academic news search engine, which has been operational since July 2004, providing a massive volume of categorised headlines. The data was specifically refined by researchers to evaluate the performance of deep learning architectures, such as character-level convolutional networks, making it a staple in the natural language processing community.
Columns
- Headline of News Article: The specific title or lead sentence of the news story extracted from the academic search engine.
- News Article Topic Number: A categorical integer label (0, 1, 2, or 3) indicating the thematic classification of the headline.
Distribution
The information is delivered in a structured CSV format, including specific files for training and testing like
test_data.csv. The data maintains 100% validity with no mismatched or missing entries reported in the records. While the parent corpus contains over a million articles, these curated subsets are designed for high-speed model benchmarking and validation. The collection is intended for annual updates to reflect its ongoing relevance in the field.Usage
This resource is ideal for training and testing NLP models to automatically sort text into thematic categories. It is frequently utilised as a standard for benchmarking text classification algorithms and exploring model explainability. Developers can also use the data to build and refine automated news aggregators or recommendation engines that rely on accurate topic identification.
Coverage
The geographic scope is international, drawing from thousands of diverse news sources gathered during more than a year of activity. The content is classified into four distinct categories: World (0), Sports (1), Business (2), and Science/Technology (3). The records capture a specific historical window of global news starting from July 2004.
License
CC BY-SA 4.0
Who Can Use It
Data scientists can leverage the labelled headlines to train robust classification models. Academic researchers may utilise the dataset to provide a standardised comparison for new neural network architectures. Furthermore, software engineers developing content filtering tools can use the data to validate the accuracy of their topic detection systems.
Dataset Name Suggestions
- AG News Topic Classification Benchmark
- Four-Category Global News Headline Corpus
- Academic News Search Engine Classification Data
- NIPS 2015 Text Classification Training Set
- Multiclass News Article Headline Registry
Attributes
Original Data Source: Academic News Search Engine Classification Data
Loading...
Free
Download Dataset in ZIP Format
Recommended Datasets
Loading recommendations...
