Opendatabay APP

Academic News Search Engine Classification Data

News & Media Articles

Tags and Keywords

News

Classification

Nlp

Headline

Benchmark

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Academic News Search Engine Classification Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Aggregating over one million news articles from more than 2,000 global outlets, this collection serves as a foundational benchmark for multiclass text classification tasks. It originated from the ComeToMyHead academic news search engine, which has been operational since July 2004, providing a massive volume of categorised headlines. The data was specifically refined by researchers to evaluate the performance of deep learning architectures, such as character-level convolutional networks, making it a staple in the natural language processing community.

Columns

  • Headline of News Article: The specific title or lead sentence of the news story extracted from the academic search engine.
  • News Article Topic Number: A categorical integer label (0, 1, 2, or 3) indicating the thematic classification of the headline.

Distribution

The information is delivered in a structured CSV format, including specific files for training and testing like test_data.csv. The data maintains 100% validity with no mismatched or missing entries reported in the records. While the parent corpus contains over a million articles, these curated subsets are designed for high-speed model benchmarking and validation. The collection is intended for annual updates to reflect its ongoing relevance in the field.

Usage

This resource is ideal for training and testing NLP models to automatically sort text into thematic categories. It is frequently utilised as a standard for benchmarking text classification algorithms and exploring model explainability. Developers can also use the data to build and refine automated news aggregators or recommendation engines that rely on accurate topic identification.

Coverage

The geographic scope is international, drawing from thousands of diverse news sources gathered during more than a year of activity. The content is classified into four distinct categories: World (0), Sports (1), Business (2), and Science/Technology (3). The records capture a specific historical window of global news starting from July 2004.

License

CC BY-SA 4.0

Who Can Use It

Data scientists can leverage the labelled headlines to train robust classification models. Academic researchers may utilise the dataset to provide a standardised comparison for new neural network architectures. Furthermore, software engineers developing content filtering tools can use the data to validate the accuracy of their topic detection systems.

Dataset Name Suggestions

  • AG News Topic Classification Benchmark
  • Four-Category Global News Headline Corpus
  • Academic News Search Engine Classification Data
  • NIPS 2015 Text Classification Training Set
  • Multiclass News Article Headline Registry

Attributes

Listing Stats

VIEWS

5

DOWNLOADS

0

LISTED

29/12/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in ZIP Format