Dark Mode

Home

Data Categories

Web & Social Media Data

German News Article Dataset (GNAD)

FREE DATASET LIBRARY

Verified Data Provider

£0

German News Article Dataset (GNAD)

Data Science and Analytics

Tags and Keywords

News

Classification

Nlp

People

And

Society

German

Trusted By

German News Article Dataset (GNAD) Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

The 10kGNAD dataset is a valuable resource specifically designed for German topic classification, addressing the scarcity of non-English text classification datasets. It comprises 10,273 German language news articles sourced from an Austrian online newspaper [1]. These articles are an unused portion of the One Million Posts Corpus [1]. Each article is categorised into one of nine distinct topics, with the class label derived from the second part of the original topic path [1]. To prevent keyword-based classification on authors, their names have been removed, and article titles and texts are concatenated into a single text [1]. This dataset is particularly significant for evaluating the effectiveness of classifiers on German language data, given the grammatical differences such as higher inflection and the prevalence of long compound words compared to English [1].

Columns

Kategorie (Category): This column contains the class label for each news article, representing one of the nine predefined topics. These labels are derived from the second segment of the original topic path within the One Million Posts Corpus [1]. The train set includes a 'Kategorie' column [2].
Text: This column holds the concatenated content of the news article's title and its full text [1]. The train set includes a 'Text' column [2].

Distribution

The dataset consists of 10,273 German language news articles [1]. It is structured into nine distinct topics [1]. Similar to many real-world datasets, the class distribution within 10kGNAD is not balanced [1]. For instance, the "Web" class is the largest with 1,678 articles, while "Kultur" is the smallest, containing 539 articles [1]. Interestingly, articles in the "Web" class typically have the fewest words on average, whereas "Kultur" articles tend to have the second-most words [1].

Usage

This dataset is ideal for a variety of applications in Natural Language Processing (NLP) and data science [1]. Key use cases include:

Training and evaluating German text classifiers for topic classification [1].
Researching the impact of grammatical differences between English and German on classifier effectiveness [1].
Developing and testing NLP models for news categorisation in the German language [1].
Contributing to the development of German-specific datasets for machine learning tasks [1].

Coverage

The dataset's coverage is focused on German language news articles from an Austrian online newspaper [1]. It provides linguistic and topical scope specific to German-speaking contexts, which is crucial given the unique grammatical structures of the language [1]. The articles cover nine different topics [1].

License

CC-BY-NC

Who Can Use It

This dataset is primarily intended for:

Data scientists and machine learning engineers who need German-specific data for text classification models [1].
Researchers in natural language processing and computational linguistics focusing on non-English languages [1].
Academics and students working on projects involving German text analysis, topic modelling, or cross-lingual NLP [1].
Anyone requiring a free, verified dataset for German news categorisation [1].

Dataset Name Suggestions

German News Article Dataset (GNAD)
Austrian News Topic Classification
10kGNAD Corpus
German News Articles by Topic

Attributes

Original Data Source: 10kGNAD

Listing Stats

VIEWS

DOWNLOADS

LISTED

27/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...