German News Article Dataset (GNAD)
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The 10kGNAD dataset is a valuable resource specifically designed for German topic classification, addressing the scarcity of non-English text classification datasets. It comprises 10,273 German language news articles sourced from an Austrian online newspaper [1]. These articles are an unused portion of the One Million Posts Corpus [1]. Each article is categorised into one of nine distinct topics, with the class label derived from the second part of the original topic path [1]. To prevent keyword-based classification on authors, their names have been removed, and article titles and texts are concatenated into a single text [1]. This dataset is particularly significant for evaluating the effectiveness of classifiers on German language data, given the grammatical differences such as higher inflection and the prevalence of long compound words compared to English [1].
Columns
- Kategorie (Category): This column contains the class label for each news article, representing one of the nine predefined topics. These labels are derived from the second segment of the original topic path within the One Million Posts Corpus [1]. The train set includes a 'Kategorie' column [2].
- Text: This column holds the concatenated content of the news article's title and its full text [1]. The train set includes a 'Text' column [2].
Distribution
The dataset consists of 10,273 German language news articles [1]. It is structured into nine distinct topics [1]. Similar to many real-world datasets, the class distribution within 10kGNAD is not balanced [1]. For instance, the "Web" class is the largest with 1,678 articles, while "Kultur" is the smallest, containing 539 articles [1]. Interestingly, articles in the "Web" class typically have the fewest words on average, whereas "Kultur" articles tend to have the second-most words [1].
Usage
This dataset is ideal for a variety of applications in Natural Language Processing (NLP) and data science [1]. Key use cases include:
- Training and evaluating German text classifiers for topic classification [1].
- Researching the impact of grammatical differences between English and German on classifier effectiveness [1].
- Developing and testing NLP models for news categorisation in the German language [1].
- Contributing to the development of German-specific datasets for machine learning tasks [1].
Coverage
The dataset's coverage is focused on German language news articles from an Austrian online newspaper [1]. It provides linguistic and topical scope specific to German-speaking contexts, which is crucial given the unique grammatical structures of the language [1]. The articles cover nine different topics [1].
License
CC-BY-NC
Who Can Use It
This dataset is primarily intended for:
- Data scientists and machine learning engineers who need German-specific data for text classification models [1].
- Researchers in natural language processing and computational linguistics focusing on non-English languages [1].
- Academics and students working on projects involving German text analysis, topic modelling, or cross-lingual NLP [1].
- Anyone requiring a free, verified dataset for German news categorisation [1].
Dataset Name Suggestions
- German News Article Dataset (GNAD)
- Austrian News Topic Classification
- 10kGNAD Corpus
- German News Articles by Topic
Attributes
Original Data Source: 10kGNAD