Opendatabay APP

Amharic News Text Classification Corpus

Social Media and Posts

Tags and Keywords

News

Text

Classification

Amharic

Lstm

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Amharic News Text Classification Corpus Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

A collection of Amharic news articles presented in the Fidel orthography, structured into a CSV file containing both article text and corresponding news categories. The dataset also includes a standalone list of Amharic stop words, which has been curated, edited, and expanded by a native Amharic speaker using various academic sources. The foundation of this dataset draws on research published in 2021 concerning Amharic news text classification, with subsequent additions made to expand the corpus size. This resource is primarily designed for tasks related to text classification and deep linguistic analysis of Amharic.

Columns

  • article: The textual content of the news story. This column has 60,678 unique news entries.
  • category: The classification assigned to the article, such as 'Sport' or 'Politics'. There are 7 unique classes available for classification purposes.

Distribution

The primary data file is named Amharic_corpus_merged_2023-04-16.csv and is approximately 260.01 MB in size. The dataset includes 61.9 thousand valid records. The file format is standard CSV. In addition to the corpus, a stop words list containing 714 unique tokens is included. Note that this stop word list is general for the Amharic language and is not tailored specifically to this particular news text corpus. The dataset is expected to receive updates on an annual basis.

Usage

This dataset is suitable for developing and testing machine learning models for natural language processing (NLP), particularly news text classification, including advanced techniques like LSTM. It is an ideal resource for research into low-resource language processing and for building robust Amharic text analysis systems.

Coverage

The data covers Amharic news texts classified across seven distinct topics. The categories and their associated text counts from the original sources include: 20,674 Local News (ሀገር አቀፍ ዜና), 10,411 Sport (ስፖርት), 9,325 Politics (ፖለቲካ), 6,543 World News (ዓለም አቀፍ ዜና), 3,894 Business (ቢዝነስ), and 635 Entertainment (መዝናኛ). Our own additions include 5,276 Business texts and 5,156 Politics texts. The latest update to the collection was noted in April 2023.

License

CC0: Public Domain

Who Can Use It

Intended users include data scientists focused on classification models, machine learning researchers studying under-resourced languages, academic linguists interested in Amharic morphology and syntax, and developers creating region-specific NLP applications.

Dataset Name Suggestions

  • Amharic News Text Classification Corpus
  • Fidel Script News Articles
  • Amharic Corpus with Stop Words
  • Ethiopian News Classification Data

Attributes

Listing Stats

VIEWS

3

DOWNLOADS

0

LISTED

15/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format