Amharic News Text Classification Corpus
Social Media and Posts
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
A collection of Amharic news articles presented in the Fidel orthography, structured into a CSV file containing both article text and corresponding news categories. The dataset also includes a standalone list of Amharic stop words, which has been curated, edited, and expanded by a native Amharic speaker using various academic sources. The foundation of this dataset draws on research published in 2021 concerning Amharic news text classification, with subsequent additions made to expand the corpus size. This resource is primarily designed for tasks related to text classification and deep linguistic analysis of Amharic.
Columns
- article: The textual content of the news story. This column has 60,678 unique news entries.
- category: The classification assigned to the article, such as 'Sport' or 'Politics'. There are 7 unique classes available for classification purposes.
Distribution
The primary data file is named
Amharic_corpus_merged_2023-04-16.csv and is approximately 260.01 MB in size. The dataset includes 61.9 thousand valid records. The file format is standard CSV. In addition to the corpus, a stop words list containing 714 unique tokens is included. Note that this stop word list is general for the Amharic language and is not tailored specifically to this particular news text corpus. The dataset is expected to receive updates on an annual basis.Usage
This dataset is suitable for developing and testing machine learning models for natural language processing (NLP), particularly news text classification, including advanced techniques like LSTM. It is an ideal resource for research into low-resource language processing and for building robust Amharic text analysis systems.
Coverage
The data covers Amharic news texts classified across seven distinct topics. The categories and their associated text counts from the original sources include: 20,674 Local News (ሀገር አቀፍ ዜና), 10,411 Sport (ስፖርት), 9,325 Politics (ፖለቲካ), 6,543 World News (ዓለም አቀፍ ዜና), 3,894 Business (ቢዝነስ), and 635 Entertainment (መዝናኛ). Our own additions include 5,276 Business texts and 5,156 Politics texts. The latest update to the collection was noted in April 2023.
License
CC0: Public Domain
Who Can Use It
Intended users include data scientists focused on classification models, machine learning researchers studying under-resourced languages, academic linguists interested in Amharic morphology and syntax, and developers creating region-specific NLP applications.
Dataset Name Suggestions
- Amharic News Text Classification Corpus
- Fidel Script News Articles
- Amharic Corpus with Stop Words
- Ethiopian News Classification Data
Attributes
Original Data Source: Amharic News Text Classification Corpus
Loading...
