Ethiopian Media Classification Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Contains classified news articles written in Amharic, providing a highly useful resource for natural language processing research and machine learning model development. The collection includes raw text content, associated headlines, defined classification categories, and essential metadata, previously used to set a reference standard for Amharic news text classification models.
Columns
The data includes six primary columns:
- headline: The main title or headline corresponding to the news piece (containing over 50,000 unique titles).
- category: The predefined classification label assigned to the article. There are six unique categories, with ‘ሀገር አቀፍ ዜና’ representing the most frequent category.
- date: The publication date of the news item, spanning approximately ten years.
- views: The recorded view count for the article (data shows a mean of 778 views, but includes significant variance and missing values).
- article: The full body text of the news story itself.
- link: A URL pointing back to the original source location of the news article.
Distribution
The collection is supplied as a large CSV file named
Amharic News Dataset.csv, which is approximately 191 MB in size. The structure contains approximately 51,500 valid records or rows.Usage
This collection is ideal for developing, training, and testing text classification algorithms specifically tuned for less common or low-resource languages. It is suitable for creating machine learning benchmarks, conducting topic modeling, or performing sentiment analysis experiments focused on Amharic-language media content. To utilise the resource, the compressed data file must be extracted prior to running any code.
Coverage
The data covers news articles published over a significant period, spanning from 31 July 2011 through to 23 January 2021. Content is focused on Amharic news sources, and classification categories include broad topics such as Politics and Government.
License
CC BY-NC-SA 4.0
Who Can Use It
- Researchers and students specialising in NLP for African languages.
- Data scientists aiming to develop highly accurate classification and topic modeling systems.
- Academics seeking to analyse historical media trends and discourse in Ethiopia or Amharic-speaking regions.
Dataset Name Suggestions
- Amharic News Text Classification Corpus
- Ethiopian Media Classification Data
- Amharic NLP Baseline Dataset
Attributes
Original Data Source: Ethiopian Media Classification Data
Loading...
