Multi-Class Arabic Text Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Arabic texts designed for classification tasks. It captures modern Arabic language as it appears in newspaper articles, featuring alphabetic, numeric, and symbolic words. Its structure allows for evaluating the efficiency and robustness of various Arabic text classification and indexing document systems.
Columns
While specific column names are not explicitly provided, a typical structure for a classification dataset like this would include:
text
: The actual Arabic news article content.category
: The assigned classification label for each article (e.g., sport, politic, culture, economy, diverse).
Distribution
The dataset comprises 111,728 documents, containing a total of 319,254,124 words. It is structured in text files, typically available in a CSV format. The documents are categorised into five distinct classes: sport, politic, culture, economy, and diverse, with the number of documents and words varying across these classes.
Usage
This dataset is ideal for a range of applications, including:
- Developing and testing Arabic text classification models.
- Building robust Arabic document indexing systems.
- Research into modern Arabic language processing.
- Training machine learning models for news categorisation.
Coverage
The dataset focuses on modern Arabic language, sourced from news articles published by three prominent Arabic online newspapers: Assabah, Hespress, and Akhbarona. The content covers five main categories: sport, politic, culture, economy, and diverse.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
This dataset is suitable for:
- NLP Practitioners: For developing and refining Arabic language models.
- Researchers: Studying text classification, natural language understanding, and Arabic linguistics.
- Beginner, Intermediate, and Advanced Data Scientists: Engaged in text mining and machine learning projects.
- Developers: Building applications that require automated categorisation of Arabic news content.
Dataset Name Suggestions
- Modern Arabic News Text Classification
- Arabic News Article Categories
- Multi-Class Arabic Text Dataset
- Arabic Newspaper Content for NLP
- Assabah Hespress Akhbarona Dataset
Attributes
Original Data Source:Multi-Class Arabic Text Dataset