Yektanet Persian Web Text Classification Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The Yektanet Dataset is a real Persian web data collection, meticulously refined and gathered by the Yektanet platform. Its primary purpose is to serve as an industrial case study for applying machine learning in Natural Language Processing (NLP) [1]. This dataset enables the development of machine learning models capable of predicting the categorical topic of a document based on its text features, such as the title, description, and full text content [1]. It provides a valuable resource for training and evaluating machine learning models in document categorisation and topic prediction [1].
Columns
The dataset consists of multiple instances, each containing various features that provide information about the documents [1]. The main target variable is the category column, which indicates the topic or category of the content [1]. Additional features include:
- Description: This column provides a description of the document [1].
- Text_content: This column holds the complete text content of the document [1].
- Title: This column represents the title of the document [1].
- h1 and h2: These columns contain content found within the HTML tags h1 and h2, respectively [1].
- URL: This column specifies the link address associated with the document [1].
- Domain: This column indicates the domain or website from which the document originates [1].
- Id: This column represents the unique identifier for each link [1].
Distribution
The Yektanet dataset comprises multiple instances, with approximately 5206 records based on the distribution of category labels [1, 2]. The dataset includes unique values for columns such as ID (4786 unique values), text content (4720 unique values), title (4614 unique values), and description (4399 unique values) [3]. Data files are typically provided in CSV format [4].
Usage
This dataset is ideally suited for developing and evaluating machine learning models for document categorisation and topic prediction tasks [1]. It can be used for applications involving Natural Language Processing (NLP), such as:
- Training machine learning models to predict document topics [1].
- Developing text classification systems [1].
- Research into real-world web data analysis [1].
- Exploring feature engineering for NLP tasks [1].
Coverage
The Yektanet dataset is a real Persian web data collection [1]. Its region of coverage is global [5]. It includes content across various topics, with dominant categories such as 'سلامت' (health) at 13% and 'ورزش' (sports) at 11% [3]. The data availability is not restricted to specific groups or years beyond being a current web data collection [1].
License
CC By
Who Can Use It
The dataset is primarily intended for researchers and practitioners in the fields of machine learning and Natural Language Processing (NLP) [1]. Ideal users include data scientists, AI/ML engineers, academics, and anyone interested in document classification, topic modelling, or working with Persian text data [1].
Dataset Name Suggestions
- Yektanet Persian Web Text Classification Dataset
- Persian Document Topic Prediction Data
- Yektanet NLP Classification Corpus
- Web Text Categorisation Dataset (Persian)
- Yektanet Machine Learning Text Dataset
Attributes
Original Data Source: Yektanet( Dataset for Text Classification)