Dark Mode

Home

Data Categories

AI & ML Data

Yahoo Answers Topic Classification Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

Yahoo Answers Topic Classification Dataset

Art & Digital Creations

Tags and Keywords

Classification

Nlp

Multiclass

Text

Trusted By

Yahoo Answers Topic Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is specifically constructed for topic classification, drawing on content from Yahoo! Answers. It comprises a substantial collection of training and testing samples distributed across 10 distinct main categories. The primary focus of the data is the content of the best answers provided, alongside their corresponding main category, making it an excellent resource for various text categorisation tasks.

Columns

Class Index: A numerical identifier ranging from 1 to 10, representing the specific main category for each entry.
Question Title: The original title of the question posted on Yahoo! Answers.
Question Content: The full textual content of the question itself.
Best Answer: The selected best answer content associated with the question.
- Note: All text fields within the dataset are escaped using double quotes ("), with any internal double quote escaped by two double quotes (""). New line characters within the text are escaped using a backslash followed by an "n" character (\n).

Distribution

The dataset is provided in comma-separated values (CSV) format, organised into separate training and testing files. It contains a significant number of samples:

Total training samples: 1,400,000 (each of the 10 classes includes 140,000 training samples).
Total testing samples: 60,000 (each of the 10 classes includes 6,000 testing samples).

Usage

This dataset is ideally suited for a range of applications and use cases, including:

Developing and evaluating Natural Language Processing (NLP) models specifically for text classification.
Training multiclass classification algorithms to categorise diverse textual data.
Conducting research into text analysis, topic modelling, and content categorisation.
Creating applications that require automated categorisation of user-generated content or articles.

Coverage

The content within this dataset reflects a global scope, originating from the extensive user-generated discussions on Yahoo! Answers. While specific details regarding the geographic or demographic distribution of the original users are not provided, the dataset is broadly applicable to general text classification problems. A specific time range for the data collection is not detailed.

License

This is a free dataset, indicating it can be used without any financial cost.

Who Can Use It

Data Scientists and Machine Learning Engineers who are focused on building and refining text classification models.
NLP Researchers for academic investigations into text understanding, topic discovery, and language processing.
Students and Educators engaged in learning or teaching about text classification, machine learning, and data analysis.
Software Developers aiming to integrate automated content categorisation features into their platforms or applications.

Dataset Name Suggestions

Yahoo Answers Topic Classification Dataset
Yahoo Answers NLP Categories
Multi-Class Yahoo Answers Text Dataset
Yahoo Answers Best Answer Classification

Attributes

Original Data Source: Yahoo Answers 10 categories for NLP CSV

Note: A direct URL for the original data source is not available within the provided information.

Dataset Category Suggestions

AI & Machine Learning
Natural Language Processing (NLP)
Text Data
Classification Datasets
Open Data

Dataset SEO Keyword Suggestions

NLP, Text, Classification, Training, Categories

Listing Stats

VIEWS

DOWNLOADS

LISTED

05/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...