Yahoo Answers Topic Classification Dataset
Art & Digital Creations
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is specifically constructed for topic classification, drawing on content from Yahoo! Answers. It comprises a substantial collection of training and testing samples distributed across 10 distinct main categories. The primary focus of the data is the content of the best answers provided, alongside their corresponding main category, making it an excellent resource for various text categorisation tasks.
Columns
- Class Index: A numerical identifier ranging from 1 to 10, representing the specific main category for each entry.
- Question Title: The original title of the question posted on Yahoo! Answers.
- Question Content: The full textual content of the question itself.
- Best Answer: The selected best answer content associated with the question.
- Note: All text fields within the dataset are escaped using double quotes ("), with any internal double quote escaped by two double quotes (""). New line characters within the text are escaped using a backslash followed by an "n" character (
\n
).
- Note: All text fields within the dataset are escaped using double quotes ("), with any internal double quote escaped by two double quotes (""). New line characters within the text are escaped using a backslash followed by an "n" character (
Distribution
The dataset is provided in comma-separated values (CSV) format, organised into separate training and testing files. It contains a significant number of samples:
- Total training samples: 1,400,000 (each of the 10 classes includes 140,000 training samples).
- Total testing samples: 60,000 (each of the 10 classes includes 6,000 testing samples).
Usage
This dataset is ideally suited for a range of applications and use cases, including:
- Developing and evaluating Natural Language Processing (NLP) models specifically for text classification.
- Training multiclass classification algorithms to categorise diverse textual data.
- Conducting research into text analysis, topic modelling, and content categorisation.
- Creating applications that require automated categorisation of user-generated content or articles.
Coverage
The content within this dataset reflects a global scope, originating from the extensive user-generated discussions on Yahoo! Answers. While specific details regarding the geographic or demographic distribution of the original users are not provided, the dataset is broadly applicable to general text classification problems. A specific time range for the data collection is not detailed.
License
This is a free dataset, indicating it can be used without any financial cost.
Who Can Use It
- Data Scientists and Machine Learning Engineers who are focused on building and refining text classification models.
- NLP Researchers for academic investigations into text understanding, topic discovery, and language processing.
- Students and Educators engaged in learning or teaching about text classification, machine learning, and data analysis.
- Software Developers aiming to integrate automated content categorisation features into their platforms or applications.
Dataset Name Suggestions
- Yahoo Answers Topic Classification Dataset
- Yahoo Answers NLP Categories
- Multi-Class Yahoo Answers Text Dataset
- Yahoo Answers Best Answer Classification
Attributes
Original Data Source: Yahoo Answers 10 categories for NLP CSV
- Note: A direct URL for the original data source is not available within the provided information.
Dataset Category Suggestions
- AI & Machine Learning
- Natural Language Processing (NLP)
- Text Data
- Classification Datasets
- Open Data
Dataset SEO Keyword Suggestions
NLP, Text, Classification, Training, Categories