20 Newsgroups Text Classification Dataset
Education & Learning Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset offers an accessible collection of preprocessed text documents, designed for researchers and machine learning practitioners. Its primary purpose is to provide ready-to-use data for experimenting with various machine learning techniques, particularly text classification and text clustering, without the need for extensive initial preprocessing. While some basic cleaning has been applied, the data retains sufficient structure to allow for diverse analytical approaches and further customisation. It is an ideal resource for standardising text documents for analytical tasks.
Columns
- target: This column contains the category labels, representing 20 distinct newsgroup topics.
- text: This column holds the original text extracted from the documents, maintaining its initial formatting.
- text_cleaned: This column provides the preprocessed version of the text, ready for immediate use in machine learning models.
Distribution
The dataset comprises 18,828 individual documents and is structured as a dataframe, typically suitable for formats like CSV. It features three distinct columns or features and is categorised into 20 classes, each corresponding to a specific topic.
Usage
This dataset is perfectly suited for a variety of applications, including:
- Implementing and testing text classification models to categorise documents into their respective newsgroup topics.
- Developing and evaluating text clustering algorithms to group similar documents together based on their content.
- Exploring and comparing different Natural Language Processing (NLP) techniques for text analysis.
- Experimenting with various preprocessing methods to potentially enhance model prediction capabilities.
Coverage
The dataset covers 20 diverse newsgroup topics, ranging from technical subjects to social discussions. Examples include:
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Some topics are closely related (e.g.,rec.sport.baseball
andrec.sport.hockey
), while others are entirely unrelated (e.g.,alt.atheism
andmisc.forsale
).
License
CC0
Who Can Use It
This dataset is primarily intended for:
- Researchers keen on exploring and refining machine learning techniques, particularly in text analysis.
- Students and educators in the fields of AI, machine learning, and data science looking for a practical, preprocessed dataset for learning and teaching purposes.
- Data scientists and developers interested in building and testing models for text classification and clustering on a well-structured corpus.
Dataset Name Suggestions
- 20 Newsgroups Text Classification Dataset
- Preprocessed Newsgroup Topics
- Machine Learning Text Corpus (20 Topics)
- Cleaned Newsgroup Documents
- NLP Newsgroup Dataset
Attributes
Original Data Source: 20 newsgroup preprocessed