Dark Mode

Home

Data Categories

AI & ML Data

20 Newsgroups Text Classification Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

20 Newsgroups Text Classification Dataset

Education & Learning Analytics

Tags and Keywords

Education

Online

Communities

Text

Nlp

Multiclass

Classification

Clustering

Trusted By

20 Newsgroups Text Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset offers an accessible collection of preprocessed text documents, designed for researchers and machine learning practitioners. Its primary purpose is to provide ready-to-use data for experimenting with various machine learning techniques, particularly text classification and text clustering, without the need for extensive initial preprocessing. While some basic cleaning has been applied, the data retains sufficient structure to allow for diverse analytical approaches and further customisation. It is an ideal resource for standardising text documents for analytical tasks.

Columns

target: This column contains the category labels, representing 20 distinct newsgroup topics.
text: This column holds the original text extracted from the documents, maintaining its initial formatting.
text_cleaned: This column provides the preprocessed version of the text, ready for immediate use in machine learning models.

Distribution

The dataset comprises 18,828 individual documents and is structured as a dataframe, typically suitable for formats like CSV. It features three distinct columns or features and is categorised into 20 classes, each corresponding to a specific topic.

Usage

This dataset is perfectly suited for a variety of applications, including:

Implementing and testing text classification models to categorise documents into their respective newsgroup topics.
Developing and evaluating text clustering algorithms to group similar documents together based on their content.
Exploring and comparing different Natural Language Processing (NLP) techniques for text analysis.
Experimenting with various preprocessing methods to potentially enhance model prediction capabilities.

Coverage

The dataset covers 20 diverse newsgroup topics, ranging from technical subjects to social discussions. Examples include:

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc Some topics are closely related (e.g., rec.sport.baseball and rec.sport.hockey), while others are entirely unrelated (e.g., alt.atheism and misc.forsale).

License

CC0

Who Can Use It

This dataset is primarily intended for:

Researchers keen on exploring and refining machine learning techniques, particularly in text analysis.
Students and educators in the fields of AI, machine learning, and data science looking for a practical, preprocessed dataset for learning and teaching purposes.
Data scientists and developers interested in building and testing models for text classification and clustering on a well-structured corpus.

Dataset Name Suggestions

20 Newsgroups Text Classification Dataset
Preprocessed Newsgroup Topics
Machine Learning Text Corpus (20 Topics)
Cleaned Newsgroup Documents
NLP Newsgroup Dataset

Attributes

Original Data Source: 20 newsgroup preprocessed

Listing Stats

VIEWS

DOWNLOADS

LISTED

21/06/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format

Recommended Datasets

Loading recommendations...