Opendatabay APP

20 Newsgroups Text Classification Dataset

Education & Learning Analytics

Tags and Keywords

Education

Online

Communities

Text

Nlp

Multiclass

Classification

Clustering

Trusted By
Trusted by company1Trusted by company2Trusted by company3
20 Newsgroups Text Classification Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset offers an accessible collection of preprocessed text documents, designed for researchers and machine learning practitioners. Its primary purpose is to provide ready-to-use data for experimenting with various machine learning techniques, particularly text classification and text clustering, without the need for extensive initial preprocessing. While some basic cleaning has been applied, the data retains sufficient structure to allow for diverse analytical approaches and further customisation. It is an ideal resource for standardising text documents for analytical tasks.

Columns

  • target: This column contains the category labels, representing 20 distinct newsgroup topics.
  • text: This column holds the original text extracted from the documents, maintaining its initial formatting.
  • text_cleaned: This column provides the preprocessed version of the text, ready for immediate use in machine learning models.

Distribution

The dataset comprises 18,828 individual documents and is structured as a dataframe, typically suitable for formats like CSV. It features three distinct columns or features and is categorised into 20 classes, each corresponding to a specific topic.

Usage

This dataset is perfectly suited for a variety of applications, including:
  • Implementing and testing text classification models to categorise documents into their respective newsgroup topics.
  • Developing and evaluating text clustering algorithms to group similar documents together based on their content.
  • Exploring and comparing different Natural Language Processing (NLP) techniques for text analysis.
  • Experimenting with various preprocessing methods to potentially enhance model prediction capabilities.

Coverage

The dataset covers 20 diverse newsgroup topics, ranging from technical subjects to social discussions. Examples include:
  • alt.atheism
  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • misc.forsale
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • soc.religion.christian
  • talk.politics.guns
  • talk.politics.mideast
  • talk.politics.misc
  • talk.religion.misc Some topics are closely related (e.g., rec.sport.baseball and rec.sport.hockey), while others are entirely unrelated (e.g., alt.atheism and misc.forsale).

License

CC0

Who Can Use It

This dataset is primarily intended for:
  • Researchers keen on exploring and refining machine learning techniques, particularly in text analysis.
  • Students and educators in the fields of AI, machine learning, and data science looking for a practical, preprocessed dataset for learning and teaching purposes.
  • Data scientists and developers interested in building and testing models for text classification and clustering on a well-structured corpus.

Dataset Name Suggestions

  • 20 Newsgroups Text Classification Dataset
  • Preprocessed Newsgroup Topics
  • Machine Learning Text Corpus (20 Topics)
  • Cleaned Newsgroup Documents
  • NLP Newsgroup Dataset

Attributes

Original Data Source: 20 newsgroup preprocessed

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

21/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format