Opendatabay APP

NLP Processed Topic Classification Data

Data Science and Analytics

Tags and Keywords

News

Text

Classification

Nlp

Multiclass

Trusted By
Trusted by company1Trusted by company2Trusted by company3
NLP Processed Topic Classification Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset has been created to simplify topic classification tasks by consolidating diverse content from various sources, including news articles, general articles, answers, and comments. It features text that has undergone extensive Natural Language Processing (NLP) to enhance its utility for advanced analytical applications. The dataset is specifically designed to facilitate multi-class topic classification, covering six primary themes: Politics, Health, Emotion, Financial, Sport, and Science. Its processed nature makes it highly suitable for training machine learning models and conducting in-depth text analysis.

Columns

  • label: This column indicates the main topic or category assigned to each text entry. The dataset includes labels such as Politics, Health, Emotion, Financial, Sport, and Science.
  • cleantext: Contains the original text data after initial cleaning steps. These steps include normalisation, removal of punctuation, stop words, HTML tags, special characters, and emojis, as well as fixing contractions.
  • applied NLP text: Features the text after further NLP processing, which involves Part-of-Speech (POS) tagging and lemmatisation. This refined text is ideal for advanced NLP model training.

Distribution

The dataset is primarily available in a CSV file format. It contains approximately 135,000 unique records, making it a substantial resource for various data projects. The dataset is structured across two main files: one containing the original text data, and a second, named '2CLEAN', which includes the NLP-processed text that is the focus of this description.

Usage

This dataset is ideally suited for a range of applications in data science and analytics. Its primary use case is topic classification, enabling users to build and train models that can accurately categorise textual content. Other ideal applications include:
  • Developing Natural Language Processing (NLP) models.
  • Conducting sentiment analysis or emotion detection within text.
  • Building recommendation systems based on content topics.
  • Research in text mining and information retrieval.

Coverage

The dataset offers global coverage, meaning the textual content is not restricted to any specific geographic region. There is no specific time range or demographic scope detailed for the content itself.

License

CC0

Who Can Use It

This dataset is beneficial for a wide array of users, particularly those involved in data analysis and machine learning:
  • Data Scientists: For building and evaluating text classification models.
  • Machine Learning Engineers: To develop and fine-tune NLP applications.
  • Researchers: For academic studies on text analysis, natural language understanding, and information extraction.
  • Developers: To integrate topic classification capabilities into their applications.

Dataset Name Suggestions

  • NLP Processed Topic Classification Data
  • Multi-Topic Text Classification Dataset
  • Cleaned Text Dataset for Topic Modelling
  • Applied NLP Text Topics

Attributes

Original Data Source: Topic_classification_dataset

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

22/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format