NLP Processed Topic Classification Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset has been created to simplify topic classification tasks by consolidating diverse content from various sources, including news articles, general articles, answers, and comments. It features text that has undergone extensive Natural Language Processing (NLP) to enhance its utility for advanced analytical applications. The dataset is specifically designed to facilitate multi-class topic classification, covering six primary themes: Politics, Health, Emotion, Financial, Sport, and Science. Its processed nature makes it highly suitable for training machine learning models and conducting in-depth text analysis.
Columns
- label: This column indicates the main topic or category assigned to each text entry. The dataset includes labels such as Politics, Health, Emotion, Financial, Sport, and Science.
- cleantext: Contains the original text data after initial cleaning steps. These steps include normalisation, removal of punctuation, stop words, HTML tags, special characters, and emojis, as well as fixing contractions.
- applied NLP text: Features the text after further NLP processing, which involves Part-of-Speech (POS) tagging and lemmatisation. This refined text is ideal for advanced NLP model training.
Distribution
The dataset is primarily available in a CSV file format. It contains approximately 135,000 unique records, making it a substantial resource for various data projects. The dataset is structured across two main files: one containing the original text data, and a second, named '2CLEAN', which includes the NLP-processed text that is the focus of this description.
Usage
This dataset is ideally suited for a range of applications in data science and analytics. Its primary use case is topic classification, enabling users to build and train models that can accurately categorise textual content. Other ideal applications include:
- Developing Natural Language Processing (NLP) models.
- Conducting sentiment analysis or emotion detection within text.
- Building recommendation systems based on content topics.
- Research in text mining and information retrieval.
Coverage
The dataset offers global coverage, meaning the textual content is not restricted to any specific geographic region. There is no specific time range or demographic scope detailed for the content itself.
License
CC0
Who Can Use It
This dataset is beneficial for a wide array of users, particularly those involved in data analysis and machine learning:
- Data Scientists: For building and evaluating text classification models.
- Machine Learning Engineers: To develop and fine-tune NLP applications.
- Researchers: For academic studies on text analysis, natural language understanding, and information extraction.
- Developers: To integrate topic classification capabilities into their applications.
Dataset Name Suggestions
- NLP Processed Topic Classification Data
- Multi-Topic Text Classification Dataset
- Cleaned Text Dataset for Topic Modelling
- Applied NLP Text Topics
Attributes
Original Data Source: Topic_classification_dataset