Cleaned Reddit Depression Data
Mental Health & Wellness
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides cleaned text content from Reddit posts, specifically curated for mental health classification. It is designed to facilitate the development of machine learning models that can identify and classify content related to depression. The raw data was initially collected through web scraping various Subreddits and has undergone processing using multiple Natural Language Processing (NLP) techniques to ensure cleanliness and usability. All content within the dataset is in the English language.
Columns
- clean_text: This column contains the processed and cleaned text from Reddit posts. It is the primary input feature for classification tasks.
- is_depression: This column serves as the label for each post. It is a binary indicator, with '1' signifying that the post is classified as relating to depression and '0' indicating it is not. The dataset contains 3,900 instances labelled as 0 (non-depression) and 3,831 instances labelled as 1 (depression).
Distribution
The dataset typically comes in a tabular format, most commonly as a CSV file. It comprises 7,650 unique records or rows, each representing a single Reddit post with its corresponding cleaned text and depression label. While specific file size information is not provided, its structure is straightforward, consisting of two distinct columns.
Usage
This dataset is ideally suited for a variety of applications in the field of text classification and natural language processing. It can be effectively used to:
- Train and evaluate machine learning models for detecting mental health-related content.
- Develop tools for sentiment analysis or topic modelling within social media data.
- Support research into online discussions about mental well-being and depression.
- Build automated systems for content moderation or early intervention in digital mental health.
Coverage
The geographic scope of the dataset is global, as the source material from Reddit is not restricted to any particular region. The data is entirely in the English language. Specific demographic details of the original post creators are not included. Information regarding the precise time range of data collection is not available in the provided sources.
License
CCO
Who Can Use It
This dataset is valuable for a wide range of users, including:
- Data scientists and machine learning engineers who are building and optimising text classification models.
- Researchers in mental health, social sciences, and computational linguistics exploring online discourse.
- Developers creating applications that leverage AI for mental health support or content analysis.
- Academic institutions and students engaged in NLP projects or studies on social media data.
Dataset Name Suggestions
- Reddit Mental Health Posts
- Depression Text Classifier
- Cleaned Reddit Depression Data
- Social Media Mental Health Classification
- NLP Depression Dataset
Attributes
Original Data Source: [Depression: Reddit Dataset (Cleaned)]