Opendatabay APP

Cleaned Reddit Depression Data

Mental Health & Wellness

Tags and Keywords

Text

Classification

Healthcare

Nlp

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Cleaned Reddit Depression Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset provides cleaned text content from Reddit posts, specifically curated for mental health classification. It is designed to facilitate the development of machine learning models that can identify and classify content related to depression. The raw data was initially collected through web scraping various Subreddits and has undergone processing using multiple Natural Language Processing (NLP) techniques to ensure cleanliness and usability. All content within the dataset is in the English language.

Columns

  • clean_text: This column contains the processed and cleaned text from Reddit posts. It is the primary input feature for classification tasks.
  • is_depression: This column serves as the label for each post. It is a binary indicator, with '1' signifying that the post is classified as relating to depression and '0' indicating it is not. The dataset contains 3,900 instances labelled as 0 (non-depression) and 3,831 instances labelled as 1 (depression).

Distribution

The dataset typically comes in a tabular format, most commonly as a CSV file. It comprises 7,650 unique records or rows, each representing a single Reddit post with its corresponding cleaned text and depression label. While specific file size information is not provided, its structure is straightforward, consisting of two distinct columns.

Usage

This dataset is ideally suited for a variety of applications in the field of text classification and natural language processing. It can be effectively used to:
  • Train and evaluate machine learning models for detecting mental health-related content.
  • Develop tools for sentiment analysis or topic modelling within social media data.
  • Support research into online discussions about mental well-being and depression.
  • Build automated systems for content moderation or early intervention in digital mental health.

Coverage

The geographic scope of the dataset is global, as the source material from Reddit is not restricted to any particular region. The data is entirely in the English language. Specific demographic details of the original post creators are not included. Information regarding the precise time range of data collection is not available in the provided sources.

License

CCO

Who Can Use It

This dataset is valuable for a wide range of users, including:
  • Data scientists and machine learning engineers who are building and optimising text classification models.
  • Researchers in mental health, social sciences, and computational linguistics exploring online discourse.
  • Developers creating applications that leverage AI for mental health support or content analysis.
  • Academic institutions and students engaged in NLP projects or studies on social media data.

Dataset Name Suggestions

  • Reddit Mental Health Posts
  • Depression Text Classifier
  • Cleaned Reddit Depression Data
  • Social Media Mental Health Classification
  • NLP Depression Dataset

Attributes

Original Data Source: [Depression: Reddit Dataset (Cleaned)]

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

05/06/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format