Public Health Diabetes Prediction Data
Public Health & Epidemiology
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a cleaned and consolidated collection of survey responses from the 2015 Behavioral Risk Factor Surveillance System (BRFSS), conducted by the Centers for Disease Control and Prevention (CDC). It is designed to support the creation of predictive models for diabetes risk. Diabetes is a widespread chronic condition in the United States, affecting millions and imposing a substantial financial burden. It impairs the body's ability to regulate blood glucose levels, potentially leading to a reduced quality of life and lifespan. Complications such as heart disease, vision loss, lower-limb amputation, and kidney disease are associated with persistently high blood sugar levels. Early diagnosis is crucial for implementing lifestyle changes and effective treatments. This dataset is particularly relevant given that millions of Americans have diabetes or prediabetes, with a significant proportion unaware of their risk. The burden of the disease disproportionately affects those of lower socioeconomic status.
Columns
- Diabetes_012: The target variable, indicating diabetes status with three classes: 0 = no diabetes or only during pregnancy, 1 = prediabetes, and 2 = diabetes.
- HighBP: Indicates if an individual has high blood pressure: 0 = no, 1 = yes.
- HighChol: Indicates if an individual has high cholesterol: 0 = no, 1 = yes.
- CholCheck: Records whether an individual had a cholesterol check in the past five years: 0 = no, 1 = yes.
- BMI: Body Mass Index, a measure of body fat based on height and weight.
- Smoker: Indicates if an individual has smoked at least 100 cigarettes in their lifetime (equivalent to five packs): 0 = no, 1 = yes.
- Stroke: Records if an individual has ever been told they had a stroke: 0 = no, 1 = yes.
- HeartDiseaseorAttack: Indicates if an individual has coronary heart disease (CHD) or myocardial infarction (MI): 0 = no, 1 = yes.
- PhysActivity: Records physical activity levels in the past 30 days, excluding job-related activities: 0 = no, 1 = yes.
- Fruits: Indicates if an individual consumes fruit one or more times per day: 0 = no, 1 = yes.
- Veggies: Indicates if an individual consumes vegetables one or more times per day: 0 = no, 1 = yes.
- HvyAlcoholConsump: Identifies heavy drinkers (adult men consuming more than 14 drinks per week and adult women consuming more than 7 drinks per week): 0 = no, 1 = yes.
- AnyHealthcare: Records whether an individual has any kind of health care coverage, including health insurance or prepaid plans like HMOs: 0 = no, 1 = yes.
- NoDocbcCost: Indicates if there was a time in the past 12 months when an individual needed to see a doctor but could not due to cost: 0 = no, 1 = yes.
- GenHlth: Self-reported general health status on a scale of 1-5, where 1 = excellent, 2 = very good, 3 = good, 4 = fair, and 5 = poor.
- MentHlth: Number of days in the past 30 days when mental health (including stress, depression, and emotional problems) was not good, on a scale of 0-30 days.
- PhysHlth: Number of days in the past 30 days when physical health (including physical illness and injury) was not good, on a scale of 0-30 days.
- DiffWalk: Indicates if an individual has serious difficulty walking or climbing stairs: 0 = no, 1 = yes.
- Sex: Gender of the respondent: 0 = female, 1 = male.
- Age: Age category based on a 13-level scale (e.g., 1 = 18-24, 9 = 60-64, 13 = 80 or older).
- Education: Education level on a scale of 1-6 (1 = Never attended school or only kindergarten, 6 = College 4 years or more).
- Income: Income scale on a scale of 1-8 (1 = less than $10,000, 5 = less than $35,000, 8 = $75,000 or more).
Distribution
The dataset is primarily available in CSV format and includes three distinct files. The main file,
diabetes_012_health_indicators_BRFSS2015.csv
, contains 253,680 survey responses with 21 feature variables and a three-class target variable (Diabetes_012
). This particular file exhibits class imbalance. Additionally, there is diabetes_binary_5050split_health_indicators_BRFSS2015.csv
, a balanced dataset with 70,692 responses and the same 21 feature variables, but with a two-class target variable (Diabetes_binary
). The third file, diabetes_binary_health_indicators_BRFSS2015.csv
, also contains 253,680 responses with 21 features and a two-class target, but it is not balanced. The diabetes_012_health_indicators_BRFSS2015.csv
file itself is 22.74 MB. This data was derived from an original BRFSS 2015 dataset that contained over 400,000 responses and 330 features, which has been cleaned and consolidated for this offering.Usage
This dataset is ideal for:
- Building machine learning models to predict whether an individual has diabetes based on survey questions.
- Identifying the most predictive risk factors associated with diabetes.
- Developing strategies for early diagnosis and intervention by understanding population health indicators.
- Creating simplified questionnaires (short forms) from the BRFSS data, using feature selection techniques, to accurately predict diabetes risk or high-risk individuals.
- Public health research and policy formulation aimed at mitigating the impact of diabetes.
Coverage
The dataset's scope encompasses health-related behaviours and conditions within the United States, drawing from the 2015 BRFSS survey. It covers a range of demographic factors including age (18 to 80+), education levels, income brackets, and gender. The data reflects the state of health indicators and diabetes prevalence in 2015, making it suitable for analysing risk factors across various demographic groups in the US.
License
CC0: Public Domain
Who Can Use It
This dataset is suitable for:
- Data scientists and machine learning practitioners focusing on health prediction and classification tasks.
- Public health researchers and officials aiming to understand diabetes prevalence, risk factors, and inform public health campaigns or policy decisions.
- Healthcare professionals interested in population-level health trends and identifying high-risk groups for early intervention.
- Students and educators for learning and applying data analysis techniques in a real-world health context.
- Beginners in data science looking for a well-structured public health dataset.
Dataset Name Suggestions
- BRFSS 2015 Diabetes Health Indicators
- US Diabetes Risk Factors Survey Data
- Public Health Diabetes Prediction Data
- CDC BRFSS Diabetes Survey 2015
- Health Indicator Diabetes Dataset
Attributes
Original Data Source: Public Health Diabetes Prediction Data