Opendatabay APP

Smoking and Drinking Prediction Data

Public Health & Epidemiology

Tags and Keywords

Health

Smoking

Drinking

Body

Korea

Classification

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Smoking and Drinking Prediction Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset, collected from the National Health Insurance Service in Korea, contains various body signal metrics alongside information on smoking and drinking habits. Its primary purpose is to facilitate the analysis of body signals and the classification of individuals as smokers or drinkers. All personal and sensitive information has been excluded, making it suitable for research and machine learning applications aimed at predicting and understanding health behaviours based on physiological data.

Columns

  • Sex: Indicates the biological sex of the individual, with values 'male' and 'female'. Approximately 53% of records are male and 47% are female.
  • Age: Rounded up to 5-year intervals, representing the age of the individual. The age range in the dataset is from 20 to 85 years, with a mean age of 47.6 years.
  • Height: Measured in centimetres [cm], rounded up to 5 cm. Heights range from 130 to 190 cm, with a mean height of 162 cm.
  • Weight: Measured in kilograms [kg]. Weights range from 25 to 140 kg, with a mean weight of 63.3 kg.
  • Waistline: Measured in centimetres [cm]. Waistline measurements range from 8 to 999 cm, with a mean of 81.2 cm.
  • Sight_left: Represents left eyesight. Values range from 0.1 to 9.9, with a mean of 0.98.
  • Sight_right: Represents right eyesight. Values range from 0.1 to 9.9, with a mean of 0.98.
  • Hear_left: Indicates left hearing, where 1 is normal and 2 is abnormal. Approximately 96.8% of records are normal hearing.
  • Hear_right: Indicates right hearing, where 1 is normal and 2 is abnormal. Approximately 96.9% of records are normal hearing.
  • SBP: Systolic blood pressure [mmHg]. Readings range from 67 to 273 mmHg, with a mean of 122 mmHg.
  • DBP: Diastolic blood pressure [mmHg]. Readings range from 32 to 185 mmHg, with a mean of 76.1 mmHg.
  • BLDS: BLDS or FSG (fasting blood glucose) [mg/dL]. Values range from 25 to 852 mg/dL, with a mean of 100 mg/dL.
  • Tot_chole: Total cholesterol [mg/dL]. Values range from 30 to 2340 mg/dL, with a mean of 196 mg/dL.
  • HDL_chole: HDL cholesterol [mg/dL]. Values range from 1 to 8110 mg/dL, with a mean of 56.9 mg/dL.
  • LDL_chole: LDL cholesterol [mg/dL]. Values range from 1 to 5120 mg/dL, with a mean of 113 mg/dL.
  • Triglyceride: Triglyceride [mg/dL]. Values range from 1 to 9490 mg/dL, with a mean of 132 mg/dL.
  • Hemoglobin: Hemoglobin [g/dL]. Values range from 1 to 25 g/dL, with a mean of 14.2 g/dL.
  • Urine_protein: Protein in urine, categorised from 1 (-) to 6 (+4). The vast majority of records (94.3%) are category 1 (-).
  • Serum_creatinine: Serum (blood) creatinine [mg/dL]. Values range from 0.1 to 98 mg/dL, with a mean of 0.86 mg/dL.
  • SGOT_AST: SGOT (Glutamate-oxaloacetate transaminase) AST (Aspartate transaminase) [IU/L]. Values range from 1 to 10000 IU/L, with a mean of 26 IU/L.
  • SGOT_ALT: ALT (Alanine transaminase) [IU/L]. Values range from 1 to 7210 IU/L, with a mean of 25.8 IU/L.
  • Gamma_GTP: Y-glutamyl transpeptidase [IU/L]. Values range from 1 to 999 IU/L, with a mean of 37.1 IU/L.
  • SMK_stat_type_cd: Smoking state, categorised as 1 (never), 2 (used to smoke but quit), or 3 (still smoke). Approximately 60.7% are never smokers, 17.6% used to smoke, and 21.6% still smoke.
  • DRK_YN: Drinker or Not, a boolean field indicating if the individual is a drinker. The dataset is nearly balanced with 50% true (drinker) and 50% false (not a drinker).

Distribution

The dataset is provided as a CSV file (smoking_driking_dataset_Ver01.csv) with a size of 109.56 MB. It contains 24 columns and includes over 991,000 valid records.

Usage

This dataset is ideal for:
  • Developing predictive models to classify individuals as smokers or drinkers based on their body signal data.
  • Conducting in-depth analysis of the relationship between various body signals and smoking/drinking habits.
  • Building binary classification models in the health and medicine domains.
  • Researching public health trends and risk factors related to lifestyle choices.

Coverage

The dataset's geographic scope is Korea, as it was collected from the National Health Insurance Service there. While specific time ranges for data collection are not provided, it includes demographic information such as age (20-85 years) and sex (male/female), enabling analysis across these groups.

License

CC BY-NC-SA 4.0

Who Can Use It

This dataset is suitable for:
  • Data Scientists and Machine Learning Engineers: For building and testing classification models to predict health behaviours.
  • Public Health Researchers: To analyse health trends, identify correlations between body signals and lifestyle, and inform public health initiatives.
  • Medical Professionals and Researchers: For studying the impact of smoking and drinking on various physiological markers.
  • Students and Academics: As a practical resource for learning about health data analysis, classification, and predictive modelling.

Dataset Name Suggestions

  • Korean Health Habits and Body Signals Dataset
  • Smoking and Drinking Prediction Data
  • Health Behaviour Classification Dataset
  • Korean Biomedical Lifestyle Data
  • Physiological Markers of Habits

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

1

LISTED

14/07/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format