Smoker Lung Cancer Prediction
Public Health & Epidemiology
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset provides a focused collection of demographic and medical information for predicting lung cancer in a population of current and former smokers. It represents a subset of data from the US National Lung Screening Trial (NLST), where participants were observed over a seven-year period with annual lung cancer screenings. The data is designed to facilitate the study of relationships between smoking history, individual characteristics, and the incidence and progression of lung cancer. It specifically excludes non-smokers, concentrating on individuals with a history of tobacco use.
Columns
- pid: An anonymous identifier assigned to each participant, ensuring privacy.
- age: The participant's age at the commencement of the trial.
- gender: Indicates the participant's biological sex, categorised as Male or Female.
- race: Describes the racial background of the participant.
- smoker: Denotes the participant's smoking status, either 'Former' (defined as having quit within the last 15 years) or 'Current'.
- days_to_cancer: The number of days elapsed since the trial's start until lung cancer was first detected. This column has a high proportion of missing values, indicating that most participants did not develop cancer during the trial.
- stage_of_cancer: Specifies the clinical stage of cancer at the point of initial observation. Similar to 'days_to_cancer', this column also contains many missing values for individuals who remained cancer-free.
Distribution
The dataset is provided in a CSV file format (specifically,
lung_cancer.csv
), with a file size of approximately 1.76 MB. It comprises seven distinct columns. The dataset includes data for roughly 53,400 individuals. It is important to note that columns such as days_to_cancer
and stage_of_cancer
contain relevant data for about 2,000 records, highlighting the prevalence of participants who did not develop cancer during the observation period.Usage
This dataset is well-suited for a variety of analytical and predictive tasks. It is ideal for developing machine learning models aimed at lung cancer risk prediction. Other applications include:
- Conducting demographic analyses of smokers.
- Investigating the natural history and progression of lung cancer.
- Exploring the influence of demographic factors such as age, gender, and smoking status on cancer incidence.
- Facilitating biostatistical research into cancer outcomes.
Coverage
The data originates from the US National Lung Screening Trial (NLST), focusing on participants within the United States. The observation period spans seven years, during which participants underwent annual lung cancer testing. The demographic scope covers a diverse range of ages (from 43 to 79 years old), with a gender distribution of approximately 59% Male and 41% Female. Racially, the cohort is predominantly White (91%), with smaller proportions of Black or African-American and other racial groups. Crucially, the dataset pertains exclusively to current and former smokers.
License
CC0: Public Domain
Who Can Use It
- Medical Researchers: For epidemiological studies, understanding risk factors, and contributing to public health initiatives.
- Data Scientists and Machine Learning Engineers: To build predictive models for disease diagnosis and risk assessment.
- Academics and Students: For educational purposes, research projects, and case studies in health data analytics.
- Public Health Agencies: To inform policy decisions and develop targeted health interventions.
Dataset Name Suggestions
- Smoker Lung Cancer Prediction
- NLST Lung Cancer Risk Factors
- Demographic Lung Cancer Insights
- Smoking History Lung Cancer Data
- Cancer Prediction for Smokers
Attributes
Original Data Source: Smoker Lung Cancer Prediction