Breast Cancer Prediction Dataset
Clinical Trials & Research
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed for the binary classification of breast cancer tumours, distinguishing between malignant (cancerous) and benign (non-cancerous) types. Breast cancer is a prevalent global health concern, affecting over 2.1 million people in 2015 and accounting for 25% of all cancer cases. The dataset aims to support the development of machine learning models, specifically mentioning Support Vector Machines (SVMs), to predict tumour classification. It facilitates understanding the data, performing any necessary cleanup, building and fine-tuning classification algorithms, and comparing their evaluation metrics. Tumours typically form as lumps or are detectable via X-ray.
Columns
- id: A unique identifier for each record.
- diagnosis: The target variable, indicating the tumour type: 'M' for Malignant or 'B' for Benign.
- radius_mean: The mean value of the radius of the breast lobes.
- texture_mean: The mean value of the surface texture.
- perimeter_mean: The mean value of the outer perimeter of the lobes.
- area_mean: The mean value of the area of the lobes.
- smoothness_mean: The mean value of smoothness levels.
- compactness_mean: The mean value of compactness.
- concavity_mean: The mean value of concavity.
- concave points_mean: The mean value of concave points.
- symmetry_mean: The mean value of symmetry.
- fractal_dimension_mean: The mean value of fractal dimension.
- radius_se: The standard error of the radius.
- texture_se: The standard error of the texture.
- perimeter_se: The standard error of the perimeter.
- area_se: The standard error of the area.
- smoothness_se: The standard error of smoothness.
- compactness_se: The standard error of compactness.
- concavity_se: The standard error of concavity.
- concave points_se: The standard error of concave points.
- symmetry_se: The standard error of symmetry.
- fractal_dimension_se: The standard error of fractal dimension.
- radius_worst: The "worst" or largest mean value for radius.
- texture_worst: The "worst" or largest mean value for texture.
- perimeter_worst: The "worst" or largest mean value for perimeter.
- area_worst: The "worst" or largest mean value for area.
- smoothness_worst: The "worst" or largest mean value for smoothness.
- compactness_worst: The "worst" or largest mean value for compactness.
- concavity_worst: The "worst" or largest mean value for concavity.
- concave points_worst: The "worst" or largest mean value for concave points.
- symmetry_worst: The "worst" or largest mean value for symmetry.
- fractal_dimension_worst: The "worst" or largest mean value for fractal dimension.
Distribution
The dataset is provided as a CSV file named
breast-cancer.csv
, with a size of 124.57 kB. It contains 569 records and 32 columns. All columns are valid, with no mismatched or missing values reported. The diagnosis
column, which is the target for classification, shows a distribution of 63% Benign (B) and 37% Malignant (M) tumours.Usage
This dataset is ideal for:
- Developing and testing machine learning classification models to predict breast cancer type.
- Conducting data exploration and cleanup activities.
- Experimenting with various hyperparameter tuning techniques for classification algorithms.
- Comparing the performance and evaluation metrics of different classification models, such as SVMs.
- Educational purposes in data science and machine learning, particularly in the healthcare domain.
Coverage
The dataset is referred to as the Breast Cancer Wisconsin (Diagnostic) Dataset, implying a geographic focus related to Wisconsin. No specific time range or demographic breakdown for the dataset itself is provided within the source material.
License
CC0: Public Domain
Who Can Use It
This dataset is suitable for:
- Data Scientists and Machine Learning Engineers: For building, training, and evaluating predictive models for breast cancer diagnosis.
- Healthcare Researchers: To explore relationships between tumour characteristics and malignancy, potentially aiding in diagnostic research.
- Students and Educators: As a practical example for learning about binary classification, data preprocessing, and model evaluation in a real-world medical context.
- Developers: Creating diagnostic support systems or applications that require automated tumour classification.
Dataset Name Suggestions
- Breast Cancer Wisconsin Diagnostic Dataset
- Malignant-Benign Tumour Classification Data
- Breast Cancer Prediction Dataset
- Oncology Tumour Data for ML
Attributes
Original Data Source: Breast Cancer Prediction Dataset