Opendatabay APP

Breast Cancer Classification Data

Patient Health Records & Digital Health

Tags and Keywords

Health

Cancer

Healthcare

Regression

Classification

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Breast Cancer Classification Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is a clean version of the Breast Cancer Wisconsin (original) dataset, designed for logistic regression analysis. It contains real patient data, enabling the classification of a dependent variable into either malignant or benign diagnoses. The dataset is ready for analysis, featuring no missing values [1].

Columns

The dataset contains 10 columns, including the dependent 'Class' variable, and 9 independent variables related to cell characteristics:
  • Clump Thickness: Describes the thickness of cell clumps.
    • Label Count: Ranges from 1.00 - 1.90 (139 counts) to 9.10 - 10.00 (69 counts) [2].
    • Mean: 4.44, Std. Deviation: 2.82 [2].
    • Quantiles: Min 1, 25% 2, 50% 4, 75% 6, Max 10 [2].
  • Uniformity of Cell Size: Measures the uniformity in cell size.
    • Label Count: Ranges from 1.00 - 1.90 (373 counts) to 9.10 - 10.00 (67 counts) [2, 3].
    • Mean: 3.15, Std. Deviation: 3.06 [3].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 5, Max 10 [3].
  • Uniformity of Cell Shape: Indicates the uniformity in cell shape.
    • Label Count: Ranges from 1.00 - 1.90 (346 counts) to 9.10 - 10.00 (58 counts) [3].
    • Mean: 3.22, Std. Deviation: 2.99 [3, 4].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 5, Max 10 [4].
  • Marginal Adhesion: Reflects the degree of cell adhesion to each other.
    • Label Count: Ranges from 1.00 - 1.90 (393 counts) to 9.10 - 10.00 (55 counts) [4].
    • Mean: 2.83, Std. Deviation: 2.86 [4].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 4, Max 10 [4].
  • Single Epithelial Cell Size: Size of a single epithelial cell.
    • Label Count: Ranges from 1.00 - 1.90 (44 counts) to 9.10 - 10.00 (31 counts) [4, 5].
    • Mean: 3.23, Std. Deviation: 2.22 [5].
    • Quantiles: Min 1, 25% 2, 50% 2, 75% 4, Max 10 [5].
  • Bare Nuclei: Describes the presence of bare nuclei.
    • Label Count: Ranges from 1.00 - 1.90 (402 counts) to 9.10 - 10.00 (132 counts) [5].
    • Mean: 3.54, Std. Deviation: 3.64 [5].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 6, Max 10 [5].
  • Bland Chromatin: Refers to the chromatin's texture.
    • Label Count: Ranges from 1.00 - 1.90 (150 counts) to 9.10 - 10.00 (20 counts) [6].
    • Mean: 3.45, Std. Deviation: 2.45 [6].
    • Quantiles: Min 1, 25% 2, 50% 3, 75% 5, Max 10 [6].
  • Normal Nucleoli: Indicates the normality of nucleoli.
    • Label Count: Ranges from 1.00 - 1.90 (432 counts) to 9.10 - 10.00 (60 counts) [6, 7].
    • Mean: 2.87, Std. Deviation: 3.05 [7].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 4, Max 10 [7].
  • Mitoses: Counts the number of mitoses.
    • Label Count: Ranges from 1.00 - 1.90 (563 counts) to 9.10 - 10.00 (14 counts) [7].
    • Mean: 1.6, Std. Deviation: 1.73 [7].
    • Quantiles: Min 1, 25% 1, 50% 1, 75% 1, Max 10 [7].
  • Class: The dependent variable, indicating the diagnosis.
    • Label Count: 2.00 - 2.20 (444 counts) and 3.80 - 4.00 (239 counts) [7].
    • Mean: 2.7, Std. Deviation: 0.95 [8].
    • Quantiles: Min 2, 25% 2, 50% 2, 75% 4, Max 4 [8].

Distribution

The dataset is provided in a CSV format [2] and has a size of 15.02 kB [2]. It consists of 10 columns [2]. The original dataset contained 699 observations [1], and the cleaned version has 683 valid observations across all detailed columns [2-8]. There are no missing values [1].

Usage

This dataset is ideal for logistic regression analysis and is particularly suitable for classifying breast cancer as malignant or benign [1]. It can be used for developing and testing predictive models in medical diagnosis.

Coverage

The dataset is derived from the Breast Cancer Wisconsin (original) dataset [1]. It is a clean dataset with no missing values [1], ensuring high data quality for analysis. No specific geographic, time range, or demographic scope details are available within the provided sources.

License

CC0: Public Domain

Who Can Use It

This dataset is suitable for:
  • Data scientists and machine learning practitioners developing classification models.
  • Researchers in the field of oncology and medical diagnostics.
  • Students learning about logistic regression and binary classification.

Dataset Name Suggestions

  • Breast Cancer Classification Data
  • Wisconsin Breast Cancer Prediction Dataset (Cleaned)
  • Malignant/Benign Breast Cancer Data
  • Logistic Regression Breast Cancer Dataset

Attributes

Original Data Source: Breast Cancer Classification Data

Listing Stats

VIEWS

0

DOWNLOADS

0

LISTED

08/08/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in CSV Format