Opendatabay APP

Imbalanced Breast Cancer Classification Data

Patient Health Records & Digital Health

Tags and Keywords

Cancer

Classification

Microcalcification

Mammography

Imbalanced

Trusted By
Trusted by company1Trusted by company2Trusted by company3
Imbalanced Breast Cancer Classification Data Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

Addresses a critical cancer classification challenge involving severely skewed class distribution. The data focuses on detecting breast cancer by analysing radiological scans, specifically targeting the presence of clusters of microcalcification that appear bright on a mammogram. It was created by scanning images, segmenting them into candidate objects, and using advanced computer vision techniques to derive features describing each object. The objective is a binary distinction between Microcalcification (the positive, minority class) and non-Microcalcification (the negative, majority class).

Columns

The dataset contains 7 features derived from segmented objects:
  • Area: The measured area of the object, specified in pixels.
  • Grey Level: The average gray level observed across the object.
  • Gradient Strength: Measures the gradient strength of the pixels located along the object's perimeter.
  • Noise Fluctuation: The root mean square noise fluctuation within the object itself.
  • Contrast: Calculated as the average gray level of the object minus the average gray level of a two-pixel wide border surrounding it.
  • Shape Descriptor: A low order moment used as a descriptor of the object's shape.
  • Microcalcification: The outcome variable indicating the presence or absence of microcalcification clusters ('1' for presence, '-1' for absence).

Distribution 📈

The data is provided in a CSV file format, weighing approximately 839.19 kB. It contains 11.2k valid records, with 100% data validity across all features (zero missing or mismatched values). The class distribution is highly imbalanced, which is characteristic of real-world medical diagnostic problems: the non-Microcalcification class constitutes 98% of the records, while the Microcalcification class makes up the remaining 2%.

Usage

This resource is ideally suited for building and testing robust machine learning models intended for clinical diagnostic support. Ideal applications include:
  • Developing and benchmarking classification algorithms designed to handle extreme class imbalance.
  • Exploratory Data Analysis (EDA) focused on feature engineering for medical imaging data.
  • Research into computer-aided detection (CAD) systems for breast cancer.
  • Evaluating anomaly detection techniques where the positive case (Microcalcification) is rare.

Coverage

The dataset's scope is defined by features extracted from segmented candidate objects within mammography scans. Data focuses solely on morphological and intensity characteristics derived via computer vision techniques. Specific geographic locations or historical timeframes for the source scans are not specified.

License

CC0: Public Domain

Who Can Use It

This material is appropriate for:
  • Data Science Beginners: To practice fundamental classification modelling and feature analysis.
  • Intermediate Machine Learning Engineers: To experiment with sophisticated techniques (e.g., SMOTE, cost-sensitive learning) necessary for managing highly skewed medical data.
  • Healthcare Technology Researchers: Seeking validated feature sets for diagnostic tool development.

Dataset Name Suggestions

  • Mammography Microcalcification Feature Set
  • Imbalanced Breast Cancer Classification Data
  • Radiological Microcalcification Detection Features
  • Microcalcification Object Segmentation Data

Attributes

Listing Stats

VIEWS

1

DOWNLOADS

0

LISTED

09/11/2025

REGION

GLOBAL

Universal Data Quality Score Logo UDQSQUALITY

5 / 5

VERSION

1.0

Loading...

Free

Download Dataset in CSV Format