TCGA Brain Glioma Signatures for Classification
Patient Health Records & Digital Health
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Gliomas represent the most frequently occurring primary tumors found in the brain. These tumors are categorised, or graded, as either Lower-Grade Glioma (LGG) or Glioblastoma Multiforme (GBM) based on specific histological and imaging criteria. Accurate grading is significantly influenced by both clinical factors and molecular or mutation signatures, although molecular testing can often be expensive. This resource features data drawn from the TCGA-LGG and TCGA-GBM brain glioma projects, consolidating information on the 20 most frequently mutated genes alongside 3 crucial clinical features. This aggregation of data supports the prediction task of determining a patient’s glioma grade (LGG or GBM) and is designed to assist in finding the optimal, cost-effective subset of features for grading processes.
Columns
The dataset includes 27 columns, offering a mix of clinical data and genetic mutation status.
- Grade: The dependent variable, indicating the glioma classification (LGG, 58%; GBM, 42%).
- Project: Indicates the source TCGA project (TCGA-LGG or TCGA-GBM).
- Case_ID: A unique identifier for each patient case.
- Gender: Clinical feature (Male, 58%; Female, 42%).
- Age_at_diagnosis: Clinical feature, recorded in words.
- Primary_Diagnosis: Includes specific diagnoses like Glioblastoma (42%) and Astrocytoma, anaplastic (15%).
- Race: Demographic data (White, 89%; Black or African American, 7%).
- Mutation Status Columns (Examples): Twenty molecular features are included, recorded as NOT_MUTATED or MUTATED. Key genes include IDH1 (48% MUTATED), TP53 (41% MUTATED), ATRX (26% MUTATED), PTEN (17% MUTATED), EGFR (13% MUTATED), and CIC (13% MUTATED). Less frequently mutated genes (around 3%) include FAT4, IDH2, and PDGFRA.
Distribution
The data is presented in a tabular format within a CSV file named
TCGA_GBM_LGG_Mutations_all.csv, which is approximately 265.05 kB in size. The data contains 862 valid records, covering all 27 columns. For these records, there are no reported mismatched or missing values. The underlying data creation was funded by the NCI via The Cancer Genome Atlas (TCGA) Project.Usage
This resource is ideally suited for classification tasks aimed at distinguishing between LGG and GBM patients. It is excellent for machine learning researchers working on feature selection, with a primary objective being to identify the most effective subset of mutation genes and clinical features that can reduce diagnostic costs while maintaining high performance. General exploratory data analysis in the field of healthcare and oncology is also a strong use case. Ensembling methods are recommended for model development.
Coverage
The dataset focuses on clinical and genomic signatures relevant to brain glioma grading, derived specifically from The Cancer Genome Atlas (TCGA) projects: TCGA-LGG and TCGA-GBM. The scope includes patient demographic information such as gender and age at diagnosis, as well as race, noting that the patient population is predominantly white (89%).
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Machine Learning Researchers and Data Scientists: To develop and validate predictive models for cancer classification.
- Clinical Informaticians: To assess the utility of molecular markers in clinical decision support systems.
- Healthcare Analysts: To perform cost-benefit analysis regarding genetic testing based on feature importance ranking.
- Biomedical Students: For educational projects involving genomics, classification, and feature engineering.
Dataset Name Suggestions
- Glioma Grade Prediction: Clinical and Molecular Factors
- TCGA Brain Glioma Signatures for Classification
- LGG/GBM Mutation and Clinical Features Dataset
Attributes
Original Data Source: TCGA Brain Glioma Signatures for Classification
Loading...
