Clinical Variant Classification Predictor
Not Specified
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset aims to predict whether a human genetic variant will have conflicting clinical classifications. ClinVar, a public resource, contains annotations where variants are usually manually classified by clinical laboratories into categories such as benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. The presence of conflicting classifications from different laboratories can lead to confusion for clinicians and researchers when interpreting a variant's impact on a patient's disease.
This problem is presented as a binary classification task, where each record represents a genetic variant. A value of '0' indicates consistent classifications, while '1' signifies conflicting classifications. The dataset has been curated to include only variants with multiple classifications, removing those with a single submission from the original ClinVar .vcf file. This dataset is designed to encourage further exploration of machine learning applications in genomics, particularly regarding the necessary feature engineering to confidently assess the objective.
Columns
- CHROM: Chromosome on which the variant is located.
- POS: Position of the variant on the chromosome.
- REF: Reference allele.
- ALT: Alternate allele.
- AF_ESP: Allele frequencies sourced from GO-ESP.
- AF_EXAC: Allele frequencies sourced from ExAC.
- AF_TGP: Allele frequencies sourced from the 1000 Genomes Project.
- CLNDISDB: Tag-value pairs providing the disease database name and identifier (e.g., OMIM:NNNNNN).
- CLNDISDBINCL: For included variants, tag-value pairs for disease database name and identifier.
- CLNDN: ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB.
- CLNDNINCL: For included variants, ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB.
- CLNHGVS: Top-level (primary assembly, alt, or patch) HGVS expression.
- CLNSIGINCL: Clinical significance for a haplotype or genotype that includes this variant, reported as pairs of VariationID:clinical significance.
- CLNVC: Type of variant (e.g., single_nucleotide_variant, Deletion).
- CLNVI: The variant's clinical sources reported as tag-value pairs of database and variant identifier.
- MC: Comma-separated list of molecular consequence in the format of Sequence Ontology ID|molecular_consequence.
- ORIGIN: Allele origin, which can include values like unknown, germline, somatic, inherited, paternal, maternal, de-novo, biparental, uniparental, not-tested, tested-inconclusive, or other.
- SSR: Variant Suspect Reason Codes, such as unspecified, Paralog, byEST, oldAlign, Para_EST, 1kg_failed, or other.
- CLASS: The target binary class, where '0' signifies no conflicting submissions and '1' signifies conflicting submissions.
- Allele: The variant allele used to calculate the consequence.
- Consequence: The type of consequence, as defined by Ensembl.
- IMPACT: The impact modifier for the consequence type (e.g., MODERATE, LOW).
- SYMBOL: Gene Name.
- Feature_type: Type of feature, typically Transcript, RegulatoryFeature, or MotifFeature.
- Feature: Ensembl stable ID of the feature.
- BIOTYPE: Biotype of the transcript or regulatory feature (e.g., protein_coding).
- EXON: The exon number (out of the total number).
- INTRON: The intron number (out of the total number).
- cDNA_position: Relative position of the base pair in the cDNA sequence.
- CDS_position: Relative position of the base pair in the coding sequence.
- Protein_position: Relative position of the amino acid in the protein.
- Amino_acids: Only provided if the variant affects the protein-coding sequence.
- Codons: The alternative codons with the variant base in upper case.
- DISTANCE: Shortest distance from the variant to the transcript.
- STRAND: Defined as + (forward) or - (reverse).
- BAM_EDIT: Indicates success or failure of editing using a BAM file (e.g., OK).
- SIFT: The SIFT prediction and/or score, typically given as prediction(score).
- PolyPhen: The PolyPhen prediction and/or score.
- MOTIF_NAME: The source and identifier of a transcription factor binding profile aligned at this position.
- MOTIF_POS: The relative position of the variation in the aligned TFBP.
- HIGH_INF_POS: A flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP).
- MOTIF_SCORE_CHANGE: The difference in motif score between the reference and variant sequences for the TFBP.
- LoFtool: Loss of Function tolerance score for loss of function variants.
- CADD_PHRED: Phred-scaled CADD score.
- CADD_RAW: Score indicating the deleteriousness of variants.
- BLOSUM62: BLOSUM62 score.
Distribution
The data file is in CSV format and is named
clinvar_conflicting.csv
, with a size of 30.72 MB. It contains 46 columns. The dataset includes approximately 65.2 thousand records, focusing specifically on genetic variants that have received multiple classifications from different clinical laboratories. Variants with only a single submission have been excluded from this dataset.Usage
This dataset is ideal for:
- Developing and testing machine learning models to predict conflicting clinical classifications of human genetic variants.
- Applying machine learning techniques to genomics research.
- Feature engineering to improve the assessment of variant classification consistency.
- Identifying single submission variants that may be prone to conflicting classifications in the future.
Coverage
This dataset focuses on human genetic variants sourced from ClinVar, a public resource. The raw data was downloaded on Saturday, April 7th, 2018. There is no expectation for future updates to this specific dataset. The geographic scope is not specified, but ClinVar is an international public database, implying a broad, global coverage of human genetic data.
License
CC0: Public Domain
Who Can Use It
- Genetics researchers and bioinformaticians looking to understand and mitigate classification discrepancies.
- Data scientists and machine learning engineers interested in applying predictive analytics to complex biological and healthcare data.
- Clinicians who interpret genetic variant classifications and need tools to identify potentially ambiguous cases.
- Academics and students exploring genomics and precision medicine applications using machine learning.
Dataset Name Suggestions
- ClinVar Variant Classification Conflict
- Predicting Conflicting Genetic Classifications
- Human Genome Variant Discrepancy
- Clinical Variant Classification Predictor
Attributes
Original Data Source: Clinical Variant Classification Predictor