COVID-19 B-cell Epitope Prediction Dataset
Public Health & Epidemiology
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
This dataset is designed to assist in B-cell epitope prediction for vaccine development, particularly relevant for COVID-19. It provides information on subregions of antigen proteins (epitope regions) that B-cells recognise to produce antigen-specific antibodies. Predicting these regions is highly beneficial for designing and developing vaccines aimed at inducing antibody production. The dataset is simple for data analysts and is intended to be useful for medical data analysis beyond just COVID-19. It supports the use of automated methods and machine learning in accelerating vaccine development.
Columns
- parent_protein_id: A unique identifier for the parent protein.
- protein_seq: The sequence of the parent protein.
- start_position: The starting position of the peptide sequence within the parent protein.
- end_position: The ending position of the peptide sequence within the parent protein.
- peptide_seq: The sequence of the peptide.
- chou_fasman: A peptide feature representing β-turn propensity.
- emini: A peptide feature indicating relative surface accessibility.
- kolaskar_tongaonkar: A peptide feature related to antigenicity.
- parker: A peptide feature indicating hydrophobicity.
- isoelectric_point: A protein feature describing its isoelectric point.
- aromacity: A protein feature indicating its aromaticity.
- hydrophobicity: A protein feature describing its hydrophobicity.
- stability: A protein feature indicating its stability.
- target: The antibody valence, which is the target value indicating whether a peptide exhibited antibody-inducing activity. This is a binary estimation (Positive or Negative).
Distribution
The dataset consists of three CSV files:
input_bcell.csv
, input_sars.csv
, and input_covid.csv
.input_bcell.csv
is the primary training data, containing 14,387 rows, representing combinations of 14,362 unique peptides and 757 unique proteins. It includes all 14 columns.parent_protein_id
: 760 unique IDs.protein_seq
: 757 unique sequences.start_position
: Ranges from 1 to 3079, with a mean of 298.end_position
: Ranges from 6 to 3086, with a mean of 308.peptide_seq
: 14,362 unique sequences.chou_fasman
: Ranges from 0.53 to 1.55, with a mean of 0.99.emini
: Ranges from 0 to 27.2, with a mean of 1.06.kolaskar_tongaonkar
: Ranges from 0.84 to 1.25, with a mean of 1.02.parker
: Ranges from -9.03 to 9.12, with a mean of 1.77.isoelectric_point
: Ranges from 3.69 to 12.2, with a mean of 7.07.aromaticity
: Ranges from 0 to 0.18, with a mean of 0.08.hydrophobicity (protein)
: Ranges from -1.97 to 1.27, with a mean of -0.41.stability
: Ranges from 5.45 to 137, with a mean of 43.7.target
: Binary values (0 for Negative, 1 for Positive), with 10,485 negative records and 3,902 positive records.
input_sars.csv
is also main training data, containing 520 rows.input_covid.csv
is the target data and does not contain label information in its columns. All three datasets consist of protein and peptide information.
Usage
This dataset is ideal for:
- Developing and testing algorithms for B-cell epitope prediction.
- Designing and developing vaccines that aim to induce antigen-specific antibody production.
- Conducting general medical data analysis, particularly in immunology and virology.
- Implementing and evaluating machine learning models for rapid vaccine development.
Coverage
The dataset focuses on COVID-19 and SARS B-cell epitope prediction. The data was sourced from the Immune Epitope Database (IEDB) and UniProt. Antibody proteins included are restricted to IgG, the most frequently recorded type in IEDB. The dataset is not expected to be updated frequently.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Data analysts: To explore and build predictive models for epitope identification.
- Vaccine researchers: For insights into antigen-antibody interactions and vaccine design.
- Bioinformaticians: To develop new computational methods for epitope prediction.
- Machine learning practitioners: To apply and advance machine learning techniques in biomedical research.
Dataset Name Suggestions
- COVID-19 B-cell Epitope Prediction Dataset
- SARS-CoV-2 Vaccine Epitope Data
- Immune Epitope Database for Vaccine Development
- Antigen Protein Epitope Prediction Data
- B-cell Antibody Induction Dataset
Attributes
Original Data Source: COVID-19 B-cell Epitope Prediction Dataset