Dark Mode

Home

Data Categories

Medical & Healthcare Data

COVID-19 B-cell Epitope Prediction Dataset

FREE DATASET LIBRARY

Verified Data Provider

£0

COVID-19 B-cell Epitope Prediction Dataset

Public Health & Epidemiology

Tags and Keywords

B-cell

Epitope

Vaccine

Covid-19

Antibody

Trusted By

COVID-19 B-cell Epitope Prediction Dataset Dataset on Opendatabay data marketplace

"No reviews yet"

Free

About

This dataset is designed to assist in B-cell epitope prediction for vaccine development, particularly relevant for COVID-19. It provides information on subregions of antigen proteins (epitope regions) that B-cells recognise to produce antigen-specific antibodies. Predicting these regions is highly beneficial for designing and developing vaccines aimed at inducing antibody production. The dataset is simple for data analysts and is intended to be useful for medical data analysis beyond just COVID-19. It supports the use of automated methods and machine learning in accelerating vaccine development.

Columns

parent_protein_id: A unique identifier for the parent protein.
protein_seq: The sequence of the parent protein.
start_position: The starting position of the peptide sequence within the parent protein.
end_position: The ending position of the peptide sequence within the parent protein.
peptide_seq: The sequence of the peptide.
chou_fasman: A peptide feature representing β-turn propensity.
emini: A peptide feature indicating relative surface accessibility.
kolaskar_tongaonkar: A peptide feature related to antigenicity.
parker: A peptide feature indicating hydrophobicity.
isoelectric_point: A protein feature describing its isoelectric point.
aromacity: A protein feature indicating its aromaticity.
hydrophobicity: A protein feature describing its hydrophobicity.
stability: A protein feature indicating its stability.
target: The antibody valence, which is the target value indicating whether a peptide exhibited antibody-inducing activity. This is a binary estimation (Positive or Negative).

Distribution

The dataset consists of three CSV files: input_bcell.csv, input_sars.csv, and input_covid.csv.

input_bcell.csv is the primary training data, containing 14,387 rows, representing combinations of 14,362 unique peptides and 757 unique proteins. It includes all 14 columns.
- parent_protein_id: 760 unique IDs.
- protein_seq: 757 unique sequences.
- start_position: Ranges from 1 to 3079, with a mean of 298.
- end_position: Ranges from 6 to 3086, with a mean of 308.
- peptide_seq: 14,362 unique sequences.
- chou_fasman: Ranges from 0.53 to 1.55, with a mean of 0.99.
- emini: Ranges from 0 to 27.2, with a mean of 1.06.
- kolaskar_tongaonkar: Ranges from 0.84 to 1.25, with a mean of 1.02.
- parker: Ranges from -9.03 to 9.12, with a mean of 1.77.
- isoelectric_point: Ranges from 3.69 to 12.2, with a mean of 7.07.
- aromaticity: Ranges from 0 to 0.18, with a mean of 0.08.
- hydrophobicity (protein): Ranges from -1.97 to 1.27, with a mean of -0.41.
- stability: Ranges from 5.45 to 137, with a mean of 43.7.
- target: Binary values (0 for Negative, 1 for Positive), with 10,485 negative records and 3,902 positive records.
input_sars.csv is also main training data, containing 520 rows.
input_covid.csv is the target data and does not contain label information in its columns. All three datasets consist of protein and peptide information.

Usage

This dataset is ideal for:

Developing and testing algorithms for B-cell epitope prediction.
Designing and developing vaccines that aim to induce antigen-specific antibody production.
Conducting general medical data analysis, particularly in immunology and virology.
Implementing and evaluating machine learning models for rapid vaccine development.

Coverage

The dataset focuses on COVID-19 and SARS B-cell epitope prediction. The data was sourced from the Immune Epitope Database (IEDB) and UniProt. Antibody proteins included are restricted to IgG, the most frequently recorded type in IEDB. The dataset is not expected to be updated frequently.

License

Attribution 4.0 International (CC BY 4.0)

Who Can Use It

Data analysts: To explore and build predictive models for epitope identification.
Vaccine researchers: For insights into antigen-antibody interactions and vaccine design.
Bioinformaticians: To develop new computational methods for epitope prediction.
Machine learning practitioners: To apply and advance machine learning techniques in biomedical research.

Dataset Name Suggestions

COVID-19 B-cell Epitope Prediction Dataset
SARS-CoV-2 Vaccine Epitope Data
Immune Epitope Database for Vaccine Development
Antigen Protein Epitope Prediction Data
B-cell Antibody Induction Dataset

Attributes

Original Data Source: COVID-19 B-cell Epitope Prediction Dataset

Listing Stats

VIEWS

DOWNLOADS

LISTED

14/07/2025

REGION

GLOBAL

QUALITY

5 / 5

VERSION

1.0

Free

Download Dataset in ZIP Format

Recommended Datasets

Loading recommendations...