Protein Folding Structure Dataset
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The data explains the Physicochemical Properties of Protein Tertiary Structure. It was derived from structures produced during the Critical Assessment of Structure Prediction (CASP) experiments, specifically cycles 5 through 9. This resource includes data on 45,730 protein decoys, which possess sizes spanning a range of 0 to 21 angstroms. The underlying purpose is to provide input variables necessary for developing models aimed at predicting or evaluating the quality of protein structures.
Columns
The dataset contains 10 columns, each representing a critical physicochemical measurement:
- RMSD: Indicates the Size of the residue.
- F1: Represents the Total surface area.
- F2: Quantifies the Non polar exposed area.
- F3: Records the Fractional area of exposed non polar residue.
- F4: Measures the Fractional area of exposed non polar part of residue.
- F5: Describes the Molecular mass weighted exposed area.
- F6: Calculates the Average deviation from standard exposed area of residue.
- F7: Provides the Euclidian distance.
- F8: Indicates the Secondary structure penalty.
- F9: Denotes Spacial Distribution constraints (N,K Value).
Distribution
The data is provided in a CSV file format, specifically named 'protein.csv', and has a file size of 3.53 MB. It consists of 45,730 distinct records, or decoys. All 10 columns are completely valid, exhibiting zero missing or mismatched entries, ensuring high data quality for immediate use in modelling tasks.
Usage
Ideal applications for leveraging this scientific data include:
- Developing and evaluating machine learning models designed for protein tertiary structure quality assessment.
- Benchmarking new algorithms against established metrics in structural bioinformatics.
- Research into the relationships between various molecular properties and protein stability or folding outcomes.
- Feature engineering for advanced predictive analytics in biophysics.
Coverage
The scope of the data is strictly biophysical, covering molecular decoys generated during specific CASP events (CASP 5-9). The size coverage of the residues spans from 0 to 21 angstroms. Traditional geographic or demographic limitations are not relevant to this type of structural biology resource. The collection was compiled and initially distributed in 2013.
License
CC BY-NC-SA 4.0
Who Can Use It
This dataset is valuable for several user groups and applications:
- Data Scientists: For training neural networks or classification algorithms on high-dimensional scientific features.
- Structural Biologists: For examining the features that differentiate correctly folded proteins from poor decoys.
- Students and Educators: For practical projects and lessons in machine learning applied to biological data.
- Researchers: To utilize a standard benchmark dataset for academic publications and comparisons.
Dataset Name Suggestions
- Protein Tertiary Structure ML Features
- CASP Decoy Physicochemical Measurements
- Molecular Structure Quality Metrics
- Protein Folding Structure Dataset
Attributes
Original Data Source: Protein Folding Structure Dataset