Tox24 TTR Binding Prediction Data
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
Data relates to the Tox24 Challenge, specifically focusing on the prediction of chemical binding activity to the target protein Transthyretin (TTR). This collection offers a practical, real-world example of how machine learning can be employed to forecast chemical activity against a specific biological target. The data includes various processed representations of the SMILES notation for 1512 competition chemicals. The materials also incorporate supplemental tables taken from the article accompanying the challenge, detailing assay reaction components, lists of autofluorescent chemicals, and chemicals excluded from the analysis due to interference.
Columns
The dataset contains eleven columns detailing chemical identity and structural representations derived from SMILES:
- dataset: Defines the designated subset for modelling purposes (e.g., training, blind test, or other).
- Chemical: The systematic name of the chemical compound (with 1512 unique values).
- activity: The primary label, representing the median percentage activity recorded against TTR.
- pubchem_smiles: The SMILES notation retrieved directly from PubChem.
- alogps_smiles: The initial SMILES representation.
- pubchem_smiles_cleaned: The cleaned version of the PubChem SMILES.
- alogps_smiles_cleaned: The cleaned version of the initial SMILES.
- pubchem_smiles_no_iso_atoms: Cleaned SMILES with isolated atoms removed.
- pubchem_smiles_no_salts: Cleaned SMILES with salts removed.
- pubchem_smiles_no_iso_atoms_and_dup: Cleaned SMILES with isolated atoms and duplicate fragments removed.
- alogps_smiles_no_salts: SMILES notation with salts removed.
Distribution
The core data is contained within the
all_smiles_data.csv file, which is approximately 434.41 kB in size. This file is structured as a tabular dataset with 1512 valid records across 11 columns. The data is partitioned for model development, with 67% designated for training, 20% for the blind test set, and 13% classified as 'other' (200 records). It should be noted that the target variable, activity, has 300 missing values, accounting for 20% of the total observations.Usage
This collection is ideally suited for several advanced scientific and technical applications, including:
- Developing robust machine learning models to predict chemical binding activity, often leveraging algorithms such as XGBoost and LightGBM.
- Supporting fundamental drug design research and the identification of lead compounds.
- Conducting detailed studies on protein-ligand interactions.
- Evaluating the predictive performance of different molecular descriptors derived from various SMILES representations.
Coverage
The scope of this data is strictly limited to the chemical compounds and their measured binding responses screened during the Tox24 Challenge. This includes chemicals screened using both single concentration and concentration response testing methods. The data focuses solely on chemical properties and toxicological responses related to TTR binding, and therefore contains no explicit geographical or demographic dimensions. The data is static, with an expected update frequency listed as 'Never'.
License
Attribution 4.0 International (CC BY 4.0)
Who Can Use It
- Cheminformatics Researchers: For applying computational methods to analyse chemical structure data.
- Toxicologists and Pharmacologists: Individuals involved in studying toxicity prediction and drug efficacy related to protein binding.
- Data Scientists and Machine Learning Engineers: Professionals building predictive models for biological activity and chemical property forecasting.
Dataset Name Suggestions
- Tox24 TTR Binding Prediction Data
- Transthyretin Chemical Activity Dataset
- SMILES Representations for TTR Modelling
Attributes
Original Data Source: Tox24 TTR Binding Prediction Data
Loading...
