Global Aqueous Solubility Database
Data Science and Analytics
Tags and Keywords
Trusted By




"No reviews yet"
Free
About
The AqSolDB is a curated aqueous solubility dataset created by the Autonomous Energy Materials Discovery [AMD] research group. It comprises aqueous solubility values for 9,982 unique compounds, drawing from nine different publicly available aqueous solubility datasets. This openly accessible dataset is the largest of its kind, serving as a valuable reference for measured solubility data and an enhanced, generalisable training source for building data-driven models. In addition to curated experimental solubility values, it includes relevant topological and physico-chemical 2D descriptors, calculated using RDKit, and validated molecular representations for each compound.
Columns
- ID: The source ID of the compound, with the first letter indicating its origin source.
- Name: The common name of the compound.
- InChI: The IUPAC International Chemical Identifier for the compound.
- InChIKey: A hashed InChI value, providing a compact, fixed-length identifier.
- SMILES: The SMILES notation, representing the compound's molecular structure.
- Solubility: The experimental aqueous solubility value, expressed as LogS.
- SD: The standard deviation of multiple solubility values, applicable if more than one value exists for a compound.
- Ocurrences: The number of multiple occurrences of a compound within the dataset.
- Group: A reliability group classification for the data, with details available in the referenced paper.
- MolWt: The molecular weight of the compound.
- MolLogP: The octanol-water partition coefficient, indicating hydrophobicity.
- MolMR: The molar refractivity of the compound.
- HeavyAtomCount: The total number of non-hydrogen atoms in the compound.
- NumHAcceptors: The count of hydrogen bond acceptor atoms.
- NumHDonors: The count of hydrogen bond donor atoms.
- NumHeteroatoms: The total number of heteroatoms (atoms other than carbon and hydrogen).
- NumRotatableBonds: The number of rotatable bonds in the molecular structure.
- NumValenceElectrons: The total number of valence electrons in the compound.
- NumAromaticRings: The count of aromatic rings.
- NumSaturatedRings: The count of saturated rings.
- NumAliphaticRings: The count of aliphatic rings.
- RingCount: The total number of rings (aromatic, saturated, and aliphatic combined).
- TPSA: The Topological Polar Surface Area, a descriptor related to drug absorption.
- LabuteASA: Labute's Approximate Surface Area.
- BalabanJ: Balaban's J Index, a topological index.
- BertzCT: A topological complexity index for the compound.
Distribution
The dataset is provided as a CSV file named
curated-solubility-dataset.csv
, with a file size of 3.75 MB. It contains 9,982 unique compounds, with 26 columns of data. All entries across these columns are validated and complete, with 100% data validity and no missing values for the listed compounds.Usage
This dataset is ideal for developing and validating predictive models for aqueous solubility, a critical property in chemistry and drug discovery. It can serve as a reliable reference for experimental solubility data, enabling researchers to benchmark their own measurements or predictions. Furthermore, it is an excellent resource for training data-driven models aimed at predicting chemical properties, utilising its diverse set of compounds and associated descriptors.
Coverage
The dataset compiles information from sources published between 1994 and 2014. While the dataset itself does not specify geographic or demographic scopes, it is anticipated to be updated annually.
License
CC0: Public Domain
Who Can Use It
This dataset is suitable for:
- Computational chemists and cheminformaticians building and refining aqueous solubility prediction models.
- Drug discovery scientists requiring reliable solubility data for lead optimisation and candidate selection.
- Materials scientists interested in the solubility of various compounds.
- Data scientists and machine learning engineers looking for a well-curated chemical dataset for predictive modelling and algorithm development.
Dataset Name Suggestions
- AqSolDB: Aqueous Solubility Dataset
- Curated Chemical Solubility Data
- Compound Aqueous Solubility Values
- Global Aqueous Solubility Database
Attributes
Original Data Source: Global Aqueous Solubility Database