Valence Labs and Recursion Launch OpenQDC: The Largest Public Quantum Mechanical Dataset for ML Research
Researchers at Valence Labs, powered by Recursion, have introduced OpenQDC (Open Quantum Data Commons), an extensive, publicly accessible quantum mechanical dataset.
OpenQDC is designed to accelerate machine learning in molecular sciences, particularly in drug discovery and material innovation, by providing standardized, high-quality data.
OpenQDC consolidates nearly 40 quantum mechanical datasets into one unified repository, encompassing 1.5 billion geometries, 70 atom species, and over 250 quantum methods. The datasets cover a wide range of chemical interactions relevant to organic chemistry and molecular dynamics, enabling precise algorithm development for Machine Learning Interatomic Potentials (MLIP).
OpenQDC hub and some of its datasets, the hub has search functionality
Features
- Standardized Metadata: Each dataset includes essential metadata — such as energy, distance, force units, and isolated atom energies — computed to ensure accuracy and consistency.
- Comprehensive Coverage: The data spans small organic molecules, diverse chemical elements, and multiple computational methods.
- ML-Ready Design: Optimized for Python-based ML workflows, datasets are preprocessed for compatibility with frameworks like PyTorch, JAX, and NumPy.
- Effortless Access: With a simple Python API, datasets can be loaded and used with one line of code.
- Performance-Focused: Supports efficient formats like memmap, zarr, and xyz for seamless data manipulation.
Applications
OpenQDC is designed to democratize access to quantum mechanical data, addressing a key bottleneck in the development of accurate ML models for molecular simulations. Researchers can leverage its datasets to: advance MLIP algorithm development, explore molecular behaviors such as protein folding, binding mechanisms, and allosteric sites, and accelerate discovery in molecular dynamics.
Some dataset examples:
- ANI-1: A foundational dataset containing 22 million structural conformations of 57,000 small organic molecules composed of C, N, O, and H atoms.
- Alchemy: Features 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.
- QM7X: A comprehensive collection of 4.2 million conformers for 6,950 organic molecules, featuring up to seven heavy atoms.
Availability
OpenQDC is open-source and available for download on GitHub: https://github.com/valence-labs/openQDC. Visit OpenQDC.io for further details.
OpenQDC uses the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license to distribute its datasets, aligning with its mission to democratize access to quantum mechanical data for non-commercial research and development. This licensing choice ensures that researchers can freely use, adapt, and share the datasets while adhering to non-commercial restrictions.
Topics: AI & Digital