Valence Labs and Recursion Launch OpenQDC: The Largest Public Quantum Mechanical Dataset for ML Research

by Roman Kasianov       News

Disclaimer: All opinions expressed by Contributors are their own and do not represent those of their employers, or BiopharmaTrend.com.
Contributors are fully responsible for assuring they own any required copyright for any content they submit to BiopharmaTrend.com. This website and its owners shall not be liable for neither information and content submitted for publication by Contributors, nor its accuracy.

  
Topics: AI & Digital   
Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email   |  

Researchers at Valence Labs, powered by Recursion, have introduced OpenQDC (Open Quantum Data Commons), an extensive, publicly accessible quantum mechanical dataset.

OpenQDC is designed to accelerate machine learning in molecular sciences, particularly in drug discovery and material innovation, by providing standardized, high-quality data.

OpenQDC consolidates nearly 40 quantum mechanical datasets into one unified repository, encompassing 1.5 billion geometries, 70 atom species, and over 250 quantum methods. The datasets cover a wide range of chemical interactions relevant to organic chemistry and molecular dynamics, enabling precise algorithm development for Machine Learning Interatomic Potentials (MLIP).

OpenQDC hub and some of its datasets, the hub has search functionality

Features

  • Standardized Metadata: Each dataset includes essential metadata — such as energy, distance, force units, and isolated atom energies — computed to ensure accuracy and consistency.
  • Comprehensive Coverage: The data spans small organic molecules, diverse chemical elements, and multiple computational methods.
  • ML-Ready Design: Optimized for Python-based ML workflows, datasets are preprocessed for compatibility with frameworks like PyTorch, JAX, and NumPy.
  • Effortless Access: With a simple Python API, datasets can be loaded and used with one line of code.
  • Performance-Focused: Supports efficient formats like memmap, zarr, and xyz for seamless data manipulation.

Applications

OpenQDC is designed to democratize access to quantum mechanical data, addressing a key bottleneck in the development of accurate ML models for molecular simulations. Researchers can leverage its datasets to: advance MLIP algorithm development, explore molecular behaviors such as protein folding, binding mechanisms, and allosteric sites, and accelerate discovery in molecular dynamics.

Some dataset examples:

  • ANI-1: A foundational dataset containing 22 million structural conformations of 57,000 small organic molecules composed of C, N, O, and H atoms.
  • Alchemy: Features 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.
  • QM7X: A comprehensive collection of 4.2 million conformers for 6,950 organic molecules, featuring up to seven heavy atoms.

Availability

OpenQDC is open-source and available for download on GitHub: https://github.com/valence-labs/openQDC. Visit OpenQDC.io for further details.

OpenQDC uses the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license to distribute its datasets, aligning with its mission to democratize access to quantum mechanical data for non-commercial research and development. This licensing choice ensures that researchers can freely use, adapt, and share the datasets while adhering to non-commercial restrictions.

Topics: AI & Digital   

Share:   Share in LinkedIn  Share in Reddit  Share in X  Share in Hacker News  Share in Facebook  Send by email