Billions and billions…of molecules?

I’ve written about the CAS Registry – the enormous database of small and large molecules – on several occasions over my quarter of a century in science communication. It usually comes up when they reach a milestone. Indeed, I remember writing about the day they registered their 10 millionth structure, that was either in The Guardian or New Scientist, don’t remember, it was the early 1990s. I wrote about it much more recently here on the Sciencebase blog back in September 2009 when they reached 50 million structures. How can there be so many chemicals, surely we are approaching some kind of limit? Well, no. We are nowhere near.

As, Daniel Merkle of the University of Southern Denmark, in Odense, and colleagues point out in a recent issue of the International Journal of Computational Biology and Drug Design, the chemical space of possible molecules is vast, really vast. I just checked CAS and their most recent press release mentioned them passing the 75 million structure landmark in November 2013.

But, their homepage mentions 87 million unique organic and inorganic chemical substances, such as alloys, coordination compounds, minerals, mixtures, polymers and salts, and more than 65 million protein sequences. The allusion being that there are other databases the entries from which may well not even be represented by the CAS registration information. But, even these tens of millions pale into negligibility when compared to the almost 200 billion possible structures that might be constructed with up to 17 atoms of carbon, nitrogen, oxygen, sulfur and the halogens (fluorine, bromine, chlorine, iodine…)

“The chemical universe of molecules reachable from a set of start compounds by iterative application of a finite number of reactions is vast,” Merkle and colleagues say. They point out that highly sophisticated and efficient exploration strategies are needed to allow chemists to explore this combinatorial complexity in the quest for novel molecules that diverge in structure from the many known compounds and might thus have previously unreported properties, or more critically for organic and medicinal chemists, physiological activity.

The team has now devised a new approach to chemical space exploration based on the structural graph of possible molecules, the mutual connectivity and arrangement of the atoms within the molecule represented by its chemical formula. If the atoms are vertex labels in the graph and the chemical bonds holding them together “edges”, then a chemical reaction can be defined and described as a graph transformation from one graph to another. Thus chemical space might be explored in terms of possible transformations from a starting material to a range of possible products. The graph grammar is encapsulated in the reaction mechanisms that give rise to the transformations. Of course, chemical space might be infinite if we allow polymers, where individual molecules, monomeric building blocks, are simply strung together in arbitrary numbers. But, polymers aside, the space remains vast and so efficient methods are needed to map plausible graph transformations and yield a new virtual registry of possible structures that might be accessed by synthetic organic chemistry.

The team has demonstrated proof of principle with key examples of complex reaction networks from carbohydrate chemistry and shown that their approach produces a feasible high-level strategy for generating possible new molecules. It might even help chemists get to that 100 million in the CAS Registry, although it will still be barely a dent in the billions upon billions* of molecules in chemical space.

Andersen, J.L., Flamm, C., Merkle, D. and Stadler, P.F. (2014) ‘Generic strategies for chemical space exploration’, Int. J. Computational Biology and Drug Design, Vol. 7, Nos. 2/3, pp.225-258.

*With a nod and a wink to the late, great Carl Sagan.