50 Million Chemicals and Counting

UPDATE: Sept 8 – Compound 50m in the CS Registry is a novel arylmethylidene heterocycle with analgesic properties called (5Z)-5-[(5-fluoro-2-hydroxyphenyl)methylene]-2-(4-methyl-1-piperazinyl)-4(5H)-thiazolone. (Registry number 1181081-51-5).

According to an email I received from a CAS spokesman, “The number itself represents an important milestone both for researchers and CAS, but even more significant is the pace of scientific discovery around the world.” Roger Schenck, Manager of the Content Planning Department at CAS, adds that, “More scientific literature is being published and we have noticed an explosive growth of patent literature since 1998 that accounts for the rapid growth of substance information available.”

cas-1181081-51-5By contrast, it took 33 years for CAS to register 10 million compounds, a milestone reached in 1990.

It’s intriguing to think that two decades after I wrote a news item (very early in my career) discussing the announcement of that 10 millionth entry for one of the chemistry trade magazines, that CAS should be recording its 50 millionth substance. Indeed, it’s a mere nine months since it announced the 40 millionth.

chemical compounds on CAS

Apparently, the predominant source of this new chemical substance information is the global patent literature. Several years ago, patents accounted for approximately 20 percent of the substance information added to the registry. Today, that number is closer to 70 percent. It was that statement that intrigued me most.

But, I wonder…if they’re scraping patents on such a vast scale, is the addition of a few extra million entries actually representative of technological advance? An alternative explanation is that it simply shows how clever patent attorneys are at working with chemists to couch their claims in such imaginative ways to envelope a whole chemical space in a single sentence.

The increase could be a real indication that researchers increasingly are thinking in terms of monetizing their discoveries, and doing so much earlier in the research process. It could of course be due to increasing research around the world or maybe it’s driven by demand for more advanced electronics and the need for materials for such devices. There are also increasing demands from medical and pharmaceutical research. But, could this have lead to so many million more compounds?

I’m sure it’s not just CAS running a “stamp collecting” business, there has been research demonstrating molecular diversity in the collection.

Schenck confirmed that molecular diversity is something CAS takes seriously. “In regards to molecular diversity in CAS Registry, CAS scientists recently published an article in the Journal of Organic Chemistry on structural diversity among the 24 million organic substances in Registry at the time and may help to answer in-depth diversity questions,” he says.

He also pointed out that CAS monitors the literature as it is published and selects substances in the literature that meet its criteria. To be added the structure must come from a reputable source, including but not limited to patents, journals, chemical catalogues, and selected substance collections on the web. It has to have been described in largely unambiguous terms, characterized by physical methods or described in a patent document example or claim. It also has to be consistent with the laws of atomic covalent organization.

There are also some subtle legislative effects at play too, as Schenck explains:

In the academic community, such activities were greatly enhanced by U.S. legislation passed in 1980, the Bayh-Dole Act, which requires that universities actively seek commercialization for federally-funded research.

The 50-millionth compound will be an interesting milestone. Its identity will not be revealed until tomorrow. It’s probably not going to be a magic bullet for disease or an environmental panacea, but it’s not going to be a trivial compound either. Just how interesting it is will be determined over time, after all there are few compounds without any intrinsic interest.

It would be a happy coincidence if this 50 millionth entry just happened to be something chemically fascinating, to drive innovation from cancer research and nanotechnology to alternative fuel vehicles, cell phones and more. I suspect it will be a little more mundane, but 50 million entries in any collection is a significant milestone regardless.

Research Blogging IconLipkus, A., Yuan, Q., Lucas, K., Funk, S., Bartelt, W., Schenck, R., & Trippe, A. (2008). Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry The Journal of Organic Chemistry, 73 (12), 4443-4451 DOI: 10.1021/jo8001276