In March 2006, I interviewed PubChem’s Steve Bryant for the Reactive Reports chemistry webzine and he revealed some of the inner workings and the aims of the PubChem chemistry database. Ever since, I’ve been rather curious about the growth of the site. How many scientists are using it. Unfortunately, Bryant tells me, getting a handle on that kind of data is difficult. “It’s a very tricky business to accurately condense all the raw log info on hits and IP addresses into an accurate summary of who’s using a given resource and how,” he explains.

However, there are a few tips you might use to extract some useful information from the site nevertheless. There is an easy way to look at current contents of the databases, for instance. The best trick is to go to the “global query” page:

Then enter “all[filter]” (no quotes) in the search box. This gives counts of how many records in each database, e.g. 10,358,219 PubChem compounds, 552 assays, etc. There is also a summary of contributors to PubChem, that lists numbers of substances or assays by organization:

Now, obviously that doesn’t provide usage stats, but it does highlight a newsworthy aspect of developments at PubChem. Over the past year, there has been an increasing number (and diversity) of the screening assay results. “We’re now up to over 10 million substance test results (sum of the number of substances tested in each assay, across all assays),” says Bryant, “We’ve also put some work into structure-activity analysis tools. For example, from the first
assay answering the all[filter] query (AID 728, Factor XIIa Dose Response Confirmation), try “Related BioAssays | Related BioAssays, by Target Similarity”, the “Structure Activity Analysis”.”

Bryant points out that this “heatmap” display isn’t useful to all users. However, screeners who want to check on the selectivity of their “hits” are using these tools more and more, he says.

2 thoughts on “PubChem Statistics”

  1. For log analysis, I can very highly recommend Lire (see, disclaimer: I am a former developer). It allows to define custom analysis and is very flexible in general. The software basically works with an intermediate DLF format, where each column contains a particular bit of information of interest. For example, a log file processor can easily add a column with the type of request, extracted from the URL. Such a processor could also look at the request and extract the CID, possibly by reverse lookup of InChI and SID. A report generator can then do all sorts of analysis and, for example, create a report for ‘Most looked up Compounds per Continent’, or ‘Most Common Set of Five Compounds’ by tracking the click-path people used. All quite trivial in Lire. The good news, Lire is distributed with several Linux distributions. The bad news, development has slowed down somewhat.

  2. I’m interested to know where there are any available stats on the number of users per day querying PubChem, the number of searches they execute and the types of searches they do. In fact, it would be interesting to know similar stats for CAS Scifinder searches. I have to imagine that the number of queries must be in the tens of thousands per day for Scifinder and thousands per day for PubChem…but these are just perceptions…are there actual numbers anywhere?

