Clide every mountain
by David Bradley
is all about structure and, as readers of these pages well know, there are
countless packages for drawing and editing chemical structures. But, what
if you have a parcel of scanned reprint research papers or a pile of
downloads from which you would like to extract the chemical information?
One way would be to work your way through each paper, redrawing the
structures therein in one of those drawing packages. The problem, of
course, being that printed pictures, embedded gifs and the like carry with
them none of the underlying chemical information essential for creating an
accurate chemical structure with the atoms and bonds and the right places
that can be manipulated as a "real" chemical model albeit on the screen.
Clide - Chemical Literature Data Extraction project - does for chemists and drug discovery researchers what OCR (optical character recognition) does for wordsmiths. It turns flat two-dimensional semantics free representations and turns them into the "real" thing. Take a bitmap image or an Adobe portable document format (PDF) file of that molecular structure, run it through Clide and the product is the more familiar, but crucially manipulable, chemical structure.
The program started out life as a project in the Department of Chemistry at the University of Leeds, UK, and is now a commercially available entity that comes in two flavours. A full version and a Lite version. Clide extracts information from chemical literature then store it in a database. The internal representation of Clide stores ALL information included in the structure in such a way that it can be converted to other formats. The input of the program is a bitmap image (BMP) of a chemical document or PDF file.
The program outputs information as an ASCII file containing the recognised structures, the reactions and any text associated with the image. This is stored in the Clide database. By default, information is held in Clide file format, which contains all the information that has been processed by Clide. However, the information can be exported in other formats such as the commonly used MDL molfile and CambridgeSoft's ChemDraw v 6.0 format, mac ChemDraw is also available as are two formatting and printing systems Postscript and text only. This makes Clide a rather versatile character is these formats can be handled by more sophisticated chemical spreadsheets and databases. They also allow the extracted structures to be easily edited, displayed and converted in a variety of useful ways. A mol file, for instance, can be rendered in three dimensions on screen using MDL's associated Chime program, or its precursor system Rasmol, in a web browser). However, these proprietary formats contain only the raw molecular structure information that their design allows. The molfile contains the connection tables of the structures, for example, so page numbers, titles and associated text and images are lost in converting to this format.
Clide document/image processing takes place in three constituent parts. First, is the physical document structure analysis. This consists of the identification of the connected components of the image, what Clide makes by loading an image and the demarcation of the image into its graphic and textual regions. Bibliographical information (i.e. title and author of a document) is extracted automatically.
Second is the recognition of the primitives. This starts with classification of the connected components into characters, lines and graphics. This is followed by the application of the recognition process to these elements by Clide's sophisticated algorithms. The characters are recognised by the program's OCR module while the graphic lines are recognised by the graphic recognition module.
And, finally, the logical document structure recognition. The page is analysed and chemical structures recognised as logical units. This involves building the connection table of the structure that relates each point and line to an atom and a bond. At this stage, reactions too are recognised so that reactant and product can be differentiated. Generic text is also identified at this stage and checked, or parsed. Lastly, any reaction text is then identified and parsed so that it is not lost into the milieu of the body text but rather retained within any reaction scheme.
The system can also extract generic structure information from graphics and text and either expand or condense such structures before they are entered into a database. The whole process takes seconds per "printed" page on the meanest of PCs (e.g. a 366 MHz PII laptop with 160 MB memory takes less than half a second to process a bitmap image to MDL Molfile format).
It is a fantastic program in concept and I can see it being developed much further for large pharmaceutical company databases and abstracting services once the programmers have ironed out the residual bugs. Clide could also dovetail nicely with Simbiosys' simulated biomolecular systems in which the company provides a software system for rational drug design and the optimization of small, organic therapeutic molecules. One additional area in which Clide could lend a helping hand is in the creation of a truly chemically literate web search engine.
Although the application of chemoinformatics in the drug discovery process has drastically grown in the past two decades, to the point that it is currently being applied at all stages of the process, there is still huge potential for growth. Integrated data storage and retrieval solutions, that interconnect and provide feedback to the various links in the chain are increasingly desirable. Chemically intelligent solutions like Clide will help produce successful drugs faster and with less cost.
Clide is available for the MS Windows 2000/NT 4.0/XP/Me/98/95 platforms.