Site Index

Cope-Chat Cards
Chemical Databases
Extensible Indexing
Data Cartridge
Subgraph Isomorphism
Chemical Awareness
Under the Bonnet
Reaction Databases
Atom-Atom Mapping
Markush Structures
Selected Publications
Contact Details

Home






Valid XHTML 1.0 Strict

Valid CSS!

Subgraph Isomorphism

Introduction

A mathematician could define subgraph isomorphism as the determination of a subgraph of a graph which is isomorphic to another graph. From the chemoinformatic practitioners perspective, specifically relating to 2D chemical structure databases, the definition might be more loosely defined as a substructure search.
A typical substructure search might involve retrieving all compounds from a from a chemical database that look like strychnine. The action of executing a query of this type results in many compounds being included in the database result set, including brucine (the differences highlighted in blue).
Chemical structure of Brucine Chemical structure of Strychnine
Brucine Strychnine
This hypothetical database query is an example of subgraph isomorphism applied to chemoinformatics. The alkaloid brucine contains a wildcard match for strychnine.

Adjacency matrixes

Out of the box relational databases such DB2 and Oracle are not chemically aware. The ANSI recommendations do not describe and the vendor database implementations do not offer enhanced SQL with functionality to perform, for example, chemical substructure searches. For DBMS implementations that require chemical awareness, the void is addressed with third party extensible indexing database components. The COTS products of CambridgeSoft, MDL, Daylight, ChemAxon, Tripos, InfoChem, gNova Scientific Systems and Accelrys all offer a variation on this theme.
Within the extensible indexing component, chemical entities can be represented as adjacency matrices. For example, the vertex-adjacency matrix A(G) of a labelled connected graph G (urea) with 4 vertices (excluding hydrogen) is the 4 (rows) x 4 (columns) diagonal matrix shown.
Adjacency matrix of Urea, the numbers in blue corresponding to the atom numbering in Urea shown to the right Adjacency matrix of Urea
One adjacency matrix for urea
To perform a substructure search, that is to determine whether strychnine contains a substructure with the atom connectivity of urea, the database plug-in responsible for the indexing and manipulation of chemical entities must find an occurrence of the adjacency matrix A(G) in the similarly constructed adjacency matrix for strychnine. The graph atom connectivity observed in urea is not in strychnine; searching a chemical structure database for all compounds containing the connectivity of urea (excluding hydrogen atoms) would not retrieve strychnine in the database result set.
If a chemical structure database were searched for information about chemical entities that contained the atom connectivity exhibited by indole as a substructure, strychnine would be included in the database result set. In this instance, the chemoinformatics practitioner is querying the chemical structure database for all moieties that contain the substructure of indole.
Adjacency matrix of Indole Adjacency matrix of Indole
A(Indole') A(Indole'')
The two adjacency matrices A(Indole') and A(Indole'') both represent adjacency matrices for indole. There are others. The chemically aware indexing component will search the adjacency matrix of strychnine for the occurrence of the adjacency matrix of indole. The connectivity observed for indole and represented in its adjacency matrix A(Indole') is present in the adjacency matrix for strychnine A(Strychnine').
Adjacency matrix of Strychnine
A(Strychnine') - an adjacency matrix for strychnine
When substructure searches are performed, or when key index fields required for SQL predicates as described in the Cope-Chat discussion are required, comparisons of adjacency matrices are performed. For example, to confirm that strychnine possesses the substructure index keys (eg 3° amine, allylic ether, amide, indole, amide, number and size of rings, chiral flag), or to confirm that the substructure of strychnine is contained within brucine.
It should be noted that A(Indole'') is also an adjacency matrix of indole. This adjacency matrix differs from A(Indole') in that the atom numbering in the indole substructure has been varied slightly. In this example the detection of A(Indole'') in A(Strychnine') is not trivial. The atom numbering in A(Strychnine'), A(Indole'), and A(Indole'') is arbitrary. Variation in the numbering of any of the graphs may result in adjacency matrices other than those shown herein. Accordingly the detection of A(Indole'') in A(Strychnine) may require considerable computational resource. Such tasks however need not require expensive resource if sensible algorithms are used (ie not brute force).

Adjacency Matrix Variations

The adjacency matrix is one of a number of graph theoretical matrices that can be used for manipulation of chemical structure data. Other variations useful for chemical structure databases are the distance matrix, or vertex and edge-weighted matrices encompassing the vertex label (the chemical element) and bond type (sp³, sp², and sp, tautomeric, aromatic, ionic etc).

References

The reader is directed to the seminal work of Ullman (Ullman, J. R.; An Algorithm for Subgraph Isomorphism, J. ACM, vol. 23(1) 1976) for further discussion and citation references. At the time of writing (October 2005), Professor Brendan D. McKay claims that nauty is the world's fastest isomorphism testing implementation. Further information including C source code can be found on the nauty website. Readers are also directed to the VFLib Graph Matching Library, a C++ implementation of a number of graph matching algorithms. An implementation of this algorithm is also available in C# from the CodeProject.

Summary

If database indexes were to be implemented for chemical moiety functional groups, as outlined during the discussion of Cope-Chat cards, or comparison of the structure of one graph represented as an adjacency matrix with another within the extensible indexing component, graph and subgraph isomorphism detection would be used. The implementation of these algorithms would be the crux of the chemical structure database extensible indexing component. In Oracle, these components are referred to as Oracle Data Cartridges.
The underlying theory is described in the primary literature. The source code for some implementations is also in the public domain.
COTS products that encapsulate structure and substructure searching within chemical structure database systems are available.

Last updated/modified 10 July 2009