To perform a substructure search, that is to determine whether strychnine contains a substructure with
the atom connectivity of urea, the database plug-in responsible for the indexing and manipulation of
chemical entities must find an occurrence of the adjacency matrix A(G) in the similarly constructed
adjacency matrix for strychnine. The graph atom connectivity observed in urea is not in strychnine;
searching a chemical structure database for all compounds containing the connectivity of urea (excluding
hydrogen atoms) would not retrieve strychnine in the database result set.
If a chemical structure database were searched for information about chemical entities
that contained the atom connectivity exhibited by
indole as a substructure,
strychnine would be included in the database result set. In this instance, the chemoinformatics practitioner
is querying the chemical structure database for all moieties that contain the substructure of indole.
The two adjacency matrices A(Indole') and A(Indole'') both represent adjacency matrices for indole.
There are others. The chemically aware indexing component will search the adjacency matrix of strychnine for
the occurrence of the adjacency matrix of indole. The connectivity observed
for indole and represented in its adjacency matrix A(Indole') is present in the adjacency matrix for
strychnine A(Strychnine').
When substructure searches are performed, or when key index fields required for SQL predicates as described
in the
Cope-Chat discussion are required, comparisons of adjacency matrices
are performed. For example, to confirm that strychnine possesses the substructure index keys (
eg 3° amine,
allylic ether, amide, indole, amide, number and size of rings, chiral flag), or to confirm that the
substructure of strychnine is contained within brucine.
It should be noted that A(Indole'') is also an adjacency matrix of indole. This adjacency matrix differs from
A(Indole') in that the atom numbering in the indole substructure has been varied slightly. In this example
the detection of A(Indole'') in A(Strychnine') is not trivial. The atom numbering in A(Strychnine'), A(Indole'),
and A(Indole'') is arbitrary. Variation in the numbering of any of the graphs may result in adjacency
matrices other than those shown herein. Accordingly the detection of A(Indole'') in A(Strychnine) may require
considerable computational resource. Such tasks however need not require expensive resource
if sensible algorithms are used (
ie not brute force).
Adjacency Matrix Variations
The adjacency matrix is one of a number of graph theoretical matrices that can be used for
manipulation of chemical structure data. Other variations useful for chemical structure databases are the
distance matrix, or vertex and edge-weighted
matrices encompassing the vertex label (the chemical element) and bond type (sp³, sp², and
sp, tautomeric, aromatic, ionic
etc).
References
The reader is directed to the seminal work of
Ullman
(Ullman, J. R.; An Algorithm for Subgraph Isomorphism, J. ACM, vol. 23(1) 1976) for further discussion
and citation references. At the time of writing (October 2005), Professor
Brendan D. McKay
claims that nauty is the world's fastest isomorphism testing implementation. Further information including
C source
code can be found on the
nauty
website. Readers are also directed to the
VFLib Graph Matching Library, a
C++ implementation of a number of
graph matching algorithms. An implementation of this algorithm is also available in
C# from the
CodeProject.
Summary
If database indexes were to be implemented for chemical moiety functional groups, as outlined during the
discussion of
Cope-Chat cards, or comparison of the structure of one graph represented
as an adjacency matrix with another within the extensible indexing component, graph and subgraph isomorphism
detection would be used. The implementation of these algorithms would be the crux of the chemical structure
database extensible indexing component. In Oracle, these components are referred to as Oracle Data Cartridges.
The underlying theory is described in the primary literature. The source code for some
implementations is also in the public domain.
COTS products that encapsulate structure and substructure searching within chemical structure database
systems are available.
Last updated/modified 10 July 2009