Chemical Database Systems
The
Cope-Chat approach previously described
remains the crux of most chemoinformatic database systems. The implementation
provides a mechanism to search on chemical (sub)structure or
reaction scheme within the database.
How?
The pseudo code example below demonstrates how a simple database table could
be queried for the free base of strychnine.
SELECT quantity
FROM tblStores
WHERE quantity>65
AND buildingCode=110
AND chemicalStructure=
AND siUnit='g'
AND country='uk'
AND county='oxfordshire'
SQL, the
language of the
relational database, does not directly offer such rich nor convenient functionality
however. The compound must be preprocessed and the key attributes
extracted or calculated. For strychnine, these details may include the relative
molecular weight (334.42g/mol), a
fingerprint, and key
functional groups/molecular characteristics such as
tertiary amine, amide, levels of unsaturation
etc. Bar the
relative molecular weight and fingerprint, each of these characteristics
could correspond to a numbered or labelled hole along the edges of
a Cope-Chat card.
For database queries attempting to retrieve information on a discrete
chemical compound, the SQL query above could be rewritten:
SELECT something
FROM tableName
WHERE molecularWeight = 334.42
AND tertiaryAmine=1
AND ester=0
AND weinreibAmide=0
AND amide=1
AND .......
Such an SQL query may return a database table row for
information relevant to the chemical structure of strychnine. However the
database result set may also contain rows for the many many other chemical
structures that also have a molecular weight of 334.42g/mol, and match the
other search predicates (eg the compound has a single
tertiary amine and no Weinreib amide functional groups). Post
database query processing is required to ensure that the
intermediary query results truly match what the original search
criteria. Generally this a very CPU intensive task so it is necessary
to ensure the predicates above filter the candidate result set to
a minimum.
For wildcard chemical database interrogation such as substructure
searching, the same type of query would be used. However the post
processing step would have to interpret the substructure, open valency,
salts, R groups etc.
COTS chemoinformatics implementations that permit (sub)structure
searching of databases containing chemical information usually
encapsulate these pre- (to generate the information to use as the
Cope-Chat holes) and post- (to ensure that the information queried
truly matches the scientists search query) processing steps within
the database using an extensible indexing mechanism. In IBM
DB2, these database components are
known as
Extenders, in
IBM
Informix
these components are known as
Datablades, and in
Oracle they are known as
Data Cartridges.
The extensible indexing implementation also hides
additional table(s) or data
structures containing the search criteria extracted from the chemical moieties being indexed
(
ie the repository containing the moiety molecular weight, fingerprints,
and functional groups present).
Further information on COTS chemical database systems implemented within an
Oracle Data Cartridge include:
Further information on COTS chemical database systems implemented using technology centered around
databases other than Oracle include: