Chemical Structures in Databases (blog article 2 of 8) : Chemical Structure Databases
The Cope-Chat approach previously described remains the crux of most chemoinformatic database systems. The implementation provides a mechanism to search on chemical (sub)structure or reaction scheme within the database.
The pseudo code example below demonstrates how a simple database table could be queried for the free base of strychnine.
SQL, the language of the relational database, does not directly offer such rich nor convenient functionality however. This chemical compound must be preprocessed and the key attributes extracted or calculated. For strychnine, these details may include the relative molecular weight (334.42g/mol), a fingerprint, and key functional groups/molecular characteristics such as tertiary amine, amide, levels of unsaturation etc. Bar the relative molecular weight and fingerprint, each of these characteristics could correspond to a numbered or labelled hole along the edges of a Cope-Chat card.
For database queries attempting to retrieve information on a discrete chemical compound, the SQL query above could be rewritten:
Such an SQL query may return a database table row for information relevant to the chemical structure of strychnine. However the database result set may also contain rows for the many many other chemical structures that also have a molecular weight of 334.42g/mol, and match the other search predicates (eg. the compound has a single tertiary amine and no Weinreib amide functional groups). Post database query processing is required to ensure that the intermediary query results truly match what the original search criteria. Generally this a very CPU intensive task so it is necessary to ensure the predicates above filter the candidate result set to a minimum.
For wildcard chemical database interrogation such as substructure searching, the same type of query would be used. However the post processing step would have to interpret the substructure, open valency, salts, R groups etc.
COTS chemoinformatics implementations that permit (sub)structure searching of databases containing chemical information usually encapsulate these pre- (to generate the information to use as the Cope-Chat holes) and post- (to ensure that the information queried truly matches the scientists search query) processing steps within the database using an extensible indexing mechanism. In IBM DB2, these database components are known as Extenders, in IBM Informix these components are known as Datablades, and in Oracle they are known as Data Cartridges. The extensible indexing implementation also hides additional table(s) or data structures containing the search criteria extracted from the chemical moieties being indexed (ie. the repository containing the moiety molecular weight, fingerprints, and functional groups present).
Just over a decade ago there were a number of players in the market offering various Oracle Data Cartridges for the purposes of indexing chemical structure information. This included:
- CambridgeSoft, now part of Perkin Elmer
- MDL, then Elsevier MDL, then Symx, then Accelrys, now Dassault Systèmes
Nowadays the market has consolidated somewhat due to organization mergers (both informatics providers and pharmaceutical organisations the unified supplier and informatics products implications thereof), and a general acceptance that indexing chemical structures is not as difficult as the original informatics COTS entrant organisations proffered. Oracle Data Cartridges can be used for more than just Chemical Structures too. For example, C$WILDNA1 is a COTS product for indexing DNA profiles on-a-par with native Oracle data types.
Similarly (Oracle does not hold an exclusivity on extensible database indexing, as highlighted above) COTS chemical database systems implemented using technology centered around databases other than Oracle include:
- gNova Scientific Software
This is the second of eight blog articles with a subtitle of “Chemical Structures In Databases”. The prior article can be located here and the next article in the series of eight can be located here.
— Published by Mike, 12:57:33 06 April 2018 (BST)