Informatics RDBMS storage and manipulation of chemical moieties (blog article 4 of 8)
Over the past three decades, COTS products from major chemoinformatics solution providers have offered a number of technologies for the storage and manipulation of small-molecules within informatics RDBMS and related systems. The supported underlying chemical representations have ranged from .mol files (that did represent atom connectivity, valency, charge etc. from a chemical perspective) through to SMILES and SMARTS strings, InChi identifiers, and CML; the latter three have a more informatic focus that a real scientist wouldn’t be bothered to understand it in depth (I appreciate this sentence is written brashly, but nevertheless it remains true; SMILES for example represents chemical bonds as a dollar symbol, a hash/pound symbol, a colon, and more – it is silly at best, and demonstrably not chemistry/science by any means). Putting the hype around SMILES, SMARTS, InChi, etc. aside, or for that matter the perceived authority or standing of the publisher, one of the primary goals of these chemical-structure textual representations was/is to define a standardised representation of small molecules for manipulation in informatics systems and data exchange, and based on this criteria the end result is a (mixed) success.
Ideally a database containing small-molecule chemical structure information could be queried using a chemical structure, something that is after all only meaningful representation of a molecule (well OK, it is 2D not 3D, but my point is made). A contrived SQL query is shown below, where the hypothetical database and computer system could handle chemical structures directly; a truly chemical aware database!
Unfortunately things have been considerably dumbed down to enable RDBMS databases to perform chemically aware queries, with SQL. A typical example is shown in-part below, where the chemical structure is represented as a SMILES string. The key function is smi2cansmi, in this case a DayCart Oracle Database Cartridge Operator.
CambridgeSoft, MDL, Accelrys, ChemAxon, Daylight, Perkin Elmer, IDBS (some of the current and prior players in this field) offer/still offer informatic solutions for manipulating small-molecules within an RDBMS environment using Oracle Data Cartridges, with varying degrees of functionality, performance, and success. It should be noted that unlike the initial approach, where a .mol or .sd file represented atom connectivity, valency, isotopic enrichment or depletion, and even R-group wildcards, tautomers and more, all based upon sound chemical concepts, the current approach is to reduce a chemical structure to the lowest common denominator of ease of use in informatics systems – a textual data type such as a VARCHAR or something that can be easily manipulated by Java or posted over http or a webservice, and cannot be interpreted as a chemical moiety or chemical reaction. Here again is another example where software developers, informatics practitioners, and IT in general have done science, and chemistry specifically, a disservice, albeit based on good intentions, by introducing a lossy implementation with deficient technologies that bear little relation to the underlying problem or the science.
Since the implementation of .mol and .sd file content, and SMILES, SMARTS, InChi etc., there have been a few attempts to represent chemical structures again with a formal connectivity table, as they exist in matter, and with 3D spacial awareness too. This small and seemingly inactive GitHub project has a couple of interesting community comments from non-commercial contributors to the field (again reference to CTAB – atom Connectivity TABle and spacial awareness, as above), and mooting JSON as an appropriate technology.
The overarching point however is that however the chemical moieties are represented, whether that be silly SMILES strings or something else, under the bonnet and within the RDBMS, some chemical aware processing must restore the textual representation to atom connectivity for the purposes of returning a valid chemically aware database result set (this is of course the point; C. A. R. Hoare states it beautifully “The most important property of a program is whether it accomplishes the intention of its user“). The chemistry/science is rightfully hidden from the DBA/SQL developer, that more often than not is not nor ever was a chemist/scientist. The interface to the science is SQL, it looks simple, and it works, so perhaps these silly representation of chemicals as SMILES, SMARTS, InChi have achieved in one of their intended goals (or more likely claiming it was a goal retrospectively). The question that remains is what is the chemical aware processing that occurs under the bonnet – this is addressed in the next blog article.
This is the fourth of eight blog articles with a subtitle of “Chemical Structures In Databases”. The prior article can be located here and the next article in the series of eight here.
— Published by Mike, 22:27:12 28 June 2018 (BST)