Chemoinformatics/Chemistry Data Cartridges
Introduction
COTS products from
Daylight,
MDL,
CambridgeSoft,
ChemAxon,
InfoChem,
Tripos, and
Accelrys et al all offer chemoinformatic solutions
implemented as an
Oracle Data Cartridge. All encapsulate the
functionality for manipulating chemical moieties with
SQL
like constructs. Chemical moieties are represented in formats that include
SMILES strings,
MDL
.mol files (
J. Chem. Inf. Comput. Sci,
1992
32, 244-255), or open source
CML
(
Rzepa et al).
Existing electronic representations of molecule/structure files
.mol files (and .sd files etc)
Molecular Design Limited (
MDL) initially published a format
for the interchange of chemical structure data (see Dalby
et al, J. Chem. Inf. Comput. Sci, 1992, 32, 244-255).
All the major chemoinformatic database systems and supporting applications, including
chemical drawing tools, support the format. This also includes the IUPAC InChI identifier API
vide infra.
For the purposes of exchange of chemical structures, reaction schemes, structure and substructure
queries, and related information, MDL have published the structure of their internal formats in a very detailed
document. Further information
on these files including a format for their specification should be sourced directly from MDL.
SMILES strings
SMILES strings are one way of representing chemical moieties. They range from very simple
intuitive strings for structurally simple molecules such as cyclohexane, benzene, methane, and water,
to long incomprehensible collections of characters. SMILES strings are commonly used in
chemoinformatics within many pharmaceutical organisations, chemical partners, and academic
institutions. Further information on the SMILES format, including discussion of the various
versions and applications that support different levels of SMILES support, can be obtained from
Daylight.
The following are two SMILES strings representing the free base of strychnine.
The first string is a representation of strychnine with all stereocentres defined. The
latter has the stereocentres unspecified.
-
c1ccc2c(c1)[C@@]13[C@]4(N2C(C[C@]2([C@@]4([C@@]4(\C(=C/CO2)\CN([C@]1(C4)[H])CC3)[H])[H])[H])=O)[H]
-
c1ccc2c(c1)C13C4N2C(CC2C4C4\C(=C/CO2)\CN(C1C4)CC3)=O
Despite the common use of SMILES strings, they should not be viewed as a
standard format
for the exchange of chemical information between chemoinformatic applications. The use of the word
standard is often carelessly used in marketing hype and by IT professionals with
regularity.
Standards are issued by standards authorities such as
ANSI,
ISO, and/or
other authorative bodies.
Daylight
claim SMILES strings are a
universal nomenclature; DayLight make no claim that SMILES strings
are a
standard.
IUPAC InChI
IUPAC has recently published both the source code
and documentation for the generation of a non-proprietary identifier for chemical
substances. This went live May 2005. The source code builds on both Windows and linux operating
systems and out of the box supports both
MDL .mol
file formats and
CML. The IUPAC InChI
licence agreement is very liberal permitting free incorporation of the InChI source
code into third party products with the ability to redistribute, including sell, the resulting
software. This move by IUPAC represents a significant step toward an open standard for
the representation of chemical structures suitable for electronic consumption from
an authorative source. The IUPAC identifiers are known as
InChI identifiers or InChI strings. Scientific software solution
providers have already begun to support InChI identifiers within their products
(
eg ACD Labs). At September 2005 the
author is not aware of any chemoinformatics data cartridges that support InChI identifiers.
The facts are:
-
InChI identifiers define a format for the representation of structure chemical data.
-
Representation of structure chemical information in formats other than InChI is already
be viewed as legacy.
-
InChI is non-proprietary.
-
IUPAC are the authorative body in this field.
-
IUPAC has no interests in stymieing commercial competitors.
-
A full reference implementation including full source code is made openly available free of
charge and free of fee for redistribution or incorporation into third party products from both
Sourceforge and IUPAC.
-
Chemoinformatic support for InChI is fast growing.
Most major COTS chemical database systems support SMILES strings to some level. Usage examples for
Cambridgesoft, ChemAxon, and Daylight are shown below.
Interrogating a chemical database with SMILES strings and DayCart.
Note: In all examples of cartridge use above, specific reference to qualified cartridge
or database package names has been omitted.
Summary: DBMS queries for chemical/structure information
Similar SQL like constructs are used by
CambridgeSoft's,
ChemAxon's, and
DayLight data cartridges. For the most part
this similarity is governed by the common use of SQL and the query language,
the Oracle
ODCI
interface implemented by the cartridge developer based on events/callbacks from
the Oracle DBMS, the requirements of the chemist/chemoinformatics application,
and the common use of SMILES strings. It should be noted too this syntax is
similar to that of Oracle Text, a no additional charge component of Oracle
implemented as an Oracle Data Cartridge and described
previously.
When using formats to represent chemical moieties other than SMILES strings, the basic
SQL structure is also very similar.
The most important point to note however is that the implementation, the under-the-bonnet
chemical aware processing that must go on to retrieve the database result set, is encapsulated
within the cartridge code and therefore hidden from the application programmer or DBA alike.
The interface remains SQL!