Site Index

Cope-Chat cards
Chemical Databases
Extensible Indexing
Data Cartridge
Subgraph Isomorphism
Chemical Awareness
Under the Bonnet
Reaction Databases
Atom-Atom Mapping
Markush Structures
Selected Publications
Contact Details

Home



Chemoinformatics/Chemistry Data Cartridges

Introduction

COTS products from Daylight, MDL, CambridgeSoft, ChemAxon, InfoChem, Tripos, and Accelrys et al all offer chemoinformatic solutions implemented as an Oracle Data Cartridge. All encapsulate the functionality for manipulating chemical moieties with SQL like constructs. Chemical moieties are represented in formats that include SMILES strings, MDL .mol files (J. Chem. Inf. Comput. Sci, 1992 32, 244-255), or open source CML (Rzepa et al).

Existing electronic representations of molecule/structure files

.mol files (and .sd files etc)

Molecular Design Limited (MDL) initially published a format for the interchange of chemical structure data (see Dalby et al, J. Chem. Inf. Comput. Sci, 1992, 32, 244-255). All the major chemoinformatic database systems and supporting applications, including chemical drawing tools, support the format. This also includes the IUPAC InChI identifier API vide infra. For the purposes of exchange of chemical structures, reaction schemes, structure and substructure queries, and related information, MDL have published the structure of their internal formats in a very detailed document. Further information on these files including a format for their specification should be sourced directly from MDL.

SMILES strings

SMILES strings are one way of representing chemical moieties. They range from very simple intuitive strings for structurally simple molecules such as cyclohexane, benzene, methane, and water, to long incomprehensible collections of characters. SMILES strings are commonly used in chemoinformatics within many pharmaceutical organisations, chemical partners, and academic institutions. Further information on the SMILES format, including discussion of the various versions and applications that support different levels of SMILES support, can be obtained from Daylight.

The following are two SMILES strings representing the free base of strychnine. The first string is a representation of strychnine with all stereocentres defined. The latter has the stereocentres unspecified.

Despite the common use of SMILES strings, they should not be viewed as a standard format for the exchange of chemical information between chemoinformatic applications. The use of the word standard is often carelessly used in marketing hype and by IT professionals with regularity. Standards are issued by standards authorities such as ANSI, ISO, and/or other authorative bodies. Daylight claim SMILES strings are a universal nomenclature; DayLight make no claim that SMILES strings are a standard.

IUPAC InChI

IUPAC has recently published both the source code and documentation for the generation of a non-proprietary identifier for chemical substances. This went live May 2005. The source code builds on both Windows and linux operating systems and out of the box supports both MDL .mol file formats and CML. The IUPAC InChI licence agreement is very liberal permitting free incorporation of the InChI source code into third party products with the ability to redistribute, including sell, the resulting software. This move by IUPAC represents a significant step toward an open standard for the representation of chemical structures suitable for electronic consumption from an authorative source. The IUPAC identifiers are known as InChI identifiers or InChI strings. Scientific software solution providers have already begun to support InChI identifiers within their products (eg ACD Labs). At September 2005 the author is not aware of any chemoinformatics data cartridges that support InChI identifiers.

The facts are:

Most major COTS chemical database systems support SMILES strings to some level. Usage examples for Cambridgesoft, ChemAxon, and Daylight are shown below.

Interrogating a chemical database with SMILES strings and DayCart.
 
     SELECT something
      FROM tableName
       WHERE exact
        (
         dbColumn,
         smi2cansmi('c1ccc2c(c1)C13C4N2C(CC2C4C4 etc'),
         0
        )>0
 
Interrogating a chemical database with SMILES strings and CambridgeSoft's Oracle Cartridge.
 
     SELECT something
      FROM tableName
       WHERE moleculecontain
        (
         dbColumn,
         'c1ccc2c(c1)C13C4N2C(CC2C4C4 etc',
         '',
         'FULL=YES'
        )>0
 
Interrogating a chemical database with SMILES strings and the ChemAxon JChem Cartridge.
 
     SELECT something
       FROM tableName
        WHERE jc_contains
         (
          dbColumn,
          'c1ccc2c(c1)C13C4N2C(CC2C4C4 etc'
         )>0
 

Note: In all examples of cartridge use above, specific reference to qualified cartridge or database package names has been omitted.

Summary: DBMS queries for chemical/structure information

Similar SQL like constructs are used by CambridgeSoft's, ChemAxon's, and DayLight data cartridges. For the most part this similarity is governed by the common use of SQL and the query language, the Oracle ODCI interface implemented by the cartridge developer based on events/callbacks from the Oracle DBMS, the requirements of the chemist/chemoinformatics application, and the common use of SMILES strings. It should be noted too this syntax is similar to that of Oracle Text, a no additional charge component of Oracle implemented as an Oracle Data Cartridge and described previously.

When using formats to represent chemical moieties other than SMILES strings, the basic SQL structure is also very similar.

The most important point to note however is that the implementation, the under-the-bonnet chemical aware processing that must go on to retrieve the database result set, is encapsulated within the cartridge code and therefore hidden from the application programmer or DBA alike. The interface remains SQL!


Google