Traditionally database indexes, a mechanism to quickly location data within the database without significant performance penalty, have been limited to database data type primitives. In Oracle these primitives are typically DATEs, data types containing text such such as VARCHAR2 and CHAR, and NUMBERs. Little support for other data types is provided. Notably, out of the box, Oracle does not provide the facility to index and query information containing chemical reaction schemes or discrete chemical structures.
An Oracle Data Cartridge, a plugin component conceptually similar technology to IBM DB2 extenders, or IBM Informix Datablades, is a database extensible indexing mechanism that permits the cartridge developer to extend the DBMS feature set to recognise more complex data types. This can include chemical structure information. For chemoinformatic applications, the cartridge would inter alia extend the DBMS feature set to index and query information containing chemical structures and reaction schemes on a par with other Oracle data types such VARCHAR2, CHAR, DATE, and NUMBER.
Pharmaceutical partner "Chemical Services plc" has a collection of several hundred thousand internal scientific documents. In-house staff have imported these documents into the database. Three typical document excerpts are shown below.
These documents were then searched with the aim of finding subject matter containing all the words enantiomer, resolution, and tartaric.
A most inefficient way to search database tables containing this information would be to use this type of SQL construct:
SELECT whatever
FROM tableContainingDocuments
WHERE dbColumn LIKE '%tartaric%enantiomer%resolution%'
This Oracle SQL query would meet the search criteria requirements when performed against the database containing the documents above, only returning the last of the three in the result set. Despite meeting the requirements, the query does have severe limitations however. These include:
Oracle addresses the types of issues raised above, and many more, in a no additional licence cost component known as Oracle Text (referred to as Intermedia Text or ConText in now unsupported versions of Oracle). Within Oracle, this component is implemented as a Data Cartridge. Oracle Text is incredibly feature rich and SQL constructs not unlike that shown below address many of the points raise above.
SELECT whatever
FROM tableContainingDocuments
WHERE CONTAINS(dbColumn,'<query>
<textquery grammar="context">
enantiomer AND resolution AND tartaric
</textquery>
</query>')>0
Notably
In summary extensible indexing components like Oracle Text allow Oracle to index data types of choice. In the case of Oracle Text, this includes content within Microsoft Word and Adobe PDF documents. The underlying database indexes are contructed through reverse-engineering and lexical analysis of the source document content. When database queries are performed against the document content, the indexes are used to retrieve the result set candidates. As described, the necessity to perform a full table scan and expensively reverse-engineer each document on the fly has been obviated.
Extensible indexing technology also exists to manipulate discrete chemical structures, sub structures, and reaction schemes within the database. Enhancing Oracle via a chemistry aware data cartridge allows Oracle to manipulate NUMBERs, character types containing text such as VARCHARs and CHAR, DATEs, and chemical structures! The interface, as demonstrated above for rich data types such as Microsoft Word and Adobe PDF documents remains SQL. Examples of such technology are discussed in further detail here.
A superb example of chemical structure searching using an online publically accessible system, keyed on chemical structure, is eMolecules (originally known as Chmoogle but strong-armed into changing their name by legal pressure from Google). EMolecules accepts a number of commonly used electronic formats for the representation of chemical structure information including SMILES strings, Cambridgesoft ChemDraw files, and MDL IsisDraw files. At the time of writing (January 2006), Chmoogle does not support IUPAC InChI identifiers. Further discussion on these and other technologies can be found here.
Last modified/updated 16 March 2008