Chemical Structures in Databases (blog article 3 of 8) : Extensible Indexing

Traditionally database indexes, a mechanism to quickly locate data within the database without significant performance penalty, have been limited to database data type primitives. In Oracle these primitives are typically DATEs, data types containing text such such as VARCHAR2 and CHAR, and NUMBERs. Little support for other data types is provided. Notably, out of the box, Oracle does not provide the facility to index and query information containing chemical reaction schemes or discrete chemical structures.

An Oracle Data Cartridge, a plugin component conceptually similar technology to IBM DB2 extenders, or IBM Informix Datablades, is a database extensible indexing technology that permits the cartridge developer to extend the DBMS feature set to recognise more complex data types. This can include chemical structure information. For chemoinformatic applications, the cartridge would inter alia extend the DBMS feature set to index and query information containing chemical structures and reaction schemes on-a-par with other Oracle data types such VARCHAR2, CHAR, DATE, and NUMBER.

Case Study :- Searching Text 1

Pharmaceutical partner “Chemical Services plc” has a collection of several hundred thousand internal scientific documents. In-house staff have imported these documents into the database. Three typical document excerpts are shown below.

  • …Treatment of the 3-ethoxycyclohexenones 13 and 14 with ethynylmagnesium bromide in THF gave the ynenones 15 and 16 in 89 and 91% yield respectively. Conjugate addition of methanol promoted by N-methylmorpholine in benzene gave the…
  • …followed by chromatographic purification over alumina to remove cerium salts afforded homochiral cis-3-isopropyl-6-alkyldiketopiperazines 10a-f in good yield. Hydrolysis of 10a-f, by refluxing in 6 M HCl, afforded a mixture of (S)-valine 5 and the desired (S)-α-amino acids 7a-f. Whilst these α-amino acids 7a-f could be separated from the valine chiral auxiliary via ion exchange chemistry over Dowex 50-XH, this approach proved ….
  • … the solution of amine 16 (6.1g, 324mmol) in methanol (60mL) and D-(+) tartaric acid was heated under reflux at 80°C until homogeneous, allowed to cool to 40°C over 2 hours, and seeded with amine salt 3 (2.4mg). The mixture was then allowed to stand overnight, filtered, and the residue washed with cold methanol (3*10mL) affording the salt 3 (4.5g, 46%). HPLC/UV using the method described above showed the ee of the free base to be greater than 99%. The unnatural enantiomer of 2 was liberated using the procedure outlined for 1 in 87% overall yield for the 3 step process. This method clearly demonstrates the resolution of α-amino derivatives of 4 using D-(+) tartaric acid to afford the unnatural enantiomer of 16 with ee‘s in excess of ….

These documents were then searched with the aim of finding subject matter containing all the words enantiomer, resolution, and tartaric.

A most inefficient way to search database tables containing this information would be to use this type of SQL construct:

SELECT whatever
   FROM tableContainingDocuments
      WHERE dbColumn LIKE '%tartaric%enantiomer%resolution%'

This Oracle SQL query would meet the search criteria requirements when performed against the database containing the documents above, only returning the last of the three in the result set. Despite meeting the search requirements, the query does have severe limitations however. These include:

  • As it stands the SQL query is case sensitive, eg. should the word enantiomer appear in anything but lower case, the document ‘hit’ would not be included in the result set
  • The query is sensitive to the order of the search terms. Specifically if the words tartaric, enantiomer, and resolution did not appear in this precise order in the document excerpt, the document would not be retrieved using this SQL.
  • The query results in the entire database table being scanned. This is not a problem when there are a small number of documents and the document size is small. However the query is not scalable. As the number of documents and/or their size increases, the searching operation will take longer and longer and impact on other processes
  • There is no provision for stemming or other fuzzy searches. The meaning of the SQL is clearly to retrieve documents containing information on amine resolution using a tartrate of some kind. It would not be unreasonable to expect to retrieve documents containing information with the same meaning. For example, to also retrieve documents containing the three words tartaric, enantiomer, and resolve (not resolution).
  • The SQL described above will only work for documents stored as text within the database. There is no provision to query the content of documents stored in more rich or complex formats, eg. Adobe PDF, Microsoft Word, or Microsoft PowerPoint.

Case Study :- Searching Text 2 (Oracle Text, 10g grammar)

Oracle addresses the types of issues raised above, and many more, in a no additional licence cost component known as Oracle Text (referred to as Intermedia Text or ConText in now unsupported versions of Oracle). Within Oracle, this component is implemented as a Data Cartridge. Oracle Text is incredibly feature rich and SQL constructs not unlike that shown below address many of the points raise above.

SELECT whatever
  FROM tableContainingDocuments
    WHERE CONTAINS(dbColumn,'<query>
                               <textquery grammar="context">
                                  enantiomer AND resolution AND tartaric
                               </textquery>
                             </query>') > 0

Notably

  • The order of the words enantiomer, resolution, and tartaric in the query is unimportant
  • Through the use of INSO filtering technology from Stellent Chicago (versions of Oracle < 10g), or technology from Autonomy, Inc. in 11g, Oracle Text supports over 150 document format types including Adobe PDF, various markup languages, and those generated by a number of Microsoft Office applications including Word and RTF
  • This filtering technology is available on Oracle instances that run on Microsoft Windows Intel, Solaris, HP-UX, IBM AIX and others
  • There exists no requirement for a full table scan. The query is quickly and efficiently performed through the use of database indexes
  • DBA and application developer alike do not require knowledge of the inner workings or structure of the underlying indexes to use them. The interface remains SQL
  • The cartridge can be used by any number of applications at the same time. The code necessary to reproduce the indexing functionality need not be rewritten nor included in each database schema.

In summary extensible indexing components like Oracle Text allow Oracle to index data types of choice. In the case of Oracle Text, this includes content within Microsoft Word and Adobe PDF documents. The underlying database indexes are contructed through reverse-engineering and lexical analysis of the source document content. When database queries are performed against the document content, the indexes are used to retrieve the result set
candidates. As described, the necessity to perform a full table scan and expensively reverse-engineer each document on the fly has been obviated.

Extensible Indexing of information containing chemical structures

Extensible indexing technology also exists to manipulate discrete chemical structures, sub structures, and reaction schemes within the database. Enhancing Oracle via a chemistry aware data cartridge allows Oracle to manipulate NUMBERs, character types containing text such as VARCHARs and CHAR, DATEs, and chemical structures! The interface, as demonstrated above for rich data types such as Microsoft Word and Adobe PDF documents remains SQL.

A superb example of chemical structure searching using an online publically accessible system, keyed on chemical structure, is eMolecules (ooriginally known as Chmoogle but they were strong-armed intochanging their name by legal pressure from Google). EMolecules accepts a number of commonly used electronic formats for the representation of chemical structure information including SMILES strings, Cambridgesoft ChemDraw files, and IsisDraw files, and even InChi identifiers. Oracle Data Cartridges can be used for more than just Chemical Structures and documents containing texto. For example, C$WILDNA1 is a COTS product for indexing DNA profiles on-a-par with native Oracle data types.

This is the third of eight blog articles with a subtitle of “Chemical Structures In Databases”. The prior article can be located here and the next article in the series of eight here.

— Published by Mike, 14:58:23 06 April 2018 (BST)

Leave a Reply