Extensible Indexing
Introduction
Traditionally database indexes, a mechanism to quickly locate data
within the database without significant performance penalty, have been limited
to database data type primitives. In
Oracle
these primitives are typically DATEs, data types containing text such such as
VARCHAR2 and CHAR, and NUMBERs. Little support for other data types is provided.
Notably, out of the box, Oracle does not provide
the facility to index and query information containing chemical reaction schemes or
discrete chemical structures.
An Oracle Data Cartridge, a plugin component conceptually
similar technology to
IBM DB2 extenders,
or IBM Informix
Datablades,
is a database extensible indexing technology that permits the cartridge
developer to extend the DBMS feature set to recognise more
complex data types. This can include chemical structure information.
For chemoinformatic applications, the cartridge would
inter alia extend the DBMS feature set to index and query information containing
chemical structures and reaction schemes on a par with other Oracle data types
such VARCHAR2, CHAR, DATE, and NUMBER.
Case Study :- Searching Text 1
Pharmaceutical partner "Chemical Services plc" has a
collection of several hundred thousand internal scientific
documents. In-house staff have imported these documents
into the database. Three typical document excerpts are shown below.
-
...Treatment of the 3-ethoxycyclohexenones 13 and
14 with ethynylmagnesium bromide in THF gave the
ynenones 15 and 16 in 89 and 91% yield respectively. Conjugate
addition of methanol promoted by N-methylmorpholine in
benzene gave the...
-
...followed by chromatographic purification over alumina
to remove cerium salts afforded homochiral
cis-3-isopropyl-6-alkyldiketopiperazines 10a-f in good yield.
Hydrolysis of 10a-f, by refluxing in 6 M HCl, afforded a mixture of
(S)-valine 5 and the desired (S)-α-amino acids 7a-f.
Whilst these α-amino acids 7a-f could be separated from the valine
chiral auxiliary via ion exchange chemistry over Dowex 50-XH, this
approach proved ....
-
... the solution of amine 16 (6.1g, 324mmol) in methanol
(60mL) and D-(+) tartaric acid was heated under reflux at 80°C
until homogeneous, allowed to cool to 40°C over 2 hours, and seeded
with amine salt 3 (2.4mg). The mixture was then allowed to stand
overnight, filtered, and the residue washed with cold methanol (3*10mL) affording
the salt 3 (4.5g, 46%). HPLC/UV using the method described above showed
the ee of the free base to be greater than 99%. The unnatural enantiomer
of 2 was liberated using the procedure outlined for 1 in 87% overall
yield for the 3 step process. This method clearly demonstrates the resolution
of α-amino derivatives of 4 using D-(+) tartaric acid to afford
the unnatural enantiomer of 16 with ee's in excess of ....
These documents were then searched with the aim of finding subject matter
containing all the words enantiomer, resolution, and tartaric.
A most inefficient way to search database tables containing this information
would be to use this type of SQL construct:
SELECT whatever
FROM tableContainingDocuments
WHERE dbColumn LIKE '%tartaric%enantiomer%resolution%'
This Oracle SQL query would meet the search criteria requirements when performed
against the database containing the documents above, only returning the last of the three
in the result set. Despite meeting the search requirements, the query does
have severe limitations however. These include:
-
As it stands the SQL query is case sensitive, eg should the word enantiomer
appear in anything but lower case, the document 'hit' would not be included in the
result set.
-
The query is sensitive to the order of the search terms. Specifically if the words
tartaric, enantiomer, and resolution did not appear in this
precise order in the document excerpt, the document would not be retrieved
using this SQL.
-
The query results in the entire database table being scanned. This is not a problem
when there are a small number of documents and the document size is small. However the
query is not scalable. As the number of documents and/or their size increases, the searching
operation will take longer and longer and impact on other processes.
-
There is no provision for stemming or other fuzzy searches. The meaning of the SQL is
clearly to retrieve documents containing information on amine resolution using a tartrate
of some kind. It would not be unreasonable to expect to retrieve documents
containing information with the same meaning. For example, to also retrieve documents
containing the three words tartaric, enantiomer, and resolve.
-
The SQL described above will only work for documents stored as text within the database. There
is no provision to query the content of documents stored in more rich or complex formats, eg
Adobe PDF, Microsoft Word, or Microsoft PowerPoint.
Case Study :- Searching Text 2 (Oracle Text, 10g grammar)
Oracle addresses the types of issues raised above, and many more, in a no additional licence cost
component known as
Oracle Text
(referred to as Intermedia Text or ConText in now unsupported versions of Oracle). Within Oracle, this
component is implemented as a Data Cartridge. Oracle Text is incredibly feature rich and
SQL constructs not unlike that shown below address many of the points raise above.
SELECT whatever
FROM tableContainingDocuments
WHERE CONTAINS(dbColumn,'<query>
<textquery grammar="context">
enantiomer AND resolution AND tartaric
</textquery>
</query>')>0
Notably
-
The order of the words enantiomer, resolution, and tartaric
in the query is unimportant.
-
Through the use of INSO filtering technology from Stellent Chicago (versions of Oracle ≤ 10g),
or technology from
Autonomy, Inc. in 11g, Oracle Text supports over 150 document
format types including Adobe PDF, various markup languages, and those generated by a
number of Microsoft Office applications including Word and
RTF.
This filtering technology is available on Oracle instances that run on
Microsoft Windows Intel, Solaris, HP-UX, IBM AIX and others.
-
As shown there exists no requirement for a full table scan. The query is quickly and
efficiently performed through the use of database indexes.
-
DBA and application developer alike do not require knowledge of the inner workings
or structure of the underlying indexes to use them. The interface remains SQL.
-
The cartridge can be used by any number of applications at the same time. The code
necessary to reproduce the indexing functionality need not be rewritten nor included
in each database schema.
In summary extensible indexing components like Oracle Text allow Oracle to index
data types of choice. In the case of Oracle Text, this includes content within Microsoft Word
and Adobe PDF documents. The underlying database indexes are contructed through
reverse-engineering and lexical analysis of the source document content. When database queries are
performed against the document content, the indexes are used to retrieve the result set
candidates. As described, the necessity to perform a full table scan and expensively reverse-engineer
each document on the fly has been obviated.
Extensible Indexing of information containing chemical structures
Extensible indexing technology also exists to manipulate discrete chemical structures,
sub structures, and reaction schemes within the database. Enhancing Oracle
via a
chemistry aware data cartridge allows Oracle to manipulate NUMBERs, character types containing
text such as VARCHARs and CHAR, DATEs, and chemical structures! The interface, as demonstrated above for
rich data types such as
Microsoft Word and
Adobe PDF documents remains SQL. Examples of such technology
are discussed in further detail
here.
A superb example of chemical structure searching using an online
publically accessible system, keyed on chemical structure, is
eMolecules (originally known as Chmoogle but strong-armed into
changing their name by legal pressure from
Google). EMolecules accepts a number of commonly
used electronic formats for the representation of chemical structure information including
SMILES strings,
Cambridgesoft ChemDraw files, and
MDL IsisDraw files. At the time of writing (January 2006), Chmoogle
does not support
IUPAC
InChI identifiers. Further discussion on these and other
technologies can be found
here.
Last modified/updated 15 December 2008