This document is intended to explain further the objectives and technical aspects of the QIS. For more questions related to this documentation email: Luis Marenco, Prakash Nadkarni or Tzuuyi Wang.
Query Integrator System (QIS) is a ongoing database mediator framework designed to provide data integration from databases in Bioscience that continuously evolve. This application framework is being build primarily to provide data integration among distributed databases in Neuroscience.
Motivation: This system spawned from primary needs to integrate information from EAV/CR databases such as SenseLab, with other common data source types (Relational, XML and flat files, custom databases) used in Neuroscience.
The primary goal of QIS is to explore database integration mechanisms that will take advantage of the current resources and functionality in an incremental fashion providing users with collection of federated data scattered in multiple data sources.
Following in Road Map, will describe the incremental QIS functionality provided while is built.
The current implementation of QIS uses physical databases and functional structural queries to deliver data from multiple distributed data sources to requesting applications. Current developments will leverage ontologies that will be mapped from physical databases to create a semantic mediated queries. At long range, textual queries engines could use the knowledge in the ontology to facilitate textual interpretation of free text queries. See figure below

|
Each of the QIS servers is intimately related with the type of
functionality provided by each communication unit.
At the Data end, multiple heterogeneous databases (Relational, EAV/CR,
XML, Text, and other) hosted at specific institution connect to a specific Data
Source Server (DSS) serving as a gateway to system. Structured Queries: QIS was initially built to allow data integration using structured queries based on physical descriptions of each database. QIS mediation is currently using this approach but it requires knowledge of each database structure to build queries. Semantic Queries: During the process of building an ontology, the user can identify elements of interest on it, composing appropriate queries in an automated or assisted fashion. The specifications of such queries can be saved for reuse, so that even if there is currently very little data of interest to a specific query within the federation, the same query may return more results when re-run in future, as the contents of the federated databases expand. |
The basic QIS system is composed by three loosely coupled units: Data Source Server (DSS), Integrator Server (IS) and Ontological Server (OS). Inter server communication is XML encoded and HTTP transported to avoid network firewall limitations. For more detailed information check the QIS components page.
Each of the QIS units is based MS Platform and require the following applications: Windows 2000, Internet Information Server, SQL Server or Access database and VB6 runtimes. Specific requirements for each of the QIS units are explained in the QIS components page.
The system is in the process of migration to the MS.NET framework to take advantage of its features and to facilitate the creation of future Java versions.
Like most mediator systems, QIS uses its own query language to query disparate types of data sources. QIS query description is directly derived from SQL-like languages but represented in XML to facilitate legibility, syntax validation and future feature extensibility. In essence, the query is decomposed (“pre-parsed”) into its constituent elements, which are represented in terms of metadata-repository “unique identifiers”. Further, for atomic/column elements in a query, the IS records, in stored form, whether the element is part of the output (i.e., one of the fields to be displayed), whether it is used in the equivalent of a “join” to bridge between two tables, and whether it is part of a query criterion/filter.
Next we give a brief introduction of the QIS query structure by analyzing a QIS query expressed in XML, for this example we use the "getReceptorGeneChromosomeProtein _structure" query from the Membrane property resource database, with explanations of its constituents:
The previous query was automatically generated by the IS query design tool (see snapshots: 1 and 2). In general, you should be using the query design tool to compose your query because it makes the choices applicable to a particular field as pull-down lists, minimizing typing (and typographical errors that would cause the query to fail). Once the query has been determined to give the correct results, you can then save its XML for future use. (The above text is a "friendly" equivalent of the XML. without the XML tags., for the purposes of easier explanation)
The explanation below assumes that you are familiar with the principles of SQL and Boolean searching: if you are not, please study a book on SQL (such as C.J. Date's excellent "Introduction to Database Systems"). Jim Melton's recent book on SQL-99 is also helpful.
First, the "QIS_query" node encloses the query information distributed in the following unique nodes:
info: Query information, name, description, author, (version and date: not shown here)
from: specifies the sets which will be used for the query. Sets are the equivalents of tables in an RDBMS; we use the term "sets" because you may be accessing an XML-based data set instead of tables.
When you pick a set, the query designer automatically generates an alias (id) for the set. In the above example, the alias for the set "Receptor_properties" is "g1". In the SQL that accesses relational tables, Aliases are often necessary because sometimes you can use the same set more than once in the same query (in relational lingo, you perform what is called a "self-join") and you must distinguish between the two instances of this set. Also, where the table/set has a long name, aliases are short-forms that reduce typing.
The set has an internal identifier (gId) and a name (caption). The two are almost identical, except that in the caption, underscores are typically replaced by human-friendly spaces; the caption may also be more (or less) verbose than the internal name.
select: contains one or more "atom" nodes (the
equivalents of columns in an
RDBMS, which obviously must be qualified by the prefix indicating the set) whose
values will be requested.
Each "atom" node contins the
following XML attributes:
"id" an alias for the atom/column, by which the column is referred to elsewhere;
"aId" for atom Id (the unique identifier of the atom within the data source)
"name" is the printed name (caption) for the column . The role of aID and name are similar to gId and name for the set, as described above.
conditions: Uses one or more "cond" nodes to filter
results based on conditions of any atom element. A condition consists of the
atom/column that serves as a constraint, the relational operator used
in the constraint, and the value to which the atom is constrained.
In the above example, there are two conditions: the Subtype column will
be constrained based on two values specified by the user or application. (this
is indicated by the placeholder "?", indicating that this is a parameter
to the query). For example, at runtime, the user may wish to search for alpha-1,
alpha-2, or beta-1 receptor subtypes, and may enter the pattern "alpha%" for the
first condition, and "beta-1" for the second. If you know your SQL, the operator
for string pattern matching is "LIKE".
If you specify multiple conds one after the other, the individual conds,
by default, are combined using the AND Boolean operators. If you wish to combine
them in any other way (e.g., using OR or NOT operators) then the expression
clause (below) is used to build complex Boolean expressions.
expression: Optional. Used to build complex Boolean conditions when you want something other than the default AND behavior. In the above case, we wish to combine the two conditions to return a broader result, using the OR operator.
join: Is used for joining two sets. While the example
here does not show joins, the principle is similar to that in SQL. A join
essentially defines a pair of atoms/columns in two sets and a join operator that
combines them.
The join operator is one of "INNER", "LEFT_OUTER", "RIGHT_OUTER", "FULL_OUTER".
Refer to a SQL text book for explanations of INNER and OUTER joins. The INNER
join is the default. an example: if we had two columns g1.c1 and g2.c4, and we
wish to perform an inner join, we would use the clause:
operator="INNER" attr1='g1.c1'
attr2='g2.c4'.
combine: this lets you combine the results of two or more
sets, where the atoms/columns in these sets are logically identical in type and
number (in the sequential order in which they have been defined in the query).
The operators are "UNION" (set union), "INTERSECT" (set and "MINUS" (set
difference). For example, to return the union of all the rows returned from set
g1 and g2, we would use the clause :
operator="UNION" set1 = "g1" set2="g2".
Yale Center for Medical Informatics. 2004