QIS: A Framework for Biomedical Database Federation

This document is intended to explain further the objectives and technical aspects of the QIS. For more questions related to this documentation email: Luis Marenco, Prakash Nadkarni or Tzuuyi Wang.

What is QIS

Query Integrator System (QIS) is a ongoing database mediator framework designed to provide data integration from databases in Bioscience that continuously evolve. This application framework is being build primarily to provide data integration among distributed databases in Neuroscience.

Motivation: This system spawned from primary needs to integrate information from EAV/CR databases such as SenseLab, with other common data source types (Relational, XML and flat files, custom databases) used in Neuroscience.

Goals

The primary goal of QIS is to explore database integration mechanisms that will take advantage of the current resources and functionality in an incremental fashion providing users with collection of federated data scattered in multiple data sources.

Following in Road Map, will describe the incremental QIS functionality provided while is built.

Framework Road Map

The current implementation of QIS uses physical databases and functional structural queries to deliver data from multiple distributed data sources to requesting applications. Current developments will leverage ontologies that will be mapped from physical databases to create a semantic mediated queries. At long range, textual queries engines could use the knowledge in the ontology to facilitate textual interpretation of free text queries. See figure below

 

System Architecture Overview

At the outer layer the three main units involved in communication: Users, Data and Knowledge are joined at the inner layer by a series of servers that conform the QIS. QIS servers communicate using the Internet (blue solid lines); and DSS use institutional Intranets (red dotted lines) to connect to specific databases.

Each of the QIS servers is intimately related with the type of functionality provided by each communication unit. At the Data end, multiple heterogeneous databases (Relational, EAV/CR, XML, Text, and other) hosted at specific institution connect to a specific Data Source Server (DSS) serving as a gateway to system.
Users
can be human or automated agents (i.e.: A web server or a desktop application requesting data from QIS) that connect to the system by means of a Integrator Server (IS). This server processes queries that target specific data sources.
Knowledge
as described in standardized ontologies (i.e: UMLS) is correlated to a QIS Ontology Server (OS) where subsets of this ontologies are merged with non yet associated or newly discovered terms found in federated databases. Metadata and conceptual data elements for multiple data sources are mapped to OS concepts to mediate semantic based queries.

Structured Queries: QIS was initially built to allow data integration using structured queries based on physical descriptions of each database. QIS mediation is currently using this approach but it requires knowledge of each database structure to build queries.

Semantic Queries: During the process of building an ontology, the user can identify elements of interest on it, composing appropriate queries in an automated or assisted fashion. The specifications of such queries can be saved for reuse, so that even if there is currently very little data of interest to a specific query within the federation, the same query may return more results when re-run in future, as the contents of the federated databases expand.

System Components

The basic QIS system is composed by three loosely coupled units: Data Source Server (DSS), Integrator Server (IS) and Ontological Server (OS). Inter server communication is XML encoded and HTTP transported to avoid network firewall limitations. For more detailed information check the QIS components page.

System Requirements

Each of the QIS units is based MS Platform and require the following applications: Windows 2000, Internet Information Server, SQL Server or Access database and VB6 runtimes. Specific requirements for each of the QIS units are explained in the QIS components page.

The system is in the process of migration to the MS.NET framework to take advantage of its features and to facilitate the creation of future Java versions.

Query Language

Like most mediator systems, QIS uses its own query language to query disparate types of data sources. QIS query description is directly derived from SQL-like languages but represented in XML to facilitate legibility, syntax validation and future feature extensibility. In essence, the query is decomposed (“pre-parsed”) into its constituent elements, which are represented in terms of metadata-repository “unique identifiers”. Further, for atomic/column elements in a query, the IS records, in stored form, whether the element is part of the output (i.e., one of the fields to be displayed), whether it is used in the equivalent of a “join” to bridge between two tables, and whether it is part of a query criterion/filter.

Next we give a brief introduction of the QIS query structure by analyzing a QIS query expressed in XML, for this example we use the "getReceptorGeneChromosomeProtein _structure" query from the Membrane property resource database, with explanations of its constituents:

QIS_query
  info
    query name= 'getReceptorGeneChromosomeProtein _structuredescription= 'Retrieves genes, chromosome location and protein structure of membrane receptorsowner= 'anonymousdatabase= 'Membrane Properties ResourceserverdsID= 'd53'
  from
    set id= 'g1gId= 'Receptor_propertiesname= 'Receptor Propertiesalias= '' version= '1'
  select
    atom id= 'c1aId= 'g1.Subtypename= 'Subtype
    atom id= 'c2aId= 'g1.Gene_chromosomename= 'Gene chromosome
    atom id= 'c3aId= 'g1.structurename= 'structure
  conditions
    cond id= 'n1aId= 'g1.Subtypeoperator= 'LIKEvalue= '?'
    cond id= 'n2aId= 'g1.Subtypeoperator= 'LIKEvalue= '?'
  expression
value= '(n1 or n2)'
  join value='<<join clause>>'
  combine value= '<<combine clause>>'

The previous query was automatically generated by the IS query design tool (see snapshots: 1 and 2). In general, you should be using the query design tool to compose your query because it makes the choices applicable to a particular field as pull-down lists, minimizing typing (and typographical errors that would cause the query to fail). Once the query has been determined to give the correct results, you can then save its XML for future use. (The above text is a "friendly" equivalent of the XML. without the XML tags., for the purposes of easier explanation)

The explanation below assumes that you are familiar with the principles of SQL and Boolean searching: if you are not, please study a book on SQL (such as C.J. Date's excellent "Introduction to Database Systems"). Jim Melton's recent book on SQL-99 is also helpful.

First, the "QIS_query" node encloses the query information distributed in the following unique nodes:

 

Yale Center for Medical Informatics. 2004