An Introduction to Entity-Attribute-Value Design for Generic Clinical Study Data Management Systems
Prakash Nadkarni MD, Center for Medical Informatics, Yale University Medical School, New Haven, CT
Large-scale Clinical Study Data Management Systems (CSDMSs) allow one to manage any number of clinical studies, each tracking an arbitrary number of clinical parameters in an arbitrary number of patients, within the framework of a database management system (DBMS). (We’ll assume that the DBMS is relational or object-relational: the vast majority of such systems usually are.) Such CSDMSs are termed “generic” in that they deliver this functionality without needing to continually add new tables to the database schema, as new types of clinical parameters need to be recorded across different sub-domains of medicine. The purpose of this tutorial is to show how this is done. We use a production open-source CSDMS, TrialDB, to illustrate different aspects of generic CSDMS creation.
At the outset, we must make clear that creation of a generic CSDMS is not for everybody: it is a major software undertaking, considerable programming and database expertise must be available, and a lot of infrastructure must be built before the benefits of a generic architecture can be realized. Also, we wish to emphasize that building a clinical study database generically is not the only “correct” way, which would imply that using a non-generic architecture is “wrong”. It depends on the task at hand: specifically, the number of studies that must be managed, and their heterogeneity, in terms of the different sub-domains of medicine that they must represent: the greater the heterogeneity, the greater the likelihood of a generic architecture being the “correct” one. For a single study, the generic approach is overkill. (Similarly, when there are a bunch of studies dealing with a single domain, such as cancer, one can get away with a non-generic design. One of the most functional cancer-oriented CSDMSs, built by Colin Begg’s group at Sloan-Kettering, is non-generic.)
Few readers of this tutorial will be building their own CSDMS: they will be interacting with an existing CSDMS that is already in place. We therefore aim to teach the reader what lies under the hood, and what they must know in order to work productively with such systems.
Generic CSDMSs use what is called an Entity-Attribute-Value (EAV) data model for the clinical data. (Some authors use “generic” as a synonym for EAV, but here the phrase “generic” is a bit too generic and non-specific for my liking.) The best way to understand the rationale of EAV design is to understand row modeling (of which EAV is a generalized form). Consider a supermarket database that must manage thousands of products and brands, many of which have a transitory existence. Here, it is intuitively obvious that product names should not be hard-coded as names of columns in tables. Instead, one stores product descriptions in a Products table: purchases/sales of individual items are recorded in other tables as separate rows with a product ID referencing this table. Replace “product” in the supermarket scenario with “clinical parameter” in the CSDMS scenario, and you should get the idea. When you hard-code a clinical parameter such as blood hemoglobin as a column in a table, this is the equivalent of hard-coding “diet Coke” as the name of a column in a purchases or sales table. (This is not always wrong. If you were using a spreadsheet for a small study that tracked only twenty parameters, this is the simplest way to go, just as if you were a dealer for the Coca-Cola company selling soda wholesale to supermarkets or hotels, it might make sense to have columns for each of the dozen brands of soft drink that you are selling when recording the details of a single sale.)
Let us now summarize the principles of EAV design. In EAV design, every fact is conceptually stored in a table, with three sets of columns:
· The Entity: patient ID plus timestamps recording event occurrence.
· The Attribute, also referred to as a Parameter or Question, e.g., “serum Potassium”.
· The Value of the attribute, e.g., “4.5”.
In EAV design, one row stores a single fact. In a conventional table that has one column per attribute, by contrast, one row stores a set of facts. EAV design is appropriate when the number of parameters that potentially apply to an “entity”, e.g., the patient, is vastly more than those that actually apply to an individual entity. CSDMSs, for example, must track thousands of parameters (clinical/ lab/ questionnaire items) across numerous medical specialties, though the vast majority of subjects will be enrolled in only a single study that tracks a much smaller set of parameters.
The above description is somewhat simplified. In reality, the schema is highly normalized. Thus:
· One typically records the Attribute ID rather than its name: the latter is stored, along with other information such as its data type, in a separate Attribute Definitions table. (This is analogous to a Products table.).
· Rather than repeat the patient ID and timestamp fields for every single attribute-pair, one typically has a table that represents clinical event for a patient, with a machine-generated Unique Identifier (UID) called the Event ID. Individual Attribute-Value pairs are then stamped with an Event ID. (Sometimes, as in TrialDB, the Event information is divided into two tables.)
· The simplified example above would imply that all values are stored as strings. With such a design, one would not be able to store values of complex data, such as images or signal data such as EKGs. Further, if we wanted to rapidly identify values of a particular attribute beyond a threshold (e.g., abnormally low hemoglobins), we would not be able to do so. In databases, one enables rapid search by creating an index on the value column: in this case, however, an index on numbers that are stored as strings is of little use. In reality, therefore, rather than have a single table for all values, one creates multiple EAV tables for different data types. TrialDB uses separate tables for values that are integers, real numbers, dates, enumerated (based on a choice set), short strings, long strings (such as arbitrarily-long narrative text) and binary data such as images or signal data.
For those readers who have some exposure to computer-science concepts, EAV is the database analog of a sparse matrix representation. In a sparse matrix, one stores only non-zero values, and in the case of EAV, one only records the parameters that actually apply to the patient (significant positives or negatives whose recording is part of the study protocol). The sparse-matrix analogy is important because the EAV-design philosophy should not be carried to the point of absurdity. Most “EAV” systems actually store some of their data, e.g., patient demographics, conventionally. EAV would be misapplied to standard demographic attributes, which are non-sparse in that they are recorded on every patient.
The existence of heterogeneity complicates the task of query of patient data significantly, for several reasons. For example, the SQL (Structured Query Language) statements required to operate on conventional data are different from those needed to operate on EAV data to return equivalent results. Further, data may be reorganized, with the representation of attributes being switched from conventional to EAV and vice versa, causing existing “stored procedures” that hard-code the access to these attributes to break and require rewriting.
The EAV schema has the advantage of remaining stable when the number of parameters increases continually as knowledge expands, as in medicine. EAV design, however, poses significant software challenges, because end-users think of the data as being conventionally structured, one column per parameter. (This is how the data must be presented when allowing users to edit the contents of individual case report forms.)
Similarly, most analytical programs (such as statistics packages) also require the data in the one column per parameter format. A common exploratory task in epidemiology is identifying associations between clinical parameters, as in the well-known case where the etiology of toxic shock syndrome (staphylococcal) was identified due to its association with menstruating women who were using a particular brand of tampon. Remember that a single EAV table stores highly heterogenous facts- apples, oranges, chop suey and wasabi – all in the same column. It would obviously be impossible to do basic and multivariate statistics with the data arranged this way. To be analyzable, this data must be first rearranged into the format that analytical programs expect. (A similar argument applies to the supermarket scenario, where an analogous operation is association analysis (“market-basket” analysis). Here, tracking which items are purchased together by the same customers allows targeted advertising, or optimal placement of products on adjacent supermarket shelves.)
Therefore, the user interface to a well-designed EAV system must create the illusion of conventional data storage through programming sleight-of-hand. The output of queries, or data extracts destined for analysis, must similarly be formatted conventionally. To put it another way, the EAV model is designed only for flexible storage in a clinical data repository. It was never intended for direct analysis, and the very existence of the EAV structure must be a well-kept secret as far as end-users are concerned.
In EAV systems, the user-interface illusion of conventional data organization is implemented through metadata. While the most restricted definition of metadata treats it as synonymous with “data dictionary”, we use this term in the broader sense of any data that is required by generic code within the application or system to provide functionality, as opposed to attempting to hard-code such functionality in application logic. All DBMSs use built-in metadata – such as table, column and index definitions – for various purposes, such as semantic checking and optimization of SQL queries. However, the metadata of an EAV system, which are designer-supplied, serve additional purposes. For example, knowing where an attribute’s data is stored, based on its data type, helps generate the correct SQL when data is viewed or changed. Typically, EAV metadata also record how information is presented to the user. Thus, for example, an attribute can have a caption that briefly describes its purpose, and may also be associated with a set of choices, constraining input to these choices. Such a set forms the basis of a “combo box” in the user interface. Further, attributes can be grouped into higher-order objects (forms). If we also record the order in which attributes are to be presented, the collective metadata can be used to automatically generate a data-entry/browsing form. This is, in fact, what TrialDB does: its Web-form-generation infrastructure and metadata-editing facilities have been previously described (Prakash Nadkarni et al, JAMIA 2000: Cynthia Brandt et al, Controlled Clinical Trials 2000).
Many relational database purists who are unfamiliar with EAV are instinctively uneasy with EAV because they are under the impression that, by using EAV, they are somehow abandoning the principles of relational design, and that many features, such as referential integrity, which come for free in conventional databases, must be thrown away. For CSDMSs, this is not true, as explained below.
· When using an EAV design for clinical data, one is really simulating a conventional relational design using EAV. Therefore, one must be even more compulsive about applying relational principles. For example, when multiple instances of a particular parameter such as Serum Sodium must be captured creating numerous parameters with names like Sodium1, Sodium2, etc. is a recipe for disaster. When this data is converted to conventional relational form for the purposes of analysis, a properly simulated relational design yields a proper relational design, while a junky simulation yields a junky end-product that is not analyzable because it is not in first normal form.
· The number of metadata tables typically outnumbers the data tables by about 3 to 1. The metadata is organized conventionally.
· In the case of clinical parameters, it is very rarely that one has to worry about referential integrity between parameters, because individual parameters are not related to each other: they are only related to a clinical event (and therefore, to a patient and a study). The tables containing clinical data maintain referential integrity with respect to patients and patient events. (In row-modeled designs such as the supermarket database example mentioned earlier, all the relational integrity checks one could ever desire are in place.)
This part of the tutorial is meant mainly for the techies, who would like to build their own system. (The rest of you can jump to the “Further Resources” section at the end of this paper, or go home.)
We consider the metadata under several categories.
Study-Related Metadata and Locale Metadata:
This part of the schema is shown at the top of figure 1. The Studies table stores a description of each study. A study may be conducted at multiple Study Sites, and may be accessed by multiple users. (A Sites table, which contains details of each site such as site name and postal address, has not been shown.) The Users-Privileges table is a bridge between Studies and Users, recording what each user is permitted to do for a given study (view, edit, add data, create queries, etc.). In addition, certain users may be restricted to viewing patient data from a single Study Site: each Patient demographics record (not shown) is tagged with the ID of the site where the patient belongs.
A Study is typically divided into multiple periods or Study Phases. (Some authors use the term “Study Events” as synonymous with “Study Phases”.) The time difference between phases in terms of offsets from a “time zero” is recorded for each phase (the time units for the offsets being stored in the Studies table to allow computation), and the phases are serially numbered accordingly. Connections between the Study-Related metadata and Attribute Data will be described later. Every patient event is tagged, not only with a time-stamp, but also the Phase ID when it occurred. As already stated, in clinical studies, Phase IDs are more useful than absolute dates when comparing groups of patients, because different subjects are enrolled in the study at different times.
Attributes represented in EAV form:
In the center of the diagram are the tables Clusters (=Forms), Question Groups and Questions (=Attributes).(We use “Cluster” because “Form” is a quasi-reserved word in MS-Access, our original prototyping platform.). A Question Group is a set of questions with which a common time-stamp or pair of time-stamps (representing the chronology of the event occurrence) can be associated. (For example, a clinical chemistry panel of tests has a single time-stamp, the time of blood collection.)
A Form typically has one Question Group, but some may have several. The “bridge” tables Question_Group-Clusters and Question_Group-Questions connect the tables suggested by their names and implement many-to-many relationships. For a single study, however, the relationship between Forms and Question Groups, and between Question Groups and Questions, is strictly one-to-many: in other words, a Question must only occur once in a study, in a single form and group. The “serial number” fields in these bridge tables refer to the order of presentation of the items within the parent item.
Attributes represented in conventional form:
The tables Meta-Classes and Meta-Attributes in the bottom right of the diagram actually serve to document the entire physical schema of TrialDB, with one entry per table in Meta-Classes, and one for each physical field in Meta-Attributes. The Class description has numerous fields (not shown), some of which record whether the class represents meta-data or patient-related data: whether the class represents column-modeled data, special-purpose row-modeled data (e.g., inclusion and exclusion criteria) or general-purpose EAV data, and so on. Only tables that represent patient data in column-modeled or special-purpose row-modeled form are presented in the query interface.
The list of “forms” shown to the user who is browsing a study is actually a union of these tables and the list of Clusters used in this study (recorded in the Studies-Clusters table, see later). The difference between EAV and non-EAV data is therefore deliberately blurred from the user. Every patient-related attribute in the system, whether conventional or EAV-modeled, is identified with a unique Attribute ID.
Many items (EAV-based questions as well as some fields/attributes in the database) are based on “choice sets” (discrete-value groups), displayed in the user interface as pull-down lists. TrialDB stores all choice sets used in the database, and the individual values associated with each choice set, in the two tables Discrete Value Groups and Discrete Values.
Connections between the Study-Related metadata and Attribute metadata:
These are shown in the left-center part of the diagram. The table Study_Phases_Clusters records, for a given study, which Forms are filled in during which period. The table Studies_Clusters records which Forms occur together for a given study: co-occurring Forms have the same Co-Location Number, an integer that is computed by a straightforward algorithm using the data in the Study_Phases-Clusters table. As already stated in the text, co-occurrence information reduces the number of output tables after a query finishes executing.
Though the existence of the bridge tables Question_Group_Clusters and Question_Group_Questions implies a many-to-many relationship between Questions and Groups, and between Groups and Clusters, in reality this relationship is somewhat constrained in TrialDB. Specifically, for a given study, a single question may occur only within a single group, and a single group only within a single cluster.
When generating a Case Report Form for a Cluster, both these tables record, respectively, the order of presentation of individual groups, and the questions within a group. The group itself provides a caption that is displayed in the form.
Questions that do not repeat (i.e., have only a single instance) are generated as form fields. The metadata associated with the question (such as whether it is associated with a choice set, the display height and width, and so on) determine whether it is displayed as a single-line text box, a multi-line text area, or a SELECT list (pull-down menu or list box). There are various other cosmetic customization options that determine, for example, whether the choices in a select list are displayed with the numeric codes corresponding to individual phrases in the list, or the phrases alone. (This can determine ergonomics in terms of ease of heads-down data entry during transcription from a paper form.)
A set of questions that comprise a Repeating Group is displayed by simulating a sub-form (the typical display metaphor for many-to-one relationships). One done this by creating an HTML table that can grow by addition of extra rows, and can shrink by deletion of rows. (This is actually done on the Web client: it is much easier to do this in Internet Explorer than in Netscape.)
Varying degrees of “smarts” can be implemented for robust data entry and validation. Some of this can be effected by checking the data type of each field (and preventing the entry of text into, say, a numeric field). Other types of intelligence require supporting metadata. TrialDB, for example, has fields with computed formulas; fields with pull-down lists whose contents change dynamically based on choice made in a previous pull-down list; fields that dynamically enable or disable based on values entered in previous fields, and so on. The WebEAV paper of Nadkarni et al JAMIA 2000 describes this in more detail.
Figure 1: Layout of (part of) the TrialDB metadata.
Different CSDMSs follow slightly different approaches. In TrialDB, as stated earlier, clinical data is strongly typed. There are therefore multiple attribute-value tables. The patient-event information is actually split into two tables: event_header and event_subheader. The purpose of these two tables is now described.
A clinical event (here, a study event) is a time point where clinical parameters applicable to that event are recorded. Thus, for example, a blood draw may be performed, and a battery of laboratory parameters determined. In this case, there is one instance of a parameter (such as serum Sodium or Bicarbonate) per event. Sometimes, however, there are multiple instances of a particular parameter. For example, a patient may have multiple adverse effects due to the therapy being administered: the number of adverse effects is not, of course, predictable. Some adverse effects may have been transient, others may have ceased after proper treatment, others may persist. In other words, each adverse effect recorded during a study event may have a different date/time of onset and cessation. Further, the name of the adverse effect is not the only parameter recorded: for each adverse effect, one records the severity, whether treatment was needed, the likelihood that the adverse effect was caused by the therapy, and so on. In other words, there is often an entire group of questions with multiple instances per event. (CSDMSs refer to such a group as a repeating group: this is an example of the simulation of a many-to-one relationship.)
The Event_header records information applicable to the event as a whole: the IDs of the patient, study, event (phase), the form, the date of recording, who entered the data, and so on. Event_Subheader records the time-stamps associated with each instance of a parameter. A single Event_Header record may have either a single Event_Subheader (for parameters such as laboratory values, which are recorded once during each event), or many Event_Subheader records (for parameters such as adverse effects).
The layout of the clinical data tables is shown in the diagram below. The EAV_ prefix indicates data-type-specific tables whose data type is indicated by their name. (“Memo” refers to arbitrarily long text, “BLOB” refers to Binary data.)
The transformation of the EAV structured data into a set of conventional relational tables is algorithmically straightforward. Essentially, a single Case Report Form will yield one or more relational tables. If the number of fields on the form is less than those that can be typically accommodated in an RDBMS table (for Microsoft Access, this is 255; for Oracle and SQL Server, this is 1024), and there are no repeating groups, then all the fields in the form become columns of a table. It is possible that a large CRF may need to be divided into multiple tables but this is rare. CRFs with a thousand questions are not ergonomically optimal, and need to be split into smaller CRFs anyway. (Also, with present-day browsers and the amount of RAM on most machines, such large forms can easily overwhelm the capacity of a Web browser, especially on machines using an OS such as Windows 98 or ME: NT or 2000 are much more stable in this regard.)
In the case of repeating groups within a case report form, each repeating group goes into its own table, where the questions in the group become columns in the table. By design, repeating groups contain a modest number of questions, simply because an HTML table does not wrap around, and a 40-column table would be an ergonomic bear to work with.
Generally, it is a good idea to extract the contents of the appropriate metadata (such as the caption and description for each question, and the phrases corresponding to choice sets) along with the data extract as a set of metadata tables. The existence of these tables serves as a “data dictionary” to annotate the data set and assist its interpretation.
Figure 2: The TrialDB clinical data: how EAV data is actually stored.
The TrialDB source and documentation is available at:
Read the file README.TXT and download all the files you need.