Friday, September 11, 2009

Functional Design of an Ontology --- Relationship of the i2b2 ontology to star schema

Recent posts to this blog have discussed ontologies and description logics . As mentioned earlier, the OWL-DL and OWL-Lite sub-languages of the W3C-endorsed Web Ontology Language (OWL) are based on a description logic. A tool for editing and creating ontologies, Protégé, was also described.

In the present post, I’d like to describe the functional design of the ontology used by i2b2, a collection of open-source software tools for the collection and management of project-related clinical research data. That is, this post will present an introduction to what’s under the hood.

Data storage

i2b2 data is stored in a relational database, usually either Oracle or SQL Server and always in a star schema format, a design proposed initially by Ralph Kimball in the 1980s. It is named this because of the appearance of the final database schema diagram that looks like a star (see figure below).

Notes: Ralph Kimball and I were formerly regular contributors to now-defunct DBMS Magazine. A brief introduction to the star schema format is given in the OLAP section of my article Using Neural Networks and OLAP Tools to Make Business Decisions. (See the bibliography at the bottom of this blog)

{click on the images above for larger views}

A star schema contains one fact and many dimension tables. The fact table contains the quantitative or factual data, while the dimension tables contain descriptors that further characterize the facts.

Facts are defined by concept codes and the hierarchical structure of these codes together with their descriptive terms and some other information forms the i2b2 ontology (also called metadata).

i2b2 ontology data may consist of one or many tables. If there is one table, it will contain all the possible data types or categories. The other option is to have one table for each data type. Examples of data types are: diagnoses, procedures, demographics, lab tests, encounters (visits or observations), providers, health history, transfusion data, microbiology data and various types of genetics data. All metadata tables must have the same basic structure.

The structure of the metadata is integral to the visualization of concepts in the i2b2 tools, as well as for querying the data.

In healthcare, a logical fact is an observation on a patient. It is important to note that an observation may not represent the onset or date of the condition or event being described, but instead is simply a recording or a notation of something. For example, the observation of ‘diabetes’ recorded in the database as a ‘fact’ at a particular time does not mean that the condition of diabetes began exactly at that time, only that a diagnosis was recorded at that time (there may be many diagnoses of diabetes for this patient over time).

The fact table contains the basic attributes about the observation, such as the patient and provider numbers, a concept code for the concept observed, a start and end date, and other parameters. In i2b2, as shown in the figure above, the fact table is called observation_fact.

Dimension tables contain further descriptive and analytical information about attributes in the fact table. A dimension table may contain information about how certain data is organized, such as a hierarchy that can be used to categorize or summarize the data. In the i2b2 Data Mart, there are four dimension tables that provide additional information about fields in the fact table: patient_dimension, concept_dimension, visit_dimension, and provider_dimension.


Once a database grows to over 10 million items, the advantages of a star schema can start to take hold. The first consideration is the speed and integrity of the queries. When one exceeds 0.5 billion rows in a database, it becomes important to have the data expressed in very large indexes. Very large indexes are only possible with very large tables. If one has several hundred or thousand tables in a database (easily attained in large transaction systems), one will have at least one index on each table resulting in several hundred or thousand small indexes. Joins between 100‐1000 indexes for each query will result in slow performance (hours), while joins between 3‐4 indexes, even representing 100’s of millions of rows, will be fast (seconds). Furthermore, the integrity of queries in a transactional database is also compromised because queries can often be answered through several paths in a circular manner.

The second consideration is the need for a large analytic database to constantly absorb new data. The database schema does not change as new data sources are added. New data will result in additional rows added to the fact, patient, and visit tables. New concepts and observers will result in new rows added to the concept and provider tables. But new columns and tables do not need to be added for each new data source. This is very useful in large projects where there are many tools depending upon a specific database schema. A strategy where the database grows by adding rows for new data rather than adding new tables and columns allows tools developed to work with one kind of data to also work with a new source of data.

The third advantage of the star schema is the ability to manage the metadata of a large analytic database. Metadata is used to perform queries, and if it is incorrect a query will be profoundly affected. For example, if one wanted to find all the patients with diabetes, but left out one of the codes used to represent diabetes in a database, none of those orphaned patients would be counted. The detection of orphaned concepts is easily achieved in the star schema by, for example, joining the fact table to the concept and provider tables and reporting those fact table concepts and providers left out by the join.

A sample ontology query for diagnosis

To find all the patients that were diagnosed with migraines, use this query:

Select distinct (patient_num)
From observation_fact
Where concept_cd in
(select concept_cd
from concept_dimension
where concept_path like '%Neurologic Disorders (320-389)\(346) Migraine\%')

Note: The material in this post has been taken largely from the following i2b2 pages, which should be consulted for further details: