Monday, August 24, 2009

Reasoning for Ontology Engineering and Usage and The Challenges of Modern Medical Ontologies

In my August 6 post, I briefly introduced ontology editor Protégé 4.0 with the reasoners FaCT++ (implemented using C++) and Pellet (Java based). Today's post picks up this story and adds the RacerPro (commercial) reasoner to the mix. You can -- and I recommend that you do -- download your own copies of the latest versions of these tools. Links that enable you to do so are located at the end of this post.

Protégé 4.0 with three reasoner added-ins

reasoner is a piece of software able to infer logical consequences from a set of asserted facts or axioms. In the present context, a reasoner makes inferences about classes and individuals in an ontology, tasks that are beyond the Web Ontology Language (OWL) model alone.

Ontologies, as described in prior posts, are formal vocabularies of terms, often shared by a community of users, and, as such, ontologies play an important role in semantic interoperability and Web 3.0. One of the most prominent application areas of ontologies is medicine and the life sciences. For example, the Systematised Nomenclature of Medicine Clinical Terms (SNOMED CT) is a clinical ontology. Another example is the OBO Foundry -- a repository containing about 80 biomedical ontologies.

These ontologies are gradually superseding existing medical classifications and will provide the future platforms for gathering and sharing medical knowledge. Capturing medical records using ontologies will reduce the possibility for data misinterpretation, and will enable information exchange between different applications and institutions.

Medical ontologies are strongly related to description logics (DLs), which provide the formal basis for many ontology languages, most notably the W3C standardised OWL. All the above mentioned ontologies are nowadays available in OWL and, therefore, in a description logic. The developers of medical ontologies have recognised the numerous benefits of using DLs, such as the clear and unambiguous semantics for different modelling constructs, the well-understood tradeoffs between expressivity and computational complexity, and the availability of provably correct reasoners and tools (discussion to follow).

The development and application of ontologies crucially depend on reasoning. Ontology classification, i.e., organising classes into a specialisation/generalisation hierarchy, is a reasoning task that plays a major role during ontology development: it provides for the detection of potential modelling errors such as inconsistent class descriptions and missing sub-class relationships. For example, about 180 missing sub-class relationships were detected when the version of SNOMED CT used by the NHS was classified using the DL reasoner FaCT++. Query answering is another reasoning task that is mainly used during ontology-based information retrieval; e.g., in clinical applications query answering might be used to retrieve "all patients that suffer from nut allergies".

Despite the impressive state-of-the-art, modern medical ontologies pose significant challenges to both the theory and practice of DL-based languages. Existing reasoners can efficiently deal with some large ontologies, but many important ontologies are still beyond the reach of available tools (i.e., they are unable to classify some widely used ontologies).

Applications currently need to work around these limitations, e.g., by using subsets of ontologies that can be successfully processed. For example, the version of
GALEN typically used in practice contains only about 20% of the axioms of the full version; this reduces the interaction between concepts and thus makes the ontology "processable". This is, however, highly undesirable in practice, because it reduces coverage, weakens the conceptualisation of the domain and may prevent the detection of modelling errors.

Furthermore, the amount of data used with ontologies can be orders of magnitude larger than the ontology itself. For example, the annotation of patients' medical records in a single hospital can easily produce data consisting of hundreds of millions of facts, and aggregation at a national level might produce billions of facts. Existing reasoners cannot cope with such data volumes, especially not if ontologies such as
GALEN and FMA are used as schemata.

Having forewarned you about these limitations, I'd like to recommend the following video on reasoners -- free and commercial.

Some readers of this blog might not be familiar with terms that appear in the video, starting with ABox and TBox.

For those readers especially, the following links for accessing the Protégé, Pellet and Racer sites could be used to install and examine this software (and accompaning documentation) before watching the videos.




Then, as you watch the video, you could follow along using your own running code. For some, this will require more than a single session.

Recommended reading - basics of description logics: