Tuesday, October 20, 2009

Collaborative Developement of Large, Complex and Evolving Ontologies (e.g., SNOMED CT and GALEN) using a Concurrent Versioning System (CVS)


Prior posts here have talked about ontologies as though they magically appear and seamlessly meet a variety of challenges faced by the developers of computer applications. In this and a subsequent post, I'm going to touch upon several of the difficulties present in the creation and use of certain ontologies. What follows below is a few words on the use of Concurrent Versioning Systems (CVS). My next post will discuss the gap between the majority of today's ontologies and a real world that's filled with a good deal of vagueness and uncertainty that these ontologies can't describe all that well.

OWL Ontologies are being used in many application domains. In particular, OWL is extensively used in the clinical sciences; prominent examples of OWL ontologies are the National Cancer Institute (NCI) Thesaurus, SNOMED CT, the Gene Ontology (GO), the Foundational Model of Anatomy (FMA), and GALEN.

These ontologies are large and complex; for example, SNOMED currently describes more than 350,000 concepts whereas NCI and GALEN describe around 50,000 concepts. Furthermore, these ontologies are in continuous evolution; for example the developers of NCI and GO perform approximately 350 additions of new entities and 25 deletions of obsolete entities each month.

Most realistic ontologies, including the ones just mentioned, are being developed collaboratively. The developers of an ontology can be geographically distributed and may contribute in different ways and to different extents. Maintaining such large ontologies in a collaborative way is a highly complex process, which involves tracking and managing the frequent changes to the ontology, reconciling conflicting views of the domain from different developers, minimising the introduction of errors (e.g., ensuring that the ontology does not have unintended logical consequences), and so on.

In this setting, developers need to regularly merge and reconcile their modifications to ensure that the ontology captures a consistent unified view of the domain. Changes performed by different users may, however, conflict in complex ways and lead to errors. These errors may manifest themselves both as structural (i.e., syntactic) mismatches between developers’ ontological descriptions, and as unintended logical consequences.

Tools supporting collaboration should therefore provide means for: (i) keeping track of ontology versions and changes and reverting, if necessary, to a previously agreed upon version, (ii) comparing potentially conflicting versions and identifying conflicting parts, (iii) identifying errors in the reconciled ontology constructed from conflicting versions, and (iv) suggesting possible ways to repair the identified errors with a minimal impact on the ontology.

In software engineering, the Concurrent Versioning paradigm has been very successful for collaboration in large projects. A Concurrent Versioning System (CVS) uses a client-server architecture: a CVS server stores the current version of a project and its change history; CVS clients connect to the server to create (export) a new repository, check out a copy of the project, allowing developers to work on their own ‘local’ copy, and then later to commit their changes to the server. This allows several developers to make changes concurrently to a project. To keep the system in a consistent state, the server only accepts changes to the latest version of any given project file. Developers should hence use the CVS client to regularly commit their changes and update their local copy with changes made by others. Manual intervention is only needed when a conflict arises between a committed version in the server and a yet-uncommitted local version. Conflicts are reported whenever the two compared versions of a file are not equivalent according to a given notion of equivalence between versions of a file.

Change or conflict detection amounts to checking whether two compared versions of a file are not ‘equivalent’ according to a given notion of equivalence between versions of a file.

A typical CVS treats the files in a software project as ‘ordinary’ text files and hence checking equivalence amounts to determining whether the two versions are syntactically equal (i.e., they contain exactly the same characters in exactly the same order). This notion of equivalence is, however, too strict in the case of ontologies, since OWL files, for example, have very specific structure and semantics. For example, if two OWL files are identical except for the fact that two axioms appear in different order, the corresponding ontologies should be clearly treated as ‘equivalent’: an ontology contains a set of axioms and hence their order is irrelevant.

Another possibility is to use the notion of logical equivalence. This notion of equivalence is, however, too permissive.

Therefore, the notion of a conflict should be based on a notion of ontology equivalence ‘in-between’ syntactical equality and logical equivalence.

Conflict resolution is the process of constructing a reconciled ontology from two ontology versions which are in conflict. In a CVS, the conflict resolution functionality is provided by the CVS client.

Conflict resolution in text files is usually performed by first identifying and displaying the conflicting sections in the two files (e.g., a line, or a paragraph) and then manually selecting the desired content.

Errors in the reconciliation process can be detected using a reasoner, but this too is complicated.

Collaborative Protégé is just one among several recent proposals for facilitating collaboration in ontology engineering tools. [See the following references for more information on this topic.] Such tools would allow developers to hold discussions, chat, and annotate changes.

Collaborative Protégé online demo http://protegewiki.stanford.edu/index.php/Collaborative_Protege
http://smi-protege.stanford.edu/collab-protege/

Collaborative Ontology Development with Protégé (2009)
http://protege.stanford.edu/conference/2009/slides/CollabProtegeTutorial.pdf

Noy, N.F., Tudorache, T., de Coronado, S., Musen, M.A.: Developing biomedical ontologies collaboratively. In: Proc. of AMIA 2008. (2008)

Noy, N.F., Chugh, A., Liu, W., Musen, M.A.: A framework for ontology evolution collaborative environments. In: Proc. of ISWC. (2006) 544–558

My next post will discuss the need for ontologies that benefit from fuzzy or probability-based logic when a domain has vagueness or uncertainty.