Sunday, September 27, 2009

Watson - An efficient access point to online ontologies - A gateway to the Semantic Web

Next generation semantic applications will be characterized by a large number of sometimes widely-distributed ontologies, some of them constantly evolving. That is, many next-generation semantic applications will rely on ontologies embedded in a network of already existing ontologies. Other semantic applications – e.g. some electronic health records (EHR) – will maintain a single, globally consistent semantic model that serves the needs of application developers and fully integrates a number of pre-existing ontologies.

As the Semantic Web gains momentum, more and more semantic data is becoming available online. Semantic Web applications need an efficient access point to this Semantic Web data. Watson, the main focus of this post, provides such a gateway. Two limited demonstrations of Watson - one video, the other static - are given below.

Overview of Watson Functionalities

The role of a gateway to the Semantic Web is to provide an efficient access point to online ontologies and semantic data. Therefore, such a gateway plays three main roles:

(1) it collects the available semantic content on the Web
(2) analyzes it to extract useful metadata and indexes, and
(3) implements efficient query facilities to access the data.

Watson provides a variety of access mechanisms, both for human users and software programs. The combination of mechanisms for searching semantic documents (keyword search), retrieving metadata about these documents and querying their content (e.g., through SPARQL) provides all the necessary elements for applications to select and exploit online semantic resources in a lightweight fashion, without having to download the corresponding ontologies.

For a easy-to-follow video demonstration of The Watson plug-in for the NeOn toolkit, click on

and, better still, click one of the Media Player links at this destination.

Note: There is a Watson plug-in for the ontology editor Protégé in the works.

Protégé (see my August 24 post below) is probably the most popular ontology editor available. In addition, its well established plug-in system facilitates the development of a plug-in using the Waston Web Services and API. To date, however, the Protégé site provides only what it describes as, “more a proof of concept or an example than a real plug-in.”

NeOn Toolkit

The NeOn architechture for ontology management supports the next generation semantics-based applications. The NeOn architecture is designed in an open and modular way and includes infrastructure services such as a registry and a repository and supports distributed components for ontology development, reasoning and collaboration in networked environments.

The NeOn toolkit, the reference implementation of the NeOn architechture, is based on the Eclipse infrastructure.

Ontology Management
Semantic Web, Semantic Web Services, and Business Applications
Copyright 2008 Springer

A static demonstration of the Watson plug-in for the NeOn toolkit

The Watson plug-in allows the user to select entities of the currently edited ontology he/she would like to inspect, and to automatically trigger queries to Watson, as a remote Web service. Results of these queries, i.e. semantic descriptions of the selected entities in online ontologies, are displayed in an additional view allowing further interactions. The figure below provides an example, where the user has selected the concept “human” and triggered a Watson search. The view on the right provides the query results (a list of definitions of class human which have been found on the Semantic Web) and allows easy integration of the results by simply clicking on one of the different “add”-buttons.

Finally, the core of the plug-in is the component that interacts with the core of the NeOn toolkit: its datamodel. Statements retrieved by Watson from external ontologies can be integrated in the edited ontology, requiring for the plug-in to extend this ontology through the NeOn toolkit datamodel and data management component.

{click on the image above for a larger view}

An interesting exercise:
And search on "snomed"

And, this dynamic view is what you get after clicking the (view as graph) link.

Thursday, September 24, 2009

Querying Semantic Data & Ontology - Assisted Querying of Relational Data --- SQL

My September 11 post discussed the i2b2 suite of applications, which has at its base a collection of database tables – with a star schema format - developed from the ground up to represent ontologies. In the present post, I’ll continue this discussion, only for the case where external ontologies are used. I’ll illustrate this latter option with two examples: querying semantic data & ontology-assisted querying of relational data, both using SQL.

Some organizations are using semantic approaches to create an information model (the ontology) based on data schema taken from a particular organization or industry. Individual application database schema are mapped to a standard information model in order to make the meaning of the concepts in different, application-specific data schema explicit and relate them to each other. The resulting information architecture provides a unified view of the data sources in the organization.

As shown in the figure below, application users can query these semantic (metadata) models, which comprise RDF data or ontologies. Standard ontologies reconcile queries needing access to heterogeneous data sources and application-specific schema. This results in solutions that have the power to address problems such as:

* data integration across a heterogeneous, expanding set of sources,
* racking provenance information, and
* modeling probabilistic data and schema.

The product focused on in this post – chosen in part by the toss of a coin – is the latest database from Oracle, 11g, and not competitors like SQL Server. Oracle, it should be mentioned, can deploy on any server platform (Unix, Linux, or Windows) whereas Microsoft SQL Server can deploy only on Windows Server.

In Oracle 11g,
RDF triples based on a graph data model are persistent, indexed, and queried, similar to other object-relational data types. I’ll have more to say on RDF/OWL data and ontologies in future posts. For now, the links found earlier in this paragraph serve as an introduction.

As shown in this figure, the Oracle 11g database contains semantic data and ontologies (RDF/OWL models), as well as traditional relational data.

The Oracle Database 11g semantic database features enable:

* Storage, Loading, and DML access to RDF/OWL data and ontologies
* Inference using OWL and RDFS semantics and also user-defined rules
* Querying of RDF/OWL data and ontologies using SPARQL-like graph patterns
* Ontology-assisted querying of enterprise (relational) data

Query Semantic Data in Oracle Database

RDF/OWL data can be queried using SQL. The Oracle SEM_MATCH table function, which can be embedded in a SQL query, has the ability to search for an arbitrary pattern against the RDF/OWL models, and optionally, data inferred using RDFS, OWL, and user-defined rules. The SEM_MATCH function meets most of the requirements identified by W3C SPARQL standard for graph queries. Support for virtual models, a view-like feature for combining models and optionally corresponding entailments from a UNION or UNION ALL operation, can be used in a SEM_MATCH query. New in release 11.2 of the Oracle database, the SPARQL FILTER, UNION, and OPTIONAL keywords are supported in the SEM_MATCH table function.

{click on the image above for larger view}

Ontology-assisted Query for Relational Data

Queries can extract more semantically complete results from relational data by associating relational data with ontologies that organize the domain knowledge of the relational data.

As shown in the next example, Oracle 11g performs this task by associating an ontology with the data and using the new SEM_RELATED operator (and optionally its SEM_DISTANCE ancillary operator). The new SEM_INDEXTYPE index type improves performance for semantic queries.

{click on the image above for larger view}

For an in-depth treatment of the SEM_MATCH table function, the SEM_RELATED operator, and related topics, consult the Oracle Database Semantic Technologies Developer's Guide.

Native Inferencing using OWL, RDFS, and user-defined rules

In addition to simply storing, and querying an ontology, the latest Oracle database can perform a number of other important tasks, including but not limited to drawing inferences and reasoning. The ability to draw inferences from existing data using the precision and rigor of mathematical logic (e.g., Description Logic) is probably the most important property that distinguishes semantic data from others. New Oracle Database 11g enhancements include a native inference engine for efficient and scalable inferencing using major subsets of OWL. This OWL inferencing engine makes the existing native inferencing for RDF, RDFS, and user-defined rules (used for additional specialized inferencing capabilities) more efficient and scalable. Inferencing may also be done using any combinations of these various entailment regimes. In addition, through the Oracle Jena Adaptor (downloadable from the Oracle Semantic Technologies page), you can integrate with external reasoners such as Pellet (see my August 24 post below for an introduction to Pellet).

Friday, September 11, 2009

Functional Design of an Ontology --- Relationship of the i2b2 ontology to star schema

Recent posts to this blog have discussed ontologies and description logics . As mentioned earlier, the OWL-DL and OWL-Lite sub-languages of the W3C-endorsed Web Ontology Language (OWL) are based on a description logic. A tool for editing and creating ontologies, Protégé, was also described.

In the present post, I’d like to describe the functional design of the ontology used by i2b2, a collection of open-source software tools for the collection and management of project-related clinical research data. That is, this post will present an introduction to what’s under the hood.

Data storage

i2b2 data is stored in a relational database, usually either Oracle or SQL Server and always in a star schema format, a design proposed initially by Ralph Kimball in the 1980s. It is named this because of the appearance of the final database schema diagram that looks like a star (see figure below).

Notes: Ralph Kimball and I were formerly regular contributors to now-defunct DBMS Magazine. A brief introduction to the star schema format is given in the OLAP section of my article Using Neural Networks and OLAP Tools to Make Business Decisions. (See the bibliography at the bottom of this blog)

{click on the images above for larger views}

A star schema contains one fact and many dimension tables. The fact table contains the quantitative or factual data, while the dimension tables contain descriptors that further characterize the facts.

Facts are defined by concept codes and the hierarchical structure of these codes together with their descriptive terms and some other information forms the i2b2 ontology (also called metadata).

i2b2 ontology data may consist of one or many tables. If there is one table, it will contain all the possible data types or categories. The other option is to have one table for each data type. Examples of data types are: diagnoses, procedures, demographics, lab tests, encounters (visits or observations), providers, health history, transfusion data, microbiology data and various types of genetics data. All metadata tables must have the same basic structure.

The structure of the metadata is integral to the visualization of concepts in the i2b2 tools, as well as for querying the data.

In healthcare, a logical fact is an observation on a patient. It is important to note that an observation may not represent the onset or date of the condition or event being described, but instead is simply a recording or a notation of something. For example, the observation of ‘diabetes’ recorded in the database as a ‘fact’ at a particular time does not mean that the condition of diabetes began exactly at that time, only that a diagnosis was recorded at that time (there may be many diagnoses of diabetes for this patient over time).

The fact table contains the basic attributes about the observation, such as the patient and provider numbers, a concept code for the concept observed, a start and end date, and other parameters. In i2b2, as shown in the figure above, the fact table is called observation_fact.

Dimension tables contain further descriptive and analytical information about attributes in the fact table. A dimension table may contain information about how certain data is organized, such as a hierarchy that can be used to categorize or summarize the data. In the i2b2 Data Mart, there are four dimension tables that provide additional information about fields in the fact table: patient_dimension, concept_dimension, visit_dimension, and provider_dimension.


Once a database grows to over 10 million items, the advantages of a star schema can start to take hold. The first consideration is the speed and integrity of the queries. When one exceeds 0.5 billion rows in a database, it becomes important to have the data expressed in very large indexes. Very large indexes are only possible with very large tables. If one has several hundred or thousand tables in a database (easily attained in large transaction systems), one will have at least one index on each table resulting in several hundred or thousand small indexes. Joins between 100‐1000 indexes for each query will result in slow performance (hours), while joins between 3‐4 indexes, even representing 100’s of millions of rows, will be fast (seconds). Furthermore, the integrity of queries in a transactional database is also compromised because queries can often be answered through several paths in a circular manner.

The second consideration is the need for a large analytic database to constantly absorb new data. The database schema does not change as new data sources are added. New data will result in additional rows added to the fact, patient, and visit tables. New concepts and observers will result in new rows added to the concept and provider tables. But new columns and tables do not need to be added for each new data source. This is very useful in large projects where there are many tools depending upon a specific database schema. A strategy where the database grows by adding rows for new data rather than adding new tables and columns allows tools developed to work with one kind of data to also work with a new source of data.

The third advantage of the star schema is the ability to manage the metadata of a large analytic database. Metadata is used to perform queries, and if it is incorrect a query will be profoundly affected. For example, if one wanted to find all the patients with diabetes, but left out one of the codes used to represent diabetes in a database, none of those orphaned patients would be counted. The detection of orphaned concepts is easily achieved in the star schema by, for example, joining the fact table to the concept and provider tables and reporting those fact table concepts and providers left out by the join.

A sample ontology query for diagnosis

To find all the patients that were diagnosed with migraines, use this query:

Select distinct (patient_num)
From observation_fact
Where concept_cd in
(select concept_cd
from concept_dimension
where concept_path like '%Neurologic Disorders (320-389)\(346) Migraine\%')

Note: The material in this post has been taken largely from the following i2b2 pages, which should be consulted for further details:

Wednesday, September 9, 2009

Semantic Interoperability, EHR, etc. are of little use to someone whose claim is denied by his or her insurance company

Founded in 1945, Kaiser Permanente is this nation’s largest not-for-profit health plan, serving more than 8.6 million members, with headquarters in Oakland, California.

This blog has carried a prominently-placed electronic health records (EHR) video outlining some of the excellent information technology work that's being introduced by Kaiser Permanente.

So, given the placement of this video, I feel a responsibility to add here the reality that Kaiser Permanente's technology is only one facet of a system that daily makes decisions about who can and who cannot get health care.

The California Nurses Association/National Nurses Organizing Committee has just released new data that reveals more than one of every five requests for medical claims for insured patients, even when recommended by a patient’s physician, are rejected by California’s largest private insurers. (The Kaiser Permanente Health Plan membership in California is greater than 6 million.)

This is data that the health insurance companies have wanted to hide, and it’s just now becoming available. It documents that these insurance companies have denied, in California alone, 45 million claims since 2002. Some of these rates ranged as high as 40 percent (for UnitedHealthcare’s PacifiCare). And other large, giant insurers like Blue Cross, Health Net, CIGNA, and Kaiser were all in the range of 30 percent (Kaiser Permanente's denial rates is 28 percent). This report shows a clear pattern of very high denials by the very insurance companies that people depend upon to assure that they get the care they need when they need it.

There are a variety of reasons insurance companies claim why they make these denials: in the end though, it’s a war that goes on between the insurance companies and the doctors and the hospitals. (Note: Attorney General of California Jerry Brown has announced he’s going to conduct an investigation into the business practices of these companies and why these denial rates are so high.)

A recent piece in the Los Angeles Times quotes a spokeswoman for the California Association of Health Plans, responding to the data that the California Nurses Association/National Nurses Organizing Committee has just released, saying, “It appears [that] a good deal of the so-called denials are merely paperwork issues.”

It seems to me that even if you put the best face on the California Association of Health Plans' response, what it demonstrates is how much waste (aka administrative overhead) there is in the health insurance industry. It's been suspected for some time now that one-third of every healthcare dollar goes to waste and to enforcing claims denials in the United States..