Friday, May 22, 2009

Critique of OBO Foundry Principles

The OBO Foundry aims to create a suite of orthogonal interoperable reference ontologies in the biomedical domain. They have outlined their principles here: 

http://www.obofoundry.org/wiki/index.php/OBO_Foundry_Principles

In reading some of these I found that they poorly expressed true principles in ontology design. I provide here a brief critique on some of the contentious points:

"3. The ontologies possesses a unique identifier space within the OBO Foundry. The identifier uniquely and persistently identifies a definition, which itself unambiguous identifies some type of biological entity. The identifier is for the definition: it is NOT the name and it is NOT an identifier for the name.

There are systems that use alphanumeric id's - eg MetaCyc. This should be dis-encouraged, especially as these have semantic content."

This mixes up a number of issues. An identifier is a symbol for an entity, which should guarantee uniqueness in the lexical space, unlike human readable names which are not required to be unique. So it doesn’t matter whether the identifier is numeric, alphanumeric or alphabetic and thus the latter part of this principle, referring to alphanumeric MetaCyc ids, is pure nonsense. It is the description of the entity that *matters*, and that the textual description is arguably unchanging (What does OBOF say about when a description changes by even one word? Should a new identifier be crafted? How does one assess whether the previous identifier is in fact compatible with the new one? Should one be directed to use the new identifier – is it possible that the semantics are *fundamentally* different? These are far more important questions to address)

"6. The ontology must be orthogonal to other ontologies already lodged within OBO. For each domain, there should be convergence upon a single reference ontology that is recommended for use by those who wish to become involved with the Foundry initiative"

This is a contestable claim. Given that there is no universal agreement on many biological terms, any given ontology will not necessarily capture the semantics of what one wants to express. Anyone familiar with the word "gene" can easily demonstrate this as a case in point.

 

"7. The ontologies include textual definitions for all terms."

Textual descriptions aren't really useful unless they succinctly capture the essence of the entity in question.  For instance, definitions in the (OWL version) BFO are incomprehensible to many people (certainly to my undergrad students). In many other cases the textual descriptions can be shown to be either overly vague or constraining in unrealistic ways.

 

Einstein said "Make everything as simple as possible, but not simpler" - a good mantra in crafting term descriptions is "Be as accurate as possible, while not adding superfluous information or imposing unnecessary constraints.

 

"9. The ontology is well documented."

Be more specific - What does "well documented" mean?


"10. The ontology has a plurality of independent users."

This is another unreasonable demand. The defining characteristic is that for every ontology, there exists requirements (possibly in the form of use cases) that the ontology can be demonstrated to satisfy.  Ultimately, an ontology should have demonstrated utility. Paraphrasing Salinger - If you build it, (and have shown it to be useful) they will come.

 

"11. The ontology will be developed collaboratively with other OBO Foundry members."

A long standing myth is that ontologies need to be developed collaboratively - but in fact, we have found that such an approach is in fact wholly unproductive. What is productive is collecting use cases, undertaking focused development, and conducting a peer review and refinement process in which the needs of the community can be publicly solicited and addressed. This kind of procedure is in place at the W3C, and results in high quality standards. The OBO Foundry should consider setting up such a facility, with open calls for review across all relevant mailing lists, including quality assessment, additions/removals etc - particularly before things get published as a so-called "standard"