Sunday, September 25, 2011

Provenance: what is it and how should we formalize it?

As a testament to the growing recognition of provenance for (e-)science, i'm glad to see that the W3C incubator group worked hard to think about the issues and make it possible to establish a W3C provenance interchange Working Group.

a good starting point:

"provenance is often represented as metadata, but not all metadata is necessarily provenance"

"Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource."

does not provide adequate description for identifying the conditions.  

"Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource"

contains elements that are undefined (record), uncertain (are processes not also entities?), narrow (producing/delivering) and broad (influencing).

Of course, I appreciate the difficulty in crafting a good definition, and I understand that this is a definition from which useful work can be achieved.  I will take the opportunity to express my thoughts on the matter.

i think there are two key aspects to provenance (not unlike what is suggested here:
1. how did the resource come about? (relates to creation and justification)
 - important for reproducibility (which is an element of science)
 - includes attribution (who created the resource), creation (process that generated the resource), reproduction (process in which a copy was made), derivation (process in which the resource was generated from some resource or portion of a resource), versioning (process of keeping count of sequential derivations)

2. what is the history of the resource (from the point of creation)
 - important for authenticity 
 - includes origin, possession and the acts of transfer

Both have implications for trust, and can be used for accountability, among other things. 

I find this part on recommendations of a provenance framework quite nice:

but get less excited when i see the collection of "provenance concepts" 

particularly because we need to simplify the discourse such that we consider

an event (for 1 above) 
 - participants (and their roles; e.g. agents, targets, products)
 - locations
 - time instants (e.g. action timestamps) and durations (processual attributes)

and a sequence of events (for both 1 and 2 above)

this would certainly help to generate a specification with a minimal set of classes and relations to express this kind of information.

now, i'm writing this late at night, and I appreciate that I may not have considered all the issues that the provenance group has (along with others that have written about the subject), but perhaps there is still some good discussions to be had wrt provenance and how we formally represent it, as it is of strategic importance to the HCLSIG in our current and future efforts.

Thursday, September 22, 2011

A letter to gmail: attachments

Dear Gmail team.

First, thank you for making it possible for me to see my unread mail - i sent you this idea some time ago, and i'm glad that you listened.

Second, I'm now somehow at 50% of my allocated capacity, and what i need is a way to filter my mails by attachment (which i can do!), but also order them from largest to smallest (which I can't do). Once i can order attachments by size, i can start deleting the big ones and free up more room! YAY!


Tuesday, September 20, 2011

On the topic of ontology evaluation

 With only a cursory look at the literature pertaining to the evaluation of ontologies, I already get the feeling that the current measures completely miss the point. The answer doesn't lie in the syntax (format) or structure of the ontology (the number of classes and properties, subsumed classes, axioms, etc), but rather the effectiveness of an ontology ( a representation of knowledge ) is in whether the semantics can be used for some task. So what we really want, is to focus on the nature of the task, and whether ontology provides a competitive advantage over other technologies.

Off the top of my head, here are some tasks:
- search/browse (most sites using GO, etc)
- text annotation (gopubmed)
- data normalization and structured queries - bio2rdf
- answering questions that require background knowledge e.g. across a yeast database
- data integration (heterogeneous types of data; data from different sources, of differing granular detail)  (see my translational medicine paper)
- classification e.g. domains or chemicals 
- prediction e.g. predicting phenotypes

Perhaps others can suggest some?

Sunday, September 11, 2011

New Charter for the W3C Health Care and Life Sciences Interest Group

Last week, the World Wide Web Consortium (W3C) approved a new charter for its Health Care and Life Sciences Interest Group (HCLSIG), in which I (Carleton University) along with Charlie Mead (NIH CBIIT) and Vijay Bulusu (Pfizer) were selected as co-chairs. This new charter directs us to develop, advocate and support the use of Semantic Web technologies for translational medicine and its three enabling domains: life sciences, clinical research and health care. While the core HCLSIG values - simplicity, pragmatism, effectiveness - remain firmly in place, Charlie, Vijay and I hope to make subtle changes to the operational strategy such that our efforts become increasingly recognized as critical in conferences and boardrooms across the globe.

As always, the HCLSIG will create both prototype implementations that demonstrate the value of formalizing and sharing knowledge using Semantic Web technologies. We will marshal our efforts towards fulfilling compelling use cases that have intrinsic value to not just W3C members, but ideally to a larger number of outside benefactors. Thus, our experts will now develop these use cases such that a priori we have a clearer picture of the rationale of the project, its resources, milestones and deliverables, and ultimately, which organizations and communities will directly and indirectly benefit. Coupled with an effective dissemination strategy including leverage our combined social networks, we hope to maximize the impact of the work of our members in this emerging area of knowledge management.

 As part of our dissemination strategy, we also intend to produce more member contributions that describe methods for basic and advanced tasks, in addition to publishing recommendations arising from consensus among our members. Such recommendations will endorse and specify the use of terminological resources in the long term context of semantic interoperability across the three core domains. Thus, participation in the HCLSIG will be critical for those wanting to advocate RDF-representations of data, OWL representations of ontologies, for the purposes of semantic annotation and large scale, semantic integration of biomedical data.

 With that, we invite non-members to join the W3C and work with our strong compliment of experts in what will surely be an exciting and productive time over the next few years for the W3C HCLSIG.

Saturday, September 10, 2011

Bio2RDF: moving forward

  Last week we held our first virtual meeting towards re-invigorating the Bio2RDF project with a significantly larger and vested community. From discussions, we plan to establish 3 focus groups around :

A. policy (information, governance, sustainability, outreach)
B. technical (architecture, infrastructure and RDFization)
C. social (user experience and social networking)

The next step then is for groups to:
1. identify and certify discussion leads (responsibilities: set meeting times and agenda, facilitate and encourage discussion among members, draft reports)
2. identify additional people to recruit from the wider community that would provide additional expertise (interested, but didn't attend the first? sign up now !)
3. extend and prioritize discussion items (what exactly will this group focus its efforts on in the short and long term)
4. identify and assign bite-sized tasks (so we can get things done one step at a time :)
5. collate results and present to the wider community

I suggest that groups self-organize a first meeting in the next two weeks to deal with items 1-4, and either meet again or use the Google documents to collaboratively report findings.

Finally, I'd like for us to hold another meeting with times that are much more accommodating for Europe + North America ;)  Please fill the doodle poll (
As always, feel free to contact me if you have any questions, and please sign up to the Bio2RDF mailing list for all future discussions.

Sunday, September 4, 2011

Scientific Publishing: We're not quite there yet...

I recall how discussions with Pascal Hitzler at ISWC2008 in Karlsruhe eventually led to the creation of a new journal: Semantic Web in 2009. It had most of the elements I wanted :
 1- immediately availability of the manuscript
 2- a transparent review process
 3- a high standard of quality.
 4- low cost of publication

 The immediately availability of the manuscript is important because publishing in a peer-reviewed journals can often take more than 1 year - so this is problematic when making claims about who did what and when. A transparent review process is important to create an incentive to carefully formulate constructive reviews and weed out spurious reviews dealt under the veil of anonymity. In addition, we can establish the reviewer quality and reliability - important when asking for future review requests. Another other other aspect of open review is to enable the acknowledgement of reviewer contributions to improving the quality of the manuscript. In this respect, Semantic Web is stellar - it acknowledges the editors and the reviewers on the front page of the published manuscript.

  One concern was whether reviewers would be sufficiently forthcoming about the failings of a paper such that it lead to the strong rejection of the manuscript. From all indications (see the paper that Pascal and Krzysztof wrote, the journal not only rejects a significant number of manuscripts that don't meet expectations for publication, but also does so in a constructive manner such that it invites authors to submit a revised manuscript with indicating how they have addressed the reviewer's comments.  Indeed, this aspect of open review also means that it authors attempt to shop the failed manuscript to other journals, a simple web search should uncover the reviews associated with it. This saves precious editor and reviewer time and really pushes the authors to make substantial changes.

 While this is all good, the major point of content is the business model. From any which way I look at it, I get the feeling that the tables are turned upside down. First you have the authors creating intellectual content, which they submit to a journal, and is reviewed by people who don't get paid to do so. The journal then turns around and tells the authors that if they want it freely available to the public, then the authors should cough up the money. This is ridiculous. Even a simple advertising based model could easily make a return on investment for articles of increasing impact. The other aspect is that the publisher needs to cover the costs of a print publication, but really, who reads print? I haven't for over 10 years now!  In the case of the Journal, they exact a cost in typesetting, but we're used to doing this for our workshop and conference submissions - so we really don't need people to do this anymore. It would be nice to see a break down of costs for digital publishing systems today (perhaps you can point us to one).

 So what is the business model of the modern scientific publisher?  Well, i think it lies in aggregating and creating new content, whether editorials or comprehensive reviews, which give readers a viewpoint or summary of where the field is heading.  Perhaps low-cost subscriptions (e.g. $100/yr) or pay per view ($5), one could recoup costs if the work is sufficiently meritorious. It's definitely time to think about the next evolution in scientific publishing.