Sunday, September 25, 2011

Provenance: what is it and how should we formalize it?


As a testament to the growing recognition of provenance for (e-)science, i'm glad to see that the W3C incubator group worked hard to think about the issues and make it possible to establish a W3C provenance interchange Working Group.

a good starting point:

"provenance is often represented as metadata, but not all metadata is necessarily provenance"

but
"Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource."

does not provide adequate description for identifying the conditions.  

and 
"Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource"

contains elements that are undefined (record), uncertain (are processes not also entities?), narrow (producing/delivering) and broad (influencing).

Of course, I appreciate the difficulty in crafting a good definition, and I understand that this is a definition from which useful work can be achieved.  I will take the opportunity to express my thoughts on the matter.

i think there are two key aspects to provenance (not unlike what is suggested here: http://www.springerlink.com/content/edf0k68ccw3a22hu/)
1. how did the resource come about? (relates to creation and justification)
 - important for reproducibility (which is an element of science)
 - includes attribution (who created the resource), creation (process that generated the resource), reproduction (process in which a copy was made), derivation (process in which the resource was generated from some resource or portion of a resource), versioning (process of keeping count of sequential derivations)

2. what is the history of the resource (from the point of creation)
 - important for authenticity 
 - includes origin, possession and the acts of transfer

Both have implications for trust, and can be used for accountability, among other things. 

I find this part on recommendations of a provenance framework quite nice:

but get less excited when i see the collection of "provenance concepts" 

particularly because we need to simplify the discourse such that we consider

an event (for 1 above) 
 - participants (and their roles; e.g. agents, targets, products)
 - locations
 - time instants (e.g. action timestamps) and durations (processual attributes)

and a sequence of events (for both 1 and 2 above)

this would certainly help to generate a specification with a minimal set of classes and relations to express this kind of information.

now, i'm writing this late at night, and I appreciate that I may not have considered all the issues that the provenance group has (along with others that have written about the subject), but perhaps there is still some good discussions to be had wrt provenance and how we formally represent it, as it is of strategic importance to the HCLSIG in our current and future efforts.

No comments: