Scientific Ramblings: semantic web

Showing posts with label semantic web. Show all posts

Sunday, September 25, 2011

Provenance: what is it and how should we formalize it?

As a testament to the growing recognition of provenance for (e-)science, i'm glad to see that the W3C incubator group worked hard to think about the issues and make it possible to establish a W3C provenance interchange Working Group.

a good starting point:

"provenance is often represented as metadata, but not all metadata is necessarily provenance"

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#Provenance_and_Metadata

but

"Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource."

does not provide adequate description for identifying the conditions.

and

"Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource"

contains elements that are undefined (record), uncertain (are processes not also entities?), narrow (producing/delivering) and broad (influencing).

Of course, I appreciate the difficulty in crafting a good definition, and I understand that this is a definition from which useful work can be achieved. I will take the opportunity to express my thoughts on the matter.

i think there are two key aspects to provenance (not unlike what is suggested here: http://www.springerlink.com/content/edf0k68ccw3a22hu/)

1. how did the resource come about? (relates to creation and justification)

- important for reproducibility (which is an element of science)

- includes attribution (who created the resource), creation (process that generated the resource), reproduction (process in which a copy was made), derivation (process in which the resource was generated from some resource or portion of a resource), versioning (process of keeping count of sequential derivations)

2. what is the history of the resource (from the point of creation)

- important for authenticity

- includes origin, possession and the acts of transfer

Both have implications for trust, and can be used for accountability, among other things.

I find this part on recommendations of a provenance framework quite nice:

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#A_Roadmap_for_Provenance_on_the_Web

but get less excited when i see the collection of "provenance concepts"

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#Recommendations (section 4)

particularly because we need to simplify the discourse such that we consider

an event (for 1 above)

- participants (and their roles; e.g. agents, targets, products)

- locations

- time instants (e.g. action timestamps) and durations (processual attributes)

and a sequence of events (for both 1 and 2 above)

this would certainly help to generate a specification with a minimal set of classes and relations to express this kind of information.

now, i'm writing this late at night, and I appreciate that I may not have considered all the issues that the provenance group has (along with others that have written about the subject), but perhaps there is still some good discussions to be had wrt provenance and how we formally represent it, as it is of strategic importance to the HCLSIG in our current and future efforts.

Sunday, September 11, 2011

New Charter for the W3C Health Care and Life Sciences Interest Group

Last week, the World Wide Web Consortium (W3C) approved a new charter for its Health Care and Life Sciences Interest Group (HCLSIG), in which I (Carleton University) along with Charlie Mead (NIH CBIIT) and Vijay Bulusu (Pfizer) were selected as co-chairs. This new charter directs us to develop, advocate and support the use of Semantic Web technologies for translational medicine and its three enabling domains: life sciences, clinical research and health care. While the core HCLSIG values - simplicity, pragmatism, effectiveness - remain firmly in place, Charlie, Vijay and I hope to make subtle changes to the operational strategy such that our efforts become increasingly recognized as critical in conferences and boardrooms across the globe.

As always, the HCLSIG will create both prototype implementations that demonstrate the value of formalizing and sharing knowledge using Semantic Web technologies. We will marshal our efforts towards fulfilling compelling use cases that have intrinsic value to not just W3C members, but ideally to a larger number of outside benefactors. Thus, our experts will now develop these use cases such that a priori we have a clearer picture of the rationale of the project, its resources, milestones and deliverables, and ultimately, which organizations and communities will directly and indirectly benefit. Coupled with an effective dissemination strategy including leverage our combined social networks, we hope to maximize the impact of the work of our members in this emerging area of knowledge management.

As part of our dissemination strategy, we also intend to produce more member contributions that describe methods for basic and advanced tasks, in addition to publishing recommendations arising from consensus among our members. Such recommendations will endorse and specify the use of terminological resources in the long term context of semantic interoperability across the three core domains. Thus, participation in the HCLSIG will be critical for those wanting to advocate RDF-representations of data, OWL representations of ontologies, for the purposes of semantic annotation and large scale, semantic integration of biomedical data.

With that, we invite non-members to join the W3C and work with our strong compliment of experts in what will surely be an exciting and productive time over the next few years for the W3C HCLSIG.

Saturday, September 10, 2011

Bio2RDF: moving forward

Last week we held our first virtual meeting towards re-invigorating the Bio2RDF project with a significantly larger and vested community. From discussions, we plan to establish 3 focus groups around :

A. policy (information, governance, sustainability, outreach)

B. technical (architecture, infrastructure and RDFization)

C. social (user experience and social networking)

The next step then is for groups to:

1. identify and certify discussion leads (responsibilities: set meeting times and agenda, facilitate and encourage discussion among members, draft reports)

2. identify additional people to recruit from the wider community that would provide additional expertise (interested, but didn't attend the first? sign up now !)

3. extend and prioritize discussion items (what exactly will this group focus its efforts on in the short and long term)

4. identify and assign bite-sized tasks (so we can get things done one step at a time :)

5. collate results and present to the wider community

I suggest that groups self-organize a first meeting in the next two weeks to deal with items 1-4, and either meet again or use the Google documents to collaboratively report findings.

Finally, I'd like for us to hold another meeting with times that are much more accommodating for Europe + North America ;) Please fill the doodle poll (http://www.doodle.com/fsuz6mgs5cztf2e2)

As always, feel free to contact me if you have any questions, and please sign up to the Bio2RDF mailing list for all future discussions.

Friday, June 4, 2010

SADI

(modified from an email that Mark Wilkinson sent)

SADI is a very lightweight "standard" (set of best-practices, really) for modeling and providing Web Services. It uses standards from the W3C Semantic Web initiative - in particular, it uses OWL for types, and RDF for instance data.

SADI is used to expose "resources" to the world in a manner that can be discovered automatically, and accessed automatically, by software. Those resources might be data inside databases (where SADI replaces the traditional Web Query page), or they might be analytical algorithms that consume data, chug away on it, and return output data. In both cases, the interfaces are structurally identical, so from the perspective of the client software, it doesn't have to know or care whether it is trying to get data out of a database or out of an analytical tool - the question/query structure is the same, and moreover, it is completely predictable.

This is critical advantage #1 for SADI over traditional Web Services frameworks - in traditional XML-based Web Services, you still must code your client software to access each service, since the service interfaces cannot be interpreted by the machine. In SADI, we can design ONE piece of software to access all resources exposed as SADI services - "one ring to rule them all!". (and we already have several different "rings" that expose SADI data in different ways)

Critical advantage #2 is a bit more obscure and hard to describe, but is likely to be the more important in the long-run. In SADI, data is "grounded" in explicit semantics. This means that all data in SADI carries with it information about what TYPE of data it is, and how that data relates to other data (e.g. genes transcribed into transcripts translated into proteins which regulate genes: Gene, Transcript, Protein are all data types, and "transcribed", "translated", "regulate" are relationships between them). With this explicit (and extensive!) grounding in semantics, we can start asking our machines to do a lot of the interpretation for us. For example, "what gene regulates gene X" is a nonsensical question biologically, but it's a question that biologists ask all the time! With a solid grounding in semantics, the machine would be able to follow the logical pathway above and say "well, to answer that question, I am going to have to go through transcripts and proteins to get there" and then automatically construct the pipeline of services that get to the answer. This is just one example of how Semantics can be used to facilitate question-answering.

There are several tutorials available.

for what it can do: http://www.slideshare.net/markmoby/sadi-swsip-09

then go to http://sadiframework.org to find the more specific tutorials on how to deploy services.

The current list of available services is at http://sadiframework.org/registry/services/ and that list will be growing rapidly over the next year (we have committed to having at least 400 more services, but I suspect that we'll go far beyond that number!)

Friday, May 7, 2010

Getting SNORQL to work with Virtuoso

SNORQL is an AJAX SPARQL browser that makes it easy to i) see if your queries work and ii) navigate your linked data. SNORQL comes packaged with D2R server, but one has to make a few modifications to make it work when a) installed in a directory or port that is different than the SPARQL endpoint.

You need 3 things to do the following to make SNORQL work with some host-located endpoint:

1. Download and install SNORQL. SNORQL comes as part of the D2R distribution. Download this and extract the snorql folder from the webapps director into some folder on your host, preferably one that is already accessible by the web server (e.g. in the htdocs directory). If you want to put SNORQL in a folder different than that, you must add an entry to the http.conf file.

Alias /snorql /usr/var/snorql

Options None

AllowOverride None

Order allow,deny

Allow from all

</Directory>

2. Configure the Apache server as a proxy to the endpoint. If the port of the apache server and the endpoint are different, you need to make them appear the same for the AJAX to work. Add this to your http.conf file

ProxyRequests Off

Order deny,allow

Allow from all

</Proxy>

ProxyPass /sparql http://localhost:8890/sparql

ProxyPassReverse /sparql http://localhost:8890/sparql

3. Configure SNORQL to use the endpoint.

Edit the snorql.js file and replace

this._endpoint = document.location.href.match(/^([^?]*)snorql\//)[1] + 'sparql';

with

this._endpoint = 'http://localhost/sparql';

4. Open your browser to the SNORQL URL (e.g. http://localhost/snorql) to query and navigate the results :-)