Scientific Ramblings: 2011

Friday, October 14, 2011

Google Maps: break in the road bug!

Dear Google Maps,
I'm trying to plan a trip [1] in and around the Acatama Desert while i'm here in Chile, but have found an interesting bug. Basically, there is a break in the road (between A and B) which prevents maps from using that road: see here to reproduce http://g.co/maps/zwq8y

There's also a break on the road directly east of that break point.

I'd report the problem to you, if only you had the "Report a Problem" link anywhere on this page, despite your claims [2] to the contrary.

Best,

m.

[1] http://g.co/maps/3zwqt
[2] http://maps.google.com/support/bin/answer.py?hl=en&answer=162873&topic=1687362

Sunday, September 25, 2011

Provenance: what is it and how should we formalize it?

As a testament to the growing recognition of provenance for (e-)science, i'm glad to see that the W3C incubator group worked hard to think about the issues and make it possible to establish a W3C provenance interchange Working Group.

a good starting point:

"provenance is often represented as metadata, but not all metadata is necessarily provenance"

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#Provenance_and_Metadata

but

"Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource."

does not provide adequate description for identifying the conditions.

and

"Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource"

contains elements that are undefined (record), uncertain (are processes not also entities?), narrow (producing/delivering) and broad (influencing).

Of course, I appreciate the difficulty in crafting a good definition, and I understand that this is a definition from which useful work can be achieved. I will take the opportunity to express my thoughts on the matter.

i think there are two key aspects to provenance (not unlike what is suggested here: http://www.springerlink.com/content/edf0k68ccw3a22hu/)

1. how did the resource come about? (relates to creation and justification)

- important for reproducibility (which is an element of science)

- includes attribution (who created the resource), creation (process that generated the resource), reproduction (process in which a copy was made), derivation (process in which the resource was generated from some resource or portion of a resource), versioning (process of keeping count of sequential derivations)

2. what is the history of the resource (from the point of creation)

- important for authenticity

- includes origin, possession and the acts of transfer

Both have implications for trust, and can be used for accountability, among other things.

I find this part on recommendations of a provenance framework quite nice:

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#A_Roadmap_for_Provenance_on_the_Web

but get less excited when i see the collection of "provenance concepts"

http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/#Recommendations (section 4)

particularly because we need to simplify the discourse such that we consider

an event (for 1 above)

- participants (and their roles; e.g. agents, targets, products)

- locations

- time instants (e.g. action timestamps) and durations (processual attributes)

and a sequence of events (for both 1 and 2 above)

this would certainly help to generate a specification with a minimal set of classes and relations to express this kind of information.

now, i'm writing this late at night, and I appreciate that I may not have considered all the issues that the provenance group has (along with others that have written about the subject), but perhaps there is still some good discussions to be had wrt provenance and how we formally represent it, as it is of strategic importance to the HCLSIG in our current and future efforts.

Thursday, September 22, 2011

A letter to gmail: attachments

Dear Gmail team.

First, thank you for making it possible for me to see my unread mail - i sent you this idea some time ago, and i'm glad that you listened.

Second, I'm now somehow at 50% of my allocated capacity, and what i need is a way to filter my mails by attachment (which i can do!), but also order them from largest to smallest (which I can't do). Once i can order attachments by size, i can start deleting the big ones and free up more room! YAY!

m.

Tuesday, September 20, 2011

On the topic of ontology evaluation

With only a cursory look at the literature pertaining to the evaluation of ontologies, I already get the feeling that the current measures completely miss the point. The answer doesn't lie in the syntax (format) or structure of the ontology (the number of classes and properties, subsumed classes, axioms, etc), but rather the effectiveness of an ontology ( a representation of knowledge ) is in whether the semantics can be used for some task. So what we really want, is to focus on the nature of the task, and whether ontology provides a competitive advantage over other technologies.

Off the top of my head, here are some tasks:

- search/browse (most sites using GO, etc)

- text annotation (gopubmed)

- data normalization and structured queries - bio2rdf

- answering questions that require background knowledge e.g. across a yeast database.

- data integration (heterogeneous types of data; data from different sources, of differing granular detail) (see my translational medicine paper)

- classification e.g. domains or chemicals

- prediction e.g. predicting phenotypes

Perhaps others can suggest some?

Sunday, September 11, 2011

New Charter for the W3C Health Care and Life Sciences Interest Group

Last week, the World Wide Web Consortium (W3C) approved a new charter for its Health Care and Life Sciences Interest Group (HCLSIG), in which I (Carleton University) along with Charlie Mead (NIH CBIIT) and Vijay Bulusu (Pfizer) were selected as co-chairs. This new charter directs us to develop, advocate and support the use of Semantic Web technologies for translational medicine and its three enabling domains: life sciences, clinical research and health care. While the core HCLSIG values - simplicity, pragmatism, effectiveness - remain firmly in place, Charlie, Vijay and I hope to make subtle changes to the operational strategy such that our efforts become increasingly recognized as critical in conferences and boardrooms across the globe.

As always, the HCLSIG will create both prototype implementations that demonstrate the value of formalizing and sharing knowledge using Semantic Web technologies. We will marshal our efforts towards fulfilling compelling use cases that have intrinsic value to not just W3C members, but ideally to a larger number of outside benefactors. Thus, our experts will now develop these use cases such that a priori we have a clearer picture of the rationale of the project, its resources, milestones and deliverables, and ultimately, which organizations and communities will directly and indirectly benefit. Coupled with an effective dissemination strategy including leverage our combined social networks, we hope to maximize the impact of the work of our members in this emerging area of knowledge management.

As part of our dissemination strategy, we also intend to produce more member contributions that describe methods for basic and advanced tasks, in addition to publishing recommendations arising from consensus among our members. Such recommendations will endorse and specify the use of terminological resources in the long term context of semantic interoperability across the three core domains. Thus, participation in the HCLSIG will be critical for those wanting to advocate RDF-representations of data, OWL representations of ontologies, for the purposes of semantic annotation and large scale, semantic integration of biomedical data.

With that, we invite non-members to join the W3C and work with our strong compliment of experts in what will surely be an exciting and productive time over the next few years for the W3C HCLSIG.

Saturday, September 10, 2011

Bio2RDF: moving forward

Last week we held our first virtual meeting towards re-invigorating the Bio2RDF project with a significantly larger and vested community. From discussions, we plan to establish 3 focus groups around :

A. policy (information, governance, sustainability, outreach)

B. technical (architecture, infrastructure and RDFization)

C. social (user experience and social networking)

The next step then is for groups to:

1. identify and certify discussion leads (responsibilities: set meeting times and agenda, facilitate and encourage discussion among members, draft reports)

2. identify additional people to recruit from the wider community that would provide additional expertise (interested, but didn't attend the first? sign up now !)

3. extend and prioritize discussion items (what exactly will this group focus its efforts on in the short and long term)

4. identify and assign bite-sized tasks (so we can get things done one step at a time :)

5. collate results and present to the wider community

I suggest that groups self-organize a first meeting in the next two weeks to deal with items 1-4, and either meet again or use the Google documents to collaboratively report findings.

Finally, I'd like for us to hold another meeting with times that are much more accommodating for Europe + North America ;) Please fill the doodle poll (http://www.doodle.com/fsuz6mgs5cztf2e2)

As always, feel free to contact me if you have any questions, and please sign up to the Bio2RDF mailing list for all future discussions.

Sunday, September 4, 2011

Scientific Publishing: We're not quite there yet...

I recall how discussions with Pascal Hitzler at ISWC2008 in Karlsruhe eventually led to the creation of a new journal: Semantic Web in 2009. It had most of the elements I wanted :

1- immediately availability of the manuscript

2- a transparent review process

3- a high standard of quality.

4- low cost of publication

The immediately availability of the manuscript is important because publishing in a peer-reviewed journals can often take more than 1 year - so this is problematic when making claims about who did what and when. A transparent review process is important to create an incentive to carefully formulate constructive reviews and weed out spurious reviews dealt under the veil of anonymity. In addition, we can establish the reviewer quality and reliability - important when asking for future review requests. Another other other aspect of open review is to enable the acknowledgement of reviewer contributions to improving the quality of the manuscript. In this respect, Semantic Web is stellar - it acknowledges the editors and the reviewers on the front page of the published manuscript.

One concern was whether reviewers would be sufficiently forthcoming about the failings of a paper such that it lead to the strong rejection of the manuscript. From all indications (see the paper that Pascal and Krzysztof wrote http://t.co/twDASAN9), the journal not only rejects a significant number of manuscripts that don't meet expectations for publication, but also does so in a constructive manner such that it invites authors to submit a revised manuscript with indicating how they have addressed the reviewer's comments. Indeed, this aspect of open review also means that it authors attempt to shop the failed manuscript to other journals, a simple web search should uncover the reviews associated with it. This saves precious editor and reviewer time and really pushes the authors to make substantial changes.

While this is all good, the major point of content is the business model. From any which way I look at it, I get the feeling that the tables are turned upside down. First you have the authors creating intellectual content, which they submit to a journal, and is reviewed by people who don't get paid to do so. The journal then turns around and tells the authors that if they want it freely available to the public, then the authors should cough up the money. This is ridiculous. Even a simple advertising based model could easily make a return on investment for articles of increasing impact. The other aspect is that the publisher needs to cover the costs of a print publication, but really, who reads print? I haven't for over 10 years now! In the case of the Journal, they exact a cost in typesetting, but we're used to doing this for our workshop and conference submissions - so we really don't need people to do this anymore. It would be nice to see a break down of costs for digital publishing systems today (perhaps you can point us to one).

So what is the business model of the modern scientific publisher? Well, i think it lies in aggregating and creating new content, whether editorials or comprehensive reviews, which give readers a viewpoint or summary of where the field is heading. Perhaps low-cost subscriptions (e.g. $100/yr) or pay per view ($5), one could recoup costs if the work is sufficiently meritorious. It's definitely time to think about the next evolution in scientific publishing.

Wednesday, July 13, 2011

Sabbatical Interview

Earlier this year I was interviewed about my sabbatical plans, and we did a photo shoot as well. Have a look at the article, which includes my rabbits featured on the monitor screen :)

http://cualumni.carleton.ca/magazine/summer-2011/parting-shots/

Monday, July 4, 2011

Breaking Down the Sabbatical

It's hard to believe that it's already been 7 years since I started at Carleton University. I remember thinking about going on sabbatical - a major feature of becoming a University professor. Well, now the time has come, and along with the proposal , I've scheduled quite a bit of travel to visit with colleagues and to take some well deserved holiday time. So here's the tentative schedule

July 3-14: Cambridge, UK (EBI - Dietrich Rebholz-Schuhmann, University of Cambridge - Robert Hoehndorf)
July 14-18: Vienna, Austria (Bio-ontologies, ISMB Tutorial)
July 18-August 7: Malta+Italy (Valletta, Catania, Salerno, Pompeii, Rome, Florence, Venice, Bologna, Pisa)
August 7-10: Finland (Tempere, Helsinki)
August 10-18: Iceland
August 18-29: Kyoto + Tokyo, Japan (Biohackathon 2011)
August 29-September 3: Madrid (Ontology Engineering Group : Alexander De Leon)
September 3-7: Heidelberg, Germany (COMBINE)
September 7-12: Ottawa (defense: Leonid Chepelev -> success!)
September 13-17: Nancy, France (INRIA/LORIA: Adrien Coulet)
September 19: Volendam (OpenPHACTS/Gen2Phen meeting on open data)
September 17-October 1: Paris, St Malo, Bordeaux, London (travel with parents)
October 1-11: Ottawa, Canada (defense: Natalia Villaneuva-Rosales -> success!)
October 11-December 1: Conception, Chile (Universite de Conception : Leo Ferres)
December 1-December 23: Santiago, San Pedro de Atacama, Mendoza, Buenos Aires, Colonia, Punta del Diablo and Montevideo)
December 24-January 31: Toronto, Ottawa
February - March: Stanford university (Nigam Shah, Mark Musen)
April - June : India, Nepal, Bhutan, Thailand, Singapore ?

Sabbatical 2011-2012: Formalizing Scientific Discourse

Objective: The goal of my research program is to enable biologists to compose and evaluate scientific hypotheses using a diverse set of informational sources (ontology, database, text, equations, and web services). The purpose of my 2011-2012 sabbatical is to develop expertise in formalizing scientific discourse, with a particular focus on formalizing textual descriptions and mathematical equations such that they interoperate with knowledge represented in databases and structured documents. In particular, I am interested in using high quality facts derived from text and dynamic computation from formalized equations to answer questions and provide evidence for scientific hypotheses.

Background: Advancing knowledge in the life sciences involves experimentally testing hypotheses and interpreting the results based on prior scientific work. In generating a valid hypothesis, biologists face the overwhelming challenge of collecting, evaluating and integrating large and increasing amounts of different kinds of information about organisms, cells, genes and proteins from thousands of articles, hundreds of databases and dozens of tools. A biologist’s ability to efficiently construct and evaluate a hypothesis over current knowledge requires that i) knowledge, data and hypotheses are formally represented so they may be reasoned about, ii) adequate software exists to manage and query formal knowledge, and iii) data can be obtained by searching relevant databases and invoking the right analytical tools. The inability to efficiently discover relevant information can negatively impact scientific research directions and proposed activities. Methods for facilitating the construction and evaluation of hypotheses against the current state of knowledge could translate into greater scientific insight and increased productivity. Innovative approaches for knowledge discovery could be applied to data on the emerging Semantic Web and be transformative on a global scale.

Proposed Activities
1. Text to triples: The purpose of this leg of the sabbatical is to gain an understanding of the current state of the art in natural language processing and develop skill in producing high quality triples from parsing scientific text.
Time Frame: July 2011-September 2011
Location: European Bioinformatics Institute, Hinxton, UK.
Host: Dr. Rebholz-Schuhmann

2. Formalizing equations: The purpose of this leg of the sabbatical is to investigate the ontology of equations, represent scientific equations using Semantic Web technologies (principally the Rule Interchange Format), and implement semantic web services that serve to compute over formalized scientific equations.
Time Frame: October 2011-December 2011
Location: Universidad de Concepcion, Concepcion, Chile.
Host: Dr. Leo Ferres

3. Formalizing Research Hypotheses: The purpose of this leg of the sabbatical is to explore the formalization of hypotheses concerning disease. Specifically, I will extract meaningful facts from AlzForum and integrate these with resources from the National Centre for BioOntology (NCBO) and Bio2RDF, our large scale Semantic Web project.

Time Frame: January 2012-March 2012
Location: Stanford University, Palo Alto, California, USA.
Host: Dr. Mark Musen

4. Integrated Framework for Knowledge Discovery: The last leg of the sabbatical will be focused towards maximizing interoperability between text, equations, ontologies and database-derived facts. I will use SADI, our platform semantic web services framework, towards achieving this objective.
Time Frame: April 2012-June 2012
Location: India, Thailand, Singapore

Scientific Value and Broader Beneficial Impacts
The development and application of efficient strategies for knowledge discovery is a major goal in bioinformatics. My research into new strategies for the representation and evaluation of scientific hypotheses using ontologies, scientific text, data and bioinformatic services will create a novel platform that will significantly contribute to scientific productivity and ultimately improve our understanding of biology. The proposed sabbatical will provide me with new skills that will be used to train a future training of young scientists in the areas of formal knowledge representation, text mining and the Semantic Web. Ultimately, it is expected that the sabbatical will cultivate new partnerships with leading scientists and open new doors to work with industry and government agencies.