Scientific Ramblings: 2010

Friday, June 4, 2010

SADI

(modified from an email that Mark Wilkinson sent)

SADI is a very lightweight "standard" (set of best-practices, really) for modeling and providing Web Services. It uses standards from the W3C Semantic Web initiative - in particular, it uses OWL for types, and RDF for instance data.

SADI is used to expose "resources" to the world in a manner that can be discovered automatically, and accessed automatically, by software. Those resources might be data inside databases (where SADI replaces the traditional Web Query page), or they might be analytical algorithms that consume data, chug away on it, and return output data. In both cases, the interfaces are structurally identical, so from the perspective of the client software, it doesn't have to know or care whether it is trying to get data out of a database or out of an analytical tool - the question/query structure is the same, and moreover, it is completely predictable.

This is critical advantage #1 for SADI over traditional Web Services frameworks - in traditional XML-based Web Services, you still must code your client software to access each service, since the service interfaces cannot be interpreted by the machine. In SADI, we can design ONE piece of software to access all resources exposed as SADI services - "one ring to rule them all!". (and we already have several different "rings" that expose SADI data in different ways)

Critical advantage #2 is a bit more obscure and hard to describe, but is likely to be the more important in the long-run. In SADI, data is "grounded" in explicit semantics. This means that all data in SADI carries with it information about what TYPE of data it is, and how that data relates to other data (e.g. genes transcribed into transcripts translated into proteins which regulate genes: Gene, Transcript, Protein are all data types, and "transcribed", "translated", "regulate" are relationships between them). With this explicit (and extensive!) grounding in semantics, we can start asking our machines to do a lot of the interpretation for us. For example, "what gene regulates gene X" is a nonsensical question biologically, but it's a question that biologists ask all the time! With a solid grounding in semantics, the machine would be able to follow the logical pathway above and say "well, to answer that question, I am going to have to go through transcripts and proteins to get there" and then automatically construct the pipeline of services that get to the answer. This is just one example of how Semantics can be used to facilitate question-answering.

There are several tutorials available.

for what it can do: http://www.slideshare.net/markmoby/sadi-swsip-09

then go to http://sadiframework.org to find the more specific tutorials on how to deploy services.

The current list of available services is at http://sadiframework.org/registry/services/ and that list will be growing rapidly over the next year (we have committed to having at least 400 more services, but I suspect that we'll go far beyond that number!)

Friday, May 7, 2010

Getting SNORQL to work with Virtuoso

SNORQL is an AJAX SPARQL browser that makes it easy to i) see if your queries work and ii) navigate your linked data. SNORQL comes packaged with D2R server, but one has to make a few modifications to make it work when a) installed in a directory or port that is different than the SPARQL endpoint.

You need 3 things to do the following to make SNORQL work with some host-located endpoint:

1. Download and install SNORQL. SNORQL comes as part of the D2R distribution. Download this and extract the snorql folder from the webapps director into some folder on your host, preferably one that is already accessible by the web server (e.g. in the htdocs directory). If you want to put SNORQL in a folder different than that, you must add an entry to the http.conf file.

Alias /snorql /usr/var/snorql

Options None

AllowOverride None

Order allow,deny

Allow from all

</Directory>

2. Configure the Apache server as a proxy to the endpoint. If the port of the apache server and the endpoint are different, you need to make them appear the same for the AJAX to work. Add this to your http.conf file

ProxyRequests Off

Order deny,allow

Allow from all

</Proxy>

ProxyPass /sparql http://localhost:8890/sparql

ProxyPassReverse /sparql http://localhost:8890/sparql

3. Configure SNORQL to use the endpoint.

Edit the snorql.js file and replace

this._endpoint = document.location.href.match(/^([^?]*)snorql\//)[1] + 'sparql';

with

this._endpoint = 'http://localhost/sparql';

4. Open your browser to the SNORQL URL (e.g. http://localhost/snorql) to query and navigate the results :-)

Tuesday, April 27, 2010

Compute Canada and the future of HPC computing

Compute Canada is hosting a series of town hall meetings to discuss the future of high performance computing in Canada. Here are some thoughts:

In order to increase Canada’s HPC capability and make them more relevant for today’s scientific computing needs, it will have to embrace new computing models.

The next generation in computing is cloud computing. Compute Canada should embrace this model as part of its service offering such that researchers can use the cloud across Compute Canada infrastructure. Importantly, it must be possible for researchers to grow their cloud from local private clouds (we already have one setup in our lab), into the Compute Canada cloud and ultimately into commercial clouds (such as Amazon EC2), if necessary. Compute Canada also needs to invest in data storage, and create the means by which such storage may be accessed using data access standards (e.g. Amazon S3) and provisioned through networks (e.g. CANARIE <-> university <-> commodity networks). Compute Canada should endeavor to use open standards and ensure interoperability for any deployment.

A major issue with Canada’s HPC centers is that we cannot currently (AFAIK) host public services on them. As a bioinformatician, it would be invaluable for me to have Compute Canada host a server using my image, thereby ensuring scalable capacity and continuity for the software tool (as reported in publications). Importantly, some of our services require on-the-fly compute resources to accomplish their task, and it would be ideal if we could setup asynchronous services that use cloud facilities to compute, and then provide a link to users as to where they can find their results (stored on an S3 compatible store). Being able to do this would be a game changer for bioinformatics (and I suspect in other computing fields), and would create a new paradigm for open service provisioning across the world.

While full fledged cloud computing software is currently available as both open source and commercial solutions, this would present an opportunity to acquire such software 'en masse' to get started, while also creating new capacity in developing cloud computing solutions. I can imagine training my 4th year project/honours students to develop bioinformatic software using Compute Canada’s compute/storage resources – but not until we see cloud computing.