Scientific Ramblings: bioinformatics

Showing posts with label bioinformatics. Show all posts

Saturday, September 10, 2011

Bio2RDF: moving forward

Last week we held our first virtual meeting towards re-invigorating the Bio2RDF project with a significantly larger and vested community. From discussions, we plan to establish 3 focus groups around :

A. policy (information, governance, sustainability, outreach)

B. technical (architecture, infrastructure and RDFization)

C. social (user experience and social networking)

The next step then is for groups to:

1. identify and certify discussion leads (responsibilities: set meeting times and agenda, facilitate and encourage discussion among members, draft reports)

2. identify additional people to recruit from the wider community that would provide additional expertise (interested, but didn't attend the first? sign up now !)

3. extend and prioritize discussion items (what exactly will this group focus its efforts on in the short and long term)

4. identify and assign bite-sized tasks (so we can get things done one step at a time :)

5. collate results and present to the wider community

I suggest that groups self-organize a first meeting in the next two weeks to deal with items 1-4, and either meet again or use the Google documents to collaboratively report findings.

Finally, I'd like for us to hold another meeting with times that are much more accommodating for Europe + North America ;) Please fill the doodle poll (http://www.doodle.com/fsuz6mgs5cztf2e2)

As always, feel free to contact me if you have any questions, and please sign up to the Bio2RDF mailing list for all future discussions.

Monday, July 4, 2011

Sabbatical 2011-2012: Formalizing Scientific Discourse

Objective: The goal of my research program is to enable biologists to compose and evaluate scientific hypotheses using a diverse set of informational sources (ontology, database, text, equations, and web services). The purpose of my 2011-2012 sabbatical is to develop expertise in formalizing scientific discourse, with a particular focus on formalizing textual descriptions and mathematical equations such that they interoperate with knowledge represented in databases and structured documents. In particular, I am interested in using high quality facts derived from text and dynamic computation from formalized equations to answer questions and provide evidence for scientific hypotheses.

Background: Advancing knowledge in the life sciences involves experimentally testing hypotheses and interpreting the results based on prior scientific work. In generating a valid hypothesis, biologists face the overwhelming challenge of collecting, evaluating and integrating large and increasing amounts of different kinds of information about organisms, cells, genes and proteins from thousands of articles, hundreds of databases and dozens of tools. A biologist’s ability to efficiently construct and evaluate a hypothesis over current knowledge requires that i) knowledge, data and hypotheses are formally represented so they may be reasoned about, ii) adequate software exists to manage and query formal knowledge, and iii) data can be obtained by searching relevant databases and invoking the right analytical tools. The inability to efficiently discover relevant information can negatively impact scientific research directions and proposed activities. Methods for facilitating the construction and evaluation of hypotheses against the current state of knowledge could translate into greater scientific insight and increased productivity. Innovative approaches for knowledge discovery could be applied to data on the emerging Semantic Web and be transformative on a global scale.

Proposed Activities
1. Text to triples: The purpose of this leg of the sabbatical is to gain an understanding of the current state of the art in natural language processing and develop skill in producing high quality triples from parsing scientific text.
Time Frame: July 2011-September 2011
Location: European Bioinformatics Institute, Hinxton, UK.
Host: Dr. Rebholz-Schuhmann

2. Formalizing equations: The purpose of this leg of the sabbatical is to investigate the ontology of equations, represent scientific equations using Semantic Web technologies (principally the Rule Interchange Format), and implement semantic web services that serve to compute over formalized scientific equations.
Time Frame: October 2011-December 2011
Location: Universidad de Concepcion, Concepcion, Chile.
Host: Dr. Leo Ferres

3. Formalizing Research Hypotheses: The purpose of this leg of the sabbatical is to explore the formalization of hypotheses concerning disease. Specifically, I will extract meaningful facts from AlzForum and integrate these with resources from the National Centre for BioOntology (NCBO) and Bio2RDF, our large scale Semantic Web project.

Time Frame: January 2012-March 2012
Location: Stanford University, Palo Alto, California, USA.
Host: Dr. Mark Musen

4. Integrated Framework for Knowledge Discovery: The last leg of the sabbatical will be focused towards maximizing interoperability between text, equations, ontologies and database-derived facts. I will use SADI, our platform semantic web services framework, towards achieving this objective.
Time Frame: April 2012-June 2012
Location: India, Thailand, Singapore

Scientific Value and Broader Beneficial Impacts
The development and application of efficient strategies for knowledge discovery is a major goal in bioinformatics. My research into new strategies for the representation and evaluation of scientific hypotheses using ontologies, scientific text, data and bioinformatic services will create a novel platform that will significantly contribute to scientific productivity and ultimately improve our understanding of biology. The proposed sabbatical will provide me with new skills that will be used to train a future training of young scientists in the areas of formal knowledge representation, text mining and the Semantic Web. Ultimately, it is expected that the sabbatical will cultivate new partnerships with leading scientists and open new doors to work with industry and government agencies.

Tuesday, April 27, 2010

Compute Canada and the future of HPC computing

Compute Canada is hosting a series of town hall meetings to discuss the future of high performance computing in Canada. Here are some thoughts:

In order to increase Canada’s HPC capability and make them more relevant for today’s scientific computing needs, it will have to embrace new computing models.

The next generation in computing is cloud computing. Compute Canada should embrace this model as part of its service offering such that researchers can use the cloud across Compute Canada infrastructure. Importantly, it must be possible for researchers to grow their cloud from local private clouds (we already have one setup in our lab), into the Compute Canada cloud and ultimately into commercial clouds (such as Amazon EC2), if necessary. Compute Canada also needs to invest in data storage, and create the means by which such storage may be accessed using data access standards (e.g. Amazon S3) and provisioned through networks (e.g. CANARIE <-> university <-> commodity networks). Compute Canada should endeavor to use open standards and ensure interoperability for any deployment.

A major issue with Canada’s HPC centers is that we cannot currently (AFAIK) host public services on them. As a bioinformatician, it would be invaluable for me to have Compute Canada host a server using my image, thereby ensuring scalable capacity and continuity for the software tool (as reported in publications). Importantly, some of our services require on-the-fly compute resources to accomplish their task, and it would be ideal if we could setup asynchronous services that use cloud facilities to compute, and then provide a link to users as to where they can find their results (stored on an S3 compatible store). Being able to do this would be a game changer for bioinformatics (and I suspect in other computing fields), and would create a new paradigm for open service provisioning across the world.

While full fledged cloud computing software is currently available as both open source and commercial solutions, this would present an opportunity to acquire such software 'en masse' to get started, while also creating new capacity in developing cloud computing solutions. I can imagine training my 4th year project/honours students to develop bioinformatic software using Compute Canada’s compute/storage resources – but not until we see cloud computing.