Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

Upcoming Conferences

A Global Community

  • ISCB Student Council

    dedicated to facilitating development for students and young researchers

  • Affiliated Groups

    The ISCB Affiliates program is designed to forge links between ISCB and regional non-profit membership groups, centers, institutes and networks that involve researchers from various institutions and/or organizations within a defined geographic region involved in the advancement of bioinformatics. Such groups have regular meetings either in person or online, and an organizing body in the form of a board of directors or steering committee. If you are interested in affiliating your regional membership group, center, institute or network with ISCB, please review these guidelines (.pdf) and submit your application using the online ISCB Affiliated Group Application form. Your exploratory questions to ISCB about the appropriateness of a potential future affiliation are also welcome by Diane E. Kovats, ISCB Executive Director (This email address is being protected from spambots. You need JavaScript enabled to view it.).

  • Communities of Special Interest

    topically-focused collaborative communities
     

     

  • ISCBconnect

    open dialogue and collaboration to solve problems and identify opportunities

  • ISCB Member Directory

    connect with ISCB worldwide

  • ISCB Innovation Forum

    a unique opportunity for industry

Professional Development, Training and Education

ISCBintel and Achievements

Conference on Semantics in Healthcare and Life Sciences

Organizing Committee

Conference Co-Chairs

  • Jonas S. Almeida, University of Alabama at Birmingham, United States
  • Ted Slater, OpenBEL Consortium, Cambridge, United States

Organizing Committee

  • Jonas S. Almeida, University of Alabama at Birmingham, United States
  • Mike Bevil, Merck & Co. West Point, United States
  • Chris Baker, University of New Brunswick, St. John, Canada
  • Anita de Waard, Elsevier Labs, Jericho, United States
  • Ted Slater, OpenBEL Consortium, United States
  • Andrea Splendiani, Rothamsted Research, Harpenden, United Kingdom
  • Bryn Williams-Jones, Connected Discovery, United Kingdom

Logistical Organizers

  • Diane E. Kovats, ISCB Executive Officer, La Jolla, United States
  • Steven Leard, ISCB Conferences Director, Edmonton, Canada
  • Stacy Slagor, ISCB Director of Corporate Relations and Development, La Jolla, United States


Contact Us
This email address is being protected from spambots. You need JavaScript enabled to view it.

Conference on Semantics in Healthcare and Life Sciences

Sponsor Opportunities

Updated May 29, 2012
Gold: $20,000
  1. Three (3) complimentary conference registrations
  2. Three (3) sponsor/VIP dinner invitations
  3. One (1) Exhibitor Showcase Display space
  4. One (1) 20 minute tech-talk (scheduled by the organizers)
  5. Logo recognition with hyperlink on home page of conference web site
  6. Logo recognition during conference opening session
  7. Logo recognition on conference sponsor signage*
  8. Gold Sponsor recognition with organization logo and
    100 word description in conference program
  9. Full-page black and white advertisement in conference program
  10. Option to provide company brochure/flyer for placement
    in delegate packet

Silver: $15,000
  1. Two (2) complimentary conference registrations
  2. Two (2) sponsor/VIP dinner invitations
  3. One (1) Exhibitor Showcase Display space
  4. One (1) 20 minute tech-talk (scheduled by the organizers)
  5. Logo recognition with hyperlink on sponsors page of conference
    web site
  6. Logo recognition during conference opening session
  7. Logo recognition on conference sponsor signage*
  8. Silver Sponsor recognition with organization logo and
    50 word description in conference program
  9. Half-page black and white advertisement in conference program

Bronze: $10,000
  1. One (1) complimentary conference registration
  2. One (1) sponsor/VIP dinner invitation
  3. One (1) Exhibitor Showcase Display space
  4. One (1) 20 minute tech-talk (scheduled by the organizers)
  5. Company name recognition with hyperlink on sponsors page of conference web site
  6. Logo recognition during conference opening session
  7. Logo recognition on conference sponsor signage*
  8. Bronze Sponsor recognition with organization name and
    company URL in conference program
  9. Quarter-page black and white advertisement in conference program

Reception Sponsor - $5000
  1. One (1) complimentary conference registration
  2. One (1) sponsor/VIP dinner invitation
  3. Company name recognition with hyperlink on sponsors page of conference web site
  4. Logo recognition during conference opening session
  5. Logo recognition on conference sponsor signage*
  6. Sponsor recognition with organization name and company URL in conference program
  7. Quarter-page black and white advertisement in conference program

Keynote Sponsor - $2500
  1. One (1) complimentary conference registration
  2. One (1) sponsor/VIP dinner invitation
  3. Company name recognition with hyperlink on sponsors page of conference web site
  4. Logo recognition during conference opening session
  5. Logo recognition on conference sponsor signage*
  6. Sponsor recognition with organization name and company URL in conference program

Coffee Break Sponsor - $1750
  1. One (1) complimentary conference registration
  2. One (1) sponsor/VIP dinner invitation
  3. Company name recognition with hyperlink on sponsors page of conference web site
  4. Logo recognition during conference opening session
  5. Logo recognition on conference sponsor signage*
  6. Sponsor recognition with organization name and company URL in conference program

Tech Talks - $2000
Tech Talks are opportunities to showcase products and services with conference delegates within a defined conference session track. The cost for a Tech Talk is $2000 and includes one complimentary conference registration for the presenter.

Tech Talks are 20 minutes in length allowing for a 15 minute presentation and up to 5 minutes to for Q&A. These presentations are designed to allow organizations to create awareness of new technologies, services, etc.

Requests for Tech Talks are reviewed for approval by the Organizing Committee. Space is limited and it may not be possible to accept all requests.

Exhibitor Showcase
-->Not for Profit Organization: $1500.00

-->For Profit Organization: $2500.00

CSHALS 2013 offers organizations an opportunity to showcase products and services as part of the conference exhibitor showcase. A limited number of spaces are available on a first come, first served basis. The exhibitor showcase includes the following:
  • Conference Registration for one representative
  • Exhibit showcase space, with 6ft table. Please note the showcase is designed for pop-up displays, approximately 8 ft wide or a table-top exhibit.

Student Travel Fellowship - $1000
CSHALS is also hoping to expand participation by students and jr. researchers through creation of a travel fellowship program. If your company is able to sponsor one or more student travel fellowships to CSHALS contact This email address is being protected from spambots. You need JavaScript enabled to view it.

Contributions of $1000 and above will be recognized with a hyperlinked company logo on the conference website.

To confirm your CSHALS participation please contact:

Stacy Slagor
Director of Corporate Relations and Development
International Society for Compuational Biology
9500 Gilman Drive, MC 0505
La Jolla, CA, USA  92093-0505
Telephone: 1+858-630-5339
Email: This email address is being protected from spambots. You need JavaScript enabled to view it.

For Conference Support or Logistical Questions contact:

Steven Leard
ISCB Conferences Director 
This email address is being protected from spambots. You need JavaScript enabled to view it.
Phone: 1+780-414-1663


* Logo recognition on conference sponsor signage will be proportionately sized for each sponsorship level, with Gold Sponsor logos being the largest and appearing at the top of the signage.

TOP

Conference on Semantics in Healthcare and Life Sciences

Conference Sponsors

Updated Feb 19, 2013


GENERAL SPONSORS:

. . . . . . . . . . . . . . . . . . . . . . .

 


COFFEE BREAK SPONSOR:


 [TOP]

Conference on Semantics in Healthcare and Life Sciences

Tutorial (hands on)

Updated March 06, 2013

Indicates that presentation slides or other resources are available.


*Please note tutorial is included in registration fee. Please confirm during registration you will attend:

SADI Tutorial System Requirements:

The following MUST be installed PRIOR to the tutorial - SADI

Participants will receive a fully configured virtual machine image loaded with open source software for practical SADI exercises (writing SADI services and running SPARQL queries).

Participants must have 10GB of free space on their hard drive to uncompress the VM image.

Install VMWare player > To test the installation of VMWare player use this small image (http://www.trendsignals.net/vm/ubuntu1004t/).

SADI Workshop participants who use Apple Computers:

There is no version of VMware Player for OS X. Instead, there is a VMWare Fusion software for Mac.

The 30-day trial can be downloaded here:

https://my.vmware.com/web/vmware/evalcenter?p=vmware-fusion5

The price of the full version is $49.99.

The tutorial presenters have tested the Ubuntu VM image on a Mac machine (Mac OS 10.8.2) with licensed VMWare Fusion (Full Version). All the programs were working fine. The only thing is the memory of Virtual machine must be set to 1GB or more in the VM Settings.

We strongly recommend to download and test your VMWare Player/Fusion with small image (http://www.trendsignals.net/vm/ubuntu1004t/).


Links within this page: SADI | Semantics 101

Semantic Automated Discovery and Integration (SADI) Web Services Tutorial
Wednesday, February 27
1:00 p.m. - 5:00 p.m.

Mark Wilkinson Presentation slides (web site)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Artjom Klein:
- Introduction (pdf)
- Writing Service Demo - Part 1 (pdf)
- Writing Service Demo - Part 2 (pdf)
- Semantic Benchmarking Infrastructure for Text Mining (pdf)

The SADI tutorial will begin with an overview of the project, including the history and "philosophy" behind it, and how it fits into the broader Semantic Web and Linked Data ecosystem. We will then talk about what a SADI service "is", and the specific design patterns advocated by SADI. After seeing a few examples of pipelining services together using the SADI SPARQL client ("SHARE") attendees will then be invited to compose their own ad hoc pipelines, based on a simple template we will provide. The hands-on component will then continue, with workshop participants being shown how to extend the template through building services of their own. In particular, participants will write, and deploy, two additional services that can be fed by data from existing services. Finally, participants will add these new services into the SPARQL queries they wrote earlier.

Tutorial Attendee Profile:

The first half of the tutorial will be accessible to newcomers learning the principles underlying the SADI framework. For the hands on part of the tutorial attendees are expected to be comfortable working in the in Linux environment (editor, terminal), and have working knowledge of Java, RDF/OWL, SPARQL (highly desired) Web Services (HTTP, Tomcat), and the Protege Ontology Editor. Knowledge of Description Logics is optional.


Biographies of SADI Instructors

Dr. Mark Wilkinson
Polytechnic University of Madrid, and University of British Columbia
www.hli.ubc.ca/research/PIs/Wilkinson.html

Dr Mark Wilkinson is an Isaac Peral Distinguished Researcher at the Center for Plant Biotechnology and Genomics, Polytechnic University of Madrid, and Adjunct Professor of Medical Genetics in the Faculty of Medicine, University of British Columbia. His research over the past decade has focused on the problem of enabling bio/medical researchers to execute their own data manipulation and analysis. Much of this work has involved the application of "semantics" to both scientific data and the analytical services that operate on them. He was founder of the BioMoby Semantic Web Service interoperability framework which, at its peak, included more than 1600 bioinformatics data-sources and tools. More recently he created SADI - a set of design practices that enhance interoperability between semantic Web services - and SHARE, a novel query resolver that combines a workflow engine with a logical reasoner to dynamically generate, and execute, a data retrieval and analysis workflow in response to questions posed by the user. His work on supporting multi-lingual end-user query construction in SHARE won first prize for "Best Demonstration of Semantic Technology" at the Semantic Web Applications and Tools for Life Sciences 2010 meeting in Berlin. He is on the Editorial Board of the Journal of Biomedical Semantics, and is an invited expert for the W3C's Semantic Web in Health Care and Life Sciences Interest Group. Mark has a Ph.D. in Botany from the University of British Columbia.

Artjom Klein
Research Scientist, University of New Brunswick, Canada
www.linkedin.com/in/artjomklein

Artjom Klein is a computational linguist with a Masters degree from the University of Heidelberg, Germany. He is has many years of hands on experience in Semantic Web techniques. Recent projects involved the provision of natural language query interfaces to semantic knowledge bases and deployment of text mining services in SADI Framework. Artjom was a core developer on the C-BRASS project (Canadian Bioinformatics Resources as Semantic Services) deploying more than 200 SADI services in Dr Chris Baker's lab at UNB, Saint John.


Semantics 101 Tutorial

Wednesday, February 27
3:00 p.m. - 5:00 p.m.

This two-hour tutorial will present a practical introduction to Semantic Web technologies. We will motivate the introductions with use cases and examples, and we will also attempt to dispel some of the myths and some of the hype around the Semantic Web. The tutorial will cover the following topics:

• What is the Semantic Web? What are Semantic Web technologies? What's the difference?

• How does the Semantic Web relate to other technologies such as semantic search or text analytics/NLP?

• How are some life sciences organizations using Semantic Web technologies today?

The tutorial will also introduce the following Semantic Web technologies:

  • RDF -- the foundational data model of the Semantic Web
  • SPARQL -- the query language of the Semantic Web
  • RDFS & OWL -- schema and ontology language for describing data and reasoning on the Semantic Web
  • Linked Data -- best practices for publishing data in an easily consumable manner
  • RDFa -- for embedding data in Web pages and XML documents

The tutorial is aimed both at people who are brand new to the Semantic Web and also those who have some experience but are looking for a broader understanding of the state of the Semantic Web in 2013 and how the various pieces of the Semantic Web fit together.   The tutorial draws on highly regarded educational material from Semantic University.

Biography of Semantic 101 Instructor

Lee Feigenbaum (@LeeFeigenbaum) is a co-founder of Cambridge Semantics, where he serves as VP of Marketing and Technology. Lee brings over a decade of experience with Semantic Web technologies to this role, helping to direct the development of the Anzo product suite to solve customers' ever-changing and diverse data challenges. Prior to co-founding Cambridge Semantics, Lee spent over five years as an engineer with IBM's Advanced Internet Technology Group. There, Lee helped architect and develop successive iterations of a semantic application architecture, culminating in the open-source release of the IBM Semantic Layered Research Platform.

Lee is an active member of the W3C Semantic Web standards community, and currently serves as the Co-Chair of the W3C's SPARQL Working Group. Lee authored a 2007 Scientific American article on the Semantic Web, and writes regularly about Semantic Web technologies at his blog, TechnicaLee Speaking. Lee is also a co-creator and editor-in-chief of Semantic University.


[TOP]

Conference on Semantics in Healthcare and Life Sciences

Tech Talks

Updated February 19, 2013

Tech Talks showcase products and services of relevance to the CSHALS audience. Each Tech Talk is 20 minutes (15 minutes presentation and up to 5 minutes for Q&A) in length and designed to allow organizations to create awareness of new technologies, services, etc., in an informational presentation format.

For organizations interested in presenting a Tech Talk, please go to our Sponsor Opportunities page (click here) for further information.


TECH TALK 1
Thursday – February 28, 2013

11:20 am - 11:40 am

Linking Data with Agile Text Mining

Presenter: David Milward, Chief Technology Officer, Linguamatics


Much of the knowledge we have resides as unstructured text. How can we exploit this to make connections and create new knowledge?

For some years, ontology-based text mining has been used to connect information from different documents to generate new hypotheses, for example by finding indirect relationships e.g. from a compound to a disease via an interaction with a gene. Different terminologies or ontologies can be exploited to bridge different communities such as clinical and scientific research. We will discuss similarities with semantic web approaches and some differences.

Finally we will show how we can export unstructured data in a structured format, whether RDF or BEL, to integrate unstructured and structured data. We will also discuss consequences of extraction of relationships followed by curation vs. direct extraction of hypotheses.

Biography: David Milward is CTO of Linguamatics and has over 20 years experience of product development, consultancy and research in natural language processing. After receiving a PhD from the University of Cambridge, he was a researcher and lecturer at the University of Edinburgh. He has published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics. He is a pioneer of interactive text mining, and a founder of Linguamatics.


TECH TALK 2
Thursday – February 28, 2013

11:45 am - 12:05 pm

Semantic Indexing of Unstructured Documents Using Taxonomies and Ontologies

Presenter: Jans Aasman, Franz Inc., Oakland, CA, US

Life science companies and healthcare organizations use RDF/SKOS/OWL based vocabularies, thesauri, taxonomies and ontologies to organize enterprise knowledge. There are many ways to use these technologies but one that is gaining momentum is to semantically index unstructured documents through ontologies and taxonomies.

In this talk we will demonstrate two projects where we use a combination of SKOS/OWL based taxonomies and ontologies, entity extraction, fast text search and a RDF triplestore to create a semantic retrieval engine for unstructured documents.

Biography: Jans Aasman started his career as an experimental and cognitive psychologist, earning his PhD in cognitive science with a detailed model of car driver behavior using Lisp and Soar. He has spent most of his professional life in telecommunications research, specializing in intelligent user interfaces and applied artificial intelligence projects. From 1995 to 2004, he was also a part-time professor in the Industrial Design department of the Technical University of Delft. Jans is currently the CEO of Franz Inc., the leading supplier of commercial, persistent, and scalable RDF database products that provide the storage layer for powerful reasoning and ontology modeling capabilities for Semantic Web applications.

[top]


TECH TALK 3
Friday – March 01, 2013
2:20 pm - 2:40 pm

Enabling Drug Discovery Applications Through a Linked Data Platform


Presenter: Alasdair J G Gray, University of Manchester, UK

We present the Open PHACTS linked data platform that is being developed to support a wide range of novel drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies.

The discovery of new medicines requires pharmacologists to interact with a number of data sources; ranging from data on chemical compounds to their interactions with targets. The linked data platform provides an integrated view over data retrieved from several complementary, but overlapping, data sources

Key features of the Open PHACTS linked data platform are:
1) Domain specific API making drug discovery linked data available for a diverse range of applications without requiring the application developers to become knowledgeable of semantic web standards such as SPARQL;
2) Just-in-time identity resolution and alignment across datasets enabling a variety of entry points to the data and ultimately to support different integrated views of the data;
3) Centrally cached copies of public datasets to support interactive response times for user-facing applications.

The Open PHACTS platform is hosted by OpenLink using the Virtuoso triplestore. This is enabling us to provide the security and privacy guarantees required by pharmaceutical companies. We have recently begun beta testing of the platform with our associated partners and anticipate a full public roll-out later in 2013.

The utility of the linked data platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.

Biography: Alasdair is a researcher in the MyGrid team at the University of Manchester. He is currently working on the Open PHACTS project which is building an Open Pharmacological Space to integrate drug discovery data. Alasdair gained his PhD from Heriot-Watt University, Edinburgh. He has spent the last 10 years working on novel knowledge management projects investigating issues of relating data sets. www.cs.man.ac.uk/~graya/

[top]


TECH TALK 4
Friday – March 01, 2013
2:45 pm – 3:05 pm

Practical Usage of Linked Data and Semantic Annotations by the Enterprise


Presenter: Vassil Momtchev, Group leader, Ontotext, Bulgaria

Linked data and ontology-driven text processing (aka semantic annotation) is nowadays becoming mainstream technology. Although the technology benefits are well understood, it is difficult to point to a significant number of established semantic systems used in production in the life sciences and health care domain.

In this talk, we will present Ontotext's solution implemented on top of native RDF infrastructure capable of efficiently combining semantic annotations with a large repository of bio-medical linked data. We will demonstrate how to implement semantic document searches that disambiguate concepts according their context and browse documents using semantic annotations. Both background and extracted information is modelled as RDF and further exposed as linked data that can be indexed via a powerful search interface included as part of the system. The search interface allows indexing of locally processed data and data exposed via remote SPARQL endpoints.

The solution is built on a public linked data service called Linked Life Data which integrates more than 25 popular life sciences and biomedical data sources in an RDF warehouse of more than 10 billion statements, all accessible via single SPARQL endpoint and updated regularly.

Biography: Vassil Momtchev is board member of Ontotext and passionate software engineer with over 12 years experience in the development of large scale knowledge management solutions for the life sciences, pharmaceutical and biotechnology industries. He joined Ontotext in 2005 and coordinated several European funded research projects in the areas of knowledge representation, reasoning and life sciences. He has practical experience of the product development, software architectures and research in linked data, RDF, natural language processing and semantic databases. 

[TOP]


Conference on Semantics in Healthcare and Life Sciences

Poster Presentations

Updated March 06, 2013

Indicates that presentation slides or other resources are available.


POSTER DETAILS:

Posters will be on display throughout the conference, beginning 5 p.m. Wednesday, February 27, with a special reception for poster authors to present their work to conference delegates:

  • Poster Set-up: Wednesday – February 27, 4:00 p.m. – 5:00 p.m.
  • Poster Reception: Wednesday – February 27, 5:00 p.m. – 6:30 p.m.

When preparing and designing your poster please note that it should be no larger than 44 inches wide by 44 inches high (there are two posters per side).

Posters must be removed between 1:30 p.m. - 3:00 p.m., Friday, March 01.


Poster 01:
Data Modeling and Machine Learning Approaches to Clustering and Identification of Bacterial Strains in Complex Mixtures by Genome Sequence Scanning

Presenter: Deepak Ropireddy
PathoGenetix
Woburn, MA, United States

Abstract: Bacterial contaminants, as a cause of some food-borne illness, pose a major challenge for the food industry in the timely detection and isolation of the source pathogen. Genome sequence scanning (GSS) developed by PathoGenetix is a single-molecule DNA analysis technique aimed at rapid identification of bacterial strains in complex biological mixtures. The detection, classification of single molecules and identification of bacterial strains entails computationally intensive steps of data analysis, data modeling, and machine learning using scientific software developed in-house. Method Initially, fluorescent traces of individual DNA molecules are acquired from the biological sample and analyzed for signal intensity. The statistics of all measured DNA molecules is evaluated by custom data analysis software. Next, these optical traces are passed to a data modeling and classification tool capable of running in parallel and in distributed computing architecture. In the classification process, the experimental signals of single molecules are statistically modeled by distributions of photons along the DNA restriction fragment based on Poisson and Gamma distributions. The experimental trace signals are compared to a target database containing averaged template patterns for specific restriction fragments from multiple target organisms. The template patterns are generated either by theoretical calculations, based on known sequences, or experimentally produced by GSS analysis and clustering of molecules from isolates. For each individual trace we compute a set of distances from database targets, where distance is defined as negative logarithm of the probability that the trace could originate from a specific target fragment. We quantify the confidence in classification of a single trace to a specific target by the log-likelihood value computed as difference of two distances: the one between this trace and the target to which it has been classified to and the other between this trace and its next closest target. Results This computational modeling methodology yielded robust results for detection and typing of multiple serovars and strains of Escherichia coli and Salmonella in complex biological mixtures. In identifying closely related strains of these species, a hierarchical clustering algorithm (UPGMA: Unweighted Paired Group Method with Arithmetic Mean) is applied to group detected organisms. The grouping is based on preliminary analysis of similarities between the GSS template patterns for different microorganisms. Subsequently, the potentially detected organisms are sorted in the decreasing order of the detected fraction of the total length of their fragments. Additionally, the clustering algorithm is applied to generate phylogenetic trees to compare closely related strains of Escherichia coli, Salmonella enterica and other species. Conclusions The ability of GSS to model single DNA molecule traces and attribute them to specific organisms in conjunction with the genome based unsupervised classification using hierarchical clustering approach is the basis of a robust technology for the confident detection and identification of pathogenetic strains. This modeling platform is used further to generate user-based knowledge of closely related strains through phylogenetic trees and other quantitative measures. We intend to incorporate current semantic and ontological technologies to build a knowledge tool for recent food outbreaks.

[TOP]


Poster 02:
HYDRA: A Commercial Query Engine for SADI Semantic Web Services (pdf)


Presenters: Christopher J. O. Baker, CEO
Alexandre Riazanov, CTO
IPSNP Computing Inc.
Saint John, NB, Canada

Abstract: IPSNP Computing Inc. based in Saint John, Canada, was set up to commercialize prior university-based research on data federation and semantic querying with SADI. The core technology is a high-performance query engine (working title HYDRA) operating on networks of SADI services representing various distributed resources. HYDRA will be packaged and licensed as two products: an intuitive end user-oriented querying and data browsing tool, including a software-as-a-service edition, and an OEM-oriented Java toolkit. IPSNP will target Bioinformatics and Clinical Intelligence markets and, later, other verticals requiring self-service ad hoc federated querying.


Poster 03:
Automatic Generation of SADI Semantic Web Services from Declarative Descriptions (pdf)


Presenter: Mohammad Sadnan Al Manir
University of New Brunswick
St. John, Canada

Abstract: Most modern organizations use relational databases to store and retrieve large amounts of data. Tables in these databases are structured and connected by complex schemas. Therefore, persons having profound knowledge in the SQL query language are required for advanced data access and retrieval. But many scenarios require ad-hoc querying of relational data which can be done by non-technical users, without having much knowledge of the database schema or SQL. Semantic Querying is proposed as a solution to this problem, which is based on the automatic application of domain knowledge written in the form of ontological axioms and rules. The axioms are used to map concrete data into virtual models based on RDF[S] and OWL. Queries can then be formulated by end users in the terminology of their domains without need for any knowledge of either SQL or the underlying table structure. Web services with such semantic querying capability will be very useful in today's Web-based environment, and in doing so, research work in introducing Semantic web services will strengthen the ongoing efforts by the Semantic Web community. Here we propose an architecture which generates SADI-based Semantic Web services automatically for the semantic querying of relational data. Thus, instead of writing SADI Web services manually, which is labor consuming and error-prone, it is envisioned to generate them from declarative descriptions automatically. The access to databases is implemented by semantic mappings using an expressive Positional-Slotted Object-Applicative (PSOA) Web rule language combining Datalog RuleML and W3C RIF. Such semantic mappings can support end users in any environment requiring semantic querying of large relational databases. The architecture is novel in comparison to the currently available approaches and its implementation can be used to perform knowledge discovery across large volumes of relational data.

[TOP]


Poster 04:
A Novel Semantic Approach to Information Flow Modeling in Big Pharma (pdf)


Presenter: Kelly Clark
Merck
Boston, MA, United States

Abstract: The information landscape within "Big Pharma" is a complex array of heterogeneous data, disparate applications and siloed repositories. This creates a myriad of challenges for scientists and knowledge workers at all levels, and does not provide the optimal use (or reuse) of the data that are generated -- a problem that informaticists within Merck are actively trying to address. Understanding how data, information, and knowledge is leveraged across the R&D landscape is not a trivial task. However, making technology and informatics investment decisions without a meaningful understanding of these domain areas and the actual state of information flow across the research activities is risky, costly, and often results in less than optimal IT solutions. Typical analysis efforts to uncover and document information flows through the R&D pipeline are, themselves, complex and inefficient. The information is often represented within "single-use artifacts" (usually Word, Excel, and/or Visio documents), which reflect the analysts' own interpretation of the current state in a variety of ways that are then subject to different interpretations. We have developed a method called Semantic Information Flow Modeling (sIFM) that allows multiple analysts to work individually to unambiguously model, analyze, and communicate information and knowledge flows across Merck. The resulting model is a graph-based knowledgebase that can be rendered in RDF, and used to help better inform informatics and technology investment decisions by elucidating targeted opportunities where technology can improve the state of information architecture and search.

[TOP]


Poster 05:
Yes, We Can! Lessons from Using Linked Open Data (LOD) and Public Ontologies to Contextualize and Enrich Experimental Data (pdf)


Presenter: Erich Gombocz
IO Informatics, Inc.
Berkeley, CA, United States

Abstract: Semantic W3C standards provide a framework for the creation of knowledgebases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. A growing number of knowledgebases are being built on these standards— in particular as Linked Open Data (LOD) resources. The availability of LOD resources has received increasing attention and use in industry and academia. Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met: issues often arise from the alignment of public and experimental corporate standards, from inconsistent namespace policies, and from the use of internal, non-formal application ontologies. Often the reliability of resources is problematic, from service levels of LOD resources and/or SPARQL endpoints to URI persistence. Furthermore, more and more “Open data” are closed for commercial use, and there are serious funding concerns related to government grant-backed resources. With these challenges, can Semantic Web technologies provide value to Industry today? We make the case that, yes, this can be done and is the case now. We demonstrate a use case of successful contextualization and enrichment of internal experimental datasets with public resources, thanks to outstanding examples of LOD such as UniProt, Drugbank, Diseasome, SIDER, Reactome, and ChEMBL, as well as ontology collections and annotation services from NCBO’s BioPortal. We show how, starting with semantically integrated experimental results from multi-year toxicology studies on different -OMICS platforms, a knowledgebase can be built that integrates and harmonizes such information, and enriches it with public data from UniProt, Drugbank, Diseasome, SIDER, Reactome, and NCBI Biosystems. The resulting knowledgebase facilitates toxicity assessment in drug development at the pre-clinical trial stage. It also provides models for classification of toxicity types (hepatotoxicity, nephrotoxiciy, toxicity based on drug residues) and offers better a priori determination of adverse effects of drug combinations. Not only have we been able to correlate responses across unrelated studies with different experimental models, but also to validate system changes associated with known toxicity mechanisms such as oxidative stress, liver function and ketoacidosis. Since observations from multi-modal OMICs experiments can result from the same perturbation, but represent very different biological processes, and because pharmacodynamic correlations are not necessarily functionally linked within the biological network and genetic and metabolic changes may occur prior to pathological changes, enrichment with LOD resources led to discovery of new pharmacodynamically and biologically linked pathway dependencies. As LOD resources mature, more reliable information is becoming publicly available to enrich experimental data with computable descriptions of biological systems in ways never anticipated before and that ultimately helps understanding the experiments' results. Time and money saved from such an approach has large socio-economic impact for drug companies and healthcare. As a community, we need to establish business models through cooperation between industry and academic institutions that support the maintenance and extension of invaluable public LOD resources. Their effective use in enriching toxicology data exemplifies the success of using Semantic Web technologies to contextualize experimental, internal, external, clinical and public data towards faster, better understanding of biological systems, and more effective outcomes in healthcare.

[TOP]


Poster 06:
DistilBio: Semantic Web and Data Integration Platform for the Life Sciences (pdf)


Presenter: Ramkumar Nandakumar
Metaome Science Informatics (P) Ltd.
Bangalore, India

Abstract: While the number of linked data resources continues to grow, there is a need to build applications that allow users easy access to open-ended exploration across a multi-dimensional data space. This would include intuitive interfaces to build powerful queries that could span across several data sources without the end user needing to know SPARQL, the underlying data models or the location of the data. Method: The platform consists of 3 main subsystems (query engine, web application and autosuggest) and a caching layer. A Service Oriented Architecture has been followed in its design. Services generally interact in a stateless manner using JSON over HTTP. The “query engine” serves as the interface to the Virtuoso database layer for the rest of the system. At its core it is a dynamic, generic query builder for SPARQL. The engine is almost entirely data driven - namely, it obtains most of its information for SPARQL generation directly from the OWL Ontology defined in Virtuoso, and the rest from the contents of the input query. The “web application” features an advanced querying interface utilizing both textual and graphical input elements in a complementary manner. The results are presented as a set of interconnected facets that map intuitively to the user’s query. This textbox is the key entry point for the end user and provides instantaneous access to more than 100 million indexed biological terms. Going beyond the plain suggestion of words provided by search engines in general, this interface detects phrases as logical groups and provides context sensitive suggestions. Text by its nature is linear, 1-dimensional. The query canvas lets the user treat the query as a two dimensional graph. This bridges the semantic web’s technical world of graphs and triples with the user’s notion of biological entities and connections. Results: The interface allows users to retrieve linked data and most importantly does so without the loss of expressivity and at scale. There is a clear abstraction between the data layer and the query engine allowing seamless updates to both data and the ontology. The data store currently houses nearly one billion triples across several biologically relevant databases (http://distilbio.com/help#data) and most normal queries return results in real time. Conclusions: With the DistilBio interface we have successfully allowed users to build fairly expressive queries and browse the results without needing to have any understanding of the underlying semantic technologies. Additional work in the future would include improved filtering capabilities, extending support for full range of SPARQL operations and better ways to display provenance. DistilBio is available at www.distilbio.com and for use cases view the demo videos at http://distilbio.com/demo.

[TOP]


Poster 07:
Semantic Approaches to Clinical Trial Harmonization (pdf)


Presenter: Simon Rakov
AstraZeneca
Waltham, MA, United States

Abstract: Objectives and Motivation We have combined clinical trial metadata from internal, vendor and public sources, for example ClinicalTrials.gov, FDA, and WHO, to facilitate in-licensing questions. We use semantic technology to assemble data, map it to public ontologies, query, and explore results. This approach facilitates decision-making for drug discovery and drug repositioning. This work describes how we implemented the system, the tools and processes that we used, and lessons learned along the way. This project extends work initiated in 2009, when we created a custom object-relational mapping system termed: “Cortex.” It was written in Perl and had its own SPARQL-like custom query language. We set out to improve upon Cortex by migrating to semantic technologies. The intent was to reduce maintenance effort, onboarding cost, and improve interoperability with public—and other collaborator--data. Method The current methodology initiates with XML feed data from disparate sources needed to answer business questions; it ends with harmonized RDF controlled via standard vocabularies. We transform and enrich the data sources through a series of three named RDF graphs: staging, source, and question-ready RDF. We query the harmonized data using a UI built within the Callimachus RDF templating engine. The user enters terms to complete a question that is transformed into SPARQL. Answers return for further analysis using an Exhibit-based JavaScript framework that displays a faceted view. Results Business scout teams use this project and the underlying knowledgebase to find clinical trials with interesting drug assets. We find many positive aspects to the semantic technology approach. We can take advantage of the semantic community and standard technologies to reduce custom code and maintenance effort. We turn around complex query requests quickly. We load and transform data faster, de-duplicate data more easily, and query against more data sets, with different shapes and sizes. Named graphs help us partition data; PROV-O makes it easier to implement provenance audit control. SPARQL is straightforward to query and can transform data that both incorporates vocabularies and provides multiple data shapes corresponding to business needs. On the downside, we struggled with the enterprise-readiness of some of the semantic toolkit. SPARQL has no native way to write functions; certain simple extensions are problematic. Triplestores have no common API. Our initial triplestore has no audit log, so we needed custom provenance to identify named graphs properly. SPARQL wasn’t consistently performant. There is a learning curve with these tools, and integrating them into our build systems requires additional effort. Conclusions We have discovered benefits and drawbacks, good and bad practices, for working with semantic technologies. In general, from our experience, we recommend the following: 1) Use SPARQL for ETL. Some things it cannot do easily 2) Use named graphs to partition content and transformation 3) Use PROV-O for provenance on named graphs 4) Account for the semantic learning curve 5) Anticipate immaturity in the product set 6) Use URIs to de-duplicate data 7) Take advantage of standard vocabularies to combine your data with other linked data.

[TOP]


Poster 08:
An Interoperable Suite of BioNLP Web Services based on the SADI Framework (pdf)


Presenter: Syed Ahmad Chan Bukhari
University of New Brunswick
Saint John, Canada

Abstract: The scientific literature is considered to be the most up to date source of information for biologists and is fundamental resource for knowledge discovery, experimental design and systems biology analysis. Biomedical text mining (also known as BioNLP) provides information extraction solutions to facilitate the needs of scientists. In many use case scenarios the integration of not just one but several text mining tools is required and the outputs of these tools must be consolidated. The outputs might be used for different purposes, including summarization, comparative evaluation of results, cross-validation against existing data. Consequently, the interoperability of tools and the format of the output are of critical importance. Since almost all BioNLP tools produce XML or TAB-output (with different schemas), integration of tools and consolidation of results requires some programming work. We proposed a programming 'free' and installation 'free' scientific text processing system to annotate and extract the biological information from textual data based on a BioNLP SADI framework. The proposed mechanism directly addresses the centralized weak-binding issues of biological NLP pipelines by introducing the ontology based and linked data aware biological text annotation scheme. We developed BioNLP SADI semantic web services for a Mutation Finder, Drug Extractor, Drug-Drug Interaction Extractor to achieve the interoperability among biological data outputs. Around this we created a web based platform for the general biologist and bioinformatics application developer through which they can get enhanced access to biological information from the literature with minimal effort. Project Page: https://code.google.com/p/bionlp-sadi/

[TOP]


Poster 09:
Active PURLs for Clinical Study Aggregation
(pdf)

Presenters:
David Wood
3 Round Stones Inc.
Fredericksburg, VA, United States

Tom Plasterer
AstraZeneca
Waltham, MA, United States

Abstract: A challenge common to many pharmaceutical companies is to more closely relate the detailed outcomes of clinical trials to their own research. This is made difficult by both the distributed nature of the companies and the distributed nature of clinical trial information. Simply finding information related to clinical studies within a large pharmaceutical company can be challenging. Today's pharmaceutical companies are worldwide, highly distributed organizations. Developers creating an application to support researchers may never meet their stakeholders, nor even necessarily know who they are. This situation mirrors the World Wide Web, where it is not generally possible to determine who the user base is nor which features or offerings they might wish to have. Traditional enterprise software development methodologies do not take into account scenarios where stakeholders are not known in advance. Linked Data techniques can help to address both the availability of clinical trial information and provide a means to build effective information systems using it. Linked Data techniques were developed for the Web and allow for "cooperation without coordination". Publishers of data provide necessary context to allow for use by (possibly unknown) third parties in other portions of a distributed enterprise. Users of Linked Data can combine information from multiple sources and even publish the results of their analyses using the same Linked Data techniques. Subsequent publication can create a virtuous circle of positive feedback, allowing researchers, informaticists and support staff to collaboratively and distributively build a reusable knowledge base. 3 Round Stones and a pharmaceutical company created a system to allow coordinated views of distributed clinical trial information. The system extended the Callimachus Project, an Open Source management system for Linked Data. Persistent URLs, or PURLs, were used to provide globally unique and resolvable identifiers for each clinical study. The PURL concept was extended to enable PURLs to have multiple targets and for the results of each target to undergo arbitrary transformation. PURLs which have such capabilities are called Active PURLs. Information sources relevant to clinical studies were identified, regardless of whether their location was internal or external to the pharmaceutical company's network. Active PURLs were used to resolve data sources having HTTP endpoints capable of returning XML or textual results. Each information source is dynamically transformed into Resource Description Framework (RDF) formats and all sources' results then merged into a single, temporary graph of RDF data. Information is rendered to end users as coordinated HTML descriptions regarding each clinical trial using the Callimachus template engine. Machine-readable versions of the data are also available. The pharmaceutical company has a means to view coordinated clinical trial information across internal and external sources and is moving it toward production use. We showed that a Linked Data approach to distributed information retrieval works for clinical trial information and demonstrated the benefits of cooperation without coordination for typical bioinformatics challenges.


Poster 10:
Validation of a Comprehensive NGS-based Cancer Genomic Assay for Clinical Use


Presenter: Eric Neumann
Foundation Medicine
Cambridge, MA, United States

Abstract: Molecular diagnostics are increasing in importance to clinical oncology, as the number of therapies targeting specific molecular alterations and pathways in cancer grows. This trend has led to a proliferation of single-target biomarker assays, which are constrained by scarce tissue material and restricted in the breadth of genomic alterations assessed. To overcome these limitations, we developed a CLIA certified, pan solid tumor, next-generation sequencing (NGS) based test that enables comprehensive identification of clinically actionable genomic alterations present in routine FFPE specimens. The test uses minimal (≥50ng) DNA to achieve >500X average unique sequence coverage across 3,240 exons and 37 intronic intervals in 189 cancer genes, permitting identification of single-base substitutions, small insertions and deletions (indels), copy number alterations, and selected gene fusions, even when present in a minor fraction of input cells. To support clinical adoption, we conducted a series of experiments to validate test performance for substitution and indel mutations.

[TOP]


Poster 11:
Mapping Scientific Narratives (pdf)


Presenter: Robert Malouf
San Diego State University
CA, United States

Abstract: The scientific literature in any (sub-)domain constitutes a kind of an ongoing narrative constructed jointly by a community of researchers. And, as in any community, its members develop a specialized vocabulary among themselves which may be somewhat opaque to outsiders. This goes beyond the use of technical terminology and biomedical jargon (Jordan 2005) -- researchers investigating, say, the use of monoclonal antibodies to treat psoriasis will develop specialized habits of language use, and the narrower the subfield, the subtler the linguistic distinctions. Understanding these differences is vital for accessing the scientific narrative as an outsider via information retrieval or text mining systems, or even to contribute to it as a researcher entering a new subfield. Using the tools of corpus linguistics and computational lexicography, we can analyze large quantities (on the order of hundreds of millions of words) of domain-specific text. One primary tool of corpus linguistics is the concordancer, a system which allows the analyst quick access to individual examples of words in use. This can reveal surprising patterns -- for example, in papers on multiple sclerosis, verbs like "gain" and "increase" occur with undesirable direct objects like "disability" or "disease load", while in the asthma literature "gain" is more likely to occur with "control", a desirable outcome. This can also allow us to extract technical terms (e.g, in the literature on anticoagulants, we often find "burst of thrombin" but almost never "thrombin burst"). Going beyond simple word counts, analysis via pointwise mutual information, an information-theoretic measure of associations (Church and Hanks 1990, Evert 2007), finds collocation patterns which occur with greater than expected frequency given the frequencies of the individual words. When combined with deep syntactic analysis, these measures of association facilitate automatic extraction of a domain-specific thesaurus (Lin 1998, Curran and Moens 2002). The synonym sets provide a high-level overview of the way that language is being used in a narrowly focused corpus which in turn can help the analyst find differences in word usage between that domain and biomedical literature in general. For example, in literature on multiple sclerosis, a close synonym of "damage" is "demyelination", while in a corpus of papers on diabetes a synonym of "damage" is "neuropathy". Finally, broader semantic patterns of word meanings and language use can be found using vector space analysis and non-negative matrix factorization (Deerwester et al. 1990, Pauca 2004, Turney 2010, Utsumi 2010). This technique maps words and texts into a kind of semantic space. The distance between two words in this semantic space is a measure of the similarity of the contexts in which the two words tend to occur, and the structure of the semantic space provides a basis for comparing the development of word meanings across domains and across time. In this poster presentation, we will describe the use of this computational toolkit in more detail and present real-world case studies of its application to problems in accessing scientific narratives.

[TOP]


Poster 12:
A Patient Centered Infrastructure (pdf)


Presenter: Christian Seebode
ORTEC medical
Berlin, Germany

Abstract: A Patient Centered Infrastructure In order to give patients a possibility to participate more in healthcare delivery and to take responsibility for their action patients need access to information, communication and educational services. We present a Patient Centered Infrastructure which supports a patient centered process which represents an information cycle where patients improve their health literacy in an iterative way. Patients have to become consumer, mostly consumer of health information but also of health services. At the same time the patient participates in healthcare delivery and performs actions that correspond to his level of health literacy. While participating and learning patients become also a source of health information. Patients and their knowledge are the most undervalued resource in healthcare delivery. This means that patients have to adopt a new role model too. Patients learn to demand and consume health information and improve their situation to achieve better outcomes. The main condition for active participation is open and transparent communication. The Patient Centered Information infrastructure supports information, communication and education in order to improve the level of health literacy in patients such that they may understand and consciously decide what is best for them. The Patient Centered Infrastructure is a collection of federated services which resides on top on a common information model for patients and supports the patient centered process. The Patient Centered process represents a cycle of steps that patients perform in order to participate in healthcare delivery and to take control of their personal health situation and consists of the following steps: -Store (Health Record Management) Patients assemble a personal profile with contains a personal health record and a knowledge base. This may contain input from other systems, the patient himself or a result from a previous cycle. -Retrieve (Information Retrieval) Patients seek information based on the information and knowledge they possess. Sources of information may be Web, Case Databases, Literature or the information from EPRs or HIS. Even associated clinical text may be sources of information by means of linguistic analysis and text mining. -Gain Insight(Knowledge Management) Knowledge management is supported by formal representations of knowledge e.g. ontologies. Ontologies represent domain knowledge and are defined by experts or Patients who contribute their own knowledge base and align it. -Learn (Education) Patients learn and build knowledge from the information they retrieved. Patients are educated by learning from information or from others. Patients are able to share this knowledge with others because of the common information model. Learning curves are represented by comparing different versions of the knowledge base. -Act (Medical Services) Patients use medical services and participate according to their level of health literacy. Medical services are offered offline or online. Information plays a vital role for medical services too. Patients get decision support by consuming second opinion services or from the specific patient community that is associated with a service. Patients may document their personal level of health literacy by giving fine grained access to their personal knowledge base in order to receive personalized support. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.

[TOP]


Poster 13:
The ISA Infrastructure for the Biosciences: from Data Curation at Source to the Linked Data Cloud (pdf)


Alejandra Gonzalez-Beltran
Oxford e-Research Centre, University of Oxford
Oxfordshire, United Kingdom

Abstract: Experimental metadata is crucial for the ability to share, compare, reproduce, and reuse data produced by biological experiments. The ISAtab format -- a tabular format based on the concepts of Investigation/Study/Assay (ISA) -- was designed to support the annotation and management of experimental data at source, with focus on multi-omics experiments. The format is accompanied with a set of open-source tools that facilitate compliance with existing checklists and ontologies, production of ISAtab metadata, validation, conversion to other formats, submission to public repositories, among other things. The ISAtab format together with the tools allow for the syntactic interoperability of the data and support the ISA commons, a growing community of international users and public or internal resources powered by one or more components of the ISA metadata tracking framework. The underlying semantics of the ISAtab format is currently left to the interpretation of biologists and/or curators. While this interpretation is assisted by the ontology-based annotations that can be included into the ISAtab files, it is currently not possible to have this information processed by machines, as in the semantic web/linked data approach. In this presentation, we will introduce our ongoing isa2owl effort to transform ISAtab files into an RDF/OWL-based (Resource Description Framework/Web Ontology Language) representation, supporting the semantic interoperability between ISAtab datasets. By using a semantic framework, we aim at: 1. making the ISAtab semantics explicit and machine-processable, 2. exploit the existing ontology-based annotations, 3. augment annotations over the native ISA syntax constructs with new elements anchored in a semantic model extending the Ontology of Biomedical Investigations (OBI) 4. facilitate the understanding and semantic querying of the experimental design 5. facilitate data integration, knowledge discovery and reasoning over ISAtab metadata and associated data. The software architecture of the isa2owl component is engineered to support multiple mappings between the ISA syntax and semantic models. Given a specific mapping, a converter takes ISAtab datasets and produces OWL ontologies, whose Tboxes are given by the mapping and the Aboxes are the ISAtab elements or derived ones. These derived elements result from the analysis of the experimental workflow, as represented in the ISAtab format and the associated graph representation. The implementation relies on the OWLAPI. As a proof of concept, we have performed a mapping between the ISA syntax and a set of interoperable ontologies anchored in the Basic Formal Ontology (BFO) version 1. These ontologies are part of the Open Biological and Biomedical Ontologies (OBO) Foundry and include OBI, the Information Artifact Ontology (IAO) and the Relations Ontology (RO). We will show how this isa2owl transformation allows users to perform richer queries over the experimental data, to link to external resources available in the linked data cloud, and to support knowledge discovery.

[TOP]


Poster 14:
Bio2RDF Linked Data Experience, the Lessons Learned Since 2006 (pdf)


Presenter: François Belleau
Centre de recherche du CHUQ, Laval University
Québec, Canada

Abstract: Since 2006 the Bio2RDF project (http://bio2rdf.org) hosted jointly by the Centre de recherche du CHUQ of Laval University and Carleton University, aims is to transforms silos of life science data made publicly available by data provider like KEGG, UniProt, NCBI, EBI and many more, into a globally distributed network of linked data for biological knowledge discovery available to the life science community. Using semantic data integration techniques and the Virtuoso triplestore software, Bio2RDF seamlessly integrates diverse biological data and enables powerful SPARQL services across its globally distributed knowledge bases. Online since 2006, this very early Linked Data project have evolve and inspired many other. This talk will recall Bio2RDF main design steps over time, but more important, we will explain design decision that, we think, made it successful. Being part of the Linked Data space since the beginning, we have been in a major position to observe the evolution and adoption of semantic web technologies by the life science community. Now with so many mature project online, we propose to look back, so we can share our own experiences building Bio2RDF with the CSHALS community. The method to produce, publish and consume Bio2RDF linked data will be exposed. The pipeline used to transform public data to RDF and its design principle will be explained. The way Openlink Virtuoso server, an open source project, is configured and used will be explained. We will also propose guidelines to follow to publish RDF within the bioinformatic Linked Data cloud. Finally, we will show different way to consume this data using URI, SPARQL queries and semantic web software like RelFinder and Virtuoso Facet browser. As a result Bio2RDF is still the most diverse and integrated linked data space available. But now, we observe an important move with data provider starting to expose their own datasets using RDF or, even better, SPARQL endpoint. Looking back at the evolution of Bio2RDF, we will share lessons we have learned about semantic web design project. What was a good idea, and which one were not. To conclude, we will expose our vision, still to be fulfilled, of what Linked Data could be in a near future so the data integration problem, so present in Life Science, benefits of the mature Semantic Web technologies, to help researchers do their daily discovery work.

[TOP]


Poster 15:
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data


Presenter: Michel Dumontier
Carleton University
Ontario, Canada

Abstract: Bio2RDF is an open source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences. Here, we present the second release of the Bio2RDF project which features up-to-date, open-source scripts, IRI normalization through a common dataset registry, dataset provenance, data metrics, public SPARQL endpoints, and compressed RDF files and full text-indexed Virtuoso triple stores for download. Methods Bio2RDF defines a set of simple conventions to create RDF(S) compatible Linked Data from a diverse set of heterogeneously formatted sources obtained from multiple data providers. We have consolidated and updated Bio2RDF scripts into a single GitHub repository (http://github.com/bio2rdf/bio2rdf-scripts), which facilitates collaborative development through issue tracking, forking and pull requests. The scripts are released with an MIT license, making it available for any use (including commercial), modification or redistribution. Provenance regarding when and how the data were generated is provided using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Additional scripts were developed to compute dataset-dataset links and summaries of dataset composition and connectivity. Results Nineteen datasets, including 5 new datasets and 3 aggregate datasets, are now being offered as part of Bio2RDF Release 2. Use of a common registry ensures that all Bio2RDF datasets adhere to strict syntactic IRI patterns, thereby increasing the quality of generated links over previous suggested patterns. Quantitative metrics are now computed for each dataset and provide elementary information such as the number of triples to a more sophisticated graph of the relations between types. While these metrics provide an important overview of dataset contents, they are also used to assist in SPARQL query formulation and to monitor changes to datasets over time. Pre-computation of these summaries frees up computational resources for more interesting scientific queries and also enable tracking of dataset changes with time, which will help make projections about the hardware and software requirements. We demonstrate how multiple open source tools can be used to visualize and explore Bio2RDF data, as well as how dataset metrics may be used to assist querying. Conclusions Bio2RDF Release 2 marks an important milestone for this open source project, in that it was fully transferred into a new team and development paradigm. Adoption of GitHub as a code development platform makes it easier for new parties to contribute and get feedback on RDF converters, as well as make it possible to automatically be added to the growing Bio2RDF network. Over the next year we hope to offer bi-annual releases that adhere to formalized development and release protocols.

[TOP]


Poster 16:
Towards a Seizures and Epilepsies Ontology (pdf)


Presenter: Robert Yao
Arizona State University/Mayo
Scottsdale, United States

Abstract: Introduction In 2012, the Institute of Medicine recognized a significant problem in epilepsy knowledge, care, and education in their report, “Epilepsy Across the Spectrum.” Physicians often disagree and are inconsistent when making an epilepsy diagnosis (1) because various limitations in their understanding of seizure and epilepsy have contributed to the lack of a clear diagnostic method. Such limitations include structural flaws (or lack of relationships) in the current knowledge representation model, terms that are ambiguous and inconsistently used, too much dependence on expert opinion, and too little on evidence (2-6). Currently, most epilepsies were identified by grouping seizure or epilepsy names and then sorting based on what is perceived as the defining characteristic. If the wrong defining characteristic was chosen, epilepsies were often misclassified, and thus misdiagnosed. When certain observable symptoms or properties were unknown or missing, it was not possible to identify the seizure or epilepsy. As a first step towards diagnostic clarity, the IOM has recommended the validation nd implementation of standard definitions and criteria for epilepsy case ascertainment. The epilepsy domain has been calling for a new evidence-based epilepsy model that incorporates the latest knowledge to classify and relate seizure types and define specific epilepsy syndromes (5,7-9). In response to the need for better diagnosis and to the call for a new evidence-based epilepsy model, we propose an ontology-based knowledge representation to aid in the improvement of the diagnosis and management of epilepsy. Methods Design and build an ontologic knowledge representation of the epilepsy domain 1) Define the ontology domain and scope 2) Review existing ontologies 3) Select an upper ontology 4) Create classes, properties, and relationships 5) Create a conceptual model (using concept maps) 6) Create a scoring heuristic to suggest a differential diagnosis of seizure types and epilepsy syndromes. Results To address ambiguous and inconsistently used terms, a concept approach was taken. Relationships between concepts were then defined, creating an ontology for seizure types and epilepsy syndromes. A concept map of all possible epilepsies was constructed using a concept process resulted in a tree that models all possible epilepsies in one map (not shown). Figure 1 depicts the seizure aura sub-branch of the overall Epilepsies Ontology. Discussion The ontology created takes an important first step in providing a standard definition for a seizure type and a specific seizure syndrome. We are currently working on a peer-reviewed, evidence-based validation of the ontology. Furthermore, a reasoning heuristic based on the ontology is being developed to evaluate the implementation of diagnostic criteria for seizure type and epilepsy syndrome.

[TOP]


Poster 17:
Biotea


Presenter: Alexander Garcia
Florida State University
Tallahassee, United States

Abstract: In this poster, we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We aim at delivering interoperable, interlinked, and self-describing documents in the biomedical domain. We applied our approach to the full-text, open-access subset of PubMed Central.

Methods: We use BIBO, DCMI Terms, and the Provenance Ontology to model the bibliographic metadata. BIBO provides classes and properties to represent citations and bibliographic references; BIBO is used to model documents and citations in RDF or to classify documents within a hierarchy. Dublin Core offers a domain-independent vocabulary to represent metadata; such vocabulary aims to facilitate cross-resource exploration. In order to identify biological terms, we use two text-mining tools: Whatizit and the NCBO Annotator. Both tools are based on dictionaries and string matching. By doing so, relevant biological identifiers such as UniProt accessions as well as ChEBI and GO identifiers are added. We are working with more than 20 biomedical ontologies. The main input for our process is the XML offered by PMC for open-access articles. We use JAXB to programmatically process this XML and RDFReactor to map the ontologies to Java classes. The output of the process comprises three RDF files: the article itself as well as the annotations from NCBO Annotator and Whatizit. The article is modeled as bibo:Document; whenever it is possible, a more accurate class is also added, e.g., bibo:AcademicArticle for research articles. Publisher metadata is also modeled with BIBO, including publisher name, the International Standard Serial Number, volume, issue, and starting and ending pages. Authors are modeled as foaf:Person and grouped as a bibo:authorList. Abstract and sections are modeled as a doco:Section with a cnt:chars containing the actual text with formatting omitted. Well-known identifiers such as PubMed and DOI are included in the output; thus, it is possible to track the original source of the article. The same principle is also applied to the references. The references are modeled as bibo:Document; the relations used are bibo:cites and bibo:citedBy. References are available for both the document and the section level. For incomplete references, e.g., "Allen, F. H. (2002). ActaCryst. B58, 380-388" in PMC:2971765, it is possible to use services such as Mendeley, CrossRef, and eFetch in order to complete the information so title and identifiers can be added.

Results: We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/.

Conclusions: The semantic processing of biomedical literature presented in this paper embeds documents within the Web of Data and facilitates the execution of concept-based queries against the entire digital library. Our approach delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of biomedical documents. Our model delivers a semantically rich and highly interconnected dataset with self-describing content so that software can make effective use of it.

[TOP]


Poster 18:
A Semantic Portal for Treatment Response Analysis in Major Depressive Disorder

Presenters: Joanne S. Luciano, Brendan Ashby, Yuezhang Xiao
Rensselaer Polytechnic Institute
Troy, NY, United States

Abstract: The World Health Organization (WHO) reports that Major Depressive Disorder (MDD) affects more than 350 million people and is a significant contributor to the global burden of disease. It is the leading cause of disability in the U.S. for ages 15-44. This poster will present a semantically enabled web-portal that enables treatment response pattern analysis from clinical depression data studies conducted at major depression research facilities. Using the Luciano Model, the resultant response pattern visualizations provide patient and clinician with detailed information about the expected response to the treatment, thus supporting clinical decision making and increasing patient engagement. Currently treatment selection remains trial and error and patient engagement for any pharmaceutical is nonexistent. Further, the individual patient's response pattern can be monitored more closely by both patient and clinician enabling earlier intervention when the patient's response is different from what is expected for that treatment. The aim of this work is to improve the selection of the treatment and provide information that enables earlier intervention when necessary in order to prevent unnecessary suffering, suicide, and costs.

[TOP]


Poster 19:
Primary Immunodeficiency Disease (PID) PhenomeR - An Integrated Web-based Ontology Resource Towards Establishment of PID E-clinical Decision Support System

Presenter: Sujatha Mohan, Ph.D.
Research Center for Allergy and Immunology (RCAI)
The Institute of Physical and Chemical Research (RIKEN)
Yokohama city, Kanagawa, Japan

Abstract: Primary immunodeficiency diseases (PIDs) are genetic disorders, causing abnormalities in development as well as maintenance and functioning of the immune system that are manifested by increased susceptibility to infections and autoimmune disorders. To this date, more than 250 PIDs have been reported most of which are rare and infrequent. The patients diagnosed for a given PID condition are often scattered all over the world and knowledge about these diseases are hindered by lack of unified representation of PID information, especially linking genotype and phenotype data which requires regular concerted efforts and community participation. Earlier, we had developed an open access integrated molecular database of PIDs named "Resource of Asian Primary Immunodeficiency Diseases - RAPID" (http://rapid.rcai.riken.jp/RAPID) and at present it comprises a total of 263 PIDs and 242 genes, out of which 232 genes are reported with over 5039 unique disease-causing mutation data obtained from over 1823 PubMed citations. We, hereby, introduce a newly developed PID ontology browser for systematic integration and analysis of PID phenotype with the genotype data from RAPID. Towards this end, we have developed a user-friendly interface named, "PID PhenomeR", which serves as a standardized phenotype ontology resource to present ontology class structures and entities of all observed phenotypic terms in PID patients from RAPID in standardized file formats - Web Ontology Language (OWL) and Resource Description Framework (RDF) using semantic web technology. PID PhenomeR consists of 1466 standardized PID terms that are classified under 24 and 29 semantic types and categories respectively as of December, 2012. The standardization of PID phenotype terms for addition of new terms is in progress, using unique semi-automated process including logic based assessment method. In essence, PID PhenomeR serves as an active integrated platform for PID phenotype data, wherein the generated semantic framework is implemented in the integrated knowledge-base query interface i.e. SPARQL Protocol and RDF Query Language (SPARQL) endpoint for establishing a well-informed PID e-clinical decision support system.

Database URL: http://rapid.rcai.riken.jp/ontology/v1.0/phenomer.php

Keywords: Semantic web technology, Ontology, Genotype, Phenotype, Mutation, SPARQL

[TOP]


Poster 20:
Connecting Linked Data (pdf)

Presenter: Nadia Anwar
General Bioinformatics
Reading, United Kingdoms

Abstract: Initially, data in linked data clouds were brought together with the maxim "Messy is Good". The idea was, if you put your data in RDF then it can be used, re-purposed and integrated. This maxim was used to encourage people to expose their data as RDF. Now that we are all convinced, and there is a lot of RDF available to us, some of it messy, it is time to tidy up the mess. We aim to show why "messy" is now problematic and describe how a little house-keeping makes the linked data we have better. In our experience, messy or “good enough” was the starting point however we now find that many practical uses of the data originally exposed as RDF are very difficult. Even the simplest of SPARQL queries on public resources can be very unintuitive. To demonstrate our point we exemplify through a small and a large example, how, adding just a few inferences, makes public RDF data easier to SPARQL. The first, small, example uses two RDF datasets of the model organism Drosophila melanogaster. FlyAtlas is a tissue specific expression database produced by Julian Dow at the University of Glasgow. The expression profiles were designed to reveal the differences in expression in very different tissues, for example, from brain to the hindgut. This is an incredible resource available alongside other fly resources in RDF at openflydata.org. Gene set enrichment is now a standard tool used to understand such expression data, an analogous query is “are there any tissue specific pathways in the hindgut?”. In theory, this question can be answered with a SPARQL query connecting FlyAtlas to FlyCyc, a databases of Fly Pathways, however, this is not easy. The two graphs, FlyAtlas and FlyCyc are actually quite difficult to traverse. However, with the addition of some triples through very simple inferences, utilizing for example, transitive properties, class subsumption and simple CONSTRUCT queries. With the addition of these extra triples the graphs are much easier to query. In a larger example, we have a pharma client with a large set of linked data mainly from the public domain. In this linked data cloud there are some 30 databases using about 6 different RDFS/OWL schemas. The concepts are diverse, from Protein to Pathways to Clinical Trials. Practically, the queries through this data are complex and as the the cloud of data has grown, queries have become more and more cumbersome. In this larger data set, we exemplify some of the typical queries performed through this data, and just how difficult some of these simple queries can be. We describe some actions that unify concepts within the multiple schemas in the linked data with the addition of some semantics. We show some of the query gains achieved in understandability and performance through the addition of these ontology statements.

[TOP]


Poster 21:
A Clinical Information Management Platform Using Semantic Technologies

Presenters:
Mathias Ort
Christian Seebode
ORTEC Medical
Berlin, Germany

Abstract: A Clinical Information Management Platform using semantic technologies Medical procedures generate a vast amount of data from various sources. An efficient and comprehensive integration and exploitation of these data will be one of the success factors for improving health care delivery to the individual patient, making health care services more cost-effective at the same time. In order to support an effective mining, selection and presentation of medical data for clinical or patient-centered use cases, either text data or structured clinical data from Health Information Systems (HIS) has to be enriched with semantic meta-information and has to be available at any point during the data value chain. We present a platform which combines an approach to semantic extraction of medical information from clinical free-text documents with the processing of structured information from HIS records. The information extraction uses a fine-grained linguistic analysis, and maps the preprocessed terms to the concepts of domain-specific ontologies. These domain ontologies comprise knowledge from various sources, including expert knowledge and knowledge from public medical ontologies and taxonomies. The processes of ontology engineering and rule generation are supported by a semantic workbench that enables an interactive identification of those linguistic terms in clinical texts that denote relevant concepts. This supports incremental refinement of semantic information extraction. Facts extracted from both, clinical free texts and structured sources, represent chunks of knowledge. They are stored in a Clinical Data Repository (CDR) using a common document-oriented storage model, which takes advantage of an application-agnostic format, in order to support different use cases. It furthermore supports version control of facts reflecting the evolution of information. Enrichment algorithms aggregate further information by generating statistical information, search indexes, or decision recommendations. The CDR generally separates processes of information generation from processes of information processing or consumption, and thus supports smart partitioning of data for scalable application architectures. The applications hosted on the platform retrieve facts from the CDR by subscribing to the event stream provided by the CDR. The first applications implemented on top of that platform support specific scenarios of clinical research, like recruiting patients for clinical trials, answering feasibility studies, or aggregating data for epidemiological studies. Further applications address patient-centered use cases like second opinion or dialogue support. The web-based application StudyMatcher maps study criteria to a list of cases and their medical facts. Trial teams may define study criteria in interaction with the knowledge resources. The application automatically generates a list of candidates cases.. Since the user interface links the facts extracted by the system to the original sources (e.g. the clinical documentation), users are able to check with low effort whether or not a fact has been recognized correctly by the system, matched correctly with the given criteria. This strategy of combining automatic and supervised fact generation promises to be a reasonable approach to improving the semantic exploitation of data. Platform and applications are developed in cooperation with europes leading healthcare providers Charité and Vivantes and will be rolled out in January 2013. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.

[TOP]


Poster 22:
Data Modeling and Machine Learning Approaches to Clustering and Identification of Bacterial Strains in Complex Mixtures by Genome Sequence Scanning (pdf)

Presenters:
Deepak Ropireddy
Gene Malkin
Mikhail M Safranovitch
Douglas B. Cameron
John Harris

Abstract: Genome Sequence Scanning (GSS), developed by PathoGenetix, is a single-molecule DNA analysis technique aimed at rapid identification of bacterial contaminants in complex biological mixtures and food samples. The detection, classification of single molecules and identification of bacterial strains entails computationally intensive steps of data analysis, modeling, and machine learning approaches on fluorescent traces of individual DNA molecules acquired from biological samples.

The basic methodology consists of initially obtaining fluorescent traces of individual DNA molecules from the biological sample and their signal intensity and statistics of all measured DNA molecules is evaluated by custom data analysis software. In the next step, experimental signals of single molecules are statistically modeled by distribution of photons along the DNA restriction fragment based on Poisson and Gamma distributions. The experimental trace signals are compared to a target database containing averaged template patterns for specific restriction fragments from multiple target organisms. The template patterns are generated either by theoretical calculations, based on known sequences, or experimentally produced by GSS analysis and clustering of molecules from isolates [1].

This computational modeling methodology yielded robust results for detection and typing of multiple serovars and strains of Escherichia coli and Salmonella in complex biological mixtures. In identifying closely related strains of these species, a hierarchical clustering algorithm (UPGMA:
Unweighted Paired Group Method with Arithmetic Mean) is applied to group detected organisms. This clustering algorithm is applied to generate phylogenetic trees to compare closely related strains of Escherichia coli, Salmonella enterica and other species. The ability of GSS to model single DNA molecule traces and attribute them to specific organisms in conjunction with the genome based unsupervised classification using hierarchical clustering approach is the basis of a robust technology for confident detection and identification of pathogenetic strains.


[TOP]

Conference on Semantics in Healthcare and Life Sciences (CSHALS)

Presenters

Updated March 07, 2013

Indicates that presentation slides or other resources are available.

Links within this page:

Connecting Vocabularies in Linked Data

Nadia Anwar
General Bioinformatics
Reading, United Kingdoms

Presentation (pdf)

Abstract: Initially, data in linked data clouds were brought together with the maxim "Messy is Good". The idea was, if you put your data in RDF then it can be used, re-purposed and integrated. This maxim was used to encourage people to expose their data as RDF. Now that we are all convinced, and there is a lot of RDF available to us, some of it messy, it is time to tidy up the mess. We aim to show why "messy" is now problematic and describe how a little house-keeping makes the linked data we have better. In our experience, messy or “good enough” was the starting point however we now find that many practical uses of the data originally exposed as RDF are very difficult. Even the simplest of SPARQL queries on public resources can be very unintuitive. To demonstrate our point we exemplify through a small and a large example, how, adding just a few inferences, makes public RDF data easier to SPARQL. The first, small, example uses two RDF datasets of the model organism Drosophila melanogaster. FlyAtlas is a tissue specific expression database produced by Julian Dow at the University of Glasgow. The expression profiles were designed to reveal the differences in expression in very different tissues, for example, from brain to the hindgut. This is an incredible resource available alongside other fly resources in RDF at openflydata.org. Gene set enrichment is now a standard tool used to understand such expression data, an analogous query is “are there any tissue specific pathways in the hindgut?”. In theory, this question can be answered with a SPARQL query connecting FlyAtlas to FlyCyc, a databases of Fly Pathways, however, this is not easy. The two graphs, FlyAtlas and FlyCyc are actually quite difficult to traverse. However, with the addition of some triples through very simple inferences, utilizing for example, transitive properties, class subsumption and simple CONSTRUCT queries. With the addition of these extra triples the graphs are much easier to query. In a larger example, we have a pharma client with a large set of linked data mainly from the public domain. In this linked data cloud there are some 30 databases using about 6 different RDFS/OWL schemas. The concepts are diverse, from Protein to Pathways to Clinical Trials. Practically, the queries through this data are complex and as the the cloud of data has grown, queries have become more and more cumbersome. In this larger data set, we exemplify some of the typical queries performed through this data, and just how difficult some of these simple queries can be. We describe some actions that unify concepts within the multiple schemas in the linked data with the addition of some semantics. We show some of the query gains achieved in understandability and performance through the addition of these ontology statements.

[top]


Producing, Publishing and Consuming Linked Data Three Lessons from the Bio2RDF Project

François Belleau
Centre de recherche du CHUQ, Laval University
Québec, Canada

Abstract: Since 2006 the Bio2RDF project (http://bio2rdf.org) hosted jointly by the Centre de recherche du CHUQ of Laval University and Carleton University, aims is to transforms silos of life science data made publicly available by data provider like KEGG, UniProt, NCBI, EBI and many more, into a globally distributed network of linked data for biological knowledge discovery available to the life science community. Using semantic data integration techniques and the Virtuoso triplestore software, Bio2RDF seamlessly integrates diverse biological data and enables powerful SPARQL services across its globally distributed knowledge bases. Online since 2006, this very early Linked Data project have evolve and inspired many other. This talk will recall Bio2RDF main design steps over time, but more important, we will explain design decision that, we think, made it successful. Being part of the Linked Data space since the beginning, we have been in a major position to observe the evolution and adoption of semantic web technologies by the life science community. Now with so many mature project online, we propose to look back, so we can share our own experiences building Bio2RDF with the CSHALS community. The method to produce, publish and consume Bio2RDF linked data will be exposed. The pipeline used to transform public data to RDF and its design principle will be explained. The way Openlink Virtuoso server, an open source project, is configured and used will be explained. We will also propose guidelines to follow to publish RDF within the bioinformatic Linked Data cloud. Finally, we will show different way to consume this data using URI, SPARQL queries and semantic web software like RelFinder and Virtuoso Facet browser. As a result Bio2RDF is still the most diverse and integrated linked data space available. But now, we observe an important move with data provider starting to expose their own datasets using RDF or, even better, SPARQL endpoint. Looking back at the evolution of Bio2RDF, we will share lessons we have learned about semantic web design project. What was a good idea, and which one were not. To conclude, we will expose our vision, still to be fulfilled, of what Linked Data could be in a near future so the data integration problem, so present in Life Science, benefits of the mature Semantic Web technologies, to help researchers do their daily discovery work.

[top]


Domeo Web Annotation Tool: Linking Science and Semantics through Annotation

Paolo Ciccarese
Mass General Hospital and Harvard Medical School
Boston, United States

Abstract: Background. Annotation is a fundamental activity in clinical and biomedical research as well as scholarship in general. Through annotation we can associate a commentary or formal judgment (textual comment, revision, citation, classification, or other related object) to targets such as text, images, video and database records. Annotation can be created for personal use, as in note-taking and personal classification of documents and document content. Or it can be addressed to an audience beyond its creator, as in shared commentary on documents, reviewing, citation, and tagging. While various annotation tools exist, we currently lack a comprehensive framework for creating, aggregating and sharing annotation in an open architecture. An open approach enables users engagement through the applications they prefer for performing the specific task. Method. In order to facilitate the social creation and sharing of annotation on digital resources we developed the Annotation Ontology (AO) and the Domeo Web Annotation Toolkit. AO is an ontology in OWL-DL for annotating documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO provides a provenance model to support versioning, and a set model for specifying groups and containers of annotation. Domeo is a browser-based annotation tool that enables users to visually and efficiently create, save, version and share AO-based “stand-off” annotation on HTML or XML documents. Domeo supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control. Several use cases were incrementally implemented by the toolkit. These use cases in biomedical communications include personal note-taking, group document annotation, semantic tagging - through biomedical ontologies-, claim-evidence-context extraction – through the SWAN ontology model -, reagent tagging, - through the antibodyregistry.con - and curation of textmining results from entity extraction algorithms such as the NCBO Annotator Web Service. Results. Domeo has been deployed as part of the NIH Neuroscience Information Framework (NIF); in the private network of a major pharmaceutical company; and in a (currently) limited-access public version on the Cloud. Researchers may request access to the public alpha build of Domeo Version 2. Domeo is open source software, licensed under the Apache 2.0 open source license. Conclusions. The success of the first version of the Domeo annotation tool motivated the development of the second version of the product that is open source and includes new features such as annotation of images and of multiple targets in the same document. The new version of the tool will be also supporting the new emerging Open Annotation Model provided by the W3C Open Annotation Community Group. The Open Annotation model, which began as the merge of the Annotation Ontology and the Open Annotation Collaboration model, is now a self-standing initiative that we are expecting having great impact in the world of annotation.

[top]


Bio2RDF Release 2: Improved coverage, Interoperability and Provenance of Life Science Linked Data

Michel Dumontier
Carleton University
Ottawa, Canada

Abstract: Bio2RDF is an open source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences. Here, we present the second release of the Bio2RDF project which features up-to-date, open-source scripts, IRI normalization through a common dataset registry, dataset provenance, data metrics, public SPARQL endpoints, and compressed RDF files and full text-indexed Virtuoso triple stores for download. Methods Bio2RDF defines a set of simple conventions to create RDF(S) compatible Linked Data from a diverse set of heterogeneously formatted sources obtained from multiple data providers. We have consolidated and updated Bio2RDF scripts into a single GitHub repository (http://github.com/bio2rdf/bio2rdf-scripts), which facilitates collaborative development through issue tracking, forking and pull requests. The scripts are released with an MIT license, making it available for any use (including commercial), modification or redistribution. Provenance regarding when and how the data were generated is provided using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Additional scripts were developed to compute dataset-dataset links and summaries of dataset composition and connectivity. Results Nineteen datasets, including 5 new datasets and 3 aggregate datasets, are now being offered as part of Bio2RDF Release 2. Use of a common registry ensures that all Bio2RDF datasets adhere to strict syntactic IRI patterns, thereby increasing the quality of generated links over previous suggested patterns. Quantitative metrics are now computed for each dataset and provide elementary information such as the number of triples to a more sophisticated graph of the relations between types. While these metrics provide an important overview of dataset contents, they are also used to assist in SPARQL query formulation and to monitor changes to datasets over time. Pre-computation of these summaries frees up computational resources for more interesting scientific queries and also enable tracking of dataset changes with time, which will help make projections about the hardware and software requirements. We demonstrate how multiple open source tools can be used to visualize and explore Bio2RDF data, as well as how dataset metrics may be used to assist querying. Conclusions Bio2RDF Release 2 marks an important milestone for this open source project, in that it was fully transferred into a new team and development paradigm. Adoption of GitHub as a code development platform makes it easier for new parties to contribute and get feedback on RDF converters, as well as make it possible to automatically be added to the growing Bio2RDF network. Over the next year we hope to offer bi-annual releases that adhere to formalized development and release protocols.

[top]


Yes, We Can!  
Lessons from Using Linked Open Data (LOD) and Public Ontologies to Contextualize and Enrich Experimental Data

Presentation (pdf)

Presenter:
Erich A. Gombocz
IO Informatics, Inc., Berkeley, CA, USA

Co-Authors:
Andrea Splendiani

IO Informatics, Inc., London, UK

Mark A. Musen
Stanford Center for Biomedical Informatics Research (BMIR), Stanford, CA, USA

Robert A. Stanley
IO Informatics, Inc., Berkeley, CA, USA

Jason A. Eshleman
IO Informatics, Inc., Berkeley, CA, USA

Abstract: Semantic W3C standards provide a framework for the creation of knowledge bases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. An ever growing number of knowledge bases are being built on these standards— in particular as Linked Open Data (LOD) resources. The availability of LOD resources has received increasing attention and use in industry and academia.

Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met:  issues often arise from the alignment of public and experimental corporate standards, from inconsistent namespace policies, and  from the use of internal, non-formal application ontologies. Often the reliability of resources is problematic, from service levels of LOD resources and/or SPARQL endpoints to URI persistence. Furthermore, more and more “Open data” are closed for commercial use, and there are serious funding concerns related to government grant-backed resources.

With these challenges, can Semantic Web technologies provide value to Industry today?
We make the case that, yes, this can be done and is the case now.

We demonstrate a use case of successful contextualization and enrichment of internal experimental datasets with public resources, thanks to outstanding examples of LOD such as UniProt, Drugbank, Diseasome, SIDER, Reactome, and ChEMBL, as well as ontology collections and annotation services from NCBO’s BioPortal.

We show how, starting with semantically integrated experimental results from multi-year toxicology studies performed on different platforms (gene expression and metabolic profiling), a knowledge base can be built that integrates and harmonizes such information, and enriches it with public data from UniProt, Drugbank, Diseasome, SIDER, Reactome, and NCBI Biosystems. The resulting knowledge base facilitates toxicity assessment in drug development at the pre-clinical trial stage. It also provides models for classification of toxicity types (hepatotoxicity, nephrotoxiciy, toxicity based on drug residues) and offers better a priori determination of adverse effects of drug combinations. In this specific use case, we were not only able to correlate responses across unrelated studies with different experimental models, but also to validate system changes associated with known common toxicity mechanisms such as oxidative stress (Glutathione metabolism),  liver function (Bile acid and Urea cycle) and Ketoacidosis.  Since experimental observations from multi-modal –OMICs data can result from the same perturbation, but represent very different biological processes, and because pharmacodynamic correlations are not necessarily functionally linked within the biological network and genetic and metabolic changes may occur at lower doses and prior to pathological changes, enrichment with LOD resources offers new insights into mechanisms and led to discovery of new pharmacodynamically and biologically linked pathway dependencies.

As LOD resources mature, more reliable information is becoming publicly available that can enrich experimental data with computable descriptions of biological systems in ways never anticipated before and that ultimately help in understanding the experiments' results. The time and money saved from such an approach has enormous socio-economic benefits for drug companies and healthcare alike.

As a community, we need to establish business models through cooperation between industry and academic institutions that support the maintenance and extension of invaluable public LOD resources. Their effective use in enriching toxicology data exemplifies the success of using Semantic Web technologies to contextualize experimental, internal, external, clinical and public data towards faster and better understanding of biological systems and, as such, more effective outcomes in health and quality of life for all of us.

References:
(1)    LDOW2012 Linked Data on the Web. Bizer C,Heath T, Berners-Lee T, Hausenblas M. WWW Workshop on Linked Data on the Web, 2012  Apr.16, Lyon, France.
(2)    The National Center for Biomedical Ontology. Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, Smith B. J Am Med Inform Assoc. 2012 Mar-Apr; 19 (2): 190-5
(3)    BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. Nucleic Acids Res. 2011; 39 (Web Server issue): W541-5
(4)    From Individual Experiments to Informed Decision Making: Challenges, Sucess Stories and Opportunities in Collaborative Science. Gombocz EA in: Data for Decision Making: Lab, Enterprise, Web, Center for Computing for Life Sciences SFSU. 2012 May 3, San Francisco, CA.
(5)    Does network analysis of integrated data help understanding how alcohol affects biological functions? - Results of a semantic approach to biomarker discovery.  Gombocz EA, A.J. Higgins AJ, Hurban P, Lobenhofer EK, Crews FT, Stanley RA, Rockey C, Nishimura T. 2008 Sept.29-Oct.1.Biomarker Discovery Summit, Philadelphia, PA.

[top]


The Disease and Clinical Measurements Knowledgebase: A Case Study of Data-driven Ontology Class Evaluation and Enrichment

Nophar Geifman
Ben Gurion University, Dep. of Microbiology, Immunology and Genetics, Faculty of Health Sciences and The National Institute for Biotechnology in the Negev
Be'er Sheva, Israel

Presentation (pdf)

Abstract: Laboratory tests such as standard blood tests are commonly used for diagnosis of disease or selection of additional diagnostic procedures. For most blood or urine analytes, abnormal values (e.g. elevated serum Creatinine levels) are strongly indicative of pathological states (e.g. muscle destruction). However, abnormal values are associated with pathological states in a many-to-many relationship; an abnormal value can be associated to several pathologies and vice versa. Despite this being common knowledge, it appears that a freely available formal knowledge structure holding these complex interrelationships does not exist. Furthermore, evaluation of ontologies is vital for the maturation of the Semantic Web and data-driven ontology evaluation is an accepted approach. It also has the facility to grow the knowledge structures. Since an existing ontology is unlikely to capture all possible aspects of classification, new, additional classification may help enrich the ontology to better capture the knowledge domain. Methods and Results: As an extension to the Age-Phenotype Knowledgebase, the Disease and Clinical Measurements Knowledgebase has been developed to capture the multiple associations between disease and clinical diagnostic tests. The knowledgebase comprises a relational database and formal ontologies such as the Disease Ontology and the Clinical Measurements Ontology. The use of ontologies as part of the knowledge model provides a standardized, unambiguous method to describe entities captured from various data sources. In addition, ontology use allows complex queries to be conducted by abstraction to higher order concepts. The knowledgebase was initially populated with disease-analyte relationships extracted from textbooks. Added to these, were disease-analyte relationships inferred from MeSH terms co-occurrence in PubMed abstracts. Over two million PubMed abstracts were obtained, and for each abstract the associated MeSH terms were captured. Abstracts were scanned for the co-occurrence of blood analytes related MeSH terms and terms from the Disease Ontology. Clustering of these co-occurrences generated 67 disease clusters which share a similar pattern of blood analytes association as captured in the literature. For example, a cluster containing the diseases: 'obesity', 'hypertriglyceridemia', 'atherosclerosis' and 'lipodystrophy' was characterised by a high association with: blood glucose, triglycerides and cholesterol. A comparison of the disease clusters to disease classes in DO revealed numerous overlaps; many clusters were found to contain diseases which were classified together in the ontology, thus validating these classifications. On the other hand, several clusters did not completely overlap with DO classes thus suggesting new classifications which could be added to the ontology in order to enrich the knowledge it captured. Conclusions: This work provides both an example for the incorporation of ontologies within a knowledgbase and a use case for data-driven ontology class enrichment and evaluation. Ontology evaluation and enrichment have the potential to make significant quality improvement in ontologies or classifications, particularly where they are automatically generated. While the work present here is currently focused on blood analytes, it could easily be extended to include other clinical diagnostic measurements and symptoms and additional sources of data.

[top]


The ISA Infrastructure for the Biosciences: from Data Curation at Source to the Linked Data Cloud

Alejandra Gonzalez-Beltran
University of Oxford
Oxford, United Kingdom

Philippe Rocca-Serra, Eamonn Maguire, Susanna-Assunta Sansone
Oxford e-Research Centre, University of Oxford
Oxford, United Kingdom

Presentation (pdf)

Abstract: Experimental metadata is crucial for the ability to share, compare, reproduce, and reuse data produced by biological experiments. The ISAtab format -- a tabular format based on the concepts of Investigation/Study/Assay (ISA) -- was designed to support the annotation and management of experimental data at source, with focus on multi-omics experiments. The format is accompanied with a set of open-source tools that facilitate compliance with existing checklists and ontologies, production of ISAtab metadata, validation, conversion to other formats, submission to public repositories, among other things. The ISAtab format together with the tools allow for the syntactic interoperability of the data and support the ISA commons, a growing community of international users and public or internal resources powered by one or more components of the ISA metadata tracking framework. The underlying semantics of the ISAtab format is currently left to the interpretation of biologists and/or curators. While this interpretation is assisted by the ontology-based annotations that can be included into the ISAtab files, it is currently not possible to have this information processed by machines, as in the semantic web/linked data approach. In this presentation, we will introduce our ongoing isa2owl effort to transform ISAtab files into an RDF/OWL-based (Resource Description Framework/Web Ontology Language) representation, supporting the semantic interoperability between ISAtab datasets. By using a semantic framework, we aim at: 1. making the ISAtab semantics explicit and machine-processable, 2. exploit the existing ontology-based annotations, 3. augment annotations over the native ISA syntax constructs with new elements anchored in a semantic model extending the Ontology of Biomedical Investigations (OBI) 4. facilitate the understanding and semantic querying of the experimental design 5. facilitate data integration, knowledge discovery and reasoning over ISAtab metadata and associated data. The software architecture of the isa2owl component is engineered to support multiple mappings between the ISA syntax and semantic models. Given a specific mapping, a converter takes ISAtab datasets and produces OWL ontologies, whose Tboxes are given by the mapping and the Aboxes are the ISAtab elements or derived ones. These derived elements result from the analysis of the experimental workflow, as represented in the ISAtab format and the associated graph representation. The implementation relies on the OWLAPI. As a proof of concept, we have performed a mapping between the ISA syntax and a set of interoperable ontologies anchored in the Basic Formal Ontology (BFO) version 1. These ontologies are part of the Open Biological and Biomedical Ontologies (OBO) Foundry and include OBI, the Information Artifact Ontology (IAO) and the Relations Ontology (RO). We will show how this isa2owl transformation allows users to perform richer queries over the experimental data, to link to external resources available in the linked data cloud, and to support knowledge discovery.

[top]


Enabling Australia Wide Use of SNOMED CT

David Hansen
CSIRO ICT Centre
Brisbane, Australia

Presentation (pdf)

Abstract: Enabling Australia wide use of SNOMED CT Australia is standardizing on SNOMED CT as the preferred clinical terminology for use in Electronic Health and Medical Records. Existing electronic systems and the legacy terminologies and vocabularies they contain represent a substantial investment in health information collections. The Snapper SNOMED CT mapping tool has been made available to all public and private companies in Australia to aid in the transition from these existing vocabularies to SNOMED CT. Method: The Snapper toolkit is based on the snorocket classifier, a fast classification and subsumption engine for EL+ ontologies. Snapper provides a user-friendly interface for creating mappings from an existing termset to concepts in the SNOMED CT ontology. Advanced features make use of the description logic foundation of SNOMED CT. Features include the ability to create post-coordinated expressions and classify the expression into the correct position in the hierarchy in real time. The use of Snapper allows users themselves to transition to SNOMED CT, adopt this standard clinical terminology, and preserve the value of their existing health information collections. Additionally, many electronic health record suppliers are using Snapper and investigating the use of a cloud based terminology server to use snorocket to perform subsumption queries. Results: Mapping legacy termsets to SNOMED CT eases the adoption path, allowing a migration from their existing terms to SNOMED CT. Mappings created include creating standard content for General Practices, Surgeries, Emergency Departments, and Community Health and Pharmacy systems The lessons from creating these maps include: • the intended use of maps • whether or not all legacy termsets are deserving of migration • Identification and management of legacy termset content and preparation for mapping • whether or not SNOMED CT expressions or extensions will be necessary, and • how these can be maintained or deployed.? As well as these mapping uses, there have been examples of the use of some of the advanced features of Snapper. The Australian Medicines Terminology (AMT) is a catalogue of medicinal products and substances in use in Australia. Snapper and snorocket have been used to map AMT to the Substance hierarchy of SNOMED CT and thereby make drug classes and subsumption available to AMT. The snorocket classifier can then be used to produce an integrated and fully-classified extension. Snapper has also been used to develop Reference Sets (RefSets) of SNOMED CT content, suited for data retrieval and queries. Public health biosurveillance users required RefSet content to analyse live data feeds, encoded in SNOMED CT, relevant to patient presentations. RefSets developed by Snapper users are capable of detecting ‘signal’ cases of avian, swine and other forms of influenza in a patient population. Conclusion: The uses of Snapper and snorocket have shown the importance of a classification engine to the creation of maps and RefSets of SNOMED CT content. Participants will learn about the use of mapping to integrate related SNOMED CT-based terminologies, the role and value of classification and classifiers in such a process, and gain insight into post coordination and subsumption-based queries.

[top]


Semantic Benchmarking Infrastructure for Text Mining: Leverage of Corpora, Ontologies and SPARQL to Evaluate Mutation Text Mining Systems

Artjom Klein
University of New Brunswick
Saint John, Canada

Abstract: (1). Objectives and motivation. In biomedical text mining, the development of robust pipelines, publication of results and the running of comparative evaluations is greatly hindered by the lack of adequate benchmarking facilities. Benchmarks - annotated corpora - are usually designed and created for specific tasks and provide implemented hard-coded evaluation metrics. Comparative evaluations between tools and evaluation of these tools on different gold standard data sets is an important aspect for performance verification and adoption. Evaluations are hindered by a diversity and heterogeneity of formats and annotation schemas of corpora and systems. Well-known text mining frameworks such as UIMA and GATE include functionality for integration and evaluation of text mining tools based on hard-coded evaluation metrics. Unlike these approaches, we leverage semantic technologies to provide flexible and ad-hoc authoring of evaluation metrics. We report on a centralized community-oriented annotation and benchmarking infrastructure to support development, testing and comparative evaluation of text mining systems. We have deployed this infrastructure to evaluate the performance of mutation text mining systems. (2). Method. The design of the infrastructure is based on semantic standards, where RDF is used to represent the annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases programming is not needed to analyse system results. The core infrastructure comprises of: 1) third-party upper-level ontologies to model annotations and text structure, 2) a domain ontology for modelling domain-specific annotations, 3) SPARQL queries for performance metrics computation, and 4) a sizeable collection of manually curated documents, that can minimally support mutation grounding and mutation impact extraction. The diversity of freely available RDF/OWL tools enables out-of-the-box use of the annotation data for corpus search and analysis, system testing and evaluation. (3). Results. We developed the Mutation Impact Extraction Ontology (MIEO) to a domain ontology to model extracted mutation impact related information. We seeded the infrastructure with several corpora (242 documents in total) supporting at least two mutation text mining tasks: mutation grounding to proteins and extraction of mutation impacts on molecular functions of proteins; and developed SPARQL queries to perform calculations of relevant metrics. To facilitate a preliminary evaluation of our infrastructure for comparative evaluation, we integrated the freely available mutation impact extraction systems into the infrastructure, and developed a set of SPARQL queries to perform cross-evaluation on available mutation impact corpora. (4). Conclusions. We present an evaluation system for benchmarking and comparative evaluation of mutation text mining systems designed for use by BioNLP developers, biomedical corpus curators, and bio database curators. Corpora and text mining outputs modelled in terms of underlying ontologies can be readily integrated into the infrastructure for benchmarking and evaluation. The generic nature of the solution makes it flexible, easily extendable, re-usable and adoptable for new domains. Flexible SPARQL allows ad-hoc search and analysis of corpora and the implementation of evaluation metrics without requiring programming skills. This is the primary example of benchmarking infrastructure for mutation text mining.

[top]


Building the App Store for Health and Discovery

Kenneth Mandl
Harvard Medical School/Boston Children's Hospital
Boston, United States

Presentation (pdf)

Abstract: Most vendor electronic health record (EHR) products are architected monolithically, making modification difficult for hospitals and physician practices. An alternative approach is to reimagine EHRs as iPhone-like platforms that support substitutable apps-based functionality. Substitutability is the capability inherent in a system of replacing one application with another of similar functionality. Substitutability requires that the purchaser of an app can replace one application with another without being technically expert, without requiring re-engineering other applications that they are using, and without having to consult or require assistance of any of the vendors of previously installed or currently installed applications. A deep commitment to substitutability enforces key properties of a health information technology ecosystem. Because an app can be readily discarded, the consumer or purchaser of these applications is empowered to define what constitutes value in information technology. Apps necessarily compete with each other promoting progress and adaptability. The Substitutable Medical Applications, Reusable Technologies (SMART) Platforms project seeks to develop a health information technology platform with substitutable apps constructed around core services. It is funded by a $15M grant from Office of the National Coordinator of Health Information Technology’s Strategic Health IT Advanced Research Projects (SHARP) Program. All SMART standards are open and the core software is open source. The goal of SMART is to create a common platform to support an “app store for health” as an approach to drive down healthcare costs, support standards evolution, accommodate differences in care workflow, foster competition in the market, and accelerate innovation. The SMART project focuses on promoting substitutability through an application programming interface (API) that can be adopted as part of a “container” built around by a wide variety of health technology platforms, providing read-only access to the underlying data model and a software development toolkit to readily create apps. SMART containers are health IT systems, that have implemented the SMART API or a portion of it. Containers marshal data sources and present them consistently across the SMART API. SMART applications consume the API and are substitutable. SMART has sparked an ecosystem of apps developers and attracted existing health information technology platforms to adopt the SMART API—including, traditional, open source, and next generation EHRs, patient-facing platforms and health information exchanges. SMART-enabled platforms to date include the Cerner EMR, the WorldVista EMR, the OpenMRS EMR, the i2b2 analytic platform, and the Indivo X personal health record. The SMART team is working with the Mirth Corporation, to SMART-enable the HealthBridge and Redwood MedNet Health Information Exchanges. We have demonstrated that a single SMART app can run, unmodified, in all of these environments, as long as the underlying platform collects the required data types. Going forward, we seek to design approaches to enable the nimble customization of health IT for the clinical and translational enterprises.

[top]


CDISC2RDF - Make Clinical Data Standards Linkable, Computable and Queryable

Charles Mead
Octo Consulting Group
Washington, United States

Eric Prud'hommeaux
W3C
Cambridge, United States

Presentation (pdf)

Abstract: Clinical data standards have been identified as one of five initial areas by the TransCelerate BioPharma, the non-profit organization formed by ten leading pharmaceutical companies, to accelerate the development of new medicines. The European Medicines Agency (EMA) is developing a policy on the proactive publication of clinical-trial data in the interests of public health including clear and understandable clinical data formats. The FDA has a long-held goal of making better use of submitted clinical trial data. Pharmaceutical companies have attempted to use submission standards to create study repositories. Exploiting Semantic Web technologies stands to simplify the interpretation of individual studies, and improve cross-study integration Method The CDISC2RDF initiative exploits semantic web standards and linked data principles for clinical data standards from CDISC (Clinical Data Interchange Standards Consortium). This has been proposed by early adopters in AstraZeneca and Roche as a way to make clinical data standards linkable, computable and queryable beyond today’s disconnected PDFs and Excels files. CDISC2RDF is a cross-pharma pre-competitive project with Roche, AstraZeneca, TopQuadrant, Free University of Amerstam and W3C HCLS. Results This presentation will describe the results from the phase 1) Standards-as-is, covering standards for submissions (SDTM), analysis (ADaM) and data capture (CDASH) structures and terminologies. And also discuss ideas for the next two phases; 2) Standards-in-context, and 3) Interoperability across standards and the data collected using them.

[top]


Building a Knowledge Base for Cancer Genomics

Eric Neumann, Alex Parker, Rachel Erlich
Foundation Medicine

Abstract:  Next-generation sequencing (NGS) is becoming an increasingly important part of healthcare, providing new genomic insights into diseases and their treatments. Therapies for cancer in particular will benefit greatly, since cancer arises from a series of genetic alterations affecting cellular proliferation and survival—routine genomic testing will guide more effective treatments.  To this end, Foundation Medicine® (FMI) has developed FoundationOne™, a comprehensive cancer genomic profiling test based on next generation sequencing.

Growing knowledge and clinical application of cancer genomics has changed the oncology landscape in recent years, enabling therapeutic options that specifically target the genomic drivers of a patient’s unique cancer. While this approach may offer more efficient and less toxic options than traditional chemotherapy, physicians need to be able to match each patient with the right drug for their unique cancer, which requires a comprehensive genomic profile of the patient’s tumor and an expansive knowledge of cancer genomics.

FMI has made this complex information readily available for any clinical practice. Using highly sensitive and accurate next-generation sequencing on small amounts of routine FFPE cancer tissue, the FoundationOne assay interrogates the entire coding sequence of hundreds of tumor genes known to be rearranged or altered in cancer, based on recent scientific and clinical literature. Genomic alterations are matched to relevant targeted therapies, either approved or in clinical trials that could be a rational choice for the patient based on the genomic profile of their cancer. This information is reported to the patient’s physician via the FoundationOne Interactive Cancer Explorer, the company’s online reporting platform (usually within three weeks).

To support personalized cancer treatments based on FoundationOne, FMI is compiling the world’s most comprehensive cancer genomic alteration knowledgebase (KB). These data can be further analyzed for the combinations of mutations and cancers that best respond to specific drugs. The compilation of thousands of observed cancer genomic alterations for each of the cancers is being linked using RDF to published results, molecular cancer databases, and clinical trials, creating a densely interconnected knowledge base. The KB serves multiple applications internally as well as externally, taking full advantage of the power and flexibility of Linked Semantic Data. The associated molecular and clinical knowledge will enable oncologists and researchers to probe deeper into the mechanisms behind each patient’s cancer and potential cancer treatments, promising to dramatically accelerate improvement of existing therapies and the discovery of new ones.

Over time, FMI will continue to expand and analyze the Cancer Genomic Knowledge Base, merging new forms of information that scientists and clinicians see as key for understanding cancer.  The use of RDF standards and Linked Semantic Data protocols gives us the ability to grow and maintain the KB with new insights, some which will be inferred directly from it.

[top]


Semantically Enabling Genetic Medicine to Facilitate Patients and Guidelines Matching and Enhanced Clinical Decision Support

Matthias Samwald
Medical University of Vienna
Vienna, Austria

Abstract: The delivery of genomic medicine requires an integrated view of molecular and genetic profiles coupled with actionable clinical guidelines that are based on formalisms and definitions such as the star allele nomenclature. However, the identification of new variants can change these definitions and impact guidelines for patient treatment. We present a system that makes use of semantic technologies, such as an OWL 2 - based ontology and automated reasoning for 1) providing a simple and concise formalism for representing allele and phenotype definitions, 2) detecting inconsistencies in definitions, 3) automatically assigning alleles and phenotypes to patients and 4) matching patients to clinically appropriate pharmacogenetic guidelines and clinical decision support messages. This development is coordinated through the Health Care and Life Science Interest Group of the World Wide Web Consortium (W3C). Method We created an expressive OWL 2 ontology by automatically extracting or manually curating data from dbSNP, clinically relevant polymorphisms and allele definitions from PharmGKB, clinically relevant polymorphisms from the OMIM database, the Human Cytochrome P450 nomenclature database, guidelines issued by the Clinical Pharmacogenetics Working Group (CPIC) and the Royal Dutch Pharmacogenetics Working Group, FDA product labels and other relevant data sources. We used highly scalable OWL 2 reasoners (e.g., TrOWL) for analysing the aggregated data and for classifying genetic profiles. Results We demonstrate how our approach can be used for identifying errors and inconsistencies in primary datasets, as well as inferring alleles, phenotypes and matching clinical guidelines from genetic profiles of patients. Conclusion We invite stakeholders in clinical genetics to participate in the further development and application of the formalism and system we developed, with the potential goal of establishing it as an open standard for clinical genetics.

[top]


A Clinical Information Management Platform Using Semantic Technologies

Christian Seebode
ORTEC Medical
Berlin, Germany

Presentation (pdf)

Abstract: A Clinical Information Management Platform using semantic technologies Medical procedures generate a vast amount of data from various sources. An efficient and comprehensive integration and exploitation of these data will be one of the success factors for improving health care delivery to the individual patient, making health care services more cost-effective at the same time. In order to support an effective mining, selection and presentation of medical data for clinical or patient-centered use cases, either text data or structured clinical data from Health Information Systems (HIS) has to be enriched with semantic meta-information and has to be available at any point during the data value chain. We present a platform which combines an approach to semantic extraction of medical information from clinical free-text documents with the processing of structured information from HIS records. The information extraction uses a fine-grained linguistic analysis, and maps the preprocessed terms to the concepts of domain-specific ontologies. These domain ontologies comprise knowledge from various sources, including expert knowledge and knowledge from public medical ontologies and taxonomies. The processes of ontology engineering and rule generation are supported by a semantic workbench that enables an interactive identification of those linguistic terms in clinical texts that denote relevant concepts. This supports incremental refinement of semantic information extraction. Facts extracted from both, clinical free texts and structured sources, represent chunks of knowledge. They are stored in a Clinical Data Repository (CDR) using a common document-oriented storage model, which takes advantage of an application-agnostic format, in order to support different use cases. It furthermore supports version control of facts reflecting the evolution of information. Enrichment algorithms aggregate further information by generating statistical information, search indexes, or decision recommendations. The CDR generally separates processes of information generation from processes of information processing or consumption, and thus supports smart partitioning of data for scalable application architectures. The applications hosted on the platform retrieve facts from the CDR by subscribing to the event stream provided by the CDR. The first applications implemented on top of that platform support specific scenarios of clinical research, like recruiting patients for clinical trials, answering feasibility studies, or aggregating data for epidemiological studies. Further applications address patient-centered use cases like second opinion or dialogue support. The web-based application StudyMatcher maps study criteria to a list of cases and their medical facts. Trial teams may define study criteria in interaction with the knowledge resources. The application automatically generates a list of candidates cases.. Since the user interface links the facts extracted by the system to the original sources (e.g. the clinical documentation), users are able to check with low effort whether or not a fact has been recognized correctly by the system, matched correctly with the given criteria. This strategy of combining automatic and supervised fact generation promises to be a reasonable approach to improving the semantic exploitation of data. Platform and applications are developed in cooperation with Europes leading healthcare providers Charité and Vivantes and will be rolled out in January 2013. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.


[top]

Conference on Semantics in Healthcare and Life Sciences

Discussion Questions

Please check back for updates to this page.

Exclusively for members

  • Member Discount

    ISCB Members enjoy discounts on conference registration (up to $150), journal subscriptions, book (25% off), and job center postings (free).

  • Why Belong

    Connecting, Collaborating, Training, the Lifeblood of Science. ISCB, the professional society for computational biology!

     

Supporting ISCB

Donate and Make a Difference

Giving never felt so good! Considering donating today.