Foundation Medicine



IO Informatics

Conference on Semantics in Healthcare and Life Sciences (CSHALS)


Updated March 07, 2013

Indicates that presentation slides or other resources are available.

Links within this page:

Connecting Vocabularies in Linked Data

Nadia Anwar
General Bioinformatics
Reading, United Kingdoms

Presentation (pdf)

Abstract: Initially, data in linked data clouds were brought together with the maxim "Messy is Good". The idea was, if you put your data in RDF then it can be used, re-purposed and integrated. This maxim was used to encourage people to expose their data as RDF. Now that we are all convinced, and there is a lot of RDF available to us, some of it messy, it is time to tidy up the mess. We aim to show why "messy" is now problematic and describe how a little house-keeping makes the linked data we have better. In our experience, messy or “good enough” was the starting point however we now find that many practical uses of the data originally exposed as RDF are very difficult. Even the simplest of SPARQL queries on public resources can be very unintuitive. To demonstrate our point we exemplify through a small and a large example, how, adding just a few inferences, makes public RDF data easier to SPARQL. The first, small, example uses two RDF datasets of the model organism Drosophila melanogaster. FlyAtlas is a tissue specific expression database produced by Julian Dow at the University of Glasgow. The expression profiles were designed to reveal the differences in expression in very different tissues, for example, from brain to the hindgut. This is an incredible resource available alongside other fly resources in RDF at Gene set enrichment is now a standard tool used to understand such expression data, an analogous query is “are there any tissue specific pathways in the hindgut?”. In theory, this question can be answered with a SPARQL query connecting FlyAtlas to FlyCyc, a databases of Fly Pathways, however, this is not easy. The two graphs, FlyAtlas and FlyCyc are actually quite difficult to traverse. However, with the addition of some triples through very simple inferences, utilizing for example, transitive properties, class subsumption and simple CONSTRUCT queries. With the addition of these extra triples the graphs are much easier to query. In a larger example, we have a pharma client with a large set of linked data mainly from the public domain. In this linked data cloud there are some 30 databases using about 6 different RDFS/OWL schemas. The concepts are diverse, from Protein to Pathways to Clinical Trials. Practically, the queries through this data are complex and as the the cloud of data has grown, queries have become more and more cumbersome. In this larger data set, we exemplify some of the typical queries performed through this data, and just how difficult some of these simple queries can be. We describe some actions that unify concepts within the multiple schemas in the linked data with the addition of some semantics. We show some of the query gains achieved in understandability and performance through the addition of these ontology statements.


Producing, Publishing and Consuming Linked Data Three Lessons from the Bio2RDF Project

François Belleau
Centre de recherche du CHUQ, Laval University
Québec, Canada

Abstract: Since 2006 the Bio2RDF project ( hosted jointly by the Centre de recherche du CHUQ of Laval University and Carleton University, aims is to transforms silos of life science data made publicly available by data provider like KEGG, UniProt, NCBI, EBI and many more, into a globally distributed network of linked data for biological knowledge discovery available to the life science community. Using semantic data integration techniques and the Virtuoso triplestore software, Bio2RDF seamlessly integrates diverse biological data and enables powerful SPARQL services across its globally distributed knowledge bases. Online since 2006, this very early Linked Data project have evolve and inspired many other. This talk will recall Bio2RDF main design steps over time, but more important, we will explain design decision that, we think, made it successful. Being part of the Linked Data space since the beginning, we have been in a major position to observe the evolution and adoption of semantic web technologies by the life science community. Now with so many mature project online, we propose to look back, so we can share our own experiences building Bio2RDF with the CSHALS community. The method to produce, publish and consume Bio2RDF linked data will be exposed. The pipeline used to transform public data to RDF and its design principle will be explained. The way Openlink Virtuoso server, an open source project, is configured and used will be explained. We will also propose guidelines to follow to publish RDF within the bioinformatic Linked Data cloud. Finally, we will show different way to consume this data using URI, SPARQL queries and semantic web software like RelFinder and Virtuoso Facet browser. As a result Bio2RDF is still the most diverse and integrated linked data space available. But now, we observe an important move with data provider starting to expose their own datasets using RDF or, even better, SPARQL endpoint. Looking back at the evolution of Bio2RDF, we will share lessons we have learned about semantic web design project. What was a good idea, and which one were not. To conclude, we will expose our vision, still to be fulfilled, of what Linked Data could be in a near future so the data integration problem, so present in Life Science, benefits of the mature Semantic Web technologies, to help researchers do their daily discovery work.


Domeo Web Annotation Tool: Linking Science and Semantics through Annotation

Paolo Ciccarese
Mass General Hospital and Harvard Medical School
Boston, United States

Abstract: Background. Annotation is a fundamental activity in clinical and biomedical research as well as scholarship in general. Through annotation we can associate a commentary or formal judgment (textual comment, revision, citation, classification, or other related object) to targets such as text, images, video and database records. Annotation can be created for personal use, as in note-taking and personal classification of documents and document content. Or it can be addressed to an audience beyond its creator, as in shared commentary on documents, reviewing, citation, and tagging. While various annotation tools exist, we currently lack a comprehensive framework for creating, aggregating and sharing annotation in an open architecture. An open approach enables users engagement through the applications they prefer for performing the specific task. Method. In order to facilitate the social creation and sharing of annotation on digital resources we developed the Annotation Ontology (AO) and the Domeo Web Annotation Toolkit. AO is an ontology in OWL-DL for annotating documents on the web. AO supports both human and algorithmic content annotation. It enables “stand-off” or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO provides a provenance model to support versioning, and a set model for specifying groups and containers of annotation. Domeo is a browser-based annotation tool that enables users to visually and efficiently create, save, version and share AO-based “stand-off” annotation on HTML or XML documents. Domeo supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control. Several use cases were incrementally implemented by the toolkit. These use cases in biomedical communications include personal note-taking, group document annotation, semantic tagging - through biomedical ontologies-, claim-evidence-context extraction – through the SWAN ontology model -, reagent tagging, - through the antibodyregistry.con - and curation of textmining results from entity extraction algorithms such as the NCBO Annotator Web Service. Results. Domeo has been deployed as part of the NIH Neuroscience Information Framework (NIF); in the private network of a major pharmaceutical company; and in a (currently) limited-access public version on the Cloud. Researchers may request access to the public alpha build of Domeo Version 2. Domeo is open source software, licensed under the Apache 2.0 open source license. Conclusions. The success of the first version of the Domeo annotation tool motivated the development of the second version of the product that is open source and includes new features such as annotation of images and of multiple targets in the same document. The new version of the tool will be also supporting the new emerging Open Annotation Model provided by the W3C Open Annotation Community Group. The Open Annotation model, which began as the merge of the Annotation Ontology and the Open Annotation Collaboration model, is now a self-standing initiative that we are expecting having great impact in the world of annotation.


Bio2RDF Release 2: Improved coverage, Interoperability and Provenance of Life Science Linked Data

Michel Dumontier
Carleton University
Ottawa, Canada

Abstract: Bio2RDF is an open source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences. Here, we present the second release of the Bio2RDF project which features up-to-date, open-source scripts, IRI normalization through a common dataset registry, dataset provenance, data metrics, public SPARQL endpoints, and compressed RDF files and full text-indexed Virtuoso triple stores for download. Methods Bio2RDF defines a set of simple conventions to create RDF(S) compatible Linked Data from a diverse set of heterogeneously formatted sources obtained from multiple data providers. We have consolidated and updated Bio2RDF scripts into a single GitHub repository (, which facilitates collaborative development through issue tracking, forking and pull requests. The scripts are released with an MIT license, making it available for any use (including commercial), modification or redistribution. Provenance regarding when and how the data were generated is provided using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Additional scripts were developed to compute dataset-dataset links and summaries of dataset composition and connectivity. Results Nineteen datasets, including 5 new datasets and 3 aggregate datasets, are now being offered as part of Bio2RDF Release 2. Use of a common registry ensures that all Bio2RDF datasets adhere to strict syntactic IRI patterns, thereby increasing the quality of generated links over previous suggested patterns. Quantitative metrics are now computed for each dataset and provide elementary information such as the number of triples to a more sophisticated graph of the relations between types. While these metrics provide an important overview of dataset contents, they are also used to assist in SPARQL query formulation and to monitor changes to datasets over time. Pre-computation of these summaries frees up computational resources for more interesting scientific queries and also enable tracking of dataset changes with time, which will help make projections about the hardware and software requirements. We demonstrate how multiple open source tools can be used to visualize and explore Bio2RDF data, as well as how dataset metrics may be used to assist querying. Conclusions Bio2RDF Release 2 marks an important milestone for this open source project, in that it was fully transferred into a new team and development paradigm. Adoption of GitHub as a code development platform makes it easier for new parties to contribute and get feedback on RDF converters, as well as make it possible to automatically be added to the growing Bio2RDF network. Over the next year we hope to offer bi-annual releases that adhere to formalized development and release protocols.


Yes, We Can!  
Lessons from Using Linked Open Data (LOD) and Public Ontologies to Contextualize and Enrich Experimental Data

Presentation (pdf)

Erich A. Gombocz
IO Informatics, Inc., Berkeley, CA, USA

Andrea Splendiani

IO Informatics, Inc., London, UK

Mark A. Musen
Stanford Center for Biomedical Informatics Research (BMIR), Stanford, CA, USA

Robert A. Stanley
IO Informatics, Inc., Berkeley, CA, USA

Jason A. Eshleman
IO Informatics, Inc., Berkeley, CA, USA

Abstract: Semantic W3C standards provide a framework for the creation of knowledge bases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. An ever growing number of knowledge bases are being built on these standards— in particular as Linked Open Data (LOD) resources. The availability of LOD resources has received increasing attention and use in industry and academia.

Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met:  issues often arise from the alignment of public and experimental corporate standards, from inconsistent namespace policies, and  from the use of internal, non-formal application ontologies. Often the reliability of resources is problematic, from service levels of LOD resources and/or SPARQL endpoints to URI persistence. Furthermore, more and more “Open data” are closed for commercial use, and there are serious funding concerns related to government grant-backed resources.

With these challenges, can Semantic Web technologies provide value to Industry today?
We make the case that, yes, this can be done and is the case now.

We demonstrate a use case of successful contextualization and enrichment of internal experimental datasets with public resources, thanks to outstanding examples of LOD such as UniProt, Drugbank, Diseasome, SIDER, Reactome, and ChEMBL, as well as ontology collections and annotation services from NCBO’s BioPortal.

We show how, starting with semantically integrated experimental results from multi-year toxicology studies performed on different platforms (gene expression and metabolic profiling), a knowledge base can be built that integrates and harmonizes such information, and enriches it with public data from UniProt, Drugbank, Diseasome, SIDER, Reactome, and NCBI Biosystems. The resulting knowledge base facilitates toxicity assessment in drug development at the pre-clinical trial stage. It also provides models for classification of toxicity types (hepatotoxicity, nephrotoxiciy, toxicity based on drug residues) and offers better a priori determination of adverse effects of drug combinations. In this specific use case, we were not only able to correlate responses across unrelated studies with different experimental models, but also to validate system changes associated with known common toxicity mechanisms such as oxidative stress (Glutathione metabolism),  liver function (Bile acid and Urea cycle) and Ketoacidosis.  Since experimental observations from multi-modal –OMICs data can result from the same perturbation, but represent very different biological processes, and because pharmacodynamic correlations are not necessarily functionally linked within the biological network and genetic and metabolic changes may occur at lower doses and prior to pathological changes, enrichment with LOD resources offers new insights into mechanisms and led to discovery of new pharmacodynamically and biologically linked pathway dependencies.

As LOD resources mature, more reliable information is becoming publicly available that can enrich experimental data with computable descriptions of biological systems in ways never anticipated before and that ultimately help in understanding the experiments' results. The time and money saved from such an approach has enormous socio-economic benefits for drug companies and healthcare alike.

As a community, we need to establish business models through cooperation between industry and academic institutions that support the maintenance and extension of invaluable public LOD resources. Their effective use in enriching toxicology data exemplifies the success of using Semantic Web technologies to contextualize experimental, internal, external, clinical and public data towards faster and better understanding of biological systems and, as such, more effective outcomes in health and quality of life for all of us.

(1)    LDOW2012 Linked Data on the Web. Bizer C,Heath T, Berners-Lee T, Hausenblas M. WWW Workshop on Linked Data on the Web, 2012  Apr.16, Lyon, France.
(2)    The National Center for Biomedical Ontology. Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, Smith B. J Am Med Inform Assoc. 2012 Mar-Apr; 19 (2): 190-5
(3)    BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. Nucleic Acids Res. 2011; 39 (Web Server issue): W541-5
(4)    From Individual Experiments to Informed Decision Making: Challenges, Sucess Stories and Opportunities in Collaborative Science. Gombocz EA in: Data for Decision Making: Lab, Enterprise, Web, Center for Computing for Life Sciences SFSU. 2012 May 3, San Francisco, CA.
(5)    Does network analysis of integrated data help understanding how alcohol affects biological functions? - Results of a semantic approach to biomarker discovery.  Gombocz EA, A.J. Higgins AJ, Hurban P, Lobenhofer EK, Crews FT, Stanley RA, Rockey C, Nishimura T. 2008 Sept.29-Oct.1.Biomarker Discovery Summit, Philadelphia, PA.


The Disease and Clinical Measurements Knowledgebase: A Case Study of Data-driven Ontology Class Evaluation and Enrichment

Nophar Geifman
Ben Gurion University, Dep. of Microbiology, Immunology and Genetics, Faculty of Health Sciences and The National Institute for Biotechnology in the Negev
Be'er Sheva, Israel

Presentation (pdf)

Abstract: Laboratory tests such as standard blood tests are commonly used for diagnosis of disease or selection of additional diagnostic procedures. For most blood or urine analytes, abnormal values (e.g. elevated serum Creatinine levels) are strongly indicative of pathological states (e.g. muscle destruction). However, abnormal values are associated with pathological states in a many-to-many relationship; an abnormal value can be associated to several pathologies and vice versa. Despite this being common knowledge, it appears that a freely available formal knowledge structure holding these complex interrelationships does not exist. Furthermore, evaluation of ontologies is vital for the maturation of the Semantic Web and data-driven ontology evaluation is an accepted approach. It also has the facility to grow the knowledge structures. Since an existing ontology is unlikely to capture all possible aspects of classification, new, additional classification may help enrich the ontology to better capture the knowledge domain. Methods and Results: As an extension to the Age-Phenotype Knowledgebase, the Disease and Clinical Measurements Knowledgebase has been developed to capture the multiple associations between disease and clinical diagnostic tests. The knowledgebase comprises a relational database and formal ontologies such as the Disease Ontology and the Clinical Measurements Ontology. The use of ontologies as part of the knowledge model provides a standardized, unambiguous method to describe entities captured from various data sources. In addition, ontology use allows complex queries to be conducted by abstraction to higher order concepts. The knowledgebase was initially populated with disease-analyte relationships extracted from textbooks. Added to these, were disease-analyte relationships inferred from MeSH terms co-occurrence in PubMed abstracts. Over two million PubMed abstracts were obtained, and for each abstract the associated MeSH terms were captured. Abstracts were scanned for the co-occurrence of blood analytes related MeSH terms and terms from the Disease Ontology. Clustering of these co-occurrences generated 67 disease clusters which share a similar pattern of blood analytes association as captured in the literature. For example, a cluster containing the diseases: 'obesity', 'hypertriglyceridemia', 'atherosclerosis' and 'lipodystrophy' was characterised by a high association with: blood glucose, triglycerides and cholesterol. A comparison of the disease clusters to disease classes in DO revealed numerous overlaps; many clusters were found to contain diseases which were classified together in the ontology, thus validating these classifications. On the other hand, several clusters did not completely overlap with DO classes thus suggesting new classifications which could be added to the ontology in order to enrich the knowledge it captured. Conclusions: This work provides both an example for the incorporation of ontologies within a knowledgbase and a use case for data-driven ontology class enrichment and evaluation. Ontology evaluation and enrichment have the potential to make significant quality improvement in ontologies or classifications, particularly where they are automatically generated. While the work present here is currently focused on blood analytes, it could easily be extended to include other clinical diagnostic measurements and symptoms and additional sources of data.


The ISA Infrastructure for the Biosciences: from Data Curation at Source to the Linked Data Cloud

Alejandra Gonzalez-Beltran
University of Oxford
Oxford, United Kingdom

Philippe Rocca-Serra, Eamonn Maguire, Susanna-Assunta Sansone
Oxford e-Research Centre, University of Oxford
Oxford, United Kingdom

Presentation (pdf)

Abstract: Experimental metadata is crucial for the ability to share, compare, reproduce, and reuse data produced by biological experiments. The ISAtab format -- a tabular format based on the concepts of Investigation/Study/Assay (ISA) -- was designed to support the annotation and management of experimental data at source, with focus on multi-omics experiments. The format is accompanied with a set of open-source tools that facilitate compliance with existing checklists and ontologies, production of ISAtab metadata, validation, conversion to other formats, submission to public repositories, among other things. The ISAtab format together with the tools allow for the syntactic interoperability of the data and support the ISA commons, a growing community of international users and public or internal resources powered by one or more components of the ISA metadata tracking framework. The underlying semantics of the ISAtab format is currently left to the interpretation of biologists and/or curators. While this interpretation is assisted by the ontology-based annotations that can be included into the ISAtab files, it is currently not possible to have this information processed by machines, as in the semantic web/linked data approach. In this presentation, we will introduce our ongoing isa2owl effort to transform ISAtab files into an RDF/OWL-based (Resource Description Framework/Web Ontology Language) representation, supporting the semantic interoperability between ISAtab datasets. By using a semantic framework, we aim at: 1. making the ISAtab semantics explicit and machine-processable, 2. exploit the existing ontology-based annotations, 3. augment annotations over the native ISA syntax constructs with new elements anchored in a semantic model extending the Ontology of Biomedical Investigations (OBI) 4. facilitate the understanding and semantic querying of the experimental design 5. facilitate data integration, knowledge discovery and reasoning over ISAtab metadata and associated data. The software architecture of the isa2owl component is engineered to support multiple mappings between the ISA syntax and semantic models. Given a specific mapping, a converter takes ISAtab datasets and produces OWL ontologies, whose Tboxes are given by the mapping and the Aboxes are the ISAtab elements or derived ones. These derived elements result from the analysis of the experimental workflow, as represented in the ISAtab format and the associated graph representation. The implementation relies on the OWLAPI. As a proof of concept, we have performed a mapping between the ISA syntax and a set of interoperable ontologies anchored in the Basic Formal Ontology (BFO) version 1. These ontologies are part of the Open Biological and Biomedical Ontologies (OBO) Foundry and include OBI, the Information Artifact Ontology (IAO) and the Relations Ontology (RO). We will show how this isa2owl transformation allows users to perform richer queries over the experimental data, to link to external resources available in the linked data cloud, and to support knowledge discovery.


Enabling Australia Wide Use of SNOMED CT

David Hansen
Brisbane, Australia

Presentation (pdf)

Abstract: Enabling Australia wide use of SNOMED CT Australia is standardizing on SNOMED CT as the preferred clinical terminology for use in Electronic Health and Medical Records. Existing electronic systems and the legacy terminologies and vocabularies they contain represent a substantial investment in health information collections. The Snapper SNOMED CT mapping tool has been made available to all public and private companies in Australia to aid in the transition from these existing vocabularies to SNOMED CT. Method: The Snapper toolkit is based on the snorocket classifier, a fast classification and subsumption engine for EL+ ontologies. Snapper provides a user-friendly interface for creating mappings from an existing termset to concepts in the SNOMED CT ontology. Advanced features make use of the description logic foundation of SNOMED CT. Features include the ability to create post-coordinated expressions and classify the expression into the correct position in the hierarchy in real time. The use of Snapper allows users themselves to transition to SNOMED CT, adopt this standard clinical terminology, and preserve the value of their existing health information collections. Additionally, many electronic health record suppliers are using Snapper and investigating the use of a cloud based terminology server to use snorocket to perform subsumption queries. Results: Mapping legacy termsets to SNOMED CT eases the adoption path, allowing a migration from their existing terms to SNOMED CT. Mappings created include creating standard content for General Practices, Surgeries, Emergency Departments, and Community Health and Pharmacy systems The lessons from creating these maps include: • the intended use of maps • whether or not all legacy termsets are deserving of migration • Identification and management of legacy termset content and preparation for mapping • whether or not SNOMED CT expressions or extensions will be necessary, and • how these can be maintained or deployed.? As well as these mapping uses, there have been examples of the use of some of the advanced features of Snapper. The Australian Medicines Terminology (AMT) is a catalogue of medicinal products and substances in use in Australia. Snapper and snorocket have been used to map AMT to the Substance hierarchy of SNOMED CT and thereby make drug classes and subsumption available to AMT. The snorocket classifier can then be used to produce an integrated and fully-classified extension. Snapper has also been used to develop Reference Sets (RefSets) of SNOMED CT content, suited for data retrieval and queries. Public health biosurveillance users required RefSet content to analyse live data feeds, encoded in SNOMED CT, relevant to patient presentations. RefSets developed by Snapper users are capable of detecting ‘signal’ cases of avian, swine and other forms of influenza in a patient population. Conclusion: The uses of Snapper and snorocket have shown the importance of a classification engine to the creation of maps and RefSets of SNOMED CT content. Participants will learn about the use of mapping to integrate related SNOMED CT-based terminologies, the role and value of classification and classifiers in such a process, and gain insight into post coordination and subsumption-based queries.


Semantic Benchmarking Infrastructure for Text Mining: Leverage of Corpora, Ontologies and SPARQL to Evaluate Mutation Text Mining Systems

Artjom Klein
University of New Brunswick
Saint John, Canada

Abstract: (1). Objectives and motivation. In biomedical text mining, the development of robust pipelines, publication of results and the running of comparative evaluations is greatly hindered by the lack of adequate benchmarking facilities. Benchmarks - annotated corpora - are usually designed and created for specific tasks and provide implemented hard-coded evaluation metrics. Comparative evaluations between tools and evaluation of these tools on different gold standard data sets is an important aspect for performance verification and adoption. Evaluations are hindered by a diversity and heterogeneity of formats and annotation schemas of corpora and systems. Well-known text mining frameworks such as UIMA and GATE include functionality for integration and evaluation of text mining tools based on hard-coded evaluation metrics. Unlike these approaches, we leverage semantic technologies to provide flexible and ad-hoc authoring of evaluation metrics. We report on a centralized community-oriented annotation and benchmarking infrastructure to support development, testing and comparative evaluation of text mining systems. We have deployed this infrastructure to evaluate the performance of mutation text mining systems. (2). Method. The design of the infrastructure is based on semantic standards, where RDF is used to represent the annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases programming is not needed to analyse system results. The core infrastructure comprises of: 1) third-party upper-level ontologies to model annotations and text structure, 2) a domain ontology for modelling domain-specific annotations, 3) SPARQL queries for performance metrics computation, and 4) a sizeable collection of manually curated documents, that can minimally support mutation grounding and mutation impact extraction. The diversity of freely available RDF/OWL tools enables out-of-the-box use of the annotation data for corpus search and analysis, system testing and evaluation. (3). Results. We developed the Mutation Impact Extraction Ontology (MIEO) to a domain ontology to model extracted mutation impact related information. We seeded the infrastructure with several corpora (242 documents in total) supporting at least two mutation text mining tasks: mutation grounding to proteins and extraction of mutation impacts on molecular functions of proteins; and developed SPARQL queries to perform calculations of relevant metrics. To facilitate a preliminary evaluation of our infrastructure for comparative evaluation, we integrated the freely available mutation impact extraction systems into the infrastructure, and developed a set of SPARQL queries to perform cross-evaluation on available mutation impact corpora. (4). Conclusions. We present an evaluation system for benchmarking and comparative evaluation of mutation text mining systems designed for use by BioNLP developers, biomedical corpus curators, and bio database curators. Corpora and text mining outputs modelled in terms of underlying ontologies can be readily integrated into the infrastructure for benchmarking and evaluation. The generic nature of the solution makes it flexible, easily extendable, re-usable and adoptable for new domains. Flexible SPARQL allows ad-hoc search and analysis of corpora and the implementation of evaluation metrics without requiring programming skills. This is the primary example of benchmarking infrastructure for mutation text mining.


Building the App Store for Health and Discovery

Kenneth Mandl
Harvard Medical School/Boston Children's Hospital
Boston, United States

Presentation (pdf)

Abstract: Most vendor electronic health record (EHR) products are architected monolithically, making modification difficult for hospitals and physician practices. An alternative approach is to reimagine EHRs as iPhone-like platforms that support substitutable apps-based functionality. Substitutability is the capability inherent in a system of replacing one application with another of similar functionality. Substitutability requires that the purchaser of an app can replace one application with another without being technically expert, without requiring re-engineering other applications that they are using, and without having to consult or require assistance of any of the vendors of previously installed or currently installed applications. A deep commitment to substitutability enforces key properties of a health information technology ecosystem. Because an app can be readily discarded, the consumer or purchaser of these applications is empowered to define what constitutes value in information technology. Apps necessarily compete with each other promoting progress and adaptability. The Substitutable Medical Applications, Reusable Technologies (SMART) Platforms project seeks to develop a health information technology platform with substitutable apps constructed around core services. It is funded by a $15M grant from Office of the National Coordinator of Health Information Technology’s Strategic Health IT Advanced Research Projects (SHARP) Program. All SMART standards are open and the core software is open source. The goal of SMART is to create a common platform to support an “app store for health” as an approach to drive down healthcare costs, support standards evolution, accommodate differences in care workflow, foster competition in the market, and accelerate innovation. The SMART project focuses on promoting substitutability through an application programming interface (API) that can be adopted as part of a “container” built around by a wide variety of health technology platforms, providing read-only access to the underlying data model and a software development toolkit to readily create apps. SMART containers are health IT systems, that have implemented the SMART API or a portion of it. Containers marshal data sources and present them consistently across the SMART API. SMART applications consume the API and are substitutable. SMART has sparked an ecosystem of apps developers and attracted existing health information technology platforms to adopt the SMART API—including, traditional, open source, and next generation EHRs, patient-facing platforms and health information exchanges. SMART-enabled platforms to date include the Cerner EMR, the WorldVista EMR, the OpenMRS EMR, the i2b2 analytic platform, and the Indivo X personal health record. The SMART team is working with the Mirth Corporation, to SMART-enable the HealthBridge and Redwood MedNet Health Information Exchanges. We have demonstrated that a single SMART app can run, unmodified, in all of these environments, as long as the underlying platform collects the required data types. Going forward, we seek to design approaches to enable the nimble customization of health IT for the clinical and translational enterprises.


CDISC2RDF - Make Clinical Data Standards Linkable, Computable and Queryable

Charles Mead
Octo Consulting Group
Washington, United States

Eric Prud'hommeaux
Cambridge, United States

Presentation (pdf)

Abstract: Clinical data standards have been identified as one of five initial areas by the TransCelerate BioPharma, the non-profit organization formed by ten leading pharmaceutical companies, to accelerate the development of new medicines. The European Medicines Agency (EMA) is developing a policy on the proactive publication of clinical-trial data in the interests of public health including clear and understandable clinical data formats. The FDA has a long-held goal of making better use of submitted clinical trial data. Pharmaceutical companies have attempted to use submission standards to create study repositories. Exploiting Semantic Web technologies stands to simplify the interpretation of individual studies, and improve cross-study integration Method The CDISC2RDF initiative exploits semantic web standards and linked data principles for clinical data standards from CDISC (Clinical Data Interchange Standards Consortium). This has been proposed by early adopters in AstraZeneca and Roche as a way to make clinical data standards linkable, computable and queryable beyond today’s disconnected PDFs and Excels files. CDISC2RDF is a cross-pharma pre-competitive project with Roche, AstraZeneca, TopQuadrant, Free University of Amerstam and W3C HCLS. Results This presentation will describe the results from the phase 1) Standards-as-is, covering standards for submissions (SDTM), analysis (ADaM) and data capture (CDASH) structures and terminologies. And also discuss ideas for the next two phases; 2) Standards-in-context, and 3) Interoperability across standards and the data collected using them.


Building a Knowledge Base for Cancer Genomics

Eric Neumann, Alex Parker, Rachel Erlich
Foundation Medicine

Abstract:  Next-generation sequencing (NGS) is becoming an increasingly important part of healthcare, providing new genomic insights into diseases and their treatments. Therapies for cancer in particular will benefit greatly, since cancer arises from a series of genetic alterations affecting cellular proliferation and survival—routine genomic testing will guide more effective treatments.  To this end, Foundation Medicine® (FMI) has developed FoundationOne™, a comprehensive cancer genomic profiling test based on next generation sequencing.

Growing knowledge and clinical application of cancer genomics has changed the oncology landscape in recent years, enabling therapeutic options that specifically target the genomic drivers of a patient’s unique cancer. While this approach may offer more efficient and less toxic options than traditional chemotherapy, physicians need to be able to match each patient with the right drug for their unique cancer, which requires a comprehensive genomic profile of the patient’s tumor and an expansive knowledge of cancer genomics.

FMI has made this complex information readily available for any clinical practice. Using highly sensitive and accurate next-generation sequencing on small amounts of routine FFPE cancer tissue, the FoundationOne assay interrogates the entire coding sequence of hundreds of tumor genes known to be rearranged or altered in cancer, based on recent scientific and clinical literature. Genomic alterations are matched to relevant targeted therapies, either approved or in clinical trials that could be a rational choice for the patient based on the genomic profile of their cancer. This information is reported to the patient’s physician via the FoundationOne Interactive Cancer Explorer, the company’s online reporting platform (usually within three weeks).

To support personalized cancer treatments based on FoundationOne, FMI is compiling the world’s most comprehensive cancer genomic alteration knowledgebase (KB). These data can be further analyzed for the combinations of mutations and cancers that best respond to specific drugs. The compilation of thousands of observed cancer genomic alterations for each of the cancers is being linked using RDF to published results, molecular cancer databases, and clinical trials, creating a densely interconnected knowledge base. The KB serves multiple applications internally as well as externally, taking full advantage of the power and flexibility of Linked Semantic Data. The associated molecular and clinical knowledge will enable oncologists and researchers to probe deeper into the mechanisms behind each patient’s cancer and potential cancer treatments, promising to dramatically accelerate improvement of existing therapies and the discovery of new ones.

Over time, FMI will continue to expand and analyze the Cancer Genomic Knowledge Base, merging new forms of information that scientists and clinicians see as key for understanding cancer.  The use of RDF standards and Linked Semantic Data protocols gives us the ability to grow and maintain the KB with new insights, some which will be inferred directly from it.


Semantically Enabling Genetic Medicine to Facilitate Patients and Guidelines Matching and Enhanced Clinical Decision Support

Matthias Samwald
Medical University of Vienna
Vienna, Austria

Abstract: The delivery of genomic medicine requires an integrated view of molecular and genetic profiles coupled with actionable clinical guidelines that are based on formalisms and definitions such as the star allele nomenclature. However, the identification of new variants can change these definitions and impact guidelines for patient treatment. We present a system that makes use of semantic technologies, such as an OWL 2 - based ontology and automated reasoning for 1) providing a simple and concise formalism for representing allele and phenotype definitions, 2) detecting inconsistencies in definitions, 3) automatically assigning alleles and phenotypes to patients and 4) matching patients to clinically appropriate pharmacogenetic guidelines and clinical decision support messages. This development is coordinated through the Health Care and Life Science Interest Group of the World Wide Web Consortium (W3C). Method We created an expressive OWL 2 ontology by automatically extracting or manually curating data from dbSNP, clinically relevant polymorphisms and allele definitions from PharmGKB, clinically relevant polymorphisms from the OMIM database, the Human Cytochrome P450 nomenclature database, guidelines issued by the Clinical Pharmacogenetics Working Group (CPIC) and the Royal Dutch Pharmacogenetics Working Group, FDA product labels and other relevant data sources. We used highly scalable OWL 2 reasoners (e.g., TrOWL) for analysing the aggregated data and for classifying genetic profiles. Results We demonstrate how our approach can be used for identifying errors and inconsistencies in primary datasets, as well as inferring alleles, phenotypes and matching clinical guidelines from genetic profiles of patients. Conclusion We invite stakeholders in clinical genetics to participate in the further development and application of the formalism and system we developed, with the potential goal of establishing it as an open standard for clinical genetics.


A Clinical Information Management Platform Using Semantic Technologies

Christian Seebode
ORTEC Medical
Berlin, Germany

Presentation (pdf)

Abstract: A Clinical Information Management Platform using semantic technologies Medical procedures generate a vast amount of data from various sources. An efficient and comprehensive integration and exploitation of these data will be one of the success factors for improving health care delivery to the individual patient, making health care services more cost-effective at the same time. In order to support an effective mining, selection and presentation of medical data for clinical or patient-centered use cases, either text data or structured clinical data from Health Information Systems (HIS) has to be enriched with semantic meta-information and has to be available at any point during the data value chain. We present a platform which combines an approach to semantic extraction of medical information from clinical free-text documents with the processing of structured information from HIS records. The information extraction uses a fine-grained linguistic analysis, and maps the preprocessed terms to the concepts of domain-specific ontologies. These domain ontologies comprise knowledge from various sources, including expert knowledge and knowledge from public medical ontologies and taxonomies. The processes of ontology engineering and rule generation are supported by a semantic workbench that enables an interactive identification of those linguistic terms in clinical texts that denote relevant concepts. This supports incremental refinement of semantic information extraction. Facts extracted from both, clinical free texts and structured sources, represent chunks of knowledge. They are stored in a Clinical Data Repository (CDR) using a common document-oriented storage model, which takes advantage of an application-agnostic format, in order to support different use cases. It furthermore supports version control of facts reflecting the evolution of information. Enrichment algorithms aggregate further information by generating statistical information, search indexes, or decision recommendations. The CDR generally separates processes of information generation from processes of information processing or consumption, and thus supports smart partitioning of data for scalable application architectures. The applications hosted on the platform retrieve facts from the CDR by subscribing to the event stream provided by the CDR. The first applications implemented on top of that platform support specific scenarios of clinical research, like recruiting patients for clinical trials, answering feasibility studies, or aggregating data for epidemiological studies. Further applications address patient-centered use cases like second opinion or dialogue support. The web-based application StudyMatcher maps study criteria to a list of cases and their medical facts. Trial teams may define study criteria in interaction with the knowledge resources. The application automatically generates a list of candidates cases.. Since the user interface links the facts extracted by the system to the original sources (e.g. the clinical documentation), users are able to check with low effort whether or not a fact has been recognized correctly by the system, matched correctly with the given criteria. This strategy of combining automatic and supervised fact generation promises to be a reasonable approach to improving the semantic exploitation of data. Platform and applications are developed in cooperation with Europes leading healthcare providers Charité and Vivantes and will be rolled out in January 2013. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.