Conference on Semantics in Healthcare and Life Sciences (CSHALS)

Presenters

Updated February 7, 2011

Semantic Infrastructure for Automated Small Molecule Classification and Data Mining for Lipidomics

Christopher Baker
University of New Brunswick
Saint John, CA

Presentation Abstract: Background The development of high-throughput experimentation and combinatorial chemistry has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. At the same time, efforts to annotate, classify, analyze, and link these chemical entities to disparate data sources have largely remained in the hands of human curators using manual or semi-automated protocols. Since chemical function is often closely linked to its structure, and cocomitantly, position within a chemical ontology, the accurate classification and annotation of chemical entities is of primary importance in understanding their functionality as well as the full spectrum of potential applications. Unfortunately, neither the expressivity of formal ontologies, nor the potential of Semantic Web Technologies (SWT) to integrate disparate computational services have been fully exploited within the lipidomics and metabolomics communities. Results As part of a case study in the utility of SWT for chemical classification, we have developed a prototype framework for automated lipid classification and annotation. This framework comprises of the following components; Firstly a formal lipid ontology developed in OWL-DL, which is based in part on the lipid class hierachy from the LIPIDMAPS database and relevant literature. The Lipid Ontology, [ICBO2009], relies on structural features of small molecules to formally described lipid classes. Secondly a set of federated Semantic Web services deployed within the SADI framework is used to invoke the automated logical classification task. The first service, a structural annotation service, detects and enumerates relevant chemical subgraphs on a given input chemical graph. Secondly a classifier service assigns chemical entities to appropriate ontology classes by reasoning over class description in the ontology and checking them against the set of chemical subgroups provided by the structure annotation service. We illustrate the utility of these core services using the use case of Eicosanoid classification and combine them with additional SADI services linking the annotated lipids to related proteins found in the biomedical literature or within the public databases. Using these services we further contrast the performance of automated Eicosanoid classification with the existing lipid nomenclature systems and curated lipid databases and reflect on the contribution of our methodology in the context of high-throughput Lipidomics. Conclusions The prototype semantic web service framework we have developed is capable of accurate automatic classification of lipids and integration of information on given chemical entities from relevant databases. The services we provide within this framework can also be reused within other contexts and adapted to diverse lipidomics computational workflows. We conclude that SWT can provide an accurate and versatile means of classification and annotation of chemical entities.


The Phenoscape Knowledgebase: Linking Evolutionary Diversity to Genetic Data Using Phenotype Ontologies

James Balhoff
National Evolutionary Synthesis Center
Durham, US

Presentation Abstract: Objectives and motivation Phenotypic differences among species have long been systematically itemized and described by biologists in the process of investigating phylogenetic relationships and trait evolution. Traditionally, these descriptions have been expressed in natural language within the context of individual journal publications or monographs. As such, this rich store of phenotype data has been largely unavailable for statistical and computational comparisons across studies or integration with other biological knowledge. We have created the Phenoscape Knowledgebase, which consists of a database and web application (http://kb.phenoscape.org/). The database combines ontologically annotated phenotypic character data for a large and diverse group of fishes with phenotypic annotations from the ZFIN model organism database. The web application provides query and browsing interfaces which allow users to exploit the the logical framework provide by the ontologies which underpin the data. Method We used OBD ("Ontology-based Database") to store phenotypic data, from ~50 phylogenetic publications, as statements using terms from ten different OBO ontologies. The phenotypic data, taxa, and specimens in these published data sets were annotated with ontology terms using our curation application, Phenex. In this process free-text phenotype descriptions were converted to semantic representations using an Entity-Quality (EQ) model, combining terms from separate anatomical and qualitative ontologies. The ontologies and annotated data sets, along with EQ phenotype annotations for zebrafish genes, exported from the ZFIN database, were loaded into OBD using its own triple-based schema. We used the SQL-based OBD reasoner to pre-compute inferred statements and add them to the Knowledgebase. We developed a web services API providing access to the Knowledgebase using the Restlet Java framework. We also developed a Ruby on Rails-based end-user web interface, which allows biologists to query the Knowledgebase, accessing the data via these public web services. Results The Phenoscape Knowledgebase integrates over 500,000 asserted phenotype statements, concerning ~2500 fish species, with over 20,000 phenotype statements linked to over 3700 zebrafish genes. Users can discover fish species matching arbitrary phenotypic profiles, which can be expressed as queries making use of the hierarchical nature of anatomical, qualitative, and taxonomic ontologies. Moreover, genes influencing these phenotypes can be simultaneously returned. At the same time users can visualize the structure and explore term definitions of the included ontologies. The Knowledgebase has been used to investigate patterns of anatomical coverage within published phylogenetic characters, as well as to generate hypotheses for candidate genes underlying evolutionary losses of both scales and skeletal elements. Conclusion Ontological annotations of free-text phenotypic data, built with shared community-driven ontologies, constitute a powerful resource when aggregated within a database system which makes full use of the semantic framework provided by those ontologies. For the first time, scientists can search phenotypic content from dozens of phylogenetic publications, querying across anatomical, qualitative, and taxonomic axes.


Identifying Unexpected Associations in Integrated Biomedical Data Sets: Novel Navigation, Analysis & Visualization Interaction Patterns for Semantic TripleStores

Christopher Bouton
Entagen, LLC
Newburyport, US

Presentation Abstract: A promise of semantic technologies is the facile integration of large quantities of disparate data. In the biomedical research and development (R&D) sector this type of integration is essential for the potential identification of connections across entity domains (e.g. compound to targets, targets to indications, pathways to indications). However, the vast majority of data currently utilized in biomedical R&D settings is not integrated in ways which make it possible for researchers to intuitively navigate, analyze and visualize these types of interconnections. Using the Linking Open Drug Data (http://esw.w3.org/HCLSIG/LODD) data sets, we have been experimenting with novel forms of biomedical data integration, navigation, analysis & visualization through the development of a web-based, rich-internet application (RIA). An essential goal of this work is the creation user interface paradigms which enable "bench" researchers to intuitively identify unexpected associations which may drive their research forward through the iterative process of effective hypothesis generation and subsequent testing.


Semantic Repository of Genomics Experiments

Sudeshna Das
Mass. General Hospital, Harvard Medical School
Cambridge, US

Presentation Abstract: Objectives Genome-wide experiments are routinely conducted to study gene expression, DNA-protein binding and epigenetic status. The importance of structured meta-data for these experiments for integration and reuse is widely recognized. For this purpose, first the MIAME standard was developed for microarrays and recently the ISA-TAB format was published as a generalized format for experiments employing omics technologies. Several MIAME-compliant repositories exist for genomics data, notably Array Express and GEO. However, these are not yet widely available as Linked Data compliant with standard biomedical ontologies such as the MAGE Ontology (MO), the Ontology for Biomedical Investigators (OBI) and the Experiment Factor Ontology (EFO). Researchers need friendly, useful and reusable software environments that can automatically produce such Linked Data. Method We have developed reusable software to build semantic repositories of genomics experiments. Our software is based on the open source content-management system Drupal (www.drupal.org). The primary content type is an experiment; which has a title, researcher, design details and is comprised of one or more bioassays. The experiment can be linked to publication(s). Bioassays are processes that have biomaterials and technologies as participants and data files as output. The main classes are mapped to MO & OBI. Biomaterials have various characteristics such as organism, disease state and cell types. These characteristics are mapped to existing published biomedical concepts. The data is entered in a structured format – thus, eliminating the need for future curation. We then use RDF modules in Drupal to produce Linked Data & a SPARQL endpoint. Results We have developed two repositories using this software in separate domains. One of them is a repository for hematopoietic stem cell data (http://bloodprogram.hsci.harvard.edu). It contains over 100 microarray, transcription factor binding and histone modification experiments. The majority of the data is from microarray experiments performed on model organisms (mouse and zebra fish) and encompasses various cell types and disease states. The cell types were mapped to the Cell Type (CL) Ontology. The other repository comprises of microarray profiles from Parkinson’s disease patients (http://pdexpression.org/). The disease subtypes and tissue of subjects were mapped to standard terms with the help of the NCBO (National Center for Biomedical Ontologies) annotation tool. The use of standard terminologies to describe the biomaterials allows interoperability with other repositories. However, finding the most appropriate mappings still remains a challenge. For example, when mapping the cell types – there were quite a few missing entries, whereas “Parkinson’s Disease” was found in over 20 systems. Addressing these issues is as much a social process as a technological one. Conclusions The main benefit of our software is the ability to create Linked Data in a synchronous manner that eliminates the need for latter curation. For each domain we can deploy an instance of the software that is pre-populated with relevant terms (mapped to existing terminologies) from that field. As more communities begin to adopt such reusable infrastructure and make Linked Data available, we will begin to address the integration challenge that is currently posed to biomedical researchers.


Rendering Medical Documents in RDF: Strategies and Gotchas

John F. Madden
Duke University, Durham, US


Presentation Abstract: Clinical medical records consist of documents such as laboratory reports, physician's progress notes, admission summaries, etc.. They often contain a mixture of full-sentence, natural language text and "bullet-point" or form-like content.

Non-explicit knowledge (the document’s purpose, genre, temporal context, author’s background knowledge, etc.) as well as references to external assertions found in other documents heavily condition the meaning of such documents. Rendering the content of such a document in RDF is a complex act of interpretation, akin to translation. There is no single "correct" RDF rendering.

We will examine sample medical documents and study some possible renderings into RDF/OWL, with the purpose of highlighting common challenges including the following:

  • dealing with anaphora, i.e., candidate triples whose appropriate subject is ambiguous or multiple ("sodium 142 mM:: Whose sodium? The patient's? The patient’s serum? The sample of the patient’s serum delivered to the laboratory? etc.?)
  • instances versus classes, especially when using legacy vocabularies ("Jim has influenza": Does Jim have SNOMED-influenza, or does he have an instance of SNOMED-influenza?)
  • dealing with references to assertions in other documents ("My colleague Dr. Smith diagnosed pneumonia last week": Is pneumonia the relevant fact, or is the diagnosis of pneumonia the relevant fact? How do I represent the difference?)

Advancing Regulatory Science for Public Health – An FDA Perspective

Vicki Seyfert-Margolis
US Food and Drug Administration

Presentation Abstract: : For breakthroughs in science and technology to reach their full potential, FDA must play an increasingly integral role as an agency not just dedicated to ensuring safe and effective products, but also to promote public health and participate more actively in the scientific research enterprise directed towards new treatments and interventions. We must also modernize our evaluation and approval processes to ensure that innovative products reach the patients who need them, when they need them. These new scientific tools, technologies, and approaches form the bridge to critical 21st century advances in public health. They form what we call regulatory science: the science of developing new tools, standards and approaches to assess the safety, efficacy, quality and performance of FDA-regulated products.

 


NoSQL: New Possibilities for Distributed Scientific Data Management, Workflow and Collaboration

Mike Miller
Assistant Research Professor, U. Washington
Founder/Chief Scientist, Cloudant Inc.
Boston, USA

Presentation Abstract: Inspired by new problems (exploding sensor data, complex workflows, geo-distribution, etc.), there has been a dramatic renaissance of alternatives to classic relational database management systems.   We briefly review these “NoSQL” implementations including key/value stores, big tables, document stores and graph stores.  Next we focus on specific qualities that enable new possibilities for scientific data management, processing and analysis, in particular: flexibility, scalability, expressiveness, REST interfaces, concurrency, replication and cloud hosting.  Finally, we discuss relevant applications in physical and biological sciences.

 


PharmaConnect: Development of an Integrated Knowledge Platform by Extracting, Integrating and Analyzing Information to Support Systematic, Evidence Based Decision Making in R&D

Sherri Matis-Mitchell
Astrazeneca Pharmceuticals
Wilmington, US

Presentation Abstract: The Knowledge Engineering initiative within AstraZeneca has recently delivered the first version of a knowledgebase that integrates internal and external evidence for connections between key concepts such as targets, pathways, compounds, diseases, preclinical, and clinical outcome from Chemistry, Competitive, Disease and Safety Intelligence workstreams. This talk will describe the system, architecture, and it’s development; demonstrate the impact of this new platform with specific examples; and discuss lessons learned during its development. We will also detail linkages to additional data sources and system as well as plans for the future.


Conceptual Interoperability and Biomedical Data

James McCusker
Rensselaer Polytechnic Institute
Troy, US

Presentation Abstract: Computable semantic interoperability among domain models in biomedicine, as well as interoperability with cross cutting models, has become a major concern in biomedical research. The National Cancer Institute Center for Biomedical Informatics and Information Technology (NCI CBIIT) has begun the next phase for developing caBIG semantic interoperability through the adoption of layered semantics and data models. We discuss a possible mapping between conceptual and logical models. This mapping technique leverages OWL annotation capabilities paired with SKOS representations of existing biomedical ontologies. We show how this technique might provide interoperability among domain and cross-cutting models in caBIG and in other semantic environments. We demonstrate three capabilities that this mapping provides: conversion between domain models and cross-cutting models, conversion between domain models, and domain model-agnostic queries across multiple models. We discuss the application of this technique to the existing caBIG semantics, the proposed caBIG semantics, and to interoperability of biomedical data through the proposed translational research provenance vision.


Semantic Analysis and Visualization of Clinical Data

Eric Neumann
Clinical Semantics
Bedford, US

Presentation Abstract: Biomedical data generation is continuously growing both in terms of size and complexity. Clinical Study data is complicated by the fact that new forms of associated data are continuously created as technologies emerge, including biomarkers, pathway (mechanistic) knowledge, assay platforms, and model systems. W3C semantic standards such as RDF and OWL have been around for several years, but most informatics specialists are unsure where they can be applied effectively. Semantically Linked Data (SLD) can significantly change the organization and re-use of data without requiring a concomitant investment in data systems. SLD is especially fine-tuned for handling information extracted from literature, and relating it to structured data, even if they exist in other data systems.


BEL (Biological Expression Language): Using Causal Relationships to Represent Scientific Findings in Molecular Biology in Support of Applications

Dexter Pratt
Selventa
Cambridge, US

Presentation Abstract: The intent of scientific publication is to share knowledge. To do this effectively, scientific documents should be accessible to semantically enabled applications, with critical information encoded in a computationally accessible knowledge representation. This presentation describes the knowledge representation language BEL (Biological Expression Language), a language designed to pragmatically represent scientific findings in molecular biology as causal relationships. BEL was designed to capture knowledge about biological scientific findings as well as their contexts in a user friendly, intuitive way. Findings can be encoded via the representation of experimentally demonstrated causal relationships which are further annotated with information describing biological context, experimental methodology, literature source and curation process. Biological models appropriate for a given analysis or application can be created in a knowledge assembly process in which BEL-encoded findings are integrated, selected, transformed and augmented by inference. Knowledge can be selected based on provenance and biological context information associated with each finding, enabling a strategy where knowledge capture can be well separated from the design of useful models. Each relationship in a BEL-derived model can be justified by reference to its supporting findings. BEL closely links the represented knowledge to measurable quantities by focusing the ontology on terms denoting abundances and activities of entities at the molecular scale, facilitating the use of BEL-derived models in the interpretation of experimental data sets. BEL terms can be defined by reference to external vocabularies or ontologies, thereby supporting the integration of knowledge from multiple sources. Following eight years of development and proprietary use, BEL has proven to be an intuitive and effective language for scientists, supporting the creation of a large knowledgebase used in the interpretation of 'omics data sets via causal relationship-based analytics. BEL and supporting tools are now being made publicly available to the research community through the introduction of the BEL Web Portal™. The BEL Web Portal™ provides public access to BEL language specifications, documentation, knowledge representation examples, and BEL software tools.


Using SWObjects to Create and Query RDF Views

Eric Prud'hommeaux
W3C
Cambridge, US

Presentation Abstract: SPARQL CONSTRUCTs, RIF and other rule forms allow us to trivially tailor views of RDF data from sources like turtle files, GRDDL'd XML documents, RDF databases, or conventional relational databases. Views over databases are especially practical if they can be virtual, that is, SPARQL queries over the virtual graph are mechanically transformed into SPARQL queries for RDF databases or SQL queries for conventional relational data (e.g. and Employees table and an Address table).

This talk will discuss the utility of such an architecture, including efficient access to RDBs and pipelines of transformation services supported by parties other than the custodians of the final data resources. Real-world examples will include using the SWObjects toolbox to view Gene Ontology as BioPAX and to ask questions which unify across Uniprot and GO databases.


Semantics-enabled Proactive and Targeted Dissemination of New Medical Knowledge

Lakshmish Ramaswamy
University of Georgia
Athens, US

Presentation Abstract: The body of knowledge in the field of medicine is expanding at a tremendous pace. The number of citations in MEDLINE grew by more than 700,000 in 2009 and it is expected to grow by 1 million this year. This includes discovery of new drugs, previously unknown reactions to existing drugs, and new treatments. Some of the discoveries are so important that they have to reach the end-practitioners quickly so that they can act upon new knowledge, possibly by altering the course of treatment of relevant patients. Typically, medical knowledge dissemination occurs through channels such as conferences, medical journals, and memos. In the past decade, Web has, to some extent revolutionized medical knowledge dissemination by providing advanced search capabilities. However, this mode of knowledge dissemination is passive, and it has significant limitations. First, it requires the doctors to periodically search the online databases, which places additional burden on doctors. Second, the time lag for a doctor to become aware of a research depends upon how often she searches the online databases. Third, even after a doctor becomes aware of certain medical information, it would take additional time for her to search through the patient records to find out the patients to whom the information would be relevant. These limitations highlight the need for a proactive medical information dissemination paradigm. Our vision is to design and develop a semantics-enabled framework for proactive and targeted dissemination of new medical knowledge. We believe that such a system has to achieve two major design goals. First, in order to prevent information overload, information dissemination has to be targeted in the sense that a doctor should receive alerts about new discoveries if the information is likely to be relevant to one or more of her patients. Second, the additional workload on the doctor for participating in the system should very minimal. In other words, the system should function based upon the information that is recorded during the examination and treatment of patients. Towards achieving these two goals, our main idea is to utilize patients’ electronic medical records (EMRs) to identify information in scientific articles, memos, etc. that are relevant to a particular patient and alert her doctor accordingly. Several research challenges have to be addressed in order to make such a system efficient and scalable including: (1) EMRs, research publications (from PubMed, etc.), and memos from organizations such as CDC and FDA have to be automatically annotated with medical ontology-based semantics-rich metadata; (2) Novel, semantics-driven algorithms for retrieving, filtering and ranking information relevant to a particular EMR have to be designed; (3) In the interest of system scalability techniques to cluster EMRs based upon their semantic-similarity need to be utilized; and (4) The system has to be continuously tuned based upon explicit and implicit feedback from the users to maintain and improve its effectiveness. In this talk, we will motivate this work through real-world examples. We will elaborate upon the above challenges, discuss our ideas towards addressing them, and present an architecture for our framework.


Publishers' Content Linked with Bioinformatics Data Resources: Working Towards Brokering Standards in the SESL Pilot Project.

Dietrich Rebholz-Schuhmann
European Bioinformatics Institute, Hinxton, UK

Presentation Abstract: The SESL pilot project explores the technical feasibility for federated querying across full text literature and bioinformatics databases.  Five Life Science and Pharmaceutical companies have collaborated with four publishers and the Rebholz group (EMBL-EBI) to extract selected data from bioinformatics databases (Uniprot, OMIM and ArrayExpress) and full text literature with focus on human diseases related to Type 2 diabetes mellitus.  Gene to disease related assertions have been delivered through a single point of query to the scientist users.

The pilot implements the integration of content from public resources and extracted information from the scientific literature into a shared infrastructure based on Semantic Web technology.  The SPARQL endpoint is hosted at the EBI and can be accessed remotely through SPARQL queries, a Web browser based graphical user interface or through a SOAP Web services client.  The project delivers a preliminary set of standards describing the minimal infrastructure necessary to support a biology brokering service and the provision of a prototype instance of that infrastructure as a public demonstrator.


Semantic Representation of Events in the Pharmaceutical Industry

Martin Romacker, Samuel Läubli & Marc Bux
Novartis Pharma AG, NIBR-IT

Presentation Abstract: Data feeds from commercial content providers contain information highly relevant to pharmaceutical research. Processing and normalizing the data for in-depth analysis plays an important role in areas like competitive intelligence, strategic alliances or modeling and simulation.

Unfortunately, the data is not easy to be integrated and to be semantically syndicated. The standard transfer mode of knowledge in terms of XML files clearly lacks semantics. Additionally, many facts are locked in natural language statements instead of being accessible in a machine-readable and semantically valid representation. The challenge is even larger when content needs to be combined from different feeds. Heterogeneous ways of naming, different semantic typing and different content structures prevent the users from fully exploiting the rich knowledge contained in the feeds.

At NIBR-IT, we have implemented an automatic pipeline to process and normalize company names. At the same time, we have created a NLP pipeline which is able to derive facts from statements around phase transitions, mergers and acquisitions or licensing events. By doing so, we transform natural language statement into a normalized semantic representation which uses a Neo-Davidson-like form of notation. The different types of events are captured in a high-level ontology around the event types using OWL. Having this kind of representation it is now possible to ask queries like "What are the licensing events where Novartis gave a license to any company?". The company centric events are complemented by knowledge around indications, products and other semantic types. A secondary aspect of this project is to be able to demonstrate to the content providers that it might be an interesting idea to change to a semantically richer and computer-accessible way to deliver data.

In the presentation, we will first outline the business rationale behind our project. In the second part, we will give an overview on our way to process and normalize the free text sentences and will explain the Semantic Web approach we have taken to represent data.


NCBO Resource Index: Ontology-Based Search and Mining of Biomedical Resources

Nigam H. Shah

Stanford School of Medicine
California, USA

Presentation Abstract: The volume of publicly available data in biomedicine is constantly increasing. However, this data is stored in different formats on different platforms. Integrating this data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology, we have developed the Resource Index—a growing, large-scale index of more than twenty diverse biomedical resources. The resources include heterogeneous data from a variety of repositories maintained by different researchers from around the world. Furthermore, we use a set of 200 publicly available ontologies, also contributed by researchers in various domains, to annotate and to aggregate these descriptions. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics under the hood.


Fueling Knowledge Federation Using Terminological Services

Therese Vachon
Novartis Pharma AG
Basel, CH

Presentation Abstract: Knowledge proliferation and data silos are well-known buzz words which characterize the way data is produced and stored in the pharmaceutical industry. Most efforts in knowledge mining try to make the knowledge burried in applications and data bases accessible. These efforts are both expensive and tedious. Additionally, not all knowledge can be recovered as the stored information tends to be ambiguous and incomplete. At the Novartis, we have been working on a principled approach to overcome these shortcomings. The basic idea is to create a federation layer based on well controlled terminologies aiming at a uniform wording within and across data repositories. Thus, we have been collecting and defining meaningful atomic units (basic concepts) together with their lexical representations (terms) in a knowledge integration framework. Within that framework we maintain a number of terminologies (like indication, company, target, gene, assay method). The terminologies are organized in terms of taxonomies and complemented by referential knowledge, so-called cross references or pointers which link out to other repositories. One of our objectives is to stay compatible with the major resources of the open biomedical community. With regards to the coverage, our terminologies focus on the terms which are really relevant to research at NIBR. Cross referencing is a powerful but formally simple means to link out to other knowledge repositories to get access to additional information. Our methods to maintain and enhance the different terminologies in our framework depend mainly on the concept type. We have different levels of automatic generation, versus intellectual curation of the content related to indications, companies or genes. We believe that for each of these concept types there as an optimal balance between automation and curation – the former being prone to errors and the latter being time consuming and therefore expensive. Furthermore, we intend to make the maintenance process more and more a collaborative task where scientists can access, review and modify the content according to their role profile. An important success factor for the widespread usage of terminologies is to bring them seamlessly to the point of usage. Consequently, we have implemented a service layer providing SOAP and JSON Web Services as well as a REST API. Importantly, the users have access to the knowledge without slowing their work and without having to leave the active application. The increasing usage of these services both in number of applications and in number of calls clearly demonstrates the importance of the flexible integration layer. It is important to mention that for some of the concept types we have reached a critical mass in usage which allows us to run queries across systems or provide concept centric views joining internal and external data. As we can demonstrate the benefits from using our resources more and more people and organizations are starting to buy in. In our oral presentation, we would like to give an overview on our approach to “Terminology Management” and illustrate how we represent knowledge (terminological and referential). Finally, some use cases demonstrate how the services are applied.

[top]