Conference on Semantics in Healthcare and Life Sciences (CSHALS)

Conference on Semantics in Healthcare and Life Sciences (CSHALS)

Presenters

(updated March 8, 2010)

Talapady Bhat
NIST
Gaitehrburg, MD, USA

Presentation Title:
Chem-BLAST: A Rule-based Method to Develop Advanced Structural Ontologies for Chemical Bioinformatics

Presentation slides - .pdf: click here

Presentation Abstract:
Today’s Chemical Bioinformatics community must interact with a variety of information standalone applications and ontologies. This limitation promotes the need to define and develop rule-based stringent ontologies for information processing and sharing. Chemical Block Layered Alignment of Substructure Technique (Chem-BLAST) first recursively dissects chemical structures into blocks of substructures using rules that operate on atomic connectivity and then aligns them one against another to develop first Chemical Resource Description Framework (RDF) and then chemical ontologies in the form of a “tree” made up of “hub-and-spokes”. The technique has been applied for (a) both 2-D and 3-D structural data for AIDS (http://bioinfo.nist.gov/SemanticWeb_pr2d/chemblast.do); (b) to ~60000 structures from the PDB which is now available from the RCSB/PDB Web site (www.rcsb.org/pdb/explore/externalReferences.do?structureId=3GGT) and advanced features are at http://xpdb.nist.gov/chemblast/pdb.html. Full description of the Chem_BLAST along with recent results and illustrations including those for approximately a million compounds from the PubChem will be presented. Methods to dynamically intersect PDB and PubChem using ontology will be illustrated. These features may be used for developing linked data and Semantic Web by independently maintained resources.

[Return to Full Agenda]

Erik Bierwagen
Genentech, Inc.
San Francisco, CA, USA

Presentation Title:
Laying a Foundation with Animal Data

Presentation Abstract:
Animal assay data, especially those which came from mutant animals, is very expensive. The holy grail of any company doing animal research would be an electronic system for that allows analysis, searching, and ultimately computation on such data. However, a framework must be available to allow the capture and management of key data in a granular format. At Genentech, we are working on a system, DIVOS, that will allow us to capture detailed study designs and atomic data for all animal studies; in addition, DIVOS will allow us to leverage the detailed data set we have been generating through our mutant mouse colony management system (CMS). We will present the current state of our systems and our future plans. Additionally, we will highlight the specific work we have done with the Translational Oncology group at Genentech.

[Return to Full Agenda]

Vijay Bulusu
Pfizer Inc.
Groton, CT, USA

Presentation Title:
Data Integration and Knowledge Sharing Using Semantic Technologies

Presentation Abstract:
This talk reviews a pilot that applies semantic technologies to solve several tough data integration challenges and meet high-ROI use cases under a short timeline. In 2009, Pfizer Global R&D engaged leading semantic technology vendors for a Pilot on linking data stored in multiple repositories using imprecise connections. High ROI use cases were identified that served to test challenging data integration requirements, under a project timeline that would be difficult or impossible to meet with traditional relational or object-oriented technologies. Based on the data integration requirements, initial estimates for the pilot timeline were in the 4 - 6 month range. Franz, Inc. and IO Informatics worked together under a joint venture agreement, with Franz providing the semantic database and IO Informatics providing the data integration and application layers. By applying semantic technology to integrate datasets "on the fly", the pilot was completed within a 6 weeks timeframe.

Use Case I: Compound Purity Verification
Core technical requirements included integrating analytical testing data from a Laboratory Information Management System ("LIMS") and a Chromatography Data System ("CDS"). Final results for impurities (specified and unspecified) in clinical batches (stored in LIMS) had to be matched with the corresponding raw datasets (stored in CDS) in order to confirm the integration of the peak in the CDS system and the calculations of standards and samples to get the final result in the LIMS system. Some of the challenges overcome included lack of consistent common identifiers and searching and federating results in a user-friendly manner. Historically, responding to requests from regulatory agencies for such confirmations have involved undue time and cost, taking multiple staff days and at times weeks to integrate experimental data required for validation. The semantic application applies data linking and SPARQL-based pattern matching to bring back federated and ranked results in seconds.

Use Case II: Improving Drug Product Stability
Core technical requirements included integrating analytical testing and formulation composition data from two separate repositories. Analytical and Formulation datasets were stored in silos with no common identifiers between them. The challenge in this use case was to link product formulation composition data with stability data under various experimental conditions to identify hidden relationships between the use of excipients and the resulting stability of the drug product. Based on historical data, scientists can prepare better formulations that are more stable over longer periods of time.

Conclusions:

Rapid integration of new data becomes possible, even in contexts where precise common identifiers may be missing

Rapid and iterative creation of rich linkages across data sources provides far more useful and flexible data integration method with corresponding benefits for searching, exploration and reporting

User-friendly creation and application of meaningful search patterns becomes possible, using both properties and relationships

[Return to Full Agenda]

Carolyn R. Cho
TerrVerdae Bioworks
Beverley, MA 01915

Presentation Title:
Integrating Experimental, Text mining and Analytical Prediction Methods in Drug Discovery

Presentation Abstract:
Understanding and applying context specificity to molecular function is crucial to not only generating a hypothesis regarding a gene’s function in disease, but, more importantly, to predicting how the function may be perturbed to induce a therapeutic effect. Genome wide association studies have drawn new associations between genes and frank type II diabetes (T2D). Although the function of most of these genes is known or postulated, their role in the context of T2D is poorly understood.

We developed and tested hypotheses on the functional role of T2D-associated genes using a combination of text mining, statistical modeling and experiments in human cell-based models. This analysis demonstrates the effective integration of methods that improve our ability to consolidate our current knowledge and increase the effectiveness of therapeutic target discovery.

[Return to Full Agenda]

Nikolai Daraselia
Ariadne
Rockville, MD, USA

Presentation Title:
Semantic Indexing of Biomedical Texts

Presentation slides - .pdf: click here

Presentation Abstract:
Motivation: The past decade of biomedical research has produced overwhelming amounts of biological information conveyed in hundreds of thousands of scientific publications and web-resources. Effective accessing of relevant literature has become a critical issue. A bulk of biological knowledge is formed by information about biological relationships (or associations) between conceptual entities (proteins, small molecules, cellular processes, diseases, etc). Traditional keyword-based document retrieval systems (e.g. Pubmed) historically dominate the area of biomedical information search. They are easy to use, simple and fast. However, existing search systems are poorly suited for finding biological relationships in text. They consider each document to be a “bag of words” and return a collection of documents matching user queries that need to be further read by a user. Method: We have developed a novel Natural Language Processing algorithm (ConceptScan) which decomposes sentences into a set of primitive semantic relationships. We showed that in combination with the entity detection, it can be used as a text indexing engine in a search system which allows finding very specific biological and clinical information. Our approach relies on the following key ideas:

Identifying named entities (Proteins, small molecules, diseases, cellular processes, etc) in text using established dictionary-based approach.
Decomposing sentences into a set of domain-independent binary semantic relationships (triplets). Each relationship is represented by a left and right noun phrase (NP) connected by a single “relationship” word (usually a verb). An example of simple semantic triplet is:”p53 activates apoptosis”. The decomposition algorithm is purely linguistic and does not involve any domain-specific knowledge. It can consequently be applied to any corpus of documents.
Using the extracted relationships to index the documents semantically. More specifically, creating separate words-to-triplets indexes for words (and named entities) found in left NP, and right NP and the “relationship” word of each triplet. Such indexing allows the user to find words or entities which are in a certain (or any) semantic relationship with each other.
Using the structured query in the same form of a semantic triplet to search against the triplets created for indexed corpora and finding precise answers to the question.

The main advantage of semantically indexed search system is that it can find semantic relationships expressed in the sentences of a document rather than words scattered throughout the document itself. Combined with the named entity recognition, semantic indexing is a powerful tool for precise searching for biomedical relationships. Results: We have applied ConceptScan to 2008 PubMed to extract more than 70,500,000 semantic triplets. We will present the technical description of the ConceptScan algorithm and our text indexing approach, the results of the processing of the Pubmed with ConceptScan and use-cases demonstrating the value of the relationship-based indexing and searching for finding biological and clinical information.

Conclusions: The developed algorithm and indexing approach can be used to build a semantic relationship-based text indexing and search system suited for finding precise answers to specific biological and clinical questions.

[Return to Full Agenda]

Anita de Waard
Elsevier Labs
Amsterdam, The Netherlands

Presentation Title:
Knowledge Access Through Hypothesis Identification

Presentation slides - .pdf: click here

Presentation Abstract:
To access the knowledge within scientific publications, it is imperative that we move beyond entity (and even relationship) extraction to a fuller format that represents the argumentation, and links to research data, contained within a collection of papers. Currently, several initiatives exist that aim to represent, and to a certain extent extract, such hypotheses, with relationships to each other and to experimental evidence. Five of these initiatives have gathered under the moniker 'HypER' (Hypotheses, Evidence and Relationships), that are currently working together with the W3C Health Care and Life Sciences special interest group on a number of efforts:

Developing a model for a 'rhetorical abstract', i.e. and abstract that shows the main argumentative elements of a paper

Developing a single set of discourse relations, merging input from a number of proposed relationship ontologies

Perform some initial experiments on trying to identify these hypotheses automatically.

We will discuss these efforts, and briefly expand on some current work to create a theoretical discourse model for publications in the life sciences that explains the use of verb tense in biology.

[Return to Full Agenda]

Detlef Grittner
Sohard Software GmbH
Fuerth, Germany

Presentation Title:
Using SPARQL to Access Data Stores in Healthcare

Presentation slides - .pdf: click here

Presentation Abstract:
For the implementation of semantic web technology in healthcare it is necessary to access the already existing IT-infrastructure and data stores, which is a problem related to the theme of clinical harmonization. A very important standard in healthare is DICOM (Digital Imaging and Communication in Medicine) dealing with imaging for diagnostics and workflows for medical devices. PACS (Picture Archiving and Communication System) servers provide access to the data through a network protocol standardized by DICOM. There are plans to store semantical information directly in the DICOM data and reuse the existing infrastructure for storage and transmission. This leaves the problem how a semantic application shall access this data including querying with SPARQL, inferencing with ontologies encoded in OWL, and federation with other semantic data sources. Principally, there are two different approaches: Either the database of the PACS is completely converted into RDF or SPARQL queries are converted to the query protocols defined by DICOM.

Methods and Tools The prototype introduced in this presentation is called SeDI (Semantic DICOM) and uses Jena as RDF framework, Joseki for implementing a SPARQL endpoint, and Pellet as reasoner. SeDI uses dcm4chee as PACS, which has no built-in support for semantic applications. With SeDI it is possible to use SPARQL for querying a PACS, because SeDI transforms a SPARQL query directly into a DICOM C-Find or Move request as appropriate and converts the result from the DICOM protocol into a SPARQL result set. SeDI allows semantic applications to directly access a PACS without completely exporting the existing data of the PACS to a triple store. The DICOM data model is encoded in an ontology, on which SeDI depends to a large extent for transforming the query and adding semantically meaningful information to the query result.

Results The main advantage of this approach is that the existing IT infrastructure remains unaltered, since SeDI uses the already existing DICOM protocol access to the systems. A semantic application can access the DICOM data as if the PACS were a SPARQL endpoint, use inferencing on the data, and federate it with other data sources. Furthermore there is no problem with synchronizing the original data source with a triple store, because the original data source is directly accessed and there is no need for an export to a triple store. A drawback of this approach is that the queries run with less performance compared to the access to a real triple store, and an ontology containing all DICOM concepts is rather difficult to model.

Conclusions For the future a standardized ontology of DICOM is desirable so that queries can use concepts from defined namespaces. This could even be part of the DICOM standard itself. Furthermore the unobtrusive access to legacy databases containing data not encoded in RDF seems to be the natural way to evolve the semantic web. SeDI proofs this concept for medical data encoded in DICOM by implementing a seamless integration of a PACS as a SPARQL endpoint in a semantic web application.

[Return to Full Agenda]

Josh Levy
Research Triangle Institute
Research Triangle Park, NC, USA

Presentation Title:
Ontology-supported Searches for the PhenX Toolkit

Presentation slides - .pdf: click here

Presentation Abstract:
The goal of PhenX (concensus measures for Phenotypes and eXposures) is to facilitate cross-study analysis of GWAS by providing a toolkit of standard, well-established measures relevant to biomedical research. The challenge is to design an information retrieval system that will make the toolkit measures accessible to investigators who come from a variety of scientific disciplines. We envision a toolkit that will spark creative thinking as well as facilitate collaboration and trans-disciplinary expansion of population-based studies. Presentation of the inter-relatedness of measures across the toolkit can be effectively addressed by implementing a backbone ontology. Organizing PhenX measures into a biomedical ontology provides a framework for recognizing aliases without having to resort to undesirable and less-productive full-text searching. The ontology also makes it possible to provide search results even if there are no “direct hits”. The results of a such a "traversal search" could point to related PhenX measures by traveling down an acyclic graph to children, or laterally to siblings The ontology supports an intuitive tree-like interface for browsing groups and sub-groups of PhenX measures such as Risk Factors, Physical Assessments and Sources. These improved search and browse capabilities will make the PhenX toolkit more accessible and more valuable to a variety of users and encourage widespread incorporation of PhenX measures in population-based studies.

[Return to Full Agenda]

James McCusker
Rensselaer Polytechnic Institute
Troy, NY, USA

Presentation Title:
Representing Microarray Experiment Metadata Using Provenance Models

Presentation slides - .pdf: click here

Presentation Abstract:
MAGE (MicroArray and Gene Expression) representations are primarily representations of workflow: a process was used to derive biomaterial A from biomaterial B. This representation is ideally suited for representation using provenance models such as OPM (Open Provenance Model) and PML (Proof Markup Language). We demonstrate methods and tools, MAGE2OPM and MAGE2PML, to convert RDF representations of MAGE graphs to OPM and PML respectively. We analyze the coverage of representation for each model. We argue that provenance models are sufficient to represent biomedical experimental metadata in general, and may provide a useful point of reference for unifying information from multiple systems, including biospecimen management, Laboratory Information Systems (LIMS), and computational workflow automation tools. This unification results in a complete, computationally useful derivational picture of biomedical experimental data.

[Return to Full Agenda]

Deborah McGuinness
Rensselaer Polytechnic Institute
Troy, NY, USA

Presentation Title:
owl:sameAs Considered Harmful to Provenance

Presentation slides - .pdf: click here

Presentation Abstract:
GOTO was once a standard operation in most computer programming languages. Edsger Dijkstra argued in 1968 that GOTO is a low level operation that is not appropriate for higher-level programming languages, and advocated structured programming in its place. Arguably, owl:sameAs in its current usage may be poised to go through a similar discussion and transformation period. In biomedical research, the provenance of information gathered is nearly as important as, and sometimes even more important than, the information itself. owl:sameAs allows someone to state that two separate descriptions really refer to the same entity. Currently that means that operational systems merge the descriptions and at the same time, merge the provenance information, thus losing the ability to retrieve where each individual description came from. This merging of provenance can be problematic or even catastrophic in biomedical applications that demand access to provenance information. Based on our knowledge of integration issues of data in biomedicine, we give examples as use cases of this issue in biospecimen management and experimental metadata representations. We suggest that systems using any construct like owl:sameAs must provide an option preserve the provenance of the entities and ground assertions related to those entities in question.

[Return to Full Agenda]

Jean Peccoud
Virginia Tech
Blacksburg, VA, USA

Presentation Title:
Developing Semantic Models of DNA Sequences Using Attribute Grammars

Presentation Abstract:
Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. One possible approach to this problem is to extend the linguistic metaphor used to formulate the central dogma. The notions of genetic code, transcription, and translation are derived from a linguistic representation of biological sequences. Several authors have modeled the structure of various types of biological sequences using syntactic models . However, these structural models have not yet been complemented by formal semantic models expressing the sequence function. The Sequence Ontology and the Gene Regulation Ontology represent attempts to associate semantic values with biological sequences. Their controlled vocabularies can be used by software applications to manage knowledge. However, the semantics derived from these ontologies is a semantics of the sequence annotation, not of the sequences themselves. Developing semantic models of DNA sequences represents a new frontier in biomedical semantics.

Method
Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. These capabilities are illustrated by simple example grammars implemented in a Prolog program expressing how gene expression rates are dependent upon single or multiple parts.

Results
To demonstrate the potential use of a semantic model to search for a desirable behavior in a large genetic design space, we have generated the DNA sequences of all 41,472 possible sequences having the same structure as previously described switches. All sequences were translated into separate model files and a script was developed to perform a bistability analysis of each model. Parameters of the semantic model were obtained by qualitatively matching the experimental results of the six previously published switches. It takes only minutes to generate 41,472 sequences and translate them into SBML files. Hence, the computational cost of this step is negligible compared to the time required by the simulation of the SBML files.

Conclusions
Attribute grammars represent a flexible framework connecting DNA sequences with models of the biological function they encode. This formalism is also expected to provide a solid foundation for the development of computer assisted design applications for synthetic biology. A challenge for the synthetic biology community will be to reconcile the attribute grammar approach to developing semantic models of DNA sequences with the semantic models that can be derived from carefully crafted ontologies.

[Return to Full Agenda]

Elgar Pichler
Waltham, MA, USA

Presentation Title:
The Translational Medicine Ontology: A Small Compass for Navigating a Large Sea of Biomedical Data

Presentation slides - .pdf: click here

Presentation Abstract:
The Translational Medicine Ontology (TMO) is a bridging ontology to span domain ontologies which describe diverse areas of translational medicine such as hypothesis management, discovery research, drug development, formulation, clinical research, and clinical practice.

Such an ontology is needed if separate data sets from genomics, chemistry, and medicine are to be integrated so that they can be browsed, queried, analyzed, and presented collectively. Unfortunately, at present only very few ontologies exist that attempt to bridge these domains. A more holistic view of the available data in these areas is necessary to successfully address questions in personalized medicine and tailored drug development.

The TMO was developed by members of the Translational Medicine Ontology track of the World Wide Web Consortium’s Health Care and Life Sciences Interest Group to help provide such a view.

After careful consideration of the roles that individuals in health care and life sciences can assume in a translational medicine setting, the questions relating to each role and the relevant concepts in these questions were identified. The found concepts were then included in the ontology after aligning them to the Basic Formal Ontology (BFO) and mapping them to classes in several domain ontologies when possible.

To demonstrate the value of the TMO in integrating relevant data sets and enabling a more informative exploration of the data, a use case was developed in which a physician tries to answer questions focused on disease diagnosis, drug treatment options, and relevant clinical trials. For that use case a number of data sources containing information on pharmacogenomics, diseases, drugs, and side effects were aggregated. The combined data were then used in combination with the TMO to find answers to the physician's questions.

This presentation will give an overview of the work of the TMO group. In particular, the TMO development method and the application of TMO in a clinical use case will be discussed. Problems and strenghts of the TMO methodology will be highlighted in the hope that TMO can serve as a model for similar ontology developments and data integration efforts.

[Return to Full Agenda]

Raul Rodriguez-Esteban
Boehringer Ingelheim
Ridgefield, CT USA

Presentation Title:
From Genetics to Network to Target Discovery

Presentation slides - .pdf: click here

Presentation Abstract:
Closing the gap between genetics, molecular biology, and disease phenotypes requires the integration of disparate sources of information both structured (e.g., databases) and unstructured (e.g., text mining). Because there are multiple ways in which information can be represented, and multiple probabilistic models that can underpin it, flexible methods of information management are necessary in order to develop applications that respond to the changing requirements of a project. In particular, the task of target discovery requires a continuous cycle of selection and triage by committee that is amenable to semantic web applications. I will discuss examples that use as starting point information from genetics or text mined protein networks leading to candidate targets and how this information can be deployed in a pharmaceutical organization to allow the stakeholders to share, modify, and visualize the results for themselves.

[Return to Full Agenda]

Tim Schultz
Johnson & Johnson
USA

Presentation Title:
Creating a Linked Data Architecture for Neuroscience

Presentation slides - .pdf: click here

Presentation Abstract:
The Food and Drug Administration issued a monumental report entitled Challenge and Opportunity on the Critical Path to New Medical Products in March 2004. Today this report serves as a roadmap in expediting the discovery, development and commercialization of novel life-saving/life-enhancing pharmaceutical products. Computer modeling techniques, biomarkers, and clinical trial endpoints are examples of tools employed along the "critical path". The strategic utilization of such tools in the diagnosis and assessment of potential treatments for neurological disorders such Alzheimer’s disease has led to tremendous medical gains. However with such advances come great challenges in developing and implementing effective strategies to manage the corresponding data.

The volume of biomedical data generated from Phase 0 through Phase 3 is enormous. Managing vast quantities of such information is further complicated by the diverse nature of both data type and origin. The fact that much data are currently originated through third party collaborations (e.g. academia, in-license partnerships, and contract research organizations, etc.) obviously presents unique challenges. Consequently, it is no longer possible to manage data in a traditional ad hoc manner. It is essential that a company develop and maintain guidelines as to how data should be managed and represented; thereby making the task of introducing additional data more straightforward. These guidelines should have common attributes. The data environment should be flexible and provide straightforward integration of additional internal and external data sources. It must be interoperable and thus have established agreements over data formats, protocols, terminology, semantic interpretation, access, and authentication. The platform must be reusable thereby supporting easy access to biomedical data endpoints. It must be context-rich and retain domain-specific bias while remaining conducive to integration. Finally the data environment should be interlinked to establish relationships between disparate pieces of data to become readily apparent.

Ultimately, the goal of developing a linked data architecture in the discovery, development and commercialization of novel therapeutic agents is to provide a platform for cultivating an environment where it is more straightforward to manage, integrate, and access data in a meaningful way across the enterprise. It should enable information to flow smoothly across community, application, and enterprise boundaries.

The goal of this presentation is to provide an overview of preliminary research within Johnson & Johnson Pharmaceutical Research & Development pertaining to Linked Data and applied to the domain of Alzheimer’s disease research. It will describe how we have implemented a system to catalog existing data inside of Johnson & Johnson and how we have built a Linked Data framework that surfaces the catalog of data through SPARQL endpoints. Details will be provided on the rich visualization tool that is used for interrogating the data.

[Return to Full Agenda]

Susie Stephens
Johnson & Johnson
USA

Presentation Title: W3C Semantic Web for Health Care and Life Sciences Interest Group: A Pre-competitive Environment for Collaboration

Presentation slides - .pdf: click here

[Return to Full Agenda]

John Timm
IBM
San Jose, CA, USA

Presentation Title:
Methodology for Standards-Based Biomedical and Healthcare Data Instance Generation

Presentation slides - .pdf: click here

Presentation Abstract:
Our objective is to provide a methodology for biomedical and healthcare data instance generation that enables semantic interoperability and leverages the strengths of several different standards and technologies to bridge the inherent knowledge gap between the clinical domain expert and the healthcare IT domain expert. The clinical domain expert is comfortable with medical terminologies such as SNOMED-CT and visual ontology editors such as Protégé. The healthcare IT expert is familiar with object-oriented analysis and design, modeling languages such as the Unified Modeling Language (UML), and XML-based standards for healthcare interoperability such as the HL7 Clinical Document Architecture (CDA). CDA is defined at a high-level of abstraction and must be further constrained for specific use cases and application scenarios. Constraints are defined using templates and published in an implementation guide (IG).

Starting with an implementation guide, the healthcare IT domain expert creates a model of CDA templates using UML and OCL. This template model is transformed into an OWL ontology based on rules from the Ontology Definition Metamodel (ODM). The generated ontology is used to capture a set of variables, their relationships, and additional constraints on property values. This core ontology serves as a target for data harmonization across all data sources. The clinical domain expert is responsible for mapping data source variables (cohort ontology) to the core ontology. Data extracted from the various data sources at run-time is converted to OWL individuals that conform to the cohort ontology. Using the mappings specified by the clinical domain expert, these individuals are converted to OWL individuals that conform to the core ontology. An instance generation engine receives the core OWL data graph and generates standard CDA-based XML instances that conform to the template model.

We used the HL7 Public Health Case Report (PHCR) to validate our approach. From the PHCR IG, we created a template model for constraining the general CDA to the PHCR with tools available from the Model-Driven Health Tools (MDHT) project in Open Health Tools (OHT). We also created the PHCR Tuberculosis model. A core ontology was generated from these template models using the Eclipse UML2 API and the Jena API. Protégé was used to create a cohort ontology for a mock data source containing clinical data of tuberculosis observations. Mappings between the cohort and generated core ontology were also created using Protégé. The Jena API was used to convert between individuals that conform to the cohort ontology to individuals that conform to the core ontology. These individuals were used in conjunction with the runtime API generated from the PHCR TB template model to produce CDA XML instances. Increasing requirements to implement IG based data exchanges has highlighted the need for expert tailored tooling, established shared core ontologies, mapping processes, and validation technologies. Our approach for semantic data instance generation based on information models serving as a common language has promise for providing low cost and easy to use tools for improved interoperability in these environments.

[Return to Full Agenda]

Simon N. Twigger
Medical College of Wisconsin
Milwaukee, WI, USA

Presentation Title:
Semantic Web Approaches to Candidate Gene Identification

Presentation slides - .pdf: click here

Presentation Abstract:
"Are any of these genes associated with my disease or phenotype? Is this candidate gene expressed in my tissue of interest?" These are examples of common questions asked virtually every day by scientists attempting to identify genes contributing to human disease. Model Organism Databases such as the Rat Genome Database (RGD) curate published data related to these questions but there is much more information available than can be manually curated. Much of this information is being deposited into large scale data repositories but extracting useable information and knowledge from this stored data is a challenging problem. The goal of our project is two fold: 1) to explore the use of ontologies and the National Center for Biomedical Ontology's Web service technologies to annotate large scale repositories such as NCBI's Gene Expression Omnibus. 2) To build tools that enable researchers to use the resulting annotations to further their studies of the genetic causes of disease. I will present our experiences using ontologies to augment the GEO database and the steps we are taking to integrate these results with semantic web technologies to accelerate candidate gene discovery.

[Return to Full Agenda]

Dennis Underwood
CSO Praxeon, Inc
Cambridge, MA, USA

Presentation Title:
Using Semantic Methods to Access Medical Information for Healthcare Professionals and their Patients.

Presentation Abstract:
By 2006 we had created 3 million times the amount of digital information contained in all the books ever written (almost 150 exta-bytes) and this is projected to grow to 1,500 exa- bytes by the 2010. This growth in information is mirrored in medicine and the life sciences - currently (2009) there are 19 million citations in PubMed from MEDLINE and life science journals. Growth in PubMed is projected to be greater than 1 million citations per year by 2010. This growth is likely dwarfed by the number of online articles from other sources such as news channels, regulatory bodies, institutes and hospitals, blogs and health and medical communities which continue to grow exponentially. The challenge for healthcare professionals and their patients is to find high quality, best- evidence and actionable medical information at the time of need and which is personalized to their needs. Our approach to this challenge is as follows: Provide access to the highest quality medical information using sophisticated medical ontologies and semantic methodologies. Personalize according to medical interests and personal medical records. Classify hits in well defined ways for different audiences. Provide community functionality that ranks and prioritizes information and allows comments and criticisms.

Our first demonstration is providing medical information to healthcare professionals through a web-based portal (www.curbside.md). Physicians can ask sophisticated questions such as: “I have a patient with micro-albumin in their urine. Is using losartan effective in preventing the progression of diabetic kidney disease? Is it safe in someone with a history of allergies to ACE inhibitors?”. The results of this query are from sources such as the peer-reviewed medical literature, the FDA, National Guidelines Clearinghouse, leading professional societies, Cockrane Reviews, ACT Journal Club and ongoing clinical trials databases. The results are functionally organized for physicians; differential diagnosis, epidemiology, case-reports, treatment, systematic reviews, etc. Various tools can provide the ability to go directly to the full article, focus around the any particular article, email etc. The semantic methodology used is semantic fingerprinting of the query and searching across the corpus of medical information (semantically indexed) for relevant articles - our approach pin-points particular paragraphs that are the most relevant to the query. These tools are used freely by over 20 health advocacy organizations and healthcare professionals.

Patients are interested in quality healthcare information that is in a language that they can understand and that is focused on their medical needs, conditions and treatments. Our approach to meeting these needs is to use a patient’s medical and demographic profile to direct new and alerts to them from the best medical and health channels including blogs and discussions in healthcare communities. The web-based portal (www.mydailyapple.com) enables them to enter a medical profile and receive information via email, RSS feeds or through a personalized medical page. Again, this uses semantic methodologies to match the profiles to the medical information being published. This is freely available to patients and is part of Google Health.

[Return to Full Agenda]

Lie Xie
San Diego Supercomputer Center
La Jolla, CA, USA

Presentation Title:
Knowledge Discovery of Protein-Ligand Interaction Networks

Presentation slides - .pdf: click here

Presentation Abstract: Network pharmacology, which focus on designing multi-target drugs, has emerged as a new paradigm in drug discovery. The success in network pharmacology depends on the understanding of genome-wide protein-ligand interactions on multiple scales. We have developed a knowledge discovery platform PLINA, which apply knowledge engineering techniques to predict protein-ligand interaction networks and to correlate the protein-ligand interaction to clinical outcomes. The PLINA consists of three main components: an annotated corpus for training and evaluating text mining algorithms and developing protein-ligand interaction ontology, a RDF/OWL Protein-Ligand Interaction Modeling Ontology (PLIMO) to support quantitative molecular modeling and text mining, and a software system to recognize and classify the protein-ligand interaction from the sentence in biomedical literatures.

Annotated corpus is critical for the development of reliable text mining techniques. We have created a Protein-Ligand Interaction Corpus (PLIC). PLIC consists of 2,000 Medline abstracts that include both drug and target names in the Drugbank (www.drugbank.ca). These abstracts are segmented into sentences. The sentence is further parsed with part-of-speech tagging. In addition, chemical names, protein names and terms that describe drug-target interactions are annotated. The annotated protein-ligand interaction corpus is a valuable resource in the training and evaluation of text mining components such as name entity recognitions.

We have developed an RDF/OWL Protein-Ligand Interaction Modeling Ontology (PLIMO) to support quantitative modeling of protein-ligand interactions. PLIMO is designed for maximum reuse of existing ontologies (i.e. ChEBI, GO, Protein Feature, BioPAX etc.). In the latest version of PLIMO, there are 334 concepts and 33 relationships. By integrating PLIMO with other biological ontologies, e.g., for phenotypes and diseases, it is possible for us to represent knowledge at a multi-scale from molecular interactions to cellular functions to disease and clinical outcomes.

A software system is implemented, which is able to acquire explicit relations between proteins and chemicals from Medline abstracts. Given a chemical or protein name, the system firstly extract all related abstracts using information retrieval techniques. Second sentences that include the given name are parsed. Third other protein or chemical names in the sentence are recognized using a hidden Markov model trained from PLIC. The relationship between the chemical and the protein is identified based on PLIMO.

Combining with other bioinformatics techniques, we have successfully applied PLINA to repurpose existing drugs to treat different diseases, to elucidate molecular mechanisms of drug side effects and to fish molecular targets of in vivo screenings. In summary, knowledge engineering technique that allows us to predict protein-ligand interactions on multiple scales is a valuable tool for drug discovery and translational medicine.

[Return to Full Agenda]