Conference on Semantics in Healthcare and Life Sciences (CSHALS)


Updated February 01, 2012

Links within this page:

Target Identification Using an Integrated Subset of the Yeast Interactome with Chemical Genomic Data in RDF

Nadia Anwar
General Bioinformatics
Reading, United Kingdom

Abstract: Semantic web technologies provide a well-established, efficient and cost-effective data integration strategy. We demonstrate the advantages these technologies offer in target discovery through combining experimental evidence from several target discovery technologies using the linked data paradigm. Drug discovery methods such as drug induced HaploInsufficiency profiling (HIP) are here combined with other chemical genomic data and genetic interaction networks to improve the sensitivity and specificity of target identification. We demonstrate the value of this integration using the yeast interactome with complementary experimental evidence within a fungal target discovery pipeline for triaging of hit lists, target identification and target deconvolution. Genetic interaction profiles generated from yeast deletion strains, using methods such as SGA, describe the relationships between genes. These profiles integrated into a network of genetic interactions are used to uncover and predict the functions of uncharacterized genes. Chemical Genomic data describes the influence of small molecules on biological systems and are used to characterize the effect of compounds at the cellular level. Chemical genomic methods including HaploInsufficiency Profiling (HIP), Homozygous profiling (HOP) and Multi-copy Suppression Profiling (MSP) are commonly used together to overcome the limitations in individual technologies. For example, HIP is used to identify small molecule (drug) targets, however, HIP is limited to molecules that inhibit cell growth and will fail to identify targets with functional paralogs. While HIP identifies direct targets, HOP is especially useful for providing insight into potential drug interactions. A combination of these approaches provides a more complete view, specifically identifying both on-target effects and off-target effects. Since HIP, HOP and MSP are based on the same principles, a combined approach, although more time consuming and expensive, delivers more comprehensive data. An alternative combined approach is based on the idea of re-using established genomic scale data. Constanzo et al. re-use their clustered genetic interactions (GI) by correlating these with chemical-genetic interactions (CGI) from HOP. They successfully re-use their GI and CGI data and demonstrate that their combined profiles are complementary to HIP. Following Costanzo et al, we used semantic technologies to integrate the genetic interaction profiles described in their paper, with a test set of compounds assayed using HIP. Specifically, we have followed the linked data approach using clustered networks of GI and CGI in yeast, layered over chemical genomic experiments. We show that this method is not only a cost-effective integration strategy for these data but it also simplifies the discovery of the target as well as relevant interactions. Creating a foundational resource of these data in this fashion allows new experimental results and clusters to be layered into the network efficiently, moving from boutique, case by case integration, to a scalable and robust integration resource. We demonstrate how this integrated data set can be used to identify profiles for compounds of interest, its target, and aid the visualization of the targets network proximal to the compound’s immediate target. Finally, we demonstrate how providing such a comprehensive view of view of the data eases the investigation of the compound’s mechanism of action.


Intelligent Surveillance of Health Care-associated Infections with SADI Semantic Web Services

Christopher Baker
University of New Brunswick
Saint John, Canada

Abstract: 1. Objectives and Motivation Clinical Intelligence (CI) tools support data analysis for the purposes of clinical research, surveillance and rational health care management. Ad-hoc querying of clinical data is one desirable type of functionality. Since most of the data is currently stored in relational form, ad-hoc querying is problematic as it requires specialized technical skills and the knowledge of particular data schemas. A possible solution is semantic querying where the user formulates queries in terms of domain ontologies that are much easier to navigate and comprehend than data schemas. Existing approaches to semantic querying of relational data, based on declarative semantic mappings from data schemas to ontologies cannot cope with situations when some computation is required in query time. We are reporting preliminary progress on a project dedicated to the use of SADI Semantic Web services [1] for semantic querying of clinical data for the surveillance of hospital-acquired infections (HAI) [2]. 2. Method We implement semantic access to a Relational DB by using an ontology for HAI and modeling the RDB in it. The modeling is implemented by SADI Semantic Web services that can be automatically discovered and invoked based on the needs of a particular query. The main services draw data from the DB, but services bringing data from external resources are also used. Users formulate SPARQL queries using primitives from the ontology and execute them via a SADI query engine. The querying can be both ad-hoc and self-service because the users need not know RDB programming. 3. Results To test our approach in a CI scenario dedicated to the surveillance for HAI, we are prototyping a SADI-based infrastructure for semantic querying of The Ottawa Hospital datawarehouse (see, e. g., [3]). Our infrastructure includes an ontology defining concepts suitable for reasoning about Hospital-Acquired Infections and a number of SADI services on the datawarehouse. To test the infrastructure, we write SPARQL queries representing questions a HAI surveillance professional would like to ask, such as "Which patients were diagnosed with SSI while they were taking corticosteroids?" or "How many diabetic patients were diagnosed with SSI?". To facilitate temporal comparisons required by many competency questions, we created a time ontology and wrote a set of SADI services implementing temporal reasoning. 4. Conclusions The main conclusion from our work on semantic querying so far is that the use of SADI services via a SPARQL interface is a viable general direction. Our approach will add to the pool of existing practical methods for semantic querying of RDB, at least in CI. 5. References [1] M. D. Wilkinson, B. Vandervalk, and L. McCarthy. SADI Semantic Web Services "cause you can't always GET what you want! Proceedings of the IEEE APSCC 2009. Singapore; 2009. [2] A. Shaban-Nejad, G.W. Rose, A. Okhmatovskaia, A. Riazanov, C.J. Baker, R. Tamblyn, A.J. Forster, and D.L. Buckeridge. Knowledge-based surveillance for preventing postoperative surgical site infection. Proceedings of MIE, Oslo, Norway 2011 [3] G.W. Rose. Use of an Electronic Data Warehouse to Enhance Cardiac Surgical Site Infection Surveillance at a Large Canadian Centre. MSc thesis, University of Ottawa.


Dynamic Enhancement of Drug Product Labels Through Semantic Web Technologies

Richard Boyce
University of Pittsburgh
Pittsburgh, United States

Abstract: FDA-approved drug product labeling (packages insert or PI) is a major source of information intended to help clinicians prescribe drugs in a safe and effective manner. Unfortunately, drug PIs have been identified as often lagging behind the drug knowledge expressed in the scientific literature, especially when it has been several years since a drug has been released to the market. Out-of-date or incomplete PI information can increase the risk of otherwise preventable adverse drug events. This can occur directly if the PI fails to provide information that is needed for safe dosing or to properly manage drugs known to interact. Clinicians might also be indirectly affected if they depend on third party drug information sources, and these sources fail to add information that is available in the scientific literature but not present in the PI. We are creating a Linked Data store that will enable the drug PI to be expanding as new information becomes available in the scientific literature. The goal of the Linked Data store will be to provide clinicians, patients, and the maintainers of drug information resources with the most complete and up-to-date information on particular claims made within a PI. We are focusing on 25 currently-marketed psychotropic medications (nine antipsychotics, twelve antidepressants, and four sedative hypnotics). To construct this Linked Data repository, we aim to use Natural Language Processing (NLP) technologies identify core claims in the scientific literature and various web-based data sources that pertain to pharmacokinetic drug-drug interactions, age-related changes in clearance, metabolic clearance pathways, and genetic polymorphisms that can affect metabolism. This work aligns with the CSHAL themes "Linked Data", "Text Analysis, NLP, Question Answering", "Data Modeling: Ontologies, Taxonomies", and "Clinical Applications." Method We will identify the core rhetorical components of the content sources using a basic Scientific Discourse ontology constructed (and compatible with) biomedical discourse ontologies (i.e., SWAN, OAC and AO) and discourse annotation metadata (specifically CoreSC). The ensuing discourse annotations will distinguish between facts, hypotheses, and evidence statements, and will be automatically recognised in text following an information extraction approach similar to conceptualisation zoning. The expected result is a Linked Open Data Node, a Triple store and a SPARQL endpoint available for use by different patient, clinician, and pharmacoepidemiology-centered data sources. Human readable summaries will also be generated to expand on existing PI information. Results: While we are in the early planning phases of the project, we have built a prototype system that demonstrates the concept by identifying how claims on metabolic clearance and drug-drug interactions could be updated in two drug PIs with evidence from the scientific literature. Conclusions: We envision using the resulting Linked Data store as the back end for a system that provides pharmacokinetic information on age-related clearance changes, metabolic clearance pathways, pharmacokinetic drug-drug interactions, and genetic polymorphisms. After developing a demonstrator for the 25 psychotropics, we anticipate that it will be feasible to subsequently deploy our system for any given drug.


E-Diary Data Collection in Neurology and Psychiatry: Computational Achievements and Challenges

Ron Calvanio
Massachusetts General Hospital & Harvard Medical School
Cambridge, United States

Additional Authors: F. Buonanno, MD

Dr. Calvanio will present e-diary data recorded by patients undergoing outpatient treatment in a neurology clinic at the Massachusetts General Hospital. Patients had, or were suspected of having, a sudden onset disorder:  a stroke, a traumatic brain injury, etc. Patient symptom complaints were:  sensory or motor spells, headaches, emotional outbursts, fatigue, sleep disturbances, cognitive lapses, or odd behavior. Personalized e-diaries were designed to identify routine events that may have influenced symptom expression. Identification of symptom influences was then used to resolve diagnostic issues and to enhance treatment outcomes. Dr. Calvanio will show: 1) how e-diary data reveal symptom influence patterns, many of which patients are not aware; 2) how identification of these influences improves care; 3) what the computational challenges are in data coding, data analysis, and data pattern representation.


Chem2Bio2RDF: Linked Open Data for Drug Discovery

Bin Chen
Indiana University
Bloomington, United States

Abstract: A critical barrier in current drug discovery is the inability to utilize public datasets in an integrated fashion to fully understand the actions of drugs and chemical compounds on biological systems. There is a need for not only a resource to intelligently integrate the heterogeneous datasets pertaining to compounds, drugs, targets, genes, diseases, and drug side effects now available, but also robust, effective network data mining algorithms that can be applied to such integrative data sets to extract important biological relationships. In this talk, we discuss (i) the creation of the Chem2Bio2RDF for drug discovery data, integrating chemical compounds, protein targets, genes, metabolic pathways, diseases and side-effects using Semantic Web technologies, and (ii) the development of innovative data mining algorithms to facilitate drug discovery. Chem2Bio2RDF incorporates 25 public datasets related to systems chemical biology, grouped into 6 domains: chemical (PubChem Compound, ChEBI, PDB Ligand); chemogenomics (KEGG Ligand, CTD Chemical, BindingDB, MATADOR, PubChem BioAssay, QSAR, TTD, DrugBank, ChEMBL, Binding MOAD, PDSP, PharmGKB); biological (UNIPROT, HGNC, PDB, GI); systems (KEGG Pathway, Reactome, PPI, DIP); phenotypes (OMIM, Diseasome, SIDER, CTD diseases); and literature (MEDLINE/PubMed). The number of RDF triples is approximately 110 million. We developped the domain ontology (called Chem2Bio2OWL) to better integrate these 25 datasets. The primary classes of this ontology are: SmallMolecule, MacroMolecule, Disease, SideEffect, Pathway, BioAssay, Literature and Interaction based partially on the BioPAX classes. The primary classes were further refined in accordance with current instance data structure. We proposed and tested several graph mining and machine learning algorithms (e.g., Bio-LDA, path finding, subgraph mining and diversity ranking) on the generated Chem2Bio2RDf linked open dataset to facilitate drug discovery. We found that our Bio-LDA model used the bio-terms, journal information and word information to characterize the topic providing a better representation of topics than the simple LDA model, which only can provide the word representation. Rosiglitazone is one of several thiazolinediones on the market for diabetes. Our path finding algorithm presents the set of most informative and diverse associations between the drug and the potential side effects, which shows different causes of the hepatitis side effect. Our constraint-based subgraph and diversity ranking algorithm can detect the inhibition of Catechol O-methyltransferase (COMT) in Parkinson's disease. By combining information from Drugbank, Pubchem and Uniprot, we can find information regarding the gene that Tolcapone and Entacapone targets, its name, the protein it encodes, Pubmed articles related to their interaction with COMT, and the structure of the protein it targets. In this talk, We demonstrated the potentials of data mining and graph mining algorithms to identify hidden associations that could provide valuable directions for further exploration at the experimental level. In the future, we will focus on using the identified associations and paths existing between various bio terms to predict the potential connection of other unknown bio terms.


Adverse Events Following Immunization: Standardization, Automatic Case Classification and Signal Detection

Mélanie Courtot
British Columbia Cancer Research Centre
Vancouver, Canada

Additional Authors:
Ryan R. Brinkman
BC Cancer Agency, Vancouver, BC, Canada
Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada

Alan Ruttenberg
University at Buffalo, Buffalo, NY, USA

Abstract: Analysis of spontaneous reports of Adverse Events Following Immunization (AEFIs) is an important way to identify potential problems in vaccine safety and efficacy and summarize experience for dissemination to health care authorities. However, current reporting methods are not sufficiently controlled. While there is general adoption of Medical Dictionary of Regulatory Activities (MedDRA) in the reporting systems we consider, definitions are not provided for MedDRA terms, reports are not annotated in a consistent manner, differing in experience of annotator, and annotation is done either at entry time, or post-hoc. Sometimes, only the final adverse event code is saved, discarding evidence supporting the diagnosis. Because of these practices, interpretation of such spontaneous reports is tedious, costly and time consuming. The Adverse Event Reporting Ontology (AERO)we are building plays a role in increasing accuracy and quality of reporting, ultimately enhancing response time to adverse event signals. Methods: In order to address these deficiencies, we work with the Brighton Collaboration who has done extensive work towards standardization of case definitions and diagnostic criteria for vaccine adverse events. Based on our initial results with AERO, a working group has been established within the Brighton network, including representation from the Public Health Agency of Canada (PHAC) and the US Food and Drug Administration (FDA), to incorporate logical representations of Brighton case definitions into AERO, with the aim of increasing quality and accuracy of AEFI reporting. As an example, only 9% of the Vaccine Adverse Event Reporting System (VAERS) anaphylaxis reports post-H1N1vaccination early 2010 were correctly annotated with the MedDRA anaphylaxis term. Working within the framework being established by the Open Biological and Biomedical Ontologies (OBO) Foundry, the Adverse Events Reporting Ontology (AERO) first documents assessments of relevant signs and symptoms textually. These elements of AEFI reports are then logically defined by being positioned into a hierarchy and related to each other in a way that supports computing an overall diagnosis. Our system allows automatic inference of a diagnosis according to the Brighton criteria based on the evidence encoded in the MedDRA annotations. As an additional test of our approach we will also attempt to parse the textual section of VAERS reports and annotate them with AERO terms with the aim of using the logic encoded in AERO to determine diagnoses as defined in the Brighton Guidelines. Results: Our approach allows us to unambiguously refer to a specific set of carefully defined signs and symptoms at the time of data entry, as well as an overall diagnosis that remains linked to its associated signs and symptoms. The adverse event diagnosis is formally expressed, making it amenable to further querying for example for statistical analysis ("what percentage of patients presented with motor manifestations?") and at different levels of granularity. Finally, by enabling automatic processing of adverse events reports, we will decrease time and money needed for their evaluation. This may allow earlier detection of adverse events signal in the datasets, and trigger a warning for experts to further investigate.


Executing Semantics Across Documents: Bringing Science Into Context

Anita de Waard
Disruptive Technology Director, Elsevier Labs
Utrecht, Netherlands

Abstract: In my presentation I will show how semantic technologies and Linked Data are forming the backbone of a new form of science publishing, where a paper is presented within three types of context. The first type of context is that of the research process. As we find better ways to integrate research data, executable components and workflow representations with the scientific narrative, we hope to add richness, depth and accountability to publications and improve the reader’s ability to evaluate and replicate the findings. The second type of context is that of the specific features of the object of study such as patient characteristics for clinical reports, species and subspecies for animal studies, or other experimental parameters such as instrumentation details. The third type of context we wish to enable the reader to have access to is the knowledge preceding and succeeding a given paper. By identifying the key claims the authors make and linking them to their supporting evidence both within and across papers, we hope to allow an infrastructure that will enable more straightforward ways of assessing trust and validity when assessing new information. I will demonstrate these principles with three use cases which we are working on together with our academic collaborations, pertaining to clinical guidelines, drug-drug interactions, and neuroscientific knowledge integration. 

Some related references:

  • Bourne P, Clark T, Dale R, de Waard A, Herman I, Hovy E and Shotton D, on behalf of the Force11 community (2011). Force11 White Paper: Improving the Future of Research Communication and e-Scholarship. 27 October 2011. Available from
  • de Waard, A. (2010). ‘The Future of the Journal? Integrating research data with scientific discourse’, LOGOS: The Journal of the World Book Community, Volume 21, Numbers 1-2, 2010 , pp. 7-11(5)
  • de Waard, A. (2010). From Proteins to Fairytales: Directions in Semantic Publishing. IEEE Intelligent Systems 25(2): 83-88 (2010
  • de Waard, A., Buckingham Shum, S., Carusi, A., Jack Park, Matthias Samwald and Ágnes Sándor. (2009). Hypotheses, Evidence and Relationships: The HypER Approach for Representing Scientific Knowledge Claims, Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), co-located with the 8th International Semantic Web Conference ((ISWC-2009).


The VIVO Ontology: Enabling Networking of Scientists

Ying Ding
Indiana University
Bloomington, United States

Abstract: VIVO, funded by NIH, utilizes Semantic Web technologies to model scientists and provides federated search to enhance the discovery of researchers and collaborators across disciplines and organizations. VIVO ontology is designed with the focus on modeling scientists, publications, resources, grants, locations, and services. It incorporates classes from popular ontologies, such as BIBO, Dublin Core, Event, FOAF, geopolitical, and SKOS. VIVO data is annotated based on the VIVO ontology to semantically represent and integrate information about faculty research (i.e., educational background, publications, expertise, grants), teaching (i.e., courses, seminars, training), and service (i.e., organizing conferences, editorial boards, other community services). The VIVO ontology has been adopted nationally and internationally, and enables the national and international federated search for finding experts. VIVO is an open source Semantic Web application that, when populated with researcher interests, activities, and accomplishments, enables discovery of research and scholarship across disciplines and organizations. The VIVO core ontology models the academic community in order to provide an consistent and connected perspective on the research community to various shareholders, including students, administrative and service officials, prospective faculty, donors, funding agencies, and the public. The major impetus for NIH to fund the VIVO effort to "develop, enhance, or extend infrastructure for connecting people and resources to facilitate national discovery of individuals and of scientific resources by scientists and students to encourage interdisciplinary collaboration and scientific exchange" . The application is in use at the seven institutions of the NIH VIVO project and has been adopted or to be adopted by several other universities (e.g., Harvard University) and organizations in the USA (e.g., the United States Department of Agriculture), and several universities or institutions in Australia and China (e.g., Queensland University of Technology, Chinese National Academy of Sciences) (Gewin, 2009). More specifically, VIVO can support discovering potential collaborators with complementary expertise or skills, suggesting appropriate courses, programs, and faculty members according to students’ interests, and facilitate research currency, maintenance and communication. For example, a Computer-Aided Drug Discovery (CADD) group may want to find and team up with a computer specialist and a group using in vivo experiments in drug discovery. If the VIVO core ontology is implemented in the hypothetical situation, the group leader can search across experts in computer science and molecular biology. In this paper, we present a relatively comprehensive discussion of the development of the VIVO core ontology, including the latest updates.


sdlink: An Integrated System for Linking Biological and Biomedical Semantic Data

Alexandre Francisco
Technical University of Lisbon
Lisboa, Portugal

Abstract: Nowadays, with the decreasing cost and increasing availability of high-throughput technologies, an enormous amount of biological and biomedical data is becoming available. Such data is usually represented and stored in different formats and platforms, most of the times off line and not standardized. The automatic integration of data from different databases suffers from several caveats, the most notable being the lack of interfaces for automatic querying and running integration and analysis tools. In order to solve some of these issues, semantic technologies have been proposed and used with great success. In these work we propose an integrated environment for querying, retrieving and analyzing linked data, suitable for users unfamiliar with such technologies, solving an issue that has been detracting a more generalized adoption of semantic methodologies in biology and biomedicine. Method: The new sdlink system (http:/ assumes that data is annotated following a given ontology and provides data views, including graphical representations, and a friendly querying interface. The querying interface was developed to be used by semantic technologies unfamiliar users, where one can for instance define a query by means of a point and click simple interface, which is then translated to SPARQL. The sdlink system uses Virtuoso OSE as the underlying triplestore. To address user concerns with respect to security and privacy, the system supports user/project control access, based on OpenID for authentication and FOAF+WAC for authorization. The system is being used by two FP7 European projects, with good results both in what concerns scalability and usability by non-expert users. We made also available a public project for evaluation purposes (http:/ Results: Our results were twofold. First, through the development and deployment of sdlink, we were able to use semantic technologies and linked data on two large projects were most people were unaware of these technologies or of any reason to use them. The main contribution was an interface that simultaneously allows users to retrieve and query linked data, and does not lose expressiveness, efficiency or scalability. In particular, the system is self-adaptable to ontology changes and data transformations, depending only on the update of underlying ontologies. The projects where the system was tested are dealing with heterogeneous data, including sequence data and experiment results, resulting from several teams and work packages, that in the end became integrated, browsable and queryable. The data stored comprises about one million triples, which can be queried in less than one second for most usual queries. Conclusions: The development of sdlink, and its deployment in a real scenario, allowed us to concluded about the importance and usefulness of semantic technologies, namely for domain representation and data integration. More importantly, it was possible to show that, by developing suitable interfaces, any user can benefit from such technologies. Currently, the unfriendliness of most semantic technologies, in particular in the fields of biology and biomedicine, have struggle the adoption of these technologies. The sdlink system is proposed to overcome this problem and to bring semantic technologies and linked data to a broader audience.


Exploitation of Semantic Methods to Cluster Pharmacovigilance Terms

Natalia Grabar
Universite Lille
Villeneuve d’Ascq, France

Abstract: Pharmacovigilance activity is related to the collection, analysis and prevention of adverse drug reactions (ADRs) likely to be caused by drugs. This activity is achieved thanks to the case reporting to the pharmacovigilance authorities and pharmaceutical industries. Before their inclusion in pharmacovigilance databases, the ADRs of these case reports are coded with terms from dedicated terminologies, such as MedDRA. The analysis of the collected ADRs is related to the safety surveillance within these databases. It relies on the identification of relations between a drug and an ADR. It has been observed that some couples {drug, ADR} are not activated, when they should be. The main cause then is that MedDRA is a fine-grained terminology and that the encoding of the adverse reactions with MedDRA may have an impact on the signal dissolution: similar and close ADRs may be encoded with different terms and during the analysis of the databases they will remain isolated and the safety risk detection may be under-estimated. METHODS. We propose to exploit semantic resources and methods provided by Natural Language Processing and by Computer Sciences for automatic generation of clusters of the MedDRA terms, which have close semantic and clinical meaning. We exploit the ontological resource ontoEIM and MedDRA terms. The SMQs are exploited as the gold standard. Among the methods, we use semantic distance approaches, lexically-based methods for detection of hierarchical and synonymy relations between terms, as well as several clustering methods. The obtained clusters of terms are compared with the existing SMQs, both hierarchical and non hierarchical. The results are evaluated with three metrics: precision, recall and f-mesure. Results are evaluated quantitatively (against the gold standard) and qualitatively (by medical and pharmacovigilance experts).

RESULTS. Various factors have been tested: exploitation of formal definitions, several semantic distance approaches, weighting of the semantic axes within formal definitions, clustering methods, comparison and combination of semantic distance and lexical methods. We obtain results which indicate that the generated clusters can assist the creation of new SMQs or the hierarchical structuring of terms within SMQs. Depending on the SMQs, we obtain interesting results with the semantic distance approach (precision between 36% and 87%, recall between 15% and 77%) and for the lexical approach (precision between 10% and 92%, recall between 3% and 33%). Moreover, these two methods provide complementary results. Indeed, safety topics are better modeled with one or another of the methods. For instance, the generation of the Agranulocytosis cluster has poor results with semantic distance approach: the relevant terms are spread within the ontoEIM resource. Although this grouping shows high performances with the lexical method: the relevant terms have semantic similarities which can be detected at the lexical level. CONCLUSION. The performed experiences indicate that it is possible to generate meaningful clusters of terms on new safety topics in order to assist the creation of new SMQs. The exploited methods can also be exploited for the refinement of the hierarchical structure of the existing SMQs.


The Biospecimen Repository as Library: How HeLa is like Moby Dick

James McCusker
Rensselaer Polytechnic Institute
Troy, United States

Abstract: Provenance-oriented data models are becoming critical for fostering interoperability among scientific workflow systems. Tools to manage laboratory systems and biorepositories record the actions of people and equipment in order to keep track of exactly what has happened to experimental artifacts. We explore the similarities between the library science standard Functional Requirements for Bibliographic Resources (FRBR) and requirements for biospecimen management in research settings. Abstractive provenance, or the ability to describe entities and their history at multiple levels of abstraction, is used in FRBR to describe the relationship between a particular copy of a book and the concept of that book. A similar treatment can describe the relationship between a cell line, various physical colonies of cells from that cell line, and the originating organisms. We propose a similar standard, Functional Requirements for Biological Resources (FRBioR), to describe those requirements and an ontology that integrates with the W3C draft PROV provenance model and ontology.


Using Ontologies in the Age-Phenome Knowledge-base (APK)

Eitan Rubin
Ben Gurion University
Beer Sheva, Israel

Abstract: The importance of age in biomedical research and clinical care has resulted in an abundance of publications linking age and phenotypes. However, these data are organized such that searching for age-phenotype relationships is prohibitively difficult. Recently, we described the Age-Phenome Knowledge-base (APK), a computational platform for storage and retrieval of information concerning age-related phenotypic patterns. Here we present and discuss the incorporation and use of ontologies and standardized vocabularies in the APK. Methods and results: The Age-Phenome Knowledge-base contains evidence, such as scientific publications and clinical data analysis, connecting specific ages or age groups and phenotypes such as diseases. It makes extensive use of ontologies and fixed vocabularies in order to describe ages, diseases and other forms of phenotypes. Ages and age groups are described using the Age Ontology, a simple ontology developed for this purpose and based on the description of age-ranges in the Medical Subject Headings (MeSH). The Disease Ontology (DO) is used in APK to represent diseases while other forms of phenotypes are described by a subset of the Unified Medical Language System (UMLS) Metathesaurus. Complex searches are made possible by abstracting over the Age Ontology and the Disease Ontology's hierarchical structures. Conclusions: APK provides an example of how ontologies can be used in rapid development of new knowledge models. It makes integral use of ontologies and vocabularies to represent diseases and age groups in a standard, unambiguous way. Furthermore, the use of ontologies allows abstraction, which in turn makes it easy to develop/conduct complex queries.


Using Linked Open Data to Inform the Drug Discovery Process

James Snowden
UCB Celltech
Slough, United Kingdom

Abstract: The treatment of disease and identification of new targets via which the symptoms / causes of disease can be treated is one of the cornerstones of drug discovery research in the pharmaceutical sector. Whilst much of the information for these areas is available, it is distributed in many systems both internally and externally. Therefore, the main issue with gathering the required information is actually one of time / resource. In response to this at UCB, we have developed the Target Information Platform (TIP) and Disease Information Platform (DIP) systems to collate key information relating to targets and disease respectively and make this available in a single portal for easy access by our scientists. This approach is underpinned by the capabilities provided by semantic technology and in particular Linked Open Data (LOD), which allows complete querying of available data sources in a quick and automated manner. The public LOD system which is comprised of SPARQL endpoints over key biological data sources is queried using SPARQLMOTION scripts through the TopBRaid composer system. This takes in a single data item (Uniprot ID for target, disease name for disease) which is used to pull data an initial endpoint (UniProt / Diseaseome). The results from this are parsed and where relevant, additional calls are made out to endpoints for other data sources. The end result of this is that a RDF data package is generated which collates together relevant information from multiple sources in the public domain. Additionally for DIP, literature, patent and omics data is queried and stored in a Triplestore. Web pages are generated from this information which are provided to the scientists. The key benefits that have been derived through this approach so far have been speed, completeness of data searching and increasing the availability of target / disease information. A target search that may have taken 2-3 days work for 1-2 scientists can now be done in 5-10 minutes. This frees up scientist time, provides target information faster and allows many more targets to be queried. The data searching is done in a standardised manner with the aspect of human error removed and also more consistency in terms of data returned for targets and disease. Finally, providing the information returned in a central portal means that scientists always know where to go to access the information. All of these benefits are in some way related to the semantic / LOD approach used. Disadvantages of this approach are mostly related to technical issues of endpoint uptime / availability and also updating of information within the endpoints. This work has demonstrated that it is possible to utilise the public LOD framework in an automated manner that exemplifies that linked data principle by starting from a single point of information to gather detailed data. It has returned information relating to key concepts of vital importance to drug discovery that have helped to optimise this process at UCB and has demonstrated practical utility for semantics and LOD.


Domain Knowledge and Provenance-Integrated Knowledge Organization System Represented with RDFS and SPARQL

Young Soo Song
University of Alabama at Birmingham
Birmingham, United States

Abstract: Objectives and Motivation: Although semantic web technologies are expanding as a framework to construct knowledge organization system (KOS), without controlling data flow based on rules and consensus, its adoption will be limited. We have previously addressed this problem by defining a Markov process for user operators associated with a KOS, S3DB. The idea was that by annotating existing assertions with the domain neutral S3DB tags, the user-operator states describing the provenance could then be tracked by a parallel algebraic process. That solution includes a mechanism for resolving both the merging and the migration of multiple, often conflicting, provenance. This mechanism is currently supported by a open source prototype ( with a SPARQL endpoint and a query language, S3QL. Although the core concept of S3DB includes both domain knowledge and provenance model and its implementation is currently used in several institutions, they are loosely coupled because domain knowledge model was expressed as RDFS and provenance model as numeric computation. In this study we seek to bridge between the logic and algebraic representations by describing user-operators as a RDFS model such that the integrated representation can be resolved by a SPARQL 1.1 engine. Method: Semantics of S3DB provenance model were thoroughly analyzed and represented as a semantically equivalent SPARQL query. While S3DB domain knowledge model is a pure RDFS model, its provenance model is a numerical model, which receives arguments from asserted RDF triples and produces outputs as inferred triple through the successive procedures. In this model, asserted triples represent assigned user operator relationships between users and entities and new relationships between them are inferred through numerical procedures. Semantically reinforced SPARQL 1.1 can simulate these numerical procedures. In particular, propagation of user operators corresponded to SPARQL designed with property paths and merging to SPARQL designed with aggregation function. These phrases could be assembled into SPARQL having subquery as a part. Although the procedure of merging was performed during propagation of user operations in the original numerical model, our SPARQL model performed merging after propagation is completed, producing the same results while not being affected by the order of the procedures. Results: For the proof of concept, our integrated model was implemented with ARQ version 2.8.8., although any triple store or application supporting SPARQL 1.1 should deliver the same results. Provenance model was tested upon the cancer genome atlas (TCGA) data, containing microarray and sequencing data for over 500 cancer patients. Verification of the validity of our model needs only three steps, 1) installation of ARQ, or equivalent application supporting SPARQL 1.1, 2) importing of knowledge model and test TCGA data, and 3) executing of query representing our provenance model. Effective user operators inferred from the query could be stored in the other namespaces separate from the assigned user operators. Conclusions: As a consequence, it is argued, complex provenance scenarios can be accommodated by data stores equipped with a SPARQL endpoint. This result signifies that the proposed solution can be handled in a scalable and distributed manner by regular triple stores.


Spo: An Ontology for Describing Host-pathogen Interactions Inherent to Streptococcus Pneumoniae Infections

Cátia Vaz
Poly Institute of Lisbon
Lisbon, Portugal

Abstract: Over the past twenty years, the study of infection has tended to consider individual virulence factors or host factors. The Pneumopath project (, a FP7 European research project, has the objective of studying the host-pathogen interactions during infection of Streptococcus pneumonaie and finding new targets for diagnosis and treatment. This research purports to identify the most important and consistently involved host and pneumococcal factors, in contrast to previous approaches, where factors where studied in isolation. The transmission of Streptococcus pneumoniae to a new host can result in asymptomatic colonization or progress to invasive disease. The infection can be determined by multiple attributes of both host and pathogen, being important to take into account the epidemiological and genomic characterization of pneumococcal strains, the results from experiments that evaluate host or pneumococcal responses to infection or different environmental challenges, and also the results from experiments that identify host genetic susceptibility factors. In this work we propose Spo (, an ontology developed in the context of the Pneumopath project, which provides terms and semantic constructs for annotating all aspects of host-pneumococcal interactions. Method: The data considered includes the characterization of pneumococcal strains, typing information, as well as data of in vitro and in vivo experiments with animals and cell models, relevant for identifying new targets to combat pneumococcal diseases. Some of these data are scattered across numberous information systems and repositories, each with its own terminologies, identifier schemes, and data formats. The need to share such data brings challenges for both data management and annotation, such as, the need to have a common understanding of the concepts that describes host-pneumococcal interactions. Thus, semantic annotation and interoperability become an absolute necessity for the integration of such diverse biomolecular data. Moreover, given the heterogeneous environment inherent to the project, the ontology construction took into consideration contributions from all partners, leading to a well-grounded set of concepts and annotations. Results: Spo provides a framework to represent mentioned host-pneumococcal interactions, being flexible enough to accommodate the rapid changes and advancement of research and achieve data interoperability and interchange. This has been only possible because of semantic Web recommended practices for clearly specifying names for things and relationships, expressing data using standardized and well-specified knowledge representation languages. The ontology described in OWL Lite v1.0 includes 36 classes, 24 object properties and 43 data properties. Conclusion: The main contribution of this work was not only Spo, but all the approach and methodology for its construction in the context of a large research project, where many people were not aware of semantic technologies. The proposed ontology does not only describe knowledge in this field, but also allows for validating and aggregating existing knowledge, which is essential for data integration. Furthermore, the ability to accurately describe the host-pneumococcal interactions through the use of Spo has facilitated the implementation of information systems capable of coping with the heterogeneous types of data and, by using well known semantic technologies, it allowed users to query data and discover new knowledge.


Annotation Analysis for Testing Drug Safety Signals

Trish Whetzel
Stanford University
Stanford, United States

Abstract: Introduction R is used versus using coded data alone. Changes in biomedical science, public policy, and electronic heath record (EHR) adoption have converged recently to enable a transformation in health care. While analyzing structured EHRs have proven useful in different contexts, the true richness and complexity of health records—roughly 80 percent—lies within the free-text clinical notes and it is crucial to develop methods to test for drug safety signals throughout the EHR. Using ontology-based approaches, we computed the risk of having a Myocardial infarction (MI) on taking Vioxx for Rheumatoid arthritis (RA) using the annotations created on the textual notes for over 1 million patients in the Stanford Clinical Data Warehouse (STRIDE). Methods Based on the NCBO Annotator Web service, we created a standalone NCBO Annotator Workflow that is highly optimized for both time and space. The workflow was extended to incorporate negation detection, the concept recognizer Unitex, and uses ontologies from BioPortal. To reproduce the risk of MI following Vioxx treatment, we identified patients in STRIDE with a pattern of RA, who are taking Vioxx, and then suffer MI. To identify patients with RA and MI, we scanned through structured data of 25 million coded ICD9 diagnoses for codes beginning with the ICD9 codes for RA and MI. We also scanned through the normalized annotations of the unstructured data, to look for non-negated mentions of MI and RA. We denote the first occurrence or mention of the condition as t0(RA) and t0(MI). We did not have access to the structured medication data; therefore, we relied upon annotations derived from the textual notes to identify patients taking Vioxx. We scanned through the normalized annotations of the unstructured data to look for non-negated mentions of Vioxx or rofecoxib. We denote the first occurrence or mention of the drug as t0(Vioxx). Results The Annotator Workflow was enhanced in both time and space and processed 9.5 million patient notes in 7 hours using 4.5 GB of disk space. From the observed patient counts, we constructed a contingency table and obtained a reporting odds ratio (ROR) of 2.058 with a confidence interval (CI) of [1.804, 2.349] and proportional reporting ratio (PRR) of 1.828 with CI of [1.645, 2.032]. The uncorrected ?2 statistic was significant with a p-value < 10-7. In comparison, without using the unstructured data and only using the ICD9 coded data, the results were more ambiguous. The corresponding risks for the results without the unstructured data were: ROR=1.524 with CI=[0.872, 2.666] confidence interval; and PRR=1.508 with CI=[0.8768, 2.594]; and ?2=0.06816. Conclusions We have significantly scaled the NCBO Annotator Workflow to computationally annotate the free-text narrative of over 9.5 million reports from STRIDE. Our results demonstrate that unstructured data in the EHR provide a viable source for testing drug safety signals using annotations created from the textual notes. Our analysis recapitulated the latent Vioxx risk signal and found that the risk is far more perceptible when ontology-based analysis methods of unstructured data in the EHR is used versus using coded data alone.