Conference on Semantics in Healthcare and Life Sciences
Updated March 06, 2013
Indicates that presentation slides or other resources are available.
Posters will be on display throughout the conference, beginning 5 p.m. Wednesday, February 27, with a special reception for poster authors to present their work to conference delegates:
- Poster Set-up: Wednesday – February 27, 4:00 p.m. – 5:00 p.m.
- Poster Reception: Wednesday – February 27, 5:00 p.m. – 6:30 p.m.
When preparing and designing your poster please note that it should be no larger than 44 inches wide by 44 inches high (there are two posters per side).
Posters must be removed between 1:30 p.m. - 3:00 p.m., Friday, March 01.
Data Modeling and Machine Learning Approaches to Clustering and Identification of Bacterial Strains in Complex Mixtures by Genome Sequence Scanning
Presenter: Deepak Ropireddy
Woburn, MA, United States
Abstract: Bacterial contaminants, as a cause of some food-borne illness, pose a major challenge for the food industry in the timely detection and isolation of the source pathogen. Genome sequence scanning (GSS) developed by PathoGenetix is a single-molecule DNA analysis technique aimed at rapid identification of bacterial strains in complex biological mixtures. The detection, classification of single molecules and identification of bacterial strains entails computationally intensive steps of data analysis, data modeling, and machine learning using scientific software developed in-house. Method Initially, fluorescent traces of individual DNA molecules are acquired from the biological sample and analyzed for signal intensity. The statistics of all measured DNA molecules is evaluated by custom data analysis software. Next, these optical traces are passed to a data modeling and classification tool capable of running in parallel and in distributed computing architecture. In the classification process, the experimental signals of single molecules are statistically modeled by distributions of photons along the DNA restriction fragment based on Poisson and Gamma distributions. The experimental trace signals are compared to a target database containing averaged template patterns for specific restriction fragments from multiple target organisms. The template patterns are generated either by theoretical calculations, based on known sequences, or experimentally produced by GSS analysis and clustering of molecules from isolates. For each individual trace we compute a set of distances from database targets, where distance is defined as negative logarithm of the probability that the trace could originate from a specific target fragment. We quantify the confidence in classification of a single trace to a specific target by the log-likelihood value computed as difference of two distances: the one between this trace and the target to which it has been classified to and the other between this trace and its next closest target. Results This computational modeling methodology yielded robust results for detection and typing of multiple serovars and strains of Escherichia coli and Salmonella in complex biological mixtures. In identifying closely related strains of these species, a hierarchical clustering algorithm (UPGMA: Unweighted Paired Group Method with Arithmetic Mean) is applied to group detected organisms. The grouping is based on preliminary analysis of similarities between the GSS template patterns for different microorganisms. Subsequently, the potentially detected organisms are sorted in the decreasing order of the detected fraction of the total length of their fragments. Additionally, the clustering algorithm is applied to generate phylogenetic trees to compare closely related strains of Escherichia coli, Salmonella enterica and other species. Conclusions The ability of GSS to model single DNA molecule traces and attribute them to specific organisms in conjunction with the genome based unsupervised classification using hierarchical clustering approach is the basis of a robust technology for the confident detection and identification of pathogenetic strains. This modeling platform is used further to generate user-based knowledge of closely related strains through phylogenetic trees and other quantitative measures. We intend to incorporate current semantic and ontological technologies to build a knowledge tool for recent food outbreaks.
HYDRA: A Commercial Query Engine for SADI Semantic Web Services (pdf)
Presenters: Christopher J. O. Baker, CEO
Alexandre Riazanov, CTO
IPSNP Computing Inc.
Saint John, NB, Canada
Abstract: IPSNP Computing Inc. based in Saint John, Canada, was set up to commercialize prior university-based research on data federation and semantic querying with SADI. The core technology is a high-performance query engine (working title HYDRA) operating on networks of SADI services representing various distributed resources. HYDRA will be packaged and licensed as two products: an intuitive end user-oriented querying and data browsing tool, including a software-as-a-service edition, and an OEM-oriented Java toolkit. IPSNP will target Bioinformatics and Clinical Intelligence markets and, later, other verticals requiring self-service ad hoc federated querying.
Automatic Generation of SADI Semantic Web Services from Declarative Descriptions (pdf)
Presenter: Mohammad Sadnan Al Manir
University of New Brunswick
St. John, Canada
Abstract: Most modern organizations use relational databases to store and retrieve large amounts of data. Tables in these databases are structured and connected by complex schemas. Therefore, persons having profound knowledge in the SQL query language are required for advanced data access and retrieval. But many scenarios require ad-hoc querying of relational data which can be done by non-technical users, without having much knowledge of the database schema or SQL. Semantic Querying is proposed as a solution to this problem, which is based on the automatic application of domain knowledge written in the form of ontological axioms and rules. The axioms are used to map concrete data into virtual models based on RDF[S] and OWL. Queries can then be formulated by end users in the terminology of their domains without need for any knowledge of either SQL or the underlying table structure. Web services with such semantic querying capability will be very useful in today's Web-based environment, and in doing so, research work in introducing Semantic web services will strengthen the ongoing efforts by the Semantic Web community. Here we propose an architecture which generates SADI-based Semantic Web services automatically for the semantic querying of relational data. Thus, instead of writing SADI Web services manually, which is labor consuming and error-prone, it is envisioned to generate them from declarative descriptions automatically. The access to databases is implemented by semantic mappings using an expressive Positional-Slotted Object-Applicative (PSOA) Web rule language combining Datalog RuleML and W3C RIF. Such semantic mappings can support end users in any environment requiring semantic querying of large relational databases. The architecture is novel in comparison to the currently available approaches and its implementation can be used to perform knowledge discovery across large volumes of relational data.
A Novel Semantic Approach to Information Flow Modeling in Big Pharma (pdf)
Presenter: Kelly Clark
Boston, MA, United States
Abstract: The information landscape within "Big Pharma" is a complex array of heterogeneous data, disparate applications and siloed repositories. This creates a myriad of challenges for scientists and knowledge workers at all levels, and does not provide the optimal use (or reuse) of the data that are generated -- a problem that informaticists within Merck are actively trying to address. Understanding how data, information, and knowledge is leveraged across the R&D landscape is not a trivial task. However, making technology and informatics investment decisions without a meaningful understanding of these domain areas and the actual state of information flow across the research activities is risky, costly, and often results in less than optimal IT solutions. Typical analysis efforts to uncover and document information flows through the R&D pipeline are, themselves, complex and inefficient. The information is often represented within "single-use artifacts" (usually Word, Excel, and/or Visio documents), which reflect the analysts' own interpretation of the current state in a variety of ways that are then subject to different interpretations. We have developed a method called Semantic Information Flow Modeling (sIFM) that allows multiple analysts to work individually to unambiguously model, analyze, and communicate information and knowledge flows across Merck. The resulting model is a graph-based knowledgebase that can be rendered in RDF, and used to help better inform informatics and technology investment decisions by elucidating targeted opportunities where technology can improve the state of information architecture and search.
Yes, We Can! Lessons from Using Linked Open Data (LOD) and Public Ontologies to Contextualize and Enrich Experimental Data (pdf)
Presenter: Erich Gombocz
IO Informatics, Inc.
Berkeley, CA, United States
Abstract: Semantic W3C standards provide a framework for the creation of knowledgebases that are extensible, coherent, interoperable, and on which interactive analytics systems can be developed. A growing number of knowledgebases are being built on these standards— in particular as Linked Open Data (LOD) resources. The availability of LOD resources has received increasing attention and use in industry and academia. Using LOD resources to provide value to industry is challenging, however, and early expectations have not always been met: issues often arise from the alignment of public and experimental corporate standards, from inconsistent namespace policies, and from the use of internal, non-formal application ontologies. Often the reliability of resources is problematic, from service levels of LOD resources and/or SPARQL endpoints to URI persistence. Furthermore, more and more “Open data” are closed for commercial use, and there are serious funding concerns related to government grant-backed resources. With these challenges, can Semantic Web technologies provide value to Industry today? We make the case that, yes, this can be done and is the case now. We demonstrate a use case of successful contextualization and enrichment of internal experimental datasets with public resources, thanks to outstanding examples of LOD such as UniProt, Drugbank, Diseasome, SIDER, Reactome, and ChEMBL, as well as ontology collections and annotation services from NCBO’s BioPortal. We show how, starting with semantically integrated experimental results from multi-year toxicology studies on different -OMICS platforms, a knowledgebase can be built that integrates and harmonizes such information, and enriches it with public data from UniProt, Drugbank, Diseasome, SIDER, Reactome, and NCBI Biosystems. The resulting knowledgebase facilitates toxicity assessment in drug development at the pre-clinical trial stage. It also provides models for classification of toxicity types (hepatotoxicity, nephrotoxiciy, toxicity based on drug residues) and offers better a priori determination of adverse effects of drug combinations. Not only have we been able to correlate responses across unrelated studies with different experimental models, but also to validate system changes associated with known toxicity mechanisms such as oxidative stress, liver function and ketoacidosis. Since observations from multi-modal OMICs experiments can result from the same perturbation, but represent very different biological processes, and because pharmacodynamic correlations are not necessarily functionally linked within the biological network and genetic and metabolic changes may occur prior to pathological changes, enrichment with LOD resources led to discovery of new pharmacodynamically and biologically linked pathway dependencies. As LOD resources mature, more reliable information is becoming publicly available to enrich experimental data with computable descriptions of biological systems in ways never anticipated before and that ultimately helps understanding the experiments' results. Time and money saved from such an approach has large socio-economic impact for drug companies and healthcare. As a community, we need to establish business models through cooperation between industry and academic institutions that support the maintenance and extension of invaluable public LOD resources. Their effective use in enriching toxicology data exemplifies the success of using Semantic Web technologies to contextualize experimental, internal, external, clinical and public data towards faster, better understanding of biological systems, and more effective outcomes in healthcare.
DistilBio: Semantic Web and Data Integration Platform for the Life Sciences (pdf)
Presenter: Ramkumar Nandakumar
Metaome Science Informatics (P) Ltd.
Abstract: While the number of linked data resources continues to grow, there is a need to build applications that allow users easy access to open-ended exploration across a multi-dimensional data space. This would include intuitive interfaces to build powerful queries that could span across several data sources without the end user needing to know SPARQL, the underlying data models or the location of the data. Method: The platform consists of 3 main subsystems (query engine, web application and autosuggest) and a caching layer. A Service Oriented Architecture has been followed in its design. Services generally interact in a stateless manner using JSON over HTTP. The “query engine” serves as the interface to the Virtuoso database layer for the rest of the system. At its core it is a dynamic, generic query builder for SPARQL. The engine is almost entirely data driven - namely, it obtains most of its information for SPARQL generation directly from the OWL Ontology defined in Virtuoso, and the rest from the contents of the input query. The “web application” features an advanced querying interface utilizing both textual and graphical input elements in a complementary manner. The results are presented as a set of interconnected facets that map intuitively to the user’s query. This textbox is the key entry point for the end user and provides instantaneous access to more than 100 million indexed biological terms. Going beyond the plain suggestion of words provided by search engines in general, this interface detects phrases as logical groups and provides context sensitive suggestions. Text by its nature is linear, 1-dimensional. The query canvas lets the user treat the query as a two dimensional graph. This bridges the semantic web’s technical world of graphs and triples with the user’s notion of biological entities and connections. Results: The interface allows users to retrieve linked data and most importantly does so without the loss of expressivity and at scale. There is a clear abstraction between the data layer and the query engine allowing seamless updates to both data and the ontology. The data store currently houses nearly one billion triples across several biologically relevant databases (http://distilbio.com/help#data) and most normal queries return results in real time. Conclusions: With the DistilBio interface we have successfully allowed users to build fairly expressive queries and browse the results without needing to have any understanding of the underlying semantic technologies. Additional work in the future would include improved filtering capabilities, extending support for full range of SPARQL operations and better ways to display provenance. DistilBio is available at www.distilbio.com and for use cases view the demo videos at http://distilbio.com/demo.
Semantic Approaches to Clinical Trial Harmonization (pdf)
Presenter: Simon Rakov
Waltham, MA, United States
An Interoperable Suite of BioNLP Web Services based on the SADI Framework (pdf)
Presenter: Syed Ahmad Chan Bukhari
University of New Brunswick
Saint John, Canada
Abstract: The scientific literature is considered to be the most up to date source of information for biologists and is fundamental resource for knowledge discovery, experimental design and systems biology analysis. Biomedical text mining (also known as BioNLP) provides information extraction solutions to facilitate the needs of scientists. In many use case scenarios the integration of not just one but several text mining tools is required and the outputs of these tools must be consolidated. The outputs might be used for different purposes, including summarization, comparative evaluation of results, cross-validation against existing data. Consequently, the interoperability of tools and the format of the output are of critical importance. Since almost all BioNLP tools produce XML or TAB-output (with different schemas), integration of tools and consolidation of results requires some programming work. We proposed a programming 'free' and installation 'free' scientific text processing system to annotate and extract the biological information from textual data based on a BioNLP SADI framework. The proposed mechanism directly addresses the centralized weak-binding issues of biological NLP pipelines by introducing the ontology based and linked data aware biological text annotation scheme. We developed BioNLP SADI semantic web services for a Mutation Finder, Drug Extractor, Drug-Drug Interaction Extractor to achieve the interoperability among biological data outputs. Around this we created a web based platform for the general biologist and bioinformatics application developer through which they can get enhanced access to biological information from the literature with minimal effort. Project Page: https://code.google.com/p/bionlp-sadi/
Active PURLs for Clinical Study Aggregation (pdf)
3 Round Stones Inc.
Fredericksburg, VA, United States
Waltham, MA, United States
Abstract: A challenge common to many pharmaceutical companies is to more closely relate the detailed outcomes of clinical trials to their own research. This is made difficult by both the distributed nature of the companies and the distributed nature of clinical trial information. Simply finding information related to clinical studies within a large pharmaceutical company can be challenging. Today's pharmaceutical companies are worldwide, highly distributed organizations. Developers creating an application to support researchers may never meet their stakeholders, nor even necessarily know who they are. This situation mirrors the World Wide Web, where it is not generally possible to determine who the user base is nor which features or offerings they might wish to have. Traditional enterprise software development methodologies do not take into account scenarios where stakeholders are not known in advance. Linked Data techniques can help to address both the availability of clinical trial information and provide a means to build effective information systems using it. Linked Data techniques were developed for the Web and allow for "cooperation without coordination". Publishers of data provide necessary context to allow for use by (possibly unknown) third parties in other portions of a distributed enterprise. Users of Linked Data can combine information from multiple sources and even publish the results of their analyses using the same Linked Data techniques. Subsequent publication can create a virtuous circle of positive feedback, allowing researchers, informaticists and support staff to collaboratively and distributively build a reusable knowledge base. 3 Round Stones and a pharmaceutical company created a system to allow coordinated views of distributed clinical trial information. The system extended the Callimachus Project, an Open Source management system for Linked Data. Persistent URLs, or PURLs, were used to provide globally unique and resolvable identifiers for each clinical study. The PURL concept was extended to enable PURLs to have multiple targets and for the results of each target to undergo arbitrary transformation. PURLs which have such capabilities are called Active PURLs. Information sources relevant to clinical studies were identified, regardless of whether their location was internal or external to the pharmaceutical company's network. Active PURLs were used to resolve data sources having HTTP endpoints capable of returning XML or textual results. Each information source is dynamically transformed into Resource Description Framework (RDF) formats and all sources' results then merged into a single, temporary graph of RDF data. Information is rendered to end users as coordinated HTML descriptions regarding each clinical trial using the Callimachus template engine. Machine-readable versions of the data are also available. The pharmaceutical company has a means to view coordinated clinical trial information across internal and external sources and is moving it toward production use. We showed that a Linked Data approach to distributed information retrieval works for clinical trial information and demonstrated the benefits of cooperation without coordination for typical bioinformatics challenges.
Validation of a Comprehensive NGS-based Cancer Genomic Assay for Clinical Use
Presenter: Eric Neumann
Cambridge, MA, United States
Abstract: Molecular diagnostics are increasing in importance to clinical oncology, as the number of therapies targeting specific molecular alterations and pathways in cancer grows. This trend has led to a proliferation of single-target biomarker assays, which are constrained by scarce tissue material and restricted in the breadth of genomic alterations assessed. To overcome these limitations, we developed a CLIA certified, pan solid tumor, next-generation sequencing (NGS) based test that enables comprehensive identification of clinically actionable genomic alterations present in routine FFPE specimens. The test uses minimal (≥50ng) DNA to achieve >500X average unique sequence coverage across 3,240 exons and 37 intronic intervals in 189 cancer genes, permitting identification of single-base substitutions, small insertions and deletions (indels), copy number alterations, and selected gene fusions, even when present in a minor fraction of input cells. To support clinical adoption, we conducted a series of experiments to validate test performance for substitution and indel mutations.
Mapping Scientific Narratives (pdf)
Presenter: Robert Malouf
San Diego State University
CA, United States
Abstract: The scientific literature in any (sub-)domain constitutes a kind of an ongoing narrative constructed jointly by a community of researchers. And, as in any community, its members develop a specialized vocabulary among themselves which may be somewhat opaque to outsiders. This goes beyond the use of technical terminology and biomedical jargon (Jordan 2005) -- researchers investigating, say, the use of monoclonal antibodies to treat psoriasis will develop specialized habits of language use, and the narrower the subfield, the subtler the linguistic distinctions. Understanding these differences is vital for accessing the scientific narrative as an outsider via information retrieval or text mining systems, or even to contribute to it as a researcher entering a new subfield. Using the tools of corpus linguistics and computational lexicography, we can analyze large quantities (on the order of hundreds of millions of words) of domain-specific text. One primary tool of corpus linguistics is the concordancer, a system which allows the analyst quick access to individual examples of words in use. This can reveal surprising patterns -- for example, in papers on multiple sclerosis, verbs like "gain" and "increase" occur with undesirable direct objects like "disability" or "disease load", while in the asthma literature "gain" is more likely to occur with "control", a desirable outcome. This can also allow us to extract technical terms (e.g, in the literature on anticoagulants, we often find "burst of thrombin" but almost never "thrombin burst"). Going beyond simple word counts, analysis via pointwise mutual information, an information-theoretic measure of associations (Church and Hanks 1990, Evert 2007), finds collocation patterns which occur with greater than expected frequency given the frequencies of the individual words. When combined with deep syntactic analysis, these measures of association facilitate automatic extraction of a domain-specific thesaurus (Lin 1998, Curran and Moens 2002). The synonym sets provide a high-level overview of the way that language is being used in a narrowly focused corpus which in turn can help the analyst find differences in word usage between that domain and biomedical literature in general. For example, in literature on multiple sclerosis, a close synonym of "damage" is "demyelination", while in a corpus of papers on diabetes a synonym of "damage" is "neuropathy". Finally, broader semantic patterns of word meanings and language use can be found using vector space analysis and non-negative matrix factorization (Deerwester et al. 1990, Pauca 2004, Turney 2010, Utsumi 2010). This technique maps words and texts into a kind of semantic space. The distance between two words in this semantic space is a measure of the similarity of the contexts in which the two words tend to occur, and the structure of the semantic space provides a basis for comparing the development of word meanings across domains and across time. In this poster presentation, we will describe the use of this computational toolkit in more detail and present real-world case studies of its application to problems in accessing scientific narratives.
A Patient Centered Infrastructure (pdf)
Presenter: Christian Seebode
Abstract: A Patient Centered Infrastructure In order to give patients a possibility to participate more in healthcare delivery and to take responsibility for their action patients need access to information, communication and educational services. We present a Patient Centered Infrastructure which supports a patient centered process which represents an information cycle where patients improve their health literacy in an iterative way. Patients have to become consumer, mostly consumer of health information but also of health services. At the same time the patient participates in healthcare delivery and performs actions that correspond to his level of health literacy. While participating and learning patients become also a source of health information. Patients and their knowledge are the most undervalued resource in healthcare delivery. This means that patients have to adopt a new role model too. Patients learn to demand and consume health information and improve their situation to achieve better outcomes. The main condition for active participation is open and transparent communication. The Patient Centered Information infrastructure supports information, communication and education in order to improve the level of health literacy in patients such that they may understand and consciously decide what is best for them. The Patient Centered Infrastructure is a collection of federated services which resides on top on a common information model for patients and supports the patient centered process. The Patient Centered process represents a cycle of steps that patients perform in order to participate in healthcare delivery and to take control of their personal health situation and consists of the following steps: -Store (Health Record Management) Patients assemble a personal profile with contains a personal health record and a knowledge base. This may contain input from other systems, the patient himself or a result from a previous cycle. -Retrieve (Information Retrieval) Patients seek information based on the information and knowledge they possess. Sources of information may be Web, Case Databases, Literature or the information from EPRs or HIS. Even associated clinical text may be sources of information by means of linguistic analysis and text mining. -Gain Insight(Knowledge Management) Knowledge management is supported by formal representations of knowledge e.g. ontologies. Ontologies represent domain knowledge and are defined by experts or Patients who contribute their own knowledge base and align it. -Learn (Education) Patients learn and build knowledge from the information they retrieved. Patients are educated by learning from information or from others. Patients are able to share this knowledge with others because of the common information model. Learning curves are represented by comparing different versions of the knowledge base. -Act (Medical Services) Patients use medical services and participate according to their level of health literacy. Medical services are offered offline or online. Information plays a vital role for medical services too. Patients get decision support by consuming second opinion services or from the specific patient community that is associated with a service. Patients may document their personal level of health literacy by giving fine grained access to their personal knowledge base in order to receive personalized support. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.
The ISA Infrastructure for the Biosciences: from Data Curation at Source to the Linked Data Cloud (pdf)
Oxford e-Research Centre, University of Oxford
Oxfordshire, United Kingdom
Abstract: Experimental metadata is crucial for the ability to share, compare, reproduce, and reuse data produced by biological experiments. The ISAtab format -- a tabular format based on the concepts of Investigation/Study/Assay (ISA) -- was designed to support the annotation and management of experimental data at source, with focus on multi-omics experiments. The format is accompanied with a set of open-source tools that facilitate compliance with existing checklists and ontologies, production of ISAtab metadata, validation, conversion to other formats, submission to public repositories, among other things. The ISAtab format together with the tools allow for the syntactic interoperability of the data and support the ISA commons, a growing community of international users and public or internal resources powered by one or more components of the ISA metadata tracking framework. The underlying semantics of the ISAtab format is currently left to the interpretation of biologists and/or curators. While this interpretation is assisted by the ontology-based annotations that can be included into the ISAtab files, it is currently not possible to have this information processed by machines, as in the semantic web/linked data approach. In this presentation, we will introduce our ongoing isa2owl effort to transform ISAtab files into an RDF/OWL-based (Resource Description Framework/Web Ontology Language) representation, supporting the semantic interoperability between ISAtab datasets. By using a semantic framework, we aim at: 1. making the ISAtab semantics explicit and machine-processable, 2. exploit the existing ontology-based annotations, 3. augment annotations over the native ISA syntax constructs with new elements anchored in a semantic model extending the Ontology of Biomedical Investigations (OBI) 4. facilitate the understanding and semantic querying of the experimental design 5. facilitate data integration, knowledge discovery and reasoning over ISAtab metadata and associated data. The software architecture of the isa2owl component is engineered to support multiple mappings between the ISA syntax and semantic models. Given a specific mapping, a converter takes ISAtab datasets and produces OWL ontologies, whose Tboxes are given by the mapping and the Aboxes are the ISAtab elements or derived ones. These derived elements result from the analysis of the experimental workflow, as represented in the ISAtab format and the associated graph representation. The implementation relies on the OWLAPI. As a proof of concept, we have performed a mapping between the ISA syntax and a set of interoperable ontologies anchored in the Basic Formal Ontology (BFO) version 1. These ontologies are part of the Open Biological and Biomedical Ontologies (OBO) Foundry and include OBI, the Information Artifact Ontology (IAO) and the Relations Ontology (RO). We will show how this isa2owl transformation allows users to perform richer queries over the experimental data, to link to external resources available in the linked data cloud, and to support knowledge discovery.
Bio2RDF Linked Data Experience, the Lessons Learned Since 2006 (pdf)
Presenter: François Belleau
Centre de recherche du CHUQ, Laval University
Abstract: Since 2006 the Bio2RDF project (http://bio2rdf.org) hosted jointly by the Centre de recherche du CHUQ of Laval University and Carleton University, aims is to transforms silos of life science data made publicly available by data provider like KEGG, UniProt, NCBI, EBI and many more, into a globally distributed network of linked data for biological knowledge discovery available to the life science community. Using semantic data integration techniques and the Virtuoso triplestore software, Bio2RDF seamlessly integrates diverse biological data and enables powerful SPARQL services across its globally distributed knowledge bases. Online since 2006, this very early Linked Data project have evolve and inspired many other. This talk will recall Bio2RDF main design steps over time, but more important, we will explain design decision that, we think, made it successful. Being part of the Linked Data space since the beginning, we have been in a major position to observe the evolution and adoption of semantic web technologies by the life science community. Now with so many mature project online, we propose to look back, so we can share our own experiences building Bio2RDF with the CSHALS community. The method to produce, publish and consume Bio2RDF linked data will be exposed. The pipeline used to transform public data to RDF and its design principle will be explained. The way Openlink Virtuoso server, an open source project, is configured and used will be explained. We will also propose guidelines to follow to publish RDF within the bioinformatic Linked Data cloud. Finally, we will show different way to consume this data using URI, SPARQL queries and semantic web software like RelFinder and Virtuoso Facet browser. As a result Bio2RDF is still the most diverse and integrated linked data space available. But now, we observe an important move with data provider starting to expose their own datasets using RDF or, even better, SPARQL endpoint. Looking back at the evolution of Bio2RDF, we will share lessons we have learned about semantic web design project. What was a good idea, and which one were not. To conclude, we will expose our vision, still to be fulfilled, of what Linked Data could be in a near future so the data integration problem, so present in Life Science, benefits of the mature Semantic Web technologies, to help researchers do their daily discovery work.
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data
Presenter: Michel Dumontier
Abstract: Bio2RDF is an open source project that uses Semantic Web technologies to build and provide the largest network of Linked Data for the Life Sciences. Here, we present the second release of the Bio2RDF project which features up-to-date, open-source scripts, IRI normalization through a common dataset registry, dataset provenance, data metrics, public SPARQL endpoints, and compressed RDF files and full text-indexed Virtuoso triple stores for download. Methods Bio2RDF defines a set of simple conventions to create RDF(S) compatible Linked Data from a diverse set of heterogeneously formatted sources obtained from multiple data providers. We have consolidated and updated Bio2RDF scripts into a single GitHub repository (http://github.com/bio2rdf/bio2rdf-scripts), which facilitates collaborative development through issue tracking, forking and pull requests. The scripts are released with an MIT license, making it available for any use (including commercial), modification or redistribution. Provenance regarding when and how the data were generated is provided using the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV) and Dublin Core vocabulary. Additional scripts were developed to compute dataset-dataset links and summaries of dataset composition and connectivity. Results Nineteen datasets, including 5 new datasets and 3 aggregate datasets, are now being offered as part of Bio2RDF Release 2. Use of a common registry ensures that all Bio2RDF datasets adhere to strict syntactic IRI patterns, thereby increasing the quality of generated links over previous suggested patterns. Quantitative metrics are now computed for each dataset and provide elementary information such as the number of triples to a more sophisticated graph of the relations between types. While these metrics provide an important overview of dataset contents, they are also used to assist in SPARQL query formulation and to monitor changes to datasets over time. Pre-computation of these summaries frees up computational resources for more interesting scientific queries and also enable tracking of dataset changes with time, which will help make projections about the hardware and software requirements. We demonstrate how multiple open source tools can be used to visualize and explore Bio2RDF data, as well as how dataset metrics may be used to assist querying. Conclusions Bio2RDF Release 2 marks an important milestone for this open source project, in that it was fully transferred into a new team and development paradigm. Adoption of GitHub as a code development platform makes it easier for new parties to contribute and get feedback on RDF converters, as well as make it possible to automatically be added to the growing Bio2RDF network. Over the next year we hope to offer bi-annual releases that adhere to formalized development and release protocols.
Towards a Seizures and Epilepsies Ontology (pdf)
Presenter: Robert Yao
Arizona State University/Mayo
Scottsdale, United States
Abstract: Introduction In 2012, the Institute of Medicine recognized a significant problem in epilepsy knowledge, care, and education in their report, “Epilepsy Across the Spectrum.” Physicians often disagree and are inconsistent when making an epilepsy diagnosis (1) because various limitations in their understanding of seizure and epilepsy have contributed to the lack of a clear diagnostic method. Such limitations include structural flaws (or lack of relationships) in the current knowledge representation model, terms that are ambiguous and inconsistently used, too much dependence on expert opinion, and too little on evidence (2-6). Currently, most epilepsies were identified by grouping seizure or epilepsy names and then sorting based on what is perceived as the defining characteristic. If the wrong defining characteristic was chosen, epilepsies were often misclassified, and thus misdiagnosed. When certain observable symptoms or properties were unknown or missing, it was not possible to identify the seizure or epilepsy. As a first step towards diagnostic clarity, the IOM has recommended the validation nd implementation of standard definitions and criteria for epilepsy case ascertainment. The epilepsy domain has been calling for a new evidence-based epilepsy model that incorporates the latest knowledge to classify and relate seizure types and define specific epilepsy syndromes (5,7-9). In response to the need for better diagnosis and to the call for a new evidence-based epilepsy model, we propose an ontology-based knowledge representation to aid in the improvement of the diagnosis and management of epilepsy. Methods Design and build an ontologic knowledge representation of the epilepsy domain 1) Define the ontology domain and scope 2) Review existing ontologies 3) Select an upper ontology 4) Create classes, properties, and relationships 5) Create a conceptual model (using concept maps) 6) Create a scoring heuristic to suggest a differential diagnosis of seizure types and epilepsy syndromes. Results To address ambiguous and inconsistently used terms, a concept approach was taken. Relationships between concepts were then defined, creating an ontology for seizure types and epilepsy syndromes. A concept map of all possible epilepsies was constructed using a concept process resulted in a tree that models all possible epilepsies in one map (not shown). Figure 1 depicts the seizure aura sub-branch of the overall Epilepsies Ontology. Discussion The ontology created takes an important first step in providing a standard definition for a seizure type and a specific seizure syndrome. We are currently working on a peer-reviewed, evidence-based validation of the ontology. Furthermore, a reasoning heuristic based on the ontology is being developed to evaluate the implementation of diagnostic criteria for seizure type and epilepsy syndrome.
Presenter: Alexander Garcia
Florida State University
Tallahassee, United States
Abstract: In this poster, we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We aim at delivering interoperable, interlinked, and self-describing documents in the biomedical domain. We applied our approach to the full-text, open-access subset of PubMed Central.
Methods: We use BIBO, DCMI Terms, and the Provenance Ontology to model the bibliographic metadata. BIBO provides classes and properties to represent citations and bibliographic references; BIBO is used to model documents and citations in RDF or to classify documents within a hierarchy. Dublin Core offers a domain-independent vocabulary to represent metadata; such vocabulary aims to facilitate cross-resource exploration. In order to identify biological terms, we use two text-mining tools: Whatizit and the NCBO Annotator. Both tools are based on dictionaries and string matching. By doing so, relevant biological identifiers such as UniProt accessions as well as ChEBI and GO identifiers are added. We are working with more than 20 biomedical ontologies. The main input for our process is the XML offered by PMC for open-access articles. We use JAXB to programmatically process this XML and RDFReactor to map the ontologies to Java classes. The output of the process comprises three RDF files: the article itself as well as the annotations from NCBO Annotator and Whatizit. The article is modeled as bibo:Document; whenever it is possible, a more accurate class is also added, e.g., bibo:AcademicArticle for research articles. Publisher metadata is also modeled with BIBO, including publisher name, the International Standard Serial Number, volume, issue, and starting and ending pages. Authors are modeled as foaf:Person and grouped as a bibo:authorList. Abstract and sections are modeled as a doco:Section with a cnt:chars containing the actual text with formatting omitted. Well-known identifiers such as PubMed and DOI are included in the output; thus, it is possible to track the original source of the article. The same principle is also applied to the references. The references are modeled as bibo:Document; the relations used are bibo:cites and bibo:citedBy. References are available for both the document and the section level. For incomplete references, e.g., "Allen, F. H. (2002). ActaCryst. B58, 380-388" in PMC:2971765, it is possible to use services such as Mendeley, CrossRef, and eFetch in order to complete the information so title and identifiers can be added.
Results: We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/.
Conclusions: The semantic processing of biomedical literature presented in this paper embeds documents within the Web of Data and facilitates the execution of concept-based queries against the entire digital library. Our approach delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of biomedical documents. Our model delivers a semantically rich and highly interconnected dataset with self-describing content so that software can make effective use of it.
A Semantic Portal for Treatment Response Analysis in Major Depressive Disorder
Presenters: Joanne S. Luciano, Brendan Ashby, Yuezhang Xiao
Rensselaer Polytechnic Institute
Troy, NY, United States
Abstract: The World Health Organization (WHO) reports that Major Depressive Disorder (MDD) affects more than 350 million people and is a significant contributor to the global burden of disease. It is the leading cause of disability in the U.S. for ages 15-44. This poster will present a semantically enabled web-portal that enables treatment response pattern analysis from clinical depression data studies conducted at major depression research facilities. Using the Luciano Model, the resultant response pattern visualizations provide patient and clinician with detailed information about the expected response to the treatment, thus supporting clinical decision making and increasing patient engagement. Currently treatment selection remains trial and error and patient engagement for any pharmaceutical is nonexistent. Further, the individual patient's response pattern can be monitored more closely by both patient and clinician enabling earlier intervention when the patient's response is different from what is expected for that treatment. The aim of this work is to improve the selection of the treatment and provide information that enables earlier intervention when necessary in order to prevent unnecessary suffering, suicide, and costs.
Primary Immunodeficiency Disease (PID) PhenomeR - An Integrated Web-based Ontology Resource Towards Establishment of PID E-clinical Decision Support System
Presenter: Sujatha Mohan, Ph.D.
Research Center for Allergy and Immunology (RCAI)
The Institute of Physical and Chemical Research (RIKEN)
Yokohama city, Kanagawa, Japan
Abstract: Primary immunodeficiency diseases (PIDs) are genetic disorders, causing abnormalities in development as well as maintenance and functioning of the immune system that are manifested by increased susceptibility to infections and autoimmune disorders. To this date, more than 250 PIDs have been reported most of which are rare and infrequent. The patients diagnosed for a given PID condition are often scattered all over the world and knowledge about these diseases are hindered by lack of unified representation of PID information, especially linking genotype and phenotype data which requires regular concerted efforts and community participation. Earlier, we had developed an open access integrated molecular database of PIDs named "Resource of Asian Primary Immunodeficiency Diseases - RAPID" (http://rapid.rcai.riken.jp/RAPID) and at present it comprises a total of 263 PIDs and 242 genes, out of which 232 genes are reported with over 5039 unique disease-causing mutation data obtained from over 1823 PubMed citations. We, hereby, introduce a newly developed PID ontology browser for systematic integration and analysis of PID phenotype with the genotype data from RAPID. Towards this end, we have developed a user-friendly interface named, "PID PhenomeR", which serves as a standardized phenotype ontology resource to present ontology class structures and entities of all observed phenotypic terms in PID patients from RAPID in standardized file formats - Web Ontology Language (OWL) and Resource Description Framework (RDF) using semantic web technology. PID PhenomeR consists of 1466 standardized PID terms that are classified under 24 and 29 semantic types and categories respectively as of December, 2012. The standardization of PID phenotype terms for addition of new terms is in progress, using unique semi-automated process including logic based assessment method. In essence, PID PhenomeR serves as an active integrated platform for PID phenotype data, wherein the generated semantic framework is implemented in the integrated knowledge-base query interface i.e. SPARQL Protocol and RDF Query Language (SPARQL) endpoint for establishing a well-informed PID e-clinical decision support system.
Database URL: http://rapid.rcai.riken.jp/ontology/v1.0/phenomer.php
Keywords: Semantic web technology, Ontology, Genotype, Phenotype, Mutation, SPARQL
Connecting Linked Data (pdf)
Presenter: Nadia Anwar
Reading, United Kingdoms
Abstract: Initially, data in linked data clouds were brought together with the maxim "Messy is Good". The idea was, if you put your data in RDF then it can be used, re-purposed and integrated. This maxim was used to encourage people to expose their data as RDF. Now that we are all convinced, and there is a lot of RDF available to us, some of it messy, it is time to tidy up the mess. We aim to show why "messy" is now problematic and describe how a little house-keeping makes the linked data we have better. In our experience, messy or “good enough” was the starting point however we now find that many practical uses of the data originally exposed as RDF are very difficult. Even the simplest of SPARQL queries on public resources can be very unintuitive. To demonstrate our point we exemplify through a small and a large example, how, adding just a few inferences, makes public RDF data easier to SPARQL. The first, small, example uses two RDF datasets of the model organism Drosophila melanogaster. FlyAtlas is a tissue specific expression database produced by Julian Dow at the University of Glasgow. The expression profiles were designed to reveal the differences in expression in very different tissues, for example, from brain to the hindgut. This is an incredible resource available alongside other fly resources in RDF at openflydata.org. Gene set enrichment is now a standard tool used to understand such expression data, an analogous query is “are there any tissue specific pathways in the hindgut?”. In theory, this question can be answered with a SPARQL query connecting FlyAtlas to FlyCyc, a databases of Fly Pathways, however, this is not easy. The two graphs, FlyAtlas and FlyCyc are actually quite difficult to traverse. However, with the addition of some triples through very simple inferences, utilizing for example, transitive properties, class subsumption and simple CONSTRUCT queries. With the addition of these extra triples the graphs are much easier to query. In a larger example, we have a pharma client with a large set of linked data mainly from the public domain. In this linked data cloud there are some 30 databases using about 6 different RDFS/OWL schemas. The concepts are diverse, from Protein to Pathways to Clinical Trials. Practically, the queries through this data are complex and as the the cloud of data has grown, queries have become more and more cumbersome. In this larger data set, we exemplify some of the typical queries performed through this data, and just how difficult some of these simple queries can be. We describe some actions that unify concepts within the multiple schemas in the linked data with the addition of some semantics. We show some of the query gains achieved in understandability and performance through the addition of these ontology statements.
A Clinical Information Management Platform Using Semantic Technologies
Abstract: A Clinical Information Management Platform using semantic technologies Medical procedures generate a vast amount of data from various sources. An efficient and comprehensive integration and exploitation of these data will be one of the success factors for improving health care delivery to the individual patient, making health care services more cost-effective at the same time. In order to support an effective mining, selection and presentation of medical data for clinical or patient-centered use cases, either text data or structured clinical data from Health Information Systems (HIS) has to be enriched with semantic meta-information and has to be available at any point during the data value chain. We present a platform which combines an approach to semantic extraction of medical information from clinical free-text documents with the processing of structured information from HIS records. The information extraction uses a fine-grained linguistic analysis, and maps the preprocessed terms to the concepts of domain-specific ontologies. These domain ontologies comprise knowledge from various sources, including expert knowledge and knowledge from public medical ontologies and taxonomies. The processes of ontology engineering and rule generation are supported by a semantic workbench that enables an interactive identification of those linguistic terms in clinical texts that denote relevant concepts. This supports incremental refinement of semantic information extraction. Facts extracted from both, clinical free texts and structured sources, represent chunks of knowledge. They are stored in a Clinical Data Repository (CDR) using a common document-oriented storage model, which takes advantage of an application-agnostic format, in order to support different use cases. It furthermore supports version control of facts reflecting the evolution of information. Enrichment algorithms aggregate further information by generating statistical information, search indexes, or decision recommendations. The CDR generally separates processes of information generation from processes of information processing or consumption, and thus supports smart partitioning of data for scalable application architectures. The applications hosted on the platform retrieve facts from the CDR by subscribing to the event stream provided by the CDR. The first applications implemented on top of that platform support specific scenarios of clinical research, like recruiting patients for clinical trials, answering feasibility studies, or aggregating data for epidemiological studies. Further applications address patient-centered use cases like second opinion or dialogue support. The web-based application StudyMatcher maps study criteria to a list of cases and their medical facts. Trial teams may define study criteria in interaction with the knowledge resources. The application automatically generates a list of candidates cases.. Since the user interface links the facts extracted by the system to the original sources (e.g. the clinical documentation), users are able to check with low effort whether or not a fact has been recognized correctly by the system, matched correctly with the given criteria. This strategy of combining automatic and supervised fact generation promises to be a reasonable approach to improving the semantic exploitation of data. Platform and applications are developed in cooperation with europes leading healthcare providers Charité and Vivantes and will be rolled out in January 2013. In cooperation with DFKI - Deutsches Forschungszentrum für Künstliche Intelligenz.
Data Modeling and Machine Learning Approaches to Clustering and Identification of Bacterial Strains in Complex Mixtures by Genome Sequence Scanning (pdf)
Mikhail M Safranovitch
Douglas B. Cameron
Abstract: Genome Sequence Scanning (GSS), developed by PathoGenetix, is a single-molecule DNA analysis technique aimed at rapid identification of bacterial contaminants in complex biological mixtures and food samples. The detection, classification of single molecules and identification of bacterial strains entails computationally intensive steps of data analysis, modeling, and machine learning approaches on fluorescent traces of individual DNA molecules acquired from biological samples.
The basic methodology consists of initially obtaining fluorescent traces of individual DNA molecules from the biological sample and their signal intensity and statistics of all measured DNA molecules is evaluated by custom data analysis software. In the next step, experimental signals of single molecules are statistically modeled by distribution of photons along the DNA restriction fragment based on Poisson and Gamma distributions. The experimental trace signals are compared to a target database containing averaged template patterns for specific restriction fragments from multiple target organisms. The template patterns are generated either by theoretical calculations, based on known sequences, or experimentally produced by GSS analysis and clustering of molecules from isolates .
This computational modeling methodology yielded robust results for detection and typing of multiple serovars and strains of Escherichia coli and Salmonella in complex biological mixtures. In identifying closely related strains of these species, a hierarchical clustering algorithm (UPGMA:
Unweighted Paired Group Method with Arithmetic Mean) is applied to group detected organisms. This clustering algorithm is applied to generate phylogenetic trees to compare closely related strains of Escherichia coli, Salmonella enterica and other species. The ability of GSS to model single DNA molecule traces and attribute them to specific organisms in conjunction with the genome based unsupervised classification using hierarchical clustering approach is the basis of a robust technology for confident detection and identification of pathogenetic strains.