C-SHALS 2009


Poster:	Structural database using Semantic Web Concepts to SupportStructure-Based Drug Design for AIDS
Presenter(s):	T. N. Bhat, NIST

Poster:	Exposing The Cancer Genome Atlas (TCGA) as a SPARQL Endpoint
Presenter(s):	Helena F. Deus, The University of Texas M. D. Anderson Cancer Center

Poster:	BioProspecting the Bibliome: Discovering Potentially Novel Cancer Biomarkers
Presenter(s):	Peter Elkin, Mount Sinai Medical Center

Poster:	Building Innovation Networks: Applying Semantic Technology in the Life Sciences
Presenter(s):	Marc Hadfield, Alitora Systems, Inc.

Poster:	Semantic Integration in Biomedical Networks
Presenter(s):	Chun-Hsi Huang, University of Connecticut

ABSTRACTS

Poster Presenter: T. N. Bhat, NIST, bhat@nist.gov

Poster Title: Structural database using Semantic Web concepts to Support structure-Based drug design for AIDS

Abstract: The HIV structural databases (HIVSDB, http://bioinfo.nist.gov/SemanticWeb_pr2d/chemblast.do, http://chemdb2.niaid.nih.gov) distribute one of the largest comprehensive collections of structural, biological and pre-clinical data on inhibitors, drug leads and clinical drugs for AIDS. These databases contain info on several thousand biologically active compounds from all classes (HIV PR, RT, CCR5, Integrase) of FDA approved drugs. Efficient and yet user friendly data management systems that support state-of-the-art annotation, visualization and query capabilities are crucial for the effective use of data for fragment based structural pharmacology and rational drug design. Semantic Web is the vision of the World Wide Web Consortium for enabling seamless integration of electronic data for data mining and knowledge generation across the Web.

Robust and functionally relevant ontology plays a critical role in developing the data elements for a Semantic Web. Presentation will illustrate how Semantic Web concepts are used for novel annotation, data integration, storage, and query to manage and display structural (fragments, 2-D images and text-based) biological, and pre-clinical data. One of these techniques (ChemBLAST (Prasanna, Vondrasek et al. 2006)) developed allows rapid comparison of structural fragments using the Semantics commonly used in drug discovery process. At present majority of the data in HIVSDB are obtained by us by weaning through publications. Our intension is to seek greater participation by the community by depositing data to HIVSDB at the time of publication.

Prasanna, M. D., J. Vondrasek, et al. (2006). "Chemical compound navigator: a web-based chem-BLAST, chemical taxonomy-based search engine for browsing compounds." Proteins 63(4): 907-17.

Poster Presenter: Helena F. Deus, The University of Texas M. D. Anderson Cancer Center, mhdeus@mdanderson.org

Poster Title: Exposing The Cancer Genome Atlas (TCGA) as a SPARQL Endpoint

Abstract: Automated discovery of candidate biomarkers from multiple databases has been a central challenge in the Life Sciences in general and in the study of systemic processes such as cancer biology in particular. The maturation of Semantic Web technologies offers solutions to those problems by allowing the query to be defined by the domains of discourse where the answer to the query is sought. A specific example of this challenge is found in The Cancer Genome Atlas initiative (TCGA, http://cancergenome.nih.gov/), which generates a large scale repository of high throughput molecular biology data generated and processed at 5 academic facilities across the USA [1, 2]. The heterogeneity of domains (genomic, transcriptomic, epigenetic effects, proteomic, clinic and demographic, etc) and the heterogeneity of methodologies within each domain will be further compounded by the expansion of TCGA as an international initiative lead by The International Genomics Consortium (IGC) in collaboration with the Translational Genomics Research Institute (TGen). Currently, the TCGA data is aggregated at a single point of access charged with providing syntactic interoperability to all of the data - the TCGA portal.

Using The Cancer Genome Atlas as a case study, and the S3DB (www.s3db.org, [3, 4]) distributed semantic data service application as the engine of integration, we developed a computational domain representation for the TCGA data in order to integrate the clinical and molecular information and expose it through Web Services. Specifically, this novel resource allows information retrieval through the SPARQL module available at any S3DB node deployment. Since sensitive and proprietary data is always a sensitive preoccupation with translational studies in Biomedicine, the ontology itself includes the management of user permissions on individual data elements. The architecture for this novel resource is thought to provide a template for web-based solutions that bridge between data silos within a domain of knowledge and between the bench and the clinical point of care.

References:

[1]Cancer Genome Atlas Research Network. "Comprehensive genomic characterization defines human glioblastoma genes and core pathways," Nature, vol. 455, pp. 1061-8, Oct 23 2008.

[2] L. Chin et al, "Translating insights from the cancer genome into clinical practice," Nature, vol. 452, pp. 553-63, Apr 3 2008.

[3] J. S. Almeida et al, "Data integration gets 'Sloppy'," Nat Biotechnol, vol. 24, pp. 1070-1, Sep 2006.

[4] H. F. Deus et al, "A Semantic Web Management Model for Integrative Biomedical Informatics," PLoS ONE, vol. 3, p. e2946, 2008.

Poster Presenter: Peter Elkin, Mount Sinai Medical Center, peter.elkin@mssm.edu

Poster Title: BioProspecting the Bibliome: Discovering Potentially Novel Cancer Biomarkers

Abstract: Using SNOMED-CT and HUGO as ontologies, 27,000 Web-accessible NEJM articles were mined for potentially novel biomarkers using NLP. Articles containing an association between a gene and a metabolic function were linked with articles associating the same metabolic function and a disease, and no single article associated the same gene and disease (i.e. novel association).

Poster Presenter: Marc Hadfield, Alitora Systems, Inc., marc@alitora.com

Poster Title: Building Innovation Networks: Applying Semantic Technology in the Life Sciences

Abstract: To be successful, Life Science professionals must utilize the perfect storm of heterogeneous data - genomic research, clinical studies, health care records, patents, industry news, market research, and government policy.

Semantic Technology provides a means of linking related heterogeneous data assets together using Information Extraction and other Data Mining techniques aligned using Ontologies. People too are an important part of the Innovation Network - much of an organization's data is locked in the brains of their experts.

In this presentation, we will explore specific tools and techniques to build Innovation Networks using the Memomics application as a Case Study. A demonstration of the Memomics Innovation Network will combine data assets from heterogeneous sources with expert users. Users of the Innovation Network will be able to collaborate with other users, sharing knowledge and expertise.

The Memomics Innovation Network uses Semantic Technology to expose data assets to semantic searches, and associates discovered knowledge with the users that may be most interested in it.

Issues will be explored such as:

Accuracy of Semantic data, and Errors
Semantic Searches
Matching users to knowledge
Utilizing Multiple Ontologies
Collaboration across Organizations
Security

Poster Presenter: Chun-Hsi Huang, University of Connecticut, huang@cse.uconn.edu

Poster Title: Semantic Integration in Biomedical Networks

Abstract: A semantic network is a conceptual model for knowledge representation, in which the knowledge entities are represented by nodes (or vertices), while the edges (or arcs) are the relations between entities. The semantic network is an effective tool, serving as the backbone knowledge representation system for genomic, clinical and medical data. Usually these knowledge bases are stored at locations geographically distributed. This highlights the importance of an efficient distributed semantic network system enabling distributed knowledge extraction and inference. Note that the semantic network is a key component of the Unified Medical Language System (UMLS) project initiated in 1986 by the U.S. National Library of Medicine (NLM). The goal of the UMLS is to facilitate associative retrieval and integration of biomedical information so researchers and health professionals can use such information from different (readable) sources. The UMLS project consists of three core components: (1) the Metathesaurus, providing a common structure for more than 95 source biomedical vocabularies. It is organized by concept, which is a cluster of terms, e.g., synonyms, lexical variants, and translations, with the same meaning; (2) the Semantic Network, categorizing these concepts by semantic types and relationships; and (3) the SPECIALIST lexicon and associated lexical tools, containing over 30,000 English words, including various biomedical terminologies. Information for each entry, including base form, spelling variants, syntactic category, inflectional variation of nouns and conjugation of verbs, is used by the lexical tools. The 2002 version of the Metathesaurus contains 871,584 concepts named by 2.1 million terms. It also includes inter-concept relationships across multiple vocabularies, concept categorization, and information on concept co-occurrence in MEDLINE.

Application areas in biomedicine include the epidemiological studies and medical imaging, which produce tremendous amount of data that are usually geographically distributed among hospitals, clinics, research labs, and radiology centers, etc. For research, training or clinical purposes, physicians and researchers often need to consult and analyze data from distributed sites. Thus, an infrastructure supporting on-demand and automated information extraction and reasoning will provide significant convenience.

This research work involves a few tasks, including (1) the development of a distributed semantic network system, based on a task-based and message-driven model to exploit both task and data parallelism while processing queries; (2) the parallelization of the inference engine to speed-up the query processing; and (3) automated data migration among the distributed knowledge bases to maximize the storage utilization rate. The current information infrastructure, as a test-bed, of this project is a campus-wide computational and data Grid. Participating sites of this infrastructure include the Schools of Engineering and Medicine at the University of Connecticut. Note that the Grid represents a rapidly emerging and expanding technology that allows geographically distributed resources (CPU cycles, data storage, sensors, visualization devices, and a wide variety of internet-ready instruments), which are under distinct control, to be linked together in a transparent fashion. The aggregate computing power, data storage, network bandwidth, as well as the user friendliness have rendered the Grid a prosperous infrastructure in support of automated processing of distributed information.

[TOP]