Oral Presentations


OP01:
Exploiting spatial and temporal variation in epigenomic landscapes

Subject: Gene Regulation & Transcriptomics

Presenting Author: Kyoung-Jae Won, University of Pennsylvania

Abstract:
Accumulation of cell-type specific epigenomic data provides an unprecedented opportunity to study cell-type specific gene regulation on a genome-wide scale. Comparative approached have applied to study cell-type specific gene regulation. Though successful, most of the comparative studies were applied to a limited number of epigenomic marks or samples. Systematic comparative approaches to fully exploit multi-dimensional epigenomic landscape across multiple samples (or time-scale) are required. However, exploring full range of genome in a multi-dimensional space and performing subsequent analysis is computationally challenging. Computationally cost-efficient methods are in need as the number of available datasets increases. We propose a novel approach, called SeqW, to handle various kinds of histone modification data and perform comparative analysis across time or samples. To perform comparative analysis in a time efficient manner, we applied a filtering method based on the special wavelet transformation. The wavelet filter efficiently reduced the data size without losing information about the intensity of histone modification data. By applying SeqW to time-course epigenomic dataset during adipogenesis, we clustered coding as well as non-coding genes according to the epigenetic changes across time. Unique to SeqW is the power to exploit spatial changes of epigenetic composition. SeqW captures H3K36me3 in the genebody region and H3K27me3 widely upstream promoter region and classify genes. We observed the genes with H3K27me3 upstream promoter regions were slightly less expressed than the genes without H3K27me3, suggesting a new functional role of H3K27me3. Also, we found genes with H3K27ac were usually highly expressed.


top
OP02:
Comparison of human LCL to post-mortem brain eQTLs: Effects of tissue and power

Subject: Gene Regulation & Transcriptomics

Presenting Author: Benjamin Keller, Eastern Michigan University

Co-Author(s):
Margit Burmeister, University of Michigan

Abstract:
To assess the regulatory effect of genotyped SNPs in different tissues, we compared lymphoblastoid cell lines (LCLs) and post-mortem brain eQTLs. Such comparisons are critical to our ability to predict how SNPs affect complex genetic disease regulation using readily available tissue. We identified eQTLs in LCLs and tested for replication in at least one of 10 brain regions in genes expressed in both tissues. In each tissue, associations with false discovery rate q-value ≤ 0.05 were considered significant. In LCLs, 314 genes were found to have significant associations involving 5123 SNPs (minor allele frequency, MAF < 5%) at q ≤ 0.05. After accounting for probe annotation, tissue-specific gene expression, MAF, and discarding SNP-containing probes that might confound expression measurements, 200 genes remained. Replication at q ≤ 0.05 finds 78 genes, relaxing criteria for replications to a nominal p ≤0.05 involve 89 genes. 198 associations involving 33 genes were clearly not replicated even under these relaxed conditions, and hence are hypothesized to be LCL-specific eQTLs. Since, most eQTLs replicate with a relaxed p value, our data suggest that statistical power is the primary driver for differences between tissues. Interestingly, a significant number of q≤0.05 replicates show an inversion of the direction in the association effect, suggesting that those SNPs may bind regulatory signals that differ in terms of enhancing vs. repressor properties between LCLs and brain tissue. Currently we are pursuing investigations to determine if such regulatory, e.g. transcription factors, binding near a SNP can be identified.


top
OP03:
Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis

Subject: Gene Regulation & Transcriptomics

Presenting Author: Jeff Klomp, Van Andel Research Institute

Co-Author(s):
Kyle Furge, Van Andel Research Institute

Abstract:
High-throughput methods that ascribe a cellular or physiological function for each gene product are useful to understand the roles of genes that have not been extensively characterized by molecular or genetic approaches. One method to infer gene function is a "guilt-by-association" approach, in which the expression pattern of a poorly characterized gene is shown to co-vary with the expression of better-characterized genes. The function of the poorly characterized gene is inferred from the known function(s) of the well-described genes. For example, genes that are co-regulated with the prostate-specific antigen (PSA/KLK3) have been implicated as additional diagnostic or therapeutic targets for prostate cancer (Walker et al. 1999). While examining the expression characteristics of several poorly characterized genes, we noted that we could associate each of the genes with a cellular phenotype by correlating expression changes with gene set enrichment scores from individual samples. We evaluate the effectiveness of this approach – correlating a gene’s expression value with a gene set enrichment statistic – using a modest sized gene expression data set (expO) and a compendium of gene expression phenotypes (MSigDBv3.0). Our results show that this approach is generally useful to identify and verify functions of disparate sets of genes. In evaluating the model, we also found that 4% of the genome encodes for genes that are associated with small molecule and small peptide signal transduction, implicating a large number of genes in both internal and external environmental sensing.


top
OP04:
Characteristics and Significance of Intergenic PolyA RNA Transcription in Arabidopsis thaliana

Subject: Gene Regulation & Transcriptomics

Presenting Author: Gaurav Moghe, Michigan State University

Co-Author(s):
Melissa Lehti-Shiu, Michigan State University
Alex Seddon, Michigan State University
Shan Yin, Michigan State University
Yani Chen, Michigan State University
Federica Brandizzi, Michigan State University
Piyada Juntawong, University of California - Riverside
Julia Bailey-Serres, University of California - Riverside
Shin-Han Shiu, Michigan State University

Abstract:
The Arabidopsis thaliana genome has over 27,000 protein-coding genes and is the most well annotated plant genome. However, recent transcriptome sequencing suggests the presence of several novel intergenic polyA transcripts. It is not clear whether these transcripts can be translated and whether these novel transcripts represent functional genes. In this study, we first assessed the extent of intergenic polyA transcription using eight mRNA-Seq datasets and found that Intergenic Transcribed Fragments (ITFs), while ranging from hundreds to thousands across all datasets, occupy only a tenth of the intergenic space. We assessed the potential functionality of ITFs based on breadth and level of expression, association with the ribosomal machinery and primary sequence conservation. Most ITFs were identified as short, lowly expressed, dataset-specific transcripts lying close to annotated genes. Through analyses of translatome and proteome datasets, ~35% ITFs were likely translated. However, ITFs closer to genes were significantly more likely to be ribosome-associated, suggesting that they may be part of annotated transcriptional units. The sequence-level conservation of ITFs was assessed based on comparison between A. thaliana and A. lyrata and between 80 accessions of A. thaliana. We found that only ~15% of the ITFs have significant purifying selection. Overall, our comprehensive analyses of the A. thaliana polyA transcriptome reveal that, despite the prevalence of ITFs, most do not display evidence of purifying selection. Thus, we cannot rule out the possibility that they are products of spurious transcription. Nonetheless, these apparently neutrally evolving ITFs may underlie an important mechanism in creating evolutionary novelty.


top
OP05:
Cis-regulatory code of stress responsive gene expression in plants

Subject: Gene Regulation & Transcriptomics

Presenting Author: Shin-Han Shiu, Michigan State University

Co-Author(s):
Alexander Seddon, Michigan State University
Cheng Zou, Chinese Academy of Agricultural Sciences

Abstract:
Environmental stress leads to significant changes in gene expression that are central to plant survival. Although there are well studied examples of a few plant cis-regulatory elements (CREs) that function in stress regulation, the plant stress cis-regulatory code, i.e., how CREs work independently and/or in concert to specify stress-responsive transcription, is mostly unknown. We identified a large number of putative CREs through analysis of the transcriptional response of Arabidopsis thaliana above-ground tissues to multiple stress conditions. Surprisingly, biotic and abiotic responses are mostly mediated by distinct pCRE superfamilies. In addition, using machine learning approaches, we uncovered cis-regulatory codes specifying how pCRE presence and absence, combinatorial relationships, location, and copy number can be used to predict stress-responsive expression. Using salt stress response as an example, we showed that cis-regulatory code based on above-ground tissue expression can be used to predict response in roots and, most importantly, in rice, a plant species that diverged from A. thaliana ~150 million years ago. The discovery of these cis-regulatory rules significantly advances our understanding of plant stress transcriptional response. In addition, our ability to apply cis-regulatory logic across tissue types and species highlights the robustness of the regulatory rules and their utility in translational research.


top
OP06:
Splice Variants Detection Using RNA-Seq Assembly and Digital Normalization

Subject: Gene Regulation & Transcriptomics

Presenting Author: Likit Preeyanon, Michigan State University

Co-Author(s):
C. Titus Brown, Michigan State University
Hans Cheng, USDA, ADOL, ARS, Michigan State University
Jerry B. Dodgson, Michigan State University

Abstract:
Recently, RNA sequencing (RNAseq) from Next Generation Sequencing (NGS) technology has been successfully used to study alternative splicing in humans and mice. Methods used in these analyses rely solely on high quality gene models. Consequently, these methods are not suitable for other organisms lacking high quality gene annotations. To overcome this problem, other methods have been developed; for example, one method does not rely on an existing annotation but instead constructs the gene models from sequence reads that are mapped to the genome. However, the method is limited by using only sequence reads that are mapped to the genome and gene models are built based on a computational prediction.

We have developed a pipeline, based on an assembly approach, that builds gene models and identify alternative isoforms from RNA-Seq data. We have been using this method to study alternative splicing in chickens line 6 and 7 that are resistance and susceptible to Marek’s disease (MD). The method identified many novel genes and isoforms that are not included in existing gene models. The pipeline does not rely on existing gene annotations, therefore, it can be applied to study alternative splicing in any organism. Moreover, we are developing a technique that intelligently reduces a significant amount of RNA-Seq data and sequencing errors to facilitate an assembly process.


top
OP07:
Inference of recurrent 3D RNA motifs from sequence

Subject: Protein Structure & Function

Presenting Author: Craig Zirbel, Bowling Green State University

Co-Author(s):
Anton Petrov, Bowling Green State University
James Roll, Bowling Green State University
Neocles Leontis, Bowling Green State University

Abstract:
Correct prediction of RNA structure from sequence is an unsolved problem in bioinformatics. An important sub-goal is the inference of the 3D structures of recurrent hairpin and internal loops. Such motifs can play architectural roles, serve to anchor RNA tertiary interactions, or provide binding sites for proteins. To establish the sequence variation of recurrent motifs, all hairpin and internal loops from a non-redundant (NR) set of RNA 3D structures are extracted and clustered in geometrically similar families. Probabilistic models for sequence variability are constructed for each motif using hybrid Stochastic Context-Free Grammar/Markov Random Field (SCFG/MRF) models and parameterized by all motif instances and knowledge of substitution patterns for non-Watson-Crick basepairs. SCFG techniques can account for nested pairs and insertions, while MRF ideas can handle non-nested interactions, including base triples. Given the sequence of a hairpin or internal loop from a secondary structure as input, each SCFG/MRF model calculates the probability that that sequence variant would occur. If the score is in the same range as sequences known to form the 3D structure, we infer that the new sequence forms the same 3D structure. This approach correctly infers the 3D structures of nearly all structured internal loops when using sequences from 3D structures as input. Often, a single sequence is enough to correctly infer 3D structure. Probabilistic models for 3D motifs from structurally conserved regions of ribosomal rRNA were validated by scoring sequence variants from multiple sequence alignments that are different from those used to construct the models.


top
OP08:
Macromolecular structure modeling and electron microscopy fitting using 3D Zernike descriptors

Subject: Protein Structure & Function

Presenting Author: Daisuke Kihara, Purdue University

Co-Author(s):
Juan Esquivel-Rodríguez, Purdue University

Abstract:
Protein complexes play crucial roles in cell functions such as transport, signal transduction and gene regulation. Computational methods to model the atomic structure of these molecules have been developed, but they are mainly focused on monomeric forms or pairwise interaction between proteins. We developed two methods that explore the next stage in protein structure modeling, namely creating models for larger multimeric complexes. Multi-LZerD, our multiple protein docking protocol, that assembles pairwise docking models of component proteins. For each pair of proteins, over 50,000 pairwise models are computed, which are then combined by a genetic algorithm. Our results show that Multi-LZerD can successfully model the structure of hetero-multimeric structures. It is particularly successful at generating models from the unbound form of the components. The majority of models obtained are near-native, with a root mean square deviation from the native of 2.5 Å or less. EMLZerD fits high-resolution structures into low-resolution electron microscopy density maps using Multi-LZerD models. A target EM-map is compared against several hundred multimeric models using the 3D Zernike descriptor (3DZD), a mathematical series expansion of three-dimensional functions that is effective at encoding protein shapes. The 3DZD provides a unified representation of the surface shape of models and EM maps that allows a convenient, fast quantitative comparison. In a benchmark composed of 19 multimeric complexes with density maps at 10 Å and 15 Å resolution, EMLZerD was able to identify near-native complex structures for 14 cases and medium range models for the remaining cases.


top
OP09:
Viral Capsid Proteins are Segregated in Structural Fold Space

Subject: Protein Structure & Function

Presenting Author: Shanshan Cheng, University of Michigan, Ann Arbor

Co-Author(s):
Charles Brooks, University of Michigan

Abstract:
Viral capsid proteins assemble into large, symmetrical architectures that are not found in complexes formed by their cellular counterparts. Given the prevalence of the signature jelly-roll topology in viral capsid proteins, we are interested in whether these functionally unique capsid proteins are also structurally unique in terms of folds. To explore this question, we applied a structure-alignment based clustering of all protein chains in VIPERDB filtered at 50% sequence identity to identify distinct capsid folds, and compared the cluster centroids with a non-redundant subset of protein domains in the SCOP database, not including the viral capsid entries. This comparison, using Template Modeling (TM)-score, identified 345 structural “relatives” of capsid proteins from the non-capsid set, covering altogether 16 folds following the definition in SCOP. The statistical significance of the 16 folds shared by two sets of the same sizes, estimated from 10,000 permutation tests, is 0.0056, which is an upper bound on the p-value. We thus conclude that viral capsid proteins are segregated in structural fold space. Our result provides novel insight on how structural folds of capsid proteins, as opposed to their surface chemistry, might be constrained during evolution by requirement of the assembled cage-like architecture. Also importantly, our work highlights a guiding principle for virus-based nanoplatform design in a wide range of biomedical applications and materials science.


top
OP10:
Structure to function analysis of the Renin-Angiotensin system

Subject: Protein Structure & Function

Presenting Author: Jeremy Prokop, The University of Akron

Co-Author(s):
Fabiano Araujo, UFMG
Robson Santos, UFMG
Almir Martins, UFMG
Amy Milsted, The University of Akron

Abstract:
The renin-angiotensin system is targeted for treatment in numerous diseases from cardiovascular systems to cancer. The system is comprised of a protein (angiotensinogen) that is cleaved into peptides (angiotensin peptides) of various sizes by several enzymes (renin, ACE and ACE 2). These peptides then bind to and activate several receptors (AT1, AT2, and MAS). Through compiling all known protein structures of the molecules involved (modeling those structures that are unknown) and combining them with sequences from multiple species, we identified key residues sharing properties at each step of the enzyme and receptor binding. These key amino acids were confirmed through published mutagenesis and functional experiments in the literature. Additional molecular dynamic simulations and docking site predictions of peptides reveal the importance of serines and threonines in the binding process that allows for fragments to bind and be removed from active sites of enzymes, and for proper orientation into the activating receptors. Finally, modeling approaches reveal for the first time the possible significance of the pro-peptide domain of renin (cleaved off in activation of renin), which may serve an important role in blood pressure regulation and cellular remodeling. This raises a new concern in renin blocker treatments for hypertension. These agents increase the pro-renin plasma concentrations, resulting in an increase in the pro-peptide domain levels. Studies such as this show the importance of applying bioinformatics into a molecular system and identifying ways to analyze a system from a bottom-up approach without over interpreting the significance of individual components.


top
OP11:
DominoQuery, a research-friendly deployable query environment

Subject: Databases & Ontologies

Presenting Author: Rajesh Cherukuri, Case Western Reserve University

Co-Author(s):
Joe Teagno, Case Western Reserve University

Abstract:
We introduce DominoQuery, a research tool for taking well-structured XML documents capturing the details of a clinical study, which allows a researcher to query the data set quickly, efficiently, and accurately. DominoQuery achieves these goals by using a data store with text-mining capabilities, dubbed DominoStore, and a web-based query tool we call XQRuby.

DominoStore was developed to enable researchers to easily access and manage their document store. It allows users to search their data with ease and does not require prior experience in managing or using a database. DominoStore is a Java application that allows the user to add their documents to the document store via a graphical user interface and automatically starts the process of indexing the document and running a kernel-based cluster text mining to determine possible queries for the purpose of hastening a user's queries. DominoStore has one drawback: though the user can enter a query, the result is an XML blob.

The query tool employs a Domain-Specific Language (DSL) created for DominoQuery, XQRuby, which allows for the query interface to be able to connect to the document store. XQRuby serves as a bridge between the application's native code, written in Ruby, and the queries into the DominoStore, executed in XQuery. XQRuby shoulders all the responsibility for coordinating incoming queries from the Ruby code and leveraging the caching and acceleration enhancements provided by the DominoStore. Comprehensively, DominoQuery enables a researcher to forgo any tedious software configuration and parse through their data efficiently.


top
OP12:
Computational characterization of moonlighting proteins using Gene Ontology annotations

Subject: Databases & Ontologies

Presenting Author: Ishita Khan, Purdue University

Co-Author(s):
Daisuke Kihara, Purdue University

Abstract:
Advancements in function prediction algorithms are enabling large scale computational elucidation of annotations for newly sequenced genomes. With the increase in the number of functionally well characterized proteins it has been observed that there are many proteins involved in more than one cellular function. These proteins characterized as the moonlighting proteins show varied functional behavior depending on the cell type, localization in the cell, binding sites, oligomerization, multiple binding sites etc. As an example, ATF2 protein from Human genome is involved as a transcription factor and in DNA damage response. Here, we have developed computational framework for investigating and identifying moonlighting proteins in a genome scale. First, from many examples of known moonlighting proteins taken from literature, we have analyzed functional diversity of the proteins in terms of Gene Ontology. Then, from this observation we have established a definition of moonlighting proteins, which can be used for automatic identification of such proteins. Second, we examined how current function prediction methods perform to identify diverse functions in moonlighting proteins. Lastly, we identified moonlighting proteins from several genome sequences of model organisms. Genome scale computational identification of moonlighting proteins can provide significant insight into the roles played by these multifunctional proteins on the cellular landscape thereby aiding the development of better function prediction algorithms.


top
OP13:
The Echinoderm Tree Of Life Project

Subject: Evolution & Comparative Genomics

Presenting Author: Daniel Janies, Ohio State University

Abstract:
A multidisciplinary team of biologists have recently assembled to study echinoderm phylogeny. The team has an award from the National Science Foundation Program, Assembling the Tree of Life. The five living classes of echinoderms are Asteroidea, Crinoidea, Echinoidea, Holothuroidea, and Ophiuroidea. However, these classes represent a shadow of both the full morphological disparity and diversity of Lower Paleozoic echinoderms that includes as many as 21 classes. Phylogenetic analysis of living echinoderms remains challenging. Moreover a complete echinoderm evolutionary tree will have to incorporate all echinoderm lineages and key outgroups to link echinoderms into the broader tree of life. The project is organized with the following working groups: genomics, morphology, informatics, and outreach. Our goals include genomic sampling of numerous exemplars among the five living echinoderm classes, integration of genomic and morphologic data of living echinoderms. I will focus on progress of the molecular and informatics groups.


top
OP14:
Phylogenetic incongruence in E. coli O104:H4: Understanding the evolutionary relationships of emerging pathogens in the face of homologous recombination

Subject: Evolution & Comparative Genomics

Presenting Author: Weilong Hao, Wayne State University

Co-Author(s):
Vanessa Allen, Public Health Ontario
Heather Kent, Public Health Agency of Canada
Natalie Knox, Public Health Agency of Canada
Philip Mabon, Public Health Agency of Canada
Frances Jamieson, Public Health Ontario
Donald Low, Public Health Ontario
David Alexander, Public Health Ontario

Abstract:
Escherichia coli O104:H4 is an emerging pathogen which, during the spring and summer of 2011, was responsible for a widespread outbreak that resulted in the deaths of at least 46 people and sickened >3900. Traditional phenotypic and genotypic assays permit identification and classification of bacterial pathogens, but cannot accurately resolve relationships among genotypically similar but pathotypically different isolates. To understand the evolutionary origins of E. coli O104:H4, we sequenced two strains isolated in Ontario, Canada. One was epidemiologically linked to the 2011 outbreak, and the second, unrelated isolate, was obtained in 2010. Multilocus sequence typing (MLST) analysis indicated that both isolates are of the same sequence type (ST678), but genome sequencing revealed differences in chromosomal and plasmid content. Through comprehensive and careful phylogenetic analysis of the O104:H4 ST678 genomes, we identified 167 genes in three gene clusters that have undergone homologous recombination with distantly related E. coli strains. These recombination events have resulted in unexpectedly high sequence diversity within the same sequence type. Failure to recognize or adjust for homologous recombination can result in phylogenetic incongruence. Understanding the extent of homologous recombination among different strains of the same sequence type may explain the pathotypic differences between the ON2010 and ON2011 strains and help shed new light on the emergence of this new pathogen.


top
OP15:
PhyloPTE/Peacefield – Phylogenetic Reconstruction used to improve the power of GWAS

Subject: Evolution & Comparative Genomics

Presenting Author: Samuel Handelman, Ohio State University

Co-Author(s):
Joseph Verducci, The Ohio State University
Wolfgang Sadee, Ohio State University Medical School
Daniel Janies, Ohio State University Medical School
Jacob Aaronson, Ohio State University Medical School
Rebecca Jackson, Ohio State University Medical School

Abstract:
When performing Genome Wide Association Studies (GWAS), population structure can
be a significant confounder. For a GWAS on any given population, the first order effects
of the population structure can be corrected by segregating a population by ethnicity.
However, this does not entirely correct for the biases that the population structure will
introduce. These biases can be expected to be particularly severe when patient outcomes
are influenced by hereditary non-genetic effects, such as geography or lifestyle, and when
developing multiple small effect models on GWAS data.

We present our method, PhyloPTE/Peacefield (Phylogeny with Path To Event on
People), to perform GWAS studies while adjusting for all structure observed in a
population. We compare our method to several alternative approaches in terms of the
biological consistency of the SNPs identified, using the dmGWAS tool developed by
the Zhao group at Vanderbilt. We will also present a evaluation of this figure of merit
as a means of evaluating the utility of different approaches to identify relevant SNPs.
We have applied this measure to evaluate the performance of PhyloPTE/Peacefield with
promising results.

This work was supported in part by: U01GM092655.


top
OP16:
Metabolic Network Analysis of Apicomplexan Parasites to Identify Novel Drug Targets

Subject: Evolution & Comparative Genomics

Presenting Author: Stacy Hung, Univeristy of Toronto

Co-Author(s):
James Wasmuth, Hospital for Sick Children
Michael Grigg, National Institutes of Health
John Parkinson, University of Toronto

Abstract:
We are interested in studying the metabolic network of apicomplexan parasites, which includes Plasmodium falciparum, the causative agent for the most severe forms of malaria, and Toxoplasma gondii, which is responsible for food-borne illness that are health threats in HIV+/AIDS and immunocompromised populations. By applying systems-based methods in the context of biochemical pathways, we can better understand the metabolic potential of apicomplexans enabling for the identification of viable enzyme drug targets. We have accurately reconstructed the networks for 14 apicomplexans, and comparative analyses have confirmed the presence of a highly conserved ‘core’ of enzymes along with those that are lineage-specific suggesting these parasites have evolved different strategies for performing similar metabolic activities. Furthermore, candidate enzymes in the pantothenate biosynthesis pathway have been identified that are of therapeutic interest, which we are characterizing through gene knockout studies in Toxoplasma. To examine the in vivo landscapes of parasite metabolism, we have obtained high quality RNA-Seq datasets, providing deep coverage of blood-stages for P. falciparum and metabolically active stages of Toxoplasma and closely related Neospora. By overlaying expression data onto the network, we can apply comparative transcriptomics to highlight conserved expression patterns and differentially expressed pathways that might explain the ability of these parasites to survive in such a wide range of hosts. These findings provide insight into metabolic adaptations of apicomplexans and with an improved metabolic reconstruction for apicomplexans, we believe more meaningful system-based studies can be performed that serve to generate real, testable hypotheses to help focus future drug-discovery programs.


top
OP17:
Tracking protein length changes reveals pseudogenization in action

Subject: Population Genomics

Presenting Author: G. Golding, McMaster University

Abstract:
A method for identifying and mapping fusion and fission events onto a
phylogeny was developed and applied to Bacillaceae genomes. In contrast
to previous studies across longer evolutionary time scales, we found that
gene fission is more common than fusion in bacterial genomes. Fusion and
fission events are generally rare across the Bacillaceae and are genome-
specific, but we unexpectedly uncovered a large number of fissions
specific to the genetically monomorphic Bacillus anthracis lineage. Our
results suggest that the B. anthracis lineage may be under an accelerated
rate of gene fragmentation, which is a common evolutionary trend found
in lineages that have recently become host-restricted. We hypothesize
that other bacteria evolving under similar conditions would also exhibit
detectable fission patterns. Salmonella enterica and Yersinia pestis
genomes are used to confirm this hypothesis.


top
OP18:
Gene Expression Games: A Case Study of the Integration between Game Programming and Bioinformatics Education

Subject: other

Presenting Author: Yunkai Liu, Gannon University

Co-Author(s):
Mary Vagula, Gannon University
Weifeng Xu, Gannon University
Tao Ding, Gannon University

Abstract:
Bioinformatics is currently playing an important role in the whole bioscience. Despite the increasing importance, the bioinformatics education in undergraduate level is still undervalued. The major challenge is, few successful strategies exist to integrate bioinformatics into the life sciences as well as to present bioinformatics as in interesting problem domain for computer scientists.
This paper presents a complete case study of a hybrid approach that systematically combines a game application-driven approach and genetic teaching on gene expression into bioinformatics curriculum at the undergraduate level. The case study consists of 1) proposing a new curriculum design process, 2)identifying a set of principles and practices on gene expression and software engineering, 3)proposing a semester-long game project following the design process of software engineering,4)integrating the principles and practices of gene expression into the game development process and, 5)delivering the principles and practices of gene expression to students during the game devolvement. The results of the case study, including analysis of the related project documentation and students’ feedback indicate that adopting the games app-driven approach motivate students to learn in teams, help transferring bioinformatics concepts effectively between instructors and students and facilitate achieving the student learning objectives.


top
OP19:
Analyses of Cancer Driver Gene Signaling Pathway Networks Using Within-Species Network Alignments

Subject: Protein Interactions & Molecular Networks

Presenting Author: Gurkan Bebek, Case Western Reserve University

Co-Author(s):
George Linderman, Case Western Reserve University
Mehmet Koyuturk, Case Western Reserve University
Mark Chance, Case Western Reserve University

Abstract:
Recent advances in “-omics” techniques have led to the discovery of cancer driver genes (CAN- genes) partaking in carcinogenesis when mutated. However, the majority of cancer patients do not actually exhibit mutations in all of these genes. Based on this observation, we hypothesize that it is rather the driver genes' synergistic activity that is leading carcinogenesis, as opposed to a mutation in a single CAN-gene. We also hypothesize that different combinations of mutations would lead to similar carcinogenic phenotypes. Hence, we have applied a network approach to investigate CAN-genes, in order to identify the relationships between these genes and reveal common patterns of functional relationships that underlie similar phenotypic outcomes.
We present a network-alignment based approach to understand the relationships between CAN-genes. Using colorectal cancer as a specific application, we utilize a novel within-species network alignment algorithm. Next, we perform hierarchal clustering of the alignment results to group CAN-genes by the similarity of their associated networks. Finally, we integrate these clusters with independently observed somatic mutations across 94 patients, and find that mutations of CAN-genes in highly similar subnetworks are generally mutually exclusive (Mantel test; p=0.4).
This result shows that throughout the carcinogenic process where multiple mutations are observed, mutations in only one of the synergistically similar CAN-genes may be sufficient to progress tumor growth in colorectal cancer This validates our framework as an effective network-based approach to understanding the relationships between CAN-genes. Using this framework, we will improve our understanding of the tumorigenesis timeline and further improve diagnostic approaches.


top
OP20:
Identifying class-specific protein subnetworks for multi-class phenotypes

Subject: Protein Interactions & Molecular Networks

Presenting Author: Mehmet Erten, Case Western Reserve University

Co-Author(s):
Salim A. Chowdhury, Carnegie Mellon University
Xiaowei Guan, Case Western Reserve University
Rod K. Nibbe, Case Western Reserve University
Jill S. Barnholtz-Sloan, Case Western Reserve University
Mark Chance, Case Western Reserve University
Mehmet Koyuturk, Case Western Reserve University

Abstract:
In recent years, many algorithms have been developed for network-based analysis of differential gene expression in complex diseases. These algorithms use protein-protein interaction (PPI) networks as an integrative framework and identify subnetworks that are coordinately dysregulated in the phenotype of interest.

While such dysregulated subnetworks have demonstrated significant improvement over individual gene markers for classifying phenotype, the current state-of-the-art in dysregulated subnetwork discovery is almost exclusively limited to binary phenotype classes. However, many clinical applications require identification of molecular markers for multiple classes (e.g., Dukes four-stage classification of colorectal carcinoma).

We consider the problem of discovering groups of genes whose expression signatures can discriminate multiple phenotype classes. We consider two alternate formulations of this problem (i) an all-vs-all approach that aims to discover subnetworks distinguishing all classes, (ii) a one-vs-all approach that aims to discover subnetworks distinguishing each class from the rest of the classes. For the one-vs-all formulation, we develop a set-cover based algorithm, which aims to identify groups of genes such that at least one gene in the group exhibits differential expression in the target class.

We test the proposed algorithms in the context of predicting stages of colorectal cancer. Our results show that the set-cover based algorithm identifying stage-specific subnetworks outperforms the all-vs-all approaches in classification. We also investigate the merits of utilizing PPI networks in the search for multiple markers, and show that, with correct parameter settings, network-guided search improves performance. Furthermore, we show that assessing statistical significance when selecting features greatly improves classification performance.


top
OP21:
Identify Condition Specific Gene Co-expression Networks

Subject: Protein Interactions & Molecular Networks

Presenting Author: Kun Huang, The Ohio State University

Co-Author(s):
Vikram Kalluru, The Ohio State University
Raghu Machiraju, The Ohio State University

Abstract:
Gene co-expression network analysis is widely adopted in biomedical research with many applications. Recently condition specific co-expression networks have been identified as potential disease biomarkers. We present a method for identifying condition specific co-expression networks. Our method includes four major steps: 1. For a gene expression dataset with multiple samples, compute the Pearson correlation coefficients(PCC) between every pair of genes in the specific condition and then apply our recently developed weighted graph quasi-clique mining algorithm eQCM to identify tightly co-expressed gene networks. 2. Across multiple conditions, compute the Expected Conditional F-statistic(ECF-statistic) between every gene pair, which is a metric evaluating change of PCC between different conditions. 3. Identify high ECF gene pairs with ECF-statistics among the top 5% of all gene pairs. 4. Apply chi-square tests on every previously identified co-expression networks to determine the ones which are enriched with high ECF gene pairs, Bonferroni method is applied to compensate for multiple tests. We applied this method to a breast cancer gene expression dataset (NCBI-GEO GDS2250) including control, non-basal like and basal like human breast cancer samples. For non-basal like subtype, 171 networks were identified using eQCM algorithm and out of which 88 show enriched high ECF gene pairs (p < 0.05/171) suggesting extensive disruption of transcription programs in the other (ie, the basal) type breast cancer. For basal-like subtype, 23 out of 99 total networks were identified. These are the subtype specific networks for further analysis including gene ontology/pathway enrichment analysis and correlation with survival times.


top
OP22:
Improved Search Strategies for Fitting Rate Parameters to Viral Assembly Models

Subject: Protein Interactions & Molecular Networks

Presenting Author: Lu Xie, Carnegie Mellon University

Co-Author(s):
Gregory Smith, Carnegie Mellon University
Xian Feng, Carnegie Mellon University
Russell Schwartz, Carnegie Mellon University

Abstract:
Viral capsid assembly has been a topic of research for researchers from various disciplines of biology, computer science, physics and mathematics due to its value as a model of complex self-assembly in general as well as its medical importance. Theoretical and simulation studies have played a key role in this work. While such efforts have been valuable in analyzing spaces of theoretically possible pathways or assembly methods, they have so far been limited in their ability to make predictions about specific viruses because of the difficulty of determining detailed interaction parameters needed to instantiate simulations or theoretical models. In prior work, we addressed this question by using data fitting methods to learn reaction rate parameters for individual protein-protein interactions from the results of light scattering experiments that track bulk assembly progress of in vitro capsid models. In the present work, we have sought to improve on these prior approaches by a variety of strategies for reducing the degrees of freedom of the parameter space in order to allow more precise fits. Specific search strategies include novel approaches for grouping parameters and simultaneous fitting of data from multiple experimental conditions to a single physical model. We demonstrate our methods on three capsid systems – human papillomavirus (HPV), hepatitis B virus (HBV), and cowpea chloritic mottle virus (CCMV) – with the resulting fits suggesting a diversity of assembly mechanisms.


top
OP23:
Towards Improved Sequence-based Prediction of Protein-Protein Interaction Sites

Subject: Protein Interactions & Molecular Networks

Presenting Author: Alexey Porollo, University of Cincinnati

Abstract:
Next generation sequencing provides deep insights on individual genomes and transcriptomes. At the same time, it calls for better annotation tools of protein sequences that would allow researchers to collate genomic variants with functional alterations. Mapping of protein-protein interaction sites on a protein sequence combined with sequenced mutations will greatly facilitate development of ‘molecular diagnostics’ towards personalized medicine. In the search for better discriminating characteristics, this study presents a set of novel features that may be used in improved sequence-based methods for prediction of protein-protein interaction sites. A new reduced alphabet of amino acids is proposed based on chemical similarity of amino acid side chains. Shannon entropy computed for this alphabet identifies a subset of residues that can be discriminated using predicted relative solvent accessibility (RSA) combined with its confidence factors. In particular, among more conserved residues with predicted low RSA values, interaction sites have a lower prediction confidence for solvent accessibility. Predicted RSA and corresponding confidence factors were obtained using SABLE. The SPPIDER training and control sets were used for training and validation. The prediction model is based on a single neural network (NN) with 54 input nodes (6 features over a sliding window of 9 residues). 10-fold cross-validation on the training set and a best performing NN from cross-validation applied to the control set yielded Matthews correlation coefficient 0.22 and 0.19, respectively. The results show improvement over other published sequence-based methods and serve as a ground for prediction models using novel features of protein interaction sites.


top
OP24:
Exploiting metatranscriptomics for functional interrogation of microbiomes

Subject: Metagenomics

Presenting Author: John Parkinson, Hospital for Sick Children

Co-Author(s):
Xuejian Xiong, Hospital for Sick Children
Daniel Frank, University of Colorado
Charles Robertson, University of Colorado
Stacy Hung, Hospital for Sick Children
Janet Markel, Hospital for Sick Children
Angelo Canty, McMaster University
Philippe Poussier, University of Toronto
Jayne Danska, Hospital; for Sick Children

Abstract:
The emerging science of metagenomics is transforming our understanding of the relationships of microbes with their environments. Moving beyond cataloguing the organisms and genes present, metatranscriptomics offers the exciting prospect of providing a more mechanistic understanding of these relationships. Exploiting metatranscriptomic data from microbiomes of increasing complexity, generated using the Illumina Solexa platform, we are developing novel software pipelines to process and interpret these datasets from a systems perspective. Two of the more significant challenges to address in these types of analyses concerns the vast genetic diversity inherent in environmental samples and an associated lack of reference genomes. In initial studies of a limited bacterial community derived from mouse cecal flushes, we demonstrated that even in the absence of a complete set of reference genomes, 76bp reads may be usefully employed for metatranscriptomic analyses. Key to these analyses is adopting a gene family framework whereby groups of transcripts may by grouped on the basis of common functions. Protein-protein interaction and other systems datasets then provide scaffolds on which these data may be integrated and subsequently interpreted. Here I will discuss our current efforts in applying our analytical framework to microbiomes of increasing complexity. Current efforts are focused on: 1) the use of paired end read sequence data; 2) optimizing existing short read assembly algorithms to improve annotation of metatranscriptomic datasets; 3) the use of more sensitive sequence similarity search algorithms to identify potential homologs; and 4) the creation of a robust statistical framework to compare across samples.


top
OP25:
A novel method to culture microbial communities reveals ammonium induced alterations in a marine environment and highlights the need to consider analysis-dependent bias in metagenomic studies.

Subject: Metagenomics

Presenting Author: Kevin Keegan, Argonne National Lab

Co-Author(s):
Jack Gilbert, Argonne National Lab
Folker Meyer, Argonne National Lab

Abstract:
A microbial community sampled from the English Channel (E1 sampling station) was cultured with a novel technique, micro-drop encapsulation. Sampled microbial communities were incubated successfully for a period of three months in the laboratory. Metagenomic DNA from multiple biological and technical replicate samples was extracted, and underwent 16s ribosomal subunit based amplicon sequencing (454 technology). Sequence data were processed via MG-RAST to produce taxonomic abundance profiles that underwent a series of comparative analyses. In contrast to a recent report (Zhou 2011), we find we find 16s NGS amplicon data to exhibit highly reproducible results; however, the degree to which reproducibility is observed depends largely on key factors related to the analysis of the data.


top
OP26:
Effect of Sequencing Depth in RNA-sequencing from Different Platforms

Subject: Sequence Analysis

Presenting Author: Kun Huang, The Ohio State University

Co-Author(s):
Amy Webb, Ohio State University
Jeffrey Parvin, Ohio State University
Gulcin Ozer, Ohio State University

Abstract:
Next Generation Sequencing technologies enabled researchers to generate high quality and quantity genome-wide data. Their trascriptomics application, RNA-sequencing (RNA-seq) is widely and increasingly used to quantify gene expression with high resolution. As the technology improves, a single experiment can yield higher throughput, thus more coverage. However, the effect of sequencing depth on transcription detection in different technologies is not well studied. In this research, we evaluated 11 RNA-seq experiments on prefrontal cortex. Eight of these experiments were generated by ABI SOLiD and three of them were repeated using Illumina GA II. First, we evaluated effect of sequencing depth on detection of different transcript biotypes. We observed that around 30 million reads both Illumina and SOLiD data reaches a saturation in detection of new protein coding transcripts. On the other hand, detection of the rest of the transcript biotypes (e.g. pseudogenes, miRNA, lincRNA, snoRNA, etc.) continues to increase with the sequencing depth. SOLiD technology was able to detect slightly more transcripts than Illumina in both cases. Next, we evaluated relationship between sequencing depth and the length of the detected transcripts. We observed that median length of the detected transcripts drops dramatically as the sequencing depth increases for all transcript biotypes. Around 30 million reads, median length of the detected transcripts saturates. In this case, both Illumina and SOLiD data performed similarly. This study reports simultaneous evaluation of Illumina and SOLiD sequencing platforms for the effect of sequencing depth on detection of different transcript biotypes and their length.


top
OP27:
Efficient Detection and Correction of Sequencing Errors Using K-Bounded Suffix Trees

Subject: Sequence Analysis

Presenting Author: Daniel Savel, Case Western Reserve University

Co-Author(s):
Mehmet Koyutürk, Case Western Reserve University
Thomas LaFramboise, Case Western Reserve University
Wojciech Szpankowski, Purdue University
Ananth Grama, Purdue University

Abstract:
Next generation sequencing technologies produce large quantities of short reads with an error rate that adversely affects effective use of these reads. One of the primary uses of sequencing data is de novo genome assembly, which is complicated and can be obfuscated by sequencing errors. Unfortunately, the first time the assembly process is done, no reliable reference sequence is available to compare against; therefore error detection and correction can only use the set of available reads as a reference. State-of-the-art methods for error detection and correction utilize the frequencies of the substrings of the reads, based on the principle that low-frequency substrings may point to sequencing errors. One of the main limiting factors of the error correction methods is the amount of memory required to perform the detection and correction procedures as the set of reads from the sequencer is typically very large and thus the set of substrings of all those reads is also very large. Existing methods typically store comprehensive sets of the correction unit using an efficient data structure, either k-mers (using hash tables) or suffixes (using suffix trees). However, in these data structures, each error manifests itself multiple times, causing redundancy. Here, we propose a method for leveraging the level of memory reduction against accuracy and relating it to remaining error manifestations. Our experimental results show that better performance and accuracy in error correction can be achieved by reducing the amount of data stored in the data structure.


top
OP28:
A Multispecies Polyadenylation Site Model

Subject: Sequence Analysis

Presenting Author: Eric Ho, Rutgers University-New Brunswick

Co-Author(s):
Samuel Gunderson, Rutgers University-New Brunswick
Siobain Duffy, Rutgers University-New Brunswick

Abstract:
Polyadenylation occurs in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5’-capping. We are interested in the evolution of polyadenylation sites (PAS) in diverse species and DNA viruses. Even though most mammalian PAS contain a highly conserved hexanucleotide in the upstream region, namely the canonical poly(A) signal, and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, PAS in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general PAS recognition model challenging. We surveyed nine PAS prediction methods between 1999 and 2011. All of them exploit the skewed nucleotide profile across the PAS, and the highly conserved poly(A) signal as the primary features for recognition. The number of features utilized by these methods is usually large (from 15 to 274), which attributes to the problematic dimensionality curse. Here we propose a PAS model that employs minimal features to capture the essence of PAS, and yet, produces better prediction accuracy across diverse species. Our model utilizes three di-or-trinucleotide profiles, and the predicted nucleosome occupancy in the region 300 nucleotides upstream and downstream of the PAS. We validated our model using two machine learning methods viz. logistic regression and linear discriminant analysis. Results showed that the model achieves 85-92% sensitivity and 85-93% specificity in human, chicken, C.elegans, and Arabidopsis thaliana. Applying our PAS model across species can shed light on the evolution of PAS.


top
OP29:
A Probabilistic Approach to Identify Tandem Duplications using Paired-end Reads

Subject: Sequence Analysis

Presenting Author: Gokhan Yavas, Case Western Reserve University

Co-Author(s):
Mehmet Koyuturk, Case Western Reserve University
Thomas LaFramboise, Case Western Reserve University

Abstract:
Next generation sequencing technology (NGS) enabled the computational biology community to develop novel algorithms to analyze the DNA of a donor genome and identify various structural variants beyond single nucleotide polymorphisms (SNPs). Among single-end and paired-end sequencing protocols, the latter proved itself to provide invaluable information for the variant characterization. In this manuscript, we propose a method that utilizes the data obtained from paired-end sequencing of a donor genome to detect the tandem duplications. To characterize the structural differences between a donor and the reference genome, we first align the sequenced reads to human reference genome. In case of structural variants, the reads obtained from the donor would be mapped to the reference genome with unexpected discrepancies. In case of a tandem duplication, the discordance is observed as an aberrant orientation of the ends of the read. We use these discordantly mapped reads along with the fragment size distribution to generate a set of possible start and end base position pairs for the duplication using a geometric approach. The next step is to compute a probability value for each of these possible coordinate pairs. We also determine the set of coordinates for the tandem duplication that will include the real breakpoint with the user defined confidence level. We evaluated our algorithm on simulated as well as real data cancer cell line, demonstrating that our method performs well in terms of precision and recall. Moreover, we have verified the existence of several tandem duplications identified by our method using wet-lab techniques.


top
OP30:
Accurate non-coding RNA structure prediction with pseudoknot using an extended chain algorithm

Subject: Sequence Analysis

Presenting Author: Prapaporn Techa-angkoon, Michigan State University

Co-Author(s):
Jikai Lei, Michigan State University
Yanni Sun, Michigan State University

Abstract:
Non-coding RNAs function directly as RNA molecules without translating into proteins. Recent research shows that ncRNAs are numerous and play important roles in many molecular biological activities. Pseudoknot is an essential feature in many ncRNA families. Predicting secondary structures of ncRNAs with pseudoknots has become an important goal for bioinformatics. Existing pseudoknot structure prediction tools either use single ncRNA sequence or multiple sequence alignment as input. Single sequence prediction tools usually overpredict base pairs that do not exist in the true structure. The performance of the predicted structure is unsatisfactory if the sequence similarity is low and alignment quality is unreliable for multiple sequence alignment based tools. So there is a great need for tools that can accurately predict secondary structures while controlling false positive rate.

Here we present a new ncRNA structure prediction method that can handle simple H-type pseudoknot. The method takes two homologous ncRNA sequences as input and predict their secondary structures. It uses an extended two dimensional chain algorithm and incorporates free energy, sequence similarity, and structure similarity information. We tested the method using a dataset we constructed from Rfam10 seed sequences and compared the performance of our tool with the state-of-the-art ncRNA pseudoknot structure prediction tools such as DotKnot. The experimental results show that our tool achieves better sensitivity and reduces false positive base pairs.


top
OP31:
Accurate Estimation of Short Read Mapping Quality for Next Generation Genome Sequencing

Subject: Sequence Analysis

Presenting Author: Matt Ruffalo, Case Western Reserve University

Co-Author(s):
Mehmet Koyuturk, Case Western Reserve University
Soumya Ray, Case Western Reserve University
Thomas LaFramboise, Case Western Reserve University

Abstract:
Motivation:
Several software tools specialize in the alignment of short next
generation sequencing reads to a reference sequence. Some of these tools report a mapping
quality score for each alignment -- in
principle this quality score tells researchers the likelihood that the alignment is correct.
However, the reported mapping quality often correlates weakly with actual accuracy and the
qualities of many mappings are underestimated, encouraging the researchers to discard
correct
mappings. Further, these low-quality mappings tend to correlate with variations in the genome
(both single-nucleotide and structural), and such mappings are important in accurately
identifying genomic variants.

Results:
We develop a machine learning tool, LoQuM, to assign reliable mapping quality scores to
mappings of Illumina reads returned by any alignment tool. LoQuM utilizes statistics
on the read (base
quality scores reported by the sequencer) and the alignment (number of matches, mismatches, and
deletions, mapping quality score returned by the alignment tool, if available, number of
mappings) as features
for classification, and uses simulated reads to learn a logistic regression model that relates
these features to actual mapping quality. We perform comprehensive computational experiments on
the human genome to identify the most informative features and find that rate of degradation in
base quality and number of mappings are most informative of actual mapping quality. Our results
also show that LoQuM can "resurrect" many mappings that are assigned zero quality
scores by the alignment tools and are therefore likely to be discarded by researchers.


top
OP32:
Active Feature Acquisition for Protein-Protein Interaction Prediction

Subject: Algorithm Development and Machine Learning

Presenting Author: Madhavi Ganapathiraju, University of Pittsburgh

Co-Author(s):
Mohamed Thahir, University of Pittsburgh
Tarun Sharma, Carnegie Mellon University

Abstract:
Machine learning approaches to predict protein-protein interactions (PPIs) use biological features of proteins to classify whether a protein pair is interacting or not. However, such features are not known for most proteins. Carrying out wet-lab experiments to determine all such unavailable features (‘missing features’) is infeasible as each experiment requires human expertise, time, high-end equipment and other resources. Active feature acquisition (AFA) strategy is being proposed to guide which of these missing features are to be obtained experimentally so as to improve the classifier performance. The AFA strategy has not been used in the domain of PPI prediction. The only approach previously developed for other domains considers every possible combination of instance, feature and feature-value and computes which combination gives best accuracy for that batch, and the approach is not scalable for the domain of PPI prediction. We present a heuristic method that does not require retraining to calculate the utility of acquiring a missing feature. It takes into account the change in belief of the classification model induced by the acquisition of the feature under consideration. Our method achieves the highest possible F-score with as few as 40% missing features acquired compared to random selection of features for acquisition, and is computationally very efficient compared to previous AFA strategies. By analyzing the features acquired by the algorithm, we find that the biological process feature is more relevant than molecular function feature, which in turn is more useful than the subcellular localization feature for predicting protein-protein interactions.


top
OP33:
A Regression-based Approach for Predicting Imputation Quality

Subject: Algorithm Development and Machine Learning

Presenting Author: Chun-Nan Hsu, University of Southern California

Co-Author(s):
Yi-Hung Huang, Academia Sinica
John Rice, Academia Sinica
Scott Saccone, Academia Sinica
José Luis Ambite, Academia Sinica
Yigal Arens, Academia Sinica
Jay Tischfield, Academia Sinica

Abstract:
Traditionally, the data from a genome wide association study are often collected at different times or different platforms. As a result, various imputation algorithms (IMPUTE, BEAGLE, and MACH) have been proposed to predict the individual genotypes at un-typed markers. Because these imputation algorithms are already been used in practice, we would like to know how to measure the imputation quality for a particular single nucleotide polymorphism (SNP) imputed by these algorithms. After the imputation quality measurement has been established, the bio-informatics scientists can pay more attention on those poorly-imputed SNPs in the chip data integration process.
Recently, a statistics based method for evaluating the imputation reliability is being proposed, and it is named as imputation quality score (IQS). However, the true genotypes need to be known before evaluating the IQS, so that they develop another method which is requiring the additional statistical information for solving this problem.
In contrast, we propose a regression based approach for this purpose. Modern machine learning algorithms can be efficiently computed with a large number of features. For this reason, any related information can be used in constructing the regression model. There are some feature values that have been proven to be correlated with SNP genotypes imputation in the previous researches, such as minor allele frequency, recombination hotspot information, and the posterior probability outputted from imputation software. Our result reveals that the IQS for each imputed SNPs can be correctly predicted by these features through the nu-SVM regression model with RBF kernel.


top
OP34:
Learning Differential Gene Expression Signatures from Personalized High Throughput Screening

Subject: Algorithm Development and Machine Learning

Presenting Author: Alfred Hero, University of Michigan

Co-Author(s):
Tzu-Yu Liu, University of Michigan
Ami Wiesel, the Hebrew University of Jerusalem
Christopher Woods, Duke University
Aimee Zaas, Duke University
Geoffrey Ginsburg, Duke University

Abstract:
In reference based classification, the task is to correctly predict the label by using serially or spatially diversified samples. By using such a fixed reference, irrelevant patient variations can be controlled to enhance our ability to evaluate positive or negative response to drug treatment, or to classify a disease based on clinical-molecular from multiple time points or multiple tissues. As the data dimension increases, variable selection becomes increasingly important in these problems. This is especially the case for the serially sampled reference-based classification problem, as variable dimensions increase linearly in the number of references. Hence it is important to understand which variables are strongly relevant to the classification task, and how they evolve over temporally or spatially different samples. We show that parameter estimation with shrinkage can be cast as a problem of structured variable selection, where the structure is specified by the classes and blocks defining the sampling patterns. A convex optimization method is developed to solve for the optimal classifier function and select the relevant variables simultaneously. This optimization is implemented by variable splitting and augmented Lagrangian methods. We apply our algorithm to predictive health problems. The results show that the addition of sample reference of gene microarray under normal conditions greatly improves the classification by gene expression to predict the health status. Our method is able to greatly reduce the number of features and pick out immune genes that mediate the response to a viral (H3N2) pathogen and are predictive of severe symptomatic illness.


top
OP35:
Image-Derived Models of Protein Subcellular Location Dynamics

Subject: Bioimaging

Presenting Author: Devin Sullivan, Carnegie Mellon-University of Pittsburgh

Co-Author(s):
Gregory R. Johnson, Carnegie Mellon University
Robert F. Murphy, Carnegie Mellon University

Abstract:
The subcellular location pattern of a protein can be learned from fluorescent images providing key insight into protein function. This classification problem has been well studied for static images, but the dynamic response of a protein to stimuli may be an important part of its function. Since live cell fluorescent microscopy can be used to directly observe these responses, we have developed and evaluated a range of machine learning methods to capture them. These provide significant new results, many of which are not apparent to visual inspection. This new knowledge can be used to better characterize protein and perturbagen function leading to a better understanding of cellular behavior and how it is regulated.


top
OP36:
Predicting and Testing Cell Signaling Pathway Requirements in Breast Cancer

Subject: Disease Models & Epidemiology

Presenting Author: Eran Andrechek, Michigan State University

Co-Author(s):
Daniel Hollern, Michigan State University
Inez Yuwanita, Michigan State University
Danille Barnes, Michigan State University

Abstract:
Breast cancer is a heterogeneous disease with key differences apparent in the morphology and gene expression patterns inherent within and between individual breast cancers. This heterogeneity prevents a critical obstacle to successful treatment of the disease. To better understand the heterogeneity in activation of signaling pathways in breast cancer, we have employed training data to generate pathway signatures using a Bayesian Factor Regression Modeling (BFRM) approach. Used in conjunction with a database we assembled of a large number of mouse model breast cancers with unique initiating oncogenic events, we have predicted roles for a number of key cell signaling pathways in specific tumor types. For instance, we have predicted a role for the E2F transcription factors in mouse models overexpressing Myc, Neu(HER2), and PyMT. Subsequent tests using Geneset Enrichment Analysis (GSEA) has confirmed our predictions. However, to directly test these bioinformatic predictions, we have integrated our genomic approach with traditional tests using mouse model systems. By interbreeding the models overexpressing Myc, Neu(HER2) and PyMT with knockout mice for the individual E2F transcription factors we have directly tested our bioinformatic predictions. This has revealed that the E2F transcription factors play key roles in tumors initiated by Myc, Neu(HER2) and PyMT with effects on tumor latency, growth rate, apoptosis and metastasis that are unique to each of the initiating oncogenes. Gene expression from the resulting tumors is being analyzed to determine how E2F targets are differentially regulated and cause these differences.


top
OP37:
Network-mapping proteomics data analysis for identifying colorectal cancer biomarker candidates

Subject: other

Presenting Author: Xiaogang Wu, Indiana University

Co-Author(s):
Xiaogang Wu, Indiana University
Hui Huang, Indiana University
Madhankumar Sonachalam, Indiana University
Ragini Pandey, Indiana University
Karl MacDorman, Indiana University
Jake Chen, Indiana University

Abstract:
In this work we present a novel approach for LC-MS proteomics data analysis based on pathway cross-talk network modeling and disease-specific protein-protein interaction (PPI) network mapping. We demonstrate how to apply our network-mapping approach to identifying colorectal cancer (CRC) biomarker candidates by using the proteomics data from cceHUB, which contains blood samples (Normal: 80, PolyP: 72, CRC: 42) scanned using LC-MS. Our approach is comprised of the following twelve steps: Visualize one LC-MS spectra as a heat map; calibrate the mass/charge location of peaks locally; extract informative spectra from the quality-improved heat map images; align peaks for a peptide mass fingerprint (PMF) and visualize them in 3D; identify proteins based on matched PMFs using the Mascot platform; enrich the identified proteins in the global pathway data; select seed genes from curated gene signature data; build a CRC-specific PPI network by expanding those seed genes in global PPI data; validate the CRC-specific PPI network by using microarray data; map the enriched pathways onto the CRC-specific PPI network by using pathway cross-talk networks; automatically search the literature supporting the identified proteins associated with CRC; and build an integrated pathway/network model by using a multi-layer modeling technique. From the analysis results on the cceHUB data, we find that both NNMT and BRAF are detected frequently, which implies that less-known NNMT and well-known BRAF could both be used as biomarker candidates in blood tests for further differential proteomics analysis. We also build an integrated pathway to interpret the relationship between NNMT, BRAF, and oncogenes, such as TP53, MDM2, and CD44.


top
OP38:
The Art and Science of Cancer Classification: the View from 2012

Subject: Disease Models & Epidemiology

Presenting Author: Jun Li, University of Michigan

Abstract:
Today, the same tumor samples are often analyzed simultaneously for genomic, epigenomic, transcriptomic, and other –omics profiles. Such unprecedented abundance of information offers a new opportunity for integrated analysis. What has usually happened, however, is Divide-and-Not-Conquer: different working groups would separately analyze different data types; and not surprisingly, they bring conflicting views regarding the nature and number of subtypes for the same cohort. Another practice is to use only one data type, such as gene expression, to classify a given tumor collection and, take these classes as a given to retrospectively annotate the between-class differences in other data types. In our group we have wrestled with these issues constantly, and I will share some of the lessons learned from analyzing the glioblastoma datasets from The Cancer Genome Atlas Project. We were able to estimate within-tumor heterogeneity in terms of the mixing of euploid and aneuploid cells, and have shown its importance in class discovery. As self-aggregating algorithms (hierarchical or k-means clustering) will always produce a desired number of clusters, we have shown that some popular indices of cluster stability are inherently inflated. In reaching across data types, our inner artist relies on the sense of parsimony and interpretability, while our inner scientist frets over cross-validation, model identifiability, probabilistic inference of component cell types, and relevance in clinical outcome. Our efforts uncovered common pitfalls in currently adopted approaches, and led to exciting new insights into glioblastoma evolution and subtype diversity.


top