Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences

Proceedings Track Presentations

Highlights, Late Breaking Research and Proceedings Track presentations will be presented by Theme.
Presenters names in bold (for updates and changes email steven@iscb.org)

Attention Conference Presenters - please review the Speaker Information Page available here.

TP002: DFLpred: High throughput prediction of disordered flexible linker regions in protein sequences
Date: Sunday, July 10 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Topic: PROTEINS
  • Fanchi Meng, University of Alberta, Canada
  • Lukasz Kurgan, Virginia Commonwealth University, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

Motivation: Disordered flexible linkers (DFLs) are disordered regions that serve as flexible linkers/spacers in multi-domain proteins or between structured constituents in domains. They are different from flexible linkers/residues since they are disordered and longer. Availability of experimentally annotated DFLs provides an opportunity to build high-throughput computational predictors of these regions from protein sequences. To date, there are no computational methods that directly predict DFLs and they can be found only indirectly by filtering predicted flexible residues with predictions of disorder.
Results: We conceptualized, developed and empirically assessed a first-of-its-kind sequence-based predictor of DFLs, DFLpred. This method outputs propensity to form DFLs for each residue in the input sequence. DFLpred uses a small set of empirically selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions, which are processed by a fast linear model. Our high-throughput predictor can be used on the whole-proteome scale; it needs < 1 hour to predict entire proteome on a single CPU. When assessed on an independent test dataset with low sequence-identity proteins, it secures area under the ROC curve (AUC) equal 0.715 and outperforms existing alternatives that include methods for the prediction of flexible linkers, flexible residues, intrinsically disordered residues, and various combinations of these methods. Prediction on the complete human proteome reveals that about 10% of proteins have a large content of over 30% DFL residues. We also estimate that about 6000 DFL regions are long with 30 or more consecutive residues.
Availability: http://biomine.ece.ualberta.ca/DFLpred/.

TP010: Analysis of aggregated cell-cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells
Date: Sunday, July 10 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Topic: DISEASE / SYSTEMS
  • Alfred Schissler, Lussier Lab, United States
  • Qike Li, The University of Arizona, United States
  • James Chen, The Ohio State University, United States
  • Colleen Kenost, The University of Arizona, United States
  • Ikbel Achour, The University of Arizona, United States
  • D. Dean Billheimer, The University of Arizona, United States
  • Haiquan Li, University of Arizona, United States
  • Walter W. Piegorsch, University of Arizona Center for Biomedical Informatics and Biostatistics, United States
  • Yves Lussier, University of Arizona, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Motivation: As ‘omics’ biotechnologies accelerate the capability to contrast a myriad of molecular measurements from a single cell, they also exacerbate current analytical limitations for detecting meaningful single-cell dysregulations. Moreover, mRNA expression alone lacks functional interpretation, limiting opportunities for translation of single-cell transcriptomic insights to precision medicine. Lastly, most single-cell RNA-sequencing analytic approaches are not designed to investigate small populations of cells such as circulating tumor cells shed from solid tumors and isolated from patient blood samples.
Results: In response to these characteristics and limitations in current single-cell RNA-sequencing methodology, we introduce an analytic framework that models transcriptome dynamics through the analysis of aggregated cell-cell statistical distances within biomolecular pathways. Cell-cell statistical distances are calculated from pathway mRNA fold changes between two cells. Within an elaborate case study of circulating tumor cells derived from prostate cancer patients, we develop analytic methods of aggregated distances to identify five differentially expressed pathways associated to therapeutic resistance. Our aggregation analyses perform comparably to Gene Set Enrichment Analysis (GSEA) and better than differentially expressed genes followed by gene set enrichment. However, these methods were not designed to inform on differential pathway expression for a single cell. As such, our framework culminates with the novel aggregation method, cell-centric statistics (CCS). CCS quantifies the effect size and significance of differentially expressed pathways for a single cell of interest. Improved rose plots of differentially expressed pathways in each cell highlight the utility of CCS for therapeutic decision-making.
Availability: http://www.lussierlab.org/publications/CCS/

TP015: A novel algorithm for calling mRNA m6A peaks by modeling biological variances in MeRIP-seq data
Date: Sunday, July 10 12:00 pm - 12:20 pm
Room: Northern Hemisphere E1/E2
Topic: GENES
  • Xiaodong Cui, UTSA, United States
  • Jia Meng, Xi'an Jiaotong-liverpool University, China
  • Shaowu Zhang, Northwestern Polytecnical University, China
  • Yidong Chen, UTHSCSA, United States
  • Yufei Huang, UTSA, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Motivation: N6-methyl-adenosine (m6A) is the most prevalent mRNA methylation but precise pre-diction of its mRNA location is important for understanding its function. A recent sequencing tech-nology, known as Methylated RNA Immunoprecipitation Sequencing technology (MeRIP-seq), has been developed for transcriptome-wide profiling of m6A. We previously developed a peak calling algorithm called exomePeak. However, exomePeak over-simplifies data characteristics and ig-nores the reads’ variances among replicates or reads dependency across a site region. To further improve the performance, new model is needed to address these important issues of MeRIP-seq data.
Results: We propose a novel, graphical model-based peak calling method, MeTPeak, for tran-scriptome-wide detection of m6A sites from MeRIP-seq data. MeTPeak explicitly models reads count of an m6A site and introduces a hierarchical layer of Beta variables to capture the variances and a Hidden Markov model (HMM) to characterize the reads dependency across a site. In addi-tion, we developed a constrained Newton’s method and designed a log-barrier function to compute analytically intractable, positively constrained Beta parameters. We applied our algorithm to simu-lated and real biological datasets and demonstrated significant improvement in detection perfor-mance and robustness over exomePeak. Prediction results on publicly available MeRIP-seq da-tasets are also validated and shown to be able to recapitulate the known patterns of m6A, further validating the improved performance of MeTPeak.
Availability: The package ‘MeTPeak’ is implemented in R and C++, and additional details are available at https://github.com/compgenomics/MeTPeak

TP016: DrugE-Rank: Improving Drug-Target Interaction Prediction of New Candidate Drugs or Targets by Ensemble Learning to Rank
Date: Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere A1/A2
Topic: DISEASE / DATA
  • Qing-Jun Yuan, Fudan University, China
  • Junning Gao, FDU, China
  • Dongliang Wu, Fudan University, China
  • Shihua Zhang, University of Southern Canlifornia, United States
  • Hiroshi Mamitsuka, Kyoto University, Japan
  • Shanfeng Zhu, Fudan University, China

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Motivation: Identifying drug-target interaction is an important task in drug discovery. To reduce heavy time and financial cost in experimental identification of drug-target interaction, many computational approaches have been proposed. Although these approaches have used many different principles, their performance is far from satisfactory, especially in predicting drug-target interactions of new drugs or new targets.

Methods: Approaches based on machine learning for this problem can be divided into two types: feature based and similarity-based methods. Learning to rank (LTR) is the known, most powerful technique in the feature-based methods, while similarity-based methods are well-accepted, due to their idea of connecting the chemical and genomic spaces, represented by drug and target similarities, respectively. We propose a
new method, DrugE-Rank, to improve the performance of the problem by nicely combining the advantages of the two different types of the methods. That is, DrugE-Rank uses LTR, for which multiple well-known similarity-based methods can be used as components of ensemble learning.

Results: The performance of DrugE-Rank was thoroughly examined by mainly three experiments, using data from DrugBank: 1) cross-validation on FDA (US Food and Drug Administration) approved drugs before March 2014, 2) independent test on FDA approved drugs after March 2014, and 3) independent test on FDA experimental drugs. Experimental results show that DrugE-Rank outperformed competing methods significantly, especially achieving more than 30% improvement in AUPR (Area under Prediction Recall curve) for FDA approved new drugs and FDA experimental drugs.

TP018: RNAiFold2T: Constraint Programming design of thermo-IRES switches
Date: Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere E1/E2
Topic: GENES
  • Juan Antonio Garcia-Martin, Department of Biology, Boston College, United States
  • Ivan Dotu, Research Programme on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Spain
  • Javier Fernandez-Chamorro, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Gloria Lozano, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Jorge Ramajo, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Encarnacion Martinez-Salas, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Peter Clote, Department of Biology, Boston College, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Motivation: RNA thermometers (RNATs) are cis-regulatory elements that change secondary structure
upon temperature shift. Often involved in the regulation of heat shock, cold shock and virulence genes,
RNATs constitute an interesting potential resource in synthetic biology, where engineered RNATs could
prove to be useful tools in biosensors and conditional gene regulation.
Results: Solving the 2-temperature inverse folding problem is critical for RNAT engineering. Here
we introduce RNAiFold2T, the first Constraint Programming (CP) and Large Neighborhood Search
(LNS) algorithms to solve this problem. Benchmarking tests of RNAiFold2T against existent programs
(adaptive walk and genetic algorithm) inverse folding show that our software generates two orders of
magnitude more solutions, thus allowing ample exploration of the space of solutions. Subsequently,
solutions can be prioritized by computing various measures, including probability of target structure in the
ensemble, melting temperature, etc. Using this strategy, we rationally designed two thermosensor internal
ribosome entry site (thermo-IRES) elements, whose normalized cap-independent translation efficiency is
approximately 50% greater at 42C than 30C, when tested in reticulocyte lysates. Translation efficiency
is lower than that of the wild-type IRES element, which on the other hand is fully resistant to temperature
shift-up. This appears to be the first purely computational design of functional RNA thermoswitches, and
certainly the first purely computational design of functional thermo-IRES elements.
Availability: RNAiFold2T is publicly available as as part of the new release RNAiFold3.0
at https://github.com/clotelab/RNAiFold and http://bioinformatics.bc.edu/
clotelab/RNAiFold, which latter has a web server as well. The software is written in C++ and
uses OR-Tools CP search engine.
Contact: clote@bc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

TP030: CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction
Date: Sunday, July 10 3:30 pm - 3:50 pm
Room: Northern Hemisphere E1/E2
Topic: PROTEINS
  • Xuefeng Cui, KAUST, Saudi Arabia
  • Zhiwu Lu, Renmin University, China
  • Sheng Wang, Toyota Technological Institute at Chicago, United States
  • Jingyan Wang, KAUST, Saudi Arabia
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step towards predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading, and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.

Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.

Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8,332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

TP031: Reconstructing the temporal progression of HIV-1 immune response pathways
Date: Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Topic: SYSTEMS / DISEASE
  • Siddhartha Jain, Carnegie Mellon University, United States
  • Joel Arrais, Universidade de Aveiro, IEETA, Portugal
  • Narasimhan J. Venkatachari, University of Pittsburgh, United States
  • Velpandi Ayyavoo, University of Pittsburgh, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

We present TimePath, a new method that integrates time series and static datasets to reconstruct dynamic models of host response to stimulus. TimePath uses an Integer Programming formulation to select a subset of pathways that, together, explain the observed dynamic responses. Applying TimePath to study human response to HIV-1 led to accurate reconstruction of several known regulatory and signaling pathways and to novel mechanistic insights. We experimentally validated several of TimePaths' predictions highlighting the usefulness of temporal models.

TP033: Ensemble-Based Evaluation for Protein Structure Models
Date: Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere E1/E2
Topic: PROTEINS
  • Michal Jamroz, Warsaw University, Poland
  • Andrzej Kolinski, Warsaw University, Poland
  • Daisuke Kihara, Purdue University, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Motivation: Comparing protein tertiary structures is a fundamental procedure in structural biology and protein bioinformatics. Structure comparison is important particularly for evaluating computational protein structure models. Most of the model structure evaluation methods perform rigid body superimposition of a structure model to its crystal structure and measure the difference of the corresponding residue or atom positions between them. However, these methods neglect intrinsic flexibility of proteins by treating the native structure as a rigid molecule. Since different parts of proteins have different levels of flexibility, for example, exposed loop regions are usually more flexible than the core region of a protein structure, disagreement of a model to the native need to be evaluated differently depending on the flexibility of residues in a protein.
Results: We propose a score named FlexScore for comparing protein structures that considers flexibility of each residue in the native state of proteins. Flexibility information may be extracted from experiments such as NMR or molecular dynamics simulation. FlexScore considers an ensemble of conformations of a protein described as a multivariate Gaussian distribution of atomic displacements and compares a query computational model to the ensemble. We compare FlexScore with other commonly used structure similarity scores over various examples. FlexScore agrees with experts’ intuitive assessment of computational models and provide information of practical usefulness of models.

TP038: Convolutional neural network architectures for predicting DNA-protein binding
Date: Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Topic: DATA / PROTEINS
  • Haoyang Zeng, Massachusetts Institute of Technology, United States
  • Matthew Edwards, MIT, United States
  • Ge Liu, MIT, United States
  • David Gifford, MIT, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Convolutional neural networks (CNN)
have outperformed conventional methods in modeling the sequence
specificity of DNA-protein binding. Yet inappropriate CNN
architectures can yield poorer performance than simpler models. Thus
an in-depth understanding of how to match CNN architecture to a
given task is needed to fully harness the power of CNNs for
computational biology applications. We present
a systematic exploration of CNN architectures for predicting DNA
sequence binding using a large compendium of transcription factor
datasets. We identify the best-performing architectures by varying
CNN width, depth, and pooling designs. We find that adding
convolutional kernels to a network is important for motif-based
tasks. We show the benefits of CNNs in learning rich higher-order
sequence features, such as secondary motifs and local sequence
context, by comparing network performance on multiple modeling tasks
ranging in difficulty. We also demonstrate how careful construction
of sequence benchmark datasets, using approaches that control
potentially confounding effects like positional or motif strength
bias, is critical in making fair comparisons between competing
methods. We explore how to establish the sufficiency of training
data for these learning tasks, and we have created a flexible
cloud-based framework that permits the rapid exploration of
alternative neural network architectures for problems in
computational biology.

TP039: What Time is It? Deep Learning Approaches for Circadian Rhythms
Date: Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Topic: SYSTEMS / GENES
  • Forest Agostinelli, University of California-Irvine, United States
  • Nicholas Ceglia, University of California-Irvine, United States
  • Babak Shahbaba, University of California-Irvine, United States
  • Paolo Sassone-Corsi, University of California-Irvine, United States
  • Pierre Baldi, University of California-Irvine, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Motivation: Circadian rhythms date back to the origins of life, are found in virtually every species and every cell, and play fundamental roles in functions ranging from metabolism to cognition. Modern high-throughput technologies allow the measurement of concentrations of transcripts, metabolites, and other species along the circadian cycle creating novel computational challenges and opportunities, including the problems of inferring whether a given species oscillate in circadian fashion or not, and inferring the time at which a set of measurements was taken.

Results: We first curate several large synthetic and biological time series data sets containing labels for both periodic and aperiodic signals. We then use deep learning methods to develop and train BIO_CYCLE, a system to robustly estimate which signals are periodic in high-throughput circadian experiments, producing estimates of amplitudes, periods, phases, as well as several statistical significance measures. Using the curated data, BIO_CYCLE is compared to other approaches and shown to achieve state-of-the-art performance across multiple metrics. We then use deep learning methods to develop and train BIO_CLOCK to robustly estimate the time at which a particular single-time-point transcriptomic experiment was carried. In most cases, BIO_CLOCK can reliably predict time, within approximately one hour, using the expression levels of only a small number of core clock genes.
BIO_CLOCK is shown to work reasonably well across tissue types, and often with only small degradation across conditions. BIO_CLOCK is used to annotate most mouse experiments found in the GEO database with an inferred time stamp.

Availability: All data and software are publicly available on the CircadiOmics web portal: circadiomics.igb.uci.edu/.

TP040: phRAIDER: Pattern-Hunter Based Rapid Ab Initio Detection of Elementary Repeats
Date: Monday, July 11 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Charlotte Schaeffer, Miami University, United States
  • Nathan Figueroa, Miami University, United States
  • Xiaolin Liu, Miami University (Ohio), United States
  • John Karro, Miami University (Ohio), United States

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Transposable Elements and repetitive DNA make up a sizable fraction of Eukaryotic genomes, and their annotation is crucial to the study of the structure, organization, and evolution of any newly sequenced genome. While RepeatMasker and nHMMER are useful for identifying these repeats, they require a pre-compiled repeat library -- which is not always available. {\it De novo} tools such as Recon, RepeatScout, or RepeatGluer serve to identify TEs purely from sequence content, but are either limited by runtimes that prohibit whole-genome use or degrade in quality in the presence of substitutions that disrupt the sequence patterns.

Results: phRAIDER is an de novo transposable element tool that addresses both the issue of of runtime without sacrificing sensitivity, as compared to competing tools. The underlying model is a new definition of elementary repeats that incorporates the PatternHunter spaced seed model, allowing for greater sensitivity in the presence of genomic substitutions. As compared to the premier tool in the literature, RepeatScout, phRAIDER shows an average 10x speedup on any single human chromosome and has the ability to process the whole human genome in just over three hours. Here we present the tool, the theoretical model underlying the tool, and the results demonstrating its effectiveness.

Availability: phRAIDER is an open source tool available from https://github.com/karroje/phRAIDER.

TP041: RCK: accurate and efficient inference of sequenceand structure-based protein-RNA binding models from RNAcompete data
Date: Monday, July 11 10:30 am - 10:50 am
Room: Northern Hemisphere A3/A4
Topic: DATA / GENES
  • Yaron Orenstein, MIT, United States
  • Yuhao Wang, MIT, United States
  • Bonnie Berger, MIT, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Motivation: Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240,000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but was outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers.
Results: We develop RCK, an efficient, scalable algorithm to infer sequence and structure preferences based on a new k-mer model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale.
Availability: Software and models are freely available at http://groups.csail.mit.edu/cb/rck/.
Contact: bab@mit.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

TP046: Read-Based Phasing of Related Individuals
Date: Monday, July 11 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / SYSTEMS
  • Shilpa Garg, MPI-INF, Germany, Germany
  • Marcel Martin, Science for Life Laboratory, Sweden
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information - reads and pedigree - has the potential to deliver results better than each individually.
Results: We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2x for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15x coverage per individual.

TP048: Novel Applications of Multi-task Learning and Multiple Output Regression to Multiple Genetic Trait Prediction
Date: Monday, July 11 11:40 am - 12:00 pm
Room: Northern Hemisphere E1/E2
Topic: GENES / DATA
  • Dan He, IBM T.J. Watson, United States
  • Laxmi Parida, IBM T J Watson Research Center, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models. In many cases, for the same set of samples and markers, multiple traits are observed. Some of these traits might be correlated with each other. Therefore, modeling all the multiple traits together may improve the prediction accuracy. In this work, we view the multi-trait prediction problem from a machine learning angle: as either a multi-task learning problem or a multiple output regression problem, depending on whether different traits share the same genotype matrix or not. We then adapted multi-task learning algorithms and multiple output regression algorithms to solve the multi-trait prediction problem. We proposed a few strategies to improve the least square error of the prediction from these algorithms. Our experiments show that modeling multiple traits together could improve the prediction accuracy for correlated traits.

TP049: An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree
Date: Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / SYSTEMS
  • Yufeng Wu, Computer Science and Engineering Department, University of Connecticut, United States

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Gene tree represents the evolutionary history of gene
lineages that originate from multiple related populations. Under the
multispecies coalescent model, lineages may coalesce outside the
species (population) boundary. Given a species tree (with branch
lengths), the gene tree probability is the probability of observing a
specific gene tree topology under the multispecies coalescent model.
There are two existing algorithms for computing the exact gene tree
probability. The first algorithm is due to Degnan and Salter (2005),
where they enumerate all the so-called coalescent histories for the
given species tree and the gene tree topology. Their algorithm runs
in exponential time in the number of gene lineages in general. The
second algorithm is the STELLS algorithm (2012), which is usually
faster but also runs in exponential time in almost all the cases.

Results: In this paper, we present a new algorithm, called
CompactCH, for computing the exact gene tree probability. This new
algorithm is based on the notion of compact coalescent histories:
multiple coalescent histories are represented by a single compact
coalescent history. The key advantage of our new algorithm is that it
runs in polynomial time in the number of gene lineages if the number
of populations is fixed to be a constant. The new algorithm is more
efficient than the STELLS algorithm both in theory and in practice
when the number of populations is small and there are multiple
gene lineages from each population. As an application, we show
that CompactCH can be applied in the inference of population tree
(i.e. the population divergence history) from population haplotypes.
Simulation results show that the CompactCH algorithm enables
efficient and accurate inference of population trees with much more
haplotypes than a previous approach.

Availability: The CompactCH algorithm is implemented in the
STELLS software package, which is available for download at http:
//www.engr.uconn.edu/~ywu/STELLS.html.

Contact: ywu@engr.uconn.edu

TP051: A Network-driven Approach for Genome-wide Association Mapping
Date: Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere E1/E2
Topic: GENES / DISEASE
  • Seunghak Lee, Carnegie Mellon University, United States
  • Soonho Kong, Carnegie Mellon University, United States
  • Eric Xing, Carnegie Mellon University, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Motivation:

It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms.

Results:

In this paper, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits, or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer's disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways
to explain our findings.

TP055: DeepMeSH: Deep Semantic Representation for Improving Large-scale MeSH Indexing
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere BCD
Topic: DATA
  • Shengwen Peng, Fudan University, China
  • Ronghui You, Fudan University, China
  • Hongning Wang, Department of Computer Science at University of Virginia, United States
  • Chengxiang Zhai, UIUC, United States
  • Hiroshi Mamitsuka, Kyoto University, Japan
  • Shanfeng Zhu, Fudan University, China

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Motivation:
Medical Subject Headings (MeSH) indexing, which is to assign a
set of MeSH main headings to citations, is crucial for many
important tasks in biomedical text mining and information retrieval.
Large-scale MeSH indexing has two challenging aspects: the citation side and
MeSH side.
For the citation side, all existing methods, including Medical Text
Indexer (MTI) by NLM (National Library of Medicine) and the
state-of-the-art method, MeSHLabeler, deal with text by bag-of-words,
which cannot capture semantic and context-dependent information well.

Methods: We propose DeepMeSH that incorporates deep semantic
information for large-scale MeSH indexing.
It addresses the two challenges in both citation and MeSH sides.
The citation side challenge is solved by a new deep semantic representation,
D2V-TFIDF, which concatenates both sparse and dense semantic representations.
The MeSH side challenge is solved by using the `learning to rank' framework of
MeSHLabeler, which integrates various types of evidence generated from
the new semantic representation.

Results:
DeepMeSH achieved a Micro F-measure of 0.6323, 2\% higher than 0.6218
of MeSHLabeler and 12\% higher than 0.5637 of MTI, for BioASQ3 challenge
data with 6,000 citations.

TP057: A Cross-Species Bi-Clustering Approach to Identifying Conserved Co-regulated Genes
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Topic: GENES / SYSTEMS
  • Jiangwen Sun, University of Connecticut, United States
  • Zongliang Jiang, University of Connecticut, United States
  • X Cindy Tian, University of Connecticut, United States
  • Jinbo Bi, University of Connecticut, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Motivation: A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes.

Results: We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between human and mouse embryos.

Availability: The R package containing the implementation of the proposed method in C++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/.

TP060: Genome assembly from synthetic long read clouds
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Volodymyr Kuleshov, Stanford University, United States
  • Michael Snyder, Stanford University, United States
  • Serafim Batzoglou, Stanford University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Motivation: Despite rapid progress in sequencing technology, assembling de-novo the genomes of new species as well as reconstructing complex metagenomes remain major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads.
Results: Here, we introduce Architect, a new de-novo scaffolder aimed at synthetic long read technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR’s underlying short reads, which we refer to as read clouds. This enables a 4 to 20 fold reduction in sequencing requirements and a five-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully-subassembled long reads.

TP061: Structure-Based Prediction of Transcription Factor Binding Specificity using an Integrative Energy Function
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS
  • Alvin Farrel, University of North Carolina at Charlotte, United States
  • Jonathan Murphy, University of North Carolina at Charlotte, United States
  • Jun-Tao Guo, University of North Carolina at Charlotte, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Transcription factors (TFs) regulate gene expression through binding to specific target DNA sites. Accurate annotation of transcription factor binding sites (TFBSs) at genome scale represents an essential step toward our understanding of gene regulation networks. In this paper, we present a structure-based method for computational prediction of TFBSs using a novel, integrative energy function. The new energy function combines a multibody knowledge-based potential and two atomic energy terms (hydrogen bond and π-interaction) that might not be accurately captured by the knowledge-based potential due to the mean force nature and low count problem. We applied the new energy function to the TFBS prediction using a non-redundant dataset that consists of transcription factors from 12 different families. Our results show that the new integrative energy function improves the prediction accuracy over the knowledge-based, statistical potentials, especially for homeodomain transcription factors, the second largest TF family in mammals.

TP063: Jumping across biomedical contexts using compressive data fusion
Date: Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere BCD
Topic: DATA / DISEASE
  • Marinka Zitnik, Stanford University, United States
  • Blaz Zupan, University of Ljubljana, Slovenia

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Motivation:
The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects---such as a gene and a disease---can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling.

Results:
We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous data sets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics.

TP068: deBWT: parallel construction of Burrows-Wheeler Transform for large collection of ge-nomes with de Bruijn-branch encoding
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / DATA
  • Bo Liu, Center for Bioinformatics, Harbin Institute of Technology, China
  • Dixian Zhu, Center for Bioinformatics, Harbin Institute of Technology, China
  • Yadong Wang, Center for Bioinformatics, Harbin Institute of Technology, China

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Motivation: With the development of high-throughput sequencing, the number of assembled ge-nomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows-Wheeler Transform (BWT) is an important data structure of genome indexing which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing due to their incremental nature, which is a bottleneck to utilize modern computers to accelerate BWT construction.
Results: We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT con-struction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also utilizes the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.
Availability: deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWT
Contact: ydwang@hit.edu.cn

TP069: Finding correct protein-protein docking models using ProQDock
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS
  • Sankar Basu, Linköping University, Sweden
  • Bjorn Wallner, Linkoping University, Sweden

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Motivation: Protein-protein interactions are a key in virtually all biological process. For a detailed understanding of the biological processes, the structure of the protein complex is essential. Given the current experimental techniques for structure determination, the vast majority of all protein com-plexes will never be solved by experimental techniques. In lack of experimental data, computational docking methods can be used to predict the structure of the protein complex. A common strategy is to generate many alternative docking solutions (atomic models) and then use a scoring function to select the best. The success of the computational docking technique is, to a large degree, depend-ent on the ability of the scoring function to accurately rank and score the many alternative docking models.
Results: Here, we present ProQDock, a scoring function that predicts the absolute quality of dock-ing model measured by a novel protein docking quality score (DockQ). ProQDock uses support vec-tor machines trained to predict the quality of protein docking models using features that can be cal-culated from the docking model itself. By combining different types of features describing both the protein-protein interface and the overall physical chemistry it was possible to improve the correlation with DockQ from 0.25 for the best individual feature (EC) to 0.49 for the final version of ProQDock. ProQDock performed better than the state-of-the-art methods ZRANK and ZRANK2 in terms of cor-relations, ranking and finding correct models on an independent test set. Finally, we also demon-strate that it is possible to combine ProQDock with ZRANK and ZRANK2 to improve performance even further.

TP072: Compacting de Bruijn graphs from sequencing data quickly and in low memory
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / DATA
  • Rayan Chikhi, CNRS, France
  • Antoine Limasset, IRISA, France
  • Paul Medvedev, Pennsylvania State University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.

We present an algorithm and a tool BCALM 2 for the compaction of de Bruijn graphs. BCALM 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, BCALM 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied BCALM 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, BCALM 2 is at least an order of magnitude more efficient than other available methods.

TP074: Influence maximization in time bounded network identifies transcription factors regulating perturbed pathways
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere E1/E2
Topic: GENES
  • Kyuri Jo, Seoul National University, Korea, Republic of
  • Inuk Jung, Seoul National University, Korea, Republic of
  • Ji Hwan Moon, Seoul National University, Korea, Republic of
  • Sun Kim, Seoul National University, Korea, Republic of

Area Session Chair: Judith Blake

Presentation Overview: Show

To understand the dynamic nature of the biological process, it is crucial to identify perturbed pathways in an altered environment and also to infer regulators that trigger the response. Current time-series analysis methods, however, are not powerful enough to identify perturbed pathways and regulators simultaneously. Widely used methods include methods to determine gene sets such as differentially expressed genes or gene clusters and these genes sets need to be further interpreted in terms of biological pathways using other tools. Most pathway analysis methods are not designed for time series data and they do not consider gene-gene influence on the time dimension. In this paper, we propose a novel time-series analysis method TimeTP for determining transcription factors regulating pathway perturbation, which narrows the focus to perturbed sub-pathways and utilizes the gene regulatory network and protein-protein interaction network to locate transcription factors triggering the perturbation. TimeTP first identifies perturbed sub-pathways that propagate the expression changes along the time. Starting points of the perturbed sub-pathways are mapped into the network and the most influential transcription factors are determined by influence maximization technique. The analysis result is visually summarized in TF-Pathway map in time clock. TimeTP was applied to PIK3CA knock-in dataset and found significant sub-pathways and their regulators relevant to the PIP3 signaling pathway.

TP077: An Integer Programming Framework for Inferring Disease Complexes from Network Data
Date: Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS / DISEASE
  • Arnon Mazza, Tel Aviv University, Israel
  • Konrad Klockmeier, Max Delbrück Center for Molecular Medicine, Germany
  • Erich Wanker, Max Delbrück Center for Molecular Medicine, Germany
  • Roded Sharan, School of computer science, Tel Aviv university, Israel

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Unraveling the molecular mechanisms that underlie disease calls for methods that go beyond the identification of single causal genes to inferring larger protein assemblies that take part in the disease process. Here we develop an exact, integer-programming-based method for associating protein complexes with disease. Our approach scores proteins based on their proximity in a protein-protein interaction network to a prior set that is known to be relevant for the studied disease. These scores are combined with interaction information to infer densely interacting protein complexes that are potentially disease-associated. We show that our method outperforms previous ones and leads to predictions that are well supported by current experimental data and literature knowledge.

TP082: RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes
Date: Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Avi Srivastava, Stony Brook University, United States
  • Hirak Sarkar, Stony Brook University, United States
  • Nitish Gupta, Stony Brook University, United States
  • Rob Patro, Stony Brook University, United States

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis.

Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap - our tool implementing quasi-mapping - is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we employ to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically-meaningful groups.

Availability: RapMap is implemented in C++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap.

Contact: rob.patro@cs.stonybrook.edu

TP083: A convex optimization approach for identification of human tissue-specific interactomes
Date: Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / DISEASE
  • Shahin Mohammadi, Purdue University, United States
  • Ananth Grama, Department of Computer Science, Purdue University, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

Motivation: Analysis of organism-specific interactomes has yielded novel insights into cellular function and coordination, understanding of pathology, and identification of markers and drug targets. Genes, however, can exhibit varying levels of cell-type specificity in their expression, and their coordinated expression manifests in tissue-specific function and pathology. Tissue-specific/selective interaction mechanisms have significant applications in drug discovery, as they are more likely to reveal drug targets. Furthermore, tissue-specific transcription factors (tsTFs) are significantly implicated in human disease, including cancers. Finally, disease genes and protein complexes have the tendency to be differentially expressed in tissues in which defects cause pathology. These observations motivate the construction of refined tissue-specific interactomes from organism-specific interactomes.

Results: We present a novel technique for constructing human tissue-specific interactomes. Using a variety of validation tests (ESEA, GO Enrichment, Disease-Gene Subnetwork Compactness), we show that our proposed approach significantly outperforms state of the art techniques. Finally, using case studies of Alzheimer's and Parkinson's diseases, we show that tissue-specific interactomes derived from our study can be used to construct pathways implicated in pathology and demonstrate the use of these pathways in identifying novel targets.\\

Availability: http://www.cs.purdue.edu/homes/mohammas/projects/ActPro.html

TP088: SHARAKU: An algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing
Date: Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Mariko Tsuchiya, Keio University, Japan
  • Kojiro Amano, Keio University, Japan
  • Masaya Abe, Keio University, Japan
  • Misato Seki, Keio University, Japan
  • Sumitaka Hase, Keio University, Japan
  • Kengo Sato, Keio University, Japan
  • Yasubumi Sakakibara, Keio University, Japan

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs.
Results: We developed an algorithm termed SHARAKU to align two read mapping profiles of nextgeneration sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5’-end processing and 3’-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain.

TP091: Analysis of differential splicing suggests different modes of short-term splicing regulation
Date: Tuesday, July 12 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Hande Topa, Aalto University, Finland
  • Antti Honkela, University of Helsinki, Finland

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-seq time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels.
Results: Using estrogen receptor alpha signalling response as a model system, our Gaussian process (GP)-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change while others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment.
Availability: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing. An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/.

TP092: Prediction of Ribosome Footprint Profile Shapes from Transcript Sequences
Date: Tuesday, July 12 12:00 pm - 12:20 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / GENES
  • Tzu-Yu Liu, University of Pennsylvania, United States
  • Yun S. Song, University of California, Berkeley, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

Motivation: Ribosome profiling is a useful technique for studying translational dynamics and quantifying protein synthesis. Applications of this technique have shown that ribosomes are not uniformly distributed along mRNA transcripts. Understanding how each transcript-specific distribution arises is important for unraveling the translation mechanism.

Results: Here, we apply kernel smoothing to construct predictive features and build a sparse model to predict the shape of ribosome footprint profiles from transcript sequences alone. Our results on Saccharomyces cerevisiae data show that the marginal ribosome densities can be predicted with high accuracy. The proposed novel method has a wide range of applications, including inferring isoform-specific ribosome footprints, designing transcripts with fast translation speeds, and discovering unknown modulation during translation.

TP096: Comparative Analyses of Population-scale Phenomic Data in Electronic Medical Records Reveal Race-specific Disease Networks
Date: Tuesday, July 12 12:20 pm - 12:40 pm
Room: Northern Hemisphere E1/E2
Topic: DISEASE / SYSTEMS
  • Benjamin S. Glicksberg, Icahn School of Medicine at Mount Sinai, United States
  • Li Li, Icahn School of Medicine at Mount Sinai, United States
  • Marcus A. Badgeley, Icahn School of Medicine at Mount Sinai, United States
  • Khader Shameer, Icahn School of Medicine at Mount Sinai, United States
  • Roman Kosoy, Icahn School of Medicine at Mount Sinai, United States
  • Noam D. Beckmann, Icahn School of Medicine at Mount Sinai, United States
  • Nam Pho, Harvard Medical School, United States
  • Joerg Hakenberg, Icahn School of Medicine at Mount Sinai, United States
  • Meng Ma, Icahn School of Medicine at Mount Sinai, United States
  • Kristin L. Ayers, Icahn School of Medicine at Mount Sinai, United States
  • Gabriel E. Hoffman, Icahn School of Medicine at Mount Sinai, United States
  • Shuyu Dan Li, Icahn School of Medicine at Mount Sinai, United States
  • Eric E. Schadt, Icahn School of Medicine at Mount Sinai, United States
  • Chriag J. Patel, Harvard Medical School, United States
  • Rong Chen, Icahn School of Medicine at Mount Sinai, United States
  • Joel T. Dudley, Icahn School of Medicine at Mount Sinai, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Motivation: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA), and Hispanic/Latino (HL).
Results: We compared susceptibility profiles and temporal connectivity patterns for 1,988 diseases and 37,282 disease pairs represented in a clinical population of 1,025,573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure, and underlying disease connections between EA, AA, and HL populations. We found 2,158 significantly comorbid diseases for the EA cohort, 3,265 for AA, and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key “hub” diseases that are the focal points in the race-centric networks and of par-ticular clinical importance. Incorporating race-specific disease co-morbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes.

TP097: Using genomic annotations increases statistical power to detect eGenes
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Dat Duong, UCLA, United States
  • Jennifer Zou, UCLA, United States
  • Farhad Hormozdiari, School of Computing Science, UCLA, United States
  • Jae Hoon Sul, Brigham and Women's Hospital, Boston, USA, United States
  • Jason Ernst, UCLA, United States
  • Buhm Han, Asan Institute for Life Sciences, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea, Korea, Republic of
  • Eleazar Eskin, University of California, Los Angeles, United States

Area Session Chair: Janet Kelso

Presentation Overview: Show

Expression quantitative trait loci (eQTL) are genetic variants
that affect gene expression. In eQTL studies, one important task
is to find eGenes or genes whose expressions are associated with at
least one eQTL. The standard statistical method to determine if a
gene is an eGene requires association testing at all nearby variants
and the permutation test to correct for multiple testing. The standard
method however does not consider genomic annotation of the
variants. In practice, variants near gene transcription start sites or
certain histone modifications are likely to regulate gene expression.
In this paper, we introduce a novel eGene detection method that
considers this empirical evidence and thereby increases the statistical
power. We applied our method to the liver Genotype-Tissue Expression
(GTEx) data using distance from transcription start sites, DNase
hypersensitivity sites, and six histone modifications as the genomic
annotations for the variants. Each of these annotations helped us
detected more candidate eGenes. Distance from transcription start
site appears to be the most important annotation; specifically, using
this annotation, our method discovered 50% more candidate eGenes
than the standard permutation method.

TP098: Simultaneous prediction of enzyme orthologs from chemical transformation patterns for de novo metabolic pathway reconstruction
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / PROTEINS
  • Yasuo Tabei, Japan Science and Technology Agency, Japan
  • Yoshihiro Yamanishi, Kyushu University, Japan
  • Masaaki Kotera, Tokyo Institute of Technology, Japan

Area Session Chair: Trey Ideker

Presentation Overview: Show

Motivation:
Metabolic pathways are an important class of molecular networks consisting of compounds, enzymes, and their interactions.
The understanding of global metabolic pathways is extremely important for various applications in ecology and pharmacology.
However, large parts of metabolic pathways remain unknown, and most organism-specific pathways contain many missing enzymes.
Results:
In this study we propose a novel method to predict the enzyme orthologs that catalyze the putative reactions to facilitate the de novo reconstruction of metabolic pathways from metabolome-scale compound sets.
The algorithm detects the chemical transformation patterns of substrate-product pairs using chemical graph alignments, and constructs a set of enzyme-specific classifiers to simultaneously predict all the enzyme orthologs that could catalyze the putative reactions of the substrate-product pairs in the joint learning framework.
The originality of the method lies in its ability to make predictions for thousands of enzyme orthologs simultaneously, as well as its extraction of enzyme-specific chemical transformation patterns of substrate-product pairs.
We demonstrate the usefulness of the proposed method by applying it to some ten thousands of metabolic compounds,
and analyze the extracted chemical transformation patterns that provide insights into the characteristics and specificities of enzymes.
The proposed method will open the door to both primary (central) and secondary metabolism in genomics research,
increasing research productivity to tackle a wide variety of environmental and public health matters.

TP099: Classifying and Segmenting Microscopy Images with Deep Multiple Instance Learning
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Topic: DATA
  • Oren Kraus, University of Toronto, Canada
  • Lei Jimmy Ba, University of Toronto, Canada
  • Brendan Frey, University of Toronto, Canada

Area Session Chair: Curtis Huttenhower

Presentation Overview: Show

Abstract
Motivation: High content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centred object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations.
Results: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND MIL pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps.
Availability: We will make our implementation and training data available for the final version of the manuscript.
Contact: oren.kraus@mail.utoronto.ca
Supplementary information: Supplementary data are available at Bioinformatics online.

TP101: Fast metabolite identification with Input Output Kernel Regression
Date: Tuesday, July 12 2:20 pm - 2:40 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / DATA
  • Céline Brouard, Aalto university, Finland
  • Huibin Shen, Aalto University, Finland
  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
  • Florence D'Alché-buc, Télécom ParisTech/Institut Mines-Télécom, France
  • Sebastian Böcker, Friedrich Schiller University Jena, Germany
  • Juho Rousu, Aalto University, Finland

Area Session Chair: Trey Ideker

Presentation Overview: Show

An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprints and matching these fingerprints against existing databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach.
We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the structures in input and output with an output kernel and an operator-valued kernel in input. The mapping between the two structured sets is approximated by learning a function with values in the feature space associated to the output kernel and solving a pre-image problem for the prediction step. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods.

TP102: PHOCOS: Inferring Multi-Feature Phenotypic Crosstalk Networks
Date: Tuesday, July 12 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Topic: DATA
  • Yue Deng, School of Pharmacy, UCSF, United States
  • Steven Altschuler, School of Pharmacy, UCSF, United States
  • Lani Wu, School of Pharmacy, UCSF, United States

Area Session Chair: Curtis Huttenhower

Presentation Overview: Show

Motivation: Quantification of cellular changes to perturbations can provide a powerful approach to infer crosstalk among molecular components in biological networks. Existing crosstalk inference methods conduct network-structure learning based on a single phenotypic feature (e.g. abundance) of a biomarker. These approaches are insufficient for analyzing perturbation data that can contain information about multiple features (e.g. abundance, activity or localization) of each biomarker.
Results: We propose a computational framework for inferring phenotypic crosstalk (PHOCOS) that is suitable for high-content microscopy or other modalities that capture multiple phenotypes per biomarker. PHOCOS uses a robust graph-learning paradigm to predict direct effects from potential indirect effects and identify errors due to noise or missing links. The result is a multi-feature, sparse network that parsimoniously captures direct and strong interactions across phenotypic attributes of multiple biomarkers. We use simulated and biological data to demonstrate the ability of PHOCOS to recover multi-attribute crosstalk networks from cellular perturbation assays.

TP103: Data-driven mechanistic analysis method to reveal dynamically evolving regulatory networks
Date: Tuesday, July 12 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Topic: GENES / SYSTEMS
  • Jukka Intosalmi, Aalto University, Finland
  • Kari Nousiainen, Aalto University, Finland
  • Helena Ahlfors, The Babraham Institute, United Kingdom
  • Harri Lähdesmäki, Aalto University, Finland

Area Session Chair: Janet Kelso

Presentation Overview: Show

Mechanistic models based on ordinary differential equations provide powerful and accurate means to describe the dynamics of molecular machinery which orchestrates gene regulation. When combined with appropriate statistical techniques, mechanistic models can be calibrated using experimental data and, in many cases, also the model structure can be inferred from time-course measurements. However, existing mechanistic models are limited in the sense that they rely on the assumption of static network structure and cannot be applied when transient phenomena affect, or rewire, the network structure. In the context of gene regulatory network inference, network rewiring results from the net impact of possible unobserved transient phenomena such as changes in signaling pathway activities or epigenome, which are generally difficult, but important, to account for.

We introduce a novel method that can be used to infer dynamically evolving regulatory networks from time-course data. Our method is based on the notion that all mechanistic ordinary differential equation models can be coupled with a latent process that approximates the network structure rewiring process. We illustrate the performance of the method using simulated data and, further, we apply the method to study the regulatory interactions during T helper 17 cell differentiation using time-course RNA sequencing data. The computational experiments with the real data show that our method is capable of capturing the experimentally verified rewiring effects of the core Th17 regulatory network. We predict Th17 lineage specific subnetworks that are activated sequentially and control the differentiation process in an overlapping manner.

TP104: Faster and More Accurate Graphical Model Identification of Tandem Mass Spectra using Trellises
Date: Tuesday, July 12 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Topic: PROTEINS
  • Shengjie Wang, University of Washington, United States
  • John Halloran, University of Washington, United States
  • Jeff Bilmes, University of Washington, United States
  • William Stafford Noble, University of Washington, United States

Area Session Chair: Trey Ideker

Presentation Overview: Show

Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by search- ing each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speed-up afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved peptide identification accuracy.

TP106: A novel method for discovering local spatial clusters of genomic regions with functional relationships from DNA contact maps
Date: Tuesday, July 12 3:30 pm - 3:50 pm
Room: Northern Hemisphere A1/A2
Topic: GENES
  • Xihao Hu, The Chinese University of Hong Kong, Hong Kong
  • Christina Huan Shi, The Chinese University of Hong Kong, Hong Kong
  • Kevin Yip, The Chinese University of Hong Kong, Hong Kong

Area Session Chair: Janet Kelso

Presentation Overview: Show

Motivation: The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps.

Results: Here we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally-related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains (TADs), and further define local sub-domains within each domain.

Availability: The source code of BNMF and tutorials on how to use the software to extract local clusters from contact maps are available at http://yiplab.cse.cuhk.edu.hk/bnmf/ .

TP107: BioASF: A Framework for Automatically Generating Executable Pathway Models Specified in BioPAX
Date: Tuesday, July 12 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS / DATA
  • Reza Haydarlou, VU University Amsterdam, Netherlands
  • Annika Jacobsen, VU University Amsterdam, Netherlands
  • Nicola Bonzanni, VU University Amsterdam, Netherlands
  • K. Anton Feenstra, VU University Amsterdam, Netherlands
  • Sanne Abeln, VU University, Netherlands
  • Jaap Heringa, VU University Amsterdam, Netherlands

Area Session Chair: Trey Ideker

Presentation Overview: Show

ABSTRACT
Motivation: Biological pathways play a key role in most cellular functions.
To better understand these functions, diverse computational
and cell biology researchers use biological pathway data for various
analysis and modeling purposes. For specifying these biological pathways,
a community of researchers has defined BioPAX and provided
various tools for creating, validating, and visualizing BioPAX models.
However, a generic software framework for simulating BioPAX models
is missing. Here, we attempt to fill this gap by introducing a generic
simulation framework for BioPAX. The framework explicitly separates
the execution model from the model structure as provided by BioPAX,
with the advantage that the modelling process becomes more reproducible
and intrinsically more modular; this ensures natural biological
constraints are satisfied upon execution. The framework is based
on the principles of discrete event systems and multi-agent systems,
and is capable of automatically generating a hierarchical multi-agent
system for a given BioPAX model.
Results: To demonstrate the applicability of the framework, we
simulated two types of biological network models: a gene regulatory
network modeling the haematopoietic stem cell regulators and a
signal transduction network modeling the Wnt/B-catenin signaling
pathway. We observed that the results of the simulations performed
using our framework were entirely consistent with the simulation
results reported by the researchers who developed the original
models in a proprietary language.
Availability and Implementation: The framework, implemented in
Java, is open source and its source code, documentation, and tutorial
are available at http://www.ibi.vu.nl/programs/BioASF.
Contact: j.heringa@vu.nl

TP111: Linear effects models of signaling pathways from combinatorial perturbation data
Date: Tuesday, July 12 4:10 pm - 4:30 pm
Room: Northern Hemisphere A3/A4
Topic: SYSTEMS
  • Ewa Szczurek, University of Warsaw, Poland
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Area Session Chair: Trey Ideker

Presentation Overview: Show

Motivation: Perturbations constitute the central means to study signaling pathways. Interrupting
components of the pathway and analyzing observed effects of those interruptions can give insight into
unknown connections within the signaling pathway itself, as well as the link from the pathway to the effects. Different pathway components may have different individual contributions to the measured perturbation effects, such as gene expression changes. Those effects will be observed in combination when the pathway components are perturbed. Extant approaches focus either on the reconstruction of pathway structure or on resolving how the pathway components control the downstream effects.
Results: Here, we propose a linear effects model, which can be applied to infer both from combinatorial
perturbation data. We use simulated data to demonstrate the accuracy of learning the pathway structure
as well as estimation of the individual contributions of pathway components to the perturbation effects.
The practical utility of our approach is illustrated by an application to perturbations of the mitogen-activated protein kinase pathway in Saccharomyces cerevisiae.
Availability: lem is available as a R package at http://www.mimuw.edu.pl/~szczurek/lem
Contact: niko.beerenwinkel@bsse.ethz.ch
Supplementary information: Supplementary data are available at Bioinformatics online.