Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Abstracts - Other Presentations

Nonlinear Mixed-Effects Modeling of Proteomics Antibody-Based Multiplex Assays: A Bioinformatics Post-Hoc Approach to Improve Signal-to-Noise Ratios
Date: Tuesday, July 25 8:30 am - 8:45 am
Room: Hall 1B
  • Julián Candia, National Institutes of Health, Bethesda, MD, United States
  • Angelique Biancotto, National Institutes of Health, Bethesda, MD, United States

Presentation Overview: Show

We developed a bioinformatics pipeline for data quality control, calibration and analysis of proteomics microsphere-based multiplex assays. The first step is a dip test analysis that flags non-unimodal bead-level distributions per well and analyte. Then, nonlinear mixed-effects models of standard curves are generated recursively, in which outliers are identified and corrected by an analysis of residuals. After obtaining a convergent model of standard curves, quality control bridge probes are calibrated to assess batch and plate effects. Finally, the standard curve model is applied to calibrate donor samples. This bioinformatics post-hoc approach is aimed at improving the quality of data typically characterized by poor signal-to-noise ratios. Although this pipeline is here applied to Luminex datasets, a similar framework could be implemented to analyze Mesoscale, SOMAscan and other analyte-based multiplex assays.

Chemical biology libraries: probing the proteome with small molecules
Date: Tuesday, July 25 8:45 am - 9:00 am
Room: Hall 1B
  • Anne Wassermann, Merck & Co Inc, Kenilworth, NJ, United States

Presentation Overview: Show

In the early 2000s, the scientific discipline of chemical genetics was born and sought after the elucidation of the proteome with small molecules. This talk will highlight how we are designing and analyzing biologically annotated libraries at MSD. We will describe how we are aggregating bioactivity data from both internal and external sources to algorithmically identify the best available small molecule probes for each target. We will then describe the statistical tests and in-house designed visualization tools that we are using to relate small molecule bioactivities in phenotypic screens to putative targets and protein biological functions. Common pitfalls in the design and analysis of biologically annotated libraries will be discussed. Lastly, we will discuss how we are seeking to close the gap and identify small molecule probes for those proteins for which currently no known, low molecular weight ligand exists, thereby aiming at further mapping and characterizing the druggable proteome.

Ultra-sensitive n-plexed protein quantification by a model-based reconstruction method
Date: Tuesday, July 25 9:00 am - 9:15 am
Room: Hall 1B
  • Kyowon Jeong, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Yeon Choi, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Joon-Won Lee, Department of Applied Chemistry, Kyung Hee University, South Korea
  • Sangtae Kim, Pacific Northwest National Laboratory, United States
  • Jae Hun Jung, Department of Applied Chemistry, Kyung Hee University, South Korea
  • Young-Suk Lee, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Kwang Pyo Kim, Department of Applied Chemistry, Kyung Hee University, South Korea
  • V. Narry Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Jong-Seo Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea

Presentation Overview: Show

Isotopic labeling based protein quantification has advantages such as accurate quantity ratios and reduced technical bias over other approaches. However, conventional isotopic labeling schemes (e.g., SILAC) have a limited multiplexity (≤3-plex). Although some trials have been made to increase multiplexity, they require either ultrahigh-resolution instruments or complicated/expensive labeling schemes. Also, thorough evaluation of quantification was scarcely made or the number of proteins quantified in all labels was often insufficient for most applications.
We present a model-based reconstruction method called EPIQ (Epic Protein Integrative Quantification) that enables ultra-sensitive n-plexed protein quantification. EPIQ allows deuterium-based isotopic labeling and small mass difference between labels (≥2Da). Such labels make the XICs (eXtracted-Ion Chromatograms) from distinct labels hard to be separated; they have different retention time (due to deuterium effect) and mutual interferences (due to overlapping isotope clusters).
EPIQ is based on a generative model that describes how the observed XICs are generated given the labeled peptide ions of the same species. The model assumes the observed XICs are generated by superimposing signal components (XICs from labeled peptide ions) as well as noise components (coelution or flat intensity noises). Given an identified PSM (Peptide-Spectrum Match), it predicts retention time, isotope distribution, and XIC shapes of the labeled peptide ions. By integrating these predictions, the signal and noise components in the generation model are predicted. Then EPIQ reconstructs the observation using the predicted components. As a result, it successfully separates XICs from distinct labels and performs accurate quantification with low limit of quantification (LOQ).
To test the quantification performance of EPIQ, we developed deuterium-based 6-plexed labeling. Labeled HeLa unfractionated sample having ratio 30:20:10:1:5:10 was subject to LC-MS/MS (Q-Exactive). EPIQ reported ~3,000 proteins with median quantity ratio 30.4:21.3:10:1.1:4.1:10.1. In ~70% of the cases, the ratios (to the first label) fell within 2-fold change from the input ratio.
To benchmark against other state-of-the-art tools, we adopted 13C-based 3-plexed labeling. A sample with a known ratio (HeLa, 1:10:20) and a biological sample (Xenopus early embryo) were analyzed by EPIQ and other tools. While all reported comparable numbers of proteins, the ratios from other tools than EPIQ were severely biased, especially for low abundant peptides. Such results demonstrate that EPIQ achieves low LOQ than other tools.
As EPIQ allows higher multiplexity, we are currently developing further chemical/metabolic labeling schemes (≥8-plex). EPIQ could facilitate various biological applications (e.g., cell dynamics studies or sensitive detection of differentially expressed proteins).

A Drug-Centric View of Drug Development: How Drugs Spread from Disease to Disease
Date: Tuesday, July 25 10:00 am - 10:15 am
Room: Hall 1B
  • Raul Rodriguez-Esteban, F. Hoffmann-La Roche Ltd., Switzerland

Presentation Overview: Show

Drugs are often seen as ancillary to the purpose of fighting diseases. Here an alternative view is proposed in which they occupy a spearheading role. In this view, drugs are technologies with an inherent therapeutic potential. Once created, they can spread from disease to disease independently of the drug creator’s original intentions. Through the analysis of extensive literature and clinical trial records, it can be observed that successful drugs follow a life cycle in which they are studied at an increasing rate, and for the treatment of an increasing number of diseases, leading to clinical advancement. Such initial growth, following a power law on average, has a degree of momentum, but eventually decelerates, leading to stagnation and decay. A network model can describe the propagation of drugs from disease to disease in which diseases communicate with each other by receiving and sending drugs. Within this model, some diseases appear more prone to influence other diseases than be influenced, and vice versa. Diseases can also be organized into a drug-centric disease taxonomy based on the drugs that each adopts. This taxonomy reflects not only biological similarities across diseases, but also the level of differentiation of existing therapies. In sum, this study shows that drugs can become contagious technologies playing a driving role in the fight against disease. By better understanding such dynamics, pharmaceutical developers may be able to manage drug projects more effectively.

On the feasibility of mining CD8+ T-cell receptor patterns underlying immunogenic peptide recognition
Date: Tuesday, July 25 10:15 am - 10:30 am
Room: Hall 1B
  • Nicolas De Neuter, University of Antwerp, Belgium
  • Wout Bittremieux, University of Antwerp, Belgium
  • Charlie Beirnaert, University of Antwerp, Belgium
  • Bart Cuypers, University of Antwerp, Belgium
  • Aida Mrzic, University of Antwerp, Belgium
  • Pieter Moris, University of Antwerp, Belgium
  • Arvid Suls, University of Antwerp, Belgium
  • Viggo Van Tendeloo, University of Antwerp, Belgium
  • Benson Ogunjimi, University of Antwerp, Belgium
  • Kris Laukens, University of Antwerp, Belgium
  • Pieter Meysman, University of Antwerp, Belgium

Presentation Overview: Show

Current T-cell epitope prediction tools are a valuable resource in designing targeted immunogenicity experiments. They typically focus on, and are able to, accurately predict peptide binding and presentation by major histocompatibility complex (MHC) molecules on the surface of antigen-presenting cells. However, recognition of the peptide-MHC complex by a T-cell receptor is often not included in these tools. We developed a classification approach based on random forest classifiers to predict recognition of a peptide by a T-cell and discover patterns that contribute to recognition. We considered two approaches to solve this problem: (1) distinguishing between two sets of T-cell receptors that each bind to a known peptide and (2) retrieving T-cell receptors that bind to a given peptide from a large pool of T-cell receptors. Evaluation of the models on two HIV-1, B*08-restricted epitopes reveals good performance and hints towards structural CDR3 features that can determine peptide immunogenicity. These results are of particularly importance as they show that prediction of T-cell epitope and T-cell epitope recognition based on sequence data is a feasible approach. In addition, the validity of our models not only serves as a proof of concept for the prediction of immunogenic T-cell epitopes but also paves the way for more general and high performing models.

Immunoinformatics and Molecular Modeling exploration of T-cell Epitope-based Cancer Immunotherapy
Date: Tuesday, July 25 10:30 am - 10:45 am
Room: Hall 1B
  • Seema Mishra, University of Hyderabad, India

Presentation Overview: Show

Abstract Text: Cancer Immunotherapy is the natural way of tweaking the body's own fighter immune system towards cancer cell destruction for prevention/cure. Immunotherapy has shown lots of potential and promise as opposed to unnatural ways of chemotherapy and radiotherapy which have relatively low success rates. Peptide vaccines in the form of T and B-cell epitopes, adoptive transfer of T cells, vaccination using dendritic cells are some of the several ways currently being explored. In view of time (10-20 years), cost (USD 200-900 million) and labor involved in experimental vaccine discovery and development, immunoinformatics tools have been pivotal in quickening the process, and generating novel hypotheses and insights. It is mandatory to use a tumor-specific (TSA) or tumor-associated antigen (TAA) for eliciting an effective immune response. Further, an optimal level of immunogenicity is critical. Systems biology based approaches, exploring networks and pathways, can predict novel TAAs. During immunotherapy regimen, it is usually observed that while immune responses are well-mounted, clinical response occurs only in a subset of patients. In this context, approaches using HLA ligandome as well as muliti-modal immunotherapy can be explored towards widespread clinical response. In this regard, our studies using an onco-fetal antigen Placental Alkaline Phosphatase and a key overexpressed chaperone protein involved in Unfolded Protein Response which promotes chemoresistance in cancer cells, Glucose-regulated Protein-78, will be presented. Computational prediction of cytotoxic T lymphocyte (CTL) and helper T lymphocyte (HTL) epitopes and HLA ligandome will be discussed. Specifically, generation of promiscuous T cell epitopes binding to several HLA alleles and the sequence and structural correspondence will be deliberated upon. Structural basis of HLA ligandome action using molecular modeling as deduced from HLA-peptide epitope docking, binding and interactions will also be presented. Systems Biology approaches using protein interaction networks are currently being explored in our lab. Future work involving experimental validation will be considered.

Single-cell RNA sequencing identifies novel roles and interacting partners of APE1 in Panceatic Ductal Adenocarcinoma Cells
Date: Tuesday, July 25 10:45 am - 11:00 am
Room: Hall 1B
  • Nadia Atallah, Purdue University, United States
  • Emery Goossens, Purdue University, Department of Statistics, United States
  • Fenil Shah, Indiana University School of Medicine, Department of Pediatrics, Wells Center for Pediatric Research, United States
  • Mark Kelley, Indiana University School of Medicine, Department of Pediatrics, Wells Center for Pediatric Research, Department of Pharmacology and Toxicology, United States
  • Melissa Fishel, Indiana University School of Medicine, Department of Pediatrics, Wells Center for Pediatric Research, Department of Pharmacology and Toxicology, United States

Presentation Overview: Show

Apurinic/apyrimidinic endonuclease/redox factor-1 (APE1/Ref-1 or APE1) is a multifunctional protein involved in repairing DNA damage via endonuclease activity in base excision repair and in redox signaling control of transcription factors such as HIF-1 (hypoxia inducible factor-1 ) STAT3, NFB, and others. APE1 is overexpressed in several cancers, including in pancreatic ductal adenocarcinoma (PDAC). APE1 overexpression in cancer is associated with resistance to radiation and chemotherapy as well as increased tumor cell migration, proliferation, and survival. Deciphering APE1’s role in cell survival, hypoxia signaling, and resistance to chemotherapy is complicated by the fact that APE1 is essential for cell viability and by its dual functionality in DNA repair and in redox regulation of transcription factors. Due to an inability to generate a stable APE1-knockout cell line and the incomplete, transient nature of APE1 siRNA knockdown, the use of bulk RNA-seq would lead to difficulty in conclusively defining a comprehensive list of genes regulated by APE1. In this study, single-cell RNA sequencing was utilized to compare the transcriptomes of siAPE1 and scrambled control cells under normal oxygen conditions and under hypoxia. Low passage patient-derived PDAC cells were used to investigate and characterize APE1 function under these conditions. Cell cycle-related genes were identified and used to determine a correction factor for the expression of other genes and fit a latent variable model accounting for treatment and control covariates using the R package scLVM. The R package BPSC was then utilized to test for differential expression, which models the gene expression counts using the beta Poisson distribution. Overall under normal O2, 1,950 genes were differentially expressed between the siAPE1 knockdown and control cells using a false discovery rate cutoff of 5%. Additional analyses were performed to fully take advantage of the power of single-cell sequencing, including an analysis using a statistical model that split all cells into three categories: scrambled control, cells transfected with siAPE1 that retained some expression of APE1 transcript (siAPE1-non zero), and cells transfected with siAPE1 that gave undetectable APE1 (siAPE1-zero) which identified 2,837 differentially expressed genes. A pathway analysis identified numerous pathways influenced by APE1 knockdown including mTOR, mitochondrial dysfunction, and the apoptosis signaling pathway. Biological validation was performed, including qRT-PCR validation of several genes in PDAC cells and in various other patient-derived tumor cells. Using data from TCGA, the clinical relevance of the DEGs was assessed by fitting a Cox proportional hazards model. Of the DEGs, ~16% of the genes overlapping between this single-cell sequencing study and previous bulk RNA-Seq datasets included in TCGA were found to be clinically relevant to pancreatic cancer. Thus the current scRNA-seq study both contains overlap with genes already found to be clinically relevant and also provides new putative APE1 interacting partners as well as potentially novel mechanisms through which APE1 acts in the cell. This study has identified novel roles for APE1 in the cell and has utilized the power of single-cell to identify well-established as well as new, putative partners in the APE1 interactome. This study paves the way for future experiments aimed at identifying novel combination therapies for PDAC.

Misleading arrows? Fitness landscapes and cancer progression models
Date: Tuesday, July 25 11:00 am - 11:15 am
Room: Hall 1B
  • Ramon Diaz-Uriarte, Dept. Biochemistry, Universidad Autonoma de Madrid, Instituto de Investigaciones Biomedicas “Alberto Sols” (UAM-CSIC), Spain

Presentation Overview: Show


We would expect a correspondence between the Directed Acyclic Graphs
(DAGs) inferred from cancer progression models and fitness landscapes as
the first implicitly model the epistatic interactions that shape the
fitness landscapes. I will show, however, that there is a many-to-many
mapping between fitness landscapes and DAGs inferred from cancer
progression models. Using simulated data under different fitness
landscapes I show this many-to-many mapping as the result of two different
phenomena: a one-DAG-to-many-landscapes, where the same DAG is often
observed in very different landscapes, and a one-landscape-to-many-DAGs,
where the same landscape can produce very different DAGs, with the
amount of variation affected by mutation rate, sampling regime, and
fitness landscape features. I then illustrate this problem using a
pancreatic cancer data set: genotype frequencies similar to the
empirically observed one can be obtained under fitness landscapes that are
widely different from each other and that produce data that lead to the
inference of DAGs that are also very different among themselves.

Novelty and significance:

This paper explicitly establishes a connection between the inference of
cancer progression models and the fitness landscapes that condition that
cancer progression. It shows there is a many-to-many mapping between them,
even under large sample sizes, that can limit their interpretability.

Reconstructing tumour evolution from single-cell sequencing data
Date: Tuesday, July 25 11:15 am - 11:30 am
Room: Hall 1B
  • Katharina Jahn, ETH Zurich, Switzerland
  • Jack Kuipers, ETH Zurich, D-BSSE, Computational Biology Group, Switzerland
  • Ben Raphael, Princeton University, United States
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

The mutational heterogeneity observed within tumours is a key obstacle to the development of efficient cancer therapies. A thorough understanding of subclonal tumour composition and the underlying mutational history is essential to open up the design of treatments tailored to individual patients. Recent advances in next-generation sequencing offer the possibility to analyse the evolutionary history of tumours at an unprecedented resolution, by sequencing single cells. This development poses a number of statistical challenges such as elevated noise rates due to allelic drop out, missing data and contamination with doublet samples.

We present SCITE our probabilistic approach for reconstructing tumour mutation histories from single-cell sequencing data[1] with a focus on two recent extensions, the explicit modelling of doublet samples and a rigorous statistical test to identify the presence of parallel mutations and mutational loss[2].
Looking at several single-cell sequencing datasets from various tumour types, we find strong evidence that such recurrent mutational hits of the same genomic site occur more frequently than would be expected by chance. This result casts severe doubt on the adequacy of the infinite sites assumptions which is typically made in current state-of-the-art models for reconstructing mutation histories of tumours from single-cell as well as bulk sequencing data.

[1] Jahn, K., Kuipers, J., and Beerenwinkel, N., 2016. Tree inference for single-cell data. Genome Biology, 17:86.
[2] Kuipers, J., Jahn, K., Raphael, B., and Beerenwinkel, N., 2017. A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution. bioRxiv 094722; doi: https://doi.org/10.1101/094722

Impact of tissue architecture on the nature and predictability of tumour evolution
Date: Tuesday, July 25 11:30 am - 11:45 am
Room: Hall 1B
  • Robert Noble, Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
  • John Burley, Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States
  • Michael Hochberg, Institut des Sciences de l’Evolution, University of Montpellier, France

Presentation Overview: Show

Intra-tumour genetic heterogeneity is a product of evolution in spatially structured populations of cells. Whereas genetic heterogeneity has been proposed as a prognostic biomarker in cancer, its spatially dynamic nature makes accurate prediction of tumour progression challenging. We use a novel computational model of cell proliferation, competition, mutation and migration to assess when and how genetic diversity is predictive of tumour growth and evolution. We characterize how tissue architecture (cell-cell competition and cell migration) influences the potential for subclonal population growth, the prevalence of clonal sweeps, and the resulting pattern of intra-tumour heterogeneity. We further compare the accuracy of cancer growth forecasts generated using different virtual biopsy sampling strategies, in different tissue types, and when cancer evolution is characteristically neutral or non-neutral. We thus determine the conditions under which genetic diversity is most predictive of future tumour states. Our findings help explain the multiformity of tumour evolution and contribute to establishing a theoretical foundation for predictive oncology.

The spatio-temporal evolutionary dynamics of lymph node spread in early breast cancer
Date: Tuesday, July 25 11:45 am - 12:00 pm
Room: Hall 1B
  • Alexandra Vatsiou, ICR, United Kingdom
  • Inma Spiteri, ICR, United Kingdom
  • Gaia Schiavon, ICR, United Kingdom
  • Peter Barry, Royal Marsden Hospital, London, UK, United Kingdom
  • Andrea Sottoriva, ICR, United Kingdom

Presentation Overview: Show

The most significant prognostic factor in breast cancer is lymph node involvement.
Therefore this stage between localised and systemic disease is key to understanding breast cancer progression, however our knowledge of the evolution of lymph node involvement remains limited as most currently available data derives from primary tumours. Here in 11 treatment-naïve node positive breast cancer patients without clinical evidence of distant metastasis we investigated lymph node evolution using spatial multi-region sequencing of primary and lymph node deposits and matched longitudinal plasma samples circulating tumour DNA (ctDNA) genomic profiling. We found pervasive intra-tumour heterogeneity between primary and lymph node deposits. Phylogenomic analysis revealed that a significant proportion of lymph nodes were infiltrated early in the evolution of the malignancy, diverging from a cancer cell lineage in the primary. Moreover, somatic mutations private to the lymph nodes were found to be predominant with respect to mutations private to the primary in longitudinal ctDNA taken peri-operatively, suggesting a higher propensity of the unique nodal clones being represented in the plasma. The APOBEC signature was also prevalent in the lymph nodes compared to the primary. This study sheds light on the most crucial evolutionary step in the natural history of breast cancer metastasis.

Leveraging heterogeneity in public data to reduce bias and increase accuracy of cell-mixture deconvolution
Date: Tuesday, July 25 12:15 pm - 12:30 pm
Room: Hall 1B
  • Francesco Vallania, Stanford University, United States
  • Andrew Tam, Stanford University, United States
  • Shane Lofgren, Stanford University, United States
  • Steven Schaffert, Stanford University, United States
  • Erika Bongen, Stanford University, United States
  • Michael Alonso, Stanford University, United States
  • Mark Davis, Stanford University, United States
  • Ed Engleman, Stanford University, United States
  • Purvesh Khatri, Stanford University, United States

Presentation Overview: Show

Cell-mixture deconvolution is an established in silico approach to quantify cell subset proportions directly from bulk gene expression of mixed-cell samples(1) using a reference expression matrix, called basis matrix(2). Because existing matrices are built on limited expression data, we hypothesized that virtually all deconvolution methods are affected by the microarray platform used to measure expression and by the sample disease state, severely limiting their applicability.
To test this, we deconvolved 1,071 transcriptomes of human peripheral blood mononuclear cells (PBMCs) across eight microarray platforms from two manufacturers. We used LM22(3) and IRIS(2), two established blood cell basis matrices, with four deconvolution algorithms (linear regression, quadratic programming, robust regression, and support vector regression)(2-4). We quantified deconvolution accuracy by goodness of fit(3), which correlates measured expression with reconstituted expression from the basis matrix. We observed significant platforms bias across all methods for both IRIS (p = 2.27e-9) and LM22 (p = 4.79e-4).
To reduce bias, we present immunoStates, a new blood cell matrix created from 6240 transcriptomes of 20 sorted human immune cells measured over 42 different platforms. Unlike previous matrices, immunoStates showed no significant platform bias (p = 0.28). Notably, we observed high concordance across all methods when using the same basis matrix (r = 0.973 ± 0.004) whereas the opposite was not true (r = 0.356 ± 0.060), suggesting that the choice of matrix is more important than the method used for deconvolution.
Unlike IRIS and LM22, which only used samples from healthy controls, immunoStates incorporated disease samples as well. We therefore hypothesized that immunoStates would have higher accuracy than IRIS and LM22 in diseased samples. To test this, we used a compendium of 4,113 expression profiles of healthy and cancer samples of blood and solid tissue origin(5). We compared goodness of fit of blood samples to tissue biopsies. As blood is expected to have higher goodness of fit than solid tissue, we asked whether this was consistent across healthy and diseased samples. While LM22 and IRIS accurately distinguished blood from tissue-derived samples in healthy samples (AUC: 98% LM22; 95% IRIS), they failed to do so in cancerous samples (AUC: 54% LM22; 70% IRIS). In contrast, immunoStates accurately separated blood from biopsies independently of disease state (AUC: 96.3% healthy; 90% cancer) across all methods.
Finally, we compared cell proportion estimates to either FACS or complete blood counts. We deconvolved 5 datasets consisting of 402 samples with paired gene expression and cell count data. immunoStates had significantly higher correlation between estimated and measured cellular proportions (r = 0.78 ± 0.03) compared to LM22 (r = 0.56 ± 0.06, p = 5.8e-4) and IRIS (r = 0.15 ± 0.07, p = 4.77e-5). Interestingly, no deconvolution method performed better than others across all matrices and cohorts.
Collectively, our results demonstrate that a basis matrix has significant effects on the accuracy of in silico deconvolution. We further demonstrated that biological and technical heterogeneity present in public data can be leveraged to create a basis matrix that is not affected by biological, technical, and methodological biases.

1. Shen-Orr SS, Gaujoux R. Computational deconvolution: extracting cell type-specific information from heterogeneous samples. Current Opinion in Immunology. 2013 Oct;25(5):571–8.

2. Abbas AR, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data. Genes Immun. 2005 Jun;6(4):319–31.

3. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Meth. Nature Publishing Group; 2015 Mar 30;12(5):453–7.

4 Gong, T. et al. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS ONE 6, e27156 (2011)

5. Lukk M, Kapushesky M, Nikkilä J, Parkinson H, Goncalves A, Huber W, et al. A global map of human gene expression. Nat Biotechnol. 2010 Apr;28(4):322–4.

Digital assisted curation to the rescue of traditional literature curation for life-science databases
Date: Tuesday, July 25 2:15 pm - 2:30 pm
Room: Hall 1B
  • Fabio Rinaldi, IFI, University of Zurich, Switzerland
  • Socorro Gama, UNAM, Mexico
  • Yalbi Itzel Balderas-Martínez, Facultad de Ciencias, UNAM, Mexico
  • Oscar Lithgow, UNAM, Mexico
  • Hilda Solano Lira, UNAM, Mexico
  • Mishael Sánchez-Pérez, Program on Computational Genomics Center for Genomic Sciences, UNAM, Mexico
  • Alejandra Lopez-Fuentes, Center for Genomic Sciences - UNAM, Mexico
  • Luis Muñiz-Rascado, RegulonDB, Mexico
  • Cecilia Ishida, UNAM, Mexico
  • Carlos-Francisco Méndez-Cruz, Center for Genomic Sciences, National Autonomous University of Mexico (UNAM), Mexico
  • Alberto Santos-Zavaleta, Centro de Ciencias Genómicas (CCG) UNAM, Mexico
  • Julio Collado-Vides, Center for Genomic Sciences, UNAM, Mexico

Presentation Overview: Show

Faced with decreasing financial resources, life science databases
struggle to keep pace with the constantly increasing amount of
published results. Traditional approaches based on careful human
review of published papers guarantee a high quality of database
entries, and cannot be easily replaced by automated technologies, but
are very slow and not cost-effective. There are several
technologies derived from the field of natural language processing
that promise to support the search for information in textual
resources. These technologies have been applied to the scientific
literature for many years. In particular in the life science domain
several community-organized evaluation campaigns carried out
in the past few years have shown a steady improvement in the
capabilities of these systems. However, there is still widespread
skepticism on the possibility to use such tools in a curation

We argue that although text mining tools on their own would not be
easily usable in a curation pipeline, their integration in a
supportive environment can lead to a remarkable increase in efficiency
of the curation process, and we prove this point through recent
digital assisted curation experiments in the context of an established
bacterial database and a recently initiated curation effort for an
important human disease.

Model tree by ensemble of piecewise linear models and its application to QSAR modeling
Date: Tuesday, July 25 2:30 pm - 2:45 pm
Room: Hall 1B
  • Jonathan Silva, King's College London, United Kingdom
  • Lazaros Papageorgiou, University College London, United Kingdom
  • Sophia Tsoka, King's College London, United Kingdom
  • George Papadatos, GlaxoSmithKline, United Kingdom

Presentation Overview: Show

Ensemble methods, such as random forest, gradient boosting or simple average of decision trees, are frequently used in QSAR modeling and by combining multiple predictive classification or regression trees to produce a single prediction, these algorithms often lead to a better generalisation than if a single model were used. However, these meta models are not directly interpretable and could not be used to design new compounds in the inverse QSAR problem.

Inspired by boostraping strategies and the need for interpretability in QSAR models, we have developed an algorithm that builds a single predictive model tree (with linear equations on the leaves) from a collection of regression models, instead of taking an average of prediction. We use a piecewise linear regression algorithm to identify breakpoints

A systematic identification of species-specific protein succinylation sites using joint element features information
Date: Tuesday, July 25 2:45 pm - 3:00 pm
Room: Hall 1B
  • Mehedi Hasan, The Chinese University of Hong Kong, Hong Kong
  • Yong Cao, Harbin Institute of Technology Shenzhen Graduate School, China
  • Dianjing Guo, The Chinese University of Hong Kong, Hong Kong

Presentation Overview: Show

Lysine succinylation is a post-translational modification, which plays significant roles in many cellular processes. An accurate identification of succinylation sites can facilitate the understanding of the molecular mechanism of succinylation. However, even in well- studied systems, the majority of succinylation sites remain undetected because the traditional experimental identification approaches are often costly, time-consuming, and laborious. In silico approach, on the other hand, is potentially an alternative strategy to predict succinylation substrates. In this paper, a novel computational predictor SuccinSite2.0 was developed for predicting generic and species-specific protein succinylation sites by combining sequence evolutionary information and orthogonal binary features. A random forest classifier was trained with these features to build the prediction model. The proposed SuccinSite2.0 predictor achieved improvement over the existing generic succinylation site prediction tools for a complementary independent and new dataset. Furthermore, we also analyzed the important features that make visible contributions to species-specific and cross species-specific prediction of protein succinylation site. The proposed predictor is anticipated to be a powerful computational resource for succinylation site prediction. The integrated species-specific online tool interface is publicly accessible (http://biocomputer.bio.cuhk.edu.hk/SuccinSite2.0).

Using Ancestral Sequence Reconstruction to Characterize an Allosteric Bi-Enzyme Complex
Date: Tuesday, July 25 3:00 pm - 3:15 pm
Room: Hall 1B
  • Kristina Heyn, University of Regensburg, Germany
  • Alexandra Holinski, University of Regensburg, Germany
  • Rainer Merkl, University of Regensburg, Germany
  • Reinhard Sterner, University of Regensburg, Germany

Presentation Overview: Show

Ancestral sequence reconstruction (ASR) is the inference of primordial amino acid sequences from contemporary ones with the help of a phylogenetic tree [1]. Extant sequences occupy the leaves of this tree and the sequences corresponding to the internal nodes and the root are intermediates. The in silico and biochemical characterization of these intermediates can help to elucidate the structural determinants of a whole protein family.
Imidazole glycerol phosphate synthase is a bi-enzyme complex consisting of the cyclase subunit HisF and the glutaminase subunit HisH. We have used ASR and protein design to unravel the structural basis of the HisH-HisF interaction between the two enzymes.
To this end, we compared the binding of a given HisH protein to i) a modern HisF enzyme (a leaf), to ii) evolutionary intermediate HisF proteins (internal nodes), and to iii) HisF from the last universal common ancestor (LUCA-HisF, root) that differ in the composition of their interfaces [3].
The in silico analyses of these interfaces made clear that one residue is key to the binding affinity of the HisH-HisF complex. The predicted effect on complex stability induced by the reciprocal exchange of the corresponding interface residues was confirmed by means of biochemical studies [4]. Thus, we could demonstrate that a combination of in silico methods and wet-lab experiments allowed us to identify an interface hot-spot in a straightforward manner.

[1] Merkl, R. & Sterner, R. (2016) Ancestral protein reconstruction: techniques and applications. Biological Chemistry 397, 1–21.
[2] Beismann-Driemeyer, S. & Sterner, R. (2001). Imidazole glycerol phosphate synthase from Thermotoga maritima: Quaternary structure, steady-state kinetics and reaction mechanism of the bi-enzyme complex. J. Biol. Chem. 276, 20387-20996.
[3] Reisinger, B., Sperl, J., Holinski, A., Schmid, V., Rajendran, C., Carstensen. L., Schlee, S., Blanquart, S., Merkl, R. & Sterner, R. (2014). Evidence for the existence of elaborate enzyme complexes in the Paleoarchean era. J. Am. Chem. Soc. 136, 122−129.
[4] Holinski, A., Heyn, K., Merkl, R., & Sterner, R. (2017). Combining ancestral sequence reconstruction with protein design to identify an interface hotspot in a key metabolic enzyme complex. Proteins: Structure, Function, and Bioinformatics. 85, 312-321

Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus
Date: Tuesday, July 25 3:30 pm - 3:45 pm
Room: Hall 1B
  • Yuki Yoshida, Keio University, Japan
  • Georgios Koutsovoulos, Edinburgh University, United Kingdom
  • Dominik Laetsch, Edinburgh University, United Kingdom
  • Lewis Stevens, Edinburgh University, United Kingdom
  • Sujai Kumar, Edinburgh University, United Kingdom
  • Daiki Horikawa, Keio University, Japan
  • Kyoko Ishino, Keio University, Japan
  • Shiori Komine, Keio University, Japan
  • Takekazu Kunieda, Department of Biological Sciences, Graduate School of Science, University of Tokyo, Japan
  • Masaru Tomita, Keio University, Japan
  • Mark Blaxter, Edinburgh University, United Kingdom
  • Kazuharu Arakawa, Keio University, Japan

Presentation Overview: Show

Tardigrada, a phylum of meiofaunal organisms, have been at the center of discussions of the evolution of Metazoa, the role of horizontal gene transfer (HGT) in animal evolution, and the biology of survival in extreme environments. Tardigrada are placed as sisters to Arthropoda and Onychophora in the superphylum Panarthropoda by morphological analyses, but many molecular phylogenies fail to recover this relationship. This tension between molecular and morphological understanding may be very revealing of the mode and patterns of evolution of major groups. Meiofaunal species have been reported to have elevated levels of HGT events, but how important this is in evolution, particularly in the evolution of extremophile physiology, is unclear. Limno-terrestrial tardigrades display extreme cryptobiotic abilities, including anhydrobiosis and cryobiosis. These extremophile behaviors challenge understanding of normal, aqueous physiology: how does a multicellular organism avoid lethal cellular collapse in the absence of liquid water? To address these questions, we resequenced and reassembled the genome of Hypsibius dujardini, a limno-terrestrial tardigrade that can undergo anhydrobiosis only after extensive pre-exposure to drying conditions, and compared it to the genome of Ramazzottius varieornatus, a related species with tolerance to rapid desiccation. Contrasting gene expression responses to desiccation were observed; major transcriptional change in H. dujardini but limited regulation in R. varieornatus. We identified a limited number of horizontally transferred genes. Whole-genome molecular phylogenies supported a Tardigrada+Nematoda relationship over Tardigrada+Arthropoda, but rare genomic changes tended to support Tardigrada+Arthropoda.

Sequencing 1,000 spiders to elucidate the design mechanisms of spider silk proteins
Date: Tuesday, July 25 3:45 pm - 4:00 pm
Room: Hall 1B
  • Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan
  • Nobuaki Kono, Institute for Advanced Biosciences, Keio University, Japan
  • Masayuki Fujiwara, Institute for Advanced Biosciences, Keio University, Japan
  • Hiroyuki Nakamura, Spiber Inc., Japan
  • Rintaro Ohtoshi, Spiber Inc., Japan
  • Masaru Tomita, Institute for Advanced Biosciences, Keio University, Japan

Presentation Overview: Show

Spider silks are protein materials that exhibit extraordinary mechanical properties including high tensile strengths, extensibilities, and exceptional toughness, with potential applications in industry as renewable materials. Spiders produce seven types of silks, each composed of specific proteins called spidroins and exhibiting a diverse range of mechanical properties, majority of which are known to be monophyletic. Therefore, if we can obtain a multitude of spidroin sequences along with the mechanical properties of corresponding silks from diverse samples among the order Araneae, we can potentially identify quantitative linkages between sequence motifs and the physical properties. To this end, we have sequenced 1,000 spiders using de novo transcriptome sequencing and assembly strategy. Spider silk genes, however, are extremely long (>10kbp) and highly repetitive sequences which is impossible to be sequenced in full length using short reads, and we combine multiple sequencing technologies including PacBio and Oxford Nanopore long reads with Illumina short reads to conquer this challenge. In this talk, we would like to discuss the progress of the project and our findings regarding the link between spidroin sequences and their mechanical properties.

How physical interactions between architectural proteins and DNA shape the three dimensional structure of Human genome
Date: Tuesday, July 25 4:00 pm - 4:15 pm
Room: Hall 1B
  • Przemyslaw Szalaj, Centre of New Technologies, University of Warsaw, Poland
  • Zhonghui Tang, Jackson Laboratory for Genomic Medicine, United States
  • Paul Michalski, Jackson Laboratory for Genomic Medicine, United States
  • Michal Pietal, Centre of New Technologies, University of Warsaw, Poland
  • Michal Sadowski, Centre of New Technologies, University of Warsaw, Poland
  • Oskar Luo, Jackson Laboratory for Genomic Medicine, United States
  • Yijun Ruan, Jackson Laboratory for Genomic Medicine, Poland
  • Dariusz Plewczynski, Centre of New Technologies, University of Warsaw, Poland

Presentation Overview: Show

ChIA-PET is a high throughput mapping technology that reveals long-range chromatin interactions and provides insights into the basic principles of spatial genome organization and gene regulation mediated by specific protein factors. Recently, we showed that a single ChIA-PET experiment provides information at all genomic scales of interest, from the high resolution locations of binding sites and enriched chromatin interactions mediated by specific protein factors, to the low resolution of non-enriched interactions that reflect topological neighborhoods of higher-order chromosome folding. This multilevel nature of ChIA-PET data offers an opportunity to use multiscale 3D models to study structural-functional relationships at multiple length scales, but doing so requires a structural modeling platform. Here we report the development of 3D-GNOME (3-Dimensional GeNOme Modeling Engine), a complete computational pipeline for 3D simulation using ChIA-PET data. 3D-GNOME consists of three integrated components: a graph-distance-based heatmap normalization tool, a 3D modeling platform, and an interactive 3D visualization tool. Using ChIA-PET and Hi-C data derived from human B-lymphocytes, we demonstrate the effectiveness of 3D-GNOME in building 3D genome models at multiple levels, including the entire genome, individual chromosomes, and specific segments at megabase (Mb) and kilobase (kb) resolutions of single average and ensemble structures. Further incorporation of CTCF-motif orientation and high-resolution looping patterns in 3D simulation provided additional reliability of potential biologically plausible topological structures. Our work provides the high-resolution and multi-scale computational model of chromatin looping architecture in human genome.

Reconstructing metastatic seeding patterns of human cancers
Date: Tuesday, July 25 12:00 - 12:15 pm
Room: Hall 1B
  • Johannes G. Reiter, Harvard University, United States
  • Alvin P. Makohon-Moore, Memorial Sloan Kettering Cancer Center, United States
  • Jeffrey M. Gerold, Harvard University, United States
  • Ivana Bozic, Harvard University, United States
  • Krishnendu Chatterjee, IST Austria, United States
  • Christine A. Iacobuzio-Donahue, Memorial Sloan Kettering Cancer Center, United States
  • Bert Vogelstein, Johns Hopkins University School of Medicine, United States
  • Martin A. Nowak, Harvard University, United States

Presentation Overview: Show

Reconstructing the evolutionary history of metastases is critical for understanding their basic biological principles and has profound clinical implications. Genome-wide sequencing data has enabled modern phylogenomic methods to accurately dissect subclones and their phylogenies from noisy and impure bulk tumor samples at unprecedented depth. However, existing methods are not designed to infer metastatic seeding patterns. Here we develop a tool, called Treeomics, to reconstruct the phylogeny of metastases and map subclones to their anatomic locations. Treeomics infers comprehensive seeding patterns for colorectal, endometrial, pancreatic, and prostate cancers. Moreover, Treeomics correctly disambiguates true seeding patterns from sequencing artifacts; 7% of variants were misclassified by conventional statistical methods. These artifacts can skew phylogenies by creating illusory tumor heterogeneity among distinct samples. In silico benchmarking on simulated tumor phylogenies based on established mathematical models of cancer evolution across a wide range of sample purities (15-95%) and sequencing depths (25‑800x) demonstrates the accuracy of Treeomics compared to existing methods.

PubSqueezer: a Text-Mining Tool to Discover Connections in Scientific Literature
Date: Tuesday, July 25 2:00 pm - 2:15 pm
Room: Hall 1B
  • Alberto Calderone, Bioinformatics and Computational Biology Unit Molecular Genetics Laboratory - University of Rome Tor Vergata, Italy
  • Elisa Micarelli, Bioinformatics and Computational Biology Unit Molecular Genetics Laboratory - University of Rome Tor Vergata, Italy
  • Gianni Cesareni, Molecular Genetics Laboratory - University of Rome Tor Vergata, Italy

Presentation Overview: Show

PubSqueezer: a Text-Mining Tool to Discover Connections in Scientific Literature
Alberto Calderone1,2, Elisa Micarelli1,2, Gianni Cesareni2

1. Bioinformatics and Computational Biology Unit. Molecular Genetics Laboratory. University of Rome Tor Vergata
2. Department of Biology, University of Rome Tor Vergata, Rome, Italy
contact: sinnefa@gmail.com

The scientific literature is the entry point to the current understanding of any research topic. Reading several articles about one topic is essential to make mental connections and come up with new hypotheses that are not yet explicitly reported in the scientific literature. This process is not easily done by any single scientist reading articles in his/her field of interest. Thus, automatic text analyses can facilitate this and other analysis by speeding up tasks such as keyword identification. Furthermore, literature is also the first resource for biological database curators (Hirschman et al., 2012)⁠ who can significantly benefit from automatic approaches.
Text mining (TM) is the process of extracting information from documents by identifying text patterns via statistical and computational approaches. Some TM tools such as SciLite (Venkatesan et al., 2016)⁠ aim at keywords highlighting. While keywords identification in texts is useful when reading articles to spot important words and phrases, much information is usually scattered in many articles and can often only be derived by mental abstraction. To support this process, we developed PubSqueezer, a tool that aims at analyzing and integrating multiple articles in order to extract not explicitly written information.

First, we used PubSqeezer to assess how well an explicitly written topic is derivable from a given set of articles. To this end, we automatically downloaded publications about several different pathways and, using KEGG (Kanehisa and Goto, 2000)⁠ as a reference, we evaluated how well PubSqueezer was able to abstract each pathway.
Secondly, we investigated if PubSqueezer was able to abstract implicit information. We selected two linked biological processes: Alzheimer’s Disease and Lipid Metabolism (Calderone et al., 2016)⁠. Downloading papers about Alzheimer’s Disease and filtering lipids-related publications we confirmed that Lipid Metabolism was still derivable from the remaining publications.
Lastly, by intersecting results obtained by the analysis of articles describing two different topics we noticed that it is also possible to make plausible connections that are not yet explicitly reported in any paper.
These preliminary results suggest that by processing a collection of articles together it is possible to automatically describe their main content thus facilitating information acquisition and, at the same time, offering new insights.

Publication abstracts and titles are retrieved from PubMed according to user-defined criteria. These documents are preprocessed to remove unnecessary terms, lemmatized and tokenized. Next, PubSqeezer searches in the downloaded articles a user-defined list of words which are then ranked and analyzed. For the use cases reported here, we used, as list of terms, a dictionary of gene names and, we performed a Gene Set Enrichment Analysis in order to identify Gene Ontology terms that are specifically enriched in the top ranking genes.


Calderone, A., Formenti, M., Aprea, F., Papa, M., Alberghina, L., Colangelo, A.M., and Bertolazzi, P. (2016). Comparing Alzheimer’s and Parkinson’s diseases networks using graph communities structure. BMC Syst. Biol. 10, 25.
Hirschman, L., Burns, G.A.P.C., Krallinger, M., Arighi, C., Cohen, K.B., Valencia, A., Wu, C.H., Chatr-Aryamontri, A., Dowell, K.G., Huala, E., et al. (2012). Text mining for the biocuration workflow. Database 2012, bas020-bas020.
Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30.
Venkatesan, A., Kim, J.-H., Talo, F., Ide-Smith, M., Gobeill, J., Carter, J., Batista-Navarro, R., Ananiadou, S., Ruch, P., McEntyre, J., et al. (2016). SciLite: a platform for displaying text-mined annotations as a means to link research articles with biological data. Wellcome Open Res. 1, 25.

Molecular biology of body-size variation: from evolution to human disease
Date: Tuesday, July 25 3:15 pm - 3:30 pm
Room: Hall 1B
  • Maria Chikina, University of Pittsburgh, United States
  • Nathan Clark, University of Pittsburgh, United States
  • Amanda Kowalczyk, University of Pittsburgh, United States

Presentation Overview: Show

Mammals differ in body-size and longevity by several orders of magnitude, and this variation has important molecular consequences that are relevant to understanding human disease. For example, as large and long-lived animals undergo many more cell divisions yet do not have higher cancer rates, they must posses molecular adaptations that reduce the carcinogenesis rate. Using our recently developed method of correlating evolutionary rates with phenotypes, we develop a pipeline to systematically analyze how changes in body mass alter the evolutionary constraint of individual genes. We find that body mass explains a large fraction of evolutionary rate variation. Moreover, genes most associated with body-mass are strongly enriched in several pathways with direct relevance to human disease, such as: cell-cycle, IGF signaling, lipid metabolism and blood clotting. Our method provides a unique evolutionary view of gene function and could be used to gain insight into complex multi-tissue processes.