Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.


Track: Machine Learning Systems Biology (MLSB)

Session B-190: Pioneering topological methods for network-based drug-target prediction by exploiting a brain-network self-organization theory
COSI: MLSB
  • Claudio Durán, BIOTEC, Germany
  • Simone Daminelli, BIOTEC, Germany
  • Josephine Thomas, BIOTEC, Germany
  • Joachim Haupt, BIOTEC, Germany
  • Michael Schroeder, BIOTEC, Germany
  • Carlo Vittorio Cannistraci, BIOTEC, Germany

Short Abstract: The bipartite network representation of the drug-target interactions (DTIs) in a biosystem enhances understanding of the drugs multifaceted action modes, suggests therapeutic switching for approved drugs and unveils possible side effects. Since experimental testing of DTIs is costly and time consuming, computational predictors are of great aid. Here, for the first time, state-of-the-art DTIs supervised predictors custom-made in network biology were compared - using standard and innovative validation frameworks - with unsupervised pure topological-based models designed for general-purpose link prediction in bipartite networks. Surprisingly, our results show that the bipartite topology alone, if adequately exploited by means of the recently proposed local-community-paradigm (LCP) theory - initially detected in brain-network topological self-organization and afterward generalized to any complex network - is able to suggest highly reliable predictions, with comparable performance to the state-of-the-art supervised methods that exploit additional (nontopological, for instance biochemical) drug-target interaction knowledge. Furthermore, a detailed analysis of the novel predictions revealed that each class of methods prioritizes distinct true interactions, hence combining methodologies based on diverse principles represents a promising strategy to improve drug-target discovery. To conclude, this study promotes the power of bioinspired computing, demonstrating that simple unsupervised rules inspired by principles of topological self-organization and adaptiveness arising during learning in living intelligent systems (like the brain), can efficiently equal-perform complicated algorithms based on advanced, supervised and knowledge-based engineering.

Session B-192: Kernelized Rank Learning for Personalized Drug Recommendation
COSI: MLSB
  • Xiao He, ETH Zurich, Switzerland
  • Lukas Folkman, ETH Zurich, Switzerland
  • Karsten Borgwardt, ETH Zurich, Switzerland

Short Abstract: Large-scale screenings of cancer cell lines with detailed genomic profiles against libraries of pharmacological compounds are currently being performed in order to gain a better understanding of the genetic component of drug response and to enhance our ability to predict drug sensitivity from genetic profiles. These screens differ from the clinical setting in which (1) medical records only contain the response of a patient to very few drugs, and in which (2) selecting the most promising out of all therapies is more important than accurately predicting the sensitivity to the given drug. Current regression models for drug sensitivity prediction fail to account for these two properties. We present a machine learning approach, named Kernelized Rank Learning (KRL), that ranks drugs based on their predicted effect per patient, circumventing the difficult problem of precisely predicting the sensitivity to the given drug. Our approach outperforms several state-of-the-art predictors in drug recommendation, particularly in a clinically-relevant case where few training data are available.

Session B-194: MEMNAR: Finding Mutually Exclusive Mutation Sets through Negative Association Rule Mining
COSI: MLSB
  • Iman Deznabi, Bilkent University, Turkey
  • Ahmet Alparslan Celik, Bilkent University, Turkey
  • Oznur Tastan, Bilkent University, Turkey

Short Abstract: It has been reported in multiple cancers that certain set of gene mutations tend not to occur concurrently in the same patient. This mutual exclusivity pattern hints at a functional relation and can help uncover cancer-driver alterations. We address the problem of discovering mutually exclusive mutation gene sets through mining negative association rules. Our proposed algorithm, MEMNAR, efficiently mines for negative association rules in patient mutation data and constructs mutually exclusive gene sets based on these extracted rules with high accuracy. We also define and detect more complex mutual exclusivity patterns that have not been addressed in earlier approaches. Evaluations on simulated data sets demonstrate that MEMNAR can discover mutual exclusive gene sets faster with improved accuracy compared to the state-of-the-art methods. When we apply MEMNAR on breast cancer, we identify several mutually exclusive gene sets that are biologically relevant and some of which have not been reported in the literature.

Session B-196: A scalable algorithm for calibrating mathematical models of biochemical pathways using steady state perturbation response data
COSI: MLSB
  • Tapesh Santra, Systems Biology Ireland, University College Dublin, Ireland

Short Abstract: Advances in array based omic technology made it possible to simultaneously measure steady-state responses of several components of biochemical pathways to multiple external perturbations. Such data is typically used to reconstruct topological models of these pathways, but, are rarely used for calibrating more detailed ordinary differential equation (ODE) based mathematical models, designed to simulate their dynamics. This is because existing calibration algorithms require simulating all perturbation experiments for several sets of parameter values, and then choosing a set of values that provides the closest match between experimental and simulated perturbation response data. This is a highly computation intensive process and sometimes infeasible due to lack of sufficient knowledge of the experimental perturbations. Here, I propose an algorithm which can calibrate ODE models of biochemical pathways using steady-state perturbation responses(SSPRs), without explicitly simulating perturbation experiments during the calibration process. Consequently, it does not require detailed knowledge of the mechanisms or effects of experimental perturbations. It is also highly parallelizable and therefore scalable. I have demonstrated the capabilities and shortcomings of this algorithm using both simulated and real perturbation responses of Mitogen Activated Protein Kinase (MAPK) pathway.

Session B-198: PAC Learning of Thomas Regulatory Networks from Time-Series Data
COSI: MLSB
  • Arthur Carcano, ENS, France
  • François Fages, Inria Saclay-Ile de France, France
  • Jérémy Grignard, Inria Saclay-Ile de France, France
  • Sylvain Soliman, InriaSaclay-Ile de France, France

Short Abstract: Automating the process of model building from experimental data is a very desirable goal to palliate the lack of modellers for many applications. However, despite the spectacular progress of machine learning techniques in data analytics, classification, clustering and prediction making, learning dynamical models from data time-series is still challenging. In this paper we investigate the use of the Probably Approximately Correct (PAC) learning framework of Leslie Valiant as a method for the automated discovery of influence models of biochemical processes from Boolean and stochastic traces. We show that Thomas’ Boolean influence systems can be naturally represented by k-CNF formulae and learned from time-series data with a quasi linear number of Boolean activation samples per species, and that positive Boolean influence systems can be represented by monotone DNF formulae and learned actively with both activation samples and oracle calls. We evaluate the performance of this approach on a model of T-lymphocyte differentiation, with and without prior knowledge, and discuss its merits as well as its limitations with respect to realistic experiments.

Session B-200: Partially Ordered Expression Features Improves Survival Prediction in Cancer
COSI: MLSB
  • Mustafa Buyukozkan, Institute of Computational Biology Helmholtz Zentrum Muenchen, Germany
  • Halil Ibrahim Kuru, Bilkent University, Turkey
  • Oznur Tastan, Bilkent University, Turkey

Short Abstract: Predicting the survival of cancer patients is critical for choosing patient specific treatment strategies. Survival prediction has been traditionally based on clinical or pathological factors such as patient age and tumor stage. With the availability of high-throughput data expression quantities are also incorporated in the models. The survival models that are built with molecular expression profiles rely on the individual expression quantities of the molecules in the tumors. However, in the cell molecules interact with each other and in cancer these interactions are dysregulated in various ways. A better representation of the molecular abundance that accounts for these dysregulations has potential to increase the predictive performance of survival models and help reach biomarkers goes beyond individual molecules. To reach results that are biologically relevant and readily interpretable, we suggest using partial ordering of the expression quantities in lieu of individual expression values. In this work, we focused on protein expression data as it is more stable; however, the same framework is applicable to other molecular types as well. We built random forest survival (RSF) models with partial order features of protein expression data and compare them with the models trained with individual protein expression features in 8 different cancer types. The results demonstrate that partial order features have better predictive performance in the majority of the cancers. Accounting order dysregulation of proteins unveil predictive features with direct relevance to the biological mechanism of cancer.

Session B-202: Fast Imputation of Summary Statistics Based on Local LD Structure
COSI: MLSB
  • Matteo Togninalli, ETH Zürich, Switzerland
  • Damian Roqueiro, ETH Zürich, Switzerland
  • Karsten Borgwardt, ETH Zürich, Switzerland

Short Abstract: GWAS meta-analyses have risen as the predominant way to improve association studies' statistical power. Prior to conducting a meta-analysis, researchers often rely on genotype imputation to have a perfect overlap of the studied SNPs. This procedure is subject to limitations and thus imputation from summary statistics emerged as a promising alternative because i) it does not rely on original genotyped data and ii) its computational burden is lower when compared to genotype imputation. Here we present a novel summary statistics imputation method that relies on local linkage disequilibrium structure to compute the imputed values. Our method performs faster than competitor techniques while achieving their same accuracy levels.

Session B-204: Gaussian processes for identifying branching dynamics in single cell data
COSI: MLSB
  • Alexis Boukouvalas, University of Manchester, United Kingdom
  • James Hensman, Prowler.io, United Kingdom
  • Magnus Rattray, University of Manchester, United Kingdom

Short Abstract: Single cell gene quantification allows for the analysis of heterogeneous cell populations and the analysis of the whole transcriptome without the need for a priori gene target selection. Identifying branching dynamics in cell populations undergoing differentiation is computationally challenging due to lack of time course data and high technical and biological noise. We develop the branching Gaussian process (BGP), a non-parametric flexible model that is able to robustly identify branching dynamics on an individual gene level whilst also providing an uncertainty estimate of the branching times.

Session B-206: Exploiting the Structure of Random Forest for the Detection of Epistatic Interactions
COSI: MLSB
  • Corinna Lewis Schmalohr, Quantitative Molecular Biology, Center for Molecular Medicine Cologne (CMMC), University of Cologne, Cologne, Germany, Germany
  • Jan Großbach, University of Cologne, Germany
  • Andreas Beyer, University of Cologne, Germany
  • Mathieu Clément-Ziza, CMMC, University of Cologne, Germany

Short Abstract: Summary. Epistasis (non-additive genetic interaction) is one possible cause for the 'missing heritability' that is observed for many complex traits. However, there is a lack of sensitive methods for the detection of epistasis. We propose four approaches for the detection of epistasis, which are based on the machine learning algorithm Random Forest: the split asymmetry, the selection asymmetry, the paired selection frequency, and an ensemble method that combines the three approaches. We assess the performance of these methods on simulated and real data, comparing them to the commonly used exhaustive pair-wise ANOVA approach. Our scores perform better than ANOVA on both simulated and real data and we discuss possible reasons for the performance differences. This work contributes to the long-standing problem of extracting information about the underlying model from a Random Forest.

Session B-208: Deep50: Web service for multi-task protein-ligand interaction prediction
COSI: MLSB
  • Adam Arany, University of Leuven, Belgium
  • Jaak Simm, University of Leuven, Belgium
  • Yves Moreau, University of Leuve, Belgium

Short Abstract: The prediction of drug and protein interactions is crucial for the development of new drugs. We designed and developed a machine learning model, and a coupled web service infrastructure to serve queries about compound activities. The service can handle hundreds of queries real time. We evaluated the predictive performance of the underlying model resulting in a competitive result of mean AUC of 0.930 over different proteins.

Session B-210: Scaling up probabilistic pseudotime estimation with the GPLVM
COSI: MLSB
  • Sumon Ahmed, University of Manchester, United Kingdom
  • Alexis Boukouvalas, University of Manchester, United Kingdom
  • Magnus Rattray, University of Manchester, United Kingdom

Short Abstract: The analysis of single cell genomics data promises to reveal novel states of complex biological processes, but is challenging due to inherent biological and technical noise. We propose a probabilistic approach based on sparse variational Bayesian Gaussian process latent variable model (GPLVM) to perform robust pseudotime estimation whilst allowing for the incorporation of prior information such as cell capture times. The model converges an order of magniture faster compared to existing methods whilst achieving similar levels of estimation accuracy. We demonstrate the flexibility of our approach by extending the model to higher-dimensional latent spaces that can be used to simulteneously infer pseudotime and branching structure.

Session B-212: Smart systems for model exploration with application in computational systems biology
COSI: MLSB
  • Fredrik Wrede, Uppsala University, Sweden
  • Andreas Hellander, Uppsala University, Sweden

Short Abstract: Large computational experiments involving parameter sweep applications (PSAs), where a simulator acts much like a "black box", can be used for e.g. robustness analysis of underlying models, uncertainty quantification, computational design, and model exploration. However, for complex models involving many parameter and with little a priori knowledge, such sweeps can become massive. This makes it impossible to manually analyze the results and organize the data. We propose a workflow for smart and efficient scientific discoveries in parameter sweeps studies associated with artificial temporal data simulated from applications in system biology.

Session B-214: TOPSPIN: a novel algorithm to predict treatment specific survival in cancer
COSI: MLSB
  • Joske Ubels, UMC Utrecht, Netherlands
  • Erik Van Beers, SkylineDx, Netherlands
  • Pieter Sonneveld, Erasmus MC Cancer Institute, Netherlands
  • Martin van Vliet, SkylineDx, Netherlands
  • Jeroen de Ridder, UMC Utrecht, Netherlands

Short Abstract: Introduction It is increasingly recognized that the successful treatment of cancer is hampered by genetic heterogeneity of the disease and a more personalized approach is needed. Differences in the genetic makeup between tumors can result in a different response to treatment (Burrell et al., 2013). As a result, despite the existence of a wide range of efficient cancer treatments available (Block et al., 2015), many therapies only benefit a minority of the patients that receive them, while they are associated with very serious side effects. Therefore, there is a great clinical need for tools to predict - at the moment of diagnosis - which patient will benefit most from which treatment. This requires the discovery of markers, like gene expression signatures, that are informative about treatment response. The first gene expression signature that was shown to successfully predict prognosis in cancer was a 70-gene breast cancer signature (Van ‘t Veer et al., 2002). In recent years, many more gene expression signatures that can distinguish molecular subtypes of cancer or predict a favourable prognosis have been published, for a wide variety of cancers. However, by definition, prognostic signatures predict survival irrespective of the treatment given. To aid in treatment decision we need to discover a predictive signature, i.e. a marker that can predict survival depending on which treatment is given. Building a classifier for predictive purposes poses a unique challenge, which is not encountered when inferring prognostic classifiers. Most methods for defining a prognostic classifier rely on a supervised learning approach. In these methods a label is defined for each patient based on their survival or some other outcome measure, like the risk of experiencing a relapse. The training procedure then focuses on predicting these labels as accurately as possible to ultimately produce a classifier that can predict outcome for a new patient. However, when building a predictive classifier we aim to predict whether a patient will have a better prognosis when given a certain treatment of interest as compared to a different treatment. This means that labels defined solely on survival data will be inadequate, since it is impossible to know whether a patient would have had a different outcome when given an alternative treatment. A patient with a favourable outcome when given the treatment of interest, may have responded as well to any other treatment. Conversely, a patient may have a shorter than average survival even when treated with the optimal treatment; any other treatment would have resulted in an even worse prognosis. The absence of predefined labels make existing methods for building gene expression signatures unsuitable for this problem and thus a novel approach is needed. To address this challenge we introduce a new algorithm, TOPSPIN (Treatment Outcome Prediction using Similarity between PatIeNts), that derives a classifier able to distinguish a subset of patients with improved treatment outcome from the treatment of interest, but not the comparator treatment. Uniquely, TOPSPIN integrates the process of defining labels and building a classifier, eliminating the need to predefine labels based on survival alone. The fundamental idea of our approach is that we can estimate a patient’s treatment benefit by comparing its survival to a set of genetically similar patients that received the comparator treatment. Patients with a large survival difference can then act as prototype patients: new patients with a similar gene expression profile should also benefit from receiving the treatment of interest. These prototype patients are simultaneously used to define the classifier and the labels. In this work we focus on Multiple myeloma (MM), which is a clonal B-cell malignancy that is characterized by abnormal proliferation of plasma cells in both the bone marrow and the extramedullary sites. Median survival is 5 years (Howlader, 2016). In the last two decades many novel therapies have been introduced for MM, resulting in an improved survival. However, response rates remain low and there is no clear indication what determines treatment response. This is complicated by the fact that MM is very heterogeneous, both between and within patients (Lohr et al, 2014). Especially in MM, predictive signatures could be of great benefit. Methods Data We pooled gene expression and survival data from three phase III trials: Total Therapy 2 (TT2, GSE2658), Total Therapy 3 (TT3, GSE2658) and HOVON-65/GMMG-HD4 (H65, GSE19784). In our analyses of the pooled data two treatment arms were considered: a bortezomib arm, which comprises the PAD arm from H65 and TT3, and a non-bortezomib arm, which comprises the VAD arm from H65 and TT2. Combined, these datasets include 910 patients, for which 407 received bortezomib and 503 did not. We split the dataset in a training set (n = 606) and a test set (n = 304). This test set is not used at any point in the training procedure and acts as an independent validation set to assess the performance of the final classifier. Progression free survival (PFS) was used as outcome measure. Algorithm TOPSPIN aims to predict if a patient benefits or does not benefit from a certain treatment of interest based on the gene expression profile of the patient. In order to train this classifier, we split the training set into three equal folds (A, B & C). We first define a ranked list of prototype patients on fold A (Step 1) that exhibit a better than expected prognosis compared to a set of genetically similar patients that received the opposite treatment. In Step 2, a decision boundary around a selection of prototype patients is determined on fold B. Patients who lie within this decision boundary are expected to show a favourable outcome when receiving bortezomib and and make up class F. All other patients are considered class N and are not expected to benefit from receiving bortezomib. Because it is a priori unknown based on which genes patient similarity should be defined, Step 1 and 2 are performed for a large number of functionally coherent gene sets obtained from the Gene Ontology annotation, yielding one classifier per gene set. Step 1 and 2 are repeated k times for all n gene sets, which ultimately results in n * k classifiers. In an approach based on the boosting principle, the individual classifiers are combined to construct a more robust final classifier. These classifiers are applied separately to the samples in fold C, which act as out-of-bag samples. Since across the repeats all samples are included in fold C this will give an independent classification per gene set for all patients included in the training dataset. The performance of a classifier is defined by the Hazard Ratio (HR) found between the two treatment arms within class F. Since not all the trained classifiers will be equally successful in identifying the subset of patients that benefits from the treatment of interest a threshold S is set in Step 3, which determines which classifiers will participate in the final classifier: a classifier is included only if its performance is below a certain HR. We base this threshold on a 50/50 mixture of the performances obtained on fold B and fold C, the OOB samples. This defines a binary vector x for each patient, where xs has a value of 1 if the patient belongs to class F according to the sth classifier and 0 otherwise. A classification score is defined for a patient i based on x: Classification Score_i= (∑_1^s▒x_s )/s with s being the number of classifiers contributing to the final classifier. On this score a threshold T is set, which determines whether a patient is to benefit from the treatment of interest. These steps are described in more detail below. Step 1 - Prototype ranking on fold A For each patient receiving the treatment of interest, the treatment benefit is defined as ΔPFS_(i ) = 1/n ∑_(j∈O)▒〖(PFS_i- PFS_j) 〗, where O is the set of the n most similar patients (based on euclidean distance) that did not receive the treatment of interest. We use n = 10. ΔPFS is only calculated for neighbor pairs where it is clear which patient experienced an event first; if both are censored, ΔPFS is not computed. To correct for the fact that a patient with a long survival time will, on average, have a large ΔPFS irrespective of its relative treatment benefit compared to genetically similar patients, we define the z-normalized zPFS score as: zPFS_(i )= (ΔPFS_i - μ(RPFS_i))/(σ(RPFS_i)), where RPFS is a distribution of 1000 random ΔPFS scores, obtained by calculating ΔPFS for randomly chosen sets O, i.e. determining treatment benefit with respect to random patients instead of genetically similar patients. Based on the zPFS score all patients in fold A that were given the treatment of interest can be ranked. Step 2 - Classifier definition on fold B Classifier Q is defined by a subset of z top-ranked prototypes along with a decision boundary defined in terms of the euclidean distance γ around a prototype. A patient is classified as class F when it lies within γ of any of the top z prototypes. The optimal values for z and γ are those resulting in the lowest Hazard Ratio (HR) in class F (the patient group in which the treatment of interest should have a better survival). We additionally constrain z and γ, such that class F comprises at least 20% of the dataset. The number of prototypes was restricted to 3 to prevent defining an extremely complicated classifier. The search grid for parameter γ was made dependent on the local density of the neighbors, and consisted of the sorted list of euclidean distances between the prototype and its neighbors. The optimal z and γ combination is chosen so that the HR in class F is minimal, with a preference for a HR associated with a p-value < 0.05. Step 3 – Set thresholds S and T A threshold S which determines which classifiers are included in the final classifier is optimized. Any classifier that resulted in a HR higher than S is excluded, with the options ranging from 1 to 0.3 in steps of 0.025. To utilize the information gained in Fold C, but prevent overtraining, the performance used is alternately the HR found on fold B and the HR found on fold C. For each possible threshold S, a threshold T is also optimized. This threshold T is set on the Classification Score to define class F in the final classifier. The combination of S and T that leads to the HR associated with the lowest p-value in class F, given that class F comprises at least 20% of the dataset, is chosen. Results The optimal threshold S found was 0.45, with a threshold T of 0.3. With this threshold S in total 14 150 classifiers were included in the final classifier. Applying the final classifier resulted in a HR of 0.43 (p = 1*10-4 ) in the training set within class F, based on the classification of the OOB samples. More importantly, when applied to the independent test set, a HR of 0.5 (p = 0.04) between the two treatment arms was found, demonstrating TOPSPINs ability to identify the subset of patients benefitting from bortezomib. It is important to note that in class N HRs of 0.9 and 0.97 were found in the training and test set, respectively. These patients did not experience any benefit from receiving bortezomib and could thus possibly have been spared the treatment and side effects. Conclusion Here we have demonstrated TOPSPIN ability to identify a subset of MM patients that benefit from the proteasome inhibitor bortezomib. TOPSPIN is however not specific for MM and can be used on any dataset with two randomized treatment arms and a continuous outcome measure. Considering the often low response rates combined with the serious side effects of current cancer therapies, TOPSPIN therefore offers an important step towards realistic personalization of cancer medicine. References Block, K. I., et al. (2015). Designing a broad-spectrum integrative approach for cancer prevention and treatment. Seminars in Cancer Biology, 35, S276–S304. doi:10.1016/j.semcancer.2015.09.007 Burrell, R. A., et al. (2013). The causes and consequences of genetic. Nature, 501, 338–345. doi:10.1038/nature12625 Howlader N, et al. (2016). SEER Cancer Statistics Review, 1975-2013. In National Cancer Institute. Bethesda, MD. Retrieved from http://seer.cancer.gov/csr/1975_2013/ Lohr, J. G., et al. (2014). Widespread genetic heterogeneity in multiple myeloma: Implications for targeted therapy. Cancer Cell, 25, 91–101. doi:10.1016/j.ccr.2013.12.015 Van ’t Veer, L. J., et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530–536. Retrieved from http://dx.doi.org/10.1038/415530a

Session B-216: Knowledge Driven Graph Evolution (KDGE)
COSI: MLSB
  • Federico Tomasi, DIBRIS - Università degli Studi di Genova, Italy
  • Margherita Squillario, DIBRIS-University of Genoa, Italy
  • Annalisa Barla, DIBRIS - Università di Genova, Italy

Short Abstract: Understanding complex biological phenomena is a difficult task, considering the interplay among molecular variables. Common approaches rely on data to build a robust statistical model. Usually, a set of variables which are able to characterise the biological conditions under analysis is identified, and then enrichment analysis is used a posteriori in order to give a functional assessment of selected variables. In this case, prior knowledge is used to validate the result instead of leading the analysis from the start. In this work, we make explicit use of prior knowledge, using signalling pathways during the learning phase. After, a network (or graph) of inferred pairwise interactions is extracted using mutual information scores in the different biological conditions, to assess interaction evolution of biological variables.

Session B-218: FastCMH: Genome-wide genetic heterogeneity discovery with categorical covariates
COSI: MLSB
  • Felipe Llinares-Lopez, ETH Zurich, Switzerland
  • Laetitia Papaxanthos, ETH Zurich, Switzerland
  • Dean Bodenham, ETH Zurich, Switzerland
  • Damian Roqueiro, ETH Zurich, Switzerland
  • Copdgene Investigators, COPDGene, USA
  • Karsten Borgwardt, ETH Zurich, Switzerland

Short Abstract: Univariate Genome-Wide Association Studies (GWASs) are a widely used approach to retrieve Single-Nucleotide Polymorphisms (SNPs) significantly associated with a phenotype of interest, such as the presence or absence of a disease, eye color or the size of bones. However, the phenotypic variance explained by findings of univariate GWASs is typically considerably lower than the estimated heritability of the corresponding trait, leading to the well know ''missing heritability'' problem. A partial explanation to this phenomenon is the existence of genetic heterogeneity, by which several SNPs have a common but weak effect on the phenotype of interest. In our recently published paper Genome-wide genetic heterogeneity discovery with categorical covariates (Llinares-Lopez, 2017), we propose FastCMH, an algorithm that allows to detect statistically significant associations between genomic regions and the phenotype of interest, while correcting for categorical confounders. Our method uses an state-of-the-art pattern mining approach to test all genomic regions, without imposing a maximum region length, while being computationally efficient and showing a high statistical power.

Session B-220: easyGWAS: A cloud-based platform for comparing the results of genome-wide association studies
COSI: MLSB
  • Dominik Grimm, ETH Zürich, Switzerland
  • Damian Roqueiro, ETH Zürich, Switzerland
  • Matteo Togninalli, ETH Zürich, Switzerland
  • Easygwas Consortium, ETH Zürich, Switzerland
  • Karsten Borgwardt, ETH Zürich, Switzerland

Short Abstract: We will present different use cases to illustrate the full functionality of easyGWAS. Firstly, users will be guided through steps to upload genotype and phenotype data onto the platform. Secondly, different analysis methods will be presented. Finally, the results of the analyses will be visualised, compared and shared with others.

Session B-222: Ask the doctor - Improving drug sensitivity predictions through active expert knowledge elicitation
COSI: MLSB
  • Iiris Sundin, Aalto University, Finland
  • Tomi Peltola, Aalto University, Finland
  • Muntasir Mamun Majumder, Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Finland
  • Pedram Daee, Aalto University, Finland
  • Marta Soare, Aalto University, Finland
  • Homayun Afrabandpey, Aalto University, Finland
  • Caroline Heckman, Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Finland
  • Samuel Kaski, Aalto University, Finland
  • Pekka Marttinen, Aalto University, Finland

Short Abstract: Predicting the efficacy of a drug for a given individual, using high-dimensional genomic measurements, is at the core of precision medicine. However, identifying features on which to base the predictions remains a challenge, especially when the sample size is small. Incorporating expert knowledge offers a promising alternative to improve a prediction model, but collecting such knowledge is laborious to the expert if the number of candidate features is very large. We introduce a probabilistic model that can incorporate expert feedback about the impact of genomic measurements on the sensitivity of a cancer cell for a given drug. We also present two methods to intelligently collect this feedback from the expert, using experimental design and multi-armed bandit models. In a multiple myeloma blood cancer data set (n=51), expert knowledge decreased the prediction error by 8%. Furthermore, the intelligent approaches can be used to reduce the workload of feedback collection to less than 30% on average, compared to a naive approach.

Session B-224: Integrative concurrent analysis of multiple biological datasets by HALS-based multi-relational NMF
COSI: MLSB
  • Oliver Mueller-Stricker, University Medicine Greifswald, Germany
  • Lars Kaderali, University Medicine Greifswald, Germany

Short Abstract: Non-negative matrix factorization (NMF) has already been proven useful for the analysis of biological and epidemiological datasets, as e.g. gene expression, RNAi or GWAS data. For example, Kim and Tidor [1] have applied NMF to cluster genes and to predict functional relationships in yeast, Brunet et al. [2] utilized NMF to reduce the dimensionality of expression data from thousands of genes to a handful of metagenes. Hutchins et al. [3] described a novel approach to the characterization of putative regulatory sequence motifs based on NMF. However, most approaches utilizing NMF focus on a single dataset. Yet, oftentimes it is desirable to analyze multiple interrelated datasets concurrently instead of analyzing each one separately. There are only few approaches to this task proposed in literature. For example, Wang et al. [4] introduced a technique for the prediction of protein–protein interactions from multimodal biological data sources. Gligorijević et al. [5] presented a methodology for discovery of driver genes by a holistic analysis of patient SNP data, demographic data and gene-gene interaction data. These applications of NMF utilize an approach presented already in 2008 by Wang et al. [6] which consists of the inclusion of multiple relation matrices into the objective function of the NMF problem, the usage of matrix tri-factorization by introduction of an additional matrix which relates the two factor matrices to each other, and the addition of a graph laplacian in order to be able to include intra-type data, e.g. protein-protein interaction data. While the methods described above are shown to yield superior results compared to other methods in the field, all of these algorithms rely on standard multiplicative updates, which oftentimes show slow convergence [7, 8]. Here, we propose a novel method for the fast integrative and concurrent analysis of interrelated datasets. For this purpose we adapt and extend the well-known Hierarchical Alternating Least Squares (HALS) algorithm for NMF [9, 10]. Our contribution is threefold. First, to be able to represent the contributions of single factor combinations, we adapt HALS for matrix tri-factorization. As stated above, an additional matrix has to be inserted in the formulation. Second, we extend HALS to handle the factorization of multiple relations concurrently. This way it is possible to integrate data from multiple sources as well as prior knowledge directly into one single analysis. Third, we introduce the option of weighted input into our algorithm to enable the analysis of incomplete input data, which is a common case in biological data. Here, instead of imputing unobserved input values or representing them as zero, a weight mask is introduced into the methodology. We show the behavior of our proposed method by applying it to multi-relational biological data, as e.g. for the stratification of cancer subtypes by patient mutation profiles and expression data of ovarian, uterine and lung cancer cohorts from The Cancer Genome Atlas as well as additional molecular network data, and comparing it to current state-of-the-art NMF analysis algorithms. References [1] P. M. Kim and B. Tidor, “Subsystem Identification Through Dimensionality Reduction of Large-Scale Gene Expression Data,” Genome Res., vol. 13, no. 7, pp. 1706–1718, Jul. 2003. [2] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proc. Natl. Acad. Sci., vol. 101, no. 12, pp. 4164–4169, Mar. 2004. [3] L. N. Hutchins, S. M. Murphy, P. Singh, and J. H. Graber, “Position-dependent motif characterization using non-negative matrix factorization,” Bioinformatics, vol. 24, no. 23, pp. 2684–2690, Dec. 2008. [4] H. Wang, H. Huang, C. Ding, and F. Nie, “Predicting Protein–Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization,” J. Comput. Biol., vol. 20, no. 4, pp. 344–358, Mar. 2013. [5] V. Gligorijević, N. Malod-Dognin, and N. Pržulj, “Patient-specific data fusion for cancer stratification and personalised treatment,” in Biocomputing 2016, WORLD SCIENTIFIC, 2015, pp. 321–332. [6] F. Wang, T. Li, and C. Zhang, “Semi-Supervised Clustering via Matrix Factorization,” in Proceedings of the 2008 SIAM International Conference on Data Mining, 0 vols., Society for Industrial and Applied Mathematics, 2008, pp. 1–12. [7] H. Kim and H. Park, “Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method,” SIAM J. Matrix Anal. Appl., vol. 30, no. 2, pp. 713–730, Jan. 2008. [8] J. Kim and H. Park, “Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons,” SIAM J. Sci. Comput., vol. 33, no. 6, pp. 3261–3281, Jan. 2011. [9] A. Cichocki and A.-H. Phan, “Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. E92–A, no. 3, pp. 708–721, Mar. 2009. [10] A. Cichocki, R. Zdunek, and S. Amari, “Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization,” in Independent Component Analysis and Signal Separation, M. E. Davies, C. J. James, S. A. Abdallah, and M. D. Plumbley, Eds. Springer Berlin Heidelberg, 2007, pp. 169–176.

Session B-226: Robustness of modeling-based experiment retrieval to differences in measurement and preprocessing techniques
COSI: MLSB
  • Pradeep Eranti, Aalto University, Finland
  • Paul Blomstedt, Aalto University, Finland
  • Samuel Kaski, Aalto University, Finland

Short Abstract: With the rapid progress in high-throughput measurement technologies, public repositories such as ArrayExpress and Gene Expression Omnibus continue to grow enormously with freely available gene expression datasets. This rapid growth poses a challenge in the retrieval of datasets relevant to a researcher for biological research. To overcome problems related to inaccurate or missing metadata, the task of querying a database of experiments using measurement data, instead of metadata, has recently received increased attention in the literature. Earlier studies of such methods have been conducted with a subset of gene expression experiments generated using one array design or generated with a dedicated preprocessed technique. However, public repositories continue to store experiments which have been conducted on various array designs and generated with various preprocessing techniques. In our recent study, we evaluated a recently introduced model-distance-based method together with two other content-based retrieval methods available in the literature to differences in preprocessing techniques (RMA, MAS5) and array designs (Affymetrix's Human Gene Array and Human Genome Arrays). The methods were (i) a model-distance-based method which represents each experiment as a probabilistic model and uses the distance between the models as a measure of relevance, (ii) a likelihood-based method which is similar to the previous one but uses the likelihood of each model as the measure of relevance, and (iii) a non-probabilistic method based on correlations between differential expression profiles. The experimental evaluations demonstrate that the model-distance-based method is tolerant to differences in preprocessing techniques and outperforms the remaining two methods. Furthermore, the model-distance-based method is tolerant to differences in Human Genome arrays and the results to differences between Human Genome arrays and Human Gene Array is inconclusive. The model-distance-based method facilitates the retrieval of gene expression datasets, exhibiting similar expression patterns in different conditions, across the platforms and preprocessing techniques as long as sufficient amounts of data are available. While the method has currently only been tested on microarray gene expression experiments, there is ongoing work to extend it as a general purpose retrieval scheme for other experiment types. Thus, it leads to making maximal usage of data available in the public repositories to discover hidden patterns in biological mechanisms.

Session B-228: Enlightening discriminative network functional modules behind Principal Component Analysis separation in differential-omic science studies
COSI: MLSB
  • Sara Ciucci, Biotechnology Center (BIOTEC),Technische Universität Dresden; Lipotype GmbH, Germany
  • Yan Ge, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany
  • Claudio Duran, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany
  • Alessandra Palladini, Biotechnology Center (BIOTEC), Technische Universität Dresden; DZD Paul Langerhans Institute, Technische Universität Dresden; Lipotype GmbH, Germany
  • Víctor Jiménez Jiménez, Fundación Centro Nacional de Investigaciones Cardiovasculares Carlos III, Spain
  • Luisa María Martínez Sánchez, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany
  • Yuting Wang, Center for Regenerative Therapies Dresden (CRTD), Technische Universität Dresden; MPI of Molecular Cell Biology and Genetics, Germany
  • Susanne Sales, MPI of Molecular Cell Biology and Genetics, Germany
  • Andrej Shevchenko, MPI of Molecular Cell Biology and Genetics, Germany
  • Steven W. Poser, Department of Internal Medicine III, University Hospital Carl Gustav Carus at the Technische Universität Dresden, Germany
  • Maik Herbig, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany
  • Oliver Otto, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany
  • Andreas Androutsellis-Theotokis, Center for Regenerative Therapies Dresden (CRTD); University Hospital Carl Gustav Carus at the Technische Universität Dresden;Centre for Biomolecular Sciences, School of Medicine, University of Nottin, Germany
  • Jochen Guck, Biotechnology Center (BIOTEC),Technische Universität Dresden, Germany
  • Mathias J. Gerl, Lipotype GmbH, Germany
  • Carlo Vittorio Cannistraci, Biotechnology Center (BIOTEC), Technische Universität Dresden, Germany

Short Abstract: Omic science is rapidly growing and one of the most employed techniques to explore differential patterns in omic datasets is principal component analysis (PCA). However, a method to enlighten the network of omic features that mostly contribute to the sample separation obtained by PCA is missing. An alternative is to build correlation networks between univariately-selected significant omic features, but this neglects the multivariate unsupervised feature compression responsible for the PCA sample segregation. Biologists and medical researchers often prefer effective methods that offer an immediate interpretation to complicated algorithms that in principle promise an improvement but in practice are difficult to be applied and interpreted. Here we present PC-corr: a simple algorithm that associates to any PCA segregation a discriminative network of features. Such network can be inspected in search of functional modules useful in the definition of combinatorial and multiscale biomarkers from multifaceted omic data in systems and precision biomedicine. We offer proofs of PC-corr efficacy on lipidomic, metagenomic, developmental genomic, population genetic, cancer promoteromic and cancer stem-cell mechanomic data. Finally, PC-corr is a general functional network inference approach that can be easily adopted for big data exploration in computer science and analysis of complex systems in physics.


View Posters By Category

Search Posters: