Presentation Overview: Show
Cancer progression is an evolutionary process characterized by the accumulation of genetic alterations and responsible for tumor growth, clinical progression, and drug resistance development. We discuss how to reconstruct the evolutionary history of a tumor from single- cell sequencing data and present probabilistic models and efficient inference algorithms for mutation calling and learning tumor phylogenies from mutation and copy number data. We present methods for integrating single-cell DNA and RNA data obtained from tumor biopsies and for detecting common patterns of tumor evolution among patients, including re-occurring evolutionary trajectories and clonally exclusive mutations.
Presentation Overview: Show
Improvements in single cell RNA-seq technologies mean that studies measuring multiple experimental conditions, such as time series, have become more common. At present, few computational methods exist to infer time series-specific transcriptome changes, and such studies have therefore typically used unsupervised pseudotime methods. While these methods identify cell subpopulations and the transitions between them, they are not appropriate for identifying the genes which vary coherently along the time series. In addition, the orderings they estimate are based only on the major sources of variation in the data, which may not correspond to the processes related to the time labels.
We introduce psupertime, a supervised pseudotime approach based on a regression model, which explicitly uses time series labels as input. It identifies genes that vary coherently along a time series, in addition to pseudotime values for individual cells, and a classifier which can be used to estimate labels for new data with unknown or differing labels. We show that psupertime outperforms benchmark classifiers in terms of identifying time-varying genes, and provides better individual cell orderings than popular unsupervised pseudotime techniques. psupertime is applicable to any single cell RNA-seq dataset with sequential labels (principally time series but also drug dosage and disease progression, for example), derived from either experimental design and provides a fast, interpretable tool for targeted identification of genes varying along with specific biological processes.
Presentation Overview: Show
Single-cell RNA-seq (scRNA-seq) technologies are often used to study changes in the molecular states that underlie cellular phenotypes. However, the transcript counts from scRNA-seq are obtained from each cell at a single instant in time. A single snapshot in time often does not provide sufficient information to understand the dynamic molecular processes the cell is undergoing. To address this challenge, we have developed a neural ordinary differential equation based method, RNAForecaster, for predicting RNA expression states in single cells for multiple future time steps. We demonstrated that in 111 simulated single-cell expression data sets, RNAForecaster can accurately predict expression states up to two hundred simulated time steps downstream of the data it was trained on. Additionally, using metabolic labeling transcriptomic profiling data from human RPE cells, RNAForecaster was able to predict cellular progression through the cell cycle, with the expression changes predicted aligning highly significantly with continuous measures of cell cycle over a three day prediction period. Thus, RNAForecaster enables predictions of future expression states in biological systems over short to medium time periods, which has potential to yield significant insight into transcriptional dynamics.
Presentation Overview: Show
The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes.
In this work, we adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget.
Presentation Overview: Show
Factor analysis is a widely used method for dimensionality reduction of datasets in molecular biology and has recently been adapted to spatial data. However, existing methods assume (count) matrices as input and are therefore not directly applicable to single molecule resolved data, which increasingly arise in the field of spatial transcriptomics and provide insight into the subcellular localization of individual RNA molecules. To address this, we propose FISHFactor, a probabilistic factor model that combines the benefits of spatial, non-negative factor analysis with a Poisson point process likelihood to explicitly model and account for the nature of single molecule resolved data. Given data from multiple segmented cells, FISHFactor infers a weight matrix that is shared across cells, while factors remain independent. This allows a consistent interpretation of factors and clustering of cells based on factor activities.
Using simulated data, we show that our approach leads to a more accurate estimate of the latent structure compared to methods that rely on aggregating information by spatial binning. We demonstrate on a real data set of cultured mouse embryonic fibroblast cells that FISHFactor identifies major subcellular expression patterns and accurately recovers known spatial gene clusters.
Presentation Overview: Show
We present Full Annotation Using Shape-constrained Trees (FAUST), a recently-published method for unbiased discovery and annotation of cellular phenotypes in high-dimensional cytometry datasets (https://doi.org/10.1016/j.patter.2021.100372). Analyzing such datasets is challenging as they can consist of hundreds of samples jointly profiling many millions of cells. To address these challenges, our method combines novel approaches for variable scoring, variable selection, clustering, cluster-matching, and feature selection [Fig 1]. We discuss how this allows FAUST to process samples independently from one-another without sub-sampling, and match clusters on the basis of computationally-derived phenotypes. We show how FAUST can automatically adapt to technical effects like batching as well as sample-to-sample variability that manifests as changes in the location and scale of the expression distributions. We demonstrate through simulation studies that FAUST can resolve phenotypes in the data that vary by orders of magnitude in their abundances. Also, we show that FAUST outperforms other methods on the FlowCAP-IV benchmark dataset. Finally, we discuss how FAUST has been used to analyze multiple high-dimensional cytometry datasets, including data generated in a Merkel cell carcinoma anti-PD-1 clinical trial. An open-source implementation of the FAUST method is available online (https://github.com/RGLab/faust).
Presentation Overview: Show
Cell images contain a vast amount of quantifiable information about the status of the cell: for example, whether it is diseased, whether it is responding to a drug treatment, or whether a pathway has been disrupted by a genetic mutation. We extract hundreds of features of cells from images. Just like transcriptional profiling, the similarities and differences in the patterns of extracted features reveal connections among diseases, drugs, and genes.
We are harvesting similarities in image-based profiles to identify, at a single-cell level, how diseases, drugs, and genes affect cells, which can uncover small molecules’ mechanism of action, discover gene functions, predict assay outcomes, discover disease-associated phenotypes, identify the functional impact of disease-associated alleles, and find novel therapeutic candidates. As part of the JUMP-Cell Painting Consortium (Joint Undertaking for Morphological Profiling-Cell Painting) we are aiming to establish experimental and computational best practices for image-based profiling (https://jump-cellpainting.broadinstitute.org/results) and produce the world’s largest public Cell Painting gene/compound image resource, with 140,000 perturbations in five replicates, to be released November 2022. With these data and new technologies like Pooled Cell Painting, we hope to bring drug discovery-accelerating applications to practice.
Presentation Overview: Show
Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming, and low throughput. As a result, carcinogenicity information is fairly limited and building data-driven models with good prediction accuracy remains a major challenge. In this work, we propose CONCERTO, a deep learning model that uses a graph transformer in conjunction with a molecular fingerprint representation for carcinogenicity prediction from molecular structure. Special efforts have been made to overcome the data size constraint, such as enriching the training data with more informative labels, multi-round pre-training on related but lower quality mutagenicity data, and transfer learning from a large self-supervised model. Extensive experiments demonstrate that our model performs well and can generalize to external validation sets. CONCERTO could be useful for guiding future carcinogenicity experiments and provide insight into the molecular basis of carcinogenicity.
Presentation Overview: Show
The cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets, but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescence images in the Human Protein Atlas (HPA) with affinity purifications in BioPlex to create a unified hierarchical map of human cell architecture. Integration is achieved by configuring each approach as a general measure of protein distance, then calibrating the two measures using machine learning. The map, called the Multi-Scale Integrated Cell (MuSIC 1.0), resolves 69 subcellular systems of which approximately half are undocumented. Accordingly we perform 134 additional affinity purifications, validating subunit associations for the majority of systems. The map reveals a pre-ribosomal RNA processing assembly and accessory factors, which we show govern rRNA maturation, and functional roles for SRRM1 and FAM120C in chromatin and RPS3A in splicing. By integration across scales, MuSIC increases the resolution of imaging while giving protein interactions a spatial dimension, paving the way to incorporate diverse types of data in proteome-wide cell maps.
Presentation Overview: Show
Highly multiplexed imaging technologies such as Imaging Mass Cytometry (IMC) enable the quantification of the expression of up to 40 proteins in tissue sections. Data preprocessing pipelines subsequently segment images to single cells, recording their average expression profile along with spatial characteristics. However, segmentation of the resulting images to single cells remains a challenge, with doublets -- an area erroneously segmented as a single-cell that is composed of more than one ""true"" single cell. This results in cells with implausible protein co-expression combinations, confounding the interpretation of important cellular populations across tissues.
While doublets have been discussed in the context of single-cell RNA-sequencing analysis extensively, there is no clustering method for IMC while accounting for segmentation errors. Therefore, we introduce SegmentaTion AwaRe cLusterING (STARLING), a probabilistic model tailored for IMC that clusters the cells explicitly allowing for doublets resulting from mis-segmentation. To benchmark STARLING against existing methods, we develop a novel evaluation system that penalizes clusters with biologically-implausible marker co-expression combinations. Finally, we generate IMC data of human tonsil (a densely packed secondary lymphoid organ) and demonstrate cellular states captured by STARLING identify known cell types not visible with other methods and important for understanding the dynamics of immune response.
Presentation Overview: Show
Population-scale genomic sequencing provides novel opportunities to survey the effect of rare variants on phenotypes. Existing methods for rare-variant-association studies (RVASs), such as burden or variance-component tests, make strong assumptions about which variants exhibit phenotypic effects, limiting their efficacy.
Here, we propose DeepRVAT (Deep Rare Variant Association Testing), a data-driven framework that uses deep neural networks to learn a flexible rare variant aggregation function. Specifically, we build on DeepSet networks to efficiently model variant effects and interactions. Compared to existing methods, DeepRVAT (1) learns variant effects without strong filtering or specifying a kernel, (2) models nonlinear and epistatic effects, (3) efficiently incorporates dozens of multi-modal variant annotations, (4) provides trait-specific burden scores, and (5) utilizes GPUs for biobank-scale analyses.
We apply DeepRVAT to multiple phenotypes on 167,000 whole-exome-sequenced samples from UK Biobank. Compared with previous state-of-the-art methods, we obtain significantly increased power (e.g., 29 vs. 15 genes associated to human height; FDR < 0.05), while maintaining statistical calibration. Furthermore, we validate our results by multiple methods (e.g., enrichment analysis, comparison to larger studies) to ensure biological plausibility of our associations. Collectively, our results demonstrate increased power and robustness for studying gene-trait associations using rare variants.
Presentation Overview: Show
Most disease-associated genetic variants fall outside of protein-coding DNA and are often enriched in regulatory elements associated with DNA binding proteins known as transcription factors (TFs). Computational methods are largely used to predict TF binding sites (TFBS) as the experimental characterization of most human TFs is intractable due to technical limitations. Instead, the most popular approaches use TF motifs and chromatin accessibility data to predict TF binding. Here, we present “maxATAC” a suite of deep neural network models for genome-wide TFBS prediction from the assay for transposase accessible chromatin (ATAC-seq) in any cell type, with models available for 127 human TFs. We demonstrate maxATAC’s capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci. We analyzed activated T cells isolated from patients with atopic dermatitis and their age-matched controls. Patient-specific ATAC-seq signal and DNA sequence were used as input for maxATAC to predict the binding of 103 nominally expressed TFs. We predicted increased binding of several TFs relevant to T cells, including MYB and FOXP1, in patients with atopic dermatitis. These results illustrate the utility of maxATAC models for exploring potential regulators of cellular biology and the effects of genetic variants on genomic regulation.
Presentation Overview: Show
Analysis of disease progression patterns of multimorbid patients typically try to find systematic patterns of risk factors, diseases and complications. Such analyses are complicated by the fact that certain risk factors also can present as complications, thus representing “promiscuous” diseases that appear in quite different contexts. Another problem is that similar outcomes can be caused by different mechanisms, mixed etiologies, that can be difficult to disentangle longitudinally. The talk will discuss approaches to patient stratification in such situations, including cases where cohort multi-omics data are available. These can potentially reveal patient level disease characteristics and individualized response to treatment. We developed a deep learning based framework, Multi-Omics Variational autoEncoders, to integrate such data and applied it to a cohort of 789 patients with newly diagnosed type 2 diabetes (T2D). These patients had clinical measurements and drug use recorded in electronic case report forms. We show how we can capture relevant clinical patterns and identify associations between the medication data and the multi-omics data. For the 20 most prevalent T2D drugs, we identified significant multi-omics associations, which we use to extract both known and novel biological associations between drugs and multi-omics features (genomics, transcriptomics, proteomics, metabolomics, and microbiomes as well as data from diet questionnaires, and clinical measurements). To properly disentangle molecular mechanisms and response to drugs we propose that future studies should explore the multi-omics space in depth rather than having focus on single data types, as is often done.
Presentation Overview: Show
Breast cancer is a type of cancer that develops in breast tissue, and, after skin cancer, it is the most commonly diagnosed cancer in women in the United States. Given that an early diagnosis is imperative to prevent breast cancer progression, many machine learning models have automated the histopathological classification of the different types of carcinomas. However, many of them are not scalable to the large dataset. In this study, we propose the novel Primal-Dual Multi-Instance Support Vector Machine (pdMISVM) to determine which tissue segments in an image exhibit an indication of an abnormality. We also derive the efficient optimization approach for the proposed method by bypassing the quadratic programming and least-squares problems, which are commonly employed to optimize Support Vector Machine (SVM) models in multi-instance learning. The proposed method is scalable to large datasets, and it is computationally efficient. We applied our method to the public BreaKHis dataset and achieved promising prediction performance and scalability for histopathological classification.
Presentation Overview: Show
Understanding the molecular alterations that allow transitioning from local to invasive disease is essential to improve cancer treatments. Investigating tumor progression has, however, been hampered by substantial intra- and intertumor heterogeneity. Here, we introduce a multifocal cohort of locally advanced prostate cancer, where several primary lesions and metastatic lymph nodes per patient have been transcriptomically profiled. Modeling pathway activity using a centroid-based approach allows tracing cancer progression from primary to metastatic tissue, highlighting how mainly invasion-related processes are altered. Moreover, using an external dataset we demonstrate that in lymph node positive primary tumors the activity of certain signatures is altered to resemble metastatic lymph nodes. We use this observation to identify the most likely seeding primary lesion in each patient of our cohort. The predicted seeding lesions agree well with seeding lesions estimated from RNA-seq derived somatic variants and are enriched in the invasive PAM50 Luminal B subtype. Finally, we leverage the unique design of our cohort and develop a new testing procedure to identify molecular processes that characterize the seeding lesions. Importantly, the poor correspondence between a lesion’s Gleason score and seeding status suggests that multifocal designs will be pivotal for the study of invasive disease.
Presentation Overview: Show
To pave the road towards precision medicine, patients with similar biology should be grouped into cancer subtypes. Utilizing high-dimensional multiomics data generated from cancer tissues, integrative computational approaches have been developed to uncover cancer subtypes. Recently, Graph Neural Networks were discovered utilizing node features and associations simultaneously on graph-structured data. Addressing limitations that existing tools have in leveraging these architectures, we developed SUPREME, an integrative approach by comprehensively analyzing multiomics data and patient associations with multiplex graph convolutions. Unlike existing tools, SUPREME generates patient similarity networks and obtains embeddings from each using Graph Convolutional Network models, on which it utilizes all multiomics. Also, SUPREME integrates embeddings with raw features to capture local and global features simultaneously. SUPREME integrates all embedding combinations as separate tasks, thus being interpretable regarding utilized networks and features. On TCGA, METABRIC, and combined datasets, SUPREME significantly outperformed seven supervised methods, with consistent results. SUPREME-inferred subtypes consistently had significant survival differences, mostly more significant than survival differences between ground truth(PAM50) subtypes, and outperformed all nine methods. Our findings suggest that properly utilizing multiple datatypes and associations, SUPREME could demystify subtype characteristics that cause significant survival differences and could improve ground truth, which depends mainly on one datatype.
Presentation Overview: Show
Polypharmacy presents a major medical challenge for the world's ageing population due to novel drugs used in combination, the high cost of clinical trials, and limited studies on side-effect mechanisms. For problems like polypharmacy side effects, a predictor that predicts correct side effects is far from enough. A question of greater concern is: Can the ‘predictor’ help us to figure out the cause of a predicted side effect? In this paper, we gave it a positive answer by proposing APRILE, a graph learning framework for discovering the mechanism behind drug side effects. Given a predictor, APRILE can generate hypothetical mechanisms by identifying the optimal set of drug targets and protein interactions of greatest importance to the predictor’s prediction. In the existing literature, we found ample evidence for mechanisms generated by APRILE, including drug hypersensitivity leading to anxiety, or paradoxical drug action leading to panic disorder. We wrapped our framework into a python package and developed a website application, for offering the mechanistic explanations for 34 million POSE predictions, and the common mechanistic hypotheses for 472 diseases, 485 symptoms, 9 mental disorders, and 20 disease categories, to make it more accessible to the biomedical research community.
Presentation Overview: Show
Accurate and robust prediction of patient-specific responses to a new compound is critical to personalized drug discovery and development. However, patient data are often too scarce to train a generalized machine learning model. Although many methods have been developed to utilize cell line screens for predicting clinical responses, their performances are unreliable due to data heterogeneity and distribution shift. We have developed a novel Context-aware Deconfounding Autoencoder (CODE-AE) that can extract intrinsic biological signals masked by context-specific patterns and confounding factors. Extensive comparative studies demonstrated that CODE-AE effectively alleviated the out-of-distribution problem for the model generalization, significantly improved accuracy and robustness over state-of-the-art methods in predicting patient-specific \in vivo drug responses purely from in vitro compound screens. Using CODE-AE, we screened 59 drugs for 9,808 cancer patients. Our results are consistent with existing clinical observations, suggesting the potential of CODE-AE in developing personalized anti-cancer therapies and drug-response biomarkers.
Presentation Overview: Show
In the first part of the talk, I will go over a number of research work done by my lab on the topics of explainable AI applied to biomedical problems, which exemplifies how it addresses new scientific questions, make new biological discoveries from data, make informed clinical decisions, and even open new research directions in biomedicine. In the second part of the talk, I will show you that explainable AI needs to evolve and improve to solve real-world problems in computational biology and medicine by having a deep dive into our cancer pharmacogenomics project led by our Ph.D. student Joseph Janizek in collaboration with Prof. Kamila Naxerova at Harvard Medical School.
Presentation Overview: Show
Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine. Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes). Several methods were suggested for this scenario based on observational data, i.e.~data where the intervention was not applied randomly, for both continuous and binary outcome variables. However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period. Albeit their enormous importance, time-to-event data is rarely used for treatment optimization.
We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network; i.e.~we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM). We show in simulation studies that this approach outperforms the state of the art. Further, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters. We successfully validated this finding in an independent cohort. We provide BITES as an easy-to-use python implementation including scheduled hyper-parameter optimization.
Presentation Overview: Show
Motivation: Predicting side effects of drug-drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data is sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances.
Result: We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity.
Presentation Overview: Show
The improved therapeutic outcome and reduced adverse effects of synergistic drug combinations have turned them into standard of care in many cancer types. Given the wealth of relevant information provided by high-throughput screening studies, the costly experimental design of these combinations can now be guided by advanced computational tools. Thus, we present MARSY, a deep-multitask learning method that predicts the level of synergism between drug pairs tested on cancer cell lines. Using gene expression to characterize cancer cell lines and induced signature profiles to represent drug pairs, MARSY learns a distinct set of embeddings to obtain multiple views of the features. Precisely, a representation of the entire combination and a representation of the drug pair are learned in parallel. These representations are then fed to a multitask network that predicts the synergy score of the drug combination alongside single drug responses. A thorough evaluation of MARSY revealed its superior performance compared to various state-of-the-art and traditional computational methods. A detailed analysis of the design choices of our framework demonstrated the predictive contribution of the learned embeddings by this model. Additionally, we predicted 116,348 new synergy scores, which revealed new insights regarding the impact of drug combinations in cancer cell lines.
Presentation Overview: Show
Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.
Presentation Overview: Show
Motivation: Understanding the mechanisms underlying T cell receptor (TCR) binding is of fundamental importance to understanding adaptive immune responses. A better understanding of the biochemical rules governing TCR binding can be used, for example, to guide the design of more powerful and safer T cell-based therapies. Advances in repertoire sequencing technologies have made available millions of TCR sequences. Data abundance has, in turn, fueled the development of many computational models to predict the binding properties of TCRs from their sequences. Unfortunately, while many of these works have made great strides towards predicting TCR specificity using machine learning, the black-box nature of these models has resulted in a limited understanding of the rules that govern the binding of a TCR and an epitope. Results: We present an easy-to-use and customizable pipeline, DECODE, to extract the binding rules from any black-box model designed to predict the TCR-epitope binding. DECODE offers a range of analytical and visualization tools to guide the user in the extraction of such rules. We demonstrate our pipeline on a recently published TCR binding prediction model, TITAN, and show how to use the provided metrics to assess the quality of the computed rules.
In conclusion, DECODE can lead to a better understanding of the sequence motifs that underlie TCR binding. Our pipeline can facilitate the investigation of current immunotherapeutic challenges, such as cross-reactive events due to off-target TCR binding.
Presentation Overview: Show
Motivation: Dataset sizes in computational biology have been increased drastically with the help of improved data collection tools and increasing size of patient cohorts. Previous kernel-based machine learning algorithms proposed for increased interpretability started to fail with large sample sizes, owing to their lack of scalability. To overcome this problem, we proposed a fast and efficient multiple kernel learning (MKL) algorithm to be particularly used with large-scale data that integrates kernel approximation and group Lasso formulations into a conjoint model. Our method extracts significant and meaningful information from the genomic data while conjointly learning a model for out-of-sample prediction. It is scalable with increasing sample size by approximating instead of calculating distinct kernel matrices.
Results: To test our computational framework, namely, Multiple Approximate Kernel Learning (MAKL), we demonstrated our experiments on three cancer datasets and showed that MAKL is capable to outperform the baseline algorithm while using only a small fraction of the input features. We also reported selection frequencies of approximated kernel matrices associated with feature subsets (i.e. gene sets/pathways), which helps to see their relevance for the given classification task. Our fast and interpretable MKL algorithm producing sparse solutions is promising for computational biology applications considering its scalability and highly correlated structure of genomic datasets, and it can be used to discover new biomarkers and new therapeutic guidelines.
Presentation Overview: Show
A challenge with using modern machine learning methods in practice is that, frequently, their learned logic for transforming input features into output predictions is opaque and difficult for humans to understand. In-silico saturation mutagenesis (ISM) attempts to explain this logic on a per-example basis by applying the model to a biological sequence and also to all possible single mutations of that sequence, and comparing the model output. However, when a model contains convolution operations, this procedure can perform a significant amount of redundant calculations due to the convolution's limited receptive field.
We propose a method, named Yuzu, that speeds up ISM using two ideas: (1) Yuzu operates on the difference in layer outputs between the mutated sequences and the original sequence, which is sparse when the layer is a convolution, and (2) Yuzu uses the principles of compressed sensing to compress these sparse deltas into a compact set of probes that convolutions can efficiently operate on while eliminating redundant computation. Together, these properties enable Yuzu to achieve speedups of over three orders of magnitude for individual convolutions and around one order of magnitude for several published model architectures on both a CPU and GPU.
Presentation Overview: Show
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call “reverse homology”, exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homologue from another set of IDRs otherwise sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.