Presentation Overview: Show
The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences and identifying genomic locations for integration, with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate novel 200 bp regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences including binding specificity, composition, accessibility, and regulatory potential. We assess transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-of-the-art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.
Presentation Overview: Show
Deep learning models can accurately map regulatory DNA to genome-wide profiles of cellular processes. However, the influence of dataset and model design choices on the robustness and reliability of sequence features learned by these models remains unexplored. We demonstrate that contemporary design choices result in models that learn spurious sequence features that violate biologically valid causal interpretation. This phenomenon, which we term “feature leakage”, afflicts several state-of-the-art deep learning models of regulatory DNA such as DeepSEA, Enformer and scBasset. We identify two key design choices that result in feature leakage- biased selection of training genomic loci, and joint training of multiple tasks using canonical multi-task architectures and loss functions. Biased locus selection, such as training on the union of regulatory events (e.g. DNase-seq peaks) from multiple cellular contexts, creates an artificial enrichment and depletion of motifs of key lineage-defining TFs that are mutually exclusive across cell contexts, resulting in spurious leakage of non-causal features across tasks. Orthogonally, conventional multi-tasking architectures result in entangled shared representations that are unable to separate individual predictive sequence features. To mitigate feature leakage, we propose training single-task models using background locus selection strategies that are not biased towards regulatory events in specific cell contexts.
Presentation Overview: Show
The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type specific ‘on switches’ providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we developed sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines and highlights how large-scale functional measurements can be leveraged to dissect the regulatory grammar.
Presentation Overview: Show
DNA Zoo is a large-scale project that has used Hi-C measurements to create draft genomic assemblies for over 500 species and has collected ATAC-seq datasets for a subset of 100 of those species. Hi-C and ATAC-seq data provide complementary views of chromatin organization, with Hi-C identifying pairwise genomic interactions and ATAC-seq measuring local chromatin accessibility. However, collecting both Hi-C and ATAC-seq data for all species is impractical due to cost and sample availability constraints. To address this challenge, we propose a deep tensor factorization model called Chrome-Zoo that can translate between ATAC-seq and Hi-C in species where only one of these assays is available. We address the challenges associated with handling multiple genomes by training an additional model that converts nucleotide sequences to learned genomic embeddings that are consistent across species. Using these learned genome embeddings, we trained Chrome-Zoo on 18 species with available genome assemblies, Hi-C and ATAC-seq datasets. We show that our model can successfully translate between Hi-C and ATAC-seq in new species at coarse (100 kb) and fine (1–10 kb) resolutions.
Presentation Overview: Show
The DNA sequence determinants of mammalian Pol II transcription initiation remain incompletely understood. Although we've identified overrepresented motifs in promoters, a third of human promoters contain no known motifs; in promoters with known motifs, how sequence translates into TSS positioning and promoter activity is poorly characterized. We know even less about initiation at enhancers. To address these knowledge gaps, we trained a deep learning model, ProCapNet, to predict transcription initiation, measured genome-wide at base-resolution by PRO-cap, from DNA sequence. ProCapNet accurately predicts exact TSS locations and initiation rate consistently across promoter classes and at enhancers. We next applied a model interpretation framework to identify a high-sensitivity collection of motifs predictive of transcription initiation. Then, to dissect how these motifs modulate initiation, we performed systematic in silico mutational experiments. Results suggest nuanced epistasis: motifs play specialized roles, dependent on other nearby motifs. For multiple motifs, we identified a novel secondary function as direct initiation sites. We quantified the contribution of motifs to TSS positioning and initiation rate, finding motif-specific positioning signatures that suggest a general rule of redistribution. Finally, we compared the sequence determinants of initiation in promoters vs. enhancers; results support a unified model of cis-regulatory syntax for transcription initiation.
Presentation Overview: Show
Epigenetic modifications are dynamic mechanisms involved in the regulation of gene expression. Unlike the DNA sequence, epigenetic patterns vary not only between individuals, but also between different cell types within an individual. Epigenetic changes are reversible and thus promising therapeutic targets for precision medicine. However, mapping efforts to determine an individual’s cell-type-specific epigenome are constrained by experimental costs and tissue accessibility. We developed eDICE, a deep-learning model that employs attention mechanisms to impute epigenomic tracks. eDICE is trained to reconstruct masked epigenomic tracks within sets of epigenomic measurements derived from large-scale mapping efforts. By learning to encode the epigenomic signal at a particular genomic position into factorised representations of the epigenomic state of each profiled cell type as well the local activity profile of each epigenomic assay, eDICE is able to generate genome-wide imputations for the signal tracks of assays in cell types in which measurements are currently unavailable. We demonstrate improved performance relative to previous imputation methods on the reference Roadmap epigenomes, and additionally show that eDICE is able to predict individual-specific epigenetic patterns in unobserved tissues when trained on individual-specific epigenomes from ENTEx.
Presentation Overview: Show
Abstract: In this talk we show how to learn the underlying geometry of data using heat diffusion, which can represent each point by a distribution of transition probabilities that are proportional to manifold distance. This distribution renders point cloud data into a statistical manifold. Then we show how to derive low dimensional visualizations and embeddings of such data using divergences between such datapoint transition probabilities. I will then cover recent work which learns a continuous model of such a statistical manifold using a neural network which is then used to learn the infinitesimal analog of such a divergence: a Fisher information metric. Next we show how to compare many such distributions using multiscale diffusion distances for optimal transport. Then we move from static to dynamic optimal transport using neural ODEs in order to learn dynamic trajectories from static snapshot data—a key problem in inference from single cell data. Finally, we will show how to characterize and classify dynamic regimes by combining geometry with topology to characterize the data. Throughout the talk, we present examples of such techniques being applied to massively high throughput and high dimensional datasets from biology and medicine including in cancer and neurodegeneration.
Presentation Overview: Show
The increasing availability of high-throughput omics data allows for considering a new medicine centered on individual patients. Precision medicine relies on exploiting these high-throughput data with machine-learning models, especially the ones based on deep-learning approaches, to improve diagnosis. Due to the high-dimensional small-sample nature of omics data, current deep-learning models end up with many parameters and have to be fitted with a limited training set. Furthermore, interactions between molecular entities inside an omics profile are not patient-specific but are the same for all patients.
In this article, we propose AttOmics, a new deep-learning architecture based on the self-attention mechanism. First, we decompose each omics profile into a set of groups, where each group contains related features. Then, by applying the self-attention mechanism to the set of groups, we can capture the different interactions specific to a patient. The results of different experiments carried out in this paper show that our model can accurately predict the phenotype of a patient with fewer parameters than deep neural networks. Visualizing the attention maps can provide new insights into the essential groups for a particular phenotype.
Presentation Overview: Show
Single-cell RNA-sequencing technologies have greatly enhanced our understanding of
heterogeneous cell populations and underlying regulatory processes. However, structural (spatial or
temporal) relations between cells are lost during cell dissociation. These relations are crucial for identifying
associated biological processes. Many existing tissue-reconstruction algorithms use prior information
about subsets of genes that are informative with respect to the structure or process to be reconstructed.
When such information is not available, and in the general case when the input genes code for
multiple processes, including being susceptible to noise, biological reconstruction is often computationally
challenging.
We propose an algorithm that iteratively identifies manifold-informative genes using existing
reconstruction algorithms for single-cell RNA-seq data as subroutine. We show that our algorithm improves
the quality of tissue reconstruction for diverse synthetic and real scRNA-seq data, including data from the
mammalian intestinal epithelium and liver lobules.
Presentation Overview: Show
The profiling of multiple molecular layers from the same set of cells has recently become possible. There is thus a growing need for multi-view learning methods able to jointly analyze such data. We here present Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization (NMF) and Optimal Transport (OT), enhancing at the same time the clustering performance and interpretability of integrative NMF. We apply Mowgli to multiple paired single-cell multi-omics data profiled with 10X Multiome, CITE-seq, and TEA-seq. Our in-depth benchmark demonstrates that Mowgli’s performance is competitive with the state-of-the-art in cell clustering and superior to the state-of-the-art once considering biological interpretability. Mowgli is implemented as a Python package seamlessly integrated within the scverse ecosystem and it is available at http://github.com/cantinilab/mowgli.
Presentation Overview: Show
Single cell transcriptomics is revolutionizing our understanding of the cellular complexity of complex tissues at a systems level. As cells are classified into clusters based on similar gene expression profiles, there is a need to identify cell type-specific biomarkers to reliably identify and match cells of the same type in new experiments. We’ve developed an algorithm – NS-Forest – that leverages the explainable characteristics of random forest machine learning to capture the most informative gene expression feature combinations that maximize cell type classification accuracy. Applied to several human reference datasets from brain, lung, and kidney, NS-Forest selects on average ~2.5 marker genes per cell type for optimal classification. These cell type marker genes can be used as targets for spatial transcriptomics cell localization, as definitional characteristics for semantic cell type representation in the Provisional Cell Ontology (https://bioportal.bioontology.org/ontologies/PCL), and as a reduced feature space for cell type matching between datasets. Using NS-Forest markers and the multivariate statistical graph algorithm – FR-Match – to compare human middle temporal gyrus and primary motor cortex, we find that the majority of GABAergic inhibitory neuron and glial cell types are well conserved across cortical brain regions, whereas the glutamatergic excitatory neuron types are region specific.
Presentation Overview: Show
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal implications spanning drug development and manufacturing, plastic degradation, and carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. In addition to discussing this topic, I will also give an overview of the different ways in which machine learning is being developed and applied to protein engineering.
Presentation Overview: Show
There exists a range of different quantification frameworks to estimate the synergistic effect of drug combinations. The diversity and disagreement in estimates make it challenging to determine which combinations from a large drug screening should be proceeded with. Furthermore, the lack of accurate uncertainty quantification for those estimates precludes the choice of optimal drug combinations based on the most favourable synergistic effect. In this work, we propose SynBa, a flexible Bayesian approach to estimate the uncertainty of the synergistic efficacy and potency of drug combinations, so that actionable decisions can be derived from the model outputs. The actionability is enabled by incorporating the Hill equation into SynBa, so that the parameters representing the potency and the efficacy can be preserved. Existing knowledge may be conveniently inserted due to the flexibility of the prior, as shown by the empirical Beta prior defined for the normalised maximal inhibition. Through experiments on large combination screenings and comparison against benchmark methods, we show that SynBa provides improved accuracy of dose-response predictions and better-calibrated uncertainty estimation for the parameters and the predictions.
Presentation Overview: Show
Motivation: Recent advances in spatial proteomics technologies have enabled the profiling of dozens of proteins in thousands of single cells in situ. This has created the opportunity to move beyond quantifying the composition of cell types in tissue, and instead probe the spatial relationships between cells. However, current methods for clustering data from these assays only consider the expression values of cells and ignore the spatial context. Furthermore, existing approaches do not account for prior information about the expected cell populations in a sample.
Results: To address these shortcomings, we developed SpatialSort, a spatially aware Bayesian clustering approach that allows for the incorporation of prior biological knowledge. Our method is able to account for the affinities of cells of different types to neighbor in space, and by incorporating prior information about expected cell populations, it is able to simultaneously improve clustering accuracy and perform automated annotation of clusters. Using synthetic and real data, we show that by using spatial and prior information SpatialSort improves clustering accuracy. We also demonstrate how SpatialSort can perform label transfer between spatial and non-spatial modalities through the analysis of a real world diffuse large B-cell lymphoma dataset.
Presentation Overview: Show
Assay for Transposase Accessible Chromatin by sequencing (ATAC-seq) provides an accurate way to depict the chromatin regulatory state and altered mechanisms guiding gene expression in disease. However bulk sequencing entangles information from different cell types and obscures cellular heterogeneity. Here, we develop and validate Cellformer, a novel deep learning method, that deconvolutes bulk ATAC-seq into cell type-specific expression across the whole genome. Cellformer enhances the bulk ATAC-seq resolution and allows an efficient cell type specific open chromatin profiling on large size cohorts at a low cost. Applied to 191 bulk samples from 3 brain regions, Cellformer identifies cell type-specific gene regulatory mechanisms and putative mediators involved in resilient to Alzheimer’s disease (RAD), an uncommon group of cognitively healthy individuals that harbor a high pathological load of Alzheimer’s disease (AD). Cell type-resolved chromatin profiling unveils cell type specific pathways and nominates potential epigenetic mediators underlying RAD that may illuminate therapeutic opportunities to limit the cognitive impact of this highly prevalent yet incurable disease. Cellformer has been made freely and publicly available to advance analysis of high-throughput bulk ATAC-seq in future investigations.
Presentation Overview: Show
Antigen presentation on MHC Class II (pMHCII presentation) plays an essential role in the adaptive immune response to extracellular pathogens and cancerous cells. But it can also reduce the efficacy of large-molecule drugs by triggering an anti-drug response. Significant progress has been made in pMHCII presentation modeling due to the collection of large-scale pMHC mass spectrometry datasets (ligandomes) and advances in machine learning. Here, we develop graph-pMHC, a graph neural network approach to predict pMHCII presentation. We derive adjacency matrices for pMHCII using Alphafold2-multimer, and address the peptide-MHC binding groove alignment problem with a simple graph enumeration strategy. We demonstrate that graph-pMHC dramatically outperforms methods with suboptimal inductive biases, such as the multilayer-perceptron-based NetMHCIIpan-4.0 (+22.84% average precision). Finally, we create an antibody drug immunogenicity dataset from clinical trial data, and develop a method for measuring anti-antibody immunogenicity risk using pMHCII presentation models. In comparison with BioPhi’s Sapiens score, a deep learning based measure of the humanness of an antibody drug, our strategy achieves a 7.14% ROC AUC improvement in predicting antibody drug immunogenicity.
Presentation Overview: Show
Utilizing AI-driven approaches for DTI prediction require large volumes of training data which are not available for the majority of target proteins. In this study, we investigate the use of deep transfer learning for the prediction of interactions between drug candidate compounds and understudied target proteins with scarce training data. The idea here is to first train a deep neural network classifier with a generalized source training dataset of large size and then reuse this pre-trained neural network as an initial configuration for re-training/fine-tuning purposes with a small-sized specialized target training dataset. To explore this idea, we selected six protein families that have critical importance in biomedicine: kinases, G-protein-coupled receptors (GPCRs), ion channels, nuclear receptors, proteases, and transporters. The protein families of transporters and nuclear receptors were individually set as the target datasets, while the other five families were used as the source datasets. Several size-based target family training datasets were formed in a controlled manner. Here we present a disciplined evaluation by pre-training a feed-forward neural network with source training datasets and applying different modes of transfer learning from the pre-trained source network to a target dataset. The performance of deep transfer learning is evaluated and compared with that of training the same deep neural network from scratch. We found that when the training dataset is smaller than 100 compounds, transfer learning yields significantly better performance compared to training the system from scratch, suggesting an advantage to using transfer learning to predict binders to under-studied targets.
Presentation Overview: Show
Motivation: The coronavirus disease 2019 (COVID-19) remains a global public health emergency. Although people, especially those with underlying health conditions, could benefit from several approved COVID-19 therapeutics, the development of effective antiviral COVID-19 drugs is still a very urgent problem. Accurate and robust drug response prediction to a new chemical compound is critical for discovering safe and effective COVID-19 therapeutics.
Results: In this study, we propose DeepCoVDR, a novel COVID-19 drug response prediction method based on deep transfer learning with graph transformer and cross-attention. First, we adopt a graph transformer and feed-forward neural network to mine the drug and cell line information. Then, we use a cross-attention module that calculates the interaction between the drug and cell line. After that, DeepCoVDR combines drug and cell line representation and their interaction features to predict drug response. To solve the problem of SARS-CoV-2 data scarcity, we apply transfer learning and use the SARS-CoV-2 dataset to fine-tune the model pre-trained on the cancer dataset. The experiments of regression and classification show that DeepCoVDR outperforms baseline methods. We also evaluate DeepCoVDR on the cancer dataset, and the results indicate that our approach has high performance compared with other state-of-the-art methods. Moreover, we use DeepCoVDR to predict COVID-19 drugs from FDA-approved drugs and demonstrate the effectiveness of DeepCoVDR in identifying novel COVID-19 drugs.
Presentation Overview: Show
Protein-ligand binding affinity prediction is an important task in drug design and development. Cross-modal attention mechanism has become a core component of deep learning models due to the significance of model explainability. Non-covalent interactions, one of the key chemical aspects of this task, should be incorporated in protein-ligand attention mechanism. We propose ArkDTA, a novel deep neural architecture for explainable binding affinity prediction guided by non-covalent interactions. Experimental results show that ArkDTA achieves predictive performance comparable to current state-of-the-art models while significantly improving model explainability. Qualitative investigation into our novel attention mechanism reveals that ArkDTA can identify potential regions for non-covalent interactions between candidate drug compounds and target proteins, as well as guiding internal operations of the model in a more interpretable and domain-aware manner. ArkDTA is available at https://github.com/dmis-lab/ArkDTA
Presentation Overview: Show
The expensive and time-consuming nature of drug discovery necessitates the incorporation of innovative computational techniques into research and development pipelines. Representation learning has emerged as a promising solution for creating compact and informative numerical representations of molecules that can be utilized effectively in subsequent prediction tasks. However, current methods suffer from robustness and validity issues, primarily as a result of the input encoding or algorithms employed. In this study, we introduce SELFormer, a transformer-based chemical language model that utilizes SELFIES as input, a 100% valid, compact, and expressive notation. SELFormer is pre-trained on two million drug-like compounds in the ChEMBL database and evaluated on various molecular property prediction tasks. SELFormer demonstrated superior performance in predicting the aqueous solubility of molecules and adverse drug reactions compared to existing graph learning-based methods and SMILES-based chemical language models, while producing comparable results for other tasks. We shared SELFormer as a programmatic tool, along with its datasets and pre-trained models. Overall, our research demonstrates the benefits of combining a valid and expressive molecular notation with the appropriate deep learning architecture in chemical language modeling, thereby opening up new possibilities for discovering and designing novel drug candidates.
Presentation Overview: Show
Discovering novel drug candidate molecules is one of the most fundamental and critical steps in drug development. Generative models offer high potential for designing de novo molecules; however, to be useful in real-life drug development pipelines, these models should be able to design target-specific molecules. In this study, we propose a novel generative system, DrugGEN, for de novo design of drug candidate molecules that interact with selected target proteins. The proposed system represents compounds and protein structures as graphs and processes them via serially connected two generative adversarial networks comprising graph transformers. The system is trained using two million compounds from ChEMBL and target-specific bioactive molecules, to design effective and specific inhibitory molecules against the AKT1 protein. DrugGEN has a competitive or better performance against other methods on fundamental benchmarks. To assess the target-specific generation performance, we conducted further in silico analysis with molecular docking. Their results indicate that de novo molecules have high potential for interacting with the AKT1 protein structure in the level of its native ligand. DrugGEN can be used to design novel and effective target-specific drug candidate molecules for any druggable protein, given the target features and a dataset of known bioactive molecules.
Presentation Overview: Show
Deep neural networks are increasingly used to analyze biological sequences, including DNA, RNA and proteins, leading to promising applications in annotation, classification, structure prediction or generation. While the architectures of deep neural networks for biosequences have been so far largely borrowed from the field of natural language processing, I will discuss in this presentation some specificities of biosequences that deserve specific methodological developments, in particular 1) how to transform a biosequence as a sequence of tokens, 2) how to incorporate some known symmetries of biosequences in the architecture of the model, and 3) how to solve tasks which are specific to biosequences such as learning to align.
Presentation Overview: Show
Motivation: The size of available omics datasets is steadily increasing with technological advancement in recent years. While this increase in sample size can be used to improve the performance of relevant prediction tasks in healthcare, models that are optimized for large datasets usually operate as black boxes. In high stakes scenarios, like healthcare, using a black-box model poses safety and security issues. Without an explanation about molecular factors and phenotypes that affected the prediction, healthcare providers are left with no choice but to blindly trust the models. We propose a new type of artificial neural network, named Convolutional Omics Kernel Network (COmic). By combining convolutional kernel networks with pathway-induced kernels, our method enables robust and interpretable end-to-end learning on omics datasets ranging in size from a few hundred to several hundreds of thousands of samples. Furthermore, COmic can be easily adapted to utilize multi-omics data.
Results: We evaluated the performance capabilities of COmic on six different breast cancer cohorts. Additionally, we trained COmic models on multi-omics data using the METABRIC cohort. Our models performed either better or similar to competitors on both tasks. We show how the use of pathway-induced Laplacian kernels opens the black-box nature of neural networks and results in intrinsically interpretable models that eliminate the need for post-hoc explanation models.
Presentation Overview: Show
The ability to modulate pathogenic proteins represents a powerful treatment strategy for diseases. Unfortunately, many proteins are considered “undruggable” by small molecules, and are often intrinsically disordered, precluding the usage of structure-based tools for binder design. To address these challenges, we have developed a suite of algorithms that enable the design of target-specific peptides via protein language model embeddings, without the requirement of 3D structures. First, we train a model that leverages ESM-2 embeddings to efficiently select high-affinity peptides from natural protein interaction interfaces. We experimentally fuse model-derived peptides to E3 ubiquitin ligases and identify candidates exhibiting robust degradation of undruggable targets in human cells. Next, we develop a high-accuracy discriminator, based on the CLIP architecture, to prioritize and screen peptides with selectivity to a specified target protein. As input to the discriminator, we create a Gaussian diffusion generator to sample an ESM-2-based latent space, fine-tuned on experimentally-valid peptide sequences. Finally, to enable de novo generation of binding peptides, we train an instance of GPT-2 with protein interacting sequences to enable peptide generation conditioned on target sequence. Our model demonstrates low perplexities across both existing and generated peptide sequences. Together, our work lays the foundation for programmable protein targeting applications.
Presentation Overview: Show
BACKGROUND: Deep learning is powerful, but interpretability remains a challenge. A unique approach for interpretability builds on biological knowledge to construct the computational graph of a neural network such that hidden nodes represent biological entities (pathways). After training, such “biology-inspired” neural networks reveal biological pathways involved in a given process (cancer).
MOTIVATION: Biology-inspired models provide an unprecedented ability to interpret hidden nodes, in contrast to the common approaches to interpret input features. However, critical elements of interpretability remain unsolved. First, the random initiation of weights limits the robustness of interpretations. Second, biases in biological knowledge favor highly connected hidden nodes. Yet, despite their critical relevance, robustness and network biases are largely unstudied.
METHODS: We developed methods to assess and control robustness and network biases, and validated them in state-of-the-art biology-inspired models to evaluate their impact on interpretations.
RESULTS: We demonstrate that controlling both robustness and biases is required for reliable interpretability. We find that the impact of robustness and biases on interpretations depend on the difficulty of the prediction task and we identify which network biases mostly affect interpretations. Together, these results reveal critical elements of interpretability that may be relevant beyond the special case of biology-inspired deep learning.
Presentation Overview: Show
Identifying protein-protein interactions (PPIs) is crucial for deciphering biological pathways and their dysregulation. Numerous prediction methods have been developed as a cheap alternative to biological experiments, reporting phenomenal accuracy estimates. While most methods rely exclusively on sequence information, PPIs occur in 3D space. As predicting protein structure from sequence is an infamously complex problem, the almost perfect reported performances for PPI prediction seem dubious. We systematically investigated how much reproducible deep learning models depend on data leakage, sequence similarities, and node degree information and compared them to basic machine learning models. We found that overlaps between training and test sets resulting from random splitting lead to strongly overestimated performances, giving a false impression of the field. In this setting, models learn solely from sequence similarities and node degrees. When data leakage is avoided by minimizing sequence similarities between training and test, performances become random, leaving this research field wide open.
Presentation Overview: Show
G-quadruplexes are non-B-DNA structures that form in the genome by Hoogsteen bonds between guanines in single or multiple DNA strands. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method for the task, achieved a Pearson correlation of over 0.8. Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.