Posters - Schedules
Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7
minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19
and no later than July 23. All registered conference participants will have access to the poster and presentation
through the conference and content until October 31, 2021. There are Q&A opportunities through a chat
function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.
Information on preparing your poster and poster talk are available at:
https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Session A: Sunday, July 25 between 15:20 - 16:20 UTC |
Session B: Monday, July 26 between 15:20 - 16:20 UTC |
---|---|
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC |
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC |
---|---|
Session E: Thursday, July 29 between 15:20 - 16:20 UTC |
---|
Short Abstract: There is often extensive genetic heterogeneity in a tumor as evidenced by single-cell and bulk DNA sequencing data. Understanding the genetic composition of a tumor is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. Thus, we propose a Bayesian nonparametric hierarchical Dirichlet process mixture model to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data.
Inference with our model provides estimates of the subpopulation genotypes and the distribution over subpopulations in each sample. We represent the model as a Gamma-Poisson hierarchical model and derive a fast Gibbs sampling algorithm with analytical sampling steps based on this representation using the augment-and-marginalize method. This representation and inference algorithm can also be used with other hierarchical Dirichlet process prior models to derive a fast Gibbs sampler.
Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.
Short Abstract: While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.
Short Abstract: Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans. Understanding which genes are essential for growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, Saccharomyces cerevisiae, construction of mutants is considerably more laborious in C. albicans. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 C. albicans GRACE (gene replacement and conditional expression) strains that has been previously constructed as a basis for training. We identified several relevant features contributing unique information to the predictions. Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we prioritized the construction of an additional set of >800 strains and discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for efficient discovery of essential genes, and a similar approach may also be useful in other species.
Short Abstract: Successful identification of an acute infection using patients’ gene-expression data with machine learning classifiers is made complicated by training on incomplete and noisy multi-cohort gene expression datasets. The classifier performance is adversely affected by the introduction of noisy data points during training; therefore, a quantitative assessment of information content of the candidate datasets can lead to more efficient learning. We adapt the Bayesian Optimal Experimental Design (BOED) framework to improve classifier performance and its generalisation by sequentially training on datasets that have the highest information content. An objective function is used to rank the datasets according to their predicted Expected Information Gain (EIG) value for a chosen Gaussian Process (GP) classifier. We find that the classifier training performed on datasets with high EIG values gives the largest increases in the classifier accuracy. An online learning algorithm is proposed where the gene-expression datasets are used sequentially for GP classifier training in order of their EIG values. Furthermore, the predicted EIG values for unsampled regions of gene-expression dataspace can be used to guide future clinical data acquisition for improving classifier accuracy.
Short Abstract: Novel medications often fail during expensive and time-consuming late-stage clinical trials due to unforeseen adverse drug reactions (ADRs). Previous computational solutions to forgo these failures often have one of two aims: proactively predict ADRs, or repurpose established medications which have already passed clinical trials.
We developed an approach using the framework of relational graph convolutional neural networks (R-GCNs). Our approach learns relationships between drugs, proteins and conditions in a heterogeneous multi-relational knowledge graph and allows us to perform two tasks: i) predicting ADRs and ii) drug repositioning. A specific aspect of our method is the integration of multi-modal data (chemical structure, protein sequence, gene expression, microscopy images, representations of electronic health records) to encapture relevant biomedical metadata. We compared our method to several state-of-the-art representation learning approaches for both, ADR prediction and drug repositioning.
Overall, our method can predict novel indications for drug repurposing candidates as well as potential ADRs.
Short Abstract: Manual segmentation and quantification of complex 3D fluorescence images is a limiting factor in data analysis for many biologists. Here, we develop a morphological analysis workflow to automatically reconstruct 3D cell models with accurate geometrical quantifications and measurements by combining a semi-unsupervised segmentation system and Delaunay triangulation mesh generator. We applied our workflow to C. elegans sex myoblasts, an in vivo model of a cell that migrates during larval development, and differentiates into the adult uterine and vulval muscles. Thus, the conventional raw confocal images, which are sparse and noisy, are made useful for accurately understanding the migration and differentiation of mesodermal precursor cells. We also ameliorated the existing supervised machine learning approaches that heavily rely on massive hard-to-get manual annotations. Our active-learning-based segmentation system for single z-stack images combines multiple agent networks and fuses the pseudo-label masks which will be further stacked and extracted to generate 3D cell surfaces. The workflow, applied to high-resolution time-lapse confocal images, allowed for the quantification of numerous filopodia or cell protrusions. Finally, to test our workflow, we analyzed additional live imaging data from the C. elegans anchor cell (AC) and from a different model system, the developing zebrafish tailbud.
Short Abstract: Epilepsy is a complex brain disorder characterized by repetitive seizure events. Epilepsy patients often suffer from various and severe physical and psychological co-morbidities (e.g. anxiety, migraine, stroke, etc.). While general comorbidity prevalences and incidences can be estimated from epidemiological data, such an approach does not take into account that actual patient specific risks can depend on various individual factors, including medication. This motivates to develop a machine learning approach for predicting risks of future comorbidities for the individual epilepsy patient.
In this work we use inpatient and outpatient administrative health claims data of around 19,500 US epilepsy patients. We suggest a dedicated multi-modal neural network architecture to predict the time dependent risk of six common comorbidities of epilepsy patients. We demonstrate superior performance of DeepLORI in a comparison with several existing methods Moreover, we show that DeepLORI based predictions can be interpreted on the level of individual patients. Using a game theoretic approach, we identify relevant features in DeepLORI models and demonstrate that model predictions are explainable in the light of existing knowledge about the disease. Finally, we validate the model on independent data from around 97,000 patients, showing good generalization and stable prediction performance over time
Short Abstract: Selecting the most effective drug for individual patients is an important task in precision medicine and one that machine learning-based approaches may be able to expedite. However, drug response prediction using machine learning presents substantial challenges, such as the ‘curse’ of dimensionality and lack of well-defined distance metrics defining similar patients or cell lines.
We have constructed a recommender system to generate drug response predictions using weighted averages of response values of similar cell lines. We tackle high dimensionality by leveraging HSIC Lasso (Climente-González et al.,2019) to identify maximally relevant and minimally redundant genes. To represent similarity, we leverage proximity on an approximated Riemannian manifold, and compare this approach with simpler similarity metrics such as correlation. We evaluate the proposed architecture on the Genomics of Drug Sensitivity in Cancer database.
The resulting model outperforms recommender system based algorithms based on traditional similarity metrics (Suphavilai et al.,2020) by a significant margin (accuracy/drug: 81% vs. 77%; correlation with predicted drug response: 0.76 vs 0.65), substantiating the effectiveness of similarity metrics on locally connected manifold approximations derived from dimensionality-reduced gene expression data; taking a step towards making the accuracy of recommender-based drug response prediction models more relevant for clinical practice and patient benefit.
Short Abstract: Current techniques in target identification in drug development are costly, lengthy, and function with high degrees of uncertainty that a drug can succeed in modulating a target. To address this, the ability of computational approaches to leverage cellular signature data for in-silico target identification was explored. The NIH LINCS program compiled and publicized data consisting of cell viability measurements and high-throughput gene-expression drug and target screens (measured by L1000 assay) of ~5000 small molecules and their respective mechanisms of action (MoA). This study involves the utilization of an ensemble of transfer learning and a novel neural network to convert this cellular signature data to the mechanism of action of a compound. Validating on an isolated subset of our training data, our approach achieved 91.51% AUC (Receiver Operating Characteristic curve). These promising results were also shown to significantly perform more efficiently than other machine learning approaches (p<0.05) and demonstrate advantages over previously published in-silico techniques. Overall, the algorithm provides a reliable framework for expedited drug target identification.
Short Abstract: The development of machine learning-based virus-host protein-protein interaction(VHPPI) predictors is an active research area.
Here, we assessed the impact of a feature selection algorithm on the performance of machine learning-based VHPPI prediction. Typically, these VHPPI predictors use tripeptide frequencies of host and virus proteins computed on reduced amino acid alphabets. In this study, we used a 7-letter amino acid alphabet and created a total of 686(7^3+7^3; host+virus) features using tripeptide frequencies normalized by unit normalization. Assuming VHPPI are mediated by tripeptides from virus and host proteins sharing similar distribution, we computed Pearson Correlation Coefficient(PCC) between each host and virus feature (343*343) and filtered down pairs using various thresholds. The selected features are used to train Random Forest Classifiers(RFC).
We conducted experiments using three training(each 18120 positive/181200 negative) and three independent tests (each 4533 positive/45330 negative) sets provided by HVPPI. For the PCC thresholds 0.08-0.13, the number of features ranged from 203/186(host/virus) to 24/20. Although a ~%93.5 drop achieved through feature selection, the RFC’s sensitivity and specificity on the independent test set remained the same; 0.84+-0.02, 0.83+-0.02 respectively. Our models outperformed alternative tools HVPPI, Hopitor, and DeNovo on all the independent test sets regardless of the choice of PCC threshold.
Short Abstract: Simultaneous profiling of multi-omic modalities within single cells is a grand challenge for biology. While there have been impressive technical innovations demonstrating feasibility of co-measurement technologies, widespread application of joint profiling methods is challenging due to experimental complexity, noise, and cost. Here, we introduce BABEL, a deep learning method that enables a multi-omic view of single cells given only a single measured modality. Leveraging a series of interoperable neural networks, BABEL can predict single-cell expression from single-cell chromatin accessibility, and vice versa, after training on relevant data. BABEL is robust across varying noise profiles, and across diverse biological contexts not seen during training. For example, BABEL can generate single-cell expression profiles from patient-derived basal cell carcinoma (BCC) chromatin accessibility data, enabling insights into transcriptomic states despite never being trained on BCC data. BABEL’s predictions are even comparable to analyses of empirical BCC scRNA-seq data. We further show that BABEL can be easily extended to incorporate additional single-cell data modalities such as protein expression, thus enabling unified translation across chromatin, RNA, and protein states. BABEL offers a powerful approach for multi-omic data exploration and hypothesis generation.
Short Abstract: Cell atlases often include samples that span locations, labs, and conditions leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. Choosing a data integration method is a challenge due to the difficulty of defining integration success. Here, we benchmark 68 integration approaches on 85 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, representing >1.2 million cells distributed in 13 atlas-level integration tasks. Our integration tasks span several common sources of variation such as individuals, species, and experimental labs. We evaluate methods according to scalability, usability, and their ability to remove batch effects while retaining biological variation.
Using 14 evaluation metrics, we find that highly variable gene selection improves data integration performance, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, Scanorama, scANVI, and scGen perform well, particularly on complex integration tasks; Seurat v3 and Harmony perform well on simpler tasks with distinct biological signals; and scATAC-seq integration performance is strongly affected by choice of feature space. Our reproducible Python module and benchmarking pipeline can be used to identify optimal data integration methods for new data, benchmark new methods, and guide method development.
Short Abstract: The spatial transcriptomics procedure enables to gain an insight not only into the level of gene activity, but also enables to map this activity spatially. It is possible due to the fact that, unlike single cell RNA-seq experiments, spatial transcriptomics (ST) retains information on cells’ position within the tissue. However, ST spots contain multiple cells, therefore the observed signal inevitably includes information about mixtures of cells of different types. In order to deconvolute the aforementioned mixtures and infer the spatial cell types composition, various methods combining the two complementary technologies: ST and single cell RNA-seq have been proposed. Unfavourably, these methods require both types of data and may be prone to bias due to platform-specific effects, such as sequencing depth. To address those issues, we present an innovative approach that does not require single cell data, but instead needs additional prior knowledge on marker genes. Our novel probabilistic model for cell type deconvolution in ST data called Celloscope, was applied on mouse brain data and was able to successfully indicate brain structures and spatially distinguish between two main neuron types: inhibitory and excitatory.
Short Abstract: Fungi are generally mesophilic in nature. However, there is the existence of thermophilic/thermostable fungi. Various molecular factors have been proposed to cause thermostability, such as salt bridges and side-chain─side-chain hydrogen bonds. These factors cannot be generalized for all fungi. Factors imparting thermostability can guide how fungal thermophilic proteins gain thermostability. We curated a dataset for 14 thermophilic fungi and their evolutionarily closer mesophiles. Initially, the proteome data of Rhizopus microsporus and its evolutionarily related mesophile Mucor circinelloides was analyzed. Using eggNOG, we classified the proteome into COGs. Excluding the COGs R and S, we extracted sequence features using Protr. Currently, we find that in COGs A, B, K, T, and U the amino acids Lys, Arg, Asp, Glu, Phe, Tyr, and Trp are highly represented in thermophile compared to the mesophile. Analyzing the features using an ensemble feature selection tool we selected feature-sets that were above a threshold of 0.6 (scale of 0.0 to 1.0). These include amino acid composition, pseudo amino acid composition, quasi-sequence-order descriptors, conjoint triad, dipeptide composition, Geary, Moran, and normalized Moreau-Broto autocorrelation. Both supervised and unsupervised machine learning approaches are being trained to derive a model that would differentiate a fungal sequence as thermophilic or mesophilic.
Short Abstract: Incidental thyroid nodule detection has increased notably in recent years. Biopsy is invasive and expensive, and risk stratification of nodule ultrasounds is done manually by radiologists through the Thyroid Imaging, Reporting & Data System (TI-RADS), motivating the need for effective computerized diagnosis solutions.
I hypothesized that deep-learning could achieve superior classification performance in sensitivity, specificity, and AUROC compared to TI-RADS using ultrasound cine-clip images.
I developed a novel deep learning-based algorithm, trained on ultrasound cine-clip images with prospectively-collected biopsy data as ground truths. The algorithm consists of a MobileNet-v2 convolutional-neural-network (CNN) for feature extraction with a bi-layer Transformer-Encoder network (each with self-attention and feedforward sub-layers).
I addressed two main technical challenges: extreme class-imbalance (benign/malignant patient ratio=175/17) and processing patient-wide cine-clip images’ sequentiality. For the former, I applied weighted oversampling and back-propagated focal loss (alpha=0.9, gamma=2.4). For the latter, I stacked CNN inputs (adjacent or equally-spaced frames), then channeled extracted feature vectors into the Transformer-Encoder for patient-wide attention-mechanism processing.
My deep-learning techniques—adjacent-frame (AUROC=0.867) and equal-spaced-frame (AUROC=0.858) models—performed superior to TI-RADS scoring (AUROC=0.798). CNN+Transformer model can classify cine-clips as a pre-screening tool, or jointly with radiologists for improved prediction.
Short Abstract: Deep neural networks (DNNs) have been applied successfully to many regulatory genomics tasks, such as predicting the binding strength between proteins and DNA. To explain DNN predictions, post hoc attribution methods are employed to provide importance scores for each nucleotide in a given sequence, often revealing motif-like representations that are important for model predictions. Among attribution methods, we find that those based on gradients are often employed in a naïve way, not respecting the categorical simplex constraints of the 1-hot encoded sequence, which can lead to a significant source of error for the importance of nucleotides. This can affect the efficacy of downstream analyses that use attribution scores, such as scoring the effects of disease-associated variants. Here we derive a simple correction for gradient-based attribution scores and demonstrate its effectiveness using synthetic and in vivo sequences across different networks and attribution methods. We find that our correction consistently leads to a small, but significant improvement in attribution scores for motif positions and reduces spurious attributions at other positions. While not intended to improve the classification performance, this correction provides a consistent improvement in interpretability, leading toward better transparency of a network’s decision-making process, providing clearer insights into the underlying biology.
Short Abstract: COVID-19 pandemic accelerated the need for large-scale immune profiling and analysis of collected patient data. Flow cytometry experiments can be used to analyze composition of blood samples, but often result in heterogeneous datasets due to changing availability of markers and other technical batch effects. In order to enable scalable flow data integration, we created a variational neural network model for simultaneous generative embedding, bias correction, and semi-supervised cell-type prediction. The model explicitly merges subspaces defined by markers shared across heterogeneous panels. Latent batch correction is performed before any downstream tasks in order to avoid integrating panel-specific batch effects. To provide consistent control sets for normalization, we resample and balance the data distribution using partially trained classifiers as the training progresses. Finally, we perform latent space clustering and visualize the combined space in 2D. Our model allows for sequential and parallel training, and scales well to millions of cells. Model selection is automated using a Bayesian optimizer, which determines global, as well as task-specific hyper-parameters such as subnetwork size and individual training frequencies. We apply the model to identify cell populations enriched in COVID-19 cases from several hospitals in our region to estimate correlations with disease trajectory.
Short Abstract: Machine Learning (ML) based methods have shown unparalleled success and is a powerful approach for accurate analysis of medical images the classification of images with highly similar features is a typical application of machine learning-based image analysis. This paper depicts a similar motion depicting a learning strategy based on convolutional neural networks to identify various X-ray images stipulating various respiratory diseases such as Pneumonia, TB, Pneumothorax, and COVID-19. To optimise the augmentation of the data, various parameters such as rescaling, horizontal and vertical flip, and zoom augmentation are varied and transformed. The system of pooling layers, dense layers, learning rate, and epochs are carefully adjusted when building the CNN model in order to achieve the model's optimum efficiency.
17272 chest x-ray scan samples were obtained from the Kaggle datasets, National institute of Health dataset, Stanford ML group (CheXpert database) to assess the model's results, with 15257 being used for training and 2015 for validation.
The proposed CNN model is able to execute efficiently and evince the following performance parameter-
(specificity_at_sensitivity:0.9996, sensitivity_at_specificity:0.9983, accuracy: 0.9269, val_specificity_at_sensitivity: 0.9937, val_sensitivity_at_specificity: 0.9727 etc.)
These findings show that data from deep learning neural networks can be used to detect COVID-19, Pneumonia, and other respiratory tract infection.
Short Abstract: The architecture of the T-cell receptor (TCR) repertoire largely contributes to the performance of the adaptive immune response against viral or bacterial infection. Each human has about 10^7–10^8 unique TCRs. The diversity of TCR repertoire is primarily affected by age, HLA genetic variabilities, and prior exposure to viral or bacterial infections. With the advent of immune sequencing, whether bulk or single-cell RNAseq, TCR repertoire can be characterized and used in predicting disease prognosis. Immune repertoire classification can be seen as a multiple instance learning problems with extremely low witness rates (~0.01%), and the overlap of immune repertoires of different individuals is low. These properties hinder the development of end-to-end deep learning frameworks for classifying individuals' phenotypes based on their TCR repertoire. This work presents our new framework that combines statistical and deep learning components to predict individuals' phenotype based on their TCR-β immune repertoire. We applied the proposed framework to a dataset of 641 TCR-β immune repertoires 289 with CMV+ and 352 with CMV-). Our performance evaluation shows robust performance in classifying CMV+ subjects (auROC = 0.95). The developed framework can be applied to other diseases such as infectious disease, cancers, and autoimmune disease.
Short Abstract: Breast cancer is one of the most common cancers in women. However, the concordance rate of breast cancer diagnosis from histology slides is unacceptably low. Classifying normal versus tumor breast tissues from breast histology microscopy images is an ideal case to use for deep learning and could help to more reproducibly diagnose breast cancer.
We tested the accuracy of tumor versus normal classification using the BreAst Cancer Histology dataset. We first tested the patch-level classification accuracy for 16 combinations of non-specialized models, data preprocessing techniques, and hyperparameter configurations, and chose the model with the highest patch-level accuracy. Then we computed the slide-level accuracy of the selected models and compared them with 26 hyperparameter sets of a pathology-specific attention based multiple-instance learning model.
Two generic models (One-Shot Learning and the DenseNet201 with highly tuned parameters) achieved 94% slide-level validation accuracy compared to only 88% for the pathology-specific model.
The combination of image data preprocessing and hyperparameter configurations have a direct impact on the performance of image classifiers. To identify a well-performing model to classify tumor versus normal breast histology, researchers should not only focus on developing novel models, since hyperparameter tuning for existing methods could also achieve a high prediction accuracy.
Short Abstract: Autism Spectrum Disorder (ASD) and Intellectual Disability (ID) are comorbid neurodevelopmental disorders with complex genetic architectures. Despite large-scale sequencing studies only a fraction of the risk genes were identified for both. Here, we present a novel network-based gene risk prioritization algorithm named DeepND that performs cross-disorder analysis to improve prediction power by exploiting the comorbidity of ASD and ID via multitask learning. Our model leverages information from gene co-expression networks that model human brain development using graph convolutional neural networks and learns which spatio-temporal neurovelopmental windows are important for disorder etiologies. We show that our approach substantially improves the state-of-the-art prediction power in both single-disorder and cross-disorder settings. DeepND identifies prefrontal cortex brain region and early-mid fetal period as the highest neurodevelopmental risk window for both ASD and ID. Finally, we investigate frequent ASD and ID associated copy number variation regions and confident false findings to suggest several novel susceptibility gene candidates. DeepND can be generalized to analyze any combinations of comorbid disorders and is released atgithub.com/ciceklab/deepnd.
Short Abstract: The importance of tumor microenvironment has been studied extensively in recent years. While specifically the immune microenvironment has been the focus of research, a more general approach has not been attempted as of yet. In this work we show that there is a relationship between general microenvironment composition and both gene mutations and patient survival in non-small cell lung cancer. We introduce a new training dataset for tissue classification in lung cancer, LubLung. With this, we train an accurate and reliable deep learning model ARA-CNN and use it to segment tissue slides from TCGA. These segmented slides are used to compute novel spatial metrics and are utilised in the tasks of gene mutation classification and survival prediction. We show that there are gene mutations in lung cancer that can be predicted based on tissue prevalence and tumor microenvironment quantified from images.. It’s likely that similar relationships can be found in other cancer types. In addition, tumor microenvironment is also highly important in survival analysis. We observe a clear link between the tumor neighbourhood structure and patient survival.
Short Abstract: Thanks to their outstanding performance, deep learning models have often been the method of choice in life sciences research. Despite their results, however, their decision process often remain obscure. The black-box nature of these models greatly limits their usage in high-stake scenarios such as healthcare or slows down further scientific discovery. Arguably, these are some reasons for the recent surge in interest towards the field of Interpretable Machine Learning. As a consequence, computational biologists are now presented with an overwhelmingly high number of interpretable methods to start unraveling the decision-making rules of an opaque model. In this work, we introduce DEPICTION, a toolbox designed to help computational biologists make their machine learning models more interpretable. DEPICTION provides an unified interface for the most well-established interpretability methods, which can be included in existing pipelines with little change to the code. Further, DEPICTION facilitates the comparison of different interpretability techniques. As a proof of concept, we apply DEPICTION to the analysis of a single-cell proteomic dataset where different immune cells populations where profiled. In all cases, DEPICTION identified the most informative markers to distinguish each cellular population, shedding light into the molecular phenotypes that characterise the immune system.
Short Abstract: Peptides are molecules composed mainly of amino acids linked together by peptide bonds. They have different properties and characteristics that make them unique structures and provide various advantages in biotechnology and bioengineering. One of the main problems of working with this type of structure lies in the high economic cost involved in its synthesis. It is necessary to create and implement computational methods that facilitate the design of peptides with desirable activity. Strategies such as directed evolution or rational design are recently being combined with computational methods based on artificial intelligence to improve their performance. However, implementing predictive models is an arduous and complex task, which as a generality is sought, effectiveness is lost since specific methods are implemented for a particular task. Based on an analytical approach and using statistical methods, bioinformatics approaches, different coding strategies, and data mining techniques, we have developed a system to propose variated rules of the definition of activities for particular peptide sequences using the information in our peptide database recently built. Employing this definition's rules and combining them with variated optimization algorithms, we propose designing and implementing an automaton system to generate peptide sequences with desirable chemical properties.
Short Abstract: Trauma is one of the leading causes of death worldwide, causing more fatalities than HIV and Tuberculosis combined. Quickly and correctly assessing trauma severity is essential, both to improve survival and to minimise adverse long-term outcomes. Although numerous scoring systems have been devised to categorise trauma severity and predict mortality, most are subjective and based solely on observable patients’ characteristics. To address this, we have developed a machine learning pipeline to reliably categorise patient severity in the hyperacute window (<2hr post-trauma) using an ensemble-based classifier trained on proteomic data. This allowed us to identify eight proteins (CPLX1/2, GSN, IFI16, CHGB, CCL22, ADSSL1, PLG) that together achieved an AUC of 0.93 (5-folds cv) for correctly identifying critical patients, that outperformed current severity indicators. These proteins are involved in platelet degranulation, dissolution of fibrin clots, and a variety of signalling pathways. Patients who were assigned a higher probability of being critical were also more likely to require ventilation, longer hospitalisation and more transfusions. One of the identified biomarkers, Gelsolin (GSN), is a protein that has been recently associated with poor prognosis in a series of other health conditions, supporting its use as a general indicator of adverse clinical outcomes in human disease.
Short Abstract: Detecting predictive biomarkers from multi-omics data is important for precision medicine, to improve diagnostics of complex diseases, and for better treatments. This needs substantial experimental efforts that are made difficult by the heterogeneity of cell lines and huge cost. An effective solution is to build a computational model over the diverse omics data, including genomic, molecular, and environmental information. However, choosing informative and reliable data sources from among the different types of data is a challenging problem. We propose DIVERSE, a framework of Bayesian importance-weighted tri- and bi-matrix factorization (DIVERSE3 or DIVERSE2) to predict drug responses from data of cell lines, drugs, and gene interactions. DIVERSE integrates the data sources systematically, in a step-wise manner, examining the importance of each added data set in turn. More specifically, we sequentially integrate five different data sets, which have not all been combined in earlier bioinformatic methods for predicting drug responses. Empirical experiments show that DIVERSE clearly outperformed five other methods including three state-of-the-art approaches, under cross-validation, particularly in out-of-matrix prediction, which is closer to the setting of real use cases and more challenging than simpler in-matrix prediction. Additionally, case studies for discovering new drugs further confirmed the performance advantage of DIVERSE.
Short Abstract: Cancer is a leading cause of death, and early cancer detection can affect prognosis and increase treatment effectiveness. Towards this challenge, we asked the following research question: Is medical data on healthy, undiagnosed individuals predictive of their risk to develop cancer later?
We analyzed electronic medical records of 20,000 healthy individuals who underwent routine checkups at the Tel-Aviv Medical Center Inflammation Survey between 2001 and 2017. Those records encompass more than 600 parameters per visit, including laboratory tests, vital signs, medical history, medication profile, etc. We identified those who developed cancer later using the Israeli National Cancer Registry.
We developed a novel ensemble method for risk prediction of multivariate time series data using a random forest (RF) model of survival trees for left-truncated and right-censored (LTRC) data. Our method uses an adapted version of the log-rank score as a splitting criterion.
Our method predicted future prostate gland cancer and breast cancer six months before diagnosis with an area under the ROC curve of 0.62±0.05 and 0.6±0.03 respectively. Performance was better than prior predictors. Our model was able to detect individuals who were not detected by typical screening tests such as mammography and clinical breast examination for breast cancer.
Short Abstract: Transcriptome prediction methods have become popular in complex trait mapping and most models have been trained in European populations using parametric linear models like elastic net (EN). We built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and compared their performance to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study comprising individuals of African ancestries. We show that prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across populations. We show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.
Short Abstract: Here we present ENNGene (an Easy Neural Network model building tool for Genomics), an application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. ENNGene allows multiple streams of input information, including sequence, evolutionary conservation, and secondary structure, and performs the needed preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is to be customized by the user, ranging from number and types of the layers (convolutional, GRU, LSTM, or dense layers), to precise set-up of each layer, e.g. a dropout rate. ENNGene then deals with all steps of training and evaluation of the model, exporting useful metrics such as multi-class AUC-ROC or precision-recall curve plots, as well as TensorBoard log files. To facilitate interpretation of the predicted results, we deploy the Integrated Gradients method, providing the user with a graphical representation of an attribution level of each nucleotide. To showcase the usage of ENNGene, we train multiple models on RBP24 dataset, quickly reaching the state-of-the-art performance, while improving the performance on several of the proteins by including the evolutionary conservation score and tuning the network architecture per each protein.
Short Abstract: Identifying disease-gene association is substantial for discovering disease mechanisms and developing therapeutic drugs. Although experimental studies have been conducted for decades, it is too slow and costly to fill in all the association relationships between diseases and genes.
In this study, we develop a deep learning approach based on graph convolutional networks for determining the associations of diseases and genes. We use OMIM (Online Mendelian Inheritance in Man) data which includes disease-gene association information, experimentally validated. Since too few association relationships are available in this data to train a machine learning model, we add other auxiliary data sources to increase the volume of training data. We apply a multi-layered Graph Convolutional Networks model to capture unknown nonlinear associations between diseases and genes. To avoid information loss in deep layers, we combine a model with shallow layers with another model with deep layers. We evaluate our ensemble approach using OMIM data. Our results improve the performance of other existing methods. We believe that our graph convolutional network model can discover novel biomarker genes for diseases such as cancer.
Short Abstract: To obtain new insights into biological mechanisms, one common approach is to interpret Convolutional Neural Networks (CNN) by identifying patterns in genomic sequences. Attribution methods assign importance scores to each input feature and can uncover the relevance of single positions in the input sequence regarding the model’s prediction. These scores are created independently from other positions but many features in biological sequences are in a relationship with others forming motifs with a specific function. Moreover, groups of such locally dependent features can be part of a higher interaction representing a regulatory logic between motifs hidden in deeper layers. Interactions could lead to noisy or missing importance scores and therefore not capture the underlying ground truth in an understandable manner.
We define interaction logic concepts in biological sequences and generate data sets that capture these concepts. Then, we investigate whether current post-hoc attribution methods are capable to capture motifs if the input sequences contain interactions like the previously defined interaction concepts.
Short Abstract: Even though eighty percent of rare diseases have a genetic component, ninety-five percent of rare diseases do not have a molecular target for therapy. One reason for this problem is that rare diseases have scarce datasets (datasets with a low number of samples) which makes it difficult to develop accurate statistical and machine learning models for identifying drug targets and drug repurposing candidates. However, with transfer learning, we can identify patterns from larger, statistically powered datasets and apply them to a rare disease scarce dataset, for example, to identify pathomechanisms or drug targets. The objective of this study is to evaluate different dimension reduction methods, a required step in transfer learning. We applied several dimension reduction methods including principal component analysis (PCA), independent component analysis (ICA), and non-negative matrix factorization (NMF) to identify gene expression patterns from Recount2 (a large and publicly available gene expression data set) and transferred these patterns to Glioblastoma Multiforme (GBM) gene expression profiles and gene expression profiles from cell lines perturbed by temozolomide, the standard chemotherapy for GBM. Further, we evaluated the performance of these dimension reduction methods to identify gene expression patterns with potential clinical impact.
Short Abstract: Generative frameworks designed for genomics data are emerging as powerful approaches to study complex phenomena in biology including protein functions and structures, single-cell RNA-seq analyses and phylogenetic-based studies. However, to study molecular evolutionary derived processes, most of deep generative models do not consider explicitly the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. Here we propose a method for a variational Bayesian generative model that jointly approximate the true posterior of local biological evolutionary parameters and generate sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as the generalized time reversible model. The architecture of our method consists of a set of deep variational encoders that infer the parameters of evolutionary-latent-variable distributions and allows sampling; and a generative model that computes probability transition matrices from sampled latent variables and generates a distribution of sequence alignments from reconstructed ancestral states. We train the model via a low-variance variational objective function and a site-wise-stochastic gradient ascent algorithm. Experimentally, we show the effectiveness and efficiency of the method on synthetic sequence alignments simulated with several evolutionary schemas and on real virus aligned DNA sequences.
Short Abstract: The human kinome contains a vast network of interacting kinases and phosphorylation substrates. Some of these kinases are very well studied and have proven to be useful as therapeutic targets, but many are poorly understood and their biological roles unknown. In this work, we use the latest advancements in graph-based machine learning methods to explore the biological roles of these understudied kinases. We use the post-translational modification data from iPTMnet to build a kinase-substrate interaction network, and enrich this network using the Gene Ontology functional annotation to provide a biological context to these interactions. We then use the node2vec algorithm to learn vector representation of the kinases and substrates in this network, and use these representations to predict novel interactions for understudied kinases using a Random Forest model. We also perform a bioinformatics analysis of the predicted interactions to understand the biological roles of understudied kinases. For two of the understudied kinases - Q9UEE5 (STK17A) and Q9H422 (HIPK3) we were able to ascertain through functional enrichment analysis that they play an important role in cancer activity and mediating an inflammatory immune response based on their predicted interaction partners.
Short Abstract: In recent years, single-cell RNA sequencing (scRNA-seq) technologies have made rapid progress. It has provided many valuable insights into biological systems, such as characterizing cell populations and unraveling complex cellular processes. Analyzing scRNA-seq data requires a number of procedures such as data normalization, feature selection and dimension reduction for visualization and clustering. In addition, their hyperparameters need to be set properly.
To address this problem, we propose a method based on a variational autoencoder (VAE). By using raw count data as input, VAE captures higher-order dependencies of gene expressions in end-to-end training. VAE usually uses a Gaussian distribution as the prior probability distribution in the latent space. However, it cannot represent multiple modes of different cell types in scRNA-seq data. Therefore, we use the Gaussian-mixture model instead of the Gaussian distribution. It allows us to represent the latent space more flexibly. If the VAE captures the relevant features of the scRNA-seq data, each cell type will form a group as a subpopulation within the overall Gaussian-mixture model population. We show the result of clustering cells with latent representations and how VAE selects the relevant features.
Short Abstract: Aggregating clinical transcriptomics data across hospitals can increase sensitivity and robustness of differential gene expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequently employed to pool local results. However, if class labels or confounders are inhomogeneously distributed between cohorts, results may differ significantly from those of aggregated analyses.
The dilemma between the accuracy loss and privacy may be resolved by employing privacy-aware techniques, such as federated learning (FL), or secure multi-party computations (SMPC). Here we present Flimma, a novel privacy-preserving implementation of a popular differential gene expression analysis workflow, limma voom. Flimma is based on HyFed (github.com/tum-aimed/hyfed), a hybrid federated framework, which enables the participants to hide the real values of their local models from the server while preserving the utility of the global model. Unlike limma voom, Flimma preserves the privacy of the data as only highly noisy model parameters are shared with the server. Flimma is user friendly and publicly available at exbio.wzw.tum.de/flimma/, including tutorials and documentation. A full version of this article is available at arxiv.org/abs/2010.16403.
Short Abstract: G2PDeep is an open-access web server, which provides a deep-learning framework for quantitative phenotype prediction and discovery of genomics markers. It uses zygosity or single nucleotide polymorphism (SNP) information from plants and animals as the input to predict quantitative phenotype of interest and genomic markers associated with phenotype. It provides a one-stop-shop platform for researchers to create deep-learning models through an interactive web interface and train these models with uploaded data, using high-performance computing resources plugged at the backend. G2PDeep also provides a series of informative interfaces to monitor the training process and compare the performance among the trained models. The trained models can then be deployed automatically. The quantitative phenotype and genomic markers are predicted using a user-selected trained model and the results are visualized. Our state-of-the-art model has been benchmarked and demonstrated competitive performance in quantitative phenotype predictions by other researchers. In addition, the server integrates the soybean nested association mapping (SoyNAM) dataset with five phenotypes, including grain yield, height, moisture, oil, and protein. A publicly available dataset for seed protein and oil content has also been integrated into the server. The G2PDeep server is publicly available at g2pdeep.org. The Python-based deep-learning model is available at github.com/shuaizengMU/G2PDeep_model.
Short Abstract: One of the remarkable properties of living systems is their propensity to be robust against various sources of variation. A growing body of work has been studying this topic from the viewpoint of gene expression program, under dissimilar names. These efforts, while shedding great light on the study of this subject, have different limitations. They lack a comprehensive and comparative approach to study the variation across tissues and cell lines. The connectivity of genes through the co-expression or gene regulatory network is often not taken into account. A comparative analysis of gene expression variation across different species is missing. In this work, we aimed to address the above challenges by analyzing a variety of datasets. We illustrate extensive evidence suggesting the important role of networks in controlling the variation. To the best of our knowledge, we build the most inclusive gene-specific regulatory features. Moreover, we predict our desired statistics based on (1) only the genome sequence, (2) only features, and (3) features and network by utilizing a variety of methods. We further observe that expression variation is conserved for closer species in matching tissues.
Short Abstract: Reverse linear regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, have emerged to extend multi-trait genome-wide association studies (GWAS) to high-dimensional settings where the number of traits exceeds the number of samples.
Here we demonstrate that all multi-trait GWAS methods can be written as reverse genotype prediction methods. We analyzed linear and non-linear machine learning methods for multi-trait GWAS. Using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains, we found that genotype prediction accuracy varies across variants, but does not correlate with the overall effect on gene expression of a variant. Moreover, feature coefficients correlated with the association strength between variants and individual traits, were predictive of true trans-eQTL target genes. Feature selection allowed to distinguish between high and low transcriptional activity genomic regions in random forest models, but not in ridge regression or SVM models. In summary, feature coefficients of genotype predictive models from high-dimensional traits identify biologically relevant variant-trait associations, but comparing relative importance of variants through these models in a GWAS-like manner using a single test-statistic remains an open challenge.
Short Abstract: The development of resistance to conventional antibiotics in pathogenic bacteria poses a global health hazard. Antimicrobial peptides (AMPs) are an emerging group of compounds with the potential to become the new generation of antibiotics. Deep learning methods are widely used by wet-laboratory researchers to screen for the most promising candidates. We propose HydrAMP - a generative model based on a semi-supervised variational autoencoder, that can generate new AMPs, improve existing ones, and perform analogue discovery. Novel features of our approach include non-iterative training, parameter-regulated model creativity, generation of more diverse peptides, and the disentanglement of the latent space from the conditions. Our model enables fast and efficient discovery of peptides with desired biological activity. The peptides generated by HydrAMP are similar to the known AMPs in terms of physicochemical properties. The wet lab validation confirmed that HydrAMP is able to find potent analogues, which was demonstrated on Pexiganan, for which we obtained a new, more active analogue. The proposed model may contribute to the fight against the antibiotic resistance crisis.
Short Abstract: Motivation:
Cancer is a genetic disease where mutations of cancer driver genes induce a functional reorganisation of the cell by reprogramming existing cellular pathways. Therefore, many approaches to predict cancer affected pathways have been suggested, typically based on how strongly they have been perturbed by differentially expressed genes. However, we observed that cancer driver genes perform hub-roles in the communication between pathways. Therefore, it is likely that not primarily the pathways themselves get perturbed, but more so their pathway-pathway interactions and with it their mutual functional relationships. So, we aim to identify cancer pathways and cancer genes based on their functional relationships and how they change in cancer.
Results:
To learn an embedding space that captures the functional organisation of pathways in the cell, we present our pathway-driven non-negative matrix tri-factorisation model (P-NMTF), which simultaneously decomposes a list of sub-adjacency matrices that encode how each pathway interacts within the cell. To predict cancer pathways and cancer genes we define our NMTF centrality and moving distance, which, respectively, allow us to compute the functional importance of a pathway or gene in the cell and how strongly its functional relationships are disrupted in cancer.
Short Abstract: With the characterization of cancer tumors at the molecular level, there have been reports of patients being similar despite being diagnosed with different cancer types. Motivated from these observations, we aim at discovering cross-cancer patients, which we define as patients whose tumors are more similar to patient tumors diagnosed with another cancer type. We develop DeepCrossCancer to identify cross-cancer patients that always co-cluster with the other patient from another cancer type. The input to DeepCrossCancer is the transcriptomic profiles of the patient tumors, the age, and sex of the patient. To solve the clustering problem, we use a semi-supervised deep learning-based clustering method in which the clustering task is supervised by cancer type labels and the survival times of the patients. Applying the method to patient data from nine different cancers, we discover 20 cross-cancer patients that consistently co-cluster. By analyzing the predictive genes of the cross-cancer patients and other genomic information available for the patient such as somatic mutations and copy number variations, we identify striking genomic similarities across these patients. The detection of cross-cancer patients opens up possibilities for transferring clinical decisions across patients at a single patient level.
Short Abstract: Single-cell RNA-sequencing data from multiple time points promises insights into mechanisms controlling differentiation and cell fate decisions at the level of individual cells. Yet, at each time point, a different, heterogeneous sample of cells from diverse types and developmental stages is obtained, complicating the identification of specific developmental trajectories across multiple time points.
To address this challenge, we propose a modeling approach that integrates neural network-based dimension reduction with inference of the temporal dynamics. More specifically, a low-dimensional, latent representation of gene expression is obtained with an autoencoder. In the latent space, we describe trajectories by alternating between assigning cells into groups based on the current dynamic model prediction, and optimizing the model parameters by matching the predicted and true distributions in each group using a quantile-based loss function.
Based on simulated data, we show that this approach allows for inferring distinct developmental trajectories despite the lack of one-to-one correspondence between cells at different time points. Jointly optimizing the neural network for dimension reduction and the dynamic model allows for learning an improved low-dimensional representation specifically adapted to the underlying dynamics. We additionally present an application to single-cell RNA-sequencing data from several time points during mouse cortical differentiation.
Short Abstract: Computational models for drug sensitivity prediction have the potential to significantly improve personalized cancer medicine. Drug sensitivity assays, combined with profiling of cancer cell lines and drugs become increasingly available for training such models. Multiple methods were proposed for predicting drug sensitivity from cancer cell line features, some in a multi-task fashion. So far, no such model leveraged drug inhibition profiles. Importantly, multi-task models require a tailored approach to model interpretability. In this work, we develop DEERS, a neural network recommender system for kinase inhibitor sensitivity prediction. The model utilizes molecular features of the cell lines and kinase inhibition profiles of the drugs. DEERS incorporates two autoencoders to project cell line and drug features into 10-dimensional hidden representations and a feed-forward neural network to combine them into response prediction. We propose a novel interpretability approach, which in addition to the set of modeled features considers also the genes and processes outside of this set. Our approach outperforms simpler matrix factorization models, achieving R=0.82 correlation between true and predicted response for the unseen cell lines. The interpretability analysis identifies 67 biological processes that drive the cell line sensitivity to particular compounds. Detailed case studies are shown for PHA-793887, XMD14-99 and Dabrafenib.
Short Abstract: Linear models are often regarded as the example of an interpretable method. Arguably, one of the main reasons is the possibility of thinking of each feature separately from one another. However, linear models are limited since they are not able to approximate more complex functions. In this work, we introduce feature-wise additive neural networks: each feature is processed separately from the others, and a latent representation is produced for each of them. Subsequently, all the feature-wise representations are simply summed, and finally passed through a final network for classification. We show the feature-wise and additive aspects of our model greatly increase its interpretability compared to other deep learning models, while still retaining the capability of approximating complex functions. This claim is supported by our experiments in a T-cell receptor binding prediction task. While motivated by interpretability, our model can be useful also for multimodal problems (e.g. prediction using diverse clinical information) and problems with missing data (with no need of imputation with arbitrary values).
Short Abstract: Despite recent interest in deep generative models for scaffold elaboration, their applicability to fragment-to-lead campaigns has been limited. This is in part because they are currently unable to account for local protein structure when proposing molecules and are not designed to pursue specific design hypotheses. We propose a novel method for fragment elaboration, STRIFE, which uses Fragment Hotspot Maps to extract meaningful and interpretable information from the protein and uses these to generate elaborations with complementary pharmacophores. We demonstrate substantial improvements over an existing, structure unaware, fragment elaboration model on a large-scale test set and show that our approach allows a significant degree of control over the elaborations produced. On a challenging case study derived from the literature, despite only providing our model with the fragment and associated protein structure, a number of the proposed molecules ranked highly by STRIFE closely matched the design hypothesis specified by the authors.
Short Abstract: There are currently 85,000 chemicals registered with the US Environmental Protection Agency but only a small fraction have measured toxicological data. To address this gap, high-throughput screening (HTS) and computational methods are vital. As part of one such HTS effort, embryonic zebrafish were used to examine a suite of morphological and mortality endpoints from over 1,000 chemicals found in the ToxCast library. We hypothesized that by using a conditional generative adversarial network (GAN-ZT) or deep neural network (Go-ZT) we could efficiently predict toxic outcomes of untested chemicals. We converted the 3D chemical structural information into a weighted set of points, and using in vivo toxicity data to train two generators. Our results showed that Go-ZT significantly outperformed GAN-ZT, support vector machine (SVM), random forest (RF) and multilayer perceptron (MLP) models in cross validation, and when tested against an external test dataset. By combining both Go-ZT and GAN-ZT, our consensus model improved the Kappa, and area under the receiver operating characteristic to 0.673 and 0.837, respectively. Considering their potential use as prescreening tools, these models could provide in vivo toxicity predictions and insight into the hundreds of thousands of untested chemicals to prioritize compounds for HT testing.
Short Abstract: Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.
Short Abstract: Single-cell sequencing analysis often includes clustering cells and identifying differentially expressed genes (DEGs) between clusters. However, the number of clusters, clustering algorithm, and choice of hyperparameters can have a large effect on downstream analyses and biological interpretation.
To address these difficulties, we present our method, which leverages manifold learning to identify DEGs in a cluster-independent way. We begin by modeling the cellular state space as a manifold. We then calculate a kernel density estimate (KDE) of each gene over the graph using a generalized form of KDE to smooth manifolds. Finally, we compute the L1 distance between gene density distributions and a uniform distribution. By ranking the genes based on L1 distance, we identify genes with highly localized expression along the manifold.
We demonstrate the utility of this approach on spatial and RNA sequencing of meningioma tumors from nine patients. Our method identifies enrichment of critical immune signaling DEGs in NF2-mut tumors versus NF2-wt tumors, corroborating a link between NF2 loss and immune infiltration. Within NF2-mut tumors, our approach discovers cell-cell communication networks through identifying downstream targets colocalized with enriched ligands in the spatial profile. Together, we show our method enables cluster-independent exploration of the immune infiltrate during brain tumorigenesis.
Short Abstract: The eukaryotic cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescent images in the Human Protein Atlas with ongoing affinity purification experiments from the BioPlex resource to create a unified hierarchical map of eukaryotic cell architecture. Integration involves configuring each approach to produce a general measure of protein distance, then calibrating the two measures using machine learning. The evolving map, called the Multi-Scale Integrated Cell (MuSIC 1.0), currently resolves 69 subcellular systems of which approximately half are undocumented. Based on these findings we perform 134 additional affinity purifications, validating close subunit associations for the majority of systems. The map elucidates roles for poorly characterized proteins; identifies new protein assemblies in ribosomal biogenesis and RNA splicing; and reveals crosstalk between cytoplasmic and mitochondrial ribosomal proteins. By integration across scales, MuSIC substantially increases the mapping resolution obtained from imaging while giving protein interactions a spatial dimension, paving the way to incorporate many molecular data types in proteome-wide maps of cells.
Short Abstract: Drug combination therapies are commonly used for the treatment of complex diseases such as cancer due to increased efficacy and reduced side effects. However, experimentally validating all possible combinations for synergistic interaction even with high-throughout screens is intractable due to vast combinatorial search space. Computational techniques are often used to reduce the number of combinations to be evaluated experimentally by prioritizing promising candidates. We present MatchMaker that can predict drug synergy scores for a pair of drugs using the drugs’ chemical structures and gene expression profiles of untreated cell lines as input. MatchMaker is a deep neural network-based drug synergy prediction algorithm. The model contains three neural subnetworks; two subnetworks learn a representation of the two drugs separately conditioned on cell line gene expression of the given cell line, the output of these two subnetworks are then input to a third subnetwork that predicts the Loewe synergy score of the pair. We train Matchmaker using DrugComb dataset, that contained 286,421 examples. MatchMaker yields performance improvements up to ~15% correlation and ~33% mean squared error (MSE) improvements over the next best method DeepSynergy.
Short Abstract: See attached long abstract
Short Abstract: Shape plays a fundamental role in biology. Traditional phenotypic analysis methods measure some features but fail to measure the information embedded in shape comprehensively. To extract, compare, and analyze this information embedded in a robust and concise way, we turn to Topological Data Analysis (TDA), specifically the Euler Characteristic Transform (ECT). TDA measures shape comprehensively using mathematical terms based on algebraic topology features. To study its use, we compute both traditional and topological shape descriptors to quantify the morphology of 3121 barley seeds scanned with X-ray Computed Tomography (CT) technology. The ECT in this case produces vectors that can be thought of as shape signatures for each barley seed. Using these vectors, we successfully train a support vector machine (SVM) to classify 28 different accessions of barley based solely on the 3D shape of their grains. We observe that combining both traditional and topological descriptors classifies barley seeds to their correct accession better than using traditional descriptors alone. This improvement suggests that TDA is thus a powerful complement to traditional morphometrics to describe shape nuances which are otherwise ignored. TDA can quantify aspects of phenotype that have remained "hidden", enabling the reconstructing of objects based on their topological signatures.
Short Abstract: Understanding genes and their underlying mechanisms is critical in deciphering how antimicrobial-resistant (AMR) bacteria withstand detrimental effects of antibiotic drugs. Current AMR databases, however, might not be comprehensive since new mechanisms are continuously discovered. It is thus critical to expand the potential AMR gene repertoire for more accurate inferences of AMR strains.
We developed a Cross-Validated Feature Selection (CVFS) approach for robustly mining genes related to AMR activities in an unbiased manner. The core idea behind the CVFS approach is interrogating features among randomly-split non-overlapping sub-datasets to ensure the representativeness of the features. Furthermore, the interrogation process will be repeated several times to minimize random effects, and only features selected by most of the repeated runs are chosen as the final feature set. By testing this idea on Salmonella enterica pan-genome dataset, we show that this approach is able to extract the most representative features (genes selected by the CVFS approach; CVFS-genes) that predict AMR activities very well, indicating the high association between the CVFS-genes and the AMR activities. The functional analysis demonstrates that the majority of the gene functions are hypothetical proteins (i.e. unknown functional roles), highlighting the potential of CVFS-genes to significantly expand the antimicrobial resistance databases.
Short Abstract: Breast cancer is one of the most common causes of death among women. The acquisition of an annotated dataset is still a time-consuming process, therefore the use of computer-aided diagnosis (CAD) for an automatic classification of histopathological images can improve the analysis process.
Our goal was to investigate selected machine learning algorithms on datasets that consisted of images of invasive ductal carcinoma breast cancer (IDC), of which 277,524 slices were extracted (198,738 negative and 78,786 positive). Determined features and/or interactions were later used to analyse the precision of tumour area detection.
Analysis was based on three Multiple Instance Learning (MIL) methods: the APR MIL, the Citation-kNN MIL and the MILBoost. The data was grouped into labelled bags which allows usage of less descriptive annotations of instances in datasets. We did a comparison of the results collected from three different datasets: first contains features, second contains interactions and third dataset contains interactions with features. The indicators calculated based on the confusion matrix showed that the best detection of tumour areas was achieved using a set containing features and interactions. The highest accuracy was obtained by MILBoost (96-95%), then Citation-kNN MIL (81-70%) and APR MIL (66-60%).
Short Abstract: Population stratification refers to the presence of differences in allele frequencies between subpopulations within samples, due to different ancestry. It is one of the major challenges in Genome Wide Association Studies (GWAS) as it increases type I error. An additional issue in GWAS is the presence of correlation between SNPs, or Linkage Disequilibrium (LD). To account for LD, we consider associations at the level of LD-groups (groups of correlated SNPs) rather than at the individual SNP level. In this contribution, we introduce multitask group Lasso for feature selection where each task corresponds to a subpopulation and each feature corresponds to an LD-group. Our algorithm provides the selection of either shared LD-groups across all tasks, or of population-specific LD-groups. We incorporate stability selection to improve the stability of sparsity-enforcing penalties. We used safe screening rules to provide a significant speed-up to scale the algorithm for GWAS data. To our knowledge, this is the first framework applied to GWAS associating feature selection, stability selection and safe screening rules for admixed populations at the LD-groups level. We show that our approach outperforms all standard methods on a simulated dataset and on two real cancer datasets.
Short Abstract: Analyzing transcriptomic data collected from multiple species can be useful for analyzing evolutionary changes between genes in different species. When considering a single species, a powerful tool for collecting this data is high-throughput single-cell sequencing technology, which allows for profiling hundreds of thousands of cells to enable the exploration of expression differences among the cells. However, cross-species data require establishing a common set of genes, which can be challenging given the existence of orthologous and paralogous genes. Single-cell datasets also raise issues such as data sparsity and batch effects. We propose to use an extension of non-negative matrix factorization (NMF), based on a deep neural network, to compare expression of orthologous X-linked genes to that of their corresponding autosomal genes across species. Our data, represented as a 4D tensor, is a single-cell RNA-seq dataset that consists of multiple species and cell types. We extract information from the tensor by using deep NMF to induce latent factors corresponding to genes, cells, species, and cell types. We demonstrate that, using this approach, cells from different species and cell types can be jointly embedded in a latent space, which should facilitate cross-species and cross-types expression comparison of orthologs.
Short Abstract: High-throughput third-generation sequencing devices, such as nanopore sequencing, can generate long reads spanning thousands of bases. This new technology offers the possibility of considering a wide range of epigenetic modifications and provides the capability to interrogate previously inaccessible regions of the genome, such as highly repetitive regions, and perform comprehensive allele-specific methylation analysis, among other applications. It is well-known, however, that detection of DNA methylation from nanopore data results in a substantially reduced per-read accuracy when comparing to bisulfite sequencing due to noise introduced by the sequencer and its underlying pore chemistry. Therefore, new methods must be developed for reliable modeling and analysis of DNA methylation landscapes using nanopore sequencing data. Here we introduce such a method and, by using simulations, we provide evidence of its superiority to the state-of-the-art. The proposed approach establishes a solid foundation for developing a comprehensive framework for the statistical analysis of DNA methylation and possibly of other epigenetic marks using nanopore sequencing data and potential energy landscapes.
Short Abstract: The current approach to cancer treatment is a one-size-fits-all approach, failing to comprehend tumor heterogeneity, resulting in 75% ineffectiveness of cancer treatment. Recent research has focused on modeling of drug prediction by applying machine learning on genetic mutations or using microRNA (miRNA), a key biomarker of cancer. Although these approaches demonstrate improved potential of targeted drug prediction, they present some limitations. Gene mutations have shown to account for only a small subset of candidate biomarkers, and while miRNA-based gene expression is regarded as offering more predictive modalities, both can be complemented by the integration and analysis of the multi-omic view of cancer. In this research, machine learning model was trained and tested with over 80% of cancer types using gene mutation, miRNA, and drug response data from The Cancer Genome Atlas. Feature Selection using ExtraTreesClassifier identified 945 gene mutations and miRNAs as key features out of over 18,000 features. The model was tested with multiple machine learning classifiers including DecisionTree, K-NearestNeighbors, and Ensemble Learning-based approaches - AdaBoostClassifier and OneVsRestClassifier. OneVsRestClassifier, when combined with cross validation, outperformed other approaches and is able to predict drugs for cancer patients based on their gene mutations and miRNA data with an accuracy of 83%.
Short Abstract: The continuous increase in pathogenic viruses and the intensive laboratory research for development of novel antiviral therapies often poses challenges in terms of cost and time efficient drug design. This accelerates research for alternate drug candidates and contributes to the recent rise in research of antiviral peptides against many of the viruses. With limited information regarding these peptides and their activity, modifying the existing peptide backbone or developing a novel peptide is very time consuming and a tedious process. Advanced deep learning approaches such as generative adversarial networks (GAN) can be helpful for wet lab scientists to screen potential antiviral candidates of interest and expedite the initial stage of peptide drug development. To our knowledge this is the first ever use of GAN models for antiviral peptides across the viral spectrum.
Results: In this study, we develop PandoraGAN that utilizes GAN to design bio active antiviral peptides.Available antiviral peptide data was manually curated for preparing training data set to include peptides with lower IC50 values. We further validated the generated sequences comparing the physico-chemical properties of generated antiviral peptides with training data.
Short Abstract: Novel fungal disease outbreaks are happening more often now than ever, and they are hard to predict and diagnose. Currently, a multi-drug resistant fungal species, Candida auris, is emerging worldwide with a mortality rate of 30-60%. Pathogenic fungi are under-researched, while they are constantly evolving to survive in new environments and hosts as their natural habitats are destroyed by climate change. At the same time, genetic sequencing technologies have advanced rapidly, opening the door for computational approaches to forecast pathogenic potential.
We construct the most comprehensive database, to our knowledge, of pathogenic fungal species and their hosts by mining literature, books, and existing databases. This database consists of 982 pathogenic fungal species, including 471 species with publicly available genomic data. Furthermore, we present a classifier based on Residual Neural Networks (ResNets) to predict the pathogenic potential of novel fungi directly from unannotated DNA sequences. We compare our model with Basic Local Alignment Search Tool (BLAST), and achieve higher prediction accuracy. We believe our work provides valuable information to open the door for future research in pathology of fungi.
Short Abstract: G-quadruplexes (G4s) are a class of stable structural nucleic acid secondary structures that are known to play a role in a wide spectrum of genomic functions, such as DNA replication and transcription. The classical understanding of G4 structure points to four variable length guanine strands joined by variable length nucleotide stretches. Experiments using G4 immunoprecipitation and sequencing experiments have produced a high number of highly probable G4 forming genomic sequences. The expense and technical difficulty of experimental techniques highlights the need for computational approaches of G4 identification. Here, we present PENGUINN, a machine learning method based on Convolutional neural networks, that learns the characteristics of G4 sequences and accurately predicts G4s outperforming state-of-the-art methods. We provide both a standalone implementation of the trained model, and a web application that can be used to evaluate sequences for their G4 potential.
Short Abstract: Bioengineered fungi can be used as ‘microbial factories’ to manufacture biofuels and other chemicals through processes that are economically and ecologically sustainable. To do so, it is important to have an understanding of the organism’s condition-specific transcriptional regulation of genes. Here we propose a neural network, FUngal pRomoter sequence to conditiOn specific expRession (FUROR), that takes as inputs the promoter sequence of a gene and the expression of transcription factors (TFs) to predict the condition-specific expression of the gene as High, Mean or Low. We tested FUROR on N. crassa for which it achieved an accuracy of 0.58 on a test set of randomly withheld elements from the gene-condition matrix and 0.48 on withheld conditions. Both of these results are significantly better than random assignment of classes. The neural network was also interpreted for the de novo discovery of TF-DNA binding motifs. This indicates that FUROR could be used for discovery of gene regulatory networks of non-model fungal species. Currently, we are extending the model to other species including I. orientalis.
Short Abstract: Proteins are complex molecules that have vital roles in virtually every biological function. Understanding their properties, including local structure, surface accessibility, and interactions is key to developing therapeutics and understanding their overall function.
In this work, we develop techniques to assess the relative solvent accessible surface area (rASA). Clearly, if the 3D structure of a protein is known then computation of the rASA can be easily done. However, in most cases, where the structure of the protein is unknown, it is difficult to assess the rASA.
In this work, we use graph neural networks accompanied by a data prepossessing pipeline that uses homologous proteins to generate graph properties. We use the proteinNet database that was generated for the task of training protein folding algorithms. This allowed us to train our network on roughly 40K proteins with known structures, accompanied by their Multiple Sequence Alignments (MSA) and the Position Specific Scoring Matrices (PSSM). The use of MSA provides an estimate of the contact maps which are used in the graph network to explore long range relationships between residues. As we show in our numerical experiments, we significantly improve over the state of the art on some of the latest data sets.
Short Abstract: Antiviral Peptides (AVPs), a subset of Antimicrobial Peptides (AMPs), have been studied as alternatives to traditional antiviral drugs. However, the identification of promising bioactive peptides, especially from venomous animals, it’s a challenge given that the development and pre-clinical testing of a single peptide can be expensive and present low effectiveness. This promotes the development of many Machine Learning and Deep Learning models that attempt to predict the AVPs based on amino acid sequence and derived features. We have developed an ensemble model based on both Random Forest (RF) model and LSTM Network that combines both physicochemical and sequence information in order to predict the AVPs. To train the model we gathered experimentally validated AMPs from seven databases, resulting in 15,650 AMPs and 2,818 validated AVPs. We tested our model with several feature sets as input for the RF; The best performing models utilized PCP descriptors for the RF, and achieved 10-fold cross validation mean values of 96% Accuracy and 0.98 AUC in the AVP prediction task, and 96% Accuracy and 0.99 AUC in the general AMP prediction task. Following, toxin proteins from Arachnidea species will be used in an attempt to identify novel AVPs and AMPs.
Short Abstract: Protein kinases, as important signaling proteins which deregulation or over-expression is a contributing factor of many diseases, are one of the most common pharmacological targets for novel drug discovery.
Previous approaches implemented on benchmark datasets show poor performance on more rigorous test settings containing previously unseen small compounds or protein kinase targets - thus limiting their real world application. For this purpose we extended the compound space of putative protein kinase inhibitors by combining several publicly available databases with popular benchmark datasets. In addition, we limited the size of input information by considering only compounds with small molecular weight (< 900 Da) and learning only conserved protein kinase domain sequence representations instead of using whole sequences, since most of the kinase inhibitors bind to the ATP binding site or other catalytic subunits in the proximity of ATP binding site.
Predictive methodology used relied on an ensemble approach, XGBoost trained on fingerprint based and sequence based similarity features, as a baseline - and graph convolutional networks (GCN) as more advanced representation learning predictive methodology. In order to assess the uncertainty of model predictions, we defined a structure-based applicability domain with focus on density of compound space in the training set.
Short Abstract: Recent studies on using single-cell RNA sequencing (scRNA-seq) technology have been widely applied in biological studies such as drug discovery. Prior to in-depth investigations of the functionality of single cells for pathological goals, identification of cell types is an essential step that can be sped up using computational methods. Recently, supervised learning methods have been developed to automatically identify cell types. Due to the lack of sufficient annotated datasets, these methods have not been commonly used in scRNA-seq studies. Classification methods can simply take advantage of feature selection techniques to improve cell type prediction while identifying the most informative genes among a high number of genes in high-dimensional scRNA-seq datasets. In this regard, we introduce a combination of two powerful techniques for representation learning and unsupervised feature selection to automatically achieve cell type identification in two steps. Average prediction accuracy of 98\% obtained on six different cell types in a Human Pancreas scRNA-seq dataset. In addition, we found that 11 out of 13 selected genes are biologically related to two cell types in the Human Pancreas, which confirms the effectiveness of the proposed approach.
Short Abstract: Spatial transcriptomics is an emerging technology that enables cellular variation to be placed into a spatial context. This has produced novel insights into tissue heterogeneity under healthy and diseased conditions. High resolution and cost-efficient methods require a pre-selection of genes that will be measured. Here we present a method to select probe sets for spatial transcriptomics and benchmark this approach and others in a custom pipeline. Our method optionally incorporates prior knowledge and accounts for experimental constraints.
Short Abstract: Self-supervised deep learning is a powerful approach for sequence modeling in natural language and potentially biological sequences. However, existing models (e.g. BERT) and pre-training methods are designed for natural languages, not protein sequences. Protein-specific models and pre-training tasks are necessary to better capture the information within protein sequences. Here, we introduce a novel deep-learning model based on BERT, called ProteinBERT. We present a pre-training scheme that consists of masked language modeling combined with a protein-specific pre-training task of predicting Gene Ontology (GO) functions. We introduce novel architectural elements that make the model more efficient and flexible to very large sequence lengths and multiple inputs. We obtain state-of-the-art results, despite using a far smaller model than other deep learning models, and show major benefits from pretraining. Code and models are available
Short Abstract: Recent metabolomics measurement devices, such as mass spectrometers, produce extremely high-dimensional data. Together with small sample sizes, this setting is known as the fat data (or p >> n) problem. Biomarker discovery in this configuration is a challenge. Classical statistical methods fail and common Machine Learning (ML) algorithms produce models too complex to be interpretables. ML algorithms that rely on sparsity to predict phenotypes using very few covariates have been shown to thrive in this setting. While sparsity helps to avoid overfitting, it also leads to concise models that are easier to interpret for biomarker discovery.
The Set Covering Machine (SCM) algorithm produces sparse models based on simple decision rules. Recent work has applied SCMs to the genotype-to-phenotype prediction of antibiotic resistance and achieved state-of-the-art accuracy. To adapt this approach to metabolomics (fat) data, we developed a bootstrap aggregation of SCM models : RandomSCM.
We explored applications of RandomSCM beyond genotype-to-phenotype prediction by applying it to five metabolomics dataset. Predictions performances are at state-of-the art level. Furthermore, the study of the decision rules in RandomSCM revealed valid biomarkers of the phenotypes. These results demonstrate the high potential of the RandomSCM algorithm for biomarker discovery in omics sciences.
Short Abstract: Novel pathogens evolve quickly and may emerge rapidly, causing dangerous outbreaks or even global pandemics. Next-generation sequencing is the state-of-the art in open-view pathogen detection, and one of the few methods available at the earliest stages of an epidemic, even when the biological threat is unknown. Analyzing the samples as the sequencer is running can greatly reduce the turnaround time, but existing tools rely on close matches to lists of known pathogens and perform poorly on novel species. Machine learning approaches can predict if single reads originate from more distant, unknown pathogens, but require relatively long input sequences and processed data from a finished run. We train ResNets to classify raw, incomplete Illumina and Nanopore reads and integrate our models with HiLive2, a real-time Illumina mapper. This approach outperforms alternatives based on machine learning and sequence alignment on simulated and real data, including SARS-CoV-2 sequencing runs. After just 50 Illumina cycles, we observe an 80-fold sensitivity increase compared to real-time mapping. The first 250bp of Nanopore reads, corresponding to 0.5s of sequencing time, are enough to yield predictions more accurate than mapping the finished long reads. The approach could also be used for screening synthetic sequences against biosecurity threats.
Short Abstract: Convolutional neural networks have been applied in supervised learning to a variety of computational genomics problems, taking DNA sequences as inputs and predicting regulatory functions as outputs. Despite their successful performance metrics, these supervised learning approaches have limitations: supervised models commonly only learn sequence features that can immediately help accurate prediction of the regulatory function outputs, ignoring other features present within the input sequences. Additionally, the genomic features learned by supervised models are often very basic sequence features (e.g., GC content).
Here we present Genomic Representations with Information Maximization (GRIM), an unsupervised learning method based on the Infomax principle that enables more comprehensive identification of whole sequence motifs. We demonstrate that GRIM is able to discover motifs known to be present in genomic sequences but which are not detectable using supervised methods. We also demonstrate the efficacy of the representations of genomic sequences learned by GRIM by showing that relatively simple models trained on these representations can approach the performance of more complex, fully supervised models trained on raw genomic sequences. We further demonstrate the utility of GRIM in analyzing several in vivo genomic datasets, illuminating use cases for our method.
Short Abstract: Single-cell and single-nucleus ATAC-seq methods are increasingly employed for studying chromatin accessibility. Due to technical issues in the sequencing protocols, there may be large differences in sequencing depth across cells. This can strongly impact the downstream clustering analysis and commonly employed approaches can produce clusters defined by these sequencing artefacts rather than by the underlying biology. This warrants the development of a clustering approach for binary ATAC-seq that is capable of dealing with differences in sequencing depth.
We develop a binary mixture model where the underlying Bernoulli distribution is modified with an additional cell-specific parameter modelling sequencing depth. We develop a bespoke Expectation- Maximisation based inference method combined with a model-based feature-selection approach and a cluster splitting/merging heuristic to improve performance. Our method robustly identifies clusters and informative open chromatin features, and can automatically detect the number of clusters. We validate our method on synthetic data and apply it to publicly available real single-cell ATAC-seq datasets. We compare against standard Bernoulli mixture model and a state-of-the-art clustering method, and show that our method is tolerant to variation in sequencing depth and provides biologically meaningful clustering.
Short Abstract: A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.
Short Abstract: Single-cell multi-omics data continues to grow at an unprecedented pace, and effectively integrating different modalities holds the promise for better characterization of cell identities. Although a number of methods have demonstrated promising results in integrating multiple modalities from the same tissue, the complexity and scale of data compositions typically present in cell atlases still pose a significant challenge for existing methods. Here we present scJoint, a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. Using multiple atlas data and a biologically varying multi-modal data, we demonstrate scJoint is computationally efficient and consistently achieves significantly higher cell type label accuracy than existing methods while providing meaningful joint visualizations. This suggests scJoint is effective in overcoming the heterogeneity in different modalities towards a more comprehensive understanding of cellular phenotypes.
Short Abstract: Interpretation of neural networks trained on regulatory sequence data is an important problem in computational genomics. Several methods such as in-silico mutagenesis, Grad-CAM, DeepLIFT, and Integrated Gradients have been developed to explain such networks. However, the limitations that arise when applying these methods to genomic data are not completely understood. While simulated datasets with known ground-truth DNA motifs can be used to test whether a given interpretability method can accurately recover the motifs, such simulations do not reflect the complexity of real biological data.
In this work, we propose a systematic pipeline for designing simulated datasets to mirror the complexity of a given experimental dataset. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments. We use these simulated datasets to quantify the performance and identify pitfalls of different interpretation methods based on how well they can recover the ground-truth motifs. We further explore the impact of user-defined settings on the interpretation methods, and find that some commonly-used settings from the computer vision literature are not always a good choice for genomics. Based on our analysis, we suggest some best practices for practitioners interested in applying these model interpretation methods to their own genomic datasets.
Short Abstract: Although Transformer-based Language Models (LMs) can learn contextualized representations of language, they have difficulties in representing factual knowledge. Many proposed solutions are based on integrating relational facts in the form of Knowledge Graph (KG) triples into LMs, which has proven to significantly improve such contextualized representations, specifically in the biomedical domain. However, a major limitation of these approaches is their dependence on entity linking (i.e., the process of aligning text tokens and KG entities). To overcome this impediment, we propose a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (STonKGs), which can operate on unaligned pairs of text sequences and KG triples. STonKGs is a large-scale Transformer-based model trained on several million text-triple pairs from PubMed, assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA). First, we adapt the output of node2vec to represent KG triples as sequential input data. This, in conjunction with text tokens, is used as the input to our model, which then uses multi-modal attention to learn rich interdependencies between text tokens and KG entities. By evaluating STonKGs on various fine-tuning tasks and comparing it to an NLP- and a KG-baseline, we empirically validate the added value of our knowledge integration method.
Short Abstract: Comprehensive training of Machine-Learning models is frequently not possible for rare and diverse cancer types such as Pancreatic neuroendocrine neoplasms (panNENs). We report on a novel data-augmentation technique which substitutes neoplastic training data with data of healthy origin based on a transcriptomic deconvolution algorithm. The output of the deconvolution is subsequently utilized as training data for Machine-Learning models, which in turn predict clinical characteristics of panNENs. Deconvolution-trained models efficiently predict the neoplastic grading, disease-related patient survival, and can differentiate between neuroendocrine tumor and carcinoma subtype while achieving the same prediction accuracy as a baseline model trained on neoplastic expression data and the Ki-67 gold-standard biomarker classified panNENs. The clinical characterization of panNENs and that of rare cancer types in general is complemented by the method.
Short Abstract: Cell-cell interactions are vital for numerous biological processes including development, differentiation, and response to inflammation. Currently most methods for studying interactions on scRNA-seq level are based on curated databases of ligands and receptors. Whilst useful, such methods are limited by current biological knowledge. Recent advances in single cell protocols have allowed for physically interacting cells to be captured, enabling complimentary methods for studying interactions that does not rely on prior information. We introduce a new method for detecting genes whose expression change as a result of interaction in such datasets based on Latent Dirichlet Allocation (LDA). We validate our method on synthetic data before applying our approach to two datasets of physically interacting cells, allowing us to identify genes that change as a result of interaction. For each dataset we produce a ranking of genes that are changing in subpopulations of the interacting cells. Lastly, we apply our method to a dataset generated by a standard droplet based protocol, not designed to capture interacting cells and discuss its suitability for analysing interaction. We are able to rank genes that change as a result of interaction without relying on prior clustering and generation of synthetic reference profiles as current methods do.