Posters - Schedules
Poster presentations at GLBIO 2021 will be presented virtually. Authors will pre-record their poster talk (5-7
minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19
and no later than July 23. All registered conference participants will have access to the poster and presentation
through the conference and content until June 30, 2021. There are Q&A opportunities through a chat
function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.
Information on preparing your poster and poster talk are available at: https://www.iscb.org/glbio2021-general-info/glbio2021-presenter-info#poster
Ideally authors should be available for interactive chat during the times noted below:
Short Abstract: Languages can be classified as tonal and non-tonal. Tonal languages are those in which the intonation change the meaning of the words. Although there is no concrete answer to the development of tonal language, there is evidence of genetic influence. However, which genes are implicated in this linguistic feature have not been fully elucidated. The aim of this work is to identify candidate genes of tonal language development. Genes associated to “speech” and “hearing” were searched in the Human Phenotype Ontology. We retrieved a total of 245 genes, which serve as the input for the network analysis. We used String v.11.0 to generate the network, with the following parameters: interaction score 0.4, first and second shell = 100. We excluded the text mining and neighborhood interaction sources. The next steps were performed in the Cytoscape v.3.8 environment. We used the Cytohubba v.0.1 and the algorithm MCC for the hub gene analysis. The gene set analysis was performed with ClueGO v.2.5.7, including the hypergeometric test and only pathways with p<0.05. The initial network has 357 nodes and 4391 edges. Top 20 genes were ranked and participated in similar pathways, like GTPase activity, spliceosome and RNA processing. Gene enrichment results resulted in 85 pathways, and 34.12% are in Clinvar human disease, 38.82% in KEGG pathways, and 27.06% in Reactome. Several pathways are related to hearing and speech related diseases, such as Prader-Willi syndrome, De lange syndrome, and hereditary hearing loss and deafness. The present analysis gives us clues about important genes related to hearing and speech, and we hypothesize that these genes may also play a role in the tonal language development. Besides, this candidate gene analysis may help us understand the evolutionary role of human languages.
Short Abstract: Diffuse Large B-Cell Lymphoma (DLBCL) is a type of aggressive Non-Hodgkin Lymphoma, a subcategory of lymphoma cancers that originates in the lymph nodes. DLBCL affects the B-lymphocytes, which are cells of the immune system that create antibodies to fight disease. DLBCL makes up 25% of all new Non-Hodgkin Lymphoma cases, with 18,000 people diagnosed every year. However, only 2 out of 3 people can be treated with chemotherapy, meaning finding more effective ways to limit the proliferation of the cancer is imperative to help patients. Under the lens of Graph Theory, we constructed a Protein-Protein Interaction (PPI) network model of DLBCL, arising from Differentially Expressed Genes (DEGs). From this, we can examine which proteins contribute the most to the proliferation of DLBCL. Because there are no existing graph theoretic models for lymphoma, this analysis provides a unique take on the protein interactions that drive DLBCL proliferation. We compiled data on 4356 DEGs from the Cancer Genome Atlas (CGA) and 13651 DEGs from the Gene Expression Omnibus (GEO), yielding 1794 common DEGs. We used KEGG and STRING to compile data on the specific interaction networks of these DEGs. We utilized Cytoscape (v 3.8.0) to construct a full PPI Network with 1794 proteins and 29087 interactions. Finally, we computed the Betweenness and PageRank Centrality of each protein to pinpoint the most influential proteins in the system. Cross-comparing both centrality measures and each proteins’ function yields TP53, RPS27A, HSP90AA1, FN1, and CDK1 as the proteins with the highest impact factor in our PPI network. We thoroughly investigate these proteins and their direct links to confirm their significance in the system, indicating that these proteins are likely the backbone of DLBCL proliferation. With this data, scientists will be better equipped to inhibit DLBCL by targeting the proteins we identified.
Short Abstract: A major concern with the spread of any virus such as COVID-19 is how fast it initially spreads. In early stages, the number of infected individuals grows exponentially if the basic reproduction number (which represents the number of secondary infections caused by a single infected individual inserted in a fully susceptible population for the duration of the disease) is bigger than one. Instead, the pandemic is considered under control when the basic reproduction number is below one. Both these regimes emerge by a simple first order approximation of the evolution curve of the pandemic around its initial time. For example, you can think about Taylor-approximating at first order the solution of the SIR (Susceptible-Infected-Removed) compartmental model, but a similar insight can be taken from many other models as well. If the basic reproduction number is approximately one, the growth rate might not be exactly exponential anymore and the approximation just mentioned is inconclusive unless conducted to a higher order. In this latter scenario, the higher order terms become fundamental and will be the main contributors to the evolution of the disease.
The basic reproduction number depends on the contact rate and so space-time interactions must be taken into account to have a more precise description of the diffusion of the disease. Note that lockdowns, social distancing, and air travel restriction were the most common non-pharmaceutical interventions implemented by governments to try to counteract the coronavirus. Therefore, any accurate model describing the dynamics of COVID-19 must take into account a spatial component. Of course, how COVID-19 spreads is important for multiple public health reasons, such as avoiding strain on the healthcare system. Having a clear picture of the spatial and temporal characteristics of the transmission of COVID-19 during the first stages of the pandemic can also potentially help predict the dynamics of early stages of subsequent waves or novel mutations.
In this work, we are interested in understanding if higher order terms play a role in the spatio-temporal diffusion of COVID-19. We model the number of cases and deaths on a logarithmic scale as a linear combination of functions, which are themselves products of two functions, one of time only and one of space only, that are allowed to be random, deterministic, and nonlinear, and capture increasing model complexity.
For the sake of this presentation, we will concentrate on the case-study of Ohio in its first three weeks following the first COVID-19 case in each county. In our analysis, we scale the time variable of each county to the first case of COVID-19 in Ohio in order to account for the time-lag between different counties. We will model the evolution with a Bayesian hierarchical Besag-York-Mollié model, but modified in order to allow for nonlinear terms.
Our analysis suggests that the linear model seems preferable in most of the regimes that we tested our models on. This fact has positive implications for decision makers as nonlinear terms are a source of irregularity with respect to initial data. Irregularity can cause uncertainty even in the short-term evolution of a pandemic, which in turn challenges public health policies. Although in the cases where the basic reproduction number is around one, higher order terms still play an important role in understanding the dynamics of COVID-19, we did not find strong evidence that for our setting (Ohio had basic reproduction number over one), nonlinearities played an important role in the initial phases of the pandemic.
Our study is ongoing and we are adding more complex models and more general situations (eg. all US counties) to our analysis, all of which will be reported in our talk.
Short Abstract: Around the world, coral reefs serve as an invaluable resource for more than 500 million people, acting as sources of food, income, and coastal protection. However, despite their overwhelming necessity, the overharvesting of corals through coral mining has contributed greatly to their depletion—with over 400,000 pieces of live coral mined and exported each year from the U.S. alone. With such little enforcement currently limiting the practice, threatened & endangered coral species face significant danger of extinction.
The purpose of this research was to train image classification machine learning models (specifically convolutional neural networks or CNNs) to identify threatened coral species. A total of 4 CNNs were constructed using the transfer learning process (a process in which existing complex machine learning algorithms are repurposed to conduct a different task), with each one trained on a different dataset consisting of either curated or crowdsourced images. The three curated datasets (EILAT, RSMAS, MLC2008) each consisted of coral images that were entirely based on either structure, texture, or colonies, respectively. This variety enabled their corresponding CNNs to have unique coral classification abilities. When tested, each model achieved high accuracies when classifying coral into genera or species—with the EILAT model reaching 91%, the RSMAS model reaching 96%, and the MLC2008 model reaching 92%. In addition, a new dataset consisting wholly of crowdsourced images (1,372 in total) of 7 threatened & endangered coral species was compiled. The CNN trained on this crowdsourced dataset achieved an accuracy of 72%, and while lower than the previous models, we hypothesize that this number will increase as the CNN accounts for the larger variation inherent in this crowdsourced dataset when more images are crowdsourced in the future.
Ultimately, the 4 models trained in this research project present the capabilities of machine learning to identify corals based on structure, texture, and colony features. Future work includes validating the models against collections of new images towards the ultimate goal of facilitating reductions in coral mining and increasing coral preservation awareness by giving a larger audience the means to identify threatened and endangered corals species.
Short Abstract: Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans, especially in immunocompromised individuals. Understanding which genes are essential for growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, Saccharomyces cerevisiae, construction of mutants is considerably more laborious in C. albicans. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 C. albicans GRACE (gene replacement and conditional expression) strains that has been previously constructed as a basis for training. We identified several relevant features including average gene expression level and variance across a large compendium of conditions, degree of co-expression, codon adaptation index (CAI), the number of SNPs per nucleotide for each gene across a set of sequenced C. albicans strains, the presence of an essential ortholog in S. cerevisiae, and the presence of a duplicated set of paralogs in S. cerevisiae that exhibited a synthetic sick/lethal genetic interaction. We additionally incorporated six features from a recent transposon mutagenesis (TnSeq) study in a stable haploid genetic background.
Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we used this approach to prioritize the construction of an additional set of >800 strains. We discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for efficient discovery of essential genes, and a similar approach may also be useful in other species.
Short Abstract: Glycosphingolipids such as α- and β-glucosylceramides (GlcCers) and, α- and β-galactosylceramides (GalCers) are stereoisomers differentially synthesized by gut bacteria and their mammalian hosts in response to environmental insult. Thus, lipidomic assessment of α- and β-GlcCers and α- and β-GalCers is crucial for biomarker discovery and pathomechanistic studies. However, simultaneous quantification of these stereoisomeric lipids is difficult due to their virtually identical structures. Differential mobility mass spectrometry (DMS) as an orthogonal separation to high performance liquid chromatography used in electrospray ionization, tandem mass spectrometry (LC-ESI-MS/MS) can discriminate stereoisomeric lipids through gas-phase interactions. The development of a LC-ESI-DMS-MS/MS method for lipidomic analysis demands intensive manual optimization of DMS parameters and relies exclusively on the availability of synthetic lipid standards. Where synthetic standards do not exist, method development is not possible. We describe here a supervised learning approach that predicts instrument responses to different lipid stereoisomers at various machine parameter settings. This in silico optimization method replaces manual, empirical parameter optimization and eliminates the dependency on available synthetic lipid standards. Our approach promises to greatly accelerate the development of assays for the detection of lipid stereoisomers in biological samples.
Short Abstract: In the past few decades, many statistical methods have been developed to identify rare variants associated with a complex trait or a disease. Recently, rare variant association studies with multiple phenotypes have drawn a lot of attentions because association signals can be boosted when rare variants are related with more than one phenotype. Most of existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a gene or a genetic region is tested one at a time. However, these methods are not designed to locate individual rare variants within a gene or a genetic region. In this article, we propose a unified standardized selection probability to locate rare variants associated with highly correlated multiple phenotypes. The proposed method basically combined weighted selection probability of elastic-net and selection probability of mgaussian according to z-scores. We then select top ranked rare variants that have relatively large z-scores. In our simulation study, we demonstrated that the propose method outperforms the existing selection methods in terms of true positive rate, when phenotype outcomes are highly correlated with each other. We also applied the proposed method to our wild bean data set that consists of 10,783 rare variants and 13 correlated amino acids.
Short Abstract: An essential survival skill of bacteria is quick adaptability to environmental shifts, through gene expression regulation . In Escherichia coli, σ factors are responsible for simultaneous, large-scale regulation of many genes, and consequent strong phenotypic alterations, such as growth phase adaptation , . Here we study the specificity of σ factors.
We use RNA-seq and flow-cytometry to characterize and compare, as a function of the promoter sequence, the changes in the single cell statistics of protein numbers of genes whose promoters have preference for σ70, for σ38, and for both σ70 and σ38, when cells shift from exponential to the stationary growth phase.
We started by selecting genes known to have preference for both σ70 and σ38 and found a statistically significant association with biological processes associated to respiration, suggesting that they are selective advantageous. Next, we performed flow-cytometry to characterize their single-cell statistics of protein numbers, prior and after a shift in growth phase. For comparison, we obtained the same data from genes with preference for only one of the two σ factors. For control, RNA-seq was performed in the two growth phases.
From the data, we found significant correlation between promoter sequences and fold-changes due to the growth phase shift. Finally, we propose a generalized analytical model of the dynamical changes of any gene, as cells shift from the exponential to the stationary phase.
Overall, we suggest that promoter preference for σ factors is a ‘near-continuous’, rather than a categorical, sequence dependent feature. We expect our model to be useful in the future design of genetic circuits robust to shifts in cell growth phase.
 R. Phillips, N. M. Belliveau, G. Chure, H. G. Garcia, M. Razo-Mejia, and C. Scholes, “Figure 1 Theory Meets Figure 2 Experiments in the Study of Gene Expression,” Annu. Rev. Biophys., vol. 48, no. 1, pp. 121–163, May 2019, doi: 10.1146/annurev-biophys-052118-115525.
 V. K. Kandavalli, H. Tran, and A. S. Ribeiro, “Effects of σ factor competition are promoter initiation kinetics dependent,” Biochim. Biophys. Acta - Gene Regul. Mech., vol. 1859, no. 10, pp. 1281–1288, 2016, doi: doi.org/10.1016/j.bbagrm.2016.07.011.
 B. K. Cho, D. Kim, E. M. Knight, K. Zengler, and B. O. Palsson, “Genome-scale reconstruction of the sigma factor network in Escherichia coli: Topology and functional states,” BMC Biol., vol. 12, pp. 1–11, 2014, doi: 10.1186/1741-7007-12-4.
Short Abstract: The aryl hydrocarbon receptor (AhR) is an inducible transcription factor with various exogenous ligands, including the potent environmental contaminant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD) and other similar chemicals (dioxins). Dioxin-mediated toxicity is achieved through AhR activation and binding to the DNA, chiefly the canonical 5'-GCGTG-3' binding motif known as Dioxin Response Element (DRE). Importantly, in vivo AhR binding in human tissues is highly dynamic and tissue-specific. Approximately 50 percent of the experimentally verified binding sites do not contain a DRE, and a great number of otherwise accessible DREs are not bound by AhR. Identification of the drivers and determinants of AhR binding, especially underlying tissue specificity, is crucial for our understanding of downstream gene regulatory effects and potential adverse health outcomes such as liver toxicity and immune suppression. Our goal was to develop interpretable predictive machine learning models of DRE-centered AhR binding with the aim of accurate cross-tissue prediction. To this effect, we applied XGBoost, a supervised machine learning architecture, to predict bound DREs by investigating non-linear effects of combining features representing 1) DNA sequence immediately flanking the DRE, and 2) local chromatin context, such as DNase-seq, histone mark and transcription factor (TF) ChIP-seq signals, as well as 3) features denoting DRE proximity to gene promoters. In this study, we predicted AhR binding of exogenously induced AhR in MCF-7 human breast cancer cells (45 minutes or 24 hours of TCDD treatment), human primary hepatocytes (24 hours of TCDD treatment), and non-induced AhR in HepG2 hepatocellular carcinoma cells. Our results demonstrate highly accurate and robust models of within-tissue binding, verified for generalization through 5-fold cross validation. Our results further indicate several specific transcription factors and histone marks as highly predictive of AhR binding within and across MCF-7 and HepG2 cell lines. Additionally, we show that tissue-specific AhR binding appears to be driven by a complex interplay of DNA flanking sequence and local chromatin context features, as well as other tissue-specific mechanisms.
Short Abstract: BACKGROUND. Exercise has known metabolic benefits for normal and overweight populations (Booth, Roberts, and Laye 2012; Kelly, Kelly, and Kelly 2020). While most studies of overweight groups have focused on sedentary individuals or prolonged exercise intervention, metabolic changes in trained overweight runners (OWT) remain understudied.
METHODS. Trained runners (n=25) were divided according to BMI into OWT and normal weight (NWT) groups. Blood samples were collected prior to and immediately following 90-minute treadmill run at 60% VO2max for untargeted metabolomics. A liquid chromatography, high resolution mass spectrometry platform was used in positive and negative ionization modes to maximize metabolome coverage. The platform revealed 680 altered metabolites in total forming four unique metabolic profiles: 1) NWT: pre- versus post-exercise, 2) OWT: pre- versus post-exercise, 3) Pre: NWT versus OWT, and 4) Post: NWT versus OWT. Principal component analysis (PCA) was used to identify metabolites that contributed most to variation within each group.
RESULTS. Hydrophilic interaction chromatography (HILIC) and reverse phase chromatography (RP) methods were used to measure small molecules in pre-exercise and post-exercise serum of OWT and NWT. Exercise revealed latent differences between OWT and NWT groups. Adenine catabolism intermediates, inosine and hypoxanthine, contributed most to variation due to BMI in pre- and post-exercise profiles. NWT and OWT groups showed unique metabolic profiles due to exercise with 152 overlapping putative metabolites. MS/MS fragmentation analysis and PCA identified free fatty acids, acylcarnitines, the ketone body, β-hydroxybutyrate (βOHB), and novel lipid class fatty acid esters of hydroxy fatty acids (FAHFAs) amongst the highest contributors to variation due to exercise in both NWT and OWT. Validation of FAHFAs on our platform confirmed endogenous signals are linear and reproducible using HILIC method. A RP method was used to observe changes in relative abundance of PAHSA and OAHSA regioisomers. A subset of study participant samples was used to measure cytokine abundance pre- and post-exercise. Significant increases in IL-6, MCP-1, and IL-10 were observed, independent of BMI. PCA of NWT and OWT metabolic profiles with cytokine abundance revealed greater association between IL-6, MCP-1 and metabolic profile of OWT group.
CONCLUSIONS. This study is the first, to our knowledge, to identify increasing FAHFA abundance with acute running in trained groups and to do so using a HILIC LC-MS/MS platform. FAHFAs have been identified as anti-inflammatory, anti-diabetic lipids synthesized by adipose tissue (Yore et al. 2014). Research indicates species-specific functions (Brejchova et al. 2020; Kolar et al. 2019; Dongoran et al. 2020). The OWT group is a subset of metabolically healthy obese population who have a greater risk of developing cardiometabolic pathologies (Smith, Mittendorfer, and Klein 2019; Bluher 2010). The contribution of FAHFAs to acute exercise metabolism may shed light on mechanisms of BMI-related exercise adaptations.
Short Abstract: Background: Predicting non-coding RNA structures is crucial to understanding their mechanism of action. Comparative approaches have contributed to significantly improve the prediction of conserved RNA structures of homologous RNA sequences. Computational methods that rely on comparative approaches mainly exploit multiple sequence and/or structure alignments to improve the accuracy of the prediction of conserved RNA structures. The align-and-fold methods which simultaneously compute an optimal sequence alignment and a conserved secondary structure perform generally better than the align-then-fold and fold-then-align strategies which first solve one of the problem and then use the solution as a proxy for solving the second problem. The fourth strategy, named align-free-fold, consists in predicting the conserved RNA structure without relying on an alignment step. The align-free-fold strategy has the advantage of being generally faster than alignment-based strategies.
Method: We have developed aliFreeFoldMulti, an extension of the aliFreeFold algorithm which predicts a representative secondary structure of a set of homologous RNA by using a vectorial representation of their samples of sub-optimal structures, and a machine learning approach. aliFreeFoldMulti improves on aliFreeFold by computing the conserved structure for each sequence of the input family. Four strategies have been developed in aliFreeFoldMulti to infer the secondary structures of all homologous sequences of a family.
Results: To assess the accuracy and efficiency of aliFreeFoldMulti prediction, we compare it to a selection of leading RNA-structure prediction methods. Five methods representative of the different RNA folding approaches were selected for the comparison. The results show that aliFreeFoldMulti provides the best consensus in terms of prediction accuracy and time efficiency. It achieves the lowest computing times and the highest maximum accuracy scores. We present aliFreeFoldMulti as an illustration of the potential of alignment-free approaches to provide fast and accurate RNA-structure prediction methods
 M-A. Bossanyi, V. Carpentier, J-P. S. Glouzon, A. Ouangraoua, Y. Anselmetti. (2020). aliFreeFoldMulti: alignment-free method to predict secondary structures of multiple RNA homologs. NAR Genomics and Bioinformatics, 2(4), lqaa086.
Short Abstract: Single-cell RNA-sequencing allows us to measure gene expression levels in thousands of individual cells from a heterogeneous tissue sample simultaneously. After sequencing, cells with similar expression profiles can be clustered together representing groups of distinct cell types. While clustering the cells is relatively straightforward, deciding which cell type each cluster represents is challenging. Researchers often look for enriched expression of known cell-type marker genes within a cell-type cluster in order to make these assignments, but curating lists of marker genes from the scientific literature is time consuming and can lead to inconsistent marker gene lists from different research groups, hindering reproducibility. We hypothesize that natural language processing (NLP) can be used to identify useful markers for thousands of cell types in an unbiased manner.
To test this hypothesis, we leveraged millions of PubMed abstracts to train a word2vec model to generate numerical vector representations of ~15k ENSEMBL genes and each of the thousands of cell types described in the Cell Ontology. We then used cosine similarity to quantify the relationships between gene vectors and cell-type vectors, giving us a score for each gene/cell-type pair. We predicted that genes with high similarity to one cell type, but not to other cell types, would make useful cell-type-specific marker genes, so we took advantage of the hierarchical relationships between thousands of cell types recorded in the Cell Ontology and normalized the cosine similarity scores among groups of related cell types. Importantly, we grouped the cell-types at three different resolutions, normalizing across all cell types (lowest resolution), “sibling” cell types with the same parent term (highest resolution), or “extended family” cell types separated by a discrete number of nodes in the ontology (medium resolution). We reasoned that marker genes derived from high resolution normalization would better discriminate closely related cell types present in a typical single-cell RNA-seq experiment.
We quantified the ability of the raw cosine similarity and normalized scores to predict the correct cell-type-specific marker genes using hand-curated markers for 120 cell types in the CellMarker database as a gold standard. We found no significant difference between auPRCs calculated with the raw cosine similarity and auPRCs calculated using normalized scores at any resolution. However, because these marker genes will be used to discriminate between different cell types, we also calculated auPRC for off-target cell types, i.e. using scores from cell-type A to predict marker genes for cell-type B, reasoning that effective marker gene scores would have higher auPRC when predicting markers for the correct cell type than when predicting markers for an off-target cell type. The median difference between on- and off-target auPRCs was significantly higher for normalized scores at all three resolutions than raw cosine similarity, confirming our hypothesis that normalizing cosine similarity across cell types makes marker gene scores more cell-type-specific.
Next, we assessed whether NLP derived marker genes could be used to accurately predict cell-type labels given transcriptional profiles of individual cells. We used SCINA, an expectation maximization algorithm used to assign cell-type labels to single-cell RNA-seq given known cell-type markers, to predict peripheral blood mononuclear cell (PBMC) labels. Using the top ranked markers from the raw and normalized similarity scores, we found that the medium resolution, “extended family” normalization most accurately predicted the correct labels. Importantly, both the raw and “extended family” marker genes outperformed hand-curated marker genes from the CellMarker database in this task. Thus, this work provides a proof of principle that NLP approaches can be used to create unbiased lists of cell-type-specific marker genes useful for annotating single-cell RNA-seq data.
Short Abstract: We know that homologous proteins share similar sequence, structure and function, and several measures have been developed to evaluate these similarities. However, we also know that in their native state, proteins are highly flexible and exist in not just one structure, but in a multitude of conformations forming a set called an ensemble. Although knowledge of the ensemble is necessary to fully understand protein function the relationship between evolutionary distance and similarity of conformational ensembles of homologous proteins has not been systematically evaluated. To compare conformational ensembles we first need to develop a metric to compare the differences between pairs or groups of conformations. In this study, we developed such a metric by calculating the difference distance map (DDM) between two conformations and then comparing pairs of DDMs using the correlations between them. This allows us not only to recognize, but also to quantify the similarity of the conformational differences between two pairs of conformations. We used this approach to compare the differences between pairs of alternative conformations of homologous proteins in the Protein Data Bank (PDB). We first identified clusters of similar conformations from the multiple coordinate sets of individual proteins in the PDB and then compared DDMs between pairs of representative conformations of different, but homologous proteins. We show that, as expected, homologous proteins display similar conformational changes but also that this similarity correlates with their sequence identity. This result can be used to improve the relevance of homology modeling in analyzing protein function, where we can model not only individual structures, but also functionally relevant structural changes of the targets.
Short Abstract: Querying new information from the knowledge sources, in general, and published literature, in particular, aims to provide precise and quick answers to questions raised about a system under study. However, due to the vast amount and the inconsistency of the published literature, extracting useful information is becoming challenging. Particularly, in biology, the amount of biological data is constantly growing, further augmenting the issues of data inconsistency and fragmentation. Moreover, when coupled with machine collection and extraction of information, providing automated and accurate query answering techniques is sorely needed. Such techniques will impact the efficiency, breadth, depth and accuracy of the extracted information. In this work, we present ACCORDION (Automated Clustering Conditional On Relating Data of Interactions tO a Network)  a novel methodology and a tool to enable efficient answering of biological questions by assembling new, or expanding existing models using published literature. Our approach integrates clustering with simulation and formal analysis to allow for the automated iterative process of assembling, testing and selecting the most relevant models. Specifically, clustering identifies connected groups of newly extracted network elements from literature, while simulation and formal analysis select those clusters of elements that provide the best model extensions for satisfying pre-defined system properties.
The inputs for ACCORDION are a baseline model (few seed components and their interactions) and a machine reading output (from a machine reading engine). The first step is creating the machine reading output which includes new event information from literature (events represent interactions between biochemical entities, such as increase or decrease of amount or activity), followed by filtering, scoring and classifying these events, all of which are fully automated. Once the machine reading output is created, the three main steps within ACCORDION are performed, and they include (1) fully automated clustering of new events, (2) fully automated assembly of the clustered event data into models, and (3) fully automated selection of the most suitable and useful events. We use Markov Clustering algorithm (MCL) , an unsupervised graph clustering algorithm, to group events obtained from literature by machine reading. Once we generate clusters, we find return paths that start in the baseline model, go through one or more clusters, and end in the baseline model. The baseline model and the clusters on such return path form a candidate model. Finally, we generate an executable model (in which discrete variables represent states of model elements, and each element can have a state transition function) for each candidate model, and use the DiSH simulator  to obtain dynamic traces of the baseline model and the candidate models. We also test all models on a set of predefined properties, using a statistical model checking approach . The candidate model that satisfies most or all of the properties is selected as the final extended model.
To evaluate our approach, we compare the outcomes of ACCORDION with three previously published manually created models, namely naïve T cell differentiation model, T cell large granular lymphocyte leukemia model and pancreatic cancer cell model. Experimental results reveal considerable improvements of our approach over other related methods. Moreover, using ACCORDION, we are able to efficiently find the best set of extensions to reproduce the manually created models. Besides demonstrating automated reconstruction of a model that was previously built manually, ACCORDION can assemble multiple models that satisfy desired system properties. As such, it replaces large number of tedious or even impractical manual experiments and guides alternative hypotheses and interventions in biological systems.
 Y. Ahmed, et al., arXiv, 2020.
 Stijn vanDongen, thesis, 2000.
 K. Sayed, et al., Winter Simul. Conf., 2018.
 S. Kumar-Jha et al., Cmsb, 2005.
Short Abstract: Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a discriminative, supervised machine learning-based tool for annotating cells against the graph-structured Cell Ontology. The Cell Ontology provides a comprehensive hierarchy of animal cell types encoded as a directed acyclic graph (DAG). Framing the cell type classification task as that of hierarchical classification against the Cell Ontology poses a number of advantages over flat-classification. First, this DAG provides a rich source of prior knowledge to the cell type classification task that remains un-utilized in flat classification. In addition, if the algorithm is uncertain about which specific cell type the cell may be, the use of a hierarchy allows the algorithm to place a cell internally within the graph rather than at a leaf node. Thus, for cells whose specific cell types are absent from the training set, a classifier that uses a hierarchy is capable of providing more informative output than simply claiming that the cell is “uncertain.”
CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. By pre-training the classifier on a comprehensive training set, CellO arrives ready-to-run on diverse tissue and cell types. More specifically, CellO’s training set comprises 4,293 bulk RNA-seq samples from 264 studies. These samples are labeled with 310 cell type terms, of which 113 are the most specific cell types in our data set (i.e., no sample in our data was labeled with a descendant cell type term in the Cell Ontology’s DAG). These cell types are diverse, spanning multiple stages of development and differentiation. To the best of our knowledge, this dataset is the largest and most diverse set of bulk RNA-seq samples derived from only primary cells.
Finally, CellO makes extensive use of linear models, which are particularly amenable to interpretation. To enable their interpretation, we present a web-based tool, the CellO Viewer, for exploring the cell type expression signals uncovered by the models (uwgraphics.github.io/CellOViewer/). Specifically, the CellO Viewer enables the exploration and comparison of the coefficients within each cell type’s linear models. The tool supports two modes of operation: a cell-centric mode and a gene-centric mode. In the cell-centric mode the user can select cell types via a graphical display of the Cell Ontology in order to view and compare the most important genes for distinguishing those cell types. In the gene-centric view, the user can select genes and explore which cell types these genes are most important for distinguishing from the remaining cell types.
Bernstein, M.N., Ma, J., Gleicher, M., Dewey, C.N. (2021). CellO: Comprehensive and hierarchical cell type classification of human cells with the Cell Ontology. iScience, 24(1), 101913.
Short Abstract: Small nucleolar RNAs (snoRNAs) are a conserved family of non-coding RNAs involved in ribosome biogenesis. Most snoRNAs are embedded within introns of host genes, suggesting a joint regulation of their expression. SnoRNAs guide modifications on the target RNA to which they bind, their canonical targets being ribosomal RNA. However, recent reports suggest a broader range of snoRNA targets (e.g. mRNAs, tRNAs, etc.) and functions (regulation of splicing, polyadenylation, etc.). To further characterize snoRNA abundance patterns and their underlying determinants, we thus sequenced RNA from seven healthy human tissues (breast, ovary, prostate, testis, liver, brain and skeletal muscle) using a low structure bias RNA-Seq approach (TGIRT-Seq). We find that expressed snoRNAs can be categorized in two abundance classes that greatly differ in their embedding preferences, function and conservation level: 390 snoRNAs are uniformly expressed across tissues whereas 85 snoRNAs are enriched in brain or reproductive tissues. Strikingly, we observe that most uniformly expressed snoRNAs do not correlate with the expression of their host gene, whereas conversely, tissue-enriched snoRNAs are tightly coupled to their host gene expression. We uncover that the host gene function and architecture play a central role in the regulation of snoRNA abundance. We demonstrate that the presence of dual-initiation promoter within a host gene facilitates the uncoupling of snoRNA and host gene expression through the recruitment of the nonsense-mediated decay machinery, which degrades the host gene transcript but not the snoRNA. Furthermore, we find that a host gene function is tightly linked to the correlation of abundance with its embedded snoRNA: host genes coding for ribosomal proteins are highly correlated with the abundance of their embedded snoRNA whereas host genes involved in the regulation of RNA splicing, processing and binding show strong anticorrelation with their embedded snoRNA abundance. Altogether, our results indicate that snoRNAs are not a mere group of ubiquitous housekeeping genes, but also include highly regulated and specialized RNAs, thereby meeting the different functional needs of the human tissues.
Short Abstract: Background: Single-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. This is especially important in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer data sets offer unprecedented opportunity to better understand the mechanisms underlying tumor progression, metastasis, drug resistance, and immune evasion, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data. Furthermore, while a few web-based tools for analyzing scRNA-seq data are available, they are not designed specifically for cancer research or do not easily enable exploration of existing public data sets.
Results: We present CHARacterizing Tumor Subpopulations (CHARTS), a web application for exploring publicly available scRNA-seq cancer data sets in the NCBI’s Gene Expression Omnibus. More specifically, CHARTS enables the exploration of individual gene expression, cell type, malignancy-status, differentially expressed genes, and gene set enrichment results in subpopulations of cells across tumors and data sets. Along with the web application, we also make available the backend computational pipeline that was used to produce the analyses that are available for exploration in the web application. CHARTS currently enables exploration of 198 tumors across 15 cancer types, and data is being continually added.
Conclusion: CHARTS is an easy to use, comprehensive platform for exploring single-cell subpopulations within tumors across the ever-growing collection of public scRNA-seq cancer data sets. CHARTS is freely available at charts.morgridge.org.
Short Abstract: In biology, researchers are often studying complex system with components interacting with each other dynamically, while their goals are to confirm hypotheses, answer questions, or test treatments. Computational modeling allows for integrating knowledge from experts, databases, scientific articles, or experimental observations in order to achieve these goals.
The commonly used modeling approaches can be categorized into event-based, agent-based, and element-based approaches. An event is defined as a reacting process co-occurring with one or more components, therefore event-based modeling is defined as the system that can be fully represented by a set of events. Under such modeling, researchers focus more on the concentration level than the individual behaviors. The apparent limitation of this modeling approach is that only the average behavior of systems is studied. In agent-based approaches, the system is modeled as a collection of autonomous decision-making individuals called agents. The agent-based model represents dynamic systems in a manner permitting the systems to evolve through agent interactions. A drawback of agent-based approaches is that electing an adequate number of parameters to include in the model can be challenging, both on the practical and the theoretical level. In the middle ground of event-based and agent-based approaches, the element-based approach emerges to maintain the balance of model simplicity and accuracy. This approach is studying elements (i.e., the whole population of the same components) as a unit, elements change states over time as determined by their regulatory functions. When given an element update scheme, element-based approaches can execute simulations over time. The widely used element-based modeling methods include Boolean Networks (BNs), Probabilistic Boolean Networks (PBNs), and Bayesian Networks (ByNs), which all update the modeled component as a whole population to a given update scheme.
Different from the existing element-based modeling approaches, we describe here a new element-based modeling approach, which configures hybrid models and executes simulations over time, relying on a granular computing approach and a range of different element update functions. We study the relationships between the BN, PBN, ByN, and our hybrid modeling approach, with a particular focus on the central role of PBNs and their links to others. We find, via mathematic induction, that there is always one Bayesian conditional probability table (together they form a network) for a PBN at a single element, however, there are multiple functional realizations for a Bayesian conditional probability table, and we provide a detailed procedure for obtaining one such PBN. Moreover, the hybrid modeling approach, with a specific simulation update scheme, can be shown to be a special case of dependent Probabilistic Network, with the Boolean version of the hybrid modeling approach being equivalent to a dependent PBN accordingly. Within the conversions, advantages and limits of different approaches under different application scenarios are particularly investigated from the perspectives of time complexity in construction, state transition graph analysis, etc. We show that the computation complexity of obtaining the state transition matrix under the hybrid modeling approach is reduced to O(n*2^n), compared to O(n*2^(2n)) for the PBN approach, an even larger scale for the ByN approach.
A detailed example biological system using these approaches is then studied to illustrate how common knowledge of the same system is differently represented and how stochasticity is incorporated.
Short Abstract: The model diatom Phaeodactylum tricornutum is an emerging platform for synthetic biology applications including production of biofuels, chemicals, and even proteins because it grows quickly, it is inexpensive, and recently developed genetic tools enable production. However, the current reference genome is incomplete and does not represent the diploid structure of the genome. Here, we improved the current reference genome by using the Oxford Nanopore sequencing platform to create a telomere-to-telomere assembly. We extracted high molecular weight DNA by grinding in liquid nitrogen, followed by phenol-chloroform extraction. We achieved a read N50 of 35 kilobases, with the longest reads of about 300 kilobases. Initial assembly was performed using minimap2 and miniasm. We observed that many chromosomes were incorrectly assembled near telomeres. To correct these regions, we developed a network graph approach to extract and correct each unique telomere. Briefly, all long reads containing the known telomeric repeat were extracted and were mapped in all-vs-all mode. We then created a network graph using each read name as a vertex, and edges between vertices created reads overlapped with query coverage > 95%. 88 unique and 7 duplicated telomere clusters were identified, and were used to manually correct assembly errors near the ends of chromosomes. We found that this genome assembly comprises 25 telomere-to-telomere chromosomes, rather than the previously proposed 33. We confirmed the structure of each chromosome by developing a software tool to check contiguity by creating a minimum tiling overlapping read map with ultra-long reads. In the future, this data will be used to resolve the diploid telomere-to-telomere genome assembly, as well as provide methylation data for additional analysis.
Short Abstract: Major depressive disorder (MDD) is a serious mental health disorder that affects millions of people in the U.S. Many patients who experience MDD are resistant to available antidepressant treatments and there is a pressing need to develop novel therapies for this debilitating disorder. DrugFindR is an R Shiny application created to identify candidate repurposable drugs from the Library of Integrated Network-Based Cellular Signatures (LINCS). Here, we utilize DrugFindR to identify potential new candidate drugs for the treatment of MDD.
For this analysis, gene signatures are generated from publicly available RNAseq datasets obtained from postmortem brain tissue from subjects with MDD and non-psychiatrically ill controls. Simultaneously, LINCS signatures for known antidepressants are identified. Finally, using signature-based connectivity analysis, DrugFindR identifies chemical perturbagens that are highly discordant with the disease gene signatures (i.e. reverse the MDD gene signatures) and highly concordant with known antidepressant signatures. Additionally, we examine MDD disease signatures from male and female subjects to identify sex-specific candidate repurposable drugs for the treatment of MDD.
This study utilizes the extensive repository of chemical perturbagen signatures available in LINCS to identify novel candidate drugs that may be explored for the treatment of MDD in male and female subjects. This study addresses a pressing need in neuropsychiatric research. Future studies will examine the utility of the identified drugs in vivo.
Short Abstract: COVID-19 emerged in Wuhan (China) at the end of 2019 and was declared a pandemic by the World Health Organization in March 2020. With more than 2.5 million deaths worldwide as of late February 2021, COVID-19 has been a defining health crisis and has impacted people’s everyday lives in countless ways. One of the most noteworthy circumstances of the COVID-19 outbreak in the United States was the closure of virtually all schools throughout the country. Since their closure, one of the most pressing issues pertaining to COVID-19 is how to properly reopen schools without sparking a surge in cases throughout the community. Currently, the situation is highly heterogeneous with even nearby schools adopting alternative strategies. Note that prolonged school closure has been shown to negatively affect student learning experience and to be the cause of serious mental illnesses, such as anxiety and depression.
Much about the dynamics of the spread of infectious diseases such as COVID-19 can be analyzed by means of statistical and mathematical models. In this work, we will concentrate on a couple of distinct models with the intent of capturing important factors in the diffusion of the coronavirus in Indiana’s secondary school system.
In the first model, we analyze the number of cases in each school subdividing them by county. The distribution of the number of cases in schools within a given county is modeled with a Conditional Gaussian Distribution; namely, for each county, we model the distribution of cases per school with a Gaussian distribution with mean and variance taken county-dependent. An interesting fact emerged from the analysis: the conditional mean of the school cases per county scales linearly with respect to the cases of the county. This happens in a strikingly hierarchical way with the number of cases in students, teachers, and the full population in each county differing by an approximately constant factor. This has speculatively important public policy related consequences, including the possibility of concentrating the testing in schools and using the scaling factor to estimate the incidence of COVID-19 in the full population.
The second model is a compartmental model with age structure (4 compartments of young interacting with 4 compartments of adults) that aims at describing how non-trivial the dynamics of the disease can be and what unexpected consequences there might be. The simulations of our models with parameters in line with those of Indiana showed that even if adults keep their contact with other adults to a minimum and follow other proper protocols, transmission from young can present itself to be extremely detrimental to the more at-risk population. This shows that optimal school reopening strategies can potentially benefit not only the school population, but the entire community.
Taken in conjunction, these results underline once more the importance of adopting proper school reopening strategies and how they relate to the diffusion of the coronavirus outside the school environment. The diffusion of the coronavirus among the school population has the potential to not only be a strong determinant of the health of the more at risk population, such as elderly and sick, but also be a proxy for the incidence of COVID-19 in the community.
We plan to extend our study beyond Indiana to all US counties and validate the accuracy and deductions of our models. We will talk about these extensions and our updated results in our GLBIO presentation.
Short Abstract: Polyomaviruses are the smallest known double-stranded DNA viruses and are abundant in the human body. The polyomaviruses JC virus (JCPyV) and BK virus (BKPyV) are common viruses in the human urinary tract. Prior studies estimate JCPyV infects between 20-80% of older adults and BKPyV infects between 65-90% of individuals by age 10. These two viruses encode for the same six genes and share 75% nucleotide sequence identity across their genomes. While prior urinary virome studies have repeatedly reported the presence of JCPyV, we were interested in seeing how JCPyV prevalence compares to BKPyV. We retrieved all publicly available whole-genome sequencing reads from urinary microbiome and virome studies (n=165). Raw reads were first mapped to the JCPyV RefSeq. We found that 59 of the 165 samples had reads mapping to the JCPyV genome. Upon further investigation, we found that in 56 of these samples the reads were mapped to a particular gene/domain with high coverage. When these 56 data sets were mapped to the BKPyV RefSeq, uniform coverage across the genome was observed. Thus, we conclude that JCPyV is present in only 2% (n=3) of the samples tested, significantly less than BKPyV (34%). Furthermore, this study highlights the need for coverage-based analyses to distinguish between closely related species within a community.
Short Abstract: Accurate modeling of kinase networks is an increasingly relevant topic in current research due to increased interest in kinase-kinase interactions. An example of a modeling system is KINNET, an R package developed by Ali Imami which uses Bayesian Networks to model these interactions using kinomic data. However, the number of potential networks increases exponentially with more kinases. As such, it is important to develop a method to select the optimal kinase when presented with the choice between multiple kinases with similar scores. Here, a ranking system is presented to add weight to kinases based on two parameters: the number of interactions that a kinase participates in and the knowledge base available for that particular kinase.
This two-parameter approach aims to balance the complexity that a kinase adds to the network with how much information is available, allowing for the selection or rejection of potential nodes depending on the needs of the user. In addition, the ability to switch between “dark” and “light” kinases, which are those with a lack or abundance of information respectively, has also been developed, which may enable the discovery of novel kinase interactions.
Short Abstract: Genetic interactions can be important for predicting individual phenotypes. Due to statistical power issues, the systematic study of genetic interaction from GWAS with traditional approaches has been challenging. We recently published a new computational approach, called BridGE , for detecting genetic interactions from GWAS study data. We showed that by leveraging the expected local network structure between sets of functionally related genes we are able to systematically discover disease-associated genetic interactions from different GWAS cohorts.
One powerful application of genetic interactions, demonstrated by extensive reverse genetic studies in model organisms, is to use the pattern of genetic interactions for a gene of interest to identify other functionally related genes. Gene pairs that exhibit high profile similarity often correspond to genes in common pathways or protein complexes. To date, this profile similarity approach has been applied to genetic interactions derived from reverse genetic screens, but not to genetic interactions derived from population genetics. The aim of our current research is to develop a method for measuring functional gene networks based on genetic interaction profile similarity using genetic interactions derived from population genetic studies. There are some issues that first need to be addressed to get meaningful functional patterns from the similarities in the profiles including population structure, the sparsity of profiles, and linkage disequilibrium. Different Machine Learning tools(Clustering and PCA decomposition) are used to solve the issues. We summarize progress in benchmarking functional networks derived from this approach and highlight several important caveats to extending profile similarity analyses to population genetic data.
1- Fang, G., Wang, W., Paunic, V. et al. Discovering genetic interactions bridging pathways in genome-wide association studies. Nat Commun 10, 4274 (2019)
Short Abstract: Next-generation sequencing technologies have revolutionized the field of genetics and helped usher in today’s genomics era. Continued advances in the underlying sequencing technology and decreased costs have empowered researchers to generate massive amounts of next-generation sequencing data, particularly for whole-genome sequencing (WGS) efforts in mammalian species where genome sizes often span 2+ billion bases. WGS projects may range from single individuals with a phenotype of interest to entire cohorts, thus broadening our ability to probe biological questions at both genome-wide and population-level scales. This revolution has been exceedingly fruitful in veterinary medicine and in particular for dogs, with research groups across the globe actively generating WGS data across many canine breeds. However, working with WGS data in an effective and efficient way is not always straightforward. The processing of WGS data is often challenging and inaccessible to many researchers, computationally expensive, and relatively non-standard. To address the latter, for years the Broad Institute has been at the vanguard in developing open-source tools (Genome Analysis Toolkit – GATK), analysis pipelines, and best practices geared towards human WGS projects. Yet reconfiguring GATK pipelines for non-human WGS samples and generating sample-specific input can be a daunting and potentially error-prone endeavor when scaled to dozens or hundreds of samples. To address these hurdles, we developed a canine-specific wrapper around a standard pipeline to encourage the best practices championed by GATK while lowering the barrier to entry often faced by researchers seeking to analyze WGS data. Our tool generates all required input, including extracting flow cell and lane information to enable processing of samples split across flow cells or lanes, to run a complete analysis from raw FASTQs through genomic variant call format files on HPC architecture employing commonly used schedulers. In addition, we developed a pipeline for joint genotyping across multiple samples, and then extended this pipeline to enable the identification of novel variants from specific samples. This pipeline extension has been highly successful in identifying likely causal variants when applied to a number of different canine monogenic disorders. Additionally, the fast processing times for our pipeline make using WGS as a clinical/diagnostic test a possibility that we are actively exploring. Finally, although developed for canine WGS data, our processing pipeline can also be modified with little bioinformatics knowledge to work effectively in other veterinary species.
Short Abstract: The Covid-19 Pandemic has highlighted the importance of being able to rapidly and accurately identify potential drug treatments for emerging diseases. The development of in silico drug repurposing tools is complemented by the increased availability of Next-Generation Sequencing transcriptomic datasets which allows us to identify significant changes in gene expression associated with a given disease state.
Here we introduce DrugFindR, a novel method for rapid screening of small molecules for activity against a given disease transcriptomic signature. Using the DrugFindR Application, users can upload and analyze their own disease transcriptomic signatures and identify candidate small molecules that are discordant or “reverse” the disease signature. DrugFindR utilizes the Library of Integrated Network-based Cellular Signatures (LINCS) as the source of small molecule signatures (>20,000 small molecule signatures available). Novel candidate drugs are those whose signatures are 1) concordant (similar) to known drug treatments for the disease of interest and 2) are the most likely to reverse the disease signature. This two-pronged approach allows for a more accurate filtering of the candidate drug list. DrugFindR can be applied to rapidly screen candidate drugs in silico for both new and existing disease conditions.
We have applied this method to identify potential treatments for COVID-19 and were able to identify 17 candidates, including 10 with known anti-viral properties and 7 that are currently in clinical trial as therapies for COVID-19.
Short Abstract: Systemic infections, especially in patients with chronic diseases, result in sepsis: an explosive, uncoordinated immune response that can lead to multisystem organ failure with a high mortality rate. Sepsis survivors and non-survivors oftentimes have similar clinical phenotypes or sepsis biomarker expression upon diagnosis, suggesting that the dynamics of sepsis in the critical early stage may have an impact on these opposite outcomes. To investigate this, we designed a within-subject study of patients with systemic gram-negative bacterial sepsis with surviving and fatal outcomes and performed single-cell transcriptomic analyses of peripheral blood mononuclear cells (PBMC) collected during the critical period between sepsis recognition and 6 hours. We observed that the largest sepsis-induced expression changes over time in surviving versus fatal sepsis were in CD14+ monocytes, including gene signatures previously reported for sepsis outcomes. We further identify changes in the metabolic pathways of both monocytes and platelets, the emergence of erythroid precursors, and T cell exhaustion signatures, with the most extreme differences occurring between the non-sepsis control and the sepsis non-survivor. Our single-cell observations are consistent with trends from public datasets but also reveal specific effects in individual immune cell populations, which change within hours. In conclusion, this pilot study provides the first single-cell results with a repeated measures design in sepsis to analyze the temporal changes in the immune cell population behavior in surviving or fatal sepsis. These findings indicate that tracking temporal expression changes in specific cell-types could lead to more accurate predictions of sepsis outcomes. We also identify molecular pathways that could be therapeutically controlled to improve the sepsis trajectory toward better outcomes.
Short Abstract: Proteins are the principle biological macromolecules of life. It is critical for a number of applications to understand protein function. The function of a protein is closely linked to its structure and so being able to correctly classify protein structure is important. This is the task of protein structural classification (PSC). As protein structure is in turn related to amino acid sequence, PSC was originally done using sequence data. Then later using direct 3-dimensional (3D) data was found to improve on using sequence data for PSC. We recently proposed PSC methods based on protein structural networks (PSNs) which outperformed PSC done with traditional sequence and 3D data.
PSNs are networks that model a protein structure by representing the protein amino acids as nodes and connecting with an edge any two nodes whose amino acids are spatially close to each other. In previous studies, by modeling proteins as PSNs we were able to leverage features that describe network topology for the task of PSC.
The previous studies modeled protein 3D structures as static PSNs. By a static PSN, we mean that we model the entire native 3D structure of a protein using a single PSN. However, proteins are not created all at once as whole native structures. Instead, a protein is built by adding amino acids one at a time to the protein chain. The amino acids in this chain are capable of interacting with and even partially folding with those amino acids that have already been added to it. In this study we propose a novel dynamic PSN to naively mimic this process. Dynamic PSNs are composed of a number of static networks called snapshots. Each snapshot uses a progressively larger section of the native 3D structure in its construction starting at the N-terminus and eventually extending to a final snapshot that uses the entire structure.
We extract features from these dynamic PSNs using the well-known concept of dynamic graphlets. A graphlet is an induced, connected subgraph of a network. A dynamic graphlet captures the changes in edges of these subgraphs across snapshots. We then use these dynamic graphlet-based features for the task of PSC.
For our study we use 35 datasets of proteins with CATH or SCOP structural labels. CATH and SCOP feature 4 levels of structural organization and each dataset covers labels at one of these levels. For each dataset we construct dynamic PSNs. We then extract dynamic graphlet features from these PSNs to use in the task of PSC. An off-the-shelf logistic regression model is used for all structure prediction. We compare our approach to other recent state-of-the-art approaches. We find that our dynamic PSNs outperform existing state-of-the-art PSN-based PSC at all but the highest levels of structural classification. Specifically, in our preliminary data dynamic PSNs outperformed or matched static PSNs for 19 of 25 lower-level datasets. For higher-level datasets dynamic PSNs outperformed all static PSN features except the most recent for 9 of 10 datasets and performed the best for 1 of 10 datasets.
Short Abstract: The succinate dehydrogenase (SDH) enzyme also known as complex II is an integral component of the mitochondrial respiratory chain . Complex II couples the oxidation of succinate to fumarate in the Krebs cycle with the electron transfer to the terminal acceptor ubiquinone in the electron transport chain (ETC).
Various dysfunctions in ETC may lead to a wide range of disorders and even cause cancer. As SDH plays a crucial role in electron transfer, alterations in its structure may affect cell function. Recently, mutations in genes encoding subunits B, C and D of SDH has been found to be associated with paraganglioma - a type of cancer that can develop in carotid bodies .
Nowadays next-generation sequencing of tumors can identify novel and yet unstudied mutations with possible causative links to paraganglioma . Experimental study of each mutation is a time-consuming process. Alternatively, the effects of mutations on protein structure may be elucidated using computer modeling.
Lately, a set of mutant variants was obtained from patients tissue samples. In this study, we focused on nonsynonymous variants : R242C (SDHB) , R70S (SDHD), G75D (SDHC). We performed molecular dynamics (MD) investigation and found that 2 of 3 these variants cause drastic changes. We observed that both mutations (R242C and G75D) result in ubiquinone (UQ) destabilization, while R242C also leads to iron-sulfur cluster breakdown.
MD analysis of the third mutation (R70S) didn’t reveal any significant changes in protein structure.
Considering observed effects we assume that R242C and G75D mutations may disrupt the respiratory chain. It is accompanied by the accumulation of active forms of oxygen. Thus these mutations (R242C, G75D) can be pathogenic and may cause paragangliomas.
Molecular modeling experiments were carried out using the equipment of the shared research facilities of HPC computing resources at Lomonosov Moscow State University supported by the project RFMEFI62117X0011.
1. Tracey T.J. et al. (2018) Neuronal Lipid Metabolism: Multiple Pathways Driving Functional Outcomes in Health and Disease // Front. Mol. Neurosci. Vol. 11. P. 10.
2. Brière J.-J. et al. (2005) Succinate dehydrogenase deficiency in human // Cell. Mol. Life Sci. Vol. 62, № 19-20. P. 2317–2324.
3. Wieneke J.A., Smith A. (2009) Paraganglioma: carotid body tumor // Head Neck Pathol. Vol. 3, № 4. P. 303–306.
Short Abstract: The novel Coronavirus (2019-nCoV) has caused and is continuing to cause thousands of worldwide deaths along with severe economic damage. The lack of efficient drug candidate screening methodologies delayed the identification of potential small-molecule treatment candidates for the outbreak. This research aimed to address this issue and design a comprehensive drug candidate screening methodology utilizing advancements in bioinformatics and artificial intelligence (AI) that could be applied to the 2019-nCoV main protease as well as primary components of future disease outbreaks.
In order to be a potential drug candidate truly worth screening in a laboratory, a compound should have two main characteristics: (1) be structurally similar to other semi-effective treatments and (2) have a good predicted binding energy/interaction likelihood with the drug target (in this case the 2019-nCoV main protease).
The first phase of research was to screen a dataset of roughly 10 million stock compounds against three molecules with moderate activity against the main protease: Remdesivir, Quercetin, and Epigallocatechin gallate. Compounds with desirable preliminary characteristics advanced to the second research phase.
In the second phase of research, an algorithm using AI-based neural network architectures was implemented to analyze past statistical protein-compound interactions and identify new compounds with key chemical properties.
The final phase of the research involved using a prepared and refined molecular docking trial with binding energy calculations generated by Schrodinger’s GLIDE Software. Of the 10 million original compounds, 5 compounds with substantial binding affinity were identified from the three research phases. One of these 5 compounds, when tested by Dr. Brian Kraemer’s lab at the University of Washington, was a dose-response hit, inhibiting the main protease as predicted computationally.
In conclusion, this research has led to the identification of an effective inhibitor of the 2019-nCoV protease using a newly-designed screening methodology. The computational methodology utilized, particularly the strategic use of new algorithms and software, has the potential to be applied for future pandemics, accelerating the identification of life-saving treatments.
Short Abstract: Pathway analysis is an important tool used in biomedical research and aids our understanding of gene expression data gathered from omic approaches. The method involves finding statistical enrichment of user provided gene list in gene-set associated with a well-defined biological pathway. Currently, the method faces three major hurdles. First, redundancy within the gene sets; second, finding the bigger picture amongst the overwhelming amount of data, and third, lack of methods to harmonize ontologies from different sources for instance, reactome, KEGG, Gene Ontology, and other similar repositories of gene sets. To overcome these hurdles, we developed a text-mining based approach which builds upon Bidirectional Encoder Representation Transformers (BERT), a natural language processing platform and relies on the definition of a given ontology/pathway to create a database of embedding (a learned representation of the definition). Definitions from the user-provided list of GO-term is matched with the embedding database and a theme representing the list is generated using a weighing factor –term frequency inverse document frequency (tf_idf)—, to filter most important words. Our approach expanded upon other methods relying on the semantic similarity and outperforms other available tools including the industry standard parent-child analysis on several benchmarks involving time to perform the analysis, reproducibility and coverage of results. An R package named pathwayHUNTER is under development.
Short Abstract: Proteases are extensively used in the food industry for their ability to break down proteins. Breaking down proteins to small peptides can be associated with loss and gain of chemical and biological properties, and it is important to understand these changes. Mass spectrometry (MS) peptidomics offers a way to characterise the pattern and progress of proteolysis in food processing and in other biological contexts, such as digestion and apoptosis. A variety of prediction tools are available for the prediction of endoproteolytic events based on motifs in the cutting site, but to date exoproteolytic prediction and estimation tools are rare. As both exo- and endoproteases work together in the digestion environment, there is also a great need in this field of research to estimate the relative proportions of these two distinctive digestion processes in a given sample. To determine patterns of proteolytic activity the authors proposed a staircase pattern finding algorithm for the identification of exoprotease-linked overlapping peptide sets from MS data. These are combined with predicted endoprotease activities to provide a relative quantification of the indicative proportions of different proteolytic processes. Since separate enzymatic activities perform aminopeptidase and carboxypeptidase based exoproteolysis, patterns were estimated for each separately. We are developing a Python based software package that may be readily applied to MS output tables and peptide sequence lists in order to estimate parameters of different proteolysis processes in a given sample, and to allow such processes to be contrasted in different samples.
Short Abstract: Genotypes of large polymorphic inversions can be inferred from single nucleotide polymorphism (SNP) data. Recombination is repressed in inversion regions, allowing alleles to remain private to a single inversion orientation. For large inversions, there are sufficient numbers of private alleles such that samples segregate according to inversion genotypes.
In principal component analysis (PCA), samples form groups according to their inversion genotypes. Clustering algorithms can be used to determine the distinct groups and their members. Depending on the particular data set, however, inversions cause different patterns. For example, D. melanogaster samples form three distinct, circular clusters with only a handful of outliers when analyzing SNPs on the 2L chromosome arm, while the origin separates An. gambiae samples by their 2Rb inversion genotypes but the clusters are not clearly defined since the intra- and inter-cluster variances are nearly equal.
Clustering algorithms use a range of heuristics to infer cluster membership. For example, k-means partitions the global space into regions of approximately equal size, while DBSCAN uses the local topology of points to infer clusters. The differences in choices make each clustering algorithm suitable to a different type of data. For example, k-means is most appropriate for evenly sized, circular clusters, while DBSCAN works on complex shapes with uneven sizes as long as inter-cluster distances are less than intra-cluster distances.
Our aim was to characterize the different cluster patterns caused by large polymorphic inversions and determine which clustering algorithm is best suited for each type of pattern. We performed PCA of SNPs from chromosome arms of An. gambiae, An. coluzzii, and D. melanogaster. We developed a set of descriptive criteria which we applied to characterize the observed patterns for each inversion. We then applied several clustering algorithms to the data sets and evaluated their predictions against the known genotypes for the samples. We use our characterizations and clustering results to recommend the most appropriate clustering algorithms for each type of pattern.
Short Abstract: FcgRIIA, (CD32a), a low affinity receptor for the Fc region of immunoglobulin G (IgG) is expressed on the surface of several immune cells and uniquely expressed on platelets. Known as the only Fc receptor to be expressed on platelets, and previously thought to be limited in expression to humans, it has recently been brought to the forefront for its role in heparin induced thrombocytopenia and in systemic thrombosis as a result of anti-CD40L therapy. More recent studies of FcgRIIA have shown that it is not unique to humans and therefore, studying which organisms express FcgRIIA could be important for designing effective preclinical trials. To have the best potential research, it was necessary to identify which organisms, other than humans, express FcgRIIA on immune cells and platelets. Using a consensus sequence of human FcgRIIA, the sequence was analyzed using NCBI protein blast to gain a data set of “hits” in other organisms with defined similarity or identity to FcgRIIA. Similar sequences were then aligned using ClustalX2 to locate areas of strong similarity or where single nucleotide polymorphisms were present. These alignments, visually showing the percent identity, allowed for further comparisons and creation of phylogenetic trees using MEGA. Moreover, heatmap analysis of similarity showed which organisms expressed the highest levels of similarity in defined areas of the protein compared to humans. Using these approaches, we discovered that many organisms contain similarity in the conserved immunoglobulin superfamily domains in the extracellular portion of FcgRIIA although typically the similarity was in adhesion receptors and not Fc receptors. However, only two organisms, Chimpanzees and Orangutans, shared 100% similarity in the cytoplasmic domain including the critical ITAM signaling domain. We’ve previously defined the importance of the ITAM sequence of FcgRIIA for efficiency of phagocytosis and phagolysosome fusion. In summary, we’ve identified the organisms with the most similarity to humans with respect to FcgRIIA are Chimpanzees and Orangutans. Having identified the organisms most closely related to humans with respect to FcgRIIA, in future studies that involve anti-platelet biologics, a more effective model can be used for preclinical trials.
Short Abstract: Purpose: Transcriptome meta-analysis identifies genes, signaling pathways and biomarkers, revealing novel pathways and genes with greater accuracy. Due to the availability of microarrays and transcriptome high-throughput sequencing data, it is expected that the meta-analysis of different datasets under the same disease and control conditions allows to identify potential pathways for drug repurposing. Time series high throughput gene expression profiles help with the understanding of the dynamics of gene-gene interactions as time-delayed gene regulation pattern in organisms. To identify temporal patterns in publicly available data, temporal studies need to be indexed appropriately. However, a simple keyword search of abstracts from for example NCBI GEO will only result in a large number of false positives. To explore the potential of eliminating false positives and negatives by text mining approaches we conducted a case study.
Procedures: In our case study we started with a paper on rat eye injury, identified key regulatory pathways and searched in PubMed to identify papers with those pathways. We used a query refinement method developed by our group and randomly selected 40% of the papers out of 1226 original research results on rats with “Time Factors” and “Gene Expression Profiling” as MeSH terms. Those were manually screened for time-series expression profiles and temporal pathway analysis resulting in 68 distinct papers. We recorded all significant gene changes and associated pathways reported in those papers.
Outcome: Although tissue responses to injury and stress are complex and multi-faceted, we tried to identify commonalities across the selected experiments. For example, the involvement of the following pathways in the selected studies: growth factors in all 68, apoptosis signaling in 25, cell cycle regulation in 19, and MAPK signaling in 13. While most of the studies unsurprisingly focused on injuries, we also identified unrelated topics like methamphetamine addiction. We were not able to identify the 68 papers based on the abstract alone, which could be potentially automated through text mining approaches, but had to consult the paper itself. That was especially true when we tried to identify time-delayed gene regulation pattern in organisms.
Impact: Reduction of false positives and negatives through querying and potentially text mining the abstracts alone are not sufficient. However, our research can help scientists efficiently identify time series scientific data online.
Short Abstract: Whole cell models are a new class of cellular mathematical models that combine all known mechanistic processes for an organism. These multi-scale models incorporate a diverse array of chemical pathways, cellular phenomena, and computational methods. Until now, only one truly computational WCM of a biosystem, the Mycoplasma genitalium organism, has been developed . Simulating WCMs would allow us to address many challenging questions, such as the understanding of interactions and coupling between pathways, the examination of system properties, and the identification of gaps in our biological knowledge. Unfortunately, efficient and rapid simulation of WCM is stymied by a lack of data on chemical kinetics, implementation independent representations WCMs, and an efficient parallelizable simulator that can run using high performance computing (HPC) resources. To demonstrate that WCM can be efficiently solved on HPC resources, we build upon a recently published adaptation of M. genitalium into a graph model  that was published without kinetic rate information. In this work we detail our progress toward developing an HPC-based WCM platform, as well as our protocol for parameterizing the graph model of M. genitalium in order to test our simulation tool and investigate its performance with representative workloads.
 Karr, J.R., Sanghvi, J.C., Macklin, D.N., Gutschow, M.V., Jacobs, J.M., Bolival Jr, B., Assad- Garcia, N., Glass, J.I. and Covert, M.W., 2012. A whole-cell computational model predicts phenotype from genotype. Cell, 150(2), pp.389-401.
 Burke, P.E., Claudia, B.D.L., Costa, L.D.F. and Quiles, M.G., 2020. A biochemical network modeling of a whole-cell. Scientific reports, 10(1), pp.1-14.
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-ABS-820025
Short Abstract: For optimal pancreatic cancer treatment, early and accurate diagnosis is vital. Blood-derived biomarkers and genetic predispositions can contribute to early diagnosis, but they often have limited accuracy or applicability. Here, we seek to exploit the synergy between both approaches by combining the biomarker CA19-9 with novel genetic variants. We aim to use deep sequencing and deep learning to improve differentiating resectable pancreatic ductal adenocarcinoma from chronic pancreatitis and to estimate survival.
We obtained samples of nucleated cells found in peripheral blood from around 270 patients suffering from resectable pancreatic ductal adenocarcinoma (rPDAC), non-resectable pancreatic cancer (nrPC), chronic pancreatitis (CP). We sequence mRNA with high coverage and reduced millions of raw variants to hundreds of high-quality, significant genetic variants. Together with CA19-9 values, these served as input to deep learning models that separate cancer from chronic pancreatitis.
Our deep learning models achieved an area under the curve (AUC) of 92% or better. In particular, differentiating resectable PDAC from pancreatitis can be solved with an AUC of 96%. Moreover, we identified genetic variants to estimate survival in rPDAC patients.
Overall, we show that the blood transcriptome harbours genetic variants, which can substantially improve non-invasive clinical diagnosis and patient stratification in pancreatic cancer using standard laboratory practices.
Short Abstract: Four Lactobacillus species tend to dominate the female urogenital microbiota of young healthy women: L. iners, L. crispatus, L. jensenii, and L. gasseri. Studies have associated probiotic influences with L. gasseri. Here we present the results of our comparative genomics study of L. gasseri and its sister taxa L. paragasseri, which is also found in the urogenital microbiota. To date, 68 genomes/assemblies are publicly available for these two species. Fifteen of these genomes are from urinary isolates. Other genomes include fecal, milk, and vaginal isolates. Our analysis identified several genomes classified as L. gasseri that were in fact members of L. paragasseri, confirmed by core genome phylogenomics and average nucleotide identity (ANI) analysis. Niche specialization was not detected. In other words, we did not identify any specific gene acquisitions specific to genomes of isolates from the urinary tract or from the vagina. After reclassifying these genomes, we found that urinary L. paragasseri strains encode for the bacteriocin gasserin while L. gasseri strains do not. Bacteriocins are antimicrobial peptides and may provide L. paragasseri with the probiotic qualities previously associated with L. gasseri. Given the discrepancies found in the identification of isolates of these two species, future studies are needed to ascertain if L. gasseri or L. paragasseri is in fact associated with probiotic benefits for the urinary tract.
Short Abstract: Natural control of HIV-1 infection in Elite Controllers (EC) is a characteristic of <1% of HIV-1 infected individuals. Challenges in elucidating the control mechanism have been attributed to considerable heterogeneity in the EC group, both between the sexes, and how they maintain EC phenotype. Therefore, we sought to identify signaling pathways associated with the EC phenotype by proteomics (all male), transcriptomic (male and female) analysis and integrative proteo-transcriptomic analysis in the male cohort. We found that HIF signaling and glycolysis are specific traits of the EC phenotype. Within these pathways, ENO1 was upregulated in EC on a protein level irrespective of sex. We also performed targeted transcriptomic analysis and identified HIF target genes dysregulation as a specific feature of the EC group. We did not detect any difference in the protein levels of HIF-1α in total PBMCs between EC and HIV-1 negative individuals (HC). However, a higher proportion of HIF-1α and HIF-1β in the nuclei of CD4+ and CD8+ T cells in the male EC was observed, indicating increased activation of the HIF signaling pathway while females exhibited increased activation of Akt phosphorylation. Furthermore, intracellular glucose levels were elevated in EC even as the surface expression of the metabolite transporters Glut1 and MCT-1 were decreased on lymphocytes, irrespective of sex. This indicates that the EC group has a unique metabolic uptake and flux profile. Combined, our data indicates how HIF signaling and glycolysis may contribute to the natural control of HIV-1 infection.
Short Abstract: Since the outbreak of the COVID-19 pandemic, several studies have concentrated on the biological and epidemiological factors governing COVID-19 transmission, while few others have investigated the potential impact of socio-economic characteristics on governing the extent of COVID-19 diffusion in the population, especially those socio-economic determinants predating the pandemic. Societal and economic factors can be of critical importance for accuracy of models of the outbreak because of the economic and health impacts of the drastic measures that have been put in place in an effort to slow the spread of the disease (e.g social distancing, quarantine, lockdowns, testing, and reallocation of hospital resources).
In this work, we take a reverse perspective and analyze how socio-economic determinants predating the pandemic relate to the number of reported cases, deaths, and the ratio deaths/cases of COVID-19 early in the pandemic via machine learning methods. Our focus is on understanding the connection between epidemiological variables of the COVID-19 pandemic and the (i) level of health care infrastructure, (ii) general health of the population, (iii) economic factors, (iv) demographic structure, (v) environmental health, (vi) societal characteristics, and (vii) religious characteristics of a country. We hypothesize that different countries have different specific socio-economic features and therefore the efficacy of government measures and the incidence of the disease must be smart and heterogeneous across countries and across resources.
Using epidemiological data (number of cases, number of deaths, and number of deaths per number of cases) as our outcome variables (Y), socio-economic determinants as our predictors (X), and a geographical weighting matrix (A) consisting of pairwise distances between capital cities worldwide, we analyze data for 199 countries with 32 interpretable models, including (i) regression models, and (ii) variable selection through LASSO. Then, we build a Signed Importance Index (SII) to help focus only on those findings that are common to a majority, thereby reducing the sensitivity of our findings to the limitations of any single model considered.
Our analysis determines that the socio-economic status of a country follows some sort of Action-Reaction Principle, as it is not only heavily influenced by the pandemic (we did not address this in the study, but it is an established fact in the literature), but it is also a distinct factor of the early pandemic and of course must be taken into consideration by governments. Our results suggest that governments might need to allocate healthcare resources heterogeneously, with a possible benefit in decentralizing healthcare in the set of possible initial measures. This could be a problem for developing countries, where the means are limited. Countries with more economic equity among their citizens seemed less hit by COVID-19, possibly indicating the importance of having a minimal baseline assistance across the whole population of a country as a condition for rapid response. The reduced degree of mobility across countries, for example the degree to which tourism is constrained, had a positive effect early on in reducing the number of cases, deaths, and death rate per cases. However, there is an indication that a smart and alternating policy could lead to further containment of the disease. Furthermore, our analysis highlighted the benefit of informing the population for government measures to be more effective. Together, our results seem to indicate that blanket policies are sub-optimal and government measures related to healthcare and immigration have the potential to both help and damage the population, as, if not appropriately taken, they can lead to an increase or reduced decrease of COVID-19 cases, deaths, and deaths/cases rate.
Short Abstract: Understanding organ-specific endothelial cells (ECs) transcriptomic signatures change over time in response to a systemic disease or injury is key to designing new targeted therapies for vascular disease. Existing computational tools to analyze longitudinal transcriptomic data do not adequately classify distinct temporal patterns and many existing approaches also do not take the temporal sequence into account when identifying dynamic differentially expressed genes (DDEGs).
We developed the R package TrendCatcher which applies a constant negative binomial (NB) model to estimate the baseline fluctuation confidence interval, along with a smooth spline ANOVA model to estimate the dynamic signals at each non-baseline time point. Combined with a break point searching strategy, TrendCatcher achieved higher accuracy in predicting dynamic genes compared to existing algorithms, such as DESeq2+spline and ImpulseDE2 when using simulated data. TrendCatcher has an accuracy of 78.8% for linear trajectories, 74.5% for impulse-shaped trajectories, 98.0% for 2 to 3 break points trajectories. To evaluate the false positive rate, we tested it using embedded constant trajectories, the false positive rate is only 3.5%.
Interestingly, we found the accuracy of the other two methods dropped significantly, when more complicated longitudinal trajectories were embedded. We then applied TrendCatcher to a real-world dataset in which RNA-Seq was performed on blood vessel endothelial cells from the heart, brain and lungs at distinct time points after severe systemic inflammatory injury. TrendCatcher identified 2,047, 2,044 and 2,632 inflammatory response dynamic genes in the brain heart and lung endothelium, respectively. Importantly, Trendcatcher also identified distinct time courses of DDEGs in each organ, suggesting that the lung endothelium may show much earlier vulnerability to systemic inflammation than the brain endothelium.
In conclusion, we report a novel tool for the identification of dynamic differentially expressed genes (DDEGs) for temporal RNA-seq analysis, that can provide novel biological insights into time courses of transcriptional stress responses.
Short Abstract: Introduction: Prematurity is the foremost cause of death in children under 5 years. Genetics contributes to 25-40% of all preterm births (PTB) yet we still need to identify specific targets for intervention based on new genetic pathways Objective: To identify potential therapeutic targets, and corresponding protein cavities and their binding interactions with intervening compounds to manage the challenge of PTB. Methodology: On the basis of literature survey, we searched 20 genes coding 55 PTB proteins from National Center for Biotechnology Information (NCBI) PubMed database (January-2009 to April-2020). Single Nucleotide Polymorphisms (SNPs) of concerned genes were extracted from ENSEMBL, and selection of exonic variants (non-synonymous) was performed. Several in-silico downstream protein functional effect prediction algorithms were used to identify damaging variants. Rare coding variants were selected with an allele frequency of ≤1% in 1000 Genomes Project, further supported by South Asian (SAS) population frequency from Allele Frequency Aggregator (ALFA), and expression statuses within the female reproductive tissues from GTEx database. Structural protein identification (homology modeling) of CNN1 isoform 2 and blind docking approach were used to explore the binding cavities and molecular interactions with progesterone (as various synthetic derivate are commonly used to manage PTB) ranked with energetic estimations. Binding cavities were further validated through COACH meta-server. PTB-related drugs were screened and shortlisted from DrugBank database through target identification and Lipinski’s rule of 5. 5 screened PTB-related drugs were selected for the binding energy estimations with CNN1 isoform-2 protein. Finally, the energy-based binding outputs were evaluated in terms of ADMET (Absorption, Digestion, Metabolism, Excretion, and Transportation) properties by performing in-silico pharmacokinetics profiling of PTB-related drugs. Findings: 4 out of 20 genes were identified with 74 rare pathogenic variants. Of which, 3 genes have partially resolved protein structures identified with least coverage. We focused on CNN1 isoform 2, the highly expressed gene in female reproductive organs with maximum structural coverage of ~41% to its template (PDB ID: 1WYP). The modeled structure revealed 5 binding cavities with progesterone. Molecular interactions of CNN1 with progesterone were investigated manually, LigPlot 2D analysis and by visual inspections 3D analysis. The molecular estimation of drug based target inhibition showed the high binding affinity of Hydroxyprogesterone caproate with CNN1 isoform 2, i.e., -11.14 kcal/mol and inhibition constant Ki 69 nM. Conclusion: Calponin-1 gene and its molecular interaction analysis could serve as an intervention targets for the prevention of PTB.
Short Abstract: Introduction :
Acute myeloid leukemia (AML) research has identified some recurrent mutations, such as KMT2A gene rearrangements that are present in ~ 65% of childhood AML samples, but the development of the disease is still not well understood. Research show that rearrangement of the KMT2A gene alone can trigger AML in our models. But this is not the case in patients with recurrent secondary mutations. The hypothesis is that if we compare the effect of therapeutic compounds between models and patients, we will be able to demonstrate the impact of recurrent secondary mutations despite the heterogeneity of the patients.
Methods and results :
We perform a high-throughput chemical screening of 12,000 compounds in a single dose. They will be tested on 50 samples, from patients and model leukemias with different KMT2A gene rearrangement. On these samples, we will also have access to information about expression and exome. We have been able to identify compounds that specifically inhibit the proliferation of leukemia cells in patients, that can be used in therapy. As well as compounds having an effect on different fusion subgroups of the KMT2A gene.
This work will help advance our understanding of the biology of acute myeloid leukemia, but will potentially help identify new compounds that could be used for therapeutic purposes in patients as well as in the development of a clinical diagnosis.
Short Abstract: One of mankind’s greatest challenges is fighting a never-ending war against pathogens. These bacteria encode genes for a myriad of methods to cause disease in human hosts. Antibiotics, a relatively new weapon against microbes, have proven successful for decades. These drugs frequently disrupt the bacteria’s ability to replicate its own DNA, perform vital cell processes, or weaken/destroy its cell wall. However, bacteria, such as Escherichia coli, do not remain idle. An increasing number of bacterial strains have been witnessed to gain resistance to our antibiotics. The growing antibiotic resistance phenomena is exacerbated by both frequent misuse and recurrent overuse of drugs that leads to strong evolutionary selective pressure. This further produces increasingly resistant strains of bacteria which require even stronger medication. Therefore, my project aims to be able to predict a bacterial strains’ resistance or susceptibility to antibiotics through analyzing the presence or absence of genes in each phenotypic group. Our data was composed of 66 samples of E. coli collected from the urinary tract of patients. These samples were tested for resistance to the drugs: Amoxicillin, Ciprofloxacin, Cefpodoxime, Fosfomycin, and Sulfamethoxazole-Trimethoprim. Using these phenotypic groups, we compared the genetic content of organisms across groups to determine what genes were unique to the resistant phenotype. The genes unique to the resistant phenotype were then clustered using USEARCH. These clusters were reciprocally BLASTed against others in the same cluster to determine similarity. In addition, the clusters were again BLASTed against the susceptible group to ensure they were entirely unique to any gene in the antibiotic susceptible phenotype. Thus, the outcome of our work is a method for determining which antibiotics a given strain is resistant to purely on genetic content.
Short Abstract: Genome-scale CRISPR loss-of-function screens are an increasingly popular experimental platform for investigating potential genetic interactions and druggable targets. While the first generation of single-targeting CRISPR-Cas9 screens only allowed for the targeting of a single genomic location per guide construct, next generation combinatorial CRISPR screens allow for the concurrent, systematic perturbation of multiple genomic locations across the genome. Combinatorial screens are typically enabled by the simultaneous use of alternative Cas enzymes such as Cas12a, allowing for the joint targeting of distinct genomic regions.
Recently, we developed a novel CRISPR screening system named CHyMErA that both enables the direct measurement of genetic interactions between two genes, as well as the deletion of sizeable genomic fragments, by pairing a Cas9 guide with one or more Cas12a guides expressed from the same hybrid guide RNA (Gonatopoulos-Pournatzis et al. 2020). We employed CHyMErA screens to identify genetic interactions between roughly 600 pairs of paralogous human genes, and additionally demonstrated that combinatorial screens outperform single-targeting screens by inducing multiple cuts in genes targeted for knockouts. To account for differences introduced by guide orientations - whether the first guide along a construct targets gene A and the second targets gene B, or vice versa - we developed a novel scoring method that scores orientations separately before aggregating results across orientations.
To extend this scoring method and enable the accurate and reproducible scoring of combinatorial CRISPR screens in general, we present a novel computational workflow named Orthrus. The key features of Orthrus include flexible scoring appropriate for multiple experimental platforms, a large variety of automatically-generated QC plots to help users diagnose potential issues with their screens, and the inclusion of several convenient interfaces to enable either concise and quick scoring or fine-grained control over all steps of the scoring pipeline. We apply Orthrus to score both previously-published CHyMErA screening data as well as data from a separate experimental platform that employs multiplexed Cas12a guides, and show that Orthrus both successfully diagnoses and accounts for screen quality issues and accurately quantifies genetic interactions between human paralogous genes. The Orthrus scoring workflow is available as an open-source R package at github.com/csbio/orthrus.
Short Abstract: Mucopolysaccharidoses (MPS) are storage diseases characterized by defects in the activity of lysosomal hydrolases. Patients present progressive multisystemic involvement in several tissues and organs. Neurological impairment is caused by secondary alterations due to the accumulation of substrates, by mechanisms not well elucidated. This work aimed to identify hub genes using different topological algorithms. We also evaluated the performance of local and global-based methods in the different networks. We retrieved transcriptomic data from MPS with neurological impairment in GEO (GSE111906 of MPS I, GSE95224 MPS II, GSE23075 MPS IIIB, GSE15758 MPSIIIB, and GSE76283 of MPS VII). We constructed the networks using string database v.11, and the hub gene analysis was performed with Cytohubba v.0.1 in the Cytoscape v.3.8. We used all the local-based methods (MCC, DMNC, MNC, and Degree) and all the global-based methods (EPC, BottleNeck, EcCentricity, Closeness, Radiality, Betweenness, and Stress), calculating the top 10 node’s scores. We check the first-stage nodes and the expanded subnetwork. Regarding the hub gene analysis, the GSE15758 showed 46 unique genes; GSE23075 45 unique genes, GSE76283 44 unique genes, GSE111906 have unique genes 43, and GSE95224 have 39 unique genes. The venn diagram shows 2 shared genes between GSE95224 and GSE76283 (Penk and C3ar1), 1 gene between GSE15758 and GSE76283 (Ptpn6), and 1 gene between GSE95224 and GSE15758 (Plcb1). Considering the 55 different pairwise combinations of the algorithms, we calculated the percentage of genes shared in the pairwise combinations. Only two combinations shared the same 10 genes (GSE23075- EPC versus MCC, and GSE23075 – MNC versus Degree).For the best performance, we recommend choosing the hybrid combination: one local-based and one global-based method. Our results show the importance of testing different algorithms in the search for hub genes in network analysis.
Short Abstract: The bacterial adaptive immune system CRISPR has shown promise as a tool for use in genetic engineering, and more recently as a next-generation antimicrobial agent. To perform these functions, the CRISPR-Cas9 nuclease is targeted to a site by a single guide RNA molecule (sgRNA) where it then introduces a double-strand DNA break. The success of this tool is limited by the wide variation in activity of the targeting sgRNA sequence. Over the past decade there have been several data driven attempts to predict the on- and off-target activity of sgRNAs, primarily in eukaryotic cells. However, eukaryotic and prokaryotic model predictions failed to correlate with the results of our Salmonella enterica CRISPR-Cas9 sgRNA cleavage activity testing — a clear issue with current model generalization. Additionally, other models of sgRNA activity prediction do not take into account sgRNA toxicity or other undesirable factors such as off-target activity, focusing only on the prediction of on-target activity.
Our model utilizes de novo time series and optical density CRISPR-Cas9 sgRNA cleavage activity and toxicity data in E. Coli. The sgRNA sequences are transformed by one-hot encoding, and their associated cleavage efficiencies are normalized using the CoDaSeq Aitchison mean. This data is fed into a two stage convolutional neural network and long-short term memory recurrent neural network model, providing a prediction of CRISPR-Cas9 activity for a specified sgRNA sequence. With the integration of off-target cleavage information, alongside sgRNA-target thermodynamic stability data and further hyper-parameter tuning, our model should additionally be able to discern desirable sgRNAs from sgRNAs with both high on- and off-target activity. We intend to benchmark our model against prior prokaryotic models, and prove its ability to generalize across various bacterial CRISPR-Cas9 sgRNA activity datasets.
Short Abstract: RNA editing is a post-transcriptional modification that serves to increase transcriptome diversity. This includes adenosine to inosine (A-to-I) deamination catalyzed by adenosine deaminase acting on RNA (ADAR) and cytosine to uracil (C-to-U) deamination catalyzed by apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like (APOBEC) enzymes. Some of these enzymes, including APOBEC3A and ADARp150 isoform, are induced by interferons in viral infection and play a role in antiviral immune response by viral genome editing, including in SARS-CoV-2 infection. One challenge in elucidating RNA editing patterns, however, is linking these enzyme expression changes to editing pattern changes in the host genome. This is important for better understanding the effects of viral infection: for example, ADARp150 (which normally regulates neural signaling and plasticity) is activated during viral infections, including that of SARS-CoV-2, and may result in observed neurological complications if host transcripts are over/underedited due to infection. We use the publicly available RNA-seq dataset PRJNA625518 of Calu-3 lung adenocarcinoma cells infected with SARS-CoV-1/2 to examine editing levels. Overall ADAR (but not APOBEC) editing levels increased in both viral infections, with SARS-CoV-2 having higher levels than SARS-CoV-1, including in immune cytokine signaling transcripts. Notably, some other sites experienced underediting in SARS-CoV-2, including in pathways related to gene expression. Interestingly, there was no significant correlation between levels of ADAR or APOBEC expression and overall editing, indicating that editing is nuanced and may be regulated at the level of individual transcripts or sites, or that some of the editing burden shifts from the host to the viral genome.
Short Abstract: This project integrates all available studies related to breast cancer and the mammary microbiome to (1) reassess the original findings in light of advances in this rapidly progressing field and (2) incorporate all the data available as a large meta-analysis to identify general trends and specific differences across patient cohorts and studies. In order to reassess original findings of each study, the 16S rRNA sequence data is retrieved from SRA and is analyzed using a bioinformatics pipeline. The bioinformatics pipeline uses an amplicon sequence variant approach as ASVs provide greater resolution, specificity, and lower false positives than the mixture of approaches used in the original studies (Prodan et al. 2020. PLoS One 15:e0227434). The results from the bioinformatics pipeline are being compared with the findings from the original studies. In order to identify specific trends and differences across patient cohorts and studies, I am in progress of compiling the results from each of the available studies into a single dataset to assess for trends in microbial community interactions with breast cancer across studies. As different patient cohorts, 16S rRNA variable regions, and experimental protocols were used across the studies of interest, I will need to statistically account for these covariates in my model. The final ASV tables from the bioinformatics pipeline analysis for each study will be merged and assessed using multivariate analysis to identify trends that, once the covariates are accounted for, remain significantly associated with the disease (Goodrich et al. 2014. Cell 158:250– 262).
Short Abstract: Understanding of complex biological systems and diseases requires an investigative look at different molecular domains. Researchers are now able to get access to transcriptomics, genomics, and proteomics data to investigate the underlying systemic biology of cellular processes and disease signatures. There is a big effort of leveraging these omics datasets for multi-omics integration analyses. However, we observed a lack of effort of integrating broad-based protein kinase activity, also called the active kinome, with gene expression datasets. Such integration is critical to move the field forward since gene expression data are limited in their ability to reflect the functional state of a biological system; that is, changes in post-translational modification, such as phosphorylation, can have a profound effect on cellular functions without any change in expression levels of the signaling molecules in the pathway. Given the importance of the active kinome in signal transduction and its role in many cellular and disease processes, we developed a pipeline, which we named "KinoGrate", to integrate active kinome and gene expression signatures using a network-based algorithm. The prize-collecting Steiner forest (PCSF) algorithm is used to generate and identify biological subnetworks that characterize disease signatures and highlight potential novel drug targets. We used our software to model patient-derived pancreatic ductal adenocarcinoma cell lines network by integrating active kinome and gene expression signatures. The results highlighted known pancreatic cancer (PC) targets like the growth factor receptor-bound protein 2 (GRB2), Jun Proto-Oncogene (JUN), and β-catenin (CTNNB1), and potential novel targets like Peroxisome Proliferator Activated Receptor Delta (PPARD) and REST corepressor 1 (RCOR1). Our approach demonstrated its ability of highlighting known PC targets but also resulted in potential new leads that can be further investigated.
Short Abstract: 16S rRNA gene sequencing of clinically uninfected hip and knee implants has revealed polymicrobial populations. However, previous studies assessed 16S rRNA gene sequencing as a technique for the diagnosis of periprosthetic joint infections, leaving the microbiota of presumed aseptic hip and knee implants largely unstudied. These communities of microorganisms might play important roles in aspects of host health, such as aseptic loosening. Therefore, this study sought to characterize the bacterial composition of presumed aseptic joint implant microbiota using next generation 16S rRNA gene sequencing, and it evaluated this method for future investigations.
252 samples were collected from implants of 41 patients undergoing total hip or knee arthroplasty revision for presumed aseptic failure. DNA was extracted and amplicons of the V4 region of the 16S rRNA gene were sequenced. Sequencing data were analyzed and compared with ancillary specific PCR and microbiological culture. Computational tools (SourceTracker and decontam) were used to detect and compensate for environmental and processing contaminants.
Microbial diversity of patient samples was higher than that of open-air controls and differentially abundant taxa were detected between these conditions, likely reflecting a true microbiota that is present in clinically uninfected joint implants. However, novel positive control-associated artifacts and DNA extraction methodology significantly affected sequencing results. As well, sequencing failed to identify Cutibacterium acnes in culture- and PCR-positive samples. These challenges limited characterization of bacteria in presumed aseptic implants, but genera were identified for further investigation.
In all, we demonstrate that there is likely a microbiota present in clinically uninfected joint implants, and that methods other than 16S rRNA gene sequencing may be ideal for its characterization. This work has illuminated the importance of further study of microbiota of clinically uninfected joint implants with novel molecular and computational tools to further eliminate contaminants and artifacts that arise in low bacterial abundance samples.
Short Abstract: Metagenomic experiments characterize the microbial community of an environment using high-throughput DNA sequencing. These experiments profile both the presence and abundance of microbes by mapping the sequencing reads to a database of known genomes. Previously, high-throughput 16S rRNA gene amplicon sequencing (16S) has primarily been utilized to characterize only the bacteria present and often resolving taxonomy at the genus level. Due to the decrease in sequencing cost, higher resolution whole-genome shotgun (WGS) sequencing has become widely available to capture all domains of organisms present and giving the ability to characterize taxonomy at the species to strain level. However, many microbial reference genomes are mapped erroneously due to shared genes between the sequences sampled and genomes in the database. Filtering out these false-positive genomes has typically been done by dropping low prevalent and abundant species, potentially erasing rare species and jeopardizing the recall of these profiles. Identifying all the true species present in a metagenomic dataset is essential for generating accurate taxonomic features from these high-resolution WGS metagenomic datasets. We propose a machine learning pipeline to accurately predict the species present in a WGS metagenomic experiment. Our engineered features included coverage, taxonomic, and abundance statistics of the query and reference sequences. We rigorously trained and verified our approach by using precision-recall curves and f1-score on mock communities with known microbes and comparing it to the traditional method of filtering. We also rarefied the samples to show viability at many sequencing depths. Finally, we show the efficacy of our approach by applying it to several WGS metagenomic datasets. These results will allow microbiome experiments to confidently identify the presence of species in metagenomic experiments, allowing these high-resolution WGS datasets to be more widely accessible and useful.
Short Abstract: Recent advances in microarray and sequencing technology have provided large-scale DNA methylation data sets that can be used for understanding the molecular pathways involved in cancer development. However, there is a lack of principled and computationally-tractable statistical models for identifying recurring methylation patterns and samples associated with those patterns. We introduce a novel Poisson–Beta Bayesian statistical model whose generative process is inspired by the true data generation mechanism. We show that though the model is non-conjugate, the complete conditional distribution of the latent Poisson counts is a form of the Bessel distribution and the marginal likelihood is the doubly non-central Beta distribution. The model allows for the number of sample clusters to be different than the number of feature clusters (genes or CpG islands), enabling a flexible representation of the data. We derive an elegant and efficient auxiliary variable Gibbs sampler for posterior inference. We show that this model is competitive with the current methods (NMF, PCA) in terms of a weighted metric of stability and reconstruction error as well as out-of-sample prediction accuracy. This model is more broadly useful for representation learning in real-valued data that is bounded by the unit interval, which includes many types of genomic data sets.
Short Abstract: The majority of disease-associated genetic polymorphisms fall outside protein-coding sequences. Causal, noncoding variants are thought to contribute to disease phenotypes by altering transcription factor (TF) binding and gene expression in specific cell types. The Assay for Transposase Accessible Chromatin (ATAC-seq) opens new opportunities to understand gene regulation and noncoding genetic variants. This easy-to-use, popular technique provides high-resolution chromatin accessibility with low sample input requirements. Because TF binding is affected by chromatin accessibility (and vice versa), ATAC-seq is valuable for the prediction of TF binding – especially for rare cell types in vivo, where scarce sample material precludes direct measurement of TF binding (e.g., by ChIP-seq). In 2017, the “ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge” established two top-performing TF-binding prediction algorithms that vastly improved performance over simple motif scanning methods (median AUPR .4. versus .1). While ATAC-seq – and now scATAC-seq – datasets grow exponentially, sub-optimal TF motif scanning is still commonly used for TF binding prediction, likely due to limited TF coverage of existing models (top-performing models for 12-31 TFs versus ~1200 human TFs with motifs in CIS-BP).
Here, we present “maxATAC”, a suite of user-friendly deep convolutional neural network (CNN) models for inferring TF binding from ATAC-seq experiments. Using a novel natural language processing algorithm, we identified existing and generated new data to build state-of-the-art supervised learning models, trained on ATAC- and ChIP-seq data, for >50 TFs. Computational speed (even for predictions from trained models) is another barrier to wide-spread adoption of top-performing methods; thus, we propose novel strategies to build efficient methods – with little cost in accuracy. We explore new CNN architectures, training routines, and the utility of hybrid modeling with DNAse1-seq to expand coverage of TF models.
Following benchmarking, we apply maxATAC to ATAC-seq of T-cell activation time courses in naïve and memory human T cells. We integrate the dynamic in-silico TF binding predictions with genetic variants associated with inflammatory diseases, identifying TFs whose predicted binding significantly overlaps disease-associated genetic variants. Our results nominate TFs that might connect underlying genetic variability to altered gene regulation in T cells.
Short Abstract: The adaptation of intracellular pathogen to the host derived stresses depends on the resistance and persistence which are the key players in the cause of virulence. Metabolic adaptation to the host cell is important for Chlamydia trachomatis (Ct). We study the flux differences for Ct from proteome and qRT-PCR data by comprehensive pathway modeling. The results show that the comparatively inert infectious elementary body (EB) and the active replicative reticulate body (RB) systematically using a genome-scale metabolic model with 321 metabolites and 277 reactions. This did yield 84 extreme pathways based on a published proteomics dataset at three different time points of infection.
Further we develop a model to focus on host-pathogen interactions which involve not only metabolic but signaling pathways. The simulation-based study covers various transcriptome and proteome datasets at several time points. The dynamics of proteins involved in glycolysis, apoptosis and inflammation can answer various strategy of bacteria to escape host immune response.
Short Abstract: EB1 is a protein that tracks growing microtubule plus-ends, manifesting as a comet-like structure slightly behind the microtubule tip (‘EB1 tip tracking’). EB1 localizes proteins to growing microtubule plus-ends that normally have little or no affinity for microtubules, and thus EB1 plays a central role in multiple cellular processes. The canonical model for EB1 tip tracking suggests that EB1 binds a complete pocket between four GTP-tubulin heterodimers, and then dissociates as GTP hydrolyzes to GDP. Recent work, however, suggests that EB1 may preferentially bind to incomplete binding pockets, which have less than four tubulin heterodimers comprising the binding pocket. Here, EB1 has a dramatically increased binding rate onto incomplete binding pockets (edge sites) as compared to complete binding pockets (lattice sites), which have greater steric hindrance of binding. This finding suggests that these edge sites could play a vital role in EB1’s ability to tip track microtubules. Thus, to examine the degree to which edge sites affect EB1 tip tracking, we created a stochastic model that allowed us to determine the importance of edge binding for EB1 tip tracking. Our model predicts that without edge binding, EB1 does not properly tip track. Furthermore, cell free work demonstrates that in the presence of Eribulin, less mal3 (yeast EB1) accumulates at the plus end of dynamic microtubules. We propose EB1 binding to GTP edge sites plays an important role in proper EB1 accumulation at the microtubule plus end.
Short Abstract: Molecular evolution and phylogeny can provide key insights into pathogenic protein families. Studying how these proteins evolve across bacterial lineages, can help identify lineage-specific and pathogen-specific signatures and variants, and consequently, their functions. We have developed a streamlined computational approach for the molecular evolution and phylogeny of target proteins, widely applicable across protein and pathogen families. This approach starts with one or more query proteins whose homologs are then identified through the BLAST+ suite and MMSeqs2. These proteins and their homologs are then run through Interproscan and RPSBLAST to characterize them by their domain architecture - the constituent domains that comprise the protein. Finally, custom R scripts are used to generate the phyletic spreads and all other data summarization and visualization. In addition to our approach, we have developed the MolEvolvR webapp which enables biologists to run our entire molecular evolution and phylogeny approach on their data by simply uploading a list of their query proteins. The webapp accepts inputs in multiple formats: protein/domain sequences, multi-protein operons/homologous proteins, or motif/domain scans. Depending on the input, MolEvolvR returns to the user the complete set of homologs/phylogenetic tree, domain architectures, common partner domains. Users can obtain graphical summaries that include multiple sequence alignments and phylogenetic trees; domain architectures; domain proximity networks; phyletic spreads, co-occurrence patterns, and relative occurrences across lineages. Thus, the webapp provides an easy-to-use interface for a wide range of analyses, starting from homology searches and phylogeny to domain architectures. In addition to this analysis, researchers can use the app for data summarization and dynamic visualization. MolEvolvR will be a powerful, easy to use tool that accelerates the characterization of proteins. The webapp can be accessed here: jravilab.org/molevolvr. Soon, it will be available as an R-package for use by computational biologists.
Short Abstract: The functional core of a bioinformatics pipeline is usually an elegant series of data transformations. But there is a flaw in the system that complicates both pipeline and tool development: the passing of untyped textual data between subcomponents. Under the textual paradigm, all tool must parse incoming text, untangle the metadata and data (for example, FASTA headers and sequences), operate on them, then re-entangle the results in the output. The complexity of propagating metadata through the pipeline leads to error-prone and unmaintainable workarounds. Furthermore, lack of agreement on data formats requires tools to either support many formats and variations or to rely on their users to provide correct input. Pipelines written within the context of a single programming language and bioinformatics library (e.g., Bioperl, Biopython, or Bioconductor) are an improvement, but limit the userbase to one language. As a tentative solution, we offer morloc, a compiler that performs function composition between languages through type-directed generation of interoperability code. In morloc subcomponents are source functions and their input/outputs are typed data structures in memory. Every function is described by a pair of parallel type signatures: one specifies the function's language-specific data structures in memory and the other specifies the general form of the data. This allows functions from across languages to be indexed, tested, benchmarked, and packaged within one ecosystem. The compiler is open source and publicly available at github.com/morloc-project/morloc.
Short Abstract: 1. Overview of effector-triggered immunity and temporal transcriptome data
Plants rely on innate immunity to defend themselves from pathogen attack. Effector-triggered immunity (ETI) is one major mode of plant immunity. Pathogens deliver effectors into the plant cell to manipulate plant functions. Some effectors are recognized by the plant as signals of pathogen attack, resulting in induction of ETI. For example, the Pseudomonas syringae effector AvrRpt2 is recognized in the model plant Arabidopsis. Multiple layers of defense response, including massive transcriptome reprogramming, are induced during ETI. To uncover the mechanism regulating the transcriptome reprogramming, several studies have generated high-resolution temporal transcriptome data during ETI in Arabidopsis.
2. Limitations of the current approaches
General-purpose unsupervised approaches like clustering are often applied to gene expression data in order to reveal gene groups that may be functionally related based on guilt-by-association principle. While conventional clustering approaches can capture groups of genes responding similarly and describe their general temporal patterns, they are not able to provide detailed dynamic information that may reveal underlying mechanisms governing the response.
3. Our novel Multi-compartment models (MCMs)
In this study, we focused our analysis on an RNA-seq dataset characterizing transcriptome dynamics in Arabidopsis upon challenge with an ETI-inducing P. syringae pv. tomato (Pst) strain expressing AvrRpt2 (Pst AvrRpt2). We selected approximately 3000 genes that significantly respond at at least one of seven time points, after inoculation of Pst AvrRpt2 compared to mock inoculation. Interestingly, initial analysis of this dataset revealed transcriptome-wide “double-peak” time-course patterns. We further aimed to quantitatively describe the double-peak phenomenon and study the underlying mechanisms. To this end, we developed a novel computational approach based on a general mathematical framework called Multi-compartment models (MCMs), which can be applied to model the detailed dynamics of response peaks in temporal transcript data. Specifically, MCMs interpret a gene’s transcript response as a combination of two single-peak shape responses named first peak and second peak, respectively.
To model the transcripts’ responses for a single dataset, MCMs generate a chain of compartments with single-peak shape but different peak times. The transcript levels of each gene are fit with the analytical solution of an ordinary differential equation using a pair of compartments as input along with first-order decay. Naturally, this approach can enable the decomposition of temporal responses into two responses capturing sequential signaling events.
4. Discoveries and potential biological insights
Our MCM modeling framework could successfully fit transcript response profiles for approximately 1800 genes, capturing a majority of the genes with strong responses to the pathogen treatment. These MCM model fits for each transcript supported a number of downstream analyses of the transcript response. Surprisingly, we found that the timing and amplitude of the second peak response for each gene echoes the dynamics of the first peak response for the majority of genes. Permutation analysis confirmed a stronger than expected similarity in the late and early peak dynamics (p < 0.001).
With a low inoculation dose, only a limited proportion of the plant cells had direct contacts with bacteria and received AvrRpt2 (cell population 1). Surrounding cells (cell population 2) responded indirectly. Since the population 1 cells undergo a programmed cell death by 12 hours, cell population 2 must mostly represent second peak response. Our unpublished results show that the cell population 1 response closely resembles the first peak response. It is very likely that first and second peaks largely represent responses of cell populations 1 and 2, respectively. We also identified a small number of genes whose transcript responses were distinctive between two peaks and will study whether this minor difference can explain distinctive cellular response of the two populations.
Short Abstract: Background: Alzheimer’s disease (AD) is a complex heterogeneous progressive disorder and is the leading cause of dementia. A comprehensive understanding of molecular pathogenesis is the key to early detection of multi-factorial AD dementia and to develop targeted interventions.
Data & Approach: To elucidate the mechanisms of AD dementia pathogenesis, we will use genetic (genome-wide SNP data), and blood-based epigenetic (genome-wide DNA methylation data (DNAm)) and transcriptomic (RNA-Seq) data collected among 4000 participants of the Health and Retirement Study (HRS) 2016 sub-study Harmonized Cognitive Assessment Protocol (HCAP). We will develop a machine learning deep probabilistic generative model (ex: Variational Autoencoder, VAE) with existing biological knowledge as constraints along with enforced candidate features for the unsupervised phase and a classification layer for the supervised phase. Currently, we are developing a framework to incorporate high-dimensional omics data for the VAE prediction model. We will further implement an “opening black box” approach, analyze the weights captured by VAEs decoder, to identify the important features contributing to latent dimensions in VAE for biological interpretability.
Results and Conclusion: As this project is in the early stages of implementation, we will demonstrate preliminary results from DNAm based VAE model to predict AD dementia. We hypothesize that the VAE model with biological interaction network as constraints will have better accuracy for dementia classification than traditional dimension reduction and classifier approaches in omics analysis.
Short Abstract: Background: It is now well established that alternative splicing is a ubiquitous process in eukaryote organisms that allows multiple distinct spliced transcripts to be produced from the same gene. Yet, existing gene tree reconstruction methods make use of a single reference transcript per gene to reconstruct gene family evolution and infer gene orthology relationships. The study of the evolution of sets of transcripts within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, a natural generalization of pairwise spliced alignment is multiple spliced alignment that is an alignment of a set of spliced RNA sequences with a set of unspliced genomic sequences.
Methods: We have developed a collection of algorithms to compute multiple spliced alignments, infer splicing orthology relationships, and reconstruct transcript and gene evolutionary histories. The new spliced alignment algorithms account for sequence similarity and splicing structure of the input sequences. A multiple spliced alignment is obtained by combining several pairwise spliced alignments between spliced transcript sequences and unspliced genes of a gene family, using a progressive multiple sequence alignment strategy. Splicing orthology relationships, gene tree and transcript tree are computed based on the multiple spliced alignment of all genes and transcripts of a gene family.
Results: The application of the algorithms on real vertebrates and simulated data show that the new method provides a good balance between accuracy and execution time, compared to existing spliced alignment algorithms. Moreover, we show that it is robust and performs well for various levels of similarity between input sequences, thanks to the use of the splicing structure of input sequences. The application also illustrates the usefulness of our methods to compute gene family super-structures, to identify splicing ortholog groups and conserved structural features, as well as to improve gene model annotations.
 S. Jammali, J-D. Aguilar, E. Kuitche, A. Ouangraoua. (2019). SplicedFamAlign: CDS-to-gene spliced alignment and identification of transcript orthology groups. BMC bioinformatics, 20(3), 133.
 E. Kuitche, S. Jammali, and A. Ouangraoua. (2019). SimSpliceEvol: alternative splicing-aware simulation of biological sequence evolution. BMC bioinformatics, 20(20) (2019): 1-13.
Short Abstract: To effectively combat antibiotic resistance, a detailed understanding of the processes driving the emergence of resistance is vital. When quinolones were introduced in the sixties, it took over a decade to identify their targets and mechanism of action and to unveil mutations in gyrA and parC as a cause for quinolone resistance.
Here, we show how a hypothesis-free, high-throughput approach based on deep sequencing and genome-wide association (GWAS) to resistance can pinpoint resistance-conferring mutations. We develop a tailored bacterial GWAS model, which takes population stratification into account. To maximise the diversity of genomes, we apply the model to E. coli from wastewater, which combines environmental, industrial, and human sources.
From around 100 wastewater E. coli genomes, we identified over 200,000 high-quality genomic variants. Among these variants, we found 13, which correlate very highly with quinolone resistance. Three of them are the known quinolone resistance-conferring mutations in gyrA and parC. The other ten variants appear in new candidate resistance genes including the biofilm dispersal gene bdcA and valS, which plays a key role in translation. Both processes can be connected to resistance formation. In bdcA, the mutation is in proximity to the active site and could hence impact the gene products' efficiency. The gene valS harbours synonymous mutations, which may have an indirect effect on valS function and resistance.
In summary, we demonstrate that GWAS effectively and comprehensively identifies resistance mutations without a priori knowledge of targets and mode of action and using data from a single site only. The approach recovers gyrA and parC mutations as main sources of quinolone resistance, which are complemented by novel candidate resistance mutations in bdcA and valS. More studies are needed to sustain the connection and to validate the mechanism of action.
Short Abstract: The advent of single cell sequencing has provided the ability to define data-driven cell types at unprecedented resolution. However, these experiments produce datasets that share neither identical instances nor identical features, and current methods operate only on features jointly shared between the datasets. Thus, these methods discard important unshared features, such as intergenic epigenomic information and unshared genes.
The critical need to include unshared features in single-cell integration analyses motivated us to extend our previously published LIGER framework. We present UINMF, a novel nonnegative matrix factorization algorithm that allows the inclusion of both shared and unshared features when integrating datasets. The key innovation of UINMF is the introduction of a U-matrix to the iNMF objective function, incorporating features that belong to only one or a subset of the datasets when estimating metagenes and cell factor loadings. To solve the resulting optimization problem, we derived a block coordinate descent algorithm.
We demonstrate the advantages of our method using four real datasets. First, we validate our results using ground truth cell correspondences from simultaneous scRNA and snATAC data. UINMF consistently places the scATAC cells in closer proximity to their corresponding scRNA profiles, confirming that incorporating intergenic information from snATAC data leads to more accurate clustering. In two separate analyses, we demonstrate that using U to include unshared genes improves cell type resolution when integrating scRNA data and targeted spatial transcriptomic data. This allows the identification of more distinct cellular profiles within a spatial context. Additionally, we illustrate the benefit of using UINMF to integrate dual-omic SNARE-seq and spatial transcriptomic datasets. Including unshared genes, as well as chromatin accessibility, within the U matrix allows improved sensitivity in cell type labeling. As multi-omic sequencing efforts and novel spatial transcriptomic methods develop, we anticipate that UINMF will improve single-cell data integration across a range of data types.
Short Abstract: One of the effects of the ongoing COVID-19 pandemic is an overburden on the health care system. This not
only limits the medical resources available to COVID-19 patients, but leads to other poor health outcomes. One
category of ongoing stress to healthcare systems is the availability of intensive care for the extremely sick. It is
likely that an important factor in recovery rate is that the number of patients that can receive high-quality care
stays below the number of beds available. ICU care depends not only on the count of beds, but the availability
of ICU staff, with increased rates of mortality among patients in strained ICUs. As regular ICU beds are filled,
hospital units that typically deal with less acute patients must be repurposed as intensive care, with the
associated reallocation of staff who may not have ICU training to intensive or progressive care units. The
availability of ICU care may be further lowered as healthcare providers themselves become affected as they or
their family members become sick or housebound. ICU load has been shown to be positively correlated with
an increase in mortality counts over the next 7 days. This increase in COVID-19 mortality goes beyond a
simple increase in the number of patients. ICU strain has been shown to increase the risk of mortality for
patients relative to unstrained times. Headline numbers typically report only the number of ICU beds that are
used or available; this does not give a baseline to compare against. The quality and methodology of these
reports varies: these numbers do not account for the ICU level, and some measures of available capacity only
count the number of beds available, while others specify the number of beds that can be staffed. It is important
to distinguish between and account for the different reporting criteria in any model of these data. Tracking
these reporting factors as well as the human factors can result in overly complex models. Additionally, it is not
necessarily easy to determine a baseline for ICU availability in a particular region. This number varies within
each hospital, making the simpler-to-model distinction between regular ICU beds and additional ICU beds
added for the pandemic difficult to constrain.
Our work examines healthcare burden due to COVID-19 by means of computational methods and techniques
from mathematical epidemiology. In particular, we will discuss several compartmental models which account for
hospitalizations. We will discuss the basic Susceptible-Exposed-Infectious-Recovered (SEIR) model and build
on this for our model to account for hospitals over capacity. This is implemented by adding terms for ICU
hospitalizations and ICU capacity to the SEIR model to account for hospital over-burden. Adding
compartments increases the model complexity and so determining the optimal compartmental structure is a
challenging problem which requires the development of appropriate machine learning methodologies. The
optimal model would be the simplest one able to capture qualitatively significant healthcare overburden effects.
We illustrate with real-data examples and specifically those related to US states, such as those in the Midwest.
The model can be used for risk adjustment for hospital overload. A model for ICU over-capacity can be used to
assist in identifying hospitals or regions that are over functional capacity when high quality data is not yet
available or all parameters are not tracked. Our study builds on previous research extending SEIR models to
account for hospitalizations, quarantines, and contact reduction strategies.
Short Abstract: For over a year now the world has been trying to cope with COVID-19. Several non-pharmaceutical
interventions have been implemented to try to mitigate the dramatic outcomes of the pandemic. Studies have
shown the effectiveness of lockdowns, social distancing, and mask usage as measures aimed at stopping or at
least slowing down the transmission of the coronavirus. The actual 2 meters distance policy is vetust and
based on outdated science. It does not take into consideration important factors such as the environment and
the physics of disease transmission (eg. temperature, air humidity, cough/talking/normal breathing, and many
others) and other determinants, and this calls for a deeper assessment of optimal social distancing. Although
there are many studies which show the shape and the reach of breath in different environments from a
transversal perspective, much less is known about the shape of the breath cloud/ social distancing area in the
direction orthogonal to the plane surface. The understanding of this would be informative for what concerns
social distance measures, both outdoor and indoor. It is known that without specific additional restrictive
measures, closed spaces are more dangerous than open spaces as they favour the concentration of air and
the possibility of contact between individuals. It is, then, important to understand how many people can safely
be present in closed spaces and optimize their disposition so as to minimize coronavirus transmission.
Our approach is to connect the problem with optimal packing inside polygons, which is a classical problem of
integral geometry and requires sophisticated tools from real analysis, topology, optimization, and geometry.
The idea is to model the air cloud around the head of an individual with an egg-shaped figure with sizes given
by the max distance found in the literature that air can reach by breathing/coughing with/without mask and use
this as a static version of the problem of minimizing breath-interactions and thereby disease transmission.
Although this is a very simplified picture of the dynamics, it can still give an insight into the optimal position of
people in a closed space and possible strategies for capping the capacity of rooms. We believe that this
approach is new, challenging, and promising. We will consider rooms of polygonal shape (square, rectangle,
trapezoid, semicircle) and egg-shaped figures with different combinations of axes lengths reflecting as closely
as possible the physics of COVID-19 transmission. We will further compare the results of the case in which
masks are used (shorter major egg axis) vs the case in which masks are not used (longer major egg axis). We
will pay particular attention to the real-data case of classroom spaces, as optimal school-reopening strategies
have been shown to be critical for contrasting COVID-19 transmission and those strategies cannot prescind
from an optimal classroom maximal capacity assessment. Our analysis provides results about the maximum
number of people to fit into a room while still complying with social distancing requirements and their optimal
transmission-minimizing disposition. Other socio-economic applications are to restaurant capacities and sport
events, cases where the geometry of the domain is more complicated (non-convex) and requires a more
Our work is not only assessing a problem important from the biological/epidemiological perspective, but it is of
independent mathematical interest. The literature of sphere-packing and coverage problems is indeed broad
and vivid, producing beautiful mathematical results and a hot research area for over a hundred years. As far as
we know our work is the first instance where those techniques are applied to solve a problem in epidemiology.
Short Abstract: Screening drug compounds against a collection of defined gene mutants can identify mutations that sensitize or suppress a drug’s effect. These chemical-genetic interaction screens can be performed in human cell lines using a pooled lentiviral CRISPR-Cas9 approach, allowing for interrogation of many single-gene knockout backgrounds and assessing their differential sensitivity or resistance to drugs. While genome-wide chemical-genetic screens (typical size: ~70,000 sgRNAs targeting ~18,000 genes) can inform candidate compounds for drug development, many labs do not have the resources to perform large-scale screens for more than a few compounds. Our pilot screen with Bortezomib (a proteasome inhibitor) shows that a small targeted CRISPR library (~3,000 sgRNAs targeting ~1,000 genes) screened at higher sequencing depth (1000X) can 1) recover biological information at a higher signal to noise ratio compared to genome-wide screens, and 2) reduce resource costs, which allows for higher throughput screening.
Quantification of the phenotypic readout (cell viability) from these screens can be translated to chemical-genetic interaction (CGI) profiles, or a “fingerprint” indicative of a compound’s mode-of-action. Our group has developed a novel R software package for scoring chemical-genetic interactions, ranking candidate drug targets, and predicting compound mode-of-action. Chemical-genetic profiles are analogous to genetic interaction (GI) profiles, which represent differential sensitivity/resistance of genetic perturbations to a second genetic perturbation rather than a compound. To predict genetic target(s) of known or uncharacterized bioactive compounds, we leverage the property that a compound’s CGI profile will be similar to the GI profile of its target. We plan to integrate chemical-genetic interaction profiles with genetic interactions profiles produced by other ongoing efforts to improve our drug target prediction approach. The combination of our targeted screening approach and novel computational framework will provide a more feasible and scalable platform for drug discovery.
Short Abstract: Synergistic drug combinations allow lower doses of each constituent drug, reducing adverse reactions and the development of drug resistance. However, it is simply not feasible to sufficiently test every combination of drugs for a given illness to determine promising synergistic combinations. Since there is a limited amount of time and resources available for testing of synergistic combinations, a model that can identify synergistic combinations from a limited subset of all available combinations could accelerate development of therapies. By applying the low-rank matrix completion algorithm Probabilistic Matrix Factorization (PMF), it may be possible to identify synergistic combinations from partial information of the drug interactions. Here, we use PMF to predict the efficacy of two-drug combinations using the NCI ALMANAC, a robust collection of pairwise drug combinations of 104 FDA approved anticancer drugs against 60 common cancer cell lines. We find that PMF is able predict drug combination efficacy with high accuracy from a limited set of combinations and is robust to changes in the individual training data. We also explore the effect of seed graph topology on PMF performance and investigate how the prediction accuracy changes with the individual efficacy of the drugs. Based on our results, we suggest a strategy for efficiently inferring synergistic drug combinations from a carefully selected parsimonious set of experiments.
Short Abstract: High-throughput sequencing technologies now enable the measurement of gene expression, DNA methylation, and chromatin accessibility at the single-cell level. Integration of multiple single-cell datasets is crucial to the analysis of cell identity. Previously, we developed LIGER (Linked Inference of Genomic Experimental Relationships), an R package that employs integrative non-negative matrix factorization to identify shared and dataset-specific factors of cellular variation. These factors then provide a principled and quantitative definition of cellular identity. Here, we present PyLiger, a python implementation of LIGER, which enhances the processing efficiency of LIGER. The functionality provided by PyLiger includes integration of multiple single-cell datasets, clustering, visualization, and differential expression testing. We confirmed that the results from PyLiger are identical (to within numerical precision) to those from the LIGER R package. Additionally, the PyLiger functions are 1.5 to 5 times faster than their R counterparts. In particular, the most time-consuming step—matrix factorization—is approximately 3 times faster in Python than our previous R implementation. We also redesigned the structure of the Liger class to smoothly interface with the widely used AnnData format. Cell factor loadings, shared metagenes, and dataset-specific metagenes are stored as annotations in AnnData objects. The use of AnnData format also facilitates interoperability with existing single-cell analysis tools such as Scanpy and scVelo. Furthermore, PyLiger uses HDF5 file format for on-demand loading of datasets stored on disk, providing significant memory savings. In summary, PyLiger is a Python implementation of LIGER offering faster performance, optional on-disk format, and convenient interoperability with other single-cell analysis tools.
Short Abstract: The acquisition of resistance by tumors to targeted molecular therapies remains one of the greatest challenges in precision medicine approaches to cancer treatment. Evolutionary techniques leveraging modern sequencing technologies, such as phylogenetic analysis and the thoughtful calculation of selection intensities, are well suited to offer guiding insights in describing and, ultimately, overcoming therapeutic resistance.
Here, we quantify the relative importance of mutations occurring during EGFR targeted therapy on a cohort of Lung Adenocarcinoma (LUAD) patients using clinical records to calibrate phylogenetic chronograms to identify variants occurring on branches of the tumor phylogeny coincindicing with targeted EGFR therapy and calculate the evolutionarily derived cancer effect size for those variants. Further, we compare these results to other cancer contexts to reveal the strikingly high effect size of the EGFR T790M mutation, demonstrating the utility of a phylogenetically informed evolutionary approach with a first-of-its-kind finding. Lastly, we preform ancestral state reconstruction and mutational signature analysis to provide further insight on mutational processes and how they shift with therapeutic exposure.
Short Abstract: The basis of several recent methods for drug repurposing is the key principle that an efficacious drug will reverse the disease molecular ‘signature’ with minimal side-effects. This principle was defined and popularized by the influential ‘connectivity map’ study in 2006 regarding reversal relationships between disease- and drug-induced gene expression profiles, quantified by a disease-drug ‘connectivity score.’ Over the past 15 years, several studies have proposed variations in calculating connectivity scores towards improving accuracy and robustness in light of massive growth in reference drug profiles. However, these variations have been formulated inconsistently using various notations and terminologies even though various scores are based on a common set of conceptual and statistical ideas. Here, we present a systematic reconciliation of multiple disease-drug similarity metrics (ES, css, Sum, Cosine, XSum, XCor, XSpe, XCos, EWCos) and connectivity scores (CS, RGES, NCS, WCS, Tau, CSS, EMUDRA) by defining them using consistent notation and terminology. In addition to providing clarity and deeper insights, this coherent definition of connectivity scores and their relationships provides a unified scheme that newer methods can adopt, enabling the computational drug-development community to compare and investigate different approaches easily. To facilitate the continuous and transparent integration of newer methods, this resource will be available as a live document (jravilab.github.io/connectivity_scores) coupled with a GitHub repository (github.com/JRaviLab/connectivity_scores) that any researcher can build on and push changes to. We are now developing a computational drug repurposing approach that reconciles these multiple connectivity scores to predict drug candidates against infectious diseases.
Short Abstract: Background. A pangenome is the collection of all genes found in a set of related genomes. For microbes, these genomes are often different strains of the same species, and the pangenome offers a means to compare gene content variation with differences in phenotypes, ecology, and phylogenetic relatedness. Though most frequently applied to bacteria, there is growing interest in adapting pangenome analysis to bacteriophages. Working with phage genomes presents new challenges, however. First, most phage families are under-sampled, and homologous genes in related viruses can be difficult to identify. Second, homing endonucleases and intron-like sequences may be present, resulting in fragmented gene calls. Each of these issues can reduce the accuracy of standard pangenome analysis tools.
Methods. We developed an R pipeline called Rephine.r that takes as input the gene clusters produced by an initial pangenomics workflow. Rephine.r then proceeds in two primary steps. First, it searches for distant homologs separated into different gene families using Hidden Markov Models. Significant hits are used to merge families into larger clusters. Next, Rephine.r identifies three common causes of fragmented gene calls: 1) indels creating early stop codons and new start codons; 2) interruption by a selfish genetic element; and 3) splitting at the ends of the reported genome. Fragmented genes are then fused for the purpose of inferring the single-copy core genome and phylogenetic tree.
Results. We applied Rephine.r to three well-studied phage groups: the Tevenvirinae (e.g. T4), the Studiervirinae (e.g. T7), and the Pbunaviruses (e.g. PB1). In each case, Rephine.r increased the size of the single-copy core genome and the overall bootstrap support of the phylogeny. We expect corrections to fragmented gene calls will also be informative beyond pangenomes, with potential applications to prophage prediction and genome assembly. The Rephine.r pipeline is provided through GitHub as a single script for automated analysis and with standalone functions and a walkthrough for researchers with specific use cases for each type of correction.
Short Abstract: Mucopolysaccharidoses (MPS) are lysosomal storage diseases caused by the deficiency of enzymes essential for the metabolism of glycosaminoglycans (GAG). To the best of our knowledge, the MPSBase
Short Abstract: MicroRNAs are short non-coding RNAs of 19-22 nucleotides that mediate gene silencing by targeting mRNA transcripts that contain reverse complementary sequences of the seed region (nucleotides 2-8 of the mature miRNA). MicroRNAs have important roles in many biological processes and their expression profiling can be routinely carried out using reference mature sequences. Currently more than 50 thousand miRNA-sequencing datasets are publicly available on the Sequence Read Archive from at least 60 different species. The prediction of novel miRNA genes however, generally requires the availability of genomic sequences in order to assess important properties such as the characteristic hairpin-shaped secondary structure. Nevertheless, although sequencing costs have decreased over the last years, many important species still lack a high quality genome assembly. We implemented a reference-independent algorithm that allows miRNA prediction by exploiting the highly conserved seed region to aggregate reads into miRNA-like clusters. Candidate miRNAs are also required to meet characteristic biogenesis features that can be assessed without genomic sequences such as the 5’ processing homogeneity. We evaluated its performance on different sequencing datasets from several Homo sapiens and Mus musculus tissues by comparing the candidates to well-established reference mature miRNA sequences from miRGeneDB. Using 18 different tissues, miRNAgFree was able to correctly predict 249 and 259 miRNAs from human and mouse respectively, with precisions ranging from 76 to 87%. An additional considerable portion were found to be at least partially correct. Overall, 90-100% of the most expressed predictions for each species (first quartile) corresponded to bona fide miRNAs, indicating a good performance of this approach. Furthermore we found that just 18 tissues were needed to recover ~25% of the miRNA complement with a high specificity, which suggests that miRNA-seq data alone can be sufficient to reconstruct a relevant portion of a species miRNAome.
Short Abstract: Conjugation enables the exchange of genetic elements throughout environments, including the human gut microbiome. Conjugative elements can carry and transfer clinically relevant metabolic pathways which makes precise identification of these systems in metagenomic samples clinically important. We employed two separate methods to identify conjugative systems in the human gut microbiome. We metagenomically assembled 101 samples from two and identified conjugative systems by employing profile hidden Markov models. We also annotated a human gut microbiome reference genome set using the UniRef90 database to identify potential conjugative systems and quantified those systems across 785 samples from 8 cohorts. We first show that conjugative systems exhibit strong population and age-level stratification. Additionally, we find that the total relative abundance of all conjugative systems present in a sample is not an informative metric to use, regardless of the method of identifying the systems. Finally, we demonstrate that the majority of assembled conjugative systems are not included within metagenomic bins, and that only a small proportion of the binned conjugative systems are included in "high-quality" metagenomic bins. Our findings highlight that conjugative systems differ between adult North Americans and a cohort of North American pre-term infants, revealing a potential use as an age-related biomarker. Furthermore, conjugative systems can distinguish between other geographical-based cohorts. Our findings emphasize the need to identify and analyze conjugative systems outside of standard metagenomic binning pipelines. We suggest that analysis of type IV conjugative systems should be added in parallel to the current metagenomic analysis approaches as they contain information that could explain differences between cohorts beyond those we investigated.
Short Abstract: Phytophthora is a genus of oomycetes that cause extensive crop damage and economic loss. Specific proteins from the host and pathogen facilitate their interaction and mediate a multifaceted mechanism in infection. In this study, we utilized published protein-protein interactions between Phytophthora and its hosts for developing a model for prediction. We applied supervised learning algorithms- Support vector machine (SVM) and Ensemble methods to predict interactions. Different protein features of proteins in host and pathogen like amino acid composition, dipeptide composition, pseudo amino acid composition, amphiphilic pseudo amino acid composition, C/T/D, conjoint triads, autocorrelation, sequence order coupling number, quasi-sequence order descriptors were utilized to develop the model for binary classification, whether the proteins interact or not. The relative importance of different protein features in the training model were also evaluated. SVM with radial kernel had an accuracy of 75%. Bagging algorithm Random Forest showed an accuracy of 84.6%. A GLM ensemble of K-Nearest Neighbor (KNN), SVM (radial), rpart and random forest gave an accuracy of 70.1%. The model developed can be trained with more experimentally validated interactions to improve the accuracy. Furthermore we constructed a protein-protein interaction network of host and pathogen proteins to depict the interaction network operating during pathogenesis and evaluated the network topology. The results of the study may be taken forward for experimental validation.
Short Abstract: Acute myeloid leukemia (AML) is an aggressive malignancy. Despite early response to standard chemotherapy, 70-80% of patients relapse with treatment resistant disease and die of relapsed disease. Leukemia stem cells (LSCs), the subpopulation of leukemia cells with self-renewal capacity, can recapitulate leukemia to cause relapse. Current chemo and targeted therapy do not target LSCs, explaining the relapse in this disease. Our goal is to identify the molecular mechanisms of self-renewal in LSCs in order to define therapeutic targets that prevent AML relapse. Our group previously demonstrated that activated NRAS (NRASG12V) mediates self-renewal in the LSC-enriched subpopulation of AML in the Mll-AF9/NRASG12V transgenic mouse model of AML (Sachs et al. Blood 2014; Kim et al. Blood 2009). Next, we performed single cell RNA-sequencing (scRNA-seq) analysis to identify the gene expression profile of self-renewal in this mouse model. We validated the functional effects of this gene expression profile using in vivo mouse reconstitution assays (Sachs K. et al. Cancer Research 2019). Next, we investigated whether primary human AML stem cells express the murine-derived single-cell self-renewal gene expression profile. We designed a new computational method called Single cell Correlation Analysis (SCA) to directly quantitate the expression level of a gene expression profile within a single-cell gene expression dataset. We used SCA to analyze scRNA-seq data from primary human AML LSC samples that our group generated and sixteen previously published primary human AML samples (Galen et al, Cell 2019). Using this approach, we demonstrated that most of human AML samples we tested harbor cells that express the murine single-cell self-renewal gene expression profile (0-11.5% with an FDR-q < 0.1, 0-19.8% with an FDR-q < 0.25). Since we previously validated the functional relevance of this profile in vivo, these analyses allow us to identify the putative self-renewing LSCs at the single-cell level and interrogate their gene expression profiles further (in our ongoing work).
Short Abstract: Genome-wide perturbation screens enable a systematic investigation of biological systems. Recently, the maturation of the CRISPR/Cas9 technology has enabled genome-wide knockout screens in human cell lines. However, relatively little work has been done to fully characterize the reproducibility of differential effects (e.g. genetic interactions) in genome-wide CRISPR screens.
Evaluation of the quality of genome-wide screens is uncomplicated in the presence of a ground truth gold standard. For example, for genome-wide single knockout screens, an essentiality-based gold standard can support a simple analysis of quality by counting the number of essential genes identified by a specific screen. However, such gold standards are frequently unavailable or imperfect (GO-based co-annotation standards) for most CRISPR/Cas9 screen applications.
We introduce a new computational tool, called JEDER (Joint Estimation of Data and Error Rates) to evaluate experimental screens without an external gold standard. Given a set of replicate genetic screens, and using a Markov Chain Monte-Carlo (MCMC)-based approach, JEDER computes a maximum likelihood estimate of the False Positive Rate (FPR), False Negative Rate (FNR), along with a consensus profile that accounts for these error rates and all replicate observations. JEDER then binarizes the consensus profile (assuming pairs with a posterior probability > 0.5 as positive examples) and uses this as a standard to provide detailed benchmarking of screen quality as a function of different data thresholds (e.g. effect size, statistical significance).
To estimate the reproducibility of single mutant effects from CRISPR/Cas9 screens, we conducted 21 independent single knockout replicate screens in the human HAP1 cell line. Focusing on genes exhibiting strong dropouts, we applied JEDER to jointly estimate a consensus set of essential genes and the FPR and FNR. We then evaluated single mutant effect measurements at a range of effect size thresholds and calculated the corresponding precision-recall characteristics. To estimate the reproducibility of differential effects (e.g. genetic interactions), we also performed several double mutant knockout screens, each one repeated at least four different times with independent biological replicates in HAP1 cells. JEDER was then applied to these replicated screens, enabling precise estimates of false negative and false positive rates for both negative and positive genetic interactions. We evaluated differential effects at a range of effect size and statistical significance thresholds. Our analysis suggests that reproducing differential effects is considerably more challenging than reproducing single mutant effects.
JEDER provides the first robust estimates of reproducibility of double mutant genetic interactions, and importantly, highlights the substantially increased difficulty of scoring differential effects based on CRISPR/Cas9 screens as compared to single mutant effects. We expect JEDER to generalize to many other genomic/proteomic data settings to enable precise and accurate estimates of reproducibility without relying on external gold standards.
Short Abstract: Taxonomy is essential for the classification of sequences to help researchers to understand and communicate biological information.The exponential growth of biological sequences as a result of development of a wide range of high throughput experimental methods and computational efforts has also meant an increase in their biodiversity.PIN2DB (pinir.ncl.res.in/) a dedicated biological database of Potato inhibitor type-II protease inhibitors (Pin-II type PI) family, covers functional annotations of each available Pin-II type PI sequence. As captured in the database, Pin-II PIs are widely distributed across species with multiple levels of hierarchies and complex relationships. Interactive Visualization can aid in exploring and discerning meaning from such complex data. To meet this requirement, we have developed “Taxon- TreeViewer” an interactive visualization tool to explore the diversity of Pin-II PI sequences in PIN2DB. Taxon- TreeViewer has a responsive web design and develops an interactive tree which is laid out as a node-link diagram that displays the hierarchical relationships across Pin-II PI taxonomy. The visualization uses “circles” as marks to encode “taxon” nodes and “lines” as marks to encode connection between parent and child. The “color” attribute for circle mark is used to encode collapsed and expanded node. For collapsed nodes which have children they are represented by circles with orange fill color while completely expanded and leaf nodes are represented by circles with white fill color. In addition it uses vertical spatial position attribute to encode the tree depth. Each node label displays name of taxon, and the number of Pin-II PIs which belongs to it.Initially when the tree is rendered on the browser then it is displayed in the vertical orientation up to the depth of 11.Users can expand or collapse any node by just clicking it. The visualization also provides the users with the facility to orient the tree either horizontally or vertically. Further to improve the legibility of the visualization the Taxon- TreeViewer provides the facility to select the root nodetaxon of the tree.Taxon-Tree Viewer is implemented using Java Script and makes use of D3s (d3js.org/) hierarchical layouts. It is derived from Mike Bostock’s Collapsible Tree example (observablehq.com/@d3/collapsible-tree).In future we plan to display distribution of more sequence features across species using overlays on the mouse over event of Taxon node. We are also working on a generic open source variant of Taxon- TreeViewer which could be easily implemented for any biological database with taxonomy classification.
Visualization URL: pinir.ncl.res.in/Taxon-TreeViewer/
Short Abstract: Autism is serious developmental spectrum disorder with a prevalence of 1/54 people, according to the new CDC statistics from last year. Despite this very high incidence and rise in prevalence each year, less that 10% of autism cases raise from an identifiable cause; partly because the causes are poorly understood. The consensus in the field is that the contributing causes are both genetic and environmental. An environmental factor that has been linked to autism are pyrethroid pesticides. Recent studies have suggested that despite the pyrethroid pesticides supposed safety, developmental exposure can lead to autism risk. A recent epidemiology study, the CHARGE study, showed that pregnant women living near areas where pyrethroids were used, were at a significantly increased risk for their unborn child to later be diagnosed with autism. To further evaluate the developmental effects of pyrethroid exposure, experiments were performed using two independent cohorts of mouse dams which were fed pyrethroid pesticide (or vehicle) during pregnancy and lactation. The offspring were raised to adulthood and euthanized for tissue collection. Split-sample study transcriptomics, kinomics, and metabolomics analysis will be performed on the brain tissue collected. Multiomics integration, and other bioinformatic analyses, will be performed using the generated datasets to gain a more comprehensive understanding of the multi-modal biophenotype induced from the exposure. Results for each of the analyses and the multiomics integration will be discussed in the context of contribution of developmental pesticide exposure to autism risk.
Short Abstract: Oral Mucositis (OM) is a deleterious side effect of radiotherapy targeting the head and neck. The severe ulcers that form can result in increased hospitalizations and cessation of cancer treatment. Effective therapies without side effects are lacking for OM. In order to develop more successful treatments, a better understanding of the pathophysiology of OM is necessary. Following damage, matrix metalloproteinases (MMPs) are essential for tissue remodeling and leukocyte trafficking. However, if expressed in excess, MMPs can cause disproportionate inflammation and impede healing. Because the roles of MMPs during OM are not completely understood, we assessed expression of MMPs during peak damage. RNA-seq was performed comparing sham to irradiated mice and a total 1892 genes were differentially expressed, with 407 genes significantly upregulated and 271 genes significantly downregulated in the irradiated tongue tissue. Of note, MMP10, 1a/b,8,13,12, and 27 showed a ≥ 2 Log2 fold change and P-value ≤ 0.05. Expression of these genes was verified by qPCR. In all, radiation induces transcription of MMPs that may contribute to the pathology of OM. Future studies will investigate the kinetics of MMP gene expression to gain a better understanding of each during the course of OM in order to inform drug development.
Short Abstract: The availability, affordability and pervasiveness of mobile and wearable devices is at an all-time high. New applications are developed to increase the functionality and usefulness of wearable devices in order to enhance and improve quality-of-life areas such as communications, workplace productivity, electronic commerce, personal fitness, and healthcare. Simultaneously, the increasing magnitude of security breaches, including sophisticated hacking methods, ransomware, malware and phishing attacks, have reached alarming levels. In most incidents, identify theft is most often the outcome, where Personally Identifiable Information is compromised, including login credentials, credit card information and healthcare records. In spite of the availability of tools to protect our records, such as the use of multi-factor authentication protocols, possession protocols, or inherence protocols, the threat remains persistent.
In this presentation, I share the results of a unique longitudinal field research that focused on workplace acceptance of ECG-based wearable authentication as a platform to mitigate unauthorized access. This was done through the study of two elements: establishing a model of endogenous and exogenous variables using structured equation modeling, then by validating the model’s Importance-Performance Map Analysis as a key performance indicator that businesses can leverage in their adoption or deployment decisions.
Short Abstract: COVID-19 hit the world in December 2019 and is still possibly the biggest public health concern. The first case
was identified in Wuhan, China, after which it spread throughout Europe and the US, leading to an ongoing
pandemic, as officially determined by WHO in March 2020. Since the very first stages of the pandemic the
global research community mobilized and started to study the evolution of COVID-19 to understand its virology,
pathophysiology, and epidemiology. The complexity of the problem requires the development of new
methodologies and the program of large collaborative teams across disciplines.
Our team joined this interdisciplinary research effort with the interest of understanding the dynamics of the
disease from a machine learning perspective. We want to understand the time evolution of COVID-19 and in
particular its changes with respect to non-pharmaceutical interventions, its comparison to seasonal influenza,
and its spatial differences with specific attention to the situation in the US by means of data-driven
mathematical modeling and topological tools.
Our first assessment is about understanding the relationship between qualitative changes in the curve of cases
and deaths and non-pharmaceutical interventions, such as lockdowns, social distancing, and face mask usage.
We attacked this problem by means of multivariate change point analysis, a statistical tool developed to
determine changes in the distributional properties of a stochastic process. The homology group of the
sublevels of the first difference of the time series of cases gives information about the instants where the
COVID-19 dynamics changes. Paired with the timeline of government intervention, we can determine the lag
between the measure employed and the effect of COVID-19. At the moment, our analysis is purely associative
and not causal, although causality is surely a direction of crucial importance and must be taken.
Our second assessment is about the comparison between COVID-19 and influenza, which were believed by
some to be very similar early in the pandemic, due to both being contagious respiratory diseases caused by a
virus with similarities in symptoms. The analysis will be performed by developing algebraic topological tools
(e.g. persistent homology) typical of topological data analysis. We will test the hypothesis that the persistence
landscapes of COVID-19 and influenza (aka the shape of their time trajectory) are similar using US data.
Finally, we will use the features extracted by persistence homology to cluster trajectories of COVID-19 cases
and deaths with the goal of identifying similarities in the evolution of the disease among counties and states in
the US. Note that the importance of this analysis is multiple. For example: topological features of a time series
are invariant with respect to time translations and so if two counties have been determined to have a similar
time series, the evolution of the county that was hit earlier by COVID-19 inform public policies for the similar
county where the timeline of the disease is delayed. On another note, this analysis can potentially uncover
socio-economic factors related to the disease, as similar regions tend to cluster together.
The vastness of the problem, the wide applicability of the techniques, and the unfortunate fact that the
COVID-19 pandemic is still ongoing makes this a continuing project. In our presentation, we will discuss our
updated results which will incorporate the most recent data close to the conference. We decided to not report
in this abstract any preliminary results as the situation can possibly change by the time of the conference and
that would risk to make our current results outdated.
Short Abstract: Colorectal cancer is one of the most common cancers happened in the human. Studying multi-omics data like transcriptomics and proteomics could be essential to discover signature pathways in colorectal cancer progression. Here, we conducted transcriptomics and proteomics analysis from two independent colorectal cancer studies that contained tumor and adjacent normal samples for each to identify differences of signature pathways in cancer progression from transcript and protein levels. As a result, we found that proteomics could provide additional pathways information to transcriptomics data. Focal adhesion pathway information that related to actin cytoskeleton regulation and pathway in cancer was preserved in protein level. Meanwhile, oxidative phosphorylation pathway that highly correlated to neurodegenerative diseases pathways (Alzheimer's disease, Parkinson's disease, Huntington's disease) was preserved in both transcript and protein level. This study also suggested that oxidative phosphorylation could be a potential colorectal cancer target therapy.
Short Abstract: Single-cell RNA sequencing is used to identify the cellular composition of tissue samples. Cells with similar gene expression profiles are clustered and each cluster is assumed to represent a distinct cell type, but the specific cell type represented by each cluster is not known a priori. Many cell type cluster annotation methods use labeled reference cell atlases or manually-curated lists of marker genes to infer which cell type a new cluster represents, but these methods can introduce study and selection bias based on what the reference atlases or marker gene lists include. In addition, these methods are often incapable of annotating cells to cell types not found in the reference atlas or the list of marker genes, so full coverage of as many cell types as possible is critical for the accurate annotation of single-cell data. Here, we propose an approach that uses natural language processing of millions of PubMed abstracts to associate potential marker genes with all of the thousands of cell types described in the Uberon Ontology automatically and without bias.
First, we create numerical representations of genes and cell types by embedding them in a shared high-dimensional space based on the text of over 17 million biomedical abstracts in PubMed and the curated hierarchical relationships between cell types in the Uberon Ontology. We then train a multi-label deep neural network that takes as input a gene embedding vector and outputs a vector with length equal to the number of cell types in the Uberon Ontology, where each output element is a score indicating the association between the input gene and that element’s corresponding cell type. Our neural network uses a modified loss function that does not calculate loss for any output element that is unlabeled, allowing us to make predictions for novel gene/cell type pairs that are unseen in our training data.
Our cross-validation results show that our method can extend known marker gene lists to encompass novel gene/cell type relationships, even when the gene has not been previously studied in the context of our marker gene training data. Due to our integration of the Uberon Ontology structure, our method can also create entirely new marker gene lists for any cell type present in the Uberon Ontology, even when the cell type is not present in our training data.
Short Abstract: Human Pluripotent Stem Cells (PSCs) are characterized by their self-renewal capacity and their ability to differentiate into every cell type in the body. Deciphering the molecular factors that are involved in stem cell fate decision is crucial to understanding human development. Transcription Factors (TFs) are known to play key roles in maintaining the stem cell state and in differentiation to each of the three germ line layers. In recent years, long intergenic non-coding RNAs (lincRNAs) have emerged as regulators of pluripotency and differentiation. To study the role of lincRNAs in stem cell fate, we generated RNA-seq data from eleven time points during directed differentiation to cardiomyocytes. To elucidate the role of novel lincRNAs we constructed a network of highly correlated lincRNAs. This network is enriched with lincRNAs whose function have been experimentally verified. We then clustered all the lincRNAs with known markers of pluripotency and differentiation which revealed over 500 differentially expressed non-coding RNAs, putatively involved in different steps of cardiac differentiation. Furthermore, we explored the expression of alternatively spliced isoforms of TFs and lincRNAs throughout the experiment. We discovered dozens of TFs and lincRNA genes, which their alternative transcript shows a distinct pattern of expression during pluripotency and differentiation, suggesting an important role for alternative RNA processing of coding and non-coding RNAs in early human development.
Short Abstract: The microbiome of the human body has been shown to have profound effects on physiological regulation and the development of disease. In particular, the microbiome of the human gut has been linked to the pathogenesis of metabolic diseases such as obesity, diabetes mellitus, and inflammatory bowel disease . However, association analyses based on statistical modeling of microbiome data have continued to be a challenge due to inherent noise, complexity of the data, and high cost of collecting a large number of samples. Recently, machine learning models have been advocated for a data-driven approach for the prediction of the host phenotype . However, with the relatively small size of microbiome datasets, it is often the case that datasets have a far greater number of features than the number of samples, which can quickly lead to the overfitting of models.
To address these challenges, we employed a deep learning framework to construct a data-driven simulation of microbiome data using conditional generative adversarial networks . Conditional generative adversarial networks train two neural network models against each other while leveraging side information to allow a single generative model across multiple conditions. We use the generative model to compute larger simulated datasets that are representative of the original dataset for further downstream analyses.
In our study, we used a dataset from a cohort of 121 patients with inflammatory bowel disease and 34 healthy patients containing 195 microbial OTUs aggregated at the species level and show that the conditional generative adversarial network can generate samples representative of the original data based on alpha and beta diversity metrics. Using 10-fold cross-validation, we augmented the dataset with the 20,000 synthetic samples to train logistic regression and neural network models for the task of disease prediction and saw an increase in the area under the receiver operating characteristic curve from 0.778 to 0.849 using logistic regression and from 0.847 to 0.889 using neural networks when compared to using only the original data. In addition, we also used the synthetic samples generated by this cohort to predict the disease status of a different external cohort of patients with inflammatory bowel disease and saw an increase from 0.734 to 0.832 using logistic regression and from 0.794 to 0.849 using neural networks.
Using two different cohorts of subjects with inflammatory bowel disease, we demonstrate that the synthetic samples generated from conditional generative adversarial networks are similar to the original data in both alpha and beta diversity metrics. In addition, we show that augmenting the training set by using a large number of synthetic samples can improve the performance of logistic regression and neural network models in predicting host phenotype. Further results and details can be found in our previously published work .
1. Franzosa EA, Sirota-Madi A, Avila-Pacheco J, Fornelos N, Haiser HJ, Reinker S, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4(2):293-305. Epub 2018/12/10. doi: 10.1038/s41564-018-0306-4. PubMed PMID: 30531976.
2. Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol. 2016;12(7):e1004977. Epub 2016/07/12. doi: 10.1371/journal.pcbi.1004977. PubMed PMID: 27400279; PubMed Central PMCID: PMCPMC4939962.
3. Mirza M and Osindero S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. 2014.
4. Reiman D, Dai Y. Using Conditional Generative Adversarial Networks to Boost the Performance of Machine Learning in Microbiome Datasets. bioRxiv. 2020:2020.05.18.102814. doi: 10.1101/2020.05.18.102814.
Short Abstract: Long QT syndrome is a potentially-fatal condition found in roughly 1 of every 2500 individuals. Missense mutations in the hERG potassium channel, which plays a key role in the repolarization of the heart, are implicated in the Long QT syndrome (LQTS). Genetic screening techniques are traditionally used by clinicians to identify gene variants that could lead to LQTS. However, functional characterization of novel gene variants, also known as variants of unknown significance (VUS), is generally infeasible. To predict which VUS may lead to LQTS, we designed a machine learning-based computational protocol. A descriptive matrix comprising both computational and experimental features for 205 characterized hERG variants has been developed. The computationally-derived features include scores from tools such as Foldx as well as molecular simulations. We trained three machine learning algorithms, decision trees, random forests, and support vector machines via supervised learning. All were validated against test data and yielded accuracies approaching 80%. Among the features, Foldx scores proved to be the best descriptor for classifying the variants compared to the rest.