Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


General Sessions

Schedule subject to change.
All times in Central Daylight Time (CDT)
Tuesday, May 11th
Keynote: Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks
  • Lorin Crawford

Presentation Overview: Show

A consistent theme of the work done in the Crawford Lab is to take modern
computational approaches and develop theory that enable their interpretations to be
related back to classical genomic principles. The central aim of this talk is to address
variable selection and interpretability questions in nonlinear regression models (e.g.,
neural networks). Motivated by statistical genetics, where interactions are of particular
interest, we introduce novel, interpretable, and computationally efficient ways to
summarize the relative importance of genetic variants contributing to broad-sense
heritability and phenotypic variation.

Characterizing the regulation of the human snoRNome
  • Étienne Fafard-Couture, Université de Sherbrooke, Canada
  • Sonia Couture, Université de Sherbrooke, Canada
  • Sherif Abou-Elela, Université de Sherbrooke, Canada
  • Michelle S Scott, Université de Sherbrooke, Canada

Presentation Overview: Show

Small nucleolar RNAs (snoRNAs) are a conserved family of non-coding RNAs primarily known to be involved in ribosome biogenesis. Most snoRNAs are embedded within introns of either protein-coding or non-coding host genes, suggesting a joint regulation of their expression [1]. Depending on their structure, snoRNAs can guide the 2’-O-methylation or the pseudouridylation of the target RNA to which they bind [2]. Their canonical targets comprise ribosomal RNA and small nuclear RNAs, although recent reports suggest a broader range of snoRNA targets (e.g. mRNAs, tRNAs, etc.) [3]. These studies underline that snoRNA functions are not limited only to the regulation of ribosome assembly, but also include the modulation of splicing, polyadenylation and chromatin remodeling, thereby highlighting snoRNAs as master regulator of gene expression [3]. To further understand how snoRNAs modulate the human transcriptome, we employed a low structure bias RNA-Seq approach that combines a thermostable reverse transcriptase with a bioinformatic pipeline designed specifically for snoRNA quantification (TGIRT-Seq) that we recently optimized [4]. We sequenced ribodepleted RNA from seven healthy human tissues (breast, ovary, prostate, testis, liver, brain and skeletal muscle) in order to characterize snoRNA abundance patterns and their underlying determinants.
We find that snoRNAs represent on average the second most abundant RNA biotype after tRNAs across all the tissue samples. Interestingly, expressed snoRNAs can be categorized in two abundance classes that greatly differ in their embedding preferences, function and conservation level: 390 snoRNAs are uniformly expressed across tissues whereas 85 snoRNAs are enriched in brain or reproductive tissues. Uniformly expressed snoRNAs are mostly encoded within protein-coding genes and target mainly ribosomal RNA, whereas tissue-enriched snoRNAs are mostly embedded within non-coding genes and lack canonical targets. Strikingly, we observe that most uniformly expressed snoRNAs do not correlate or are even anticorrelated with the expression of their host gene, whereas conversely, tissue-enriched snoRNAs are tightly coupled to their host gene expression. We uncover that the host gene function and architecture play a central role in the regulation of snoRNA abundance. Indeed, we demonstrate that the presence of dual-initiation promoter within a host gene facilitates the uncoupling of snoRNA and host gene expression. Depending on the transcription initiation type, the generated host gene transcript will either be functional or highly targeted by the nonsense-mediated decay machinery, whereas conversely, a functional snoRNA will be generated in both cases, thereby leading to complex abundance relationships. Furthermore, we find that a host gene function is tightly linked to the correlation of abundance with its embedded snoRNA: host genes coding for ribosomal proteins are highly correlated with the abundance of their embedded snoRNA whereas host genes involved in the regulation of RNA splicing, processing and binding show strong anticorrelation with their embedded snoRNA abundance.
Altogether, our results indicate that snoRNAs are not a mere group of ubiquitous housekeeping genes, but also include highly regulated and specialized RNAs. The complex relationships between snoRNAs and host genes underline the various ways by which the snoRNome meets the functional needs of the different human tissues.
1. Dieci et al. Eukaryotic snoRNAs: A paradigm for gene expression flexibility. Genomics. 2009.
2. Bratkovic et al. Functional diversity of small nucleolar RNAs. Nucleic Acids Research. 2020.
3. Bergeron et al. Small nucleolar RNAs: continuing identification of novel members and increasing diversity of their molecular mechanisms of action Biochemical Society Transactions. 2020.
4. Boivin et al. Simultaneous sequencing of coding and noncoding RNA reveals a human transcriptome dominated by a small number of highly expressed noncoding genes. RNA. 2018.

Mapping RNA editing patterns during coronavirus infection using Guttman scaling
  • Noel-Marie Plonski, Kent State University, United States
  • Caroline Nitirahardjo, Kent State University, United States
  • Chiara De Arcangelis, Kent State University, United States
  • Jacqlyn Caspers, Kent State University, United States
  • Violet Goldlinger, Kent State University, United States
  • Eric Takacs, Kent State University, United States
  • Richard Meindl, Kent State University, United States
  • Helen Piontkivska, Kent State University, United States

Presentation Overview: Show

Background: SARS-CoV-2, a ssRNA virus in the same family as SARS-CoV, is responsible for causing COVID-19, a respiratory disorder with symptoms ranging from mild flu-like symptoms to severe respiratory distress and organ failure. There is evidence of increased expression of inflammatory cytokines and other interferon stimulated genes (ISGs) due to innate immune activation. Many of these cytokines have been linked with severity of symptoms and clinical outcomes of infected patients. However, relatively little is known about the expression of one particular ISG, adenosine deaminase acting on RNA (ADAR), more specifically the isoform ADAR1p150, which in addition to its key role in transcriptome diversity is responsible for aiding in fighting off viral infections by editing viral RNA. RNA editing patterns in the RNA of both SARS-CoV-1 and SARS-CoV-2 are significantly altered during infection and seem to influence viral evolution. Although ADAR editing has been examined from the viral side, the effects of altered ADAR editing patterns in the host transcriptome have not been explored. We hypothesize that changes to ADAR editing patterns, driven by differential expression of ADAR1p150 during viral infection, can influence structure and function of proteins derived from edited transcripts, leading to clinically relevant consequences. These profiles can in turn help explain underlying molecular mechanisms of the broad spectrum of symptoms seen in patients with COVID-19 including dry cough, asthma like inflammation, mild to severe respiratory distress, anosmia, and gastrointestinal distress.
Methods: Here we use a publicly available RNA sequencing dataset with intestinal (Caco2), lung epithelial (Calu3) and lymph node (H1299) cell lines infected with SARS-CoV and SARS-CoV-2 to map differential ADAR editing patterns during infection. We used our previously developed RNA-seq pipeline AIDD (Automated Isoform Diversity Detector) to build immune gene expression profiles and to map ADAR editing landscapes using machine learning approaches, including Guttman scaling and random forest analysis.
Results: There is significantly higher expression of the isoform of ADAR1p150 in lung epithelial cell line, but not in the intestinal cell line or the lymph node cell line in SARS-CoV-2 infection. Additionally, in Sars-CoV-2, ADAR1p150 is moderately correlated with editing events at ADAR editing sites that have a high or moderate impact on protein structure and function in lung epithelium cell line, while being negatively correlated in intestinal cell lines. This indicates increased ADAR1p150 expression driven by the infection, at least in part, may be driving increases in global ADAR editing. SARS-CoV1 derived data show similar patterns, although with a correlation between editing and ADAR expression. Because these editing sites are predicted to have an effect on proteins’ structure or function, dysregulation of editing at these targets can potentially contribute to molecular mechanisms behind observed clinical symptoms.
Conclusions: Dynamic changes to protein structure and function such as those caused by dysregulation of ADAR editing can help shed light on the pathogenesis of COVID-19 and may provide novel investigative avenues for prognostic, diagnostic and therapeutic options.

Multiple spliced alignment for reconstructing the evolution of alternative transcripts
  • Abigail Djossou, University of Sherbrooke, Canada
  • Safa Jammali, University of Sherbrooke, Canada
  • Aida Ouangraoua, University of Sherbrooke, Canada

Presentation Overview: Show

Background: It is now well established that alternative splicing is a ubiquitous process in eukaryote organisms that allows multiple distinct spliced transcripts to be produced from the same gene. Yet, existing gene tree reconstruction methods make use of a single reference transcript per gene to reconstruct gene family evolution and infer gene orthology relationships. The study of the evolution of sets of transcripts within a gene family is still in its infancy. One prerequisite for this study is the availability of methods to compare sets of transcripts while accounting for their splicing structure. In this context, a natural generalization of pairwise spliced alignment is multiple spliced alignment that is an alignment of a set of spliced RNA sequences with a set of unspliced genomic sequences.

Methods: We have developed a collection of algorithms to compute multiple spliced alignments, infer splicing orthology relationships, and reconstruct transcript and gene evolutionary histories. The new spliced alignment algorithms account for sequence similarity and splicing structure of the input sequences. A multiple spliced alignment is obtained by combining several pairwise spliced alignments between spliced transcript sequences and unspliced genes of a gene family, using a progressive multiple sequence alignment strategy. Splicing orthology relationships, gene tree and transcript tree are computed based on the multiple spliced alignment of all genes and transcripts of a gene family.

Results: The application of the algorithms on real vertebrates and simulated data show that the new method provides a good balance between accuracy and execution time, compared to existing spliced alignment algorithms. Moreover, we show that it is robust and performs well for various levels of similarity between input sequences, thanks to the use of the splicing structure of input sequences. The application also illustrates the usefulness of our methods to compute gene family super-structures, to identify splicing ortholog groups and conserved structural features, as well as to improve gene model annotations.

[1] S. Jammali, J-D. Aguilar, E. Kuitche, A. Ouangraoua. (2019). SplicedFamAlign: CDS-to-gene spliced alignment and identification of transcript orthology groups. BMC bioinformatics, 20(3), 133.

[2] E. Kuitche, S. Jammali, and A. Ouangraoua. (2019). SimSpliceEvol: alternative splicing-aware simulation of biological sequence evolution. BMC bioinformatics, 20(20) (2019): 1-13.

A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data
  • Vishal Sarsani, University of Massachusetts Amherst, United States
  • Shai He, University of Massachusetts Amherst, United States
  • Patrick Flaherty, University of Massachusetts, Amherst, United States
  • Aaron Schein, Columbia University, United States

Presentation Overview: Show

There are distinguishing features or ""hallmarks"" of cancer that are found across tumors, individuals, and types of cancer, and these hallmarks can be driven by specific genetic mutations. Yet, within a single tumor there is often extensive genetic heterogeneity as evidenced by single-cell and bulk DNA sequencing data. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. Understanding the genetic composition of the tumor at the time of treatment is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment.

We propose a Bayesian nonparametric hierarchical Dirichlet process mixture model for combining information from bulk and single-cell next-generation DNA sequencing data from multiple samples and from multiple individuals. This hierarchical Dirichlet process mixture model has tunable hyperparameters that control the a priori concentration of the subpopulation distribution for each sample; this hyperparameter can be estimated in an empirical Bayes setting or set directly when the concentration is known---for example, when the sample is from a single-cell. The hierarchical structure models the nested sampling structure in real NGS datasets that arises from drawing multiple bulk and single-cell biopsies from multiple individuals. Inference with our model provides estimates of the subpopulation genotypes and the distribution over subpopulations in each sample. We represent the model as a Gamma-Poisson hierarchical model and in we derive a fast Gibbs sampling algorithm based on this representation using the augment-and-marginalize method. This representation and inference algorithm are generalizable to other models that make use of a hierarchical Dirichlet process prior and can be employed to derive a fast Gibbs sampler with analytical sampling steps for other models.

Statistical inference provides estimates of the subpopulation genotypes and the distribution of subpopulations in individual samples with rigorous Bayesian uncertainty estimates. Since our inference algorithms produce samples from the full posterior distribution, our methods allow for rigorous quantification of the uncertainty in our estimates. Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

A preliminary version of this work is available on bioRxiv https://doi.org/10.1101/2020.11.10.330183 and has been accepted for publication in the Annals of Applied Statistics.

Genetic variants improve the diagnosis of pancreatic cancer
  • Ali Al-Fatlawi, TU Dresden, Germany
  • Michael Schroeder, TU Dresden, Germany

Presentation Overview: Show

For optimal pancreatic cancer treatment, early and accurate diagnosis is vital. Blood-derived biomarkers and genetic predispositions can contribute to early diagnosis, but they often have limited accuracy or applicability. Here, we seek to exploit the synergy between both approaches by combining the biomarker CA19-9 with novel genetic variants. We aim to use deep sequencing and deep learning to improve differentiating resectable pancreatic ductal adenocarcinoma from chronic pancreatitis and to estimate survival.
We obtained samples of nucleated cells found in peripheral blood from around 270 patients suffering from resectable pancreatic ductal adenocarcinoma (rPDAC), non-resectable pancreatic cancer (nrPC), chronic pancreatitis (CP). We sequence mRNA with high coverage and reduced millions of raw variants to hundreds of high-quality, significant genetic variants. Together with CA19-9 values, these served as input to deep learning models that separate cancer from chronic pancreatitis.
Our deep learning models achieved an area under the curve (AUC) of 92% or better. In particular, differentiating resectable PDAC from pancreatitis can be solved with an AUC of 96%. Moreover, we identified genetic variants to estimate survival in rPDAC patients.
Overall, we show that the blood transcriptome harbours genetic variants, which can substantially improve non-invasive clinical diagnosis and patient stratification in pancreatic cancer using standard laboratory practices.

Characterizing the genome-wide associations of somatic mutations and chromatin accessibility in cancer
  • Oliver Ocsenas, 1. University of Toronto, Department of Medical Biophysics 2. Ontario Institute for Cancer Research, Canada
  • Juri Reimand, 1. University of Toronto, Department of Medical Biophysics 2. Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

Cancer is a disease caused by somatic mutations. Driver mutations are positively selected for and confer oncogenic fitness to cells. However, most mutations are neutral passengers and occur as a result of mutational processes. Mutations are not distributed randomly across the genome. At the megabase-scale, somatic mutation frequencies are negatively associated with chromatin accessibility. This is most likely due to increased activity of repair pathways in transcribed and earlier-replicating regions. Mutational processes also generate characteristic mutational signatures that are identified through linear combinations of probabilities of mutation types. We can use these signatures to understand which mutational processes are associated with specific mutations in cancer genomes.

Chromatin accessibility is known to be associated with megabase-scale variation in mutation frequency. Earlier studies used the chromatin states of cell lines and normal tissues to predict mutation rates in cancer. They then used the most predictive chromatin states to infer cancer cell-of-origin and mutational timing. However, the association between primary tumor chromatin state and cancer mutation rates, as well as the association between chromatin accessibility and mutational processes remain uncharacterized. Here we performed a systematic analysis of 2517 cancer whole genomes with 23+ million single nucleotide variants (SNVs) and 677 chromatin accessibility profiles to investigate the associations of the chromatin landscape and mutational processes in cancer. Using a random forest machine-learning approach, we evaluated chromatin accessibility profiles of tumor and normal tissues as predictors of mutation rates in 25 cancer types, pan-cancer, and known mutational signatures.

We first evaluated whether tumor or normal tissue chromatin accessibility profiles were stronger predictors of regional variation in somatic mutation rates. We trained separate models on either normal tissue or primary tumor chromatin tracks to predict mutation rates in 25 cancer types and the pan-cancer dataset. Mutation frequencies were more accurately predicted by chromatin accessibility profiles of primary tumors than normal cells in most cancer types. Since chromatin accessibility is associated with passenger mutations, a stronger association with tumor-specific chromatin accessibility suggests that most passenger mutations occurred after the cells had acquired a cancer-specific chromatin landscape. We examined the most predictive chromatin tracks for 12 cancer types for which we had chromatin tracks for corresponding tumor and normal tissue. As expected, several cancer types showed chromatin accessibility profiles of matching cancers as the strongest predictors of mutation rates. A few exceptions were apparent: mutations in melanomas, glioblastomas, leukemias and lymphomas were better predicted by chromatin accessibility of matching normal tissues. This result suggests an earlier mutational timing in these four cancer types. We also analyzed the associations of mutational signatures and the chromatin landscape. We found that chromatin state was an accurate predictor of mutation rates associated with carcinogens, while the mutations of endogenous mutational processes were less associated. Signatures of unknown etiology were also highly associated with chromatin accessibility, potentially indicating previously unknown carcinogenic signatures. This is most evident in signature 17, thought to be caused by acid reflux-related oxidative damage in stomach and esophageal adenocarcinoma. Lastly, we examined genomic regions which had an abundance of mutations that were not predicted by our model. These ‘high-error’ windows were significantly associated with known cancer genes and biological processes.

Our integrative analysis of mutations and chromatin state provides insight into tumor evolution and genetic heterogeneity. Specifically, we provide evidence relating to mutational timing and tissue of origin in 12 cancer types and multiple mutational signatures. Additionally, the association between chromatin accessibility and regional mutation frequencies can be used to refine cancer driver discovery methods. Associations of chromatin and regional mutagenesis can also be used improve models predicting tumor tissue of origin based on their somatic regional mutation rates.

EDI Panel Introduction
Identifying regulators of an age-associated immunosuppressive CD4+ T cell population using gene regulatory network inference from single-cell genomics
  • Emily Miraldi, Cincinnati Children's Hospital Medical Center, United States
  • Joseph Wayman, Cincinnati Children's Hospital Medical Center, United States
  • Maha Almanan, Cincinnati Children's Hospital Medical Center, United States
  • Alyssa Thomas, Cincinnati Children's Hospital Medical Center, United States
  • Claire Chougnet, Cincinnati Children's Hospital Medical Center, United States
  • David Hildeman, Cincinnati Children's Hospital Medical Center, United States

Presentation Overview: Show

Aging is associated with dampened immune responses to vaccines. Recently, we discovered an immunosuppressive population of IL-10-producing CD4+ T cells bearing markers of follicular helper cells [1]. These so-called ""Tfh10"" cells accumulate dramatically in aged mice and humans. We also found that suppression of IL-10 signaling restored vaccine responses in aged mice.

In this study, we combined single-cell (sc)RNA-seq and scATAC-seq profiling with gene regulatory network (GRN) modeling to assess mechanisms controlling the function and homeostasis of memory CD4+ T cell subsets in young and aged mice, with a focus on Tfh10. Clustering of scRNA-seq and scATAC-seq cells revealed six major populations: Th1, Tfh, Treg, Tfh10 and cytotoxic CD4+ T cells. As expected, Tfh10 were the major producers of IL-10 in aged mice. Leveraging scRNA-seq and scATAC-seq, we used our block-sparse multi-task GRN inference algorithm to simultaneously construct GRNs for both aged and young CD4+ T cells, where scATAC-seq was integrated as an age-specific prior of TF-gene interactions (based on inferred enhancer-promoter interactions and predicted TF binding).

We identified ""core"" TFs controlling cell-type-specific gene expression in young and old T cell subsets. Clustering cell types based on core TFs showed that Tfh10 cells most closely resemble Tfh and regulatory T cells, share some circuits with Th1, but also include unique regulators. Our data also suggest that Tfh10 accrual is controlled via repression of TFs that promote alternative subsets. Further, aligning our data with age-matched whole spleen scRNA-seq from the Tabula Muris Senis Atlas [2], showed that intercellular interactions driving Tfh10 include LIGHT and CD6-ALCAM signaling. Overall, using GRN inference from multimodal single-cell data, we elucidate age-associated regulatory mechanisms of an immunosuppressive CD4+ T cell population.

[1] Almanan et al. (2020) Science Advances
[2] Tabula Muris Consortium (2020) Nature

Data-driven biological network alignment that uses topological, sequence, and functional information
  • Shawn Gu, University of Notre Dame, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview: Show

Many proteins' functional annotations are missing. Uncovering these annotations is often done by transferring functional knowledge between proteins across species. One common technique to do so is sequence alignment. However, sequence similarity between proteins does not automatically mean they are functionally related (perform similar functions), and sequence dissimilarity does not automatically mean they are functionally unrelated (do not perform similar functions). This discrepancy between sequence similarity and functional relatedness likely exists because proteins' functions are not determined solely by their sequences. Rather, cellular functioning is carried out by proteins interacting with each other in complex networked ways. So, we argue that functional prediction across species can be improved by considering protein-protein interactions (PPIs) in addition to sequence information.

Proteins and their PPIs can be modeled as PPI networks, where nodes are proteins and edges are PPIs. Then, PPI networks between species can be compared with network alignment (NA). Typically, NA aims to find a node mapping between the networks that uncovers regions of high network topological (and often sequence) similarity. Then, analogous to sequence alignment, NA can be used to transfer functional knowledge between conserved (aligned) PPI network, rather than just sequence, regions of the species.

Traditionally, NA assumes that topological similarity (isomorphic-like matching) between network regions means that the regions are functionally related. However, we argue that this is not a reasonable assumption. For one, noise in current PPI network data is an issue. With many missing and spurious PPIs, mismatches between proteins’ topological similarity and their functional relatedness are likely. For example, suppose that a set of three proteins that are all linked to each other via PPIs (i.e., a triangle) is in reality fully evolutionarily conserved (i.e., functionally related) between two species. In this scenario, an NA method would find that the two triangles in the two species are topologically similar, and (correctly) infer that they are functionally related. But say that, due to data noise, one of the three PPIs that actually exists in reality is missing in exactly one of the two species’ current PPI networks. Then, it is a 3-node path in that species that should be aligned to a triangle in another species in order to identify the functional match. That is, the functionally related regions are now topologically dissimilar due to the data noise.

Second, even when PPI network data become complete, the traditional assumption of topological similarity is unlikely to hold due to biological variation between species. Namely, molecular evolutionary events such as gene duplication, deletion, or mutation may cause PPI network topology to differ across species’ evolutionary conserved network regions. Even for sequence alignments, pairwise sequence identity as low as 30\% is sufficient to indicate evolutionary conservation (i.e., homology) for 90\% of all protein pairs. So, one can perhaps expect evolutionary conserved PPI networks of different species to be as topologically dissimilar.

So, we redefined NA as a data-driven framework called TARA, which learns from network and protein functional data what kind of topological relatedness (rather than similarity) between proteins corresponds to their functional relatedness. TARA used topological information (within each network) but not sequence information (between proteins across networks). Yet, TARA yielded higher protein functional prediction accuracy than existing NA methods, even those that used both topological and sequence information.

So, we propose TARA++ that is also data-driven, like TARA and unlike other existing methods, but that uses across-network sequence information on top of within-network topological information, unlike TARA. To deal with the within-and-across-network analysis, we adapt social network embedding to the problem of biological NA. TARA++ outperforms existing methods in terms of protein functional prediction accuracy.

Streamlining signaling pathway reconstruction
  • Chris S Magnano, University of Wisconsin-Madison, United States
  • Tobias Rubel, Reed College, United States
  • Adam Shedivy, University of Wisconsin-Madison, United States
  • Pramesh Singh, Reed College, United States
  • Anna Ritz, Reed College, United States
  • Anthony Gitter, University of Wisconsin-Madison; Morgridge Institute for Research, United States

Presentation Overview: Show

Transcriptomic, proteomic, and other high-throughput assays are routinely applied to characterize cellular states. These data are most powerful when they can be analyzed within the context of a biological system, and network-based pathway reconstruction algorithms provide one popular strategy for attaining this systems-level understanding. Pathway reconstruction algorithms connect genes and proteins of interest in the context of a general protein-protein interaction network in order to characterize a cellular response. This technique has been central in important applications such as identifying mutated genes in cancer, recovering differentially expressed subnetworks, expanding pathways beyond their limited representations in pathway databases [1,2].

Although dozens of graph algorithms have been applied to pathway reconstruction tasks, practical challenges have suppressed their broader adoption. Each individual method has its own input and output file formats, installation process, and user-specified parameters. Different algorithms employ varied objective functions and optimization strategies, and recognizing which method is appropriate for a particular dataset and how to set its unique parameters requires domain expertise in pathway reconstruction. As a consequence, it can be cumbersome to use a single pathway reconstruction algorithm correctly, let alone trying multiple methods to empirically assess which is best for the dataset at hand.

We present a unified conceptual framework and workflow for pathway reconstruction. Our framework defines a single common workflow for pathway reconstruction and wraps popular pathway reconstruction algorithms to fit within that workflow. The workflow is implemented using Snakemake, allowing users to easily run multiple pathway reconstruction algorithms on multiple datasets with different hyperparameters on local, cluster-based, and cloud-based platforms. Inspired by related projects for single-cell transcriptomics [3,4], we provide Docker images for pathway reconstruction algorithms to sidestep installation challenges. Our framework specifies a straightforward universal input format for omic and network data. Similarly, a universal output format for predicted pathways supports supplementing pathway reconstruction with downstream analyses [5] such as network visualization, gene set enrichment, and comparison with pathway databases. The modular design allows third-party contributors to add new pathway reconstruction algorithms by providing a suitable Docker image and translation functions for the input data and output predictions.

Our pathway reconstruction framework streamlines omic data analysis and integration; researchers can easily explore different types of pathway reconstruction algorithms and select those that best address their biological questions. The Docker-based system also makes many pathway reconstruction algorithms more accessible to users without the expertise for complicated installations. In addition, the framework will encourage developing new approaches for pathway reconstruction. The ability to compare multiple algorithms directly can reveal gaps in the current state-of-the-art methods and facilitate objective benchmarking of new algorithms. Our platform connects the computational and molecular systems biology communities through the common goal of pathway reconstruction and analysis.

An early prototype is available from https://github.com/Reed-CompBio/pathway-reconstruction-enhancer.

[1] Ritz et al. NPJ Syst Biol Appl. 2016 Mar 3;2:16002. doi: 10.1038/npjsba.2016.2.
[2] Köksal et al. Cell Rep. 2018 Sep 25;24(13):3607-3618. doi: 10.1016/j.celrep.2018.08.085.
[3] Pratapa et al. Nat Methods. 2020 Feb;17(2):147-154. doi: 10.1038/s41592-019-0690-6.
[4] Saelens et al. Nat Biotechnol. 2019 May;37(5):547-554. doi: 10.1038/s41587-019-0071-9.
[5] Rubel and Ritz ACM-BCB 2020. 2020 Sep; doi: 10.1145/3388440.3412411.

Predicting chemical mode-of-action through targeted CRISPR-Cas9 chemical-genetic screens
  • Kevin Lin, University of Minnesota, United States
  • Maximilian Billmann, University of Minnesota, United States
  • Henry Ward, University of Minnesota, United States
  • Ya-Chu Chang, University of Minnesota, United States
  • Anja-Katrin Bielinsky, University of Minnesota, United States
  • Chad Myers, University of Minnesota, United States

Presentation Overview: Show

Screening drug compounds against a collection of defined gene mutants can identify mutations that sensitize or suppress a drug’s effect. These chemical-genetic interaction screens can be performed in human cell lines using a pooled lentiviral CRISPR-Cas9 approach, allowing for interrogation of many single-gene knockout backgrounds and assessing their differential sensitivity or resistance to drugs. While genome-wide chemical-genetic screens (typical size: ~70,000 sgRNAs targeting ~18,000 genes) can inform candidate compounds for drug development, many labs do not have the resources to perform large-scale screens for more than a few compounds. Our pilot screen with Bortezomib (a proteasome inhibitor) shows that a small targeted CRISPR library (~3,000 sgRNAs targeting ~1,000 genes) screened at higher sequencing depth (1000X) can 1) recover biological information at a higher signal to noise ratio compared to genome-wide screens, and 2) reduce resource costs, which allows for higher throughput screening.

Quantification of the phenotypic readout (cell viability) from these screens can be translated to chemical-genetic interaction (CGI) profiles, or a “fingerprint” indicative of a compound’s mode-of-action. Our group has developed a novel R software package for scoring chemical-genetic interactions, ranking candidate drug targets, and predicting compound mode-of-action. Chemical-genetic profiles are analogous to genetic interaction (GI) profiles, which represent differential sensitivity/resistance of genetic perturbations to a second genetic perturbation rather than a compound. To predict genetic target(s) of known or uncharacterized bioactive compounds, we leverage the property that a compound’s CGI profile will be similar to the GI profile of its target. We plan to integrate chemical-genetic interaction profiles with genetic interactions profiles produced by other ongoing efforts to improve our drug target prediction approach. The combination of our targeted screening approach and novel computational framework will provide a more feasible and scalable platform for drug discovery.

AcrFinder and AcrDB: genome mining tools for anti-CRISPR operons in prokaryotes and their viruses
  • Yanbin Yin, University of Nebraska - Lincoln, United States

Presentation Overview: Show

Anti-CRISPR (Acr) proteins encoded by (pro)phages/(pro)viruses have a great potential to enable a more controllable genome editing. The reason is that these short Acr proteins (most < 150 aa) are made by phages/viruses and other mobile genetic elements to inhibit the CRISPR-Cas systems of their prokaryotic hosts for successful invasion and survival. However, genome mining new Acr proteins is challenging due to the lack of a conserved functional domain and the low sequence similarity among experimentally characterized Acr proteins. We will introduce two bioinformatics tools: a web server (http://bcb.unl.edu/AcrFinder) and an online database (http://bcb.unl.edu/AcrDB) that combine three well-accepted ideas used by previous experimental studies to pre-screen genomic data for Acr candidates. These ideas include homology search, guilt-byassociation (GBA), and CRISPR-Cas self-targeting spacers.

CRISPR-Cas is an anti-viral mechanism of prokaryotes that has been widely adopted for genome editing. To make CRISPR-Cas genome editing more controllable and safer to use, anti-CRISPR proteins have been recently exploited to prevent excessive/prolonged Cas nuclease cleavage. Anti-CRISPR (Acr) proteins are encoded by (pro)phages/(pro)viruses, and have the ability to inhibit their host’s CRISPR-Cas systems. We have built an online database AcrDB (http://bcb.unl.edu/AcrDB) by scanning ~19,000 genomes of prokaryotes and viruses with AcrFinder, a recently developed Acr-Aca (Acr-associated regulator) operon prediction program. Proteins in Acr-Aca operons were further processed by two machine learning-based programs (AcRanker and PaCRISPR) to obtain numerical scores/ranks. Compared to other anti-CRISPR databases, AcrDB has the following unique features: (i) It is a genome-scale database with the largest collection of data (39,799 Acr-Aca operons containing Aca or Acr homologs); (ii) It offers a user-friendly web interface with various functions for browsing, graphically viewing, searching, and batch downloading Acr-Aca operons; (iii) It focuses on the genomic context of Acr and Aca candidates instead of individual Acr protein family; and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. AcrDB will be a valuable resource to the anti-CRISPR research community.

AcrFinder has been published in NAR 2020 web server issue: https://academic.oup.com/nar/article/48/W1/W358/5836766

AcrDB has been published in NAR 2021 database server issue: https://academic.oup.com/nar/article/49/D1/D622/5929236

PlasLR Enables Adaptation of Plasmid Prediction for Error-Prone Long Reads
  • Anuradha Wickramarachchi, Australian National University, Australia
  • Vijini Mallawaarachchi, Australian National University, Australia
  • Lianrong Pu, Fuzhou University, China
  • Yu Lin, Australian National University, Australia

Presentation Overview: Show

Plasmids are extra-chromosomal genetic elements commonly found in bacterial cells that support many functional aspects including environmental adaptations. The identification of these genetic elements is vital for the further study of function and behaviour of the organisms. However it is challenging to separate these small sequences from longer chromosomes within a given species. Machine learning approaches have been successfully developed to classify assembled contigs into two classes (plasmids and chromosomes). However, such tools are not designed to directly perform classification on long and error-prone reads which have been gaining popularity in genomics studies. Assembling complete plasmids is still challenging for many long-read assemblers with a mixed input of long and error-prone reads from plasmids and chromosomes.
In this paper, we present PlasLR, a tool that adapts existing plasmid detection approaches to directly classify long and error-prone reads. PlasLR makes use of both the composition and coverage information of long and error-prone reads.
We evaluate PlasLR on multiple simulated and real long-read datasets with varying compositions of plasmids and chromosomes.
Our experiments demonstrate that PlasLR substantially improves the accuracy of plasmid detection on top of the state-of-the-art plasmid detection tools.
Moreover, we show that using PlasLR before long-read assembly helps to enhance the assembly quality, especially on recovering more plasmids in metagenomic datasets.
The source code is freely available at https://github.com/anuradhawick/PlasLR.

Wednesday, May 12th
Keynote: Methodological advancements to improve metagenomics for surveillance of antimicrobial resistance
  • Noelle Noyes

Presentation Overview: Show

Antimicrobial resistance (AMR) is a global public health concern with complex microbial ecological and evolutionary underpinnings. Metagenomics enables comprehensive profiling of antimicrobial resistance genes (the resistome), allowing us to characterize the microbiome-wide processes that support development and persistence of AMR. Metagenomics also has the potential to support improved surveillance and tracking of AMR. However, these advanced applications rely on highly resolved and contextualized resistome data, which is not fully supported by current molecular and bioinformatic approaches. In this talk, I will detail methods that we are developing to generate more accurate and useful resistome data for AMR surveillance

Recovering hidden gradients in microbiome diversity analyses.
  • Susan Hoops, University of Minnesota, United States
  • Dan Knights, University of Minnesota, United States

Presentation Overview: Show

Ecological changes in a variety of ecosystems are driven by underlying geographical or geochemical gradients. These gradients could be variations in factors such as soil pH, age, changing temperature, and exposure to sunlight, affecting communities ranging from rainforest plant populations to the complex mixtures of microbes living inside us. Datasets describing ecosystem variation often grow to thousands of dimensions to fully capture ecological complexity. Humans struggle to conceptualize beyond three or four dimensions and computational algorithms scale poorly with large volumes of high-dimensional data. To overcome these conceptual and computational challenges, researchers use dimensionality reduction methods to reduce the number of dimensions in the data while maintaining feature relationships. Current methods are well known to produce geometric abnormalities such as the problematic “Arch effect” (Gauch, 1982) or more severe “horseshoe effect” (Kendall, 1971) in the presence of environmental gradients, thus limiting detection, visualization, and statistical testing of underlying gradients.
Ordination techniques for subsequent reduction and subsequent visualization are often held culpable, but are unreliable in eliminating arch or horseshoe artifacts (Kuczynski, 2010). Numerous proposed solutions exist in ecological community literature, but are inherently flawed in their assumption of either only a single underlying gradient or their dependence on pairwise sample distance metrics. In reality, community distance metrics become asymptotic when communities in a dataset are maximally distinct. In order to demonstrate maximal distances represented in the data, ordinal analyses create an arch following the bounds of the ordinal space, thus presenting the arch and horseshoe geometries. We propose a technique for resolving underlying gradients in such datasets by trusting local distances. This method creates an undirected graph trusting only local distance measures, then reassigns pairwise community distances with the shortest paths through the resulting graph. Our technique evades the ceiling observed in distance metrics, thus allowing for deeper investigation of community relationships and characteristics while remaining true to observed ecological distances. The adjusted distances are compatible with any ordination technique, allowing for further optimization. The resolved gradients enable use of more powerful statistical analysis such as linear regression, allowing for better detection of underlying drivers of microbial community composition in these complex ecosystems.

Association rule mining to investigate microbiome patterns: guidelines and user-friendly tools for real applications
  • Giulia Agostinetto, University of Milano-Bicocca, Department of Biotechnology and Biosciences, Milan, Italy, Italy
  • Anna Sandionigi, Quantia Consulting srl, Milan, Italy, Italy
  • Antonia Bruno, University of Milano-Bicocca, Department of Biotechnology and Biosciences, Milan, Italy, Italy
  • Maurizio Casiraghi, University of Milano-Bicocca, Department of Biotechnology and Biosciences, Milan, Italy, Italy
  • Dario Pescini, University of Milano-Bicocca, Department of Statistics and Quantitative Methods, Milan, Italy, Italy

Presentation Overview: Show

Studying microbiome patterns is now a hot-topic in different fields of application. In particular, the use of machine learning techniques is increasingly in microbiome studies, providing deep insights into microbial community composition1. In this context, in order to explore microbial patterns from 16SrDNA metabarcoding data, we developed fimTool, an easy-to-use standalone tool based on Association Rule Mining (ARM) technique2, a supervised-machine learning procedure to calculate patterns (in this work, intended as groups of species) from co-occurrences and association rules between them.
fimTool resulted in a user-friendly python tool implemented on the basis of PyFIM library2. It includes an interactive dashboard, integrating metadata information to visualize sample composition and diversity analysis. It provides different algorithms and a wide range of parameters (called interest measures). Besides the most used, as support (frequency of a pattern or a rule in the dataset), confidence (how often a rule has been found to be true in the dataset) and lift (a measure of dependence between the antecedent and the consequent portions of a rule), we integrated cross-support and all-confidence metrics to evaluate spurious patterns3. In order to show the applicability of ARM to microbiome datasets, we decided to test the technique on simulated and real datasets. In detail, we simulated matrices varying in dimensions, density and pattern correlation. Then, real microbiome studies varying in biome origin were selected to be analysed. Furthermore, maximal and closed patterns were also calculated: the first is a pattern that is not included in bigger frequent patterns (also called supersets) and the second, instead, is a pattern which support is different from its immediate supersets. As the method can produce huge amounts of data, we decided to analyse also memory usage and calculation time.
As we expected, our results showed that dimensions and density of matrices affect patterns and rules reconstruction, increasing their number, but also time calculation and memory usage. Cross-support and all-confidence reduced drastically the number of itemsets and rules, impacting on their length too. Closed and maximal patterns resulted in a faster calculation and a reduction of information. Real datasets were also tested dividing input dataset by metadata filtering, in order to reduce spurious associations. This led to different pattern compositions and a reduction of computational time and memory usage. Overall, the work resulted in a technical evaluation of ARM with the idea of ​​applying it to microbiome data, providing guidelines for its application.
Concluding, we built fimTool as a user-friendly instrument suited to apply ARM on microbiome data and visualize microbial patterns interactively. As a preliminary assessment, a comparison or an integration with previous studies4 may be necessary in the future. In general, we think that ARM can unveil new patterns and be extended to the exploration of microbial interactions5. Finally, the tool supports the use of ARM on metabarcoding projects of any kind, allowing to explore biodiversity patterns in very different research fields related to biodiversity monitoring and ecology

1Ghannam, R. B., & Techtmann, S. M. (2021). Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring. Computational and Structural Biotechnology Journal.
2Borgelt, C. (2012). Frequent item set mining. Wiley interdisciplinary reviews: data mining and knowledge discovery, 2(6), 437-456.
3Hahsler, M., Chelluboina, S., Hornik, K., & Buchta, C. (2011). The arules R-package ecosystem: analyzing interesting patterns from large transaction data sets. The Journal of Machine Learning Research, 12, 2021-2025.
4Tandon, D., Haque, M. M., & Mande, S. S. (2016). Inferring intra-community microbial interaction patterns from metagenomic datasets using associative rule mining techniques. PloS one, 11(4), e0154493.
5Faust, K., & Raes, J. (2012). Microbial interactions: from networks to models. Nature Reviews Microbiology, 10(8), 538-550.

Is Gut Metabolome More Informative than Microbiome in Predicting Host Phenotype?
  • Yang Dai, University of Illinois at Chicago, United States
  • Derek Reiman, University of Illinois at Chicago, United States
  • Tina Khajeh, Not Applicable, Iran

Presentation Overview: Show

The human gut is inhabited by a complex and metabolically active microbial ecosystem. The composition change in microbial taxa in the gut is linked to various diseases such as inflammatory bowel disease (IBD), type 2 diabetes, and obesity, etc. While many studies focused on the effect of individual microbial taxa on human disease, the joint study of microbiome and metabolome has been suggested as the most promising approach to understanding the host-microbiome interactions and identifying biomarkers for disease prediction [1]. Machine learning models have been developed for host phenotype prediction using either gut microbiome or metabolomic profiles. More evidence suggests that metabolomics profiles are more predictive than microbiome, implying the overall metabolic potential is more informative. However, it remains unclear if the combined microbiome and metabolomics profiles can provide additional information to boost the predictive power.

We develop several multi-layer perceptron neural networks (MLPNNs) for host phenotype prediction based on different types of input data. Specifically, we train MLPNNs using: 1) microbiome taxa abundance profiles, 2) metabolomic feature abundance profiles, and 3) the combined profiles. For each type of data, we further compare the performance using latent profiles derived from autoencoders. Additionally, we train MLPNNs using the information from microbe and metabolite modules derived from MiMeNet [2], which infers interactions between microbes and metabolites based on the microbiome and metabolome. Since each microbial module shares interaction patterns with some metabolomic modules, we use the average of module feature values as inputs to MLPNNs as a functional structure approach. The module-based input could be considered as a form of reduced dimension features as well.

Using data from a published study of concurrent profiling of microbiome and metabolome in two IBD cohorts [3], we trained these models for IBD prediction. Using the 10-fold cross-validation procedure, our network architecture was determined from the following hyperparameters: 1- 3 hidden layers, 16, 32, 64, 128, and 512 nodes, 10 different evenly spaced on a log scale values between 0.0001 and 0.1 for L2 regularization parameter, and 0.1, 0.3, and 0.5 for dropout rates. We evaluated the trained models on the external test set. Our experiment revealed that compared to the models using microbiome, the models trained on the metabolome always obtain better performance with the area under the receiver operating characteristic curves (AUCs) of 0.855, 0.894, and 0.867 for metabolomic features, latent profile from autoencoder, and the MiMeNet derived metabolite modules, respectively. The models trained on the combined profiles further improved the AUCs to 0.901 and 0.906 when using the original features and the latent profiles of autoencoder, respectively. We confirm the significant differences of AUCs for all models using Kruskal-Wallis tests.

In conclusion, we show that metabolome is more predictive than the microbiome, and that the combined microbiome and metabolome profiles can further improve the performance of IBD prediction. In addition, we show that the autoencoder generated low dimensional profiles that contribute to improved performance and robust models.

1. Integrative, H. M. P., Proctor, L. M., Creasy, H. H., Fettweis, J. M., Lloyd-Price, J., Mahurkar, A., ... & Huttenhower, C. (2019). The integrative human microbiome project. Nature, 569(7758), 641-648.

2. Reiman, D., Layden, B. T., & Dai, Y. (2020). MiMeNet: Exploring Microbiome-Metabolome Relationships using Neural Networks. bioRxiv

3. Franzosa, E. A., Sirota-Madi, A., Avila-Pacheco, J., Fornelos, N., Haiser, H. J., Reinker, S., ... & Xavier, R. J. (2019). Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature microbiology, 4(2), 293-305.

Multi-omic association studies identify novel genes and proteins regulating cellular sensitivity to chemotherapy in diverse populations.
  • Ashley Mulford, Loyola University Chicago, United States
  • Claudia Wing, University of Chicago, United States
  • Ryan Schubert, Loyola University Chicago, United States
  • Ani Manichaikul, University of Virginia, United States
  • Hae Kyung Im, University of Chicago, United States
  • M Eileen Dolan, University of Chicago, United States
  • Heather E Wheeler, Loyola University Chicago, United States

Presentation Overview: Show

On behalf of the TOPMed Consortium:
The development of effective treatments is vital in the fight against cancer, the second leading cause of death globally. Most cancer chemotherapeutic agents are ineffective in a subset of patients; thus, it is important to consider the role of genetic variation in drug response. One useful model to determine how genetic variation contributes to differing drug cytotoxicity is HapMap lymphoblastoid cell lines (LCLs).
In our study, LCLs from 1000 Genomes Project populations of diverse ancestries were previously treated with increasing concentrations of eight chemotherapeutic drugs: cytarabine arabinoside, capecitabine, carboplatin, cisplatin, daunorubicin, etoposide, paclitaxel, and pemetrexed. Cell growth inhibition was measured at each dose after 72 hours of exposure with increasing concentrations of drug. Using either half-maximal inhibitory concentration (IC50) or area under the dose-response curve (AUC) as our phenotype for each drug; all phenotypic data were rank-normalized for use in subsequent analyses. Depending on drug, populations analyzed included up to 168, 177, or 90 individuals with European (CEU), Yoruba (YRI), or East Asian (ASN) ancestries, respectively. Including diverse populations is vital to advancing our understanding of the genetic factors impacting the effectiveness of treatments, as some variants are unique to specific ancestral populations, and some ancestral populations, particularly those of African ancestries, contain greater genetic variation than more widely studied populations of European ancestries.
We performed genome- and transcriptome-wide association studies (GWAS/TWAS) and protein-based association studies (PAS) within each population and in all three populations combined. We conducted GWAS using GEMMA, a software toolkit for fast application of linear mixed models (LMMs) that accounts for relatedness among individuals, because the CEU and YRI ancestral populations contain parent-child trios. Additionally, we performed genotypic principal component analysis to account for population stratification when combining all populations. We conducted TWAS and PAS using PrediXcan and GEMMA. We used PrediXcan, which utilizes prediction models, to calculate predicted gene expression and protein levels based on genotypic data. We then used GEMMA to identify associations between the predicted levels derived by PrediXcan and chemotherapy-induced cytotoxicity. When conducting TWAS, we used the previously trained tissue-based GTEx (Genotype-Tissue Expression) Project version 7 and population-based MESA (Multi-ethnic Study of Atherosclerosis) prediction models available in PredictDB.
In order to conduct PAS, we trained population-based prediction models using genotype and plasma protein data from an aptamer-based assay of 1335 proteins from individuals of African (AFA, n=183), European (EUR, n=416), Chinese (CHN, n=71), and Hispanic/Latino (HIS, n=301) ancestries in the TOPMed (Trans-omics for Precision Medicine) MESA multi-omics pilot study. We used cross-validated elastic net regularization (alpha mixing parameter=0.5) with genetic variants within 1Mb of the gene encoding each protein as predictors for protein levels. When performing PAS, we utilized all protein models with Spearman correlation > 0.1 between predicted and observed levels. Thus, depending on population, we tested between 253 and 416 proteins across all models for association with chemotherapy induced cytotoxicity.
Through these multi-omics association studies, we identified twelve SNPs, two genes, and seven proteins significantly associated with cellular sensitivity to chemotherapeutic drugs within and across diverse populations after Bonferroni correction for multiple testing. The TWAS we performed found that increased STARD5 predicted gene expression associates with increased cellular sensitivity to etoposide in all populations combined (P=8.49e-08). Functional studies in A549, a lung cancer cell line, revealed that knockdown of STARD5 expression results in decreased sensitivity to etoposide following exposure for 72 hours (P=0.033) and 96 hours (P=0.0001). By identifying variants, transcripts, and proteins associated with cytotoxicity across diverse ancestral populations, we strive to understand the various factors impacting the effectiveness of chemotherapy drugs and contribute to the development of future precision cancer treatment.

Nonnegative matrix factorization integrates single-cell multi-omic datasets with partially overlapping features
  • Joshua D. Welch, University of Michigan, United States
  • April R. Kriebel, University of Michigan, United States

Presentation Overview: Show

Multicellular organisms contain an incredibly diverse array of cells, with each type of cell presenting a distinct size, shape, and function. Each cell can be defined by examining a cell’s epigenetic and genetic expression, as well as its location within a tissue. Using these features of cellular identity, researchers can determine reference cellular profiles that facilitate the understanding of cellular function, as well as the source of aberration in diseased cell states. However, defining cellular profiles necessitates the measurement of each feature of cellular identity. While current single-cell sequencing techniques are capable of measuring several attributes of cellular identity at once, they are not yet able to simultaneously measure all features of interest within the same single cell. To address this limitation, many computational tools have been developed to enable dataset integration.
Common dataset integration tools, such as LIGER, Seurat, and Harmony, integrate datasets across shared features. Although effective for many tasks, these methods require that all input matrices contain a common set of genes that are measured in all datasets. Thus, these methods cannot incorporate features unique to one or more datasets, and must discard unshared features of interest, such as intergenic epigenomic information and unshared genes.
The critical need to include unshared features in single-cell integration analyses motivated us to extend our previous LIGER approach. We developed UINMF, a novel nonnegative matrix factorization algorithm that allows the inclusion of both shared and unshared features. The key innovation of UINMF is the introduction of a metagene matrix for unshared features to the iNMF objective function, incorporating features that belong to only one or a subset of the datasets. Previously, dataset integration using the iNMF algorithm operated on features common to all datasets. By including an unshared metagene matrix, U, UINMF can integrate data matrices with neither the same number of features (e.g., genes, peaks, or bins) nor the same number of observations (cells). Furthermore, UINMF does not require any information about the correspondence between shared and unshared features, such as links between genes and intergenic peaks. By incorporating unshared features, UINMF fully utilizes the available data when estimating metagenes and matrix factors, significantly improving sensitivity for resolving cellular distinctions.
We demonstrate the utility of the UINMF algorithm using four datasets. First, we validate our results using ground truth cell correspondences from simultaneous scRNA and snATAC data. UINMF consistently places the scATAC profiles in closer proximity to their corresponding scRNA profiles, confirming that incorporating intergenic information from snATAC data leads to more accurate clustering. In two separate analyses, we demonstrate that using U to include unshared genes improves cell type resolution when integrating scRNA data and targeted spatial transcriptomic data with few genes. This allows for the identification of more distinct cellular profiles within a spatial context. Additionally, we illustrate the benefit of using UINMF to integrate dual-omic SNARE-seq and spatial transcriptomic datasets. Including unshared genes, as well as chromatin accessibility features, within the Umatrix allows improved sensitivity in cell type labeling. As multi-omic sequencing efforts and novel spatial transcriptomic methods develop, we anticipate that UINMF will improve single-cell data integration across a range of data types.

Mapping the regulatory landscape of auditory hair cells from single-cell multi-omics data
  • Shuze Wang, University of Michigan, United States
  • Joerg Waldhaus, University of Michigan, United States
  • Jie Liu, University of Michigan, United States
  • Mary Lee, University of Michigan, United States
  • Scott Jones, University of Michigan, United States

Presentation Overview: Show

The function of auditory hair cells is to transduce sound to the brain. In mammals, these cells reside together with supporting cells in the sensory epithelium of the cochlea, called the organ of Corti. To establish the organ’s delicate function during development and differentiation, spatiotemporal gene expression is strictly controlled by chromatin accessibility and cell type specific transcription factors (TFs), jointly representing the regulatory landscape. Previously, bulk-sequencing technology and cellular heterogeneity obscured investigations on the interplay between transcription factors and chromatin accessibility in inner ear development.

To study the formation of the regulatory landscape in hair cells, we collected single-cell chromatin accessibility profiles accompanied by single-cell RNA data from genetically labeled murine hair cells and supporting cells after birth using a combination of Fgfr3-iCre, Ai14, Atoh1-GFP-alleles, and flow sorting. After cluster identification and cell-type annotation, we concentrated on identifying TFs that drive the differentiation and maturation of hair cells. Using an integrative approach, we predicted cell type specific activating and repressing functions of developmental transcription factors by considering the association between TF gene expression and TF motif accessibility for each individual TF. A previously published algorithm has been adopted for validation. Additionally, we featured some genes to further confirm the TF function by comparing the mRNA expression, chromatin accessibility, and footprints between hair cell and supporting cell clusters.

To identify TFs that control hair cell and supporting cell differentiation, we reconstructed gene regulatory networks for hair cells and supporting cells by integrating gene expression and chromatin accessibility datasets. A 3-step pipeline was developed to predict TF-specific regulons. A regulon is a group of genes that are regulated as a unit. Using such comparative approach, 20 hair cell specific activators and repressors including putative downstream targets genes were identified. To validate our approach, we compared the inferred regulon with previously published ChIP-seq data. Also, we predicted the cis-regulatory landscape of feature genes with identified putative binding sites for hair cell and supporting cell clusters, separately. Overall, our approach resolved TFs contributing to the regulatory landscape of hair cell and supporting cell differentiation. Clustering of target genes resolved groups of related transcription factors and was utilized to infer their developmental functions.

Furthermore, the heterogeneity in the single-cell data allowed us to spatially reconstruct transcriptional as well as chromatin accessibility anatomical position. The organ of Corti development proceeds in gradients. Using differentially expressed features, we projected hair cells onto a 1-dimensional spatial map resolving the relative tonotopic positions at single-cell resolution. Graded expression of Pkhd1l1, as an example, is highly accessible and highly expressed in basal compared with apical compartment. The gradient has been validated by RNAscope.

To further understand chromatin dynamics during hair cell differentiation, we inferred transcriptional as well as chromatin accessibility trajectories of hair cells, indicating that gradual changes in the chromatin accessibility landscape were lagging behind the transcriptional identity of hair cells along the organ’s longitudinal axis.

Overall, this study provides a strategy to spatially reconstruct the formation of a lineage-specific regulatory landscape using an integrative single-cell multi-omics approach. Furthermore, we provide a developmental context for 20 transcription factors to be implemented in future direct programming approaches.

EDI Panel Discussion
Describing `demeter`: using the Euler characteristic to quantify the shape and biology
  • Erik Amezquita, Michigan State University, United States
  • Michelle Quigley, Michigan State University, United States
  • Tim Ophelders, TU Eindhoven, Netherlands
  • Jacob Landis, Cornell University, United States
  • Daniel Koenig, University of California Riverside, United States
  • Elizabeth Munch, Michigan State University, United States
  • Daniel H. Chitwood, Michigan State University, United States

Presentation Overview: Show

Shape is data and data is shape. Biologists are accustomed to thinking about how the shape of biomolecules, cells, tissues, and organisms arise from the effects of genetics, development, and the environment. Traditionally, biologists use morphometrics to compare and describe shapes. The shape of leaves and fruits is quantified based on homologous landmarks---similar features due to shared ancestry from a common ancestor---or harmonic series from a Fourier decomposition of their closed contour. While these methods are useful for comparing many shapes in nature, they can not always be used: there may not be homologous points between samples or a harmonic decomposition of a shape is not appropriate. Topological data analysis (TDA) offers a more comprehensive, versatile way to quantify plant morphology. TDA uses principles from algebraic topology to comprehensively measure shape in datasets[^1], which reveal morphological features not obvious to the naked eye. In particular, Euler characteristic curves (ECCs)[^2] serve as a succinct, computationally feasible topological signature that allows downstream statistical analyses. For example, ECCs have been successfully used to to determine the genetic basis of leaf shape in apple[^3], tomato[^4], and cranberry[^5].

Here we present `demeter`, a python package to quickly compute the ECC of any given grayscale image in linear time with respect to its number of pixels. With `demeter` we provide all the necessary tools to explore the ECC, which can be thought as a two-step procedure. First, we give an image its adequate topological framework, a dual cubical complex in this case. Second, we associate every pixel a number with a fixed real-valued function, known as a filter function. We provide a set of different filter functions that can highlight diverse features in the images, such as Gaussian density, eccentricity, or grayscale intensity. The `demeter` workflow can readily take either 2D pixel-based or 3D voxel-based images, use filter functions outside the ones provided with minimal wrangling, and overall benefit from the standard python ecosystem.

In particular, we can use `demeter` to quantify image morphology with the Euler Characteristic Transform (ECT)[^2], which consists of a concatenation of several ECCs based on directional filters. We favor the use of the ECT for two reasons. First, the ECT can be computed in linear time with respect to the number of image voxels, which is convenient if we deal with thousands of extremely high-resolution 3D images. Second, the ECT effectively summarizes all the morphological information of 3D shapes in general[^2].

This introduction to the Euler characteristic and `demeter` is tailored for a biology audience. The working examples consist of both 2D and 3D scan of diverse plant biology tissues. The material will be available both in github and in binder, so audience will be able to run the jupyter notebooks and code simply using their web browser with no prior software requirements.

[^1]: Lum et al. (2013). Extracting insights from the shape of complex data using topology. doi:10.1038/srep01236

[^2]: Turner et al. (2014). Persistent homology transform for modeling shapes and surfaces. doi:10.1093/imaiai/iau011

[^3]: Li et al. (2018). The persistent homology mathematical framework provides enhanced genotype-to-phenotype associations for plant morphology. doi:10.1104/pp.18.00104

[^4]: Migicovsky et al. (2019). Rootstock effects on scion phenotypes in a Chambourcin experimental vineyard. doi:10.1038/s41438-019-0146-2

[^5]: Diaz-Garcia et al. (2018). Image-based phenotyping for identification of QTL determining fruit shape and size in american cranberry (Vaccinium macrocarpon L.). doi:10.7717/peerj.5461

Accurate and rapid prediction of first-line tuberculosis drug resistance using traditional machine learning algorithms and CNN
  • Xingyan Kuang, The University of Chicago, United States
  • Fan Wang, The University of Chicago, United States
  • Kyle Hernandez, The University of Chicago, United States
  • Zhenyu Zhang, The University of Chicago, United States
  • Robert Grossman, The University of Chicago, United States

Presentation Overview: Show

Antimicrobial resistant (AMR) infection is one of the major threats to human health. Effective and timely antibiotic treatment depends on accurate and rapid in silico AMR predictions. Existing knowledge-based AMR prediction tools using bacterial genomic sequencing data often achieve varying results: high accuracy on some antibiotics but relatively low accuracy on others.

To prepare the training data, we downloaded the whole-genome sequencing (WGS) data for 10,575 mycobacterium tuberculosis (MTB) isolates from the SRA database and corresponding lineage and phenotype data from PATRIC. All the data were collected and shared by the CRyPTIC Consortium from 16 countries across six continents. Our training data consist of AMR-associated genes and single nucleotide polymorphisms (SNPs) detected from WGS data, microbial lineage information, and associated phenotypes of resistance or susceptibility to the first-line anti-MTB drugs isoniazid, rifampicin, ethambutol, and pyrazinamide for isolates. Instead of using only well-studied resistance causing SNPs, we selected both known AMR associated variants and novel variants in coding regions that were detected in at least one isolate by ARIBA as SNP features. Consequently, we selected 282 genetic features. As a quality control of variant calling, we generated a phylogenetic tree based on the called variants from each isolate with known lineage information, and we found isolates of the same lineage clustered together apparently. We then trained 12 binary classifiers of AMR status across the four drugs and three different machine learning (ML) algorithms: Logistic Regression (LR), Random Forest (RF) and customized 1D CNN. RF classifiers were trained with 1,000 estimators, while L1 regularization was applied to LR to reduce overfitting. Because deep learning algorithms require more computational power, our CNN models were trained on the top 34 (rifampicin), 107 (ethambutol), 52 (isoniazid) and 95 (pyrazinamide) drug-specific features which contributed to 98% of feature importance for AMR prediction. Then, we built a multi-input CNN architecture that took N inputs of 4 x 21 matrices representing N selected SNP features into the first layer. Each 4 x 21 matrix consisted of normalized DNA base counts for each locus within a 21 base reference sequence window centered on the focal SNP. We calculated normalized counts based on bam files generated by ARIBA. Our convolutional architecture consists of two 1D convolutional layers followed by a flattening layer for each SNP input. Then, it concatenates the N flattening layers with the inputs of AMR-associated gene presence and lineage features. Finally, we added three fully connected layers to complete the deep neural network architecture.

Results & Conclusions
Using 10-fold cross-validation, we found that our methods resulted in a significant increase in accuracy for drugs that previously had relatively poor prediction from the state-of-the-art knowledge-based method Mykrobe (76.3 % to 90.0% for ethambutol, 92.5 % to 94.3% for rifampicin, and 87.3% to 90.1% for pyrazinamide), while it performs as well as Mykrobe for isoniazid (96.2%). The 1D-CNN algorithm didn’t outperform the traditional ML methods LR and RF; however, it requires more intensive computing resources. Finally, we automated the whole process from data collection to model training and evaluation into a flexible pipeline that can be easily updated with new strains or train AMR prediction models of different antibiotics for other bacteria. Given the availability of WGS data and lineage information for MTB, our ML models can classify MTB resistance against four first-line anti-TB drugs with relatively high accuracy requiring only the computational resources of a standard laptop.

Whole-blood methylation as a prognostic biomarker in multiple sclerosis
  • Maria Pia Campagna, Monash University, Australia
  • Alexandre Xaiver, University of Newcastle, Australia
  • Jeannette Lechner-Scott, University of Newcastle, Australia
  • Helmut Butzkueven, Monash University, Australia
  • Rodney Lea, University of Newcastle, Australia
  • Vilija Jokubaitis, Monash University, Australia

Presentation Overview: Show

Multiple sclerosis (MS) is a chronic autoimmune disease characterised by neuroinflammation and neurodegeneration. It is the leading cause of neurological disability in young people, affecting over two million globally. Prognostication at diagnosis remains a challenge for clinicians and it is challenging to personalise and optimise care. Therefore, prognostic biomarkers are critically needed to improve MS care and outcomes.

Published studies aiming to identify genetic predictors of disease severity have demonstrated that genetic variants incompletely explain the variation in disease severity in the cohorts studied. This suggests that interactions between genetic and environmental factors most likely regulate disease severity. DNA methylation is an epigenetic mechanism known to be impacted by a range of environmental factors, and to date, no epigenetic predictors of disease severity have been identified in MS. We hypothesised that disease severity is associated with DNA methylation, and therefore could be harnessed as a prognostic biomarker.

Cohort: We included 235 Australian females with relapse-onset MS; 119 with mild disease and 116 with severe disease. The Age-related Multiple Sclerosis Severity (ARMSS) Score was used to determine severity, with longitudinal outcomes data from MSBase Registry.

Sample preparation: DNA was extracted from whole-blood samples, then bisulfite converted using EZ-96 DNA Methylation Kit given sufficient DNA concentration (>11ng/uL) and quality (DIN>7).

Data pre-processing: Methylation was measured using Illumina Infinium EPIC arrays, and data pre-processing was completed using the Chip Analysis Methylation Pipeline (ChAMP) package in R. After standard quality control filtering, 747,969 of 866,554 (86.3%) probes remained. Data was normalised using the beta-mixture quantile (BMIQ) method, and batch effects were corrected for using Singular Value Decomposition (SVD) and ComBat.

Differential methylation analysis: A false discovery rate (FDR) of 0.05 was used to assess significance. For association testing, mild and severe samples were paired by age at blood sampling. Linear models in ChAMP was used to identify differentially methylation positions (DMPs) in whole blood, defined as CpG sites with significantly different methylation levels between groups. A differentially methylated region (DMR) was defined as two DMPs with the same direction of effect within 1000bp. Gene-set enrichment analysis (GSEA) was conducted with the GOmeth function of the missMethyl package.

Downstream analyses: Cell-type specific associations were identified using the EpiDISH and cellDMC packages. Sensitivity analyses of 10 demographic, clinical and data availability characteristics were performed on paired samples, using correlation or ANOVA tests.

Predictive modelling: An elastic net regression model was trained on 164 samples using the GLMnet and caret packages, and tested on the remaining 71 samples.

We identified associations between whole-blood methylation and disease severity in multiple sclerosis. 548 DMPs with an effect size >1% were identified. 230 DMPs were hypermethylated, and 318 were hypomethylated, in the severe group. Two DMRs were identified, each containing two DMPs; Chr15:57664490-57664704 (mean effect size=0.09) and Chr11:94886261-94886708 (mean effect size=-0.02). GSEA analysis of DMPs revealed no significantly enriched processes or pathways. Sensitivity analyses demonstrated no major effects of demographic, clinical and data availability characteristics on differential methylation between groups. We further identified no differences in cell type proportions, or cell-specific DMPs, between groups; suggesting that differential methylation between groups is independent of cell-type.

Finally, we showed that whole-blood methylation can accurately predict categorical disease severity. The final elastic net regression model used 890 methylation levels at CpG sites to predict disease severity with an AUC of 0.91.

We demonstrated differences in whole-blood DNA methylation between females with mild and severe relapse-onset MS; however, mechanistic insight requires additional transcriptomic and proteomic studies. Nevertheless, these differences can be used to predict disease severity, showing that DNA methylation may have clinical utility as a prognostic biomarker in MS.

Cell Type Specific Binding Preferences of Transcription Factors
  • Aseel Awdeh, University of Ottawa and Ottawa Hospital Research Institute, Canada
  • Marcel Turcotte, University of Ottawa, Canada
  • Theodore J. Perkins, Ottawa Hospital Research Institute and University of Ottawa, Canada

Presentation Overview: Show

Non-protein coding regions constitute approximately 98% of the human genome. These regions contain complex and precise instructions that regulate gene expression. Through differential transcription factor (TF) binding and gene regulation, different cell types express different genes, allowing one genome to give rise to the diversity of cell types and tissues. The limited number of TFs expressed in a large number of cell types and tissues may result in the same TF being expressed in several cell types and tissues [1]. However, the same TF may have varying binding affinities or sequence specificities across these multiple cell types [1,2]. Indeed, several studies have shown the cell type specific nature of TF binding [1-4].

There are many factors that cause cell type specificity in TF binding. With the exception of pioneer transcription factors [5], TF binding usually occurs in more accessible regions along the genome. Thus, cell type specific chromatin accessibility is one mechanism directing TFs to different parts of the genome [1,4, 6]. Post-translation modifications, complexing with other proteins, and alternative splicing can alter the conformation of a TF protein, thus changing its DNA-binding preferences [1,2,4,7,8]. Other events at regulatory sites, such as cooperative binding or conversely steric hindrance can also influence binding [9]. In this work, we propose a deep learning-based approach to quantify the degree to which a TF shows cell-type specific binding, and in particular, binding that makes itself shown through some kind of DNA sequence preference or signature---in contrast to binding differences due solely to changes in chromatin accessibility, without any change in DNA binding preference.

Many studies have used deep learning in genomics to predict TF-DNA binding sites along the genome [10-14], or to predict chromatin accessible regions along the genome [15,16]. Other approaches have explored the integration of multiple datasets, such as DNase-seq for chromatin accessibility and RNA-seq for gene expression, to increase their understanding of the dynamics of transcriptional regulation [17,18].

We build upon these previous deep learning studies in our own work. The network we use is similar to that used in DeepBind, augmenting it to differentiate and quantify cell type-general versus cell-type specific binding. Our problem formulation is akin to FactorNET's target problem---predicting binding in unmeasured (or held out) cell lines. However, we focus on differences in DNA signatures of binding, rather than differences (or similarities) in the binding sites per se. Our training procedure is inspired by that of ChromDragoNN but uses a direct encoding of cell type in place of gene expression values. To better explain the differences in the DNA binding signatures, we also account for chromatin accessibility in our model. Our work is similar in intent to a recent study of the differential and cooperative binding nature of the two TFs, MEIS and HOXA2, in three mice tissues [19], but different in the approach we employ and much larger in scale. We conduct a large-scale investigation of 174 TF and antibody combinations across various cell types, identifying TFs that show a difference in DNA-binding preference across multiple cell lines, and quantifying their degree of cell type specificity. We show the cell type specific nature of various TFs at varying degrees of specificity across several cell lines.


Accurate network-based gene classification with ultrafast context-specific node embedding
  • Arjun Krishnan, Michigan State University, United States
  • Renming Liu, Michigan State University, United States
  • Matthew Hirn, Michigan State University, United States

Presentation Overview: Show

Genome-scale gene interactions networks are powerful models of functional relationships between tens of thousands of genes in complex organisms. Numerous studies have established how these networks can be leveraged to classify experimentally un(der)-characterized genes to specific biological processes, traits, and diseases. We and others have previously shown that low-dimensional representations (or embeddings) of nodes in large networks can be really beneficial for network-based gene classification. Since each node’s embedding vector concisely captures its network connectivity, node embeddings can be conveniently used as feature vectors in any ML algorithm to learn/predict node properties or links. One of the earlier node embedding methods that continues to show good performance in various node classification tasks, especially on biological networks, is a random-walk based approach called node2vec. Recent studies on the task of network-based gene classification have shown that node2vec achieves the best performance among the state-of-the-art embedding methods for gene classification, and that using embedding generated from node2vec achieves prediction performance comparable to the state-of-the-art label propagation methods.

However, despite its popularity, the original node2vec software implementations (written in Python and C++) have significant bottlenecks in seamlessly using node2vec on all current biological networks. First, due to inefficient memory usage and data structure, they do not scale to large and dense networks produced by integrating several data sources on a genome-scale (17–26k nodes and 3–300mi edges). Next, the embarrassingly-parallel precomputations of calculating transition probabilities and generating random walks are not parallelized in the original software. Finally, the original implementations only support integer-type node identifiers (IDs), making it inconvenient to work with molecular networks typically available in databases where nodes may have non-integer IDs.

Here, we present PecanPy, an efficient Python implementation of node2vec that is parallelized, memory efficient, and accelerated using Numba with a cache-optimized data structure. We have extensively benchmarked our software using networks from the original node2vec study and multiple additional large biological networks. These analyses demonstrate that PecanPy efficiently generates high-quality node embeddings for networks at multiple scales including large (>800k nodes) and dense (fully connected network of 26k nodes) networks that the original implementations failed to execute.

One of the challenges with network-based gene classification is that the gene networks immediately available from databases do not capture the context-specific gene dynamics and interaction rewiring that accompanies variation in biological factors such as tissues/cell types, environmental/treatment conditions, and disease status. One of the routine, efficient, and signal-rich techniques for capturing a molecular snapshot of cells in a specific context is using genome-scale gene-expression profiling. Therefore, it would be extremely beneficial if biologists can “contextualize” network-based gene classification algorithms rapidly on-the-fly by supplying a gene expression dataset that best captures the context of their interest and get back function, trait, or disease gene predictions specific to their context. We are leveraging the ultrafast node embedding in PecanPy to explore the creation of context-specific node embeddings from static gene networks.

PecanPy is freely available at https://github.com/krishnanlab/PecanPy, can be easily installed via the pip package-management system (https://pypi.org/project/pecanpy/), and has been confirmed to work on a variety of networks, weighted or unweighted, with a wide range of sizes and densities. Therefore, it can find broad utility beyond biology.

Thursday, May 13th
  • Michael Osterholm
A Bayesian framework for estimating the risk ratio of hospitalization for people with comorbidity infected by SARS-CoV-2 virus
  • Xiang Gao, Loyola University Chicago, United States
  • Qunfeng Dong, Loyola University Chicago, United States

Presentation Overview: Show

Estimating the hospitalization risk for people with comorbidities infected by the SARS-CoV-2 virus is important for developing public health policies and guidance. Traditional biostatistical methods for risk estimations require: (i) the number of infected people who were not hospitalized, which may be severely undercounted since many infected people were not tested; (ii) comorbidity information for people not hospitalized, which may not always be readily available. We aim to overcome these limitations by developing a Bayesian approach to estimate the risk ratio of hospitalization for COVID-19 patients with comorbidities.

Materials and Methods
We derived a Bayesian approach to estimate the posterior distribution of the risk ratio using the observed frequency of comorbidities in COVID-19 patients in hospitals and the prevalence of comorbidities in the general population. We applied our approach to 2 large-scale datasets in the United States: 2491 patients in the COVID-NET, and 5700 patients in New York hospitals.

Our results consistently indicated that cardiovascular diseases carried the highest hospitalization risk for COVID-19 patients, followed by diabetes, chronic respiratory disease, hypertension, and obesity, respectively.

Our approach only needs (i) the number of hospitalized COVID-19 patients and their comorbidity information, which can be reliably obtained using hospital records, and (ii) the prevalence of the comorbidity of interest in the general population, which is regularly documented by public health agencies for common medical conditions.

We developed a novel Bayesian approach to estimate the hospitalization risk for people with comorbidities infected with the SARS-CoV-2 virus.

This work has already been published. Xiang Gao and Qunfeng Dong (2020) A Bayesian Framework for Estimating the Risk Ratio of Hospitalization for People with Comorbidity Infected by the SARS-CoV-2 Virus. Journal of the American Medical Informatics Association, 28 Sept 2020, ocaa246, doi:10.1093/jamia/ocaa246 https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocaa246/5912530

Druggability and genetic variability of SARS-CoV-2 Pocketome
  • Setayesh Yazdani, Structural Genomics Consortium, Canada
  • Nicola De Maio, European Bioinformatics Institute (EMBL-EBI ), United Kingdom
  • Yining Ding, Structural Genomics Consortium, China
  • Nick Goldman, European Bioinformatics Institute (EMBL-EBI ), United Kingdom
  • Matthieu Schapira, Structural Genomics Consortium, Canada

Presentation Overview: Show

The successful development of vaccines is expected to put an end to the COVID-19 pandemic. The disease will probably revert to an endemic status to regularly emerge following seasonal and geographical patterns yet unknown. In the absence of effective treatment, COVID-19 will remain a severe and global disease burden. Compounding this threat is the near certainty that novel coronaviruses with pandemic potential will emerge in years to come. Pan-coronavirus drugs – agents active against both SARS-CoV-2 and other coronaviruses – would address both threats. A strategy to develop such broad-spectrum inhibitors is to pharmacologically target the highly conserved binding sites on SARS-CoV-2 proteins in other known coronaviruses. The assumption being that any selective pressure to keep a site conserved across past viruses will apply to future ones. Here, we systematically mapped druggable binding pockets on the experimental structure of fifteen SARS-CoV-2 proteins and analyzed their variation across hundreds of alpha- and betacoronaviruses and across thousands of samples from COVID-19 patients. We present the data on a public web portal (thesgc.org/SARSCoV2_pocketome) where users can analyze our results in proteome-wide plots and matrices and interactively navigate individual protein structures and the genetic variability of drug binding pockets in 3D.

The Phosphorylation Model of SARS-CoV-2 Nucleocapsid Protein
  • Tomer M Yaron, Weill Cornell Medicine, United States
  • Brook E Heaton, Duke University School of Medicine, United States
  • Tyler M Levy, Cell Signaling Technology, Inc., United States
  • Jared L Johnson, Weill Cornell Medicine, United States
  • Tristan X Jordan, Icahn School of Medicine at Mount Sinai, United States
  • Elena Piskounova, Weill Cornell Medicine, United States
  • Benjamin R Tenoever, Icahn School of Medicine at Mount Sinai, United States
  • John Blenis, Weill Cornell Medicine, United States
  • Nicholas S Heaton, Duke University School of Medicine, United States
  • Lewis C Cantley, Weill Cornell Medicine, United States

Presentation Overview: Show

While vaccines are vital for preventing COVID-19 infections, it is critical to develop new therapies to treat patients who become infected. Pharmacological targeting of a host factor required for viral replication can suppress viral spread with a low probability of viral mutation leading to resistance. In particular, host kinases are highly druggable targets and a number of conserved coronavirus proteins, notably the nucleoprotein (N), require phosphorylation for full functionality.

The ability of protein kinases to phosphorylate substrates is strongly dependent on the serine/threonine phosphoacceptor’s surrounding amino acid sequence. The majority of human kinases investigated show distinct preferences for or against amino acids surrounding their phosphoacceptor. This is collectively referred to as their substrate motifs and is useful for identifying biological substrates.

To obtain the substrate motif of a kinase, ours and other laboratories have developed an unbiased approach using combinatorial peptide substrate libraries. We recently completed the characterization of almost the entire human kinome (300 serine/threonine kinases and 80 tyrosine kinases), and developed a computational prediction system - The Kinase Library - which takes the sequence around a given phosphorylation site as an input and provides a favorability score for each characterized kinase, for every phosphorylation site.

In addition, based on the site-specific predictions of The Kinase Library, we developed a kinase-enrichment tool for analyzing high-throughput phosphoproteomics datasets, in order to identify specific kinases and pathways that are upregulated or downregulated in different conditions.

In order to understand how targeting kinases could be used to compromise viral replication, we used a combination of phosphoproteomics and bioinformatics as well as genetic and pharmacological kinase inhibition to define the enzymes important for SARS-CoV-2 N protein phosphorylation and viral replication. Specifically, we utilized The Kinase Library to define not only which host kinases can phosphorylate the N protein, but also the order of action and the specific phosphorylation sites of each kinases, and by that to reconstruct the entire phosphorylation model of SARS-CoV-2 Nucleocapsid.

From these data, we propose a model whereby SRPK1/2 initiates phosphorylation of the N protein, which primes for further phosphorylation by GSK-3A/B and CK1 to achieve extensive phosphorylation of the N protein SR-rich domain. Importantly, we were able to leverage our data to identify an FDA-approved kinase inhibitor, Alectinib, that suppresses N phosphorylation by SRPK1/2 and limits SARS-CoV-2 replication. Together, these data suggest that repurposing or developing novel host-kinase directed therapies may be an efficacious strategy to prevent or treat COVID-19 and other coronavirus-mediated diseases.

This study demonstrates the computational reconstruction of a complex sequence-to-phenotype path in a novel emerging organism (SARS-CoV-2), using unbiased experimental data and mathematical modeling tools. It is also the first time we illustrate the application of The Kinase Library system on a real-life biological system. The epidemiological relevance and experimental validation of the predicted model exhibit the power of biochemistry-oriented computational models in biological sciences.

MetaProClust-MS1 can cluster metaproteomes using MS1 profiling only
  • Caitlin Simopoulos, University of Ottawa, Canada
  • Zhibin Ning, University of Ottawa, Canada
  • Leyuan Li, University of Ottawa, Canada
  • Mona Khamis, University of Ottawa, Canada
  • Xu Zhang, University of Ottawa, Canada
  • Mathieu Lavallée-Adam, University of Ottawa, Canada
  • Daniel Figeys, University of Ottawa, Canada

Presentation Overview: Show

The human gut microbiota is composed of bacteria, viruses, fungi and archaea that inhabit the digestive tract. These microbes play essential roles in the health and well-being of humans, for example, by fermenting dietary fibres into short chain fatty acids, synthesizing vitamins and contributing to the immune system. Therefore, it’s no surprise that the composition and dysfunction of the gut microbiome is also associated with many diseases, such as inflammatory bowel disease, diabetes, as well as cardiovascular disease and mental health disorders. Recently, studies have shown that human-targeted drugs may significantly disrupt the gut microbiota, and thus there is an increased need to characterize how xenobiotics may be affecting microbiomes and consequently human health.

Metaproteomics explores the composition and function of microbial communities and has been previously used to investigate how xenobiotics can affect the human gut microbiome. However, acquiring data by tandem mass spectrometry, the traditional approach to metaproteomics, is time consuming and resource intensive. As metaproteomic experiments become larger-scale, for example as seen in microbiome-drug screens, the resources required for a study also increase. To mediate this challenge, we present MetaProClust-MS1, a computational framework for microbiome screening that reduces the time required for data acquisition by mass spectrometry.

MetaProClust-MS1 performs a clustering analysis of samples processed using MS1-only mass spectrometry. In this pipeline, peptide features quantified from 15 minute mass spectrometry gradients are first matched across all mass spectrometry runs from a set of experiments. The approach preprocesses metaproteomics data and performs matrix decomposition using a robust independent component analysis (ICA). K-medoids clustering is then used to further reduce feature dimensionality. Eigenfeature values from peptide intensity values are computed and used as a summary statistic to calculate correlations with microbiome treatments for treatment clustering. The modular, platform-independent code is implemented in R and Python and can be run completely within R Studio.

In a proof-of-concept study, we tested MetaProClust-MS1 on a gut microbiome treated with five drugs with known effects at three different concentrations. The samples were analyzed twice: once with short 15 minute MS1-only mass spectrometry gradients and again by a 60 minute tandem mass spectrometry approach. We compared the clusters identified by the framework using both datasets and found that MetaProClust-MS1 identified robust microbiome shifts caused by xenobiotics. A cophenetic correlation coefficient indicated the drug treatment clusters identified using both datasets were also significantly correlated (r = 0.625). Our findings indicate that MetaProClust-MS1 is able to rapidly screen microbiomes using only short MS1 profiles and identifies treatment clusters very similar to those identified using a traditional MS/MS approach.

We developed MetaProClust-MS1 for rapid metaproteomic screens to prioritize samples or treatments of interest for deep metaproteomic analysis using more lengthy tandem mass spectrometry experiments. However, MetaProClust-MS1 is not limited to metaproteome screens. For example, the inclusion of feature modules identified by K-medoids clustering also allows for feature module exploration and potential MS1-only biomarker discovery. We anticipate that MetaProClust-MS1 could be extended for protein identification through MS1 features using an MS1-only search workflow. In its currently implementation, MetaProClust-MS1 will be especially useful in large-scale metaproteomic screens or in clinical settings where rapid results are required.

Analysis of Targeted Proteomics data across different datasets with reference characteristics
  • Dries Heylen, VITO NV, Belgium
  • Murih Pusparum, VITO, Belgium
  • Valentino Donofrio, UHasselt, Belgium
  • Inge Gyssens, Radboud University, Belgium
  • Dirk Valkenborg, UHasselt, Belgium
  • Gökhan Ertaylan, VITO, Belgium

Presentation Overview: Show

Analysis of Targeted Proteomics data across different datasets with reference characteristics

Abstract for oral presentation

Background: An early diagnosis or the detection of signals, markers, and patterns that indicate harmful events, before the disease can develop is crucial in many diseases. For infectious diseases the characterization of the inflammatory response can assist in making a rapid clinical assessment of early infection diagnosis For chronic diseases the characterization of disease-associated parameters can assist in representing a disease as an individual specific health continuum. The objective of this study is to develop an approach for the analyses of high-dimensional multi-batch proteomics datasets and to develop a pipeline to analyze multi-dimensional data keeping the control dataset as the reference for parameter range estimations. This to discover the relationships across and within these datatypes, and to identify relevant patterns.
Methods: In a pilot project (I AM Frontier), Cross-omics data (+1000 proteins, +250 metabolites) were provided by VITO health. In an attempt to test the central hypotheses of precision medicine In I AM Frontier (IAF), VITO collected blood, urine, stool, activity measurements and anthropometric information from a small cohort of 30 healthy but “at risk” (45-60 years old) individuals, on a longitudinal basis for 13 months. External datasets were included to complement the reference data with robust disease endpoints and to validate our findings with a sepsis dataset consisting of patients with infections of different aetiology. For the targeted proteomics samples were analyzed using 92-plex proteomics panels (including an inflammation panel) based on a proximity extension assay (PEA) with oligonucleotide-labelled antibody probe pairs (OLINK, Uppsala Sweden). Unsupervised differential expression analysis using hierarchical clustering t, k-means clustering and PCA were performed in R.
Supervised differential expression analysis using Welch’s t test and elastic net regression analyses where performed to confirm unsupervised analyses. A complete normalization and bridging workflow for multi-batch proteomics experiments across cohorts was applied.
Results: We establish a workflow to use IAF data as a reference dataset to analyze targeted proteomics datasets. Inverse normalized rank based transformation of the data followed by cosine similarity calculations of the OLINK pooled plasma samples showed to be a robust method for comparing this type of cross-omics multi batch data across cohorts.
A 92-plex proteomics dataset with 406 sepsis patients unsupervised, hierarchical clustering revealed that inflammatory response is more strongly related to disease severity than to aetiology or site of infection. A subgroup of influenza showed to result in clearly distinct inflammatory protein profiles compared to other infections causing sepsis.
Conclusions: We built a cross-study integration workflow for targeted proteomics (OLINK) utilizing a uniquely available timeseries dataset from IAF as the reference for determining individual variations per parameter. The proposed workflow is validated in an independent sepsis dataset. Several differentially expressed inflammatory proteins were identified that could be used as biomarkers for sepsis. A promising methodology and data availability is in place to analyze disease profiles of additional (chronic) diseases across cohorts in a search for biomolecular markers. The ranges of the proteins established by our workflow can be of value for outcome prediction, patient monitoring, and directing further diagnostics.

Accurate construction of eukaryotic gene structures using FINDER – a fully automated gene annotator
  • Sagnik Banerjee, Iowa State University, United States
  • Priyanka Bhandary, Iowa State University, United States
  • Margaret Woodhouse, Iowa State University, United States
  • Taner Sen, Iowa State University, United States
  • Roger Wise, USDA-ARS, Iowa State University, United States
  • Carson Andorf, Iowa State University, United States

Presentation Overview: Show

Reduction in sequencing costs has enabled both large and small labs to undertake whole genome expression studies to better understand their respective biosystems. These studies have been extended to non-model organisms for which a comprehensive gene annotation is often lacking. The absence of high-quality and complete gene model sets hinders a holistic understanding of the biosystem. Proper construction of gene structures including correct identification of transcription start sites is pivotal to downstream applications (e.g., promoter mining, protein interaction study, etc.) making gene annotation extremely relevant.

Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of expression data. The presence of transposable elements and sequence repeats in eukaryotic genomes adds to this complexity, as do overlapping genes and genes that produce numerous transcripts. Currently available software annotate genomes by relying on full-length cDNA or on a database of splice junctions which makes them susceptible to the errors in the input. We present the computational pipeline FINDER, which automates downloading of expression data from NCBI, optimizes read alignment, assembles transcripts and performs gene prediction. FINDER is optimized to map reads to capture all biologically relevant alignments with special attention to micro-exons (exons with a maximum length of 50 nucleotides). Assembling is conducted by PsiCLASS – a meta-assembler that can generate a consensus transcriptome from multiple RNA-Seq alignments. We configured FINDER to detect merged genes on the same strand using changepoint detection. FINDER further reports transcripts and recognizes genes that are expressed under specific conditions. It integrates prediction results from BRAKER2 with assemblies constructed from expression data to approach the goal of exhaustive genome annotation. FINDER accurately reconstructed 19,328 transcripts in Arabidopsis thaliana – about 4,500 more transcripts than BRAKER2, MAKER2 and PASA. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision.

We used Arabidopsis thaliana to test FINDER since it is a model organism with one of the best gene annotations. PsiCLASS reported a transcript F1 score of 40.78 – higher than all other assemblers. Some transcript assemblies generated by PsiCLASS represent merged genes and exhibit exon extension for overlapping genes on the opposite strand. FINDER utilizes signals present in expression profiles and implements changepoint detection to split up merged gene models. This improved the transcript F1 score from 40.78 to 45.94. Finally, gene models predicted by BRAKER2 that have high similarity with proteins and do not overlap with the assembled transcripts are included in the final annotations boosting the F1 score to 47.37.

With a wide variety of available evidence data for annotation, researchers often struggle to manage and optimize their usage. Several gene annotation software also requires users to perform complicated configurations without providing substantial guidance. FINDER makes the task of gene annotation easy for bench scientists by automating the entire process from RNA-Seq data processing to gene prediction. Since FINDER does not assume the ploidy or the nucleotide composition of a genome, it can be applied to derive gene structures for a wide range of species, including non-model organisms. FINDER constructs gene models primarily from RNA-Seq data and is therefore capable of constructing tissue-and/or condition-specific isoforms which would have been impossible to obtain from ESTs only. FINDER supersedes the performance of existing software applications by utilizing read coverage information to fine-tune gene model boundaries. Instead of removing low-quality transcripts, FINDER flags them as low confidence – giving users the choice of using them as they seem fit. We are confident that FINDER will pave the way for better gene structure annotation in the future.

  • Lukasz Jaroszewski, Biosciences Division, University of California Riverside School of Medicine, United States
  • Adam Godzik, Biosciences Division, University of California Riverside School of Medicine, United States
  • Mallika Iyer, Sanford Burnham Prebys Medical Discovery Institute, United States
  • Zhanwen Li, Biosciences Division, University of California Riverside School of Medicine, United States
  • Mayya Sedova, Biosciences Division, University of California Riverside School of Medicine, United States

Presentation Overview: Show

It has long been observed that homologous proteins, which can be recognized by sequence similarity, have similar structures (Chothia and Lesk 1986) and often (but not always) have similar functions. This follows naturally from the sequence-structure-function paradigm and forms the basis of most structural and functional predictions and database annotations of protein databases (Consortium 2019; Orry and Abagyan 2012; Yang and Zhang 2015). However, it is often interpreted as proteins having a single, unique structure, whereas it is now widely accepted that proteins are highly flexible and exist in a multitude of conformations that form a set called a conformational ensemble (Frauenfelder, Sligar, and Wolynes 1991). Transitions between these conformations occur via motions on a range of time and length scales and have been shown to be intrinsic to the function of many proteins (Eisenmesser et al. 2005; Henzler-Wildman et al. 2007). Therefore, knowledge of the conformational ensemble and movements of a protein is essential to understand its function. Studies have shown that the lowest frequency normal modes, that represent functionally relevant large-scale movements of a protein, are similar for homologous proteins (Keskin, Jernigan, and Bahar 2000; Maguid, Fernandez-Alberti, and Echave 2008; Zen et al. 2008). However, the relationship between evolutionary distance and the similarity of these movements has not been systematically examined.
Here, we have analyzed such large-scale movements using experimentally solved crystal structures deposited in the Protein Data Bank (PDB) (Berman et al. 2000). Although any single X-ray experiment provides only a single snapshot of the protein structure, many proteins have multiple coordinate sets deposited in the PDB that represent different conformational and functional states of the protein (Burra et al. 2009; Kosloff and Kolodny 2008). The analysis of such sets can be used to indirectly study the various conformational states sampled in a protein’s conformational ensemble and the large-scale movements that occur between them. However, there is no established, systematic method to do so and therefore in this study, we developed a method for this purpose. We first characterized the conformational ensembles of individual proteins based on coordinate sets representing their different conformational states. Then, for each protein with two conformational states, we calculated the difference distance map (DDM) representing the difference/conformational change between them. These were then compared for pairs of homologous proteins, and the correlation between them was calculated to assess their similarity.
Our results showed that the similarity in conformational movements of homologous proteins increases with increasing sequence identity. This has significant implications for structure-function studies in general, as it suggests that the sequence-structure-function paradigm can be extended to include conformational ensembles and the large-scale movements that occur between different conformational states.

Using a computational molecular evolution and phylogeny to study pathogenic proteins
  • Janani Ravi, Michigan State University, United States
  • Samuel Chen, Michigan State University, United States
  • Lauren M Sosinski, Michigan State University, United States
  • John Johnston, Oakland University, United States

Presentation Overview: Show

Background: Studying bacterial physiology, adaptation, and pathogenicity through the lens of evolution requires delineating the phylogenetic history of bacterial proteins and genomes. Moreover, delineating this history of proteins is best done at all three levels, sequence, structure, and function, and comparative pangenomics. However, currently, there are no unified frameworks that experimentalists can avail to seamlessly perform all these analyses for their proteins of their interest. Using these software/tools requires numerous skills/resources: install individual tools; download/preprocess massive datasets; match disparate input-output data formats across different tools; computational resources for efficient analyses; computational know-how to run command-line tools; intuitive result summarization and visualization.

Approach: To address this challenge, we have developed a computational framework for comprehensive evolutionary analysis that systematically integrates multiple data sources for gleaning sequence-structure-function relationships. Our framework goes beyond simple sequence comparisons by delving into constituent domains, domain architectures, phyletic spreads, and tracing the evolution across the tree of life. By adding a critical step of comparative pangenomics to these analyses, we can pinpoint molecular and genomic features that are unique to bacterial groups of interest (e.g., specific pathogens), which can then help prioritize candidate molecular targets, even in poorly characterized bacterial genomes.

Results: We have developed a computational approach for molecular evolution and phylogeny that researchers can perform homology searches across the tree of life and reconstructs domain architecture by characterizing the input proteins and each of their homologs by combining: sequence alignment and clustering algorithms for domain detection; profile matching against protein domain/orthology databases; and prediction algorithms for signal peptides, transmembrane regions, cellular localization, and secondary/tertiary structures. We are adding an additional layer of comparative pangenomics to identify unique features for bacterial groups of interest (e.g., pathogens vs nonpathogens). An instance of the web-app, applied to study a large number of Psp stress response proteins (present across the tree of life) can be found here: jravilab.shinyapps.io/psp-evolution. The web-app described here is a general web-server that researchers can use to study any protein(s) of interest. To demonstrate the versatility of this framework, we are currently applying it to zoonotic pathogens causing severe and chronic pathologies in humans and animals, for e.g., host-specific factors in Nontuberculous Mycobacteria, nutrient acquisition systems in Staphylococcus aureus, and surface layer proteins in Bacillus anthracis. We have implemented this computational approach for molecular evolution and phylogeny (MolEvolvR) as open-source software (R-package) and a streamlined easy-to-use web application (jraviilab.org/molevolvr) that will enable us and other researchers to prioritize candidate genetic factors in their application of interest for experimental validation.

Keynote: Predicting the evolution of syntenies- An algorithmic overview
  • Nadia El-Mabrouk

Presentation Overview: Show

Syntenies are genomic segments of consecutive genes identified by a certain conservation in gene content and order. Regardless the way they are identified, the goal is to characterize homologous genomic regions, i.e. regions deriving from a common ancestral region. Most algorithmic studies for inferring the evolutionary history that has led from the ancestral segment to the extant ones focus on inferring the rearrangement scenarios explaining their disruption in gene order. However, syntenies also evolve through other events modifying their content in genes, such as duplications, losses or horizontal transfers. While the reconciliation approach between a gene tree and a species tree addresses the problem of inferring such events for single genes, few efforts has been dedicated to the generalization to segmental events and to syntenies. In this presentation, I will review the main algorithmic methods for inferring ancestral syntenies and focus on those integrating both gene orders and gene trees.

International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube