Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide



COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Tuesday, July 23rd
10:20 AM-11:00 AM
The human gut microbiome and its clinical relevance
Room: Delhi (Ground Floor)
  • Peer Bork, EMBL Heidelberg, Germany
11:00 AM-11:10 AM
Learning accurate representations of microbe-metabolite interactions
Room: Delhi (Ground Floor)
  • James Morton, University of California San Diego, United States
  • Louis-Félix Nothias-Scaglia, University of California San Diego, United States
  • James Foulds, University of Maryland, Baltimore, United States
  • Robert Quinn, Michigan State University, United States
  • Tami Swenson, Lawrence Berkeley National Laboratory, United States
  • Marc Van Goethem, Lawrence Berkeley National Laboratory, United States
  • Trent Northen, Lawrence Berkeley National Laboratory, United States
  • Michelle Badri, New York University, United States
  • Richard Bonneau, Simons Foundation, United States
  • Se Jin Song, University of California San Diego, United States
  • Rob Knight, University of California San Diego, United States
  • Alexander A. Aksenov, University of California, Los Angeles, United States
  • Pieter C. Dorrestein, Department of Chemistry and Biochemistry department, UCSD, United States

Presentation Overview: Show

Integrating multi-omics datasets critical for microbiome research but is complicated by multiple statistical challenges, which can confound traditional correlation techniques. We solve this problem by using neural networks to estimate the conditional probability that each molecule is present given presence of each specific microbe. We show with known medical (cystic fibrosis lung) and environmental (desert biological soil crust wetting) examples our ability to recover microbe-metabolite relationships, and demonstrate how the method can discover relationships between microbially-produced metabolites and high-fat diet in complex biological system.

11:10 AM-11:20 AM
Efficient mobile taxonomic classification for nanopore data using a multilevel approach
Room: Delhi (Ground Floor)
  • Daniel H. Huson, Algorithms in Bioinformatics, Center for Bioinformatics, University of Tübingen, Germany
  • Caner Bagci, University of Tuebingen, Germany
  • Benjamin Albrecht, Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, Germany

Presentation Overview: Show

The portable ONT (Oxford Nanopore Technology) MinION has great potential for studying microbial communities directly in the field requiring software dealing with limited access to the internet and computational power restricted to laptops. Here, we present an approach meeting those requirements. Our method is designed to run on a laptop (without internet access) and aims at identifying present bacterial species in a highly accurate way.

Our analysis uses LAST to perform a two-tiered translated alignment against the bacterial part of the NCBI RefSeq database. Species identification is then performed using a "gene-sequence graph" that represents genes and their adjacency along reads.

To illustrate our approach we use published long-read sequencing data derived from a commercially-available mock community containing eight equally distributed bacteria and successfully analyze this data on a laptop with 8 cores and 64GB of memory.

The proposed method performs an accurate real-time taxonomic classification using the comprehensive RefSeq database on a laptop. Thus, in combination with a portable sequencing device (such as an ONT MinION and MinIT), our approach opens up new possibilities for studying microbial communities directly in the field.

11:20 AM-11:40 AM
Proceedings Presentation: TADA: Phylogenetic augmentation of microbiome samples enhances phenotype classification
Room: Delhi (Ground Floor)
  • Erfan Sayyari, University of California San Diego, United States
  • Siavash Mirarab, University of California San Diego, United States
  • Ban Kawas, IBM, United States

Presentation Overview: Show

Motivation: Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data is high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data is often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks.
Results: In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes.

11:40 AM-12:00 PM
MetaRefSGB: a scalable framework to organize genomes from metagenomes and their annotations into species-level genome bins
Room: Delhi (Ground Floor)
  • Fabio Cumbo, Department CIBIO, University of Trento, Trento, Italy
  • Francesco Asnicar, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Francesco Beghini, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Nicolai Karcher, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Paolo Manghi, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Serena Manara, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Edoardo Pasolli, Department of Agricultural Sciences, University of Naples Federico II, Naples, Italy, Italy
  • Nicola Segata, Department CIBIO, University of Trento, Trento, Italy, Italy

Presentation Overview: Show

Metagenomic assembly is a powerful means to reconstruct draft genomes of microbes present in a microbial community. When applied at a very large scale (>10,000 samples) it greatly expands the diversity of the microbial tree-of-life to thousands of species without genomes from isolate sequencing. However, there is currently a lack of systems able to integrate, organize, and annotate the huge number (>200,000) of available metagenome-assembled genomes (MAGs). We present MetaRefSGB, a scalable open-source framework able to automatically organize and annotate MAGs into species-level genome bins (SGBs). The framework is based on our recently developed strategy to cluster MAGs into known and previously unknown species (kSGBs and uSGBs, respectively). We highlight the potentialities of this approach in characterizing the human microbiome by analyzing the 4,930 human-associated SGBs comprising 154,723 MAGs assembled from >9,500 metagenomes. The framework is built to allow continuous updates also from external users and it is currently being expanded with >3,500 metagenomes sampled from the food chain and from populations of non-human primates. The framework provides easy access to a very large number of MAGs, and can thus support current and future assembly-free and assembly-based metagenomic investigations.

12:00 PM-12:20 PM
Proceedings Presentation: Learning a Mixture of Microbial Networks Using Minorization-Maximization
Room: Delhi (Ground Floor)
  • Sahar Tavakoli, University of Central Florida, United States
  • Shibu Yooseph, University of Central Florida, United States

Presentation Overview: Show

The interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network.
We present a mixture model framework to address the scenario when the sample-taxa matrix is associated with K microbial networks. This count matrix is modeled using a mixture of K Multivariate Poisson Log-Normal distributions and parameters are estimated using a maximum likelihood framework. Our parameter estimation algorithm is based on the Minorization-Maximization principle combined with gradient ascent and block updates. Synthetic datasets were generated to assess the performance of our approach on absolute count data, compositional data, and normalized data. We also addressed the recovery of sparse networks based on an l1-penalty model.

12:20 PM-12:30 PM
Bacterial lineage identification from multi-strain sequencing data
Room: Delhi (Ground Floor)
  • Tommi Mäklin, University of Helsinki, Finland
  • Teemu Kallonen, University of Oslo, Norway
  • Jukka Corander, University of Oslo, Norway
  • Antti Honkela, University of Helsinki, Finland

Presentation Overview: Show

Bacteria belonging to the same species can be incredibly diverse and analysis of bacterial populations at a subspecies-level of strains or lineages is necessary for understanding their properties. In this work, we present mSWEEP, a method for identifying bacterial lineages from mixed sequencing data. mSWEEP uses fast probabilistic inference of a model of pseudoalignments of reads to a large collection of bacterial reference sequences, that have been grouped to biologically meaningful lineages, combined with confidence analysis of the identified lineages. We verify experimentally that mSWEEP provides a massive increase in lineage identification accuracy compared to existing methods. We also apply mSWEEP to analysis of E. coli from pre/post-treatment diarrhoea samples, observing a clear shift from clinical and diarrhoea-attributed lineages to lineages associated with antibiotic resistance post-treatment.

12:30 PM-12:40 PM
Metaepigenomic analysis reveals the unexplored diversity of DNA methylation in an environmental prokaryotic community.
Room: Delhi (Ground Floor)
  • Wataru Iwasaki, The University of Tokyo, Japan

Presentation Overview: Show

DNA methylation plays important roles in prokaryotes, and their genomic landscapes-prokaryotic epigenomes-have recently begun to be disclosed. However, our knowledge of prokaryotic methylation systems is focused on those of culturable microbes, which are rare in nature. Here, we used single-molecule real-time and circular consensus sequencing techniques to reveal the 'metaepigenomes' of a microbial community in the largest lake in Japan, Lake Biwa. We reconstructed 19 draft genomes from diverse bacterial and archaeal groups, most of which are yet to be cultured. The analysis of DNA chemical modifications in those genomes revealed 22 methylated motifs, nine of which were novel. We identified methyltransferase genes likely responsible for methylation of the novel motifs, and confirmed the catalytic specificities of four of them via transformation experiments using synthetic genes. Our study highlights metaepigenomics as a powerful approach for identification of the vast unexplored variety of prokaryotic DNA methylation systems in nature.

2:00 PM-2:40 PM
From precision microbial genomics to precision medicine
Room: Delhi (Ground Floor)
  • Ami Bhatt, Stanford Medical School and Stanford University, United States
2:40 PM-3:00 PM
Proceedings Presentation: Large Scale Microbiome Profiling in the Cloud
Room: Delhi (Ground Floor)
  • Camilo Valdes, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
  • Vitalii Stebliankin, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
  • Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States

Presentation Overview: Show

Bacterial metagenomics profiling for whole metagenome sequencing (WGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient, and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.

We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. Flint runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67 seconds — an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.

Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary materials and data are available at http://biorg.cs.fiu.edu.

3:00 PM-3:20 PM
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold.
Room: Delhi (Ground Floor)
  • Martin Steinegger, Johns Hopkins University, United States
  • Milot Mirdita, Max-Planck Institute for biophysical Chemistry, Germany
  • Johannes Soeding, MPI BPC, Germany

Presentation Overview: Show

The open-source, de-novo Protein-level assembler Plass (https://plass.mmseqs.com) assembles six-frame-translated sequencing reads into protein sequences. It recovers 2 to 10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (SRC) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (MERC), the largest free collections of protein sequences.

3:20 PM-3:30 PM
Massive-scale structure and function predictions of human gut microbiome proteins for metagenomic applications
Room: Delhi (Ground Floor)
  • Rob Knight, University of California San Diego, United States
  • Tomasz Kosciolek, Małopolska Centre of Biotechnology, Poland
  • Douglas Renfrew, Flatiron Institute, United States
  • Tommi Vatanen, Harvard University, United States
  • Julia Koehler leman, Flatiron Institute, United States
  • Ramnik Xavier, Harvard University, United States
  • Vladimir Gligorijević, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Presentation Overview: Show

The human gut microbiome is estimated to harbor over 2 million unique protein-coding genes. Only a fraction of them is experimentally annotated and therefore require computational predictions. Community-wide experiments such as CAFA show that homology-based function annotation approaches are lacking and require more sophisticated approaches. We use protein families to predict residue-residue contacts and use them as constraints for the de novo structure predictions. The predictions are carried out using World Community Grid Microbiome Immunity Project (https://www.worldcommunitygrid.org/research/mip1/overview.do). Anyone can donate their spare computational time to the project. Until now, during 1.5 years of project duration, we were able to generate over 160,000 unique structural models, each representing a different gene family. Thus, effectively doubling the number of available protein structures. 3D structural models, instead of sequences, serve then as inputs for a deep learning-based function prediction method we developed. This approach enables us to achieve state-of-the-art accuracies in predicting gene ontology terms. We are now in a position to functionally annotate microbial genomes and metagenomes with higher coverage and accuracy. We may also start addressing microbe-microbe and host-microbiome protein-protein interactions to determine the mechanisms of microbiota-induced immune response.

3:30 PM-3:40 PM
Unraveling the Role of Microbial Dark Matter in Extreme Environmental Networks
Room: Delhi (Ground Floor)
  • Tatyana Zamkovaya, University of Florida, United States
  • Ana Conesa, University of Florida, United States

Presentation Overview: Show

Though Microbial Dark Matter (MDM) has been uncovered in a wide range of habitats, few studies have explored beyond abundance and distribution patterns, leaving the ecological role of MDM a mystery. To understand the potential ecological contributions of MDM, it is essential to first understand how these unknown species impact neighboring microbes and their respective environment. Here, we establish a method to predict the ecological significance of MDM using microbial correlation networks of four extreme aquatic environmental categories- Hot Springs, Hypersaline, Deep Sea, and Polar- compiled together from 45 publicly available 16S rRNA studies of 1086 environmental samples. Networks were constructed including and excluding MDM at multiple taxonomic levels for each of the four environments. Network centrality measures were used to quantitatively compare between networks. Due to the significant changes to closeness and betweenness centralities of other microbes in the absence of MDM in the Deep Sea, Polar, and Hypersaline communities, MDM appear to play necessary ecological roles. Interestingly, microbial taxa were shown to predominantly occur as hubs across all environments. We show that MDM, by their interactions with other microbes, are integral, highly adapted to extreme environments, and can be used to detect novel genes and pathways of adaptation.

3:40 PM-3:50 PM
Inferring microbial co-occurrence networks from 16S data: A systematic evaluation
Room: Delhi (Ground Floor)
  • Dileep Kishore, Boston University, United States
  • Gabriel Birzu, Boston University, United States
  • Zhenjun Hu, Boston University, United States
  • Kirill Korolev, Boston University, United States
  • Daniel Segrè, Boston University, United States

Presentation Overview: Show

Microbes tend to organize into communities consisting of hundreds of species involved in complex interactions with each other. 16S ribosomal RNA (16S rRNA) gene profiling provides snapshots that reveal the phylogenies and abundance distributions of these microbial communities. These snapshots, when collected from multiple samples, have the potential to reveal which microbes co-occur, providing a glimpse into the network of inter-dependencies underlying these communities. The inference of networks from 16S data is prone to statistical artifacts, but the extent to which the different steps in the workflow affect the resultant network is still unclear.
In this study, we perform a meticulous analysis of each step of a pipeline that processes 16S sequencing data into a network of microbial associations. Through this process, we determine the tools and parameters that generate the most accurate and robust co-occurrence networks. Ultimately, we develop a standardized pipeline that follows these default tools and parameters, but that can also help explore the outcome of any other combination of choices. We envisage that this standard pipeline for processing 16S sequencing data into networks of microbial co-occurrences could be used for integrating multiple data-sets, and for generating comparative analyses and consensus networks useful for detecting disease-related patterns.

3:50 PM-4:00 PM
Phylogenetic Tree-based Microbiome Association Test
Room: Delhi (Ground Floor)
  • Kangjin Kim, School of Public Health, Seoul National University, Korea, South Korea
  • Sungho Won, School of Public Health, Seoul National University, Korea, South Korea
  • Sang-Chul Park, Institute, South Korea
  • Jaehyun Park, Interdisciplinary Program for Bioinformatics, College of Natural Science, Seoul National University, Seoul, South Korea, South Korea

Presentation Overview: Show

Motivation: Microbial ecological patterns exhibit high inter-subject variation, with few operational taxonomic units (OTUs) for each species. To overcome these issues, non-parametric approaches, such as the Wilcoxon rank-sum test, have often been used. However, these approaches only utilize the ranks of observed relative abundances, leading to information loss, and are associated with high false-negative rates. In this article, we propose a phylogenetic tree-based microbiome association test (TMAT) to analyze the associations between microbiome OTU abundances and disease phenotypes. Phylogenetic trees illustrate patterns of similarity among different OTUs, and TMAT provides an efficient method for utilizing such information for association analyses. The proposed TMAT provides test statistics for each node, which are combined to identify mutations associated with host diseases.
Results: Statistical power estimates of TMAT were compared with existing methods using extensive simulations based on real absolute abundances. Simulation studies showed that TMAT preserves the nominal type-1 error rate, and estimates of its statistical power generally outperformed existing methods with regard to the considered scenarios. Furthermore, TMAT can be used to detect phylogenetic mutations associated with host diseases, providing more in-depth insight into bacterial pathology.
Availability: TMAT was implemented in the R package. Detailed information is available at http://healthstat.snu.ac.kr/software/tmat.

4:40 PM-5:00 PM
Update on the CAMI 2 Challenge
Room: Delhi (Ground Floor)
  • Alexander Sczyrba, Bielefeld University, Germany
5:00 PM-5:20 PM
CAMI future challenges
Room: Delhi (Ground Floor)
  • Alexander Sczyrba, Bielefeld University, Germany
5:20 PM-5:30 PM
Assessing taxonomic metagenome profilers with OPAL
Room: Delhi (Ground Floor)
  • Fernando Meyer, Helmholtz Centre for Infection Research, Germany
  • Andreas Bremges, Helmholtz Centre for Infection Research, Germany
  • Peter Belmann, Helmholtz Centre for Infection Research, Germany
  • Stefan Janssen, Helmholtz Centre for Infection Research, Germany
  • David Koslicki, Oregon State University, United States
  • Alice C. McHardy, Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Germany

Presentation Overview: Show

Taxonomic metagenome profilers predict the presence and relative abundance of microorganisms from shotgun sequence samples of DNA isolated directly from a microbial community. Over the past years, there has been an explosive growth of software and algorithms for this task, resulting in a need for more systematic comparisons of these methods using relevant performance criteria. We present the Open-community Profiling Assessment tooL (OPAL), a software package implementing commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment of Metagenome Interpretation (CAMI), together with convenient visualizations. OPAL implements simple presence-absence metrics and more sophisticated comparative metrics such as UniFrac and diversity metrics, as well as run time and memory efficiency measurements. All results are output in plots, tables, and in an HTML page, which contains an interactive performance ranking allowing users to dynamically rank taxonomic profilers based on the combination of metrics of their choice. OPAL thus facilitates in-depth performance comparisons, as well as the development of new methods and data analysis workflows. To demonstrate the application, we compare seven profilers on benchmark datasets of CAMI and the Human Microbiome Project. OPAL is freely available at https://github.com/CAMI-challenge/OPAL.

5:30 PM-5:40 PM
Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences
Room: Delhi (Ground Floor)
  • Ziye Wang, School of Mathematical Sciences, Fudan University, China
  • Ying Wang, Department of Automation, Xiamen University, China
  • Jed Fuhrman, Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, United States
  • Fengzhu Sun, Department of Biological Sciences, University of Southern California, United States
  • Shanfeng Zhu, Fudan University, China

Presentation Overview: Show

Understanding the power and limitations of various read assembly programs in practice is important for researchers to choose which programs to use in their investigations. Many studies evaluating different assembly programs used either simulated metagenomes or real metagenomes with unknown genome compositions. However, the simulated datasets may not reflect the real complexities of metagenomic samples and the estimated assembly accuracy could be misleading due to the unknown genomes in real metagenomes. Therefore, hybrid strategies are required to evaluate the various read assemblers for metagenomic studies. In this paper, we benchmark the metagenomic read assemblers by mixing reads from real metagenomic datasets with reads from known genomes and evaluating the integrity, contiguity and accuracy of the assembly using the reads from the known genomes. We selected MEGAHIT, MetaSPAdes, IDBA-UD and Faucet, for evaluation. We showed the strengths and weaknesses of these assemblers in terms of integrity, contiguity and accuracy. Overall, MetaSPAdes performs best in integrity and continuity at the species-level, followed by MEGAHIT. Faucet performs best in terms of accuracy at the cost of worst integrity and continuity, especially at low sequencing depth. MetaSPAdes has the overall best performance at the strain-level. MEGAHIT is the most efficient in our experiments.

5:40 PM-5:50 PM
MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics
Room: Delhi (Ground Floor)
  • Eli Levy Karin, Max-Planck Institute for biophysical Chemistry, Germany
  • Milot Mirdita, Max-Planck Institute for biophysical Chemistry, Germany
  • Johannes Soeding, MPI BPC, Germany

Presentation Overview: Show

Metagenomics is revolutionizing the study of microbes by allowing us to investigate the 99% of uncultivatable organisms through direct sequencing. Unicellular eukaryotes play essential roles in microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts and parasites. Investigating their roles is therefore of great interest to ecology, biotechnology, medicine, and evolution. However, the generally lower sequencing coverage, their complex gene architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics.
MetaEuk is a toolkit for high-throughput, reference-based discovery and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk’s power to discover novel eukaryotic proteins in large-scale metagenomics data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted millions of protein-coding genes in < 60 hours on twenty 16-core servers. Most of which are diverged from known proteins and originate from sparsely sampled eukaryotic supergroups.
MetaEuk is an open-source (GPLv3) software: https://metaeuk.soedinglab.org.

5:50 PM-6:00 PM
DeepBGC: Applying Deep Learning Algorithms to Biosynthetic Gene Cluster Identification
Room: Delhi (Ground Floor)
  • Christopher Woelk, MERCK EXPLORATORY SCIENCE CENTER, United States
  • Geoffrey Hannigan, MERCK EXPLORATORY SCIENCE CENTER, United States
  • Andrej Palicka, MSD Czech Republic, Czechia
  • Jindrich Soukup, MSD Czech Republic, Czechia
  • Ondrej Klempir, MSD Czech Republic, Czechia
  • Lena Rampula, MSD Czech Republic, Czechia
  • Jindrich Durcak, MSD Czech Republic, Czechia
  • Michael Wurst, MSD Czech Republic, Czechia
  • Jakub Kotowski, MSD Czech Republic, Czechia
  • Dan Chang, Merck & Co. Inc., United States
  • Grazia Piizzi, MERCK EXPLORATORY SCIENCE CENTER, United States
  • David Prihoda, MSD Czech Republic, Czechia
  • Bitton Danny, MSD Czech Republic, Czechia

Presentation Overview: Show

A major challenge for microbiome research is the functional interpretation of whole genome shotgun sequencing data derived from bacteria. Functional interpretation may be achieved by identifying biosynthetic gene clusters (BGCs), which are co-localized groups of genes that encode a biosynthetic pathway capable of producing functional metabolites. DeepBGC is a bidirectional long-short term memory (BiLSTM) recursive neural network that was developed to identify BGCs using a training data set of previously published BGCs and artificially constructed non-BGCs, followed by parameter tuning using nine bacterial genomes with embedded BGCs. The genes comprising BGCs were annotated for protein family (pfam) domains and then a novel algorithm, pfam2vec, converted these domains to numeric vectors for DeepBGC input. DeepBGC outperformed Hidden Markov Models (i.e. ClusterFinder) when evaluated on a hold-out validation data set of six bacterial genomes annotated for BGCs (AUC 0.923 vs. 0.847). Finally, DeepBGC also demonstrated superior performance when predicting BGCs of classes removed from the training data (e.g. RiPPs, AUC 0.910 vs. 0.738). In conclusion, DeepBGC is distributed under an MIT open source license and facilitates functional interpretation of metagenomic experiments, as well as the identification of new natural products that represent drug candidates with antimicrobial or anticancer properties.