Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: BioArchLinux: bioinformatics community with Arch Linux
COSI: BOSC
  • Guoyi Zhang, University of Nottingham, Nottingham, UK, United Kingdom
  • Michael Elliot, University of Groningen, Groningen, Netherlands, Netherlands
  • Yujin Hu, Shenzhen Research Institute of Big Data, Shenzhen, China, China
  • Viktor Drobot, Pharm InnTech LLC, Moscow, Russia, Russia
  • Jens Staal, Ghent University, Ghent, Belgium, Belgium
  • Yun Yi, Beijing, China, China


Presentation Overview: Show

BioArchLinux is a community for biologists and a repository of bioinformatics software for Arch Linux and Arch Linux-based Linux distributions. The repository is maintained using the Lilac python tool and archrepo2, which can auto-update each package's version based on upstream releases and auto-generate the repository database. At present, the BioArchLinux repository contains more than 3,700 bioinformatics packages including their dependencies.
BioArchLinux, in line with the rolling release philosophy of Arch Linux, always provides the latest versions of open-source packages, thus providing more accurate results for biological research. It is unnecessary for users to manually locate the package version upstream, download new packages from a webpage or even compile packages from source code. Furthermore, the repository provides, where possible, a DOI in each package description. This makes it easy for users to find out more about each package’s use and methodology, and to quickly identify appropriate citations when preparing publications.
Since December 2021, a small team of contributors from five countries maintain and continually expand the repository.

Virtual: Bioconductor toolchain for reproducible bioinformatics pipelines using Rcwl and RcwlPipelines
COSI: BOSC
  • Qiang Hu, Roswell Park Comprehensive Cancer Center, United States
  • Alan Hutson, Roswell Park Comprehensive Cancer Center, United States
  • Song Liu, Roswell Park Comprehensive Cancer Center, United States
  • Martin Morgan, Roswell Park, United States
  • Qian Liu, Roswell Park Comprehensive Cancer Center, United States


Presentation Overview: Show

The Common Workflow Language (CWL) is used to provide portable and reproducible data analysis workflows across different tools and computing environments. We have developed Rcwl, an R interface to CWL, to provide easier development, use and maintenance of CWL pipelines from within R. We have also collected nearly 200 pre-built tools and pipelines in RcwlPipelines, ready to be queried and used by researchers in their own data analysis. A single-cell RNA sequencing preprocessing pipeline using STARsolo for alignment and quantification and DropletUtils for filtering empty droplets from the raw gene-barcode matrix will demonstrate the use of the software.

Virtual: CaDrA: A Containerized and Cloud-Deployable Open-Source Software Package for Candidate Driver Analysis of Multi-Omics Data
COSI: BOSC
  • Reina Chau, Boston University, United States
  • Katia Bulekova, Boston University, United States
  • Vinay Kartha, Harvard University, United States
  • Stefano Monti, Boston University, United States


Presentation Overview: Show

We have previously developed an integrative omics analysis methodology called CaDrA (Candidate Driver Analysis), which implements a step-wise rank-based heuristic search approach to identify functionally relevant subsets of genomic features (mutations, copy-number alterations, methylation marks, etc.) that, together, are maximally associated with a specific outcome (phenotype) of interest1.

In its original implementation, the methodology was only available as a tar ball, had scant documentation, lacked CRAN/Bioconductor-compliant structure. Finally, the original approach supported only Kolmogorov-Smirnov (KS) and Wilcoxon-based feature ranking.

We have now extended the functionalities of the method, including feature scoring by mutual information based on the REVEALER score2, and have optimized, hardened, and deployed it as an open-source R package developed based on best software engineering practices and design principles. The package is being thoroughly documented, and a simple R Shiny interface is being designed to make the tool “biologist-friendly”. Further, the package is being “containerized” via Docker to make it cloud-ready and automatically compatible with various high-performing computing (HPC) environments. A well-documented, easily installable, and portable tool will facilitate its adoption by the research community.
We will present results of its application to simulated and real cancer multi-omics data from the TCGA consortium and in-house generated data.

Virtual: Cloud compatible pipeline for nanopore long-read sequencing data consensus DNA methylation detection
COSI: BOSC
  • Yang Liu, The Jackson Laboratory for Genomic Medicine, CT, USA, United States
  • Ziwei Pan, The Jackson Laboratory for Genomic Medicine, CT, USA; UCONN Health, CT, USA, United States
  • Christina Chatzipantsiou, Lifebit Biotech LTD., London, UK, United Kingdom
  • Thatcher Slocum, The Jackson Laboratory for Genomic Medicine, CT, USA, United States
  • Lasya Karuturi, The Jackson Laboratory for Genomic Medicine, CT, USA, United States
  • Sheng Li, The Jackson Laboratory for Genomic Medicine, CT, USA; UCONN Health, CT, USA; University of Connecticut, Storrs, CT, USA, United States


Presentation Overview: Show

Nanopore long-read sequencing expanded the capacity of long-range, single-base, and single-molecule DNA-methylation (DNAme) detection. Previously, we benchmarked and ranked seven computational tools for DNAme detection. Furthermore, long-read sequencing analysis for mammalian genome requires much more computing resources than next-generation sequencing. Thus, scalability and reproducibility are critical in pipeline design. Currently, no pipeline integrates and automates DNA methylation detection for nanopore sequencing using cloud computing. To handle these issues, we developed NANOME, the first Nextflow based container environment (Docker and Singularity) for consensus DNA methylation detection using XGBoost, a gradient boosting algorithm for nanopore long-read sequencing. The pipeline supports tera-base scale data analysis with a single command. Furthermore, the consensus model improves the DNA methylation detection by 3-5% at single-molecule resolution by F1-score accuracy, and 9%-13% at single-base resolution by MSE (mean square error). NANOME is an open-source, reproducible, and end-to-end pipeline for whole-genome DNAme detection, and is compatible with multiple HPC clusters and cloud OS platforms. A web-based interface is available using Lifebit CloudOS for cloud-based analysis, monitoring processes, and visualizing results from a single command line. NANOME is a useful and complete step forward for DNA methylation detection and long-range epigenetic phasing.

Virtual: Effectively developing, distributing, and managing a Python package with a large community
COSI: BOSC
  • João Mc Teixeira, Utrecht University, Netherlands
  • João Pglm Rodrigues, Shrodinger, United States
  • Alexandre Mjj Bonvin, Utrecht University, Netherlands


Presentation Overview: Show

The advent of massive online platforms for code hosting and version control opened the door to a cultural leap in developing open-source software, and we entered the era of liberal contributions and inclusive meritocracy. This new way of working together impregnated all software development fields, and scientific software was no exception. However, having research software in an online massive code hosting platform is not enough to provide the means to create a healthy community around the code. Moreover, the increasing maintenance burdensome resulting from ineffective management can halt the project development, discourage contributions from other developers, and alienate interested users. We present a consolidated strategy to effectively develop, organize, publish, and manage an open-source Python library with a rich community of users and contributors. We created a template repository on GitHub to document the strategies we have adopted. There, you will find explanations and live examples on how to organize the repository’s folders and files, how to set up automated unit testing, how to write and build the documentation from code, how to assemble solid continuous integration workflows using GitHub actions, and how to streamline version bump and packaging. The repository is available at: https://github.com/joaomcteixeira/python-project-skeleton.

Virtual: Evaluation of batch effect correction methods for space biology RNA sequencing data
COSI: BOSC
  • Lauren Sanders, NASA Ames, BMSIS, United States
  • Finsam Samson, Stanford University, United States
  • Hamed Chok, N/A, United States
  • Sylvain Costes, NASA Ames, United States
  • Amanda Saravia-Butler, KBR, NASA Ames Research Center, Moffett Field, CA 94035, USA, United States


Presentation Overview: Show

RNA sequencing (RNA-seq) data from space biology experiments yield insights into the effects of spaceflight on living systems. However, sample numbers from spaceflight studies are low due to limited crew availability, hardware, and space. To increase statistical power, individual spaceflight RNA-seq datasets are often aggregated together. This can introduce technical variation or ""batch effects"", which can be due to differences in sample handling, sample processing, and sequencing platforms.

In this study, we used 7 mouse liver RNA-seq datasets from NASA GeneLab to evaluate 5 common batch effect correction tools (ComBat and ComBat_seq from the sva R package, and Median Polish, Empirical Bayes, and ANOVA from the MBatch R package). We quantitatively evaluated the performance of these tools in the spaceflight context using differential gene expression analysis, BatchQC, principal component analysis, log fold change correlation, and dispersion separability criterion. We developed a standardized approach to identify the optimal correction algorithm and correction variable by geometrically probing the space of all allowable scoring functions to yield an aggregate volume-based scoring measure.

Finally, we describe the GeneLab multi-study visualization and analysis user portal which incorporates these scoring metrics to allow users to evaluate and select an optimal batch correction method and variable.

Virtual: FASEB DataWorks!: Building a culture of data sharing and reuse in biological and biomedical research
COSI: BOSC
  • Yvette Seger, Federation of American Societies for Experimental Biology (FASEB), United States
  • Emily Ruff, Federation of American Societies for Experimental Biology (FASEB), United States


Presentation Overview: Show

FASEB DataWorks! is pioneering a community-centered, interdisciplinary approach to advance data sharing and reuse in the biological and biomedical community. While interest in data sharing and reuse has been sparked by the forthcoming NIH Data Management and Sharing Policy (NOT-OD-21-013, January 2023), DataWorks! strives to move from compliance to culture change by providing the skills, support, space, and incentives for the research community to leverage the transformational power of data sharing and reuse on research discovery.
DataWorks! Salons, one of the four programs of the DataWorks! initiative, are community driven conversation spaces where researchers and practitioners can exchange ideas around data sharing and reuse in biological and biomedical research. More than 300 researchers, librarians and practitioners have participated in these conversations and shared their ideas around integration of data sharing and reuse into research contexts since the program launched in October 2021.
FASEB is championing an audacious vision for data sharing and reuse and we are excited to share our successes and learnings at BOSC.

Virtual: INTERPIN- a database for INtrinsic transcription TERminator hairPINs in bacteria
COSI: BOSC
  • Swati Gupta, Iisc, India


Presentation Overview: Show

The prediction of transcription termination sites is important for various in vitro and in vivo experiments, including drug targeting, annotation of genes and operons etc. Even though prediction softwares exist, they are either biased toward organisms selected for prediction or require additional experimental data, which is complex at large scale and is time consuming. We developed the INTERPIN (INtrinsic TERmination hairPIN) database for Intrinsic transcription terminator predictions in bacterial genomes. It covers 12745 bacteria and is the largest collection of terminators to date. We have introduced cluster hairpins [1], that are a group of contiguous hairpins at <14 bases from each other, causing transcription termination
The database provides information about predicted cluster and single hairpins, along with raw prediction termination sites files. The IGV genome viewer has been integrated to allow visualization of hairpin and operon predictions across the whole genome. Secondary and tertiary structures of any selected hairpin can be viewed and analyzed by using tools- RNAComposer and iCN3d. A detailed help page explains all the available features, with help of examples.

[1] S. Gupta and D. Pal, “Clusters of hairpins induce intrinsic transcription termination in bacteria,” Sci Rep, vol. 11, no. 1, p. 16194, Aug. 2021, doi: 10.1038/s41598-021-95435-3.

Virtual: Introducing the National Research Data Infrastructure for the Research of Microbiota (NFDI4Microbiota)
COSI: BOSC
  • Barbara Götz, ZB MED - Information Centre for Life Sciences, Germany
  • Alexander Sczyrba, Bielefeld University, Germany
  • Jens Stoye, Bielefeld University, Germany
  • Peer Bork, European Molecular Biology Laboratory (EMBL), Germany
  • Manja Marz, Friedrich Schiller University Jena, Germany
  • Ulisses Nunes da Rocha, Helmholtz Centre for Environmental Research (UFZ), Germany
  • Alexander Goesmann, Justus-Liebig-University Gießen, Germany
  • Jörg Overmann, Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Germany
  • Anke Becker, Philipps-Universität Marburg, Germany
  • Thomas Clavel, RWTH Aachen University, Germany
  • Alice C. McHardy, Helmholtz Centre for Infection Research, Germany
  • Konrad U. Förstner, ZB MED - Information Centre for Life Sciences, Germany


Presentation Overview: Show

Microbes have a strong influence on every aspect of human life. Driven by technological advances that allow e.g. high-throughput molecular characterization of microbial species and communities, microbiological research helps to address global health threats such as viral pandemics. These recent advances result in the generation of large data sets, yet the use and re-use of this data so far has failed to exploit its potential. NFDI4Microbiota has started its activity in October 2021 together with several other NFDI (National Research Data Infrastructure) consortia in Germany. It consists of ten well-established partner institutions and is supported by five professional societies and more than 50 participants. It aims to facilitate the digital transformation of the microbiological community and thus the mission of the consortium is to support the microbiology community with access to data, analysis services and (meta)data standards. Besides this, training as well as community engaging activities are offered. Furthermore, a cloud-based system will be created that will make the storage, integration and analysis of microbial data, especially omics data, consistent, reproducible, and accessible. Thereby, NFDI4Microbiota will promote the FAIR (Findable, Accessible, Interoperable and Re-usable) principles and Open Science.

Virtual: NASA GeneLab RNASeq Consensus Pipeline: A Nextflow Implementation
COSI: BOSC
  • Jonathan Oribello, NASA Ames Research Center, GeneLab, Blue Marble Space Institute of Science, United States
  • Amanda Saravia-Butler, NASA Ames Research Center, GeneLab, KBR, United States
  • Samrawit Gebre, NASA Ames Research Center, GeneLab, United States
  • Jonathan Galazka, NASA Ames Research Center, GeneLab, United States
  • Sylvain Costes, NASA Ames Research Center, GeneLab, United States


Presentation Overview: Show

The NASA GeneLab project (genelab.nasa.gov) seeks to accelerate space biology research through cataloging and democratizing omics data. Since raw omics data is largely inaccessible to non-bioinformaticians, GeneLab works with the scientific community to develop standard processing pipelines to generate and publish processed data. Unlike raw data, processed data has greater immediate value to a wide range of users with varying technical backgrounds and computational capabilities. Standardizing processing workflows is essential to match the pace of raw data generation, ensure reproducibility, and enable standardized processed data for comparison across datasets.
Previously, GeneLab developed a standardized pipeline for processing RNAseq data, referred to as the ‘GeneLab RNAseq Consensus Pipeline (RCP)’, in collaboration with GeneLab’s Analysis Working Groups. The work presented here is a Nextflow implementation of GeneLab’s RCP that automates and accelerates data processing of RNASeq datasets hosted on GeneLab. In addition to the core data processing, the workflow also includes staging of GeneLab raw data and a robust verification and validation (V&V) program that runs after each processing step to identify errors in real-time, stop additional downstream computation, and preserve computational resources. The workflow, including the staging and V&V functionality, is open source for others to reuse and modify at https://github.com/nasa/GeneLab_Data_Processing/tree/master/RNAseq.

Virtual: Natural Language Processing to reveal genomic patterns involved in sepsis
COSI: BOSC
  • Tyrone Chen, Monash University, Australia
  • Sonika Tyagi, Monash University, Australia


Presentation Overview: Show

Sepsis, a bloodstream infection, has a high mortality rate and is a common comorbidity in cancer. Rising antimicrobial resistance (AMR) globally amplifies this medical condition. Recently, the rapid growth in genotyping and multi-omics datasets has provided an opportunity to investigate AMR at a higher resolution. We therefore investigated available heterogeneous datasets to study mechanisms of AMR.

A major obstacle to a unified analysis of diverse datasets are their distinct formats. To overcome this, we are refining a universal, annotation-less framework based on our previous work which showed that it is viable to use genomic data as fastq sequences, k-mers or abundance matrices to recover unannotated biomarkers and other signals of interest.

We performed an integrative analysis on matched transcriptome, proteome and metabolome profiles of sepsis-causing bacteria. Next, we recoded the genome using natural language processing techniques. Formulating this as a classification problem, we intercepted classifier weights to highlight interesting regions in genomic sequence. Overlaying this regulatory information on the functional omics layers recovered informative pathogenicity markers. Future work involves confirming these results using independent datasets. Scaling this method will allow us to identify comprehensive functional and regulatory signatures driving AMR in sepsis, and is applicable to other biological systems.

Virtual: NCBI Datasets: making genomic data and metadata more accessible
COSI: BOSC
  • Mirian T. N. Tsuchiya, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Nuala O'Leary, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Eric Cox, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • J. Bradley Holmes, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Peter Meric, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Greg Schuler, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Robert Falk, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • William Anderson, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Xuan Zhang, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Vichet Hem, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Laurie Breen, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Jonathan Cothran, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Wes Ulm, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Wratko Hlavina, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Anne Ketter, National Center for Biotechnology Information (NCBI/NLM/NIH), United States
  • Valerie A. Schneider, National Center for Biotechnology Information (NCBI/NLM/NIH), United States


Presentation Overview: Show

NCBI Datasets is a new resource from the National Center for Biotechnology Information (NCBI) that facilitates and integrates access to genomic data and metadata as comprehensive data packages. Data packages include one or more metadata files as JSON or JSON lines plus sequence (DNA, RNA, CDS, and protein) and annotation data files (GFF3). Users can retrieve data using one of the following interfaces: website, command-line tool, Python package, and REST API. The NCBI Datasets command-line tool provides access to four main data types: genomes, genes, orthologs, and viruses (currently restricted to SARS-Cov-2), and users can request access to those by accession numbers, taxon, and NCBI Gene IDs. NCBI Datasets has a companion tool called dataformat that allows the conversion of JSON Lines data reports to TSV or Excel format for easier visualization of the metadata. By facilitating data access and delivery, NCBI Datasets contributes to the NIH Comparative Genome Resource (CGR) mission, which aims to maximize the impact of eukaryotic genomic resources on biological and medical research.

Virtual: ProFeatX: a parallelized protein feature extraction suite for machine learning
COSI: BOSC
  • David Guevara, Utah State University, United States
  • Rakesh Kaundal, Utah State University, United States


Presentation Overview: Show

Most machine learning methods require lots of data, with every data point having the same vector size. However, proteins are amino acid sequences of variable length, which makes it essential to extract a definite number of features from all the proteins for them to be used as input. There are numerous methods to achieve this, but only several tools let researchers encode their proteins using multiple methods without having to use different programs or, in many cases, code these algorithms themselves, or even come up with new algorithms. In this work, we created ProFeatX, a tool that contains 50 diverse encodings to extract protein features in an efficient and fast way. The software is implemented as a web server as well as can be downloaded as stand-alone version for local installations on desktop and the High-Performance Computing cluster.

Availability and implementation
ProFeatX is implemented in C++ and available on GitHub at https://github.com/Davegb/profeatx. The web server is available at http://bioinfo.usu.edu/encoder.

Virtual: Scaling up GeneSeqToFamily to handle pan transcriptomes and million of genes
COSI: BOSC
  • Anil S. Thanki, Earlham Institute, United Kingdom
  • Nicola Soranzo, Earlham Institute, United Kingdom
  • Robert P. Davey, Earlham Institute, United Kingdom
  • Wilfried Haerty, Earlham Institute, United Kingdom


Presentation Overview: Show

The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene families and characterize the selective pressures driving the evolution of these genes. We developed GeneSeqToFamily, a galaxy workflow to identify gene families. Galaxy allows fully transparent and reproducible analyses.

The release of very large and complex data sets (number of genes, number of species) from large genome sequencing initiatives, requires improving the pipelines to enable the analysis of such datasets. We ported the GeneSeqToFamily to the HPC environment using Snakemake scaling up the pipeline and adding new features such as an algorithm to split large gene trees, infer homology and generate statistical reports.

We applied these developments to the 10 wheat transcriptomes encompassing over 1.4 million genes and identified 161,344 gene families and expanded the analysis by adding diploid and tetraploid wheat varieties to identify gene family dynamics.

GeneSeqToFamily is species agnostic and we are expanding this work to the 240+ species from the zoonomia project. The workflow is open source and available to install from the Galaxy ToolShed and usable for free on UseGalaxy.eu. Source code for the Galaxy workflow and the Snakemake version is available from GitHub.

Virtual: seqSight: jointly profile microbial strains, genes, and biosynthetic gene clusters from metagenomics data
COSI: BOSC
  • Xinyang Zhang, The George Washington University, United States
  • Tyson Dawson, The George Washington University, United States
  • Keith Crandall, The George Washington University, United States
  • Ali Rahnavard, The George Washington University, United States


Presentation Overview: Show

Next generation, high throughput DNA sequencing (NGS) has enabled the high-resolution metagenomic analysis of microbial communities. Bayesian reassignment-based tools to taxonomically profile metagenomic samples have proven to be more accurate than their peers in terms of microbe diversity characterization including a relatively high true positive and low false-negative rate. Despite this accuracy, a lack of downstream analysis functionality in such tools limits the ability of researchers to generate strain-level taxonomy abundance profiles, perform gene family analyses, identify biosynthesis gene clusters, host transcriptome, and visualize detailed reports of their findings. We present seqSight, an integrated bioinformatics pipeline that combines a bayesian reassignment mapping strategy with powerful downstream functions for generating taxonomic abundance profiles, conducting gene family enrichment analyses, identifying gene clusters that produce metabolites, and visualizing data. Our approach will integrate existing discovery strategies and produce reproducible analysis pipelines that facilitate multi-omic investigation of microbial data.

Virtual: The OntoDev Suite of Ontology and Data Integration Tools
COSI: BOSC
  • Rebecca C. Jackson, Bend Informatics LLC, United States
  • James A. Overton, Knocean, Inc., Canada


Presentation Overview: Show

The OntoDev Suite (https://ontodev.com, https://github.com/ontodev) of open source software brings together modular open-source libraries and applications for ontology development and scientific data integration, with special emphasis on open science and the Open Biological and Biomedical Ontologies (OBO) community. The suite builds on the success of ROBOT to include data cleaning, ontology-driven validation, development and curation workflows, and more. We strive to make small, focused tools that work well together, but also work well with other best-in-class software, languages, and platforms. In this talk we present an overview of the suite, design principles, and future plans.

S-001: Rigorous benchmarking of HLA callers for RNA-sequencing data
COSI: BOSC
  • Ram Ayyala, Department of Translational Biomedical Informatics, Keck School of Medicine, University of Southern California, United States
  • Dottie Yu, Department of Quantitative and Computational Biology, University of Southern California, United States
  • Sergey Kynazev, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Serghei Mangul, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States


Presentation Overview: Show

Precise identification of alleles of the human leukocyte antigen (HLA) region of the human genome is crucial for various clinical and research applications. However, the process of HLA typing remains challenging due to the highly polymorphic nature of the HLA loci. With Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types, particularly from RNA sequencing (RNA-seq) data. Despite this development, there remains a lack of comprehensive and systematic benchmarking of RNA-seq HLA callers using large-scale and realistic gold standards. To address this limitation, we rigorously compared the performance of all 12 HLA callers currently published, through searching 30 pairwise combinations of HLA callers and reference based on six gold standard datasets spanning 650 RNA-seq samples. In each case, we produced evaluation metrics of accuracy for each caller that is the percentage of correctly predicted alleles, at both two- and four-digit resolution. We then reported the HLA genes and alleles most prone to misprediction. Furthermore, we evaluated the performance of each caller through their runtime and usage of CPU and memory. This study offers crucial information for researchers and clinicians regarding appropriate choices of methods for conducting an HLA analysis.

S-002: Annotating VCFs using the GA4GH Variation Representation Specification
COSI: BOSC
  • Wesley Goar, Nationwide Children's Hospital, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Kori Kuzma, Nationwide Children's Hospital, United States
  • Kathyrn Stahl, Nationwide Children's Hospital, United States
  • Kyle Ferriter, Broad Institute, United States
  • Andreas Prlic, Invitae, United States
  • Lawrence Babb, Broad Institute, United States
  • Reece Hart, MyOme, United States
  • Alex Wagner, Nationwide Children's Hospital, United States


Presentation Overview: Show

Variant interpretation is a challenging and time-intensive endeavor due in part to the many ways that the same variation is represented across different genomic data resources. When annotating a molecular variant, it is important to ensure that all relevant information for clinical interpretation of a specific variant is interrogated; failure to do so may lead to inaccurate, inconsistent, or incomplete interpretations of variants in a clinical setting. The Variation Representation Specification (VRS; vrs.ga4gh.org) is a variation data exchange standard from the Global Alliance for Genomics and Health (GA4GH; ga4gh.org) designed to improve interoperability between genomic knowledge resources. We have developed open-source software (https://github.com/ga4gh/vrs-python) to annotate alleles within VCF files with normalized VRS identifiers and generate VRS ID/object mappings (VRS identifier ⇒ object mappings) for reverse lookup. We demonstrate the application of this tool to 7,829,075 alleles from the Genome in a Bottle dataset, renormalizing and annotating an average of 1,361 alleles per second. We also describe a Docker service for applying this tool in cloud-based workflows. This work is a significant advancement in the suite of tools necessary for high-throughput and high-accuracy genomic knowledge annotation of variants.

S-003: Identifying Features that Maximize Engagement in Crowdsourced Challenges
COSI: BOSC
  • Rongrong Chai, Sage Bionetworks, United States
  • Verena Chung, Sage Bionetworks, United States
  • Thomas Yu, Sage Bionetworks, United States
  • Amber Nelson, Sage Bionetworks, United States
  • Ezekiel J. Maier, Booz Allen Hamilton (precisionFDA), United States
  • Julie Bletz, Sage Bionetworks, United States
  • Jake Albrecht, Sage Bionetworks, United States
  • Jineta Banerjee, Sage Bionetworks, United States


Presentation Overview: Show

The popularity of machine learning is steadily growing in biomedical sciences facilitated by the exponential growth in high-throughput data generation technologies. The complexity of biomedical data mandates specialized algorithm development. Independent benchmarking and open distribution of algorithms are key to prevent developer-bias, poor generalizability, and low reusability.

Crowdsourced competitions or Challenges provide a compelling mechanism for independent benchmarking. Organizing and running effective challenges with biomedical data requires domain experts, sponsors, data contributors, technical experts, and others. To optimize challenges and ensure maximal impact in computational biology, we must identify key features that correlate with high participant-engagement and innovative development of generalizable algorithms. In this study, we examined metadata from multiple challenge platforms to identify features that correlate strongly with impactful challenges. Metrics associated with participant-engagement, number of participants, and number of final submissions were used to define challenge effectiveness. Unsupervised clustering was done based on variables including prizes (monetary, publication, or speaking incentives), submission format, challenge sponsor, data contributor, data-types, and others. Our preliminary analysis of DREAM (Dialogue on Reverse Engineering Assessment and Methods) challenges suggests that the nature of organizers and type of biomedical data correlate strongly with high participant engagement and novel algorithm generation.

S-004: Translating Human Readable Variation Descriptions to Unique Computable Variations with the Variation Normalizer
COSI: BOSC
  • Kori Kuzma, Nationwide Children's Hospital, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Jiachen Liu, Dana-Farber Cancer Institute, United States
  • Kyle Ferriter, Broad Institute of MIT and Harvard, United States
  • Adam Coffman, Washington University School of Medicine, United States
  • Obi Griffith, Washington University School of Medicine, United States
  • Malachi Griffith, Washington University School of Medicine, United States
  • Jason Walker, Washington University School of Medicine, United States
  • Lawrence Babb, Broad Institute of MIT and Harvard, United States
  • Xuelu Liu, Dana-Farber Cancer Institute, United States
  • Alex Wagner, Nationwide Children's Hospital, United States


Presentation Overview: Show

Translating between different formats and centralized authoritative identifiers for a single variation can lead to the inability to easily exchange variation data between systems and can create errors or inefficiencies in the evaluation of variation in clinical settings.

These issues are addressed through the use of Variation Representation Specification (VRS; vrs.ga4gh.org) and VRS Added Tools for Interoperable Loquacious Exchange (VRSATILE; vrsatile.readthedocs.io), computational exchange standards from GA4GH (ga4gh.org). The VRS models, globally unique computed identifiers, and fully-justified normalization algorithm provides a standard, computable mechanism for precise and unambiguous representations of molecular and systemic variation. The VRSATILE framework provides support for the practical exchange of VRS variation data.

The Variation Normalizer (https://github.com/cancervariants/variation-normalization), an open-source Python package and REST API, translates human readable variation descriptions to VRS and VRSATILE compatible representations. It supports free text, gnomAD VCF, and HGVS expressions on the protein, coding DNA, or genomic DNA reference sequence. For example, when querying “BRAF Val640Glu” or “NP_004324.2:p.Val600Glu”, the service returns the same VRS variation since each represents the same variation. The output also includes additional data about the variation including the gene context, structural variant type, and reference allele sequence.

S-005: FUSOR and the VICC Fusion Curation Interface: Tools for the Structured Representation of Gene Fusions
COSI: BOSC
  • Jeremy Arbesfeld, The Ohio State University, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Kori Kuzma, Nationwide Children's Hospital, United States
  • Colin O'Sullivan, Nationwide Children's Hospital, United States
  • Stephanie LaHaye, Tempus Labs, United States
  • James Fitch, Nationwide Children's Hospital, United States
  • Andrea Sboner, Weill Cornell Medicine, United States
  • Scott Myrand, Thermo Fisher Scientific, United States
  • Ioannis Vlachos, Harvard Medical School, United States
  • Peter White, Nationwide Children's Hospital, United States
  • Alex Wagner, Nationwide Children's Hospital, United States


Presentation Overview: Show

Due to inconsistencies in the ways in which gene fusions are described in clinical practice and biomedical literature, the ability to successfully integrate the functional and clinical significance of fusion events in informing patient care is limited. With the aim of developing recommendations for the standardized representation of gene fusions, experts from the Variant Interpretation for Cancer Consortium (VICC), in consultation with members of Cancer Genomics Consortium, ClinGen, and the CAP/ACMG Joint Cytogenetics Committee, collaborated to create both a unified framework for the description of fusion events and a set of associated computational tools.

We will present FUSOR (Fusion Object Representation), a Python-based software development kit of modeling and validation tools for the computational representation of fusion events. We demonstrate FUSOR’s ability to convert output from commonly-used fusion detection algorithms such as CICERO and JAFFA into a machine-readable format. We also introduce the VICC Fusion Curation Interface, a web-based service that allows users to translate data describing fusion events into syntax that aligns with the proposed VICC gene fusion nomenclature system. These tools address the need for services that automate the curation of gene fusions, improving precision in the computational translation of gene fusions to clinical care.

S-006: pyTCR: a comprehensive and scalable platform for TCR-Seq data analysis to facilitate reproducibility and rigor of immunogenomics research
COSI: BOSC
  • Kerui Peng, University of Southern California, United States
  • Jaden Moore, Orange Coast College, United States
  • Serghei Mangul, University of Southern California, United States


Presentation Overview: Show

T cell receptor (TCR) studies have grown exponentially with the advancement in the sequencing techniques of T cell receptor repertoire sequencing (TCR-Seq). The analysis of the TCR-Seq data requires computational skills to run analysis tools, however biomedical researchers with limited computational backgrounds face multiple obstacles to properly and efficiently utilizing bioinformatics tools for analyzing TCR-Seq data. Here we report pyTCR, a computational platform for comprehensive and scalable TCR-Seq data analysis, which is able to facilitate reproducibility and rigor of immunogenomics research through the transparency and version control of the platform. Our platform has functionalities including TCR metrics calculations, statistical analysis, and customizable visualizations, which are richer than the existing tools. It enables the effective analysis of large-scale TCR-Seq data with flexibility. Additionally, the platform is based on computational notebooks which are approved to be user-friendly and suitable for researchers with limited computational skills. We expect the computational notebook-based platform to be adopted by the broad biomedical community as it carries benefits that are superior or comparable to R packages.

S-007: CWL toolkit for single-cell sequencing data analysis
COSI: BOSC
  • Michael Kotliar, Cincinnati Children's Hospital Medical Center, United States
  • Andrey Kartashov, Cincinnati Children's Hospital Medical Center, United States
  • Artem Barski, Cincinnati Children's Hospital Medical Center, United States


Presentation Overview: Show

Poor reproducibility and limited portability of data analysis methods and techniques present critical challenges to the progress of data-intensive biomedical research. These become of particular importance when it comes to single-cell sequencing data analysis. To overcome these problems we develop all our pipelines using Common Workflow Language (CWL) – an open standard that describes tools and pipelines as YAML or JSON structured linked data documents. CWL supports execution in an isolated runtime environment (Docker/Singularity) thus guarantying results reproducibility.
Here we present a set of CWL tools for analysis of scRNA, scATAC and Multiome sequencing data. Altogether they allow to perform the most common tasks in this research area incl.: low-quality cells removal, data integration, batch effect correction, clustering, and cell type assignment. These tools can be chained together to form various workflows for each particular research application. As an example of a successful workflow application, we analyzed scRNA-Seq data from control and Nsdhl-deficient mice to study the influence of tumor-intrinsic cholesterol biosynthesis on pancreatic carcinogenesis (Surumbayeva, Kotliar et al., 2021). Having all of the pipelines in the CWL format allows us to run them in any CWL-based execution environment, optionally scaling the processing from a single computer to HPC cluster.

S-008: Yevis: System to support building a workflow registry with automated quality control
COSI: BOSC
  • Hirotaka Suetake, The University of Tokyo, Japan
  • Tazro Ohta, Database Center for Life Science, Japan


Presentation Overview: Show

As workflow systems become more popular, many studies have been conducted to share workflows, for example, workflow sharing registries such as WorkflowHub and nf-core and standard protocols for sharing such as the GA4GH TRS API. These workflow registries collect generic workflows and maintain them through community efforts. However, because community resources are limited, there is less incentive to make an effort to collect and maintain specific workflows, for example, workflows for specific species. In addition, the expert community with domain knowledge of those workflows may lack the human resources and engineering skills to build their own workflow registry. Therefore, sustainable development and sharing of all workflows with quality control remain challenging. To make it easier to build and maintain workflow registries, we developed Yevis, a system for building a workflow registry that automatically maintains workflows. Yevis uses only GitHub and Zenodo services and provides GA4GH TRS compatible API and web UI. As a use case, we also built the DDBJ workflow registry, which contains workflows that can be executed in DDBJ WES. We believe that if each community follows a standard definition such as the TRS API and provides well-maintained workflows, users will benefit from the diversity.

S-009: Creating Cloud-based Notebooks for Exploring the GA4GH VRS Standard
COSI: BOSC
  • Kathryn Stahl, Nationwide Children's Hospital, United States
  • Wesley Goar, Nationwide Children's Hospital, United States
  • Kori Kuzma, Nationwide Children's Hospital, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Kyle Ferriter, Broad Institute, United States
  • Andreas Prlić, Invitae, United States
  • Lawrence Babb, Broad Institute, United States
  • Reece Hart, MyOme, United States
  • Alex Wagner, Nationwide Children's Hospital, United States


Presentation Overview: Show

The Variation Representation Specification (VRS; vrs.ga4gh.org) is a standard of the Global Alliance for Genomic Health (GA4GH; ga4gh.org) for the computable and expressive exchange of biomolecular variation. VRS-Python (github.com/ga4gh/vrs-python) is an open-source python package that implements models defined in VRS, provides an algorithm for generating globally unique computed identifiers, and supports translation of VRS variation models to and from other common variant representations like HGVS, VCF, and SPDI.

As a recently published standard, reference implementations and computational notebooks serve as useful introductory tools for learning the specification in hands-on application. The VRS-Python repository provides several Jupyter Notebooks demonstrating the features and capabilities of the VRS-Python python package (github.com/ga4gh/vrs-python/tree/main/notebooks).

To reduce barriers to entry for VRS, we have developed cloud-based VRS-Python notebooks to educate potential adopters. These notebooks are publicly accessible, require no local installation, and walk the user through the process of constructing and translating variants to VRS from other variation formats through a web browser interface. Users are able to easily run the notebooks and write their own VRS-Python code as they’re exploring these notebooks. The notebooks are accessible online at go.osu.edu/vrs-cloud-nb.

S-010: wdl2cwl: Converting WDL workflows to CWL
COSI: BOSC
  • Dennis Chukwunta, Imo State University, Nigeria
  • Dinithi Wickramaratne, University of Colombo, Sri Lanka
  • Bruno Kinoshita, Curii Corporation, New Zealand
  • Michael Crusoe, ELIXIR-NL; VU Amsterdam; CWL Project, Germany


Presentation Overview: Show

Computational analysis workflows are descriptions that link together steps to enable abstraction, scaling, automation, and provenance features. Workflow Description Language (WDL) and Common Workflow Language (CWL) are high-level workflow coordination languages that can define workflows made up of command-line tools.

WDL workflows are not executable by themselves and require an execution engine to run the workflows (primarily Cromwell on the Terra platform). In contrast to WDL, CWL can be run on a larger number of systems. However, some workflows that are important in bioinformatics are only maintained in WDL. For example: the GATK workflows that are used to analyze high-throughput sequencing data for variant discovery are only available from the GATK maintainers in the WDL format.

Therefore the wdl2cwl converter was created with the aim to efficiently convert the workflows written in WDL to the equivalent version using the CWL standard.
wdl2cwl can help WDL users to avoid platform dependency issues, such as trying to reproduce a WDL workflow on a platform that only supports CWL. It can help analysts by eliminating the difficulty of learning a new workflow language before they adapt a WDL workflow.

S-011: Using JBrowse 2 plugins to visualize genomics and other biological data.
COSI: BOSC
  • Caroline Bridge, Ontario Institute for Cancer Research, Canada
  • Rob Buels, University of California, Berkeley, United States
  • Colin Diesh, University of California, Berkeley, United States
  • Garrett Stevens, University of California, Berkeley, United States
  • Peter Xie, University of California, Berkeley, United States
  • Teresa Martinez, University of California, Berkeley, United States
  • Elliot Hershberg, University of California, Berkeley, United States
  • Junjun Zhang, Ontario Institute for Cancer Research, Canada
  • Shihab Dider, University of California, Berkeley, United States
  • Scott Cain, Ontario Institute for Cancer Research, United States
  • Robin Haw, Ontario Institute for Cancer Research, Canada
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Ian Holmes, University of California, Berkeley, United States


Presentation Overview: Show

Genome browsers are essential tools for integrating, exploring, and visualizing data from large genomics datasets. JBrowse 2 is a modern genome browser written in JavaScript with novel features for visualizing structural variants, syntenic alignments, and other biological data. To facilitate academic collaboration and innovation, JBrowse 2 has been designed with a pluggable interface that permits the development of features tailored to the specific needs of an individual or organization. We launched the JBrowse 2 Plugin Store to provide community-developed plugins to all users and to expose crucial feature sets to researchers. The plugin system spans all aspects of the JBrowse 2 application and enables the development of new track types (e.g. Manhattan plots, Hi-C data, splice junction), data adapters (e.g. API endpoint adapters for NCI GDC and the ICGC), and views (e.g. dot plots, multiple sequence alignments, and ideogram). Plugin development is made simple with the availability of a plugin template, numerous existing contributions to reference, and extensive documentation on the JBrowse 2 website to assist developers in getting their plugin running. Here, we present the capabilities of some JBrowse 2 plugins, describe usage scenarios for biologists and bioinformaticians, and outline JBrowse 2’s design patterns and tools for software developers.

S-012: Containers for increased accessibility and reproducibility of FreeSurfer's infant pipeline
COSI: BOSC
  • Paul Wighton, Martinos Center for Biomedical Imaging at MGH, United States
  • Nathan Xi Ngo, Martinos Center for Biomedical Imaging at MGH, United States
  • Maitreyee Mangesh Kulkarni, Martinos Center for Biomedical Imaging at MGH, United States
  • Andrew Hoopes, Martinos Center for Biomedical Imaging at MGH, United States
  • Andre van der Kouwe, Martinos Center for Biomedical Imaging at MGH, United States
  • Lilla Zöllei, Martinos Center for Biomedical Imaging at MGH, United States


Presentation Overview: Show

FreeSurfer is a set of freely available, open-source algorithms for the structural and functional analysis of MRI neuroimaging data. FreeSurfer's infant analysis pipeline provides capabilities for both volumetric and surface-based morphological analysis of subjects aged 0 to 24 months. This has, to date, been an under-represented population amongst the major neuroimaging software packages.

To increase the accessibility of FreeSurfer's infant pipeline and to enhance its reproducibility, we have developed a set of container-based methods to build and execute the pipeline as well as visually inspect its results. We do so by leveraging the existing tools Neurodocker and Neurodesktop.

Neurodocker is a ""command-line program that generates custom Dockerfiles and Singularity recipes for neuroimaging"". We have extended Neurodocker to support creating containers from source. This facilitates the creation of a continuous integration workflow, allowing every change to the source code to be automatically tested.

The outputs of FreeSurfer's infant pipeline can be visually inspected using FreeSurfer’s visualization tool FreeView. We use Neurodesktop for a convenient container-based way to access FreeView via a web browser. Neurodesktop is a “plug-and-play, browser-accessible, containerized data analysis environment” for neuroimaging.

S-013: Gender-based disparities and biases in science: observational study of a virtual conference
COSI: BOSC
  • Junhanlu Zhang, Institut Pasteur, France
  • Rachel Torchet, Institut Pasteur, France
  • Hanna Julienne, Institut Pasteur, France


Presentation Overview: Show

Abstract Most scientists would agree that success in science should solely be determined by the merit. However, success in STEM fields is still profoundly influenced by other factors such as race, gender and socioeconomic status. Numerous studies documented the gender bias throughout the publication process: women publish less than men, are less likely to be in the first position among authors who contributed equally, and are less cited than men. Gender disparities are also noticeable on less externally constrained behaviours such as the number of questions asked in scientific conferences.

As an interdisciplinary team composed of diverse skillsets (anthropology,statistics and UX design), we observed gender asking behaviour during the 2021 JOBIM virtual conference. We gathered quantitative and qualitative data including: a registration survey with detailed demographic information, post-conference survey on question asking motivations, live observations and in depth interviews of participants. Quantitative analysis highlighted several new findings such as an important fraction of the audience identifying as LGBTQIA+ and an increased attendance of women in virtual JOBIM conferences compared with in person conferences. Notably, the observations revealed a persisting underrepresentation of questions asked by women. Interviews of participants highlighted several barriers to oral expression encountered by gender minorities in STEM.

S-014: GenePlexus: A web server and Python package for gene discovery using network-based machine learning
COSI: BOSC
  • Christopher Mancuso, Michigan State University, United States
  • Renming Liu, Michigan State University, United States
  • Patrick Bills, Michigan State University, United States
  • Douglas Krum, Michigan State University, United States
  • Jacob Newsted, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States


Presentation Overview: Show

Biomedical researchers take advantage of high-throughput, high-coverage technologies to routinely generate sets of genes of interest across a wide range of biological conditions. Although these technologies have directly shed light on the molecular underpinnings of various biological processes and diseases, the list of genes from any individual experiment is often noisy and incomplete. Additionally, interpreting these lists of genes can be challenging in terms of how they are related to each other and to other genes in the genome. In this work, we present open source software, as both a web server (https://www.geneplexus.net/) and Python package (https://pypi.org/project/geneplexus/), that allows a researcher to utilize a powerful, network-based machine learning method to gain insights into their gene set of interest and additional functionally similar genes. Once a user uploads their own set of genes and chooses between a number of different network representations, GenePlexus provides predictions of how associated every gene in the network is to the input set. The web server and Python package also provide interpretability through network visualization and comparison to other machine learning models trained on thousands of known process/pathway and disease gene sets.

S-015: ElasticBLAST: accelerating alignments in the cloud
COSI: BOSC
  • Christiam Camacho, National Library of Medicine, NIH, United States
  • Grzegorz Boratyn, National Library of Medicine, NIH, United States
  • Victor Joukov, National Library of Medicine, NIH, United States
  • Thomas Madden, National Library of Medicine, NIH, United States


Presentation Overview: Show

The cloud is an appealing platform for bioinformaticians. They can bring
up compute infrastructure on demand and at the scale they requested. They can
access voluminous amounts of open access data hosted in cloud environments.
To make use of the cloud, it is valuable to have cloud native implementations of important
bioinformatics package.

We discuss ElasticBLAST, a cloud-native package to produce alignments with
the Basic Local Alignment Search Tool (BLAST). Built on top of the stand-alone BLAST+
command-line package, ElasticBLAST works with a range of query inputs,
handling anything from a few to millions of query sequences. ElasticBLAST
instantiates cloud instances, dispatches work to them, and deletes resources
when it is done. ElasticBLAST is supported on Amazon Web Services (AWS) and Google Cloud
Platform (GCP). We discuss the implementation of ElasticBLAST, show usage
examples and discuss performance.

In the last year, several updates have been made to ElasticBLAST. First,
it automatically selects an instance type for the BLAST runs based on
the database. Second, it shuts down cloud resources at the end of the run.
Finally, the ElasticBLAST throughput has been improved.

The source code and documentation for ElasticBLAST are freely available on Github.

S-016: Making biomedical research software FAIR with FAIRshare
COSI: BOSC
  • Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States


Presentation Overview: Show

We present here FAIRshare, an open-source and free (MIT license) cross-platform desktop software that helps biomedical researchers in making their research software Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the FAIR for research software (FAIR4RS) guiding principles. The FAIR4RS principles, established by the FAIR4RS Working group, provide a foundation for optimizing the reusability of research software and encourage open science. The FAIR4RS principles are, however, aspirational. Practical guidelines that biomedical researchers can easily follow for making their research software FAIR are still lacking. To fill this gap, we established the first minimal and actionable step-by-step guidelines for researchers to make their biomedical research software FAIR. We designate these guidelines as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines. FAIRshare walks the users step-by-step into implementing the FAIR-BioRS guidelines for their research software (including metadata such as codemeta.json, choosing a license – preferably open source, sharing on a suitable repository such as Zenodo, etc.). The process is streamlined through an intuitive graphical user interface and automation such as to minimize researchers’ time and effort. We believe that the FAIR-BioRS guidelines and FAIRshare will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software.

S-017: HistoLens identifies distinct patterns of podocyte injury in HIV-transgenic mice
COSI: BOSC
  • Samuel Border, The State University of New York at Buffalo, United States
  • Teruhiko Yoshida, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Jeffrey Kopp, National Institute of Diabetes and Digestive and Kidney Diseases, United States
  • Avi Rosenberg, Johns Hopkins Medical Institute, United States
  • Brandon Ginley, The State University of New York at Buffalo, United States
  • Pinaki Sarder, The State University of New York at Buffalo, United States


Presentation Overview: Show

HIV associated nephropathy (HIVAN) is a major cause of End-Stage Renal Disease (ESRD), particularly for individuals of African descent. Histological features of HIVAN include mesangial cell proliferation, collapsing focal segmental glomerulosclerosis, and podocyte injury. Podocytes, a terminally differentiated epithelial cell typically found on outer surfaces of glomerular capillaries, fill a critical role in the filtration of proteins from the bloodstream. Assessing the extent of podocyte injury across several whole slide images (WSIs) is an exceptionally difficult task, even for experienced pathologists. Computational approaches to automate diagnosis have been developed with competitive performance with human observers. However, they are largely unable to provide meaningful interpretation of the factors that contribute to their decisions. HistoLens, a graphical user interface (GUI), enables pathologists to extract a large number of quantitative image features from annotated WSIs relating to specific biological sub-compartments. Using HistoLens, we were able to identify key underlying factors associated with podocytes in Tg26-transgenic (HIVAN) compared with Wild-Type (WT) mice including podocyte thickness, distribution, and color heterogeneity. Using these features, we were able to design a model to classify glomeruli that achieved an accuracy of 73% using only a subset of podocyte features.

S-018: KG-OBO: Open Bio-Ontologies in Knowledge Graph Form
COSI: BOSC
  • Justin Reese, Berkeley Bioinformatics Open-source Projects, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
  • Chris Mungall, Berkeley Bioinformatics Open-source Projects, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
  • Harry Caufield, Berkeley Bioinformatics Open-source Projects, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States


Presentation Overview: Show

Knowledge graphs (KGs) are representations of entities and their multifaceted relationships. An ongoing challenge in learning from KGs in biology and biomedicine is in bridging the gap between real-world observations and conceptual knowledge. Though numerous bio-ontologies address this need, none may be directly added to a KG without significant effort.

Past efforts in aligning instance data to ontologies led to creation of the OBO Foundry, an open resource for standardized biological ontologies. We developed KG-OBO to allow the community to rapidly integrate OBO Foundry ontologies with biological KGs. KG-OBO translates OBOs into easily-parsed KGX TSV graphs aligned with the Biolink model, then uploads all graphs to a public repository. Users may merge one or more ontology graphs as needed, e.g., combining CHEBI with a KG of protein vs. chemical interactions allows for grouping chemicals hierarchically. The added context can also provide further training input for graph machine learning models.

The KG-OBO code, graphs, and infrastructure drive a community of knowledge engineers seeking answers to biomedical questions in KGs, including the broader OBO community. We anticipate that continued interest in learning from KGs will require easy access to the comprehensive knowledge within bio-ontologies, and KG-OBO fills this need.

S-019: Scientific Workflow and Data Management with the Arvados Platform
COSI: BOSC
  • Peter Amstutz, Curii Corporation, United States
  • Tom Clegg, Curii Corporation, United States
  • Lucas Di Pentima, Curii Corporation, United States
  • Stephen Smith, Curii Corporation, United States
  • Ward Vandewege, Curii Corporation, United States
  • Sarah Zaranek, Curii Corporation, United States
  • Alexander Sasha Wait Zaranek, Curii Corporation, United States


Presentation Overview: Show

Reproducibility benefits largely from robust workflow management. The open-source platform Arvados integrates a data management system called “Keep” and the compute management system called “Crunch”, creating a unified environment to store and organize data, and run Common Workflow Language workflows on that data. Arvados is multi-user and multi-platform, running on various cloud and high performance computing environments. Arvados management features including the ability to (1) identify the origin and verify the content of every dataset, track every workflow run, and reliably reproduce any output (2) organize and search for datasets using metadata (3) Securely and selectively share your data and workflow (3) Efficiently manage data (minimizing storage costs) and (4) Efficiently rerun workflows (minimizing time and compute costs).

S-020: A cloud based international community effort for reproducible benchmarking of genomic tools
COSI: BOSC
  • Matthew Gazzara, University of Pennsylvania, United States
  • Farica Zhuang, University of Pennsylvania, United States
  • Yoseph Barash, University of Pennsylvania, United States


Presentation Overview: Show

The number of published bioinformatic tools, each with their own strengths and weaknesses, is constantly growing. With the sheer number of available tools, it is difficult to assess which tool is best for a specific purpose. While benchmarking papers are published to address this need, such papers are in many cases hard to reproduce and extend. Here, we introduce an open source framework that 1. automates the workflow of the evaluated tools for the convenience of users, 2. benchmarks the tools based on various metrics and 3. visualize the results to easily compare tools on the OpenEBench online platform. This framework was developed as part of APAeval, an international community effort to benchmark RNA-seq based alternative polyadenylation (APA) analysis tools. APA tools are assessed based on their ability to identify, quantify, and calculate differential usage of APA sites. The framework is not constrained to just APA tools, but can also be applied to automate and benchmark other bioinformatic tools. We hope that the framework we have created can be an inspiration for an open source, community effort to automate and benchmark tools to facilitate the various downstream analyses.

S-021: Deciphering a mystery workflow written in WDL
COSI: BOSC
  • Geraldine Van der Auwera, Broad Institute, United States


Presentation Overview: Show

This talk presents a practical methodology for elucidating the structure and function of a workflow written in the Workflow Description Language (WDL), a domain-specific language for describing data processing and analysis workflows.

WDL is increasingly used by large consortia in genomics and related fields for creating standardized workflows that are portable across execution platforms. Bioinformaticians are likely to encounter WDL workflows that they will need to either apply to their own data or reimplement in their preferred language.

Deciphering real-world workflows can benefit from a systematic approach rather than attempting to read through the code linearly. We present a systematic approach intended to help bioinformaticians efficiently interpret and if necessary, reverse engineer existing WDL workflows. We demonstrate the method on a real-world WDL workflow, deconstructing it systematically in order to understand (1) what the workflow is meant to achieve; (2) how it is structured; and (3) what are the key functional patterns involved.

The main take-home from this talk will be the methodology itself, which can be adapted to other scenarios. As secondary benefits, the audience will gain some familiarity with WDL syntax and with interesting functional patterns of the language.

S-022: The GA4GH Phenopacket schema: A computable representation of clinical data for precision medicine
COSI: BOSC
  • Julius Jacobsen, Queen Mary University of London, United Kingdom
  • Monica Munoz-Torres, University of Colorado Anschutz Medical Campus, United States
  • Peter Robinson, The Jackson Laboratory, United States


Presentation Overview: Show

Despite great strides in the development and wide acceptance of standards for exchanging structured information about genomic variants, there is no corresponding standard for exchanging phenotypic data, and this has impeded the sharing of phenotypic information for computational analysis. Here, we introduce the Global Alliance for Genomics and Health (GA4GH) Phenopacket schema, which supports exchange of computable longitudinal case-level phenotypic information for diagnosis and research of all types of disease including Mendelian and complex genetic diseases, cancer, and infectious diseases. To support translational research, diagnostics, and personalized healthcare, phenopackets are designed to be used across a comprehensive landscape of applications including biobanks, databases and registries, clinical information systems such as Electronic Health Records, genomic matchmaking, diagnostic laboratories, and computational tools. The Phenopacket schema is a freely available, community-driven standard that streamlines exchange and systematic use of phenotypic data and will facilitate sophisticated computational analysis of both clinical and genomic information to help improve our understanding of diseases and our ability to manage them.

S-023: NIAID funded open-source workspaces and data visualization tools enable efficient analyses and interpretation of SARS-CoV-2 data
COSI: BOSC
  • Wiriya Rutvisuttinunt, NIAID, United States
  • Inka Sastalla, NIAID, United States
  • Punam Mathur, NIAID, United States
  • Reed Shabman, NIAID, United States
  • Liliana Brown, NIAID, United States


Presentation Overview: Show

During the COVID-19 pandemic, an unprecedented volume and variety of genomic data and scientific discoveries helped address the urgent need for developing molecular and diagnostic assays, therapeutics and vaccines against SARS-CoV-2. These data were located across numerous repositories, leading to a need for dedicated platforms and dashboards that aggregate, synthesize, and offer analytical tools. As the pandemic progressed, additional data on SARS-CoV-2 variants proved essential to update the effectiveness of countermeasures. The Office of Genomics and Advanced Technologies (OGAT) at the Division of Microbiology and Infectious Diseases (DMID), National Institute of Allergy and Infectious Diseases (NIAID) coordinated and supported efforts across computational biology and bioinformatics groups (i) to unify COVID-19 and SARS-CoV-2 epidemiology, genomic data, structural modeling, published research and other resources; (ii) to provide interactive visualization; (iii) to offer downloadable raw datasets for downstream analyses; and (iv) to enable the use of computational biology methods to extract further knowledge from integrated datasets. Here we report the gaps, challenges, and data resources made available through OGAT efforts to combat SARS-CoV-2 which serve as a blueprint for enabling rapid responses to other emerging pathogens with the potential for causing human disease.

S-024: Decomprolute: A benchmarking platform designed for proteomic based tumor deconvolution
COSI: BOSC
  • Song Feng, Pacific Northwest National Laboratory, United States
  • Anna Calinawan, Mount Sinai School of Medicine, United States
  • Michele Ceccarelli, University of Naples Federico II, Italy
  • Pietro Pugliese, University of Sannio, Italy
  • Francesca Petralia, Mount Sinai School of Medicine, United States
  • Pei Wang, Mount Sinai School of Medicine, United States
  • Sara Gosline, Pacific Northwest National Laboratory, United States


Presentation Overview: Show

Tumor deconvolution algorithms have become a reliable way to disentangle the diverse cell types that comprise solid tumors. The development and benchmarking of these algorithms have been enabled by the development of gold standard datasets and platforms that enable facile comparisons of tools across similar data. To date, however, these platforms and standards are geared toward measuring deconvolution of bulk gene expression data, and not proteomics data. In this work, we leverage the concerted effort of the Clinical Proteomics Tumor Analysis Consortium that has collected matched mRNA expression and proteomics data across thousands of tumors to build a fully open, containerized proteomic tumor deconvolution benchmarking platform called Decomprolute. Here we introduce Decomprolute, a Common Workflow Language framework that leverages Docker to compare tumor deconvolution algorithms across multiomic data sets. We describe how it can be used to measure and compare existing tumor deconvolution algorithms on proteomics data as well as simulated data, and give examples of how others developing their own tools can use this framework to assess performance of new algorithms.

S-025: Web scraping pilot study for SARS-CoV-2 variants of concern dashboards
COSI: BOSC
  • Wiriya Rutvisuttinunt, National Institute of Allergies and Infectious Diseases, United States
  • Liliana Brown, National Institute of Allergies and Infectious Diseases, United States
  • Lisa Mayer, National Institute of Allergies and Infectious Diseases, United States
  • Steve Tsang, National Institute of Allergies and Infectious Diseases, United States
  • Jane Lockmuller, National Institute of Allergies and Infectious Diseases, United States


Presentation Overview: Show

Tracking the SARS-CoV-2 variants and mutations is essential to inform the development of medical countermeasures. In response, many dashboards emerged to publish aggregated variant data through independent analyses using their own metrics and visualizations. To leverage knowledge across dashboards and prioritize SARS-CoV-2 variants with high public health impact, we developed a pipeline to automate the collection of data on variants of concern (VOC), variants of interest (VOI) and variants under monitoring (VUM) from relevant dashboards and generate consensus by web scraping with Python Selenium and Beautiful Soup followed by visualization in R. Additionally, we used the FAIR Data Principles criteria to track the data openness for each dashboard. From June 1 through September 9, 2021, we monitored twelve variant-reporting websites and scraped three dashboards (25%). The list of top variants of concerns is in agreement across these dashboards, which highlights the high impact threat levels. The nine other websites (75%) had structures inaccessible to the web scraping pipeline. Some challenges faced included limited programmatically accessible data, difficulty finding documentation, and frequent website structure changes. Overall, all dashboards provided visual variant summaries; however, expanding websites’ machine-readability and documentation would strengthen the impact by improving interoperability and reusability.

S-026: Bioinformatics Hub of Kenya Initiative: Connecting bioinformatics researchers and students in East Africa
COSI: BOSC
  • Festus Nyasimi, University of Chicago, United States
  • Michael Landi, International Institute of Tropical Agriculture (IITA), Kenya
  • Pauline Karega, International Centre for Insect Physiology and Ecology (ICIPE), Kenya
  • Margaret Wanjiku, Bioinformatics Hub of Kenya Initiative (BHKI), Kenya
  • David Kiragu, Institute of Primate Research (IPR), Kenya


Presentation Overview: Show

In recent years there has been a large uptake of bioinformatics projects within the region but only a few skilled individuals are available to perform analysis for data in these projects. This results in outsourcing for bioinformatics services which do not tap back to the community to empower them with the skills. The Bioinformatics Hub of Kenya Initiative (BHKI - www.bhki.org) is a community-based organization with the main goal of training and empowering students and researchers in the region. It also serves as a platform for networking and creating collaboration between researchers.

We provide different models of learning/training through workshops, hackathons and conferences. Our approach involves training participants with core skills, assigning them mini projects to solidify their skills and finally presenting results. We have collaborated with the Carpentries organization to provide a platform where trained members can learn to be instructors and lead different workshops.

We have organized multiple workshops so far, one train the trainer session through the carpentries and one conference, all of which has a good representation through the region and career levels. We are now planning on university and high school outreach for further sensitization of the bioinformatics field and working in open-source projects.

S-027: TADA - Targeted Amplicon Diversity Analysis
COSI: BOSC
  • Christopher Fields, High Performance Computing in Biology, University of Illinois at Urbana-Champaign, United States
  • Nicola Mulder, Computational Biology (CBIO), University of Cape Town, South Africa
  • Gerrit Botha, Computational Biology (CBIO), University of Cape Town, South Africa
  • Katie Lennard, Computational Biology (CBIO), University of Cape Town, South Africa
  • Jessica Holmes, High Performance Computing in Biology, University of Illinois at Urbana-Champaign, United States
  • Lindsay Clark, High Performance Computing in Biology, University of Illinois at Urbana-Champaign, United States
  • Gloria Rendon, High Performance Computing in Biology, University of Illinois at Urbana-Champaign, United States


Presentation Overview: Show

Targeted analysis of loci in environmental data such as bacterial/archaeal 16S rRNA has evolved towards higher throughput and increasingly sensitive methods, enabling analysis on thousands of samples across multiple amplicons and genes with higher taxonomic resolution. We developed TADA (Targeted Amplicon Diversity Analysis) towards addressing these demands as part of our H3ABioNet collaboration (Mulder et al., 2016). TADA is implemented in Nextflow (Di Tommaso et al., 2017) using nf-core’s overall workflow design (Ewels et al., 2020). TADA derives amplicon sequence variants (ASVs) using DADA2 (Callahan et al., 2016), with key steps implemented directly using R via templates, to utilize Nextflow’s capabilities for parallel processing and portability. TADA can process data from multiple genic regions (including 16S, ITS, COI, and custom amplicons) and has been tested with Illumina and long read sequence data and both Loop Genomics and Shoreline Biosciences sample data. Thousands of samples can readily be processed within a standard cluster environment or on cloud platforms. Workflow outputs can be used within multiple analysis frameworks including R/Bioconductor, QIIME2, and mothur. Past versions of TADA have been successfully used in training for 16S analyses, including workshops focused on developing bioinformatics capacity within African institutions (Ras et al., 2021).