Posters - Schedules
Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7
minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19
and no later than July 23. All registered conference participants will have access to the poster and presentation
through the conference and content until October 31, 2021. There are Q&A opportunities through a chat
function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.
Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Session A: Sunday, July 25 between 15:20 - 16:20 UTC
Session B: Monday, July 26 between 15:20 - 16:20 UTC
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
Session E: Thursday, July 29 between 15:20 - 16:20 UTC
Short Abstract: Benchmarking is a central piece of method development as it identifies 1) the most suitable tools in different analytical settings, 2) the most critical pitfalls of an analysis, 3) the areas requiring further improvements from the community. Currently, evaluations are made in fits and starts and any method developed in the future will make current benchmarks incomplete or even deprecated. Furthermore, dependency of benchmarking conclusions on method parameter settings constitutes an additional burden for the scientific community that struggles to find a consensus approach. To overcome these limitations we propose a modular and continuous community benchmarking framework on RENKU, a free open-source analytical platform that integrates version control, CI/CD and containerization. The main asset of RENKU is a knowledge graph to track input, code and output of workflows, allowing seamless extensions and updates as new data or methods arise. This provides a light-weight entry point for the scientific community to engage and extend a benchmark while delivering the most up-to-date recommendations. As a proof of concept, we demonstrate the potential application of such a framework in the field of single-cell RNA sequencing, where the fast pace of new method development illustrates the outlined limitations of the current benchmarking approaches.
Short Abstract: The GFF3 format is a common, flexible tab-delimited format representing the structure and function of genes or other mapped features. However, with increasing re-use of annotation data, this flexibility has become an obstacle for standardized downstream processing. Common software packages that export annotations in GFF3 format model the same data and metadata in different notations, which puts the burden on end-users to interpret the data model.
The AgBioData consortium is a group of genomics, genetics and breeding databases and partners working towards shared practices and standards. Providing concrete guidelines for generating GFF3, and creating a standard representation of the most common biological data types would provide a major increase in efficiency for AgBioData databases and the genomics research community that use the GFF3 format in their daily operations.
The AgBioData GFF3 working group has developed recommendations to solve common problems in the GFF3 format. We suggest improvements for each of the GFF3 fields, as well as the special cases of modeling functional annotations, and standard protein-coding genes. We welcome discussion of these recommendations from the larger community.
Short Abstract: Modern biological data is high-dimensional, feature-rich and generated by diverse technologies. As a result, they are represented in different formats including oligonucleotide sequences, genome coverage plots or as matrices of data. Containing independent layers of quantitative and qualitative information, they are often investigated separately. While the complexity of biological systems requires a multi-omics data perspective to obtain insight, merging multi-omics data remains challenging. To analyse multi-omics data efficiently, we therefore require a strategy that is capable of representing (1) quantitative and qualitative data (2) at a low level while being (3) compatible with diverse biological data types. To address these three challenges, we explored the possibility of analysing low-level sequence data directly as a unifying factor. Using this technique, we recovered informative sequence based markers. Crucially, we demonstrated that this scales to multi-omics data. While our preliminary analyses were conducted on quantitative data only, it is theoretically generalisable to qualitative data such as genomic shape or mutations. Currently, no known strategy exists which combines these three key attributes simultaneously. We are currently working on extending this to qualitative data types to obtain and further explore the statistical association between low-level features.
Short Abstract: The development of the next-generation sequencing techniques facilitate the research on the adaptive immune receptor repertoire(AIRR), which provides an in-depth understanding of the T cell and B cell genetics clonotypes and phenotypes, and helps the scientific community understand the in-depth knowledge of improving the understanding of the immune-related diseases’ etiology and pathology.
Most AIRR sequencing studies have unknown ancestry distribution, while it is crucial to understand the disparities between diverse ancestries in their immune responses. The genetic differences across ancestries potentially lead to distinct phenotypes in immune responses and immunities. Examining the ancestry distribution in immunogenetics studies guides the understanding of diverse immune responses and immunities across diverse populations. In the study, we investigate the ancestry distribution in open T cell receptor sequencing (TCR-seq) studies, to estimate the extent of ancestry availability and the ancestry distribution. The ancestry distribution is 84.08% of European ancestry, followed by 9.01% Asian ancestry, 4.04% African ancestry, and 2.87% other ancestries. The ancestry distribution across TCR-Seq studies is highly disproportionate, which is predominantly focused on European ancestries.
Short Abstract: Project Website: bioinfolinux.readthedocs.io/
Source Code: gitlab.com/vimalkvn/bioinfolinux
Licence: GPL v3.0
The BioinfoLinux project provides a preconfigured virtual machine (VM) for teaching and learning Bioinformatics. It is aimed at students and researchers in the field of biological sciences, who are unfamiliar with Linux distributions and software installation methods.
With BioinfoLinux, users can access a running Linux desktop in three steps:
1. Install VirtualBox.
2. Download and import VM.
3. Start VM.
In contrast, it will take additional steps for new users to get started with a regular Linux distribution like Ubuntu, including the creation of a new virtual machine and installing and configuring the distribution. To use additional software like Conda or Galaxy, they will need to consult the respective project's documentation.
Once logged in to BioinfoLinux, users can access documentation available on the desktop. It includes instructions on getting started and using included software.
BioinfoLinux is an open-source project and the source code is available online. It is defined as Ansible roles (YAML format) and can be customized further by adding or removing roles. The virtual machine image is built using Vagrant. Users will be able to contribute new Ansible roles and documentation on installing software, which will be useful for the wider scientific community.
Short Abstract: Data commons are software platforms that combine data, cloud computing infrastructure, and computational tools to create a resource for the managing, analyzing, and harmonizing of biomedical data. Arvados is an open source platform for managing, processing, and sharing genomic and other large scientific and biomedical data. The key capabilities of Arvados give users the ability to manage storage and compute at scale, and to integrate those capabilities with their existing infrastructure and applications using Arvados APIs. Several technical requirements have been suggested for a data commons ranging from metadata to security and compliance. We will go through these technical requirements and show that Arvados is well designed to fulfill the technical criteria for a data commons. We will demonstrate a prototype federated data commons we have created using the two clusters that we run for the Arvados Playground, a free-to-use installation of Arvados for evaluation and trial use.
Short Abstract: Cadmus is an open-source system, developed in Python to generate biomedical text corpora from published literature. The difficulty of obtaining such datasets has been a major impediment to methodologi- cal developments in biomedical-NLP and has hindered the extraction of invaluable biomedical knowledge from the published literature. The Cadmus system is composed of three main steps; query & meta-data collection, document retrieval, and parsing & collation of the resulting text into a single data repository. The system is open-source and flexible, retrieving open-access (OA) articles and also those from publishers that the user, or their host institution, have permission to access. Cadmus retrieves and processes all available document for- mats and standardises their extracted content into plain text alongside article meta-data. The retrieval rates of Cadmus varies depending on the query and licensing status. We present examples of data retrieval for four gene-based literature queries available purely through open access (OA, without subscription) and with addi- tional institutional access (University of Edinburgh, subscription). The average retrieval success rate was 89.03% with subscription access; and 59.17% with OA only. Cadmus facilitates straightforward access to full-text literature articles and structures them to allow knowledge extraction via natural language processing (NLP) and text-mining methods.
Short Abstract: The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is developed by a multi-vendor working group consisting of organizations and individuals aiming to enable scientists to share data analysis workflows. Embracing CWL as a workflow standard enables reproducible research. This ensures that scientific work can be understood, independently verified, and built upon in future work. Improvements to CWL benefit multiple organizations and workflow systems that use the CWL standard.
There is now a new stable release of the CWL standards, version 1.2. We will describe the new features in standards, give examples of using these new features, and explain why they were added to the standards Some highlights of CWL v1.2 include conditional execution of workflow steps, abstract operations, absolute paths for container inputs CWL v1.2 also contains 20 cleanups and clarifications of corner cases in the specifications.
Short Abstract: Galaxy, a free and open-source platform for web-based data integration and analysis, has been widely used by the life science research community. In rice research, we developed a federated instance that was called RiceGalaxy (galaxy.irri.org) to meet the needs of the rice researcher community for a data analysis platform. However, its use case has now extended beyond rice and has been adopted by other crops as well. The Excellence in Breeding (EIB) Program (excellenceinbreeding.org/ ) aims to fast-track modernization of breeding programs in developing regions through tools and resources, shared services and best breeding practices. We developed a demo instance (cropgalaxy.excellenceinbreeding.org/ ) that housed open-source tools for genomic selection, imputation and GWAS. These two instances have been combined resulting in CropGalaxy (cropgalaxy.excellenceinbreeding.org/) that currently contains tools and workflows for genome-wide association studies (GWAS), genomic selection (GS), imputation, and genomics data analysis. The datasets for the 3000 Rice Genomes (3K RG) project are also available publicly in the data library.
Short Abstract: Metabolic networks analysis relies heavily on computational geometry and interweaves with constraint-based approaches.
We present dingo, a complete Python toolkit that supports the fundamental methods for the analysis of metabolic networks.
dingo serves the ambition of providing to the community a novel and highly efficient geometric method to sample from the flux space of a metabolic network.
This translates to a novel algorithm for sampling polytopes in high dimensions where the crux of the novelty and efficiency is the unification of rounding (of the polytope) and sampling in one pass and the computation of both upon termination.
dingo samples from the most complicated human metabolic network accessible today, Recon3D,
that corresponds to a polytope of dimension 5,335 in less than 30 hours in a personal computer.
To our knowledge, such a functionality is out of reach for existing software.
A variety of random walks as well as Flux Balance and Flux Variability Aanalyses are also supported.
dingo exploits the C++ functionality of the library volesti*; an open-source software library that implements high-dimensional MCMC sampling and volume approximation algorithms. Both packages are under the umbrella of the GeomScale project**; a NUMFOCUS*** affiliated project.
Short Abstract: Dockstore is an open source platform for sharing bioinformatics tools and workflows using popular descriptor languages such as the Common Workflow Language (CWL), Workflow Description Language (WDL), Nextflow, and Galaxy. Dockstore aims at making workflows reproducible and runnable in any environment that supports Docker. Here, we highlight new features that the Dockstore team has been working on since our last talk in 2019.
For better support of the lifecycle of a workflow, Dockstore has added support for GitHub apps allowing for automated update of workflow content. Selected versions of a workflow can be snapshotted and exported to Zenodo to mint DOIs for publication. Improvements have also been made to the platform's cloud security in-line with NIH recommendations to keep workflows safe.
As examples of usability improvements, Dockstore has revamped organizations and home pages highlighting updated content for logged-in users while also adding notifications for events.
Finally, users can use the launch-with feature to run workflows on newly added platforms such as CGC, Cavatica, AnVIL, Nextflow Tower, BioData Catalyst, and Galaxy. ORCID support has been added to help identify users on the site. Support has been added for WDL 1.0, CWL 1.1, Nextflow’s DSL2, and GA4GH’s TRS V2 standard.
Short Abstract: EDAM is an ontology widely used to describe bioinformatics resources (tools, databases, workflows, training, etc.). However, the exploration of the ontology itself by users who might not be ontology experts remains sub-optimal, with most of the usual interfaces being either too complex, too slow, or too generic.
Here we present EDAM Browser, which lets users explore EDAM with an interface tailored to its structure and properties. The features of this ontology browser are focused on the needs of EDAM's end-users, who search and explore the ontology to find or annotate bioinformatics resources.
A unique feature of EDAM Browser is the visualization of all ""positions"" of a selected concept in the full ontology ""tree"" (DAG), with all paths to the root highlighted. One of EDAM Browser's main added values is the aggregated search across various registries (e.g. bio.tools, TeSS, BioSphere), and its speed.
The lightweight architecture makes it easy to download and run EDAM Browser on any server or personal computer, either as a local HTML file or on a web server. EDAM Browser has also been used for other ontologies, and making the reuse smoother and more usable is part of the ongoing work. Contributions are welcome!
Short Abstract: EDAM is an ontology of data analysis and data management in life sciences. The structure of EDAM is relatively simple, divided into 4 main sub-ontologies:
- Data (incl. Identifier)
EDAM is used in numerous resources such as bio.tools, Galaxy, Debian, or the ELIXIR Europe training portal TeSS. Thanks to the annotations with EDAM, computational tools, workflows, data, and educational materials are easier to find, compare, choose, and integrate.
EDAM contributes to open science by allowing semantic annotation of processed data, thus making the data more understandable, findable, and comparable. EDAM and its applications lower the barrier and effort for scientists and citizens alike, towards doing scientific research in a more open, reliable, and inclusive way.
We are currently working on the next major, ""stable"" release 1.26, with minor ""unstable"" versions published in the meantime. The main developments since 2020 include the addition of the essential concepts of data management and open science, improved automated validation, and especially reworking the contribution processes towards more inclusion and engagement with communities of scientific experts, software engineers, and volunteers.
Short Abstract: With the advent of faster and cheaper sequencing technologies, there exist ever larger amounts of sequence data to analyze. The Basic Local Alignment Search Tool (BLAST) is a core component of sequence analysis, though analyzing large volumes of sequence data remains a challenge, primarily because of computational demands. In order to leverage the offerings of cloud service providers to address the increased demands of today’s sequence analysis, we developed ElasticBLAST.
ElasticBLAST is a cloud-based implementation of BLAST designed to speed up searching and aligning large amounts of nucleotide and protein sequences against NCBI or user-provided BLAST databases. Adhering to an Open Source Philosophy, the source code and documentation for ElasticBLAST are freely available on Github. The sequence databases that ElasticBLAST operates on are also freely available on the National Center for Biotechnology Information (NCBI)'s FTP site, the Registry of Open Data at AWS, and the Google Cloud Platform.
Short Abstract: GEMmaker is a Nextflow workflow for gene expression quantification of Illumina RNA-seq datasets. GEMmaker creates a Gene Expression Matrix (GEM) using popular quantification tools such as Hisat2, Salmon or Kallisto. GEM construction is a prerequisite for differential gene expression (DEG) and gene co-expression network (GCN) analysis. Other RNA-seq workflows currently exist, but GEMmaker is unique as it is intended for scaling to hundreds to thousands of RNA-seq samples. It does so by automatically cleaning intermediate files and processing samples in batches to avoid exceeding system resources. Cleaning and batch processing are novel features of GEMmaker that are not provided by base Nextflow or other RNA-seq workflows.
GEMmaker is fully nf-core compatible ensuring community standards in workflow design and sharing are met. As such, GEMmaker is containerized and can be executed on a local computer, on institutional high-performance compute clusters or on the cloud without the need for installing software other than Nextflow. GEMmaker can use local data and can automatically download datasets from NCBI’s SRA repository. Source Code and in-depth usage documentation for GEMmaker can be found at github.com/SystemsGenetics/GEMmaker. GEMmaker is available under the MIT License.
Short Abstract: We developed a Python package, GSForge, that assists in collating and comparing genomic feature selections from any source or method. Available methods selecting genomic features include: literature mining, differential gene expression (DEG), machine learning techniques and modules from gene interaction networks. Each method applies some measure to produce a selection, or set, of features underlying some experimental variable or outcome. It is increasingly common to use multiple feature selection tools and then compare, combine and evaluate the various selections. Collating and tracking sets from disparate methods can be tedious. To assist in this work, GSForge was developed; it provides a Python Application Programming Interface (API) with data structures designed to store and organize feature data and ancillary metadata. Additionally GSForge has functions for subsetting data based on set membership(s) and set operations thereof, as well as visualizations to interpret results.
We provide a walk-through of examining an oryza sativa RNA-Seq data set of moderate size: 475 samples and 66338 genes. Multiple feature selection techniques are demonstrated, as well as methods for comparing between and within the selected sets.
Short Abstract: Since 2006, the year 23andme was founded, the leading driver for the direct-to-consumer genetic testing market has been ancestry prediction and DNA relatives matching services.
Despite the availability of scientific and research information on the methods for determining kinship and the accessibility of relevant tools, assembling a pipeline that stably operates on a real-world genotypic data requires significant research and development resources.
Currently, there is no open-source end-to-end solution for relatedness detection in genomic data available, that is fast, reliable and accurate on close and distant degrees of kinship, combines the necessary processing steps to work on real data, and which is ready for production integration.
To address this, we developed GRAPE pipeline, which combines data preprocessing, IBD segment recovery, masking of artifactual IBD segments, and accurate relationship estimation. We demonstrate the pipeline's accuracy on a simulated population and provide a rationale for the reliability of our approach.
The project uses software development best practices, as well as GA4GH standards and tools.
We believe that software freedom is essential to advance efficiency, interoperability, and speed of innovation.
GRAPE is a free and open-source project under GPLv3 license.
Short Abstract: Global Alliance for Genomics and Health (GA4GH) Cloud Workstream has created Workflow Execution Service (WES) and Tool Registry Service (TRS) standard definitions for executing and sharing workflows. Based on our experience in developing WES, the current TRS definition lacks information such as workflow attachments (e.g., configuration files and database files, etc.) and workflow parameter templates (e.g., required inputs and their type information). Therefore, there is a problem that workflows cannot be executed even if the TRS URL is specified. Also, there are existing TRSs (e.g., Dockstore, BioContainers, etc.) that use GitHub as the registry entity. Here, we propose a TRS publication protocol by combining GitHub (file hosting, user authentication, and version management), GitHub Actions (continuous testing, workflow analysis), and GitHub Pages (REST API hosting). This allows users to retrieve information for workflow execution from the GitHub repository hosting the workflow documents via TRS definitions.
Short Abstract: WGS/WES are the most popular NGS methodologies, currently being used to detect rare and common genetic variants of clinical significance. Investigating WGS/WES data for the variant discovery and genotyping is based on the nexus of different data analytic applications. Although several bioinformatics pipelines have been developed, timely finding and interpreting genetic variants are still challenging tasks among diagnostic laboratories and clinicians. In this study, we are interested in evaluating the current state of solutions available for the identification of variants, alleles, and haplotypes. Residing within the scope, we consulted peer reviewed literature published in last 10 years. We were focused on the bioinformatics applications proposed to process WGS/WES data, and support downstream analysis for gene-variant discovery, annotation, prediction, and interpretation. We present our findings, which include but not are limited to the set of operations, workflow, data handling, involved tools, technologies and algorithms, and limitations of the assessed applications. Addressing some of the current big data analytics challenges and further advancing the field, we have developed a new genomics application (JWES), which we have also discussed, validated, and compared with existing solutions. JWES is a cross-platform, user-friendly and product line application, entails three modules: data processing, storage, and visualization.
Short Abstract: Julia is a general purpose programming language designed for simplifying and accelerating numerical analysis and computational science. Thanks to the efficiency of the high-level program representation and the tools available in the ecosystem for numerical physics and mathematics, the language is well suited for work on biochemical reaction systems.
We describe a set of three open source software packages that provide the systems biology community access to the high performance tools in Julia: Catalyst.jl, SBML.jl and CellMLToolkit.jl. Catalyst.jl is a domain specific language to create dynamic models of chemical reaction networks in Julia de novo. SBML.jl and CellMLToolkit.jl import SBML and CellML models, which are XML-based community standards for exchange and storage of biological models. Importing the data into Julia enables leveraging the available functionality for simulating and analyzing these models.
Importantly, the new library implementations provide us with major gains of efficiency and program simplicity, allowing rapid development and large-scale data processing that has not been feasible before. For example, we demonstrate that combined community models consisting of tens of millions total metabolic reactions may be constructed interactively on off-the-shelf hardware, and show the new applications of SciML on the imported datasets.
Short Abstract: JBrowse 2 is a new web-based genome browser that offers multiple modalities of data visualization. It is written using React and Typescript, and can show multiple views on the same screen. Third party developers can create plugins for JBrowse 2 that add new track types, data adapters, and even new view types. We have specialized features for comparative genomics, such as dotplot and linear views of genome comparisons. We also have specialized features for structural variant analysis, including long read vs reference comparisons, and showing read support for SVs by connecting split reads across breakpoint junctions. Users of the web-based JBrowse 2 can share their entire session (e.g. all the settings, views, and even extra tracks that they have added) with other users with short shareable URLs. We are continuing to add new features, and recently released high quality SVG export and specialized visualization of insertions. We are working to develop further platforms such as JBrowse 2 Desktop, an electron packaged version of JBrowse 2.
Short Abstract: Microorganisms (microbes) are incredibly diverse, spanning all major divisions of life, and represent the greatest fraction of known species. A vast amount of knowledge about microbes is available in the literature, across experimental datasets, and in established data resources. While the genomic and biochemical pathway data about microbes is well-structured and annotated using standard ontologies, broader information about microbes and their ecological traits is not. We created the KG-Microbe (github.com/Knowledge-Graph-Hub/kg-microbe) resource in order to extract and integrate diverse knowledge about microbes from a variety of structured and unstructured sources. Initially, we are harmonizing and linking prokaryotic data for phenotypic traits, taxonomy, functions, chemicals, and environment descriptors, to construct a knowledge graph with over 266,000 entities linked by 432,000 relations. The effort is supported by a knowledge graph construction platform (KG-Hub) for rapid development of knowledge graphs using available data, knowledge modeling principles, and software tools. KG-Microbe is a microbe-centric Knowledge Graph (KG) to support tasks such as querying and graph link prediction in many use cases including microbiology, biomedicine, and the environment. KG-Microbe fulfills a need for standardized and linked microbial data, allowing the broader community to contribute, query, and enrich analyses and algorithms.
Short Abstract: NIH programs LINCS, “Library of Integrated Network-based Cellular Signatures”, and IDG, “Illuminating the Druggable Genome”, have generated rich open access datasets for the study of the molecular basis of health and disease. LINCS provides experimental genomic and transcriptomic evidence. IDG provides curated knowledge for illumination and prioritization of novel protein drug target hypotheses. Together, these resources can support a powerful new approach to identifying novel drug targets for complex diseases. Integrating LINCS and IDG, we built the Knowledge Graph Analytics Platform (KGAP) for identification and prioritization of drug target hypotheses, via open source package kgap_lincs-idg. We investigated results for Parkinson’s Disease (PD), which inflicts severe harm on human health and resists traditional approaches. Approved drug indications from IDG’s DrugCentral were starting points for evidence paths exploring chemogenomic space via LINCS expression signatures for associated genes, evaluated as targets by integration with IDG. The KGAP scoring function was validated against genes associated with PD with published mechanism-of-action gold standard elucidation. IDG was used to rank and filter KGAP results for novel PD targets, and manual validation. KGAP thereby empowers the identification and prioritization of novel drug targets, for complex diseases such as PD.
Short Abstract: BioCompute Object (BCO) and Common Workflow Language (CWL) standards contribute to reproducibility in the life sciences by providing approaches to communicate and portably execute computational workflows. We demonstrate the utility of combining both standards by extending our publicly available BCO app to generate a biocompute object with a BCO Execution Domain extension that documents the information required for execution. For our example, we use a publicly available workflow to prepare data for variant calling and annotation. The workflow is the public component of an in-progress scientific study that aims to use genetic information to guide cardiovascular disease treatment. We specifically show that the BCO app can generate a BCO from task and workflow information located on the Cancer Genomics Cloud (CGC) with sufficient information to execute. We then use a custom script that pulls the execution information from the BCO execution domain allowing for local execution. We include expected results within the BCO error domain, allowing bioinformaticians to verify results from the ported workflow. We expect that our approach to integrate BioCompute Objects and CWL within our BioCompute App will enhance workflow portability and use across computational platforms.
Short Abstract: It is a common practice to analyze miRNA PCR Array data using the commercially available software, mostly due to its convenience and ease-of-use. Here we propose a new tool, namely miRkit, an open source framework written in R that allows for a comprehensive analysis of miRNAs. By leveraging publicly available methods for the analysis of the expression profiles, the proposed toolset offers the full analysis from the raw RT-PCR data to functional analysis of targeted genes, and specifically designed to support the popular miScript miRNA PCR Array (Qiagen) technology. The overall workflow implemented in miRkit (i) performs quality control of the samples and data normalization, (ii) identifies significant differences on the expression profiles of miRNAs (limma package), and (iii) links the significant miRNAs with the targeted genes (multiMIR package) and biological processes using KEGG and GO enrichment analysis (enrichR package). The whole process is supplemented by the corresponding visualizations, which are automatically generated and saved as independent figures. The implementation of miRkit supports fast execution with low memory requirements and similar results, compared to the commercially available software. miRkir is licensed under the MIT License and is freely available through GitHub (github.com/BiodataAnalysisGroup/mirkit).
Short Abstract: Project website: newbies-in-bioinformatics.github.io/Newbies-in-bioinformatics/
Source code: github.com/Newbies-in-bioinformatics/Newbies-in-bioinformatics
License: Creative Commons Zero v1.0 Universal
Authors: Anshika Sah¹, Yo Yehudi²
¹Institute of Home Economics, ²Open Life Science
Motivation: People from biological science backgrounds often find it difficult to use computational tools and data analytics for their research. Not having a community to discuss foundational concepts often leads to demotivation and may result in newcomers doubting their self-worth or ultimately abandoning bioinformatics entirely. To make bioinformatics a more inclusive field, we should normalize not knowing introductory concepts, and asking for help.
About the project: This site is designed to promote bioinformatics amongst beginners.
Beginners who are trying to learn bioinformatics are encouraged to share the challenges and biases they have faced or facing while learning bioinformatics and discuss these issues with the experts.
Experts in bioinformatics and computational biology are invited to share their insights into the field, clearing doubts of the beginners, and share their journey towards bioinformatics to motivate learners.
Getting involved: To accelerate the growth of the community and contribute to newbies in bioinformatics, as an expert, intermediate, or a newbie, get in touch at email@example.com.
Short Abstract: Circular RNAs (circRNAs) are a class of non-coding RNA that has gained attention due to their unique covalently closed loop structure conferring resistance to RNase degradation, tissue-specific expression, abundant expression in saliva, blood, plasma and exosomes. These characteristics make circRNAs ideal candidates as both diagnostic and prognostic biomarkers in a clinical setting, facilitating the use of non-invasive liquid biopsies for the detection and monitoring of disease status. Furthermore, circRNAs titrate microRNAs (miRNAs) via miRNA response elements within their mature spliced sequence, making the study of circRNAs within the competing endogenous RNA network at the systems level highly pertinent in modelling disease progression.
We present nf-core/circrna, a multi-functional, automated high-throughput pipeline implemented in Nextflow that fully characterises circRNAs in RNA-Seq datasets via 3 analysis modules; (i) circRNA quantification, robust filtering and annotation (ii) miRNA target prediction and (iii) differential expression analysis between pathological conditions complete with statistical results, diagnostic and expression plots. nf-core/circrna has been developed within the nf-core framework, ensuring robust portability across POSIX compute environments, minimal installation requirements via containerisation, comprehensive documentation and maintenance support. Source code, documentation and installation instructions are freely available at github.com/nf-core/circrna and nf-co.re/circrna.
Short Abstract: Molecular biology has become increasingly data-driven in the last decades. To facilitate writing analysis software for computational molecular biology, the open-source Python library Biotite has been created: It provides flexible building blocks for common tasks in sequence and macromolecular data analysis in the spirit of Biopython. Biotite comprises functionalities for the bioinformatics workflow from fetching and reading files to data analysis and manipulation. A high computational efficiency of the package is achieved via C-extensions and extensive use of NumPy. With Python being an easy-to-learn programming language, Biotite aims to address a large variety of potential users: from beginners, who can use it to automate data analysis and to prepare the input data for a program, to experienced developers, who create bioinformatics software. Since the initial presentation at BOSC 2019 a lot of new functionalities were added to Biotite. Hence, we would like to take a quick glance at major features that were added since then:
- A modular system for heuristic sequence alignments and read mappings
- Base pair measurement and nucleic acid secondary structure analysis
- Molecular visualization via a seamless interface to PyMOL
- Docking with Autodock Vina
Short Abstract: The largest U.S. public sequence database, Sequence Read Archive (SRA), managed by the National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), is now available on Google Cloud Platform (GCP) and Amazon Web Services (AWS).
14 petabytes of SRA sequence data can be located through either BigQuery (GCP) or Athena (AWS) interactive query services. Updated daily, SRA cloud metadata contains all submitted experimental metadata as well as normalized and calculated NCBI metadata.
Newer NCBI-derived data types, compressed SRA Aligned Read Format (SARFs) which contain preassembled contigs, reads mapped back to the contigs and variant calls, provide more starting points for analysis. Searchable SARFs metadata has associated statistics (e.g., contig length), k-mer based taxonomies, and annotations. Currently, SARFs are generated for a subset of submitted Illumina runs with at least 100 k-mer-based hits for SARS-CoV. This service will hopefully expand to all submitted data starting in 2022.
Cloud-based SRA and SARFs can be searched, transferred and used by researchers for little or no cost to speed up data-driven discovery on the cloud. NCBI’s SRA cloud resources are supported by NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.
Short Abstract: Open Life Science (OLS) (openlifesci.org) is a 16-week mentoring and training program that helps researchers become ambassadors of open science (OS), with three key attributes. First, sharing basics and skills needed to create, lead and sustain an OS project. Second, connecting participants from diverse backgrounds through mentorship and peer-based learning. Third, empowering participants to become influential local OS ambassadors.
OLS and similar platforms are instrumental in not just spreading the awareness of OS, but also in internalising the true values of it: knowledge exchange and networking, open leadership, and empowering others. These are needed to inspire a cultural change in communities. Particularly in low-and middle-income countries (LMiCs), OLS has helped boost awareness by supporting participants in applying OS principles to their projects.
As a case study, in OLS-2, I led a project to create a single bioinformatics pipeline as a command-line tool. My mentor encouraged me to join Galaxy admin training, this inspired me to work with a supervisor to deploy the Galaxy platform at our institute. I have been working to enhance understanding of Open Science in my university, which has inspired more students to formally learn about and apply open science in their work.
Short Abstract: Motivation. Modern methods of whole transcriptome sequencing accurately recover nucleotide sequences of RNA molecules present in cells and allow for determining their quantitative abundances. The coding potential of such molecules can be estimated using open reading frames (ORF) finding algorithms, implemented in a number of software packages. However, these algorithms show somewhat limited accuracy, are intended for single-molecule analysis and do not allow selecting proper ORFs in the case of long mRNAs containing multiple ORF candidates.
Results. We developed a computational approach, corresponding machine learning model and a package, dedicated to automatic identification of the ORFs in large sets of human mRNA molecules. It is based on vectorization of nucleotide sequences into features, followed by classification using a random forest. The predictive model was validated on sets of human mRNA molecules from the NCBI RefSeq and Ensembl databases and demonstrated almost 95% accuracy in detecting true ORFs. The developed methods and pre-trained classification model were implemented in a powerful ORFhunteR computational tool that performs an automatic identification of true ORFs among a large set of human mRNA molecules.
Availability. ORFhunteR is available at GitHub repository (github.com/rfctbio-bsu/ORFhunteR), from Bioconductor and as a web application (orfhunter.bsu.by).
Short Abstract: Initial screening of peptides with desired functions demands lengthy and expensive preliminary research. Therefore, engaging computational methods such as sequence analysis, biomolecular 3D structural analysis, and machine learning have been proven very effective in preliminary peptide library analysis and screening. However, in the absence of user-friendly software, these techniques demand programming expertise from the end-users to utilize advanced algorithms pipeline and workflows. Therefore, it is hard for an end user to leverage these techniques in their research with ease. To address the above challenges, we have built peptide design and analysis under Galaxy package (PDAUG), a Galaxy-based python powered toolset, which includes workflows and datasets for rapid in-silico peptide library analysis. PDAUG includes 24 GUI-based tools for peptide library generation, data visualization, sequence retrieval, instant inbuilt sequence data access, peptide feature calculation, and machine learning modeling. In addition to this, PDAUG leverages the power of the Galaxy platform and facilitates researchers to combine PDAUG with hundreds of compatible existing Galaxy tools. A workflow predicting anticancer properties of peptides demonstrates its usability and applications. PDAUG can be accessed as a pre-built Docker image or directly installed from the Galaxy ToolShed.
Short Abstract: The Progenetix oncogenomics resource provides cancer genome profiling data and biomedical annotations as well as provenance data of cancer studies. More than 115k genomic copy number number (CNV) profiles from over 700 cancer types empower oncogenmic analyses exceeding individual studies and diagnostic concepts and provide reference data for clinical diagnostics.
Progenetix has been utilized to develop data schemas for the Global Alliance for Genomics and Health and the European bioinformatics initiative ELIXIR. The Progenetix Beacon+ service has been instrumental in developing features of the Beacon protocol such as structural variant queries, data ""handover"" or the use of CURIEs in queries. Cancer annotations from Progenetix have informed conceptual requirements and mappings for GA4GH metadata concepts and schemas.
All data from Progenetix can be freely accessed through the website as well as APIs for data retrieval and image generation. Data APIs implement the Beacon v2 standard for queries and REST paths for genomic variants or biosample information. Additional services include annotated cancer genome profiling publications, ontology mapping services (e.g. NCIT - ICD-O, UBERON) and other utilities.
Here we demonstrate an open genomic reference resource around emerging GA4GH standards and how its data can be retrieved for different information and analysis scenarios.
Short Abstract: Sequencing viral genomes and collecting associated contextual data is key to the SARS-CoV-2 pandemic public health response. However, variation in contextual data across health regions often results in data sets that are small and/or difficult to compare due to variations in data structure, format, and values. To resolve this issue and improve virus surveillance in Canada, the Canadian COVID Genomics Network (CanCOGeN) VirusSeq has developed a data specification, fine tuned for SARS-CoV-2 and public health, that incorporates existing standards. Working openly and collaboratively with Open Biological and Biomedical Ontologies Foundry (OBOF) domain ontologies, specification fields and picklist values are being integrated into the open-source OBOF Genomic Epidemiology Ontology (GenEpiO) to provide an interoperable and controlled vocabulary that is responsive to the dynamic needs of different stakeholders. The specification contains 133 fields, structured according to the open Linked data Modeling Language (LinkML) classes and relations, and 1043 terms which we obtain from or collaborate on with OBOF domain ontologies. We seek to integrate terminology that both reflects the needs of COVID-19 surveillance and interoperable data sharing, while simultaneously creating a semantic model with nuanced query capabilities for downstream tool development.
Short Abstract: Reactome is a free, open-source, curated, and peer-reviewed knowledgebase of biomolecular pathways. As of the current release, its holdings comprise 13,732 human reactions organized into 2,516 pathways involving 11,073 proteins, 1,865 small molecules, and 415 drugs, with the support of 33,453 literature references. These molecular entity and event annotations are systematically represented as an ordered network of molecular reactions, forming a graph of biological knowledge. Reactome has migrated from a relational database to a Neo4j graph database for efficient storage, processing, and access with these highly interconnected data. We developed an interface for interacting between R and the Reactome graph database, minimizing the usage of Neo4j Cypher. The ReactomeGraph4R package retrieves data with a network structure from the graph database. It contains functions that permit complex tasks, such as i) finding hierarchical data of an instance; ii) getting the entire Reaction context using preceding or following relationships; and iii) displaying network data. Output data is formatted as R objects to assist with downstream analyses. We also developed a second R package, ReactomeContentService4R, to wrap the RESTful API of the Content Service. These packages will be available through Bioconductor and will fulfill the needs of Reactome pathway- and network-based tools for analyzing multiple omics datasets.
Short Abstract: Recent advances in cloud computing and container technologies create unique opportunities to provision computationally demanding, technically complex software for general scientific use. PGAP is a prokaryotic gene annotation engine the NCBI developed and uses for continuous update and growth of the RefSeq dataset, currently containing more than 220,000 prokaryotic assemblies. It has been available as a stand-alone tool for anyone to use on their own data for over two years. The Read assembly and Annotation Pipeline Tool (RAPT) is a new tool, combining the SKESA short read assembler and PGAP, enabling assembly of bacterial and archaeal reads sequenced from isolates and annotation of genomes produced. The RAPT project at NCBI employs cloud and container approaches to make PGAP available to the scientific community in new ways. Standalone RAPT can be installed on any user-owned resources that meets hardware requirements. Our GCP RAPT version is available in the Google Cloud Shell environment without installation; it rents a computer that is billed to the user. Finally, the RAPT pilot service is accessed through a website and runs on GCP resources billed to NCBI. Our intent is to empower the prokaryotes community with convenient access to this trusted prokaryotic genome annotation resource.
Short Abstract: Characterizing the resistome is crucial for prevention and mitigation of emerging antibiotic resistance threats to animal and human health. We define the resistome to be the collection of antimicrobial resistance (AMR) genes and their precursors in microorganisms. These genes may be found embedded in a bacterial chromosome or plasmid. Reads2Resistome is a tool which allows researchers to assemble and annotate bacterial genomes using long or short read sequencing technologies, or both in a hybrid approach, with focus on the resistome characterization. Using a massively parallel analysis pipeline, Reads2Resistome performs assembly and annotation with the goal of producing an accurate and comprehensive description of a bacterial genome and AMR and virulence genes’ content. Reads2Resistome is freely available as an open-source package under the MIT license, and can be downloaded via GitHub (github.com/BioRRW/Reads2Resistome). Key features of the Reads2Resistome pipeline include quality control of input sequencing reads, genome assembly, genome annotation and AMR gene, virulence gene, and prophage characterization. Based on the case study we observed hybrid assembly, although the most time-intensive assembly method, produces a highly contiguous genome assembly with robust gene annotation, prophage identification, and resistome characterization compared to short read alone and long read alone assemblies.
Short Abstract: We present recent improvements to Magic-BLAST, an RNA-seq mapper that aligns long as well as short reads to target sequences in a BLAST database. The new features include accessing Sequence Read Archive (SRA) reads from cloud buckets, reporting National Center for Biotechnology Information (NCBI) taxonomy ID's, and reporting unaligned reads in a separate file.
A powerful Magic-BLAST feature is that it can access reads from SRA by accession, making pipelines with Magic-BLAST more reusable. NCBI hosts SRA data at cloud providers, supporting large scale hyper-parallel data analyses. Magic-BLAST 1.6.0 downloads SRA data from the cloud. Benefits include uninterrupted downloads, faster download speeds, and huge bandwidth for parallel downloads to multiple running processes or cloud instances.
Magic-BLAST provides NCBI taxonomy ID's in the SAM report. Using taxonomy ID's allows unambiguous mapping to taxonomic nodes. This is especially useful for metagenomic projects.
Magic-BLAST can report unaligned reads in a separate file in SAM, tabular, or FASTA formats. In this fashion Magic-BLAST can, for example, be used as filter for contamination screening or for separating virus and host reads from a next-generation sequencing run.
This research was supported by the Intramural Research Program of the National Library of Medicine at the NIH.
Short Abstract: The SNP-seek Project is a SNP genotyping database initiated in 2014 with approximately 60 billion SNP-genotypes generated from the 3000 Rice Genome Project. From the last update in 2017, SNP-seek is continuously being developed through addition of new datasets, features, visualizations, web services, use cases and tools to effectively support and help the science community and collaborators in their research. The integration of Galaxy, an open source web-based platform for data integration and analysis, was the latest significant enhancement in SNP-seek where users can directly extract from and export output files or search results into the Galaxy and use its available tools for further analysis. Project architecture as a requirement for possible collaboration and expansion of the project is currently being redesigned to adapt the front-end and back-end design to make it flexible and reusable. Developers now have easier control of the user interface and can load appropriate libraries and set URLs of embedded external sites and tools such as Jbrowse, Galaxy, R, among others via configuration files. SNP-seek is now available in Docker which permits users to set up and utilize a local copy of the project.
Short Abstract: Slivka is a python 3 server framework designed to fill a highly specific niche: effortless exposure of command-line programs as RESTful JSON services. It achieves this by providing endpoints for discovery of available services, transfer of input data, validation and submission, monitoring, and results retrieval. Services are defined in YAML, which can include semantic annotation. Results from one service can be passed as input to another without the need to download them first. Executables can be configured to run locally or via an Altair Grid Engine managed Cluster. New executors can be easily added to enable re-use of existing middleware.
Slivka-bio is a conda-based Slivka application for bioinformatics that is the successor to JABAWS (www.compbio.dundee.ac.uk/jabaws), a Java SOAP web services system. Both are provided by the Dundee Resource for Protein Sequence Analysis and Structure Prediction (www.compbio.dundee.ac.uk/drsasp) with support from the UKRI’s BBSRC.
On installation, Slivka-bio provides a ‘nearly’ ready-to-start system including all necessary third-party biological sequence analysis tools. Tools can be executed remotely with the help of libraries in Python and Java. A graphical interface will also feature in the next major release of Jalview (www.jalview.org) and a command-line client is in development. See www.compbio.dundee.ac.uk/slivka for tutorials and more information.
Short Abstract: Data Coordinating Centers (DCC) are becoming an integral part of large-scale biomedical research projects. A DCC is typically tasked with developing solutions to manage high volumes of biomedical data. The DCC part of the Human Tumor Atlas Network (HTAN) — an NCI-funded Cancer Moonshot initiative — develops infrastructure to coordinate data generated from over 30 cutting-edge assays, spanning multiple data modalities and 10 research institutions.
Streamlining defining metadata standards, data annotation, (meta)data compliance checks, and tracking provenance in HTAN is fundamental for releasing Findable, Accessible, Interoperable and Reusable (FAIR) data.
The schematic package developed by the HTAN DCC enables these FAIR-data use cases. Schematic (Schema Engine for Manifest-based Ingress and Curation) provides:
• User-friendly interface for developing interoperable, standardized data models
• Services for generating data model compliant (meta)data submission spreadsheets
• Asset store interfaces associating metadata with data on various cloud platforms
Examples of schematic integrations in HTAN are: HTAN Data Curator (R Shiny application facilitating submission of standardized (meta)data); the HTAN Data Portal (Next.js application previewing all metadata across 100TBs of data). Schematic provides business and data model logic for these two services.
Schematic is distributed as an open-source Python package, registered on PyPI as schematicpy.
Short Abstract: Genotype-guided therapy promotes drug efficacy and safety. However, accurately calling star alleles (haplotypes) in cytochrome P450 (CYP) genes, which encode over 80% of drug-metabolising enzymes, is challenging. Notably, CYP2D6, CYP2B6 and CYP2A6, which have neighbouring pseudogenes, present short-read alignment difficulties, high polymorphism and complex structural variations. We present StellarPGx, a Nextflow pipeline for accurately genotyping CYP genes by leveraging genome graph-based variant detection and combinatorial star allele assignments. StellarPGx has been validated using 109 whole genome sequence samples for which the Genetic Testing Reference Material Coordination Program (GeT-RM) has recently provided consensus truth CYP2D6 alleles. StellarPGx had the highest CYP2D6 genotype concordance (99%) to GeT-RM compared to existing callers namely, Cyrius (98%), Aldy (82%) and Stargazer (84%). The implementation of StellarPGx using Nextflow, Docker and Singularity facilitates its portability, reproducibility and scalability on various user platforms. StellarPGx is publicly available from github.com/SBIMB/StellarPGx.
Short Abstract: All eukaryotes express tens of thousands of long non-coding RNAs (lncRNAs) that are often associated with gene expression regulation either in cis or trans. While the function and biological relevance of a few lncRNAs has now been well described, for the vast majority much remains to be learned. LncRNA annotation and curation relies on de novo transcript annotation and coding potential assessment, mostly achieved by fragmented pieces of codes and ad hoc scripts which could result in untrackable and non-reproducible workflow or pipeline.
The release of extensive RNA-Seq across wheat cultivar and Triticeae species offers the unique opportunity to comprehensively annotate lncRNAs in Triticum aestivum and assess their variation in expression across tissues and accession.
We, therefore, developed a suite of open-source reproducible and species agnostic pipelines to identify and characterise lncRNAs. The pipelines are developed utilising the Snakemake workflow management system and conda virtual environments, which makes them platform-independent.
We applied these pipelines to large (17Gb, ~450,000 genes) and complex (hexaploid) genomes, and annotated more than 180,000 high-confidence novel intergenic lncRNA genes using 148 RNA libraries across 6 different tissue samples of 9 wheat cultivars. This pipeline also includes a collection of scripts that allows the analysis of lncRNAs expression patterns across tissues and accessions. An exemplar of these enables the annotation and analysis of homeologous lncRNA found across the three wheat sub-genomes.
The pipeline is available from GitHub github.com/TGAC/lncRNA-analysis under MIT license.
Short Abstract: The Gene Ontology (GO) is a cornerstone of genomics research that drives discoveries through knowledge-informed computational analysis of biological data from large-scale assays. Key to this success is how the GO can be used to support hypotheses or conclusions about the biology or evolution of a study system by identifying annotated functions that are overrepresented in subsets of genes of interest. Graphical visualizations of such GO term enrichment results are critical to aid interpretation and avoid biases by presenting researchers with intuitive visual data summaries. Currently there is a lack of standalone open-source software solutions that facilitate explorations of key features of multiple lists of GO terms. To address this we developed GO-Figure!, an open-source Python software for producing user-customisable semantic similarity scatterplots of redundancy-reduced GO term lists. The lists are simplified by grouping terms with similar functions using their information contents and semantic similarities, with user-control over grouping thresholds. Representatives are selected for plotting in two-dimensional semantic space where similar terms are placed closer to each other, with an array of user-customisable graphical attributes. GO-Figure! offers a simple solution for command-line plotting of informative summary visualizations of lists of GO terms, designed for exploratory data analyses and dataset comparisons.
Short Abstract: Interactive notebook systems have made significant strides toward realizing the potential of reproducible research, providing environments where users can interleave descriptive text, mathematics, images, and executable code into a “live” sharable, publishable “research narrative.” However, many of these systems require knowledge of a programming language and are therefore out of the reach of non-programming investigators. Even for those with programming experience, many tools and resources are difficult to incorporate into the notebook format. To address this gap, we have developed the GenePattern Notebook environment, which extends the popular Jupyter Notebook system to interoperate with the GenePattern platform for integrative genomics, making the hundreds of bioinformatics analysis methods in The GenePattern Notebook environment provides a free online workspace (notebook.genepattern.org) where investigators can create, share, run, and publish their own notebooks. A library of featured notebooks provides common bioinformatics workflows that users can copy and adapt to their own analyses. This talk will describe how the GenePattern Notebook environment extends the analytical and reproducible research capabilities of Jupyter Notebook and GenePattern and will discuss novel features and recent additions to the environment, including new bioinformatics tools, capabilities for ease of use, and migration to the JupyterLab interface.
Short Abstract: Adoption of genomics in biomedical research and healthcare is increasing rapidly due to the transformative power of precision medicine. However, file-based tools and methodologies cannot scale with predicted growth in genome numbers. The outcome is siloed genomic datasets that are not utilised to their full extent.
Our solution is OpenCGA, a ground-breaking variant store that brings the “big data” stack to genomics; a normalised “VCF database” that is rapid, scalable and secure. Users can de-duplicate and merge VCFs from multiple samples from any genotyping assay (WGS, WXS, panels, microarray etc). Records are linked to corresponding sample and clinical metadata and unique variants are annotated against the latest reference information.
Web services, accessed via pyopencga, opencgaR, a command line interface, and a web user interface, support flexible, real-time genotype-phenotype queries and asynchronous jobs. Common genomic standards such as GA4GH Beacon and htsget are implemented.
OpenCGA is used by several of the world’s most challenging genomics initiatives. Development, led by the University of Cambridge and Genomics England, has attracted 37 contributors to date. The software is supported by a growing community of users and we would be delighted for more to join us!
Short Abstract: Transcriptome annotation is a complex process that requires the integration of multiple databases using several computational tools. The annotation process is an important step in developing an understanding of the biological complexity of an organism. The advances in next-generation sequencing technologies and the decrease in the cost of sequencing a complete transcriptome is driving a new era in which annotation is increasingly importantly and is very productive, especially for unannotated organisms. Cloud computing offers a reduced cost of compute resources that makes them accessible for large computational biology experiments such as transcriptome annotations. Here, we describe a complete methodology for the annotation of plant transcriptomes in the cloud including two examples for the organisms Opuntia streptacantha and Physalis peruviana. In addition, we include a comparative study of multiple BLAST sequence alignments using two public cloud providers: AWS and GCP. We demonstrate that the public cloud providers are a practical alternative for executing advanced computational biology experiments at quite low cost. Using our cloud recipes, the annotation of a transcriptome with ~500,000 transcripts can be processed in less than 2 hours with a computing cost of about ~USD200-250.
Short Abstract: Tripal (tripal.info) is a framework for construction of online biological community-oriented databases. It is specifically tailored for storing, displaying, searching and sharing genetics, genomics, breeding and ancillary data . It is in use around the world in multiple independent installations housing data for thousands of species. Its purpose is to reduce the resources needed to construct such online repositories and to do so using community standards, and to meet FAIR data principles. Tripal currently provides tools that allow site developers to easily import biological data in standard and custom flat file formats, create custom data pages and search tools, and share all data using RESTful web services built from controlled vocabularies. Tripal provides a documented API allowing for precise customization which can be shared as extension modules. There are currently over 40 community-contributed extension modules available for others to enhance their sites. Tripal is governed by an active community of developers and stakeholders serving in advisory and project management committees.
Short Abstract: Sending analysis workflows to data will be essential to cope with growing amounts of genomic data in healthcare and may increase efficiency and data security. The Global Alliance for Genomics and Health (GA4GH) defines several standards for processing data in a cloud framework. Specifically, the GA4GH Workflow Execution Service (WES) specifies an interface for transmitting workflows and their configuration to a computing platform. Here, we present the open-source software WESkit, which implements the GA4GH WES standard for Nextflow and Snakemake. We develop WESkit for big data throughput, high stability, security, HPC support, and extensibility for further workflow systems. WESkit implements asynchronous and distributed task execution, extensive logging, automated testing, and a user management system. The software is used to process whole-genome cancer sequencing data at the Deutsches Krebsforschungszentrum Heidelberg and the Charité Universitätsmedizin Berlin and will serve as a general interface between data management software and workflow management systems that run their jobs in high-throughput computing or cloud environments. The WESkit repository contains comprehensive documentation and configurable scripts, which start the required software stack, including the web application, Keycloak, Redis, MongoDB, and Celery on a Docker swarm cluster.
Short Abstract: Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which discourages the community from adopting the widely practiced Don’t Repeat Yourself (DRY) principle.
To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture makes it possible for anyone in the bioinformatics community to reuse the packages within their own workflows.
Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to promote best practice software design and maintenance. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard packages. WFPM CLI source code is available at: github.com/icgc-argo/wfpm
Short Abstract: The WorkflowHub (workflowhub.eu) is a new FAIR workflow registry sponsored by the European RI Cluster EOSC-Life and the European Research Infrastructure ELIXIR. It is workflow management system agnostic: workflows may remain in their native repositories in their native forms. As workflows are multi-component objects, including example and test data, they are packaged, registered, downloaded and exchanged as workflow centric Research Objects using the RO-Crate specification, making the Hub an implementation of FAIR Digital Object principles. A schema.org based Bioschemas profile describes the metadata about a workflow and encouraged use of the Common Workflow Language is provides a canonical description of the workflow itself. Workflow management systems such as Galaxy, Nextflow, and snakemake are working with the Hub to seamlessly and automatically support object packaging, registration and exchange. The WorkflowHub provides features such as community spaces, collections, versioning and snapshots, and contributor credit. It supports community registry standards and services such as GA4GH TRS and ELIXIR-AAI, and integrates with the LifeMonitor workflow testing service.
The WorkflowHub Club open community co-develops the Hub. Beta-released in Sept 2020, the Hub now holds nearly 100 workflows, including 36 curated COVID-19 workflows. It is a listed resource of the European COVID19 Data Portal (www.covid19dataportal.org/).