Presentation Overview: Show
Life science is rapidly increasing in interdisciplinarity, making career-spanning learning critical. New methods and deeper research questions requires scientists and educators to traverse widening skill gaps to remain competent. Short-format training (SFT) — including workshops, bootcamps, and short courses — is a solution many turn to. However, the effectiveness of SFT is lower than many realize. Overall, SFT is delivered without sufficient grounding in evidence-based pedagogy, and systemic inequity limits inclusion of all learners.
I will describe the work of an international group of scientist-educators to develop a new construct — the “Bicycle Principles”. The Principles assemble education science and collective experience into a framework for improving SFT through two cyclic (hence bi-cycle) and iterative processes:
“Core Principles” (“Best Evidence”, “Catalytic”, Effective”, and “Inclusion”) apply to all SFT and are grounded in research and work in education, diversity, equity, and inclusion.
“Community Principles” (“Reach”, “Scale”, and “Sustain”) apply when SFT is organized by groups to achieve the objectives of a community (e.g., science discipline, institution, career stage).
Community refinement, adoption, and adaptation of the Bicycle Principles will help accelerate scientific progress by making SFT more effective, inclusive, and career-spanning for all.
Presentation Overview: Show
Abstract Most scientists would agree that success in science should solely be determined by the merit. However, success in STEM fields is still profoundly influenced by other factors such as race, gender and socioeconomic status. Numerous studies documented the gender bias throughout the publication process: women publish less than men, are less likely to be in the first position among authors who contributed equally, and are less cited than men. Gender disparities are also noticeable on less externally constrained behaviours such as the number of questions asked in scientific conferences.
As an interdisciplinary team composed of diverse skillsets (anthropology,statistics and UX design), we observed gender asking behaviour during the 2021 JOBIM virtual conference. We gathered quantitative and qualitative data including: a registration survey with detailed demographic information, post-conference survey on question asking motivations, live observations and in depth interviews of participants. Quantitative analysis highlighted several new findings such as an important fraction of the audience identifying as LGBTQIA+ and an increased attendance of women in virtual JOBIM conferences compared with in person conferences. Notably, the observations revealed a persisting underrepresentation of questions asked by women. Interviews of participants highlighted several barriers to oral expression encountered by gender minorities in STEM.
Presentation Overview: Show
The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures presented unique challenges to collaborative science. We applied the open-source, open-publishing software Manubot to the challenge of synthesizing the COVID-19 literature. We invited contributors to summarize and critique papers and draft a review manuscript. Manubot rendered the manuscript into several output formats, immediately updating the online text in response to new content.
We added features to Manubot to handle challenges posed by COVID-19. We developed continuous integration workflows to retrieve up-to-date data from online sources nightly, regenerating some of the manuscript’s figures, tables, and statistics as the data changed. We added a functionality to compile citation information not only from references such as papers, preprints, and websites, but also using uniform identifiers such as clinical trials. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated nearly 2,000 sources and developed seven literature reviews. This project illustrates the power of open publishing to empower biological collaborations even in the face of unique challenges.
Presentation Overview: Show
Dealing with bioinformatics projects can produce many challenges. Overcoming these challenges means progress. And surely, there is bonus pay throughout this process. However, the real exploration is not limited to the data analysis itself. There are better ways to spread the mid-step work, such as which method you choose, how you proceed the analysis, and what if the outcome is unexpected. These mid-step efforts can also become bioinformatic publications. Here, as an early career researcher working on different bioinformatics projects, I will share my experience with real protocol cases, such as using a pipeline for minimizing redundancy and complexity in large phylogenetic datasets, executing a tool to merge and minimize “bad words” from BLAST hits against multiple eukaryotic gene annotation databases, and running a web server for identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes.
Presentation Overview: Show
The cloud is an appealing platform for bioinformaticians. They can bring
up compute infrastructure on demand and at the scale they requested. They can
access voluminous amounts of open access data hosted in cloud environments.
To make use of the cloud, it is valuable to have cloud native implementations of important
bioinformatics package.
We discuss ElasticBLAST, a cloud-native package to produce alignments with
the Basic Local Alignment Search Tool (BLAST). Built on top of the stand-alone BLAST+
command-line package, ElasticBLAST works with a range of query inputs,
handling anything from a few to millions of query sequences. ElasticBLAST
instantiates cloud instances, dispatches work to them, and deletes resources
when it is done. ElasticBLAST is supported on Amazon Web Services (AWS) and Google Cloud
Platform (GCP). We discuss the implementation of ElasticBLAST, show usage
examples and discuss performance.
In the last year, several updates have been made to ElasticBLAST. First,
it automatically selects an instance type for the BLAST runs based on
the database. Second, it shuts down cloud resources at the end of the run.
Finally, the ElasticBLAST throughput has been improved.
The source code and documentation for ElasticBLAST are freely available on Github.
Presentation Overview: Show
Due to inconsistencies in the ways in which gene fusions are described in clinical practice and biomedical literature, the ability to successfully integrate the functional and clinical significance of fusion events in informing patient care is limited. With the aim of developing recommendations for the standardized representation of gene fusions, experts from the Variant Interpretation for Cancer Consortium (VICC), in consultation with members of Cancer Genomics Consortium, ClinGen, and the CAP/ACMG Joint Cytogenetics Committee, collaborated to create both a unified framework for the description of fusion events and a set of associated computational tools.
We will present FUSOR (Fusion Object Representation), a Python-based software development kit of modeling and validation tools for the computational representation of fusion events. We demonstrate FUSOR’s ability to convert output from commonly-used fusion detection algorithms such as CICERO and JAFFA into a machine-readable format. We also introduce the VICC Fusion Curation Interface, a web-based service that allows users to translate data describing fusion events into syntax that aligns with the proposed VICC gene fusion nomenclature system. These tools address the need for services that automate the curation of gene fusions, improving precision in the computational translation of gene fusions to clinical care.
Presentation Overview: Show
Biomedical researchers take advantage of high-throughput, high-coverage technologies to routinely generate sets of genes of interest across a wide range of biological conditions. Although these technologies have directly shed light on the molecular underpinnings of various biological processes and diseases, the list of genes from any individual experiment is often noisy and incomplete. Additionally, interpreting these lists of genes can be challenging in terms of how they are related to each other and to other genes in the genome. In this work, we present open source software, as both a web server (https://www.geneplexus.net/) and Python package (https://pypi.org/project/geneplexus/), that allows a researcher to utilize a powerful, network-based machine learning method to gain insights into their gene set of interest and additional functionally similar genes. Once a user uploads their own set of genes and chooses between a number of different network representations, GenePlexus provides predictions of how associated every gene in the network is to the input set. The web server and Python package also provide interpretability through network visualization and comparison to other machine learning models trained on thousands of known process/pathway and disease gene sets.
Presentation Overview: Show
https://aiqc.readthedocs.io
In working with pharma to analyze the UKBB & TCGA for the genomic drivers of disease, I was frustrated that association studies were still the primary method of analysis. So I built AIQC as an open source, high-level framework to make rigorous deep learning protocols accessible to the research community.
Since then, AIQC has [1] been sponsored by the Python Software Foundation, and [2] collaborated with organizations like Boston Children’s Hospital to analyze the proteomics of neurodegenerative diseases, as well as Dana-Farber to delineate disease subtypes of bladder cancer.
AIQC is the only open source experiment tracker that simplifies the end-to-end machine learning lifecycle: dataset registration, data preparation, tuning neural networks in batches, automatically generating performance metrics & plots, feature importance permutation, inference, and decoding.
The purpose of this talk is to demonstrate how researchers can use the AIQC framework to easily integrate deep learning into a wide array of bioinformatics use cases:
Talk agenda (3-5min per topic):
- Bioinformatics challenges & how deep learning can help solve them.
- Deep learning analyses (multi-omic, longitudinal, disease subtyping).
- Machine learning challenges that AIQC solves.
- Demo: TCGA tumor classification using gene expression.
- Demo: Histological detection of brain tumors.
- Demo: MAPK structural compound selection criteria.
https://aiqc.readthedocs.io
Presentation Overview: Show
RNA sequencing (RNA-seq) data from space biology experiments yield insights into the effects of spaceflight on living systems. However, sample numbers from spaceflight studies are low due to limited crew availability, hardware, and space. To increase statistical power, individual spaceflight RNA-seq datasets are often aggregated together. This can introduce technical variation or ""batch effects"", which can be due to differences in sample handling, sample processing, and sequencing platforms.
In this study, we used 7 mouse liver RNA-seq datasets from NASA GeneLab to evaluate 5 common batch effect correction tools (ComBat and ComBat_seq from the sva R package, and Median Polish, Empirical Bayes, and ANOVA from the MBatch R package). We quantitatively evaluated the performance of these tools in the spaceflight context using differential gene expression analysis, BatchQC, principal component analysis, log fold change correlation, and dispersion separability criterion. We developed a standardized approach to identify the optimal correction algorithm and correction variable by geometrically probing the space of all allowable scoring functions to yield an aggregate volume-based scoring measure.
Finally, we describe the GeneLab multi-study visualization and analysis user portal which incorporates these scoring metrics to allow users to evaluate and select an optimal batch correction method and variable.
Presentation Overview: Show
The prediction of transcription termination sites is important for various in vitro and in vivo experiments, including drug targeting, annotation of genes and operons etc. Even though prediction softwares exist, they are either biased toward organisms selected for prediction or require additional experimental data, which is complex at large scale and is time consuming. We developed the INTERPIN (INtrinsic TERmination hairPIN) database for Intrinsic transcription terminator predictions in bacterial genomes. It covers 12745 bacteria and is the largest collection of terminators to date. We have introduced cluster hairpins [1], that are a group of contiguous hairpins at <14 bases from each other, causing transcription termination
The database provides information about predicted cluster and single hairpins, along with raw prediction termination sites files. The IGV genome viewer has been integrated to allow visualization of hairpin and operon predictions across the whole genome. Secondary and tertiary structures of any selected hairpin can be viewed and analyzed by using tools- RNAComposer and iCN3d. A detailed help page explains all the available features, with help of examples.
[1] S. Gupta and D. Pal, “Clusters of hairpins induce intrinsic transcription termination in bacteria,” Sci Rep, vol. 11, no. 1, p. 16194, Aug. 2021, doi: 10.1038/s41598-021-95435-3.
Presentation Overview: Show
FreeSurfer is a set of freely available, open-source algorithms for the structural and functional analysis of MRI neuroimaging data. FreeSurfer's infant analysis pipeline provides capabilities for both volumetric and surface-based morphological analysis of subjects aged 0 to 24 months. This has, to date, been an under-represented population amongst the major neuroimaging software packages.
To increase the accessibility of FreeSurfer's infant pipeline and to enhance its reproducibility, we have developed a set of container-based methods to build and execute the pipeline as well as visually inspect its results. We do so by leveraging the existing tools Neurodocker and Neurodesktop.
Neurodocker is a ""command-line program that generates custom Dockerfiles and Singularity recipes for neuroimaging"". We have extended Neurodocker to support creating containers from source. This facilitates the creation of a continuous integration workflow, allowing every change to the source code to be automatically tested.
The outputs of FreeSurfer's infant pipeline can be visually inspected using FreeSurfer’s visualization tool FreeView. We use Neurodesktop for a convenient container-based way to access FreeView via a web browser. Neurodesktop is a “plug-and-play, browser-accessible, containerized data analysis environment” for neuroimaging.
Presentation Overview: Show
JBrowse 2 is a new genome browser that has unique visualization capabilities including synteny views, circular views, whole-genome overviews, and structural variant focused features. We present a case study for using JBrowse on WormBase.org, the C. elegans model organism database. We cover the steps involved in migrating datasets from GBrowse, JBrowse 1, and GBrowse_syn into the JBrowse 2 configuration. We look at the administrative and configuration options used for making track visualizations, and show how it can be set up on cloud storage as a static site. Finally, we cover how JBrowse can be integrated with feature pages by embedding JBrowse 2’s genome browser on the page, or by statically generating SVG images of genome regions. In addition, we will provide a general overview of the JBrowse 2 platform, and show how plugins can be used to extend core JBrowse features to add custom track types, data adapters, and entirely custom views. JBrowse 2 has seen increasing growth since it was released in 2020, with many optimizations, features, and improvements have been made in response to feedback from users. JBrowse and WormBase are open source, with JBrowse licensed under the Apache 2.0 license and WormBase under the MIT license.
Presentation Overview: Show
Genome browsers are essential tools for integrating, exploring, and visualizing data from large genomics datasets. JBrowse 2 is a modern genome browser written in JavaScript with novel features for visualizing structural variants, syntenic alignments, and other biological data. To facilitate academic collaboration and innovation, JBrowse 2 has been designed with a pluggable interface that permits the development of features tailored to the specific needs of an individual or organization. We launched the JBrowse 2 Plugin Store to provide community-developed plugins to all users and to expose crucial feature sets to researchers. The plugin system spans all aspects of the JBrowse 2 application and enables the development of new track types (e.g. Manhattan plots, Hi-C data, splice junction), data adapters (e.g. API endpoint adapters for NCI GDC and the ICGC), and views (e.g. dot plots, multiple sequence alignments, and ideogram). Plugin development is made simple with the availability of a plugin template, numerous existing contributions to reference, and extensive documentation on the JBrowse 2 website to assist developers in getting their plugin running. Here, we present the capabilities of some JBrowse 2 plugins, describe usage scenarios for biologists and bioinformaticians, and outline JBrowse 2’s design patterns and tools for software developers.
Presentation Overview: Show
To fulfill the goal of making biomedical data more accessible and reusable, web Application Programming Interfaces (APIs) have become a common protocol to disseminate knowledge sources. Serving as an annotation-as-a-service web API, MyVariant.info provides a high-performance, scalable interface for querying variant annotation information. Built upon the Biothings SDK with a cloud-based infrastructure, MyVariant.info inherits the capability of automated integration of variant annotations from disparate data sources, making it a handy tool for both individual researchers and application developers to simplify the process of annotation retrieval and identifier mapping. With the latest release in Feb. 2022, MyVariant.info has brought the number of hg19 variant annotation documents to 1.42 billion, and that of hg38 to 1.45 billion, from 19 and 6 data sources respectively. Holding such a large dataset, MyVariant.info successfully receives 2.6 million monthly requests from more than 2,000 unique IP addresses. MyVariant.info is built in Python, and accessible at https://myvariant.info/. Its source code is hosted at its GitHub repository, https://github.com/biothings/myvariant.info, under the Apache License, Version 2.0.
Presentation Overview: Show
Despite great strides in the development and wide acceptance of standards for exchanging structured information about genomic variants, there is no corresponding standard for exchanging phenotypic data, and this has impeded the sharing of phenotypic information for computational analysis. Here, we introduce the Global Alliance for Genomics and Health (GA4GH) Phenopacket schema, which supports exchange of computable longitudinal case-level phenotypic information for diagnosis and research of all types of disease including Mendelian and complex genetic diseases, cancer, and infectious diseases. To support translational research, diagnostics, and personalized healthcare, phenopackets are designed to be used across a comprehensive landscape of applications including biobanks, databases and registries, clinical information systems such as Electronic Health Records, genomic matchmaking, diagnostic laboratories, and computational tools. The Phenopacket schema is a freely available, community-driven standard that streamlines exchange and systematic use of phenotypic data and will facilitate sophisticated computational analysis of both clinical and genomic information to help improve our understanding of diseases and our ability to manage them.
Presentation Overview: Show
We present here FAIRshare, an open-source and free (MIT license) cross-platform desktop software that helps biomedical researchers in making their research software Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the FAIR for research software (FAIR4RS) guiding principles. The FAIR4RS principles, established by the FAIR4RS Working group, provide a foundation for optimizing the reusability of research software and encourage open science. The FAIR4RS principles are, however, aspirational. Practical guidelines that biomedical researchers can easily follow for making their research software FAIR are still lacking. To fill this gap, we established the first minimal and actionable step-by-step guidelines for researchers to make their biomedical research software FAIR. We designate these guidelines as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines. FAIRshare walks the users step-by-step into implementing the FAIR-BioRS guidelines for their research software (including metadata such as codemeta.json, choosing a license – preferably open source, sharing on a suitable repository such as Zenodo, etc.). The process is streamlined through an intuitive graphical user interface and automation such as to minimize researchers’ time and effort. We believe that the FAIR-BioRS guidelines and FAIRshare will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software.
Presentation Overview: Show
The Variation Representation Specification (VRS; vrs.ga4gh.org) is a standard of the Global Alliance for Genomic Health (GA4GH; ga4gh.org) for the computable and expressive exchange of biomolecular variation. VRS-Python (github.com/ga4gh/vrs-python) is an open-source python package that implements models defined in VRS, provides an algorithm for generating globally unique computed identifiers, and supports translation of VRS variation models to and from other common variant representations like HGVS, VCF, and SPDI.
As a recently published standard, reference implementations and computational notebooks serve as useful introductory tools for learning the specification in hands-on application. The VRS-Python repository provides several Jupyter Notebooks demonstrating the features and capabilities of the VRS-Python python package (github.com/ga4gh/vrs-python/tree/main/notebooks).
To reduce barriers to entry for VRS, we have developed cloud-based VRS-Python notebooks to educate potential adopters. These notebooks are publicly accessible, require no local installation, and walk the user through the process of constructing and translating variants to VRS from other variation formats through a web browser interface. Users are able to easily run the notebooks and write their own VRS-Python code as they’re exploring these notebooks. The notebooks are accessible online at go.osu.edu/vrs-cloud-nb.
Presentation Overview: Show
As workflow systems become more popular, many studies have been conducted to share workflows, for example, workflow sharing registries such as WorkflowHub and nf-core and standard protocols for sharing such as the GA4GH TRS API. These workflow registries collect generic workflows and maintain them through community efforts. However, because community resources are limited, there is less incentive to make an effort to collect and maintain specific workflows, for example, workflows for specific species. In addition, the expert community with domain knowledge of those workflows may lack the human resources and engineering skills to build their own workflow registry. Therefore, sustainable development and sharing of all workflows with quality control remain challenging. To make it easier to build and maintain workflow registries, we developed Yevis, a system for building a workflow registry that automatically maintains workflows. Yevis uses only GitHub and Zenodo services and provides GA4GH TRS compatible API and web UI. As a use case, we also built the DDBJ workflow registry, which contains workflows that can be executed in DDBJ WES. We believe that if each community follows a standard definition such as the TRS API and provides well-maintained workflows, users will benefit from the diversity.
Presentation Overview: Show
Reproducibility benefits largely from robust workflow management. The open-source platform Arvados integrates a data management system called “Keep” and the compute management system called “Crunch”, creating a unified environment to store and organize data, and run Common Workflow Language workflows on that data. Arvados is multi-user and multi-platform, running on various cloud and high performance computing environments. Arvados management features including the ability to (1) identify the origin and verify the content of every dataset, track every workflow run, and reliably reproduce any output (2) organize and search for datasets using metadata (3) Securely and selectively share your data and workflow (3) Efficiently manage data (minimizing storage costs) and (4) Efficiently rerun workflows (minimizing time and compute costs).
Presentation Overview: Show
Relationship estimation between pairs of individuals has both scientific and commercial applications. As an example, GWAS may suffer from high rates of false positive results due to unrecognized population structure. Accurate relationship classification is also required for genetic linkage analysis to identify disease-associated loci. Additionally, DNA relatives matching service is one of the leading drivers for the direct-to-consumer genetic testing market.
Despite the availability of scientific and research information on the methods for determining kinship and the accessibility of relevant tools, the assembly of the pipeline, that stably operates on a real-world genotypic data, requires significant research and development resources. Currently, there is no open-source production-ready solution for relatedness detection in genomic data, that is fast, reliable and accurate for both close and distant degrees, and combines all the necessary processing steps to work on real-world data.
To address this, we developed GRAPE: Genomic RelAtedness detection PipelinE. It combines data preprocessing, identity-by-descent (IBD) segments detection, and accurate relationship estimation. The project uses software development best practices, as well as GA4GH standards and tools.
We believe that software freedom is essential to advance efficiency, interoperability, and speed of innovation.
GRAPE is a free and open-source project under GPLv3 license.
Presentation Overview: Show
Reproducible research is a keystone of modern scientific work. In data engineering for data science it presents a challenge on two sides. Data is often retrieved from public and proprietary data sources that are being continuously updated, requiring special attention to ensure that for reproducibility purposes the same exact data sources are used. On the other hand, data processing involves heavy floating-point computations which are very sensitive to the exact computational environment. Reproducibility challenges are exacerbated when the research involves healthcare data that is inherently confidential and cannot be shared publicly to ensure reproducibility. We propose a solution based on sharing infrastructure instead, making an option for institutions to reproduce each other's results on their own data in compliance with their own data usage agreements. Infrastructure as Code provides a handy way to ensure that the infrastructure is identical during data processing.
Here we present a data platform based on a combination of an IaC approach and CWL. Besides tools written in widely used languages such as Python, C/C++ and Java, it also supports tools written in R and PL/pgSQL, making it, to the best of our knowledge, one of the first deployment-ready platforms appropriate for ETL/ELT pipelines.
Presentation Overview: Show
Poor reproducibility and limited portability of data analysis methods and techniques present critical challenges to the progress of data-intensive biomedical research. These become of particular importance when it comes to single-cell sequencing data analysis. To overcome these problems we develop all our pipelines using Common Workflow Language (CWL) – an open standard that describes tools and pipelines as YAML or JSON structured linked data documents. CWL supports execution in an isolated runtime environment (Docker/Singularity) thus guarantying results reproducibility.
Here we present a set of CWL tools for analysis of scRNA, scATAC and Multiome sequencing data. Altogether they allow to perform the most common tasks in this research area incl.: low-quality cells removal, data integration, batch effect correction, clustering, and cell type assignment. These tools can be chained together to form various workflows for each particular research application. As an example of a successful workflow application, we analyzed scRNA-Seq data from control and Nsdhl-deficient mice to study the influence of tumor-intrinsic cholesterol biosynthesis on pancreatic carcinogenesis (Surumbayeva, Kotliar et al., 2021). Having all of the pipelines in the CWL format allows us to run them in any CWL-based execution environment, optionally scaling the processing from a single computer to HPC cluster.
Presentation Overview: Show
The number of published bioinformatic tools, each with their own strengths and weaknesses, is constantly growing. With the sheer number of available tools, it is difficult to assess which tool is best for a specific purpose. While benchmarking papers are published to address this need, such papers are in many cases hard to reproduce and extend. Here, we introduce an open source framework that 1. automates the workflow of the evaluated tools for the convenience of users, 2. benchmarks the tools based on various metrics and 3. visualize the results to easily compare tools on the OpenEBench online platform. This framework was developed as part of APAeval, an international community effort to benchmark RNA-seq based alternative polyadenylation (APA) analysis tools. APA tools are assessed based on their ability to identify, quantify, and calculate differential usage of APA sites. The framework is not constrained to just APA tools, but can also be applied to automate and benchmark other bioinformatic tools. We hope that the framework we have created can be an inspiration for an open source, community effort to automate and benchmark tools to facilitate the various downstream analyses.
Presentation Overview: Show
This talk presents a practical methodology for elucidating the structure and function of a workflow written in the Workflow Description Language (WDL), a domain-specific language for describing data processing and analysis workflows.
WDL is increasingly used by large consortia in genomics and related fields for creating standardized workflows that are portable across execution platforms. Bioinformaticians are likely to encounter WDL workflows that they will need to either apply to their own data or reimplement in their preferred language.
Deciphering real-world workflows can benefit from a systematic approach rather than attempting to read through the code linearly. We present a systematic approach intended to help bioinformaticians efficiently interpret and if necessary, reverse engineer existing WDL workflows. We demonstrate the method on a real-world WDL workflow, deconstructing it systematically in order to understand (1) what the workflow is meant to achieve; (2) how it is structured; and (3) what are the key functional patterns involved.
The main take-home from this talk will be the methodology itself, which can be adapted to other scenarios. As secondary benefits, the audience will gain some familiarity with WDL syntax and with interesting functional patterns of the language.
Presentation Overview: Show
Computational analysis workflows are descriptions that link together steps to enable abstraction, scaling, automation, and provenance features. Workflow Description Language (WDL) and Common Workflow Language (CWL) are high-level workflow coordination languages that can define workflows made up of command-line tools.
WDL workflows are not executable by themselves and require an execution engine to run the workflows (primarily Cromwell on the Terra platform). In contrast to WDL, CWL can be run on a larger number of systems. However, some workflows that are important in bioinformatics are only maintained in WDL. For example: the GATK workflows that are used to analyze high-throughput sequencing data for variant discovery are only available from the GATK maintainers in the WDL format.
Therefore the wdl2cwl converter was created with the aim to efficiently convert the workflows written in WDL to the equivalent version using the CWL standard.
wdl2cwl can help WDL users to avoid platform dependency issues, such as trying to reproduce a WDL workflow on a platform that only supports CWL. It can help analysts by eliminating the difficulty of learning a new workflow language before they adapt a WDL workflow.
Presentation Overview: Show
Addressing complex scientific challenges requires a roadmap of data from diverse sources, organisms, contexts, formats, and granularities. Building a coherent holistic view of the data landscape to address any given problem is non-trivial. Often in the aggregation process, many of the original connections within the data are lost and it is difficult to make new (inferred) connections. Novel data integration strategies that leverage semantic technologies such as ontologies, knowledge graphs, and common modeling strategies can help span disciplinary boundaries. However, it takes the people too; robust interdisciplinary collaboration and improved data licensing and access can advance progress and innovation - turbo boosting the open data highway.
Presentation Overview: Show
The OntoDev Suite (https://ontodev.com, https://github.com/ontodev) of open source software brings together modular open-source libraries and applications for ontology development and scientific data integration, with special emphasis on open science and the Open Biological and Biomedical Ontologies (OBO) community. The suite builds on the success of ROBOT to include data cleaning, ontology-driven validation, development and curation workflows, and more. We strive to make small, focused tools that work well together, but also work well with other best-in-class software, languages, and platforms. In this talk we present an overview of the suite, design principles, and future plans.
Presentation Overview: Show
We present RPhenoscape, a package for the R programming language to provide convenient and robust access to the ontologies and ontology-liked morphological trait data (phenotypes) within the Phenoscape Knowledgebase (KB), as well as to several algorithms for computing with the semantics of traits based on formal logic reasoning. Among the major aims of the package is to enable the computational integration of trait semantics into evolutionary models for comparative trait analysis, which have traditionally treated traits simply as independent characters and character states. To this end, RPhenoscape provides access to the computational inference of presence/absence trait matrices, the presence/absence-based inference of trait dependence, evidence-based mutual trait compatibility/exclusivity, and a variety of semantic similarity metrics for phenotypes. RPhenoscape is currently in the last steps of a new major release series, which adds some of the features presented here and once complete will be made available on the Comprehense R Archive Network.
Presentation Overview: Show
Knowledge graphs (KGs) are representations of entities and their multifaceted relationships. An ongoing challenge in learning from KGs in biology and biomedicine is in bridging the gap between real-world observations and conceptual knowledge. Though numerous bio-ontologies address this need, none may be directly added to a KG without significant effort.
Past efforts in aligning instance data to ontologies led to creation of the OBO Foundry, an open resource for standardized biological ontologies. We developed KG-OBO to allow the community to rapidly integrate OBO Foundry ontologies with biological KGs. KG-OBO translates OBOs into easily-parsed KGX TSV graphs aligned with the Biolink model, then uploads all graphs to a public repository. Users may merge one or more ontology graphs as needed, e.g., combining CHEBI with a KG of protein vs. chemical interactions allows for grouping chemicals hierarchically. The added context can also provide further training input for graph machine learning models.
The KG-OBO code, graphs, and infrastructure drive a community of knowledge engineers seeking answers to biomedical questions in KGs, including the broader OBO community. We anticipate that continued interest in learning from KGs will require easy access to the comprehensive knowledge within bio-ontologies, and KG-OBO fills this need.
Presentation Overview: Show
Today’s international corporations such as BASF, a leading company in the crop protection industry, produce and consume more and more data that are often fragmented and accessible through Web APIs. In addition, part of the proprietary and public data of BASF’s interest are stored in triple stores and accessible with the SPARQL query language. Homogenizing the data access modes and the underlying semantics of the data without modifying or replicating the original data sources become important requirements to achieve data integration and interoperability. In this work, we propose a federated data integration architecture within an industrial setup, that relies on an ontology-based data access method. Our performance evaluation in terms of query response time showed that most queries can be answered in under 1 second.
Presentation Overview: Show
Building open source software is a great start, but to maximize the impact, it’s also necessary to put effort into maintaining it. Similarly, open source / open science communities don’t just need to be built; they also need to be maintained and expanded. We’ve seen increasing calls for inclusion and diversity, but once you’ve reached out to a new community, attracted new contributors, or recruited a new team member, how do you go beyond surface-level changes and achieve meaningful and sustainable inclusion? What difficult work, self-education, and research is needed to make more diverse groups thrive?
This panel and audience discussion will focus on what needs to come after initial steps to diversify. Wherever you are on your path to building inclusion in your open-source project, we invite you to offer your questions and thoughts on how we can do better. Diversity and inclusion isn’t achieved in a single success, but rather by maintaining continued success.