Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


BOSC: Bioinformatics Open Source Conference

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Wednesday, July 24th
10:15 AM-10:25 AM
Opening Remarks - BOSC
Room: Delhi (Ground Floor)
  • Nomi Harris, Lawrence Berkeley Labs, United States
10:25 AM-10:33 AM
The Open Bioinformatics Foundation
Room: Delhi (Ground Floor)
  • Heather Wiencko, Hosted Graphite, Ireland
10:33 AM-10:40 AM
Google Summer of Code 2018
Room: Delhi (Ground Floor)
  • Kai Blin, Technical University of Denmark, Denmark
10:40 AM-11:00 AM
elPrep 4: A multi-threaded tool for sequence analysis
Room: Delhi (Ground Floor)
  • Charlotte Herzeel, Imec, Belgium
  • Pascal Costanza, Imec, Belgium

Presentation Overview: Show

We present elPrep, an open-source framework for processing sequencing alignment map files in the Go programming language. elPrep is designed as a drop-in replacement for functionality provided by packages such as SAMtools, Picard, and GATK, focusing on improving computational performance while maintaining a modular and extensible implementation. Our latest release, elPrep 4, includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these functions in elPrep 4 reproduce the same outcomes of their counterparts in GATK/SAMtools/Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to greatly improve the runtime and resource use. Compared to GATK 4, elPrep executes the preparation steps of the GATK Best Practices between 7.5x and 13x faster on WGS and WES data respectively, while using fewer compute resources. elPrep 4 achieves these speed ups while working with community-driven standards such as SAM/BAM/CRAM/VCF/BED in contrast to a growing number of closed-sources tools that define their own formats to speedup NGS analysis.

11:00 AM-11:05 AM
Variant Transforms and BigQuery: Large scale data analytics in the cloud
Room: Delhi (Ground Floor)
  • Andrew Moschetti, Google Cloud, United States

Presentation Overview: Show

Variant Transforms is an open source tool developed by Google Cloud to load variants from VCF files into BigQuery. BigQuery is a highly scalable and fully managed data warehouse provided as part of Google Cloud Platform (GCP). With Variant Transforms, users can load data from VCF files into BigQuery and write SQL queries that use the power of BigQuery to process large amounts of data in seconds. Variant Transforms allows users to merge data from millions of samples to facilitate easy and fast analysis. Joining with phenotypic, clinical, and other omics data allows users to optimize their research workloads.

Numerous improvements to Variant Transforms over the past year have increased the performance and added new use cases. Variant Transforms is able to import billions of records (terabyte to petabyte scale data) to a single table. Support for VEP annotation was added, and recently received a 10x speed improvement. Partitioning and clustering of data in BigQuery along with schema optimization results in faster queries that process less data. Importing gVCFs allows for joint genotyping across these large sample sets. Export back to VCF is supported to allow users to export a cohort from BigQuery for analysis with other existing tools.

11:05 AM-11:10 AM
Forome Anfisa – an Open Source Variant Interpretation Tool
Room: Delhi (Ground Floor)
  • Michael Bouzinier, Division of Genetics, Brigham and Women's Hospital, United States
  • Sergey Trifonov, Forome Association, Russia
  • Joel Krier, Division of Genetics, Brigham and Women's Hospital, United States
  • Dmitry Etin, Forome Association, Austria
  • Dimitri Olchanyi, Forome Association, United States
  • Alexey Kargalov, Forome Association, Russia
  • Arezou Ghazani, Division of Genetics, Brigham and Women's Hospital, United States
  • Shamil Sunyaev, Department of Biomedical Informatics, Harvard Medical School, United States

Presentation Overview: Show

Whole exome and whole genome sequencing are being rapidly adopted in the healthcare industry, making way into the routine clinical practice.
Most variant interpretation tools are built to work with domain-based clinical guidelines, approved by ACMG and other responsible bodies, focusing on minimizing the number of variants for manual review by clinical staff.
Forome Anfisa is a collaborative variant annotation, interpretation, and curation tool, an organic part of the integrated clinical research program, developed by the Forome Platform team. Initially built as part of Brigham Genomics Medicine Program and is used for a hearing loss project (SEQaBOO), Forome Anfisa is now re-architected for whole exome and genome cases, enabling smooth work with a huge volume of data and giving way for clinicians to cope with millions of genetic variants in a meaningful way.
Introducing the first fully open source variant management toolset aimed at both clinical and research communities, we provide a way to seamlessly transform research workflows into clinical guidelines, thus speeding up the adoption of WGS/WES into the clinical practice.
It is offering collaboration by design, allowing interaction of users in different roles and sharing of findings in a particular patient case or even a variant.

11:10 AM-11:15 AM
Biotite: A comprehensive and efficient computational molecular biology library in Python
Room: Delhi (Ground Floor)
  • Patrick Kunzmann, TU Darmstadt, Germany
  • Kay Hamacher, TU Darmstadt, Germany

Presentation Overview: Show

Modern molecular biology is creating an increasing amount of sequence and macromolecular structure data. Their analysis can be tedious work: Although a large variety of software is available for analysis of this data, each program is usually made for a very specific task. Therefore, a combination of multiple programs is usually required in order to reach the desired goal, often rendering the data processing inflexible and even inefficient. These limitations can be overcome by shifting the workflow to comprehensive computational biology library in a easy-to-learn scripting language like Python.

For this purpose the open source Python package Biotite was created. It is a modern and comprehensive computational molecular biology library in the spirit of Biopython. As Biotite stores sequence and structure data internally as NumPy ndarrays, most operations have a high performance. On top of data analysis, modification and visualization capabilities, Biotite can be used to download data from biological databases or interface external software in a seamless manner.

11:15 AM-11:20 AM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
11:20 AM-11:25 AM
Portable Pipeline for Whole Exome and Genome Sequencing
Room: Delhi (Ground Floor)
  • Timur Isaev, DBMI Harvard Medical School, United States
  • Patrick Magee, DNAStack, Canada
  • Heather Ward, DNAStack, Canada
  • Andrey Kokorev, Forome Association, Russia
  • Arezou Ghazani, Brigham and Women's Hospital, Division of Genetics, United States
  • Michael Bouzinier, Division of Genetics, Brigham and Women's Hospital, United States
  • Joel Krier, Division of Genetics, Brigham and Women's Hospital, United States

Presentation Overview: Show

Genomic sequencing is an important part of modern healthcare, and bioinformatics pipelines are now an essential part of the diagnostic process. Portable pipelines can be easily shared between different teams and address important issues such as the reproducibility of the results.

A pipeline for processing Whole Genome and Whole Exome cases was developed at Sunyaev Lab at Harvard Medical School and Brigham & Women's Hospital and is being used by several clinical institutions. Following the GATK best practices and using GATK 3.x for calling common variants, the pipeline also uses a set of rare variant callers to look for uncommon variants. Finally, the pipeline annotates variants to be loaded into Variant Curation Tools.

We present a WDL version of the original pipeline achieving greater portability and providing reproducibility of the results in various environments. Using GATK4, it takes FASTQ raw sequencing data and produces a VCF file, ready to be ingested by tools such as xBrowse or Forome Anfisa annotation and variant curation suite.

We have tested the workflow in various computational environments supported by Cromwell. By porting the end-to-end pipeline to WDL, we hope to help promote Cromwell to the broader bioinformatics community.

11:25 AM-11:30 AM
Epiviz File Server - Query, Compute and Interactive Exploration of data from Indexed Genomic Files
Room: Delhi (Ground Floor)
  • Jayaram Kancherla, University of Maryland, United States
  • Yifan Yang, University of Maryland, United States
  • Hector Corrada Bravo, University of Maryland, Collge Park, United States

Presentation Overview: Show

Epiviz is an interactive and integrative web application for visual analysis and exploration of functional genomic datasets. We currently support a couple of different ways in which data can be provided to Epiviz. 1) using the Epivizr R/Bioconductor package, users can interactively visualize and explore genomic data loaded in R. 2) MySQL database which stores each genomic dataset as a table. Genomic data repositories like ENCODE, Roadmap Epigenomics etc., provide public access to large amounts of genomic data as files. Researchers often download a subset of data from these repositories and perform their analysis. As these repositories become larger, researchers often face bottleneck in their analysis. Increasing data size requires longer time to download, pre-process and load files into a database to run queries efficiently. Based on the concepts of a NoDB paradigm, we developed Epiviz file server, a data query system on indexed genomic files. Using the library, users will be able to visually explore and transform data from publicly hosted files. We support various indexed genomic file formats BigBed, BigWig, HDF5 and any format that can be indexed using tabix. Once the data files are defined, users can also define transformations on these data files using numpy functions.

11:30 AM-11:35 AM
What does 1.0 take? MISO LIMS after 9 years of development
Room: Delhi (Ground Floor)
  • Heather Armstrong, Ontario Institute for Cancer Research, Canada
  • Andre Masella, Ontario Institute for Cancer Research, Canada
  • Morgan Taschuk, Ontario Institute for Cancer Research, Canada
  • Dillan Cooke, Ontario Institute for Cancer Research, Canada
  • Alexis Varsava, Ontario Institute for Cancer Research, Canada
  • Lars Jorgensen, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

MISO is a laboratory information management system designed for eukaryotic sequencing operations. It supports genomic, exomic, transcriptomic, methyl-omic, and CHiP-seq protocols; long reads and short reads; and microarrays. MISO’s goals are to allow laboratory technicians to record their work accurately with a minimum of data entry overhead, and to ensure the associated metadata is valid and structured enough to use for automation and other downstream applications. MISO incorporates a wide feature set useful for both large and small facilities to track their lab workflows in great detail.

Since last presented at BOSC 2016, MISO has matured and stabilized to support production use in a large sequencing facility. MISO supports new instruments like the Illumina NovaSeq, 10X Chromium, and Oxford Nanopore PromethION, added more extensive location tracking, improved UI interfaces to simplify data entry, has improved overall performance, and has extensive documentation in the form of a new user manual and walkthroughs. Recently we have improved installation, administration, and maintenance through Docker containers and compose files. We have developed other applications that interact with MISO to facilitate laboratory functions like billing, reporting, and analysis. After 8 years of development, we are preparing a 1.0 release for late 2019.

11:35 AM-11:40 AM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
11:40 AM-12:00 PM
BioLink Model - standardizing knowledge graphs and making them interoperable
Room: Delhi (Ground Floor)
  • Deepak Unni, Lawrence Berkeley National Laboratory, United States
  • Deepak Unni, Lawrence Berkeley National Lab, United States
  • Nomi Harris, Lawrence Berkeley Labs, United States

Presentation Overview: Show

Biological Knowledge Graphs (KGs) are an emerging way of connecting together and reasoning about entities such as genes, conditions, chemicals, pathways, tissues, and so on. However, as yet there is no agreed upon standard schema or data model for how such graphs should be constructed, resulting in siloed efforts. Creating such a standard is important but challenging due to the complexity of biology and the diversity of use cases.

The BioLink Model (BLMod) is a top-level ontology and a data model that aims to represent biological knowledge. It defines entities (e.g., gene, protein, disease, chemical substance) and enumerates associations between these entities (e.g., gene to disease association). Each entity type is defined as a class, and each class has mappings to other ontologies (e.g., OBO, SIO or WikiData) that enables modeling across ontologies. BLMod aims to serve as a way of standardizing how entities (nodes) and associations (edges) between these entities are represented. BLMod treats associations as first class entities, which enables expressive modeling of data and the ability to add additional properties that define the relationship (e.g., qualifiers, provenance, etc.). The model itself is agnostic to the technology used to build a KG.

12:00 PM-12:05 PM
pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive
Room: Delhi (Ground Floor)
  • Saket Choudhary, University of Southern California, United States

Presentation Overview: Show

NCBI’s Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. However, methods to programmatically access this data are limited.

We introduce a Python package pysradb that provides a collection of command line methods to query and download metadata and data from SRA utilizing the curated metadata database available through the SRAdb project.

pysradb package builds upon the principles of SRAdb providing a simple and user-friendly command-line interface for querying metadata and downloading datasets from SRA. It obviates the need for the user to be familiar with any programming language for querying and downloading datasets from SRA. Additionally, it provides utility functions that will further help a user perform more granular queries, that are
often required when dealing with multiple datasets at large scale. By enabling both metadata search and download operations at the command-line, pysradb aims to bridge the gap in seamlessly retrieving public sequencing datasets and the associated metadata.

pysradb is written in Python and is currently developed on Github under the open-source BSD 3-Clause License. Each sub-command of pysradb contains a self-contained help string, that describes its purpose and usage example with additional documentation available on the project’s website.

12:05 PM-12:10 PM
Disq, a library for manipulating bioinformatics sequencing formats in Apache Spark.
Room: Delhi (Ground Floor)
  • Michael Heuer, RISE Lab, University of California Berkeley, United States
  • Tom White, Independent consultant, working with Broad Institute and Mount Sinai School of Medicine, United Kingdom
  • Louis Bergelson, Data Sciences Platform Group/GATK, Broad Institute, United States
  • Chris Norman, Data Sciences Platform Group/GATK, Broad Institute, United States
  • Ryan Williams, Mount Sinai School of Medicine, United States

Presentation Overview: Show

Disq is a library for manipulating bioinformatics sequencing formats in Apache Spark. Disq grew out of, and was heavily inspired by, Hadoop-BAM and Spark-BAM. The Disq project originated from a Github issue thread, developed through discussion at BOSC CollaborationFest, and kicked off as a new project with collaborators from ADAM, Hadoop-BAM, htsjdk, GATK, Spark-BAM, and ViraPipe projects. In this talk we will discuss some of the challenges of reading and writing formats like BAM, CRAM, SAM, and VCF in parallel. We'll also look at how Disq has been incorporated into common genomics pipelines in ADAM and GATK for improved speed and accuracy.

12:10 PM-12:15 PM
A toolkit for semantic markup, exploration, comparison and merging of metadata models expressed as JSON-Schemas
Room: Delhi (Ground Floor)
  • Dominique Batista, University of Oxford e-Research Centre, United Kingdom
  • Alejandra Gonzalez-Beltran, University of Oxford e-Research Centre, United Kingdom
  • Susanna-Assunta Sansone, University of Oxford, United Kingdom
  • Philippe Rocca-Serra, Oxford e-Research Center, United Kingdom

Presentation Overview: Show

We introduce a set of tools aimed at assisting knowledge engineers who rely on JSON schema technology to define semantic anchoring of schema elements, compare and combine those with other elements or existing schemas, as well as visualize and present such schemas and their comparisons in an aesthetically pleasing interface. The core Python library provides functions to support the semantic annotation of JSON schema: more specifically, given a set of schemas and vocabularies, the tool will generate the required JSON-LD context files. The library also offers a comparison algorithm that makes use of the aforementioned semantic annotations to compare sets of schemas. Built on top of this functionality, a merge function enables developers to combine components defined by existing data representation standards and make providers’ content compatible with several standards at a time. In addition, the suite also contains two client-side web applications used as visualisation tools. The first one resolves a set of schemas and presents the properties of each element. The second one reads comparison files created by the python library and outputs pairwise comparison reports, with dedicated visual cues to home in on problematic elements. Ongoing efforts are under way to integrate the tools with FAIRsharing services

12:15 PM-12:20 PM
A lightweight approach to research object data packaging
Room: Delhi (Ground Floor)
  • Eoghan Ó Carragáin, University College Cork, Ireland
  • Peter Sefton, University of Technology, Sydney, Australia
  • Stian Soiland-Reyes, The University of Manchester, United Kingdom
  • Carole Goble, The University of Manchester, United Kingdom

Presentation Overview: Show

A Research Object (RO) provides a machine-readable mechanism to communicate the diverse set of digital and real-world resources that contribute to an item of research. The aim of an RO is to evolve from traditional academic publication as a static PDF, to rather provide a complete and structured archive of the items (such as people, organisations, funding, equipment, software etc) that contributed to the research outcome, including their identifiers, provenance, relations and annotations.

This is of particular importance as all domains of research and science are increasingly relying on computational analysis, yet we are facing a reproducibility crisis because key components are often not sufficiently tracked, archived or reported.

Here we propose Research Object Crate (or RO-Crate for short), an emerging lightweight approach to packaging research data with their structured metadata, rephrasing the Research Object model as schema.org annotations to formalize a JSON-LD format that can be used independently of infrastructure, e.g. in GitHub or Zenodo archives. RO-Crate can be extended for domain-specific descriptions, aiming at a wide variety of applications and repositories to encourage FAIR sharing of reproducible datasets and analytical methods.

12:20 PM-12:25 PM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
2:00 PM-3:00 PM
BOSC Keynote: Building infrastructure for responsible open science in Africa
Room: Delhi (Ground Floor)
  • Nicola Mulder, University of Cape Town, South Africa
3:00 PM-3:20 PM
The (Re)usable Data Project
Room: Delhi (Ground Floor)
  • Seth Carbon, LBNL, United States

Presentation Overview: Show

The goal of the (Re)usable Data Project (RDP, http://reusabledata.org)
is to draw attention to the licensing issues that make the reuse of
valuable biomedical data challenging and complicated. The RDP is meant
to provide an explorable resource that looks at some of the issues
around the reuse of scientific data and open a conversation about how
to deal with them. The RDP exists as an element in the environment of
the FAIR Data Principles and the FAIR-TLC evaluation framework; it is
focused on licensing issues and seeks to draw attention to the
pervasiveness of current practice failures and their effects.

As the current project centerpiece, the RDP has put together a rubric
that attempts to objectively evaluate resource licensing and basic
data accessibility from the perspective of reuse on a linear
scale. The RDP has processed and evaluated about sixty resources and
their licenses with the rubric; the results indicate that there are
ongoing issues with how resources license and present their data.

We hope that the efforts of the RDP will encourage the community to
work together and generate discussions that help improve licensing
practices, facilitating broad and long-term access to reusable
scientific resources.

3:20 PM-3:40 PM
The FAIR data principles and their practical implementation in InterMine
Room: Delhi (Ground Floor)
  • Daniela Butano, University of Cambridge - Department of Genetics, United Kingdom
  • Justin Clark-Casey, University of Cambridge, United Kingdom
  • Josh Heimbach, University of Cambridge, United Kingdom
  • Rachel Lyne, Department of Genetics and Cambridge Systems Biology Centre, University of Cambridge, United Kingdom
  • Julie Sullivan, Department of Genetics and Cambridge Systems Biology Centre, University of Cambridge, United Kingdom
  • Yo Yehudi, University of Cambridge, United Kingdom
  • Gos Micklem, University of Cambridge, United Kingdom
  • Sergio Contrino, University of Cambridge, United Kingdom

Presentation Overview: Show

The FAIR Data Principles are a set of guidelines which aim to make data findable, accessible, interoperable and reusable. The principles are gaining traction, especially in the life sciences. We will present our experience of the practical implementation of the FAIR principles in InterMine, a platform to integrate and access life sciences data. We will cover topics such as the design of persistent URLs, standards for embedding data descriptions into web pages, describing data with ontologies, and data licences.

3:40 PM-3:45 PM
GA4GH: Developing Open Standards for Responsible Data Sharing
Room: Delhi (Ground Floor)
  • Rishi Nag, GA4GH, United Kingdom

Presentation Overview: Show

The Global Alliance for Genomics and Health (GA4GH) is creating frameworks and standards to enable the responsible, voluntary, and secure sharing of genomic and health-related data. GA4GH Specifications include file formats such as CRAM and BCF, Cloud compute enabling standards such as Workflow Execution Service (WES), and a path to match patients consent for data use with researchers areas of interest. This talk will introduce these and future specifications, the community based process in which they are developed, and show how you can contribute to this process.

3:45 PM-3:50 PM
The Commons Alliance: Building cloud-based infrastructure to support biomedical research in Data STAGE and AnVIL
Room: Delhi (Ground Floor)
  • Brian O'Connor, University of California, Santa Cruz, United States
  • Robert Carroll, Vanderbilt University, United States
  • Robert Grossman, University of Chicago, United States
  • Benedict Paten, University of California, Santa Cruz, United States
  • Anthony Philippakis, the Broad Institute, United States

Presentation Overview: Show

Modern biomedical research datasets -- derived from diverse technologies such as genome sequencing, gene-expression analysis, proteomics, and imaging assays -- consist of copious amounts of data that many researchers struggle to leverage. While the ability to generate data is a massive opportunity for the research community, the infrastructure and skill set required to handle terabytes - even petabytes - of data are relatively rare. Here we present the work of the Commons Alliance, a collaboration between the Broad Institute, UCSC, the University of Chicago, and Vanderbilt to build cloud-based infrastructure and services for the biomedical research community. These components include the Gen3 platform for core data and authentication/authorization services, the Terra workspace platform for batch and interactive analysis, and the Dockstore registry for workflow and tool sharing. We will describe these components in more detail and show how they are being evolved, improved, and expanded through real-world use in the NHLBI Data STAGE and NHGRI AnVIL projects.

3:50 PM-3:55 PM
Fake it 'til You Make It: Open Source Tool for Synthetic Data Generation to Support Reproducible Genomic Analyses
Room: Delhi (Ground Floor)
  • Adelaide Rhodes, Harvard University, United States
  • Matthieu J. Miossec, Center for Bioinformatics and Integrative Biology, Faculty of Biological Sciences, Universidad Andrés Bello, Santiago, Chile
  • Geraldine Van der Auwera, Harvard University, United States

Presentation Overview: Show

The lack of readily accessible large scale public genomic data sets currently limits the reproducibility of published biomedical research to a subset of authorized users. Tool developers, educators, journal editors and researchers alike are affected by the lack of open access genomic datasets appropriate for reproducing biologically meaningful analysis at scale. We will present a prototype pipeline that promotes reproducible analysis by making it easy to generate publicly shareable custom synthetic datasets. The prototype workflow links existing tools into a consolidated community resource for generating synthetic data cheaply and efficiently. We will demonstrate how to use this workflow on Broad Institute's open access Terra platform, to reproduce someone else's analysis and make your own work reproducible. The workflow, as written, is portable to any cloud platform that runs the Cromwell Engine, an open source scientific Workflow Management System.

3:55 PM-4:00 PM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
4:40 PM-5:40 PM
Room: Delhi (Ground Floor)
Thursday, July 25th
8:30 AM-8:40 AM
BOSC announcements
Room: Delhi (Ground Floor)
8:40 AM-8:45 AM
Archaeopteryx.js: Web-based Visualization and Exploration of Annotated Phylogenetic Trees (JavaScript)
Room: Delhi (Ground Floor)
  • Yun Zhang, J. Craig Venter Institute, United States
  • Christian Zmasek, J. Craig Venter Institute, United States
  • Richard H. Scheuermann, J. Craig Venter Institute, United States

Presentation Overview: Show

We developed Archaeopteryx.js for the web-based visualization and exploration of annotated phylogenetic trees. Archaeopteryx.js is written entirely in JavaScript and can be incorporated effortlessly into any website.
Archaeopteryx.js has been designed to be both powerful and user friendly by providing features such as: user selectable data display (e.g. branch lengths, support values, taxonomic and sequence information), intuitive zooming and panning, tools for organizing trees, sophisticated and flexible search functions (including regular expressions), and download/export of trees in a variety of formats. We focused on making Archaeopteryx.js suitable for analyzing large and complex gene trees. For this purpose, Archaeopteryx.js provides means for the visualization of gene duplications, automated sub-tree collapsing (such as collapsing by node depth or shared features), and selection and display of sub-trees. In addition, metadata (such as temporal and geographic information) and sequence-specific data (amino acid or nucleotide residue at selected alignment position) can be visualized as label and node colors, node shapes, and sizes.
Archaeopteryx.js is currently used in the Virus Pathogen Resource (ViPR; www.viprbrc.org) and the Influenza Research Database (IRD; www.fludb.org) for visualization.
We are committed to continuously supporting and improving Archaeopteryx.js, which is freely available under an open source license at
https://www.npmjs.com/package/archaeopteryx and at https://github.com/cmzmasek/archaeopteryx-js

8:45 AM-8:50 AM
Sequenceserver: a modern graphical user interface for custom BLAST databases
Room: Delhi (Ground Floor)
  • Anurag Priyam, Queen Mary University of London, United Kingdom
  • Yannick Wurm, Queen Mary University of London, United Kingdom

Presentation Overview: Show

The advances in DNA sequencing technologies have created many opportunities for novel research that require comparing newly obtained and previously known sequences. This is commonly done with BLAST, either as part of an automated pipeline, or by visually inspecting the alignments and associated meta-data. We previously reported Sequenceserver to facilitate the latter. Our software enables a user to rapidly setup a BLAST server on custom datasets and presents an interface that is modern looking and intuitive to use. However, interpretation of BLAST results can be further simplified using visualisations.
We have integrated three existing visualisations into Sequenceserver with the aim to facilitate comparative analysis of sequences. First, we provide a circos plot to rapidly check for conserved synteny, identify duplications and translocation events, or to visualise transposon activity. Second, we provide a histogram of length of all hits of a query to quickly reveal if the length of a predicted protein sequence matches that of its homologs. Finally, for each query-hit pair, the relative length and position of matching regions are shown. This is helpful to identify large insertion or deletion events between two genomic sequences, can reveal putative exon shuffling, and help confirm a priori knowledge of intron lengths.

8:50 AM-8:55 AM
Parallel, Scalable Single-cell Data Analysis
Room: Delhi (Ground Floor)
  • Ryan Williams, Mount Sinai School of Medicine, United States
  • Tom White, Mount Sinai School of Medicine, United Kingdom
  • Uri Laserson, Mount Sinai School of Medicine, United States

Presentation Overview: Show

Single-cell sequencing generates a new kind of genomic data, promising to revolutionize understanding of the fundamental units of life. The Human Cell Atlas is a multi-year, multi-institution effort to develop and standardize methods for generating and processing this data, which poses interesting storage and compute challenges.

I'll talk about recent work parallelizing analysis of single-cell data using a variety of distributed backends (Apache Spark, Dask, Pywren, Apache Beam). I'll also discuss the Zarr format for storing and working with N-dimensional arrays, which several scientific domains have recently gravitated toward in response to challenges using HDF5 in parallel and in the cloud.

8:55 AM-9:00 AM
Q&A for late-breaking lightning talks (BOSC)
Room: Delhi (Ground Floor)
9:00 AM-9:05 AM
RAWG: RNA-Seq Analysis Workflow Generator
Room: Delhi (Ground Floor)
  • Alessandro Pio Greco, Imperial College London, United Kingdom
  • Patrick Hedley-Miller, Imperial College London, United Kingdom
  • Filipe Jesus, Imperial College London, United Kingdom
  • Zeyu Yang, Imperial College London, United Kingdom

Presentation Overview: Show

Motivation: RNA sequencing (RNA-Seq) is becoming the gold standard for analysing gene expressions in biological samples. With the ever-increasing sequencing speed, the amount of data produced is growing exponentially which imposes new challenges on bioinformatics data analysis. Many analysis pipelines were developed, enabling standardisation and automation of RNA-Seq data analysis. However, most of the pipelines rely on the command-line interface which can be difficult for end-users to learn and use, and different methodologies can lead to a wide variation in the differential analysis result.

Result: We present a complete RNA-Seq data analysis framework with emphases on care-free and ease of use. Our framework uses contemporary workflow description standard, Common Workflow Language, as the analysis pipeline foundation. A website, based on the Django framework, is designed for the user to upload data and submit analyses with a few clicks. A variety of tools are encompassed in RAWG so that researchers have the freedom to pick which analysis tool to use. RAWG can run multiple pipelines in one session and comparing the results from different pipelines. RAWG is a flexible data analysis platform that is not only easy to use but also improves data reproducibility and provenance in biological science.

9:05 AM-9:10 AM
SAPPORO: workflow management system that supports continuous testing of workflows
Room: Delhi (Ground Floor)
  • Hirotaka Suetake, The University of Tokyo, Japan
  • Tazro Ohta, DBCLS, Japan

Presentation Overview: Show

Sharing personal genome data is critical to advance medical research. However, sharing data
including personally identifiable information requires ethical reviews which usually takes time and
often has limitations of computational resources that researchers can use. To allow researchers to
analyze such data in controlled access efficiently, DNA Data Bank of Japan (DDBJ) developed a new
workflow execution system called SAPPORO (Figure1). We designed the system to allow users to
execute workflows with controlled access data without touching them. Users select a workflow on the
SAPPORO's web interface to run it on a node for personal genome data analysis in the DDBJ's
high-performance computing (HPC) platform. The system supports the Common Workflow Language
(CWL) as its primary format to describe workflows; thus it can import the workflows developed by
different institutes as long as they are described in CWL [1]. We implemented the workflow run
service component by following the Workflow Execution Service API standard developed by the
Cloud working group of Global Alliance for Genomics and Health (GA4GH) [2]. This highly flexible
and portable system can be an essential module on data and workflow sharing in biomedical research.

9:10 AM-9:15 AM
Lazy representation and analysis of very large genomic data resources in R / Bioconductor
Room: Delhi (Ground Floor)
  • Qian Liu, Roswell Park Comprehensive Cancer Center, United States
  • Hervé Pagè, Fred Hutchinson Cancer Research Center, United States
  • Martin Morgan, Roswell Park Comprehensive Cancer Center, United States

Presentation Overview: Show

Motivation: With the increasing challenges in understanding very large and complex genomic data, urgent demands are present for computational and bioinformatics tools to efficiently translate the data into clinically important insights. Methods: R / Bioconductor provides robust, scalable software infrastructures and interoperable statistical methods to help tackle these challenges. Previously, DelayedArray has been developed to represent very big genomic data sets (e.g., count data from scRNA-seq). DelayedArray allows users to perform common array operations without loading the data in memory. We and others have extended DelayedArray to different backends for scalable computation, such as Hierarchical Data Format (HDF) and Genomic Data Structure (GDS). DelayedDataFrame is developed for lazy representation of sample metadata (e.g., the clinical characteristics for samples). VariantExperiment is a lightweight container for lazy data structures representing both assay and annotation data for a complete experiment. Results: These data structures provide rich semantics for data operations with familiar paradigms such as “matrix” and “data.frame”, and present minimal requirements for memory use and for bioinformaticians to learn new techniques. Conclusion: These data structures considerably improve the acquisition, management, analysis and dissemination of big genomic datasets, and benefit the broad community of bioinformatic software developers and domain-specific cancer researchers.

9:15 AM-9:20 AM
Q&A for late-breaking lightning talks (BOSC)
Room: Delhi (Ground Floor)
9:20 AM-9:25 AM
The Monarch Initiative: Closing the knowledge gap with semantics-based tools
Room: Delhi (Ground Floor)
  • Monica C Munoz-Torres, Oregon State University, United States
  • Melissa Haendel, Oregon State University, United States
  • Peter Robinson, 4The Jackson Laboratory for Genomic Medicine, United States
  • David Osumi-Sutherland, European Bioinformatics Institute, United Kingdom
  • Damian Smedley, Genomics England, United Kingdom
  • Julius Jacobsen, Queen Mary University of London, United Kingdom
  • Sebastian Köhler, Charité Universitätsmedizin, Germany
  • Julie McMurry, Oregon State University, United States
  • The Members Of The Monarch Initiative Lastname, Monarch Initiative, United States

Presentation Overview: Show

The Monarch Initiative is a consortium that seeks to bridge the space between basic and applied research, developing tools that facilitate connecting data across these fields using semantics-based analysis. The mission of the Monarch Initiative is to create methods and tools that allow exploration of the relationships between genotype, environment, and phenotype across the tree of life, deeply leveraging semantic relationships between biological concepts using ontologies. These tools include Exomiser, which evaluates variants based on the predicted pathogenicity, amongst many others. The goal is to enable complex queries over diverse data and reveal the unknown. With the semantic tools available at www.monarchinitiative.org, researchers, clinicians, and the general public can gather, collate, and unify disease information across human, model organisms, non-model organisms, and veterinary species into a single platform. Monarch defines phenotypic profiles, or sets of phenotypic terms, which are associated with a disease or genotype recorded using a suite of phenotype vocabularies (such as the Human Phenotype Ontology and the Mondo Ontology). Our niche is computational reasoning to enable phenotype comparison both within and across species. Such explorations aim to improve mechanistic discovery and disease diagnosis. We deeply integrate biological information using semantics, leveraging phenotypes to bridge the knowledge gap.

9:25 AM-9:30 AM
DAISY: a tool for the accountability of Biomedical Research Data under the GDPR.
Room: Delhi (Ground Floor)
  • Pinar Alper, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
  • Valentin Grouès, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Germany
  • Sandrine Munoz, University of Luxembourg, Luxembourg
  • Yohan Jarosz, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
  • Kavita Rege, Luxembourg Centre For Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
  • Venkata Pardhasaradhi Satagopam, Luxembourg Centre For Systems Biomedicine (LCSB), ELIXIR Luxembourg, University of Luxembourg, Luxembourg
  • Christophe Trefois, ELIXIR Luxembourg, Luxembourg
  • Jacek Lebioda, University of Luxembourg, Luxembourg
  • Regina Becker, Luxembourg Centre of Systems Biomedicine, Luxembourg
  • Reinhard Schneider, LCSB, Luxembourg

Presentation Overview: Show

GDPR requires the documentation of any processing of personal data, including data used for research and to be prepared for information provision to the data subjects. For institutions this requires a data mapping exercise to be performed and to keep meticulously track of all data processings. While there is no formal guidance on how data mapping should be done, we're seeing the emergence of some commercial "GDPR data mapping" tools and academic institutions creating registers with those tools. When it comes to mapping data in biomedical research, we observe that commercial tools may fall short as they do not capture the complex project-based, collaborative nature of research that leads to many different scenarios.

In this poster we describe a Data Information System (DAISY), our data mapping tool, which is specifically tailored for biomedical research institutions and meets the record keeping and accountability obligations of the GDPR. DAISY is open-source and is actively being used at the Luxembourg Centre for Systems Biomedicine and the ELIXIR-Luxembourg data hub.

9:30 AM-9:35 AM
Q&A for late-breaking lightning talks (BOSC)
Room: Delhi (Ground Floor)
10:20 AM-10:40 AM
Dockstore: Enhancing a community platform for sharing cloud-agnostic research tools
Room: Delhi (Ground Floor)
  • Denis Yuen, Ontario Institute for Cancer Research, Canada
  • Louise Cabansay, UC Santa Cruz Genomics Institute, United States
  • Charles Overbeck, UC Santa Cruz Genomics Institute, United States
  • Andrew Duncan, Ontario Institute for Cancer Research, Canada
  • Gary Luu, Ontario Institute for Cancer Research, Canada
  • Walt Shands, UC Santa Cruz Genomics Institute, United States
  • Natalie Perez, UC Santa Cruz Genomics Institute, United States
  • David Steinberg, UC Santa Cruz Genomics Institute, United States
  • Cricket Sloan, UC Santa Cruz Genomics Institute, United States Minor Outlying Islands
  • Brian O’connor, UC Santa Cruz Genomics Institute, United States
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada

Presentation Overview: Show

Modern biomedical research analyzes exponentially growing datasets and involves cross collaborations that commonly span multiple institutions with highly heterogeneous computing environments. The scale and complexity of these efforts has driven a rethink of bioinformatics infrastructure to leverage big data and cloud technologies that increase the mobility, interoperability, and reproducibility of research. To address these goals, we expanded integrations and features within Dockstore, our open source platform for sharing Docker-based resources that allow bioinformaticians to bring together tools and workflows into a centralized location. By packaging software into portable containers and utilizing popular descriptor languages such as the Common Workflow Language (CWL), Workflow Description Language (WDL), and Nextflow, Dockstore standardizes computational analysis, making workflows reproducible and runnable in any environment that supports Docker. Dockstore now supports workflow hosting directly on Dockstore.org along with external technologies like GitHub, Bitbucket, Quay.io, and Docker Hub. Our launch-with integration allows deployment to a growing variety of cloud platforms including FireCloud, DNAnexus, and DNAstack. Usability improvements include a better display of versioning, validation with checker workflows, and community-provided DAG visualizations. Furthermore, new collaboration features allow for permissions-based sharing and enable groups to create organization pages to describe and highlight collections of workflows and tools.

10:40 AM-11:00 AM
Bioconductor with Containers: Past, Present, and Future
Room: Delhi (Ground Floor)
  • Nitesh Turaga, Bioconductor / Roswell Park, United States
  • Martin Morgan, Roswell Park Comprehensive Cancer Center, United States

Presentation Overview: Show

Bioconductor provides a range of options which allow users and developers to perform their analysis using containers, taking away the burden of installing dependencies and system requirements. Containers allow software to be used in a reproducible, reliable and isolated fashion. Bioconductor also produces containers which replicate their build machines to allow developers contributing to the project to test their changes more aggressively before being put in production. We have seen that containers are now a common way to publish the "environment" used to perform data analysis. Bioconductor containers are also useful for cloud- and web-based analysis platforms. It is possible to use Bioconductor containers on high performance clusters, and with Kubernetes for ad hoc cluster formation. This talk will draw comparisons on how these containers were developed and used in the past, and how they are being developed and used now by Bioconductor. A particular innovation is the installation via localization of pre-compiled packages. Usage from the points of view of both the user and the developer will be discussed. It will mention different flavors of containers being provided by Bioconductor, and how best to use them to make your software development and data analysis reproducible and "machine-independent".

11:00 AM-11:15 AM
Mini-Break - BOSC
Room: Delhi (Ground Floor)
11:15 AM-11:35 AM
OpenEBench. The ELIXIR platform for benchmarking.
Room: Delhi (Ground Floor)
  • Adrian Altenhoff, ETH Zurich, Switzerland
  • Josep Ll Gelpí, Dept. Bioquimica i Biologia Molecular. Univ. Barcelona, Spain
  • Christophe Dessimoz, University of Lausanne, Switzerland
  • Salvador Capella-Gutiérrez, Barcelona Supercomputing Center (BSC), Spain
  • Vicky Sundesha, Barcelona Supercomputing Center (BSC), Spain
  • Javier Garrayo, Barcelona Supercomputing Center (BSC), Spain
  • Laia Codó, Barcelona Supercomputing Center (BSC), Spain
  • Dmitry Repchevsky, Barcelona Supercomputing Center (BSC), Spain
  • Eva Martin del Pico, Barcelona Supercomputing Center (BSC), Spain
  • Víctor Fernández-Rodríguez, Barcelona Supercomputing Center (BSC), Spain
  • Eduard Porta-Pardo, Barcelona Supercomputing Center (BSC), Spain
  • Analia Lourenco, Universidade de Vigo, Spain
  • Isabel Cuesta, Instituto de Salud Carlos III (ISCIII), Spain
  • Sara Monzon, Instituto de Salud Carlos III (ISCIII), Spain
  • Alfonso Valencia, Barcelona Supercomputing Centre BSC, Spain
  • Juergen Haas, SIB Swiss Institute of Bioinformatics & University of Basel, Switzerland

Presentation Overview: Show

Benchmarking in the bioinformatics context aims to compare the performance of bioinformatics operations under controlled conditions. Benchmarking encompasses the technical performance of individual tools, servers and workflows, including software quality metrics, and their scientific performance answering predefined challenges, as defined (reference datasets and metrics) by scientific communities.. In the context of the ELIXIR project, we have developed the OpenEBench platform aiming at transparent performance comparisons.

We will present the current implementation of OpenEBench. It covers scientific benchmarking data from a number of communities, with expert- and non-expert-oriented visualization of benchmarking results, and the assessment of quality metrics and availability of bioinformatics tools. We consider three levels of operation: level 1, already established, relates to data from existing benchmarking communities, provided via the OpenEBench API; level 2 (in beta state) is based on computing benchmarking metrics within the platform; while level 3 extends the existing OpenEBench platform to execute benchmarkable workflows (provided as software containers) using controlled conditions to ensure an unbiased technical and scientific assessment. Overall, OpenEBench provides an integrated platform to orchestrate benchmarking activities, from the deposition of reference data to executing software tools, to providing results employing metrics defined by scientific communities.

11:35 AM-11:55 AM
ELIXIR Europe on the Road to Sustainable Research Software
Room: Delhi (Ground Floor)
  • Jennifer Harrow, ELIXIR, United Kingdom
  • Rafael C Jimenez, ELIXIR, United Kingdom
  • Fotis Psomopoulos, INAB|CERTH, Greece
  • Dimitrios Bampalikis, Systems developer NBISweden, Uppsala University, Sweden
  • Allegra Via, IBPM-CNR, Italy
  • Mateusz Kuzak, Dutch Techcentre for Life Sciences, ELIXIR-Netherlands, Netherlands

Presentation Overview: Show

ELIXIR is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR's activities are divided into the following five areas Data, Tools, Interoperability, Compute and Training known as “platforms”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources. Software Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, promoting and measuring information standards and best practices applied to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software and the Top 10 metrics for life science software good practices. The next step is to adopt, promote, and recognise these information standards and best practices, by developing comprehensive guidelines for software curation, and through workshops for training researchers and developers towards the adoption of software best practices.

11:55 AM-12:15 PM
The Kipoi repository: accelerating the community exchange and reuse of predictive models for genomics
Room: Delhi (Ground Floor)
  • Julien Gagneur, Technical University of Munich, Germany
  • Ziga Avsec, Technical University of Munich, Germany
  • Roman Kreuzhuber, EMBL-European Bioinformatics Institute, United Kingdom
  • Johnny Israeli, Stanford University, United States
  • Nancy Xu, Stanford University, United States
  • Jun Cheng, Technical University of Munich / QBM Graduate School, Germany
  • Avanti Shrikumar, Stanford University, United States
  • Abhimanyu Banerjee, Stanford University, United States
  • Thorsten Beier, DKFZ, Germany
  • Lara Urban, EMBL-European Bioinformatics Institute, United Kingdom
  • Oliver Stegle, EMBL-European Bioinformatics Institute, Germany
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Machine learning models trained on large-scale genomics datasets hold the promise to be major drivers for genome science. However, lack of standards and limited centralized access to trained models have hampered their practical impact. To address this, we present Kipoi, an initiative to define standards and to foster reuse of trained models in genomics1. The Kipoi repository currently hosts over 2,000 trained models from 21 model groups that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard enables automated software installation and provides unified interfaces to apply models and interpret their outputs. Use cases include model benchmarking, variant effect prediction, transfer learning and building new models from existing ones. By providing a unified framework to archive, share, access, use, and extend models developed by the community, Kipoi will foster the dissemination and use of ML models in genomics.

12:15 PM-12:20 PM
A method for systematically generating explorable visualization design spaces
Room: Delhi (Ground Floor)
  • Anamaria Crisan, The University of British Columbia, Canada
  • Jennifer Gardy, British Columbia Centre for Disease Control, Canada
  • Tamara Munzner, The University of British Columbia, Canada

Presentation Overview: Show

Stakeholders within public health can use genomic data analyses to enact policies, yet they face many challenges using an interpreting the results. Data visualization is an emergent solution to address interpretability challenges, but absent is a robust approach for creating data visualizations. We have developed a systematic method for generating an explorable visualization design space, which catalogues visualizations existing within the infectious disease genomic epidemiology literature. We applied our method to a document corpus of approximately 18,000 articles, from which we sampled 204 articles and approximately 800 figures for analysis. We analyzed these figures and created a set of taxonomic codes along three descriptive axes of visualization design: chart types within the visualization, chart combinations, and chart enhancements. We refer to the collective complement of derived taxonomic codes as GEViT (Genomic Epidemiology Visualization Typology). To operationalize GEViT and the results of the literature analysis we have created a browsable image gallery (http://gevit.net), that allows an individual to explore the myriad of complex types of data visualizations (i.e. the visualization design space). Our analysis of the visualization design space through GEViT also revealed a number of data visualization challenges within infectious disease genomic epidemiology that future bioinformatics work should address.

2:00 PM-2:20 PM
snakePipes enable flexible, scalable and integrative epigenomic analysis
Room: Delhi (Ground Floor)
  • Vivek Bhardwaj, MPI-IE, Germany
  • Steffen Heyne, MPI-IE, Germany
  • Katarzyna Sikora, MPI-IE, Germany
  • Michael Rauer, MPI-IE, Germany
  • Fabian Kilpert, Institute of Neurogenetics and Cardiogenetics, University of Luebeck, Germany
  • Andreas Richter, Genedata AG, Switzerland
  • Devon Ryan, MPI-IE, Germany
  • Leily Rabbani, Max Planck Institute of Immunobiology and Epigenetics Freiburg, Germany
  • Thomas Manke, Max Planck Institute of Immunobiology and Epigenetics Freiburg, Germany

Presentation Overview: Show

The scale and diversity of epigenomics data has been rapidly increasing and ever more studies now present analyses of data from multiple epigenomic techniques. Performing such integrative analysis is time-consuming, especially for exploratory research, since there are currently nopipelines available that allow fast processing of datasets from multiple epigenomic assays whilealso allow for flexibility in running or upgrading the workflows. Here we present a solution to this problem: snakePipes, which can process and perform downstream analysis of data from allcommon epigenomic techniques (ChIP-seq, RNA-seq, Bisulfite-seq, ATAC-seq, Hi-C andsingle-cell RNA-seq) in a single package. We demonstrate how snakePipes can simplifyintegrative analysis by reproducing and extending the results from a recently publishedlarge-scale epigenomics study with a few simple commands. snakePipes are available under anopen-source license at ​https://github.com/maxplanck-ie/snakepipes​.

2:20 PM-2:40 PM
nf-core: Community built bioinformatics pipelines
Room: Delhi (Ground Floor)
  • Phil Ewels, Science for Life Laboratory (SciLifeLab), Department of Biochemistry and Biophysics, Sweden
  • Alexander Peltzer, Quantitative Biology Center (QBiC) Tübingen, Germany
  • Sven Fillinger, Quantitative Biology Center (QBiC) Tübingen, Germany
  • Johannes Alneberg, Science for Life Laboratory (SciLifeLab), Department of Biochemistry and Biophysics, Sweden
  • Harshil Patel, Bioinformatics and Biostatistics, The Francis Crick Institute, United Kingdom
  • Andreas Wilm, A*STAR Genome Institute of Singapore, Bioinformatics Core Unit, Singapore
  • Maxime Ulysse Garcia, Science for Life Laboratory (SciLifeLab), Department of Biochemistry and Biophysics, Sweden
  • Paolo DiTommaso, Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Spain
  • Sven Nahnsen, Quantitative Biology Center (QBiC) Tübingen, Germany

Presentation Overview: Show

The standardization, portability, and reproducibility of analysis pipelines is a renowned problem within the bioinformatics community. In the past, bioinformatic analysis pipelines have often been designed to work on-premise, deeply integrated into the local infrastructure and did show a customized architecture style. Because of this tight coupling of software to its surrounding environment, the resulting pipelines provided poor portability and reproducibility. Nextflow is a system that is able to provide functionality to make analysis reusability, portability and reproducibility complete, with built-in support for most computational infrastructures and container technologies such as Docker, Conda, and Singularity. nf-core is a community effort to implement and collect Nextflow pipelines based on community best practices and tools. The guidelines and templates provided by the nf-core community along with detailed documentation enable users to add new workflows and get started with Nextflow seamlessly. The outcome is a set of high-quality bioinformatics pipelines that researchers can apply broadly across various institutions and research facilities, as all workflows share common usage patterns and robust community support. Our primary goal is to provide a community-driven platform for a high-quality set of reproducible bioinformatics pipelines that researchers can utilize across various Institutions and research facilities.

2:40 PM-3:00 PM
NGLess: a domain-specific language for NGS analysis (the NG-meta-profiler case study)
Room: Delhi (Ground Floor)
  • Luis Pedro Coelho, Fudan University, China
  • Renato Alves, European Molecular Biology Laboratory, Germany
  • Paulo Monteiro, Inesc-id, Portugal
  • Jaime Huerta-Cepas, Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid, Spain
  • A.T. Freitas, Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), Portugal
  • Peer Bork, embl, Germany

Presentation Overview: Show

Linking different tools is an integral part of bioinformatics and software
building tools (such as GNU Make) or workflow engines have both been employed
to tackle this problem, but are not specific to bioinformatics. We present an
alternative approach, namely a domain-specific language (named NGLess) for
defining next-generation sequencing (NGS) processing pipelines and show its
advantages over the use of generic tools.

With NGLess, the user works with abstractions that are closer to the problem
domain and this enables a better user experience in building correct and
reproducible pipelines. For example, NGLess contains built-in types such as
ShortReadSet, which corresponds to a collection of FastQ files on disk. This
enables the user to link tools that output compatible files with the conversion
between them being automatic (a tool that outputs a SAM file and be linked to a
tool that consumes a BAM file as both of these are of the same type and NGLess
will automatically insert any necessary file conversions).

Using NGLess 1.0, we implemented NG-meta-profiler, a tool for producing
taxonomic and functional profiles from metagenomes. Through the use of NGLess,
NG-meta-profiler is significantly faster than the alternatives.

NGLess and NG-meta-profiler are available from https://ngless.embl.de/

3:00 PM-3:05 PM
Benten: An experimental language server for the Common Workflow Language
Room: Delhi (Ground Floor)
  • Kaushik Ghose, Seven Bridges Genomics, United States

Presentation Overview: Show

Many experienced Common Workflow Language (CWL) users are comfortable creating tools and workflows "by hand" using a plain text editor. When creating complex enough workflows, navigating and editing the resultant document and subdocuments can get tedious. Keeping track of the bigger picture (what components have been added, what connections have been set) can also get hard.

Benten is a language server component that assists CWL development in a code editor by providing auto-complete suggestions and document outlines. It has been built and tested with VS Code but can be used with any editor/IDE that implements the language server protocol.

3:05 PM-3:10 PM
Janis: An open source tool to machine generate type-safe CWL and WDL workflows
Room: Delhi (Ground Floor)
  • Richard Lupat, Peter MacCallum Cancer Centre, Australia
  • Michael Franklin, The University of Melbourne, Australia
  • Bernard Pope, The University of Melbourne, Australia
  • Daniel Park, The University of Melbourne, Australia
  • Evan Thomas, Walter and Eliza Hall Institute, Australia
  • Tony Papenfuss, Peter MacCallum Cancer Centre; Walter and Eliza Hall Institute, Australia
  • Jason Li, Peter MacCallum Cancer Centre, Australia

Presentation Overview: Show

There has been ongoing effort to improve reproducibility and portability of bioinformatics pipelines. Projects such as Common Workflow Language (CWL) aim to standardise workflow specifications. However, there is often debate on whether to adopt CWL, Workflow Definition Language (WDL) or other competing standards. CWL provides easy-to-parse specifications, is supported by multiple engines but is considered more difficult to write, while WDL offers more features but are tightly coupled to Cromwell.

To address this, we have created Janis. It is a tool that is designed to assist in building standardised workflows via a translation mechanism that generates validated workflow specifications (CWL, WDL or both). These translated workflows can be shared and executed using any workflow engine that supports the selected specifications. Janis also offers input and output type checking during workflow construction to enforce input requirements of executed tools, which is important for those that require secondary files. Through Janis, we have developed cancer variant-calling pipelines that are functional across the HPC environments at three different Australian research institutes. In future work, we may extend Janis to support additional output formats. We believe that the extra abstraction provided by Janis provides a powerful way to write portable bioinformatics pipeline specifications.

3:10 PM-3:15 PM
Collecting runtime metrics of genome analysis workflows by CWL-metrics
Room: Delhi (Ground Floor)
  • Tomoya Tanjo, National Institute of Informatics, Japan
  • Osamu Ogasawara, DDBJ Center, National Institute of Genetics, Japan, Japan
  • Tazro Ohta, DBCLS, Japan

Presentation Overview: Show

Portability of computational data analysis environment is largely improved by container virtualization technologies such as Docker and workflow description frameworks represented by Common Workflow Language (CWL). To deploy an environment for data analysis workflows, researchers must select an appropriate computational platform for the given application. To provide information to estimate the computational resource such as CPU or memory required by workflow execution, we developed CWL-metrics, a utility system of cwltool to collect runtime metrics of CWL workflows. The system summarizes resource usage of each step of workflow with the input parameters and the information of the host machine. We demonstrated the performance comparison of RNA-Seq quantification workflows by using metrics captured by CWL-metrics. The comparison analysis results recorded in Jupyter Notebook are published on GitHub. The system is utilized in the new pipeline system being deployed on the high-performance computing platform of the DDBJ to collect the metrics to help administrators of the platform. The metrics information captured by CWL-metrics is also being used by the development of resource prediction algorithm. We also would like to present the progress of the new version of CWL-metrics currently under development, which will increase the coverage of supported workflow runners and container technologies.

3:15 PM-3:20 PM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
3:30 PM-3:50 PM
Inclusiveness in Open Science Communities
Room: Delhi (Ground Floor)
  • Malvika Sharan, EMBL, Heidelberg, Germany
  • Toby Hodges, EMBL, Heidelberg, Germany

Presentation Overview: Show

Bio-IT (https://bio-it.embl.de) at the European Molecular Biology Laboratory (EMBL) (https://embl.org) is a community-driven initiative established in 2010 to support the development and technical capacity of its diverse bio-computational community. As of now, ~50% of researchers at EMBL (out of ~600) devote ≥50% of their time to computational work. The Bio-IT community at EMBL has grown organically, aiming to address the various computational needs in research on campus.

As community coordinators of Bio-IT, we provide support to our members by conducting training events on computing skills, developing/maintaining resources for reproducible science, adopting best-practices in our workflow, and creating diverse opportunities for open discussions, participation, and networking. EMBL is a member of de.NBI, the German Network for Bioinformatics Infrastructure - ELIXIR Germany. This allows Bio-IT to disseminate its resources to different ELIXIR states. Additionally, we collaborate with other Open Science communities, share resources, and bring valuable aspects of the larger and more diverse communities into Bio-IT.

In these efforts, I work at the intersection of community building, bio-computation, and inclusion of underrepresented groups in STEM. In my talk, I will highlight the importance of inclusiveness in open science communities and share some of the lessons learned while adopting them in my work.

3:50 PM-4:10 PM
ECRcentral: An open-source platform to bring early-career researchers and funding opportunities together
Room: Delhi (Ground Floor)
  • Aziz Khan, NCMM, University of Oslo, Norway

Presentation Overview: Show

For early-career researchers (ECRs), getting funding for their research ideas is becoming more and more competitive, and there is growing pressure in all disciplines to obtain grants. Although there is a plethora of funding opportunities for postdoctoral scientists and other ECRs, there has been no central platform to systematically search for such funding opportunities and/or to get professional feedback on the proposal. With a group of eLife Ambassadors, we developed ECRcentral (ecrcentral.org), a funding database and an open forum for the ECR community. The platform is open to everyone and currently contains 700 funding schemes in a wide range of scientific disciplines, 100 travel grants, and a diverse range of useful resources. In the first two months since its release approximately 500 ECRs already joined this community. The platform is developed using open-source technology, with all the source code and related content made openly available through our GitHub repository (github.com/ecrcentral). ECRcentral aims to bring ECRs and resources for funding together, to facilitate discussions about those opportunities, share experiences, and create impact through community engagement. We strongly believe that this resource will be highly valuable for ECRs and the scientific community at large.

4:10 PM-4:15 PM
The Data Carpentry Genomics Curriculum: Overview and Impact
Room: Delhi (Ground Floor)
  • François Michonneau, The Carpentries, United States
  • Jason Williams, Cold Spring Harbor Laboratory, United States

Presentation Overview: Show

The Carpentries builds global capacity for conducting efficient, open, and reproducible research. We train and foster an inclusive, diverse community of learners and instructors that promotes the importance of software and data in research. We collaboratively develop open lessons that we deliver using evidence-based teaching practices. Data Carpentry two-day hands-on workshops teach data skills through domain-specific lessons centered around a dataset. Our Genomics lessons focus on the core skills from data and project organization to analysis and visualization. These workshops are well-received, with a median recommendation of 96% and surveys show learners report significant confidence gains in using these approaches in their work.
In this curriculum, we explore:

- How to structure and organize data, metadata and analysis files
- How to use shell commands to automate tasks
- How to use command-line tools to analyze genomic data
- How to work with cloud computing resources
- How to use R for data analysis and visualization

We present an overview of this curriculum, the community model of lesson maintenance, and the impacts measured on the learners’ skills and confidence. Awareness and engagement with these teaching materials scale impact and prepare more people to work effectively and reproducibly with genomic data.

4:15 PM-4:20 PM
Impact of The African Genomic Medicine Training Initiative: a Community-Driven Genomic Medicine Competency-Based Training Model for Nurses in Africa
Room: Delhi (Ground Floor)
  • Paballo Chauke, University of Cape Town, South Africa
  • Victoria Nembaware, University of Cape Town, South Africa
  • Nicola Mulder, University of Cape Town, South Africa

Presentation Overview: Show

Potential of Genomic Medicine to improve the quality of healthcare both at population and individual-level is well-established, however adoption of available genetic and genomics evidence into clinical practice is limited. Widespread uptake largely depends on the task-shifting of Genomic Medicine to key healthcare professionals such as nurses, who could be promoted through professional development courses. Globally, trainers, and training initiatives in Genomic Medicine are limited, and in resource limited settings such as Africa, logistical and institutional challenges threaten to thwart large-scale training programmes. The African Genomic Medicine Training (AGMT) Initiative was created in response to such needs. It aims to establish sustainable Genomic Medicine training initiatives for healthcare professionals and the public in Africa. This work describes the AGMT and reports on a strategy recently piloted by to design and implement an accredited, competency and community-based distance learning course for nurses across 11 African countries. This model takes advantage of existing consortia to create a pool of trainers and adapts evidence-based approaches to guide curriculum and content development. Existing curricula were reviewed and adapted to suit the African context. Accreditation was obtained from university and health professional bodies. Toolkit is proposed to help guide adoption of the AGMT distance-learning model.

4:20 PM-4:25 PM
Biopython Project Update 2019
Room: Delhi (Ground Floor)
  • Peter Cock, The James Hutton Institute, United Kingdom

Presentation Overview: Show

Biopython is a long-running collaboration that provides a freely available Python library for biological computation. We summarise recent project news, and look ahead.

Releases 1.73 (December 2018) and 1.74 (expected May/June 2019) include improvements to various modules of the code base and continue the effort to document our public APIs, which we expect to complete this year. We have also focused on improving test coverage, currently at 85% (excluding online tests). Tests and Python PEP8/PEP257 style are checked with continuous integration on Linux (TravisCI) and Windows (AppVeyor). We are considering adopting the Python code formatting style tool 'black' to reduce human time writing compliant code.

In 2017 we started transitioning from our liberal but unique Biopython License Agreement to the similar but very widely used 3-Clause BSD License. Already half the files in the main library have been dual licensed after reviewing authorship, and all new contributions are dual licensed.

Finally, in the last year, Biopython had 32 named contributors, including 14 newcomers, reflecting our policy of encouraging even small contributions. We expect to reach 250 contributors by our 20th Birthday in August 2019.

4:25 PM-4:30 PM
Q&A for lightning talks (BOSC)
Room: Delhi (Ground Floor)
4:30 PM-4:35 PM
Introducing CoFest 2019 - the post-BOSC Collaboration Festival
Room: Delhi (Ground Floor)
  • Alexander Peltzer, Quantitative Biology Center (QBiC) Tübingen, Germany
  • Michael Heuer, RISE Lab, University of California Berkeley, United States
  • Peter Cock, The James Hutton Institute, United Kingdom

Presentation Overview: Show

In conjunction with the Bioinformatics Open Source Conference (BOSC),
the Open Bioinformatics Foundation (OBF) runs a welcoming,
self-organizing, non-competitive, and highly productive collaborative
event called the CollaborationFest, or CoFest.

Everyone is welcome to attend. We will have a mix of experienced
developers, users, trainers, and researchers, newcomers to experienced
bioinformaticians, and everything in between.

Attendees will self-organize into working groups based on shared
interests like programming languages, open source projects, or
biological questions.

CollaborationFest is not a competition; there are no prizes. Rather
its goals are to: grow and foster the contributor community for open
source bioinformatics projects; extend, enhance, and otherwise improve
open-source bioinformatics code and non-code artefacts, such as
documentation and training materials.

This will be the 10th event since 2010, which we originally called the
Coding Festival or CodeFest. Communities such as Galaxy, Common
Workflow Language, Nextflow, and others have found CollaborationFest a
fun, rewarding, and highly productive experience. As in previous years, a
summary of the event will be included in the BOSC meeting report.

CollaborationFest will take place the two days after BOSC, July 26-27,
at The Swiss Innovation Hub for Personalized Medicine in Basel,
Switzerland. Registration is free; sponsorships offset venue, coffee
and snacks costs.

4:35 PM-4:40 PM
Closing Remarks from BOSC
Room: Delhi (Ground Floor)
  • Nomi Harris, Lawrence Berkeley Labs, United States