Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

BOSC COSI Track Presentations

Attention Conference Presenters - please review the Speaker Information Page available here
BOSC Introduction and welcome
Date: Saturday, July 22
Time: 10:00 AM - 10:10 AM
Room: Meeting Hall IV
  • Nomi Harris, Berkeley, United States

Presentation Overview: Show

A welcome and overview of the 18th Bioinformatics Open Source Conference
(BOSC). The authoritative version of the BOSC 2017 schedule is online
at https://www.open-bio.org/wiki/BOSC_2017_Schedule which will include
any last minute changes.

The Open Bioinformatics Foundation
Date: Saturday, July 22
Time: 10:10 AM - 10:20 AM
Room: Meeting Hall IV
  • Hilmar Lapp, Duke, United States

Presentation Overview: Show

Current president Hilmar Lapp will introduce the Open Bioinformatics Foundation
http://www.open-bio.org (OBF) who organise the Bioinformatics Open Source
Conference (BOSC), and report on its recent activities including the one year
anniversary of our OBF Travel Fellowship scheme.

OBF in the Google Summer of Code. Wrapping up 2016 and presenting the 2017 projects
Date: Saturday, July 22
Time: 10:20 AM - 10:30 AM
Room: Meeting Hall IV
  • Kai Blin, Technical University of Denmark, Denmark

Presentation Overview: Show

Google’s Summer of Code program [1] is focused on introducing students to open source software development. Students are paired up with mentors from participating organisations and earn a stipend while spending their summer semester break getting an exposure to real-world software development practices. In the past years, the Open Bioinformatics Foundation has participated in the Google Summer of Code six times.
In 2016, the Open Bioinformatics Foundation has acted as an umbrella organisation for seven projects from the open source bioinformatics community, and seven students successfully finished the program [2]. In 2017, OBF is an umbrella for six open source bioinformatics projects. At the time of writing, the students for 2017 have not been selected yet, but at the time of BOSC, the projects will be well under way.
This talk will present an overview of the projects hosted under the OBF umbrella in last year’s round of Google Summer of Code, as well as present the projects in the current round.

Codefest 2017 summary
Date: Saturday, July 22
Time: 10:30 AM - 10:45 AM
Room: Meeting Hall IV
  • Brad Chapman, Harvard, United States

Presentation Overview: Show

Each year preceding BOSC the OBF hosts a free an open "Codefest", an
informal bioinformatics developer focussed gathering. This is an opportunity
for anyone interested in open science, biology and programming to meet,
discuss and work collaboratively. This talk will try to summarise what was
achieved at this year's CodeFest which was kindly hosted by by Brmlab,
a non-profit, community-run hackerspace in Prague.

Rabix Executor: an open-source executor supporting recomputability and interoperability of workflow descriptions
Date: Saturday, July 22
Time: 10:45 AM - 10:56 AM
Room: Meeting Hall IV
  • Adrian Sharma, Seven Bridges, United States
  • Brandi Davis-Dusenbery, Seven Bridges, United States
  • Gaurav Kaushik, Seven Bridges, United States
  • Janko Simonovic, Seven Bridges, Serbia
  • Luka Stojanovic, Seven Bridges, Serbia
  • Nebojsa Tijanic, Seven Bridges, Serbia
  • Sinisa Ivkovic, Seven Bridges, Serbia

Presentation Overview: Show

Project Website​: ​http://rabix.io
Source Code​: ​https://github.com/rabix/bunny License​: ​Apache 2.0 license
Biomedical data has become increasingly easy to generate in large quantities and the methods used to analyze it have proliferated rapidly. Reproducible methods are required to learn from large volumes of data and to reliably disseminate findings and methods within the scientific community. Workflow specifications and execution engines provide a framework with which to perform a sequence of analyses and help address issues with reproducibility and portability across environments. One such specification is the Common Workflow Language (CWL), an emerging standard which provides a robust and flexible framework for describing data analysis tools and workflows. CWL requires executors or workflow engines to interpret the specification and turn tools and workflows into computational jobs, as well as provide additional tooling such as error logging, file organization, and job scheduling optimizations to allow for easy computing on large volumes of data. To this end, we have developed the Rabix Executor, an open-source workflow engine that improves reproducibility through reusability and interoperability of workflow descriptions. We define five major components of the Rabix Executor -- frontend, bindings, engine, queue, and backend -- each of which is abstracted from the other to maintain a modular design so that components can be used as needed. Developers are able to design custom frontends (e.g. a custom graphical user interface or command line interface), build bindings for the engine to parse a specific set of workflow languages, employ a specific queuing or scheduling protocol of their choice, and submit computational jobs to different backends (e.g. Amazon Web Services, a high-performance computing cluster, a local machine).
For workflow decomposition, the Rabix Executor employs an abstract software model which was defined to be a superset of all known workflow languages; this enables the use of different workflow languages and versions. To demonstrate this, we are capable of running all drafts and versions of Common Workflow Language and CWL v1.0 workflows composed of tools in previous versions of CWL. To our knowledge, Rabix Executor is the only CWL implementation which maintains full backwards compatibility of tools and workflows, removing the need for refactoring existing bioinformatics applications. In scaling benchmarks, the Rabix Executor is capable of running tens of thousands of concurrent jobs from a batch of whole genome sequencing workflows. The modular and abstracted design of the Rabix Executor is intended to allow for reproducible bioinformatics analysis on various infrastructures and continual support of a growing library of bioinformatics tools and workflows.

Rabix Composer: an open-source integrated development environment for the Common Workflow Language
Date: Saturday, July 22
Time: 10:56 AM - 11:05 AM
Room: Meeting Hall IV
  • Adrian Sharma, Seven Bridges, United States
  • Boban Pajic, Seven Bridges, Serbia
  • Brandi Davis-Dusenbery, Seven Bridges, United States
  • Gaurav Kaushik, Seven Bridges, United States
  • Ivan Batic, Seven Bridges, Serbia
  • Luka Stojanovic, Seven Bridges, Serbia
  • Maja Nedeljkovic, Seven Bridges, Serbia
  • Marijan Lekic, Seven Bridges, Serbia
  • Nebojsa Tijanic, Seven Bridges, Serbia

Presentation Overview: Show

Project Website​: ​http://rabix.io
Source Code​: ​https://github.com/rabix/cottontail-frontend License​: ​Apache 2.0 license
The Common Workflow Language (CWL) is an emerging standard for describing data analysis workflows which allows for portability between compute environments, improved reproducibility, and the ability to add custom extensions to fit institutions’ needs. The robust and flexible framework provided by CWL has led to its adoption by the National Cancer Institute (NCI) Cancer Cloud Pilots, the NCI Genomic Data Commons, and academic and industrial organizations worldwide. Over time, the CWL community has worked hard to improve the CWL syntax to make it easy to read, easy to parse, and comprehensive in the scope of workflow parameters and behaviors it captures. A trade-off of this approach, however, is that complex bioinformatics workflows may consist of dozens of tools and hundreds of parameters which can be time-consuming to describe manually in CWL; a whole genome sequencing workflow may be hundreds of lines of CWL code alone.
To support the CWL community, we’ve created the Rabix Composer, a stand alone integrated development environment which provides rich visual and text-based editors for CWL. Our vision for the Rabix Composer is to enable rapid workflow composition and testing, provide version control and the ability to add documentation, share tools easily with online platforms and developers, and allow integration with online services such as GitHub. The Rabix Composer was designed by integrating more than 500 pieces of feedback from ​Seven Bridges users regarding our previous software development kit for CWL.
The Rabix Composer is part of the Rabix project (​http://rabix.io​), an open-source effort to provide tooling for the bioinformatics developer community. The Rabix project includes the Rabix Executor, a workflow engine that executes CWL descriptions and their associated Docker containers locally on a laptop, HPC, or on multiple cloud environments such as the Seven Bridges Platform. Using the Rabix Executor in combination with the Composer enables developers to create, run and debug bioinformatics applications locally before scaling. Together, these technologies enhance data analysis reproducibility, simplify software sharing/publication, and reduce the friction when designing a new workflow or replicating a scientific finding.

CWL-svg: an open-source workflow visualization library for the Common Workflow Language
Date: Saturday, July 22
Time: 11:05 AM - 11:07 AM
Room: Meeting Hall IV
  • Adrian Sharma, Seven Bridges, United States
  • Boban Pajic, Seven Bridges, Serbia
  • Brandi Davis-Dusenbery, Seven Bridges, United States
  • Gaurav Kaushik, Seven Bridges, United States
  • Ivan Batic, Seven Bridges, Serbia
  • Maja Nedeljkovic, Seven Bridges, Serbia

Presentation Overview: Show

Project Website​: ​http://rabix.io
Source Code​:​https://github.com/rabix/cwl-svg License​: ​Apache 2.0 license
As the Common Workflow Language (CWL) becomes more widely adopted among the bioinformatics community, the volume and complexity of publicly available CWL has increased. The flexibility and portability of CWL encourages developers to tackle difficult pipelines and edge-cases, enabling them to describe intricate processes, which can be executed in multiple environments. However, complex workflows can be challenging for users to interpret. As CWL syntax has matured, the syntactic shortcuts added in recent versions to make the language easier to write can simultaneously make it more difficult to interpret. As a result of these combined aspects of CWL, some workflows, such as BCBio (bcbio-nextgen), can have hundreds of lines of code. Understandably, these workflows can be difficult to understand and debug without external visualization tools. As part of the Rabix Composer, an integrated development environment for CWL, we developed an open-source, workflow visualization library called CWL-svg. The CWL-svg library takes a CWL description of a workflow and creates a scalable vector graphics (SVG) image to provide a visual representation for more intuitive user interactivity. CWL-svg can be used either as a standalone library, which renders SVG files from CWL, or as part of a larger interface. Our goal with CWL-svg was to create a visualization library which would most clearly represent the relevant parts of the CWL description to the user. We incorporated user feedback gleaned from our previous iteration of our CWL Workflow Editor to create a more intuitive and informative user interface. Implementation of the CWL-svg library allows users to select nodes within the workflow (e.g. a tool or file) to highlight immediate connections, rearrange nodes, and use fluid zoom resizing to make workflow details visible at all resolutions. The library also implements an auto-align algorithm which untangles complicated workflows in a visually pleasing arrangement. These design details, combined with meticulous optimizations and attention to efficiency, make CWL-svg ideal for handling complex bioinformatics workflows.

CWL-ts: an open-source TypeScript library for building developer tools for the Common Workflow Language
Date: Saturday, July 22
Time: 11:07 AM - 11:11 AM
Room: Meeting Hall IV
  • Adrian Sharma, Seven Bridges, United States
  • Brandi Davis-Dusenbery, Seven Bridges, United States
  • Gaurav Kaushik, Seven Bridges, United States
  • Ivan Batic, Seven Bridges, Serbia
  • Luka Stojanovic, Seven Bridges, Serbia
  • Maja Nedeljkovic, Seven Bridges, Serbia

Presentation Overview: Show

Project Website​: ​http://rabix.io
Source Code​: ​https://github.com/rabix/cwl-ts License​: ​Apache 2.0 license
The Common Workflow Language (CWL) is an emerging standard for workflow description, and its adoption is rapidly growing. However, few tools for working with CWL exist, and those that do are command line based, which can be difficult to approach for new users. In order to grow the CWL community, facilitate on-boarding of new users, and entice institutions to transition to the new standard, CWL requires better developer tools. Yet, building these GUI-rich developer tools takes both time and resources. Fortunately, domain-specific developer tools tend to have common reusable pieces that can help save time during the development process. For example, different user interfaces could be built using the same underlying domain-related logic​. With this idea in mind, ​we developed a data model library called CWL-ts. CWL-ts is a utility library for creating developer tools and user interfaces that work with CWL. CWL-ts has three mainfunctions: abstractingtheversionofCWL(currentlysupportingv1.0andaSevenBridges flavor of draft-2), providing an API for manipulating the CWL document, and validating the document. The library is written in TypeScript but can be used as a plain JavaScript or Node.js library. For this reason, CWL-ts can be used in a variety of different applications and platforms. Likewise, TypeScript provides the added benefit of types to JavaScript, which is especially useful in the context of CWL. CWL-ts is already being used as the domain backbone of the Rabix Composer, an integrated development environment for the Common Workflow Language. Our goal is to grow the utility of CWL-ts as CWL develops further, bolstering it with test coverage and full conformance to the CWL specification, and offering it to the community as a starting point for creating GUI-rich developer tools.

The GA4GH Tool Registry Service (TRS) and Dockstore - Year One
Date: Saturday, July 22
Time: 11:15 AM - 11:22 AM
Room: Meeting Hall IV
  • Andrew Duncan, OICR, Canada
  • Brian O'Connor, UCSC, United States
  • Denis Yuen, OICR, Canada
  • Gary Luu, OICR, Canada
  • Han Cao, OICR, Canada
  • Janice Patricia, OICR, Canada
  • Lincoln Stein, OICR, Canada
  • Solomon Shorser, OICR, Canada
  • Vincent Chung, OICR, Canada
  • Vincent Ferretti, OICR, Canada
  • Xiang Liu, OICR, Canada

Presentation Overview: Show

Project Website: https://dockstore.org
Source Code​: ​https://github.com/ga4gh/dockstore​/
License: Apache License 2.0 https://www.apache.org/licenses/LICENSE-2.0.html
Workflows written for the PCAWG (Pan-Cancer Analysis of Whole Genomes) study created a challenge for the cloud projects team at OICR and our collaborators due to the highly heterogenous nature of our fourteen computing environments (cloud and HPC, commercial and academic, geographically distributed). We met the challenge by distributing our workflows in Docker containers described by a proprietary descriptor. As we wrapped up, we realized that this approach could be of use to others so we adopted CWL (Common Workflow Language) descriptors and split out the Dockstore project as its own open-source website and associated utilities. This project reached a 1.0 milestone in September 2016.
Tools registered in Dockstore are encouraged to include open-source build instructions for the Docker image, pulling in open-source binaries and/or source code into the image, being built on a publicly visible third-party service and accompanied by a programmatic descriptor of the tool including metadata. In practice, this has meant Dockerfiles and Common Workflow Language (CWL) or Workflow Description Language (WDL) files checked into GitHub in public repositories, built on Quay.io, and indexed on Dockstore.org.
We have also donated a Tool Registry Service (TRS) ( ​https://goo.gl/zfhR4q​ ) API to the Global Alliance for Genomics and Health (GA4GH) in the hope that like-minded groups can implement it in order to facilitate the sharing of Dockerised tools in the sciences. Currently, Dockstore indexes workflows for PCAWG, the workflows for the GA4GH-DREAM challenge, the UCSC Genomics Institute, and more while welcoming contributions from all bioinformaticians.
In this talk we also present our lessons learned creating CWL+Docker images, open-source and commercial platforms for running tools and workflows on Dockstore, and interoperability challenges when converting tools between tool registries. We also highlight new Dockstore features such as test parameters, pluggable file provisioning, private tool support, and workflow visualization.

The GA4GH Task Execution System (TES) and Funnel Server
Date: Saturday, July 22
Time: 11:22 AM - 11:28 AM
Room: Meeting Hall IV
  • Adam Struck, Oregon Health and Science University, United States
  • Alex Buchanan, Oregon Health and Science University, United States
  • Brian O'Connor, University of California, Santa Cruz, United States
  • Kyle Ellrott, OHSU, United States

Presentation Overview: Show

Project Website: ​https://github.com/ga4gh/task-execution-schemas Source Code: ​https://github.com/ohsu-comp-bio/funnel
License: MIT License
The standardization of workflows and tools definitions and containerization of software has provided a way to deploy reproducible computing in a truly portable fashion. The standardization effort is hampered by the large number of infrastructural backends available. These infrastructural backends range from queuing systems, such as SLURM, GridEngine or Condor, to cloud-based solutions backed by Amazon, Microsoft Azure, Google Compute or OpenStack. When building a workflow engine, developers are burdened with writing a different backend for each of the infrastructures they want to support. This means that we now have portable code but not portable deployment systems.
The Global Alliance for Genomics and Health (GA4GH) Data Working Group is an international consortium that seeks to provide common APIs that enable genomic computing. They already have defined APIs for querying genomic read and variant data. Expanding on that effort, the Container and Workflows group is working to design a number of APIs to enable interoperable exchange of reproducible workflows and tools. The Task Execution Schema (TES) is designed to be an API that will allow users and developers a common method for issuing batch jobs to compute environments. In principle, this API is very similar to the AWS Batch API or the Google Genomics Pipeline API, but the Task Execution Schema has been designed to be platform agnostic, so that the user can deploy the same tasks in multiple environments.
TES is designed to be simple and robust. Users create a JSON message that describes the required input files to be staged, the container images, the commands to be run as well as the result files and directories that need to be copied out at the end of the run. The user then submits the task request via HTTP POST. The status of submitted tasks, including logging, can be tracked via the HTTP based API. Thus client libraries are extremely light weight. We have validated that the TES API can be used to backend several existing workflow engines. These include the Seven Bridges CWL engine, Bunny, and the Broad Institute's WDL engine, Cromwell, and Galaxy.
As a specification, any vendor has the opportunity to provide a TES compliant service. The first platform to provide the TES API is called Funnel. Funnel is an open source project from Oregon Health and Science University, built in collaboration with Intel, and is designed to execute tasks across multiple backends, including Condor, OpenStack and the Google Compute Environment. With the TES API users can quickly move between environments and service providers. We expect that the number of supported backends and servers that provide the TES API will grow very quickly.

The GA4GH Workflow Execution Schema (WES)
Date: Saturday, July 22
Time: 11:28 AM - 11:34 AM
Room: Meeting Hall IV
  • Brian O'Connor, University of California, Santa Cruz, United States
  • Jeff Gentry, Broad Institute, United States
  • Peter Amstutz, Curoverse, United States

Presentation Overview: Show

Project Website​: ​http://ga4gh.org/#/cwf-team
Source Code​: ​https://github.com/ga4gh/workflow-execution-schemas License​: Apache 2.0
The Workflow Execution Schema (WES) is a lightweight HTTP REST API defining how an application can submit requests to workflow execution systems in a standardized way. It is being developed by the Containers and Workflows task team of the Data Working Group (DWG) of the Global Alliance for Genomics and Health (GA4GH), which is dedicated to defining and promoting technical standards to improve portability and interoperability of data analysis workflows in Life Sciences.
A researcher wishing to perform analysis on a data set collected by another institution often faces both technical and regulatory hurdles that make it difficult to simply obtain a copy of the data to analyze locally. Federated computing has the promise to reduce these barriers to science by making it possible to bring the analysis to the data.
Workflow execution engines (such as Arvados, Toil, Rabix, Cromwell, Consonance, etc) will support this API to facilitate running federated analysis on multiple providers. A client using WES submits a description of the workflow to execute along with a set of input parameters. Workflows may be written using the Common Workflow Language (CWL), Workflow Definition Language (WDL), or other workflow languages supported by the target platform. Input parameters to the workflow execution are described using a JSON object.
The client may then monitor execution of the workflow to get status, errors, and logs. When a workflow completes successfully, output parameters to the workflow execution are also described using a JSON object. References to input and output files are described using URIs.
We present an update on development of WES and outline the power of standardized workflow execution through projects like the GA4GH/Dream Infrastructure Challenge. We believe adoption of WES will enable researchers to more easily perform analysis on restricted data, provide a common target for front-end user interfaces that submit workflows, and support interoperability among workflows executing on different providers or written in different languages through the use of a common API for invocation.

The GA4GH/DREAM Infrastructure Challenges
Date: Saturday, July 22
Time: 11:34 AM - 11:41 AM
Room: Meeting Hall IV
  • Brad Chapman, Harvard, United States
  • Brian O'Connor, UCSC, United States
  • Christopher Ketchum, UCSC, United States
  • Denis Yuen, OICR, Canada
  • Geraldine Auwera, the Broad, United States
  • J. Strattan, Stanford, United States
  • James Eddy, Sage Bionetworks, United States
  • Jeff Gentry, Broad, United States
  • Jeremiah Savage, University of Chicago, United States
  • Joseph Shands, UCSC, United States
  • Justin Guinney, Sage Bionetworks, United States
  • Kyle Ellrott, OHSU, United States
  • Peter Amstutz, Curoverse, United States
  • Umberto Ravaioli, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Project Website: http://synapse.org/GA4GHToolExecutionChallenge
Source Code: https://github.com/ga4gh/dockstore/ and various repositories for each workflow/tool used in the challenge License: Apache License 2.0 https://www.apache.org/licenses/LICENSE­2.0.html
Genomic datasets continue to grow in size and complexity and this hampers the ability to move data to compute environments for analysis. As a result, there is an increasing need to distribute algorithms across multiple clouds and regions and to perform analysis “in place” with compute environments that are either close by, or integrated with, storage solutions. The value, and necessity, of a distributed compute model, where algorithms are containerized, sent to remote clouds, and data transfer is minimized, has inspired the work of several groups to tackle the need for standards in this field. Three highly active and related containerized tool/workflow groups are promoting standards and technologies in this space: the GA4GH Containers and Workflows Task Team, the NCI Containers and Workflows Interest Group, and the NIH Commons Framework Working Group on Workflow Sharing and Docker Registry. The GA4GH group has spent the last year creating standards for tool and workflow sharing (the Tool Registry Service API standard), task execution on clouds ( the Task Execution Service API standard), and workflow execution on clouds (the Workflow Execution Service API standard) along with high­quality implementations for each (including Dockstore.org, Funnel, and Toil respectively).
To get concrete on the on the utility of these standards and implementations, the GA4GH group and Sage Bionetworks have organized a series of “infrastructure challenges”. The Phase 1 challenge (http://synapse.org/GA4GHToolExecutionChallenge) showed that a Docker­based tool can be shared through the GA4GH Tool Registry Service (TRS) and executed in a wide variety of systems, producing the same result. 35 groups completed this challenge successfully and these represented a wide range of execution platforms including Open Source systems like Cromwell and Toil to commercial offerings like Curoverse, DNAnexus, and Seven Bridges Genomics. Here we present the next phase of the challenge, Phase 2, which looks to use realistic workflows (from groups such as the Broad, Genomic Data Commons, bcbio, and ENCODE) and to leverage the DREAM Challenge infrastructure for organization and community outreach. Participants that are successful in the challenge will show that complex, real­world genomics workflows can be run successfully in different systems and produce the same results. The challenges described here, and future challenges beyond Phase 2, provide a foundation for future large scale projects, such as ICGCMed, that will depend on reliably and reproducibly analysis running across multiple clouds in a highly decentralized and distributed fashion.

Workflows interoperability with Nextflow and Common WL
Date: Saturday, July 22
Time: 11:45 AM - 11:50 AM
Room: Meeting Hall IV
  • Cedric Notredame, Center for Genomic Regulation (CRG), France
  • Evan Floden, Center for Genomic Regulation (CRG), New Zealand
  • Kevin Sayers, Center for Genomic Regulation (CRG), United States
  • Maria Chatzou, Center for Genomic Regulation (CRG), Greece
  • Paolo Tommaso, Center for Genomic Regulation (CRG), Italy

Presentation Overview: Show

Project Website​: http://www.nextflow.io
Source Code​: https://github.com/nextflow-io/nextflow License​: GPLv3
Reproducibility has become one of biology’s most pressing issues. This impasse has been fuelled by the combined reliance on increasingly complex data analysis methods and the exponential growth of biological datasets. When considering the installation, deployment and maintenance of bioinformatic pipelines, an even more challenging picture emerges due to the lack of community standards. Moreover, the effect of limited standards on reproducibility is amplified by the very diverse range of computational platforms and configurations on which these applications are expected to be applied (workstations, clusters, HPC, clouds, etc.).
Nextflow1​ is a pipeline orchestration tool that has been designed to address exactly these issues. It provides a domain specific language (DSL) which streamlines the writing of complex distributed computational pipelines in a portable and replicable manner. It allows the seamless parallelization and deployment of any existing application with minimal development and maintenance overhead, irrespective of the original programming language. The built-in support for software containers guarantees numerical stability and enables truly replicable computational workflows across multiple execution platforms.
This talk will introduce the Nextflow support the Common Workflow Language2​ ​. CWL is a community driven specification for describing analysis workflows and tools in portable manner and that is being used by a number of institutions. The presentation will discuss how Nextflow can be used as the execution engine for workflows defined by using the CWL specification, the benefits and disadvantages of this approach, the current limitations and open challenges of this project.
As a proof of concept, ​we will show how community based CWL pipelines can be deployed by using the Nextflow execution runtime.
References
1. Nextflow enables reproducible computational workflows (Nature Biotech, April 2017, doi:10.1038/nbt.3820)
2. Common WL, https://dx.doi.org/10.6084/m9.figshare.3115156.v2

CWL Viewer: The Common Workflow Language Viewer
Date: Saturday, July 22
Time: 11:50 AM - 11:55 AM
Room: Meeting Hall IV
  • Carole Goble, The University of Manchester, United Kingdom
  • Mark Robinson, The University of Manchester, United Kingdom
  • Michael Crusoe, Common Workflow Language project, Lithuania
  • Stian Soiland-Reyes, The University of Manchester, United Kingdom

Presentation Overview: Show

Project Website: https://view.commonwl.org/
Source Code: https://github.com/common-workflow-language/cwlviewer
License: CWL Viewer is licensed under the terms of the Apache License, Version 2.0, see https://www:apacheorg/licenses/LICENSE-2:0
Abstract
The Common Workflow Language (CWL) project emerged from the BOSC 2014 Codefest as a grassroots, multi-vendor working group to tackle the portability of data analysis workflows. It’s specification for describing workflows and command line tools aims to make them portable and scalable across a variety of computing platforms.
At its heart CWL is a set of structured text files (YAML) with various extensibility points to the format. However, the CWL syntax and multi-file collections are not conducive to workflow browsing, exchange and understanding: for this we need a visualization suite.
CWL Viewer is a richly featured CWL visualization
suite that graphically presents and lists the details
of CWL workflows with their inputs, outputs and
steps. It also packages the CWL files into a
downloadable Research Object Bundle including
attribution, versioning and dependency metadata
in the manifest, allowing it to be easily shared.
The tool operates over any workflow held in a
GitHub repository. Other features include: path
visualization from parents and children nodes;
nested workflows support; workflow graph download in a range of image formats; a gallery of previously submitted workflows; and support for private git repositories and public GitHub including live updates over versioned workflows.
The CWL Viewer is the de facto CWL visualization suite and has been enthusiastically received by the CWL community.

Screw: tools for building reproducible single-cell epigenomics workflows
Date: Saturday, July 22
Time: 11:55 AM - 12:00 PM
Room: Meeting Hall IV
  • Alexander Goncearenco, National Institutes of Health, United States
  • Aly Karsan, Michael Smith Genome Sciences Centre, Canada
  • Azhar Khandekar, National Institutes of Health, United States
  • Benjamin Busby, National Institutes of Health, United States
  • Benjamin Decato, University of Southern California, United States
  • Kieran O'Neill, BC Cancer Agency, Canada

Presentation Overview: Show

DNA methylation is a heritable epigenetic mark that shows a strong correlation with transcriptional activity. The gold standard for detecting DNA methylation is whole genome bisulfite sequencing (WGBS). Recently, WGBS has been performed successfully on single cells (SC-WGBS) [1]. The resulting data represents a fundamental shift in the ca- pacity to measure and interpret DNA methylation, especially in rare cell types and contexts where subtle cell-to-cell heterogeneity is crucial, such as in stem cells or cancer. However, SC-WGBS comes with unique technical challenges which require new analysis techniques to address. Furthermore, although some software tools have been published, and several existing studies have tended to use similar methods, no standardized pipeline for the analysis of SC-WGBS yet exists.
Simultaneously, there has been a drive within bioinformatics towards improved reproducibility. Textual descriptions of bioinformatic analyses are deeply inadequate, and often require “forensic bioinformatics” to reproduce [2]. Recreating the exact results of a study requires not only the exact code, but also the exact software (down to version, compilation options, etc). Common Workflow Language (CWL) provides a framework for specifying complete workflows, while Docker allows for bundling of the exact software and auxiliary data used in an analysis within a container that can be executed anywhere. Together, these have the potential, via repositories such as Dockstore [3], to enable completely reproducible bioinformatics research.
Here we present Screw (Single Cell Reproducible Epigenomics Workfow). Screw is a collection of standard tools and workflows for analysing SC-WGBS data, implemented in CWL, with an accompanying Docker image. Screw provides the parts for constructing fully-reproducible SC-WGBS analyses. Tools provided include quality control visualization, clustering and visualisation of cells by pairwise dissimilarity measures, construction of recapitulated-bulk methylomes from single cells of the same lineage, generation of bigWig methylation tracks for downstream visualization, and wrap- pers around published tools such as DeepCpG [4] and LOLA [5]. Screw has the added benefit that CWL’s compatibility with interactive GUI-based workflow tools such as Galaxy can lower the barriers to use for less-technical wet lab biol- ogist users.
CWL sources for Screw are available under the MIT license at https://github.com/Epigenomics-Screw/Screw. Tools and workflows are available from Dockstore under Epigenomics-Screw namespace, for example https://dockstore.org/ workflows/Epigenomics-Screw/Screw/screw-preprocess
1. Schwartzman, Tanay (2015) Nature Reviews Genetics 16:716–26.
2. Gentleman (2005) Statistical applications in genetics and molecular biology. doi: 10.2202/1544-6115.1034 3. O’Connor et al. (2017) F1000Research 6:52.
4. Angermueller, Lee, Reik, Stegle (2016) bioRxiv 055715.
5. Sheffield, Bock (2015) Bioinformatics 32:587–589.

BioThings Explorer: Utilizing JSON-LD for Linking Biological APIs to Facilitate Knowledge Discovery
Date: Saturday, July 22
Time: 12:03 PM - 12:08 PM
Room: Meeting Hall IV
  • Chunlei Wu, The scripps research institute, United States
  • Cyrus Afrasiabi, The scripps research institute, United States
  • Jiwen Xin, The Scripps Research Institute, United States
  • Sebastien Lelong, The Scripps Research Institute, United States

Presentation Overview: Show

Project Website: http://biothings.io/explorer
Source Code: https://github.com/biothings/biothings_explorer_web
License: Apache License 2.0

RESTful APIs have been widely used to distribute biological data. And many popular biological APIs, such as MyGene.info, MyVariant.info, Drugbank, Reactome, Wikipathways and Ensembl, adopt JSON as their primary data format. These disparate resources feature diverse types of biological entities, e.g. variants, genes, proteins, pathways, drugs, symptoms, and diseases. The integration of these API resources would greatly facilitate scientific domains such as translational medicine, where multiple types of biological entities are involved, and often from different resources.

To fulfill the task of integrating API resources, we have designed a workflow pattern using a semantic web technologies. This workflow pattern uses JSONS LD,a W3C standard for representing Linked Data.In our proposal, each API specifies a JSONSLD context file, which provides Universal Resource Identifier (URI) mapping for each input/output types. Besides, an API registry system is created, where API metadata info, such as query syntax, input/output types is collected, allowing API calls to be generated automatically. By utilizing this workflow, we are able to link different API resources through the input/output types which they share in common. For example, MyGene.info adopts Entrez Gene ID as its input type, which is also one of the output types for MyVariant.info. Thus, data in these two APIs could be linked together through Entrez Gene ID.

Following this workflow, we have developed a Python package as well as a web visualization interface named ‘BioThings Explorer’ using Cytoscape.js. These tools empower users to explore the relationship between different biological entities through the vast amount of biological data provided by various API resources in a visually organized manner. For example, users could easily explore all biological pathways in which a rare Mendelian disease candidate gene is involved, and then find all genes as well as chemical compounds which could regulate these biological pathways (IPython Notebook Demo: https://goo.gl/sx34T2), thus providing potential treatment options.

Discovery and visualisation of homologous genes and gene families using Galaxy
Date: Saturday, July 22
Time: 12:08 PM - 12:13 PM
Room: Meeting Hall IV
  • Anil Thanki, Earlham Institute, United Kingdom
  • Nicola Soranzo, Earlham Institute, United Kingdom
  • Robert Davey, Earlham Institute, United Kingdom
  • Wilfried Haerty, Earlham Institute, United Kingdom

Presentation Overview: Show

Source Code: https://github.com/TGAC/earlham-galaxytools
License: MIT License
The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene families and plays a vital role in finding ancestral gene duplication events as well as identifying regions that are under positive selection within species.
The Ensembl GeneTrees pipeline generates gene trees based on coding sequences and provides details about exon conservation, and is used in the Ensembl Compara project to discover homologous gene families. Since expertise is required to configure and run the pipeline via the command-line, we created GeneSeqToFamily, an open-source Galaxy workflow based on Ensembl GeneTrees. GeneSeqToFamily helps users to run potentially large-scale gene family analyses without requiring the command-line while still allowing tool parameters, configurations, and the tools themselves to be modified.
At present, we are using this workflow on a set of vertebrate genomes (human, dog, chicken, kangaroo, macropod, opossum, mouse, platypus, and tasmanian devil), with some analyses comprising more than 13000 gene families. Gene families discovered with GeneSeqToFamily can be visualised using the Aequatus.js interactive tool, integrated within Galaxy as a visualisation plugin.
Handling this large number of input datasets became problematic for both Galaxy itself and certain tools such as T-Coffee which adversely affected memory allocation and file IO processes. We have made modifications to both the T-Coffee wrapper scripts and low-level changes to the Galaxy framework as a whole to help address these issues.
We are also working on integrating protein domain information from SMART (a Simple Modular Architecture Research Tool) to complement discovered gene families, as well as the incorporation of PantherDB into the workflow for validation of families.

YAMP : Yet Another Metagenomic Pipeline
Date: Saturday, July 22
Time: 12:13 PM - 12:18 PM
Room: Meeting Hall IV
  • Alessia Visconti, King's College London, United Kingdom
  • Mario Falchi, King's College London, United Kingdom
  • Tiphaine Martin, King's College London, United Kingdom

Presentation Overview: Show

URL https://github.com/alesssia/YAMP
License GNU GPL 3
Thanks to the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on microorganisms (bacteria, archaea, microbial eukaryotes, fungi, and viruses) and on their connections with human health and diseases has surged, and, consequently, a plethora of approaches and software has been made available for their study, making it difficult to select the best methods and tools.
Here we present Yet Another Metagenomic Pipeline (YAMP) that, starting from the raw sequencing data and having a strong focus on quality control (QC), allows, within hours, the data processing up to the functional annotation. Specifically, the QC (performed by means of several tools from the BBmap suite [1]), allows de-duplication, trimming, and decontamination of metagenomics (and metatranscriptomics) sequences, and each of these steps is accompanied by the visualisation of the data quality. The QC is followed by multiple steps aiming at characterising the taxonomic and functional diversity of the microbial community. Namely, taxonomic binning and profiling is performed by means of MetaPhlAn2 [2], which uses clade-specific markers to both detect the organisms present in a microbiome sample and to estimate their relative abundance. The functional capabilities of the microbiome community are currently assessed by the HUMAnN2 pipeline [3] which first stratifies the community in known and unclassified organisms using the MetaPhlAn2 results and the ChocoPhlAn pan-genome database, and then combines these results with those obtained through an organism-agnostic search on the UniRef proteomic database. The next YAMP release, currently under development, will also support MOCAT2 [4] and an optimised version of the HUMAnN2 pipeline. QIIME [5] is used to evaluate multiple diversity measures.
YAMP is constructed on Nextflow [6], a framework based on the dataflow programming model, which allows writing workflows that are highly parallel, easily portable (including on distributed systems), and very flexible and customisable, characteristics which have been inherited by YAMP. Users can decide the flow of their analyses, for instance limiting them to the QC or using already QC-ed sequences. New modules can be added easily and the existing ones can be customised – even though we have already provided default parameters deriving from our own experience. While YAMP is developed to be routinely used in clinical research, the expert bioinformaticians will appreciate its flexibility and modularisation. YAMP is accompanied by a Docker container [7], that saves the users from the hassle of installing the required software, increasing, at the same time, the reproducibility of the YAMP results.
References
1. https://sourceforge.net/projects/bbmap
2. Truong, D.T., et al. Metaphlan2 for enhanced metagenomic taxonomic profiling. Nature methods
12(10), 902–903 (2015)
3. https://bitbucket.org/biobakery/humann2
4. Kultima, J.R., et al. MOCAT2: A metagenomic assembly, annotation and profiling framework.
Bioinformatics 32(16), 2520–2523 (2016).
5. Caporaso, J.G., et al. QIIME allows analysis of high-throughput community sequencing data. Nature
Methods 7(5), 335–336 (2010)
6. Di Tommaso, P., et al. Nextflow enables reproducible computational workflows. Nature Biotechnology
35, 316–319 (2017)
7. https://www.docker.com

MultiQC: Visualising results from common bioinformatics tools
Date: Saturday, July 22
Time: 2:00 PM - 2:18 PM
Room: Meeting Hall IV
  • Max Kaeller, Science for Life Laboratory, Sweden
  • Philip Ewels, Science for Life Laboratory, Sweden

Presentation Overview: Show

Project Website​: ​http://multiqc.info/
Source Code​: ​https://github.com/ewels/MultiQC License​: GNU GPLv3
A typical bioinformatics analysis pipeline run can generate hundreds , if not thousands of files. To properly assess the results of the pipeline log files from each tool for each sample should be checked. This can be an impossible task, leading to cherry picking of logs and problems progressing through to later stages of analysis.
MultiQC is an open-source Python package that parses files from nearly 40 different bioinformatics tools. It creates a stand-alone HTML report visualising key metrics, allowing fast and accurate evaluation of analysis results. It is now in its second year, with an increasingly large user community.
At its core, MultiQC works by searching supplied directories for log files that it recognises. It parses these and produces a report with interactive plots and tables describing key metrics. Parsed data is saved to machine readable files for simple downstream processing. Much of the behaviour of MultiQC can be customised through configuration files, plus the code is structured in such a way that it is easy to write plugins to extend the core functionality.
Here, I will describe how to get the best out of MultiQC, including how to customise reports and overcome common problems. I will describe how plugins can be used to pull sample metadata from LIMS software and how results can be pushed to a database for long term trend visualisation.

NGL - a molecular graphics library for the web
Date: Saturday, July 22
Time: 2:18 PM - 2:23 PM
Room: Meeting Hall IV
  • Alexander Rose, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego Rutgers, United States
  • Stephen Burley, RCSB Protein Data Bank, San Diego Supercomputer Center, UC San Diego Rutgers, The State University of New Jersey, United States

Presentation Overview: Show

Source Code & Project Website: https://github.com/arose/ngl/ License: MIT license
Interactive visualization and annotation of large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate. Integrative/Hybrid approaches are increasing being used to determine 3D structures of biological macromolecules by combining information from X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy with data from diverse complementary experimental and computational methods. The wealth, size and complexity of available and future structures make scalable visualization and annotation solutions more important than ever. The web can provide easy access to the resulting visualizations for all interested parties, including colleagues, collaborators, reviewers and educators. Herein, we utilize the web-based NGL library to provide 3D visualization of experimental and computational data integrated with general molecular graphics.
The NGL library has a versatile API to control every aspect from data loading, processing and rendering. A distinguishing feature of NGL is its scalability to system with a million atoms and more. Further, the library supports many file formats for small molecules, macromolecular structures, molecular dynamics trajectories, maps for crystallographic, microscopy and general purpose volumetric data. Annotations can be loaded from text, json, msgpack or xml files. A wide array of customizable representations is available. Molecular data can be displayed as balls, sticks, cartoons, surfaces and labels or with specialized representations such as hyperballs and ropes. Volumetric data can be rendered as isosurfaces, point clouds or volume slices. Additional file parsers and data representations can be added through a plugin system.
The NGL library allows developers to create custom visualization solutions for specialized or novel 3D data derived from bioinformatics calculations and biophysical/biochemical experiments. The resulting interactive visualizations enable spatial understanding and exploratory analyses. Furthermore, these web-based tools simplify data exchange and foster collaborative analysis.

GRAPHSPACE: Stimulating interdisciplinary collaborations in network biology
Date: Saturday, July 22
Time: 2:23 PM - 2:28 PM
Room: Meeting Hall IV
  • Aditya Bharadwaj, Virginia Tech, United States
  • Allison Tegge, Virginia Tech, United States
  • Anna Ritz, Reed College, United States
  • Christopher Poirel, RedOwl Analytics, United States
  • Divit Singh, Virginia Tech, United States
  • Jean Peccoud, Colorado State University, United States
  • John Tyson, Virginia Tech, United States
  • Kurt Luther, Virginia Tech, United States
  • Neil Adames, Colorado State University, United States
  • Pavel Kraikivski, Virginia Tech, United States
  • Shiv Kale, Virginia Tech, United States
  • T. Murali, Virginia Tech, United States

Presentation Overview: Show

Project Website: http://graphspace.org
Source Code: https://github.com/Murali-group/GraphSpace License: GNU General Public License v3
Computational analysis of molecular interaction networks has become pervasive in systems biology. Despite the existence of several software systems and interfaces to analyze and view networks, interdisci- plinary research teams in network biology face several challenges in sharing, exploring, and interpreting computed networks in their collaborations.
GRAPHSPACE is a web-based system that provides a rich set of user-friendly features designed to stim- ulate and enhance network-based collaboration:
• Users can upload richly-annotated networks, irrespective of the algorithms or software used to gener- ate them. GRAPHSPACE networks follow the JSON format supported by Cytoscape.js [1]. Users of Cytoscape [3] can export their networks and upload them directly into GRAPHSPACE.
• Users can create private groups, invite other users to join groups, and share networks with groups.
• A user may search for networks with a specific property or that contain a specific node or collection
of nodes.
• A powerful layout editor allows users to efficiently modify node positions, edit node and edge styles,
save new layouts, and share them with other users.
• Researchers may make networks public and provide a persistent URL in a publication, enabling other
researchers to explore these networks.
• A comprehensive RESTFul API streamlines programmatic access to GRAPHSPACE features.
• A Python module called graphspace python allows a user to rapidly construct a graph, set visual
styles of nodes and edges, and then upload the graph, all within tens of lines of code. It is very easy
to integrate this script into a user’s software pipeline.
Currently, GraphSpace supports more than 100 users who have stored more than 21,000 graphs (most of
them private) containing a total of over 1.4 million nodes and 3.8 million edges. Conceptually, GRAPHSPACE serves as a bridge between visualization and analysis of individual networks supported by systems such as Cytoscape [3] and the network indexing capabilities of NDex [2]. We anticipate that GRAPHSPACE will find wide use in network biology projects and will assist in accelerating all aspects of collaborations among computational biologists and experimentalists, including preliminary investigations, manuscript develop- ment, and dissemination of research.
References
[1] M. Franz, C. T. Lopes, G. Huck, Y. Dong, O. Sumer, and G. D. Bader. Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics, 32:309–311, Sep 2015.
[2] D.Pratt,J.Chen,D.Welker,R.Rivas,R.Pillich,V.Rynkov,K.Ono,C.Miello,L.Hicks,S.Szalma,A.Stojmirovic,R.Dobrin,M.Braxen- thaler, J. Kuentzer, B. Demchak, and T. Ideker. NDEx, the Network Data Exchange. Cell Syst, 1(4):302–305, Oct 2015.
[3] M. E. Smoot, K. Ono, J. Ruscheinski, P. L. Wang, and T. Ideker. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics, 27(3):431–432, Feb 2011.

Efficient detection of well-hopping duplicate reads on Illumina patterned flowcells
Date: Saturday, July 22
Time: 2:28 PM - 2:33 PM
Room: Meeting Hall IV
  • Donald Dunbar, Edinburgh Genomics, United Kingdom
  • Judith Risse, Wageningen University and Research, Netherlands
  • Karim Gharbi, Edinburgh Genomics, United Kingdom
  • Timothy Booth, Edinburgh Genomics, United Kingdom

Presentation Overview: Show

Source Code: https://github.com/EdinburghGenomics/well_duplicates License: BSD 2-clause "Simplified" License
Introduction
Duplication of fragments on the Illumina HiSeq4000 and HiSeqX sequencing machines can occur when DNA leaks into adjacent empty wells on the patterned flow cell tile (the figure is a schematic representation of a small part of a flow cell tile and the arrows represent the spread of well-hopping fragments). This results in a pattern of localised repeats similar to the related
issue of optical duplicate reads seen in earlier MiSeq and HiSeq
flowcell types. Different names have been proposed for this but we
refer to these duplicates as “well duplicates”[1]. Detecting and
minimising technical duplicates in DNA sequencing is important as
these constitute wasted sequencing capacity due to under-loading and
may also lead to erroneous results where the number of reads needs to be quantified, such as in RNA-seq. Well duplicates until now could only be detected after mapping to a reference genome.
Materials & Methods
Our Python duplicate scanner incorporates both the detection algorithm and a new library to efficiently read the binary BCL files produced off the sequencer. The approach is to sample a representative number of centre wells from each tile (marked C in the diagram) and to scan the vicinity of each for duplicate sequences, allowing for a Levenshtein edit distance of 2 by default. To validate the results we analysed a phiX dataset with different loading concentrations on each lane and compared our results to those of the de facto standard tool, Picard MarkDuplicates [2]. Our raw results are fundamentally different in the sense that Picard ‘knows’ all duplicates and can then report the optical fraction, while our tool reports the duplicate fraction over all sampled wells. Therefore we apply a scaling factor between 0.5 and 1 based on estimated cluster size to obtain a value directly comparable with Picard.
Discussion
Our tool is suitable for early-stage QC of new Illumina runs since it does not depend on demultiplexing or mapping of the reads to a reference, runs in minutes, and can be run as soon as the required number of cycles are complete. Our approach demonstrably compares well with Picard without requiring a mapping step, and can produce Picard-equivalent metrics after scaling. The raw values from our scanner are themselves informative and consistent and are being used successfully by Edinburgh Genomics to monitor sequencing run quality.
References
1. https://github.com/samtools/hts-specs/issues/121 2. https://broadinstitute.github.io/picard

An ensemble approach for gene set testing analysis with reporting capabilities
Date: Saturday, July 22
Time: 2:36 PM - 2:41 PM
Room: Meeting Hall IV
  • Matthew Ritchie, Walter and Eliza Hall Institute of Medical Research, Australia
  • Milica Ng, CSL Limited, Australia
  • Monther Alhamdoosh, CSL Limited, Australia

Presentation Overview: Show

Project Website http://bioconductor.org/packages/EGSEA/ Source Code: https://github.com/Bioconductor-mirror/EGSEA/ License: GPL-2
Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular data set. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA's gene set database contains around 25,000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse data sets and, based on biologists' feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes.
References
Alhamdoosh, M., Ng, M., Wilson, N.J., Sheridan, J.M., Huynh, H., Wilson, M.J., & Ritchie, M.E. (2017). Combining multiple tools outperforms individual methods in gene set enrichment analyses. Bioinformatics, 33 (3).

OpenMS 2.0: a flexible open-source software platform for mass spectrometry data analysis
Date: Saturday, July 22
Time: 2:41 PM - 2:46 PM
Room: Meeting Hall IV
  • Julianus Pfeuffer, Freie Universitaet Berlin, Germany
  • Knut Reinert, Freie Universitaet Berlin, Germany
  • Oliver Kohlbacher, University of Tuebingen, Germany
  • Timo Sachsenberg, University of Tuebingen, Germany

Presentation Overview: Show

Project Website: http://www.openms.org Source Code: https://github.com/OpenMS License: BSD 3-Clause
Abstract
High-throughput mass spectrometry has become a versatile technique to tackle a large range of questions in the life sciences. Being able to quantify diverse classes of biomolecules opens the way for improved disease diagnostics, elucidation of molecular structure, and functional and phenotypic studies. OpenMS is an open-source software package spanning the whole range from algorithms, to libraries, to scripting, to tools, and powerful workflows to transform mass spectrometric data to biological knowledge. OpenMS provides a framework consisting of core data structures, key algorithms, and I/O functions for C++ developers that intend to implement novel algorithms. These data structures and algorithms are also exposed as Python modules for rapid software prototyping. They also form the basis for implementing a more than 185 distinct tools providing specific functionality to process mass spectrometric data. A common interface and use of open, standardized data formats enable chaining these into complex, custom-tailored analysis workflows. The project is supported by a lively community of developers worldwide.
Figure 1: Inspecting raw mass spectra with the OpenMS visualization tool TOPPView.
OpenMS greatly benefits from a range of open-source projects by wrapping their functionality. To package these third-party tools into a coherent solution, we deploy OpenMS on all major operating systems. We provide self-contained, stand-alone installers for command line execution on cluster environments, as well as plugins for integration into workflow systems like KNIME. KNIME is an open-source, industry supported integration platform and workflow. In addition to the OpenMS plugin, hundreds of third-party plugins provide analysis tools for advanced statistics, integration of
scripting languages, machine learning, database connectors or automated internet queries. This enables users to perform mass spectrometry data processing as well as downstream statistical data analysis in a single workflow. KNIME allows storing the full analysis workflow including all parameters and scripts. Parts of the workflow can be reused in other projects, and complete analysis tasks become reproducible.
Apart from development the development team also provide free consulting and regular training events to users and developers alike.

Interoperable, collaborative multi-platform variant calling with bcbio
Date: Saturday, July 22
Time: 2:46 PM - 2:51 PM
Room: Meeting Hall IV
  • Brad Chapman, Harvard Chan School, Bioinformatics Core, United States

Presentation Overview: Show

Availability https://github.com/chapmanb/bcbio-nextgen
Documentation https://bcbio-nextgen.readthedocs.org/en/latest/contents/cwl.html
License MIT
Interoperable, collaborative multi-platform variant calling with bcbio Brad Chapman, Rory Kirchner, Lorena Pantano, Shannan Ho Sui, Oliver Hofmann
Harvard Chan School, Bioinformatics Core
(http://bioinformatics.sph.harvard.edu/),
The University of Melbourne Centre for Cancer Research (https://umccr.github.io/)
bchapman@hsph.harvard.edu https://github.com/chapmanb/bcbio-nextgen https://bcbio-nextgen.readthedocs.org/en/latest/contents/cwl.html MIT
bcbio (https://github.com/chapmanb/bcbio-nextgen) is an open, community effort to develop validated and scalable variant calling, RNA-seq and small RNA analyses. Last year at BOSC, we discussed our work to port bcbio’s internal workflow representation to use the community developed Common Workflow Language (CWL: http://www.commonwl.org/). This transition removed barriers that prevented bcbio interoperability.
The practical benefit of changing to standardized workflow definitions is that bcbio works on multiple heterogeneous platforms. Using CWL, bcbio runs with Curoverse’s Arvados (https://arvados.org/), UCSC’s Toil (http://toil.ucsc-cgl.org/) and Seven Bridges’ rabix bunny (http://rabix.io/). In addition, conversion to the Workflow Description Language (WDL: https://software.broadinstitute.org/wdl/) provides in-progress support for Broad’s Cromwell (https://github.com/broadinstitute/cromwell) and DNAnexus’s APIs (https: //github.com/dnanexus-rnd/dxWDL). There is also ongoing work with other communities actively developing CWL integration, including Nextflow (https://github.com/nextflow-io/cwl2nxf) and Galaxy (https://github. com/common-workflow-language/galaxy).
Widespread bcbio interoperability allows running in many computational environments without the overhead of maintaining bcbio specific integrations. Users can run locally or on high performance computing clusters with schedulers like SLURM, SGE and PBSPro. In addition, CWL enabled runners work across the three major cloud providers: Amazon Web Services, Google Compute Engine and Microsoft Azure. Commercial platforms like Curoverse, Seven Bridges and DNAnexus enable clinical labs to run in controlled environments. The key component of this diverse support is collaboration through the CWL standard. This demonstrates the importance of community standard development, especially in research environments where it is typically difficult to fund maintenance of large scale infastructure development.
The talk will discuss the practicalities of adjusting bcbio to use CWL and WDL. We balance infrastructure work for the transition to CWL with continued improvement of workflows and community support. Testing and documentation of bcbio is more complex since we validate workflows, like germline and somatic variant calling, in many environments. This requires coordination between groups with different focus as platforms, analyses and standards develop. High level collaboration is increasingly important as we do more complex science, and we’ll describe the role of the open bioinformatics community in enabling it.

Gene Set Variation Analysis in cBioPortal
Date: Saturday, July 22
Time: 2:51 PM - 2:56 PM
Room: Meeting Hall IV
  • Fedde Schaeffer, The Hyve, Netherlands
  • Kees Bochove, The Hyve, Netherlands
  • Oleguer Casals, The Hyve, Netherlands
  • Pieter Lukasse, The Hyve, Netherlands
  • Sander Tan, The Hyve, Netherlands
  • Sjoerd Hagen, The Hyve, Netherlands

Presentation Overview: Show

Project Website​: ​http://www.cbioportal.org Source Code​: ​https://github.com/cbioportal License​: GNU Affero General Public License v3
Abstract
cBioPortal is an open source application for interactive analysis and visualization of large scale cancer genomics datasets, originally developed by Memorial Sloan Kettering Cancer Center, New York, and since 2015 by a larger community including The Hyve, The Netherlands. Here we present the recent development we (The Hyve) did for supporting Gene Set Variation Analysis (GSVA) in cBioPortal, across the different views. We show how the new GSVA data is displayed in the oncoprint, with support for hierarchical clustering, and how it can be used in a variety of other plots and analyses.
This functionality is new in cBioPortal and could be useful to any of the users of this popular open source cancer genomics platform. It enables a new dimension of exploratory analysis in cBioPortal, allowing researchers to search and find patterns at the level of molecular processes where multiple genes that are known to work in concert, instead of exploring the data at the level of individual genes.
In our presentation we aim to share the details of this important update to the cBioPortal platform, also highlighting the way we worked with a large Pharma customer and other cBioPortal developers to implement this new feature.

The backbone of research reproducibility: sustainable and flexible tool deployment
Date: Saturday, July 22
Time: 3:00 PM - 3:18 PM
Room: Meeting Hall IV
  • Bjoern Gruening, Uni-Freiburg, Germany
  • Johannes Koester, Centrum Wiskunde & Informatica, Netherlands
  • John Chilton, Galaxy Project, United States
  • Ryan Dale, National Institutes of Health, United States
  • Yasset Perez-Riverol, EBI, United Kingdom

Presentation Overview: Show

Project: bioconda.github.io & biocontainers.pro Contact: bjoern.gruening@gmail.com Code: github.com/bioconda/bioconda-recipes License: The MIT License (MIT)
A massive amount of diverse data is generated in biomedical research. To manage it and extract useful information, bioinformatic tools and software must be developed. The development of a tool should always follow a similar process. First, to solve a scientific question or a need source code is developed that can be distributed as is. To help deployment and ease the usage, the code is packaged in various package formats. The tool using the code is then deployed and used. Ideally, documentation, training and support are also provided to guide users, illustrate the solution, and advertise it.
This process, from development to support, is the golden path to develop a good tool. But issues with deployment and sustainability of the tool are found for many bioinformatic tools. What bioinformatician has not dealt with the situation of missing tool dependencies, or an older version of a tool that could not be installed due to various reasons? Deployment and sustainability of tools are therefore a major threat for productivity and reproducibility in science.
For deployment, we need a package manager that is OS- and programming language-agnostic, as bioinformatics tools are developed in many available languages and can be intended to be used on every major operating systems, including enterprise ones. Moreover all available packages should be permanently cached to be always reachable to enable reproducibility in the long term.
Here we describe a community effort to create a flexible, scalable and sustainable system to fix the tool deployment problem once and for all. Bioconda is a platform for distribution of bioinformatics software using Conda, an open source package manager developed by ContinuumIO which is, independent of any programming language and OS. Installation of Conda packages are fast and robust. No root privileges are required and multiple versions of every software can be installed and managed in parallel. Supported by an extensive documentation, writing a Conda package is very simple, easing the contribution. Thanks to its big and fast-growing community, more than 2,000 bioinformatic packages have been developed in the 18 month it has existed. These packages are long-term stored in a public repository (Cargo Port, the distribution center of the Galaxy Project), resolving the sustainability issue. Moreover, a technique called layer donning has been recently introduced to build containers automatically and very efficiently for all Conda package. These are automatically contributed to the BioContainers repository.
Development of conda packages through the BioConda community eases the packaging and the deployment of any bioinformatic tool. The interface with Cargo Port enables sustainability by mirroring all sources. Building efficient Linux containers automatically ensures an even higher layer of abstraction and isolation of the base system. Thanks to these collaborative projects, their community and their collaborations, tools can be easily packaged, deployed and will be always available to help biomedical research.

Reproducible bioinformatics software with GNU Guix
Date: Saturday, July 22
Time: 3:18 PM - 3:23 PM
Room: Meeting Hall IV
  • Ben Woodcroft, UQ, Australia
  • Pjotr Prins, University Medical Center Utrecht, Netherlands
  • Ricardo Wurmus, Max-Delbrueck-Centrum fuer Molekulare Medizin (MDC), Germany

Presentation Overview: Show

Website: https://www.gnu.org/software/guix/packages/ Repository: https://git.savannah.gnu.org/cgit/guix.git/ License: GPL3
Anyone who has been bitten by dependencies and would like a fully reproducible software stack should take note. Through GNU Guix we lost the fear of combining computer languages and binary deployment because all dependencies, including command line invocations of tools, are guaranteed to work.
In this talk I will share the great experience we have of packaging, deploying, publishing and distributing software via GNU Guix of a complex web service with hundreds of dependencies that has multiple servers under http://genenetwork.org/. With GeneNetwork we are creating an environment that people can do genetics on their laptop through a front-end API, e.g., for R and Python, or the browser, using content addressable storage, such as Arvados Keep, and reproducible software deployment with GNU Guix, for analysis through reproducible pipelines, such as PBS and CWL. We are even using GNU Guix to deploy pipelines on the ORNL Beacon supercomputer.
HPC computing environment and especially super computing has its bag of challenges when it comes to software deployment. As scientists we often do not get root access which means that we either depend on what software is available or we build software in a dedicated directory using tools such as Brew, Conda or even from source. Unfortunately these solutions depend on already installed tools from an underlying distribution, often proprietary or dated compilers, and, for example, modules. Any binary that gets produced, therefore, tends to be totally unique, both in the generated binary and its set of dependencies. This is bad. Bad for trouble shooting and bad for pursuing reproducible science.
With GNU Guix we have packaged more than 300 R packages, 400 Python packages, 500 Perl packages and 150 Ruby packages, including some 200 specific bioinformatics packages with some rather difficult to package tools, such as Sambamba. Thanks to the GNU Guix community it is the largest ongoing bioinfor- matics packaging attempt next to Debian BioMed and Bioconda.
I will discuss the work on GNU Guix ‘channels’, reproducible build-systems and non-root installations and moving forward on putting Guix in containers, using work flow engines, so that jobs can run on dis- tributed systems, such as Arvados. In this talk I will explain how GNU Guix differs from distributions, such as Debian BioMed, how it can happily be deployed on any existing distribution, why it does not actually need containers, and how it can be part of Bioconda.

Reproducible and user-controlled software management in HPC with GNU Guix
Date: Saturday, July 22
Time: 3:23 PM - 3:28 PM
Room: Meeting Hall IV
  • Altuna Akalin, Max Delbrueck Center for Molecular Medicine, Germany
  • Ricardo Wurmus, Max Delbrueck Center for Molecular Medicine, Germany

Presentation Overview: Show

Website: https://gnu.org/software/guix
Repository: https://git.savannah.gnu.org/cgit/guix.git License: GNU General Public License version 3 (or later)
Reproducibility is the corner stone of science, and there is growing awareness among researchers of the problem of reproducibility in the field of bioinformatics research[1]. Computational research fields such as bioinformatics crucially depend on the reproducibility of software packages and the computational pipelines that are made up of them. Unfortunately, traditional software deployment methods generally do not take reproducibility into account.
On high-performance computing (HPC) systems two conflicting approaches to managing software collide: system administrators manage these large systems in a highly conservative manner, whereas the researchers using these systems may require up-to-date tool chains as well as libraries and scientific software in countless variations. Users often fall back to ad-hoc software deployment to satisfy their immediate requirements. As a result, HPC system users often have no guarantee that they will be able to reproduce results at a later point in time, even on the same system, and they have little hope of being able to reproduce the same software environment elsewhere.
We present GNU Guix and the functional package management paradigm and show how it can improve reproducibility and sharing among researchers[2]. Functional package management differs from other software management methodologies in that reproducibility is a primary goal. With GNU Guix users can freely customize their own independent software profiles, recreate workflow-specific application environments, and publish a package set to enable others to reproduce a particular workflow, without having to abandon package management or sharing. Profiles can be rolled back or upgraded at will by the user, independent from system administrator-managed packages.
We will introduce functional package management with GNU Guix, demonstrate some of the benefits it enables for research, such as reproducible software deployment, workflow-specific profiles, and user-managed environments, and share our experiences with using GNU Guix for bioinformatics research at the Max Delbrck Center. We will also compare the properties and guarantees of functional package management with the properties of other application deployment tools such as Docker or Conda.
References
[1] Peng, R.: Reproducible Research in Computational Science. Science, 02 Dec 2011: Vol. 334, Issue 6060, pp. 1226-1227. DOI: 10.1126/science.1213847
[2] Courts, L., Wurmus, R.: Reproducible and User-Controlled Software Environments in HPC with Guix. In: Hunold, Sascha, et al., eds. Euro-Par 2015: Parallel Processing Workshops: Euro-Par 2015 International Workshops, Vienna, Austria, August 24-25, 2015, Revised Selected Papers. Vol. 9523. Springer, 2015.

A Ubiquitous Approach to Reproducible Bioinformatics across Computational Platforms
Date: Saturday, July 22
Time: 3:28 PM - 3:33 PM
Room: Meeting Hall IV
  • Bjoern Gruening, University of Freiburg, Germany
  • Johannes Koester, Centrum Wiskunde & Informatica, Netherlands
  • John Chilton, Galaxy Project, United States
  • Marius Beek, Institut Curie, France

Presentation Overview: Show

Project: ​http://galaxyproject.org/ Code: ​http://github.com/galaxyproject/galaxy License: Academic Free License version 3.0
Reproducible data analysis requires reproducible software installation. There are many approaches to reproducible software “installation” – DebianMed, Docker, EasyInstall, homebrew-science, software modules, and others.. Many work well in cloud and container-enabled environments – where the researcher has full control of a virtual machine or container host and may choose whatever software installation mechanism makes sense. However, these same approaches are less appropriate at high performance computing (HPC) centers where large centralized resources mean such freedom is unavailable. On the other hand, the HPC-centric approaches do not provide options such as ready-to-run software containers ideal for the cloud. Furthermore, some approaches are built to work with command-line scripting while others are built for specific computational platforms or deployment technologies such as the Galaxy Tool Shed or Dockstore. Here we outline an approach that covers all of these scenarios with a great deal of flexibility – allowing for the execution of the same binaries regardless of which technologies are selected.
Bioconda is a set of recipes for the popular binary package manager Conda containing thousands of bioinformatics and general purpose scientific software recipes. The BioContainer project builds environments using these these packages for technologies such as Docker and rkt - with over 2,000 containers already available. We will highlight how these projects have provided powerful new features to Snakemake and Galaxy and how they can be used to improve Common Workflow Language (CWL) tools.
We will demonstrate how to annotate Snakemake, Galaxy, and CWL tools and workflows to utilize Bioconda packages - allowing the same packages and (for compiled languages) binaries across different platforms. We will show that leveraging Bioconda increased the reproducibility of Snakemake workflows, vastly improved the development and deployment experience in Galaxy over its custom built Tool Shed package manager, and can provide a path forward for reproducibility with CWL in environments that are not Docker enabled.
For Galaxy and CWL, we will show that these same annotations can allow the platform to directly find or build BioContainers for tool execution. We will argue these containers are going to be superior to custom built ones in most cases - they are very small, automatically built (no development, annotation, registration, etc. are required by the artifact creator), and allow for the same packages and binaries to be used inside and outside of a container.

Revitalizing a classic bioinformatics tool using modern technologies: the case of the Cytoscape Project
Date: Saturday, July 22
Time: 3:36 PM - 3:41 PM
Room: Meeting Hall IV
  • Barry Demchak, UC San Diego, United States
  • Eric Sage, University of California, San Diego Trey Ideker Lab, United States
  • Keiichiro Ono, University of California, San Diego Trey Ideker Lab, United States

Presentation Overview: Show

Project Website: http://www.cytoscape.org/ Source Code: https://github.com/cytoscape/ License: LGPL/MIT License
Abstract
To date, widely-used bioinformatics tools are often based on long established programming systems that are unable to effectively leverage or interoperate with new tools developed on modern programming systems running on modern web platforms. Abandoning and rewriting existing tools represents both risky and large investments of both calendar and monetary resources. In this presentation, we will discuss new approaches that enable small bioinformatics software developer teams to economically construct bridges between existing applications and modern web-based tools. We will highlight construction and use of portable and reusable UI components and computational services.
For our case study, we use Cytoscape, the de-facto standard network visualization platform in biology. Its first version was released in 2002, and it is still actively developed by the Cytoscape Consortium. It is a desktop Java-based plugin architecture, and third party developers have released over 300 Java-based apps – these apps represent a large and valuable ecosystem that cannot be abandoned even if Cytoscape itself were to be redeveloped (e.g., as a microservice architecture or single-page web application). Additionally, because Java is a relatively closed programming system, there is no easy way for Cytoscape core or apps to integrate emerging web technology that implements a complex user interface and or data visualization modules. This hurdle is often insurmountable for open source software (OSS) projects (especially in bioinformatics) that have limited resources.
We will discuss how we design, implement and integrate reusable JavaScript-based UI components (called cyWidgets) into Java-based Cytoscape apps, and how we build, deploy and access non-Java services across the web.

The SPOT ontology toolkit : semantics as a service
Date: Saturday, July 22
Time: 3:41 PM - 3:46 PM
Room: Meeting Hall IV
  • Helen Parkinson, EMBL-EBI, United Kingdom
  • Olga Vrousgou, EMBL-EBI, United Kingdom
  • Simon Jupp, EMBL-EBI, United Kingdom
  • Thomas Liener, EMBL-EBI, United Kingdom
  • Tony Burdett, EMBL-EBI, United Kingdom

Presentation Overview: Show

Project Website​: ​https://ebispot.github.io​/
Web Pages and Source Code​:
Ontology Lookup Service: ​http://www.ebi.ac.uk/ols​, ​https://github.com/EBISPOT/OLS Zooma: ​http://www.ebi.ac.uk/spot/zooma​, ​https://github.com/EBISPOT/zooma
OxO: ​http://www.ebi.ac.uk/spot/oxo​, ​https://github.com/EBISPOT/OLS-mapping-service Webulous: ​https://www.ebi.ac.uk/efo/webulous​, ​https://github.com/EBISPOT/webulous BioSolr: ​https://ebispot.github.io//BioSolr​, ​https://github.com/EBISPOT/BioSolr License​: ​http://www.apache.org/licenses/LICENSE-2.0
Mailing List​: ​ontology-tools-support@ebi.ac.uk
Annotating data with ontologies adds value and results in data that is more readily interoperable, discoverable and reusable. Working with multiple ontologies and their nuances creates overhead that can be readily addressed with tooling that encapsulates common use-cases. Certain activities that are common to ontology-aided data curation include querying ontologies, mapping data to and between ontologies, and creating or requesting new ontology terms. The Samples Phenotypes and Ontologies Team (SPOT) work with multiple databases to develop tools that support high-throughput, semi-automated data curation with ontologies. These tools form a toolkit of open-source software that can be deployed and run anywhere, and can be configured to work with data and ontologies from any domain.
The SPOT ontology toolkit aims to reduce the barriers to entry in the annotation of data with ontologies. The Ontology Lookup Service (OLS) provides a repository for accessing and visualising multiple ontologies. Zooma provides a repository of curated annotation knowledge that can be used to predict ontology annotations with a measure of confidence that enables the automated curation of unseen data with ontologies. OxO provides access to curated ontology cross-references and provides services for mapping from one ontology standard to another. Webulous supports ontology development from spreadsheets through a google sheet add-on. OLS, Zooma, OxO and Webulous come with integrated RESTful APIs that allow for scalable programmatic access and the development of data curation pipelines. Finally BioSolr, an ontology expansion plugin for Solr and Elasticsearch, demonstrates how advanced ontology-powered search can be achieved over data enriched with ontology annotation.
OLS, Zooma, OxO and Webulous combined address a specific need for a suite of tools that can automate more of the process of data curation with ontologies. Here we will present this toolkit, along with a suggested ontology annotation workflow designed to lower the cost of life sciences data curation.

Biopython Project Update 2017
Date: Saturday, July 22
Time: 3:46 PM - 3:51 PM
Room: Meeting Hall IV
  • Christian Brueffer, Lund University, Sweden
  • Peter Cock, James Hutton Institute, United Kingdom
  • Sourav Singh, Vishwakarma Institute of Information Technology, India

Presentation Overview: Show

Website: http://biopython.org
Repository: https://github.com/biopython/biopython
License: Biopython License Agreement (BSD like, see http://www.biopython.org/DIST/LICENSE)
The Biopython Project is a long-running distributed collaborative effort, supported by the Open Bioin- formatics Foundation, which develops a freely available Python library for biological computation [1].
We present here details of the latest Biopython release – version 1.69. This represents eight months of contributions, and a record 49 named contributors including 28 newcomers which reflects our policy of trying to encourage even small contributions.
Biopython 1.69 represents the start of our re-licensing plan, to transition away from our liberal but unique Biopython License Agreement to the similar but very widely used 3-Clause BSD License. We are reviewing the code base authorship file-by-file, in order to gradually dual license the entire project.
New features include: a new parser for the ExPASy Cellosaurus cell line database, catalogue and ontol- ogy; Bio.AlignIO now supports the UCSC Multiple Alignment Format (MAF), including indexed access to large files using SQLite3; Bio.SearchIO.AbiIO can now parse FSA files; an extended Bio.Affy module support- ing version 4 of the Affymetrix CEL format; updated Uniprot parsers to support the “submittedName” XML element and features with unknown locations; better handling of internal node comments in the NEXUS parser to improve compatibility with tools such as BEAST TreeAnnotator; an update to Bio.Restriction to include the REBASE February 2017 restriction enzyme list; updated Bio.SeqIO parsers for GenBank, EMBL, and IMGT that now record the molecule type from the LOCUS/ID line in the record.annotations dictionary and can cope with more format variations; Bio.PDB.PDBList now can download PDBx/mmCif (new default), PDB (old default), PDBML/XML and mmtf format protein structures; enhanced PyPy sup- port by taking advantage of NumPy and compiling most of the Biopython C code modules; the Bio.Seq module now offers a complement function for consistency and a SeqFeature object’s qualifiers attribute is now an explicitly ordered dictionary.
Additionally we fixed miscellaneous bugs, enhanced our test suite and continued our efforts to abide by the PEP8 and PEP257 coding style guidelines which is now checked automatically with GitHub-integrated continous integration testing using TravisCI. Current efforts include improving the unit test coverage, which is easily viewed online at CodeCov.io.
We are currently preparing a new release – version 1.70 – that will feature an extended Bio.AlignIO module that supports Mauve’s eXtended Multi-FastA (XMFA) file format.
References
[1] Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M.J. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. doi:10.1093/bioinformatics/btp163

Open Sourcing Ourselves
Date: Saturday, July 22
Time: 4:30 PM - 5:30 PM
Room: Meeting Hall IV
  • Madeleine Ball, Open Humans Foundation, United States

Presentation Overview: Show

Madeleine Ball is Executive Director of Open Humans Foundation and co-founder of Open Humans, a nonprofit project enabling individuals to engage studies and share data.

Open Humans is an open source online platform and community, created through funding support from the Robert Wood Johnson Foundation and the John S. and James L. Knight Foundation. Open Humans enables members to connect data from a variety of sources, including genome, microbiome, activity tracking, and GPS data – and then invites members to share their data with projects and work with research studies. By using an individual-centered approach, Open Humans enables new research opportunities, including: data sharing by individuals, cohort sharing across studies, anonymous engagement with studies, and citizen-led projects.

Dr. Ball is also supported by a Shuttleworth Foundation Fellowship, supporting her continued vision for new approaches to openness in human health research and data sharing. Previously, Dr. Ball was Director of Research at the Harvard Personal Genome Project.

BeerDeCoded: exploring the beer metagenome
Date: Sunday, July 23
Time: 10:05 AM - 10:23 AM
Room: Meeting Hall IV
  • Gianpaolo Rando, Hackuarium, Swiss Decode, Switzerland
  • Jonathan Sobel, Hackuarium, UNIL, Switzerland
  • Luc Henry, Hackuarium, EPFL, Switzerland

Presentation Overview: Show

BeerDeCoded is a project carried out by members of Hackuarium, a Swiss not-for-profit association that supports unconventional research ideas and promote the public understanding of science by making laboratory tools more accessible. The goal of BeerDeCoded is to extend scientific knowledge about beer, while discussing issues related to personal genomics, food technology, and their role in society with the general public. Two years ago, a crowdfunding campaign provided funding for the first stage of the project. Reaching out through this channel also allowed to collect 120 beer samples from 20 countries.
We have now obtained the metagenomic profiles for 39 of these beers using a targeted approach (ITS). We have demonstrated that it is possible to extract DNA directly from bottled beer using low cost methodologies available to citizen scientists. DNA sequenced from these samples identified a variety of wild yeast species in commercial beers. For example, some brews contain traces of more than 30 different fungal species. Brewing is a complex process and it is unclear if the beer microbiome can directly affect the final beer taste or texture and how it could be controlled to create beers with a specific character.
To answer these questions, we are collecting information about the brewing process to correlate the metagenomic profiles with metadata related to the brewing process. For instance, using a hierarchical clustering approach, we built a proximity tree of the different beers. This analysis revealed that two stouts brewed in the same city have a similar wild yeast content. However, there is currently limited access to information about the recipe of commercial beers. A notable exception is BrewDog, a craft brewery from Scotland that recently released all its brewing recipes. We would like to use this resource and pair it with a new protocol based on a portable DNA sequencer to build the proof of concept for a beer metagenome analysis pipeline that could be used in high schools, citizen science laboratories, craft breweries or industrial plants.
Altogether, we demonstrated that coupling simple laboratory procedures with DNA sequencing and metagenomic analysis can allow the detection of the microbial content in commercial beer and can easily trigger a tripartite conversation between the general public, the scientists and professional brewers. The next step is to set up an open repository where these parties can add the metagenomic profiles of their favourite beers, expanding the database and allowing researchers to test new analyses, brewers to try new recipes and the general public to discover more science and more beers.

Supporting curation communities & collecting technical dividends
Date: Sunday, July 23
Time: 10:23 AM - 10:41 AM
Room: Meeting Hall IV
  • Christine Elsik, Division of Animal and Plant Sciences, University of Missouri-Columbia, United States
  • Deepak Unni, Division of Animal and Plant Sciences, University of Missouri-Columbia, United States
  • Eric Yao, University of California, Berkeley, United States
  • Ian Holmes, Berkeley, United States
  • Monica Munoz-Torres, Lawrence Berkeley National Laboratory, United States
  • Nathan Dunn, Lawrence Berkeley National Laboratory, United States
  • Seth Carbon, Lawrence Berkeley National Laboratory, United States
  • Suzanna Lewis, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Project Website: http://genomearchitect.org/ Source Code: https://github.com/GMOD/Apollo License: Berkeley Software Distribution (BSD-3).
Scientific research is inherently a collaborative task; for curation of genome annotations, this means a dialog within and between communities of researchers to reach a shared understanding of the biology underlying the sequences. Here we describe our experience developing software to support an increasing number of genome curation communities, emphasizing the collection of technical contributions as unexpected dividends in the process.
Annotation tools help improve curation quality, and in many research groups, they constitute a means to introducing human curation to the annotation process for the first time. To facilitate the dialog between and within communities of curators, our team has developed a number of web- based tools, among them Apollo, a widely-adopted genome annotation editor designed to support structural curation of gene models in a collaborative, real-time fashion. Apollo, built on top JBrowse (http://jbrowse.org/), presents a set of tooling for data extraction and integration that allows users to generate analysis-ready data and progress reports; it also offers a secure web-service API to programmatically perform annotations, to integrate Apollo with workflow tools (e.g. Galaxy https://galaxyproject.org/, via Docker https://www.docker.com/), etc.
We have dedicated considerable efforts to spread the word about Apollo (i.e. talks and workshops to train the community on installation and use), and to continuously make improvements to support the needs of the growing research community that forms our user base. It is currently used in more than one hundred genome annotation projects around the world and across the tree of life. Recently, these improvements have included technical contributions from developers outside our core group, ranging from productive discussions, to organizing public hackathons, to welcoming code contributions to Apollo’s main development branch. This means that a portion of our target communities both have an interest in creating an ideal curation environment for their research objectives using Apollo, and are choosing to prioritize investing their resources to achieve it. It is also evidence that our efforts to support curators with an open-source, community-based bioinformatics tool have been so far successful. And as a result, we are receiving extensive technical contributions from these communities, an unexpected return on these efforts that both supports our mission and improves the workflow for our community of users.

Journal of Open Source Software (JOSS)
Date: Sunday, July 23
Time: 10:41 AM - 10:59 AM
Room: Meeting Hall IV
  • Arfon Smith, NASA, United States
  • Pjotr Prins, UMC Utrecht, Netherlands
  • Roman Valls, Center for Cancer Research, University of Melbourne, Australia, Australia
  • Vignesh Sekar, NIIT University, India

Presentation Overview: Show

Journal of Open Source Software (JOSS)
JOSS is a free open-access journal for software developers in academic research. JOSS reviewed and published 100 papers in its first year and the list of authors, reviewers and editors is growing.
Publishing a paper with JOSS gives you a citeable publication for your software project. JOSS provides a cross reference Digital Object Identifier (DOI) and is indexed on Google Scholar. With enough bioinformatics papers we aim to get JOSS automatically indexed on Pubmed.
As a journal we believe that the source code is the story. When the source code is published under an open software initiative (OSI) approved license, anyone can access and use it. If you already wrote all the source code, documentation and tests, why would you need to write another full-length paper?
With JOSS, the write-up can be a short piece of text (up to one page) with references, see for example the GeneNetwork paper:
http://joss.theoj.org/papers/10.21105/joss.00025
You can see the review process uses the github issue tracker. Review is transparent, open and designed to be collaborative between the authors, reviewers, and editor. The full workflow is explained at
https://doi.org/10.6084/m9.figshare.4688911
Note: JOSS runs on github and is an OSI affiliate.
Next to JOSS we will also discuss ongoing work on a new Journal of Open Data. This journal will allow people to publish their data for free in a content addressable public space and get a DOI. Not only will this data be citeable, it will also be FAIR and help reproducible analysis. Metadata can be mined through SPARQL and JSON filters.

Building an open, collaborative, online infrastructure for bioinformatics training
Date: Sunday, July 23
Time: 10:59 AM - 11:17 AM
Room: Meeting Hall IV
  • Berenice Batut, University of Freiburg, Germany
  • Bjoern Gruening, University of Freiburg, Germany
  • Dave Clements, Johns Hopkins University, United States
  • Galaxy Network, Galaxy Training Network, United States

Presentation Overview: Show

Project Website​: ​http://galaxyproject.github.io/training-material/ Source Code​: ​https://github.com/galaxyproject/training-material/​ License​: Creative Commons Attribution 4.0 International License
Abstract
With the advent of high-throughput platforms, analysis of data in life science is highly linked to the use of bioinformatics tools, resources, and high-performance computing. However, the scientists who generate the data often do not have the knowledge required to be fully conversant with such analyses. To involve them in their own data analysis, these scientists must acquire bioinformatics vocabulary and skills through training.
Unfortunately, data analysis and its training are particularly challenging without computational background. The Galaxy framework addresses this problem by offering a web-based, intuitive and accessible user interface to numerous bioinformatics tools. It enables sophisticated bioinformatic analysis without requiring life scientists to learn programming, command line interfaces, or systems administration.
Trainings based on Galaxy generates significant interest and the number of such events is ever growing, with more than 70 events in 2016. To federate these events and the involved people, the Galaxy Training Network (GTN) was created in 2014. ​GTN now has 32 member groups and almost 100 individuals.
Recently, GTN set up a new open, collaborative, online model for delivering high-quality bioinformatics training material: ​http://galaxyproject.github.io/training-material​. Each of the current 12 topics provides tutorials with hands-on, slides and interactive tours. Tours are a new way to go through an entire analysis, step by step inside Galaxy in an interactive and explorative way. All material is openly reviewed, and iteratively developed in one central repository by 40 contributors. ​Content is written in Markdown and, similarly to the Software/Data Carpentry, the model separates presentation from content. ​It makes easier for community contributions and enables bulk updates and style changes independently of the training content. This is beneficial for the sustainability of this project. In addition, the technological infrastructure needed to teach each topic is also described with an exhaustive list of needed tools. The data (citable via DOI) required for the hands-on, time and resource estimations and flavored Galaxy Docker images are also provided.
This material is automatically propagated to Elixir’s TeSS portal. With this community effort, the GTN offers an open, collaborative, FAIR and up-to-date infrastructure for delivering high-quality bioinformatics training for scientists.

Software and social strategies for community sourced biological networks and ontologies
Date: Sunday, July 23
Time: 11:17 AM - 11:22 AM
Room: Meeting Hall IV
  • Dexter Pratt, UCSD, United States

Presentation Overview: Show

Project Website: http://www.ndexbio.org
Source Code: https://github.com/ndexbio
License: BSD 3-Clause (http://www.home.ndexbio.org/disclaimer-license/)
Main Text of Abstract
We present community-sourcing social strategies and supporting software infrastructure to promote the creation, maintenance, and dissemination of computable biological networks and ontologies, coupled with practical, ready-to-use facilities in the Network Data Exchange (NDEx) framework. Network resources of many types, from pathway models authored by experts to data- driven ontologies and patient similarity networks, are valuable to researchers both as human- readable references and as input to analyses in applications and scripts. There are, however, significant challenges in providing high quality network resources: curation efforts are expensive and difficult to sustain; data-driven networks may be isolated as supplemental information to publications without support for revision and reuse. Moreover, network resources are not typically subject to peer review: even when a network is associated with a peer-reviewed publication, its content is unlikely to be reviewed in detail. To address these challenges, we present a strategy to create incentives for sustainable community-sourced development by (1) enabling organizations such as research consortia or editorial boards of subject matter experts to publish reviewed collections of networks, (2) facilitating workflows integrating network data with academic publication, and (3) preserving author attribution in derivative works. The effort required by participants is reduced by targeted infrastructure and interfaces incorporated in the NDEx framework for sharing, accessing, and publishing networks. The use of standards by authors is facilitated by streamlined annotation interfaces in NDEx and encouraged by rewards such as preferential search rankings for well-annotated networks. Reproducible science is promoted by enabling researchers to readily disseminate reusable networks, accessible via stable identifiers and discoverable via multi-parameter search. Finally, we make a call to action for researchers to participate in creating diverse, high-quality, and sustainable resources of biological networks and ontologies as authors, reviewers, community organizers, and thought leaders.

Distance-based, online bioinformatics training in Africa: the H3ABioNet experience
Date: Sunday, July 23
Time: 11:22 AM - 11:27 AM
Room: Meeting Hall IV
  • Ahmed Alzohairy, Zagazig University, Egypt
  • Amel Ghouila, Institute Pasteur de Tunis, South Africa
  • Colleen Saunders, University of the Western Cape, South Africa
  • David Judge, Independent Bioinformatics Training Specialist, United Kingdom
  • Deogratius Ssemwanga, Uganda Virus Research Institute, Uganda
  • Fatma Guerfali, Institute Pasteur de Tunis, South Africa
  • Jean-Baka Entfellner, University of the Western Cape, South Africa
  • Jonathan Kayondo, Uganda Virus Research Institute, Uganda
  • Kim Gurwitz, H3ABioNet; University of Cape Town, South Africa
  • Nicola Mulder, University of Cape Town, South Africa
  • Pedro Fernandes, Instituto Gulbenkian de Ci'ncia, Portugal
  • Rehab Ahmed, University of Khartoum/Future University of Sudan, Sudan
  • Ruben Cloete, University of the Western Cape, South Africa
  • Samson Salifu, Kwame Nkrumah University of Science and Technology, Ghana
  • Shaun Aron, University of the Witwatersrand, South Africa
  • Sumir Panji, H3ABioNet; University of Cape Town, South Africa
  • Suresh Maslamoney, University of Cape Town, South Africa

Presentation Overview: Show

Project Website: http://training.h3abionet.org/IBT_2016/
Source Code: NA
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (see https://creativecommons.org/licenses/by-nc-sa/4.0/ )
A distance-based Introduction to Bioinformatics course is run by H3ABioNet - a pan African Bioinformatics network for H3Africa. In its second iteration this year, the free, 3-month, skills- based course teaches the basics of various bioinformatics analyses. It makes use of multiple education delivery methods, namely: face-to-face learning; distance learning; and open educational resources, in order to increase access across Africa. Local classrooms - 27 classrooms hosting roughly 600 participants, in total, across 12 African countries (in 2017) - are attended face-to-face for additional support where there is interaction with volunteer, teaching assistants and with peers. During these face-to-face sessions, classrooms watch open access, pre-recorded and downloaded lecture recordings, prepared by expert African bioinformatics trainers. Classrooms also sign in to a virtual classroom that links them all to each other and to the trainer during biweekly contact sessions. Further, course participants and volunteer local staff engage via online forums hosted on the course management platform. Additional features of the course this year, developed out of an extensive review of last year’s iteration, include: staff training at local classrooms; promoting previous year attendees as trainee teaching assistants; consolidation sessions; and encouraging engagement within and across classrooms as well as with local bioinformatics communities. The unique course design ensures that many common challenges in Africa, namely: access to bioinformatics expertise; access to bioinformatics training; and Internet access instability, are not barriers to accessing education. Although developed for a resource limited setting, the learning model employed could easily be adapted to other settings.

Recent object formation in the core of Galaxy
Date: Sunday, July 23
Time: 11:30 AM - 11:35 AM
Room: Meeting Hall IV
  • Martin Cech, Galaxy Team, Penn State, United States

Presentation Overview: Show

Code​: github.com/galaxyproject/galaxy
Project​: galaxyproject.org
License​: Academic Free License version 3.0
Galaxy pursues accessibility, transparency and reproducibility in data-intensive science. It is a web-based framework that automatically tracks analyses and enables researchers to use advanced computational tools and resources while focusing on the research questions rather than the supporting compute infrastructure. In this report we highlight recently added and enhanced features and discuss future directions. The most prominent recent developments are:
Conda for dependency resolution. ​Galaxy h​ as deprecated its own solution (Tool Shed package recipes) for the software package management and embraced the Conda manager for these purposes. This makes tool dependencies more reliable and stable, and they are also easier to test and faster to develop.
Interactive Environments​ integrated into Galaxy interface are powerful tools for interactive a​d hoc ​analyses as demonstrated in a recent paper (​doi.org/10.1101/075457​). In addition to the Jupyter notebook and RStudio, the Galaxy community has expanded the number of available IEs to include Phinch for metagenomic data visualization, Ethercalc for tabular and csv data, and Neo4j for graph database manipulation.
Galaxy​ ​Webhooks​ are a system for adding plugins allowing for customization of individual instances. Webhooks are admin-enabled and often community-contributed pieces of code that alter the interface and offer extra features. Popular webhooks will likely become part of the core UI in the future.
Dataset Collections​ are becoming increasingly powerful - you can now directly create them when uploading datasets, flatten and import them to data libraries, and use many new and improved collection operations tools. Various toolkits were enhanced to handle collections natively.
Other notable advancements: You can start up an independent ​chat server​ and connect it to Galaxy enabling users to share and collaborate without leaving the analysis interface. Galaxy also supports ​compressed​ ​FASTQ​ formats allowing to save storage and remove unnecessary steps from workflows. Tool cache and ​‘hot reload’​ functionalities have also been added, enabling administrators to update tools without a server restart. The tool cache has also made Galaxy startup much faster, especially for instances with many tools. Histories now allow for drag&drop​ of datasets as tool inputs and also can ​propagate dataset tags​ through tool executions. In the last year ​3 new IUC members​ joined our ranks totalling 15 now. They handled 401 pull requests in last 12 months and added numerous contributions to Bioconda and other connected projects.

Reproducibility of computational workflows is automated using continuous analysis
Date: Sunday, July 23
Time: 11:35 AM - 11:40 AM
Room: Meeting Hall IV
  • Brett Beaulieu-Jones, University of Pennsylvania, United States
  • Casey Greene, University of Pennsylvania, United States

Presentation Overview: Show

Project Website: http://www.nature.com/nbt/journal/v35/n4/full/nbt.3780.html Source Code: https://github.com/greenelab/continuous_analysis/
License: (example) BSD 3-clause - https://github.com/greenelab/continuous_analysis/blob/master/LICENSE
Reproducibility is central to science. The ability to trust, validate, and extend previous works drives scientific advancement. In a recent survey conducted by Nature, ninety-percent of researchers acknowledged a “reproducibility crisis”. Concerns about reproducibility are beginning to change funding and publishing decisions – in short science that lacks rigor can impact you immediately. Funders are moving towards early release of research products – data, software, protocols, preprints and publishers are emphasizing data availability, statistical rigor and so on. This talk will highlight the impact of the “reproducibility crisis” on different stakeholders and suggest potential routes forward including “Continuous Analysis” –
Reproducing computational biology experiments, which are scripted, should be straightforward. But reproducing such work remains challenging and time consuming. In the ideal world, we would be able to quickly and easily rewind to the precise computing environment where results were generated. We would then be able to reproduce the original analysis or perform new analyses. We introduce a process termed "continuous analysis" which provides inherent reproducibility to computational research at a minimal cost to the researcher. Continuous analysis combines Docker, a container service akin to virtual machines, with continuous integration, a popular software development technique, to automatically re-run computational analysis whenever relevant changes are made to the source code. This allows results to be reproduced quickly, accurately and without needing to contact the original authors. Continuous analysis also provides an audit trail for analyses that use data with sharing restrictions. This allows reviewers, editors, and readers to verify reproducibility without manually downloading and rerunning any code.

Full-stack genomics pipelining with GATK4 + WDL + Cromwell
Date: Sunday, July 23
Time: 11:40 AM - 11:45 AM
Room: Meeting Hall IV
  • Geraldine Auwera, Broad Institute, United States
  • Jeff Gentry, Broad Institute, United States
  • Kate Voss, Broad Institute, United States

Presentation Overview: Show

Project Websites: http://software.broadinstitute.org/gatk http://software.broadinstitute.org/wdl
Source Code: https://github.com/broadinstitute/gatk/ https://github.com/broadinstitute/cromwell https://github.com/broadinstitute/wdl
License: BSD 3-clause (see https://github.com/broadinstitute/gatk/blob/master/LICENSE.TXT) Main Text of Abstract
GATK4 is the new major version of the Genome Analysis Toolkit (GATK), one of the most widely used software toolkits for germline short variant discovery and genotyping in whole genome and exome data. For genomics analysts, this new version greatly expands the toolkit's scope of action within the variant discovery space and provides substantial performance improvements with the aim of shortening runtimes and reducing cost of analysis. But it also offers significant new advantages for developers, including a completely redesigned and streamlined engine that provides more flexibility, is easier to develop against, and supports new technologies such as Apache Spark and cloud platform functionalities (e.g. direct access to files in Google Cloud Storage).
WDL and Cromwell are a Workflow Definition Language and a workflow execution engine, respectively. The imperative that drives WDL’s development is to make authoring analysis workflows more accessible to analysts and biomedical scientists, while leaving as much as possible of any runtime complexity involved to the execution engine.
GATK4, WDL and Cromwell are all developed by the Data Sciences Platform (DSP) at the Broad Institute and released under a BSD 3-clause license. For more information on GATK’s recent licensing change, please see https://software.broadinstitute.org/gatk/blog?id=9645.
Taken together, these three components constitute a pipelining solution that is purposely integrated from the ground up, although they can each be used independently and in combination with other packages through features that maximize interoperability. This principle of integration applies equally to development, to deployment in production at the Broad, and to support provided to the external community.

ToolDog - generating tool descriptors from the ELIXIR tool registry
Date: Sunday, July 23
Time: 11:45 AM - 11:50 AM
Room: Meeting Hall IV
  • Hedi Peterson, Institute of Computer Science, University of Tartu, Estonia
  • Herve Menager, Bioinformatics and Biostatistics HUB, C3BI, Institut Pasteur, France
  • Ivan Kuzmin, Institute of Computer Science, University of Tartu, Estonia
  • Jon Ison, Center for Biological Sequence Analysis Department of Systems Biology, Technical University of Denmark, Denmark
  • Kenzo-Hugo Hillion, Bioinformatics and Biostatistics HUB, C3BI, Institut Pasteur, France

Presentation Overview: Show

Over the last years, the use of bioinformatics tools has been eased by the use of workbench systems such as Galaxy [1] or frameworks that use the Common Workflow Language (CWL) [2]. Still, the integration of these resources in such environments remains a cumbersome, time consuming and error-prone process. A major consequence is the incomplete description of tools that are often missing information such as some parameters, a description or metadata.
ToolDog (Tool DescriptiOn Generator) is the main component of the Workbench Integration Enabler service of the ELIXIR bio.tools registry [3, 4]. The goal of this tool is to guide the integration of tools into workbench environments. In order to do that, ToolDog is divided in two main parts: the first part analyses the source code of the bioinformatics software with language dedicated tools and generates a Galaxy XML or CWL tool description. Then, the second part is dedicated to the annotation of the generated tool description using metadata provided by bio.tools. This annotator can also be used on its own to enrich existing tool descriptions with missing metadata such as the recently developed EDAM annotation.
References:
[1] Enis Afgan et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research (2016) doi: 10.1093/nar/gkw343
[2] Amstutz, Peter et al. (2016): Common Workflow Language, v1.0. figshare. https://doi.org/10.6084/m9.figshare.3115156.v2. Retrieved: 15 37, Mar 09, 2017 (GMT)
[3] Jon Ison et al. Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Research, 44(D1):D38–D47, January 2016. ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/gkv1116.
[4] ​https://bio.tools
Project Website​: ​https://github.com/bio-tools/tooldog/ Source Code​: ​https://github.com/bio-tools/tooldog/ License​: MIT

BioThings SDK: a toolkit for building high-performance data APIs in biology
Date: Sunday, July 23
Time: 11:50 AM - 11:55 AM
Room: Meeting Hall IV
  • Andrew Su, The Scripps Research Institute, United States
  • Chunlei Wu, The Scripps Research Institute, United States
  • Cyrus Afrasiabi, The Scripps Research Institute, United States
  • Ginger Tsueng, The Scripps Research Institute, United States
  • Jiwen Xin, The Scripps Research Institute, United States
  • Sebastien Lelong, The Scripps Research Institute, United States

Presentation Overview: Show

Project Website: http://biothings.io
Source Code: https://github.com/biothings/biothings.api License: Apache License v2
The accumulation of biological knowledge and the advance of web and cloud technology are growing in parallel. The latest computation technology allows us to modernize the way we collect, organize and disseminate the growing biological knowledge. Building web-based APIs (Application Programming Interfaces) becomes more and more popular for accessing data in a simple and reliable manner. Comparing to the traditional raw flat-file downloads, web APIs allow users to request specific data such as a list of genes of interest without having to download the entire data file. Web APIs typically return data in a format conforming to common standards (e.g. JSON or XML). This means that developers can spend less time on wrangling data, and more time on analysis and discovery.
We previously developed two high-performance and scalable web APIs for gene and genetic variant annotations, accessible at MyGene.info and MyVariant.info. These two APIs are a tangible implementation of our expertise and collectively serve over 6 million requests every month from thousands of unique users. Crucially, the underlying design and implementation of these systems are in fact not specific to genes or variants, but rather can be easily adapted to other biomedical data types such drugs, diseases, pathways, species, genomes, domains and interactions, collectively, we refer them as “BioThings”.
BioThings SDK is a generalized software development kit (aka SDK) for building the same high-performance APIs for other BioThings data types. This SDK enables other developers to build their own BioThings API for their specific data types. Users can take advantage of the abstracted technical layers we built into the SDK, and produce a high-performance API, which follows the best practice and community standards. We also adopted JSON-LD (JSON for Linked Data) as a mechanism to form the semantic connections between different data types, so that the set of BioThings APIs will form a network of linked programmatic-accessible biological knowledge. BioThings SDK now becomes the common backend of both MyGene.info and MyVariant.info APIs, and was also used to create two new members of BioThings APIs: one for chemical compound and drug data (http://c.biothings.io), and the other for taxonomy data (http://t.biothings.io). A generic biothings Python client (https://github.com/biothings/ biothings_client.py) has also been created to provide a ready-to-use Python client for any BioThings API without the need of any extra code, and yet it still remains extensible for any data-type specific customization.

Integrating cloud storage providers for genomic analyses
Date: Sunday, July 23
Time: 12:00 PM - 12:05 PM
Room: Meeting Hall IV
  • Helga Thorvaldsdottir, The Broad Institute of MIT and Harvard, United States
  • Jill Mesirov, University of California San Diego, United States
  • Marco Ocana, The Broad Institute of MIT and Harvard, United States
  • Michael Reich, University of California San Diego, United States
  • Ted Liefeld, University of California San Diego, United States

Presentation Overview: Show

Project Website: http://www.genomespace.org/
Source Code: https://bitbucket.org/GenomeSpace/combined
License: LGPL version 2.1 (http://www.opensource.org/licenses/lgpl-2.1.php)
Cloud storage is being used ever more frequently for genomics analysis due to its extreme scalability and constantly declining costs. Numerous cloud storage providers have been used in projects including Google, Amazon AWS, Dropbox, and hybrid clouds using tools such as openStack Swift. Existing genomic analysis tools offer disparate support for these cloud storage platforms. Some, but not all, of these tools run on one of the common cloud providers. To improve analysis throughput we would like the analysis and tools to be as closely co-located as possible, or failing that, for there to be wide bandwidth transfer between the location of the data and the location of the analysis.
To address this problem, and simultaneously avoid the m*n solutions necessary to integrate storage providers with the analysis providers, we have developed GenomeSpace (www.genomespace.org), a free and open source cloud-based environment that provides interoperability between best-of-breed computational tools. GenomeSpace includes an interface (API) for analysis tools to securely access data in the cloud as well as a platform that makes it easy to add additional tools and cloud storage providers. The GenomeSpace FileSystemInterface and StorageSpec API provide a structured way to connect cloud storage provider accounts with their GenomeSpace account, and make its contents available to all GenomeSpace tools through the GenomeSpace DataManager REST-ful API. The FileSystemInterface uses familiar file metaphors, but works equally well object stores such as Amazon S3 and OpenStack Swift.
In this talk we will describe the details of the data management architecture and interfaces in GenomeSpace that facilitate the connection of multiple cloud storage systems with a large and growing collection of analysis tools.

Fighting Superbugs with Open Source Software
Date: Sunday, July 23
Time: 12:10 PM - 12:15 PM
Room: Meeting Hall IV
  • Kai Blin, Technical University of Denmark, Denmark
  • Marnix Medema, Wageningen University, Netherlands
  • Sang Lee, Korea Advanced Institute of Science and Technology, South Korea
  • Tilmann Weber, Technical University of Denmark, Denmark

Presentation Overview: Show

URL https://github.com/antismash/antismash.git
License GNU Affero General Public License (AGPL) version 3.0
Antibiotics are one of the most important discoveries in medical history. They form the foundation of many other fields of modern medicine, from cancer treatments to transplantation medicine. Unfortunately, antibiotics are liable to misuse, and have been widely misused, giving rise to an ever-growing number of resistant bacteria, often called superbugs. Medical professionals all over the world are increasingly encountering superbug infections that are untreatable by any available antibiotics. New classes of antibiotics that can sidestep the common resistance mechanisms are desperately needed. Unfortunately, the pipeline for discovering new antibiotics has all but dried up. From 1935 until 1968, 14 new classes of antibiotics were discovered. Since then, only five further classes have been added to our arsenal. The superbugs seem to be winning the arms race.
Fortunately, the cloud has a silver lining. About 70% of the clinically used antibiotics are produced by a group of bacteria, the actinomycetes. With the recent surge in genome sequencing technology, is is becoming clear that many actinomycetes - as well as other bacteria and fungi - carry a large, untapped reservoir of further potential antibiotics. In order to assist lab scientists in discovering these new antibiotics while lowering the re-discovery rate of already known substances, new search strategies are needed.
antiSMASH is a fully Open Source tool to assist life scientists in the discovery of new drug leads. Since its initial release in 2011, it has become one of the most popular tool in the area of antibiotics discovery. Standing on the shoulder of giants, antiSMASH in turn leverages many Open Source life science tools to do its job, providing state-of-the-art bioinformatics analyses via an accessible, easy to use web interface.
This talk presents how Open Source software powers the hunt for new antibiotics to fight the rising threat of superbugs.

Users, Communication, and a Light Application-Level API: A Request for Comments
Date: Sunday, July 23
Time: 12:15 PM - 12:20 PM
Room: Meeting Hall IV
  • Seth Carbon, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Website: https://github.com/kltm/lala Repository: https://github.com/kltm/lala License: CC-BY 4.0
Much thought and paper has been put into sharing information between resources, from the use of common identifiers to use of common data stores and APIs. We would like to explore an area that has not, to the authors’ knowledge, been as well explored: the sharing of simple information directly between web-based applications, without the use of a non-application central server or API as an intermediary.
More concretely, the use cases that we wish to look at are how to manage user authentication, autho- rization, and how to obtain small packets of specific information from an external resource that has its own associated web application. As a way to explore this space, we have created applications that accomplish this by passing tokens and round-tripping a packet of information (acting as a “black box” or “piggy bank”) through the external application using basic HTTP methods, allowing an easy high-level “federation”. In an environment where much effort is being spent on the creation of rich web applications tailored to the habits of specific groups of scientists and curators, users and engineers would have much to gain by being able to reuse functionality from external applications as-is in their workflows and stacks, as well as having the added benefit of driving traffic to external web applications that implement such a common specification.
Through these explorations, we have decided that a path forward should capture the following qualities:
• Easy to implement: complexity can be a barrier to adoption and implementation
• Basic HTTP tooling: any system should have easy access to the tools necessary
• “Stateless”: simplifying debugging and implementation
• Minimal need for initial coordination
Beyond what data is to be returned, the external application does not need to understand the transiting
packet
• No need for calling application to coordinate changes to own API after initial coordination: as the
calling application is responsible for decoding the information it initially sent and the location it is
sent to, major API changes can occur without the need to coordinate with any other resource
• Have the ability to perform operations “remotely” while still logged-in to the calling application, without the need to coordinate cross-site logins: this can be done by placing an authorization token in
the transiting packet
Taking inspiration from the methods used by Galaxy [1] to pull data in from external applications, we’d like to discuss this potential protocol and get feedback from the community about general patterns for passing users and relatively simple data directly between applications. Variations of this proposal have been implemented or explored in applications such as: Noctua, Textpresso Central, AmiGO, and PubAnnotation.
References
[1] The Galaxy platform for accessible, reproducible and collaborative biomedical analyses, 2016 update Nucleic Acids Research 44(W1): W3-W10, doi:10.1093/nar/gkw343

RADAR-CNS - Research Infrastructure for processing wearable data to improve health
Date: Sunday, July 23
Time: 2:00 PM - 2:18 PM
Room: Meeting Hall IV
  • Amos Folarin, King's College London, United Kingdom
  • Fanscesco Nobilia, King's College London, United Kingdom
  • Herculano Campos, Goldenarm, United States
  • Irina Pulyakhina, The Hyve, Netherlands
  • Joris Borgdorff, The Hyve, Netherlands
  • Julia Kurps, The Hyve, Netherlands
  • Mark Begale, Vibrent Health, United States
  • Matthias Duempelmann, University Hospital Freiburg, Germany
  • Maxim Moinat, The Hyve, Netherlands
  • Maximilian Kerz, King's College London, United Kingdom
  • Nivethika Mahasivam, The Hyve, Netherlands
  • Richard Dobson, King's College London, United Kingdom

Presentation Overview: Show

Project Website: http://www.radar-cns.org/ Source Code: https://github.com/RADAR-CNS License: Apache2.0
Remote Assessment of Disease And Relapse – Central Nervous System (RADAR-CNS) is an innovative collaborative research project to evaluate the potential of wearable devices and smartphone technology to improve quality of life for people with epilepsy, multiple sclerosis (MS) and major depression disorder (MDD).
RADAR-CNS is built upon close collaboration of patient organizations, clinical partners, research institutes and industry to develop new strategies for treatment of patients with brain disorders. The aim of RADAR-CNS is to evaluate how to best leverage innovative technologies like wearable devices and smartphone-based applications for remote measurement and potential relapse prediction. Together with our data processing partners, The Hyve is building an open source infrastructure to capture, process, manage and analyse data from wearable devices and facilitate the integration with data from multiple other sources like clinical and -omics data. Our clinical partners will use this data processing pipeline in multiple clinical trials. Sustainability is a focus point during the development of the RADAR platform. Therefore, we develop a generic platform, which will not be limited to brain disorder applications, but will be applicable for subsequent RADAR projects like RADAR Diabetes. Furthermore, we are actively facilitating a striving open source community around the RADAR platform to ensure longevity of the research infrastructure and encourage cross-infrastructure efforts. Goal of the pilot study is the evaluation of wearable device data for passive remote measurement of patients in a hospital epilepsy monitoring unit (EMU). We developed an Android application, which captures wearable data via a Bluetooth connection and streams data to an internal hospital server. Integration of SDKs for multiple wearable devices allows us to investigate and compare the quality of data collected from different devices (e.g., Empatica E4 and Angel Sensor) as well as device specifications, such as battery life and signal stability/range. Besides data quality assessment, we will investigate whether data from wearable devices has potential for seizure detection. Parallel video and EEG monitoring in monitoring units of specialized epilepsy centers are used as a Gold Standard for seizure detection evaluation. We will investigate the potential of wearable devices as clinically valuable alternatives to complement or even replace hospital-based technologies; a prerequisite for ambulatory passive remote measurement of patients in their home environment.
The RADAR-CNS platform is adding an additional dimension to the research infrastructure for personalised medicine and health by allowing new ways to leverage innovative technologies for better data integration.
References
1. http://www.radar-cns.org
2. https://github.com/RADAR-CNS

Using Wikidata as an open, community-maintained database of biomedical knowledge
Date: Sunday, July 23
Time: 2:18 PM - 2:36 PM
Room: Meeting Hall IV
  • Andra Waagmeester, Micelio, Belgium
  • Andrew Su, The Scripps Research Institute, United States
  • Benjamin Good, The Scripps Research Institute, United States
  • Elvira Mitraka, University of Maryland, Baltimore, United States
  • Gregory Stupp, The Scripps Research Institute, United States
  • Julia Turner, The Scripps Research Institute, United States
  • Lynn Schriml, University of Maryland, Baltimore, United States
  • Matthew Jacobson, University of British Columbia, Canada
  • Nuria Queralt-Rosinach, The Scripps Research Institute, United States
  • Paul Pavlidis, University of British Columbia, Canada
  • Sebastian Burgstaller-Muehlbacher, The Scripps Research Institute, United States
  • Timothy Putman, The Scripps Research Institute, United States

Presentation Overview: Show

Project Website: https://www.wikidata.org/wiki/User:ProteinBoxBot
Source Code: https://github.com/SuLab/GeneWikiCentral (and repos linked therein) License: MIT
The sum total of biomedical knowledge is accumulating at an explosive rate. One metric of this rapid progress is the exponential growth in the biomedical literature. There are now over 1.2 million new articles published every year, averaging to one new article every 26 seconds. Unfortunately, however, the entirety of that knowledge is not easily accessible. In most cases, biomedical knowledge is locked away in free-text research articles, which are very difficult to use for querying and computation. In some cases, that knowledge has been deposited in structured databases, but even then the fragmented landscape of such databases is a barrier to knowledge integration.
Here, we describe the use of Wikidata (wikidata.org) as an open,
community-maintained biomedical knowledge base. Wikidata is a sister
project of Wikipedia -- both are run by the Wikimedia Foundation and
both are entirely populated by community contributions. In contrast to
Wikipedia’s emphasis on encyclopedic content, Wikidata focuses on
organizing structured knowledge. Wikidata currently contains 140
million individual statements on over 25 million items encoded by 1.8
billion triples. Within that set, we have seeded Wikidata with data on key biomedical entities, including genes, proteins, diseases, drugs, genetic variants, and microbes. To ensure source databases are properly credited, we have implemented a standardized model for referencing and attribution. These data, combined with other data sets imported by the broader Wikidata community, enable powerful integrative queries that span multiple domain areas via the Wikidata SPARQL endpoint.
The emphasis of this abstract is on Wikidata as resource for biomedical Open Data that can serve as a foundation for other bioinformatics applications and analyses. In addition, the code developed to execute this project is also available as Open Source software. This suite of code includes modules for populating Wikidata, for automatically synchronizing with source databases, and for creating domain-specific applications to engage specific user communities.

Emerging public databases of clinical genetic test results: Implications for large scale deployment of precision medicine
Date: Sunday, July 23
Time: 2:36 PM - 2:41 PM
Room: Meeting Hall IV
  • Benedict Paten, University of California, Santa Cruz, United States
  • Can Zhang, University of California, Santa Cruz, United States
  • David Haussler, University of California, Santa Cruz, United States
  • Melissa Cline, University of California, Santa Cruz, United States
  • Robert Nussbaum, Invitae and University of California, San Francisco, United States
  • Scott Topper, Invitae, United States
  • Shan Yang, Invitae, United States
  • Stephen Lincoln, Invitae, United States
  • Yuya Kobayashi, Invitae, United States

Presentation Overview: Show

Implementing precision medicine requires that reliable and consistent clinical interpretations of molecular test results be available to physicians. In germline genetic testing the process of determining which variants in a patient are pathogenic (disease causing) and which are benign is crucial to the clinical utility of these tests. This classification process involves expert review of complex and sometimes conflicting evidence, after which some variants must be considered of uncertain significance. To promote quality control, collaboration and consensus building, many (not all) clinical laboratories contribute classifications to open, public databases, notably ClinVar.
We examined 74,065 classifications of 27,224 clinically observed genetic variants in ClinVar, and found inter-laboratory concordance to be high (96.7%) although this varied considerably among specialties: cancer genes had the highest concordance (98.5%) with cardiology (94.2%) and metabolic disease (95.1%) the lowest. Unsurprisingly, data from research labs were 6-times more likely to disagree with clinical test results, and old classifications often disagreed with newer ones. More interestingly, evidence supporting pathogenicity appears to be more consistently used than evidence against pathogenicity (data to be shown). Low penetrance genes, which confer a modest risk of disease, were 7-times more likely to harbor disagreements compared to high penetrance genes. Technically challenging variant types (e.g. large indels and mutations in low complexity or highly homologous sequences) were under-represented in ClinVar even though independent data from an 80,000 patient cohort shows that 9% of pathogenic variants in patients are of these types.
We further examined variants in BRCA1 and BRCA2, genes which can confer a substantial lifetime risk of breast, ovarian, and other cancers. While BRCA1/2 testing is common it remains controversial, because (a) proposals to implement population scale BRCA1/2 testing as part of routine physical exams have been raised, (b) irreversible preventive options (e.g. prophylactic surgery) are offered to otherwise healthy individuals found to carry germline mutations, and (c) the largest laboratory with data to inform such testing (Myriad) refuses to share that data or provide it for detailed community review. We analyzed physician provided Myriad results along with those from other ClinVar laboratories, finding high agreement (98.5%) in a data set representing over 20,000 tested patients. Moreover, the few discordant variants were all very rare, suggesting that 99.8% of patients would receive concordant tests from any of these labs in routine practice. This confirms a similar result of ours in a prospective 1000-patient clinical trial.
Open data sharing via ClinVar is a powerful mechanism which has already helped to uncover critical factors which must be addressed as new precision medicine approaches evolve in various medical specialties. ClinVar also provides a common mechanism to drive consensus and evaluate inter-laboratory performance, as exemplified by this study.
Note: All data from these studies are publicly available without restriction for analysis by the community. Some of the results described in this abstract are currently in press. Prior published work mentioned includes Lincoln et al. J Mol Diag 2015 and Yang et al, PSB 2016.

Discovering datasets with DATS in DataMed
Date: Sunday, July 23
Time: 2:41 PM - 2:46 PM
Room: Meeting Hall IV
  • Alejandra Gonzalez-Beltran, University of Oxford, United Kingdom
  • Philippe Rocca-Serra, Oxford, UK

Presentation Overview: Show

Project Website: http://biocaddie.org/
Source Code: http://github.com/biocaddie/WG3-­‐‑MetadataSpecifications License: CC BY-­‐‑SA 3.0 (http://creativecommons.org/licenses/by-­‐‑sa/3.0/)
Among the research outputs in the life sciences, datasets are fast becoming the new currency for evaluating research output. As they underlie publications, accessing, mining and integrating them is the way forward for verifying, reproducing results, as well as propelling new discoveries. Thus, traditional literature searches via PubMed need to be complemented by the capability to find and access the related datasets, distributed through an array of generalist or highly specialized data repositories.
DataMed, a prototype, funded by the National Institutes of Health’s Big Data to Knowledge initiative, provides a data discovery index, currently indexing over 60 data repositories including a wide range of data types and granularity scales across domains of life sciences. Behind DataMed index, a model known as the DAta Tag Suite (DATS) [1], provides the constructs allowing harmonization ofdataset descriptions across repositories. Akin to the Journal Article Tag Suite (JATS) used in PubMed, the DATS model enables the submission of datasets metadata to DataMed. DATS is divided in two layers. DATS core whose elements are generic and applicable to any type of dataset, from any topic and non-specific to the bio-domain. The extended DATS element, on the other hand, can accommodate more advanced use cases and more finely grained needs, befitting the biomedicine domain.
Designed and developed in an entirely open and collaboratively fashion, DATS considered existing well-established schemas and models designed for describing dataset and/or used in biodata repositories. In addition, search cases were compiled. The DATS model is expressed as JSON schemas with schema.org JSON-LD context files. The schema.org model, is widely used by major search engines such as Google, Microsoft, Yahoo and Yandex and therefore caters for relevant indexing use cases. Our mapping of DATS into schema.org identified gaps and has contributed to its extension to better support the biomedical use cases. Work is ongoing to refine and improve ways to implement DATS and optimize information indexing in DataMed.
[1] Pre-preprint of manuscript accepted for publication in 2017 at https://doi.org/10.1101/103143

Bioschemas for life science data
Date: Sunday, July 23
Time: 2:46 PM - 2:51 PM
Room: Meeting Hall IV
  • Alasdair Gray, Heriot Watt University / ELIXIR-UK, United Kingdom
  • Carole Goble, The University of Manchester / ELIXIR-UK, United Kingdom
  • Giuseppe Profiti, Universite di Bologna / ELIXIR-Italy, Italy
  • Niall Beard, The University of Manchester / ELIXIR-UK, United Kingdom
  • Norman Morrison, ELIXIR Hub, United Kingdom
  • Rafael Jimenez, ELIXIR Hub, United Kingdom

Presentation Overview: Show

Project Website: http://bioschemas.org/
Source Code: https://github.com/BioSchemas/bioschemas
License: Creative Commons Attribution-ShareAlike License (version 3.0) Abstract
Schema.org provides a way to add semantic markup to web pages. It describes ‘types’ of information, which then have ‘properties’. For example, ‘Event’ is a type that has properties like ‘startDate’, ‘endDate’ and ‘description’. Bioschemas aims to improve data interoperability in life sciences by encouraging people in life science to use schema.org markup. This structured information then makes it easier to discover, collate and analyse distributed data. Bioschemas reuses and extends schema.org in a number of ways: defining a minimum information model for the datatype being described using as few concepts as possible and only where necessary adding new properties, and the introduction of cardinalities and controlled vocabularies. The main outcome of Bioschemas is a collection of specifications that provide guidelines to facilitate a more consistent adoption of schema.org markup within the life sciences for the “Find” part of the FAIR (Findable, Accessible, Interoperable, Reusable) principles.
In 2016 Bioschemas successfully piloted with training materials and events to enable the EU ELIXIR Research Infrastructure Training Portal (TeSS) to rapidly and simply harvest metadata from community sites. Encouraged by this in March 2017 we launched a 12 month project to pilot Bioschemas for data repositories and datasets. Specifically we are working on:
 General descriptions for datasets and data repositories
 Specific descriptions for prioritised datatypes: Samples, Human Beacons, Plant
phenotypes and Protein annotations
 Facilitating discovery by registries and data aggregators, and by general search engines
 Facilitate tool development for annotation and validation of compliant resources
All work is grounded on describing real data resources for real use cases: to this end large and small dataset are part of the project: Pfam, Interpro, PDBe, UniProt, BRENDA, EGA, COPaKB, and Gene3D. Data aggregators participating include: InterMine, BioSamples and OmicsDI. Registries include Identifiers.org, DataMed, Biosharing and the Beacon Network. Bioschemas operates as an open community initiative, sponsored by the EU ELIXIR Research Infrastructure and is supported by the NIH BD2K programme and Google.

Introducing the Brassica Information Portal: Towards integrating genotypic and phenotypic Brassica crop data
Date: Sunday, July 23
Time: 2:51 PM - 2:56 PM
Room: Meeting Hall IV
  • Annemarie Eckes, Earlham Institute, Norwich Research Park, Norwich, United Kingdom
  • Carlos Horro, Earlham Institute, Norwich Research Park, Norwich, United Kingdom
  • Graham King, Southern Cross Plant Science, Southern Cross University, Lismore, Australia
  • John Hancock, Earlham Institute, Norwich Research Park, Norwich, United Kingdom
  • Judith Irwin, John Innes Centre, Norwich Research Park, Norwich, United Kingdom
  • Piotr Nowakowski, Academic Computer Centre CYFRONET, AGH University of Science and Technology, Krakow, Poland
  • Rachel Wells, John Innes Centre, Norwich Research Park, Norwich, United Kingdom
  • Sarah Dyer, NIAB, Huntingdon Road, Cambridge, United Kingdom
  • Tomasz Guba_a, Academic Computer Centre CYFRONET, AGH University of Science and Technology, Krakow, Poland
  • Tomasz Szymczyszyn, Earlham Institute, Norwich Research Park, Norwich, Poland
  • Wiktor Jurkowski, Earlham Institute, Norwich Research Park, Norwich, United Kingdom

Presentation Overview: Show

A new community resource, the Brassica Information Portal (BIP), provides an open access web repository for ​Brassica​ phenotyping data (https://bip.earlham.ac.uk). It facilitates crop improvement by filling the gap for standardised trait​ ​data and is already integrated with tools to perform on-the-fly phenotype-genotype association analysis online.
In collaboration with the UK ​Brassica​ Research Community, we have build a tool that aims to assist global research on ​Brassica c​ rops: Users can store and publish their own study results in the Portal thanks to advanced data submission capabilities. Similarly, user-friendly interfaces and programmatic download mechanisms further facilitate work with the data. This makes the database content readily comparable and cross-linkable to that of other tools and resources for further downstream analysis. Integrated with the GWAS analysis tool GWASSER, it now enables the user to perform simple GWAS analysis with selected data on BIP .
We also aim to implement community-wide phenotypic trait definition standards to encourage trait data re-use. To ensure comparability, we are working on creating brassica trait ontology terms based on the Crop Ontology model suitable for association analysis. This creates the opportunity to carry out meta-analysis on datasets generated across multiple studies. To make BIP data more comparable and reusable, we are collaborating with ELIXIR to implement MIAPPE standards to the BIP schema. To make data more interoperable and accessible, we are working on being ​BrAPI (Breeding API)​ - compliant, as well as on extension of our own BIP-API.
There are huge benefits from making organised trait data available and accessible to the Brassica​ community and we hope that it becomes common practice to store brassica trait data systematically and share it with the wider research community in a similar way as has become routine for sequence, transcript, metabolite and protein structure data.
We will present the resource and a GWAS use-case to demonstrate the value and utility of BIP, both for single research groups and global ​Brassica​ research and breeding community.
Source code: ​https://github.com/TGAC/brassica
Open Source License: ​GNU General Public License 3.0

BOSC Panel: Open Data - Standards, Opportunities and Challenges
Date: Sunday, July 23
Time: 3:00 PM - 4:00 PM
Room: Meeting Hall IV
  • Andrew Su, The Scripps Research Institute, United States
  • Carole Goble, The University of Manchester, United Kingdom
  • Madeleine Ball, Open Humans Foundation, United States
  • Monica Munoz-Torres, Lawrence Berkeley National Laboratory, United States
  • Nick Loman, University of Birmingham, UK

Presentation Overview: Show

Every year, BOSC includes a panel discussion that offers attendees the chance to engage in conversation with the panelists and each other. This year, our panel discussion will focus on open data: how it can help to catalyze scientific and preserve knowledge, standards for sharing data, and some of the challenges. The panelists will be Nick Loman, Madeleine Ball, Carole Goble and Andrew Su; the moderator will be Monica Munoz-Torres.

Open data meets ubiquitous sequencing: challenges and opportunities
Date: Sunday, July 23
Time: 4:30 PM - 5:30 PM
Room: Meeting Hall IV
  • Nick Loman, University of Birmingham, UK

Presentation Overview: Show

Nick Loman is known as a vocal proponent of open genomic data in healthcare. A Professor of Microbial Genomics and Bioinformatics at the University of Birmingham, Dr. Loman explores the use of cutting-edge genomics and metagenomics approaches to human pathogens. He promotes the use of open data to facilitate the surveillance and treatment of infectious disease.

Dr. Loman helped establish real-time genomic surveillance of Ebola in Guinea and Zika in Brazil (via the ZiBRA project, which states that "Data will be subject to open release as it is generated"). In another recent project, real-time genomic data was used to analyze a small outbreak of Salmonella enteritidis in the UK. Through this sharing of genomic datasets, researchers were able to confirm that the cases were linked to a larger, national-scale outbreak. Dr. Loman is one of the authors of Poretools, and he regularly shares cutting-edge Nanopore data and protocols for using it. In collaboration with Lex Nederbragt, Dr. Loman is developing an open-source repository of sequencing and bioinformatics benchmarking datasets called Seqbench.