View Posters By Category
Session A: (July 22 and July 23)
Session B: (July 24 and July 25)
Presentation Schedule for July 22, 6:00 pm – 8:00 pm
Presentation Schedule for July 23, 6:00 pm – 8:00 pm
Presentation Schedule for July 24, 6:00 pm – 8:00 pm
Session A Poster Set-up and Dismantle
Session B Poster Set-up and Dismantle
Short Abstract: NCBI’s Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. However, methods to programmatically access this data are limited. We introduce a Python package pysradb that provides a collection of command line methods to query and download metadata and data from SRA utilizing the curated metadata database available through the SRAdb project. pysradb package builds upon the principles of SRAdb providing a simple and user-friendly command-line interface for querying metadata and downloading datasets from SRA. It obviates the need for the user to be familiar with any programming language for querying and downloading datasets from SRA. Additionally, it provides utility functions that will further help a user perform more granular queries, that are often required when dealing with multiple datasets at large scale. By enabling both metadata search and download operations at the command-line, pysradb aims to bridge the gap in seamlessly retrieving public sequencing datasets and the associated metadata. pysradb is written in Python and is currently developed on Github under the open-source BSD 3-Clause License. Each sub-command of pysradb contains a self-contained help string, that describes its purpose and usage example with additional documentation available on the project’s website.
Short Abstract: The scale and diversity of epigenomics data has been rapidly increasing and ever more studies now present analyses of data from multiple epigenomic techniques. Performing such integrative analysis is time-consuming, especially for exploratory research, since there are currently nopipelines available that allow fast processing of datasets from multiple epigenomic assays whilealso allow for flexibility in running or upgrading the workflows. Here we present a solution to this problem: snakePipes, which can process and perform downstream analysis of data from allcommon epigenomic techniques (ChIP-seq, RNA-seq, Bisulfite-seq, ATAC-seq, Hi-C andsingle-cell RNA-seq) in a single package. We demonstrate how snakePipes can simplifyintegrative analysis by reproducing and extending the results from a recently publishedlarge-scale epigenomics study with a few simple commands. snakePipes are available under anopen-source license at https://github.com/maxplanck-ie/snakepipes.
Short Abstract: The FAIR Data Principles are a set of guidelines which aim to make data findable, accessible, interoperable and reusable. The principles are gaining traction, especially in the life sciences. We will present our experience of the practical implementation of the FAIR principles in InterMine, a platform to integrate and access life sciences data. We will cover topics such as the design of persistent URLs, standards for embedding data descriptions into web pages, describing data with ontologies, and data licences.
Short Abstract: Portability of computational data analysis environment is largely improved by container virtualization technologies such as Docker and workflow description frameworks represented by Common Workflow Language (CWL). To deploy an environment for data analysis workflows, researchers must select an appropriate computational platform for the given application. To provide information to estimate the computational resource such as CPU or memory required by workflow execution, we developed CWL-metrics, a utility system of cwltool to collect runtime metrics of CWL workflows. The system summarizes resource usage of each step of workflow with the input parameters and the information of the host machine. We demonstrated the performance comparison of RNA-Seq quantification workflows by using metrics captured by CWL-metrics. The comparison analysis results recorded in Jupyter Notebook are published on GitHub. The system is utilized in the new pipeline system being deployed on the high-performance computing platform of the DDBJ to collect the metrics to help administrators of the platform. The metrics information captured by CWL-metrics is also being used by the development of resource prediction algorithm. We also would like to present the progress of the new version of CWL-metrics currently under development, which will increase the coverage of supported workflow runners and container technologies.
Short Abstract: The Global Alliance for Genomics and Health (GA4GH) is creating frameworks and standards to enable the responsible, voluntary, and secure sharing of genomic and health-related data. GA4GH Specifications include file formats such as CRAM and BCF, Cloud compute enabling standards such as Workflow Execution Service (WES), and a path to match patients consent for data use with researchers areas of interest. This talk will introduce these and future specifications, the community based process in which they are developed, and show how you can contribute to this process.
Short Abstract: ELIXIR is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, training materials, cloud storage, and supercomputers. One of the goals of ELIXIR is to coordinate these resources so that they form a single infrastructure. This infrastructure makes it easier for scientists to find and share data, exchange expertise, and agree on best practices. ELIXIR's activities are divided into the following five areas Data, Tools, Interoperability, Compute and Training known as “platforms”. The ELIXIR Tools Platform works to improve the discovery, quality and sustainability of software resources. Software Best Practices task of the Tools Platform aims to raise the quality and sustainability of research software by producing, adopting, promoting and measuring information standards and best practices applied to the software development life cycle. We have published four (4OSS) simple recommendations to encourage best practices in research software and the Top 10 metrics for life science software good practices. The next step is to adopt, promote, and recognise these information standards and best practices, by developing comprehensive guidelines for software curation, and through workshops for training researchers and developers towards the adoption of software best practices.
Short Abstract: We are presenting the results of the work of Cellular Genetics Informatics team from Wellcome Sanger Institute, UK. Our team provides efficient access to cutting-edge analysis methods for the Cellular Genetics programme. Our focus is on the development and operation of pipelines, tools and infrastructure for data analysis that support the programme’s research goals. For these purposes, we developed a reproducible and battle-tested Kubernetes-on-OpenStack setup using Kubespray with the primary use-cases of running Nextflow pipelines (custom RNA-seq and nf-core pipelines) and hosting a multiuser JupyterHub server with a custom image together with other web applications. It integrates data management platform iRODS, Lustre and GlusterFS filesystems. Nextflow pipeline performance has been benchmarked on Kubernetes on Openstack versus LSF scheduler on a high-performance cluster. Another part of the work we are presenting is an open-source web application for running Nextflow pipelines on Openstack. It allows users to start pipelines by uploading data and filling in input parameters without any Nextflow knowledge, be notified once an analysis is finished and get the results uploaded to S3. Although it has been written for Openstack use-case, it allows plugging in a different backend for other cloud providers, such as GCE or AWS.
Short Abstract: Stakeholders within public health can use genomic data analyses to enact policies, yet they face many challenges using an interpreting the results. Data visualization is an emergent solution to address interpretability challenges, but absent is a robust approach for creating data visualizations. We have developed a systematic method for generating an explorable visualization design space, which catalogues visualizations existing within the infectious disease genomic epidemiology literature. We applied our method to a document corpus of approximately 18,000 articles, from which we sampled 204 articles and approximately 800 figures for analysis. We analyzed these figures and created a set of taxonomic codes along three descriptive axes of visualization design: chart types within the visualization, chart combinations, and chart enhancements. We refer to the collective complement of derived taxonomic codes as GEViT (Genomic Epidemiology Visualization Typology). To operationalize GEViT and the results of the literature analysis we have created a browsable image gallery (http://gevit.net), that allows an individual to explore the myriad of complex types of data visualizations (i.e. the visualization design space). Our analysis of the visualization design space through GEViT also revealed a number of data visualization challenges within infectious disease genomic epidemiology that future bioinformatics work should address.
Short Abstract: Modern molecular biology is creating an increasing amount of sequence and macromolecular structure data. Their analysis can be tedious work: Although a large variety of software is available for analysis of this data, each program is usually made for a very specific task. Therefore, a combination of multiple programs is usually required in order to reach the desired goal, often rendering the data processing inflexible and even inefficient. These limitations can be overcome by shifting the workflow to comprehensive computational biology library in a easy-to-learn scripting language like Python. For this purpose the open source Python package Biotite was created. It is a modern and comprehensive computational molecular biology library in the spirit of Biopython. As Biotite stores sequence and structure data internally as NumPy ndarrays, most operations have a high performance. On top of data analysis, modification and visualization capabilities, Biotite can be used to download data from biological databases or interface external software in a seamless manner.
Short Abstract: Today there is a plethora of Open Source Workflow Management Systems (WMS) aiming at organizing research, automating analysis, discovering valuable Research Objects (ROs) and, ultimately, battling the reproducibility crisis. Despite the technical maturity of these tools and the efforts of communities like OBF to spread their use, we can roughly estimate that less than 1% of the published analysis that combine multiple ROs use any WMS. Besides, there are still important features that are in the best case partially supported by existing WMSs, thus limiting their applicability. In brief, these features are: (i) steep learning curve from non-IT experts, (ii) custom Domain Specific Languages, (iii) requirement for local installation, (iv) inability to cooperate with other WMS, (v) lack of rewarding to scientists that add content, (vi) lack of a single, browsable repository with easily downloadable and executable ROs that also contains usage statistics and resource requirements, and (vii) inability to rate and comment existing ROs. To remedy these issues, we present the first version of OpenBio-C, which is an online WMS, workflow composer, RO repository, and Q&A site, targeting all science enthusiasts. It requires no IT knowledge and supports RO import and export from a variety of existing WMS.
Short Abstract: The Common Workflow Language (CWL) allows to wrap and link up bioinformatic software in a standardized and portable way. However, setting up and operating a CWL-based workflow management system can be a labor-intensive challenge for data-driven laboratories. To this end, we developed CWLab: a framework for simplified, graphical deployment of CWL. CWLab allows life science researchers with all levels of computational proficiency to create, execute and monitor jobs for CWL-wrapped tools and workflows. Input parameters for large sample batches are specified using a simple HTML form and are automatically validated. The integrated web server allows to remotely control the execution on HPC clusters as well as single workstations. Moreover, automatic infrastructure provisioning for OpenStack-based clouds is being implemented. CWLab can also be used as a local desktop application that supports Linux, MacOS, and Windows by leveraging Docker containerization. Our Python-based framework is easy to set up and, via a flexible API, it can be integrated with any CWL runner and adapted to custom software environments. With CWLab, we would like to hide the complexity of workflow management so that scientific users can focus on their data analyses. This might promote the adoption of CWL in multi-professional life-science laboratories.
Short Abstract: The pdb-tools are a collection of Python scripts for working with molecular structure data in the Protein Data Bank (PDB) format. The tools allow users to easily and efficiently edit and validate PDB files as well as convert coordinate data to and from the now-standard mmCIF format. Moreover, their simple and consistent command-line interface makes them particularly adequate for non-expert users. All tools are implemented in Python, without external dependencies, and are freely available under the open-source Apache License at https://github.com/haddocking/pdb-tools and on PyPI.
Short Abstract: Linking different tools is an integral part of bioinformatics and software building tools (such as GNU Make) or workflow engines have both been employed to tackle this problem, but are not specific to bioinformatics. We present an alternative approach, namely a domain-specific language (named NGLess) for defining next-generation sequencing (NGS) processing pipelines and show its advantages over the use of generic tools. With NGLess, the user works with abstractions that are closer to the problem domain and this enables a better user experience in building correct and reproducible pipelines. For example, NGLess contains built-in types such as ShortReadSet, which corresponds to a collection of FastQ files on disk. This enables the user to link tools that output compatible files with the conversion between them being automatic (a tool that outputs a SAM file and be linked to a tool that consumes a BAM file as both of these are of the same type and NGLess will automatically insert any necessary file conversions). Using NGLess 1.0, we implemented NG-meta-profiler, a tool for producing taxonomic and functional profiles from metagenomes. Through the use of NGLess, NG-meta-profiler is significantly faster than the alternatives. NGLess and NG-meta-profiler are available from https://ngless.embl.de/
Short Abstract: Multi-omics analysis incorporating multiple types of omics data has become important in complex disease studies. Many multi-omics analysis methods have been developed. However, only a few tools for simulating multi-omics data are available. We developed OmicsSIMLA, which simulates genetic data including SNPs and CNVs, methylation data based on bisulphite sequencing, RNA-seq and normalized protein expression data. OmicsSIMLA also simulates meQTLs (SNPs influencing methylation), eQTLs (SNPs influencing gene expression), and eQTM (methylation influencing gene expression). Furthermore, a disease model can be specified to model the relationships between the multi-omics data and the disease status. We used OmicsSIMLA to simulate a multi-omics dataset with a scale similar to the ovarian cancer data from the TCGA project. The simulated data included 2,884 focal CNVs, 2,753 CpGs on chromosome 1, gene expression levels for 12,004 genes, and protein expression levels for 200 genes in 500 samples with short-term and long-term survival. A neural network-based multi-omics analysis method was applied to the real and simulated ovarian cancer data, and similar results such as the classification rate were observed. The results demonstrated that OmicsSIMLA can simulate realistic multi-omics data and will be useful to generate benchmark datasets for comparisons among multi-omics analysis methods.
Short Abstract: Whole exome and whole genome sequencing are being rapidly adopted in the healthcare industry, making way into the routine clinical practice. Most variant interpretation tools are built to work with domain-based clinical guidelines, approved by ACMG and other responsible bodies, focusing on minimizing the number of variants for manual review by clinical staff. Forome Anfisa is a collaborative variant annotation, interpretation, and curation tool, an organic part of the integrated clinical research program, developed by the Forome Platform team. Initially built as part of Brigham Genomics Medicine Program and is used for a hearing loss project (SEQaBOO), Forome Anfisa is now re-architected for whole exome and genome cases, enabling smooth work with a huge volume of data and giving way for clinicians to cope with millions of genetic variants in a meaningful way. Introducing the first fully open source variant management toolset aimed at both clinical and research communities, we provide a way to seamlessly transform research workflows into clinical guidelines, thus speeding up the adoption of WGS/WES into the clinical practice. It is offering collaboration by design, allowing interaction of users in different roles and sharing of findings in a particular patient case or even a variant.
Short Abstract: Here we use the open source KNIME Analytics Platform to create a machine learning model that learns disease names in biomedical literature. The model has two inputs: an initial list of disease names and the documents. Our goal is to create a model that can tag disease names that are part of our input as well as novel disease names. Hence, one important aspect of this project is that our model should be able to autonomously detect disease names that were not part of the training. To do this, we automatically extract openly available abstracts from PubMed and use these documents to train our model starting with an initial list of disease names. We then evaluate the resulting model using documents that were not part of the training and achieve a Precision of 0.966, Recall of 0.917, F1 of 0.941. Additionally, we test whether the model can extract new information and interactively inspect the diseases that co-occur in the same documents using a network visualization approach. Moreover, we explore genetic information associated with these diseases using publicly available information from the bioinformatics resource Ensembl BioMart.
Short Abstract: The relationship between parents and their offspring is fundamentally anchored in the genome: we inherit one copy of each chromosome from our mother and another copy from our father. These two versions of the genome are similar but not identical. Reconstructing these two individual copies (called haplotypes) can offer important insights in areas as diverse as population genetics, evolutionary genetics, or personalized medicine. We present WhatsHap, a production-ready bioinformatics software suite for highly accurate haplotyping based on the latest sequencing technologies. WhatsHap has a large user base (>250 downloads/week), is under active development, comes with a multitude of features including the use of pedigree information, the ability to determine genotypes, and the application in de novo assembly settings. WhatsHap has been designed as an easy to use application, sporting comprehensive documentation and a continuously tested code base to ensure reliability. WhatsHap is freely available under the terms of the MIT license at bitbucket.org/whatshap
Short Abstract: Data-driven science is facilitated by datasets that comply with the FAIR (Findable, Accessible Interoperable and Reusable) Data Principles to maximize the value of the data. While the social sciences and geosciences have long recognized that Trustworthy Data Repositories (TDRs) are critical components to enable data to be FAIR on a sustainable basis, the concepts of trustworthiness and the means to assess TDRs are new to some biomedical fields. In a recent effort to make standards easier to understand and apply, we recently proposed five TRUST Principles (Transparency, Responsibility, User Community, Sustainability, and Technology; https://bit.ly/2Ih7g8F). These Principles define the criteria for a repository to be considered trustworthy: the repository must have transparent policies, organizational resources, and personnel with the expertise to provide sustainable operations and secure technologies for their communities. ImmPort, which is a repository of data from immunology research projects and clinical trials funded by the National Institutes of Allergy and Infectious Diseases, NIH, has been certified by CoreTrustSeal (CTS) as a Trustworthy Data Repository. ImmPort’s submission of “Assessment Information” to CTS for certification is publicly available through the ImmPort website and provides an example of how to implement the TRUST Principles and make data FAIR.
Short Abstract: Antibiotics are one of the most important discoveries in medical history. They form the foundation of many other fields of modern medicine, from cancer treatments to transplantation medicine. About 70% of the clinically used antibiotics are produced by a group of bacteria, the actinomycetes. With the recent surge in genome sequencing technology, is is becoming clear that many actinomycetes - as well as other bacteria and fungi - carry a large, untapped reservoir of further potential antibiotics. Genome mining can be used to assist life scientists in the discovery of new drug leads. One such tool is the Open Source software antiSMASH. Since its initial release in 2011, it has become one of the most popular tool in the area of antibiotics discovery, combining comprehensive analyses with an easy to use web UI. We have recently released version 5 of antiSMASH, which next to adding new featuers updated the whole code base to Python 3. This poster will present some challenges encountered when porting a large legacy code base to a new language and how we solved those challenges in our case.
Short Abstract: DNA methylation is an epigenetic modification associated with transcriptional regulation and establishment of cellular identity. Whole genome bisulfite sequencing (WGBS) has become the gold standard for measuring DNA methylation. WGBS data processing often results in bedGraph files containing methylation and coverage statistics. Downstream analysis requires summarization of these files into methylation/coverage matrices whereby the dimensions rapidly increase along with the number of samples. However, currently available tools are limited by file format specifications, speed/memory requirements. To overcome these limitations, we have developed an R package called methrix which provides a fast and efficient solution for processing WGBS data. Core functionality of methrix includes a universal bedGraph reader which handles missing reference CpG sites, annotates and collapses strands - while being fast and memory efficient. The methrix-object is an extension to the Bioconductor SummarizedExperiment class thereby inheriting its core modules. Additionally, several methods are offered for downstream processing, including functions for data visualization, and an interactive html report generation. Methrix interacts with the popular bsseq-package thereby providing a faster pre-processing of data for existing DNA methylation analysis pipelines. In conclusion, methrix addresses the existing limitations by offering a resource efficient way of analysing WGBS data.
Short Abstract: There has been ongoing effort to improve reproducibility and portability of bioinformatics pipelines. Projects such as Common Workflow Language (CWL) aim to standardise workflow specifications. However, there is often debate on whether to adopt CWL, Workflow Definition Language (WDL) or other competing standards. CWL provides easy-to-parse specifications, is supported by multiple engines but is considered more difficult to write, while WDL offers more features but are tightly coupled to Cromwell. To address this, we have created Janis. It is a tool that is designed to assist in building standardised workflows via a translation mechanism that generates validated workflow specifications (CWL, WDL or both). These translated workflows can be shared and executed using any workflow engine that supports the selected specifications. Janis also offers input and output type checking during workflow construction to enforce input requirements of executed tools, which is important for those that require secondary files. Through Janis, we have developed cancer variant-calling pipelines that are functional across the HPC environments at three different Australian research institutes. In future work, we may extend Janis to support additional output formats. We believe that the extra abstraction provided by Janis provides a powerful way to write portable bioinformatics pipeline specifications.
Short Abstract: As sequencing runs become faster and instruments can be run more frequently, data analysis can similarly be made more efficient by automating run monitoring. We have developed an application called Run Scanner to monitor sequencer run output directories and process run metadata (information about the run that excludes sequence data) from Illumina, PacBio, and Oxford Nanopore instruments. The run metadata is presented on a web server in both user-readable and machine-readable ways. Basic run metadata is presented in a standardized way for all modern sequencing platforms. Additional per-cycle metrics, which complement Illumina’s BaseSpace tools, are added to Illumina runs for all instruments from the HiSeq 2000 to the NovaSeq 6000. In addition to being accessible to users, Run Scanner data can also be queried by a variety of software consumers. The Run Scanner’s data can enhance lab workflows by updating information in a LIMS; can be sent to an ETL data integration to be queried for reports; and can provide valuable information to automated bioinformatics pipelines. Run Scanner has decreased our lab’s workload, increased our reporting capabilities, and dramatically decreased our time from run completion to analysis initiation. It is open source and freely available online: https://github.com/miso-lims/runscanner
Short Abstract: Ada is a performant and highly configurable system for secured integration, visualization, and collaborative analysis of heterogeneous data sets, primarily targeting clinical and experimental sources. Ada's main features include a convenient web UI for an interactive data set exploration and filtering, and configurable views with widgets presenting various statistical results, such as, distributions, scatters, correlations, independence tests, and box plots. The platform offers several types of data set imports and transformations as well as an industry-level machine learning module powered by the scalable Spark ML library, which provides many classification, regression, clustering, and time-series processing routines at a fingertip. Furthermore, Ada facilitates robust access control through LDAP authentication and an in-house user management with fine-grained permissions. The main instance of Ada has served as a key infrastructural backbone of NCER-PD project (https://ada.parkinson.lu), which focuses on improving the diagnosis and stratification of Parkinson's disease by combining detailed clinical and molecular data of patients to develop novel disease biomarker signatures, mainly within Luxembourg. Ada is an open-source project with a web site available at https://ada-discovery.org.
Short Abstract: The lack of readily accessible large scale public genomic data sets currently limits the reproducibility of published biomedical research to a subset of authorized users. Tool developers, educators, journal editors and researchers alike are affected by the lack of open access genomic datasets appropriate for reproducing biologically meaningful analysis at scale. We will present a prototype pipeline that promotes reproducible analysis by making it easy to generate publicly shareable custom synthetic datasets. The prototype workflow links existing tools into a consolidated community resource for generating synthetic data cheaply and efficiently. We will demonstrate how to use this workflow on Broad Institute's open access Terra platform, to reproduce someone else's analysis and make your own work reproducible. The workflow, as written, is portable to any cloud platform that runs the Cromwell Engine, an open source scientific Workflow Management System.
Short Abstract: Machine learning models trained on large-scale genomics datasets hold the promise to be major drivers for genome science. However, lack of standards and limited centralized access to trained models have hampered their practical impact. To address this, we present Kipoi, an initiative to define standards and to foster reuse of trained models in genomics1. The Kipoi repository currently hosts over 2,000 trained models from 21 model groups that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard enables automated software installation and provides unified interfaces to apply models and interpret their outputs. Use cases include model benchmarking, variant effect prediction, transfer learning and building new models from existing ones. By providing a unified framework to archive, share, access, use, and extend models developed by the community, Kipoi will foster the dissemination and use of ML models in genomics.
Short Abstract: Modern biomedical research analyzes exponentially growing datasets and involves cross collaborations that commonly span multiple institutions with highly heterogeneous computing environments. The scale and complexity of these efforts has driven a rethink of bioinformatics infrastructure to leverage big data and cloud technologies that increase the mobility, interoperability, and reproducibility of research. To address these goals, we expanded integrations and features within Dockstore, our open source platform for sharing Docker-based resources that allow bioinformaticians to bring together tools and workflows into a centralized location. By packaging software into portable containers and utilizing popular descriptor languages such as the Common Workflow Language (CWL), Workflow Description Language (WDL), and Nextflow, Dockstore standardizes computational analysis, making workflows reproducible and runnable in any environment that supports Docker. Dockstore now supports workflow hosting directly on Dockstore.org along with external technologies like GitHub, Bitbucket, Quay.io, and Docker Hub. Our launch-with integration allows deployment to a growing variety of cloud platforms including FireCloud, DNAnexus, and DNAstack. Usability improvements include a better display of versioning, validation with checker workflows, and community-provided DAG visualizations. Furthermore, new collaboration features allow for permissions-based sharing and enable groups to create organization pages to describe and highlight collections of workflows and tools.
Short Abstract: Disq is a library for manipulating bioinformatics sequencing formats in Apache Spark. Disq grew out of, and was heavily inspired by, Hadoop-BAM and Spark-BAM. The Disq project originated from a Github issue thread, developed through discussion at BOSC CollaborationFest, and kicked off as a new project with collaborators from ADAM, Hadoop-BAM, htsjdk, GATK, Spark-BAM, and ViraPipe projects. In this talk we will discuss some of the challenges of reading and writing formats like BAM, CRAM, SAM, and VCF in parallel. We'll also look at how Disq has been incorporated into common genomics pipelines in ADAM and GATK for improved speed and accuracy.
Short Abstract: The Carpentries builds global capacity for conducting efficient, open, and reproducible research. We train and foster an inclusive, diverse community of learners and instructors that promotes the importance of software and data in research. We collaboratively develop open lessons that we deliver using evidence-based teaching practices. Data Carpentry two-day hands-on workshops teach data skills through domain-specific lessons centered around a dataset. Our Genomics lessons focus on the core skills from data and project organization to analysis and visualization. These workshops are well-received, with a median recommendation of 96% and surveys show learners report significant confidence gains in using these approaches in their work. In this curriculum, we explore: - How to structure and organize data, metadata and analysis files - How to use shell commands to automate tasks - How to use command-line tools to analyze genomic data - How to work with cloud computing resources - How to use R for data analysis and visualization We present an overview of this curriculum, the community model of lesson maintenance, and the impacts measured on the learners’ skills and confidence. Awareness and engagement with these teaching materials scale impact and prepare more people to work effectively and reproducibly with genomic data.
Short Abstract: Epiviz is an interactive and integrative web application for visual analysis and exploration of functional genomic datasets. We currently support a couple of different ways in which data can be provided to Epiviz. 1) using the Epivizr R/Bioconductor package, users can interactively visualize and explore genomic data loaded in R. 2) MySQL database which stores each genomic dataset as a table. Genomic data repositories like ENCODE, Roadmap Epigenomics etc., provide public access to large amounts of genomic data as files. Researchers often download a subset of data from these repositories and perform their analysis. As these repositories become larger, researchers often face bottleneck in their analysis. Increasing data size requires longer time to download, pre-process and load files into a database to run queries efficiently. Based on the concepts of a NoDB paradigm, we developed Epiviz file server, a data query system on indexed genomic files. Using the library, users will be able to visually explore and transform data from publicly hosted files. We support various indexed genomic file formats BigBed, BigWig, HDF5 and any format that can be indexed using tabix. Once the data files are defined, users can also define transformations on these data files using numpy functions.
Short Abstract: A Research Object (RO) provides a machine-readable mechanism to communicate the diverse set of digital and real-world resources that contribute to an item of research. The aim of an RO is to evolve from traditional academic publication as a static PDF, to rather provide a complete and structured archive of the items (such as people, organisations, funding, equipment, software etc) that contributed to the research outcome, including their identifiers, provenance, relations and annotations. This is of particular importance as all domains of research and science are increasingly relying on computational analysis, yet we are facing a reproducibility crisis because key components are often not sufficiently tracked, archived or reported. Here we propose Research Object Crate (or RO-Crate for short), an emerging lightweight approach to packaging research data with their structured metadata, rephrasing the Research Object model as schema.org annotations to formalize a JSON-LD format that can be used independently of infrastructure, e.g. in GitHub or Zenodo archives. RO-Crate can be extended for domain-specific descriptions, aiming at a wide variety of applications and repositories to encourage FAIR sharing of reproducible datasets and analytical methods.
Short Abstract: The use of ontologies in the biomedical field has been increasing exponentially for the two last decades. For historical and technical reasons however, two languages have been coexisting: OWL, a specification of the W3C part of the Semantic Web technologies, and OBO, a provisional standard put together by the OBO consortium out of already existing ontologies. The lack of specification of the OBO language is a source of conversion issues, and reduces the integration of ontologies based on the language they were developed in: OBO ontologies are harder to reason with, while OWL ontologies are more complicated to use for simple analyses. We developed the fastobo library as an implementation of the OBO format version 1.4, which restricts the OBO syntax to improve compatibility with OWL. Using a strict parser allowed us to find issues in existing ontologies such as the Environment Ontology or the Plant Ontology, and helped amending the currently unreleased specification of the format. We also developed Python bindings with the goal of enabling developers and data analysts to use OBO ontologies in their work more easily. This tool provides a solid base to the development of a semantically-correct translator program between OBO and OWL documents.
Short Abstract: For early-career researchers (ECRs), getting funding for their research ideas is becoming more and more competitive, and there is growing pressure in all disciplines to obtain grants. Although there is a plethora of funding opportunities for postdoctoral scientists and other ECRs, there has been no central platform to systematically search for such funding opportunities and/or to get professional feedback on the proposal. With a group of eLife Ambassadors, we developed ECRcentral (ecrcentral.org), a funding database and an open forum for the ECR community. The platform is open to everyone and currently contains 700 funding schemes in a wide range of scientific disciplines, 100 travel grants, and a diverse range of useful resources. In the first two months since its release approximately 500 ECRs already joined this community. The platform is developed using open-source technology, with all the source code and related content made openly available through our GitHub repository (github.com/ecrcentral). ECRcentral aims to bring ECRs and resources for funding together, to facilitate discussions about those opportunities, share experiences, and create impact through community engagement. We strongly believe that this resource will be highly valuable for ECRs and the scientific community at large.
Short Abstract: Potential of Genomic Medicine to improve the quality of healthcare both at population and individual-level is well-established, however adoption of available genetic and genomics evidence into clinical practice is limited. Widespread uptake largely depends on the task-shifting of Genomic Medicine to key healthcare professionals such as nurses, who could be promoted through professional development courses. Globally, trainers, and training initiatives in Genomic Medicine are limited, and in resource limited settings such as Africa, logistical and institutional challenges threaten to thwart large-scale training programmes. The African Genomic Medicine Training (AGMT) Initiative was created in response to such needs. It aims to establish sustainable Genomic Medicine training initiatives for healthcare professionals and the public in Africa. This work describes the AGMT and reports on a strategy recently piloted by to design and implement an accredited, competency and community-based distance learning course for nurses across 11 African countries. This model takes advantage of existing consortia to create a pool of trainers and adapts evidence-based approaches to guide curriculum and content development. Existing curricula were reviewed and adapted to suit the African context. Accreditation was obtained from university and health professional bodies. Toolkit is proposed to help guide adoption of the AGMT distance-learning model.
Short Abstract: Single-cell sequencing generates a new kind of genomic data, promising to revolutionize understanding of the fundamental units of life. The Human Cell Atlas is a multi-year, multi-institution effort to develop and standardize methods for generating and processing this data, which poses interesting storage and compute challenges. I'll talk about recent work parallelizing analysis of single-cell data using a variety of distributed backends (Apache Spark, Dask, Pywren, Apache Beam). I'll also discuss the Zarr format for storing and working with N-dimensional arrays, which several scientific domains have recently gravitated toward in response to challenges using HDF5 in parallel and in the cloud.
Short Abstract: Inferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. This computation is also more efficient, as it uses a divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. As a consequence, the regressive algorithm puts an end to a recurrent dilemma between the use of slow/accurate or fast/approximate methods. It will enable the full exploitation of extremely large genomic datasets.
Short Abstract: MIToS is a Julia package for analyzing protein sequence and structure, with the main focus on coevolutionary analysis. However, its utilities go beyond the calculation of covariation scores in multiple sequence alignments. MIToS is a flexible suite that has been used to measure residue conservation, to deal with protein structures in homology modelling and molecular dynamics pipelines, to perform structural alignment of tertiary and quaternary structures, etc. MIToS allows to access the power of Julia, a high-level programming language for scientific computing with a close to C performance. It is open source and freely available on GitHub under MIT license.
Short Abstract: Dense genetic data are now common due to accessibility of next generation sequencing technologies. Genetic analysis software has been developed mostly for population-based studies. However, recognition of how important rare variants are has made pedigree-based studies common again. We previously developed a pedigree-based analysis pipeline (PBAP) v.1, which allows users to perform several procedures for pedigree-based genetic analysis including file manipulation, selection of subset of markers from a dense panel, pedigree structure validation, and sampling of inheritance vectors (IVs), i.e., the flow of founder alleles in a pedigree. Here, we describe a second version of PBAP with new features that include setting up of files to use IVs for downstream analyses. PBAP v.2 accesses programs to implement the following analyses: a) parametric linkage analysis allowing modification of marker allele frequencies for admixed populations, b) variance components linkage analysis, c) family-based genotype imputation, and d) genotype-based kinship estimation with option to use external allele frequencies. PBAP v.2 users may also calculate pairwise kinship coefficients based on the sampled IVs and visualize spacing of the sub-selected panel of markers. All these features extend the capabilities of PBAP v.2 and give users more options to maximize use of their data for family-based analyses.
Short Abstract: Motivation: RNA sequencing (RNA-Seq) is becoming the gold standard for analysing gene expressions in biological samples. With the ever-increasing sequencing speed, the amount of data produced is growing exponentially which imposes new challenges on bioinformatics data analysis. Many analysis pipelines were developed, enabling standardisation and automation of RNA-Seq data analysis. However, most of the pipelines rely on the command-line interface which can be difficult for end-users to learn and use, and different methodologies can lead to a wide variation in the differential analysis result. Result: We present a complete RNA-Seq data analysis framework with emphases on care-free and ease of use. Our framework uses contemporary workflow description standard, Common Workflow Language, as the analysis pipeline foundation. A website, based on the Django framework, is designed for the user to upload data and submit analyses with a few clicks. A variety of tools are encompassed in RAWG so that researchers have the freedom to pick which analysis tool to use. RAWG can run multiple pipelines in one session and comparing the results from different pipelines. RAWG is a flexible data analysis platform that is not only easy to use but also improves data reproducibility and provenance in biological science.
Short Abstract: Advancement in global health over the last half-century has been significant and can mostly be attributed to discovery of new drugs by the pharmaceutical industry. However, the patent paradigm that rewards the investment in research and development, and drives this effort are not sufficient for discovery of new antimicrobials. Given the increasing issue of antimicrobial resistance, alternate models are needed in drug discovery. OSDD (Open Source Drug Discovery) is an alternative innovation model where distributed community of researchers work together towards common goals sharing data and resources. As an example of this is the Connect to Decode project which utilized the potential of crowdsourcing to generate the most comprehensive systems level models of Mycobacterium tuberculosis (Mtb) (PMID: 22808064), repository of anti-TB compounds (PMID: 29785561 ), etc. Published estimates suggest that this innovative approach packed nearly 300 man-years into 4 months. More recently, the community also generated a repository of potential anti-Nipah compounds demonstrating the power of collective open efforts towards outbreaks (http://bioinfo.imtech.res.in/anshu/nipah). I would like to apply similar principles of open data sharing on tools/methods that are developed in the field. Attending BOSC will provide me the right platform to leverage from the interactions to move forward in this direction.
Short Abstract: MISO is a laboratory information management system designed for eukaryotic sequencing operations. It supports genomic, exomic, transcriptomic, methyl-omic, and CHiP-seq protocols; long reads and short reads; and microarrays. MISO’s goals are to allow laboratory technicians to record their work accurately with a minimum of data entry overhead, and to ensure the associated metadata is valid and structured enough to use for automation and other downstream applications. MISO incorporates a wide feature set useful for both large and small facilities to track their lab workflows in great detail. Since last presented at BOSC 2016, MISO has matured and stabilized to support production use in a large sequencing facility. MISO supports new instruments like the Illumina NovaSeq, 10X Chromium, and Oxford Nanopore PromethION, added more extensive location tracking, improved UI interfaces to simplify data entry, has improved overall performance, and has extensive documentation in the form of a new user manual and walkthroughs. Recently we have improved installation, administration, and maintenance through Docker containers and compose files. We have developed other applications that interact with MISO to facilitate laboratory functions like billing, reporting, and analysis. After 8 years of development, we are preparing a 1.0 release for late 2019.
Short Abstract: The Findable, Accessible, Interoperable and Reusable (FAIR) principles provide a set of minimum elements required for an effective management of digital resources. Although the FAIR principles are meant to work on any digital resource, datasets were mostly in mind when the principles were initially published in 2016. Some of the main aspects, particularly within the Findability and Accessibility scope, are indeed directly applicable across digital resources, for instance persistent and global identification, licensing and longevity commitment policies. However, when used to model research software, i.e., scripts, packages, and applications, principles related to Interoperability and Reusability require further discussion, understanding and agreement. Here we present ten recommendations to get FAIRer research software. We take into account elements that will make it easier for other to (re)use the software such as functionality, input, output, citation and documentation. We also include a mention on best practices on software development as they cover aspects such as versioning, dependency management, and so on. With this effort we at ELIXIR Europe aim to contribute and promote the discussion around FAIRness for research software.
Short Abstract: Sharing personal genome data is critical to advance medical research. However, sharing data including personally identifiable information requires ethical reviews which usually takes time and often has limitations of computational resources that researchers can use. To allow researchers to analyze such data in controlled access efficiently, DNA Data Bank of Japan (DDBJ) developed a new workflow execution system called SAPPORO (Figure1). We designed the system to allow users to execute workflows with controlled access data without touching them. Users select a workflow on the SAPPORO's web interface to run it on a node for personal genome data analysis in the DDBJ's high-performance computing (HPC) platform. The system supports the Common Workflow Language (CWL) as its primary format to describe workflows; thus it can import the workflows developed by different institutes as long as they are described in CWL . We implemented the workflow run service component by following the Workflow Execution Service API standard developed by the Cloud working group of Global Alliance for Genomics and Health (GA4GH) . This highly flexible and portable system can be an essential module on data and workflow sharing in biomedical research.
Short Abstract: Motivation: With the increasing challenges in understanding very large and complex genomic data, urgent demands are present for computational and bioinformatics tools to efficiently translate the data into clinically important insights. Methods: R / Bioconductor provides robust, scalable software infrastructures and interoperable statistical methods to help tackle these challenges. Previously, DelayedArray has been developed to represent very big genomic data sets (e.g., count data from scRNA-seq). DelayedArray allows users to perform common array operations without loading the data in memory. We and others have extended DelayedArray to different backends for scalable computation, such as Hierarchical Data Format (HDF) and Genomic Data Structure (GDS). DelayedDataFrame is developed for lazy representation of sample metadata (e.g., the clinical characteristics for samples). VariantExperiment is a lightweight container for lazy data structures representing both assay and annotation data for a complete experiment. Results: These data structures provide rich semantics for data operations with familiar paradigms such as “matrix” and “data.frame”, and present minimal requirements for memory use and for bioinformaticians to learn new techniques. Conclusion: These data structures considerably improve the acquisition, management, analysis and dissemination of big genomic datasets, and benefit the broad community of bioinformatic software developers and domain-specific cancer researchers.
Short Abstract: Scientific progress hinges on our ability to share tools and build on each other's work. Increased adoption of best practices like version control and containers has moved us forward in recent years, but there are still gaps between handing over code and another person being able to run it correctly. We are challenging researchers to create open workspaces showcasing their work in a fully reproducible form on the Terra platform (https://www.terra.bio). Terra is an open source environment that integrates code, data, and execution on top of Google Cloud, with free credits to cover compute and storage costs. Workspaces will be evaluated on how they maximize reproducibility (“can I run it?”), customizability (“can I tweak it?”) and documentation (“do I understand what's going on?”), with coolness of the science as tiebreaker. The Grand Prize is an all-expenses paid trip to Basel to attend BOSC and ISMB, along with additional free credits. The winner will present their workspace at BOSC, highlighting its features from the standpoint of reproducibility and FAIR principles. Through this contest (https://terra.bio/tosc), we aim to illustrate principles of Open Science and motivate researchers to go the extra mile to make their work reusable by the community.
Short Abstract: The advances in DNA sequencing technologies have created many opportunities for novel research that require comparing newly obtained and previously known sequences. This is commonly done with BLAST, either as part of an automated pipeline, or by visually inspecting the alignments and associated meta-data. We previously reported Sequenceserver to facilitate the latter. Our software enables a user to rapidly setup a BLAST server on custom datasets and presents an interface that is modern looking and intuitive to use. However, interpretation of BLAST results can be further simplified using visualisations. We have integrated three existing visualisations into Sequenceserver with the aim to facilitate comparative analysis of sequences. First, we provide a circos plot to rapidly check for conserved synteny, identify duplications and translocation events, or to visualise transposon activity. Second, we provide a histogram of length of all hits of a query to quickly reveal if the length of a predicted protein sequence matches that of its homologs. Finally, for each query-hit pair, the relative length and position of matching regions are shown. This is helpful to identify large insertion or deletion events between two genomic sequences, can reveal putative exon shuffling, and help confirm a priori knowledge of intron lengths.
Short Abstract: EDAM is an ontology of well established, familiar concepts that are prevalent within bioinformatics, and life science data analysis in general. The scope of EDAM includes types of data and data identifiers, data formats, operations, and topics. EDAM has a relatively simple structure, and comprises a set of concepts with terms, synonyms, definitions, relations, links, and some additional information (especially for data formats). EDAM is developed in a participatory and transparent fashion, within a broad community of contributors. The development of EDAM is coordinated with the development and curation of tools registries (especially https://bio.tools); training materials registries (especially https://tess.elixir-europe.org); with the development and packaging of open-source bioinformatics software (especially the Debian Med and Bio-Linux community, http://debian.org/devel/debian-med); the Common Workflow Language (https://www.commonwl.org); and other related initiatives. These include development of Galaxy (https://usegalaxy.org) and collaborations with specialised networks of experts, such as within the work on EDAM-bioimaging, the extension of EDAM towards bioimage informatics. EDAM can thus function as a common terminology when sharing and exchanging information about tools, workflows, or training materials. A new version 1.22 has been released recently.
Short Abstract: Sequencing large molecules of DNA has drastically improved the contiguity of genome sequence assemblies. Long read sequencing has reduced sequence fidelity compared to short read sequencing and is currently more expensive. Linked-read sequencing from 10x Genomics Chromium combines the benefits of large DNA molecules with the sequence fidelity and cost of short read sequencing. Our tool, Physlr, constructs a physical map of large DNA molecules from linked reads without first assembling those reads. A barcode-overlap graph is constructed, where each edge represents two barcodes sharing minimizer k-mers. The underlying molecule-overlap graph is reconstructed from the barcode-overlap graph by identifying k-clique communities, where each community is born from a DNA molecule. The physical map is a set of contigs, where each contig is an ordered list of barcodes. The scaffolds of an existing assembly may be ordered and oriented using the physical map. We constructed a physical map of the 1.34 Gbp zebrafish (Danio rerio) genome. A Supernova assembly was scaffolded by mapping it to this physical map, improving the NG50 from 4.8 Mbp to 9.1 Mbp. Physlr can employ multiple libraries of linked reads, necessary for genomes larger than mammals such as conifer genomes, which can exceed 20 Gbp. https://github.com/bcgsc/physlr
Short Abstract: The Monarch Initiative is a consortium that seeks to bridge the space between basic and applied research, developing tools that facilitate connecting data across these fields using semantics-based analysis. The mission of the Monarch Initiative is to create methods and tools that allow exploration of the relationships between genotype, environment, and phenotype across the tree of life, deeply leveraging semantic relationships between biological concepts using ontologies. These tools include Exomiser, which evaluates variants based on the predicted pathogenicity, amongst many others. The goal is to enable complex queries over diverse data and reveal the unknown. With the semantic tools available at www.monarchinitiative.org, researchers, clinicians, and the general public can gather, collate, and unify disease information across human, model organisms, non-model organisms, and veterinary species into a single platform. Monarch defines phenotypic profiles, or sets of phenotypic terms, which are associated with a disease or genotype recorded using a suite of phenotype vocabularies (such as the Human Phenotype Ontology and the Mondo Ontology). Our niche is computational reasoning to enable phenotype comparison both within and across species. Such explorations aim to improve mechanistic discovery and disease diagnosis. We deeply integrate biological information using semantics, leveraging phenotypes to bridge the knowledge gap.
Short Abstract: GDPR requires the documentation of any processing of personal data, including data used for research and to be prepared for information provision to the data subjects. For institutions this requires a data mapping exercise to be performed and to keep meticulously track of all data processings. While there is no formal guidance on how data mapping should be done, we're seeing the emergence of some commercial "GDPR data mapping" tools and academic institutions creating registers with those tools. When it comes to mapping data in biomedical research, we observe that commercial tools may fall short as they do not capture the complex project-based, collaborative nature of research that leads to many different scenarios. In this poster we describe a Data Information System (DAISY), our data mapping tool, which is specifically tailored for biomedical research institutions and meets the record keeping and accountability obligations of the GDPR. DAISY is open-source and is actively being used at the Luxembourg Centre for Systems Biomedicine and the ELIXIR-Luxembourg data hub.
Short Abstract: Benchmarking in the bioinformatics context aims to compare the performance of bioinformatics operations under controlled conditions. Benchmarking encompasses the technical performance of individual tools, servers and workflows, including software quality metrics, and their scientific performance answering predefined challenges, as defined (reference datasets and metrics) by scientific communities.. In the context of the ELIXIR project, we have developed the OpenEBench platform aiming at transparent performance comparisons. We will present the current implementation of OpenEBench. It covers scientific benchmarking data from a number of communities, with expert- and non-expert-oriented visualization of benchmarking results, and the assessment of quality metrics and availability of bioinformatics tools. We consider three levels of operation: level 1, already established, relates to data from existing benchmarking communities, provided via the OpenEBench API; level 2 (in beta state) is based on computing benchmarking metrics within the platform; while level 3 extends the existing OpenEBench platform to execute benchmarkable workflows (provided as software containers) using controlled conditions to ensure an unbiased technical and scientific assessment. Overall, OpenEBench provides an integrated platform to orchestrate benchmarking activities, from the deposition of reference data to executing software tools, to providing results employing metrics defined by scientific communities.