Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Track: Bioinformatics Open Source Conference (BOSC)

Session A-107: GeoDiver: Differential Gene Expression Analysis & Gene-Set Analysis for GEO Datasets
  • Ismail Moghul, UCL Cancer Institute, United Kingdom
  • Suresh Hewapathirana, Queen Mary University of London, United Kingdom
  • Nazrath Nawaz, Queen Mary University of London, United Kingdom
  • Anisatu Rashid, Queen Mary University of London, United Kingdom
  • Marian Priebe, Queen Mary University of London, United Kingdom
  • Bruno Vieira, Queen Mary University of London, United Kingdom
  • Fabrizio Smeraldi, Queen Mary University of London, United Kingdom
  • Conrad Bessant, Queen Mary University of London, United Kingdom

Short Abstract: The growing genomic data repository GEO (Gene Expression Omnibus) has become the popular public database for microarray data from gene expression experiments. GEO contains curated as well as non-curated gene expression profile datasets. GeoDiver is an online web application for performing Differential Gene Expression Analysis (DGEA) and Generally Applicable Gene-set Enrichment Analysis (GAGE) on GEO datasets. GeoDiver allows researchers to fully take advantage of the growing GEO resource and perform powerful analyses without the need for downloading or installing additional software, learning command-line skills or having prior programming knowledge. Users can easily run both DGEA and GSA within a few minutes of accessing the web application. For users who are familiar with gene-expression data, the option to adapt the advanced parameters of the analyses mirrors the flexibility that one might expect from writing a custom analysis script. The output produced includes numerous high-quality interactive graphics, allowing users to easily explore and examine complex datasets instantly and understand the data they have analysed. Furthermore, the results produced can be reviewed at a later date and shared with collaborators. Other graphics such as the heatmaps are high-resolution, information rich and can be easily exported for use in publications. GeoDiver is therefore not only designed to facilitate the analysis of gene-expression data but also ensures that users can fully explore the results of their analysis. This is important as the ability to use such powerful analytical tools has to be paired with the corresponding ability to interpret the output. Availability: GeoDiver is freely available online at http://www.geodiver.co.uk. The source code is available on Github: https://github.com/GeoDiver/GeoDiver and a docker image is available for easy installations.

Session A-108: An open source framework for reliable and reproducible analysis pipelines
  • Alberto Riva, Bioinformatics Core, ICBR, University of Florida, United States
  • Richard L Bennett, UF Health Cancer Center, University of Florida, United States
  • Jonathan D Licht, UF Health Cancer Center, University of Florida, United States

Short Abstract: Actor is an open-source, object-oriented Python framework for the development of computational pipelines, specifically designed for NGS data analysis in a cluster computing environment. Pipelines are built out of reusable objects implementing analysis steps (e.g. short-read alignment, transcript quantification) and are controlled by simple configuration files detailing the experimental setup, the input data, and the steps to be performed. Pipeline execution is performed by a Director object that, given an Actor object representing the pipeline, performs the necessary setup and executes all required steps. Each step is represented by a Line object, which provides standard methods for Setup, Verification, Pre-Execution, Execution, Post-Execution, and Reporting. Thanks to this standard API, steps can be freely combined: for example, changing the short-read aligner from Bowtie to STAR only requires swapping one Line object for another in the pipeline definition. Actor is able to handle any number of experimental conditions, biological replicates, and technical replicates, easily supporting complex experimental designs with no changes to the pipeline structure. Actor automatically handles submission and management of jobs to the cluster, ensuring proper job sequencing and coordination. Finally, Actor pipelines automatically generate an HTML report of their execution: within a standard template (customizable by specializing the Actor object) each step may add one or more sections containing text, tables, figures, links to downloadable files. We describe the Actor framework, outline pipelines developed with it (including RNA-seq, ChIP-seq, ATAC-seq, methylation analysis, variant discovery, genome annotation) and discuss its advantages for analysis reliability and reproducibility.

Session A-109: Workflow for processing standard bioinformatics formats with sciClone to infer tumor heterogeneity
  • Uros Sipetic, Seven Bridges Genomics, Serbia

Short Abstract: Tumor heterogeneity refers to the notion of tumors showing distinct cellular morphological and phenotypic properties which can affect disease progression and response. Next-gen sequencing results from bulk whole exome or whole genome samples can be expanded to infer tumor structure (heterogeneity) using a few different methods. One of these methods, sciClone, was developed for detecting tumor subclones by clustering Variant Allele Frequencies (VAFs) of somatic mutations in copy number neutral regions. We present a portable and reproducible Common Workflow Language (CWL) implementation of a sciClone based tumor heterogeneity workflow. The results of sciClone can be directly fed into additional software like ClonEvol and Fishplot, which produce additional visualizations, e.g. phylogenetic trees showing clonal evolution. The main idea of the workflow was to build a pipeline that can process standard bioinformatics file types (VCF, BAM) and produce a comprehensive set of outputs describing tumor heterogeneity. Additionally, we designed this workflow to be robust regardless of the number of samples provided.

Session A-110: Microsatellite instability profiling of TCGA colorectal adenocarcinomas using a Common Workflow Language pipeline
  • Nikola Tesic, Seven Bridges Genomics, Serbia
  • Marko Kalinic, Seven Bridges Genomics, Serbia

Short Abstract: Microsatellite instability (MSI) is an important factor for classifying certain cancer types, it has been shown to be associated with prognosis, and it can be used as a biomarker for treatment selection. Here, we present a portable and reproducible Common Workflow Language (CWL) implementation of MSIsensor tool and lobSTR toolkit, we will show how to use MSIsensor and lobSTR, and we will see how lobSTR deals with PCR noise and how it can silence it. Finally, we present results obtained using the aforementioned tools on more than 600 TCGA colorectal adenocarcinoma cases, as well as our further analysis and interpretation of the results. Out of 592 cases that had MSI classification on GDC portal, MSIsensor predicted the MSI status correctly for 580 of them with F-score 0.93 using tumor and normal samples, while results obtained using lobSTR enabled us to correctly predict the MSI status in 584 cases with F-score 0.96 using tumor and normal samples, and 581 cases with F-score 0.94 using tumor only.

Session A-111: GeneValidator: identify problems with protein-coding gene predictions
  • Monica-Andreea Drăgan, Department of Computer Science, ETH Zürich, Switzerland
  • Ismail Moghul, School of Biological and Chemical Sciences, Queen Mary University of London, United Kingdom
  • Anurag Priyam, School of Biological and Chemical Sciences, Queen Mary University of London, United Kingdom
  • Claudio Bustos, Departamento de Psiquiatría y Salud Mental, University of Concepción, Chile
  • Yannick Wurm, School of Biological and Chemical Sciences, Queen Mary University of London, United Kingdom

Short Abstract: Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging: even the best gene prediction algorithms make substantial errors and can jeopardize subsequent analyses. Therefore, many predicted genes must be time-consumingly visually inspected and manually curated. We developed GeneValidator to automatically identify problematic gene predictions and to aid manual curation. For each gene, GeneValidator performs multiple analyses based on comparisons to gene sequences from large databases. The resulting report identifies problematic gene predictions and includes several statistics and graphs for each prediction. These can be used to eliminate problematic gene models from a set of annotations, compare two sets of annotations, or to guide manual curation efforts. GeneValidator thus accelerates and enhances the work of researchers and biocurators who need accurate gene predictions from newly sequenced genomes. GeneValidator can be used through a web interface or in the command-line. It is open-source (AGPL), available at https://wurmlab.github.io/tools/genevalidator.

Session A-112: Somatic Variant Calling Benchmarking
  • Sanja Mijalkovic, Seven Bridges Genomics, Serbia
  • Milan Domazet, Seven Bridges Genomics, Serbia

Short Abstract: The increased number of sequenced cancer genomes since the completion of the human genome project, and the importance of correctly identifying somatic mutations, which can influence treatment or prognosis, is driving forward the development of novel somatic variant calling tools (somatic callers). A lack of best practices algorithm for identifying somatic variants, however, requires constant testing, comparing and benchmarking these tools. The absence of truth set further hinders the effort for the evaluation. By comparing widely used open source somatic callers, such as Strelka, VarDict, VarScan2, Seurat and LoFreq, through analysis of in-house generated synthetic data, we found complex dependencies of somatic caller parameters relative to coverage depth, allele frequency, variant type, and detection goals. Next, we normalised and filtered the output data such that it can be appropriately compared to the truth set. The acquired benchmarking results were automatically and efficiently structured and stored. All of the tools used for the analysis have been implemented in Common Workflow Language which makes them portable and reproducible.

Session A-113: SIAMCAT - Statistical Inference of Associations between Microbial Communities And host phenoTypes.
  • Konrad Zych, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany
  • Nicolai Karcher, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany
  • Paul Igor Costea, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany
  • Shinichi Sunagawa, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany
  • Peer Bork, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany
  • Georg Zeller, Structural and Computational Biology Unit, European Molecular Biology Laboratory, Germany

Short Abstract: The promise of elucidating associations between the microbiota and their host, with diagnostic and therapeutic potential, is fueling metagenomics research. However, due to complexity and volume of profiling data, as well as a lack of integrated approaches and tools, robust associations remain elusive. Here, we describe SIAMCAT, an R package for statistical inference of associations between microbiome and host phenotypes. Ideally, such associations are described by quantitative models able to predict host status from microbiome composition. SIAMCAT can efficiently do so for data from thousands of microbial taxa, gene families, or metabolic pathways over hundreds of samples. Its core workflow is based on penalized logistic regression, though it is easy to implement additional machine learning approaches through the mlr meta-package. SIAMCAT has a modular structure and three flavors: (1) the R package, (2) command line R scripts, and (3) a Galaxy workflow, supporting users with various levels of experience. It is part of the EMBL microbiome analysis toolkit and uses Bioconductor data structures for interoperability with other packages so it can be broadly applied to new data sets and easily integrated into larger analysis pipelines. Moreover, SIAMCAT produces graphical (diagnostic) output for convenient assessment of the quality of the input data and resulting associations. SIAMCAT was successfully applied to discover microbial marker species for colorectal cancer detection. Here we show its versatility in the identification of various associations between microbiota and environmental (host) factors. Overall, SIAMCAT offers a complete and robust solution for statistical inference of associations in microbiome data.

Session A-114: Aztec: Automated Biomedical Tool Index with Improved Information Retrieval System
  • Wei Wang, UCLA, United States
  • Yichao Zhou, UCLA, United States
  • Patrick Tan, UCLA, United States
  • Vincent Kyi, UCLA, United States
  • Xinxin Huang, UCLA, United States
  • Chelsea Ju, UCLA, United States
  • Justin Wood, UCLA, United States
  • Peipei Ping, UCLA, United States

Short Abstract: Advances in bioinformatics, especially in the field of genomics, have greatly accelerated as development of powerful computational systems and new distributed platforms enabled rapid processing of massive amounts of datasets. Accordingly, this has led to an explosion of public software, databases, and other resources for biological discovery. Modern biomedical research requires the comprehension and integration of multiple types of data and tools; specifically, understanding biomedical phenotypes requires analysis of molecular data (genomics, proteomics), imaging data (sonography; computed tomography, CT), and textual data (case reports, electronic health records). Researchers require software tools in a common platform that can analyze and integrate data from each of these domains. However, many of the existing resource and tool repositories are narrowly focused, fragmented, or scattered and do not support the multidimensional data analysis required for advancing precision medicine. There is a lack of unified platform collection software applications across various disciplines, leaving many users unable to locate the tools critical to their research. To manage this influx of digital research objects (tools and databases), the FORCE11 community put forth a set of guiding principles to make data Findable, Accessible, Interoperable, and Reusable (FAIR) (Wilkinson et al., 2016). While these issues have been addressed in the data elements domain, they are equally important to but thus far neglected by the biomedical informatics software domain. Here, we introduce Aztec (A to Z technology; https://aztec.bio), an open-source online software discovery platform that empowers biomedical researchers to find the software tools they need. Aztec currently hosts over 10,000 resources, spanning 17 domains including imaging, gene ontology, text mining, data visualization, and various omic analyses (i.e., genomics, proteomics, and metabolomics). Aztec serves as a common portal for biomedical software across all of these domains, enabling modern multi-dimensional characterization of biomedical processes.

Session A-115: Building a local bioinformatics community: challenges and efforts
  • Malvika Sharan, EMBL, Heidelberg, Germany
  • Toby Hodges, EMBL, Heidelberg, Germany
  • Julia Ritzerfeld, EMBL, Heidelberg, Germany
  • Georg Zeller, EMBL, Heidelberg, Germany

Short Abstract: The German Network for Bioinformatics Infrastructure (de.NBI) is a national network funded by the Federal Ministry of Education and Research. de.NBI provides bioinformatics services to life-science and biomedicine researchers, constituting at the same time the German ELIXIR node. Through one of its service centers, the Heidelberg Center for Human Bioinformatics (HD-HuB), we host a variety of bio-computational resources and offer courses and training opportunities to scientists in a wide range of topics. Additionally, we organize events that provide a platform to assist researchers in the Heidelberg area to network with each other and exchange their knowledge. HD-HuB unites the bioinformatics expertise of three distinguished research institutions: the European Molecular Biology Laboratory (EMBL), the German Cancer Research Center (DKFZ), and Heidelberg University, and facilitates an environment for establishing an effective and efficient bioinformatics community in Heidelberg. EMBL Heidelberg provides a local bioinformatics support platform, Bio-IT, to their ~600 life-scientists across >50 groups. EMBL has reported a consistent increase in the fraction of scientists who devote ≥50% of their time to computational-based research, seeing a growth from 39% to 46% between 2009 and 2015. Bio-IT is an initiative established to support the development and technical capability of this diverse bio-computational community. The community has grown organically to tackle challenges together, providing training and support for scientific computing, creating and maintaining tools and resources for reproducible science, and providing opportunities for discussion and collaboration. Here, we share some lessons learned while establishing these communities and discuss the efforts required to maintain it thereafter.

Session A-116: CHARME - Harmonising standardisation strategies to increase efficiency and competitiveness of European life-science research
  • Aleksandra Gruca, EU COST Action CHARME CA15110, Institute of Informatics, Silesian University of Technology, Poland, Poland

Short Abstract: In last few years, the issues related to reproducibility problems in life sciences and biomedical research are gaining a lot of interest and traction in the scientific community. They are not limited to laboratory procedures and protocols but are also related to computational analysis of experimental data. Results of computational findings often cannot be reproduced because of problems associated with software versioning, packaging, installation and execution, and because of lack of proper documentation. By enabling the reuse of research assets, research can become more efficient and economical. This can be achieved by applying standards and Standard-Operating-Procedures. Thus, standards represent important drivers in the life sciences because they guarantee that data become accessible, shareable and comparable. Several initiatives launched the development and implementation of standards. Unfortunately, they remain fragmented and disconnected. CHARME is the EU COST action focused on problems of standardization in systems biology and computational modelling. Currently 29 countries are involved in this non-commercial initiative, which makes it one of the most widely recognized actions among all COST initiatives. The main objective of CHARME is to create a common platform for sustainable and efficient cooperation on standardization. The CHARME aims to bridge and combine the fragmented areas in the development of new norms and standards. This will be done by identifying needs and gaps, and teaming up with other initiatives to avoid duplication and overlap of standardisation activities. As a result, a common, long-term strategy will be developed to successfully assimilate standards into the daily workflow. Finally, CHARME will increase the awareness for the need of standards and will provide a common ground for researchers from academia, industry and multinational organizations.

Session A-117: High Content Screening data storage and analysis platform - An open source solution
  • Vincenzo Belcastro, Philip Morris International, Switzerland
  • Stephane Cano, Philip Morris International, Switzerland
  • Diego Marescotti, Philip Morris International, Switzerland
  • Carine Poussin, Philip Morris International, Switzerland
  • Ignacio Gonzales-Suarez, Philip Morris International, Switzerland
  • Florian Martin, Philip Morris International, Switzerland
  • Filipe Bonjour, Philip Morris International, Switzerland
  • Nikolai Ivanov, Philip Morris International, Switzerland
  • Julia Hoeng, Philip Morris International, Switzerland

Short Abstract: High Content Screening (HCS) is a powerful tool that can quantify a large number of biological readouts in a short period of time. At Philip Morris International (PMI), HCS is routinely used as part of the toxicological risk assessment of products with the potential to present less risk of harm than continued smoking. Each HCS-based assay is a complex, multi-step procedure that starts with an adequate experimental design, followed by the data generation phase (exposure, image acquisition and quantification) and finally data analysis and reporting. Testing of various endpoints in combination with multiple experimental conditions can easily result in hundreds of plates that need to be managed. Therefore, the use of a self-contained platform becomes necessary in order to ease the process. Here, we present an open source platform that implements a Graphical User Interface (GUI) to manage the entire HCS procedure. The platform includes (i) a GUI to design an HCS experiment; (ii) an automatic distribution of samples on plates; (iii) automatic fetching of raw images quantifications (the platform implements a web-service layer that allows multiple (proprietary) systems to be easily integrated); (iv) data analysis functionalities via pmitcpl R package that include multiple dose-response fitting and best fit selection, various statistics (e.g., Min Effective Concentration and Max Tolerated Dose) and historical data analysis. The HCS data storage and analysis platform represents an all-in-one, open source solution for HCS experimental processes. The platform is suitable for the overall management (storage, processing and reporting) of High Content Screening experimental data in BioMedical Research, thereby increasing the consistency, reliability, flexibility, speed of interpretation and traceability of the HCS data flow.

Session A-118: Expression Atlas: Exploring gene expression results across species under different biological conditions
  • Laura Huerta, EMBL-EBI, United Kingdom
  • Elisabet Barrera, EMBL-EBI, United Kingdom
  • Wojciech Bazant, EMBL-EBI, United Kingdom
  • Nuno A. Fonseca, EMBL-EBI, United Kingdom
  • Anja Fullgrabe, EMBL-EBI, United Kingdom
  • Maria Keays, EMBL-EBI, United Kingdom
  • Suhaib Mohammed, EMBL-EBI, United Kingdom
  • Alfonso Munoz-Pomer Fuentes, EMBL-EBI, United Kingdom
  • Amy Tang, EMBL-EBI, United Kingdom
  • Irene Papatheodorou, EMBL-EBI, United Kingdom
  • Robert Petryszak, EMBL-EBI, United Kingdom
  • Ugis Sarkans, EMBL-EBI, United Kingdom
  • Alvis Brazma, EMBL-EBI, United Kingdom

Short Abstract: Expression Atlas (https://www.ebi.ac.uk/gxa) is a database and web-service at EMBL-EBI that curates, re-analyses and displays gene expression data across species and biological conditions such as different tissues, cell types, developmental stages and diseases among others. Currently, it provides gene expression results on more than 3,000 experiments (based on microarray and RNA-sequencing) from 40 different organisms, including metazoans and plants. Expression profiles of tissues from Human Protein Atlas, GTEx and FANTOM5, and of cancer cell lines from ENCODE, CCLE and Genentech datasets can be explored in Expression Atlas. All experiments within Expression Atlas are manually curated to accurately represent gene expression data, and annotated to Experimental Factor Ontology (EFO) terms. The use of EFO annotations allows efficient search via ontology-driven query expansion and facilitates data integration. All RNA-sequencing experiments are analysed in a uniform way using iRAP, our standardised pipeline. It is possible to search and download datasets into R for further analysis via the R package ‘ExpressionAtlas’. Expression Atlas visualises gene expression results using heatmaps showing gene expression levels across different biological conditions. Novel analysis and visualisations include: exploration of gene co-expression across tissues, cell lines or other conditions within a baseline experiment and enrichment analysis of user provided sets of genes in a given organism against differentially expressed genes in all available comparisons. Finally, our baseline gene expression results in different tissues are integrated within other specialised resources to contribute to the understanding of pathways (Reactome, http://www.reactome.org/) and disease targets (Target Validation Platform by Open Targets, https://www.opentargets.org/).

Session A-119: Microservices in data, code, and project management.
  • Jorge Boucas, Max Planck Institute for Biology of Ageing, Germany

Short Abstract: In the increasing world of data, the mingles of data, code, and project management have become crucial for a successful and agile processing of data into meaning. While several proficient tools have emerged along the last years for each of the individual tasks, a unified tool is still in the need for many. Avoiding the costs of developing and deploying an unified tool, using existing tools in a microservices architecture is at the reach of any, allows refactoring and is infrastructure independent. bit is a git-based tool for the management of code and data. It uses git for code versioning and for example ownCloud for storing and exchanging data. It avoids the costs of data versioning by logging in git the changes made on the data storage platform (eg. ownCloud). Data can be transferred over curl to avoid the risks of mounted drives and the need for ftp access. Every time data is uploaded to ownClowd, the upload is registered on the respective wiki on github together with an hyperlink to the respective folder. Project metadata and data can therefore be kept in different instances. Using rsync in the backend, it allows seamless multiple rsync calls for efficient synchronisation of projects between different HPCs or local machines. It is primarily developed for multiple analysts working on shared projects on an HPC cluster but it can be easily used on local/cloud/non-HPC machines without changes to the code.

Session A-120: Gene Set Variation Analysis in cBioPortal
  • Pieter Lukasse
  • Kees van Bochove
Session A-121: Databases to support reanalysis of public high-throughput DNA sequencing data
  • Tazro Ohta
Session A-122: BioThings SDK: a toolkit for building high-performance data APIs in biology
  • Chunlei Wu
Session A-123: Reproducing computational experiments in situ as an interactive figure in a journal article.
  • Evanthia Kaimaklioti
Session A-124: CWL Viewer: The Common Workflow Language Viewer
  • Stian Soiland-Reyes
Session A-125: Sequana: a set of flexible genomic pipelines for processing and reporting NGS analysis
  • thomas cokelaer
Session A-126: NGI-RNAseq - a best practice analysis pipeline in Nextflow
  • Rickard Hammarén
Session A-127: PhyloProfile: an interactive and dynamic visualization tool for multi-layered phylogenetic profiles
  • Ngoc-Vinh Tran
Session A-128: CueSea: quality control tool for Illumina genotyping microarray data, with correction on intensity, clusterization and biological specificity.
  • Daria Iakovishina
Session A-129: Bio::DB::HTS – accessing HTSlib from Perl
  • Rishi Nag
Session A-130: NGLESS: A tool for perfectly understandable and reproducible metagenomics pipelines based on a domain-specific language
  • Luis Coelho
Session A-131: CGP as a Service (CGPaaS) - From data submission to results using your web-browser
  • Keiran Raine, Wellcome Trust Sanger Institute, United Kingdom
  • Adam Butler, http://keiranmraine.github.io, United Kingdom
  • Peter Clapham, Wellcome Trust Sanger Institute, United Kingdom
  • Jon Teague, Wellcome Trust Sanger Institute, United Kingdom
  • Peter Campbell, Wellcome Trust Sanger Institute, United Kingdom

Short Abstract: The Cancer Genome Project (CGP) has been heavily involved in the work of the ICGC PanCancer Analysis Of Whole Genomes (PCAWG) project to characterise 2,800 cancers. CGP provided one of the three core somatic calling pipelines. As part of this effort we have successfully produced a codebase that is entirely portable, open-source, and available as a PCAWG workflow on www.dockstore.org. This workflow generates Copy-Number, Substitution, Insertion, Deletion and Structural Variant results optimised for tumour-normal paired somatic NGS data. CGP are now looking to provide an updated version of this workflow within a cloud enabled framework. One of the key issues that faces investigators when working with large sequence data is the difficulty in transferring large datasets without the need to install dedicated software. In order to address this issue we plan to implement an in-browser, drag and drop process for data submission and retrieval. Following successful validation of data, mapping and analysis through the standard Whole Genome Sequence (WGS) somatic calling pipeline will be triggered. Here we present the current state of this work along with the current road-map for the next 2 years development.

Session A-132: Large-scale genotypic and phenotypic data support for Tripal: Chado optimization by utilizing modern PostgreSQL functionality
  • Lacey-Anne Sanderson, University of Saskatchewan, Canada
  • Reynold Tan, University of Saskatchewan, Canada
  • Carolyn Caron, University of Saskatchewan, Canada
  • Kirstin Bett, University of Saskatchewan, Canada

Short Abstract: Present-day breeding programs demand evaluation of large-scale phenotypic and genotypic data to assist with selections. We have extended Tripal, an open-source toolkit for developing biological data web portals, to manage and display large-scale genotypic and phenotypic data. Tripal utilizes the GMOD Chado database schema to achieve flexible, ontology-driven storage in PostgreSQL. The ND Genotypes module (https://github.com/UofS-Pulse-Binfo/nd_genotypes) supports loading of a variety of file formats, including the common Variant Call Format (VCF). The Chado schema was extended to include an associative table, connecting the genotype call with the marker and stock assayed. Meta-data, including quality metrics, is supported as a JSONB array for maximum flexibility. Use of materialized views and indexing in PostgreSQL improves performance such that summary charts and tables can be generated dynamically. Researchers may download the data for further analysis with choice of various software-friendly formats. Similarly, the Analyzed Phenotypes module (https://github.com/UofS-Pulse-Binfo/analyzedphenotypes) alters the Chado schema through use of a single associative phenotype table, and dynamically produces phenotype distribution charts for the user to compare phenotypic measurements between site-years. At present, both of these modules have been tested with 5 billion data points with no noticeable reduction in performance. The utility and performance of these extension modules has been benchmarked on our own breeder-focused web portal for legumes, KnowPulse (http://knowpulse.usask.ca).

Session A-133: GenePattern Notebooks: An integrative analytical environment for genomics research
  • Michael Reich, University of California, San Diego, USA
  • Thorin Tabor, Broad Institute of MIT and Harvard, USA
  • Helga Thorvaldsdóttir, Broad Institute, USA
  • Barbara Hill, Broad Institute, USA
  • Ted Liefeld, UCSD, USA
  • Jill Mesirov, UC San Diego, USA
  • Pablo Tamayo, UC San Diego Moores Cancer Center, USA

Short Abstract: As the availability of genetic and genomic data and analysis tools from large-scale cancer initiatives continues to increase, the need has become more urgent for a software environment that supports the entire “idea to dissemination” cycle of an integrative cancer genomics analysis. Such a system would need to provide access to a large number of analysis tools without the need for programming, be su ciently exible to accommodate the practices of non-programming biologists as well as experienced bioinformaticians, and would provide a way for researchers to encapsulate their work into a single “executable document” that included not only the analytical work ow but also the associated descriptive text, graphics, and supporting research. To address these needs, we have developed GenePattern Notebook, based on the GenePattern environment for integrative genomics and the Jupyter Notebook system. GenePattern Notebook presents a familiar lab notebook format that allows researchers to build a record of their work by creating “cells” containing text, graphics, or executable analyses. Researchers add, delete, and modify cells as the research evolves. When an analysis is ready for publication, the same document that was used in the design and analysis phases becomes a research narrative that interleaves text, graphics, data, and executable analyses, serving as the complete, reproducible, in silico methods section for a publication.

Session A-134: BioContainers for supercomputers: 2,000+ accessible, discoverable Singularity apps
  • John Fonner
Session A-135: Collaborative Open Plant Omics: A platform for “FAIR” data for plant science
  • Felix Shaw
Session A-136: CWL+Research Object == Complete Provenance
  • Farah Khan
Session A-137: Workflow-ready bioinformatics packages for Debian-based distributions and this Linux distribution’s infrastructure for low-friction reproducible research
  • Contributors To Debian Med, Debian Linux Community, Malta
  • Steffen Möller, Debian Linux Community, Germany

Short Abstract: Debian is one of the first GNU/Linux distributions and has always invited users from all technical and scientific backgrounds to contribute. For over 15 year with contributors from all around the globe the Debian Med special interest group organised packaging of platforms and toolkits for computational biology and medical informatics. Together with Debian Science and Debichem groups, additional packages of shared interest with neighbouring disciplines are maintained. Since Debian automatically builds all software from source code, it allows bioinformatics workflows to become consistently auto-deployable across platforms, e.g. 32bit ARM7, 64bit ARM8, x86, IBM POWER, or mainframes. It is also offered as a base for all prominent virtualisation technologies and cloud infrastructure service providers. Compiler flags for Debian builds are strict, as expected for security-sensitive parts of a Linux distribution. Fixes to the source tree are communicated back to the authors. Builds are checked for consistency by the http://reproducible-builds.org initiative, and automated testing has become an integral part of Debian packaging shown on http://ci.debian.net. The packages in Debian are in their latest versions, or otherwise updated quickly upon request. Similarly important, http://snapshot.debian.org hosts packages from the past to facilitate assessments of scientific advancements or to consistently continue or reproduce older studies. Requests for new packages are typically how the number of contributors grows - we actively help with our expertise and curiously welcome yours - be it with your package, your cloud template, your workflow, or your participation to our next meeting.

Session A-138: Forever in BlueGenes: a next-generation genomic data interface powered by InterMine
  • Yo Yehudi
Session A-139: GRADitude: A computational tool for the analysis of Grad-seq data
  • Silvia Di Giorgio
Session A-140: Enabling the optimization of open-source biological computational tools with scripting languages
  • Stefan POPA
Session A-141: Protein Inpainter: a Message-Passing-based Predictor using Spark GraphX
  • Rabie Saidi
Session A-142: Reproducible and user-controlled software management in HPC with GNU Guix
  • Ricardo Wurmus
Session A-143: An ensemble approach for gene set testing analysis with reporting capabilities
  • Monther Alhamdoosh
Session A-144: RADAR-CNS - Research Infrastructure for processing wearable data to improve health
  • Nivethika Mahasivam
Session A-145: Workflows interoperability with Nextflow and Common WL
  • Kevin Sayers
Session A-146: Bioschemas for life science data
  • Carole Goble
Session A-147: The GA4GH Tool Registry Service (TRS) and Dockstore - Year One
  • Denis Yuen
Session A-148: Introducing the Brassica Information Portal: Towards integrating genotypic and phenotypic Brassica crop data
  • Annemarie Eckes
Session A-149: Discovery and visualisation of homologous genes and gene families using Galaxy
  • Anil S. Thanki
Session A-150: The SPOT ontology toolkit : semantics as a service
  • Olga Vrousgou
Session A-151: GRAPHSPACE: Stimulating interdisciplinary collaborations in network biology
  • Aditya Bharadwaj
Session A-152: Revitalizing a classic bioinformatics tool using modern technologies: the case of the Cytoscape Project
  • Keiichiro Ono
Session A-153: Screw: tools for building reproducible single-cell epigenomics workflows
  • Kieran O'Neill
Session A-154: Emerging public databases of clinical genetic test results: Implications for large scale deployment of precision medicine
  • Stephen Lincoln
Session A-155: NGL – a molecular graphics library for the web
  • Alexander S Rose
Session A-156: ToolDog - generating tool descriptors from the ELIXIR tool registry
  • Kenzo-Hugo Hillion
Session A-157: Integrating cloud storage providers for genomic analyses
  • Ted Liefeld
Session A-158: ASpedia: alternative splicing encyclopedia to explore comprehensive evidence
  • Daejin Hyung, Research Institute National Cancer Center, Republic of Korea, South Korea
  • Soo Young Cho, Research Institute National Cancer Center, Republic of Korea, South Korea
  • Charny Park, Research Institute National Cancer Center, Republic of Korea, South Korea

Short Abstract: Motivation: The advent of mRNA high-throughput sequencing facilitates various alternative splicing (AS) analysis method development. Differential AS analysis methods pursue only precise mathematical quantification model without consideration of biological function, and it still remains an issue to identify functional evidence to evaluate AS region. Exon inclusion or exclusion is already known to implicate protein domain, miRNA binding sites, or other functional regions. The analysis result including tremendous differential AS candidates required genomic annotation and objective biological evidence. Therefore we developed ASpeida, an explore system for AS. Results: We developed a database for comprehensive evidence of AS that encompasses multiple protein features, transcriptional profile including RNA-binding proteins, miRNA binding site, repeat region, genomic evolutionary score, splicing site variants and so on. Human AS regions classified of five types were identified from hg19 ENSEMB and Refseq. Transcription profile was comprised of exon inclusion/exclusion of 26 tissues, and binding regions of 199 RNA-binding proteins, and the data set was collected from RNA-Seq (E-MTAB-1733) and ENCODE project. To build-up protein database, we converted protein positions to genomic coordinate, and scanned AS inclusion or exclusion regions to match with protein position. Protein context consists of protein domain, isoform-specific protein-protein interaction, enzyme classification, and post-translation modification site. DNA evidence was collected from exonic/intronic evolution conservation score, splicing site variant, NMD variants including novel point mutations, repeat regions, and miRNA binding site. Finally the user could query with gene symbol, and large-scale NGS analysis result file in user-friendly web interface. ASpeida supply both text-based annotation result, and visualization for each AS cases. We plan database expansion to reinforce transcriptional evidence from various RNA-seq dataset analysis.

Session A-159: Full-stack genomics pipelining with GATK4 + WDL + Cromwell
  • Kate Voss
Session A-160: BioXSD | BioJSON | BioYAML - Towards unified formats for sequences, alignments, features, and annotations
  • Matúš Kalaš
Session A-161: EDAM - The ontology of bioinformatics operations, types of data, topics, and data formats (2017 update)
  • Matúš Kalaš
  • Hervé Ménager

View Posters By Category

Search Posters: