If you need assistance please contact firstname.lastname@example.org and provide your poster title or submission ID.
Track: Bioinformatics Open Source Conference (BOSC)
Short Abstract: The growing genomic data repository GEO (Gene Expression Omnibus) has become the popular public database for microarray data from gene expression experiments. GEO contains curated as well as non-curated gene expression profile datasets. GeoDiver is an online web application for performing Differential Gene Expression Analysis (DGEA) and Generally Applicable Gene-set Enrichment Analysis (GAGE) on GEO datasets. GeoDiver allows researchers to fully take advantage of the growing GEO resource and perform powerful analyses without the need for downloading or installing additional software, learning command-line skills or having prior programming knowledge. Users can easily run both DGEA and GSA within a few minutes of accessing the web application. For users who are familiar with gene-expression data, the option to adapt the advanced parameters of the analyses mirrors the flexibility that one might expect from writing a custom analysis script. The output produced includes numerous high-quality interactive graphics, allowing users to easily explore and examine complex datasets instantly and understand the data they have analysed. Furthermore, the results produced can be reviewed at a later date and shared with collaborators. Other graphics such as the heatmaps are high-resolution, information rich and can be easily exported for use in publications. GeoDiver is therefore not only designed to facilitate the analysis of gene-expression data but also ensures that users can fully explore the results of their analysis. This is important as the ability to use such powerful analytical tools has to be paired with the corresponding ability to interpret the output. Availability: GeoDiver is freely available online at http://www.geodiver.co.uk. The source code is available on Github: https://github.com/GeoDiver/GeoDiver and a docker image is available for easy installations.
Short Abstract: Actor is an open-source, object-oriented Python framework for the development of computational pipelines, specifically designed for NGS data analysis in a cluster computing environment. Pipelines are built out of reusable objects implementing analysis steps (e.g. short-read alignment, transcript quantification) and are controlled by simple configuration files detailing the experimental setup, the input data, and the steps to be performed. Pipeline execution is performed by a Director object that, given an Actor object representing the pipeline, performs the necessary setup and executes all required steps. Each step is represented by a Line object, which provides standard methods for Setup, Verification, Pre-Execution, Execution, Post-Execution, and Reporting. Thanks to this standard API, steps can be freely combined: for example, changing the short-read aligner from Bowtie to STAR only requires swapping one Line object for another in the pipeline definition. Actor is able to handle any number of experimental conditions, biological replicates, and technical replicates, easily supporting complex experimental designs with no changes to the pipeline structure. Actor automatically handles submission and management of jobs to the cluster, ensuring proper job sequencing and coordination. Finally, Actor pipelines automatically generate an HTML report of their execution: within a standard template (customizable by specializing the Actor object) each step may add one or more sections containing text, tables, figures, links to downloadable files. We describe the Actor framework, outline pipelines developed with it (including RNA-seq, ChIP-seq, ATAC-seq, methylation analysis, variant discovery, genome annotation) and discuss its advantages for analysis reliability and reproducibility.
Short Abstract: Tumor heterogeneity refers to the notion of tumors showing distinct cellular morphological and phenotypic properties which can affect disease progression and response. Next-gen sequencing results from bulk whole exome or whole genome samples can be expanded to infer tumor structure (heterogeneity) using a few different methods. One of these methods, sciClone, was developed for detecting tumor subclones by clustering Variant Allele Frequencies (VAFs) of somatic mutations in copy number neutral regions. We present a portable and reproducible Common Workflow Language (CWL) implementation of a sciClone based tumor heterogeneity workflow. The results of sciClone can be directly fed into additional software like ClonEvol and Fishplot, which produce additional visualizations, e.g. phylogenetic trees showing clonal evolution. The main idea of the workflow was to build a pipeline that can process standard bioinformatics file types (VCF, BAM) and produce a comprehensive set of outputs describing tumor heterogeneity. Additionally, we designed this workflow to be robust regardless of the number of samples provided.
Short Abstract: Microsatellite instability (MSI) is an important factor for classifying certain cancer types, it has been shown to be associated with prognosis, and it can be used as a biomarker for treatment selection. Here, we present a portable and reproducible Common Workflow Language (CWL) implementation of MSIsensor tool and lobSTR toolkit, we will show how to use MSIsensor and lobSTR, and we will see how lobSTR deals with PCR noise and how it can silence it. Finally, we present results obtained using the aforementioned tools on more than 600 TCGA colorectal adenocarcinoma cases, as well as our further analysis and interpretation of the results. Out of 592 cases that had MSI classification on GDC portal, MSIsensor predicted the MSI status correctly for 580 of them with F-score 0.93 using tumor and normal samples, while results obtained using lobSTR enabled us to correctly predict the MSI status in 584 cases with F-score 0.96 using tumor and normal samples, and 581 cases with F-score 0.94 using tumor only.
Short Abstract: Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging: even the best gene prediction algorithms make substantial errors and can jeopardize subsequent analyses. Therefore, many predicted genes must be time-consumingly visually inspected and manually curated. We developed GeneValidator to automatically identify problematic gene predictions and to aid manual curation. For each gene, GeneValidator performs multiple analyses based on comparisons to gene sequences from large databases. The resulting report identifies problematic gene predictions and includes several statistics and graphs for each prediction. These can be used to eliminate problematic gene models from a set of annotations, compare two sets of annotations, or to guide manual curation efforts. GeneValidator thus accelerates and enhances the work of researchers and biocurators who need accurate gene predictions from newly sequenced genomes. GeneValidator can be used through a web interface or in the command-line. It is open-source (AGPL), available at https://wurmlab.github.io/tools/genevalidator.
Short Abstract: The increased number of sequenced cancer genomes since the completion of the human genome project, and the importance of correctly identifying somatic mutations, which can influence treatment or prognosis, is driving forward the development of novel somatic variant calling tools (somatic callers). A lack of best practices algorithm for identifying somatic variants, however, requires constant testing, comparing and benchmarking these tools. The absence of truth set further hinders the effort for the evaluation. By comparing widely used open source somatic callers, such as Strelka, VarDict, VarScan2, Seurat and LoFreq, through analysis of in-house generated synthetic data, we found complex dependencies of somatic caller parameters relative to coverage depth, allele frequency, variant type, and detection goals. Next, we normalised and filtered the output data such that it can be appropriately compared to the truth set. The acquired benchmarking results were automatically and efficiently structured and stored. All of the tools used for the analysis have been implemented in Common Workflow Language which makes them portable and reproducible.
Short Abstract: The promise of elucidating associations between the microbiota and their host, with diagnostic and therapeutic potential, is fueling metagenomics research. However, due to complexity and volume of profiling data, as well as a lack of integrated approaches and tools, robust associations remain elusive. Here, we describe SIAMCAT, an R package for statistical inference of associations between microbiome and host phenotypes. Ideally, such associations are described by quantitative models able to predict host status from microbiome composition. SIAMCAT can efficiently do so for data from thousands of microbial taxa, gene families, or metabolic pathways over hundreds of samples. Its core workflow is based on penalized logistic regression, though it is easy to implement additional machine learning approaches through the mlr meta-package. SIAMCAT has a modular structure and three flavors: (1) the R package, (2) command line R scripts, and (3) a Galaxy workflow, supporting users with various levels of experience. It is part of the EMBL microbiome analysis toolkit and uses Bioconductor data structures for interoperability with other packages so it can be broadly applied to new data sets and easily integrated into larger analysis pipelines. Moreover, SIAMCAT produces graphical (diagnostic) output for convenient assessment of the quality of the input data and resulting associations. SIAMCAT was successfully applied to discover microbial marker species for colorectal cancer detection. Here we show its versatility in the identification of various associations between microbiota and environmental (host) factors. Overall, SIAMCAT offers a complete and robust solution for statistical inference of associations in microbiome data.
Short Abstract: Advances in bioinformatics, especially in the field of genomics, have greatly accelerated as development of powerful computational systems and new distributed platforms enabled rapid processing of massive amounts of datasets. Accordingly, this has led to an explosion of public software, databases, and other resources for biological discovery. Modern biomedical research requires the comprehension and integration of multiple types of data and tools; specifically, understanding biomedical phenotypes requires analysis of molecular data (genomics, proteomics), imaging data (sonography; computed tomography, CT), and textual data (case reports, electronic health records). Researchers require software tools in a common platform that can analyze and integrate data from each of these domains. However, many of the existing resource and tool repositories are narrowly focused, fragmented, or scattered and do not support the multidimensional data analysis required for advancing precision medicine. There is a lack of unified platform collection software applications across various disciplines, leaving many users unable to locate the tools critical to their research. To manage this influx of digital research objects (tools and databases), the FORCE11 community put forth a set of guiding principles to make data Findable, Accessible, Interoperable, and Reusable (FAIR) (Wilkinson et al., 2016). While these issues have been addressed in the data elements domain, they are equally important to but thus far neglected by the biomedical informatics software domain. Here, we introduce Aztec (A to Z technology; https://aztec.bio), an open-source online software discovery platform that empowers biomedical researchers to find the software tools they need. Aztec currently hosts over 10,000 resources, spanning 17 domains including imaging, gene ontology, text mining, data visualization, and various omic analyses (i.e., genomics, proteomics, and metabolomics). Aztec serves as a common portal for biomedical software across all of these domains, enabling modern multi-dimensional characterization of biomedical processes.
Short Abstract: The German Network for Bioinformatics Infrastructure (de.NBI) is a national network funded by the Federal Ministry of Education and Research. de.NBI provides bioinformatics services to life-science and biomedicine researchers, constituting at the same time the German ELIXIR node. Through one of its service centers, the Heidelberg Center for Human Bioinformatics (HD-HuB), we host a variety of bio-computational resources and offer courses and training opportunities to scientists in a wide range of topics. Additionally, we organize events that provide a platform to assist researchers in the Heidelberg area to network with each other and exchange their knowledge. HD-HuB unites the bioinformatics expertise of three distinguished research institutions: the European Molecular Biology Laboratory (EMBL), the German Cancer Research Center (DKFZ), and Heidelberg University, and facilitates an environment for establishing an effective and efficient bioinformatics community in Heidelberg. EMBL Heidelberg provides a local bioinformatics support platform, Bio-IT, to their ~600 life-scientists across >50 groups. EMBL has reported a consistent increase in the fraction of scientists who devote ≥50% of their time to computational-based research, seeing a growth from 39% to 46% between 2009 and 2015. Bio-IT is an initiative established to support the development and technical capability of this diverse bio-computational community. The community has grown organically to tackle challenges together, providing training and support for scientific computing, creating and maintaining tools and resources for reproducible science, and providing opportunities for discussion and collaboration. Here, we share some lessons learned while establishing these communities and discuss the efforts required to maintain it thereafter.
Short Abstract: In last few years, the issues related to reproducibility problems in life sciences and biomedical research are gaining a lot of interest and traction in the scientific community. They are not limited to laboratory procedures and protocols but are also related to computational analysis of experimental data. Results of computational findings often cannot be reproduced because of problems associated with software versioning, packaging, installation and execution, and because of lack of proper documentation. By enabling the reuse of research assets, research can become more efficient and economical. This can be achieved by applying standards and Standard-Operating-Procedures. Thus, standards represent important drivers in the life sciences because they guarantee that data become accessible, shareable and comparable. Several initiatives launched the development and implementation of standards. Unfortunately, they remain fragmented and disconnected. CHARME is the EU COST action focused on problems of standardization in systems biology and computational modelling. Currently 29 countries are involved in this non-commercial initiative, which makes it one of the most widely recognized actions among all COST initiatives. The main objective of CHARME is to create a common platform for sustainable and efficient cooperation on standardization. The CHARME aims to bridge and combine the fragmented areas in the development of new norms and standards. This will be done by identifying needs and gaps, and teaming up with other initiatives to avoid duplication and overlap of standardisation activities. As a result, a common, long-term strategy will be developed to successfully assimilate standards into the daily workflow. Finally, CHARME will increase the awareness for the need of standards and will provide a common ground for researchers from academia, industry and multinational organizations.
Short Abstract: High Content Screening (HCS) is a powerful tool that can quantify a large number of biological readouts in a short period of time. At Philip Morris International (PMI), HCS is routinely used as part of the toxicological risk assessment of products with the potential to present less risk of harm than continued smoking. Each HCS-based assay is a complex, multi-step procedure that starts with an adequate experimental design, followed by the data generation phase (exposure, image acquisition and quantification) and finally data analysis and reporting. Testing of various endpoints in combination with multiple experimental conditions can easily result in hundreds of plates that need to be managed. Therefore, the use of a self-contained platform becomes necessary in order to ease the process. Here, we present an open source platform that implements a Graphical User Interface (GUI) to manage the entire HCS procedure. The platform includes (i) a GUI to design an HCS experiment; (ii) an automatic distribution of samples on plates; (iii) automatic fetching of raw images quantifications (the platform implements a web-service layer that allows multiple (proprietary) systems to be easily integrated); (iv) data analysis functionalities via pmitcpl R package that include multiple dose-response fitting and best fit selection, various statistics (e.g., Min Effective Concentration and Max Tolerated Dose) and historical data analysis. The HCS data storage and analysis platform represents an all-in-one, open source solution for HCS experimental processes. The platform is suitable for the overall management (storage, processing and reporting) of High Content Screening experimental data in BioMedical Research, thereby increasing the consistency, reliability, flexibility, speed of interpretation and traceability of the HCS data flow.
Short Abstract: Expression Atlas (https://www.ebi.ac.uk/gxa) is a database and web-service at EMBL-EBI that curates, re-analyses and displays gene expression data across species and biological conditions such as different tissues, cell types, developmental stages and diseases among others. Currently, it provides gene expression results on more than 3,000 experiments (based on microarray and RNA-sequencing) from 40 different organisms, including metazoans and plants. Expression profiles of tissues from Human Protein Atlas, GTEx and FANTOM5, and of cancer cell lines from ENCODE, CCLE and Genentech datasets can be explored in Expression Atlas. All experiments within Expression Atlas are manually curated to accurately represent gene expression data, and annotated to Experimental Factor Ontology (EFO) terms. The use of EFO annotations allows efficient search via ontology-driven query expansion and facilitates data integration. All RNA-sequencing experiments are analysed in a uniform way using iRAP, our standardised pipeline. It is possible to search and download datasets into R for further analysis via the R package ‘ExpressionAtlas’. Expression Atlas visualises gene expression results using heatmaps showing gene expression levels across different biological conditions. Novel analysis and visualisations include: exploration of gene co-expression across tissues, cell lines or other conditions within a baseline experiment and enrichment analysis of user provided sets of genes in a given organism against differentially expressed genes in all available comparisons. Finally, our baseline gene expression results in different tissues are integrated within other specialised resources to contribute to the understanding of pathways (Reactome, http://www.reactome.org/) and disease targets (Target Validation Platform by Open Targets, https://www.opentargets.org/).
Short Abstract: In the increasing world of data, the mingles of data, code, and project management have become crucial for a successful and agile processing of data into meaning. While several proficient tools have emerged along the last years for each of the individual tasks, a unified tool is still in the need for many. Avoiding the costs of developing and deploying an unified tool, using existing tools in a microservices architecture is at the reach of any, allows refactoring and is infrastructure independent. bit is a git-based tool for the management of code and data. It uses git for code versioning and for example ownCloud for storing and exchanging data. It avoids the costs of data versioning by logging in git the changes made on the data storage platform (eg. ownCloud). Data can be transferred over curl to avoid the risks of mounted drives and the need for ftp access. Every time data is uploaded to ownClowd, the upload is registered on the respective wiki on github together with an hyperlink to the respective folder. Project metadata and data can therefore be kept in different instances. Using rsync in the backend, it allows seamless multiple rsync calls for efficient synchronisation of projects between different HPCs or local machines. It is primarily developed for multiple analysts working on shared projects on an HPC cluster but it can be easily used on local/cloud/non-HPC machines without changes to the code.
Short Abstract: The Cancer Genome Project (CGP) has been heavily involved in the work of the ICGC PanCancer Analysis Of Whole Genomes (PCAWG) project to characterise 2,800 cancers. CGP provided one of the three core somatic calling pipelines. As part of this effort we have successfully produced a codebase that is entirely portable, open-source, and available as a PCAWG workflow on www.dockstore.org. This workflow generates Copy-Number, Substitution, Insertion, Deletion and Structural Variant results optimised for tumour-normal paired somatic NGS data. CGP are now looking to provide an updated version of this workflow within a cloud enabled framework. One of the key issues that faces investigators when working with large sequence data is the difficulty in transferring large datasets without the need to install dedicated software. In order to address this issue we plan to implement an in-browser, drag and drop process for data submission and retrieval. Following successful validation of data, mapping and analysis through the standard Whole Genome Sequence (WGS) somatic calling pipeline will be triggered. Here we present the current state of this work along with the current road-map for the next 2 years development.
Short Abstract: Present-day breeding programs demand evaluation of large-scale phenotypic and genotypic data to assist with selections. We have extended Tripal, an open-source toolkit for developing biological data web portals, to manage and display large-scale genotypic and phenotypic data. Tripal utilizes the GMOD Chado database schema to achieve flexible, ontology-driven storage in PostgreSQL. The ND Genotypes module (https://github.com/UofS-Pulse-Binfo/nd_genotypes) supports loading of a variety of file formats, including the common Variant Call Format (VCF). The Chado schema was extended to include an associative table, connecting the genotype call with the marker and stock assayed. Meta-data, including quality metrics, is supported as a JSONB array for maximum flexibility. Use of materialized views and indexing in PostgreSQL improves performance such that summary charts and tables can be generated dynamically. Researchers may download the data for further analysis with choice of various software-friendly formats. Similarly, the Analyzed Phenotypes module (https://github.com/UofS-Pulse-Binfo/analyzedphenotypes) alters the Chado schema through use of a single associative phenotype table, and dynamically produces phenotype distribution charts for the user to compare phenotypic measurements between site-years. At present, both of these modules have been tested with 5 billion data points with no noticeable reduction in performance. The utility and performance of these extension modules has been benchmarked on our own breeder-focused web portal for legumes, KnowPulse (http://knowpulse.usask.ca).
Short Abstract: As the availability of genetic and genomic data and analysis tools from large-scale cancer initiatives continues to increase, the need has become more urgent for a software environment that supports the entire “idea to dissemination” cycle of an integrative cancer genomics analysis. Such a system would need to provide access to a large number of analysis tools without the need for programming, be su ciently exible to accommodate the practices of non-programming biologists as well as experienced bioinformaticians, and would provide a way for researchers to encapsulate their work into a single “executable document” that included not only the analytical work ow but also the associated descriptive text, graphics, and supporting research. To address these needs, we have developed GenePattern Notebook, based on the GenePattern environment for integrative genomics and the Jupyter Notebook system. GenePattern Notebook presents a familiar lab notebook format that allows researchers to build a record of their work by creating “cells” containing text, graphics, or executable analyses. Researchers add, delete, and modify cells as the research evolves. When an analysis is ready for publication, the same document that was used in the design and analysis phases becomes a research narrative that interleaves text, graphics, data, and executable analyses, serving as the complete, reproducible, in silico methods section for a publication.
Short Abstract: Debian is one of the first GNU/Linux distributions and has always invited users from all technical and scientific backgrounds to contribute. For over 15 year with contributors from all around the globe the Debian Med special interest group organised packaging of platforms and toolkits for computational biology and medical informatics. Together with Debian Science and Debichem groups, additional packages of shared interest with neighbouring disciplines are maintained. Since Debian automatically builds all software from source code, it allows bioinformatics workflows to become consistently auto-deployable across platforms, e.g. 32bit ARM7, 64bit ARM8, x86, IBM POWER, or mainframes. It is also offered as a base for all prominent virtualisation technologies and cloud infrastructure service providers. Compiler flags for Debian builds are strict, as expected for security-sensitive parts of a Linux distribution. Fixes to the source tree are communicated back to the authors. Builds are checked for consistency by the http://reproducible-builds.org initiative, and automated testing has become an integral part of Debian packaging shown on http://ci.debian.net. The packages in Debian are in their latest versions, or otherwise updated quickly upon request. Similarly important, http://snapshot.debian.org hosts packages from the past to facilitate assessments of scientific advancements or to consistently continue or reproduce older studies. Requests for new packages are typically how the number of contributors grows - we actively help with our expertise and curiously welcome yours - be it with your package, your cloud template, your workflow, or your participation to our next meeting.
Short Abstract: Motivation: The advent of mRNA high-throughput sequencing facilitates various alternative splicing (AS) analysis method development. Differential AS analysis methods pursue only precise mathematical quantification model without consideration of biological function, and it still remains an issue to identify functional evidence to evaluate AS region. Exon inclusion or exclusion is already known to implicate protein domain, miRNA binding sites, or other functional regions. The analysis result including tremendous differential AS candidates required genomic annotation and objective biological evidence. Therefore we developed ASpeida, an explore system for AS. Results: We developed a database for comprehensive evidence of AS that encompasses multiple protein features, transcriptional profile including RNA-binding proteins, miRNA binding site, repeat region, genomic evolutionary score, splicing site variants and so on. Human AS regions classified of five types were identified from hg19 ENSEMB and Refseq. Transcription profile was comprised of exon inclusion/exclusion of 26 tissues, and binding regions of 199 RNA-binding proteins, and the data set was collected from RNA-Seq (E-MTAB-1733) and ENCODE project. To build-up protein database, we converted protein positions to genomic coordinate, and scanned AS inclusion or exclusion regions to match with protein position. Protein context consists of protein domain, isoform-specific protein-protein interaction, enzyme classification, and post-translation modification site. DNA evidence was collected from exonic/intronic evolution conservation score, splicing site variant, NMD variants including novel point mutations, repeat regions, and miRNA binding site. Finally the user could query with gene symbol, and large-scale NGS analysis result file in user-friendly web interface. ASpeida supply both text-based annotation result, and visualization for each AS cases. We plan database expansion to reinforce transcriptional evidence from various RNA-seq dataset analysis.
View Posters By Category
- Bioinformatics Open Source Conference (BOSC)
- Network Biology
- Regulatory Genomics (RegGenSig)
- Computational Modeling of Biological Systems (SysMod)
Session A: (July 22 and July 23)
- High Throughput Sequencing Algorithms and Applications (HitSeq)
- Machine Learning Systems Biology (MLSB)
- Translational Medicine (TransMed)