Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

BOSC COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in UTC
Thursday, July 29th
11:00-11:20
Opening remarks (Harris, Cock, Sponsors)
Format: Pre-recorded with live Q&A

Moderator(s): Nomi Harris

11:00-11:05
Opening remarks
Format: Pre-recorded with live Q&A

Moderator(s): Nomi Harris

  • Nomi Harris
11:05-11:10
What is the Open Bioinformatics Foundation?
Format: Pre-recorded with live Q&A

Moderator(s): Nomi Harris

  • Peter Cock
11:10-11:20
Platinum & Gold Sponsor videos
Format: Pre-recorded with live Q&A

Moderator(s): Nomi Harris

11:20-12:00
BOSC Keynote: Significant heterogeneities: Ecology’s emergence as open and synthetic science
Format: Live-stream

Moderator(s): Jason Williams

  • Christie Bahlai, Kent State University, USA

Presentation Overview: Show

Ecology has undergone several major cultural shifts in the past century. Eager to define the field as a ‘hard’ science distinct from natural history, early 20th century ecologists rigorously tested theory through intensive, controlled experiments, often working in relative isolation. The environmental movement of the latter 20th century again redefined the science of ecology in terms of connectivity and an appreciation of scale, fostering collaborations and infrastructure. Now, the open data revolution is changing how scientists approach explaining and predicting the behavior of ecological systems. Furthermore, these open and synthetic approaches create opportunities to incorporate diverse perspectives, data, and engagement in data-intensive ecology. However, simply sharing data cannot overcome a century of ecologists working in siloes. Not only are data collected in different places and across time subject to environmental variability, differences in how observations are made as a result of human choices of how to measure, how to record, and how (and if!) to share information can dramatically impact our ability to understand ecological patterns. In this talk, I will explore how ecology has shifted from a ‘lone wolf’ science to a distributed, collectivist endeavor, and how technology and culture intersect to shape both scientific approaches and career paths.

Christie Bahlai, PhD, is a computational ecologist in the Department of Biological Sciences at Kent State University and a former Mozilla Fellow. She uses approaches from data science to help solve problems in conservation, sustainability, and ecosystem management, partnering with conservation and tech nonprofits. Her current research focuses on developing tools to support information synthesis in temporal ecology. Dr. Bahlai has strong interests in social justice in science, and believes that directly addressing diversity issues through technology and culture change benefits both scientists and science.

12:00-12:20
Lightning Talks Standards and Practices for Open Science (Psomopoulos, Deshpande, Selby)
Format: Pre-recorded with live Q&A

Moderator(s): Jason Williams

12:00-12:05
The ELIXIR Software Management Plan
Format: Pre-recorded with live Q&A

Moderator(s): Jason Williams

  • Fotis Psomopoulos, Institute of Applied Biosciences, Centre for Research and Technology Hellas / ELIXIR-GR, Greece
  • Renato Alves, European Molecular Biology Laboratory / ELIXIR-DE/de.NBI, Germany
  • Dimitrios Bampalikis, National Bioinformatics Infrastructure Sweden / ELIXIR-SE, Sweden
  • Leyla Jael Castro, ZB MED Information Centre for Life Sciences, Germany
  • José M. Fernández, Barcelona Supercomputing Center / ELIXIR-ES, Spain
  • Jennifer Harrow, ELIXIR-Hub, United Kingdom
  • Eva Martín del Pico, Barcelona Supercomputing Center / ELIXIR-ES, Spain
  • Allegra Via, IBPM-CNR, c/o Department of Biochemical Sciences , Italy

Presentation Overview: Show

Data Management Plans (DMPs) are now considered a key element of Open Science. They describe the data management life cycle for the data to be collected, processed and/or generated within the lifetime of a particular project or activity. A Software Management Plan (SMP) plays the same role but for software; beyond its management perspective, the main advantage of an SMP is that it provides clear context to the software that is being developed and raises awareness. Although there are a few SMPs already available, most of them require significant technical knowledge to effectively use. ELIXIR has developed a low-barrier SMP, specifically tailored for life science researchers, and fully aligned to the FAIR Research Software principles. Starting from the Four Recommendations for Open Source Software, the SMP was iteratively refined by surveying the practices of the community and incorporating the received feedback. Currently available as a survey, future plans of the ELIXIR SMP include a human- and machine-readable version, that can be automatically queried and connected to relevant tools and metrics within the ELIXIR Tools ecosystem and, hopefully, beyond.

12:05-12:10
Low availability of code and high availability of raw omics data accompanying biomedical studies
Format: Pre-recorded with live Q&A

Moderator(s): Jason Williams

  • Dhrithi Deshpande, University of Southern California, United States
  • Ruiwei Guo, University of Southern California, Los Angeles, United States
  • Aditya Sarkar, IIT Mandi, India
  • Serghei Mangul, University of California, Los Angeles, United States

Presentation Overview: Show

In biomedical research, it is not only imperative to publish a detailed description of the study design, methodology, results and interpretation, but there is a pressing need to make all the biomedical data and code used for scientific analyses sharable, well documented and reproducible. Analytical code and data availability is consequential for ensuring scientific transparency and reproducibility. However, raw data is not sufficient to make scientific analyses reproducible. We have reviewed the code and data availability in ten prominent biomedical journals published between 2016-2020; Our current results indicate that while the majority of articles comply with the data sharing policies of journals, most of them are not accompanied with code. 98.5% of the research papers have data availability whereas only a meagre 12% of the research papers share the code used for their analysis. For those research papers which do share code, we intend to corroborate whether the code is usable and reproducible. Code availability has increased over the years, however, not sufficiently enough. Code sharing can warrant for reproducibility of the scientific analyses and transparency. We hope our results will abet the researchers and journals in adoption of best practices to ensure scientific transparency and reproducibility.

12:10-12:15
BrAPI: a standard API specification for plant breeding data
Format: Pre-recorded with live Q&A

Moderator(s): Jason Williams

  • Peter Selby, Cornell University, United States

Presentation Overview: Show

Project Website: https://brapi.org/
Source Code: https://github.com/plantbreeding/
License: MIT License
Abstract
Modern plant breeding research requires a large amount of data to function effectively. Data repositories are improving in their ability to store this data, but there is a growing need for interoperability between disparate data sources and applications. The Breeding Application Programming Interface (BrAPI) project offers a solution to this problem with a standardized RESTful web service API specification. This specification provides a standard data model for the plant breeding domain, plus a well-defined set of methods for interacting with the data. The goal of the project is to promote interoperability, data sharing, and code sharing across organizations who produce and consume data in this domain. The BrAPI project is a community built project and that community is well established and continuously growing. The standard is built based on concrete use cases to solve real interoperability challenges faced by the community. Beyond the core standard, the community has built a variety of tools and resources to help build and test implementations of the specification. The community is also constantly producing new BrAPI compliant applications, analysis tools, and visualizations that will work with any BrAPI data source.

12:40-13:00
The GenePattern Notebook Environment
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Thorin Tabor, Broad Institute of MIT and Harvard, United States
  • Michael Reich, University of California San Diego, United States
  • Ted Liefeld, UCSD, United States
  • Barbara Hill, Harvard University, United States
  • Helga Thorvaldsdóttir, Harvard University, United States
  • Jill Mesirov, University of California San Diego, United States

Presentation Overview: Show

Interactive notebook systems have made significant strides toward realizing the potential of reproducible research, providing environments where users can interleave descriptive text, mathematics, images, and executable code into a “live” sharable, publishable “research narrative.” However, many of these systems require knowledge of a programming language and are therefore out of the reach of non-programming investigators. Even for those with programming experience, many tools and resources are difficult to incorporate into the notebook format. To address this gap, we have developed the GenePattern Notebook environment, which extends the popular Jupyter Notebook system to interoperate with the GenePattern platform for integrative genomics, making the hundreds of bioinformatics analysis methods in The GenePattern Notebook environment provides a free online workspace (notebook.genepattern.org) where investigators can create, share, run, and publish their own notebooks. A library of featured notebooks provides common bioinformatics workflows that users can copy and adapt to their own analyses. This talk will describe how the GenePattern Notebook environment extends the analytical and reproducible research capabilities of Jupyter Notebook and GenePattern and will discuss novel features and recent additions to the environment, including new bioinformatics tools, capabilities for ease of use, and migration to the JupyterLab interface.

13:00-13:20
Lightning Talks Tools for Open Science (Patil, Campbell, Ramnath)
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

13:00-13:05
Standardizing biomedical metadata curation using schematic
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Sujay Patil, Sage Bionetworks, United States
  • Bruno Grande, Sage Bionetworks, United States
  • James Eddy, Sage Bionetworks, United States
  • Yooree Chae, Sage Bionetworks, United States
  • Milen Nikolov, Sage Bionetworks, United States

Presentation Overview: Show

Data Coordinating Centers (DCC) are becoming an integral part of large-scale biomedical research projects. A DCC is typically tasked with developing solutions to manage high volumes of biomedical data. The DCC part of the Human Tumor Atlas Network (HTAN) — an NCI-funded Cancer Moonshot initiative — develops infrastructure to coordinate data generated from over 30 cutting-edge assays, spanning multiple data modalities and 10 research institutions.

Streamlining defining metadata standards, data annotation, (meta)data compliance checks, and tracking provenance in HTAN is fundamental for releasing Findable, Accessible, Interoperable and Reusable (FAIR) data.

The schematic package developed by the HTAN DCC enables these FAIR-data use cases. Schematic (Schema Engine for Manifest-based Ingress and Curation) provides:
• User-friendly interface for developing interoperable, standardized data models
• Services for generating data model compliant (meta)data submission spreadsheets
• Asset store interfaces associating metadata with data on various cloud platforms

Examples of schematic integrations in HTAN are: HTAN Data Curator (R Shiny application facilitating submission of standardized (meta)data); the HTAN Data Portal (Next.js application previewing all metadata across 100TBs of data). Schematic provides business and data model logic for these two services.

Schematic is distributed as an open-source Python package, registered on PyPI as schematicpy.

13:05-13:10
Cadmus: a pipeline for biomedical full-text retrieval
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Jamie Campbell, The University of Edinburgh, United Kingdom
  • Antoine Lain, School of Informatics, The University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB., United Kingdom
  • David Fitzpatrick, MRC Human Genetics Unit, IGMM, University of Edinburgh, Western General Hospital Campus, EH4 2XU, United Kingdom
  • T. Ian Simpson, School of Informatics, The University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB., United Kingdom

Presentation Overview: Show

Cadmus is an open-source system, developed in Python to generate biomedical text corpora from published literature. The difficulty of obtaining such datasets has been a major impediment to methodologi- cal developments in biomedical-NLP and has hindered the extraction of invaluable biomedical knowledge from the published literature. The Cadmus system is composed of three main steps; query & meta-data collection, document retrieval, and parsing & collation of the resulting text into a single data repository. The system is open-source and flexible, retrieving open-access (OA) articles and also those from publishers that the user, or their host institution, have permission to access. Cadmus retrieves and processes all available document for- mats and standardises their extracted content into plain text alongside article meta-data. The retrieval rates of Cadmus varies depending on the query and licensing status. We present examples of data retrieval for four gene-based literature queries available purely through open access (OA, without subscription) and with addi- tional institutional access (University of Edinburgh, subscription). The average retrieval success rate was 89.03% with subscription access; and 59.17% with OA only. Cadmus facilitates straightforward access to full-text literature articles and structures them to allow knowledge extraction via natural language processing (NLP) and text-mining methods.

13:10-13:15
Tripal creates online biological, community based web portals for research and analysis
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Risharde Ramnath, University of Connecticut, United States
  • Lacey-Anne Sanderson, University of Saskatchewan, Canada
  • Sean Buehler, University of Connecticut, United States
  • Josh Burns, Washington State University, United States
  • Katheryn Buble, Washington State University, United States
  • Douglas Senalik, USDA-ARS Vegetable Crops Research Unit, United States
  • Peter Richter, University of Connecticut, United States
  • Carolyn Caron, University of Saskatchewan, Canada
  • Abdullah Almsaeed, University of Tennessee, United States
  • Bradford Condon, University of Tennessee, United States
  • Meg Staton, University of Tennessee, United States
  • Dorrie Main, Washington State University, United States
  • Kirstin Bett, University of Saskatchewan, Canada
  • Jill Wegrzyn, University of Connecticut, United States
  • Stephen Ficklin, Washington State University, United States

Presentation Overview: Show

Tripal (https://tripal.info) is a framework for construction of online biological community-oriented databases. It is specifically tailored for storing, displaying, searching and sharing genetics, genomics, breeding and ancillary data . It is in use around the world in multiple independent installations housing data for thousands of species. Its purpose is to reduce the resources needed to construct such online repositories and to do so using community standards, and to meet FAIR data principles. Tripal currently provides tools that allow site developers to easily import biological data in standard and custom flat file formats, create custom data pages and search tools, and share all data using RESTful web services built from controlled vocabularies. Tripal provides a documented API allowing for precise customization which can be shared as extension modules. There are currently over 40 community-contributed extension modules available for others to enhance their sites. Tripal is governed by an active community of developers and stakeholders serving in advisory and project management committees.

13:20-13:40
Building a Federated Data Commons with Arvados
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Peter Amstutz, Curii Corporation, United States
  • Tom Clegg, Curii Corporation, United States
  • Lucas Di Pentima, Curii Corporation, United States
  • Ward Vandewege, Curii Corporation, United States
  • Alexander Sasha Wait Zaranek, Curii Corporation, United States
  • Sarah Zaranek, Curii Corporation, United States

Presentation Overview: Show

Data commons are software platforms that combine data, cloud computing infrastructure, and computational tools to create a resource for the managing, analyzing, and harmonizing of biomedical data. Arvados is an open source platform for managing, processing, and sharing genomic and other large scientific and biomedical data. The key capabilities of Arvados give users the ability to manage storage and compute at scale, and to integrate those capabilities with their existing infrastructure and applications using Arvados APIs. Several technical requirements have been suggested for a data commons ranging from metadata to security and compliance. We will go through these technical requirements and show that Arvados is well designed to fulfill the technical criteria for a data commons. We will demonstrate a prototype federated data commons we have created using the two clusters that we run for the Arvados Playground, a free-to-use installation of Arvados for evaluation and trial use.

13:40-14:00
The OpenCGA genome-optimised data store: accessing a hundred thousand genomes, a billion variants, and a hundred trillion genotypes in real-time.
Format: Pre-recorded with live Q&A

Moderator(s): Malvika Sharan

  • Ignacio Medina, University of Cambridge, United Kingdom
  • Laura Lopez-Real, Zetta Genomics Ltd., United Kingdom
  • Jacobo Coll, Zetta Genomics Ltd., United Kingdom
  • Pedro Furio, Zetta Genomics Ltd., United Kingdom
  • Will Spooner, Zetta Genomics Ltd., United Kingdom

Presentation Overview: Show

Adoption of genomics in biomedical research and healthcare is increasing rapidly due to the transformative power of precision medicine. However, file-based tools and methodologies cannot scale with predicted growth in genome numbers. The outcome is siloed genomic datasets that are not utilised to their full extent.

Our solution is OpenCGA, a ground-breaking variant store that brings the “big data” stack to genomics; a normalised “VCF database” that is rapid, scalable and secure. Users can de-duplicate and merge VCFs from multiple samples from any genotyping assay (WGS, WXS, panels, microarray etc). Records are linked to corresponding sample and clinical metadata and unique variants are annotated against the latest reference information.

Web services, accessed via pyopencga, opencgaR, a command line interface, and a web user interface, support flexible, real-time genotype-phenotype queries and asynchronous jobs. Common genomic standards such as GA4GH Beacon and htsget are implemented.

OpenCGA is used by several of the world’s most challenging genomics initiatives. Development, led by the University of Cambridge and Genomics England, has attracted 37 contributors to date. The software is supported by a growing community of users and we would be delighted for more to join us!

14:20-14:30
Introduction to BOSC/Function joint session
Format: Live-stream

Moderator(s): Iddo Friedberg

  • Iddo Friedberg
14:30-14:40
Completing the functional human proteome together!
Format: Pre-recorded with live Q&A

Moderator(s): Iddo Friedberg

  • Monique Zahn, SIB Swiss Institute of Bioinformatics, Switzerland
  • Paula Duek, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland
  • Camille Mary, University of Geneva, Switzerland
  • Amos Bairoch, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland
  • Lydie Lane, University of Geneva and SIB Swiss Institute of Bioinformatics, Switzerland

Presentation Overview: Show

Research on the human proteome reached a milestone last year with >90% of predicted human proteins detected, according to the HUPO Human Proteome Project. Its neXt-CP50 project aims to characterize 50 of the 1273 proteins with evidence at the protein level (PE1) and no experimentally determined function. In order to support this project and the scientific community in its efforts to complete the human functional proteome, neXtProt has begun to host protein function predictions. The CC BY 4.0 license that applies to the data in neXtProt will also apply to these predictions to promote their reuse. The submitter(s) can remain anonymous or have their ORCID(s) linked to the prediction to give them credit. Predictions for 7 entries are now in neXtProt - for an example, see https://www.nextprot.org/entry/NX_Q6P2H8/function-predictions. These predictions were obtained in the frame of the Fonctionathon course for undergraduates given at the University of Geneva in 2020.We are calling on the community to propose functional predictions for the proteins with no known function resulting from manual analysis of the available data and literature. This approach will complement the Critical Assessment of protein Function Annotation algorithms (CAFA) experiment that uses computational methods to predict protein function.

14:40-15:20
BOSC/Function Keynote: Open approaches to advance data-intensive biomedicine
Format: Live-stream

Moderator(s): Iddo Friedberg

  • Lara Mangravite, Sage Bionetworks, USA

Presentation Overview: Show

Data for health research is all around us. In the last decade, we have moved from a paradigm where research data is collected in a research clinic to a paradigm where research data may stem from anywhere – including our visits to the doctor and our daily interactions with technology. These information streams offer tremendous opportunity to advance research in areas from public health to precision medicine. They can also be extremely intrusive – requiring us to evolve the ways in which we collect, manage, and analyze research data. As always, the translation of science into medicine requires robust and reproducible outcomes with clear actionable consequence. Here, we will discuss approaches to apply open science principles – transparency, reproducibility, and independent contribution – to meet the evolving needs of data-intensive biomedical research.

Lara Mangravite, PhD, is president of Sage Bionetworks, a non-profit research organization that focuses on open practices to advance biomedicine through data-driven science and digital research. Recognizing that all research is limited by restrictions placed on the distribution of information, Sage works closely with institutes, foundations, and research communities to redefine how complex biological data is gathered, shared and used. By improving information flow and research practices, Sage seeks to enable research outcomes of sufficient confidence to support translation. Dr. Mangravite obtained a PhD in pharmaceutical chemistry from the University of California, San Francisco, and completed a postdoctoral fellowship in cardiovascular pharmacogenomics at the Children’s Hospital Oakland Research Institute.

Friday, July 30th
11:00-11:40
BOSC Keynote: Contribution du mouvement maker dans le domaine de la biotechnologie en Afrique: Une perspective de la science ouverte (Contribution of the maker movement to biotechnology in Africa: An open science perspective)
Format: Pre-recorded with live Q&A

Moderator(s): Hervé Ménanger

  • Thomas Hervé Mboa Nkoudou, University of Ottowa and Mboalab, Canada/Cameroon

Presentation Overview: Show

The maker movement is a community-based movement driven by a common understanding that democratizing access to tools and technologies will revolutionize the distribution of material goods and disrupt existing socio-economic systems. With the Internet, the maker movement is reinforced by trends of openness, better, open science. Indeed, nowadays, information circulates instantaneously from one end of the world to the other, offering the possibility to exchange, share and contribute to the enrichment of knowledge with implications in the fields of health, environment, education, etc. In this presentation, I will show evidence of the contribution of the maker movement to the democratisation of biotechnology in Africa. I will also discuss the local realities that need to be considered in order to ensure the success of such an undertaking. To do this, I will address the following points:

The dynamics of the circulation of biotechnology knowledge (protocols, data, design, etc.) in makerspaces;
The presentation of concrete initiatives of Biomakerspaces in Africa, as well as the impact that these innovation spaces have in their immediate environment;
Obstacles to the implementation of such initiatives and possible solutions to overcome them.

Thomas Hervé Mboa Nkoudou, PhD, is a researcher in Open Science and Science Communication. With a background in biochemistry, he works to promote DIYbio and democratize biotechnology in Africa, and aims to help create a more inclusive, data-driven research community. He founded the Mboalab innovation lab in Cameroon, and co-leads the African Institute of Open Science and Hardware. Dr. Mboa is also the International President of APSOHA, the Association for the Promotion of Open Science in Haiti and Africa. He is currently a postdoc at the University of Ottawa in Canada, working with the Open African Innovation Research Partnership. Dr. Mboa was featured in a recent article in Nature about early-career researchers who are leaders.
Note: This talk will be delivered in French with English subtitles.

11:40-12:00
Lightning Talks Analysis Tools A (Gandham, Pierce-Ward, Herzeel)
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

11:40-11:45
GATK for Microbes
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • Bhanu Gandham, Broad Institute of MIT and Harvard, United States
  • Andrea Haessly, Broad Institute of MIT and Harvard, United States
  • James Emery, Broad Institute of MIT and Harvard, United States

Presentation Overview: Show

The detection of mutations in any microbial genomes is essential in understanding drug resistance, immune evasion, and other epidemiological characteristics of infectious disease. In an effort to leverage the algorithms that have already become a standard for human genomic data processing, thousands of researchers have applied the Broad’s Genome Analysis Toolkit (GATK) to microbial variant discovery.

However, this human-focused software may not currently provide the best results for all pathogens. To provide the bacterial research community with robust variant calling methods - funded by the Chan Zuckerberg Initiative’s EOSS program - we optimized the GATK to call short variants on bacterial genomic datasets. We developed an automated workflow that calls high-quality filtered variants against a single reference using short-reads sequencing data. We optimized for circular structure of some microbial genomes, varying read depths, and other sequencing and mapping errors typical of microbial data—resulting in improved sensitivity and precision. We are actively working on expanding the GATK’s usability to other microbes - viruses, fungi, and protozoans. To encourage community adoption we have made our scalable, reproducible, runtime- and cost-efficient workflow publicly available, along with technical documentation and test datasets in Terra, which is a cloud-native platform developed at the Broad Institute.

11:45-11:50
Sourmash protein k-mer sketches for large-scale sequence comparisons
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • N. Tessa Pierce-Ward, University of California, Davis, United States
  • Luiz Irber, University of California, Davis, United States
  • Olga Botvinnik, Chan-Zuckerberg Biohub, United States
  • Taylor Reiter, University of California, Davis, United States
  • C. Titus Brown, University of California, Davis, United States

Presentation Overview: Show

Alignment-free methods for estimating sequence similarity have become critical for scaling sequence comparisons such as taxonomic classification and phylogenetic analysis to large-scale datasets. The majority of alignment-free methods rely upon exact matching of DNA k-mers: nucleotide subsequences of length k that can be counted and compared across datasets, with or without use of subsampling methods such as MinHash. sourmash is an open-source command line tool and Python library for sketching nucleotide and amino acid k-mers for sequence analysis. Sourmash implements Scaled MinHash, a MinHash variant that selects a chosen fraction of k-mers, rather than a chosen number of k-mers, as a representative subset. In addition to Jaccard Index comparisons between sketches, this k-mer sampling approach enables Containment estimation, which improves comparisons between datasets of different sizes.
As DNA k-mer methods rely on exact sequence matches, they can suffer from limited sensitivity when comparing highly polymorphic sequences or classifying organisms from groups that are not well represented in reference databases. Here, we demonstrate the utility of sourmash Scaled Minhash protein k-mer sketches for sequence comparisons across larger evolutionary distances, including high-level taxonomic classification using sourmash gather and alignment-free estimation of Average Amino Acid Identity (AAI).

11:50-11:55
elPrep 5: A multi-threaded tool for sequence analysis
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • Charlotte Herzeel, imec, Belgium, Belgium
  • Pascal Costanza, imec, Belgium, Belgium

Presentation Overview: Show

We present elPrep 5, the latest release of the elPrep framework for analyzing sequence alignment/map files. The main new feature of elPrep 5 is the introduction of variant calling. elPrep 5 can now execute the full pipeline described by the GATK best practices for variant calling, consisting of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 generates identical BAM and VCF outputs as GATK 4, while parallelizing and merging the computation of the pipeline steps to significantly speed up the runtime. elPrep speeds up the variant calling pipeline by a factor 8-16x compared to GATK 4 on both whole-exome and whole-genome data without requiring specialized or proprietary accelerator hardware. elPrep 5 is developed as an open-source project on Github and is designed to work on community-defined standards and file formats for NGS analysis. elPrep has a strong community of users among hospitals, researchers, and companies who praise elPrep for its ease of use: elPrep is distributed as a single stand-alone binary, allowing easy installation, and has a simple user interface where a full variant calling pipeline can be expressed as a single command-line invocation.

12:00-12:20
Lightning Talks Analysis Tools B (Kunzmann, Twesigomwe, Gatzen)
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

12:00-12:05
Novelties in Biotite: A Python library for computational molecular biology
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • Patrick Kunzmann, Technical University Darmstadt, Germany
  • Tom David Müller, Technical University Darmstadt, Germany
  • Kay Hamacher, Technical University Darmstadt, Germany

Presentation Overview: Show

Molecular biology has become increasingly data-driven in the last decades. To facilitate writing analysis software for computational molecular biology, the open-source Python library Biotite has been created: It provides flexible building blocks for common tasks in sequence and macromolecular data analysis in the spirit of Biopython. Biotite comprises functionalities for the bioinformatics workflow from fetching and reading files to data analysis and manipulation. A high computational efficiency of the package is achieved via C-extensions and extensive use of NumPy. With Python being an easy-to-learn programming language, Biotite aims to address a large variety of potential users: from beginners, who can use it to automate data analysis and to prepare the input data for a program, to experienced developers, who create bioinformatics software. Since the initial presentation at BOSC 2019 a lot of new functionalities were added to Biotite. Hence, we would like to take a quick glance at major features that were added since then:

- A modular system for heuristic sequence alignments and read mappings
- Base pair measurement and nucleic acid secondary structure analysis
- Molecular visualization via a seamless interface to PyMOL
- Docking with Autodock Vina

12:05-12:10
StellarPGx: A Nextflow Pipeline for Calling Star Alleles in Cytochrome P450 Genes
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • David Twesigomwe, University of the Witwatersrand, Johannesburg, South Africa, South Africa
  • Britt Drögemöller, University of Manitoba, Winnipeg, Manitoba, Canada, Canada
  • Galen Wright, University of Manitoba, Winnipeg, Manitoba, Canada, Canada
  • Azra Siddiqui, University of the Witwatersrand, Johannesburg, South Africa., South Africa
  • Jorge Da Rocha, University of the Witwatersrand, Johannesburg, South Africa., South Africa
  • Zane Lombard, University of the Witwatersrand, Johannesburg, South Africa., South Africa
  • Scott Hazelhurst, University of the Witwatersrand, Johannesburg, South Africa., South Africa

Presentation Overview: Show

Genotype-guided therapy promotes drug efficacy and safety. However, accurately calling star alleles (haplotypes) in cytochrome P450 (CYP) genes, which encode over 80% of drug-metabolising enzymes, is challenging. Notably, CYP2D6, CYP2B6 and CYP2A6, which have neighbouring pseudogenes, present short-read alignment difficulties, high polymorphism and complex structural variations. We present StellarPGx, a Nextflow pipeline for accurately genotyping CYP genes by leveraging genome graph-based variant detection and combinatorial star allele assignments. StellarPGx has been validated using 109 whole genome sequence samples for which the Genetic Testing Reference Material Coordination Program (GeT-RM) has recently provided consensus truth CYP2D6 alleles. StellarPGx had the highest CYP2D6 genotype concordance (99%) to GeT-RM compared to existing callers namely, Cyrius (98%), Aldy (82%) and Stargazer (84%). The implementation of StellarPGx using Nextflow, Docker and Singularity facilitates its portability, reproducibility and scalability on various user platforms. StellarPGx is publicly available from https://github.com/SBIMB/StellarPGx.

12:10-12:15
Evaluating functional equivalence between variant calling pipelines
Format: Pre-recorded with live Q&A

Moderator(s): Jessica Maia

  • Michael Gatzen, Broad Institute of MIT and Harvard, Germany
  • Geraldine Van der Auwera, Broad Institute of MIT and Harvard, United States

Presentation Overview: Show

Imagine you have two different methods that purport to do the same thing. How do you determine, not which one is better, but whether their outputs are similar enough that you could use them interchangeably, and combine outputs for further analysis, without risking batch effects?

We were recently presented with the challenge of evaluating functional equivalence between two variant calling pipelines. We found that the standard ""Jaccard test"" method sometimes gave the appearance of functional equivalence in cases where there were non-trivial differences at the level of the quality of some of the calls. To solve this, we developed a new approach that involves evaluating F1 (harmonic mean) scores across a (realistic) range of QUAL thresholds, which allows us to complement the Jaccard concordance analysis and make a more reliable assessment of functional equivalence between two pipelines.

We demonstrate the use of this method using an open source WDL pipeline that automatically performs and matches all necessary analyses.

This method can be generalized for evaluating different tools or configurations for a wide range of genomics use cases, in order to mitigate the risk of batch effects when switching from one configuration to another.

12:40-13:00
Introducing WARP: A collection of cloud-optimized workflows for biological data processing and reproducible analysis
Format: Pre-recorded with live Q&A

Moderator(s): Chris Fields

  • Geraldine Van der Auwera, Broad Institute, United States
  • Kylee Degatano, Broad Institute, United States
  • George Grant, Broad Institute, United States
  • Farzaneh Khajouei, Broad Institute, United States
  • Elizabeth Kiernan, Broad Institute, United States
  • Kishori Konwar, Broad Institute, United States
  • Nikelle Petrillo, Broad Institute, United States
  • Jessica Way, Broad Institute, United States

Presentation Overview: Show

The Broad Institute Data Sciences Platform has recently released WDL Analysis Research Pipelines (WARP), a brand new, public GitHub repository of cloud-optimized WDL workflows that are used in production at the Broad Institute and in large collaborative efforts such as gnomAD, All of Us, the Human Cell Atlas, and the BRAIN Initiative. In WARP, we are deeply committed to scientific reproducibility, provenance, and transparency, so every workflow is released in GitHub with a semantic version number and release notes. They are all open-source under a BSD 3-Clause license, and they call only open-source tools published in public dockers. We’ve registered each of the workflows in Dockstore so they can be easily imported into Terra, a scalable cloud platform supporting biomedical research. Our hope is the release of WARP will lead to a community around the use, improvement, and contribution of production quality bioinformatics cloud pipelines. Learn more at https://broadinstitute.github.io/warp/

13:00-13:20
WFPM: a novel WorkFlow Package Manager to enable collaborative workflow development via reusable and shareable packages
Format: Pre-recorded with live Q&A

Moderator(s): Chris Fields

  • Junjun Zhang, Ontario Institute for Cancer Research, Canada
  • Linda Xiang, Ontario Institute for Cancer Research, Canada
  • Christina Yung, Ontario Institute for Cancer Research, Canada
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Icgc-Argo Data Coordination And Management Working Group, International Cancer Genome Consortium, Canada

Presentation Overview: Show

Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which discourages the community from adopting the widely practiced Don’t Repeat Yourself (DRY) principle.

To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture makes it possible for anyone in the bioinformatics community to reuse the packages within their own workflows.

Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to promote best practice software design and maintenance. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard packages. WFPM CLI source code is available at: https://github.com/icgc-argo/wfpm

13:20-13:40
Evolution of the Nextflow workflow management system
Format: Live-stream

Moderator(s): Chris Fields

  • Evan Floden, Seqera Labs, Spain
  • Paolo Di Tommaso, Seqera Labs, Spain
  • Rob Lalonde, Seqera Labs, Canada
  • Kevin Sayers, Seqera Labs, United States

Presentation Overview: Show

Nextflow has continued to add innovative features to facilitate bioinformatics analyses at any scale. In addition to supporting HPC schedulers, Kubernetes, AWS, and GCP, the latest Nextflow version also integrates Azure Batch as a managed cloud batch execution service. The built-in support for all major cloud providers extends and guarantees the portability of Nextflow workflows across different platforms. The Nextflow Kubernetes executor has also been introduced.

A key development of the project has been the addition of a plugin system. This provides a dynamic way to add support for new features into Nextflow ensuring that the rapidly evolving needs of bioinformatics developers can be met and opens the door to community plugins to further improve and extend the core functionality.

At BOSC 2020, the DSL2 revision to the language was introduced. This has seen widespread adoption over the past year with developers embracing the new workflow model leading to more reusable components and lowering development times. The modularization has also resulted in the ability to more easily test components promoting more robust and reproducible workflows.

The maturation of the Nextflow ecosystem is a testament of the strength of open-source software to enable collaboration and reproducible research in science.

13:40-13:50
Sapporo: an implementation of GA4GH Workflow Execution Service standard to bridge the different workflow language communities
Format: Pre-recorded with live Q&A

Moderator(s): Chris Fields

  • Hirotaka Suetake, The University of Tokyo, Japan
  • Tazro Ohta, Database Center for Life Science, Japan
  • Manabu Ishii, Genome Analytics Japan, Japan
  • Tomoya Tanjo, National Institute of Genetics, Japan

Presentation Overview: Show

Many workflow languages have been developed and released as open-source software, and workflows are shared within their respective language communities. Users choose a language according to their use case, their computing environments, or preference of language itself. However, learning different concepts, installing a runtime, and building an environment make it difficult to use workflows written in a different language to what one's daily use. In this presentation, we would like to introduce the new version of the Sapporo service, which is now a common interface to enable the execution of multiple workflow languages. The current implementation allows users to execute six workflow runners that support the following workflow languages: Common Workflow Language (CWL), Workflow Description Language (WDL), Nextflow, and Snakemake. The Sapporo service depends only on Docker, which makes it fairly easy to introduce to any environment. For WES to provide a fully equivalent interface to different workflow languages, however, languages need to have some features such as a schema of input parameters and their types. At BOSC, where many developers of workflow languages and their users gather, we would like to discuss what is needed for better sharing of workflows among the communities.

13:50-14:00
Dockstore - 2021 update
Format: Pre-recorded with live Q&A

Moderator(s): Chris Fields

  • Denis Yuen, Ontario Institute for Cancer Research, Canada
  • Gary Luu, Ontario Institute for Cancer Research, Canada
  • Charles Overbeck, UC Santa Cruz Genomics Institute, United States
  • Natalie Perez, UC Santa Cruz Genomics Institute, United States
  • Walter Shands, UC Santa Cruz Genomics Institute, United States
  • David Steinberg, UC Santa Cruz Genomics Institute, United States
  • Brian O'Connor, Broad Institute, United States
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Ash O'Farrell, UC Santa Cruz Genomics Institute, United States
  • Avani Khadilkar, UC Santa Cruz Genomics Institute, United States
  • Chaz Reid, UC Santa Cruz Genomics Institute, United States
  • Elizabeth Sheets, UC Santa Cruz Genomics Institute, United States
  • Gregory Hogue, Ontario Institute for Cancer Research, Canada
  • Kathy Tran, Ontario Institute for Cancer Research, Canada
  • Benedict Paten, UC Santa Cruz Genomics Institute, United States

Presentation Overview: Show

Dockstore is an open source platform for sharing bioinformatics tools and workflows using popular descriptor languages such as the Common Workflow Language (CWL), Workflow Description Language (WDL), Nextflow, and Galaxy. Dockstore aims at making workflows reproducible and runnable in any environment that supports Docker. Here, we highlight new features that the Dockstore team has been working on since our last talk in 2019.

For better support of the lifecycle of a workflow, Dockstore has added support for GitHub apps allowing for automated update of workflow content. Selected versions of a workflow can be snapshotted and exported to Zenodo to mint DOIs for publication. Improvements have also been made to the platform's cloud security in-line with NIH recommendations to keep workflows safe.

As examples of usability improvements, Dockstore has revamped organizations and home pages highlighting updated content for logged-in users while also adding notifications for events.

Finally, users can use the launch-with feature to run workflows on newly added platforms such as CGC, Cavatica, AnVIL, Nextflow Tower, BioData Catalyst, and Galaxy. ORCID support has been added to help identify users on the site. Support has been added for WDL 1.0, CWL 1.1, Nextflow’s DSL2, and GA4GH’s TRS V2 standard.

14:20-14:40
Lightning Talks Visualization (Reijnders, Diesh, Cain)
Format: Pre-recorded with live Q&A

Moderator(s): Karsten Hokamp

14:20-14:25
Summary Visualizations of Gene Ontology Terms With GO-Figure!
Format: Pre-recorded with live Q&A

Moderator(s): Karsten Hokamp

  • Maarten Reijnders, Department of Ecology and Evolution, University of Lausanne, Switzerland
  • Robert Waterhouse, University of Lausanne, Switzerland

Presentation Overview: Show

The Gene Ontology (GO) is a cornerstone of genomics research that drives discoveries through knowledge-informed computational analysis of biological data from large-scale assays. Key to this success is how the GO can be used to support hypotheses or conclusions about the biology or evolution of a study system by identifying annotated functions that are overrepresented in subsets of genes of interest. Graphical visualizations of such GO term enrichment results are critical to aid interpretation and avoid biases by presenting researchers with intuitive visual data summaries. Currently there is a lack of standalone open-source software solutions that facilitate explorations of key features of multiple lists of GO terms. To address this we developed GO-Figure!, an open-source Python software for producing user-customisable semantic similarity scatterplots of redundancy-reduced GO term lists. The lists are simplified by grouping terms with similar functions using their information contents and semantic similarities, with user-control over grouping thresholds. Representatives are selected for plotting in two-dimensional semantic space where similar terms are placed closer to each other, with an array of user-customisable graphical attributes. GO-Figure! offers a simple solution for command-line plotting of informative summary visualizations of lists of GO terms, designed for exploratory data analyses and dataset comparisons.

14:25-14:30
JBrowse 2: A data visualization platform with special features for comparative genomics and structural variant visualization
Format: Pre-recorded with live Q&A

Moderator(s): Karsten Hokamp

  • Colin Diesh, University of California, Berkeley, United States
  • Robert Buels, University of California, Berkeley, United States
  • Garrett Stevens, University of California, Berkeley, United States
  • Peter Xie, University of California, Berkeley, United States
  • Teresa Martinez, University of California, Berkeley, United States
  • Elliot Hershberg, University of California, Berkeley, United States
  • Andrew Duncan, Ontario Institute for Cancer Research, United States
  • Gregory Hogue, Ontario Institute for Cancer Research, Canada
  • Shihab Dider, University of California, Berkeley, United States
  • Eric Yao, University of California, Berkeley, United States
  • Robin Haw, Ontario Institute for Cancer Research, Canada
  • Scott Cain, Ontario Institute for Cancer Research, United States
  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Ian Holmes, University of California, Berkeley, United States

Presentation Overview: Show

JBrowse 2 is a new web-based genome browser that offers multiple modalities of data visualization. It is written using React and Typescript, and can show multiple views on the same screen. Third party developers can create plugins for JBrowse 2 that add new track types, data adapters, and even new view types. We have specialized features for comparative genomics, such as dotplot and linear views of genome comparisons. We also have specialized features for structural variant analysis, including long read vs reference comparisons, and showing read support for SVs by connecting split reads across breakpoint junctions. Users of the web-based JBrowse 2 can share their entire session (e.g. all the settings, views, and even extra tracks that they have added) with other users with short shareable URLs. We are continuing to add new features, and recently released high quality SVG export and specialized visualization of insertions. We are working to develop further platforms such as JBrowse 2 Desktop, an electron packaged version of JBrowse 2.

14:30-14:35
Using Docker to make JBrowse deployment easier
Format: Pre-recorded with live Q&A

Moderator(s): Karsten Hokamp

  • Lincoln Stein, Ontario Institute for Cancer Research, Canada
  • Colin Diesh, University of California, Berkeley, United States
  • Scott Cain, Ontario Institute for Cancer Research, United States
  • Ian Holmes, University of California, Berkeley, United States
  • Sibyl Gao, Ontario Institute for Cancer Research, Canada
  • Adam Wright, Ontario Institute for Cancer Research, Canada
  • Todd Harris, Ontario Institute for Cancer Research, United States
  • Paulo Nuin, Ontario Institute for Cancer Research, Canada
  • Olin Blogett, Jackson Laboratory, United States
  • Nathan Dunn, Lawrence Berkeley National Lab, United States

Presentation Overview: Show

Docker is a widely used open source container platform used for development, distribution and deployment of applications. Here we present two Docker containers developed to ease the process of JBrowse data processing and server deployment, along with several implementation examples. The first container automates the processing of GFF into a format digestible for JBrowse. It has tools to fetch GFF-based genome annotations from publicly available sites, process them into either tabix-indexed GFF or nested containment list (NCList) json and then place the resulting data into an Amazon AWS S3 bucket. The second container creates a very lightweight (typically <200MB) nginx server for JBrowse that can be configured to use the data in the AWS S3 bucket. These base containers can be pulled from Docker Hub (http://hub.docker.com/) for inclusion in application-specific containers to complete data processing and server creation. We present use cases for these containers in which they are used to process GFF and to instantiate JBrowse servers for several large projects including the Alliance of Genome Resources, WormBase and ZFIN. Provisioning these tools with Docker makes it easier to automate data processing via tools like Ansible as well as increasing their portability.

14:40-14:50
Ersilia: a hub of open source drug discovery models for global health
Format: Pre-recorded with live Q&A

Moderator(s): Moni Munoz-Torres

  • Gemma Turon, Ersilia Open Source Initiative, United Kingdom
  • Edoardo Gaude, Ersilia Open Source Initiative, United Kingdom
  • Miquel Duran-Frigola, Ersilia Open Source Initiative, United Kingdom

Presentation Overview: Show

Computational methods hold the promise to revolutionize the drug discovery field, capitalizing on the vast amount of experimental data accumulated over the years and the latest advances in artificial intelligence and machine learning (AI/ML). Unfortunately, due to the lack of a unified framework for effective dissemination, AI/ML tools remain inaccessible to the broad scientific community, and their use is not yet integrated in day-to-day biomedical research. We have created the Ersilia Model Hub, a platform to host ready to use AI/ML models to help non-expert researchers identify drug candidates for orphan diseases, design molecules de novo, understand mechanisms of action or anticipate adverse side-effects. The ultimate goal of Ersilia is to lower the barrier to drug discovery, encouraging academic groups and enterprises to pursue the development of new medicines following the principles of Open Science. Ersilia incorporates both models existing in the literature and a battery of in-house models focused on diseases that are currently neglected by the pharmaceutical industry due to estimated low returns. Ersilia’s core technology is the Chemical Checker (Duran-Frigola et al, Nature Biotechnology, 2020), a resource that embeds small molecule information in a unified vectorial format, providing an excellent basis for efficient modelling.

14:50-15:00
Knowledge graph analytics platform combining LINCS and IDG for drug target illumination featuring preliminary results for Parkinson's Disease
Format: Pre-recorded with live Q&A

Moderator(s): Moni Munoz-Torres

  • Jeremy Yang, Indiana University, United States
  • Christopher Gessner, Indiana University, United States
  • Joel Duerksen, Data2Discovery, Inc., United States
  • Daniel Biber, Data2Discovery, Inc., United States
  • Jessica Binder, University of New Mexico, United States
  • Murat Ozturk, Data2Discovery, Inc., United States
  • Brian Foote, Data2Discovery, Inc., United States
  • Robin McEntire, Data2Discovery, Inc., United States
  • Kyle Stirling, Data2Discovery, Inc., United States
  • Ying Ding, University of Texas, Austin, United States
  • David Wild, Indiana University, United States

Presentation Overview: Show

NIH programs LINCS, “Library of Integrated Network-based Cellular Signatures”, and IDG, “Illuminating the Druggable Genome”, have generated rich open access datasets for the study of the molecular basis of health and disease. LINCS provides experimental genomic and transcriptomic evidence. IDG provides curated knowledge for illumination and prioritization of novel protein drug target hypotheses. Together, these resources can support a powerful new approach to identifying novel drug targets for complex diseases. Integrating LINCS and IDG, we built the Knowledge Graph Analytics Platform (KGAP) for identification and prioritization of drug target hypotheses, via open source package kgap_lincs-idg. We investigated results for Parkinson’s Disease (PD), which inflicts severe harm on human health and resists traditional approaches. Approved drug indications from IDG’s DrugCentral were starting points for evidence paths exploring chemogenomic space via LINCS expression signatures for associated genes, evaluated as targets by integration with IDG. The KGAP scoring function was validated against genes associated with PD with published mechanism-of-action gold standard elucidation. IDG was used to rank and filter KGAP results for novel PD targets, and manual validation. KGAP thereby empowers the identification and prioritization of novel drug targets, for complex diseases such as PD.

15:00-15:20
Robust variant interpretation in precision oncology using a graph knowledge base
Format: Pre-recorded with live Q&A

Moderator(s): Moni Munoz-Torres

  • Caralyn Reisle, Canada's Michael Smith Genome Sciences Centre, Canada
  • Laura Williamson, Canada's Michael Smith Genome Sciences Centre, Canada
  • Erin Pleasance, Canada's Michael Smith Genome Sciences Centre, Canada
  • Dustin Bleile, Canada's Michael Smith Genome Sciences Centre, Canada
  • Anna Davies, Canada's Michael Smith Genome Sciences Centre, Canada
  • Brayden Pellegrini, Canada's Michael Smith Genome Sciences Centre, Canada
  • Karen Mungall, Canada's Michael Smith Genome Sciences Centre, Canada
  • Eric Chuah, Canada's Michael Smith Genome Sciences Centre, Canada
  • Martin Krzywinski, Canada's Michael Smith Genome Sciences Centre, Canada
  • Raphael Matiello Pletz, Canada's Michael Smith Genome Sciences Centre, Canada
  • Jacky Li, Canada's Michael Smith Genome Sciences Centre, Canada
  • Ross Stevenson, Canada's Michael Smith Genome Sciences Centre, Canada
  • Hansen Wong, Canada's Michael Smith Genome Sciences Centre, Canada
  • Abbey Reisle, Canada's Michael Smith Genome Sciences Centre, Canada
  • Matthew Douglas, Canada's Michael Smith Genome Sciences Centre, Canada
  • Eleanor Lewis, Canada's Michael Smith Genome Sciences Centre, Canada
  • Melika Bonakdar, Canada's Michael Smith Genome Sciences Centre, Canada
  • Jessica Nelson, Canada's Michael Smith Genome Sciences Centre, Canada
  • Cameron Grisdale, Canada's Michael Smith Genome Sciences Centre, Canada
  • Ana Fisic, Department of Medical Oncology, BC Cancer, Canada
  • Teresa Mitchell, Department of Medical Oncology, BC Cancer, Canada
  • Daniel Renouf, Department of Medical Oncology, BC Cancer; Pancreas Centre BC, Canada
  • Stephen Yip, Department of Pathology and Laboratory Medicine, Faculty of Medicine, University of British Columbia, Canada
  • Janessa Laskin, Department of Medical Oncology, BC Cancer, Canada
  • Marco Marra, Canada's Michael Smith Genome Sciences Centre; Department of Medical Genetics, University of British Columbia, Canada
  • Steven Jones, Canada's Michael Smith Genome Sciences Centre; Department of Medical Genetics, University of British Columbia, Canada

Presentation Overview: Show

Manual interpretation of variants remains a bottleneck in precision oncology. Molecular data generated from comprehensive sequencing of cancer samples continues to increase in scale and complexity, resulting in an infeasible burden of interpretation for the reviewing analyst. Furthermore, the automation of variant annotation is often complicated by the resolution of alias terms, equivalent notation, and differing levels of specificity. Ontologies have been created to resolve a number of these issues and while most knowledge bases incorporate ontologies as controlled vocabulary, few leverage the relational structure of the ontology itself. To address this and bring precision oncology to scale we have created a graph knowledge base (GraphKB) as a part of our platform for oncogenomic reporting and interpretation (PORI). GraphKB stores not only the ontology terms, but also their relations which are leveraged in our matching algorithm designed to robustly annotate patient variants. The real-world applicability of GraphKB has been demonstrated through its use in the analysis of whole genome and transcriptome analysis for hundreds of patients as part several tumour profiling studies as well as against external datasets such as the cancer genome atlas (TCGA).



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube