Posters - Schedules
Posters Home

View Posters By Category

Monday, July 24, between 18:00 CEST and 19:00 CEST
Tuesday, July 25, between 18:00 CEST and 19:00 CEST
Session A Poster Set-up and Dismantle
Session A Posters set up:
Monday, July 24, between 08:00 CEST and 08:45 CEST
Session A Posters dismantle:
Monday, July 24, at 19:00 CEST
Session B Poster Set-up and Dismantle
Session B Posters set up:
Tuesday, July 25, between 08:00 CEST and 08:45 CEST
Session B Posters dismantle:
Tuesday, July 25, at 19:00 CEST
Wednesday, July 26, between 18:00 CEST and 19:00 CEST
Session C Poster Set-up and Dismantle
Session C Posters set up:
Wednesday, July 26,between 08:00 CEST and 08:45 CEST
Session C Posters dismantle:
Wednesday, July 26, at 19:00 CEST
Virtual
A-075: Reproducible models in Systems Biology are higher cited
Track: BOSC
  • Sebastian Höpfl, Institute for Stochastics and Applications (ISA), Germany
  • Nicole Radde, Institute for Stochastics and Applications (ISA), Germany


Presentation Overview: Show

The Systems Biology community was among the first scientific communities to recognize the need for reproducible models. Still, many published models are not reproducible. Tiwari et al. currently classified 328 published models by their reproducibility. They found out that only every second model was directly reproducible. We use this classification to analyze if reproducible models in Systems Biology have a higher impact in terms of citation numbers. For the analysis, we use Bayesian Estimation, which provides complete distributional information for group means and standard deviations. Outliers are handled via a non-central t-distribution. Beginning about ten years after the introduction of SBML (2013), reproducibility gained broad awareness in the Systems Biology community. Since then, reproducible models have gotten significantly more citations than non-reproducible ones. Our analysis shows 95% credibility for higher citation rates of reproducible models for 2013-2020, and all investigated sub-periods after 2013 to 2020.
Further, normalization to the journal impact factor showed that journals of all ranks could profit by forcing reproducibility. In conclusion, reproducible models offer long-term benefits for journals and individual researchers in terms of more citations. The higher citation count provides evidence for increased reuse of these models and could promote progress in Systems Biology.

A-076: FOSS-based best-practice genomics workflows, GxP-compliant data analysis, and open science at Merck
Track: BOSC
  • Sven-Eric Schelhorn, Merck Healthcare KGaA, Germany, Germany
  • Anna Coenen-Stass, Merck Healthcare KGaA, Germany, Germany
  • Dmitriy Drichel, Drichel Analytics, Germany
  • Thomas Grombacher, Merck Healthcare KGaA, Germany, Germany
  • Stefan Pinkert, Merck Healthcare KGaA, Germany, Germany
  • Alex Rolfe, EMD Serono Inc., USA, United States
  • Jing Yang, Merck Healthcare KGaA, Germany, Germany
  • Sheng Zhao, Merck Healthcare KGAa, Germany, Germany


Presentation Overview: Show

Large biopharmaceutical organizations widely use free and open-source software (FOSS)-based genomics workflows. However, sharing practical approaches for analyzing genomics data from pre-clinical disease models and human patients is not common in the industry community, which counteracts open science principles and makes improvement of specialized workflows difficult.

Here, we present FOSS-based, best-practice genomics workflows developed at Merck for clinical use that promote transparency, reproducibility, and GxP/GDPR regulatory compliance. These workflows cover both RNA-Seq and paired/tumor-only/targeted DNA-Seq assays as well as specialized tasks such as tumor mutational burden detection, patient-derived xenograft disambiguation, and neo-epitope calling. Furthermore, we discuss how automatic provenance tracking, biomedical ontologies, and versioned metadata specifications aid in the FAIRification of CDISC-like result data sets.

In addition, we detail approaches for ensuring long-term reproducibility of results by employing containerization as well as by conducting benchmarks and regression tests using public community resources. Recently, our regression tests identified a severe bug in Mutect2, affecting hundreds of organizations. Lastly, we highlight how a petabyte-scale, community-built data lake for both preclinical and patient results data, as well as an R-based environment usable by all Merck scientists and our academic collaborators, enable the construction of rich web apps and support open science at Merck.

A-077: Advancing FAIR meta-analyses of nucleotide sequence data with q2-fondue
Track: BOSC
  • Anja Adamov, ETH Zurich, Switzerland
  • Michal Ziemski, ETH Zurich, Switzerland
  • Lina Kim, ETH Zurich, Switzerland
  • Lena Flörl, ETH Zurich, Switzerland
  • Nicholas Bokulich, ETH Zurich, Switzerland


Presentation Overview: Show

The growing availability of public nucleotide sequencing data enables meta-analysis studies that expand our knowledge of the microbiome revealing consistent insights into the diversity of microbial communities and their interactions with hosts and environments. For the resulting scientific findings to be reliable, analytical workflows must be reproducible and transparent. However, raw data inputs are typically generated through manual data retrieval steps, hindering reproducibility, and causing research bottlenecks. To address these challenges, we developed q2-fondue (Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere), an open-source Python package that streamlines data acquisition, management, and meta-analysis of nucleotide sequence data and metadata.

q2-fondue adheres to the FAIR principles, promoting data findability, accessibility, interoperability, and reusability. It simplifies the acquisition of sequence (meta)data from the NCBI Sequence Read Archive while providing full provenance tracking from download to final visualization. Through its integration in the widely used QIIME 2 ecosystem, q2-fondue enables researchers to utilize other plugins for comprehensive data analysis. We demonstrate the package’s effectiveness through an example meta-analysis of marker gene sequencing studies.

To guarantee consistent functionality, q2-fondue receives ongoing maintenance. Overall, q2-fondue promises to accelerate novel discoveries by improving scalability, accessibility, and reproducibility in a diverse array of meta-analysis studies.

A-078: An Open Source Platform for Scalable Genomics Data Infrastructures
Track: BOSC
  • Mitchell Shiell, Ontario Institute of Cancer Research (OICR), Canada
  • Jon Eubank, Ontario Institute of Cancer Research (OICR), Canada
  • Justin Richardson, Ontario Institute of Cancer Research (OICR), Canada
  • Brandon Chan, Ontario Institute of Cancer Research (OICR), Canada
  • Puneet Bajwa, Ontario Institute of Cancer Research (OICR), Canada
  • Robin Haw, Ontario Institute of Cancer Research (OICR), Canada
  • Christina Yung, Ontario Institute of Cancer Research (OICR), Canada
  • Lincoln Stien, Ontario Institute of Cancer Research (OICR), Canada
  • Melanie Courtot, Ontario Institute of Cancer Research (OICR), Canada
  • Overture Team, Ontario Institute of Cancer Research (OICR), Canada


Presentation Overview: Show

Data repositories are essential resources to accelerate scientific discoveries over unified genomics datasets. Unfortunately, building out and maintaining them is resource-intensive, requiring a breadth of expertise of software engineers, cloud infrastructure specialists and bioinformaticians. Overture addresses this with an extensible open-source platform of modular components made to build into scalable genomics data infrastructures.

The five core microservices of Overture work in concert to create scalable data commons for filtering, querying and collaborating over large datasets. Ego handles authentication and authorization alongside an administrative Ego UI component. Score is our file transfer and object storage microservice with integrated SAMtools capabilities. Metadata management is handled by Song, which tracks and validates file metadata across distributed servers and against user-defined schemas. Maestro handles the indexing of Song repositories into a single Elasticsearch index that the Arranger Search API then consumes and exposes its pre-built and configurable UI components.

Overture's core capabilities were initially informed by our experiences working on International Cancer Genome Consortium (ICGC) and the NCI Genomic Data Commons Data Portal. Today, Overture powers and informs all our projects, most notably ICGC-Accelerating Research in Genomic Ontology (ICGC-ARGO), VirusSeq and the Ontario Hereditary Cancer Research Network.

A-079: Ten lessons learned on improving the open data reusability of bioinformatics knowledge bases
Track: BOSC
  • Tarcisio Mendes de Farias, SIB Swiss Institute of Bioinformatics, Switzerland
  • Julien Wollbrett, SIB Swiss Institute of Bioinformatics, Switzerland
  • Marc Robinson-Rechavi, University of Lausanne, Switzerland
  • Frederic Bastian, SIB Swiss Institute of Bioinformatics, Switzerland


Presentation Overview: Show

Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize open data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without interoperability the utility lies dormant. Results, in this talk, we discuss several approaches to boost interoperability depending on the interoperable parts. The findings are driven by several real-world scenario examples that were mostly implemented by Bgee, a well-established gene expression database. Moreover, we discuss ten general main lessons learnt. These lessons can be applied in the context of any bioinformatics knowledge base to foster open data reusability. Conclusions, this work provides pragmatic methods and transferable skills to promote reusability of bioinformatics knowledge bases by focusing on interoperability.

A-080: A standardized Nanopore sequencing processing pipeline in nextflow
Track: BOSC
  • Yuk Kei Wan, Genome Institute of Singapore, Singapore
  • Christopher Hakkaart, Seqera Labs, Spain
  • Ying Chen, Genome Institute of Singapore, Singapore
  • Jonathan Göke, Genome Institute of Singapore, Singapore


Presentation Overview: Show

Nanopore sequencing has allowed DNA variant detection in hard-to-access regions, transcript discovery, and RNA modification profiling, which are not available in short-read technology. Still, processing from Nanopore sequencing raw data to meaningful analysis involves the same steps with commonly used software. Here we present nf-core/nanoseq, a community-developed, standardized nf-core pipeline in nextflow, requiring one command to perform quality checks, alignment, coverage track creation, DNA variant detection, transcript reconstruction and quantification, differential expression analyses, RNA fusion detection, and RNA modification detection. nf-core/nanoseq utilizes the DSL2 version of nextflow, where each process pulls either a docker or singularity container for the software needed, removing the requirement to install individual software. nf-core/nanoseq accepts various input formats (fastq, fast5, and bam) and allows users to customize the pipeline to their needs by choosing specific steps to run and the software of their preference. Together the nf-core/nanoseq pipeline greatly simplifies the processing of Nanopore long-read DNA and RNA-Seq data. nf-core/nanoseq is available through https://github.com/nf-core/nanoseq.

A-081: 10 simple rules for building FAIR workflows
Track: BOSC
  • Casper de Visser, Radboudumc, Netherlands
  • Lennart Johansson, University Medical Center Groningen, Netherlands
  • Purva Kulkarni, Radboud University Medical Center, Netherlands
  • Hailiang Mei, LUMC, Netherlands
  • Pieter Neerincx, UMCG, Netherlands
  • Joeri van der Velde, University Medical Center Groningen, Netherlands
  • Peter Horvatovich, University of Groningen, Netherlands
  • Alain J. van Gool, Radboudumc, Netherlands
  • Morris Swertz, University Medical Center Groningen, Netherlands
  • Peter-Bram Thoen, Raboud University Medical Center, Netherlands
  • Anna Niehues, Radboud University Medical Center, Netherlands


Presentation Overview: Show

Research data is accumulating rapidly, and with it the challenge of irreproducible
science. As a consequence, implementation of high quality management of scientific
data has become a global priority. The FAIR (Findable, Accesible, Interoperable and
Reusable) principles provide practical guidelines for maximizing the value of research
data, however, processing data using workflows - systematic executions of a series
computational tools - is equally important for good data management. The FAIR
principles have recently been adapted to Research Software (FAIR4RS Principles) to
promote the reproducibility and reusability of any type of research software. Here we
propose a set of 10 simple rules, drafted by experienced workflow developers that will
help researchers to apply FAIR4RS principles to workflows. The rules have been
arranged according to the FAIR acronym, clarifying the purpose of each rule with
respect to the FAIR4RS principles. Altogether, these rules can be seen as practical
guidelines for workflow developers who aim to contribute to more reproducible and
sustainable computational science, aiming to positively impact the open science and
FAIR community.

A-082: Tonkaz: A workflow reproducibility scale for automatic validation of biological interpretation results
Track: BOSC
  • Hirotaka Suetake, Sator, Inc., Japan
  • Tazro Ohta, Institute for Advanced Academic Research, Chiba University, Japan


Presentation Overview: Show

Reproducibility of data analysis workflows is a key concern in bioinformatics. While recent advancements in computing technologies, such as virtualization, have facilitated easier reproduction of workflow execution, assessing the reproducibility of results remains a challenge. Specifically, there is a lack of standardized methods for verifying whether the biological interpretation of reproduced results is consistent across different executions.

To address this issue, we propose a new metric: a reproducibility scale for workflow execution results. This metric evaluates the reproducibility of results based on biological feature values, such as the number of reads, mapping rate, and variant frequency, representing the biological interpretation of the data. We have developed a prototype system that automatically evaluates the reproducibility of results using the proposed metric, streamlining the evaluation process.

To demonstrate the effectiveness of our approach, we conducted experiments using workflows employed by researchers in real-life research projects, as well as common use cases frequently encountered in bioinformatics. Our approach enables the automatic evaluation of result reproducibility on a fine-grained scale, promoting a more nuanced perspective on reproducibility. This shift from a binary view of identical or non-identical results to a graduated scale facilitates more informed discussions and decision-making in the bioinformatics field.

A-083: Accessible and scalable pipelines for fast and easy (foodborne) pathogens detection and tracking
Track: BOSC
  • Engy Nasr, Freiburg Galaxy Team, Bioinformatik Group, Department of Computer Science, Albert-Ludwigs-University, Germany
  • Anna Henger, Biolytix AG, Switzerland
  • Tobias Schindler, amplytico gmbh, Switzerland
  • Paul Zierep, Freiburg Galaxy Team, Bioinformatik Group, Department of Computer Science, Albert-Ludwigs-University, Germany
  • Bérénice Batut, University of Freiburg, Germany


Presentation Overview: Show

Food pathogen contamination affects around 600 million people a year.
During a foodborne outbreak investigation, microbiological analysis of the potentially responsible food is performed to detect the responsible pathogens and identify the contamination source. Traditional methods require targeted pathogen isolation, which is time-consuming, not always straightforward, nor successful. Metagenomics could solve this issue, by giving an overview of the genomic composition (food, microbial community, and any possible pathogens) without any prior isolation, or limitation to targeted genes.
Metagenomics combined with Nanopore sequencing makes the identification of pathogens quicker, easier, more accessible, and more practical. But processing such data is complex, especially with the lack of accessible, easy-to-use, and openly available pipelines.

To solve this issue, we have implemented FAIR Galaxy-based workflows, which integrate state-of-the-art tools, visualizations, and reports for the detection and tracking of pathogens from any - not only food - metagenomics ONT sample.
The workflows were tested on (1) spiked food with different Salmonella strains at different concentrations, and (2) samples collected from humans, meat, and chicken in Palestine containing Campylobacter.

The workflows are freely available on the European Galaxy server, but also on EOSC-Life WorkflowHub. They are documented as an e-learning tutorial available on Galaxy Training Network.

A-084: Open Life Science: A mentoring & training virtual program for Open Science ambassadors
Track: BOSC
  • Yo Yehudi, Open Life Science, United Kingdom
  • Pradeep Eranti, Open Life Science, France
  • Sara El-Gebali, FAIRPoint & Open Life Science, Sweden
  • Bérénice Batut, University of Freiburg & Open Life Science, Germany


Presentation Overview: Show

Open Life Science (OLS) offers 16-week training and mentoring virtual programs to empower researchers and their teams to lead open research projects in their respective domains and become open science ambassadors for their communities.
With the combination of practical training and 1:1 support from our mentors, we guide participants to reflect on and apply open practices in the context of the socio-cultural environment where they conduct their research. Participants join the program with a project. They attend cohort calls to learn about open science practices and frameworks to apply open science skills, including FAIR research principles and equitable and inclusive community practices. In mentoring calls every alternating week, they reflect on their progress and identify the next steps. To strengthen their skills, graduates can re-join subsequent cohorts as mentors, call facilitators, or experts. We offer micro-grants, live transcription, and other resources to make the program inclusive and accessible.
Since 2020, we have run 7 cohorts and trained >300 participants from 6 continents, across 50+ low- and middle- (LMICs), and high-income countries (HICs), with the help of 120+ mentors and 150+ experts.
This talk will describe the various aspects of the program and highlight all open, reusable, and FAIR materials.

A-085: A scalable database structure and processing pipeline for genomic surveillance
Track: BOSC
  • Kunaphas Kongkitimanon, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Ferdous Nasri, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Alice Wittig, Bioinformatics and Systems Biology, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany, Germany
  • Jorge Sanchez Cortes, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Anna-Juliane Schmachtenberg, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Bernhard Renard, Data Analytics & Computational Statistics, Hasso Plattner Institute, University of Potsdam, Germany, Germany
  • Stephan Fuchs, Bioinformatics and Systems Biology, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany, Germany


Presentation Overview: Show

The emergence of new diseases raises the need for real-time genomic surveillance. Meeting this need requires a flexible and scalable framework that efficiently handles large amounts of genomic data.
In this project, we use MonkeyPox (Mpox) as an example and introduce MpoxSonar, an efficient software that can assist researchers in managing and analyzing very large amounts of genomic data effectively. MpoxSonar uses Python and MariaDB, which are simple to deploy at any research facility.
The tool provides scalable alignment, variant calling features to create mutation profiles, and the search functionality to quickly access sample information from the database. We embed many other useful features, e.g. generating VCF files to integrate with other analysis tools.
MpoxSonar automatically downloads all Mpox genomes and metadata daily from the NCBI, performs pairwise alignments with multiple references and generates mutation profiles. The results are stored in a database, which is the interface to MpoxRadar (doi: 10.1101/2023.02.03.526935).
With MpoxSonar, we create a disease-unspecific pipeline and database structure to enable researchers to rapidly set up and use a genomic surveillance system for emerging diseases. It is available for download at github.com/rki-mf1/MpoxSonar.

A-086: Oligo Designer Toolsuite – lightweight development of custom oligo design pipelines
Track: BOSC
  • Lisa Barros de Andrade E Sousa, Helmholtz AI, Germany
  • Isra Mekki, Helmholtz AI, Germany
  • Marie Piraud, Helmholtz AI, Germany


Presentation Overview: Show

Oligonucleotides are short, synthetic strands of DNA or RNA that have many application areas, ranging from research to disease diagnosis or therapeutics. Based on the intended application and experimental design, researchers have to customize the length, sequence composition, and thermodynamic properties of the designed oligos. Various tools exist that provide customized oligo sequences depending on the area of application. Despite the fact that most of these tools use the same basic processing steps, each newly developed tool usually uses its own implementation and different versions of package dependencies for those basic processing steps. As a consequence, the comparability of tools is hampered, but also the maintenance of existing and the development of new tools is slowed down, because developers do not have a common resource for those basic functionalities. We tackle this issue with our open-source Oligo Designer Toolsuite, which is a collection of modules that provide all basic oligo design functionalities with a common underlying data structure within a flexible Python framework. This framework allows a lightweight development of custom oligo design pipelines and was successfully applied to develop custom probe design pipelines for SCRINSHOT, SeqFISH+ and MERFISH protocols, which are embedded in the probe set and gene panel selection pipeline Spapros (https://doi.org/10.1101/2022.08.16.504115).

A-087: Transforming unstructured biomedical texts with large language models
Track: BOSC
  • J. Harry Caufield, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Harshad Hegde, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Vincent Emonet, Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands, Netherlands
  • Nomi Harris, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Marcin Joachimiak, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Nicolas Matentzoglu, Semanticly Ltd, Athens, Greece, Greece
  • Hyeongsik Kim, Robert Bosch LLC, Sunnyvale, CA 94085, USA, United States
  • Sierra Moxon, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Justin Reese, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States
  • Melissa Haendel, Anschutz Medical Campus, University of Colorado, Aurora, CO 80217, USA, United States
  • Peter Robinson, The Jackson Laboratory, Bar Harbor, ME 04609, USA, United States
  • Christopher Mungall, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, United States


Presentation Overview: Show

Creating biological knowledge bases and ontologies relies on time-consuming curation. Newly-emerging approaches driven by artificial intelligence and natural language processing can assist curators in populating these knowledge bases, but current approaches rely on extensive training data and are unable to populate arbitrarily complex nested knowledge schemas. We have developed Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning (ZSL) and general-purpose query answering from flexible prompts and return information conforming to a schema. Given a user-defined schema and an input text, SPIRES recursively queries GPT-3+ to obtain responses matching the schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for all matched elements. SPIRES may be applied to varied tasks, including extraction of cellular signaling pathways, disease treatments, drug mechanisms, and chemical to disease causation graphs. This approach offers easy customization, flexibility, and the ability to perform new tasks in the absence of any additional training data. SPIRES supports a strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM.

A-088: FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis
Track: BOSC
  • Xiaofeng Liao, RadboudUMC, Netherlands
  • Anna Niehues, Radboudumc, Netherlands
  • Casper de Visser, Radboudumc, Netherlands
  • Junda Huang, Radboudumc, Netherlands
  • Cenna Doornbos, Radboudumc, Netherlands
  • Thomas Ederveen, Radboudumc, Netherlands
  • Purva Kulkarni, Radboudumc, Netherlands
  • Joeri van de Velde, University Medical Center Groningen, Netherlands
  • Morris Swertz, University Medical Center Groningen, Netherlands
  • Martin Brandt, SURF, Netherlands
  • Alain van Gool, Radboudumc, Netherlands
  • Peter-Bram Hoen, Radboudumc, Netherlands


Presentation Overview: Show

We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. A significant fraction of these data may be misused to de-anonymize and (re-)identify individuals. Hence, most data is kept in secure and protected silos. Therefore, it remains a challenge to reuse these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of FAIR data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.

The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, facilitate reuse of their data, and make their data analysis workflows transparent.

Considering the privacy sensitive nature of -omics data, the FDCube also provides secure data analysis environment to address ethical, societal, and legal issues.

A-089: Building and Sustaining a Community of Computational Biologists at EMBL through Open-Source Tools and Four Pillars: Training, Community, Infrastructure, and Information
Track: BOSC
  • Lisanna Paladin, EMBL - European Molecular Biology Laboratory, Germany
  • Renato Alves, EMBL - European Molecular Biology Laboratory, Germany


Presentation Overview: Show

Especially in the interdisciplinary field of computational biology, collaboration is crucial. The Bio-IT project at the European Molecular Biology Laboratory (EMBL) established a community of bioinformaticians through training, community building, infrastructure, and information support. The project is managed by staff, providing a model for how institutional support can empower community-driven initiatives in science.
Training is an essential component of the Bio-IT project. Researchers learn from their peers through open-source materials and practical workshops, run in a hybrid format to allow participation across all six EMBL sites in Europe. But training alone doesn’t change the culture.It is coupled by community building initiatives such as meetups, coding clubs, an internal Mattermost chat, and a ""grassroots table"" of internal experts. Infrastructure support includes coding tools and a GitLab server. The project provides newcomers guides, intranet links and a blog to ensure access to comprehensive information.
The Bio-IT project provides a model for research institutes looking to develop similar programs to support internal field-specific communities. We will highlight the benefits and challenges of implementing this model, and we hope to spark a little revolution in the way scientists approach community and in the way institutions recognise the value of professionals contributing to it.

A-090: Functional approaches focused on Gene Regulatory Networks for integrating bulk datasets: a case study exploiting the versatility of Shiny apps
Track: BOSC
  • Eleanor Williams, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Cameron Crawford, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Ilias Moutsopoulos, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Irina Mohorianu, University of Cambridge, United Kingdom


Presentation Overview: Show

Bulk sequencing experiments on single/multiple modalities are central in biomedical research for generating hypotheses. The plethora of existing tools often necessitates bespoke analyses, expensive and often inflexible commercially-available solutions or handling disconnected pipeline components, that require conversions between outputs and accepted inputs.

To facilitate easy, interactive exploration of bulk datasets, while also generating sharable outputs, we created bulkAnalyseR, an R package which takes expression matrices and metadata tables (with preprocessing functions available) to create shiny apps that combine multiple state-of-the-art analysis and visualisation approaches. This pipeline allows users to interact with panels, explore quality control measures, perform differential expression and investigate biological interpretations with enrichment analysis and expression pattern identification. A distinctive feature of bulkAnalyseR is the inference and cross-comparison of gene regulatory networks (GRNs), either on single modalities or integrating multiple modalities with cis-, trans- and customised regulatory networks; using this feature, the dynamics of GRNs can be studied in terms of changes in topology and strength of interactions. This setup encourages thorough, deeper and systematic data mining, while also promoting reproducible analyses.

To further enhance the flexibility of bulkAnalyseR, we subsequently created an add-on batch correction (coRRectoR), based on functional, covariation-focused, alignment of signal across batches.

A-090: Functional approaches focused on Gene Regulatory Networks for integrating bulk datasets: a case study exploiting the versatility of Shiny apps
Track: BOSC
  • Eleanor Williams, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Cameron Crawford, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Ilias Moutsopoulos, Cambridge Stem Cell Institute, University of Cambridge, United Kingdom
  • Irina Mohorianu, University of Cambridge, United Kingdom


Presentation Overview: Show

Bulk sequencing experiments on single/multiple modalities are central in biomedical research for generating hypotheses. The plethora of existing tools often necessitates bespoke analyses, expensive and often inflexible commercially-available solutions or handling disconnected pipeline components, that require conversions between outputs and accepted inputs.

To facilitate easy, interactive exploration of bulk datasets, while also generating sharable outputs, we created bulkAnalyseR, an R package which takes expression matrices and metadata tables (with preprocessing functions available) to create shiny apps that combine multiple state-of-the-art analysis and visualisation approaches. This pipeline allows users to interact with panels, explore quality control measures, perform differential expression and investigate biological interpretations with enrichment analysis and expression pattern identification. A distinctive feature of bulkAnalyseR is the inference and cross-comparison of gene regulatory networks (GRNs), either on single modalities or integrating multiple modalities with cis-, trans- and customised regulatory networks; using this feature, the dynamics of GRNs can be studied in terms of changes in topology and strength of interactions. This setup encourages thorough, deeper and systematic data mining, while also promoting reproducible analyses.

To further enhance the flexibility of bulkAnalyseR, we subsequently created an add-on batch correction (coRRectoR), based on functional, covariation-focused, alignment of signal across batches.

A-091: EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), a tool for the improvement of eukaryotic structural and functional genome annotation
Track: BOSC
  • Cynthia Webster, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Karl Fetter, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Sumaira Zaman, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Vidya Vuruputoor, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Akriti Bhattarai, Ecology and Evolutionary Biology Department, University of Connecticut, United States
  • Jill Wegrzyn, Ecology and Evolutionary Biology Department, University of Connecticut, United States


Presentation Overview: Show

The emergence of affordable high-throughput sequencing technologies has increased both the number and quality of eukaryotic genomes. Although reference genomes and their associated contiguity are increasingly accessible, an efficient and accurate workflow for structural annotation of protein-coding genes remains a challenge. Existing programs struggle with predicting less common gene structures (long introns, micro-exons), finding the preferred TIS location, and distinguishing pseudogenes. We present EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs), an open-source genome annotation tool that leverages machine learning, RNA folding, and functional annotations to enhance gene prediction accuracy (https://gitlab.com/PlantGenomicsLab/easel). EASEL works by aligning high throughput short read data (RNA-Seq) and assembling putative transcripts via StringTie2 and PsiCLASS. Frames are subsequently predicted through TransDecoder using a gene family database (EggNOG) and Expressed Sequence Tag (EST) and protein hints are generated. Each gene model is independently used to train AUGUSTUS, and the resulting predictions are combined into a single gene set using AGAT. Implicated gene structures are filtered by primary and secondary features (molecular weight, GC content, free energy, etc.) with a clade-specific random forest model and then functionally annotated with EnTAP. This results in a full-scale workflow that balances efficiency and accuracy to generate high-quality genome annotations.

A-092: The Linked data Modeling Language (LinkML): a general-purpose data modeling framework
Track: BOSC
  • Sierra Moxon, LBNL, United States
  • Harold Solbrig, solbrig informatics, United States
  • Deepak Unni, Swiss Institute of Bioinformatics, Switzerland
  • Mark Miller, LBNL, United States
  • Patrick Kalita, LBNL, United States
  • Sujay Patil, LBNL, United States
  • Kevin Schaper, Anschutz Medical Campus, University of Colorado, United States
  • Tim Putman, Anschutz Medical Campus, University of Colorado, United States
  • Corey Cox, Anschutz Medical Campus, University of Colorado, United States
  • Harshad Hegde, LBNL, United States
  • J. Harry Caufield, LBNL, United States
  • Justin Reese, LBNL, United States
  • Melissa Haendel, Anschutz Medical Campus, University of Colorado, United States
  • Christopher J. Mungall, LBNL, United States


Presentation Overview: Show

The Linked data Modeling Language (https://linkml.io) is a data modeling framework that provides a flexible yet expressive standard for describing many kinds of data models from value sets and flat, checklist-style standards to complex normalized data structures that use polymorphism and inheritance. It is purposefully designed so that software engineers and subject matter experts can communicate effectively in the same language while also providing the semantic underpinnings to make data conforming to LinkML schemas easier to understand and reuse computationally. The LinkML framework includes tools to serialize data models in many formats including but not limited to: JSONSchema, OWL, SQL-DDL, and Python Pydantic classes. It also includes tools to help convert instance data between different model serializations, (LinkML runtime), convert schemas from one framework to another (LinkML convert), validate data against a LinkML schema (LinkML validate), retrieve model metadata (LinkML schemaview), bootstrap a LinkML schema from another framework (LinkML schema automator), and tools that auto-generate documentation and schema diagrams. LinkML is an open, extensible modeling framework that allows computers and people to work cooperatively and LinkML makes it easy to model, validate, and distribute data that is reusable and interoperable.

A-093: Interactive analysis of single-cell data using flexible workflows with singleCellTK and singleCellTKPlus
Track: BOSC
  • Amulya Shastry, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Joshua David Campbell, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Yichen Wang, Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Irzam Sarfraz, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Rui Hong, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Yusuke Koga, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Vidya Akavoor, Rafik B. Hariri Institute for Computing and Computational Science and Engineering, Boston, MA, United States
  • Xinyun Cao, Rafik B. Hariri Institute for Computing and Computational Science and Engineering, Boston, MA, USA, United States
  • Salam Alabdullatif, Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Nida Pervaiz, Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Syed Ali Zaib, Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Zhe Wang, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States
  • Frederick Jansen, Rafik B. Hariri Institute for Computing and Computational Science and Engineering, Boston, MA, USA, United States
  • Masanao Yajima, Department of Mathematics and Statistics, Boston University, Boston, MA, USA, United States
  • W. Evan Johnson, Bioinformatics Program and Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA, United States


Presentation Overview: Show

Single cell genomic technologies have advanced our understanding of complex living systems by providing the ability to profile molecular features at the cellular level. With an influx of single cell datasets, there has been subsequent increase in number of analytical tools focused on elucidating information from these datasets. These analysis tools may require different input formats, have different dependencies, and may be spread out across programming environments. To overcome these challenges, we have previously developed an R/Bioconductor package called singleCellTK which provides a user-friendly Shiny app for seamless integration of popular tools and workflows across programming environments. However, some popular tools are not available on standard R repositories such as CRAN or Bioconductor and are only available on GitHub. To overcome this issue, we have developed a novel package called singleCellTKPlus that supports tools only available on GitHub. This package will supplement the comprehensive list of tools and workflows already available in the singleCellTK package with an opportunity to perform additional tasks such as cell type deconvolution with MuSiC or cell type prediction with CellAssign.

A-094: The Research Software Ecosystem: an open software metadata commons
Track: BOSC
  • Kota Miura, Bioimage Analysis & Research, Japan
  • Herve Menager, Institut Pasteur, Université Paris Cité, France
  • Matúš Kalaš, Computational Biology Unit, Department of Informatics, University of Bergen, Norway
  • Oleg Zharkov, University of Freiburg, Germany
  • Federico Zambelli, Department of Biosciences, University of Milan, Milano, Italy, Italy
  • Dmitri Repchevsky, Barcelona Supercomputing Center (BSC), Spain
  • Jonathan Tedds, ELIXIR Europe, United Kingdom
  • Manthos Pithoulias, ELIXIR Europe, United Kingdom
  • Hedi Peterson, University of Tartu, Estonia
  • Perrine Paul-Gilloteaux, Institut du Thorax, University of Nantes, France
  • Stuart Owen, The University of Manchester, United Kingdom
  • Steffen Möller, Rostock University Medical Center, Germany
  • Hans Ienasescu, Technical University of Denmark, Denmark
  • Steven Manos, Australian BioCommons, Australia
  • Jennifer Harrow, ELIXIR, United Kingdom
  • Josep Ll Gelpi, Dept. Bioquimica i Biologia Molecular. Univ. Barcelona, Spain
  • Johan Gustafsson, Australian Biocommons, Australia
  • Bjoern Gruening, Uni-Freiburg, Germany
  • Carole Goble, The University of Manchester, United Kingdom
  • Alban Gaignard, Institut du Thorax, University of Nantes, France
  • José Mª Fernández, Barcelona Supercomputing Center (BSC), Spain
  • Frederik Coppens, Ghent University, Belgium
  • Salvador Capella-Gutiérrez, Barcelona Supercomputing Center (BSC), Spain


Presentation Overview: Show

Research software is a critical component of computational research. Being able to discover, understand and adequately utilize software is essential. Many existing services facilitate these tasks, all of them relying heavily on software metadata. The continued upkeep of such complex and large sets of metadata comes at the cost of multiple efforts of curation, and the resulting metadata is often sparse and inconsistent.
The Research Software Ecosystem (RSEc) aims to act as a proxy to maintain and preserve high-quality metadata for describing research software. These metadata are retrieved and synchronized with many major software-related services, within and beyond the ELIXIR Tools Platform. The EDAM ontology enables the semantic description of the scientific function of the described software.
The RSEc central repository is a GitHub repository that aggregates software metadata mostly related to life sciences, and spanning the multiple aspects of software discovery, evaluation, deployment and execution. The aggregation of metadata in a centralized, open, and version-controlled repository enables the cross-linking of services, the validation and enrichment of software metadata, the development of new services and the analysis of these metadata.

A-095: The SPHN Semantic Interoperability Framework: From clinical routine data to FAIR research data
Track: BOSC
  • Vasundra Touré, Swiss Institute of Bioinformatics SIB, Switzerland
  • Deepak Unni, Swiss Institute of Bioinformatics SIB, Switzerland
  • Sabine Österle, Swiss Institute of Bioinformatics SIB, Switzerland
  • Katrin Crameri, Swiss Institute of Bioinformatics SIB, Switzerland


Presentation Overview: Show

The Swiss Personalized Health Network (SPHN) is an initiative funded by the Swiss government for building a nationwide infrastructure for sharing clinical and health-related data in a secure and FAIR (Findable, Accessible, Interoperable, Reusable) manner. One goal is to ensure that data coming from different sources is interoperable between stakeholders. The priority was to develop a purpose-independent description of existing knowledge rather than relying on existing data models which are focused on specific use cases.

Together with partners at University Hospitals we have developed the SPHN Semantic Interoperability Framework which encompasses:
- semantics definition for data standardization
- data format specifications for data exchange
- software tools to support data providers and users
- training for facilitating knowledge sharing with stakeholders

Well-defined concepts connected to machine-readable semantic standards (e.g., SNOMED CT and LOINC) function as reusable universal building blocks that can be connected with each other to represent information. By adopting semantic web technologies, a specific schema has been built to encode the semantic concepts with given rules and conventions.

This framework is implemented in all Swiss university hospitals and forms the basis for future data-driven research projects with clinical and other health-related data.

A-096: EDAM - The data analysis and management ontology (update 2023)
Track: BOSC
  • Matúš Kalaš, University of Bergen, Norway
  • Hervé Ménager, Institut Pasteur, Paris, France
  • The Global Community Of Contributors To Edam, EDAM, Bouvet Island


Presentation Overview: Show

EDAM is an ontology of data analysis and data management in bio- and other sciences, including science-based applications. It comprises concepts related to analysis, modelling, and data life-cycle. Targeting usability by diverse users, the structure of EDAM is relatively simple, divided into 4 main sections: topics, operations, data, and formats.

EDAM is used, for example, in Bio.tools, Galaxy, CWL, Debian, BioSimulators, FAIRsharing, or the ELIXIR training portal TeSS. Thanks to the annotations with EDAM, tools, workflows, standards, data, and learning materials are easier to find, compare, and choose.

EDAM contributes to open science by supplying concepts for annotation of research outputs (such as processed data), making them more findable, understandable, and comparable. EDAM and its applications help lower the barrier and effort for scientists - professional and “citizen” - towards doing scientific research in a more open, inclusive way.

Updates in 2023 include:

- Substantial work on continuously improving the coverage of interdisciplinary applications relevant in the global context and global problems (EDAM Geo)
- Generative ML for synthetic data; mass spectrometry imaging; and cytometry concepts refined (in EDAM Bioimaging)
- Improvements to data management and data handling concepts
- Specialised validation tooling (Caséologue)
- Numerous new contributors

A-097: Domain Specific Language and variables for systematic approach to genetic variant curation and interpretation
Track: BOSC
  • Marina Pozhidaeva, Deggendorf Institute of Technology, Forome Association, MA, USA, Germany
  • Dmitry Etin, Forome Association, MA, USA; Deggendorf Institute of Technology; Oracle Corporation, Austria
  • Gennadii Zakharov, Quanotri, Georgia
  • Michael Bouzinier, Forome Association, MA, USA; Oracle Corporation; Harvard University, United States


Presentation Overview: Show

Many clinicians and researchers agree that genome sequencing should become part of routine clinical practice, eventually becoming available in community hospitals and remote places. One of the roadblocks slowing this journey is the caution with which insurance companies and governments are reimbursing these services. One reason for this caution is a gap in the standards for demonstrating clinical utility of genome sequencing based on the current clinical genetics guidelines. To facilitate higher evidence-based medicine standards we need a systematic and structured approach to developing, maintaining, and publishing these guidelines. We believe that defining a weakly typed Domain Specific Language (DSL) for expressing genetic variant curation and interpretation rules in the context of provenance, confidence, and biological evidence of the data, would be a step forward in this direction. The types of DSL variables are based on the scale and resolution, knowledge domain, and method by which the technical or biological annotation corresponding to a variable has been produced.

We will present the prototype of such a DSL and annotation classification system used as part of our AnFiSA software tool. We will provide examples of the variant curation rules written in the DSL and demonstrate how they work on real-life data.

A-098: SWIPE: Open source infrastructure as code for running WDL workflows at low cost
Track: BOSC
  • Todd Morse, The Chan Zuckerberg Foundation, United States
  • Ryan Lim, The Chan Zuckerberg Foundation, United States
  • Jessica Gadling, The Chan Zuckerberg Foundation, United States


Presentation Overview: Show

SWIPE packages the cloud infrastructure used by Chan Zuckerberg Infectious Disease to run our bioinformatics WDL workflows so our architecture can be used by others. To make the infrastructure portable we used terraform to define our infrastructure as code. The architecture is well tuned for our use case: running the same pipelines with some common inputs, while scaling to meet highly variable demand for pipeline runs, at low cost. This is achieved by using AWS batch to quickly scale up and down, high bandwidth and disk I/O instances to quickly load and read large file inputs, a caching strategy to re-use common inputs between pipeline runs, and AWS Spot Instances to lower costs. Leveraging Spot Instances can be a challenge because workloads may be interrupted, requiring workloads run on spot to implement handling for recovery. SWIPE leverages the WDL workflow definitions to automatically resume at the last completed step, freeing up workflow developers from needing to implement their own recovery logic. Coordination is handled by AWS Step Functions, a serverless workflow orchestrator. This allows SWIPE to run a workflow with an AWS API call and scale down to zero when not in use.

A-099: Identifying integrated multi-omics biomarkers to build a sepsis detection model using machine learning
Track: BOSC
  • Tyrone Chen, Monash University, Australia
  • Anton Y. Peleg, Monash University, Australia
  • Sonika Tyagi, Royal Melbourne Institute of Technology, Australia


Presentation Overview: Show

Sepsis, which is responsible for 20% of global deaths, has prompted the development of data-driven methods to improve its detection, resolution, prognosis, and treatment. Existing methods detect biomarkers through a single assay, but face challenges in capturing the complex combination of biomolecular interactions involved in gene regulation and expression. To address this, we developed a multi-omics approach that assays cellular levels of various molecules to obtain a standardised sepsis diagnosis and prognosis model. Fast and accurate sepsis detection will supplement clinical treatment of sepsis at all levels, leading to a personalised medicine approach.

We first obtained multi-omics data through our partnership with an Australian National Framework project of multiple bacterial strains associated with sepsis and non-sepsis conditions. Genomics, transcriptomics, proteomics and metabolomics data are available. Using these as input data, we split our workflow into two parts. In the first stage, we convert the (1) regulatory omics data into a generic data representation. In the second stage, we integrate the (2) functional omics component with the regulatory omics component. We hypothesised that the regulatory omics signatures drive the functional signature. Our method is both species and data agnostic, as well as publicly available as a conda package.

A-100: chromTools complete: a tool for chromatin state annotation validation
Track: BOSC
  • Jessica Shields, University of Exeter, United Kingdom
  • Eilis Hannon, University of Exeter, United Kingdom


Presentation Overview: Show

Chromatin state annotation algorithms are a powerful tool to understand regulatory activity across the genome. Using multiple different epigenetic data types as input, these methods segment and annotate the genome based on the combination of epigenetic marks found at each genomic position, assigning a chromatin state label to each segment. However, there are currently no generally agreed-upon statistics to determine the quality of a particular annotation. While tools have been developed to evaluate the characteristics of an annotation, the accuracy of the derived chromatin state labels is also reliant upon a robust input dataset and is affected by the degree to which the input dataset represents a complete set of marks across the genome.
To address this issue, we have developed a simple validation tool, chromTools complete. The tool implements a subsampling algorithm and subsequent binarisation of the data, based on ChromHMM's binarisation step. The output saturation plots give an estimate of the number of reads required to approach the complete sampling of the mark. Accordingly, chromTools complete serves as a fast and effective tool to characterize the reliability of the input dataset upon which a chromatin state annotation is based.
Availability and implementation: chromTools complete is free and open source software, available under a BSD-3-Clause license.

A-101: nf-core/sarek: a pipeline for efficient germline, tumor-only, and somatic analysis of NGS data on different compute infrastructures
Track: BOSC
  • Friederike Hanssen, Quantitative Biology Center, Tübingen; Biomedical Data Science, Department for Computer Science, University of Tübingen, Germany
  • Maxime Ulysse Garcia, Seqera Labs, Barcelona, Sweden
  • Lasse Folkersen, Nucleus Genomics, Department of Science, New York, United States
  • Susanne Jodoin, Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Germany
  • Anders Sune Pedersen, Danish National Genome Center, Copenhagen, Denmark
  • Edmund Miller, Department of Biological Sciences, The University of Texas at Dallas, Richardson, United States
  • Francesco Lescai, Department of Biology and Biotechnology, University of Pavia, Pavia, Italy
  • Nick Smith, German Human Genome-Phenome Archive; Technical University of Munich, Germany
  • Oskar Wacker, Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Germany
  • Nf-Core Community, nf-core, Sweden
  • Gisela Gabernet, Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Germany
  • Sven Nahnsen, Quantitative Biology Center, Tübingen; Biomedical Data Science, Department for Computer Science, University of Tübingen, Germany


Presentation Overview: Show

Variant calling studies often include large donor cohorts with dataset sizes varying widely for panel, whole-exome, and whole-genome-sequencing data. High-throughput, efficient, and reproducible software pipelines are needed to ensure the homogeneous processing across different compute infrastructures with affordable resource usage. We present nf-core/sarek 3.0 (https://github.com/nf-core/sarek), a pipeline for exploring single-nucleotide variants, structural variation, microsatellite instability, and copy-number alterations of germline, tumor-only, and tumor-normal pairs. The pipeline is part of the nf-core project, which provides peer-reviewed, reproducible, scalable, portable, and well-documented open-source Nextflow pipelines (https://nf-co.re/).
During the recent re-implementation, we reduced compute resources and improved data flow increasing turn-around times, which allowed us to reduce costs on commercial clouds by more than a third, facilitating the integration of publicly hosted data from repositories with in-house patient cohorts.
Other improvements include modularization of processes which facilitates code maintainability and customization on the user-side, and a broader repertoire of available tools.
We will showcase the technical developments in several projects from cancer research. We have processed 160 WGS germline samples to detect pre-leucemic changes, as well as 100 cholangiocarcinoma and 20 colorectal carcinoma panels for investigating the relationship of genomic variation to drug responsiveness.

A-102: Synthetic genomics data generation and evaluation for the use case of benchmarking somatic variant calling algorithms
Track: BOSC
  • Styliani-Christina Fragkouli, Institute of Applied Biosciences, Centre of Research and Technology Hellas, Greece
  • Andreas Agathangelidis, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Fotis Psomopoulos, Institute of Applied Biosciences, Centre of Research and Technology Hellas, Greece


Presentation Overview: Show

Somatic variant calling algorithms are widely used to detect genomic alterations associated with cancer. However, evaluating the performance of these algorithms can be challenging due to the lack of high-quality ground truth datasets. To address this issue, we developed a synthetic genomics data generation and evaluation framework for benchmarking somatic variant calling algorithms. We generated synthetic datasets based on data from the TP53 gene, using the NEAT simulator. We then thoroughly evaluated the performance of GATK-Mutect2 on these datasets, and compared the results to the “golden” files produced by NEAT that contain the true variations. Our results demonstrate that the synthetic datasets generated using our framework can accurately capture the complexity and diversity of real cancer genome data. Moreover, the synthetic datasets provide an excellent ground truth for evaluating the performance of somatic variant calling algorithms. Our framework provides a valuable resource for testing the performance of somatic variant callers, enabling researchers to evaluate and improve the accuracy of these algorithms for cancer genomics applications.

A-103: BioThings Explorer: a query engine for a federated knowledge graph of biomedical APIs
Track: BOSC
  • Jackson Callaghan, Scripps Research, United States
  • Colleen Xu, Scripps Research, United States
  • Jiwen Xin, Scripps Research, United States
  • Marco Cano, Scripps Research, United States
  • Eric Zhou, Scripps Research, United States
  • Rohan Juneja, Scripps Research, United States
  • Yao Yao, Scripps Research, United States
  • Madhumita Narayan, Scripps Research, United States
  • Kristina Hanspers, Gladstone Institutes, United States
  • Ayushi Agrawal, Gladstone Institutes, United States
  • Alexander Pico, Gladstone Institutes, United States
  • Chunlei Wu, Scripps Research, United States
  • Andrew Su, Scripps Research, United States


Presentation Overview: Show

Knowledge graphs are an increasingly common data structure for representing biomedical information. They can easily represent heterogeneous types of information, and many algorithms and tools exist for operating on them. Biomedical knowledge graphs have been used in a variety of applications, including drug repurposing, identification of drug targets, prediction of drug side effects, and clinical decision support. Typically, such graphs are constructed as a single structural entity by centralizing and integrating data from multiple disparate sources. We present BioThings Explorer, an application that can query a virtual, federated knowledge graph representing the aggregated information of many disparate biomedical web services. BioThings Explorer leverages semantically precise annotations of the inputs and outputs for each resource, and automates the chaining of web service calls to execute multi-step graph queries. Because there is no large, centralized knowledge graph to maintain, BioThing Explorer is distributed as a lightweight application that dynamically retrieves information at query time. More information can be found at https://explorer.biothings.io, and code is available at https://github.com/biothings/biothings_explorer.

A-104: OMEinfo: global geographic metadata for -omics experiments
Track: BOSC
  • Matthew Crown, Northumbria University, United Kingdom
  • Matthew Bashton, Northumbria University, United Kingdom


Presentation Overview: Show

Microbiome classification studies increasingly associate geographical features like rurality and climate with microbiomes. However, microbiologists/bioinformaticians often struggle to access and integrate rich geographical metadata from sources such as GeoTIFFs. Inconsistent definitions of rurality, for example, can hinder cross-study comparisons. To address this, we present OMEinfo, a Python-based tool for automated retrieval of consistent geographical metadata from user-provided location data. OMEinfo leverages open data sources like the Global Human Settlement Layer, Koppen-Geiger climate classification models and Open-Data Inventory for Anthropogenic Carbon dioxide, to ensure metadata accuracy and provenance.

OMEinfo's Dash application enables users to visualise their sample metadata on an interactive map, to investigate the spatial distribution of metadata features, which is complemented by a numerical data visualisation to analyse patterns and trends in the geographical data before further analysis. The tool is available as a Docker container, providing a portable, lightweight solution for researchers. Through its standardised metadata retrieval approach and incorporation of FAIR and Open data principles, OMEinfo promotes reproducibility and consistency in microbiome metadata. As the field continues to explore the relationship between microbiomes and geographical features, tools like OMEinfo will prove vital in developing a robust, accurate, and interconnected understanding of these interactions.

A-105: Open Science, Open Data - Resources and Methods of the Kidney Precision Medicine Project
Track: BOSC
  • H Ascani, University of Michigan - Ann Arbor, United States
  • Kidney Precision Medicine Project Kpmp, National Institute of Diabetes and Digestive and Kidney Diseases NIDDK - U24 DK114886, United States
  • Felix Eichinger, University of Michigan, United States


Presentation Overview: Show

Open-source data and tools are key elements of FAIR principles, engaging a wide spectrum of communities, democratizing scientific efforts, and increasing transparency in research. They inspire generation of diverse analytic methods, highlighting varied aspects of public data and creating benchmarks to validate computational techniques. The push toward Precision Medicine drove the creation of the Kidney Precision Medicine Project (KPMP) and the Kidney Tissue Atlas, an outreach and data sharing tool. Extensive datasets of clinical, pathology and molecular data, generated from study participant tissue, are shared publicly, and analyzed using state-of-the-art technologies, leading to development of next-generation software tools to visualize and share results. The Atlas Data Repository ensures raw AND processed data from KPMP biopsies are publicly available and searchable via multi-level filters, with over 13,000 downloads to-date. Personal, identifiable data are controlled but also accessible. The Atlas Explorer engine enables searches for markers or cell types of interest, and includes summary visualizations for snRNA-seq, scRNA-seq, and regional transcriptomics. In employing user-centric web design and personas, the public KPMP database empowers advanced clinical phenotyping, pathomic, transcriptomic, proteomic, epigenomic, or metabolomic interrogation of kidney biopsy samples. The KPMP’s commitment to dataset integration and open science contributes to advances in patient care.

A-106: RiboSeq.Org Data Portal: unified access to processed and standardised ribosome profiling data and metainformation.
Track: BOSC
  • Jack Tierney, University College Cork, Ireland
  • Michał Świrski, University of Warsaw, Poland
  • Håkon Tjeldnes, University of Bergen, Norway
  • Anmol Kiran, University College Cork, Ireland
  • Gionmattia Carancini, University College Cork, Ireland
  • Alla Fedorova, University College Cork, Ireland
  • Stephen Kiniry, EIRNA Bio, Ireland
  • Audrey Michel, EIRNA Bio, Ireland
  • Eivind Valen, University of Bergen, Norway
  • Pavel Baranov, University College Cork, Ireland


Presentation Overview: Show

RiboSeq.Org is an online ecosystem of computational resources and datasets for the study of translation. It features tools for processing, analysing, and visualising high-throughput data from ribosome profiling (Ribo-seq) and related assays, currently comprising RiboGalaxy, GWIPS-viz, Trips-Viz, and RiboCrypt. The latter three provide a wide range of functionalities for visualising and interrogating reference aligned Ribo-seq data, however, each tool uses independently processed data only intersecting on a fraction of all publicly available data. The exponential growth in publicly available data is causing this fraction to diminish.

This approach leads to redundancy in data processing steps and in efforts to address inconsistencies in available metadata. To address these issues and to make preprocessed datasets and their accompanying metadata freely available, we established the RiboSeq.Org Data Portal. In addition to metadata gleaned from public repositories we also provide quality reports that assess triplet periodicity, functional region occupancy, sequence biases, etc. The searchable interface allows users to choose datasets for data mining experiments and visualize them within RiboSeq.Org or download the processed data for offline analysis. We envision this framework can be applied to other methods in the future, facilitating data re-usability and results reproducibility.

A-107: JBrowse 2: a modular genome browser with views of synteny and structural variation
Track: BOSC
  • Colin Diesh, University of California, Berkeley, United States
  • Garrett Stevens, University of California, Berkeley, United States
  • Peter Xie, University of California, Berkeley, United States
  • Teresa De Jesus Martinez, University of California, Berkeley, United States
  • Elliot Hershberg, University of California, Berkeley, United States
  • Angel Leung, University of California, Berkeley, United States
  • Emma Guo, University of California, Berkeley, United States
  • Shihab Dider, University of California, Berkeley, United States
  • Junjun Zhang, OICR, Canada
  • Caroline Bridge, OICR, Canada
  • Gregory Hogue, OICR, Canada
  • Andrew Duncan, OICR, Canada
  • Scott Cain, OICR, Canada
  • Robert Buels, University of California, Berkeley, United States
  • Lincoln Stein, OICR, Canada
  • Ian Holmes, University of California, Berkeley, United States


Presentation Overview: Show

Genome browsers are commonly used in bioinformatics for interactive visualization of genomic datasets. However, most genome browsers are only capable of showing data from a single genome at a single locus. We created JBrowse 2 to enable the visualization of multiple related genomes with built-in synteny views, and to show complex structural variants that can span multiple genomic loci. JBrowse 2 can be extensively customized via configuration or extended via plugins to address the needs of its diverse user base. Recent improvements to the JBrowse 2 core codebase include: the addition of multi-wiggle tracks, which can show many different quantitative signals in a compact format; ability to launch a synteny view from a regular genome browser view; and new visualization modalities for alignments tracks. We will also demonstrate new plugins to map genome coordinates onto 3-D protein structures and to explore splicing with the isoform inspector. JBrowse 2 uses a modern web application stack using React and TypeScript. It runs as a web app, a desktop app, or as components installable via NPM. JBrowse 2 also has Jupyter Notebook integration via jbrowse_jupyter (on PyPI) and R/Shiny integration via JBrowseR (on CRAN).

A-108: Automated production engine to decode the tree of life
Track: BOSC
  • Priyanka Surana, Wellcome Sanger Institute, United Kingdom
  • Darwin Tree Of Life Consortium, Darwin Tree of Life Consortium, United Kingdom


Presentation Overview: Show

Darwin Tree of Life, a collaboration between 10 organisational partners, aims to capture the biodiversity on the islands of Britain and Ireland through genomics. To analyse this diversity of life, we are building a series of production grade workflows that take the raw data from the sequencing machines to (1) assemble, decontaminate, and curate the genome, (2) create automated standardised genome publication, and (3) run comparative genomics analysis. Here, we showcase how data flows from the sequencing machines to our pipelines and through them to public archives where all our data is made available rapidly and without embargo. Next, our released data is downloaded back to our servers where it is processed into standardised genome publications. Finally, we share our roadmap for the next phase which involves making our pipelines sustainable using green coding principles. All our pipelines are developed using open-source principles with nf-core templates and tools to ensure they meet the highest community standards. We are one of several initiatives working towards the goal of sequencing all complex life on Earth. This should help in conservation, to understand the interconnectedness of all life, and to build a new economy based on biological materials.

A-109: SpatialData: a FAIR framework for multimodal spatial omics
Track: BOSC
  • Luca Marconato, European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany, Germany
  • Giovanni Palla, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Kevin Akira Yamauchi, Department of Biosystems, Science and Engineering, ETH Zürich, Basel, Switzerland, Switzerland
  • Isaac Virshup, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Elyas Heidari, Division of Computational Genomics and System Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany, Germany
  • Tim Treis, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Marcella Toth, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Rahul Shrestha, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Harald Voehringer, European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany, Germany
  • Josh Moore, German BioImaging - Society for Microscopy and Image Analysis e.V., Konstanz, Germany, Germany
  • Fabian Theis, Institute of Computational Biology, Helmholtz, Center Munich, Munich, Germany, Germany
  • Oliver Stegle, European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany, Germany


Presentation Overview: Show

The abundance of spatial profiling methods comes with a heterogeneity of file formats which creates barriers in everyday analyses. Tasks like data loading, processing and visualization require ad-hoc conversions, which limit the ability to repurpose existing computational methods to new spatial technologies. Furthermore, the recent trends toward spatial multimodal profiling increases the complexity of file formats and analyses.
To address these problems we established a collaboration that unites research scientists from the single-cell and imaging communities and together we are developing a FAIR, language-agnostic, cloud-ready file storage format that extends the OME-NGFF format, an open format already established for large bioimaging datasets.
We are also creating a Python package, SpatialData, providing in-memory representations for the data, and implementing general data manipulation operations, abstracted to work interchangeably for every multi-modality spatial omics dataset. Finally we are developing a Napari plugin, which enables interactive visualization and annotation of spatial multi-omics datasets.
We believe that our computational solutions for data storage, manipulation and visualization will greatly reduce the technical overhead when dealing with spatial multi-modal datasets, will allow for easier repurposing of computational methods, and consequently will allow new unexplored spatial data analysis landscapes.

A-110: BioThings SDK for building a knowledge base API ecosystem in the context of the Biomedical Translator Program
Track: BOSC
  • Yao Yao, The Scripps Research Institute, United States
  • Jason Lin, The Scripps Research Institute, United States
  • Everaldo Rodolpho, The Scripps Research Institute, United States
  • Marco Alvarado Cano, The Scripps Research Institute, United States
  • Nichollette Acosta, The Scripps Research Institute, United States
  • Ginger Tsueng, The Scripps Research Institute, United States
  • Andrew I. Su, The Scripps Research Institute, United States
  • Chunlei Wu, The Scripps Research Institute, United States


Presentation Overview: Show

Building web-based APIs (Application Programming Interfaces) has become increasingly popular for accessing biomedical data or knowledge for their flexibility, simplicity, and reliability over the traditional flat-file downloads. Our team has previously developed a set of high-performance and scalable biomedical knowledge APIs, called “BioThings APIs”, which are now commonly used by the community. In this abstract, we focus on the underlying toolkit to build these APIs and how it can help specific research communities to build their own API-centered knowledge base ecosystem. This toolkit, BioThings SDK (https://biothings.io), is a generalized software development kit for developers to build, update and deploy the high-performance APIs from any knowledge sources and biomedical data types. Users can take advantage of the abstracted technical layers we built into the SDK, and produce a high-performance API, which follows the best practice and community standards. In the NCATS-funded Biomedical Translator program, BioThings SDK is used by multiple teams to create dozens of new Knowledge Provider APIs. These APIs then contributed to a large-scale, sustainable, and interoperable API-based knowledgebase ecosystem. We believe the use case from the Translator program is also applicable to other domain-specific research communities who might face the same knowledge sharing, integration, and long-term maintenance challenges.

A-111: FAIR-BioRS: Actionable guidelines for making biomedical research software FAIR
Track: BOSC
  • Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
  • Hervé Ménager, Institut Pasteur, Université Paris Cité, France
  • Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States


Presentation Overview: Show

We present the first actionable guidelines for making biomedical research software Findable, Accessible, Interoperable, and Reusable (FAIR) in line with the FAIR principles for Research Software (FAIR4RS principles). The FAIR4RS principles are the outcome of a large-scale global initiative to adapt the FAIR data principles to research software. They provide a framework for optimizing the reusability of research software and encourage open science. The FAIR4RS principles are, however, aspirational. Practical guidelines that biomedical researchers can easily follow for making their research software FAIR are lacking. To fill this gap, we established the first minimal and actionable guidelines that researchers can follow to easily make their biomedical research software FAIR. We designate these guidelines as the FAIR Biomedical Research Software (FAIR-BioRS) guidelines. The guidelines provide actionable step-by-step instructions that clearly specify relevant standards, best practices, metadata, and sharing platforms to use. We believe that the FAIR-BioRS guidelines will empower and encourage biomedical researchers into adopting FAIR and open practices for their research software. We present here our approach to establishing these guidelines, summarize their major evolution through community feedback since the first version was presented at BOSC 2022, and explain how the community can benefit from and contribute to them.

A-112: higlass-python: A Programmable Genome Browser for Linked Interactive Visualization and Exploration of Genomic Data
Track: BOSC
  • Trevor Manz, Harvard Medical School, United States
  • Vedaz Yilmaz, UMass Chan Medical School, United States
  • Nils Gehlenborg, Harvard University, United States
  • Nezar Abdennur, UMass Chan Medical School, United States


Presentation Overview: Show

HiGlass is a web-based viewer for 1D and 2D genomic datasets, providing smooth navigation and flexible view configurations. We developed higlass-python, a software development kit (SDK) to expand HiGlass capabilities and integrate it with the scientific Python ecosystem. This SDK enables custom HiGlass-powered applications, dashboards, and rapid genomic data exploration in computational notebooks, such as Jupyter Notebooks.

Higlass-python offers a toolkit for synchronizing genomic representations within HiGlass and third-party visualizations presenting alternative representations of genomic loci. This feature empowers computational researchers to extend the genome browser using Python scripts.

Integrating traditional genome browser views and dynamically linked alternative views, higlass-python facilitates more complete exploration and analysis of genomic data. The toolkit provides building blocks for two-way data binding between HiGlass and other visualizations using single-locus mapping information.

We utilized higlass-python in Jupyter Notebooks to examine latent embedding spaces of 3D genomic contact frequency profiles within functional and epigenomic contexts. Dimensionality-reduced contact profile representations are embedded in 2D visualizations, with each point representing a genomic locus or interval. Our SDK coordinates multiple views of individual loci with HiGlass, synchronizing selections across all views and allowing data loading, exploration, export, and analysis within the same computational environment.

A-113: Standards to Connect Biomedical and Behavioral Research to Artificial Intelligence in the Bridge2AI Program
Track: BOSC
  • Sarah Gehrke, University of Colorado Anschutz Medical Campus, United States
  • Monica Munoz-Torres, University of Colorado Anschutz Medical Campus, United States
  • Alex Wagner, Nationwide Children's Hospital, United States Minor Outlying Islands
  • Christopher Mungall, Lawrence Berkeley National Laboratory, United States
  • Melissa Haendel, University of Colorado Anschutz Medical Campus, United States
  • Christopher Chute, Johns Hopkins University, United States
  • Jessica Mitchell, Johns Hopkins University, United States
  • Amy Heiser, Sage Bionetworks, United States
  • Nomi Harris, Lawrence Berkeley National Laboratory, United States
  • Harry Caufield, Lawrence Berkeley National Laboratory, United States
  • James Stevenson, Nationwide Children's Hospital, United States
  • Justin Reese, Lawrence Berkeley National Laboratory, United States
  • Milen Nikolov, Sage Bionetworks, United States
  • Corey Cox, University of Colorado Anschutz Medical Campus, United States
  • Marcin Joachimiak, Lawrence Berkeley National Laboraotry, United States
  • Wesley Goar, Nationwide Children's Hospital, United States
  • James Eddy, Sage Bionetworks, United States


Presentation Overview: Show

The Bridge to Artificial Intelligence (Bridge2AI) Program aims to facilitate the use of AI in biomedical and behavioral research by generating new, ethically-sourced data and promoting a culture of ethical consideration. The Bridge2AI Integration, Dissemination, and Evaluation (Bridge) Center is tasked with ensuring that these goals are met throughout the data lifecycle. The current lack of standardized data is hindering progress in this field. To remediate this, the Standards Core of the Bridge Center is developing best practices for data collection, deposition, quality assurance, query, dissemination, and integration, with the aim of ensuring that the standards implemented in the quest for generating AI-ready data are generalizable and useful to the wider scientific community. By promoting open source and collaborative development of best practices and norms, and by developing methods to build and extend standards to address key data linkage and integration use cases for AI, the Bridge2AI Standards Core aims to provide researchers with access to new resources for discovery and innovation. Learn more at https://bridge2ai.org/standards-core

A-114: Faster evaluation of CRISPR guide RNAs across entire genomes
Track: BOSC
  • Carl Schmitz, Queensland University of Technology, Australia
  • Jacob Bradford, Queensland University of Technology, Australia
  • Dimitri Perrin, Queensland University of Technology, Australia


Presentation Overview: Show

The design of CRISPR-Cas9 guide RNAs is not trivial. In particular, evaluating the risk of off-target modifications is computationally expensive: a brute-force approach would require comparing each candidate guide with every possible CRISPR target site in the genome. In a mammalian genome, this means hundreds of millions of comparisons for each guide. We have previously introduced Crackling, a gRNA design tool that relies on Inverted Signature Slice Lists (ISSL) to accelerate off-target scoring by only considering sites with partial matches (a slice) with the candidate guide. This produced an order of magnitude speed up whilst still maintaining scoring accuracy. Here, we present a complete reimplementation of Crackling in C++ and discuss further improvements. Using longer slices we perform fewer comparisons, and we show it is possible to construct a collection of slices that still preserve an exact off-target score. We have benchmarked two ISSL configurations with the new version of Crackling and report a 15-22 times speed up over the default ISSL configuration. We also show that, using memory-mapped files, this can be achieved without any significant increase in memory usage. CracklingPlusPlus is available at https://github.com/bmds-lab/CracklingPlusPlus under the Berkeley Software Distribution (BSD) 3-Clause license.

A-115: Unleashing Pfam annotations through the InterPro website
Track: BOSC
  • Typhaine Paysan-Lafosse, EMBL-EBI, United Kingdom
  • Matthias Blum, EMBL-EBI, United Kingdom
  • Gustavo A Salazar, EMBL-EBI, United Kingdom
  • Alex Bateman, EMBL-EBI, United Kingdom


Presentation Overview: Show

Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. It is part of the InterPro member database consortium, alongside twelve other resources. InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. The Pfam database and its website have been a reference in the world of protein classification for the scientific community for over 20 years. However, the Pfam website codebase had a lot of technical debt and had become increasingly difficult to maintain. Furthermore, Pfam annotations have been integrated in the InterPro database for a number of years, which are accessible through the InterPro website. Hence, we took the decision to retire the Pfam website progressively, starting in summer 2022. Being aware that the Pfam website was widely used by scientists, we have worked hard to integrate its key functionalities in the InterPro website, to ease the transition. Here we report how Pfam annotations can be visualised in the InterPro website for different use cases and the benefits of combining other member database annotations with Pfam.

A-116: The Phenopacket Generator: A Tool for Encoding Standardized Rare Disease Patient Phenotypes
Track: BOSC
  • Julie McMurry, Monarch Initiative, United States
  • Baha El Kassaby, Jackson laboratories, United States
  • Daniel Danis, Jackson Laboratories, United States
  • Michael Gargano, Jackson Laboratories, United States
  • Camille Liedtka, Jackson Laboratories, United States
  • Beth Sundberg, Jackson Laboratories, United States
  • Sejal Desai, Jackson Laboratories, United States
  • Julius O. B. Jacobsen, Jackson Laboratories, United States
  • Benjamin Coleman, Jackson laboratories, United States
  • Monica Munoz-Torres, University of Colorado Anschutz Medical Campus, United States
  • Melissa Haendel, University of Colorado Anschutz Medical Campus, United States
  • Peter Robinson, Jackson Laboratories, United States


Presentation Overview: Show

Standardized data formats are ubiquitous in biomedicine. BED, CRAM, and VCF have enabled genomics to advance rapidly. At last year’s BOSC, we debuted the Phenopackets schema, a GA4GH standard for representing patient case-level phenotypes (a critical piece for accurate variant prioritization). These include fine-grained details such as age of onset, severity and absence. The Phenopacket Schema also provides other data structures that support exchange of computable information for diagnosis, medical actions, and research of all types of disease including Mendelian and complex genetic diseases, cancer, and infectious diseases.

Phenopackets are designed to be used across a comprehensive landscape of applications including biobanks, databases and registries, clinical information systems such as Electronic Health Records, genomic matchmaking, diagnostic laboratories, and computational tools. The newly-developed Phenopacket Generator is a single-page application enabling anyone in these contexts to easily create and validate a Phenopacket. It leverages other tools such as Fenominal (https://monarch-initiative.github.io/fenominal/), a text-mining library for disease and phenotype concepts, Phenopacket-tools, a library used to work with Phenopackets and it draws upon other open standards such as the Human Phenotype Ontology, the Medical Action Ontology, and the Mondo disease ontology.

A-117: A nextflow workflow of peptide sequence selection for targeted proteomic mass spectrometry
Track: BOSC
  • Sylvere Bastien, CIRI, France
  • Pauline François, CIRI, France
  • Sara Moussadeq, CIRI, France
  • Iulia Macavei, ISA, France
  • Karen Moreau, CIRI, France
  • Jerôme Lemoine, Institut des Sciences Analytiques de Lyon, France
  • Francois Vandenesch, Hospices Civils de Lyon - UCBL-1 - CNR Staph, France


Presentation Overview: Show

Targeted proteomic quantifies specific proteins of interest by detecting short peptide sequences. The selection of these sequences is a crucial step since it must take into account all the variability of the sequence while excluding duplicated motifs due to repeated sequences or duplication events.
We developed a nextflow pipeline to determine a minimum list of peptide sequences to detect any variant of a protein of interest from a FASTA amino acid database. This pipeline comprises several steps, including (i) building a database of non-redundant proteins, (ii) filtering the proteins using several alignment tools, (iii) selecting the variants to be retained based on their frequency, (iv) simulating in vitro trypsin hydrolysis, and (v) establishing a minimum list of peptide sequences covering the whole allelic diversity. An additional validation step is performed by aligning the selected sequences to the initial database and generating several graphs and tables.
This pipeline was used to select peptide sequences for the quantification of 41 Staphylococcus aureus virulence proteins in a targeted proteomics approach using high-throughput mass spectrometry. This approach is useful to accurately detect any variant of a specific protein and can be applied to other proteins in different organisms.