View Posters By Category
Scroll down to view Results
Session A: July 21, 2025 at 10:00-11:20 and 16:00-16:40
|
Session B: July 22, 2025 at 10:00-11:20 and 16:00-16:40
|
|
|
Session A Posters set up: Monday, July 21 between 08:00 - 08:40
Session A Posters dismantle: Tuesday, July 22 at 18:00 | Session B Posters set up: Monday, July 21 between 08:00 - 08:40
Session B Posters dismantle: Tuesday, July 22 at 18:00 |
Session C: July 23, 2025 at 10:00-11:20 and 16:00-16:40
|
Session D: July 24, 2025 at 10:00-11:20 and 13:00-14:00
|
|
|
Session C Posters set up: Wednesday, July 23 between 08:00 - 08:40
Session C Posters dismantle: Thursday, July 24 at 16:00 | Session D Posters set up: Wednesday, July 23 between 08:00 - 08:40
Session D Posters dismantle: Thursday, July 24 at 16:00 |
Virtual
|
Student Council Symposium
|
|
|
Results
B-087: Codefair: Easily Make Biomedical Research Software Findable, Accessible, Interoperable, Reusable (FAIR)
Track: BOSC: Bioinformatics Open Source Conference
- Dorian Portillo, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
- Sanjay Soundarajan, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
- Bhavesh Patel, FAIR Data Innovations Hub, California Medical Innovations Institute, United States
Presentation Overview: Show
Codefair is your personal assistant to make research software Findable, Accessible, Interoperable, and Reusable (FAIR). The FAIR-BioRS guidelines provide actionable steps to make software FAIR, but complying with them can still be time-consuming and difficult. Codefair is designed to automate the process, letting researchers concentrate on the goals of their software. It is developed as a free and open source (MIT license) GitHub app. Once installed, Codefair seamlessly integrates into your existing workflow by monitoring your repository within GitHub using tools like Probot, providing feedback via a GitHub issue, allowing users to address them through user-friendly interfaces, and even submitting pull requests automatically to maintain continuous FAIR compliance. With thorough documentation and a design focused on ease-of-use, Codefair is accessible to all. When introduced at BOSC 2024 in an early development phase, Codefair primarily assisted researchers in adding essential metadata elements such as license, CITATION.cff, and codemeta.json. Since then, it has expanded to support streamlined archival on Zenodo and automated validation of Common Workflow Language (CWL) files. A new web dashboard enables easy assessment and management of FAIR compliance across your GitHub repositories. Usage of Codefair has also increased as it is now installed on over 400 GitHub repositories. By alleviating the time and effort needed for FAIR compliance, Codefair encourages biomedical researchers to embrace FAIR and open practices. In this presentation, we detail our development approach, outline current, discuss planned features, summarize updates since BOSC 2024, and invite the community to explore and contribute to Codefair.
B-089: megSAP - a diagnostics grade long-read pipeline
Track: BOSC: Bioinformatics Open Source Conference
- Marc Sturm, Institute of Medical Genetics and Applied Genomics, University Hospital Tübingen, Germany
- Stephan Ossowski, Institut für Medizinische Genetik und Angewandte Genomik, Tübingen, Germany
- Tobias Haack, Institute of Medical Genetics and Applied Genomics, University Hospital Tübingen, Germany
Presentation Overview: Show
Short-read genome sequencing (SR-GS) has limitations, especially in homologous and low-complexity regions. Long-read genome sequencing (LR-GS) can overcome many of these limitations by resolving complex genomic regions, by facilitating haplotype phasing and by providing methylation state in addition to base sequence. Thus, LR-GS has the potential to increase the diagnostic yield in rare diseases (still more than 50% of cases are unsolved).
As part of the German long-read initiative (lonGERr) and European Long Read Innovation Network (ELRIN), we have added support for Nanopore long-read data to megSAP, our open-source data analysis pipeline for medical genetics (https://github.com/imgag/megSAP). Initially, we have sequenced multiple GiaB reference samples and a small cohort of difficult cases featuring causal repeat expansions, causal variants in duplicated genes, and causal complex structural variants, to benchmark the detection and clinical interpretation of variants using LR-GS data. By now our LR-GS analysis pipeline is accredited for rare disease diagnostics by German Accreditation Body (DAkkS).
Here we present implementation details of our open-source pipeline for LR-GS data analysis. Additionally, we present benchmarks results and compare them to the performance of SR-GS. Finally, we showcase a few interesting cases solved with LR-GS.
B-091: The WorkflowHub: A FAIR Registry for computational workflows
Track: BOSC: Bioinformatics Open Source Conference
- Finn Bacall, The University of Manchester, United Kingdom
- Stian Soiland-Reyes, The University of Manchester, United Kingdom
- Stuart Owen, The University of Manchester, United Kingdom
- Nick Juty, The University of Manchester, United Kingdom
- Johan Gustafsson, Australian BioCommons, Australia
- Frederick Coppens, VIB-UGent Center for Plant Systems Biology, Belgium
- Carole Goble, The University of Manchester, United Kingdom
Presentation Overview: Show
WorkflowHub is a free, open-source registry built to support the sharing and discovery of computational workflows, standard operating procedures, and their associated research assets. It boosts collaboration and scientific transparency by making workflows easier to co-create, discover , cite, and share - helping research become more reproducible and aligned with FAIR (Findable, Accessible, Interoperable, Reusable) data principles.
Supporting widely adopted standards such as the Common Workflow Language, RO-Crate, Bioschemas, FAIR Signposting, and GA4GH’s TRS API, WorkflowHub integrates smoothly with other platforms and tools. Its new RO-Crate submission API allows research communities to programmatically create and update workflow entries, streamlining catalog maintenance.
WorkflowHub connects with external services to enhance metadata. For example, it integrates with BBMRI-ERIC’s LifeMonitor to assess workflow health and availability. It can alsobe used to connect workflows with the software they depend upon, for instance the ELIXIR (Life Science community) tools catalogue: bio.tools.,The platform is agnostic of workflow submission language and format, and is intended to span research domains; it currently hosts over 817 workflows in 35 formats, with more than 300 teams contributing across disciplines like life sciences, health, chemistry, biodiversity, climate science, and astronomy. Users can register ‘draft’ workflows to co-develop through r team collaboration, which may itself be cross-institutional. Finalised workflows can be assigned (free) DOIs, with versioning, enabling proper citation.
WorkflowHub serves a truly international community, and is coordinated by ELIXIR in partnership with the Australian BioCommons. Its development is shaped by its users through regular community events, and through biweekly WorkflowHub Club calls.
B-093: Advancing Pediatric and Longitudinal DNA Methylation Studies with CellsPickMe, an Integrated Blood Cell Deconvolution Method
Track: BOSC: Bioinformatics Open Source Conference
- Maggie Fu, University of British Columbia, Canada
- Karlie Edwards, University of British Columbia, Canada
- Erick Navarro-Delgado, University of British Columbia, Canada
- Sarah Merrill, University of Massachusetts Lowell, United States
- John Holloway, University of Southampton, United Kingdom
- Stuart Turvey, University of British Columbia, Canada
- Michael Kobor, University of British Columbia, Canada
Presentation Overview: Show
Prospective birth cohorts offer the potential to interrogate the relation between early life environment and embedded biological processes such as DNA methylation (DNAme). These association studies are frequently conducted in the context of blood, a heterogeneous tissue composed of diverse cell types. Accounting for this cellular heterogeneity across samples is essential, as it is a main contributor to inter-individual DNAme variation. Integrated blood cell deconvolution of pediatric and longitudinal birth cohorts poses a major challenge, as existing methods fail to account for the distinct cell population shift between birth and adolescence. We critically evaluated the reference-based deconvolution procedure and optimized its prediction accuracy for longitudinal birth cohorts using DNAme data from the Canadian Healthy Infant Longitudinal Development (CHILD) cohort. The optimized algorithm, CellsPickMe, integrates cord and adult references and picks DNAme features for each population of cells with machine learning algorithms. It demonstrated improved deconvolution accuracy in cord, pediatric, and adult blood samples compared to existing benchmark methods. CellsPickMe supports blood cell deconvolution across early developmental periods under a single framework, enabling cross-time-point integration of longitudinal DNAme studies. Given the increased resolution of cell populations predicted by CellsPickMe, this R package empowers researchers to explore immune system dynamics using DNAme data in population studies across the life course.
B-095: A Study on the Design of Access Control Mechanisms for NGS-Based LIMS
Track: BOSC: Bioinformatics Open Source Conference
- Il-Kwon Lim, Korea Bioinformation Center, Korea Reseach Institute of Bioscience & BioTechnology, South Korea
- Jinseon Yoo, Korea Bioinformation Center, Korea Reseach Institute of Bioscience & BioTechnology, South Korea
- Soobok Joe, Korea Bioinformation Center, Korea Reseach Institute of Bioscience & BioTechnology, South Korea
- Boram Kang, Korea Bioinformation Center, Korea Reseach Institute of Bioscience & BioTechnology, South Korea
- Yeyoung Yoon, Korea Bioinformation Center, Korea Reseach Institute of Bioscience & BioTechnology, South Korea
Presentation Overview: Show
A Laboratory Information Management System (LIMS) is a software solution designed to efficiently manage the vast amount of data generated in laboratories. Currently, LIMS platforms are also being utilized in workflows for NGS-based genomic and transcriptomic sequencing and analysis. These systems are rapidly evolving in response to increasing demands for mobile accessibility, cloud technologies, artificial intelligence integration, and management of diverse high-throughput analytical instruments. Accordingly, the need for strengthened user authentication mechanisms has become critical to ensure secure software operation. This study reviews recent trends in access control technologies for software solutions and analyzes the current state and challenges of LIMS used in NGS-based sequencing and analysis. Furthermore, it explores and proposes a secure user access control framework optimized for LIMS by leveraging open-source solutions.
B-097: Standardized Differential Expression Analysis for Complex Bulk RNA-seq Designs with multiconDErich
Track: BOSC: Bioinformatics Open Source Conference
- Christian Heyer, Biomed. Informatics, Data Mining and Analytics, University of Augsburg; Faculty of Biosciences, Heidelberg University, Germany
- Ashik Ahmed Abdul Pari, European Center for Angioscience Heidelberg University; Division of Vascular Oncology and Metastasis, DKFZ Heidelberg, Germany
- Matthias Schlesner, Biomedical Informatics, Data Mining and Data Analytics, University of Augsburg, 86159 Augsburg, Germany, Germany
Presentation Overview: Show
Despite the rise of single-cell and spatial transcriptomics, bulk RNA-seq remains the most scalable and statistically powerful approach for many biological studies, particularly those involving multifactorial designs or limited input material. With growing experimental complexity and throughput, there is a pressing need for standardized, reproducible workflows that support interaction modeling and handle batch effects robustly.
multiconDErich is a Snakemake-based workflow for bulk RNA-seq analysis in complex study designs. The pipeline requires only a count matrix, metadata, and a configuration file. It supports both pairwise differential expression via DESeq2 and generalized linear mixed modeling (GLMM) using glmmSeq. Mixed models increase sensitivity in unbalanced designs and enable modeling of repeated measures or nested batch effects. Gene set enrichment via GSEA and decoupleR provides interpretable, pathway-level insight.
To demonstrate the workflow’s utility, we analyzed RNA-seq data from FACS-sorted endothelial cells collected across multiple experimental batches, modeling the effects of aging and tumor presence in the mouse lung. GLMM enabled robust identification of interaction effects and consistent aging-associated signatures across batches. Aging was associated with upregulation of interferon response genes and downregulation of extracellular matrix components, reflecting transcriptomic shifts in endothelial cell state that were reproducible across both healthy and tumor-bearing conditions.
All steps are modular, transparent, and fully reproducible. The workflow is openly available on GitHub with comprehensive documentation, and is deployable via Conda or Docker through Snakemake. It is designed for researchers seeking robust and standardized RNA-seq analysis in high-throughput, multifactorial studies.
B-099: Enhancing Computational Systems Biology Through Crowdsourced Efforts
Track: BOSC: Bioinformatics Open Source Conference
- Gaia Andreoletti, Sage Bionetworks, United States
- Verena Chung, Sage Bionetworks, United States
- Thomas Schaffter, Sage Bionetworks, United States
- Serghei Mangul, Sage Bionetworks, United States; University of Suceava, Romania
- Susheel Varma, Sage Bionetworks, United States
Presentation Overview: Show
Rigorous assessment and benchmarking via crowdsourcing are essential in advancing computational systems biology by promoting unbiased evaluations and fostering innovation. However, the field often encounters the ""self-assessment trap,"" where researchers compare their methods to others, leading to biased comparisons that frequently—if unintentionally—favor their own algorithms. This bias, driven by selective reporting or narrow evaluation criteria, compromises the reliability of results and hinders unbiased scientific advancement. To address this issue, community-driven challenges like DREAM (Dialogue on Reverse Engineering Assessment and Methods) and CASP (Critical Assessment of Structure Prediction) have become essential for promoting unbiased evaluations and fostering reproducible research. These initiatives emphasize transparent evaluations using diverse metrics and independent datasets to ensure robust method development. By adopting continuous assessment and benchmarking practices, the community can strengthen computational reliability, advancing innovation and improving predictive healthcare.
Since 2006, Sage Bionetworks has been instrumental in organizing DREAM Challenges, leveraging its Synapse platform to facilitate global collaboration and data sharing. For example, the Microbiome preterm birth DREAM challenge invited participants to predict clinical outcomes from microbiome data using open-source models shared on Synapse. Prediction of PTB (<32 weeks) showed an accuracy (best AUROC: 0.69–0.70), likely due to stronger microbial signals in severe cases. The challenge generated reproducible insights with real-world applications in precision medicine.
Through initiatives like DREAM and a broader commitment to open science, Sage Bionetworks is tackling key challenges in computational biology—including data fragmentation, reproducibility gaps, and evaluation bias—while advancing discovery and promoting ethical, transparent research standards.
B-101: EBI Search: Providing discovery tools for biological metadata
Track: BOSC: Bioinformatics Open Source Conference
- Matthew Pearce, EMBL, United Kingdom
- Prasad Basutkar, EMBL, United Kingdom
- Renato Juaçaba Neto, EMBL, United Kingdom
- Vijay Venkatesh Subramoniam, EMBL, United Kingdom
- Rose Neis, EMBL, United Kingdom
- Iva Tutis, EMBL, United Kingdom
- Dalya Al-Shahrabi, EMBL, United Kingdom
- Henning Hermjakob, EMBL, United Kingdom
Presentation Overview: Show
The data resources provided by the European Bioinformatics Institute (EMBL-EBI) cover major areas of biological and biomedical research, giving free and open access to users ranging from expert to casual level. The EBI Search engine (‘EBI Search’) provides a unified metadata search engine across these resources. It provides a full text search engine across over 6.5 billion data items, accessed through a user-friendly website and an OpenAPI-compliant programmatic interface.
B-103: nf-core/tumourevo: a reproducible open-source pipeline for tumour evolution analysis from Whole-Genome Sequencing data
Track: BOSC: Bioinformatics Open Source Conference
- Katsiaryna Davydzenka, International School of Advanced Studies (SISSA), Italy
- Lucrezia Valeriani, University of Trieste, Italy
- Elena Buscaroli, University of Trieste, Italy
- Giorgia Gandolfi, University of Trieste, Italy
- Virginia Anna Gazziero, University of Trieste, Italy
- Brandon Taylor Hastings, University of Trieste, Italy
- Azad Sadr Haghighi, University of Trieste, Italy
- Giovanni Santacatterina, University of Trieste, Italy
- Riccardo Bergamin, University of Trieste, Italy
- Alice Antonello, University of Trieste, Italy
- Rodolfo Tolloi, Area Science Park, Italy
- Salvatore Milite, Human Technopole, Italy
- Davide Rambaldi, Human Technopole, Italy
- Guido Sanguinetti, International School of Advanced studies (SISSA), Italy
- Giovanni Tonon, IRCCS San Raffaele Scientific Institute-Vita-Salute San Raffaele University, Italy
- Anna Kabanova, Toscana Life Sciences Foundation, Italy
- Leonardo Egidi, University of Trieste, Italy
- Alessio Ansuini, Area Science Park, Italy
- Alberto Cazzaniga, Area Science Park, Italy
- Alberto Casagrande, University of Udine, Italy
- Nicola Calonaci, University of Trieste, Italy
- Giulio Caravagna, University of Trieste, Italy
Presentation Overview: Show
The rapid advancement of whole-genome sequencing (WGS) has transformed cancer genomics, enabling high-resolution profiling of somatic alterations and new insights into tumour evolution. Studying cancer through an evolutionary lens is essential for understanding tumour heterogeneity, subclonal dynamics, and therapeutic resistance. While pipelines for detecting somatic variants are now well-established, tools for inferring tumour evolution from WGS data remain limited.
We present nf-core/tumourevo, an open-source, modular, and fully reproducible pipeline for analysing tumour evolution from next-generation sequencing data. Developed within the nf-core ecosystem using Nextflow, the pipeline integrates leading tools for variant annotation (VEP), driver gene identification, copy number alteration (CNA) quality control (CNAqc), subclonal deconvolution (PyClone-VI, VIBER, MOBSTER), and mutational signature analysis (SigProfiler, SparseSignatures).
nf-core/tumourevo supports both single-sample and multi-sample datasets, enabling analysis of tumour evolution in spatially or temporally resolved cohorts. Its containerized and portable design ensures reproducibility across high-performance computing and cloud environments, making it suitable for both research and clinical applications.
We applied the pipeline to a longitudinal WGS dataset of colorectal cancer, successfully reconstructing clonal architectures and tracking subclonal dynamics over time. We also benchmarked its accuracy using a Simulated Cohort of Universal Tumours (SCOUT), created with our in-house cancer evolution simulator.
By automating complex evolutionary analyses within a reproducible framework, nf-core/tumourevo empowers researchers to explore tumour evolution at scale, advancing biological discovery and supporting precision oncology efforts.
B-105: cellSight: Enhanced Single-Cell Analysis Platform and Comprehensive Cell Communication
Track: BOSC: Bioinformatics Open Source Conference
- Ranojoy Chatterjee, Computational Biology Institute, Dept. of Biostatistics and Bioinformatics, Milken Institute School of Public Health, United States
- Ali Rahnavard, Computational Biology Institute, Dept. of Biostatistics and Bioinformatics, Milken Institute School of Public Health, United States
- Brett Shook, Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, GWU, United States
- Chiraag Gohel, Computational Biology Institute, Dept. of Biostatistics and Bioinformatics, Milken Institute School of Public Health, United States
Presentation Overview: Show
cellSight is an innovative computational platform that facilitates and simplifies the analysis of single-cell RNA sequencing data. Through the combination of high-throughput sequencing techniques and advanced analytical pipelines, it effectively solves fundamental problems prevalent in single-cell genomics research, such as cellular type clustering, feature extraction, and data normalization. We build upon the intrinsic strengths of cellSight by incorporating Graph Attention Networks (GATs) to infer ligand-receptor interactions between cells, which is a considerable advancement over conventional approaches. It uses attention mechanisms to model complex interactions within cellular communication networks through the incorporation of expression data and interacting molecule structural information. Its implementation performs more optimally in the separation of cell populations based on combined transcriptional and spatial signatures. The platform's automated pipeline, from data normalization, integration, clustering, differential expression analysis, trajectory inference, and pathway enrichment, notably decreases analysis time and human bias. By automating computation tasks, cellSight enables researchers to divert their focus towards biological interpretation and hypothesis generation instead of processing data manually. Performance measures validate effective discrimination of unique cell populations while granting researchers greater insight into tissue architecture and cellular dynamics in both normal and disease conditions.
B-107: Applying Topic Modeling Methods to the Promoting Health Aging through Semantic Enrichment of Solitude Research Ontology (PHASES)
Track: BOSC: Bioinformatics Open Source Conference
- Damayanthi Jesudas Beera, University of Florida, United States
- William Duncan, University of Florida, United States
Presentation Overview: Show
Building an ontology requires substantial time and effort from its developers and subject matter experts. Natural Language Processing (NLP) techniques can reduce this by identifying the core concepts of a domain. Here we present our work on using NLP to assist in the development of the Promoting Health Aging through Semantic Enrichment of Solitude Research Ontology (PHASES). As a starting point, we are using topic modeling and term frequency-inverse document frequency (TF-IDF) to analyze abstracts related to solitude and gerotranscendence. Twenty of these abstracts are provided as a dataset for performing TF-IDF and Latent Dirichlet Allocation (LDA) to discover important words and topics in the corpus of abstracts. TF-IDF is used to evaluate the importance of a word in an abstract relative to the corpus of abstracts and gives a weighted score for each word. LDA assigns a probability distribution for topics and words in a topic across the corpus. Guided (or seeded) LDA is a semi-supervised learning technique that helps enhance LDA results by increasing the weights of important terms. We first tested (standard) LDA to obtain the perplexity, topic variance and coherence scores. We then tested guided LDA using seed words identified by subject matter experts and TF-IDF. Using the subject matter experts and TF-IDF seed words, guided LDA models assigned higher probabilities to domain specific words than standard LDA and outperformed the standard LDA in topic modeling. These results show a promising approach for more efficient ontology development.
B-109: Integrating Temporal-Related Dental Visit Metadata and Its Impact on Dental Care-Related Fear and/or Anxiety into the Ontology of Dental-Care Related Fear and Anxiety
Track: BOSC: Bioinformatics Open Source Conference
- Elena Milivinti, Department of Philosophy, University at Buffalo, United States
- Alexander Diehl, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, United States
- Finn Wilson, College of Dentistry, University of Florida, United States
- Ram Challa, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, United States
- William Duncan, College of Dentistry, University of Florida, United States
Presentation Overview: Show
Dental care-related fear and anxiety (DFA) manifests differently across the temporal progression of the dental care experience. We extend the Oral Health and Disease Ontology (OHD) to represent DFA within a temporal framework that captures the evolution of patient anxiety through three critical phases: prior to dental visit (anticipatory concerns), during dental visit (procedure-specific fears), and after dental visit (recovery anxieties). Our approach integrates multiple validated assessment instruments including the Dental Fear Survey (DFS), Modified Dental Anxiety Scale (MDAS), and Index of Dental Anxiety and Fear (IDAF-4C+). Using a ""bottom-up"" methodology, we mapped assessment items to appropriate temporal phases and formalized the relationships through a hybrid approach combining OHD and the Linked Data Modeling Language (LinkML).
The resulting ontology defines subclasses representing phase-specific aspects of DFA such as appointment avoidance (prior), fear of specific procedures (during), and concerns about lasting pain (after). Properties capture anxiety intensity, triggers, and coping mechanisms within each temporal context. This structure enables researchers to query relationships between specific anxieties and their temporal manifestations.
The framework provides advantages for both research and clinical practice: it facilitates integration of diverse DFA data sources, enables phase-specific analysis of patient experiences, and supports the development of targeted interventions at critical temporal junctures. By understanding how dental fear evolves throughout the patient journey, clinicians can better address anxiety at its most impactful points, potentially improving treatment outcomes and patient experiences. Our ontological framework is publicly available with MIT licensing to support collaborative refinement and adoption across the dental research community.
B-111: A Modular Nextflow Toolkit for Scalable and Interpretable Whole-Slide Image Segmentation
Track: BOSC: Bioinformatics Open Source Conference
- Dmytro Horyslavets, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine; Kyiv Academic University, Kyiv, Ukraine, Ukraine
- Oleksandr Skorobohatov, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine, Ukraine
- Maksym Chernyshev, Kyiv Academic University, Kyiv, Ukraine, Ukraine
- Yaroslav Khokhlov, Kyiv Academic University, Kyiv, Ukraine, Ukraine
- Pavlo Areshkov, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine, Ukraine
- Tetiana Yeskova, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine, Ukraine
- Mykhailo Koreshkov, Kyiv Academic University, Kyiv, Ukraine, Ukraine
- Ben Woodhams, Wellcome Sanger Institute, Hinxton, UK; EMBL-EBI, Cambridge, UK, United Kingdom
- Tong Li, Wellcome Sanger Institute, Hinxton, UK, United Kingdom
- Peter Clapham, Wellcome Sanger Institute, Hinxton, UK, United Kingdom
- Virginie Uhlmann, BioVisionCenter, University of Zurich, Zurich, Switzerland; EMBL-EBI, Cambridge, UK, Switzerland
- Alina Frolova, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine; Kyiv Academic University, Kyiv, Ukraine, Ukraine
- Omer Bayraktar, Wellcome Sanger Institute, Hinxton, UK, United Kingdom
Presentation Overview: Show
Semantic segmentation of large-scale images with low-contrast boundaries can be a challenging task for modern deep-learning techniques, lacking interpretability and being computationally expensive. Classical computer vision methods are still relevant but often require complex parameter tuning, particularly in unsupervised settings. To address this we propose a flexible, modular, and reproducible Nextflow-based imaging tool, fastlbp-nextflow. Using a highly parallel and memory-efficient implementation of the multi-radii Local Binary Patterns (LBP) descriptor, the tool allows for seamless integration of alternative feature extraction and downstream analysis methods. It offers multiple execution modes, facilitating interactive parameter exploration and efficient dataset processing. Each stage of the pre-constructed workflows is implemented as a standalone sub-workflow, which makes the tool modular and easily extensible. We illustrate the practicality of fastlbp-nextflow across different medical imaging datasets, showcasing its ability to discriminate subtle tissue structures in glioblastoma whole-slide images, identify vasculature in retinal scans, and quantify fat cells in pancreatic histology. Using LBP features in both supervised and unsupervised scenarios yielded decent segmentation accuracy while featuring increased interpretability and faster turnaround time compared to deep-learning approaches. For instance, processing ten whole-slide images of about 18000x15000 pixels with the LBP -> UMAP -> HDBSCAN pipeline took around ten minutes on a multi-core CPU setup. With its ease of deployment, visual reporting system, and support for scalable environments like IBM Spectrum LSF, fastlbp-nextflow aims to streamline large-scale imaging workflows and foster reproducible research. The project is publicly available at https://github.com/imbg-ua/fastlbp-nextflow.
B-113: Enhancing Sustainability in Workflow Execution: Improving Job Caching and Meta-Scheduling in Galaxy
Track: BOSC: Bioinformatics Open Source Conference
- Nicola Soranzo, Earlham Institute, United Kingdom
- Marius van den Beek, Sci-Scale, Belgium
- Paul De Geest, VIB, Belgium
- Sanjay Kumar Srikakulam, Albert-Ludwigs-Universität Freiburg, Germany
- Björn Grüning, Albert-Ludwigs-Universität Freiburg, Germany
Presentation Overview: Show
Galaxy is an open-source platform for accessible, reproducible, and transparent computational research, widely used in bioinformatics and beyond. As data volumes and compute demands grow, improving the platform’s environmental sustainability is increasingly important. This work focuses on two complementary strategies: reducing redundant computation through improved job result reuse, and enabling environmentally-aware job scheduling.
We significantly enhanced Galaxy’s existing job cache feature, which reuses the results of previous identical computations to avoid unnecessary reprocessing. Previously, reuse was limited by strict matching based on internal dataset identifiers. We extended the system to also match input datasets by their content hash, allowing cache hits even when a file is uploaded multiple times. We also broadened the scope of cache searches to include public Galaxy datasets. These improvements make job reuse more effective, particularly in training settings where identical analyses are run repeatedly by many users.
In parallel, we improved Galaxy’s scheduling by integrating TPV Broker, a standalone service that ranks compute endpoints based on real-time metrics from the remote job execution endpoints (called Pulsar). Usage data is pushed via RabbitMQ to InfluxDB, allowing jobs to be routed to more energy-efficient destinations. This promotes greener execution strategies without requiring public access to the compute nodes.
Our improvements lay the groundwork for a greener Galaxy infrastructure by embedding computation reuse and resource impact awareness into both the scheduling logic and the user interface. These enhancements support environmentally-responsible computational workflows at scale, aligning Galaxy with broader sustainability goals in open science.
B-115: Go with the (Next)flow: Retrobiosynthesis
Track: BOSC: Bioinformatics Open Source Conference
- Emre Taha Çevik, Gebze Technical University, Turkey
- Gülce Çelen, Yildiz Technical University, Turkey
- Nilay Yönet, Istanbul Kent University, Turkey
- Kübra Narcı, The German Human Genome-Phenome Archive, Germany
Presentation Overview: Show
Scientific workflows in synthetic biology and metabolic engineering increasingly rely on automation to enhance reliability, efficiency, and flexibility. Nextflow is a code-centric workflow management system built for scalable, reproducible, and robust data analysis. Its scriptable domain-specific language and container-native architecture enable the integration of pipelines written in any scripting language without reimplementation of existing pipelines, while its data streaming feature allows efficient and scalable workflow design. Despite these strengths, Nextflow's adoption in synthetic biology is still limited.
One of the essential approaches in synthetic biology is retrobiosynthesis. Retrosynthesis algorithms enable the identification of biosynthetic pathways by working backward from target compounds to precursor metabolites available in the host organism. This approach facilitates the exploration of known and novel routes and supports pathway feasibility assessment, enzyme selection, and ranking based on thermodynamics, flux analysis, and host compatibility. Advances in these algorithms have accelerated the systematic engineering of microbial strains for natural and synthetic product biosynthesis.
To meet the demand for automated and standardized pathway design, several open-source tools such as RetroPath2.0, BioNavi, and RetroBioCat have been developed. However, these tools are limited by reduced parameter flexibility, tool integration, and runtime control due to their GUI-centered design. Nextflow addresses these challenges by enabling modular, scalable, and reproducible pipelines with complex branching logic and containerized multi-tool integration.
This work aims to develop and improve a Nextflow pipeline for retrobiosynthesis. We expect this pipeline to constitute a base for future computational workflows to extend synthetic biology tools for pathway design and strain engineering.
B-117: DOME Registry - Supporting ML transparency and reproducibility in the life sciences
Track: BOSC: Bioinformatics Open Source Conference
- Gavin Farrell, Uni Padova, Italy
- Omar Attafi, University of Padova, Italy
- Silvio Tosatto, University of Padova, Italy
Presentation Overview: Show
The adoption of machine learning (ML) methods in the life sciences has been transformative, solving landmark challenges such as accurate protein structure prediction, improving bioimaging diagnostics and accelerating drug discovery. However, researchers face a reuse and reproducibility crisis of ML publications. Authors are publishing ML methods lacking core information to transfer value back to the reader. Commonly absent are links to code, data and models eroding trust in the methods.
In response to this ELIXIR Europe developed a practical checklist of recommendations covering key ML methods aspects for disclosure covering; data, optimisation, model and evaluation. These are now known collectively as the DOME Recommendations published in Nature Methods by Walsh et al. 2021. Building on this successful first step towards addressing the ML publishing crisis, ELIXIR has developed a technological solution to support the implementation of the DOME Recommendations. This solution is known as the DOME Registry and was published in GigaScience by Ataffi et al. in late 2024.
This talk will cover the DOME Registry technology which serves as a curated database of ML methods for life science publications by allowing researchers to annotate and share their methods. The service can also be adopted by publishers during their ML publishing workflow to increase a publication’s transparency and reproducibility. An overview of the next steps for the DOME Registry will also be provided - considering new ML ontologies, metadata formats and integrations building towards a stronger ML ecosystem for the life sciences.
B-119: PheBee: A Graph-Based System for Scalable, Traceable, and Semantically Aware Phenotyping
Track: BOSC: Bioinformatics Open Source Conference
- David Gordon, Office of Data Sciences at Nationwide Children's Hospital, United States
- Max Homilius, Office of Data Sciences at Nationwide Children's Hospital, United States
- Austin Antoniou, Office of Data Sciences at Nationwide Children's Hospital, United States
- Connor Grannis, Office of Data Sciences at Nationwide Children's Hospital, United States
- Grant Lammi, Office of Data Sciences at Nationwide Children's Hospital, United States
- Adam Herman, Office of Data Sciences at Nationwide Children's Hospital, United States
- Ashley Kubatko, Office of Data Sciences at Nationwide Children's Hospital, United States
- Peter White, Office of Data Sciences at Nationwide Children's Hospital, United States
Presentation Overview: Show
The association of phenotypes and disease diagnoses is a cornerstone of clinical care and biomedical research. Significant work has gone into standardizing these concepts in ontologies like the Human Phenotype Ontology and Mondo, and in developing interoperability standards such as Phenopackets. Managing subject-term associations in a traceable and scalable way that enables semantic queries and bridges clinical and research efforts remains a significant challenge.
PheBee is an open-source tool designed to address this challenge by using a graph-based approach to organize and explore data. It allows users to perform powerful, meaning-based searches and supports standardized data exchange through Phenopackets. The system is easy to deploy and share thanks to reproducible setup templates.
The graph model underlying PheBee captures subject-term associations along with their provenance and modifiers. Queries leverage ontology structure to traverse semantic term relationships. Terms can be linked at the patient, encounter, or note level, supporting temporal and contextual pattern analysis. PheBee accommodates both manually assigned and computationally derived phenotypes, enabling use across diverse pipelines. When integrated downstream of natural language processing pipelines, PheBee maintains traceability from extracted terms to the original clinical text, enabling high-throughput, auditable term capture.
PheBee is currently being piloted in internal translational research projects supporting phenotype-driven pediatric care. Its graph foundation also empowers future feature development, such as natural language querying using retrieval augmented generation or genomic data integration to identify subjects with variants in phenotypically relevant genes.
PheBee advances open science in biomedical research and clinical support by promoting structured, traceable phenotype data.
B-121: NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data
Track: BOSC: Bioinformatics Open Source Conference
- Fabian Woller, Biomedical Network Science Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
- Lis Arend, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Germany
- Christian Fuchsberger, Institute for Biomedicine, Eurac Research, Italy
- Markus List, Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Germany
- David B. Blumenthal, Biomedical Network Science Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
Presentation Overview: Show
Existing Python libraries and tools lack the ability to efficiently run statistical test (such as
Pearson correlation, ANOVA, Mann-Whitney-U test) for large datasets in the presence of
missing values. This presents an issue as soon as constraints on runtime and memory
availability become essential considerations for a particular use case. Relevant research
areas where such limitations arise include interactive tools and databases for exploratory
analysis of large mixed-type data. At the same time, until today, biomedical data analyses on
such large datasets (e.g. population cohorts or electronic health record data)
mostly investigate statistical associations between specific variables (e.g., correlations
between measurements as body mass index and blood pressure). However, the rapidly
growing popularity of systems approaches in biomedicine makes it increasingly
relevant to be able to efficiently compute pairwise statistical associations for all available
pairs of variables in a dataset.
To address this problem, we present the Python tool NApy, which relies on a Numba and
C++ backend with OpenMP parallelization to enable scalable statistical testing for
mixed-type datasets in the presence of missing values. Both with respect to runtime and
memory consumption, we assess NApy’s efficiency on simulated as well as real-world input
data originating from a population cohort study. We show that NApy outperforms Python
competitor tools and baseline implementations with naïve Python-based parallelization by
orders of magnitude enabling on-the-fly analyses in interactive applications. NApy is publicly
available at https://github.com/DyHealthNet/NApy.
B-123: BFVD – A release of 351k viral protein structure predictions
Track: BOSC: Bioinformatics Open Source Conference
- Rachel Seongeun Kim, Seoul National University, Korea, The Democratic People's Republic of
- Eli Levy Karin, ELKMO, Denmark
- Milot Mirdita, Seoul National University, Korea, The Democratic People's Republic of
- Rayan Chikhi, Institut Pasteur, France
- Martin Steinegger, Seoul National University, Korea, The Democratic People's Republic of
Presentation Overview: Show
While the AlphaFold Protein Structure Database (AFDB) is the largest resource of accurately predicted structures – covering 214 million UniProt entries with taxonomic labels – it excludes viral sequences, limiting its utility for virology. To fill this gap, we present the Big Fantastic Virus Database (BFVD), a repository of 351,242 protein structures predicted using ColabFold on viral sequence representatives of the UniRef30 clusters. By augmenting Logan’s petabase-scale SRA assemblies in homology searches and applying 12-recycle refinement, we further enhanced the confidence scores of 41% of BFVD entries.
BFVD serves as an essential, viral-focused expansion to existing protein structure repositories. Over 62% of its entries show no or low structural similarity to the PDB and AFDB, underscoring the novelty of its content. Notably, BFVD enables identification of a substantial fraction of bacteriophage proteins, which remain uncharacterized at the sequence level, by matching them to similar structures. In that, BFVD is on par with the AFDB, despite holding nearly three orders of magnitude fewer structures. Freely downloadable at bfvd.steineggerlab.workers.dev and explorable via Foldseek with UniProt labels at bfvd.foldseek.com, BFVD offers new opportunities for advanced viral research.
B-125: Voyager-SDK: integrating and automating pipeline runs using Voyager-SDK and Voyager platform
Track: BOSC: Bioinformatics Open Source Conference
- Sinisa Ivkovic, Memorial Sloan Kettering Cancer Center, United States
- Christopher Allan Bolipata, Memorial Sloan Kettering Cancer Center, United States
- Nikhil Kumar, Memorial Sloan Kettering Cancer Center, United States
- Eric Buehler, Memorial Sloan Kettering Cancer Center, United States
- Danielle Pankey, Memorial Sloan Kettering Cancer Center, United States
- Adrian Fraiha, Memorial Sloan Kettering Cancer Center, United States
- Mark Donoghue, Memorial Sloan Kettering Cancer Center, United States
- Nicholas Socci, Memorial Sloan Kettering Cancer Center, United States
- Ronak Shah, Memorial Sloan Kettering Cancer Center, United States
- David B. Solit, Memorial Sloan Kettering Cancer Center, United States
Presentation Overview: Show
At Memorial Sloan Kettering Cancer Center (MSKCC), we developed Voyager, a platform to automate the execution of computational pipelines built using community standards Common Workflow Language and Nextflow. Voyager streamlines the orchestration and monitoring of pipelines across various compute environments. By leveraging the nf-core input schema for Nextflow pipelines and the Common Workflow Language (CWL) schema, Voyager abstracts input handling across both technologies, enabling seamless integration and execution regardless of the underlying workflow engine.
To enable broader adoption and community contribution, we are introducing the Voyager SDK—a toolkit that empowers developers to integrate their pipelines into the platform via modular components called Operators. As the number of pipelines in our organization grew, it became increasingly important to decouple the logic of these Operators from the core Voyager codebase. Operators encapsulate pipeline-specific logic and metadata, providing a structured interface to the Voyager engine. By externalizing this logic through the SDK, we enable independent development, promote extensibility and portability, and empower developers to onboard new pipelines without modifying the platform itself.
This talk will present the architecture of the Voyager platform, demonstrate how the SDK supports the creation and testing of Operators, and discuss how open standards and open-source tooling have been central to our development strategy. We will also share lessons learned from building infrastructure that balances institutional requirements with community best practices.
B-127: The ELITE Portal: A FAIR Data Resource For Healthy Aging Over The Life Span
Track: BOSC: Bioinformatics Open Source Conference
- Milan Vu, Sage Bionetworks, United States
- Tanveer Talukdar, Sage Bionetworks, United States
- Amelia Kallaher, Sage Bionetworks, United States
- Melissa Klein, Sage Bionetworks, United States
- Natosha Edmonds, Sage Bionetworks, United Kingdom
- Jessica Malenfant, Sage Bionetworks, United States
- Christine Suver, Sage Bionetworks, United States
- Laura Heath, Sage Bionetworks, United States
- Alberto Pepe, Sage Bionetworks, United States
- Luca Foschini, Sage Bionetworks, United States
- Solly Sieberts, Sage Bionetworks, United States
- Susheel Varma, Sage Bionetworks, United Kingdom
Presentation Overview: Show
Exceptional longevity (EL) is a rare phenotype characterized by an extended health span and sustained physiological function. Various domain-specific factors contribute to EL, influencing the maintenance of key physiological systems (e.g., respiratory, cardiovascular, immune) and functional domains (e.g., mobility, cognition). Studying the impacts of protective genetic variants and cellular mechanisms associated with EL facilitates the identification of novel therapeutic targets that replicate their beneficial effects. The Exceptional Longevity Translational Resources (ELITE) Portal (eliteportal.synapse.org) is a new, open-access repository for disseminating data and other research resources from translational longevity research projects. The portal supports diverse data types including genetic, transcriptomic, epigenetic, proteomic, metabolomic, and phenotypic data from longitudinal human cohort studies and cross-species comparative biology studies of tens- to hundreds- of nonhuman species; data from longevity-enhancing intervention studies in mouse and cell models; access to web applications and software tools to support exploration of EL-related research outcomes; and a catalog of publications associated with the National Institute on Aging (NIA)-funded translational longevity research projects. The portal also integrates with the external Trusted Research Environment (TRE) CAVATICA and is poised to support future integrations with additional data resources like Terra. All resources hosted in the ELITE Portal are distributed under FAIR (Findable, Accessible, Interoperable, and Reusable) principles. The ELITE Portal is funded by the NIA 5U24AG078753 and 2U24AG061340.
B-129: Walkthrough of GA4GH standards and interoperability it provides for genomic data implementations
Track: BOSC: Bioinformatics Open Source Conference
- Jimmy Payyappilly, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Dashrath Chauhan, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Sasha Siegel, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Andrew D Yates, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
- Chen Chen, Ontario Institute for Cancer Research, Canada
- Deeptha Srirangam, Broad Institute of MIT and Harvard, United States
Presentation Overview: Show
The sharing of genomic and health-related data for biomedical research is of key importance in ensuring continued progress in our understanding of human health and wellbeing. In this journey, bioinformatics and genomics continue to be closely coupled. To further expand the benefits of research, the Global Alliance for Genomics and Health (GA4GH) builds the foundation for broad and responsible use of genomic data by setting standards and frames policies to expand genomic data use guided by the Universal Declaration of Human Rights. As is true with any data, interoperability between open-source systems processing genomic data is vital. When systems are based on standards, it eases interactions between technical ecosystems as there is a common framework and way to interact and request resources. In this talk, we present two GA4GH open-source standards, which through their reference implementations exhibit interoperability between standards. Through this session, we will showcase the use cases on how these standards support data science for genomics research and ensure easy discoverability of data across the globe.