Session Recaps
BOSC
This was the 25th anniversary of the Bioinformatics Open Source Conference (BOSC; open-bio.org/events/bosc), which started at ISMB 2000 in San Diego. As usual, BOSC began with a welcome from chair Nomi Harris, followed by an overview of BOSC’s parent organization, the Open Bioinformatics Foundation, by OBF Treasurer Heather Wiencko. The opening session also included a plug for the post-BOSC CollaborationFest (https://www.open-bio.org/events/bosc-2024/obf-bosc-collaborationfest-2024/), a free, collaborative work event (including but not limited to hacking) hosted by the nearby University of Quebec (UQAM). All are welcome to register, whether or not you are attending BOSC.
BOSC’s packed first-day schedule included sessions focusing on various topics in open science and open source bioinformatics, including Data Analysis, Open Data, Visualization, and Developer Tools and Libraries. The first keynote was delivered by Mélanie Courtot; her topic was "The Data Shows We Need Better Data". Dr. Courtot began with a brief overview of her career, with the underlying message that changes bring opportunity. She then discussed how we need better standards and metadata in order to make sense of the data and share it internationally to address global challenges. She noted that there is an extensive ecosystem of open data, open standards, and open source software that researchers can leverage to help free more time to focus on the interesting science. Most of us are aware of the FAIR (Findable/Accessible/Interoperable/Reusable) data principles; Dr. Courtot went beyond FAIR to discuss the TRUE principles (Tracked, Reasonable, Understandable, Ethical) that are particularly important for preparing AI-Ready data.
The second day of BOSC 2024 opened with a session covering a topic close to our hearts, “Standards and frameworks for open science.” Afterward, keynote speaker Andrew Su discussed "Open Data, Knowledge Graphs, and Large Language Models." Dr. Su asked, "Have LLMs obviated the need for structured knowledge?" (Spoiler alert: No!) He discussed ways to reduce hallucination using Retrieval-Augmented Generation (RAG) and tool augmentation, as well as benchmarks for evaluating AI-generated answers and explanations. Dr. Su then led participants (both in person and virtual) in several episodic future thinking exercises, with scenarios that elicited responses to investigate how our community feels about the future of AI/LLMs in biomedical informatics. The results, in the form of word clouds and polls, revealed ambivalence between excitement and concern.
BOSC ended with a panel, "Open Source AI/ML: A Game Changer for Bioinformatics?”, with panelists Larry Hunter, Thomas Hervé Mboa Nkoudou, Mélanie Courtot, and Andrew Su. Addressing whether we should switch to using open models, Larry Hunter answered emphatically, “Absolutely.” “It’s difficult to investigate sources of bias in the training data,” he pointed out, “if you can’t see the data.” Not only do we not know what’s in closed models, but we can be pretty sure they’re not acting in our best interests as researchers since the payoff for commercial AI is more targeted advertising (for example, advertisers pay to have their content used in the training data), which is no way to do science. There was a spirited discussion about data privacy vs. openness, with Mélanie Courtot pointing out that we don’t yet have a clear understanding of the benefit vs. harm that may be incurred by sharing data such as personal medical information. Andrew Su noted that bioinformaticians who were surveyed varied widely in their level of comfort with sharing their health information with an LLM tool.Moderator Monica Munoz-Torres closed by asking the panelists, “Do you think open-source models are a game changer for bioinformatics?” Panelist Thomas Mboa did not hesitate to answer “Yes!”
Thank you to everyone who helped make BOSC 2024 a success. We hope to see you again in 2025!
CAMDA
Wenzhong Xiao, Director of the Immuno-Metabolic Computational Center of Harvard Medical School, introduced the CAMDA Competition Series in its 2-day conference track at ISMB 2024, giving an overview and insights from a historical perspective.
Cathy Lozupone, University of Colorado, kicked off the microbiome session to a packed house, illustrating how metabolic network modelling of time series allows predictions about successional turnover, discussing possible mechanisms reflecting adaptation to oxidative stresses. Her studies demonstrated the relationship between microbiomes in diseased adults and those of infants, highlighting parallels to processes that occur in primary versus secondary ecological succession, where absence of a complex community of healthy gut commensals allows for a colonization of opportunistic, early succession adapted organism that undergo an ordered turnover of membership. Coupling co-occurrence patterns and longitudinal analyses of dense time-series data with genomic and metabolic network interrogations to explore underlying drivers of microbial cooperation and competition, Dr Lozupone and her team have been generating hypotheses regarding important interactions that occur during succession and subsequently tested them successfully in humanized mice.
Then Kinga Zielińska, Małopolska Centre of Biotechnology Krakow, introduced the dataset and Health Index underlying this year’s CAMDA microbiome data analysis challenge. She presented a baseline analysis, demonstrating that a novel Health Index introduced at CAMDA that focuses on microbiome functions rather than taxonomies proves to be more sensitive to detecting differences between healthy controls and a large variety of diseases, while also being robust to sequencing depth.
The subsequent session show-cased advances by the different teams tackling the CAMDA microbiome challenge. First up, Nelly Selem Mojica, Centro de Ciencias Matemáticas UNAM, presented an approach for Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing. Introducing a novel Random Forest GMHI index that integrates both taxonomic and functional data from the microbiome, this established a more robust and accurate framework for predicting health outcomes.
Using Gradient Boosting to Predict Health States from Composition and Function of the Gut Microbiome, Patrick Smyth, National Microbiology Laboratory Canada, constructed a Gradient Boost Health Index from gut Microbiome data (GBHIM), showing improved performance over existing indices like the Gut Microbiome Health Index (GMHI) across various validation folds and data sets, highlighting its potential for accurate health state predictions.
Finally, Zuzanna Karwowska, Małopolska Centre of Biotechnology, presented a Microbiome time series analysis revealing predictable patterns of change. Despite high volatility, the human gut microbiome shows stable features over time, and changes can be predicted based solely on previous states. She presented a statistical characterization of the unique temporal behaviors of individual bacterial species. Furthermore, she identify distinct longitudinal regimes in which bacteria exhibit specific patterns of behavior. Cluster analysis identified groups of bacteria that exhibit coordinated fluctuations over time. These findings contribute to our understanding of the dynamic nature of the gut microbiome and its potential implications for human health.
Closing the CAMDA microbiome session, Jesse Shapiro, McGill University, gave an overview of challenges and promising approaches for predictions in microbiome science. Discussing an analysis of several distinct biomes, the limited sizes of microbiome data still make predictions further into the future a hard challenge, and more complex outcomes remain difficult to model. Expert domain knowledge is thus particularly crucial in navigating and exploring microbial datasets.
In the last session of the day, a diverse set of complementary talks covered as diverse topics as Security Vulnerabilities of Portable Sequencing Devices (Carson Stillman, U Florida), genomic epidemiology of Giardia intestinalis ( Miguel Prieto, Simon Fraser), inverted Repeats in Viral Genomes at a Large Scale (Madhavi Ganapathiraju, Carnegie Mellon Qatar), and the Integration of Spatial Transcriptomics into Multimodal Imaging of Skin Aging (Christina Bauer, Medical University Vienna).
In an analysis of hundreds of millions of electronic Health Records, in his early morning keynote kicking off the second day of CAMDA, Andrey Rzhetsky (U Chicago) conclusively demonstrated the massive effects of air quality on a range of diseases, including depression and bipolar disorder. Strikingly, while access to natinoal data remains challenging and requires separate analyses in collaboration with scientists in each country, data from US health systems / insurance is compiled and brokered commercially.
Joaquin Dopazo, Director of the Computational Medicine Platform at the Health Ministry of Andalusia, then discussed the challenges and opportunities in the analysis of electronic Health Records and introduced the CAMDA Clinical Health Record Challenge of tracing diabetes patient disease trajectories.
The following session explored first analyses of the challenge data. Daniel Santana-Quinteros, Universidad Nacional Autónoma de México, discussed results from cluster analyses in the context of diagnosing and managing Type 2 diabetes, with better predictive models supporting more personalized and proactive healthcare interventions. Daniel Voskergian, Al-Quds University, reported a novel approach to feature engineering, combining XGB feature selection with various supervised machine learning algorithms, incl. Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Trees to develop predictive models for four complications of diabetes mellitus: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. Both teams highlighted a need for extended health records and independent validation cohorts.
Paweł Łabaj, Małopolska Centre of Biotechnology, introduced the CAMDA challenge on Antimicrobial Resistance (AMR) contributed by the team of Leonid Chindelevitch, Imperial College London. The following session reported exploratory work on the data. Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research, compared GWAS results to The Comprehensive Antibiotic Resistance Database. Interestingly, phylogenetic scores showed better performance. Jaime Salvador López Viveros, CCM UNAM Mexico, reported a comparison of various preprocessing and dimensionality reduction and modelling approaches applied to diverse subsets of the data. The best predictions were obtained from AMR gene counts by L1-regularized logistic regression.
The data challenge sessions were then complemented by a discussion of open issues in the benchmarking of single cell data clustering by Owen Visser, University of Florida, and an ISMB proceedings contribution by Dexiong Chen, Max Planck Institute of Biochemistry, who introduced a sparse, interpretable, and optimized maximum mean discrepancy test (SpInOpt-MMD) for two-sample testing and feature selection in the same experiment. SpInOpt-MMD performed well on a variety of data types even on small cohorts, outperforming other methods such as SHapley Additive exPlanations and univariate association analysis.
Delegates then voted to select the best presentations for the CAMDA 2024 Awards:
1. Patrick Smyth, National Microbiology Laboratory Canada, for ‘Using Gradient Boosting to Predict Health States from Composition and Function of the Gut Microbiome’
2. Zuzanna Karwowska, Małopolska Centre of Biotechnology Kraków, for ‘Microbiome time series data reveal predictable patterns of change’
3. Nelly Selem Mojica, Centro de Ciencias Matemáticas UNAM Mexico, for ‘Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing’
With honorable mentions going to Jaime Salvador López Viveros, CCM UNAM Mexico (‘Machine learning models for AMR prediction’) and Owen Visser, University of Florida (‘Measures for the Evaluation of Clustering Methods on Single Cell Data’).
The CAMDA 2025 Challenges will be advertised in the coming months, and we look forward to welcoming submissions until May 2025 and see you all at ISMB in Liverpool, celebrating a quarter of a century of open-ended data analysis competitions at the cutting edge of complex big data in the life sciences.
Digital Agriculture
The Digital Agriculture Open Science session (or special session, use your favorite) kicked off with a keynote byEtienne Lord from Agriculture and Agri-Food Canada. Dr. Lord talked about current and new development in Digital Agriculture, and the implications of deep learning and robotics in this new data science. Dr. Lord highlighted the exciting application of digital instruments and analysis methods that are emerging in agriculture.
The session had presentations on video analyses for animal welfare, single-plant omics, precision dairy farming, smart aquaculture, profitability maps for precision mapping, and salt tolerant protein classification with natural language models.
Furthermore, the audience engaged in a discussion over current and future challenges in digital agriculture, focusing on automation, robotics and protocol standardization.
EvolCompGen
The EvolCompGen COSI was split across Days 4 and 5. We hosted a total of 30 talks, including 4 proceedings, ~40 posters, and 1 panel discussion to conclude our session. Researchers from around the world introduced several innovative methods and applications across evolutionary biology and comparative genomics. These included algorithms for solving problems in ncRNA families with novel distance metrics and dynamics of miRNAs, a progressive supertree algorithm for inferring transcript phylogenies, and a new representation for phylogenetic trees improving efficiency and comparison metrics. Other contributions encompassed automated pipelines for species tree inference from raw genome assemblies, alternative tests for molecular adaptation across genomes, and protein structure-based classifications enhancing orthology inference. Novel tools were also developed for inferring mitochondrial clones, cell lineage trees and modeling heteroplasmy, detecting genetic overlap in cancer progression, and reconstructing evolutionary histories using synteny and species trees. Machine learning approaches and scalable algorithms further improved phylogenetic tree reconstructions and comparative genomics, contributing significantly to the field. We also heard about exciting applications in tumor evolution, pseudogenes and plasmid mobility, antibiotic resistance, and host specificity. Our final session hosted a panel discussion to highlight the ongoing debates, open questions in the field, and cutting-edge methods that could seamlessly cross-pollinate across fields. How do we bring together the best of all worlds, spanning big data to ML to deep evolutionary insights?
Overall, our sessions were very well-attended (in person and via Juno), with ample time after each presentation to discuss a few enlightening questions. We also enjoyed an evening out with a pleasant dinner with 40+ COSI members at "Les 3 Brasseurs" on a lively street in Old Montreal. We thank all the organizing committee, and all the in-person and virtual attendees for their patience and participation as we conclude our 2024 EvolCompGen program! Stay tuned for announcements on the (non-proceedings) talk and poster winners. We also welcome you to join our EvolCompGen COSI community and the 2025 planning/organizing committee via Twitter/X:@EvolComp, Website:evolcompgen.org, or Slack: bit.ly/join_evolcompgen. We look forward to seeing you in Liverpool in 2025!
SysMod
-The 2024 edition featured 2 keynote talks, 7 regular talks, 2 lightning talks, and 25 posters, with more than 200 attendees participating both in person and virtually.
-We opened our session with our first keynote speaker Prof. Nathan Price from Thorne HealthTech and Buck Institute for Research on Ageing, to discuss the role of the microbiome in predicting the onset of disease conditions such as obesity, diabetes, and neurodegenerative diseases.
-The 9 talks have demonstrated the synergy between systems biology and bioinformatic approaches, featuring computational approaches ranged across agent-based, Boolean, dynamic, network-based, and non-linear differential equations models, also highlighting the relevance of modeling using single-cell data and spatial transcriptomics.
-We closed our session with our second keynote speaker, Prof. Melissa Kemp from the Georgia Tech, to discuss about a critical role of intercellular transport, adhesion, and cell cycle asynchrony in the propagation of dynamic patterning in engineered cells.
-We awarded three posters that developed on challenges in the systems biology community: using constraint-metabolic modeling to identify metabolic changes in disease and infection, and developing a computational platform to simulate cell state transitions in single-cell RNA-seq data.
TransMed
We had a very exciting day of talks spanning very diverse areas of biology and biomedicine. The day started with Prof. Heidi Rehm’s keynote, who gave a very insightful account of strategies to identify genetic drivers of rare diseases as well as building innovative approaches to global data sharing through initiatives like AnVIL and the Global Alliance for Genomics and Health. Prof. Rehm also addressed novel approaches to support genetics and genomics in medical practice.
Our second keynote speaker, Dr. Quaid Morris gave a brilliant talk describing the GDD-ENS, a highly accurate cancer type classifier deployed at the Memorial Sloan Kettering Cancer Center based on inputs derived from an FDA-approved, and routinely applied, targeted DNA sequencing panel called MSK-IMPACT. Dr. Morris dicussed the functionality, the successes and some areas of improvement for GDD-ENS as well as the ongoing efforts to generalize GDD-ENS to other targeted cancer gene panels. In the second part of his talk, Dr Morris introduced a new framework for more interpretable mutational signatures which can be linked with pathway activity, thereby augmenting our understanding of cancer evolution.
The TransMed meeting included a series of excellent talks selected from abstracts, covering newly developed statistical and AI methods as well as applications of existing computational methods to diverse biomedical datasets for the purpose of disease diagnosis, biomarker identification, cell state identification, mapping spatial landscapes, understanding treatment resistance or clinical trial matching. This year, we introduced the Poster Flash Talk session where 8 selected posters were presented briefly, before the poster session during the lunch. This was a very dynamic session and we thank to all speakers for respecting the time allocated.
As highlighted by all of today’s speakers, we face increasing analytical and computing challenges when integrating increasingly complex datasets and translating them in the clinic, and there is a clear opportunity for computational, statistical and AI methods to transform the field of translational medicine in the coming years.