Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in JST
Wednesday, October 23rd
9:50-10:30
Opening Ceremony
Room: Large Theatre
Format: In person

Moderator(s): Susumu Goto


Authors List: Show

10:30-11:30
Invited Presentation: Keynote 1 - Prediction of cancer-associated metabolites and their application for drug targeting system
Room: Large Theatre
Format: In person

Moderator(s): Susumu Goto


Authors List: Show

  • Hyun Kim

Presentation Overview: Show

The utilization of bio big data through computational models can help predict biomarkers and drug targets for a range of diseases. In the case of cancer, a substantial amount of bio big data has been deposited, including patient-specific omics data (e.g., RNA-seq data) and medical data (e.g., survival data). In this talk, I will elaborate a computational workflow where transcriptome and mutation data from cancer patients, along with genome-scale metabolic models (GEMs), have been used to predict potential biomarkers known as oncometabolites. Oncometabolites exhibit pro-oncogenic functions when they accumulate abnormally in cancer cells, and they are generated upon mutations in a metabolic gene. GEM is a computational model that allows predicting entire metabolic reaction fluxes. I will also showcase the application of this computational workflow to predict drug targets effective for high-risk bladder cancer patients, showing poor prognosis. The predicted drug targets were validated using in vitro and in vivo studies. Ongoing efforts in generating and applying meaningful bio big data, alongside the proper use of computational models, will revolutionize our approaches to addressing medical problems.

13:00-13:15
hoodscanR: profiling single-cell neighborhoods in spatial transcriptomics data
Confirmed Presenter: Ning Liu, University of Adelaide, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Ning Liu, University of Adelaide, Australia
  • Dharmesh Bhuva, University of Adelaide, Australia
  • Arutha Kulasinghe, University of Queensland, Australia
  • Chin Wee Tan, Walter and Eliza Hall Institute of Medical Research, Australia
  • Jose Polo, University of Adelaide, Australia
  • Melissa Davis, University of Adelaide, Australia

Presentation Overview: Show

Spatial transcriptomics reveals the complex spatial architecture of tissues, but current analytical tools often fall short of fully utilizing this rich information. Understanding cellular neighborhoods is important for investigating biological processes and disease mechanisms, yet existing methods struggle to detect mixed cellular neighborhoods and provide detailed single cell-level neighborhood profiles.

To address these limitations, we developed hoodscanR, a Bioconductor R package for comprehensive neighborhood analysis in spatial transcriptomics data. hoodscanR identifies cellular neighborhoods using an efficient k-nearest neighbor search, generates cell-level neighborhood annotations, and quantifies neighborhood complexity using entropy and perplexity metrics. Applying hoodscanR to breast cancer and lung cancer datasets from different technology platforms, we demonstrate its ability to detect complex mixed neighborhoods and identify nuanced spatial patterns. Furthermore, hoodscanR enables neighborhood-based differential expression analysis, revealing transcriptional changes driven by the spatial composition of the tumor microenvironment.

In conclusion, hoodscanR is a powerful software package to explore cellular neighborhoods within spatial transcriptomics datasets. By allowing researchers to investigate complex tissue biology with greater precision, hoodscanR has the potential to accelerate discoveries in cancer research, immunology, and other biomedical fields.

13:15-13:30
DeepSpaceDB: a spatial transcriptomics atlas for interactive in-depth analysis of tissues and tissue microenvironments
Confirmed Presenter: Alexis Vandenbon, Kyoto University, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Alexis Vandenbon, Kyoto University, Japan

Presentation Overview: Show

Spatial transcriptomics enables researchers to study the spatial organization of cells within tissues, and its relation to cellular function, gene expression patterns, and disease states. However, this technology requires considerable financial resources and bioinformatics experience. Here, we present DeepSpaceDB, a spatial transcriptomics atlas that enables in-depth exploration of public spatial data. At present, DeepSpaceDB contains more than 1,000 Visium datasets. DeepSpaceDB allows users to inspect the quality of each sample compared to other samples in the database. The roughly 2.2 million spots of all samples were clustered by similarity of their gene expression patterns into manually annotated clusters (ex: spots of the hippocampus, tumor spots, etc), facilitating the interpretation of structures within tissues. Furthermore, DeepSpaceDB allows users to compare the gene expression of interactively selected sets of spots, and query spots can be used to search the entire database for similar spots. Finally, the database provides spatially variable genes and biological pathways, as well as predicted cell type compositions of the spots in all samples. We believe that DeepSpaceDB will be useful for generating new hypotheses, and for supporting the analysis of newly obtained spatial datasets. DeepSpaceDB is still being developed, but a public version is available at www.DeepSpaceDB.com.

13:30-13:45
STAIG: Spatial Transcriptomics Analysis via Image-Aided Graph Contrastive Learning for Domain Exploration and Alignment-Free Integration
Confirmed Presenter: Yitao Yang, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Yitao Yang, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Japan
  • Cui Yang, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Japan
  • Xin Zeng, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Japan
  • Yubo Zhang, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Japan
  • Martin Loza, Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan, Japan
  • Sung-Joon Park, Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan, Japan
  • Kenta Nakai, Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan, Japan

Presentation Overview: Show

Spatial transcriptomics is an essential application for investigating cellular structures and interactions, requiring multi-modal information to precisely study the spatial domains. Here, we propose STAIG, a novel deep-leaning model that integrates gene expression, spatial coordinates, and histological images using graph contrastive learning coupled with a high-performance feature extraction. STAIG can integrate tissue slices without pre-alignment along with removing batch effects. Moreover, it is designed to accept data acquired from various platforms with or without histological images. By performing extensive benchmarks, we demonstrate the capability of STAIG in recognizing spatial regions with high precision and uncovering new insights into tumor microenvironments, which suggests promising potential in deciphering spatial biological intricates.

13:45-14:00
Inferring Supplementary Epigenomic Information of Spatial Transcriptomic Data Using Single-cell Multi-omics Reference
Confirmed Presenter: Weihang Zhang, University of Tokyo, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Weihang Zhang, University of Tokyo, Japan
  • Kenta Nakai, University of Tokyo, Japan

Presentation Overview: Show

Spatial transcriptomic (ST) technologies, while transformative for mapping gene expression across tissue landscapes, often lack integrated epigenomic data, limiting comprehensive studies of spatial gene regulation. To bridge this gap, we developed a computational tool that leverages both paired (simultaneous sequencing of multi-omics from the same cells) and unpaired (independent sequencing of multi-omics from different cells) single-cell multi-omics data to infer the spatial epigenomic components of existing ST datasets.

This tool constructs three types of graphs to model relationships within and between data modalities, employing a self-supervised heterogeneous graph neural network and a semi-supervised MLP-decoder to predict chromatin accessibility. Validated across multiple datasets, including human hippocampal and mouse brain samples, our method demonstrated strong correlation with known spatial epigenomic data and accurately enriched key transcription factor motifs.

This approach not only enhances the resolution of spatial multi-modal data but also sets the future stage for its application in unraveling tumor microenvironment complexities, promising significant advancements in precision oncology.

14:00-14:15
CelFiE-ISH: a probabilistic model for multi-cell type deconvolution from single-molecule DNA methylation haplotypes
Confirmed Presenter: Irene Unterman, The Hebrew University, Israel

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Irene Unterman, The Hebrew University, Israel
  • Dana Avrahami, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Israel
  • Efrat Katsman, The Hebrew University of Jerusalem, Israel
  • Timothy J. Triche Jr., Van Andel Research Institute, Grand Rapids, MI, United States
  • Benjamin Glaser, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Israel
  • Benjamin P. Berman, The Hebrew University of Jerusalem, Israel

Presentation Overview: Show

Deconvolution methods infer quantitative cell type estimates from bulk measurement of mixed samples including blood and tissue. DNA methylation sequencing measures multiple CpGs per read, but few existing deconvolution methods leverage this within-read information. We develop CelFiE-ISH, which extends an existing method (CelFiE) to use within-read haplotype information. CelFiE-ISH outperforms CelFiE and other existing methods, achieving 30% better accuracy and more sensitive detection of rare cell types. We also demonstrate the importance of marker selection and of tailoring markers for haplotype-aware methods. While here we use gold-standard short-read sequencing data, haplotype-aware methods will be well-suited for long-read sequencing.

14:15-14:30
Comprehensive Noise Reduction in Single-Cell Data with the RECODE Platform
Confirmed Presenter: Yusuke Imoto, Kyoto University, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Chris Tan


Authors List: Show

  • Yusuke Imoto, Kyoto University, Japan

Presentation Overview: Show

Single-cell sequencing technologies, capable of revealing intricate biological details at individual cell levels, often face issues like technical noise and batch effects that obscure biological signals. We introduce integrative RECODE (iRECODE), a comprehensive method derived from the RECODE platform, specifically designed to tackle these challenges using high-dimensional statistical techniques. iRECODE not only efficiently reduces both technical and batch noise, but its application extends beyond RNA-sequencing to include single-cell Hi-C and spatial transcriptomics, significantly enhancing accuracy and computational efficiency. This advancement in RECODE technology offers a robust solution for denoising single-cell sequencing data, thereby broadening our understanding of complex biological systems across multiple genomic dimensions.

14:50-15:05
MapBatch: Conservative batch normalization of single cell RNA-sequencing data using deep learning
Confirmed Presenter: Chern Han Yong, National University of Singapore, Singapore

Room: Large Theatre
Format: In Person

Moderator(s): Jiangning Song


Authors List: Show

  • Chern Han Yong, National University of Singapore, Singapore
  • Shawn Hoon, Institute of Molecular and Cell Biology, A*STAR, Singapore, Singapore
  • Jonathan Scolnick, National University of Singapore, Singapore
  • Limsoon Wong, National University of Singapore, Singapore

Presentation Overview: Show

Single-cell RNA sequencing data from multiple samples is often combined and analyzed to reveal mechanisms behind diseases and normal biological processes. However, reducing technical artefacts between samples (ie. batch normalization) frequently dampens biological signal, obscuring rare cell populations which may be erroneously grouped with other cell types. There is a need for conservative batch normalization which maintains biological signal to aid the detection of potentially important rare cell populations.

We present MapBatch, a deep-learning batch normalization tool based on two principles: an autoencoder trained with a single sample learns the underlying gene expression structure of cell types without batch effect; and an ensemble model combines multiple autoencoders, allowing the use of multiple samples for training. Samples are projected into the batch-free biological subspaces of the training samples, in which even subtle differences between cell subpopulations can be discerned.

On a PBMC dataset, MapBatch normalization successfully delineated and maintained the differences between synthetically-perturbed rare cell subpopulations, while other commonly-used normalization methods obscured these subpopulations. We applied MapBatch to a publicly-available COVID dataset, and derived and validated subpopulations of Natural Killer cells and megakaryocytes associated with disease severity and worsening condition.

15:05-15:20
Bento enables consistent detection of subcellular domains from spatial transcriptomics data
Room: Large Theatre
Format: In person

Moderator(s): Jiangning Song


Authors List: Show

  • Clarence Mah, University of California San Diego, United States
  • Noorsher Ahmed, UC San Diego, United States
  • Hannah Carter, University of California San Diego, United States
  • Gene Yeo, University of California San Diego, United States

Presentation Overview: Show

The intricate dance of countless molecules within a cell dictates biological function. While variation in gene expression is is commonly measured as a proxy for cellular activity, novel spatial transcriptomics technologies enable systematic study of how the spatial context of gene expression plays a role. We present Bento, a Python toolkit for analyzing RNA organization from single-molecule resolution spatial transcriptomics data. Bento performs fundamental analyses to quantify RNA organization using molecular coordinates and segmentation boundaries as input. To discern distinct subcellular domains, we propose RNAflux, an unsupervised embedding method generalizable across cell types in heterogeneous tissues. Identifying these domains enables quantifying transcript composition and transcript colocalization. RNAflux consistently identifies components of the nucleus and cytoplasm, without prior knowledge, across diverse cell types and conditions. We demonstrate Bento's versatility across common real-world use cases, including cell cultures, drug perturbation experiments, and cancer tissue sections. For example, we find RNA localization changes upon treatment of cardiomyocytes with doxorubicin: depletion of RBM20 mRNA from the endoplasmic reticulum and nuclear retention, suggesting a potential mechanism for doxorubicin-induced cardiomyopathy. Bento adheres to FAIR data principles and integrates seamlessly with the open-source Scverse ecosystem, facilitating interoperability with other single-cell analysis tools like Scanpy, Squidpy, and SpatialData.

15:20-15:35
Overlap-Aware Cell Tracker for Fluorescence Live-Cell Microscopy
Confirmed Presenter: Kenji Fujimoto, Graduate School of Medicine, Hirosaki University, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Jiangning Song


Authors List: Show

  • Kenji Fujimoto, Graduate School of Medicine, Hirosaki University, Japan
  • Shigeto Seno, Graduate School of Information Science and Technology, Osaka University, Japan
  • Hironori Shigeta, Graduate School of Information Science and Technology, Osaka University, Japan
  • Yutaka Uchida, Graduate School of Medicine, Osaka University, Japan
  • Masaru Ishii, Graduate School of Medicine, Osaka University, Japan
  • Yoshinori Tamada, Graduate School of Medicine, Hirosaki University, Japan
  • Hideo Matsuda, Graduate School of Information Science and Technology, Osaka University, Japan

Presentation Overview: Show

Cell migration plays an essential role in various biological processes such as development, immune responses, and wound healing. Tracking individual cells to extract their moving trajectories is crucial for understanding cell migration. Recent cell tracking studies mainly have applied general multiple object tracking (MOT) methods. However, MOT-based methods often fail to track cells overlapped along the depth direction. This is because each cell is visualized as a group of light spots, and it is difficult to distinguish between multiple overlapped cells and a single cell. To overcome this limitation, we propose a novel cell tracking method, Overlap-Aware Cell Tracker (OACT). OACT first performs regular MOT and detects overlap and reappearance of cells. Then, it estimates the matches between pre-overlap and post-reappearance cells based on deep metric learning. Consequently, OACT continuously tracks cells from the pre-overlap frame to the post-reappearance frame. To evaluate OACT, we applied it to the time-lapse image sequences of leukocytes obtained through 2-photon excitation microscopy. As a result, OACT succeeded to continuously track overlapped cells and quantitatively outperformed the existing MOT-based method. We consider that the ability of OACT to continuously track individual cells contributes to the analysis of long-term changes in the behavior of migrating cells.

15:35-15:50
Single-cell survival analysis using a deep generative model
Confirmed Presenter: Chikara Mizukoshi, Nagoya University Hospital, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Jiangning Song


Authors List: Show

  • Chikara Mizukoshi, Nagoya University Hospital, Japan
  • Yasuhiro Kojima, Laboratory of Computational Life Science, National Cancer Center Research Institute, Japan
  • Shuto Hayashi, Division of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Japan
  • Ko Abe, Division of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Japan
  • Teppei Shimamura, Division of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Japan

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) reveals cellular heterogeneity, but methods to quantitatively link this to clinical statistics were lacking. We developed a novel approach using a deep generative model to estimate each cell's contribution to patient survival time. Our method combines single-cell deconvolution of bulk RNA-seq data with a Cox proportional hazards model, regressing patient survival time using cell state abundance. This enables estimation of which cells contribute to better or worse prognosis, providing insights into the relationship between cellular heterogeneity and patient outcomes.
We verified this method using simulation data and applied it to The Cancer Genome Atlas (TCGA) dataset. Simulations showed single-cell-level deconvolution outperformed cluster-level deconvolution in estimating cell state influence on survival time. In TCGA data, our method demonstrated generalization performance for untrained bulk RNA-seq data.
We identified cell populations and specific genes associated with high or low risk. Genes negatively correlated with risk were enriched with Gene Ontology terms involved in immune cell activation. Additionally, using VISIUM data, we mapped hazards spatially, visualizing regions involved in prognosis. This method allows analysis of cellular heterogeneity's contribution to survival time, offering a new approach to understand the complex interplay between cellular diversity and clinical outcomes in various diseases.

15:50-16:05
Single-cell multi-omics reveals the regulatory networks in neuroblastoma cell differentiation
Confirmed Presenter: Kai Jie Hu, School of Medicine, National Yang Ming Chiao Tung University, Taipei 112, Taiwan

Room: Large Theatre
Format: In Person

Moderator(s): Jiangning Song


Authors List: Show

  • Kai Jie Hu, School of Medicine, National Yang Ming Chiao Tung University, Taipei 112, Taiwan
  • Tzu Yang Tseng, Department of Life Science, National Taiwan University, Taiwan
  • Chiao Hui Hsieh, Department of Life Science, National Taiwan University, Taiwan
  • Chia Lang Hsu, Department of Medical Research, National Taiwan University Hospital, Taiwan
  • Hsuan Cheng Huang, Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 112, Taiwan
  • Hsueh Fen Juan, Department of Life Science, National Taiwan University, Taipei 106, Taiwan

Presentation Overview: Show

Neuroblastoma (NB) is a common tumor type in children, originating from neuroblasts of neural crest progenitor cells. Studies have suggested that inducing differentiation in NB cells can effectively inhibit cancer growth, achieving disease control. However, the molecular mechanisms governing NB cell fate reprogramming remain unclear. To elucidate these mechanisms, we treated NB cells with all-trans retinoic acid (ATRA), a differentiation-inducing drug which effectively inhibits cancer growth and shows promising clinical effects. We investigated both morphogenic differentiation effect and conducted single-cell RNA and ATAC sequencing on NB cell lines following ATRA treatment. Additionally, we performed differential expressed gene analysis, pathway enrichment and transcription factor (TF) prediction. Here, we identified several TFs involved in NB differentiation, including KLF2 and KLF6. In MYCN non-amplified NB, KLF2 activation induces MT2A and IGFBP4 genes. Conversely, in MYCN amplified NB, KLF6 induces IFI16 and GAL genes. These pathways exhibit opposite regulation based on MYCN status but both promote differentiation by activating HOX gene, morphogen, and tumor suppressor. Our study sheds light on the mechanisms of NB differentiation and cell fate reprogramming triggered by ATRA, providing insights into potential targets for treating NB.

16:05-16:20
Single-cell network-based approach to investigate intratumoral heterogeneity in progression of head and neck cancer
Confirmed Presenter: Wen-Yao Lee, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan

Room: Large Theatre
Format: In Person

Moderator(s): Jiangning Song


Authors List: Show

  • Wen-Yao Lee, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan
  • Ting-Yi Hao, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan
  • Ting-Yu Yeh, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan
  • Chun-Yu Lin, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Taiwan

Presentation Overview: Show

Intratumoral heterogeneity and genetic complexity complicate treatment strategies; for example, the standard treatments may fail for certain subpopulations with drug resistance. Single-cell RNA sequencing (scRNA-seq) technology offers a promising way to explore intratumoral heterogeneity at single-cell resolution. However, differential expression analyses in scRNA-seq data often fail to capture gene-gene associations crucial for understanding dynamic biological changes between cells. Recently, several methods to infer single-cell networks (SCNs) have been proposed, but they face high false positive rates or computational complexity. To address these issues, we present the Chi-square Estimate Of Single cEll Network (CHOSEN) method to infer SCNs from scRNA-seq data by considering cellular heterogeneity quantified by transcriptomic correlation. Experiments show CHOSEN outperforms previous methods in identifying cell types and displaying close-to-real network topologies. Applying CHOSEN to scRNA-seq data from three primary tumors in head and neck cancer patients, we identified four de-differentiated subpopulations of malignant epithelial cells linked to epithelial-mesenchymal transition (EMT), with the most de-differentiated one showing high expression of partial EMT and stemness signatures. CHOSEN networks also suggest ACSL4 and GCLM as key gene regulators in the ferroptosis pathway. In summary, CHOSEN provides a new opportunity to study biological systems at the single-cell level from a network perspective.

Thursday, October 24th
9:30-10:30
Invited Presentation: Keynote 2 - How far are we from designing an antibody therapeutic on a computer?
Confirmed Presenter: Charlotte Deane, University of Oxford, United Kingdom

Room: Large Theatre
Format: In Person

Moderator(s): Alex Bateman


Authors List: Show

  • Charlotte Deane, University of Oxford, United Kingdom

Presentation Overview: Show

Antibodies play a key role in the immune system and our response to vaccines, and have shown great promise as biotherapeutics. The development of new biotherapeutics typically takes many years and requires over $1bn in investment. Computational methods and in particular, machine learning, have shown great promise for increasing the speed and reducing the cost of biotherapeutic development. In this talk I will describe some of the novel computational tools and databases we are pioneering in biotherapeutics from accurate rapid structure prediction to the prediction of their affinity and binding, looking at both their promise and limitations.

10:45-11:00
PGxQA: A Resource for Evaluating LLM Performance for Pharmacogenomic Q&A Tasks
Confirmed Presenter: Karl Keat, University of Pennsylvania, United States

Room: Large Theatre
Format: In Person

Moderator(s): Kana Shimizu


Authors List: Show

  • Karl Keat, University of Pennsylvania, United States
  • Rasika Venkatesh, University of Pennsylvania, United States
  • Yidi Huang, University of Pennsylvania, United States
  • Rachit Kumar, University of Pennsylvania, United States
  • Sony Tuteja, University of Pennsylvania, United States
  • Katrin Sangkuhl, Stanford University, United States
  • Binglan Li, Stanford University, United States
  • Li Gong, Stanford University, United States
  • Michelle Whirl-Carrillo, Stanford University, United States
  • Teri Klein, Stanford University, United States
  • Marylyn Ritchie, University of Pennsylvania, United States
  • Dokyoon Kim, University of Pennsylvania, United States

Presentation Overview: Show

Pharmacogenetics represents one of the most promising areas of precision medicine, with several guidelines for genetics-guided treatment ready to be used in the clinic. Despite this, implementation has been slow, with few health systems incorporating the technology into clinical care. One major barrier to uptake is the lack of education and awareness of pharmacogenetics among clinicians and patients. The introduction of large language models (LLMs) like GPT-4 has introduced the possibility of medical chatbots that deliver timely information to clinicians, patients, and researchers with a simple interface. Although state-of-the-art LLMs have shown impressive performance at advanced tasks like medical licensing exams, they still often provide false information, which is particularly hazardous in a clinical context. To quantify the extent of this issue, we developed a series of automated and expert-scored tests to evaluate the performance of chatbots in answering pharmacogenetics questions from the perspective of clinicians, patients, and researchers. We applied this benchmark to state-of-the-art LLMs and showed that newer models like GPT-4o greatly outperform their predecessors, but still fall short of the standards for clinical use. Our benchmark will be a valuable public resource for subsequent developments in this space as we work towards better clinical AI for pharmacogenetics.

11:00-11:15
Pathogenicity scoring of genetic variants through federated learning across independent institutions reaches comparable or superior performance than their centralized-data model counterparts
Confirmed Presenter: Antonio Rausell, Université Paris Cité, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, Paris, F-75006, France, France

Room: Large Theatre
Format: In Person

Moderator(s): Kana Shimizu


Authors List: Show

  • Nigreisy Montalvo, Université Paris Cité, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, Paris, F-75006, France, France
  • Francisco Requena Sanchez, Weill Cornell Medicine, Institute for Computational Biomedicine, New York, USA., United States
  • Antonio Rausell, Université Paris Cité, INSERM UMR1163, Imagine Institute, Clinical Bioinformatics Laboratory, Paris, F-75006, France, France

Presentation Overview: Show

The clinical assessment of genetic variants often requires the use of bioinformatics scores. Supervised machine-learning models trained on publicly-available sets of pathogenic and benign variants have proven valuable for variant prioritization. Yet, large collections of variants generated at hospitals and research institutions remain inaccessible to machine-learning purposes because of privacy and legal constraints. Federated learning (FL) algorithms have been recently developed enabling multiple institutions to collaboratively train models without sharing their local datasets. Here we evaluated the capacity of FL strategies to classify pathogenic and benign variants as compared to the individual data owners or to the centralized-data model counterparts. FL scenarios across real-world institutions were mimicked taking advantage of the variants submitter information in ClinVar database. Specific models for deletion Copy Number Variants as well as for coding and non-coding Single Nucleotide Variants were investigated. Our results showed that FL models reached competitive or superior performances than traditional centralized learning, with the FedProx optimization strategy and lack of batch normalization generally leading to the best results. In addition, we evidence that federated models exhibit more robustness than centralized models upon individual institution dropouts. Our study highlights the benefits of collaborative machine-learning strategies for clinical variant assessment.

11:15-11:30
Deep learning-based design of receptor selective cell-penetrating peptides
Confirmed Presenter: Iori Yamahata, Nagoya University Graduate School of Medicine, Aichi, Japan., Japan

Room: Large Theatre
Format: In Person

Moderator(s): Kana Shimizu


Authors List: Show

  • Iori Yamahata, Nagoya University Graduate School of Medicine, Aichi, Japan., Japan
  • Shuto Hayashi, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan., Japan
  • Jun Koseki, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan., Japan
  • Teppei Shimamura, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan., Japan

Presentation Overview: Show

Recent drug development has focused on intracellular targets, but delivering large molecules like proteins and nucleic acids across cell membranes is challenging. Cell-penetrating peptides (CPPs) have shown promise as delivery tools, but their lack of cell selectivity has hindered clinical applications.
This study presents a novel two-step approach for designing highly selective de novo CPP sequences. The first step involves constructing a deep generative model based on EvoDiff, adapted using Low-Rank Adaptation (LoRA) and trained on 1,082 known CPPs. This model generates candidate CPP sequences.
The second step is an optimization cycle to improve selectivity. It focuses on receptor-mediated internalization, evaluating each sequence's binding energy with target receptors through MD simulations. Bayesian optimization is then applied to identify highly selective CPPs.
As a proof of concept, the method was applied to CXCR4 and NRP1 receptors, which are involved in CPP internalization. The approach successfully identified promising CPP sequences with high selectivity from one million candidates.
This method offers a new way to design CPPs with enhanced selectivity, potentially advancing their clinical applications in drug delivery.

11:30-11:45
A Machine Learning Approach for the Identification of Multi-Omic Signatures of Drug Response in Cancer Cells
Confirmed Presenter: Priya Ramarao-Milne, CSIRO, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Kana Shimizu


Authors List: Show

  • Priya Ramarao-Milne, CSIRO, Australia
  • Roc Reguant, CSIRO, Australia
  • Julika Wenzel, CSIRO, Australia
  • Lujain Elazab, CSIRO, Australia
  • Hawlader Al-Mamun, CSIRO, Australia
  • Rob Dunne, CSIRO, Australia
  • Qing Zhong, ProCan, Australia
  • Roger Reddel, ProCan, Australia
  • Natalie Twine, CSIRO, Australia
  • Denis Bauer, CSIRO, Australia

Presentation Overview: Show

Proteomics is emerging as a promising frontier in cancer. Recently, the world’s largest pan-cancer proteomic dataset was published, representing a rich resource for biomarker discovery. Rapid expansion of multi-omic datasets holds bright prospects for deriving insights into drug response mechanisms. However, there is a lack of adequate tools to harness such multifaceted data. Previous studies have correlated single proteins in cell line proteomic data with drug susceptibility. Due to high computational demand, identifying pair-wise and higher-order interactions synergistically modulating drug susceptibility is beyond the scope of current methods. We introduce a novel machine learning method to identify higher-order proteomic synergies underlying drug response.
We uncover “global” baseline signatures predicting drug susceptibility that appear across all drug classes and “local” signatures that exclusively predict susceptibility to specific drug classes. We replicate 183 synergies in an independent dataset, identifying EGFR as a recurrent “hub” protein central to interactions involved in sensitivity to tyrosine kinase inhibitors (TKIs). Conversely, vimentin was identified as a resistance biomarker for TKIs, aligning with studies showing that epithelial to mesenchymal transition and gefitinib resistance are associated with increased vimentin expression.
Our findings contribute towards leveraging ‘omic data to guide cancer precision medicine, leading to more effective cancer treatments.

11:45-12:00
LegNet allows for state-of-the-art prediction of activity and rational design of eukaryotic regulatory regions
Confirmed Presenter: Dmitry Penzar, AIRI, Moscow, Russia, Russia

Room: Large Theatre
Format: Live Stream

Moderator(s): Kana Shimizu


Authors List: Show

  • Dmitry Penzar, AIRI, Moscow, Russia, Russia
  • Daria Nogina, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia, Russia
  • Elizaveta Aristova, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia, Russia
  • Arsenii Zinkevich, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia, Russia
  • Ivan Kulakovskiy, Institute of Protein Research, Pushchino, Russia, Russia

Presentation Overview: Show

We describe a new convolutional neural network, LegNet, inspired by EfficientNetV2, one of the most parameter-efficient and accurate methods for image classification. Problem-oriented training schemes and advanced architecture allow LegNet to achieve state-of-the-art performance in a wide range of machine learning applications to eukaryotic regulatory genomics, from prediction of promoter activity in massively parallel reporter assays to assessing single-nucleotide variant effects. LegNet is applicable to sequences of different lengths and for different organisms, and allows for using a diffusion approach for the rational design of regulatory sequences.

12:00-12:15
Sequence design for high-performance artificial spider silk
Confirmed Presenter: Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Kana Shimizu


Authors List: Show

  • Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan

Presentation Overview: Show

Spider silk exhibits unique mechanical properties like high tensile strength, elongation, and toughness, unmatched by industrial materials. Each of the 50,000+ spider species produces diverse silk types optimized for their ecology, all derived from monophyletic spidroin proteins. The variety of spider silks reflects an evolutionary solution space of mechanical properties, such as strength, elasticity, underwater usage, and adhesiveness. Orb-weaving spiders' dragline silk, notable for its 1 GPa strength, 30% breaking strain, and 130-200 MJ/m³ toughness, also has an undesirable supercontraction trait, which shrinks silk by up to 60% when wet. Protein engineering aims to mitigate this through sequence modifications. To explore spider silk's potential, an international consortium collected and analyzed silk from 1,000 spider species, sequencing genes and measuring properties like toughness, strength, elongation, crystallinity, birefringence, and supercontraction. The data, available in the Spider Silkome Database, revealed key components and motifs affecting silk properties, including the overlooked spidroin MaSp3 and a protein named SpiCE, enhancing dragline silk toughness. Identified motifs correlating with mechanical properties were tested in artificial silks. This bioinformatic and in vitro approach aims to deepen the understanding of genotype-phenotype relationships in spider silks and protein materials.

13:45-14:45
Invited Presentation: Keynote 3 - Evolution of SARS-CoV-2 and beyond
Confirmed Presenter: Kei Sato

Room: Large Theatre
Format: In Person

Moderator(s): Kenta Nakai


Authors List: Show

  • Kei Sato

Presentation Overview: Show

SARS-CoV-2, the causative agent of COVID-19, emerged at the end of 2019. During its global spread over the past 3 years, SARS-CoV-2 has been highly diversified, and these SARS-CoV-2 variants have been considered to be the potential threats to the human society. In order to elucidate the virological characteristics of newly emerging SARS-CoV-2 variants in real-time, I launched a consortium called “The Genotype to Phenotype Japan (G2P-Japan)” in January 2021. With the colleagues who joined the G2P-Japan consortium, we have revealed the virological characteristics of SARS-CoV-2 variants. In May 2023, WHO declared the end of the Public Health Emergency of International Concern (PHEIC) for COVID-19. However, “the next pandemic” will come in the future, and we need to gather our wisdom learned from the COVID-19 pandemic for the preparedness of future pandemic. Here, I will talk about our findings on SARS-CoV-2 variants and future perspectives to combat the outbreaks and pandemic that will happen in the future.

15:05-15:20
Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures
Confirmed Presenter: Xueyi Dong, The Walter and Eliza Hall Institute of Medical Research, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Xueyi Dong, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Mei Du, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Quentin Gouil, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Luyi Tian, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Jafar Jabbari, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Rory Bowden, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Pedro Baldoni, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Yunshun Chen, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Gordon Smyth, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Shanika Amarasinghe, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Charity Law, The Walter and Eliza Hall Institute of Medical Research, Australia
  • Matthew Ritchie, The Walter and Eliza Hall Institute of Medical Research, Australia

Presentation Overview: Show

The lack of benchmark datasets with inbuilt ground-truth makes it challenging to compare the performance of existing long-read isoform detection and differential expression analysis workflows. Here, we present a benchmark experiment using two human lung adenocarcinoma cell lines that were each profiled in triplicate together with synthetic, spliced, spike-in RNAs (“sequins”). Samples were deeply sequenced on both Illumina short-read and Oxford Nanopore Technologies long-read platforms. Alongside the ground-truth available via the sequins, we created in silico mixture samples to allow performance assessment in the absence of true positives or true negatives. Our results show that StringTie2 and bambu outperformed other tools from the 6 isoform detection tools tested, DESeq2, edgeR and limma-voom were best amongst the 5 differential transcript expression tools tested and there was no clear front-runner for performing differential transcript usage analysis between the 5 tools compared, which suggests further methods development is needed for this application.

15:20-15:35
ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data
Confirmed Presenter: Yuan Gao, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, China

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Yuan Gao, Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation, China

Presentation Overview: Show

Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full- length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms (e.g. Oxford Nanopore Technologies) poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. We also found that while short-read data (e.g. Illumina) may overestimate PSI/PI values for exon skipping or intron retention events in regions with complex alternative splicing patterns, using ESPRESSO to process long-read RNA-seq data enables accurate quantification of all types of alternative splicing events. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.

15:35-15:50
A novel pipeline for human LINE-1 insertion identification based on outer-Cas9 nanopore sequencing technology
Confirmed Presenter: Xinyi Liu, Department of Computational Biology and Medical Sciences, University of Tokyo, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Xinyi Liu, Department of Computational Biology and Medical Sciences, University of Tokyo, Japan
  • Satomi Mitsuhashi, Department of Neurology, St. Marianna University School of Medicine, Japan
  • Martin C. Frith, Department of Computational Biology and Medical Sciences, University of Tokyo, Japan

Presentation Overview: Show

Transposable elements (TEs) are notable repetitive DNA sequences. Long Interspersed Elements (LINEs) are a major classification of TE, with the L1 subfamily constituting ~17% of the human genome. Although most L1s have lost their mobility in evolution, a young subfamily, L1 Homo sapiens (L1Hs), remains active and can retrotranspose to new genomic positions. It can potentially lead to human genetic disorders and diseases. Several methods have been developed for detecting these insertions using various sequencing technologies, but most require whole genome sequencing data and are time-consuming. Here, we designed an intuitive and expeditious method to identify non-reference L1Hs insertions. We transformed the normal Cas9 sequencing to “outer-Cas9 seq” to locate the L1Hs target sites and enrich the reads nearby. A novel computational pipeline was developed to identify coverage peaks and infer non-reference L1Hs insertion sites. The outer-Cas9 sequencing method was applied to the cell line of GM24385. Using ~53 Gb long-read nanopore sequencing reads, we found 45 non-reference L1 insertions of 4 different types among the genome. We tracked the source L1 via 3’-transduction, and one result showed that neither source nor target L1 sequences were in the reference genome, but our methods identified both.

15:50-16:05
MAssively-Parallel Flow cytometry Xplorer (MAPFX): A Toolbox for Analysing Data from the Massively-Parallel Cytometry Experiments
Confirmed Presenter: Hsiao-Chi Liao, The University of Melbourne, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Hsiao-Chi Liao, The University of Melbourne, Australia
  • Terry Speed, Walter and Eliza Hall Institute of Medical Research, Australia
  • Davis McCarthy, St. Vincent’s Institute of Medical Research, Australia
  • Agus Salim, The University of Melbourne, Australia

Presentation Overview: Show

Massively-parallel cytometry (MPC) experiments allow cost-effective quantification of more than 200 surface proteins at single-cell resolution. By leveraging the MPC experiments, the Infinity Flow (Inflow) protocol was developed to measure highly informative ‘backbone’ markers on all cells in all wells in three plates, along with well-specific exploratory ‘infinity’ markers. Backbone markers are used to impute the infinity markers on cells in all other wells using machine learning methods. This protocol offers unprecedented opportunities for more comprehensive analysis of proteomics data. However, some aspects of the protocol can be improved, including methods for background correction and removal of unwanted variation. Here, we propose MAPFX as an alternative tool that optimally removes the unwanted variation due to factors that range from electronic baseline restoration and fluorescence compensation to technical variation such as well effects, followed by imputation and other statistical analyses. Unique features of our approach include performing background correction prior to imputation and removing unwanted variation from the data at the cell-level, while explicitly accounting for the potential association between biology and unwanted factors. We benchmark our pipeline against alternative pipelines and demonstrate that our approach is better at preserving biological signals, removing unwanted variation, and imputing unmeasured infinity markers.

16:05-16:20
Biochemical-free enrichment or depletion of RNA classes in real-time during direct RNA sequencing with RISER
Confirmed Presenter: Alexandra Sneddon, Australian National University, EMBL Australia, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Alexandra Sneddon, Australian National University, EMBL Australia, Australia
  • Agin Ravindran, Australian National University, EMBL Australia, Australia
  • Somasundhari Shanmuganandam, Australian National University, Australia
  • Madhu Kanchi, Australian National University, Australia
  • Nadine Hein, Australian National University, Australia
  • Simon Jiang, Australian National University, Australia
  • Nikolay Shirokikh, Australian National University, Australia
  • Eduardo Eyras, Australian National University, EMBL Australia, Australia

Presentation Overview: Show

The heterogeneous composition of cellular transcriptomes poses a major challenge for detecting weakly expressed RNA classes, as they can be obscured by abundant RNAs. Although biochemical protocols can enrich or deplete specified RNAs, they are time-consuming, expensive and can compromise RNA integrity. Here we introduce RISER, a biochemical-free technology for the real-time enrichment or depletion of RNA classes. RISER performs selective rejection of molecules during direct RNA sequencing by identifying RNA classes directly from nanopore signals with deep learning and communicating with the sequencing hardware in real time. By targeting the dominant messenger and mitochondrial RNA classes for depletion, RISER reduces their respective read counts by more than 85%, resulting in an increase in sequencing depth of 47% on average for long non-coding RNAs. We also apply RISER for the depletion of globin mRNA in whole blood, achieving a decrease in globin reads by more than 90% as well as an increase in non-globin reads by 16% on average. Furthermore, using a GPU or a CPU, RISER is faster than GPU-accelerated basecalling and mapping. RISER’s modular and retrainable software and intuitive command-line interface allow easy adaptation to other RNA classes. RISER is available at https://github.com/comprna/riser.

16:20-16:35
Flexiplex: a versatile demultiplexer and search tool for omics data
Confirmed Presenter: Nadia Davidson, Walter and Eliza Hall Institute, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Tingwen Chen


Authors List: Show

  • Oliver Cheng, Walter and Eliza Hall Institute, Australia
  • Min Hao Ling, A*STAR, Singapore
  • Changqing Wang, Walter and Eliza Hall Institute, Australia
  • Chyn Chua, Walter and Eliza Hall Institute, Australia
  • Shuyi Wu, Walter and Eliza Hall Institute, Australia
  • Matthew Ritchie, Walter and Eliza Hall Institute, Australia
  • Jonathan Göke, A*STAR, Singapore
  • Noorul Amin, Walter and Eliza Hall Institute, Australia
  • Nadia Davidson, Walter and Eliza Hall Institute, Australia

Presentation Overview: Show

The process of analyzing high throughput sequencing data often requires the identification and extraction of specific target sequences, such as cellular barcodes and UMIs in single-cell data, and specific genetic variants for genotyping. Existing tools which perform these functions are often task-specific, not tolerant to noise in the sequencing data, or lack usability, such as being challenging to install and slow to run.

Here we present Flexiplex, a powerful new sequence searching and demultiplexing tool for omics data, which is fast, flexible and easy to use. We benchmarked Flexiplex against leading demultiplexing tools for long read single cell transcriptomics and found that it consistently provided the highest accuracy in barcode assignment. Moreover, when cellular barcodes were unknown, it was able to discover the barcodes significantly faster than competing tools. We also demonstrate that Flexiplex is an effective tool for genotyping cells, when SNPs and other genomics alterations are searched, achieving an excellent balance between sensitivity and speed. As a demonstration of its utility to unravel cancer clonality, we show how Flexiplex was used to identify cell populations which have relapse-associated alterations in acute myeloid leukemia. Flexiplex is available at https://davidsongroup.github.io/flexiplex/.

16:50-17:30
Panel: JSBi General Assembly
Room: Large Theatre
Format: In person


Authors List: Show

Friday, October 25th
9:30-10:30
Invited Presentation: Keynote 4 - Using bioinformatics to identify new regulatory mechanisms related to RNA m6A modification
Room: Large Theatre
Format: In person

Moderator(s): Bruno Gaeta


Authors List: Show

  • Xiujie Wang

Presentation Overview: Show

The application of various types of omics studies has produced huge amount of data,which enabled researchers to identify new regulatory mechanisms using bioinformatics. By integration of bioinformatic analysis with experimental studies, we focused on deciphering the regulatory mechanisms and biological functions of N6-methyladenosine (m6A) modification, which is the most abundance modification on mRNAs. We discovered that the selectivity of m6A modification sites is partially regulated by miRNAs via sequencing pairing, revealing a novel function of miRNAs in regulating mRNA epigenetic modification. We also systematically characterized the distribution feature and dynamic changes of m6A modification sites in mouse brain during learning process. Intriguingly, we found that the formation of m6A modification can enhance the efficacy of hippocampus-dependent memory consolidation by facilitating the translation of early-response genes, yet excessive training can compensate the function of m6A in regulating long term memory formation. We also revealed novel functions of a m6A reader protein, YTHDC1, in regulating embryonic brain development.

10:45-11:00
AI Transforming Protein Family Classification
Confirmed Presenter: Alex Bateman, EMBL-EBI, United Kingdom

Room: Large Theatre
Format: In Person

Moderator(s): Shinya Ikematsu


Authors List: Show

  • Alex Bateman, EMBL-EBI, United Kingdom

Presentation Overview: Show

We are living through a revolution in AI approaches, which is transforming molecular biology and computational biology. I will discuss how the advent of high accuracy structural models has made a large impact in our ability to completely and accurately classify protein domains in the Pfam and InterPro databases. I will also talk about how Deep Learning models such as ProtENN developed by Google Research have expanded our ability to find distant homologues for known protein families. I will argue that these models represent the most significant change in protein classification in three decades. Even more recently we have seen the arrival of Large Language Models such as ChatGPT, which may now enable us to develop high throughput tools for annotating proteins, protein families and non-coding RNAs, if only we can stop them hallucinating! I will talk about our efforts to harness these models to write accurate and verifiable annotation at scale.

11:00-11:15
Do Protein Language Models Learn Phylogeny?
Confirmed Presenter: Sanjana Tule, The University of Queensland, Australia

Room: Large Theatre
Format: In Person

Moderator(s): Shinya Ikematsu


Authors List: Show

  • Sanjana Tule, The University of Queensland, Australia
  • Gabriel Foley, The University of Queensland, Australia
  • Mikael Boden, The University of Queensland, Australia

Presentation Overview: Show

Deep machine learning uncovers evolutionary relationships directly from protein sequences, internalizing concepts inherent to classical phylogenetic tree inference. We bridge these paradigms by evaluating protein-based language models (pLMs) for discerning phylogenetic relationships without explicit training. We assess ESM2, ProtTrans, and MSA-Transformer against classical methods, considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model outperforms others in recovering phylogenetic relationships in both low- and high-gap settings. pLMs align with conventional methods, particularly in families with fewer indels, highlighting indels as a key differentiator. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons broadly recapitulate phylogenetic distances, and these polysemantic neurons are shared among homologous families. ESM2 shows promise as a complementary tool for remote homologs with complex histories of insertions and deletions.

11:15-11:30
Progress and Pitfalls in Genome-Based Machine Learning Models for Zoonotic Virus Prediction
Confirmed Presenter: Junna Kawasaki, Chiba University, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Shinya Ikematsu


Authors List: Show

  • Junna Kawasaki, Chiba University, Japan
  • Tadaki Suzuki, National Institute of Infectious Diseases,, Japan
  • Michiaki Hamada, Waseda University, Japan

Presentation Overview: Show

Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their human infectivity potential. However, the lack of comprehensive datasets poses a major challenge, limiting the predictable range of viruses. Our study addressed this limitation through two key strategies: constructing expansive datasets across 26 viral families and developing new models leveraging large language models pre-trained on extensive nucleotide sequences. These approaches substantially boosted our model performance, particularly for segmented RNA viruses, which are involved with severe zoonoses but have been overlooked due to limited data availability. Furthermore, models trained on data up to 2018 demonstrated robust generalization capability for viruses emerging post-2018. Nonetheless, we also found remaining challenges in alerting the zoonotic potential of specific viral lineages, including SARS-CoV-2. Our study provides a comprehensive benchmark for viral infectivity prediction models and highlights the unresolved issues of fully leveraging machine learning in preparedness for upcoming zoonotic threats.

11:30-11:45
Extraction of cellular function knowledge from literature using large language models
Confirmed Presenter: Haruka Ozaki, University of Tsukuba, Japan

Room: Large Theatre
Format: In Person

Moderator(s): Shinya Ikematsu


Authors List: Show

  • Haruka Ozaki, University of Tsukuba, Japan
  • Ryota Yamada, fuku, Inc., Japan
  • Shinya Nakata, University of Tsukuba, Japan
  • Kazuya Miyanishi, University of Tsukuba, Japan
  • Ami Kaneko, University of Tsukuba, Japan
  • Haruto Ijiri, University of Tsukuba, Japan
  • Yoshihiko Sakaguchi, University of Tsukuba, Japan

Presentation Overview: Show

Multicellular organisms consist of diverse cell types with various functions and phenotypes (e.g., cell cycle, size, electrophysiological properties). This knowledge is scattered across literature and not systematically organized, making it difficult to access and integrate with cell subtypes identified through single-cell RNA sequencing and spatial transcriptomics. Large language models (LLMs) have enabled attempts to ‘generate’ descriptions of cellular functions, but hallucination issues require human verification of generated descriptions.
Here, we aim to extract knowledge of cellular functions from literature using LLMs, defining this knowledge as a graph: cell types are subjects, biological phenomena are objects, and their relationships (positive, negative, neutral) are predicates. Three types of prompts were designed for the LLM, and zero-shot, one-shot, and few-shot prompting were evaluated.
As a proof of concept, we extracted cell functions from neuroscience literature. Five experts annotated 100 abstracts from a PubMed Central subset to create an evaluation dataset. The extraction performance was assessed using graph edit distance (GED) between the ground truth and the extracted graphs. Using OpenAI's GPT-4 Turbo, the baseline GED of 10.7 improved to 5.5 with few-shot prompting. We will discuss handling variations in cell type and biological phenomena terms and the extension to full-text articles.

11:45-12:30
Panel: Joint Special Session for Asia & Pacific Bioinformatics Societies
Room: Large Theatre
Format: In person

Moderator(s): Asif Khan


Authors List: Show

12:30-13:00
Closing Ceremony
Room: Large Theatre
Format: In person

Moderator(s): Kiyoko Kinoshita


Authors List: Show