Invited Presentation: Deep LTMLE: Scalable Causal Survival Analysis with Transformer
Confirmed Presenter: Toru Shirakawa, Osaka University Graduate School of Medicine, Japan
Room: 520a
Format: In Person
Moderator(s): Haluk Resat
Authors List: Show
- Toru Shirakawa, Osaka University Graduate School of Medicine, Japan
- Maya Petersen, University of California, Berkeley, US
- Sky Qiu, University of California, Berkeley, US
- Yi Li, University of California, Berkeley, US
- Yuxuan Li, Columbia University, US
- Yulun Wu, University of California, Berkeley, US
- Ryo Kawasaki, Osaka University Graduate School of Medicine, Japan
- Hiroyasu Iso, National Center for Global Medicine, Japan
- Mark van der Laan, University of California, Berkeley, US
Presentation Overview: Show
Causal inference under dynamic interventions from longitudinal data with high dimensional variables such as omics and images which potentially vary across time is a central problem in precision medicine. We developed and implemented Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to address this problem (Shirakawa et al. 2024). Following the roadmap of causal inference (Petersen and van der Laan 2014), Deep LTMLE provides an efficient estimator for a class of functionals identified through the g-formula (Robins 1986) in continuous time with monitoring process, right-censoring, and competing risks. Our architecture used Transformer to handle long-range dependencies and heterogenous variables. The method is based on the theory of causal survival analysis (Rytgaard et al. 2022) combined with a scalable architecture of deep neural networks, bridging traditional biomedical statistics and emerging methods in data science. Our method could incorporate high dimensional variables such as omics, texts, images, and videos, and thus could integrate the data from molecular biology and clinical practice to evaluate their causal impacts in relation to clinically significant events such as patients’ survival. This feature would foster both translational and reverse-translational research. Within the framework of targeted learning (van der Laan and Rose 2011, 2018), we corrected the bias commonly associated with machine learning algorithms and built an asymptotically efficient estimator. In a simulation with a simple synthetic data, Deep LTMLE demonstrated comparable statistical performance and superior computational performance to an asymptotically efficient estimator, LTMLE with a super learner of multiple machine learning algorithms. As the complexities of the synthetic data and the length of time horizon increase, Deep LTMLE tended to outperform LTMLE. Furthermore, Deep LTMLE is implemented in Python and scalable with more computational resources such as graphics processing units (GPUs). We will demonstrate an application of Deep LTMLE with real world data.
Invited Presentation: Wearable Biosensing to Predict Imminent Aggressive Behavior in Psychiatric Inpatient Youths with Autism
Confirmed Presenter: Matthew Goodwin, Northeastern University, USA
Room: 520a
Format: In Person
Moderator(s): Haluk Resat
Authors List: Show
- Matthew Goodwin, Northeastern University, USA
Presentation Overview: Show
Aggressive behavior is a prevalent and challenging issue in individuals with autism, especially for those who have limited verbal ability and intellectual impairments. This presentation investigates whether changes in peripheral physiology recorded by a wearable biosensor and machine learning can be used to predict imminent aggressive behavior before it occurs in inpatient youths with autism from 4 primary care psychiatric inpatient hospitals. Research staff performed live behavioral coding of aggressive behavior while 70 inpatient study participants wore a commercially available biosensor that recorded peripheral physiological signals (cardiovascular activity, electrodermal activity, and motion). Logistic regression, support vector machines, neural networks, and domain adaptation were used to analyze time-series features extracted from biosensor data. Area under the receiver operating characteristic curve (AUROC) values were used to evaluate the performance of population- and person-dependent models. A total of 429 naturalistic observational coding sessions were recorded, totaling 497 hours, wherein 6665 aggressive behaviors were documented, including self-injury (3983 behaviors [59.8%]), emotion dysregulation (2063 behaviors [31.0%]), and aggression toward others (619 behaviors [9.3%]). Logistic regression was the best-performing overall classifier across all experiments; for example, it predicted aggressive behavior 3 minutes before onset with a mean AUROC of 0.80 (95% CI, 0.79-0.81). Further research will explore clinical implications and the potential for personalized interventions.
Invited Presentation: Improve speed, scalability, inter-operability of core C++ modules for Stan - a tool doing Bayesian modeling and statistical inference
Confirmed Presenter: Mitzi Morris, Columbia University, United States of America
Room: 520a
Format: In Person
Moderator(s): Haluk Resat
Authors List: Show
- Mitzi Morris, Columbia University, United States of America
Presentation Overview: Show
Software Engineering for Research Software
Stan is a tool doing Bayesian modeling and statistical inference. In September 2021, the Stan project received an award from program NOT-OD-21-091 to improve speed, scalability, inter-operability of core C++ modules. The Stan developer process was used for this work and subsequent development initiatives.
Reproducible science requires reliable tools. Rapid science requires tools which are easy to learn and use. Reliability is enforced through design, code review, and extensive test suites. Good documentation supports learnability and usability. These activities inform and reinforce one another; writing good tests leads to more informative docs and vice versa.
These activities require input from researchers, developers, and end-users. The Stan project has been very much a collaboration between computer scientists and applied statisticians. This talk will examine the Stan developer community as well as the Stan developer process.
Invited Presentation: LinkML: A FAIR data modeling framework for the biosciences and beyond
Confirmed Presenter: Nomi L Harris, Lawrence Berkeley National Laboratory, United States
Room: 520a
Format: In Person
Moderator(s): Haluk Resat
Authors List: Show
- Sierra AT Moxon, Lawrence Berkeley National Laboratory, United States
- Nomi L Harris, Lawrence Berkeley National Laboratory, United States
- Matthew Brush, University of North Carolina at Chapel Hill, United States
- Melissa A Haendel, University of North Carolina at Chapel Hill, United States
- Christopher J Mungall, Lawrence Berkeley National Laboratory, United States
Presentation Overview: Show
Open science depends on open data. LinkML (Linked data Modeling Language; linkml.io) is an open, extensible modeling framework that makes it easy to model, validate, and distribute reusable, interoperable data.
The quantity and variety of data being generated in scientific fields is increasing rapidly, but is often captured in unstructured, unstandardized formats like publications, lab notebooks, or spreadsheets. Many data standards are defined in isolation, causing siloization; lack of data harmonization limits reusability and cross-disciplinary applications. A confusing landscape of schemas, standards, and tools leaves researchers struggling with collecting, managing, and analyzing data.
LinkML addresses these issues, weaving together elements of the Semantic Web with aspects of conventional modeling languages to provide a pragmatic way to work with a broad range of data types, maximizing interoperability and computability across sources and domains. LinkML supports all steps of the data analysis workflow: data generation, submission, cleaning, annotation, integration, and dissemination. It enables even non-developers to create data models that are understandable and usable across the layers from data stores to user interfaces, reducing translation issues and increasing efficiency.
Projects across the biomedical spectrum and beyond are using LinkML to model their data, including the NCATS Biomedical Data Translator, Alliance of Genome Resources, Bridge2AI, Neurodata Without Borders, Reactome, Critical Path Institute, iSample, National Microbiome Data Collaborative, Center for Cancer Data Harmonization, INCLUDE project, Open Microscopy Environment, and Genomics Standards Consortium.
Ultimately, LinkML democratizes data, helping to bridge the gap between people of diverse expertise and enabling a shared language with which to express the critically important blueprints of each project’s data collection.
This work is supported in part by an NIH supplement under NOT-OD-22-068, and by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract number DE-AC02-05CH11231.
Invited Presentation: Leveraging Language Models for Enhanced Biocuration and User Interaction in Reactome: A Pathway Towards Community-Driven Knowledge Enrichment
Confirmed Presenter: Nancy Li
Room: 520a
Format: In Person
Moderator(s): Haluk Resat
Authors List: Show
Presentation Overview: Show
The Reactome Pathway Knowledgebase, supported by NIH NHGRI and ODSS, stands as a cornerstone database renowned for its meticulous human curation practices. Reactome is currently the most comprehensive open-source, human biological pathway knowledgebase. Reactome's curation process faces inherent challenges in handling the vastness and complexity of biological data. Human-driven curation struggles with scale and efficiency. Therefore, we are developing a web-based curation tool that incorporates Large Language Models (LLM) technology, guided by Reactome’s data schema and curation requirements, to significantly improve steps that are major bottlenecks for the curation workflow. To do so, we adopted the RAG (retrieval-augmented generation) technology and developed an API to associate previously unannotated genes with Reactome pathways. Leveraging our prior work with the NIH IDG program, the API can find potential pathways for a query gene, search PubMed for supportive literature evidence, create text summaries, and extract functional relationships between the query gene and biological concepts.
In addition, Reactome is developing a conversational chatbot that facilitates user interactions toward improving comprehension of Reactome content. The chatbot is designed to facilitate a natural, interactive user experience, and more intuitive navigation through Reactome's extensive database. This interface will allow users to query complex pathway information and receive rich, informative responses that encourage deeper engagement with the knowledgebase. Future endeavors involve integrating LLMs into the chatbot for data analysis, empowering users with diverse technical backgrounds to perform sophisticated analyses using Reactome's. Integration of multi-source data retrieval systems and the incorporation of gene analysis tools are expected to enhance the platform's utility and interactivity, thereby streamlining the user experience and facilitating the exploration and understanding of complex biological datasets. Our LLM-focused approaches will improve user engagement and also lays the groundwork for improved curation workflows within Reactome, potentially offering a means for community curation practices.
Panel: Discussion
Room: 520a
Format: In person
Moderator(s): Haluk Resat
Authors List: Show