Attention Presenters - please review the Presenter Information Page available here

This special track sponsored by the US National Institutes of Health Office of Data Science Strategy (ODSS), convenes talks from funded projects in technology, infrastructure, research software, and related applications as well as associated ethical considerations to advance health research. Learn more about these program initiatives at datascience.nih.gov.

Session 1 co-chairs: Jeremias Sulam & Banky Olatosi
Session 2 co-chairs: Nomi Harris & Yanbin Yin
Session 3 co-chairs: Vida Abedi & Ansaf Salleb-Aouissi

Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-10:50
Opening Remarks
Confirmed Presenter: Yanli Wang, Office of Data Science Strategy, United States

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Yanli Wang, Office of Data Science Strategy, United States
10:50-11:10
Session: Ethics and Equity for AI and Computational Research
Invited Presentation: AI/ML to Identify and Stratify Non-Clinical Factors Contributing to Cancer Health Disparity in Rural Appalachia
Confirmed Presenter: Aisha Montgomery

Room: 520a
Format: Live Stream

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Aisha Montgomery

Presentation Overview: Show

Introduction: In the medically underserved area of rural Appalachia, cancer mortality rates are 32% higher than the US average. Social determinants of health (SDOH) are known barriers to cancer care in rural regions, however, their importance is not well demonstrated in AI/ML.

Background: In our pilot study, cancer registry data was used to build an ML model to predict 5-year colorectal cancer (CRC) survival in Appalachians. Analyses showed that the model was less accurate in predicting survival in Appalachian vs. non-Appalachian patients. Limited data from underserved populations in public datasets may increase biases and using only clinical factors may decrease predictability of AI/ML methods. These findings led to the current project.

Methods: We hypothesized that SDOH factors were important to cancer survival in Appalachia and SDOH data features would improve ML model performance. A combined EHR dataset was created from community-based cancer centers in Appalachia which included both clinical and SDOH data features. SDOH features were stratified and added to the ML model to evaluate their effect.

Results: Patients were average age 67±13.2 years, 49% female, and 66% rural. Stratification identified marital, employment, and insurance statuses as SDOH features with the highest impact on model output. Combining clinical and SDOH features in the ML model increased the area under the receiver operating curve (0.791) as compared to using clinical (0.758) or SDOH (0.662) features alone.

Discussion: These findings demonstrate the importance of SDOH factors on health outcomes in an underserved population. Further, the data methods highlight the need for diverse, community based EHR datasets in AI/ML research. We built an ML model that can be used to improve cancer-related health disparities within rural and other medically underserved populations. Expansion of the current work will contribute to best practices in the creation of diverse, representative clinical and SDOH datasets to improve AI/ML-based outcomes.

Invited Presentation: Estimating and Controlling for Fairness in Radiology with Missing Sensitive Information
Confirmed Presenter: Jeremias Sulam, Johns Hopkins University, United States

Room: 520a
Format: In Person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Beepul Bharti, Johns Hopkins University, United States
  • Paul Yi, University of Maryland, Baltimore, United States
  • Jeremias Sulam, Johns Hopkins University, United States

Presentation Overview: Show

As the use of machine learning models in real world high-stakes decision settings continues to grow, it is highly important that we are able to audit and control for any potential fairness violations these models may exhibit towards certain groups. For example, in automated screening protocols in radiology, one may wish to certify that a predictor achieves comparable performance for different demographic groups. To do so, one naturally requires access to these sensitive attributes, such as demographics, biological sex, or other potentially sensitive features that determine group membership. Unfortunately, in many settings, this information is often unavailable either because of inadequacies of existing datasets, or because of legal and privacy constraints. In this presentation, we will focus on the well-known equalized odds (EOD) definition of fairness. In a setting without sensitive attributes, we will show how to provide tight and computable upper bounds for the EOD violation of a predictor, thus being able to guarantee fairness in such missing-data scenarios. Second, we demonstrate how one can provably control the worst-case EOD by a new post-processing correction method. Our results characterize when directly controlling for EOD with respect to the predicted sensitive attributes is--and when is not--optimal when it comes to controlling worst-case EOD. Our results hold under assumptions that are milder than previous works, and we illustrate these results with experiments on synthetic and real datasets, including on chest radiographs.

11:10-11:30
Session: Ethics and Equity for AI and Computational Research
Invited Presentation: Ethical Development of Imaging Biomarkers for Colorectal Biomarkers
Confirmed Presenter: Rina Khan, Queen's University, Canada

Room: 520a
Format: In Person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Rina Khan, Queen's University, Canada
  • Amber Simpson, Queen's University, Canada
  • Catherine Stinson, Queen's University, Canada
  • Vannessa Ferguson, York University, Canada
  • Annabelle Suave, Queen's University, Canada
Invited Presentation: Examining how social and behavioral determiants affect the prevalence, severity, and outcomes of Long-COVID-19 and health disparity
Confirmed Presenter: Deborah Mudali

Room: 520a
Format: Live Stream

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Deborah Mudali

Presentation Overview: Show

This research examines how social and behavioral determinants affect the prevalence, severity, and outcomes of Long-COVID-19, and their role in worsening health disparities among affected populations. Previous studies have shown that Long-COVID-19 disproportionately impacts various demographic groups, deepening existing disparities. By applying advanced computational techniques, this study aims to identify relevant variables and analyze their associations with Long-COVID-19 using statistical, machine learning, and deep learning models. Using OCHIN datasets, the study focuses on socioeconomic status, race/ethnicity, education, healthcare access, and substance abuse.
The methods involve data collection and preprocessing, feature selection using R and Python tools, and model development employing logistic regression, decision trees, random forests, gradient boosting, neural networks, and principal component regression. The models are evaluated and validated using metrics like accuracy, precision, recall, and F1-score, with cross-validation techniques ensuring generalizability and robustness. The analysis aims to uncover complex relationships between the social determinants and Long-COVID-19 outcomes, contributing to understanding health disparities.
Preliminary results indicate significant disparities in Long-COVID-19 outcomes based on demographic factors. The bimodal distribution in the density plot suggests generational differences in diagnosis, while boxplots indicate age-related trends within racial groups. Scatter plots reveal age-related patterns relevant to Long-COVID incidence or severity. Bar plots highlight unequal representation of racial groups, pointing to potential disparities in healthcare access or exposure risk. These visualizations suggest that certain racial and demographic groups might be disproportionately affected by Long-COVID-19, leading us to the need for targeted interventions.
This study aims to provide insights into health disparities associated with Long-COVID-19 to promote equitable healthcare strategies and reduce disparities across diverse populations. By leveraging advanced analytical techniques, it seeks to inform public health policies and resource allocation, improving healthcare outcomes for all demographic groups affected by Long-COVID-19.
Keywords: Long-CoVID-19; Social and Behavioral Determinants; Health Disparity; Machine Learning

11:30-11:50
Session: Ethics and Equity for AI and Computational Research
Invited Presentation: An Ethical Framework-Guided Tool for Assessing Bias in EHR based Big Data Studies
Confirmed Presenter: Bankole Olatosi, University of South Carolina, USA

Room: 520a
Format: In Person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Bankole Olatosi, University of South Carolina, USA
  • Shan Qiao, University of South Carolina, USA
  • George Khushf, University of South Carolina, USA
  • Jiajia Zhang, University of South Carolina, USA
  • Xiaoming Li, University of South Carolina, USA

Presentation Overview: Show

Background: Current literature describes bias using electronic health records (EHR) for data science research. An important but under-researched ethical issue is the risk of potential biases prevalent in healthcare datasets (e.g., electronic health records [EHR] data) during data curation, acquisition, and processing cycles. To advance our understanding of bias and equity issues in data science applied to EHR data, we developed an ethical framework to guide data scientists using EHR data. This framework articulates the ethically meaningful biases in large EHR data studies affecting People with HIV (PWH), how these biases intersect in reproducing bias in HIV-related studies, and the strategies needed to break this vicious cycle.
Method: The ethical framework development was implemented through an iterative process comprised of literature/policy review, content analysis, and interdisciplinary dialogues and discussion. We interviewed data curators, end-user researchers, healthcare workers, government agencies, and patient representatives throughout all iterative cycles using various formats, including in-depth interviews of 20 key stakeholders, and conducted panel discussions in a conference and a charette workshop.
Results: The draft ethical framework was designed to be fair and transparent. It includes meaningful biases representing statistical/computational biases, social biases (e.g., interpersonal bias, institutional bias, structural bias), and representativeness biases (e.g., underrepresented in the EHR data due to care access, affordability, availability, and acceptability).
Conclusion: The developed framework illustrates the actions and steps that healthcare providers, health systems, data scientists, and researchers can collectively take to reduce opportunities that cumulatively work to produce and reproduce biases within EHR data and the resulting data science research products/interventions. Interdisciplinary collaboration within the public health research area and intersectional efforts across government and the healthcare system in policies, capacity building, and patient engagement/involvement are needed to manage and address the biases and protect patients from the threats of unfairness and inequality during data science research.

Invited Presentation: Modern, Intuitive Tools for Managing AI/ML Data in Health Equity-Focused Multiomic and Population Studies
Confirmed Presenter: Victor Nwankwo, University of Illinois College of Medicine and the AI.Health4All Center for Health Equity using Machine Learning and Artificial Intelligence, USA

Room: 520a
Format: In Person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Victor Nwankwo, University of Illinois College of Medicine and the AI.Health4All Center for Health Equity using Machine Learning and Artificial Intelligence, USA

Presentation Overview: Show

The evolving use of AI and ML in multiomic and population-wide studies presents researchers with the challenge of managing vast and complex datasets. This necessitates modern tools that streamline data management and enhance interpretability. We model a scalable framework for AI/ML-driven studies that integrates heterogeneous data across various study types, emphasizing translational strategies for health equity. This pipeline incorporates robust information visualization technology packages, offering a zoomable interface to coherently display diverse data types of varied scopes. By utilizing dynamic visualization frameworks and real-time interactive tools, our solution addresses critical gaps in data exploration and decision-making processes. This integration fosters a cohesive understanding of complex research datasets, facilitates hypothesis generation, and accelerates clinical and translational research outcomes, particularly those aimed at reducing health disparities.

11:50-12:00
Session: Ethics and Equity for AI and Computational Research
Invited Presentation: Use Explainable AI to Improve the Trust of and Detect the Bias of AI Models
Confirmed Presenter: Senait Tekle, George Washington University, USA

Room: 520a
Format: In Person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

  • Senait Tekle, George Washington University, USA
  • Melissa M. Goldstein, George Washington University, USA
  • Stuart J. Nelson, George Washington University, USA
  • Ali Ahmed, George Washington University, Washington DC VA Medical Center, Georgetown University, USA

Presentation Overview: Show

Background: Artificial Intelligence (AI) presents promising advancements to improve healthcare outcomes, yet it also raises new ethical concerns. AI systems trained on biased data can perpetuate discrimination against disadvantaged patients, based on factors such as race, gender, or socioeconomic status. Using Explainable AI to describe AI models is expected to enable detection and correction of biases, thereby enhancing the confidence and trustworthiness of AI models.

Objectives: To use Explainable AI to improve trust in AI models and detect bias.

Methods: Guided by the National Institute of Standards and Technology (NIST) framework on trustworthy AI, we conducted virtual interviews with clinicians, patients, IT and ethics experts, healthcare administrators, and policymakers. Questions included inquiries regarding participants’ general understanding of AI, perceptions of bias, levels of trust, and familiarity and thoughts on Explainable AI.

Results: Study participants (N=17); 53% White; 35% Asian; 12% African American; 41% female; 59% male. Participants felt Explainable AI provides valuable assistance, giving them a deeper understanding of the decision-making process and boosting confidence in the fairness and reliability of the AI system's output. Many respondents emphasize the importance of understanding the reasoning behind an AI system's decision-making process, particularly in clinical decision-making settings. They believe that transparency and comprehensibility are crucial for building trust and confidence in AI. Many respondents emphasize the need for clear explanations of how AI arrives at its decisions.

Conclusions: Our findings underscore the importance of transparency and comprehensibility in AI systems, emphasizing their role in building trust and confidence among users. Additionally, they highlight the critical need for Explainable AI methods, particularly in sensitive domains like clinical decision-making, to ensure accountability and mitigate biases.

Significance: Our study highlights the importance of ethical data science and the role of Explainable AI in enhancing trust, transparency, and detecting bias within AI systems, particularly in high-impact domains.

12:00-12:20
Session: Ethics and Equity for AI and Computational Research
Panel: Discussion
Room: 520a
Format: In person

Moderator(s): Emrin Horgusluoglu


Authors List: Show

14:20-15:00
Session: Sustainable Research Software and Tools in the Cloud and Beyond
Invited Presentation: Cloud exploration and AI/ML-readiness of CAZyme annotation in human gut microbiome
Confirmed Presenter: Yanbin Yin

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Yanbin Yin

Presentation Overview: Show

We developed dbCAN as a software system in 2012 and actively maintain it for automated CAZyme (carbohydrate-active enzymes ) annotation. With an R01 award, we have developed dbCAN-seq and dbCAN3, two new web-based bioinformatics tools, to enable dietary fiber utilization prediction given any human gut microbiome sequencing data as queries to assist the development of personalized nutrition. Funded by the NIH-ODSS cloud exploration grant, we have deployed the dbCAN3 web server on Amazon Web Services (AWS) to offer competitive performance, especially when scaling up to handle more job submissions. We also compared the AWS solution with our on-prem solution, which uses a standalone desktop server. Also funded by the NIH-ODSS AI/ML-readiness grant, we have converted ~250k CAZyme gene clusters (CGCs) of dbCAN-seq identified from ~10k metagenome assembled genomes (MAGs) into a ML/AI ready vector representations using word2vec. These unsupervised CGC data (without substrate labels) are further used to generate embeddings of supervised data (401 CGCs with substrate labels). An Recurrent Neural Network (RNN) based Multiclass Classification Model is built to allow prediction of glycan substrates for CGCs. Overall, our CAZyme bioinformatics research has benefited from the NIH-ODSS support, which helped us get one step closer to the microbiome-based personalized recommendation of dietary fiber intake.

Invited Presentation: PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical Knowledge
Confirmed Presenter: Robert Leaman

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Robert Leaman

Presentation Overview: Show

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is an advanced biomedical literature resource featuring search capabilities enabled with state-of-the-art AI methods. It extracts six key types of biomedical entities (genes, diseases, chemicals, genetic variants, species, and cell lines) and 12 common types of relationships between entities.
PubTator 3.0’s online interface facilitates literature exploration through semantic, relation, keyword, and Boolean queries. Search results are prioritized based on the depth of the relationship between the query terms in each article and the importance of the section where these matches appear. PubTator 3.0 provides a single, unified, search system across both abstracts and full text, comprising approximately 36 million PubMed abstracts and over 6 million full-text articles from the PMC Open Access Subset.
PubTator 3.0 utilizes deep-learning transformer models for named entity recognition and relation extraction. These models, AIONER and BioREx, were recently developed with computational resource support from ODSS. It currently contains 1.6 billion entity annotations and 33 million relationship annotations, with new articles added weekly. PubTator 3.0 offers programmatic access through its API and bulk download.
Compared to its predecessor, PubTator 3.0 exhibits enhanced entity recognition and normalization performance. Its new relation extraction feature shows substantially higher performance than previous state-of-the-art systems. PubTator 3.0 retrieves a greater number of articles for entity pair queries than either PubMed or Google Scholar, with higher precision in the top 20 results. Integrating ChatGPT (GPT-4) with the PubTator APIs dramatically improves the factuality and verifiability of its responses.
Previous versions of PubTator have supported a wide range of research applications, fulfilling over one billion API requests. With an improved and expanded set of features and tools, PubTator 3.0 is designed to allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

Invited Presentation: ToxPipe: Harnessing AI and Cloud Computing for Toxicological Data Exploration and Interpretation
Confirmed Presenter: Trey Saddler, Division of Translational Toxicology, NIEHS, US NIH, United States

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Trey Saddler, Division of Translational Toxicology, NIEHS, US NIH, United States
  • Parker Combs, Division of Translational Toxicology, NIEHS, US NIH, United States
  • Charles Schmitt, Office of Data Science, NIEHS, US NIH, United States
  • David Reif, Division of Translational Toxicology, NIEHS, US NIH, United States
  • Scott Auerbach, Division of Translational Toxicology, NIEHS, US NIH, United States

Presentation Overview: Show

ToxPipe is an innovative platform that harnesses expert entrained AI and cloud computing to revolutionize the exploration and interpretation of diverse toxicological data streams. By leveraging large language models (LLMs), AI automation platforms like Auto-GPT, and hybrid deployment approaches, ToxPipe enables semi-autonomous data integration, toxicological characterization, and knowledge discovery through natural language instructions.

The platform offers a user-friendly web application and API, allowing scientists and toxicologists to interact with AI agents and explore various toxicologically relevant datasets. ToxPipe's open-source codebase provides opportunities for customization and integration with additional data sources.

We present the design, implementation, and evaluation of ToxPipe’s outputs, showcasing its capabilities in tasks such as gene expression interpretation, toxicological data summarization, and exploration of disparate datasets through text-to-SQL queries, among other capabilities. Performance is evaluated using objective statistical metrics compared to expert assessment.

ToxPipe incorporates advanced tools and libraries, including promptfoo and ragas for automated evaluation of LLM prompt and retrieval-augmented generation (RAG) pipelines, and Langchain and Llamaindex for agent orchestration and data ingestion. Documents like National Toxicology Program Technical Reports are parsed and converted to vector embeddings, enabling semantic similarity search for user and agent queries. ToxPipe also leverages the data contained in ChemBioTox, a Postgres database that aims to compile toxicologically relevant information across a variety of publicly available data sources and QSAR models for over 1 million chemicals and their metabolites, to provide expert-curated annotations.

Through expert entrainment, ToxPipe generates context to guide users in arriving at integrative interpretation from diverse data streams. The ToxPipe interface offers ready access to powerful agentic discovery tools and is poised to significantly accelerate toxicological research and discovery.

Panel: Discussion
Room: 520a
Format: In person

Moderator(s): Haluk Resat


Authors List: Show

15:00-16:00
Session: Sustainable Research Software and Tools in the Cloud and Beyond
Invited Presentation: Deep LTMLE: Scalable Causal Survival Analysis with Transformer
Confirmed Presenter: Toru Shirakawa, Osaka University Graduate School of Medicine, Japan

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Toru Shirakawa, Osaka University Graduate School of Medicine, Japan
  • Maya Petersen, University of California, Berkeley, US
  • Sky Qiu, University of California, Berkeley, US
  • Yi Li, University of California, Berkeley, US
  • Yuxuan Li, Columbia University, US
  • Yulun Wu, University of California, Berkeley, US
  • Ryo Kawasaki, Osaka University Graduate School of Medicine, Japan
  • Hiroyasu Iso, National Center for Global Medicine, Japan
  • Mark van der Laan, University of California, Berkeley, US

Presentation Overview: Show

Causal inference under dynamic interventions from longitudinal data with high dimensional variables such as omics and images which potentially vary across time is a central problem in precision medicine. We developed and implemented Deep Longitudinal Targeted Minimum Loss-based Estimation (Deep LTMLE), a novel approach to address this problem (Shirakawa et al. 2024). Following the roadmap of causal inference (Petersen and van der Laan 2014), Deep LTMLE provides an efficient estimator for a class of functionals identified through the g-formula (Robins 1986) in continuous time with monitoring process, right-censoring, and competing risks. Our architecture used Transformer to handle long-range dependencies and heterogenous variables. The method is based on the theory of causal survival analysis (Rytgaard et al. 2022) combined with a scalable architecture of deep neural networks, bridging traditional biomedical statistics and emerging methods in data science. Our method could incorporate high dimensional variables such as omics, texts, images, and videos, and thus could integrate the data from molecular biology and clinical practice to evaluate their causal impacts in relation to clinically significant events such as patients’ survival. This feature would foster both translational and reverse-translational research. Within the framework of targeted learning (van der Laan and Rose 2011, 2018), we corrected the bias commonly associated with machine learning algorithms and built an asymptotically efficient estimator. In a simulation with a simple synthetic data, Deep LTMLE demonstrated comparable statistical performance and superior computational performance to an asymptotically efficient estimator, LTMLE with a super learner of multiple machine learning algorithms. As the complexities of the synthetic data and the length of time horizon increase, Deep LTMLE tended to outperform LTMLE. Furthermore, Deep LTMLE is implemented in Python and scalable with more computational resources such as graphics processing units (GPUs). We will demonstrate an application of Deep LTMLE with real world data.

Invited Presentation: Wearable Biosensing to Predict Imminent Aggressive Behavior in Psychiatric Inpatient Youths with Autism
Confirmed Presenter: Matthew Goodwin, Northeastern University, USA

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Matthew Goodwin, Northeastern University, USA

Presentation Overview: Show

Aggressive behavior is a prevalent and challenging issue in individuals with autism, especially for those who have limited verbal ability and intellectual impairments. This presentation investigates whether changes in peripheral physiology recorded by a wearable biosensor and machine learning can be used to predict imminent aggressive behavior before it occurs in inpatient youths with autism from 4 primary care psychiatric inpatient hospitals. Research staff performed live behavioral coding of aggressive behavior while 70 inpatient study participants wore a commercially available biosensor that recorded peripheral physiological signals (cardiovascular activity, electrodermal activity, and motion). Logistic regression, support vector machines, neural networks, and domain adaptation were used to analyze time-series features extracted from biosensor data. Area under the receiver operating characteristic curve (AUROC) values were used to evaluate the performance of population- and person-dependent models. A total of 429 naturalistic observational coding sessions were recorded, totaling 497 hours, wherein 6665 aggressive behaviors were documented, including self-injury (3983 behaviors [59.8%]), emotion dysregulation (2063 behaviors [31.0%]), and aggression toward others (619 behaviors [9.3%]). Logistic regression was the best-performing overall classifier across all experiments; for example, it predicted aggressive behavior 3 minutes before onset with a mean AUROC of 0.80 (95% CI, 0.79-0.81). Further research will explore clinical implications and the potential for personalized interventions.

Invited Presentation: Improve speed, scalability, inter-operability of core C++ modules for Stan - a tool doing Bayesian modeling and statistical inference
Confirmed Presenter: Mitzi Morris, Columbia University, United States of America

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Mitzi Morris, Columbia University, United States of America

Presentation Overview: Show

Software Engineering for Research Software

Stan is a tool doing Bayesian modeling and statistical inference. In September 2021, the Stan project received an award from program NOT-OD-21-091 to improve speed, scalability, inter-operability of core C++ modules. The Stan developer process was used for this work and subsequent development initiatives.

Reproducible science requires reliable tools. Rapid science requires tools which are easy to learn and use. Reliability is enforced through design, code review, and extensive test suites. Good documentation supports learnability and usability. These activities inform and reinforce one another; writing good tests leads to more informative docs and vice versa.

These activities require input from researchers, developers, and end-users. The Stan project has been very much a collaboration between computer scientists and applied statisticians. This talk will examine the Stan developer community as well as the Stan developer process.

Invited Presentation: LinkML: A FAIR data modeling framework for the biosciences and beyond
Confirmed Presenter: Nomi L Harris, Lawrence Berkeley National Laboratory, United States

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Sierra AT Moxon, Lawrence Berkeley National Laboratory, United States
  • Nomi L Harris, Lawrence Berkeley National Laboratory, United States
  • Matthew Brush, University of North Carolina at Chapel Hill, United States
  • Melissa A Haendel, University of North Carolina at Chapel Hill, United States
  • Christopher J Mungall, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Open science depends on open data. LinkML (Linked data Modeling Language; linkml.io) is an open, extensible modeling framework that makes it easy to model, validate, and distribute reusable, interoperable data.
The quantity and variety of data being generated in scientific fields is increasing rapidly, but is often captured in unstructured, unstandardized formats like publications, lab notebooks, or spreadsheets. Many data standards are defined in isolation, causing siloization; lack of data harmonization limits reusability and cross-disciplinary applications. A confusing landscape of schemas, standards, and tools leaves researchers struggling with collecting, managing, and analyzing data.
LinkML addresses these issues, weaving together elements of the Semantic Web with aspects of conventional modeling languages to provide a pragmatic way to work with a broad range of data types, maximizing interoperability and computability across sources and domains. LinkML supports all steps of the data analysis workflow: data generation, submission, cleaning, annotation, integration, and dissemination. It enables even non-developers to create data models that are understandable and usable across the layers from data stores to user interfaces, reducing translation issues and increasing efficiency.
Projects across the biomedical spectrum and beyond are using LinkML to model their data, including the NCATS Biomedical Data Translator, Alliance of Genome Resources, Bridge2AI, Neurodata Without Borders, Reactome, Critical Path Institute, iSample, National Microbiome Data Collaborative, Center for Cancer Data Harmonization, INCLUDE project, Open Microscopy Environment, and Genomics Standards Consortium.
Ultimately, LinkML democratizes data, helping to bridge the gap between people of diverse expertise and enabling a shared language with which to express the critically important blueprints of each project’s data collection.

This work is supported in part by an NIH supplement under NOT-OD-22-068, and by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract number DE-AC02-05CH11231.

Invited Presentation: Leveraging Language Models for Enhanced Biocuration and User Interaction in Reactome: A Pathway Towards Community-Driven Knowledge Enrichment
Confirmed Presenter: Nancy Li

Room: 520a
Format: In Person

Moderator(s): Haluk Resat


Authors List: Show

  • Nancy Li

Presentation Overview: Show

The Reactome Pathway Knowledgebase, supported by NIH NHGRI and ODSS, stands as a cornerstone database renowned for its meticulous human curation practices. Reactome is currently the most comprehensive open-source, human biological pathway knowledgebase. Reactome's curation process faces inherent challenges in handling the vastness and complexity of biological data. Human-driven curation struggles with scale and efficiency. Therefore, we are developing a web-based curation tool that incorporates Large Language Models (LLM) technology, guided by Reactome’s data schema and curation requirements, to significantly improve steps that are major bottlenecks for the curation workflow. To do so, we adopted the RAG (retrieval-augmented generation) technology and developed an API to associate previously unannotated genes with Reactome pathways. Leveraging our prior work with the NIH IDG program, the API can find potential pathways for a query gene, search PubMed for supportive literature evidence, create text summaries, and extract functional relationships between the query gene and biological concepts.

In addition, Reactome is developing a conversational chatbot that facilitates user interactions toward improving comprehension of Reactome content. The chatbot is designed to facilitate a natural, interactive user experience, and more intuitive navigation through Reactome's extensive database. This interface will allow users to query complex pathway information and receive rich, informative responses that encourage deeper engagement with the knowledgebase. Future endeavors involve integrating LLMs into the chatbot for data analysis, empowering users with diverse technical backgrounds to perform sophisticated analyses using Reactome's. Integration of multi-source data retrieval systems and the incorporation of gene analysis tools are expected to enhance the platform's utility and interactivity, thereby streamlining the user experience and facilitating the exploration and understanding of complex biological datasets. Our LLM-focused approaches will improve user engagement and also lays the groundwork for improved curation workflows within Reactome, potentially offering a means for community curation practices.

Panel: Discussion
Room: 520a
Format: In person

Moderator(s): Haluk Resat


Authors List: Show

16:40-17:00
Session: Preparing for the Future: AI Data Readiness and Smart Health Solutions
Invited Presentation: Multi-Context Graph Neural Networks for Enhanced Multivariate Time-Series Analysis in Healthcare
Confirmed Presenter: Luciano Nocera, University of Southern California, United States

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Luciano Nocera, University of Southern California, United States
  • Arash Hajisafi, University of Southern California, United States
  • Maria Despoina Siampou, University of Southern California, United States
  • Bita Azarijoo, University of Southern California, United States
  • Cyrus Shahabi, University of Southern California, United States

Presentation Overview: Show

Effective multivariate time-series (MTS) analysis in healthcare is crucial for various medical tasks and requires capturing complex inter-variable relationships accurately. Previous methods often fail to model these relationships adequately, leading to poor predictions, especially when training data size is limited. Our research introduces a series of Graph Neural Networks (GNNs) designed to overcome these limitations by explicitly modeling MTS as graphs, where each variable is a node connected by edges representing multi-context inter-variable relationships.

Our first work, Busyness Graph Neural Network (BysGNN), is a temporal GNN that integrates semantical, temporal, spatial, and taxonomical data to model interactions between Points of Interest (POIs). This model was particularly effective during the COVID-19 pandemic for forecasting POI visits to set occupancy restrictions in urban settings, demonstrating substantial improvements over previous forecasting models by capturing complex multi-context correlations.

Building upon BysGNN, we developed NeuroGNN, which extends the graph-based approach to EEG data analysis. NeuroGNN dynamically constructs graphs that reflect the evolving relationships between EEG electrodes and associated brain regions, significantly enhancing seizure detection and classification. The model’s ability to incorporate diverse contextual data improves its capability to classify rare seizure types with limited samples, addressing a significant challenge in medical diagnostics.

Our latest work, WaveGNN, further extends the application of GNNs by addressing irregular MTS data common in healthcare (e.g., unaligned and incomplete measurements of vital signs). WaveGNN integrates additional data modalities, including Electronic Health Record (EHR) notes, to build a robust graph representation that maintains accuracy nearly equivalent to scenarios with complete data, even in the presence of significant data gaps. This substantially improves over existing methods, which typically suffer performance degradation under similar conditions.

Invited Presentation: Clustering-Informed Shared-Structure Variational Autoencoder for Missing Data Imputation in Large-Scale Healthcare Data
Confirmed Presenter: Yasin Khadem Charvadeh, Memorial Sloan Kettering Cancer Center, USA

Room: 520a
Format: Live Stream

Moderator(s): Yanli Wang


Authors List: Show

  • Yasin Khadem Charvadeh, Memorial Sloan Kettering Cancer Center, USA
  • Kenneth Seier, Memorial Sloan Kettering Cancer Center, USA
  • Katherine S. Panageas, Memorial Sloan Kettering Cancer Center, USA
  • Mithat Gönen, Memorial Sloan Kettering Cancer Center, USA
  • Yuan Chen, Memorial Sloan Kettering Cancer Center, USA

Presentation Overview: Show

Despite advancements in managing healthcare data, missing data challenges persist in Electronic Health Records (EHR) and patient-reported health data, compromising their usability in various healthcare analytics, including telehealth. As Artificial Intelligence (AI) modeling techniques evolve, conventional and contemporary methods for handling missing data encounter notable limitations that hinder their effectiveness. Established methods such as Multiple Imputation by Chained Equations (MICE), MissForest, and Generative Adversarial Imputation Nets (GAIN) demonstrate limitations in handling the complexities inherent in healthcare data. These challenges involve capturing complex non-linear relationships, extended computation times, and constraints in addressing various types of missing data mechanisms. In response, we propose a novel model building on the Variational Autoencoder (VAE) architecture, a powerful generative model using Bayesian neural networks. Our proposed method differs from existing VAE-based imputation strategies by providing a robust framework specifically designed for handling missing values within healthcare data. This framework can effectively accommodate various missing data mechanisms, including missing not at random (MNAR). By identifying missing data patterns and leveraging shared structures across VAEs for different patterns, our model effectively captures complex associations, thus enhancing generalizability and learning efficiency. Through comprehensive simulation studies, we showcase the adaptability of our approach across different missing mechanisms, demonstrating its superiority over traditional and popular imputation methods in terms of imputation accuracy. We apply our proposed method to EHR data from patients diagnosed with early-stage breast cancer who are at high risk of recurrence after surgery at Memorial Sloan Kettering Cancer Center, specifically among those treated with the FDA-approved drug abemaciclib, which necessitates routine blood monitoring at specific time intervals due to potential side effects. Given the variability in clinical practices for ordering these tests, our approach aims to mitigate the impact of missing data on patient health monitoring and subsequent analyses.

17:00-17:20
Session: Preparing for the Future: AI Data Readiness and Smart Health Solutions
Invited Presentation: SCH: Predicting and Preventing Adverse Pregnancy Outcomes in Nulliparous Women
Confirmed Presenter: Ansaf Salleb-Aouissi, Columbia University, United States

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Adam Lin, Columbia University, United States
  • Andrea Clark-Sevilla, Columbia University, United States
  • Raiyan Rashid Khan, Columbia University, United States
  • Mahdi Arab, Hunter College, United States
  • Daniel Mallia, Hunter College, United States
  • Alisa Leshchenko, Hunter College, United States
  • Adam Catto, Hunter College, United States
  • Cassandra Marcussen, Columbia University, United States
  • Anastasia Dmitrienko, Columbia University, United States
  • Owen Kunhardt, Hunter College, United States
  • Nicolae Lari, Columbia University, United States
  • Irene Tang, Columbia University, United States
  • Qi Yan, Columbia University, United States
  • Itsik Pe'er, Columbia University, United States
  • Ronald Wapner, Columbia University, United States
  • Anita Raja, Hunter College, United States
  • Ansaf Salleb-Aouissi, Columbia University, United States

Presentation Overview: Show

Adverse pregnancy outcomes (APOs), including preterm birth (PTB) and preeclampsia (PE), have been exceedingly challenging problems in obstetrics, predominantly due to the inherent complexity of their multi-factorial etiologies and the lack of approaches capable of integrating and interpreting large multi-modal data. A particularly challenging population to determine APOs risk is first time mothers (nulliparous women) due to the lack of prior pregnancy history.

We report on our work done over the last few years as part of our funded project through the NSF-NIH Smart and Connected Health program, entitled SCH: Prediction of Preterm Birth in Nulliparous Women (R01LM013327).
We use the nuMoM2b study that was conducted in eight clinical sites across the United States between October 2010 and May 2014. Information from treatment, psychological, physiological, medical history, ultrasound, activity, toxicology and family history were included. In the context of predicting and preventing PTB and PE, our recent work addresses:

1. Exploring data and algorithms for building longitudinal models for addressing APOs using privileged information, that is data available at training time but not at inference time;
2. Combining genetic factors with other clinical factors to determine risk;
3. Novel machine learning methods for handling missing and imbalanced data in the health cohorts;
4. An interactive tool that facilitate longitudinal analysis of the risk factors associated with PTB on CDC data from 1968 to 2021 and how they have varied over the years and a second tool for searching and visualizing genetic associations of pregnancy traits;
5. A methodology for identifying and mitigating biases in our models.

Results from this work won the 2022 NIH-NICHD Decoding Maternal Morbidity Data Challenge Award and was peer-reviewed and published in several journal publications. We will plan to present highlights of our methods and results at the session.

Invited Presentation: SCH: Graph-based Spatial Transcriptomics Computational Methods in Kidney Diseases
Confirmed Presenter: Juexin Wang 

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Juexin Wang 

Presentation Overview: Show


17:20-17:40
Session: Preparing for the Future: AI Data Readiness and Smart Health Solutions
Invited Presentation: AI/ML Ready mHealth and Wearable Data for Dyadic HCT
Confirmed Presenter: Bengie L Ortiz, University of Michigan, USA

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Rajnish Kumar, University of Michigan, USA
  • Charles Ziegenbein, Peraton Labs, USA
  • Bengie L Ortiz, University of Michigan, USA
  • Vibhuti Gupta, Meharry Medical College, USA
  • Xiao Cao, University of Michigan, USA
  • Aditya Jalin, University of Michigan, USA
  • Sung Choi, University of Michigan, USA

Presentation Overview: Show

Hematopoietic cell transplantation (HCT) is a potent form of immunotherapy
for high risk blood diseases. Given the high risk associated with HCT, a dedicated
caregiver is necessary and expected for at least the first 100 days post transplant,
however HCT caregivers experience adverse physical and mental health during this
period. Dyadic methods to identify t he status of the physical and mental health of both
caregivers and patients can potentially improve the intervention and support needed to
improve the quality of life of both patients and caregivers. The goal of this research is to
build a high quality, co mprehensive, standardized, AI/ML ready, and clinically
meaningful mHealth dataset of HCT patients and their caregivers to develop novel
interventions in HCT using mHealth and wearables. We have developed a novel
preprocessing pipeline to build the AI/ML ready mHealth and wearable data which
includes data extraction, data standardization, data cleaning, feature extraction, and
data integration . The data is collected from Michigan Medicine from an existing mHealth
randomized clinical trial from Septe mber 2020 to July 2023 for a total of 323 subjects
(166 adult patients and caregivers). This data consists of physiological variables
captured from Fitbit such as heart rate, steps, and sleep and survey, patient reported
outcomes, and mood data from the mo bile application. The data demographics include
69.6% females and 28% males for caregivers, however 66% males and 34% females
for HCT patients. In terms of race, 88.3% are white caregivers and 86.8% are white
patients. There are a total of 417,003,290 steps, 3,075,687 sleep , 381,164 mood , and
86,057,538 heart rate observations in the available date. A n overall 45.67% of data is
duplicate with the maximum duplicate values in steps and heart rate data. The
generated AI/ML ready mHealth data is unique in HCT domain that will h elp research
community to validate novel hypothesis in HCT research.

Invited Presentation: Enhancing the AI-readiness of gnomAD with GA4GH Genomic Knowledge Standards
Confirmed Presenter: Alex Wagner, Nationwide Children's Hospital, USA

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Larry Babb, Broad Institute of MIT and Harvard, USA
  • Wesley Goar, Nationwide Children's Hospital, USA
  • Kyle Ferriter, Broad Institute of MIT and Harvard, USA
  • Daniel Marten, Broad Institute of MIT and Harvard, USA
  • Kori Kuzma, Nationwide Children's Hospital, USA
  • Phil Darnowsky, Broad Institute of MIT and Harvard, USA
  • Matthew Solomonson, Broad Institute of MIT and Harvard, USA
  • Kristen Laricchia, Broad Institute of MIT and Harvard, USA
  • Katherine Chao, Broad Institute of MIT and Harvard, USA
  • Heidi Rehm, Broad Institute of MIT and Harvard, USA
  • Alex Wagner, Nationwide Children's Hospital, USA

Presentation Overview: Show

The clinical interpretation of genomes is a labor-intensive process that remains a barrier to scalable genomic medicine. Efforts to improve this “interpretation bottleneck” have resulted in the development of clinical classification guidelines and databases for genomic variants in Mendelian diseases and cancers. The development of AI-augmented genome interpretation systems is a solution to scale the interpretation process, and relies upon expert-defined clinical interpretation frameworks developed by expert communities in clinical interpretation (e.g. ClinGen). Development of such interpretation systems will benefit from aggregation and collation of evidence that is in a computationally-described, AI-ready state.
The NIH-supported Genome Aggregation Database (gnomAD) is currently the largest and most widely used public resource for population allele frequency data. These data are commonly used by variant interpretation frameworks as strong evidence against variant causality, making this a highly impactful resource for filtering out variants that are unlikely to be causative for Mendelian diseases or cancer development. The importance and scale of the gnomAD population allele frequency data to clinical interpretation systems makes this resource an ideal candidate resource for AI-ready data.
Under the auspices of the Global Alliance for Genomics and Health (GA4GH), we designed and applied standard data models for cohort allele frequency evidence in collaboration with the broader genomic knowledge community. We normalized the ~1.89 billion alleles of the gnomAD resource, following the conventions of the GA4GH Variation Representation Specification (VRS), providing globally unique computed identifiers that are accessible on the gnomAD Hail platform and the associated gnomAD Hail utilities. We also designed a GA4GH draft standard for cohort allele frequency data, and built an gnomAD API to apply these standardized data in genomic interpretation support systems. We conclude with an overview of this effort in the context of interoperability with other genomic evidence repositories using GA4GH genomic knowledge standards.

17:40-18:00
Session: Preparing for the Future: AI Data Readiness and Smart Health Solutions
Invited Presentation: Enhancing Imputation for Clinical Research: The Path for a Flexible Toolkit
Confirmed Presenter: Vida Abedi, Penn State University, USA

Room: 520a
Format: In Person

Moderator(s): Yanli Wang


Authors List: Show

  • Vida Abedi, Penn State University, USA
  • Alireza Vafaei Sadr, Penn State University, USA
  • Vernon Chinchilli, Penn State University, USA

Presentation Overview: Show

Background: Missing data in clinical research restricts robust analysis and AI/ML model training. This project addresses this challenge by presenting a Python package for efficient and intelligent missing value imputation, designed specifically for clinical research data. Method: The algorithm provides a

Closing Remarks
Room: 520a
Format: In person

Moderator(s): Yanli Wang


Authors List: Show

  • Yanli Wang