The NCPI effort started in the fall of 2019 with the goal of establishing and implementing guidelines and technical standards to empower a trans-NIH, federated FAIR data ecosystem. The Systems Interoperation Working Group is a key part of this effort and focuses on putting these approaches and standards into practice, allowing researchers to work across participating cloud platforms. In this talk we look at the intersection of researcher use cases for distributed cloud-based analysis, the data standards that enable this vision, and dive into an example that leverages the Systems Interoperation work as a whole. Along the way we will explore some key technical standards that have been critical to our progress and examine our next steps as we push the envelope of interoperability.
11:33-11:56
Clinical and Phenotypic Data Interoperability using FHIR in NCPI
The increasing ability to analyze and integrate data across research studies in cloud-based platforms creates a corresponding need for clinical data interoperability. One of the major gaps with making clinical and phenotype data available to the research community is the lack of universal standards for representing and transmitting this data. FHIR has emerged as the core interoperability standard for healthcare data and provides a practical framework for implementation and interchange that is extensible to research data. The usage of FHIR provides an implementation framework for research platforms to interoperate. Additionally, it also provides a bridge between data generated from clinical systems and research systems that have traditionally been siloed and thus reducing the capacity for translational impact. We will present work on the NCPI FHIR Implementation Guide, including practical examples of representing research phenotypes across several NIH programs. We will further discuss how FHIR can be a substrate for an ecosystem of tools and downstream analytics.
11:56-12:20
Modeling the computing requirements and costs for genomics analysis in the cloud
Cost is one of the largest barriers for migrating biomedical related analysis into the cloud. Researchers currently have limited information for the expected costs for running analysis tools, which challenges budgeting and prevents many researchers from adopting cloud solutions. In addition, software developers may not focus on optimizing costs for cloud environments, which increases expense even when relatively simple optimizations are available. Addressing these needs, we are profiling and analyzing the cloud costs for many of the most widely used tools & workflows in genomics. To identify these workflows, we have mined the historic usage data on the global usegalaxy.* Galaxy servers, as it is one of the most popular community resources available. We are now measuring their computing requirements and costs when running with inputs of varying sizes and complexities. From these data, we aim to develop a predictive model and API that can estimate the costs for running these analyses on each of the NCPI cloud platforms. Our goal is to inform investigators of the anticipated costs for their research and reduce costs by informing software developers of the tools that most urgently need optimization.
An overview of the NIH activities to support a FAIR data repository ecosystem. This will be followed by moderating the panel of speakers scheduled in this session.
12:45-13:00
FAIR Research - What is in it for the Researchers?
In this short presentation, Mark Hahnel, founder and CEO of the data publishing platform Figshare will talk about the unexpected benefits in complying with policies to make datasets and code openly available - from the perspective of the researcher.
Mark will talk about how certain researchers have taken advantage of disseminating all of their research outputs to get credit for all of their research. In the competitive academic landscape, not all outputs can have huge impact - but there are simple practices that can help research move faster, whilst complying with funder mandates and ensuring maximum dissemination for your own research.
Sage Bionetworks developed Synapse in 2012 as a general purpose data repository for cloud-based distribution of data under FAIR principles. Although the system is freely available for any use, data sharing and secondary use are most active where a research community, or their funder, has created a domain-specific space. Here we will discuss the technical, social, and culture issues that make domain-specific data repositories so successful in catalyzing secondary research – and consider how we might leverage these to promote the use of general infrastructure.
This talk will focus on how both generalist and domain-specific repositories can best support FAIR data principles. Focus will be given to the role of community groups such as FORCE11 in the adoption of best practices by data repositories and data publishers.
13:30-13:45
Bridging from Researcher Data Management to ELIXIR Archives in the RDM Lifecycle
ELIXIR is the pan-national European Research Infrastructure for Life Science data, whose 23 national nodes and the EBI coordinate the development and long-term sustainability of domain public databases. FAIR services, policies and curation approaches aim to build a FAIR connected data ecosystem of trusted domain repositories, from ENA, HPA and EGA to specialised resources like CorkOakDB and PIPPA for plant phenotypes. But this is only one part of the data landscape and often the end of data’s journey. The nodes support research projects to operate “FAIR data first”, working with institutional and national platforms that are often generic or designed for project-based data management. We need to bridge between project-based and community-based, and support researchers across their whole RDM lifecycle, navigating the complexity this ecosystem. The ELIXIR-CONVERGE project and its flagship RDMkit toolkit (https://rdmkit.elixir-europe.org) aims to do just that.
13:45-14:00
An Introduction to ICPSR: A Place to Discover and Access Social and Behavioral Science Data for Secondary Analysis
Over the past 60 years, the Inter-university Consortium for Political and Social Research (ICPSR) has successfully coordinated the research needs for over 800 universities and research institutes. This work has led to the expanded use of secondary data, the development of innovative lines of research, and the training of multiple generations of scholars across the fields of social and behavioral sciences. This talk will highlight some of the resources and support that we provide to data users in discovering, accessing, and analyzing existing data as well as how we support data providers to share their data responsibly and demonstrate impact of their data. Finally, this talk will address new directions for ICPSR given the evolving data landscape and analytic needs of data users.
14:20-15:20
Session III (Panel): Diversity in Data Science Training and Research
Format: Live-stream
Moderator(s): Karol Watson & Susan Gregurick
Panel
Format: Live-stream
Moderator(s): Karol Watson & Susan Gregurick
Marcela Nava, University of Texas at Arlington
Joshua Pritchett, Google Cloud
Jennifer Wagner, Geisinger Center for Translational Bioethics and Health Care Policy
Omolola Ogunyemi, Charles R. Drew University of Medicine and Science
In this opening introduction, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem.
11:00-12:20
Session IV: Open Research Software
Format: Live-stream
Moderator(s): Heidi Sofia
11:05-11:20
Manual Brain Segmentation Workflows Using the SPINE Virtual Laboratory: from Desktop to Cloud
Format: Live-stream
Moderator(s): Heidi Sofia
Alfredo Morales Pinzon, Brigham and Women's Hospital & Harvard Medical School, USA
Manual segmentation of brain structures on magnetic resonance images (MRI) requires brain anatomical knowledge, well-defined procedures, and tailored neuroimaging segmentation tools, in order to enable high reproducibility of results, especially in projects with large data sets requiring segmentations from multiple annotators. Desktop based solutions work well for single rater segmentations but are not tailored to collaborative work. For instance, images and segmentations have to be sent back and forth among the various actors of a project, and changes in segmentation guidelines are difficult to share and implement. In this project we aim to translate the core functionalities of a neuroimaging module developed in a desktop solution into a web-based editor, and to codify manual segmentation processes into computerized workflows. A JavaScript Object Notation (JSON) schema is proposed to describe segmentation tools (e.g., configuration of viewers and segmentations widgets) and workflows, allowing the research community to easily edit and share workflows. The web-based editor, as well as the computerized workflows, are integrated in the SPINE Virtual Laboratory which allows for centralized control access and execution of workflows, while maintaining the provenance of the data and results. The SPINE Virtual Laboratory is being tested in the cloud, increasing the scalability of the solution by enabling cloud workflow execution and connection to open science cloud based data repositories.
11:20-11:35
UR_EAR - A Web App supporting computational models for auditory-nerve and midbrain responses
As technology evolves, EHRs have increasingly adopted more of a platform role, enabling third-party applications to perform limited functions within the system of record. Interoperable standards-based information systems that facilitate the secure but agile exchange of data are critical to contemporary healthcare delivery. Cloud-based services are widely used in the non-healthcare domain. EHR-integrated digital and mobile health applications must include expanded use of the SMART on FHIR standard and cloud base architecture to allow for scalability and ease of installation across healthcare systems and care settings. This presentation will share the software development for scalability and cloud readiness journey (NIH ODSS supplement grant) of an NCI-funded cancer-prevention digital health navigation application (mPATH). The activities of the Administrative Supplement port and translate the highly efficacious mPATH mobile health platform to a robust and scalable technical software infrastructure including improving interoperability (SMART on FHIR) and migration to a cloud computing model which will greatly increase the impact of the platform and contribute to the literature for open-standards based software development.
11:50-12:05
Human AI Loop in Cloud for Distributed Computation
Modern medicine is in the process of entering the big data revolution, generating large volume multi-modal, multi-scale, and multi-omics data. This progress has opened-up new opportunities for computational scientists to discover previously undiscoverable statistical biomarkers from the data. The ideal end user for the developed computational tools will use them to ask important biological questions, but often does not have a background in computational science. This has further driven computational data science to ensure that the developed computational tools are accessible to any end-user. In this talk, we will shed light along this direction, and discuss an integrative tool, HAIL (Human AI Loop) in Cloud, for open data science. This tool is developed in conjunction with HistomicsUI, an open-source whole slide image viewer and digital data archival system. We have integrated our HAIL tool as an end-user plugin for conducting detection, segmentation, and quantification of structures from very large tissue images, as well as fusion of multi-modal data, particularly fusion of spatial molecular and image data. HAIL allows an end-user to actively interact with the system to tune the model training using their biological prior knowledge. We will also show results on how the tool can be used to conduct multi-site data analysis, eliminating the need to share data with protected health care information outside the host institution. We will conclude by discussing the need for generation of reference datasets, as well as the importance of integrating various types of data ontology with the system, for reproducible assessment of independent datasets.
12:05-12:20
Software and Science at the Open Force Field Initiative
The Open Force Field Initiative aims to publish sustainable software and reproducible research. To this end, we have experimented with several approaches to software development and data management, many of which have become standard practice in the Initiative. For example, to make our research reproducible, we tie data artifacts to github releases that contain complete copies of source code, input data files, and exact conda environments. To make the software sustainable, our packages are templated by the MolSSI cookiecutter, which allows us to standardize infrastructure with other packages in our field, and lowers the barrier for contributors. While some of these practices increase upfront project costs, they reduce the personnel-time required to onboard new contributors and researchers, leading to a model which facilitates collaboration. This talk will review many of the specific approaches to open source science that Open Force Field has tried, and the degrees to which they have been successful.
12:40-14:00
Session V: Reproducibility & Re-use
Format: Live-stream
Moderator(s): Alex Bui
12:40-12:44
Introduction and Welcome
Format: Live-stream
Moderator(s): Alex Bui
Alex Bui
12:44-13:03
A Framewok for Making Predictive Models Useful in Medicine
In this session we will explore strategies for, and issues involved in, bringing Artificial Intelligence (AI) technologies to the clinic, safely and ethically. We will discuss the different use cases for AI in personalizing diagnosis, prognosis and treatment recommendation. The session will review the use of routinely collected data on millions of individuals to provide on-demand evidence in those situations where good evidence is lacking and introduce a framework for analyzing the utility of model-guided workflows in healthcare.
13:03-13:22
Veridical Data Science for precision medicine: subgroup discovery through staDISC
Format: Live-stream
Moderator(s): Alex Bui
Bin Yu
13:22-13:41
PREMIERE: A community-driven platform for reproducibilty and reuse in biomedical predictive modeling
Format: Live-stream
Moderator(s): Alex Bui
Anders O Garlid
13:41-14:00
Omics Indexing and Standards
Moderator(s): Alex Bui
Yasset Perez-Riverol
14:20-15:20
Session VI (Panel): Bridging International Efforts in Data Science
Moderator(s): Rolf Apweiler
Panel
Moderator(s): Rolf Apweiler
Michele Ramsay
Claudia Medeiros
Chuck Cook
Griffin M. Weber, Harvard Medical School, United States