Much information in global health is organized in siloed repositories, and global health datasets are relatively small compared to genomics of proteomics datasets. The data problem in global health could be considered a small data problem on a big scale. In 2013 we released the first version of Project Tycho to disseminate disease surveillance data reported by health agencies in the United States between 1888 and 2014. Over the past 3.5 years, 3500+ users have registered to use Project Tycho and over 40 creative works, including 20 peer-reviewed papers, have been published that used Project Tycho data. Now, we released Project Tycho 2.0 that aims to represent information for global health in a more FAIR (Findable, Accessible, Interoperable, and Reusable) compliant way. We re-represented all our US data and information about dengue fever for 99 countries into a standard data format, using standard ontologies and vocabularies where possible. We also created rich metadata in DataCite XML and Data Tag Suite (DATS) JSON format. With Project Tycho 2.0, we aim to improve the integration and machine-interpretability of global health data so that new discoveries can truly be made across all scales in biology, from the molecule to the global population.
10:40 AM-11:00 AM
Integrating Heterogeneous Predictive Models using Reinforcement Learning
Room: Columbus AB
Ana Stanescu, University of West Georgia, United States
The application of systems biology and machine learning approaches to large amounts and variety of biomedical data often yields predictive models that can potentially transform data into knowledge. However, it is not always obvious what techniques and/or datasets are most appropriate for specific problems, calling for alternatives such as building heterogeneous ensembles capable of incorporating the inherent variety and complementarity of the many possible models. However, the problem of systematically constructing these ensembles from a large number and variety of base models/predictors is computationally and mathematically challenging. We developed novel algorithms for this problem that operate within a Reinforcement Learning (RL) framework to search the large space of all possible ensembles that can be generated from an initial set of base predictors. RL offers a more systematic alternative to the conventional ad-hoc methods of choosing the base predictors into the final ensemble, and has the potential of deriving optimal solutions to the problem. For the sample problem of splice site identification, our algorithms yielded effective ensembles that perform competitively with the ones consisting of all the base predictors. Furthermore, the ensembles utilized a substantially smaller subset of all the base predictors, potentially aiding the ensembles’ reverse engineering and eventual interpretation.
11:00 AM-11:20 AM
Quantification of Private Information Leakage and Privacy-Preserving File Formats for Functional Genomics Data
Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to identity of individuals but represent universal statements about disease and developmental stages. On the other hand, by virtue of the experimental procedures, the reads from them are tagged with small bits of patients' variant information, which presents privacy challenges, as far as sharing the data. By looking at the “data exhaust” from transcriptome analysis, one can infer sensitive information revealing findings. However, there is great desire to share the data as broadly as possible. Therefore, there is need to formulate amount of sensitive information leaked in every step of the data exhaust. Here we developed information theory-based measures to quantify private information leakage in various stages of functional genomics data. We found that noisy variant calling, while not useful genotypes, can be used as strong quasi-identifiers for re-identification purposes through linking attacks. We then focused on how quantifications of expression levels can potentially reveal sensitive information about the subject studied, and how one can take steps to protect patient anonymity.
11:40 AM-12:00 PM
Principal Component Region Set Analysis: Facilitating Interpretation of PCA Dimensions for DNA Methylation Data
Room: Columbus AB
John Lawson, University of Virginia, United States
Principal component analysis (PCA) is a widely used technique for dimensionality reduction and visualization in genomics, where the number of dimensions can be thousands or even hundreds of thousands. However, since each principal component (PC) is a linear combination of original dimensions, the meaning of the new dimensions can be hard to interpret. For PCA of DNA methylation data, the cytosines which are the original dimensions may not have a clear biological annotation, further hindering interpretation. Currently, there is a lack of methods for interpreting PCs of DNA methylation data. We present a method which annotates PCs using sets of genomic regions corresponding to a given biological annotation, such as transcription factor binding or histone modifications. We tested the method on DNA methylation data from breast cancer, confirming known associations, and data from the rare childhood cancer Ewing sarcoma, discovering novel associations. Our method is computationally efficient, scales well with increasing number of samples, and will fit well into existing analysis workflows. This method will be broadly useful to help researchers understand variation in DNA methylation among samples.
12:00 PM-12:20 PM
MetaSRA: Normalized Human Sample-Specific Metadata for the Sequence Read Archive
Room: Columbus AB
Matthew Bernstein, University of Wisconsin-Madison, United States
Motivation: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized description. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.
Results: We present MetaSRA, a database of normalized SRA human sample-specific metadata. Our normalized metadata schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads.
Statement of significance: MetaSRA provides normalized sample-specific metadata for the SRA enabling more effective queries of SRA metadata and large-scale meta analyses.
12:20 PM-12:40 PM
Integrating Knowledge-Guided Analysis into Novel Genomic Data Ecosystems using FAIR Principles
Room: Columbus AB
Charles Blatti, University of Illinois at Urbana-Champaign, United States
Genomic data analysis ecosystems that incorporate the FAIR principles (Findable, Accessible, Interoperable, and Reusable) have the potential to accelerate biomedical research and discovery. These guidelines have informed the separate designs of the NCI-funded Seven Bridges Cancer Genomic Cloud (CGC) platform for accessing and investigating large genomic datasets and the NIH-BD2K Knowledge Engine for Genomics (KnowEnG) platform for using advanced knowledge-guided analysis tools that leverage public domain information on gene annotations and interactions. In this work, we demonstrate our FAIR-based approach to facilitate sophisticated analyses that leverage the strengths of these two platforms.
We published the KnowEnG analysis tools as CGC Public Apps utilizing two recent technologies, software containerization via Docker and tool specification through the Common Workflow Language. Using our apps, we recreated within the CGC a recently published analysis mapping oesophageal carcinoma samples from The Cancer Genome Atlas to molecular subtypes and identified genes differentiating these subtypes. We also employed KnowEnG’s Gene Set Characterization pipeline that integrates prior knowledge on gene interactions to identify novel pathways and biological processes involved in each subtype. Our efforts showcase a FAIR ecosystem for generating reproducible and sharable analytical workflows in cloud-based environments that facilitate rapid experimentation with large datasets and advanced analytics.
12:40 PM-2:00 PM
Lunch Break
Building the FAIR Data Ecosystem for Discovery to Health
2:00 PM-2:15 PM
Practical Strategies towards Making Biomedical Research Data more FAIR
Room: Columbus AB
Avi Ma'Ayan, Icahn School of Medicine at Mount Sinai, United States
2:15 PM-2:30 PM
NIH Data Sharing Policies
Room: Columbus AB
Dina Paltoo, NIH Office of Science Policy
2:30 PM-3:00 PM
Privacy-Preserving Techniques for Analyzing and Sharing Biomedical Data
Room: Columbus AB
Haixu Tang, Indiana University Bloomington, United States
3:00 PM-3:15 PM
SmartAPI: Building a FAIR API Ecosystem for Biomedical Knowledge
Room: Columbus AB
Chunlei Wu, The Scripps Research Institute (TSRI), United States
3:15 PM-3:30 PM
Implementing FAIR Principles on Protected Health Information
Room: Columbus AB
Tim Clark, University of Virginia, United States
3:30 PM-4:00 PM
Towards a Data Discovery Index: Lessons Learned by the bioCADDIE Consortium
Room: Columbus AB
Lucila Ohno-Machado, University of California, San Diego, United States
4:00 PM-4:40 PM
Coffee Break
Machine Learning Approaches to Enable Biomedical Discoveries
4:40 PM-4:55 PM
Automated Cohort Retrieval from EEG Medical Records
Room: Columbus AB
Joseph Picone, Temple University, United States
4:55 PM-5:10 PM
Big Data to Knowledge: Integrative Literature Mining and Knowledge Networks for Drug Analytics in Precision Medicine
Room: Columbus AB
Cathy Wu, University of Delaware, United States
5:10 PM-5:25 PM
Learning from Text: Translating Clinical Case Reports into Structured Knowledge
Room: Columbus AB
Wei Wang, HeatBD2K, UCLA, United States
5:25 PM-5:40 PM
The Role of Prior Knowledge in Machine Learning and Biomedical Data Science
Room: Columbus AB
Larry Hunter, CU-Denver Anschutz Medical Campus, United States
5:40 PM-6:00 PM
Causal Network Discovery from Biomedical Data
Room: Columbus AB
Greg Cooper, University of Pittsburgh , United States
Biomedical Data Science In Action
Sunday, July 8th
10:15 AM-10:45 AM
NIH Introduction
Room: Columbus AB
10:45 AM-11:05 AM
PanCancer Analysis of Whole Genomes using Multi-Cloud Strategy
Room: Columbus AB
Christina Yung, University of Chicago, United States
11:05 AM-11:25 AM
Wikidata for Biomedical Knowledge Integration and Curation
Room: Columbus AB
Greg Stupp, The Scripps Research Institute (TRSI), United States
11:25 AM-11:55 AM
Toward the FAIRness of Data Science Training Resources
Room: Columbus AB
Jack Van Horn, University of Southern California , United States
11:55 AM-12:15 PM
The Future is Now! Engaging Biomedical Data Scientists in the 21st Century
Room: Columbus AB
Ben Busby, NCBI, United States
12:15 PM-12:40 PM
Panel
Room: Columbus AB
12:40 PM-2:00 PM
Lunch Break
BD2K Power Tools: Moving to the Cloud with Industrial Strength Data
2:00 PM-2:20 PM
Scaling Analysis on the Cloud
Room: Columbus AB
Brian D. O’connor, University of California, United States
2:20 PM-2:40 PM
KnowEnG: A Cloud-based Framework for Genomics Data Analysis
Room: Columbus AB
Saurabh Sinha, University of Illinois at Urbana-Champaign, United States
2:40 PM-3:00 PM
The Three Faces of Genomic Data Compression
Room: Columbus AB
Peng Jianhao, UIUC, United States
3:00 PM-3:20 PM
Cloud Computing Alone Will Not Make Experimental Data FAIR. We Need Better Metadata First.
Room: Columbus AB
Mark Musen, Stanford University, United States
3:20 PM-4:00 PM
Panel
Room: Columbus AB
4:00 PM-4:40 PM
Coffee Break
BD2K Data Visualization Tools & Future Directions
4:40 PM-6:00 PM
BD2K Data Visualization Tools & Future Directions
Room: Columbus AB
Nils Gehlenborg, Harvard Medical School, United States
Griffin M. Weber, Harvard Medical School, United States