Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

BD2K

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
BD2K Young PI Session
Saturday, July 7th
10:15 AM-10:40 AM
Project Tycho 2.0: A New Repository for the Integration and Reuse of Global Health Data
Room: Columbus AB
  • Willem Van Panhuis, University of Pittsburgh, United States

Presentation Overview: Show

Much information in global health is organized in siloed repositories, and global health datasets are relatively small compared to genomics of proteomics datasets. The data problem in global health could be considered a small data problem on a big scale. In 2013 we released the first version of Project Tycho to disseminate disease surveillance data reported by health agencies in the United States between 1888 and 2014. Over the past 3.5 years, 3500+ users have registered to use Project Tycho and over 40 creative works, including 20 peer-reviewed papers, have been published that used Project Tycho data. Now, we released Project Tycho 2.0 that aims to represent information for global health in a more FAIR (Findable, Accessible, Interoperable, and Reusable) compliant way. We re-represented all our US data and information about dengue fever for 99 countries into a standard data format, using standard ontologies and vocabularies where possible. We also created rich metadata in DataCite XML and Data Tag Suite (DATS) JSON format. With Project Tycho 2.0, we aim to improve the integration and machine-interpretability of global health data so that new discoveries can truly be made across all scales in biology, from the molecule to the global population.

10:40 AM-11:00 AM
Integrating Heterogeneous Predictive Models using Reinforcement Learning
Room: Columbus AB
  • Ana Stanescu, University of West Georgia, United States

Presentation Overview: Show

The application of systems biology and machine learning approaches to large amounts and variety of biomedical data often yields predictive models that can potentially transform data into knowledge. However, it is not always obvious what techniques and/or datasets are most appropriate for specific problems, calling for alternatives such as building heterogeneous ensembles capable of incorporating the inherent variety and complementarity of the many possible models. However, the problem of systematically constructing these ensembles from a large number and variety of base models/predictors is computationally and mathematically challenging. We developed novel algorithms for this problem that operate within a Reinforcement Learning (RL) framework to search the large space of all possible ensembles that can be generated from an initial set of base predictors. RL offers a more systematic alternative to the conventional ad-hoc methods of choosing the base predictors into the final ensemble, and has the potential of deriving optimal solutions to the problem. For the sample problem of splice site identification, our algorithms yielded effective ensembles that perform competitively with the ones consisting of all the base predictors. Furthermore, the ensembles utilized a substantially smaller subset of all the base predictors, potentially aiding the ensembles’ reverse engineering and eventual interpretation.

11:00 AM-11:20 AM
Quantification of Private Information Leakage and Privacy-Preserving File Formats for Functional Genomics Data
Room: Columbus AB
  • Gamze Gursoy, Yale University, United States

Presentation Overview: Show

Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to identity of individuals but represent universal statements about disease and developmental stages. On the other hand, by virtue of the experimental procedures, the reads from them are tagged with small bits of patients' variant information, which presents privacy challenges, as far as sharing the data. By looking at the “data exhaust” from transcriptome analysis, one can infer sensitive information revealing findings. However, there is great desire to share the data as broadly as possible. Therefore, there is need to formulate amount of sensitive information leaked in every step of the data exhaust. Here we developed information theory-based measures to quantify private information leakage in various stages of functional genomics data. We found that noisy variant calling, while not useful genotypes, can be used as strong quasi-identifiers for re-identification purposes through linking attacks. We then focused on how quantifications of expression levels can potentially reveal sensitive information about the subject studied, and how one can take steps to protect patient anonymity.

11:40 AM-12:00 PM
Principal Component Region Set Analysis: Facilitating Interpretation of PCA Dimensions for DNA Methylation Data
Room: Columbus AB
  • John Lawson, University of Virginia, United States

Presentation Overview: Show

Principal component analysis (PCA) is a widely used technique for dimensionality reduction and visualization in genomics, where the number of dimensions can be thousands or even hundreds of thousands. However, since each principal component (PC) is a linear combination of original dimensions, the meaning of the new dimensions can be hard to interpret. For PCA of DNA methylation data, the cytosines which are the original dimensions may not have a clear biological annotation, further hindering interpretation. Currently, there is a lack of methods for interpreting PCs of DNA methylation data. We present a method which annotates PCs using sets of genomic regions corresponding to a given biological annotation, such as transcription factor binding or histone modifications. We tested the method on DNA methylation data from breast cancer, confirming known associations, and data from the rare childhood cancer Ewing sarcoma, discovering novel associations. Our method is computationally efficient, scales well with increasing number of samples, and will fit well into existing analysis workflows. This method will be broadly useful to help researchers understand variation in DNA methylation among samples.

12:00 PM-12:20 PM
MetaSRA: Normalized Human Sample-Specific Metadata for the Sequence Read Archive
Room: Columbus AB
  • Matthew Bernstein, University of Wisconsin-Madison, United States

Presentation Overview: Show

Motivation: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized description. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.

Results: We present MetaSRA, a database of normalized SRA human sample-specific metadata. Our normalized metadata schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads.

Statement of significance: MetaSRA provides normalized sample-specific metadata for the SRA enabling more effective queries of SRA metadata and large-scale meta analyses.

12:20 PM-12:40 PM
Integrating Knowledge-Guided Analysis into Novel Genomic Data Ecosystems using FAIR Principles
Room: Columbus AB
  • Charles Blatti, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Genomic data analysis ecosystems that incorporate the FAIR principles (Findable, Accessible, Interoperable, and Reusable) have the potential to accelerate biomedical research and discovery. These guidelines have informed the separate designs of the NCI-funded Seven Bridges Cancer Genomic Cloud (CGC) platform for accessing and investigating large genomic datasets and the NIH-BD2K Knowledge Engine for Genomics (KnowEnG) platform for using advanced knowledge-guided analysis tools that leverage public domain information on gene annotations and interactions. In this work, we demonstrate our FAIR-based approach to facilitate sophisticated analyses that leverage the strengths of these two platforms.

We published the KnowEnG analysis tools as CGC Public Apps utilizing two recent technologies, software containerization via Docker and tool specification through the Common Workflow Language. Using our apps, we recreated within the CGC a recently published analysis mapping oesophageal carcinoma samples from The Cancer Genome Atlas to molecular subtypes and identified genes differentiating these subtypes. We also employed KnowEnG’s Gene Set Characterization pipeline that integrates prior knowledge on gene interactions to identify novel pathways and biological processes involved in each subtype. Our efforts showcase a FAIR ecosystem for generating reproducible and sharable analytical workflows in cloud-based environments that facilitate rapid experimentation with large datasets and advanced analytics.

12:40 PM-2:00 PM
Lunch Break
Building the FAIR Data Ecosystem for Discovery to Health
2:00 PM-2:15 PM
Practical Strategies towards Making Biomedical Research Data more FAIR
Room: Columbus AB
  • Avi Ma'Ayan, Icahn School of Medicine at Mount Sinai, United States
2:15 PM-2:30 PM
NIH Data Sharing Policies
Room: Columbus AB
  • Dina Paltoo, NIH Office of Science Policy
2:30 PM-3:00 PM
Privacy-Preserving Techniques for Analyzing and Sharing Biomedical Data
Room: Columbus AB
  • Haixu Tang, Indiana University Bloomington, United States
3:00 PM-3:15 PM
SmartAPI: Building a FAIR API Ecosystem for Biomedical Knowledge
Room: Columbus AB
  • Chunlei Wu, The Scripps Research Institute (TSRI), United States
3:15 PM-3:30 PM
Implementing FAIR Principles on Protected Health Information
Room: Columbus AB
  • Tim Clark, University of Virginia, United States
3:30 PM-4:00 PM
Towards a Data Discovery Index: Lessons Learned by the bioCADDIE Consortium
Room: Columbus AB
  • Lucila Ohno-Machado, University of California, San Diego, United States
4:00 PM-4:40 PM
Coffee Break
Machine Learning Approaches to Enable Biomedical Discoveries
4:40 PM-4:55 PM
Automated Cohort Retrieval from EEG Medical Records
Room: Columbus AB
  • Joseph Picone, Temple University, United States
4:55 PM-5:10 PM
Big Data to Knowledge: Integrative Literature Mining and Knowledge Networks for Drug Analytics in Precision Medicine
Room: Columbus AB
  • Cathy Wu, University of Delaware, United States
5:10 PM-5:25 PM
Learning from Text: Translating Clinical Case Reports into Structured Knowledge
Room: Columbus AB
  • Wei Wang, HeatBD2K, UCLA, United States
5:25 PM-5:40 PM
The Role of Prior Knowledge in Machine Learning and Biomedical Data Science
Room: Columbus AB
  • Larry Hunter, CU-Denver Anschutz Medical Campus, United States
5:40 PM-6:00 PM
Causal Network Discovery from Biomedical Data
Room: Columbus AB
  • Greg Cooper, University of Pittsburgh , United States
Biomedical Data Science In Action
Sunday, July 8th
10:15 AM-10:45 AM
NIH Introduction
Room: Columbus AB
10:45 AM-11:05 AM
PanCancer Analysis of Whole Genomes using Multi-Cloud Strategy
Room: Columbus AB
  • Christina Yung, University of Chicago, United States
11:05 AM-11:25 AM
Wikidata for Biomedical Knowledge Integration and Curation
Room: Columbus AB
  • Greg Stupp, The Scripps Research Institute (TRSI), United States
11:25 AM-11:55 AM
Toward the FAIRness of Data Science Training Resources
Room: Columbus AB
  • Jack Van Horn, University of Southern California , United States
11:55 AM-12:15 PM
The Future is Now! Engaging Biomedical Data Scientists in the 21st Century
Room: Columbus AB
  • Ben Busby, NCBI, United States
12:15 PM-12:40 PM
Panel
Room: Columbus AB
12:40 PM-2:00 PM
Lunch Break
BD2K Power Tools: Moving to the Cloud with Industrial Strength Data
2:00 PM-2:20 PM
Scaling Analysis on the Cloud
Room: Columbus AB
  • Brian D. O’connor, University of California, United States
2:20 PM-2:40 PM
KnowEnG: A Cloud-based Framework for Genomics Data Analysis
Room: Columbus AB
  • Saurabh Sinha, University of Illinois at Urbana-Champaign, United States
2:40 PM-3:00 PM
The Three Faces of Genomic Data Compression
Room: Columbus AB
  • Peng Jianhao, UIUC, United States
3:00 PM-3:20 PM
Cloud Computing Alone Will Not Make Experimental Data FAIR. We Need Better Metadata First.
Room: Columbus AB
  • Mark Musen, Stanford University, United States
3:20 PM-4:00 PM
Panel
Room: Columbus AB
4:00 PM-4:40 PM
Coffee Break
BD2K Data Visualization Tools & Future Directions
4:40 PM-6:00 PM
BD2K Data Visualization Tools & Future Directions
Room: Columbus AB
  • Nils Gehlenborg, Harvard Medical School, United States
  • Griffin M. Weber, Harvard Medical School, United States
  • Alistair Ward, Frameshift, United States