DREAM Challenges

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in PDT
Thursday, November 30th
9:30-10:00
Developing an audio diagnostic for Tuberculosis infection: the CODA TB Challenge
Format: Live from venue

Moderator(s): Pablo Meyer

  • Solveig Sieberts, Sage Bionetworks, United States
  • Vijay Yadav, Sage Bionetworks, United States
  • Sophie Huddart, University of California, San Francisco, United States
  • Stephen Burkot, Global Health Labs, Inc., United States
  • Sourabh Kulhare, Global Health Labs, Inc., United States
  • Alfred Andama, Walimu, Uganda
  • DJ Christopher, Christian Medical College, India
  • Claudia Denkinger, University Hospital of Heidelberg, Germany
  • Omar Lweno, Ifakara Health Institute, Tanzania
  • Issa Lyimo, Ifakara Health Institute, Tanzania
  • Payam Nahid, University of California, San Francisco, United States
  • Nguyen Viet Nhung, Vietnam National Tuberculosis Program, Viet Nam
  • Mihaja Raberahona, Centre d’Infectiologie Charles Merieux Madagascar, Madagascar
  • Rivonirina Rakotoarivelo, Centre d’Infectiologie Charles Merieux Madagascar, Madagascar
  • Grant Theron, Stellenbosch University, South Africa
  • William Worodria, Walimu, Uganda
  • Charles Yu, De La Salle Medical and Health Sciences Institute, Philippines
  • Chris Bachman, Global Health Labs, Inc., United States
  • Matthew Horning, Global Health Labs, Inc., United States
  • Devan Jaganath, University of California, San Francisco, United States
  • Simon Grandjean Lapierre, University of Montreal, Canada
  • Adithya Cattamanchi, University of California, San Francisco, United States
  • Larsson Omberg, Sage Bionetworks, United States


Presentation Overview: Show

Tuberculosis (TB), a communicable disease caused by Mycobacterium tuberculosis, is a major cause of ill health and one of the leading causes of death worldwide. In 2021, an estimated 10.6 million people fell ill with TB and 1.6 million died of TB worldwide. However, approximately 40% of people with TB were not diagnosed or reported to public health authorities because of challenges in accessing health facilities or failure to be tested or treated when they do. The development of low-cost, non-invasive digital screening tools may improve some of the gaps in diagnosis and linkage to care. Cough has shown potential as a diagnostic biomarker for TB, COVID and other indications for which cough is a common symptom. Machine learning-based cough sound prediction represents a novel, low-cost and non-invasive approach to rapid TB screening.

To this end, we developed the COugh Diagnostic Algorithm for Tuberculosis (CODA TB) DREAM Challenge. The challenge leveraged data collected from seven countries (India, Madagascar, the Philippines, South Africa, Tanzania, Uganda, and Vietnam). The sites in each country employed the same protocol, using the Hyfe Research app to record solicited cough sounds from patients presenting with new or worsening cough followed by a comprehensive evaluation for TB. Challenge participants were provided with a training set, consisting of 9,772 cough sounds from 1,082 study participants and asked to predict TB diagnosis using either the cough sounds only (SC1) or the cough sounds plus available demographics and clinical screening data (SC2). Participants’ models were evaluated using a final test set consisting of 790 study participants (7,125 cough recordings).

For SC1 (cough-only models), the top model achieved an AUROC of 0.7434. For SC2 (cough + clinical/demographic models), the top model achieved an AUROC of 0.8315, and 5 of the 6 competing models achieved the pre-specified benchmark (sensitivity > 0.8 and specificity > 0.6), approaching WHO target thresholds for a TB triage test. Additionally, we found that submitted models showed increasing prediction probability with higher semi-quantitative levels of bacterial load (p = 4.5e-04, SC1; p = 4.6e-07, SC2) indicating that model predictions are associated with disease burden.

10:10-10:40
scRNA-seq and scATAC-seq Data Analysis DREAM Challenge
Format: Live from venue

Moderator(s): Pablo Meyer

  • Olga Nikolova, OHSU, United States
  • Nicholas Calistri, OHSU, United States
  • Rongrong Chai, Sage Bionetworks, United States
  • Andrew Nishida, OHSU, United States
  • Verena Chung, Sage Bionetworks, United States
  • Maria Diaz, Sage Bionetworks, United States
  • Jacob Albrecht, Sage Bionetworks, United States
  • Julio Saez Rodriguez, Heidelberg University, Germany
  • Galip Gürkan Yardimci, OHSU, United States
  • Emek Demir, OHSU, United States
  • Andrew Adey, OHSU, United States
  • Laura Heiser, OHSU, United States


Presentation Overview: Show

Background: Understanding transcriptional signals at the individual cell resolution is fundamental to our understanding of more complex biological systems such as tissues and organs. It is further essential to characterizing cell-to-cell heterogeneity. In parallel, examining the epigenomic landscape is important for understanding the regulatory transcriptional programs. Emerging high-throughput sequencing technologies now allow for transcript quantification and chromatin accessibility assessment at the single cell level. These technologies, however, present unique challenges due to the low amounts of mRNA that is sequenced per cell (scRNA-seq), and low copy numbers (scATAC-seq), leading to inherent data sparsity in the readouts of these assays. In scRNA-seq, proper signal correction is key to accurate gene expression quantification, which propagates into downstream analysis such as differential gene expression analysis, cell-type specific marker identification, and reconstruction of putative differentiation trajectory. In the even more sparse scATAC-seq data, the correct identification of informative features is key to assessing cell heterogeneity at the chromatin level.

The aim of this challenge is two-fold:
A) to evaluate computational methods for signal correction and peak identification in scRNA-seq and scATAC-seq, respectively, and
B) to assess the impact these methods have on downstream data analysis

Results: Here we present results from 11 teams. We first provide an overview and breakdown of the classes of participating methods and their individual performance results. Then, we discuss ensemble performance analysis. In assessing the performance of scRNA-seq data correction, we find that an ensemble of the top three performing methods outperforms the individually best-performing approach. We further assess the methods’ ability to distinguish simulated drop-outs, generated by down-sampling, as well as the biological validity of downstream analysis results. In evaluating the more limited pool of scATAC-seq analysis methods, we focus on the innovation and biological relevance of the top performing approach. We hope that our findings and one-of-a-kind benchmarking data will be of service to the greater community. Finally, we will also discuss challenges and lessons learned in the organization and implementation of this project.

10:40-11:10
OpenChallenges.io: A Centralized Hub for Biomedical Challenges and More
Format: Live from venue

Moderator(s): Pablo Meyer

  • Thomas Schaffter, Sage Bionetworks, United States
  • Verena Chung, Sage Bionetworks, United States
  • Rong Chai, Sage Bionetworks, United States
  • Maria Diaz, Sage Bionetworks, United States
  • Gaia Andreoletti, Sage Bionetworks, United States
  • Jake Albrecht, Sage Bionetworks, United States


Presentation Overview: Show

Data challenges have played a significant role in driving biomedical breakthroughs by engaging researchers, data scientists, and contributors from various fields to collaborate on complex problems. However, the information about past, active and upcoming challenges organized by dozens of biomedical benchmarking communities is fragmented. The lack of a hub centralizing all biomedical challenges makes it difficult for prospective participants to discover challenges they are interested in. Meanwhile, challenge organizers struggle to find a streamlined solution to engaging participation from a broader community.

The OpenChallenges (OC) initiative addresses current pain points like fragmented challenge information and lack of standardization. OpenChallenges.io is a centralized hub for biomedical challenges that empowers participants with the most up-to-date information about relevant challenges, while providing challenge organizers with standardized challenge event templates and intelligence. Prospective participants can perform complex search queries intuitively using full-text search and a growing number of filters such as challenge incentives, status, dates, platforms, contributing organizations, input data types, and more. Prospective organizers can also quickly identify benchmarking communities and organizations that have supported similar challenges and may be interested in contributing to new benchmarking opportunities as data providers or sponsors, for instance.

11:10-11:40
Upcoming dream challenges
Format: Live from venue

Moderator(s): Pablo Meyer

  • Paul Boutros
13:10-14:10
Invited Presentation: Keynote: TBD
Format: Live from venue

Moderator(s): Rebecca Levinson

  • Paul Spellman
14:10-14:40
The FINRISK HF Microbiome DREAM challenge: Predicting Incident Heart Failure from the Microbiome
Format: Live from venue

Moderator(s): Rebecca Levinson

  • Pande Putu Erawijantari, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland
  • Ece Kartal, Institute for Computational Biomedicine, Heidelberg University Faculty of Medicine, Heidelberg, Germany, Germany
  • José Liñares Blanco, GENYO. Centre for Genomics and Oncological Research: Pfizer, University of Granada, Granada, Spain, Spain
  • Teemu D. Laajala, Department of Mathematics and Statistics, Faculty of Science, University of Turku, Finland, Finland
  • Rajesh Shigdel, Department of Clinical Science, University of Bergen, Bergen, Norway, Norway
  • Mike Inouye, Cambridge Baker Systems Genomics Initiative, Baker Heart & Diabetes Institute, Melbourne, Victoria, Australia, Australia
  • Pekka Jousilahti, Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland, Finland
  • Rob Knight, Jacobs School of Engineering, University of California San Diego, La Jolla, CA, United States
  • Veikko Salomaa, Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland, Finland
  • Teemu Niiranen, Division of Medicine, Turku University Hospital, Turku, Finland, Finland
  • Aki S. Havulinna, Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland, Finland
  • Julio Saez-Rodriguez, Institute for Computational Biomedicine, Heidelberg University Faculty of Medicine, Heidelberg, Germany, Germany
  • Rebecca T. Levinson, Department of Internal Medicine & Psychosomatics, Heidelberg University Hospital, Heidelberg, Germany, Germany
  • Leo Lahti, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland


Presentation Overview: Show

Early identification of individuals at-risk of heart failure (HF) could allow for interventions that reduce morbidity or mortality and reduce the impact of this major public health challenge. The FINRISK HF Microbiome DREAM challenge was developed to evaluate the use of machine learning approaches to predict incident HF risk over 15 years in a population cohort of 7231 Finnish adults (FINRISK 2002, n=559 incident HF cases). Shotgun metagenomics data obtained from fecal samples and clinical data were incorporated into these risk predictions. 9 registered teams contributed 35 valid model submissions using synthetic data that was developed for model training and testing. Final models were evaluated in the real data by challenge organizers. The two highest-scoring models both used Cox regression and resulted in a Harrell's C-index of 0.8344 and 0.8271 respectively. These were then aggregated to create an ensemble model. After the challenge, we further refined the models by eliminating phylum information and testing models at intermediate timepoints. This challenge provided insights into strategies for the incorporation of microbiome data into incident disease prediction models. Our experience also highlights some of the limitations of using protected real world clinical data for a community challenge.

14:40-15:00
Team DenverFINRISKHacky, FINRISK Heart Microbiome Challenge
Format: Live from venue

Moderator(s): Rebecca Levinson

  • Lily Elizabeth Feldman, Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA, United States
  • Michael Orman, Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA, United States
  • Varsha Sreekanth, Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA, United States
  • Teemu Daniel Laajala, University of Turku, Turku, Finland & University of Helsinki, Helsinki, Finland, Finland


Presentation Overview: Show

In this presentation, the corresponding author will outline the key steps in participation to the FINRISK Heart Failure Microbiome Challenge dated mainly in late 2022 – early 2023, which resulted in a top-performing model built on top of regularized Cox regression. On behalf of the team, a special emphasis will also be placed on the format upon which we initially participated (a hackathon held in Denver). Furthermore, specific details regarding model building, modelling choices, and technical details are provided, including opening the rationale behind the modular structure of the final submission coupled with multi-seeded cross-validation. Lastly, some personal retrospective reflections are provided for the challenge format, its pros and cons, considerations regarding the model-to-data approach, and a brief comparison to some other DREAM Challenges from a three-time participant’s perspective.

15:30-16:00
Xcelerate RARE: A Rare Disease Open Science Data DREAM Challenge
Format: Live from venue

Moderator(s): Jake Albrecht

  • Jake Albrecht, Sage Bionetworks, United States


Presentation Overview: Show

The Xcelerate RARE Challenge, which was focused on rare pediatric neurodevelopmental diseases, brought together academic and industry researchers and data scientists to use patient-reported data to address unanswered research questions about rare diseases. A total of 132 scientists, many new to rare diseases research, participated in 24 teams that created 33 total submissions. This presentation will describe the overview of the Challenge tasks and data, along with the overall findings from the Challenge organizers.

16:00-16:20
Expansion of rare disease symptoms using RARE-X & external data
Format: Live from venue

Moderator(s): Jake Albrecht

  • Won Chan Jeong, 3billion, South Korea
  • Kisang Kwon, 3billion, South Korea
  • Namseok Lee, 3billion, South Korea
  • Yongjun Song, 3billion, South Korea


Presentation Overview: Show

Building comprehensive phenotype ontologies for rare diseases is challenging due to the limited number of patients available. However, comprehensive phenotype data are essential for the diagnosis of rare diseases, as they provide observable characteristics that are essential for identifying patterns, guiding genetic testing, and differentiating between similar conditions to ensure accurate and timely diagnosis. Therefore, expanding the understanding of rare disease phenotypes remains critical to improving diagnostic accuracy.

In this challenge, we examined phenotype data from 741 patients diagnosed with 27 different diseases, focusing on symptoms and responses collected from the PELHS and CSHQ questionnaires. Our goal was to use these data to uncover novel or under-recognized phenotypes associated with these diseases.

Here we used three different methods to improve our understanding of rare disease symptoms. First, recognizing the limitations of manual curation, we used natural language processing (NLP) - biomedical language models - to annotate all publications in PubMed. By annotating genes, diseases, and phenotypes, we identified previously unrecognized candidate phenotypes. Second, using the patient phenotype data from RARE-X, we applied Fisher’s exact test, a powerful tool for small datasets, to identify statistically significant disease-specific traits. Finally, we explored phenotype data from mice, taking advantage of their genetic similarities to humans, to uncover potential phenotypes not previously explored.

By integrating the three methods outlined, we discovered 36 potential novel or under-recognized phenotype candidates across 14 diseases. These findings were supported by annotated clinical publications and further validated by RARE-X patient data, establishing them as robust evidence for novel or under-recognized phenotypes.

Expanding our understanding of rare disease phenotypes, facilitated by the methods described, is critical. Not only does it improve accurate diagnosis by providing more information about new genes and existing diseases, it also opens new avenues for drug discovery. This expansion expands drug indications and identifies new targets, offering promising prospects for advances in medical treatment.

16:20-16:40
Validation and Insights into Rare Disease Symptoms via NetraAI Analysis with Emphasis on Immune Functions
Format: Live from venue

Moderator(s): Jake Albrecht

  • Kaiwen Deng, University of Michigan, United States
  • Yuanfang Guan, University of Michigan, United States


Presentation Overview: Show

In this task, by employing UMAP for symptom visualization and the two-sample proportions z-test for statistical significance on both raw and processed data, we analyzed rare disease symptom data to assess the claims made by NetraAI regarding immune functions. We focused on 41 special individuals with unique immunocharacteristics. Upon further analysis, another 144 children showcased similar immune functions. Among the 185 individuals, notable differences in immunodeficiency and recurrent infections were observed and separated them into 2 groups. We hypothesized that Wiedemann-Steiner Syndrome (WSS) (KMT2A) and the FOXP1 Syndrome might be crucial in explaining these disparities. Particularly, the KMT2A-FOXP1 pathway might be the underlying reason for the observed symptom co-occurrences, suggesting the need for further gene expression data analysis.

16:40-17:00
Random Walk with Restart on a Biomedical Multilayer Network to Uncover Novel Phenotypes associated with Rare Diseases
Format: Live from venue

Moderator(s): Jake Albrecht

  • Galadriel Brière, Aix Marseille Univ, CNRS, I2M, Marseille, France, France
  • Cécile Beust, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Morgane Térézol, Aix Marseille Univ, INSERM, MMG, Marseille, France, France
  • Anaïs Baudot, Aix Marseille Univ, INSERM, MMG, Marseille, France / Barcelona Supercomputing Center, France


Presentation Overview: Show

Background: Rare diseases present significant diagnostic, management, and treatment challenges. Nowadays, the abundance of biomedical data offers an unparalleled opportunity to investigate rare diseases. However, biomedical data is large and complex. It encompasses information at different scales, from molecular and cellular information to biological pathways, to broader biomedical concepts such as phenotypes. In this context, it is imperative to develop novel computational strategies to analyze, integrate, and extract knowledge from heterogeneous biomedical data. Notably, networks have proven an invaluable approach to integrate data within a common framework while enabling the use of the robust toolkit of graph theory. We propose here a multilayer network approach to integrate the Rare-X challenge data with prior knowledge extracted from public databases, in order to highlight novel phenotypes associated with rare diseases.

Material and methods: We constructed a two-layers network combining the Rare-X challenge dataset with prior knowledge on rare diseases. The first layer encodes Patient-Disease and Patient-Phenotype associations extracted from the Rare-X dataset. The second layer encodes Disease-Disease and Disease-Phenotype associations extracted from Orphanet, as well as Phenotype-Phenotype associations extracted from the Human Phenotype Ontology. We connected corresponding diseases in the Rare-X and prior knowledge layers. We explored this two-layers network with MultiXrank, a Random Walk with Restart algorithm. MultiXrank calculates proximity scores for all nodes in the multilayer network with respect to a seed node. Starting from each Rare-X disease node in the network, we used MultiXrank to prioritize phenotype nodes from the Rare-X layer and the prior knowledge layer.

Results: We correctly identified known phenotypes associated with rare diseases, but also highlighted novel candidate phenotypes from the Rare-X dataset. For example, we predict novel potential associations between Osteoporosis, Intellectual disability, and Abnormal Bladder function with 4H Leukodystrophy. Upon reviewing the available literature, we found several studies advocating for the recognition of these phenotypes as complications of 4H Leukodystrophy.

Conclusion: Multilayer networks offer a natural way to integrate heterogeneous and complex information, enabling the exploration of cohort data alongside public sources. This, in turn, presents an opportunity to identify potential gaps in current biomedical knowledge.