CAMDA COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Monday, July 11th
10:30-10:40
CAMDA Welcome
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • David Kreil
10:40-11:40
Keynote Presentation: Addressing standardization challenges through integrated approaches in biomedical and genomic data
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • Lynn Schriml


Presentation Overview: Show

Big data integration holds the promise of accessible datasets amenable to ML/AI approaches for knowl...

11:40-12:00
CAMDA Invited: MetaSUB, An initiative to characterize the global microbiome and establish planetary-scale metagenomic surveillance
Room: EH
Format: Live-stream

Moderator(s): Paweł Łabaj

  • Krista Ryon


Presentation Overview: Show

TBP

12:00-12:30
CAMDA Invited: Antimicrobial resistance prediction from whole-genome sequence and metagenomic data - challenges and opportunities
Room: EH
Format: Live-stream

Moderator(s): Paweł Łabaj

  • Leonid Chindelevitch


Presentation Overview: Show

TBC

14:30-14:50
Proceedings Presentation: Phage-bacteria contig association prediction with a convolutional neural network
Room: EH
Format: Live from venue

Moderator(s): Joaquin Dopazo

  • Tianqi Tang, Department of Quantitative and Computational Biology, University of Southern California, United States
  • Shengwei Hou, Department of Ocean Science and Engineering, Southern University of Science and Technology, China
  • Jed Fuhrman, Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, United States
  • Fengzhu Sun, Department of Quantitative and Computational Biology, University of Southern California, United States


Presentation Overview: Show

Motivation: Phage-host associations play important roles in microbial communities.
But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches.
These are often based on whole viral and host genomes, but in metagenomics-based studies we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long.
Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs.
Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here we develop ContigNet, a convolutional neural network based model capable of predicting phage-host matches based on relatively short contigs,and compare to previously published VirHostMatcher and WIsH.
Results: On the validation set, ContigNet achieves 72-85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68\% by VirHostMatcher or WIsH for contigs of lengths between 200 bps to 50 kbps.
We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieved 60-70% AUROC scores while VirHostMatcher and WIsH fulfilled 52%.
Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts.

14:50-15:00
The systematic assessment of completeness of public metadata accompanying omics studies
Room: EH
Format: Live from venue

Moderator(s): Joaquin Dopazo

  • Serghei Mangul, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Ram Ayyala, Department of Translational Biomedical Informatics, Keck School of Medicine, University of Southern California, United States
  • Yu-Ning Huang, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Anushka Rajesh, Department of Pharmacology and Pharmaceutical Sciences, University of Southern California, United States
  • Aditya Sarkar, School of Computing and Electrical Engineering, Indian Institute of Technology Mandi, North Campus, India
  • Ruiwei Guo, Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Elizabeth Ling, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Irina Nakashidze, Department of Clinical Medicine, Faculty of Natural Science and Health Care, Batumi Shota Rustaveli State University, United States
  • Man Yee Wong, Department of Translational Genomics, Keck School of Medicine, University of Southern California, United States
  • Jieting Hu, Department of Translational Genomics, Keck School of Medicine, University of Southern California, United States
  • Alexey Nosov, Department of Physical Sciences, Santa Monica College, United States
  • Yutong Chang, Department of Pharmacology and Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Malak S. Abedalthagafi, King Fahad Medical City and King Abdulaziz City for Science and Technology, Saudi Arabia


Presentation Overview: Show

Genomic data is easily accessible and available from public genomic repositories allowing the biomedical community to effectively share the omics datasets. However, improperly annotated or incomplete metadata accompanying the raw omics data can negatively impact the utility of shared data for secondary analysis. In this study, we perform a comprehensive analysis under 137 studies over 18,559 samples across six therapeutics fields to assess the completeness of metadata accompanying omics studies in both publication and the online repositories. This analysis involved finding studies based on the six therapeutic fields, which are Alzheimer’s disease, acute myeloid leukemia, cystic fibrosis, cardiovascular diseases, inflammatory bowel disease, sepsis, and tuberculosis. We carefully examined the availability of metadata over nine clinical variables, that included disease condition, age, organism, sex, tissue type, ethnicity, country, mortality, and clinical severity. By comparing the metadata availability in both original publications and online repositories, we observed discrepancies in sharing the metadata. We determine that the overall availability of metadata is 72.8%. Our study is the first one to systematically assess the completeness of metadata accompanying raw data across a large number of studies and phenotypes and opens a crucial discussion about solutions to improve the availability of metadata accompanying omics studies.

15:00-15:10
Assessing the completeness of immunogenetics databases across diverse populations
Room: EH
Format: Live from venue

Moderator(s): Joaquin Dopazo

  • Yu-Ning Huang, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Yiting Meng, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Naresh Amrat Patel, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Jay Himanshu Mehta, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Brittney Hua, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Marina Fayzullina, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Houda Alachkar, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Serghei Mangul, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States


Presentation Overview: Show

Recent advances in high-throughput sequencing technologies provide the scientific community with efficient bioinformatics tools to profile human adaptive immune receptor repertoire via human adaptive immune receptor repertoire sequencing (AIRR-Seq). However, few studies have taken the ancestry information into account, and the open AIRR-Seq studies have unknown ancestry distribution. In this study, I examined the completeness of the immunogenetics database1 (IMGT). By leveraging the bioinformatics software, MiXCR2, I’m able to comprehensively examine the mismatches in different ancestry group samples’ read in the VDJ gene and evaluate the completeness of the immunogenetics database across diverse ancestry groups. Unveiling the ancestry distribution in TCR-Seq studies and the completeness of the immunogenetics database representing diverse populations could highlight the need to improve ancestry diversity in those underrepresented populations and guide future immunogenomics studies to improve ancestry availability and distribution.

15:10-15:20
Data availability of open T-cell receptor repertoire data, a systematic assessment
Room: EH
Format: Live from venue

Moderator(s): Joaquin Dopazo

  • Serghei Mangul, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Yu-Ning Huang, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Naresh Amrat Patel, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Jay Himanshu Mehta, Department of Pharmaceutical Sciences, School of Pharmacy, University of Southern California, United States
  • Srishti Ginjala, School of Computing and Electrical Engineering, Indian Institute of Technology, Mandi, India
  • Petter Brodin, Department of Immunology and Inflammation, Imperial College London, United Kingdom
  • Clive M Gray, Division of Molecular Biology and Human Genetics, Biomedical Research Institute, Stellenbosch University, South Africa
  • Yesha M Patel, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States
  • Lindsay G. Cowell, Department of Population and Data Sciences, University of Texas Southwestern Medical Center at Dallas, United States
  • Amanda M. Burkhardt, Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, United States


Presentation Overview: Show

Modern data-driven research has the power to promote novel biomedical discoveries through secondary analyses of raw data. It is important to ensure data-driven research with great reproducibility for promoting precise secondary analyses of the immunogenomics data. Rigorous conduct in designing and conducting experiments is needed, specifically in scientific writing and reporting results. It is also crucial to make raw data available, discoverable, and well described in order to promote future re-analysis of the data. In order to assess the data availability of published T cell receptor repertoire data, we examined 11,918 TCR-Seq samples corresponding to 134 TCR-Seq studies ranging from 2006 to 2022. Among the 134 studies, only 38.1% had publicly available raw TCR-Seq data shared in public repositories. We also found a statistically significant association between the presence of data availability statements and the increase in raw data availability (p=0.014). Yet, 46.8% of studies with data availability statements failed to share the raw TCR-Seq data. There is a pressing need for biomedical communities to increase awareness of the importance of promoting raw data availability in research and take immediate action to improve its raw data availability enabling cost-effective secondary analysis of existing immunogenomics data by the larger scientific community.

15:20-15:30
GO Bench: Shared-hub for Universal Benchmarking of Machine Learning-Based Protein Functional Annotations
Room: EH
Format: Live from venue

Moderator(s): Joaquin Dopazo

  • Andrew Dickson, University of California - Berkeley, United States
  • E A, University of California - Berkeley, United States
  • Mohammad Mofrad, University of California - Berkeley, United States
  • Alice McHardy, Helmholtz Institute, Germany


Presentation Overview: Show

Motivation: Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset taking care of the best practices that can be used to properly benchmark the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear.
Summary: The Gene Benchmarking database is the first effort to provide an easy to use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to preset datasets, and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web-interface. The GO bench web application may also evaluate and display any trained model on leaderboards for annotation tasks.

16:00-16:20
From Critical Assessment at CAMDA to real life applications - metagenomics in forensics
Room: EH
Format: Live from venue

Moderator(s): David Kreil

  • Paweł Łabaj


Presentation Overview: Show

The foundation of Metagenomics & Metadesign of Subways & Urban Biomes (MetaSUB) consortium and bringing to live a worldwide citizen science event – global City Sampling Day (gCSD) has unveiled a huge exploratory potential of Whole Metagenome Sequencing data from environmental samples. And it was first at CAMDA where data scientists could investigate the geolocation/forensic potential of such data. The biodiversity studies conducted on urban biomes show that even within the same city the microbiome profiles can vary. CAMDA challenges partici-pants have shown that unique profiles allowing accurate sample origin classification can be built not only on taxonomic profiles or functional ones but also on k-mer ones, which represents refer-ence free approach. Those findings have been recently confirmed by MetaSUB paper. The limit-ing factor, however, for classical metagenomic data analysis approaches is the completeness of the used reference databases. Thus, the reference free approach needs to be used to allow the analysis and interpretation. Here we show how the concept born at CAMDA from MetaSUB data has evolved into successful application in forensics. We took advantage of k-mer approach of WMS data analysis paired with Targeted Metagenomic Sequencing and modern ML to accurately classify the origin of soil samples.

16:20-16:40
Network Medicine in Times of Pandemic: Can we repurpose drugs?
Room: EH
Format: Live-stream

Moderator(s): David Kreil

  • Deisy Gysi


Presentation Overview: Show

The COVID-19 pandemic has highlighted the importance of prioritizing approved drugs to treat severe acute respiratory syndrome coronavirus (SARS-CoV-2) infections. I present here, three deployed algorithms: artificial intelligence based, network diffusion based, and a network proximity based. We them use those methods to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. We experimentally screened 918 drugs, allowing us to evaluate the performance of the existing drug-repurposing methodologies, and used a consensus algorithm to increase the accuracy of the predictions. Finally, we screened in human cells the top-ranked drugs, identifying six drugs that reduced viral infection, four of which could be repurposed to treat COVID-19. The developed strategy has significance beyond COVID-19, allowing us to identify drug-repurposing candidates for neglected diseases.

16:40-17:00
A Data-Driven and Knowledge-Based Approach to Inferring Temporal Gene Networks for COVID-19
Room: EH
Format: Live from venue

Moderator(s): David Kreil

  • Mitsuhiro Odaka, The Graduate University for Advanced Studies / National Institute of Informatics / Nantes Université, Centrale Nantes, Japan
  • Morgan Magnin, Nantes Université, Centrale Nantes / National Institute of Informatics, France
  • Katsumi Inoue, National Institute of Informatics / The Graduate University for Advanced Studies / Nantes Université, Centrale Nantes, Japan


Presentation Overview: Show

Intercellular attachment between cells is potentially significant in COVID-19. However, the interactions among the responsible molecules have not been adequately uncovered. Limited understanding of how such molecules are regulated in COVID-19 causes incomplete signaling pathways. For example, β-actin (ACTB), ICAM-1 (ICAM1), and MOCCI (C15orf48) are not on the existing pathway databases, such as COVID-19 Disease Map or Signor 2.0. Therefore, this research aims to construct pathways associated with the above three genes of interest (GOIs). We propose a general framework for inferring gene networks from single-cell transcriptome data and six different knowledge bases and apply the framework to the GOIs at three-time points. Firstly, Markov random field retrieved differentially coexpressed genes specific to COVID-19 with spurious correlations deleted. Secondly, background knowledge validates the edges obtained from data. Lastly, pathway enrichment analysis using KEGG pathways discovers the nine GOIs-associated-pathways. These pathways suggest the immune response typical in COVID-19 and the possibility of membrane fusion or microtubule organizing center formation in COVID-19. In this manner, data-driven and knowledge-based approaches are harnessed into gene network inference for pathway construction. The results can contribute to repairing and completing the pathway databases, improving our current understanding of the COVID-19 mechanisms.

17:00-17:20
WITHDRAWN BY AUTHORS - Modeling of large biological knowledge graph augmented with COVID-19 casual network data for drug repurposing
Room: EH
Format: Live-stream

Moderator(s): David Kreil

  • Ihor Stepanov, Institute of Molecular Biology and Genetics of NASU, Kyiv, Ukraine, Ukraine
  • Valerii Vasylevskyi, Bogomolets National Medical University, Kyiv, Ukraine, Ukraine
  • Roman Koval, Bogomolets National Medical University, Kyiv, Ukraine, Ukraine


Presentation Overview: Show

Firstly SARS-CoV-2 was detected in Wuhan, Hubei province in central China. From the first whole-genome sequencing of this virus on December 30 to nowadays, it has already caused more than 500 million confirmed cases and 6 million deaths around the globe. Except for vaccines, most medications in COVID 19 are used to relieve symptoms, including NSAIDs or glucocorticoids in more severe cases. Therefore, more new etiological and pathogenetic treatment drugs must be found and implemented. A completely new molecule must go through many steps to implement. But with the help of artificial intelligence in drug repurposing, we can at a large scale assume with high accuracy that some drugs that are already used may be helpful in the fight against COVID19. Our research was focused on enhancing a prebuilt large biological knowledge graph with new data for effective drug repurposing. As a model for getting knowledge graph embedding TransE was used. We tested several approaches to accomplish drug repurposing tasks depending on the analyzed relations types. As a result, most of our methods' selected compounds have mentioned therapeutic effects against COVID-19 in scientific literature and even in active clinical trials.

17:20-17:50
History of CAMDA
Room: EH
Format: Live-stream

Moderator(s): David Kreil

  • Joaquin Dopazo


Presentation Overview: Show

CAMDA is among the most veteran and resilient conferences on massive data analysis. It started with the century and has been evolving as omic technologies have changed along time. It has witnessed two omic revolutions: the microarray era and the massive sequencing era, which have provided examples for an open-ended style of challenges. Here we will review the most interesting aspects and achievements of CAMDA in the las two decades.

17:50-18:00
First day closing - departure for CAMDA dinner
Room: EH
Format: Live from venue

Moderator(s): David Kreil

  • Wenzhong Xiao
Tuesday, July 12th
10:30-11:30
Keynote Presentation: PubMed & Beyond: Biomedical Text Mining for Knowledge Discovery
Room: EH
Format: Live from venue

Moderator(s): Wenzhong Xiao

  • Zhiyong Lu


Presentation Overview: Show

The explosion of biomedical big data and information in the past decade or so has created new opport...

11:30-11:50
Automatic identification of drug-induced liver injury literature using natural language processing and machine learning methods
Room: EH
Format: Live from venue

Moderator(s): Wenzhong XIao

  • Jung Hun Oh, Memorial Sloan Kettering Cancer Center, United States
  • Allen Tannenbaum, Stony Brook University, United States
  • Joseph Deasy, Memorial Sloan Kettering Cancer Center, United States


Presentation Overview: Show

Drug-induced liver injury (DILI) is an adverse hepatic drug reaction that can potentially lead to life-threatening liver failure. Previously published work in the scientific literature on DILI has provided valuable insights for the understanding of hepatotoxicity as well as drug development. However, the manual search of scientific literature in PubMed is laborious. To address this challenge, we have developed an integrated natural language processing (NLP) / machine learning classification model to identify DILI-related literature using only paper titles and abstracts. We used 14,203 publications provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, employing word vectorization techniques in NLP coupled with machine learning methods. The best performance was achieved using a linear support vector machine (SVM) model that combined vectors derived from term frequency-inverse document frequency (TF-IDF) and Word2Vec. The final SVM model built using all 14,203 publications was tested on independent datasets, resulting in accuracies of 92.5%, 96.3%, and 98.3%, and F1-scores of 93.5%, 86.1%, and 75.6% for three test sets (T1-T3). The SVM model was tested on four external validation sets (V1-V4), resulting in accuracies of 92.0%, 96.2%, 98.3%, and 93.1%, and F1-scores of 92.4%, 82.9%, 75.0%, and 93.3%.

11:50-12:10
The CAMDA Contest Challenges 2022: TextNetTopics Combined with Random Forest Applied on Drug-induced Liver Injury (DILI) Literature
Room: EH
Format: Live-stream

Moderator(s): Wenzhong XIao

  • Malik Yousef, Zefat College, Israel
  • Daniel Voskergian, Al-Quds University, Palestine


Presentation Overview: Show

In this study, TextNetTopics was used to detect significant topics. Each topic is a set of words detected by the LDA approach. The Random forest then trained on the top k topics and tested on the test part of the data. Moreover, to improve the model's performance and deal with unbalanced test/validation data, we have suggested using the probability distribution to pick up the discriminating threshold that yields improved performance. The results showed that we could improve the performance by about 5% to 12% when applying the threshold tuning technique; the validation datasets' outcome indicates that our TextNetTopics is a stable tool.

12:10-12:30
Comparative analysis of information-theory-based statistical methods and transformer-based machine learning techniques for scientific literature classification
Room: EH
Format: Live-stream

Moderator(s): Wenzhong XIao

  • Ihor Stepanov, Institute of Molecular Biology and Genetics of NASU, Ukraine
  • Alina Frolova, Institute of Molecular Biology and Genetics of NASU, Ukraine
  • Arsentii Ivasiuk, Bogomoletz Institute of Physiology, Ukraine
  • Stanislav Zubenko, Institute of Molecular Biology and Genetics of NASU, Ukraine


Presentation Overview: Show

Scientific literature grows very fast. One of the first studies regarding scientific literature production was conducted by De Solla Price, who used publication data collected over the 100 years (1862–1961) to calculate a doubling time. The results showed 13.5 years for doubling the scientific corpus with a 5.1% annual growth rate (de Solla Price, 1965). The development of technologies created conditions for scientific literature production, which made scientific information more accessible and introduced new challenges.
Our research focuses on the biomedical domain, which is one of the largest and most rapidly developing. Accessibility of biomedical literature through databases such as Medline (Medline, 2021) and research activity in biomedicine creates an opportunity to use natural language processing (NLP) techniques.
We implemented an information theory-based statistical approach and compared it with modern transformers on a relevant practical task ‒ classifying biomedical papers related to Drug-Induced Liver Injury (DILI) as part of the CAMDA 2022 Challenge 1. DILI is a clinically significant condition and is one reason for drug registration failures. Scientific literature is the primary source of information related to DILI. Thus collecting and processing vast amounts of biomedical literature can help pharma companies, research organizations, and regulators to find relevant information.

14:30-14:50
DeSIDE-DDI: Interpretable prediction of drug-drug interactions using drug-induced gene expressions
Room: EH
Format: Live from venue

Moderator(s): David Kreil

  • Eunyoung Kim, ​School of Electrical Engineering and Computer Science, ​Gwangju Institute of Science and Technology (GIST), South Korea
  • Hojung Nam, ​School of Electrical Engineering and Computer Science, ​Gwangju Institute of Science and Technology (GIST), South Korea


Presentation Overview: Show

Adverse drug-drug interaction (DDI) is a major concern to polypharmacy due to its unexpected adverse side effects and must be identified at an early stage of drug discovery and development. Many computational methods have been proposed for this purpose, but most require specific types of information, or they have less concern in interpretation on underlying genes. We propose a deep learning-based framework for DDI prediction with drug-induced gene expression signatures so that the model can provide the expression level of interpretability for DDIs. The model engineers dynamic drug features using a gating mechanism that mimics the co-administration effects by imposing attention to genes. Also, each side-effect is projected into a latent space through translating embedding. As a result, the model achieved an AUC of 0.889 and an AUPR of 0.915 in unseen interaction prediction, which is competitively very accurate and outperforms other state-of-the-art methods. Furthermore, it can predict potential DDIs with new compounds not used in training. In conclusion, using drug-induced gene expression signatures followed by gating and translating embedding can increase DDI prediction accuracy while providing model interpretability. The source code is available on GitHub (https://github.com/GIST-CSBL/DeSIDE-DDI)

14:50-15:30
AI-based Filter of Drug-induced Liver Injury Publications with Natural Language Processing and Conformal Prediction
Room: EH
Format: Live-stream

Moderator(s): David Kreil

  • Xianghao Zhan, Stanford University, United States
  • Fanjin Wang, University College London, United Kingdom
  • Olivier Gevaert, Stanford University, United States


Presentation Overview: Show

Drug-induced liver injury (DILI) is an adverse effect of drugs that may lead to acute liver failure. Conventionally, screening the large corpus of publications to label DILI-related reports is carried out manually, limiting the processing speed. To accelerate, with around 14,000 papers titles and abstracts provided by the Critical Assessment of Massive Data Analysis challenge, we developed models to filter the DILI literature. Among five text embedding techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.953 on the internal validation set. Furthermore, an ensemble model with similar overall performances was developed with a logistic regression model on the predicted probability given by separate models with different vectorization techniques. The prediction reliability was quantified with conformal prediction, offering users a control over the prediction uncertainty. Overall, the ensemble model and TF-IDF model reached satisfactory classification results while the TF-IDF has significantly lower computational cost. The two models showed high performance on seven hold-out datasets with the TF-IDF model getting the top ranking on the leaderboard on four tasks. This model is promising enough to be used by researchers to rapidly filter literature describing events respecting liver injury induced by medications.

16:00-16:20
Proceedings Presentation: High-sensitivity pattern discovery in large, paired multi-omic datasets
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • Andrew Ghazi, Broad Institute, United States
  • Kathleen Sucipto, Harvard T.H. Chan School of Public Health, United States
  • Ali Rahnavard, Harvard T.H. Chan School of Public Health, United States
  • Eric Franzosa, Harvard T.H. Chan School of Public Health, United States
  • Lauren McIver, Harvard T.H. Chan School of Public Health, United States
  • Jason Lloyd-Price, Harvard T.H. Chan School of Public Health, United States
  • Emma Schwager, Harvard T.H. Chan School of Public Health, United States
  • George Weingart, Harvard T.H. Chan School of Public Health, United States
  • Yo Sup Moon, Harvard T.H. Chan School of Public Health, United States
  • Xochitl Morgan, University of Otago, United States
  • Levi Waldron, City University of New York Graduate School of Public Health and Health Policy, United States
  • Curtis Huttenhower, Harvard T.H. Chan School of Public Health, United States


Presentation Overview: Show

Motivation: Modern biological screens yield enormous numbers of measurements, and identifying and interpreting statistically significant associations among features is essential. In experiments featuring multiple high-dimensional datasets collected from the same set of samples, it is useful to identify groups of associated features between the datasets in a way that provides high statistical power and false discovery rate control.
Results: Here, we present a novel hierarchical framework, HAllA (Hierarchical All-against-All associa-tion testing), for structured association discovery between paired high-dimensional datasets. HAllA efficiently integrates hierarchical hypothesis testing with false discovery rate correction to reveal signif-icant linear and non-linear block-wise relationships among continuous and/or categorical data. We optimized and evaluated HAllA using heterogeneous synthetic datasets of known association struc-ture, where HAllA outperformed all-against-all and other block testing approaches across a range of common similarity measures. We then applied HAllA to a series of real-world multi-omics datasets, revealing new associations between gene expression and host immune activity, the microbiome and host transcriptome, metabolomic profiling, and human health phenotypes.
Availability: An open-source implementation of HAllA is freely available at http://huttenhower.sph.harvard.edu/halla along with documentation, demo datasets, and a user group.

16:20-16:40
CAMDA Trophy ceremony
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • David Kreil
16:40-17:40
Panel: CAMDA Caffee
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • Wenzhong Xiao
17:40-18:00
CAMDA summary and closing remarks
Room: EH
Format: Live from venue

Moderator(s): Paweł Łabaj

  • Joaquin Dopazo