Presentation Overview: Show
Over the last 20 years the kind of "data wrangling" research I do has been called information technology, then bioinformatics, and more recently health data science - with all three names being open to subjective interpretation. Considerable funding is currently directed towards health data science, which many take to mean "the use of data to derive new knowledge or utility". But in my view such activities are not really health data science, but simply health research. In contrast, I contend that health data science entails a virtuous circle that connects novel methods research to innovative data engineering in order to access and leverage especially large or challenging datasets in new ways (thereby identifying the need for further new methods). Some examples of this virtuous circle will be presented, in the context of making valuable and sensitive assets (data, patients, samples, etc) responsibly discoverable and thereby more widely used.
Presentation Overview: Show
Motivation: During lead compound optimization, it is crucial to identify pathways where a drug-like compound is metabolized. Recently, machine learning-based methods have achieved inspiring progress to predict potential metabolic pathways for drug-like compounds. However, they neglect the knowledge that metabolic pathways are dependent on each other. Moreover, they are inadequate to elucidate why compounds participate in specific pathways.
Results: To address these issues, we propose a novel multi-label graph learning framework of metabolic pathway prediction boosted by pathway inter-dependence, called MLGL-MP, which contains a compound encoder, a pathway encoder, and a multi-label predictor. The compound encoder learns compound embedding representations by graph neural networks (GNNs). After constructing a pathway dependence graph by re-trained word embeddings and pathway co-occurrences, the pathway encoder learns pathway embeddings by graph convolutional networks (GCNs). Moreover, after adapting the compound embedding space into the pathway embedding space, the multi-label predictor measures the proximity of two spaces to discriminate which pathways a compound participates in. The comparison with state-of-the-art methods on KEGG pathways demonstrates the superiority of our MLGL-MP. Also, the ablation studies reveal how its three components contribute to the model, including the pathway dependence, the adapter between compound embeddings and pathway embeddings, as well as the pre-training strategy. Furthermore, a case study illustrates the interpretability of MLGL-MP by indicating crucial substructures in a compound, which are significantly associated with the attending metabolic pathways. It’s anticipated that this work can boost metabolic pathway predictions in drug discovery.
Presentation Overview: Show
Determining the contribution of the tumour microenvironment (TME) to tumour progression and resistance has proven a complex challenge due to its heterogeneity. Multiplexed imaging provides an unprecedented opportunity for studying the interaction between cancer cells and the TME. We utilised a multiplexed tissue imaging dataset of 746 colorectal tumours from different stages. Each tumour section was stained with 60 markers simultaneously to visualise immune and stromal cells as well as key cancer signalling pathways. By implementing image analysis and segmentation approaches, around 3000 cells were quantified per tumour resulting in data of ~3 million single cells. We performed compartmentalised image analysis to determine signalling activities in cancer, stromal, and immune cells. We developed a graph convolutional network (GCNs) and visualisation approach to determine cellular subpopulations associated with patient survival. Using our approach, we found that signalling of mTOR pathway can have heterogeneous activation patterns in different TME compartments which correlate with different patient outcomes. We further validated our observations using transcriptional data from TCGA. Our findings can have a significant impact on the design of mTOR-based therapies and future clinical trials. This demonstrates the utility of GCNs in determining clinically relevant signatures and biomarkers from heterogeneous single cell imaging data.
Presentation Overview: Show
Immune checkpoint blockade (ICB) prompts a revolution in cancer treatment, but its low response rate and high resistance remains a problem. Here, we reported a novel algorithm to reliably predict chemo-ICB synergism for overcoming ICB resistance, terming as Perturbed Transcriptome-based Synergism Prediction for ICB-Chemotherapy Combinations (PerTSynIC). Through a clinical response-guided feature selection procedure, we established that treatment-induced gene expression changes (TECs) are among the major determinative phenotypes for anti-PD1 response in melanoma. Through integrating one million perturbed transcriptomes of cancer cell lines treated with ten thousand genetic and pharmacological inhibitors from high-throughput screening studies, PerTSynIC identified chemo-/targeted agents who can induce TEC shifting between anti-PD1 non-responders and responders. These agents include MEKi, HDACi and CDKi, whose synergism with ICBs have been reported in clinical practice. PerTSynIC characterized 23 top synergy target genes whose genetic and pharmacological inhibition share consistent TEC shift ability in melanoma. Among these genes, PAK4 and its pharmacological inhibitors are identified. In vitro assay validated that treatment of PAK inhibitors on melanoma cell MEL526 can induce significant dose-dependent activation of antigen processing/presentation and type II interferon signaling. Our study provides a reliable prediction method for chemo-ICB synergism, which will help cancer patients better cope with immunotherapy resistance.
Presentation Overview: Show
Severity evaluation is crucial in clinical settings for evaluating patients prognosis. These calculators are used to evaluate survival chances and to optimize patient treatments and resources, notably in Intensive Care Units (ICU). In this work, we present a novel method for applying Test Time Augmentation (TTA) to tabular data. We used TTA along with an ensemble of 42 models to achieve superior performance on the MIT Global Open Source Severity of Illness Score (GOSSIS) initiative, of 131,051 ICU visits and outcomes. This method achieved an AUC of 0.915 on the private test set (19,669 admissions) and won first place at Stanford's WiDS Datathon 2020 challenge on Kaggle, while the widely used Acute Physiology and Chronic Health Evaluation (APACHE) IV model achieved an AUC of 0.868. In addition to improving predictions of patient risk, our method also reduces “unfair” bias
Presentation Overview: Show
Variation in lamin A/C (LMNA) results in a spectrum of clinical disease, including arrhythmias and cardiomyopathy. Known benign variation is rare, and current in silico predictions have limited utility in driving ACMG classification of LMNA missense variants. Our study of a family with inherited conduction system disease revealed a novel segregating missense variant, p.Asp136Glu, initially reported as a VUS by a commercial testing company. Additional familial analysis and in vitro testing enabled classification of the variant as likely pathogenic per ACMG guidelines. However, extended familial analysis is not always feasible, leaving clinicians with little genetic guidance beyond the presence of a missense variant. This prompted the development of an ML algorithm to aid clinical interpretation of LMNA missense variants. While insufficient known benign variation exists to create an ML classifier, unsupervised clustering of previously observed variants in gnomAD and Clinvar using UMAP and K-means identified three clusters with significantly different proportions of reported pathogenic/likely pathogenic variants (38.8%, 15.0%, and 6.1%). We anticipate that these findings can be translated to clinical use by guiding the treatment of patients with a VUS present in a cluster enriched for pathogenicity and may prove useful in other genes where classification is difficult.
Presentation Overview: Show
Prostate cancer (PrCa) is one of the most genetically driven solid cancers with heritability estimates as high as 57%. African American men are at an increased risk of PrCa; however, current risk prediction models are based on European ancestry groups and may not be broadly applicable. In this study, we define an African ancestry group of 4,533 individuals to develop an African ancestry-specific PrCa polygenic risk score (PRState). We identified risk loci on chromosomes 3, 8, and 11 in the African ancestry group GWAS and constructed a polygenic risk score (PRS) from 10 African ancestry-specific PrCa risk SNPs, achieving an AUC of 0.61 [0.60-0.63] and 0.65 [0.64-0.67], when combined with age and family history. Performance dropped significantly when using ancestry-mismatched PRS models but remained comparable when using trans-ancestry models. Importantly, we validated the PRState score in the Million Veteran Program, demonstrating improved prediction of PrCa and metastatic PrCa in African American individuals. This study underscores the need for inclusion of individuals of African ancestry in gene variant discovery to optimize PRS.
Presentation Overview: Show
Knowledge of a patient’s tumor type is essential for guiding clinical treatment decisions in cancer, but histologic-based diagnosis remains challenging. Genomic alterations are highly indicative of tumor type, and can be used to build classifiers which predict diagnoses, but most genomic-based classification methods use WGS data which is not feasible for widespread clinical implementation at present. MSK-IMPACT is a FDA-approved clinical sequencing fixed-panel assay which reports genomic alterations including mutations, indels and copy number alterations across 468 cancer-associated genes, and has sequenced over 65,000 Memorial Sloan Kettering patients to date. We use genomic features from this large dataset to develop Deep Genome-Derived-Diagnoses (GDD-NN): a deep-ensemble tumor type classifier. GDD-NN achieves 78.6% accuracy across 40 common cancer types, outperforming similar models. For MSK-IMPACT patients with rarer cancers, we implement out-of-distribution detection using ensemble-based features, which classifies OOD samples (AUC = .94) without explicitly training on them. For patients where non-genomic information might inform predictions, we implement a prediction-specific adaptive prior and report improved accuracy after adjusting predictions given sample biopsy site. Overall, integrating GDD-NN into the well-established MSK-IMPACT pipeline will enable clinically-relevant tumor type predictions that can guide treatment decisions in real time at an institutional level.
Presentation Overview: Show
Motivation: A critical element of drug development is the identification of therapeutic targets for diseases. However, the depletion of therapeutic targets is a serious problem.
Results: In this study, we propose the novel concept of target repositioning, an extension of the concept of drug repositioning, to predict new therapeutic targets of various diseases. Predictions were performed by a trans-disease analysis which integrated genetically perturbed transcriptomic signatures (knock-down of 4,345 genes and over-expression of 3,114 genes) and disease-specific gene transcriptomic signatures of 79 diseases. The trans-disease method, which takes into account similarities among diseases, enabled us to distinguish the inhibitory from activatory targets, and to predict the therapeutic targetability of not only proteins with known target–disease associations, but also orphan proteins without known associations. Our proposed method is expected to be useful for understanding the commonality of mechanisms among diseases and for therapeutic target identification in drug discovery.
Availability: Supplemental information and software are available at the following website [http://labo.bio.kyutech.ac.jp/~yamani/target_repositioning/].
Contact: yamani@bio.kyutech.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Presentation Overview: Show
The presence of tumor cell clusters in pleural effusion may be a signal of cancer metastasis. The instance segmentation of single cell from cell clusters plays a pivotal role for cluster cell analysis. However, current cell segmentation methods perform poorly for cluster cells due to the overlapping/ touching characters of clusters, multiple instance properties of cells, and the poor generalization ability of the models. In the paper, we propose a contour constraint instance segmentation framework (CC framework) for cluster cells based on a cluster cell combination enhancement module. The framework can accurately locate each instance from cluster cells and realize highprecision contour segmentation under a few samples. Specifically, we propose the contour attention constraint (CAC) module to alleviate over-segmentation and under-segmentation among individual cell-instance boundaries. In addition, to evaluate the framework, we construct a pleural effusion cluster cell dataset including 197 high-quality samples. The quantitative results show that the numeric result of AP mask is greater than 90%, a more than 10% increase compared with state-of-the-art semantic segmentation algorithms. From the qualitative results, we can observe that our method rarely has segmentation errors.
Presentation Overview: Show
1H-NMR metabolomics platform is rapidly gaining popularity in epidemiological research, as it provides a reproducible and cost-effective assessment of the blood metabolome. We will illustrate how we used 1H-NMR metabolomics data of a commercial platform to successfully predict 19 out of 20 routinely assessed clinical variables using a logistic ElasticNET. We will detail on how these models were trained and evaluated within the 26 biobanks participating in BBMRI-nl (~26,000 samples). We will continue by showing that these surrogates can be used to impute missing phenotypic information in external cohorts. Moreover, we will demonstrate that these metabolic surrogates can be used as substitutes for partially or completely unobserved confounders in association studies (Metabolome- or Transcriptome- Wide Association studies) and show that the metabolic surrogates themselves can be used as novel biomarkers, by presenting significant associations with incident all-cause mortality in the elderly population. Finally, we will present our new R-shiny tool (MiMIR) able to compute new and previously published multivariate metabolomics models in other cohorts with 1H-NMR metabolomics, calibrate their predicted values using Platt’s method, and compare the uploaded Nightingale metabolomics quantifications to the metabolites’ distributions observed in BBMRI-nl.
Presentation Overview: Show
Selecting the right immunosuppressant to ensure rejection-free outcomes poses unique challenges in pediatric liver transplant (LT) recipients. A molecular predictor can comprehensively address these challenges. Currently, there are no well-validated blood-based biomarkers for pediatric LT recipients either pre- or post-LT. Here, we discover and validate separate pre- and post-LT transcriptomic signatures of rejection. Using an integrative machine learning approach, we combine transcriptomic data with the reference high-quality human protein interactome to identify network module signatures, which underlie rejection. Unlike gene signatures, our approach is inherently multivariate, more robust to replication and captures the structure of the underlying network, encapsulating additive effects. We also identify, in a patient-specific manner, signatures that can be targeted by current anti-rejection drugs and other drugs that can be repurposed. Overall, our approach can enable personalized adjustment of drug regimens for the dominant targetable pathways in pre- and post-LT in children.
Presentation Overview: Show
Spatial molecular data provides unprecedented characterization of the cellular and molecular architecture of human tissue and disease. These technologies are particularly important for cancer immunotherapy, in which the interactions between diverse cell types mediate therapeutic response and resistance. In this talk, we describe how spatial molecular data enable us to uncover mechanisms of therapeutic response and resistance in a liver cancer immunotherapy clinical trial. Moreover, our new analysis approach SpaceMarkers to infer molecular changes from cell-cell interaction from latent space analysis of ST data from this trial. Further transfer learning in matched scRNA-seq data enabled further quantification of the specific cell types in which SpaceMarkers are enriched. Altogether, SpaceMarkers can identify the location and context-specific molecular interactions within the TME from ST data.
Presentation Overview: Show
Sepsis is a leading cause of death and disability in children globally, accounting for approximately three million childhood deaths per year. In pediatric sepsis patients, the multiple organ dysfunction syndrome (MODS) is considered a significant risk factor for adverse clinical outcomes characterized by high mortality and morbidity in the pediatric intensive care unit (PICU). The recent rapidly growing availability of electronic health records (EHRs) has allowed researchers to vastly develop data-driven approaches like machine learning in healthcare and achieved great successes. However, effective machine learning models which could make the accurate early prediction of the recovery in pediatric sepsis patients from MODS to a mild state and thus assist the clinicians in the decision-making process is still lacking.
This study develops a machine learning-based approach to predict the recovery from MODS to zero or single organ dysfunction~(Z/SOD) by one week in advance in the Swiss Pediatric Sepsis Study (SPSS) cohort of children with blood-culture confirmed bacteremia. Our model achieves internal validation performance on the SPSS cohort with an AUROC of 79.1 and AUPRC of 73.6, and it was also externally validated on another pediatric sepsis patients cohort collected in the U.S., yielding an AUROC of 76.4 and AUPRC of 72.4. These results indicate that our model has the potential to be included into the EHRs system and contribute to patient assessment and triage in pediatric sepsis patient care.
Presentation Overview: Show
Motivation: Advances in bioimaging now permit in-situ proteomic characterization of cell-cell interactions in complex tissues, with important applications across a spectrum of biological problems from development to disease. These methods depend on selection of antibodies targeting proteins that are expressed specifically in particular cell types. Candidate marker proteins are often identified from single-cell transcriptomic data, with variable rates of success, in part due to divergence between expression levels of proteins and the genes that encode them. In principle, marker identification could be improved by using existing databases of immunohistochemistry for thousands of antibodies in human tissue, such as the Human Protein Atlas. However, these data lack detailed annotations of the types of cells in each image.
Results: We develop a method to predict cell type specificity of protein markers from unlabeled images. We train a convolutional neural network with a self-supervised objective to generate embeddings of the images. Using nonlinear dimensionality reduction, we observe that the model clusters images according to cell types and anatomical regions for which the stained proteins are specific. We then use estimates of cell type specificity derived from an independent single-cell transcriptomics dataset to train an image classifier, without requiring any human labelling of images. Our scheme demonstrates superior classification of known proteomic markers in kidney compared to differential expression in single-cell transcriptomics.
Presentation Overview: Show
Endometriosis is a disorder in which endometrial tissues are implanted outside of the uterus. Endometriosis affects 5–10% of all women of reproductive age yet is under-diagnosed. This research aims to develop an endometriosis model using multiple inputs from the UK-biobank (UKBB). The data was split into those with a diagnosis of endometriosis (5,924; ICD-10: N80) and the rest (142,576). Over 1000 variables were used, including personal information regarding female health, lifestyle, self-reported data, genetic variants, and medical history prior to the endometriosis diagnosis. An endometriosis prediction model was developed using machine learning (ML) algorithms. CatBoost's gradient boosting methods produced the best prediction for the data-combined model, with an area under the ROC curve (ROC-AUC) of 0.78. We found that prior to being diagnosed with endometriosis, women had significantly more ICD-10 diagnoses than the average unaffected woman. Irritable bowel syndrome (IBS) and the length of the menstrual cycle were among the most informative variables ranked by SHAP values. Despite the restrictions of missing data and noisy medical input, we conclude that the UKBB's large population-based retrospective data is useful for the development of predictive models. The informative features extracted from the model may increase endometriosis diagnostic clinical utility.
Presentation Overview: Show
Cetuximab or bevacizumab combined with chemotherapy are approved regimens for first-line metastatic colorectal cancer (mCRC). However, the unknown underlying biological pathways perturbed by the therapies can be responsible for the large variation observed in their therapeutic responses. To elucidate these mechanisms, we used tumor RNA-seq and germline genotype data from 1,284 patients mCRC treated with cetuximab/bevacizumab through a randomized phase III trial (CALGB/SWOG-80405). We conducted a novel integrative approach and identified treatment-specific putative causal biomarkers and gene regulatory pathways impacting overall survival (OS). This analysis accounted for confounders using Mendelian randomization and reproduced the findings using replication sets. To gain insight into their biological functions, we evaluated the relationships of the causal gene regulatory pathways with estimated immune features from RNA-seq data. Our study suggested a potentially important role for the interaction between RELT and MYO1G related to tumor cell escape mechanism under cetuximab therapy. We identified a pathway with a common function in DNA damage and repair and a pathway highly correlated to cytotoxicity signatures, impacting the response to cetuximab and bevacizumab, respectively. Moreover, SCD5, with a causal effect on OS of patients treated with bevacizumab, highlighted the possible risk of dyslipidemia in the use of VEGF inhibitors.
Support: U10CA180821, U10CA180882, U24CA196171, https://acknowledgments.alliancefound.org; U10CA180888 (SWOG); Lilly, Genentech, and Pfizer; ClinicalTrials.gov Identifier: NCT00265850
Presentation Overview: Show
Different omics techniques allow us to characterise the genetic, transcriptomic, epigenomic and proteomic landscapes of cells and tissues, and better understand their perturbations in disease. However, joint analyses of different omics datasets for a holistic understanding of cell function present a computational challenge. We recently developed ActivePathways, an integrative pathway enrichment analysis method that uses data fusion to merge signals from multiple omics datasets, prioritizes genes and pathways through p-value merging, and evaluates their contribution from individual input datasets. Here we extend this computational framework to account for directional activities of genes and proteins across the input omics datasets. For example, fold-change in protein expression would be expected to associate positively with mRNA change of the corresponding gene, while DNA methylation change of the gene promoter would be expected to associate negatively. We extend our method to encode such directional interactions and penalize genes and proteins where such assumptions are violated. We demonstrate the approach by integrating cancer RNA-seq, DNA methylation, and proteomics datasets in the CPTAC and TCGA projects, in which we uncover novel candidate biomarkers and pathways that have been previously overlooked in the analysis of individual datasets.
Presentation Overview: Show
Genomic medicine positively impacts pediatric care, from rapid diagnosis of genetic disorders to optimization of childhood cancer treatments. The Computational Genomics Group at Nationwide Children’s Hospital supports multiple translational research protocols, combining genomics and bioinformatics to improve patient outcomes. Our “Genomics of Rare Disease” protocol has utilized genome sequencing and novel bioinformatics approaches to find answers for patients with undiagnosed disease, revealing novel disease mechanisms. Through the application of cloud computing technologies, optimized bioinformatics pipelines, and an Apache Spark-backed variant warehouse, our “Rapid Genome Sequencing” protocol returns results for infants in the ICU within 48 hours. The “Cancer Protocol” provides extensive genomic profiling, positively impacting patient diagnosis, prognosis, and therapy. Finally, the Molecular Characterization Initiative is a new screening approach to identify therapeutic vulnerabilities in pediatric cancers, by performing genomic, transcriptomic and epigenomic characterization of thousands of pediatric cancer samples nationwide. All genomic data is generated, analyzed, and interpreted within 14 days. Deidentified genomic and clinical data is submitted to dbGaP within minutes of a case being completed. Together with sharing data from our translational protocols, we are creating a community resource that will enable wide-scale engagement of the translational bioinformatics community to help solve the puzzle of genetic disease.