DREAM POSTERS




Updated Nov 3, 2014


Links within this page:

...............................................................................................................................

Rheumatoid Arthritis Responder Challenge Posters
...............................................................................................................................
DREAM P01: A generic method for predicting clinical outcomes and drug response and its application in the RA challenge

Fan Zhu1 and Yuanfang Guan1,2,3

1University of Michigan

We developed an elegant Gaussian Process Regression (GPR)-based model to predict clinical outcomes and drug response. We applied it in both the genetics-only task and the genetics + clinical information-combined task in the DREAM Rheumatoid Arthritis Outcome Responder Challenge. It achieved the top accuracy in both the leaderboard and the final previously unseen test set, for both change in disease severity ( DAS) and predictions of non-response to treatment. We will present the rationale and method of this approach and elaborate the techniques of using it in predicting RA outcomes.

...............................................................................................................................
DREAM P02: Predicting response to arthritis treatments: regression-based Gaussian processes on small sets of SNPs

Javier García-García1 , Daniel Aguilar1, Daniel Poglayen1, Jaume Bonet1, Oriol Fornés1, Emre Güney2, Joan Planas-Iglesias1,3, Manuel Alejandro Marín1, Bernat Anton1 and Baldo Oliva1

1Universitat Pompeu Fabra, 2Northeastern University, 3Dana-Farber Cancer Institute, 4Current Address: University of Warwick

The aim of our study was to identify candidate SNPs playing a role in the response to therapy in rheumatoid arthritis (RA) patients, by compiling several sources of information such as the localization in the coding/non-coding region of the gene and its consequences in the translated protein (i.e. a synonym or non-synonym mutation). Genes affected by SNPs were first analyzed in order to select the most relevant associations with RA as follows. An initial list of potential candidates was selected using association analysis derived from the experimental data provided by the DREAM challenge. Additionally, we used multiple external sources of biomedical data to filter candidate SNPs. The list of candidates was expanded using gene-prioritization algorithms that combined protein-protein interaction networks and expression data. The procedure is based on the guilt-by-association principle and we selected from the extended list only those candidates with known SNPs. After the selection of genes, we used all SNPs reported for these genes. The resulting SNPs, in combination with clinical data, were used to predict the patients' response to treatments by means of regression-based models and a 10-fold cross-validation on the training dataset provided by the DREAM-challenge. When models were applied to an independent dataset (the leaderboard set), their predictive power decreased significantly, pointing out a problem of overfitting in the model. After comparison of the initial list of potential candidates and the use of external sources of information (i.e. biomedical data to filter the candidate list and extending the list using guilt-by-association principles), we confirmed that the predictive value of the original list of candidate SNPs was not improved by any of the external information. Therefore, we simplified the approach and reduced the SNP list by selecting only those showing the highest Pearson's correlation with the patients' response (DAS) in the leaderboard set (only about 20% of the initial SNPs). In the final independent dataset (CORRONA dataset) we achieved an AUC-ROC value of 0.6237 and AUC-PR value of 0.5071.

...............................................................................................................................
DREAM P03: Integrating prior biological knowledge into machine learning models for predicting drug responses

Lu Cheng 1,2,3 , Gopal Peddinti1, Muhammad Ammad-ud-din2,3, Alok Jaiswal1, Himanshu Chheda1, Suleiman Ali Khan1,2,3 , Kerstin Bunte2,3, Jing Tang1, Matti Pirinen1, Pekka Marttinen2,3, Janna Saarela1, Jukka Corander 2,4, Krister Wennerberg1, Samuel Kaski2,3, Tero Aittokallio1

1Institute for Molecular Medicine Finland (FIMM), University of Helsinki, 2Helsinki Institute for Information Technology (HIIT), 3 Aalto University, 4University of Helsinki

For complex genetic diseases such as rheumatoid arthritis, treatment effects can vary significantly among different patients. We integrate prior biological knowledge into machine learning models to explain the differences. We use GWAS, CCA based selection, PharmGKB, differential gene expression analyses to select the SNPs. We use GEMMA (linear mixture model) and BEMKL (multiple kernel methods) for our predictions. Our results show that the clinical information contributes the most for explaining the differences. Genetic information contributes a relatively small amount for the predictions, which is validated by comparing with clinical only predictions and random SNP set predictions. Correctly utilizing the methods is also very important. Our results show that changing the settings within these methods can create huge differences in predictions.

...............................................................................................................................
DREAM P04: DREAM Rheumatoid Arthritis Responder Challenge: Team Lucia

Victor Bellón 1,2,3 , Chloé-Agathe Azencott1,2,3, Véronique Stoven1,2,3, Olivier Collier1,2,3, Azadeh Khaleghi1,2,3, Valentina Boeva1,2,3 and Jean Philippe Vert1,2,3

1MINES ParisTech, PSL-Research University, 2Institut Curie, 3INSERM U900

Approximately 30% of rheumatoid arthritis patients do not respond to their treatment. In this challenge the focus was on anti-TNFα drugs. TNFα is involved in the inflammation pathway, which takes an important role in the disease. We use 2.5 million SNPs and simple clinical variables to predict the response of patients to their anti-TNFα drugs. We selected SNPs based on biological knowledge and their statistical relevance according to mutual information criteria. After selecting the SNPs, we use kernel methods to deal with the high dimensionality of the data. During the first phase of the challenge, we observed that genetic information is explaining only a small part of the data. Our team achieved the second best score in the first sub-challenge using only clinical variables.

Top of Page


ICGC-TCGA-DREAM Somatic Mutation Calling Challenge Posters
...............................................................................................................................
DREAM P05: novoBreak: robust characterization of structural breakpoints in cancer genomes

Zechen Chong1 , Ken Chen1

1University of Texas MD Anderson Cancer Center

Structural variation (SV) is a major source of genomic variation and plays a driving role in cancer genome evolution. However, the current strategy of using next-generation whole genome sequencing still does not achieve the comprehensiveness and sensitivity required to identify abundant SV breakpoints in heterogeneous tumor samples. This is due to challenges in acquiring high sequencing depth as well as methodological limitations in aligning and interpreting short reads spanning breakpoints. To alleviate such challenges and to deepen our understanding of cancer genome evolution, we developed a novel algorithm, novoBreak, which targets the reads that substantially differ from the normal genome reference and outputs the "breakome": the collection of genomic sequences spanning breakpoints and unobserved in the reference alignment. novoBreak can comprehensively characterize a variety of breakpoints that are introduced by small indels, large deletions, duplications, inversions, insertions and translocations at base-pair resolution from whole genome sequencing data. In contrast to most existing SV discovery programs such as Delly and Meerkat, novoBreak first clusters reads around potential breakpoints and then locally assembles the reads associated with each breakpoint into contigs. After aligning the contigs to the reference, novoBreak then identifies the precise breakpoints and infers the types of SVs. novoBreak performs substantively better than other widely used algorithms and ranked at No. 1 in the recent ICGC-TCGA DREAM Somatic Mutation Calling Challenge. The higher sensitivity of novoBreak makes it possible to uncover a large number of novel and rare SVs, as shown in our data from The Tumor Genome Atlas (TCGA) and from the 1000 Genomes project. Wider application of novoBreak is under way and is expected to definitively reveal the comprehensive structural landscape that can be linked to novel mechanistic signatures in cancer genomes.

...............................................................................................................................
DREAM P06: Applying logistic regression to combine multiple somatic mutation call sets for increased overall prediction accuracy

Li Tai Fang1,* , Pegah T. Afshar2,*, John C. Mu1, Narges Bani Asadi1, Wing H. Wong2,3, Hugo Y.K. Lam1

*Authors contributed equally

1Bina Technologies, 2Stanford University, 3Stanford University School of Medicine

Integrating multiple different algorithms to detect genetic variants or pathogenicity has been shown to be an effective approach in increasing the accuracy of prediction. It is worth noticing that multiple prediction algorithms are not independent lines of evidence, as pointed out by MacArthur et al. lately. Therefore, proper understanding and handling of the underlying algorithms being implemented are required to leverage the strengths of different algorithms and avoid mistreatment. Recent reviews comparing somatic mutation callers by Wang et al. and Roberts et al. have clearly shown that various well-established tools indeed have drastically contrasting performance in different situations such as varying allele frequencies. This indicates that integrating multiple algorithms can increase the sensitivity of detection, leaving the difficult question of how to maintain the specificity at an acceptable rate.

Previous attempts such as the pipeline Cake by Rashid et al. has taken the ensemble approach to provide an integrated analysis of somatic variants based on multiple callers with a series of post processing filters. However, it does not have a robust model that is proven to maximize the accuracy for any combination of tools. In this regard, Kim et al. have proposed a statistical model to combine calls from multiple somatic mutation callers based on regularized logistic regression with feature-weighted linear stacking (FWLS). The model was able to build a combined caller across the full range of stringency levels, which outperformed all of the individual ones. Based on this approach, we have carefully chosen four somatic callers, namely MuTect, VarScan2, JointSNVMix2, and SomaticSniper, based on their performance and characteristics. We combined their call sets with over 75 caller and sequencing features. Our approach is able to achieve high accuracy as demonstrated in the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (the Challenge).

The four chosen algorithms compensate each other with their private calls in different circumstances. For example, MuTect is very sensitive at detecting low allele frequency mutations, but SomaticSniper and VarScan2 are more tolerant of calling candidates with mutation evidence in the matched normal. However, each tool also brings in extra false positive calls. Our approach improves precision with features that are predictive of variant calling confidence, such as the depth of coverage, alternate allele frequency, strand bias, mapping score, base call quality, and others. The final confidence score of each mutation candidate is the weighted sum of all the features. We applied our algorithm on the SNVs detected by the four callers from the synthetic reads in Stage 4 of the Challenge. From the union of the four call sets, it captured 78.7% of the total true mutations (sensitivity), but only 23.4% of them are true positives (precision). When our model was trained on Stage 3 data, we were able to improve the precision from 23.4% to 97.1%, while maintaining the sensitivity at 74.0%, achieving an accuracy (average of sensitivity and precision as computed in the Challenge) of 85.5%.

If we use Stage 4 dataset for cross-validation (i.e., randomly partitioning half of the stage 4 data for training, and the other half for validation), where we expect a high degree of consistency across the training and validation datasets, the accuracy is improved to 88.2%, with a sensitivity of 77.8% and precision of 98.6%. We demonstrated that using our machine learning approach with multiple call sets dramatically improves both sensitivity and specificity upon any single call set, even though sensitivity is capped at the combined performance. As a result, we envision that by incorporating more algorithmically different tools, our approach is able to achieve ultra-high accuracy.

References

1. Kim SY et al., Combining calls from multiple somatic mutation-callers, BMC Bioinformatics. 2014, 15:154. (doi:10.1186/1471-2105-15-154)

2. MacArthur DG et al., Guidelines for investigating causality of sequence variants in human disease, Nature. 2014, 508(7497):469-76. (doi: 10.1038/nature13127)

3. Roberts ND et al., A comparative analysis of algorithms for somatic SNV detection in cancer, Bioinformatics. 2013, 29(18):2223-30. (doi: 10.1093/bioinformatics/btt375)

4. Rashid M et al., Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes. Bioinformatics. 2013, 29(17):2208-10. (doi: 10.1093/bioinformatics/btt371)

...............................................................................................................................
DREAM P07: Application of MuTect for sensitive and specific somatic point mutation detection in DREAM challenge synthetic data

Mara Rosenberg1, Kristian Cibulskis1, Adam Kiezun1, Louis Bergelson1, Gad Getz1,2

1Broad Institute of Harvard and MIT, 2Massachusetts General Hospital

Sensitive and specific detection of somatic point substitutions is a critical aspect of characterizing the cancer genome. However, tumor heterogeneity, purity, and sequencing errors confound the confident identification of events at low allelic fractions. MuTect, a previously described method for somatic mutation calling [1], allows for high sensitivity by first implementing a Bayesian classifier and then further reducing the false positives through carefully tuned filters. We applied MuTect to the four synthetic datasets in the DREAM challenge and achieved top scoring performance with specificity ranging from 0.98 to 0.99 and sensitivity from 0.74 to 0.97, consistent with our experience with real data. This had a corresponding false positive rate between 0.01 and 0.07 mutations per Mb. Here, we will describe our approach that used an application of MuTect and filters to reduce artifacts from bam alignment errors and base specific sequencing noise.

Reference

1. Cibulskis K, Lawrence MS, Carter SL, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31:213-9.

Top of Page



DREAM9 Acute Myeloid Leukemia (AML) Outcome Prediction Challenge Posters
...............................................................................................................................
DREAM P08: Evolution-informed modeling to predict AML outcomes

Li Liu1

1 Arizona State University

As a part of the DREAM9 Challenge, the Acute Myeloid Leukemia (AML) Outcome Prediction Subchallenge 1 aims to foretell if an AML patient will have a complete response or resistance to treatment based on 40 clinical covariates and 231 proteomic measurements. Previous analysis performed by the challenge organizers showed that the high level of noise in proteomic data reduced their predictive power when used in an uninformed manner. To solve this problem, I designed an evolution-informed model that incorporates weights derived from evolutionary conservation and univariate analysis in machine learning algorithms.

Based on evolutionary patterns of cancer genes, it can be inferred that changes of expression levels may have more profound impact if they involve conserved proteins, as compared to variable proteins. Therefore, higher weights can be given to slow-evolving proteins, and to proteins differentially expressed between two outcome groups. To estimate protein conservation, evolutionary rate (r) for each position in each protein was calculated based on alignments of orthologous sequences from 46 vertebrates. The evolutionary weight (WE) of a protein was the reciprocal of average evolutionary rate over all positions ( ). Clinical variables took the maximum of all WEs. Next, Student's t-test was performed for each feature. P-values were transformed via negative logarithm (-log(P)) and used as the differential weight (WD). For a given feature, the final weight was the sum of its evolutionary and differential weights (W = WE + WD).

In the feature selection step, each feature was first transformed to z scores, and then multiplied with its corresponding weight (W). Because the training data were highly unbalanced, an ensemble approach was used to construct multiple classification models with balanced subsamples. Using stability selection with sparse logistic regression, features identified in >50% of bootstrapping runs were selected. In the classification step, these features with un-weighed values were used to construct a random forest model with 50 trees. The above procedure was repeated 100 times to produce an ensemble of 100 random forest models. Given a patient, 100 predictions were obtained, one from each model. The confidence score equals the proportion of models that predict the patient to have a complete response.

When applied to test data that are blind to participants of this challenge, this evolution-informed model achieved a balanced accuracy of 77.9% and AUROC of 0.795, ranked number one among all participants. Features that were selected in more than 80% of all models include Chemo (Flu-HDAC), cyto.cat (21), CD34, cyto.cat (-5), Age.at.Dx, ABS.BLST, PIK3CA, and GSKA_B.pS21_9.

...............................................................................................................................
DREAM P09: Acute myeloid leukemia outcome prediction via dictionary learning for sparse coding

Zhilin Yang1, Subarna Sinha2, David L. Dill2

1 Tsinghua University, 2Stanford University

This challenge was to use clinical and reverse-phase protein array (RPPA) data to solve three subchallenges: predicting complete remission after treatment, predicting remission duration, and predicting survival time. We describe our solutions to the first two subchallenges. For the first challenge, we found that a support vector machine (SVM) classifier with the radial basis function kernel was the most effective standard classifier of those we tried. We added a manual rule that any patient treated with Flu-HDAC would experience remission.

We found it difficult to improve prediction performance using RPPA data until we used dictionary learning for sparse coding, which learns low-rank latent state vectors from the original data in an unsupervised way and represents each sample as a sparse linear combination of the latent states, for feature extraction. Using sparse coding features of all protein data improved classifier performance. Applying sparse coding to pathway-specific subsets of proteins improved performance further, showing that prior knowledge of pathways can be useful in this task. Interestingly, some of the latent states in the pathway-specific sparse codes seemed to be biologically meaningful. The quality of the results also depended on a hybrid feature selection algorithm for clinical variables to avoid mixing up continuous and categorical features. We observed significant batch effect in the RPPA data, which we tried to correct unsuccessfully using several standard methods.

For the second subchallenge, we used an average of three support vector regressions using different subsets of the features. We were unable to improve the quality of predictions in this subchallenge using the RPPA data.

...............................................................................................................................
DREAM P10: A bagged, semi-parametric model to predict survival time for acute myeloid leukemia patients

Xihui Lin1 , Gregory M. Chen1, Honglei Xie1, Geoffrey A. M. Hunter1, Paul C. Boutros1,2

1Ontario Institute for Cancer Research, 2University of Toronto

While many AML patients go into remission after treatment, survival time remains highly variable across individuals. Predicting these differences would be of major clinical value in personalizing therapy. As part of the ninth Dialogue for Reverse Engineering Assessment and M ethods (DREAM9) challenge, we sought to accurately estimate the survival of AML patients by integrating clinical and proteomic features. We initially formulated survival models based on random forests, boosted quantile regression, and weighted linear models, but these performed no better than a benchmark Cox model with only five clinical variables. Therefore, we decided to extend the benchmark Cox model for our final submission in the DREAM9 challenge. Specifically, we used a bootstrap aggregated (bagged) modified Cox model based on five clinical features: age at diagnosis, Anthra based treatment administered, hemoglobin count, Albumin levels, and cytogenic category. Researchers identified cytogenics as the single most important prognostic factor in AML patients; however, the cytogenic categories in the data for the challenge were imbalanced. To resolve this, we re-stratified patients into high, intermediate, intermediate-low, and low risk survival groups based on their cytogenic category. This significantly improved the predictive power of the model. Surprisingly, incorporating additional clinical and/or proteomic features in the Cox model diminished its performance. These results suggest that our reclassified cytogenic categories can improve predictions of patient survival and, hence, might be the key to help tailor therapies for AML patients.

...............................................................................................................................
DREAM P11: Predicting overall survival of AML patients with Cox proportional hazards model

Ljubomir Buturovic1 , Damjan Krstajic1,2,3

1Clinical Persona Inc., 2Research Centre for Cheminformatics, Belgrade3University of Belgrade

One of the goals of the AML Challenge was to predict Overall Survival of AML patients using clinical and proteomics data. We tried the following two approaches to find an optimal survival model:

  • Cox proportional hazards model
  • parametric survival model

Bovelstad et al. [1] built various survival models on clinical and genomics data and showed that on average, a ridge-regularized Cox proportional hazards model outperforms others. However, it is not ideally suited for predicting actual survival (expected time until death). The reason is that baseline hazard in the Cox model is unknown, which made it difficult to apply in the overall survival subchallenge. However, the strength of a Cox model is that it may provide risk scores for various time points. To overcome the baseline hazard issue, we used 600 * S(600) as the measure for predicting overall survival in Cox models, where S(t) is the patient's survival function estimated by the Cox model, and 600 is the maximum observed survival time in the subchallenge.

Parametric survival models are more suited for predicting overall survival, because their baseline hazard is well defined. However, their main weakness is that they usually overfit.

As described in our recent publication [2], we used the grid-search repeated cross-validation approach to select an optimal model. The best cross-validation results were found for regularized Cox regression model using the glmnet R package. Our results confirm the findings by Bovelstad et al. [1] that regularized Cox regression models are serious contenders for clinico-genomic survival models.

References

1. Bovelstad HM, Nygard S, Borgan O. Survival prediction from clinico-genomic models - a comparative study. BMC Bioinformatics 2009, 10:413.

2. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. (2014). Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics, 6(1), 1-15.

...............................................................................................................................
DREAM P12: Ensembling classical supervised statistical learning methods and survival models in the DREAM9 AML Challenge

Rolland He 1

1Stanford University

The DREAM9 AML Challenge provides a novel training dataset detailing the clinical covariates and proteomic measurements of 191 patients diagnosed with acute myeloid leukemia (AML). This dataset provides the basis for applications of conventional supervised statistical learning methods, including linear elastic net regression and gradient boosting machines, as well as the well-known Cox survival model. Despite being relatively simple to implement and understand, these models prove to provide competitive results when combined together. I will discuss some of the practical limitations of the dataset, some of the missteps I ran into along the way, as well as the implications of my results for future biostatistical analysis. In particular, it is important to understand the versatility of classical supervised statistical models and the strength of ensembling to form accurate predictions.

...............................................................................................................................
DREAM P13: Application of crowd sourcing to improve predictive models of AML outcome: constructing the DREAM 9 AML challenge

David Noren1 , Byron Long1, Raquel Norel2, Gustavo Stolovitzky2, Steven Kornblau3, Amina Qutub1

1Rice University, 2IBM Computational Biology Center, 3University of Texas MD Anderson Cancer Center

It is projected that clinical informatics will have a profound impact on the immediate future of health care. In particular, insights drawn from clinical genomic, proteomic, and metabolomic data have the potential to greatly improve the accuracy of patient prognosis and the effectiveness of therapeutic selection. However, to realize this potential, there is a need to explore, develop, and validate new computational algorithms that are capable of translating high throughput measurements into clinically useful information. The Dialog on Reverse Engineering Assessments and Methods (DREAM) is a crowd sourcing effort which has pioneered the competitive and cooperative development of predictive biological models to fill this need. Each DREAM endeavor is structured around a challenge framework where participants from different technical fields are invited to contribute their best solutions. Here we describe the background, design, and implementation of the DREAM 9 Acute Myeloid Leukemia (AML) Outcome Prediction Challenge.

Acute myeloid leukemia (AML) is a devastating cancer of the bone marrow and the blood. It is predicted that there will be over 18,860 new cases this year and 10,460 deaths attributed to AML. While information pertaining to patient mutation status and cytogenetics have assisted clinicians in matching different therapies to specific subsets of patients, the effectiveness of treatment remains low and the overall 5 year survival rate is only ~25%. To improve personalized therapy for these patients, MD Anderson Cancer Center has employed new proteomic technologies, like Reverse Phase Protein Array (RPPA). RPPA allows clinicians to directly ascertain the changes in AML patient signaling proteins, which are the molecular targets of most therapeutics. The goal of the DREAM9 AML Outcome Prediction Challenge is to develop models that are predictive of patient outcome using the RPPA proteomics data in conjunction with other clinical measurements.

The design of the DREAM9 AML Outcome Prediction Challenge embodied several components, each with special considerations. Clinical data is often "messy" and thus the AML data was first processed to insure clarity and consistency. Two separate datasets were identified, a training dataset that was given to participants to develop their models and a test set that was later used to evaluate participant predictions. These datasets were chosen to insure the training data provided an adequate representation of the overall dataset. Different scoring metrics were evaluated to measure participant performance as well as to evaluate the threshold signal to noise ratio. Finally, a feedback structure was also designed - i.e., leaderboard schedule - to assist participants with model development without promoting model over-fitting. Here we describe the methodology and tools used to address each of these considerations when designing the DREAM9 AML Outcome Prediction Challenge and review some of the preliminary results.

Top of Page

Broad-DREAM Gene Essentiality Prediction Challenge Posters
...............................................................................................................................
DREAM P14: Learning kernel-based feature representation for gene essentiality prediction

Masayuki Karasuyama 1 , Hiroshi Mamitsuka1

1Kyoto University

We develop a predictive method for estimating gene essentiality, focusing on learning a predictive feature representation. Our method uses a kernel technique, in which the kernel is trained to capture mutual relations among different cell-lines, with respect to essentiality. We start with our baseline model, kernel ridge regression (KRR), a well-known, stably high-predictive performance model. We then attempt to improve the predictive performance of KRR by learning the kernel (or features) from gene essentiality data itself. More concretely we focus on the essentiality scores of genes, in given data, where the scores of different genes are heavily dependent on each other. We incorporate this dependency into our predictive model by using kernel canonical correlation analysis (KCCA) and kernel target alignment (KTA), both of which can be interpreted as estimating feature representations using the 'ideal' kernel defined by essentiality scores. After obtaining kernels through KCCA and KTA, we then predict the essentiality of an arbitrary gene by using the two KRR models. We finally take the average over the two prediction results (by the two models) to stabilize the results. An important point of our model is that the trained kernel is shared with all genes to predict the essentiality of each gene. This point reduces estimation variance, which can be a severe problem in high dimensional and small sample data (which is applied to the given data this time), rather than estimating different kernels for each gene. Overall, these modifications make our predictive model a high-performance approach, particularly in subchallenge 1 of the Broad-DREAM Gene Essentiality Prediction Challenge. An additional, big advantage of our approach is computational efficiency, because all techniques (KCCA, KTA and KRR) used in our approach are kernel methods, in which we do not have to deal with high dimensional data directly after we once calculate the kernels.

...............................................................................................................................
DREAM P15: Multi Pathway Learning accurately predicts gene essentiality in the Cancer Cell Line Encyclopedia

Vladislav Uzunangelov 1 , Sahil Chopra1, Kiley Graim1, Daniel Carlin1,Yulia Newton1, Alden Deran1, Adrian Bivol1, Sam Ng1, Kyle Ellrott1, Joshua M. Stuart1+, Artem Sokolov1+ and Evan Paull1+.

1University of California, Santa Cruz

+Corresponding author(s): This email address is being protected from spambots. You need JavaScript enabled to view it., This email address is being protected from spambots. You need JavaScript enabled to view it., This email address is being protected from spambots. You need JavaScript enabled to view it.

We applied biologically motivated feature transformations coupled with established machine learning methods to predict gene essentiality in CCLE cell line models. By leveraging additional large datasets, such as The Cancer Genome Atlas PanCancer12 data and MSigDB pathway definitions, we improved the robustness and biological interpretability of our models. We developed a multi-pathway learning (MPL) approach that associates a genetic pathway from MSigDB with a distinct kernel for use in a multiple kernel learning setting. We evaluated the performance of MPL compared to several other regression methods including random forests, kernel ridge regression, and elastic net linear models. We combined multiple approaches using an ensemble technique on the diverse set of predictors.

We found that the winning method was an ensemble that combined the random forest and MPL predictions. Both models utilized features derived from both gene expression and copy number data, the latter of which were filtered to those predicted as driver events in prior pan-cancer studies. MPL also ranked as the 5th highest performing method on the third sub-challenge. In this case, the 100 features with the highest cumulative MPL weights were selected and applied with kernel ridge regression to produce the final predictions. Thus, MPL also demonstrates merit as a feature selector when used with other downstream methods.

The method performed the best at predicting essentiality of those genes belonging to classes such as kinases (31 out of top 100), fibronectin type III (7 out of 100), and insulin signaling (6 out of top 100). Kinases represent a broad class of genes whose knock-outs are expected to have a wide range of effects due to their master regulatory role. Thus, signatures of gene essentiality for these genes might be more readily inferred. In addition, high prediction accuracy was achieved for several genes involved in cancer, such as TP53, PIK3CA, RB1, FGFR1, ABL1, and FLT3, suggesting MPL's utility as a biomarker for detecting key tumorigenic events.

The advantage of MPL is that mechanistically coherent gene sets are automatically selected as high scoring pathway kernels (HSPKs). We investigated whether the HSPKs identify cellular processes relevant to the loss of key genes. To do this, we inspected the HSPKs for a few of the most abundantly mutated genes in cancer. The MPL predictor for TP53 included the targets of this transcription factor as well as HSPKs involved in apoptosis, a cellular process regulated by TP53. The retinoblastoma gene (RB1) MPL predictor included RB1 targets as well as HSPKs involved in the regulation of histone deacetylase (HDAC) that interacts with RB1 to suppress DNA synthesis. PIK3CA, a gene that is mutated frequently in luminal, but not basal, breast cancers was associated with HSPKs comprised of genes differentially expressed in luminal breast cancers. Finally, HSPKs predictive of BRAF essentiality included genes associated with uveal melanoma, a finding consistent with the prevalence of BRAF mutations in skin cancers. These findings suggest trends in the MPL results could reveal a pathway-level view of the synthetic lethal architecture of cells. Such a map, that links patterns of pathway expression to potential genetic vulnerabilities, could provide an invaluable tool for exploring new avenues to target cancer cells.

...............................................................................................................................
DREAM P16: Integrative model to predict gene essentiality for cancer cell survival

Tao Wang1, Xiaowei Zhan1, Hao Tang1, Yang Xie1, Guanghua Xiao1

1University of Texas Southwestern Medical Center

The central question of precision medicine in cancer is how to identify patients who are more susceptible to a given treatment. This is especially important for targeted cancer therapies, which target specific genes/pathways to inhibit cancer cell growth and may only be effective for a specific sub-population. It is critical to predict the extent to which cell survival relies on specific genes (gene essentiality), which are the target genes for cancer treatment. Here, we show that an advanced data-driven dimension reduction strategy of integrating features from expression, copy number, and mutation data, coupled with Gaussian Process for Regression (GPR), is an effective approach to predicting gene essentiality. The feature selection step takes several factors into consideration, including the gene expression level, the distribution difference between the training and testing data, and the association with outcome variables. We applied GPR on the selected features and specified appropriate kernels to capture the complex non-linear relationship between predictors and response. In conclusion, we developed a best-performing predictive pipeline to solve the Broad-DREAM Gene Essentiality Prediction challenge (sub-challenge 1) and provided new perspectives for the realization of precision medicine in cancer.

...............................................................................................................................
DREAM P17: A strategy to select most informative biomarkers for cancer cell lines

Fan Zhu 1 , Yuanfang Guan1

1 University of Michigan

Cancer cells represent strong heterogeneity and thus the response to treatment varies dramatically between individuals. Currently a rough estimation of 80% of the patients do not respond to cancer therapy. Personalized treatment of tumors thus requires accurate identification of drug targets for the specific samples collected from biopsy. Ideally, a test panel with a limited number of biomarkers can be designed for each type of cancer to identify effective drug targets for a patient. The Broad Institute Gene Essentiality Subchallenge 2 studies whether such biomarkers can be found for each type of cancer. We have developed a method to rigorously select such stable biomarkers based on both their informativeness in the cell line under investigation and the global informativeness over all cell lines. This was the best-performing method in this subchallenge.

...............................................................................................................................
DREAM P18: Predicting gene essentiality using linear-time greedy feature selection

Peddinti Gopalacharyulu*1, Alok Jaiswal*1, Kerstin Bunte2, Suleiman Khan1,2,Jing Tang1, Antti Airola 4, Krister Wennerberg1, Tapio Pahikkala4, Samuel Kaski2,3, Tero Aittokallio1

*Equal contributions

1Institute for Molecular Medicine Finland FIMM, University of Helsinki, 2Helsinki Institute for Information Technology HIIT, Aalto University, 3Helsinki Institute for Information Technology HIIT, University of Helsinki, 4University of Turku

Genome-wide prediction of the gene essentiality using molecular characteristics of various cancer cells has the potential to open up new avenues for selective cancer therapies as well as for providing insights into the systems-level genetic interaction networks of cancer cells. Subchallenges 2 and 3 of the Broad-DREAM9 Gene Essentiality Prediction Challenge deal with the problem of finding a limited number of molecular features that are most predictive of gene essentiality. To solve this problem, we used a greedy forward feature selection algorithm for regularized least squares (RLS), called GreedyRLS. The GreedyRLS algorithm works like a wrapper type of feature selection method, which starts with an empty feature set, and in each iteration adds the feature whose addition provides the minimum RLS error in the leave-one-out cross-validation (LOO-CV). The GreedyRLS algorithm, however, performs the feature selection computationally more efficiently than previously known feature selection algorithms for RLS. The time complexity of the standard approach using LOO-CV for forward selection of k features from a total number of n features in a data set with m training samples is In contrast, the time complexity of the GreedyRLS is In sub-challenge 3, we utilized the GreedyRLS approach for multi-task learning, and it performed the best among all the competing methods in this sub-challenge. We addressed subchallenge 1 using additional information based on pathways from PARADIGM and gene sets from MSigDB. In this subchallenge, we used the Bayesian multitask multiple kernel learning (BEMKL) method, which is a non-linear method based on kernelized regression and Bayesian inference. Use of additional information of similarities of genes based on gene ontology seemed to be helpful in predicting gene essentiality, in line with the lessons learned from the previous NHI-DREAM Drug Sensitivity Prediction Challenge, but did not lead to the top performance in this subchallenge.

...............................................................................................................................
DREAM P19: Integrative learning of gene essentiality using data and knowledge with cluster specific models

Simone Rizzetto1 , Paurush Praveen1, Mario Lauria1, Corrado Priami1,2

1The Microsoft Research-University of Trento Centre for Computational and Systems Biology, 2 University of Trento

Motivation: Predictive models to infer gene essentiality in cancer cell lines can aid the molecular characterization of the cancer cell lines, which can be ultimately used to identify biomarkers and tailored treatments as well as identify patients with higher treatment efficacy. The approaches to predict essentiality have been based on genome scale data (Roberts et al. 2007) as well as on network of genes (Kim et al. 2012). However, the gene essentiality should be seen as a context based term or measurement. For example, a gene essential in lung cancer cell line may not be essential in a breast cancer cell line and vice versa. Therefore, a generalized model for quantitative estimation of gene essentiality across heterogeneous cell lines is not suitable. Another limitation of current approaches, even with gene-specific models, is that they tend to use data from heterogeneous or diverse cell lines rendering the overall model noisy and causing a reduction in their predictive power. Another area that has been less exploited is the use of existing information or knowledge on the genes as well as cell lines that can boost the performance of quantitative prediction models. We aim to exploit these hypotheses while solving the issue of essentiality prediction in the challenge (Broad-Dream Gene Essentiality Prediction Challenge-2014). We propose two models, (1) One Gene-One Model (OGOM) with integrated knowledge features and (2) Cluster Specific-OGOM (CS-OGOM) addressing these issues.

Methods: We developed two approaches to build a model for quantitative prediction of gene essentialities, based on expression and copy number variation (CNV) data. A Support Vector Regression (e-SVR) (Smola & Scholkopf 2004) forms the underlying engine for prediction. The key aspects of the models have been highlighted below.

1. OGOM (One Gene One Model): The model is based on training one model for each gene for all cell lines using support vector regression. The difference is made by the integration of knowledge from freely available multiple information sources (Oncogene, cell line information, etc.) as features in addition to the gene expression and CNV data provided by the organizers.

2. CS-OGOM (Cluster-Specific OGOM): The CS-OGOM approach is based on the hypothesis that closely related cell lines will follow one model for each gene in order to predict its essentiality. Thus, compared to the OGOM approach we have a gene specific model for every cluster. The approach first performs a hierarchical clustering on the expression data to identify closely related cell lines. To obtain an optimal cluster homogeneity we used a distance metrics, based on the degree of similarity between rank-based signatures (Lauria et al. 2013). Now within each cluster we identify the training and test data. The training data within that cluster is used to learn a model for each gene and use this model to predict the essentiality of the corresponding gene in the test cell line member of the corresponding cluster.

Results: We applied leave one out cross validation to both the models in order to test the performance of our models and measured the performance in terms of correlation coefficients. The results showed that both models outperformed the generalized model (one-model for all cell lines and genes). The CS-OGOM performed substantially better than the OGOM in terms of Pearson and Spearman's coefficients. The Pearson's correlation for the CS-OGOM was found to be > 0.4 on average whereas on the Spearman's scale it oscillated around 0.3 depending on the gene under observation. The score for the OGOM were ~ 0.2 along both the scale. The inclusion of prior knowledge also improved the performance of OGOM without these features.

Discussion: Our models revealed the improvement brought about by the gene-specificity in models and integration of prior knowledge. The leave one out cross validation of CS-OGOM showed that the similar cell lines when used to train a model lead to better performance for specific cell lines. However, the choice for the number of clusters needs a trade-off between specificity (cell line homogeneity) and number of training samples. Another important aspect is the scaling of training and test data in order to avoid numerical artifacts while mixing the clusters for the final performance measure.

References

1. Roberts SB, Mazurie AJ, Buck GA (2007) Integrating Genome-Scale Data for Gene Essentiality Prediction. Chemistry & Biodiversity 4: 2618-2630.

2. Kim J, Kim I, Han SK, Bowie JU, Kim S (2012) Network rewiring is an important mechanism of gene essentiality change. Sci Rep

3. Smola AJ, Scholkopf B (2004) A tutorial on support vector regression. Statistics and Computing 14:199-222.

4. Lauria M (2013) Rank-based transcriptional signatures: A novel approach to diagnostic biomarker definition and analysis. Systems Biomedicine 1: 228-239.

...............................................................................................................................
DREAM P20:  5Cell viability prediction from large-scale omics data using machine learning and mechanistic modeling approaches

Emanuel Gonçalves1, Daniel Machado2, Michael Menden1, Julio Saez-Rodriguez1, Miguel Rocha2
 
1 EMBL-EBI European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, UK
2 Centre of Biological Engineering, University of Minho, 4710-057 Braga, Portugal
 
Systematic studies of the impact of genes loss of function, such as short hairpin RNA (shRNA) or CRISPRs, in cancer cells viability helps to identify vulnerabilities that can be explored for possible therapeutic treatments.
 
A current challenge in this field is the accurate prediction of gene essentiality given a genomic context, which can be characterised with gene copy number variation, mutation profiles and transcriptomic expression data. In this context, the final submissions of our team, UM-EBI, to the DREAM 9 Gene Essentiality challenge were performed using a combination of machine learning methods. For feature selection, we discarded features behaving similarly across all the samples, and ranked the features using an univariate feature selection method. Regarding model estimation, we used a “wisdom of crowds” approach that consisted on averaging the results of different estimation methods, which included Ridge regression, support vector machines (SVM) and Passive Aggressive regression (PAR).

We observed that gene expression was the most informative data set. In many cases, considering copy-number variation and mutation data decreased the performance of the estimators. Changes in the prediction methods resulted in small perturbations to the final scores, whereas the filtering and selection steps significantly affected our predictions.

One limitation of these machine learning approaches is the lack of biological insight that can be obtained. Furthermore, they do not take into account the compendium of information that has already been assembled and curated into mechanistic models. In this regard, we propose to extend our approach by considering genome-scale models of human metabolism and their regulation. These models capture biochemical processes that affect cell growth and viability, allowing mechanistic and quantitative predictions of the impact of genetic changes into the cellular growth phenotype. Methods have been recently developed to generate tissue-specific models for human cells from gene expression data. We are extending these methods to generate models of metabolism, gene regulation and signal transduction by integration of multiple omics data, that we can plan to apply to study gene essentiality.

 


Top of Page