DREAM POSTERS

Updated Nov 3, 2015

Poster: DR01
From Shape to Smell: Predicting Olfactory Perceptual Descriptors using Molecular Structural Information

Richard C Gerkin¹

¹Arizona State University, United States

Abstract: The DREAM Olfaction Prediction Challenge asked participants to predict 21 different olfactory perceptual features (i.e. smell descriptors) of single molecules using a library of several thousands structural features of those molecules. One sub-challenge asked participants to make predictions for individual human subjects, and one for the mean and variance of responses across subjects. Here I describe the winning submission to the latter sub-challenge.

After using missing value imputation to fill out the structural feature dataset, I trained Random Forest Regression models (using Python's scikit-learn package) to predict mean (across subjects) responses for each of the perceptual features. I made extensive use of cross-validation to optimize these models. While I also constructed similar models to predict the variance (across subjects) of the responses, I found that prediction was improved by exploiting the relationship between the variance and the mean that was guaranteed by basic psychometric considerations. Consequently, I obtained decisive improvements in my prediction of the variance by pooling results from models trained only on the variance with a theoretically motivated non-linear transformation of results from models trained only on the mean. This technique proved decisive in constructing the winning submission for the sub-challenge.

...............................................................................................................................

Poster: DR02
The ALS Stratification Prize- Using Big Data and Crowdsourcing for Catalyzing Breakthroughs in ALS

Neta Zach¹, Robert Küffner², Hagit Alon¹, Nazem Atassi³, Barbara di Camillo⁴, Merit Cudkowicz³, Javier Garcia-Garcia⁵, Orla Hardiman⁶, Guang Li⁷, Lara Mangravite⁸, Raquel Norel⁹, Thea Norman⁸, Alexander Sherman³, Liuxia Wang⁷, Gustavo Stolovitzky⁹

¹Prize4Life, Israel
²Helmoltz Center, Germany
³Massachusetts General Hospital, United States
⁴University of Padova, Italy
⁵Universitat Pompeu Fabra, Spain
⁶Beaumont Hospital and Trinity College Dublin, Ireland
⁷Origent Data Solutions, United States
⁸Sage Bionetworks, United States
⁹IBM, United States

Abstract: The heterogeneity of the ALS patient population presents a substantial barrier to the understanding of disease mechanisms and to the planning and interpretation of ALS clinical trials, leading to large, expensive, and potentially unbalanced trials.

The 2015 DREAM ALS Stratification Prize4Life challenge offers an innovative approach to developing tools to allow a more accurate assignment of individual patients to a specific sub-group of patients with clear clinical implications for either survival or disease progression. It is expected to provide important tools for precision medicine in ALS.

The ALS Stratification challenge aims to address directly the problem of ALS patient heterogeneity with regards to important clinical targets such as ALSFRS progression and survival. In the challenge, we asked participants to derive meaningful subgroups of ALS patients along with the clinical features to characterize them.

...............................................................................................................................

Poster: DR03
DREAMTools: a Python Package for scoring collaborative challenges

Thomas Cokelaer¹, Mukesh Bansal², Christopher Bare³, Erhan Bilal⁴, Brian M. Bot³, Elias Chaibub Neto³, Federica Eduati¹, Mehmet Gönen⁵, Steven Hill⁶, Bruce Hoff³, Jonathan R. Karr⁷, Robert Küffner⁸, Michael Menden¹, Pablo Meyer4, Raquel Norel⁴, Abhishek Pratap³, Robert J. Prill⁹, Matthew T. Weirauch¹⁰, James C. Costello¹¹, Gustavo Stolovitzky⁴, Julio Saez-Rodriguez¹²

¹EMBL-EBI, United Kingdom
²Department of Systems Biology, Columbia University, United States
³Sage Bionetworks, United States
⁴IBM, TJ Watson, Computational Biology Center, United States
⁵Oregon Health & Science University, United States
⁶MRC Biostatistics Unit, Cambridge Institute of Public Health, United Kingdom
⁷Department of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, United States
⁸Institute of Bioinformatics and Systems Biology, German Research Center for Environmental Health, Germany
⁹IBM Almaden Research Center, San Jose, United States
¹⁰Center for Autoimmune Genomics and Etiology and Divisions of Biomedical Informatics and Developmental Biology, Cincinnati Children’s Hospital, United States
¹¹Department of Pharmacology, University of Colorado Anschutz Medical Campus United States
¹²RWTH Aachen University Medical Hospital, Germany

Abstract: DREAM challenges are community competitions designed to advance computational methods and address fundamental questions in system biology and translational medicine. Each challenge asks participants to develop and apply computational methods to either predict unobserved outcomes or to identify unknown model parameters given a set of training data. Computational methods are evaluated using an automated scoring metric, scores are posted to a public leaderboard, and methods are published to facilitate community discussions on how to build improved methods. By engaging participants from a wide range of science and engineering backgrounds, DREAM challenges can comparatively evaluate a wide range of statistical, machine learning, and biophysical methods. Here, we describe DREAMTools, a Python package for evaluating DREAM challenge scoring metrics. DREAMTools allows one to reproduce results from past DREAM challenges. The software also provides a command line interface that enables researchers to test new methods on past challenges, as well as a framework for scoring new challenges. As of September 2015, DREAMTools includes more than 80\% of completed DREAM challenges. DREAMTools complements the data, metadata, and software tools available at the DREAM website (http://dreamchallenges.org) and on the Synapse platform (www.synapse.org). In the poster, we will give an overview of the past and present challenges and how the DREAMTools package can be used to reproduce scores from previous competitions. We will also describe the scoring functions that are currently available within the package and how new challenges can be included into the package.

...............................................................................................................................

Poster: DR04
A Two-layer Predictor for DREAM 9.5 Olfaction Prediction Challenge

Ping-Han Hsieh¹, Bor-Wei Cherng², Yu-Chuan Chang¹, Ming-Yi Hong¹, Yi-An Tung³, Yen-Jen Oyang⁴, Chien-Yu Chen⁵

¹National Taiwan University, Taipei, Taiwan
²National Taiwan University and Academia sinica, Taipei, Taiwan
³Genome and System biology program, National Taiwan University and Academia sinica, Taipei, Taiwan
⁴Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
⁵Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan

Abstract: Olfaction is one of the most important sensibilities in animal behavior. Understanding the olfactory precept from the aspect of molecular properties may broaden our understanding of sensory cognition and create more odorant applications for industries. Though there have been many published methodologies in predicting human’s olfactory flavor, the prediction accuracy is expected to be further improved by additional training data sets. This challenge was designed to predict personal olfactory flavor based on chemical features of compounds. Based on the concept of the one-neuron-receptor role in olfactory sensory studies, odor molecules conjugate with specific types of olfactory receptors, and transmit neuron signals to multiple brain regions to generate the olfactory perception. As the result of neuronal circuitry conservation, it is reasonable to hypothesize that there are hidden relations between olfactory perception and the molecular properties, making us precisely process smell perception. To tackle this olfaction prediction challenge, we built up a machine learning-based pipeline to generate individual-specialized ensemble-based linear models, modified from the PCA-based baseline model delivered by the challenge organizers. The adopted features include molecular physicochemical properties from Dragon molecular descriptors, while the target information came from perceptual data collected in Rockefeller University Smell Study. Due to the sensory varieties between individuals, we, therefore, hypothesized that the predicting models for different persons may slightly differ from each other, though it may still have some common factors embedded in the models. In this regard, we built predicting models for each single person with respect to each chemical compound. It is expected that the individual-specific features were carefully selected by the ensemble approach. The proposed method also employed a second-layer predicting framework to predict the target value: 'valence/pleasantness'. The results revealed that the two-layer approach performed better than the conventional design, the single-layer framework, suggesting that some of the olfactory senses are highly related.

...............................................................................................................................

Poster: DR05
Attractor Metafeatures Discover Molecular Signatures for Odor-Perception Prediction

Andrew Matteson¹

¹Applied BioMath, United States

Abstract: Chemoinformatics predictors are generally more numerous than the number of samples in a target variable. The key challenge in working with these imbalanced data sets is avoiding over fitting. My approach sought to preserve as much information about the target variable while reducing the dimensionality of the data. My methods focused on the use of mutual information to either select features, or project all the features onto a space that was a good predictor of the target variables. For selection, my methods are connected to the technique of maximum-relevance-minimum-redundancy (MRMR) feature selection.

I engineered "metapredictors" from the chemoinformatics predictors using the “Attractor Metagene” algorithm (1). The algorithm engineers features in an unsupervised way by weighted averaging over the original features. Weights are chosen as a function of the mutual information between the engineered feature, and the original features.

Several of the metapredictors map onto known chemical structures associated with particular smells. Other metapredictors correspond to chemical structures not yet identified as corresponding to odors. These metapredictors enable connections to be made between molecular structure and perception that are directly interpretable.

The learning methods I developed fall into an enrich-project-predict framework. I discuss opportunities to extend these methods to other chemoinformatics and bioinformatics machine learning problems.

(1) Cheng, Wei-Yi, T. H. Ou Yang, and Dimitris Anastassiou. "Biomolecular events in cancer revealed by attractor metagenes." PLoS Comput Biol 9.2 (2013): e1002920.

...............................................................................................................................

Poster: DR06
Reflecting on the Prostate Cancer Dream Challenge: Lessons Learned

Team Jayhawks

Devin C. Koestler¹, Joseph Usset¹, Stefan Graw¹, Richard Meier¹, Rama Raghavan¹, Junqiang (Eric) Dai¹, Prabhakar Chalise¹, Shellie Ellis², Brooke L. Fridley¹

¹Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, United States
²Department of Health Policy and Management, University of Kansas Medical Center, Kansas City, KS, United States

Abstract: From March through August 2015, nearly 60 teams from around the world participated in the Prostate Cancer Dream Challenge, cosponsored in part by: the Prostate Cancer Foundation, National Cancer Institute (NCI), and the American Joint Committee on Cancer (AJCC). Participating teams were faced with the task of developing prediction models for patient survival and treatment toxicity using clinical variables collected from the comparator arms of four phase III clinical trials, including over 2,000 metastatic castrate resistant prostate cancer patients treated with first-line docetaxel. In this poster presentation, we describe: (a) the 3 sub challenges comprising the Prostate Cancer Dream Challenge, (b) the statistical metrics used by the challenge organizers for benchmarking the performance of prediction models for each sub-challenge, and (c) our team’s overall analytic strategy for addressing each of the challenge objectives. Specifically, we discuss our approach for identifying clinically important risk-predictors (i.e., feature selection and dimension reduction), the methodological framework(s) considered by our team for model development and validation, including the ensemble-based Cox proportional hazards regression model representing our final submission, and the adaptation of our modeling framework based on the results from the intermittent leaderboard rounds. As the aftermath of the Prostate Cancer Dream Challenge has prompted our team to reflect on the lessons learned throughout challenge, we also provide our perspectives on the importance of delegation, collaboration, data cleaning, and organization in challenges such as the Prostate Cancer Dream Challenge.

...............................................................................................................................

Poster: DR07
Feature Selection & Random Forest for ALS prediction

Witold R. Rudnicki¹, Wojciech Lesiński¹, Aneta Polewko-Klim¹, Krzysztof Mnich², Agnieszka Golińska¹

¹Department of Bioinformatics, University of Białystok, ²Computational Centre, University of Białystok, Konstantego Ciołkowskiego 1M, 15-245 Białystok, Poland

Abstract: The main goal of the challenge was to find the clustering of patients that would improve prediction of the progression of the ALS disease. The success of the clustering was measured by comparing prediction of the progress of the disease with actual data. We provided answers to all questions of the challenge, namely predicting disease progress and eventual death of patients for two data sets.

The modeling performed in three steps - feature construction, selection and model building. Features describing the time series data, were constructed following the approach proposed by winners of the ALS DREAM 7 Challenge[1].

We have attempted clustering using informative features, however without success, hence the final clusters were based only on the availability of data for given object.

Feature selection was based on the information entropy. Information gain was computed all variables and all pairs of variables. Informative variables and pairs were selected and redundant features were removed.

Final models were built using random forest classifier [2], using six original features. All possible combinations of variables were tested using cross-validation.

Eleven informative features for question 1 are (variables used by best model denoted by *, second model by ^): onset_delta*^, hands*^, Q1_Speech*^, Q9_Climbing_Stairs*^, Q5_Cutting*, fvc*, ALSFRS_Total^, Q3_Swallowing^, Q6_Dressing, Creatinine, Q4_Handwriting.

The cross-validated correlation of these models was 0.43, 0.42 respectively and RMSD was 0.549 and 0.568.

Eight informative variables for second question are (marked as previously): onset_delta*^, ALSFRS_Total*^, Creatinine*^, weight*^, fvc*, fvc_percent* Chloride^, Gender^.

Cross-validated classification error for the first model was 0.22 for 12 months and 0.18 for the 18 and 24 months.

There were six informative variables both for the question 3: (trunk, Q4_Handwriting, Q3_Swallowing, Q8_Walking, hands, ALSFRS_R), and for the question 4: (ALSFRS_R_Total, ALSFRS_Total, hands, MEDHx_Thyroid, Q6_Dressing_and_Hygiene, R3_Respiratory_Insufficiency).
The quality of data was much lower for the second data set and while we obtained sets of informative variables, but the results were on the border of random. The correlation coefficient for third question was 0.15, classification error for the fourth question was 0.35 – both results were estimated by cross-validation.

[1] Küffner R. et al. (2015) Nature Biotechnology 33(1), 51-57.
[2] Breiman, L. (2001) Machine Learning, 45, 5-32.

...............................................................................................................................

Poster: DR08
Predicting Discontinuation of Docetaxel Treatment for Metastatic Castration-Resistant Prostate Cancer (mCRPC) with hill climbing and random forest

Team Yoda
Daniel Kristiyanto, Kevin Anderson, Seyed Sina Khankhajeh, Kaiyuan Shi, Seth West, Ling Hong Hung, Azu Lee, Qi Wei, Migao Wu, Yunhong Yin, Ka Yee Yeung*

Institute of Technology, University of Washington, Tacoma, WA, United States
*Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Motivation: Prostate cancer patients may develop resistance to androgen deprivation therapy (ADT) [1,2]. In the DREAM 9.5 Prostate Cancer Challenge sub challenge 2 [2], we developed predictive models to predict patient outcomes in metastatic castrate-resistant prostate cancer (mCRPC) with subsequent discontinuation of docetaxel therapy.

Objective: The input data consist of 131 variables measured across clinical data from three clinical trials, namely, Memorial Sloan Kettering (MSK, with 476 patients), Celgene (with 526 patients), Sanofi (with 598 patients). The goal is to predict which patients in a fourth clinical trial (test data), AstraZeneca (AZ, with 470 patients), would discontinue treatment due to adverse events within 3 months.

Data & Methods: Data cleansing and pre-processing. The data cleansing were done separately within each clinical trial and later merged back together. Our data cleansing and pre-processing procedures include imputation of missing data [4], and removal of clinical variables with a high percentage of missing data. Data augmentation were also performed by converting selected multi-label variables into binary variables. Feature selection. We observed that univariate feature selection methods did not perform well. Hence, we adopted a hill-climbing [4] approach that optimized the AUC within 10-fold cross validation of the training data. We also addressed the issue of imbalanced data (total of 1292 negative samples and 197 positive samples) by randomly removing negative samples to meet a ratio roughly of 60% negative and 40% positive samples.

Classification. We applied random forest [5] using Sanofi as the hold-out, setting the parameters “mtry” to 25% of the number of features and number of trees to 100 times of the number of features.

Assessment: For validation, we repeated the training step 10 times, with average AUC as the assessment criteria. We also conducted additional assessment by holding out one of the three clinical trials. Our predictive model using MSK and Celgene data as the training set and Sanofi data as the test set yielded AUC = 0.165, accuracy = 0.9, precision = 0.21, F1 = 0.092, and recall = 0.06.

Results: Our final submission in predicting the discontinuation of docetaxel in the AstraZeneca clinical trial (using MSK, Celgene and Sanofi as training data) resulted in AUC of 0.13. Across the 470 in AstraZeneca clinical trial, 8 patients are predicted to discontinue the treatment within 3 months.

Acknowledgement: Ling Hong Hung and Ka Yee Yeung are supported by NIH grant U54-HL127624. This project used computing resources provided by Microsoft Azure. We would like to thank all students in TCSS 588 Bioinformatics in Spring 2015 at University of Washington Tacoma who contributed to this project.

References
1. Gupta, Eva, Troy Guthrie, and Winston Tan. "Changing paradigms in management of metastatic Castration Resistant Prostate Cancer (mCRPC)."
BMC urology 14.1 (2014): 55.
2. "DREAM 9.5 Prostate Cancer DREAM Challenge - Dream ..." 2015. 8 Oct. 2015 <http://dreamchallenges.org/project/closed/project dataspheresprostate-cancer-challenge/>
3. Hastie T, Tibshirani R, Narasimhan B and Chu G. impute: impute: Imputation for microarray data. R package version 1.42.0.
4. Romanski, P. "FSelector: Selecting attributes." Vienna: R Foundation for Statistical Computing (2009).
5. Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

...............................................................................................................................

Poster: DR09
Predicting olfaction response for each individual
Yuanfang Guan

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor

Abstract: This abstract describes the method I wrote for the 2015 DREAM Olfaction Challenge- sub-challenge 1: building models to predict olfactory response for each individual. I used decision tree as a base-learner of this chemical structural data. There are two reasons that I chose decision tree: 1) the dimension of the structure data is high, which contained over 4000 parameters; decision tree helped to reduce the dimension. 2) The data matrix is sparse; a decision boundary can be put between zeros and the rest of the values. Olfactory responses reported by individuals were noisy. Thus I used the global response across all to balance the individual response, in order to 1) capture the personalized features, and b) stabilize the predictions. For example, a chemical reported to be ‘sweet’ by individual A is trusted more when other peoples also report this chemical to be ‘sweet’. Finally, I used 0.2*individual score + 0.8* global average for each chemical as predictions. But the above parameter can be rather flexible to achieve decent performance. Similar technique also resulted in one of the best performing algorithms in 2014 DREAM Broad Institute Gene Essentiality Challenge. No external data was used. This technique turns out to be the best-performing method for this sub-challenge.

...............................................................................................................................

Poster: DR10
Predicting olfactory perception from chemical structure: a gradient boosting model

Chung Wen Yu¹ , Yusuke Ihara^1,2, and Joel D. Mainland^1,3

¹Monell Chemical Senses Center, Philadelphia PA, United States
²Institute for Innovation, Ajinomoto Co., Inc., Kawasaki, Japan.
³Department of Neuroscience, University of Pennsylvania School of Medicine, Philadelphia PA, United States

A fundamental problem in olfaction is to understand how the physical properties of a stimulus relate to perceptual characteristics. In vision, wavelength translates into color; in audition, frequency translates into pitch. By contrast, the mapping from chemical structure to olfactory percept is unknown. In other words, there is not a scientist or perfumer in the world who can view a novel molecular structure and predict how it will smell. Here we used an unpublished dataset where 49 subjects rated 476 odors to develop a model that uses physicochemical descriptors to predict 21 perceptual features. Previously published models for predicting pleasantness have used principal components of physicochemical descriptors or molecular complexity to predict pleasantness; both performed poorly on this dataset (Khan et al. 2007 r = 0.25, p < 0.001; Kermen et al., 2011 r = 0.26, p < 0.0001). Our model outperformed these previously published models on a validation set (r=0.61, p < 0.001).

Khan, R. M., Luk, C.-H., Flinker, A., Aggarwal, A., Lapid, H., Haddad, R., & Sobel, N. (2007). Predicting odor pleasantness from odorant structure: pleasantness as a reflection of the physical world. The Journal of Neuroscience : the Official Journal of the Society for Neuroscience, 27(37), 10015–10023.

Kermen, F., Chakirian, A., Sezille, C., Joussain, P., Le Goff, G., Ziessel, A., et al. (2011). Molecular complexity determines the number of olfactory notes and the pleasantness of smells. Scientific Reports, 1, 206–206. http://doi.org/10.1038/srep00206

...............................................................................................................................

Poster: DR11
Predicting patient survival in the DREAM 9.5 mCRPC challenge

Team FIMM-UTU
Teemu D. Laajala^1,2, Suleiman Khan², Antti Airola³, Tuomas Mirtti^2,4, Tapio Pahikkala³, Peddinti Gopalacharyulu², Tero Aittokallio^1,2

¹ Department of Mathematics and Statistics, University of Turku, Finland
² Institute for Molecular Medicine Finland, University of Helsinki, Finland
³ Department of Information Technology, University of Turku, Finland
⁴ Department of Pathology, HUSLAB, Helsinki University Hospital, Finland

In this poster, our team FIMM-UTU presents the top performing ensemble of penalized regression models for predicting patient survival in the context of metastatic castration-resistant prostate cancer (mCRPC) patients, originating from several clinical trials (subchallenge 1a of the DREAM 9.5 Prostate Cancer Challenge). Here, we outline the key stages in our method development: (i) processing raw data input; (ii) imputation of missing values, filtering and truncation; (iii) utilizing unsupervised learning to identify most relevant preliminary patterns; (iv) fitting batch-wise optimized penalized regression glmnet-models; and (v) constructing the final ensemble collection of models for performing accurate novel predictions.

By coupling unsupervised learning with survival-analysis-based supervised learning, we constructed an ensemble of batch-wise optimized penalized regression coxnet-models. The final ensemble models were simultaneously optimized for the penalized regression through L1/L2-norm parameter α along with the penalization coefficient λ using cross-validation and were averaged over multiple cross-validation runs to avoid randomness in the binning. Model-based imputation of missing values as well as incorporating clinical á priori knowledge of variables is presented, along with practical lessons learned from processing such challenging clinical data that required wide multidisciplinary expertise. We systematically identified and validated multiple upstream key modeling decisions that contributed to our successful performance, and provide diagnostics of which selected variables and diverse modeling strategies most favorably contributed to the final ensemble of models. Finally, we conclude clinically novel findings in the models and discuss their importance for future prognostic modeling in mCRPC.

...............................................................................................................................

Poster: DR12
GENESIS: A variation discovery framework for clinical cancer genomic profiling

Allen Chi-Shing YU^1,2*, Aldrin Kay-Yuen YIM^1,3*, Marco Jing-Weoi LI^1,2*

¹ Codex Genetics Limited, Hong Kong
² School of Life Sciences, The Chinese University of Hong Kong, Hong Kong
³ Computational & System Biology Program, Washington University School of Medicine
*Co-first authors

Large scale cancer genomics projects carried out by consortiums such as The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) have enormous impact on the understanding, diagnosis and treatment of cancer. Data generated in these projects provides important clinical insights for the advancement of bioinformatics algorithms, for instance somatic variant callers with increasing accuracy, the identification of very rare malignant clones as well as the DNA/RNA structural alterations. To participate in the ICGC-TCGA SMC-DNA DREAM Challenge (Intel-10 SNV real-tumor Sub-Challenge), we have developed a machine learning-based ensemble somatic variant calling pipeline – Genesis. Genesis was trained with the public cancer genomic dataset with validated mutation calls, and over thirty informative metrics were gathered from the four individual variant callers (MuTect, Strelka, Vardict and VarScan2) and genomic sequence features such as regional GC level, entropy, mappability and mutational hotspots. Variants were then classified into true or false positives using Support Vector Machine (SVM) models that are optimized for the tumor type and variant type (SNP or indel). Based on the 88 published hepatocellular carcinoma dataset and in-house clinical cancer samples, we demonstrated that our machine learning approach can achieve higher sensitivity and specificity when comparing to the individual call set. It is therefore expected that Genesis is well-applicable for clinical cancer genomic profiling.

...............................................................................................................................

Poster: DR13
Predicting odor perception from molecular structure using a "nearest neighbor" approach

Aharon Ravia¹, Lavi Secundo¹, Kobi Snitz¹, and Noam Sobel¹

¹Department of Neurobiology, Weizmann Institute of Science

The DREAM olfaction challenge was to predict the perceptual qualities of novel molecules according to their structural properties. The data consisted of 476 molecules rated by 49 subjects across 21 different descriptors.

Predicting odor perception from odor structure is a major goal in olfaction research. We and others have made initial steps in this direction such that we can now predict aspects like odorant pleasantness (Khan et al., 2007; Zarzo 2007; Koulakov et al., 2011) and pairwise odorant similarity (Snitz et al., 2013) from odorant structure alone. These abilities rest in part on the observation that principal component analysis (PCA) of molecular descriptors has a predictive power whereby the first principal component of the physio-chemical space is related to perceived odor pleasantness.

Here we set out to apply PCA analysis and "nearest neighbors approach" to the challenge data using a variation of the Rotation Forest method (Rodríguez, Kuncheva, & Alonso, 2006). This method uses PCA on random splits of the features in order to create weak predictors and then combine them as an Ensemble. In the Rotation Forest method classification is done on a rotated subspace. To answer the challenge we needed to create a continuous estimator instead of a classifier, and used nearest neighbor approach for this purpose. Moreover, we executed the estimation on one subspace of features each time.

This method yielded better results than other methods we tried, such as linear regression. For example, when we correlated predicted vs. actual pleasantness ratings of the leaderboard data we got results of R=~.52. We think that although others were able to get similar and better predictions using other methods, this kind of analysis may reveal facts about the physio-chemical space, and extract characteristics such as proximity between molecules. Finally, we examine the possibility that the olfactory system solves olfactory space using similar strategy.

...............................................................................................................................

Poster: DR14
Ensemble Approaches To Prostate Cancer Dream Challenge

Team The Data Wizard
Wen-Chieh Fang¹, Li-Min Tu², Huan-Jui Chang³, Chia-Tse Chang¹, Yu-Fu Wang¹, Mu-Hung Tsai⁴, Alexey Yu. Lupatov⁵, Konstantin N. Yarygin⁵, Hsih-Te Yang^1,4, Chiang Jung-Hsien^1,4

¹Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
²Department of Computer Science and Engineering, New York University, United States
³Department of Economics, National Cheng Kung University, Taiwan
⁴Institute of medical informatics, National Cheng Kung University, Taiwan
⁵Orekhovich Institute of Biomedical Chemistry of the Russian Academy of Medical Sciences, Moscow, Russia

Abtsract: Prostate cancer is the most common cancer diagnosed in many countries and the third most common cause of cancer death. However prognostic models for overall survival for patients are dated. Based on preliminary exploration of the three sets of raw trial data, we model the data using ensemble learning in the subchallenge 1a. We provide a novel boosting approach to tackle the problem in subchallenge 1b. For the second subchallenge, we combine three different prediction models to derive the final prediction.

We first preprocess data to handle missing values and use filter method for feature selection. For predicting overall survival, we derive different feature sets for the three training sets (ASCENT2, CELGENE, EFC6546), respectively and then apply the Cox model with maximum penalized likelihood on the selected features to train a specific model for each set. We combine the ranking results from the different models. To predict the exact time to event (death of a patient), we develop a novel adaboost-like regression algorithm called "adaboost-s" for survival problem, especially to predict time to event. In the training phase, if the predicted time to event of a censored data point is smaller than the lower bound, the data point is considered incorrectly predicted and its weight is increased. To predict treatment discontinuation for patients treated with docetaxel due to adverse events at early time points, we apply ensemble technique to combine the ranking results of several methods. The methods includes two random forest classifiers and one gradient boosting classier.

The main advantage of ensembles of different models is that it is unlikely that all models will make the same mistake. Ensembles tend to reduce the variance of models. Therefore, we apply ensemble approaches to deal with the prediction problems in this challenge. In the leaderboard, our approaches outperform the baseline and the methods of many teams.

...............................................................................................................................

Poster: DR15
Supervised ensembles to boost the predictive power of DREAM challenges

Gaurav Pandey^1, Gustavo Stolovitzky^1,2, Sean Whalen³, Lara Mangravite⁴, Solveig Sieberts⁴, Abhishek Pratap⁴ and Om Prakash Pandey¹

¹Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY
²IBM Research, NY
³Gladstone Institutes, University of California at San Francisco
⁴Sage Bionetworks, Seattle

Abstract: Prediction problems in biomedical sciences, such as those posed in DREAM challenges, are well-known to be quite difficult to solve convincingly. This is due in part to incomplete knowledge of the biomedical phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. These issues are reflected in the diversity of prediction techniques, datasets, domain knowledge and other ingredients used to develop submissions to DREAM challenges. In such scenarios, a powerful approach to improving prediction performance is to construct ensemble predictors that combine the output of complimentary individual predictors derived from diverse techniques and/or datasets. Traditional ensemble methods like boosting, bagging and random forest are insufficient for this task as they (generally) assume that the individual/base predictors are of the same type. They also expect the ensemble process to have control over the generation of these predictors from (generally) a single training set. Both these important assumptions do not hold for the challenge setting. Thus, in this work, we propose the use of heterogeneous ensemble methods, such as stacking and ensemble selection, for building effective ensembles for DREAM challenges as well as other biomedical prediction problems. First, using several protein function and genetic interaction prediction datasets, we illustrated how such heterogeneous ensembles can provide statistically significant gains over individual predictors, including those based on boosting and random forests (Whalen et al., Methods, 2015). Deeper analysis shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional individual predictors. Motivated by these results, we built stacking-based ensembles of subchallenge 2 of the Rheumatoid Arthritis anti-TNF drug response challenge. Using only six of the individual predictors, these ensembles (AUPR=0.5228) again provided prediction gains over the two best individual predictors (AUPR =0.5099 and 0.5071). In current work, we are trying to realize such gains for other DREAM challenges as well, especially by systematically addressing the theoretical and implementation issues associated with this task.

...............................................................................................................................

Poster: DR16
A Boosting approach and Cox model for Predicting Slope and Survival of ALS

Wen-Chieh Fang¹, Chen Yang¹, Huan-Jui Chang², Hsih-Te Yang^1,3, Jung-Hsien Chiang^1,3

¹Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
²Department of Economics, National Cheng Kung University, Taiwan
³Institute of medical informatics, National Cheng Kung University, Taiwan

Abstract: Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that leads muscle weakness and gradually impacts on the functioning of the body, leading to eventual death. It greatly reduces an individual's life expectancy. Currently, experts do not know precisely what causes ALS. There is no known cure for ALS. The DREAM ALS Stratification Prize4Life Challenge is held with the goal to enable better understanding of patient profiles and application of personalized ALS treatments.

In our approach to Dream ALS Challenge, we first disregard those features which 90 percent of values are missing. For the remaining, we replace any missing value with the mean of that variable for all other cases. For all the four subchallenges, we apply equal frequency binning that divides the response variable into three groups such that each group contains approximately same number of values. There are two kinds of features: static features and 'time-resolved' features (those with different values when time varies). For the latter, we try two designated measurements, the minimum and the maximum as features. Then for both two kinds of features, we apply feature selection based on information gain to select the top-six features. In order to select optimal features, we run cross validation on the feature candidates. We apply Gradient Boosted Regression Trees (GBRT) to predict the ALSFRS slopes. GBRT computes a sequence of simple decision trees, where each successive tree is built for the prediction residuals of the preceding tree. We apply Cox model with maximum penalized likelihood to predict the survival probability.

In this challenge, we think that the feature selection is one of the most important steps and we believe that the most appropriate features dominate the performance of the model. In the last round of leaderboard, our team was the second best team and our approaches outperformed the methods of most teams.

...............................................................................................................................

Poster: DR17

Using aggregated weights along paths across random forest to select important features and predict ALS progression
Jinfeng Xiao¹ and Jian Peng¹

¹University of Illinois at Urbana-Champaign, USA

Amyotrophic lateral sclerosis (ALS) is typically a rapidly progressing neurodegenerative disease. In many cases it leads to death within 3-5 years from onset of symptoms, but the rate of progression across the patient population can vary by an order of magnitude. Unwinding such underlying heterogeneity can hopefully shed light on disease mechanisms and drug development, and reliable prediction on progression rate can assist clinical decisions.

We developed a novel random forest based method to select 6 important clinical variables from 68 and predict ALS progression based on those 6. After training a random forest F with all n available features (whose missing rate < 50%), the patients used for training were dropped down the forest and their paths across each tree were tracked. Along each path, nodes were assigned different weights based on their positions along the path. Then node weights from all paths were aggregated so that each patient was represented by a point in an n-dimensional space Rn, where the coordinates are the aggregated weights of the n clinical features. All patients for training were then clustered in Rⁿ, and within the i^th cluster (I went from 1 to the number of clusters) the 6 clinical variables with the highest aggregated weights were used to train a new random forest F_ⁱ. When a new patient came in, based on the aggregated node weights along his/her paths across F, an F_ⁱ and the 6 corresponding clinical variables were selected for predicting his/her ALS progression rate.

Our method was the top performer in sub-challenge 3, which was to predict ALS progression of patients from two national registries. Several other methods were locally tested, and our method turned out the best. For example, we tried representing each patient with a point in Rn whose coordinates were the values of the n clinical features instead of their aggregated weights along paths across random forest, calculated the z-score (aggregated from concordance index, Pearson correlation and root-mean-square deviation) and found that the z-score of our submitted method was 41% higher. We also tried ranking the importance of features using the permutation setting of the importance function of R package randomForest, and the z-score of our submitted method was 30% higher. Inspired by those preliminary results, we are currently further developing and testing our aggregated weights method.

top