DREAM ABSTRACTS

DREAM Prostate Cancer Challenge Session

Project Data Sphere® initiative overview
Liz Zhou, Director, US Medical Affairs, Sanofi

The Project Data Sphere® initiative (PDS) is an independent not-for-profit initiative of the Life Sciences Consortium of the CEO Roundtable on Cancer, with the vision to broadly share, integrate, and analyze historical comparator-arm cancer trial data sets from any source (academic, government, industry, etc.) with protocols, CRFs, and data descriptors, freely online at www.projectdatasphere.org. The goal is to accelerate innovation to improve cancer care.

Launched in April 2014, the PDS website provides registered users access to de-identified patient level raw data from the control arms of Phase 3 oncology clinical trials. At the time of the launch, there were 4,000 patient lives across 9 datasets from 7 industry and academic data providers. As of August 2015, the number of datasets had increased from 9 to 52, with nearly 30,000 patient lives across multiple tumor types.

Data from the PDS platform was used for the Prostate Cancer DREAM Challenge in the summer of 2015 – this challenge had the greatest level of registration of any DREAM Challenge to date. Laborious effort was put behind preparing data for the Challenge, which resulted in relatively smooth Challenge process for solvers working with clinical trial data.

PDS provides a model to share clinical trial data from oncology; the rapid growth of the number of users, downloads, and publications demonstrates a viable approach to accelerating research, including exploring innovative ways to analyze the data (e.g. crowdsourcing) without exposing patients to new clinical trials.

...............................................................................................................................

Introduction to the Prostate Cancer DREAM Challenge
James C Costello,¹
¹ Department of Pharmacology, University of Colorado Anschutz Medical Campus, Colorado, USA

Prostate cancer is the most common cancer among men in developed countries and ranks third in terms of mortality after lung cancer and colorectal cancer. Nearly 15% of prostate cancer patients have metastatic disease (Stage IV) at the time of diagnosis. The mainstay of treatment for metastatic disease is androgen deprivation therapy (ADT), though inevitably many patients develop resistance resulting in metastatic castrate-resistant prostate cancer (mCRPC). To gain a better understanding of mCRPC, the Prostate Cancer DREAM Challenge was developed to address two sub-challenges: 1) Predict overall survival for mCRPC patients based on clinical variables, and 2) Predict treatment discontinuation for mCRPC patients treated with docetaxel due to adverse events at early time points. The underlying data was collected from 4 separate clinical trials and annotated by Project Data Sphere, LLC. The prognostic calculators for sub-challenge 1 were scored using the integrated Area Under the Curve (iAUC) and sub-challenge 2 was scored using the Area Under the Precision Recall Curve (AUPRC). The Challenge had over 50 teams with 180 individual researchers actively participating and more than half of the teams outperformed the standard model in the field for sub-challenge 1. Sub-challenge 2 represents a novel set of results as the amount of data needed to address discontinuation due of adverse events has not been compiled until the Prostate Cancer DREAM Challenge. The results of the Prostate Cancer DREAM Challenge, specifically the prognostic calculators developed by participating teams, will be made available to clinicians to aide in patient treatment decisions. The methods developed to identify patients likely to discontinue can be used for patient selection in future clinical trial design.

...............................................................................................................................

Prostate Cancer Challenge - Best Performer SC1

Predicting patient survival in the DREAM 9.5 mCRPC challenge
Teemu D. Laajala^1,2, Suleiman Khan², Antti Airola³, Tuomas Mirtti^2,4, Tapio Pahikkala³, Peddinti Gopalacharyulu², Tero Aittokallio^1,2
¹ Department of Mathematics and Statistics, University of Turku, Finland
² Institute for Molecular Medicine Finland, University of Helsinki, Finland
³ Department of Information Technology, University of Turku, Finland
⁴ Department of Pathology, HUSLAB, Helsinki University Hospital, Finland

We present the top performing ensemble of models for predicting patient survival in the context of metastatic castration-resistant prostate cancer (mCRPC) patients, originating from several clinical trials (subchallenge 1a of the DREAM 9.5 Prostate Cancer Challenge). By coupling unsupervised learning with survival-analysis-based supervised learning, we constructed an ensemble of batch-wise optimized penalized regression coxnet-models. The final ensemble models were simultaneously optimized for the penalized regression through L1/L2-norm parameter α along with the penalization coefficient λ. Model-based imputation of missing values as well as incorporating clinical á priori knowledge of variables is discussed, along with practical lessons learned from processing such challenging clinical data that required wide multidisciplinary expertise. Lastly, we offer our personal view of the clinical novelty of the model coefficients and its ensemble structure including interactions among the clinical variables

...............................................................................................................................

Prostate Cancer Challenge -Best Performer SC2

Docetaxel adverse event prediction: a boosting method application
Fatemeh Seyednasrollah^{1, 2}, Mehrad Mahmoudian¹, Outi Hirvonen^{3, 4}, Sirkku Jyrkkiö³ and Laura L. Elo^{1, 2}
¹Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Turku, Finland
² Department of Mathematics and Statistics, University of Turku, Turku, Finland
³ The Department of Oncology and Radiotherapy, Turku University Central Hospital, Turku, Finland
⁴ The Department of Clinical Oncology, University of Turku, Turku, Finland

The aim of this study was to predict the adverse event of docetaxel at early stage in patients with metastatic castration-resistant prostate cancer (mCRPC). More specifically, our objective was to provide more precise insights for clinicians on deciding whether to continue or discontinue docetaxel within three months of starting the treatment using baseline clinical features. To address this question, boosting method, a class of supervised machine learning techniques was utilized.

The analysis was commenced with feature selection and data preprocessing. Preliminary predictors were selected after filtering out features with insufficient clinical relevance to the study question (suggested by our clinical members) and features with high rates of missing values and/or collinearity. For selected feature categories, including lesions, prior medications and diseases, we used an arithmetic sum of presence of features in the corresponding categories. For the selected laboratory values, they were transformed, scaled or truncated based on their reference ranges and distributions. Finally, these preprocessed features were used to develop the final predictive model.

In the model building step, we focused on the R package: “gbm” (Generalized Boosted Regression Modeling). The package utilizes gradient boosting approach which iteratively tests the importance of features and aims to combine several weak models into a powerful high performance ensemble prediction. Also, the gbm package contains capacity of handling missing values and requires lower run time. In addition to this DREAM challenge, they have been shown to be successful in various practical applications.

In conclusion, the model can assist clinicians to capture suitable candidates to continue/discontinue docetaxel treatment.

...............................................................................................................................

Prostate Cancer Challenge -Best Performer SC2

Predicting discontinuation due to adverse effect in mCRPC
Yuanfang Guan

¹Department of Computational Medicine and Bioinformatics, Ann Arbor, MI, 48103
²Department of Internal Medicine, Ann Arbor, MI, 48103

In the second sub-challenge of the DREAM 2015 prostate cancer challenge, participants were asked to predict which patients cannot tolerate docetaxel therapy, i.e. early adverse effect (AE) defined as termination within 91.5 days. Multi-task comparison showed that early AE is tightly connected to early death, while the patients that experienced early AE but not early death showed heterogeneous characteristics, preventing them from used as gold standard. Thus predicting adverse effect was transformed into the problem of predicting early death. I used 3 months as the cutoff, where deaths prior to 3 months were used as the gold standard positives versus the rest as negatives. This method turned out to the best performing method of this sub-challenge.

...............................................................................................................................

DREAM Olfaction Challenge Session

Introduction to the DREAM olfaction challenge
Andreas Keller¹
¹ Laboratory of Neurogenetics and Behavior, The Rockefeller University, New York City, USA

The complexity of olfactory stimuli as well as of the olfactory perceptual space makes the question of what determines a molecule's smell an ideal topic for the collaborative approach provided by DREAM challenges. I will briefly discuss previous attempts to predict how a molecule smells based on its physical properties and then present the psychophysical data set that was collected at Rockefeller University as the basis for the DREAM olfaction challenge. I will point out how this dataset differs from data that has been traditionally used for this type of project. I will also discuss how the influence of genetic variability between individuals and of previous experiences on odor perception complicates the stimulus-percept-correlation in olfaction.

...............................................................................................................................

DREAM Olfaction Challenge -Best Performer SC1

Predicting olfaction response for each individual
Yuanfang Guan

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor

This abstract describes the method I wrote for the 2015 DREAM Olfaction Challenge- sub-challenge 1: building models to predict olfactory response for each individual. I used decision tree as a base-learner of this chemical structural data. There are two reasons that I chose decision tree: 1) the dimension of the structure data is high, which contained over 4000 parameters; decision tree helped to reduce the dimension. 2) The data matrix is sparse; a decision boundary can be put between zeros and the rest of the values. Olfactory responses reported by individuals were noisy. Thus I used the global response across all to balance the individual response, in order to 1) capture the personalized features, and b) stabilize the predictions. For example, a chemical reported to be ‘sweet’ by individual A is trusted more when other peoples also report this chemical to be ‘sweet’. Finally, I used 0.2*individual score + 0.8* global average for each chemical as predictions. But the above parameter can be rather flexible to achieve decent performance. Similar technique also resulted in one of the best performing algorithms in 2014 DREAM Broad Institute Gene Essentiality Challenge. No external data was used. This technique turns out to be the best-performing method for this sub-challenge.

...............................................................................................................................

DREAM Olfaction Challenge -Best Performer SC2

From Shape to Smell: Predicting Olfactory Perceptual Descriptors using Molecular Structural Information
Richard C. Gerkin¹
School of Life Sciences, Arizona State University

The DREAM Olfaction Prediction Challenge asked participants to predict 21 different olfactory perceptual features (i.e. smell descriptors) of single molecules using a library of several thousands structural features of those molecules. One sub-challenge asked participants to make predictions for individual human subjects, and one for the mean and variance of responses across subjects. Here I describe the winning submission to the latter sub-challenge.

After using missing value imputation to fill out the structural feature dataset, I trained Random Forest Regression models (using Python's scikit-learn package) to predict mean (across subjects) responses for each of the perceptual features. I made extensive use of cross-validation to optimize these models. While I also constructed similar models to predict the variance (across subjects) of the responses, I found that prediction was improved by exploiting the relationship between the variance and the mean that was guaranteed by basic psychometric considerations. Consequently, I obtained decisive improvements in my prediction of the variance by pooling results from models trained only on the variance with a theoretically motivated non-linear transformation of results from models trained only on the mean. This technique proved decisive in constructing the winning submission for the sub-challenge.

Collaboration with other challenge participants further improved olfactory prediction by utilizing additional molecular features; this provided scientific insight into the categories and origins of features that provide useful olfactory information about molecules.

...............................................................................................................................

DREAM Olfaction Challenge: Lessons Learned

Amit Dhurandhar, Pablo Meyer, Guillermo Cecchi

IBM TJ Watson Research, NY, USA

We present the insights gained from the DREAM Olfaction challenge run earlier this year. Our sense of smell critically affects our emotions and hence decisions, making it an important part in human cognition. We report analysis of results from the challenge where only molecular structure was used to design learning algorithms that were trained to predict individual and average judgment of smell for 49 commoners (not experts) based on 21 descriptors. The results were promising in the sense that the best models achieved greater than 0.75 correlation, with linear models being competitive with the best. The challenge showed that in terms of predictability not only generalization across odors for specific individuals is possible, but that generalization across individuals for the same odors is also possible, which is highly encouraging from a science and application point of view.

...............................................................................................................................

DREAM ALS Stratification Challenge Session

The ALS Stratification Prize-Using the Power of Big Data and Crowdsourcing for Catalyzing Breakthroughs in Amyotrophic Lateral Sclerosis (ALS)
Neta Zach,¹, Robert Kueffner,², Nazem Atassi,³, Venkat Balagurusamy⁴, Barbara di Camillo,⁵, Merit Cudkowicz^,3, Donna Dillenberge^r4, Javier Garcia-Garcia,⁶, Orla Hardiman,⁷, Bruce Hoff,⁸, Joshua Knight⁴, Melanie Leitner,⁹, Guang Li¹⁰, Lara Mangravite,⁸, Raquel Nore^l4, Thea Norman,⁸, Liuxia Wang¹⁰, Gustavo Stolovitzky⁴

^,1 Prize4Life, Israel, ,² Ludwig-Maximilian-University, Germany, ^,3 Massachusetts General Hospital, MA, USA,⁴ IBM Research, NY, USA, ,⁵ University of Padova, Italy, ^,6 Pompeu Fabra University, Spain, ,⁷ Trinity College Institute of Neuroscience, Ireland,^,8 Sage Bionetworks, ^,9 Biogen Idec, MA, USA, and ^,10 Origent, VA, USA

Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease with significant heterogeneity in its progression. In order to address this heterogeneity and spur ALS research, clinical care and drug development we need sufficient clinical data and suitable analysis approaches. To address this heterogeneity for the first time we launched the DREAM ALS Stratification Prize4Life Challenge in summer 2015, using clinical Data from the PRO-ACT database of ALS clinical trials, as well as data from National ALS registries from Italy and Ireland.

In the challenge, we asked participants to derive meaningful subgroups of ALS patients relative to disease progression and survival. The challenge drew in 75+ submissions from 31 teams. We will discuss the different approaches used by participants, as well as the baseline algorithms, and how patient classification affected performance in predicting disease outcomes. We will also discuss the different predictive features and patient subgroups that the challenge helped unveil.

...............................................................................................................................

DREAM ALS Stratification Challenge Best Performers

A Boosting Approach to Predicting ALSFRS Slope for the PRO-ACT Database
Wen-Chieh Fang¹, Chen Yang¹, Huan-Jui Chang², Hsih-Te Yang^1,3, Jung-Hsien Chiang^1,3

¹Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
²Department of Economics, National Cheng Kung University, Taiwan
³Institute of medical informatics, National Cheng Kung University, Taiwan

Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that leads muscle weakness and gradually impacts on the functioning of the body, leading to eventual death. It greatly reduces an individual's life expectancy. Currently, experts do not know precisely what causes ALS. There is no known cure for ALS. The DREAM ALS Stratification Prize4Life Challenge is held for the purpose of enabling better understanding of patient profiles and application of personalized ALS treatments.

In our approach to DREAM ALS Challenge, we first neglect those features with high percent of missing values. For the remaining, we replace any missing value with the mean of that feature for all other cases. Meanwhile, we apply equal frequency binning that divides the response variable into three groups such that each group contains approximately same number of values. There are two kinds of features in the data set: static features and 'time-resolved' features (those with different values when time varies). For the latter, we try two designated measurements, the minimum and the maximum as additional features. Then for both two kinds of features, we apply feature selection based on information gain to select the top-six features. In order to select optimal features, we run cross validation on the feature candidates. In prediction, we apply Gradient Boosted Regression Trees (GBRT) to predict the ALSFRS slopes. GBRT computes a sequence of simple decision trees, where each successive tree is built for the prediction residuals of the preceding tree.

In this challenge, we think that the feature selection is one of the most important steps and we believe that the most appropriate features dominate the performance of the model. In the final submission round, our team attained the best performance, outperforming the methods of all other teams.

...............................................................................................................................

Predicting ALS survival through complete ranking of censored data
Yuanfang Guan¹

¹Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA, 48109

Prize4life ALS sub-challenge 2 and 4 asked participants to predict survival for two separate cohorts, both as typical censored data problems. In this talk, I will review existing methods (cox-related, survival-random-forest, etc.) and their potential limitation as an incomplete ranking of training patients, and the resulting limitation in the choices of base-learners. Then, I will describe the method I wrote for this challenge, which provides a probabilistic comparison between two censored data points derived from the K-M curves (in addition to non-censored data point pairs, and censored-non-censored pairs). A complete ranking of all patients is thus given, which allows incorporation of any base-learners. This generalizes a censored data prediction problem to a standard regression problem. Finally, I will talk about one (GPR) of the many base-learners that are capable of achieving similar performance under the above modeling framework. This method turns out to be the best-performing one for both sub-challenges.

...............................................................................................................................

DREAM SMC Challenge Session

The ICGC-TCGA DREAM Somatic Mutation Calling Challenge: Combining accurate tumour genome simulation with crowd-sourcing to benchmark somatic variant detection
Joshua M. Stuart^2,11,, Anna Y. Lee^1,10, Kathleen E. Houlahan^1,10, Adam D. Ewing^2,3,10, Kyle Ellrott^2,10, Yin Hu⁴, J. Christopher Bare⁴, Shadrielle Espiritu¹, Vincent Huang¹, Kristen Dang⁴, Cristian Caloian¹, Takafumi N. Yamaguchi¹, ICGC-TCGA DREAM Somatic Mutation Calling Challenge Participants, Michael R. Kellen⁴, Thea C. Norman⁴, Stephen H. Friend⁴, Justin Guinney⁴, Gustavo Stolovitzky⁵, David Haussler², Adam A. Margolin^4,7,11, Paul C. Boutros^1,8,9,11

¹ Informatics and Biocomputing Program; Ontario Institute for Cancer Research; Toronto, Ontario, Canada
² Department of Biomolecular Engineering; University of California, Santa Cruz; Santa Cruz, CA, USA
³ Mater Research Institute; University of Queensland; Woolloongabba, QLD, Australia
⁴ Sage Bionetworks; Seattle, WA, USA
⁵ IBM Computational Biology Center; T.J.Watson Research Center; Yorktown Heights, NY, USA
⁶ Computational Biology Program; Oregon Health & Science University; Portland, OR, USA
⁷ Department of Biomedical Engineering; Oregon Health & Science University; Portland, OR, USA
⁸ Department of Medical Biophysics; University of Toronto; Toronto, Ontario, Canada
⁹ Department of Pharmacology & Toxicology; University of Toronto; Toronto, Ontario, Canada
¹⁰ These authors contributed equally
¹¹ Corresponding authors

The identification of somatic mutations in cancer genomes via next-generation sequencing will transform our understanding and treatment of cancer. Unfortunately, accurate identification of somatic mutations of all types – point-mutations and structural variants – remains challenging, with many anecdotal, small-scale reports of discordance across methods. One underlying reason for this problem is the lack of robust, impartial benchmarking studies and widely-accepted gold-standards. The Cancer Genome Atlas (TCGA) and the International Cancer Genomics Consortium (ICGC) launched the ICGC-TCGA DREAM Somatic Mutation Calling Challenge: a crowd-sourcing effort to identify the best pipelines for detecting mutations in the high-throughput sequencing reads of cancer genomes (www.synapse.org/#!Synapse:syn312572).

To benchmark variant calling approaches, a novel simulator called BAMSurgeon was developed to synthesize cancer genomes in silico. The results of 248 single nucleotide variant and 204 structural variant analyses run on five synthetic tumors will be presented. Different algorithms exhibit characteristic error profiles, and, intriguingly, false positives show a trinucleotide mutation signature often reported in human tumors. Although the three simulated tumors differ in sequence contamination (deviation from normal cell sequence) and in sub-clonality, an ensemble of pipelines outperforms the best individual pipeline in all cases. We will discuss several findings from the analysis of these methods including the ability of methods to improve performance without overfitting, that SNV but not SV callers benefit from a “wisdom of the crowds” ensemble, and the first clear picture of the origins of methodological errors in SNV and SV calling. The leaderboard for this Challenge remains open and is continually attracting new entries, serving as a living-benchmark for comparing new algorithms.

...............................................................................................................................

DREAM SMC Challenge Best Performer SC3

Strategies for SNV and Indel Detection in the DREAM Challenges
Team WashU
R. Jay Mashl, Daniel C. Koboldt, Kai Ye, Li Ding

McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA

Identifying genomic variants is a fundamental step in the understanding of mutations associated with cancer. To this end, the ICGC-TCGA DREAM series of in silico calling challenges was designed to improve standard methods for identifying somatic mutations and rearrangements in whole-genome sequencing (WGS) data. Here, we present an assessment of metrics and other observations for distinguishing true single-nucleotide variants from false positives in the third DREAM sub-challenge. Additional filtering of putative calls using a panel of normal samples was found to reduce further the false-positive rate. A filtering suite was also developed for calling insertion and deletion events. The resulting parameters are being incorporated into the GenomeVIP cloud-enabled variant identification platform.

...............................................................................................................................

DREAM SMC Challenge Best Performer SC2, 3, & 4

novoBreak: a k-mer targeted assembly algorithm for breakpoint detection in cancer genomes
Zechen Chong1 and Ken Chen¹

¹Department of Bioinformatics and Computational Biology, the University of Texas MD Anderson Cancer Center

Somatic structural variations (SVs) are major driving forces for tumor development and progression. Sporadic and recurrent chromosomal aberrations have been observed in most cancer types, including breast, lung, brain, leukemia, pancreatic and prostate cancers. The advent of high-throughput next generation sequencing (NGS) technologies has made it possible to perform genome-wide detection of SVs at base pair resolution. However, current sequencing-based computational methods are limited in sensitivity and comprehensiveness due to the challenges of acquiring sufficient information to characterize different types of SVs. Here, we present novoBreak, a novel k-mer targeted local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole genome sequencing data. NovoBreak can directly identify breakpoints from clusters of reads that share a set of k-mers uniquely present in a subject genome (e.g., a tumor genome) but not in the human reference genome or any control data (e.g., a matched normal genome). In synthetic data from the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge and real data from a cancer cell-line, novoBreak consistently outperformed existing algorithms due mainly to more effective utilization of reads spanning breakpoints. NovoBreak also demonstrated great sensitivity in identifying short INDELs and gene fusions. The wider application of novoBreak is expected to reveal comprehensive structural landscape that can be linked to novel mechanistic signatures in cancer genomes.

...............................................................................................................................

Lessons from the SMC-DNA IS Challenges & Looking Forward
Paul C. Boutros^1,2, Adam A. Margolin³, Kyle Ellrott³, Quaid D. Morris2, Paul Spellmaⁿ³, David Wedge⁴, Peter Van Loo⁵, Gustavo Stolovitzky⁶, Joshua D. Stuart⁷
¹ Ontario Institute for Cancer Research, Toronto, Canada
² University of Toronto, Toronto, Canada
³ Oregon Health & Sciences University, Portland, Oregon, USA
⁴ Wellcome Trust Sanger Institute, Hinxton, UK
⁵ Crick Research Institute, London, UK
⁶ IBM Research, NY, NY, USA
⁷ University of California, Santa Cruz, California USA

We present a summary of the key lessons learnt through the in silico challenges of the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (SMC-DNA), with a particular focus on the challenges in simulating, scoring and integrating structural variant (SV) predictions. We discuss the remarkable failure of ensemble models to improve upon SV prediction, and note the significant differences to other types of genomic data. Finally we outline the future of the SMC series of DREAM Challenges, giving the official launch of a tumour subclonality reconstruction challenge (SMC-Het) and talking about the progress towards an RNA-Seq Challenge (SMC-RNA).

...............................................................................................................................

DREAM Drug Combination Challenge Session

Crowdsourcing combinatorial therapies: The AZ-Sanger DREAM synergy prediction challenge
Menden MP^1,*, Wang ^D2,*, Chaibub Neto ^E3, Ghazoui ^Z2, Jang ^IS3, Giovanni Di Vero^li4, Gustavo Stolovitzky⁵, Dry JR^2,#, Guinney J^3,#, Saez-Rodriguez J^1,6,#

¹European Molecular Biology Laboratory – European Bioinformatics Institute
²Oncology Innovative Medicines, AstraZeneca
³Sage Bionetworks
⁴Early Clinical Development – Innovative Medicine, AstraZeneca
⁵IBM Research
⁶RWTH Aachen University Hospital, Joint Research Center for Computational Biomedicine (JRC-Combine)
* First authors; # Corresponding authors (alphabetically ordered)

In the last 20 years targeted therapies and personalized treatments have been the most promising assets to treat cancer. However, the success is often lessened by secondary resistance. As a solution to increase the therapeutic search space and to overcome resistance, combinations of drugs are currently extensively investigated. A major current limitation is the lack of effective strategies to search the virtually intractable combinatorial space. To tackle this issue and advance the development of combinatorial therapies DREAM, Sage Bionetworks and AstraZeneca are jointly hosting a drug combination challenge, which is open to the scientific community to participate.

The challenge aims to address two specific questions: (i) to predict synergies in cancer cell lines from molecular data, and (ii) to identify biomarkers that discriminate between synergistic and non-synergistic behaviors. Towards this goal, AstraZeneca provides a combinational cell line screening, which comprehends a total of ~11.5k experimental tested drug combinations, as well as all mono therapies as baselines. The previous DREAM drug combination challenge focused on ~100 experimental tested combinations and explored synergy signatures based on before and after treatment gene expression. Besides the smaller scale, those signatures are not practical/ethical to retrieve from patients. In addition our proposed challenge focus on revealing mechanistic insights from any biomarker selection and model build. By providing a framework where any group in the world can participate, and whereby methods can be evaluated in an unbiased and double-blinded way, we aim to move forward in this complex problem of discovering novel and effective drug combinations.

Here we present results from the first challenge round and will show glimpses of knowledge gained in this crowed-sourcing drug combination challenge.

...............................................................................................................................

Preventing data-leakage in leaderboard evaluations: the Ladder and LadderBoot algorithms
Elias Chaibub Neto¹
¹Sage Bionetworks

Over-fitting is a common issue in machine learning challenges. Because participants rely on the public leaderboard to evaluate and refine their models, there is always the danger they might start to over-fit their models to the holdout data supporting the leaderboard. Standard remedies to this problem include limiting the number of allowed submissions per participant and rounding the released public scores. Recently, Hardt and Blum (2015) proposed the Ladder algorithm, which reduces over-fitting by preventing the participant from exploiting minor fluctuations in public leaderboard scores during their model refinement activities. Mechanistically, the Ladder only releases the actual (rounded) score of a new submission if the score presents a statistically significant improvement over the previously best submission of the participant. If not, the Ladder releases the score of the best submission so far.

In this talk, we present evaluations of the Ladder algorithm under two adversarial attacks. Both are inspired by Freedman's paradox, where the selection of the features entering a multiple regression model is guided by the public leaderboard. In the first attack, we simply select the top features according to the public leaderboard scores of the univariate regression models. Our experiments show the effectiveness of the Ladder algorithm in this context. Our second attack, on the other hand, is based on a more aggressive step-forward variation of this first attack, and can lead to severe over-fitting. This attack explores the fact that the Ladder leaks too much information about the holdout data when it releases the public leaderboard score of the best model so far. To circumvent this problem, we propose a variation of the Ladder mechanism, called LadderBoot algorithm, which releases a bootstrapped estimate of the public leaderboard score, instead of the actual rounded score. In our experiments, the LadderBoot mechanism tended to compare favorably to the Ladder.

Reference. Hardt and Blum (2015) The Ladder: a reliable leaderboard for machine learning competitions. arXiv:1502.04585, 2015.

...............................................................................................................................

OTHERS

The DREAM Challenges channel: open science publishing for all participating DREAMers
Michael Markie¹

¹F1000Research, London, UK

F1000Research is an Open Science publishing platform that offers the immediate publication of posters, slides and articles with no editorial bias. All published articles benefit from a collaborative, transparent peer review process and the inclusion of all source data and code. F1000Research partners with the International Society for Computational Biology through the publication of the ISCB Community Journal, and have recently built the DREAM Challenges channel, a central venue to publish peer-reviewed method articles from participants of DREAM challenges. Developments in science most often build upon previous findings, insights, and data, so it is important to make this information accessible to enable easy reuse. Both DREAM challenges and F1000Research have core values that are underpinned by open, reproducible science and this collaboration aspires to advance the questions posed in DREAM Challenges through the open sharing of data-analyses and computational methods. Participants of the recently concluded Prostate Cancer challenge will be the first DREAMers to make use of the channel. Participants will take advantage of a dynamic, post publication peer review model and have the opportunity to work with expert reviewers from the biomedical research community to help improve and refine their methods after the challenge has completed. In addition to encouraging participants to publish their findings, the DREAM Challenges channel is also open to the wider DREAM community to publish research around challenge topics and the theory of how the challenges work.

The aim of the collaboration is to make further progress beyond the challenges, help answer important biomedical and biological questions and pave the way to improving clinical practice, such as the prognostic calculators developed for the recent Prostate Cancer DREAM Challenge.

...............................................................................................................................

DREAM Hackathon

The DREAM Challenges Hackathon: mining big data from a Parkinson’s Disease mobile research study
Brian Bot¹, Chris Bare¹, and DREAM Challenges Hackathon Team

Sage Bionetworks ¹, University of Rochester, DREAM Challenges

The DREAM Challenges will be sponsoring a Hackathon on Sunday and Monday evenings, based on data from the mPower study. mPower, is an App-based clinical study, focused on Parkinson Disease (PD), developed by Sage Bionetworks and the University of Rochester. As of August 2015, mPower enrolled 18,000 participants – the largest PD study ever. Typically symptoms in PD patients are evaluated and recorded twice a year, when the patient goes to the doctor. Between these doctor’s visits, a patient’s disease status is left unmonitored, and important decisions about interventions are often not made in a timely manner. mPower is built to allow patients to continuously track their PD signs and symptoms, generating data on the patient’s voice (pitch and tremor), balance, walking and finger speed and dexterity. The data being collected from this study will be made available for this year’s DREAM Conference Hackathon’s participants.

The goal of this hackathon is to use the mPower data to generate insights about PD. The data is collected using Sage Bionetworks Bridge Server, and Sage scientists have already proved that valuable information can be extracted from the mPower study. However there is a lot more work to be done. In this hackathon participants will bring their expertise and creative energies to develop ideas, models or proposals on how best to glean insights from this powerful study.

top

DREAM POSTERS

Updated Nov 3, 2015

Poster: DR01
From Shape to Smell: Predicting Olfactory Perceptual Descriptors using Molecular Structural Information

Richard C Gerkin¹

¹Arizona State University, United States

Abstract: The DREAM Olfaction Prediction Challenge asked participants to predict 21 different olfactory perceptual features (i.e. smell descriptors) of single molecules using a library of several thousands structural features of those molecules. One sub-challenge asked participants to make predictions for individual human subjects, and one for the mean and variance of responses across subjects. Here I describe the winning submission to the latter sub-challenge.

After using missing value imputation to fill out the structural feature dataset, I trained Random Forest Regression models (using Python's scikit-learn package) to predict mean (across subjects) responses for each of the perceptual features. I made extensive use of cross-validation to optimize these models. While I also constructed similar models to predict the variance (across subjects) of the responses, I found that prediction was improved by exploiting the relationship between the variance and the mean that was guaranteed by basic psychometric considerations. Consequently, I obtained decisive improvements in my prediction of the variance by pooling results from models trained only on the variance with a theoretically motivated non-linear transformation of results from models trained only on the mean. This technique proved decisive in constructing the winning submission for the sub-challenge.

...............................................................................................................................

Poster: DR02
The ALS Stratification Prize- Using Big Data and Crowdsourcing for Catalyzing Breakthroughs in ALS

Neta Zach¹, Robert Küffner², Hagit Alon¹, Nazem Atassi³, Barbara di Camillo⁴, Merit Cudkowicz³, Javier Garcia-Garcia⁵, Orla Hardiman⁶, Guang Li⁷, Lara Mangravite⁸, Raquel Norel⁹, Thea Norman⁸, Alexander Sherman³, Liuxia Wang⁷, Gustavo Stolovitzky⁹

¹Prize4Life, Israel
²Helmoltz Center, Germany
³Massachusetts General Hospital, United States
⁴University of Padova, Italy
⁵Universitat Pompeu Fabra, Spain
⁶Beaumont Hospital and Trinity College Dublin, Ireland
⁷Origent Data Solutions, United States
⁸Sage Bionetworks, United States
⁹IBM, United States

Abstract: The heterogeneity of the ALS patient population presents a substantial barrier to the understanding of disease mechanisms and to the planning and interpretation of ALS clinical trials, leading to large, expensive, and potentially unbalanced trials.

The 2015 DREAM ALS Stratification Prize4Life challenge offers an innovative approach to developing tools to allow a more accurate assignment of individual patients to a specific sub-group of patients with clear clinical implications for either survival or disease progression. It is expected to provide important tools for precision medicine in ALS.

The ALS Stratification challenge aims to address directly the problem of ALS patient heterogeneity with regards to important clinical targets such as ALSFRS progression and survival. In the challenge, we asked participants to derive meaningful subgroups of ALS patients along with the clinical features to characterize them.

...............................................................................................................................

Poster: DR03
DREAMTools: a Python Package for scoring collaborative challenges

Thomas Cokelaer¹, Mukesh Bansal², Christopher Bare³, Erhan Bilal⁴, Brian M. Bot³, Elias Chaibub Neto³, Federica Eduati¹, Mehmet Gönen⁵, Steven Hill⁶, Bruce Hoff³, Jonathan R. Karr⁷, Robert Küffner⁸, Michael Menden¹, Pablo Meyer4, Raquel Norel⁴, Abhishek Pratap³, Robert J. Prill⁹, Matthew T. Weirauch¹⁰, James C. Costello¹¹, Gustavo Stolovitzky⁴, Julio Saez-Rodriguez¹²

¹EMBL-EBI, United Kingdom
²Department of Systems Biology, Columbia University, United States
³Sage Bionetworks, United States
⁴IBM, TJ Watson, Computational Biology Center, United States
⁵Oregon Health & Science University, United States
⁶MRC Biostatistics Unit, Cambridge Institute of Public Health, United Kingdom
⁷Department of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, United States
⁸Institute of Bioinformatics and Systems Biology, German Research Center for Environmental Health, Germany
⁹IBM Almaden Research Center, San Jose, United States
¹⁰Center for Autoimmune Genomics and Etiology and Divisions of Biomedical Informatics and Developmental Biology, Cincinnati Children’s Hospital, United States
¹¹Department of Pharmacology, University of Colorado Anschutz Medical Campus United States
¹²RWTH Aachen University Medical Hospital, Germany

Abstract: DREAM challenges are community competitions designed to advance computational methods and address fundamental questions in system biology and translational medicine. Each challenge asks participants to develop and apply computational methods to either predict unobserved outcomes or to identify unknown model parameters given a set of training data. Computational methods are evaluated using an automated scoring metric, scores are posted to a public leaderboard, and methods are published to facilitate community discussions on how to build improved methods. By engaging participants from a wide range of science and engineering backgrounds, DREAM challenges can comparatively evaluate a wide range of statistical, machine learning, and biophysical methods. Here, we describe DREAMTools, a Python package for evaluating DREAM challenge scoring metrics. DREAMTools allows one to reproduce results from past DREAM challenges. The software also provides a command line interface that enables researchers to test new methods on past challenges, as well as a framework for scoring new challenges. As of September 2015, DREAMTools includes more than 80\% of completed DREAM challenges. DREAMTools complements the data, metadata, and software tools available at the DREAM website (http://dreamchallenges.org) and on the Synapse platform (www.synapse.org). In the poster, we will give an overview of the past and present challenges and how the DREAMTools package can be used to reproduce scores from previous competitions. We will also describe the scoring functions that are currently available within the package and how new challenges can be included into the package.

...............................................................................................................................

Poster: DR04
A Two-layer Predictor for DREAM 9.5 Olfaction Prediction Challenge

Ping-Han Hsieh¹, Bor-Wei Cherng², Yu-Chuan Chang¹, Ming-Yi Hong¹, Yi-An Tung³, Yen-Jen Oyang⁴, Chien-Yu Chen⁵

¹National Taiwan University, Taipei, Taiwan
²National Taiwan University and Academia sinica, Taipei, Taiwan
³Genome and System biology program, National Taiwan University and Academia sinica, Taipei, Taiwan
⁴Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
⁵Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, Taiwan

Abstract: Olfaction is one of the most important sensibilities in animal behavior. Understanding the olfactory precept from the aspect of molecular properties may broaden our understanding of sensory cognition and create more odorant applications for industries. Though there have been many published methodologies in predicting human’s olfactory flavor, the prediction accuracy is expected to be further improved by additional training data sets. This challenge was designed to predict personal olfactory flavor based on chemical features of compounds. Based on the concept of the one-neuron-receptor role in olfactory sensory studies, odor molecules conjugate with specific types of olfactory receptors, and transmit neuron signals to multiple brain regions to generate the olfactory perception. As the result of neuronal circuitry conservation, it is reasonable to hypothesize that there are hidden relations between olfactory perception and the molecular properties, making us precisely process smell perception. To tackle this olfaction prediction challenge, we built up a machine learning-based pipeline to generate individual-specialized ensemble-based linear models, modified from the PCA-based baseline model delivered by the challenge organizers. The adopted features include molecular physicochemical properties from Dragon molecular descriptors, while the target information came from perceptual data collected in Rockefeller University Smell Study. Due to the sensory varieties between individuals, we, therefore, hypothesized that the predicting models for different persons may slightly differ from each other, though it may still have some common factors embedded in the models. In this regard, we built predicting models for each single person with respect to each chemical compound. It is expected that the individual-specific features were carefully selected by the ensemble approach. The proposed method also employed a second-layer predicting framework to predict the target value: 'valence/pleasantness'. The results revealed that the two-layer approach performed better than the conventional design, the single-layer framework, suggesting that some of the olfactory senses are highly related.

...............................................................................................................................

Poster: DR05
Attractor Metafeatures Discover Molecular Signatures for Odor-Perception Prediction

Andrew Matteson¹

¹Applied BioMath, United States

Abstract: Chemoinformatics predictors are generally more numerous than the number of samples in a target variable. The key challenge in working with these imbalanced data sets is avoiding over fitting. My approach sought to preserve as much information about the target variable while reducing the dimensionality of the data. My methods focused on the use of mutual information to either select features, or project all the features onto a space that was a good predictor of the target variables. For selection, my methods are connected to the technique of maximum-relevance-minimum-redundancy (MRMR) feature selection.

I engineered "metapredictors" from the chemoinformatics predictors using the “Attractor Metagene” algorithm (1). The algorithm engineers features in an unsupervised way by weighted averaging over the original features. Weights are chosen as a function of the mutual information between the engineered feature, and the original features.

Several of the metapredictors map onto known chemical structures associated with particular smells. Other metapredictors correspond to chemical structures not yet identified as corresponding to odors. These metapredictors enable connections to be made between molecular structure and perception that are directly interpretable.

The learning methods I developed fall into an enrich-project-predict framework. I discuss opportunities to extend these methods to other chemoinformatics and bioinformatics machine learning problems.

(1) Cheng, Wei-Yi, T. H. Ou Yang, and Dimitris Anastassiou. "Biomolecular events in cancer revealed by attractor metagenes." PLoS Comput Biol 9.2 (2013): e1002920.

...............................................................................................................................

Poster: DR06
Reflecting on the Prostate Cancer Dream Challenge: Lessons Learned

Team Jayhawks

Devin C. Koestler¹, Joseph Usset¹, Stefan Graw¹, Richard Meier¹, Rama Raghavan¹, Junqiang (Eric) Dai¹, Prabhakar Chalise¹, Shellie Ellis², Brooke L. Fridley¹

¹Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, United States
²Department of Health Policy and Management, University of Kansas Medical Center, Kansas City, KS, United States

Abstract: From March through August 2015, nearly 60 teams from around the world participated in the Prostate Cancer Dream Challenge, cosponsored in part by: the Prostate Cancer Foundation, National Cancer Institute (NCI), and the American Joint Committee on Cancer (AJCC). Participating teams were faced with the task of developing prediction models for patient survival and treatment toxicity using clinical variables collected from the comparator arms of four phase III clinical trials, including over 2,000 metastatic castrate resistant prostate cancer patients treated with first-line docetaxel. In this poster presentation, we describe: (a) the 3 sub challenges comprising the Prostate Cancer Dream Challenge, (b) the statistical metrics used by the challenge organizers for benchmarking the performance of prediction models for each sub-challenge, and (c) our team’s overall analytic strategy for addressing each of the challenge objectives. Specifically, we discuss our approach for identifying clinically important risk-predictors (i.e., feature selection and dimension reduction), the methodological framework(s) considered by our team for model development and validation, including the ensemble-based Cox proportional hazards regression model representing our final submission, and the adaptation of our modeling framework based on the results from the intermittent leaderboard rounds. As the aftermath of the Prostate Cancer Dream Challenge has prompted our team to reflect on the lessons learned throughout challenge, we also provide our perspectives on the importance of delegation, collaboration, data cleaning, and organization in challenges such as the Prostate Cancer Dream Challenge.

...............................................................................................................................

Poster: DR07
Feature Selection & Random Forest for ALS prediction

Witold R. Rudnicki¹, Wojciech Lesiński¹, Aneta Polewko-Klim¹, Krzysztof Mnich², Agnieszka Golińska¹

¹Department of Bioinformatics, University of Białystok, ²Computational Centre, University of Białystok, Konstantego Ciołkowskiego 1M, 15-245 Białystok, Poland

Abstract: The main goal of the challenge was to find the clustering of patients that would improve prediction of the progression of the ALS disease. The success of the clustering was measured by comparing prediction of the progress of the disease with actual data. We provided answers to all questions of the challenge, namely predicting disease progress and eventual death of patients for two data sets.

The modeling performed in three steps - feature construction, selection and model building. Features describing the time series data, were constructed following the approach proposed by winners of the ALS DREAM 7 Challenge[1].

We have attempted clustering using informative features, however without success, hence the final clusters were based only on the availability of data for given object.

Feature selection was based on the information entropy. Information gain was computed all variables and all pairs of variables. Informative variables and pairs were selected and redundant features were removed.

Final models were built using random forest classifier [2], using six original features. All possible combinations of variables were tested using cross-validation.

Eleven informative features for question 1 are (variables used by best model denoted by *, second model by ^): onset_delta*^, hands*^, Q1_Speech*^, Q9_Climbing_Stairs*^, Q5_Cutting*, fvc*, ALSFRS_Total^, Q3_Swallowing^, Q6_Dressing, Creatinine, Q4_Handwriting.

The cross-validated correlation of these models was 0.43, 0.42 respectively and RMSD was 0.549 and 0.568.

Eight informative variables for second question are (marked as previously): onset_delta*^, ALSFRS_Total*^, Creatinine*^, weight*^, fvc*, fvc_percent* Chloride^, Gender^.

Cross-validated classification error for the first model was 0.22 for 12 months and 0.18 for the 18 and 24 months.

There were six informative variables both for the question 3: (trunk, Q4_Handwriting, Q3_Swallowing, Q8_Walking, hands, ALSFRS_R), and for the question 4: (ALSFRS_R_Total, ALSFRS_Total, hands, MEDHx_Thyroid, Q6_Dressing_and_Hygiene, R3_Respiratory_Insufficiency).
The quality of data was much lower for the second data set and while we obtained sets of informative variables, but the results were on the border of random. The correlation coefficient for third question was 0.15, classification error for the fourth question was 0.35 – both results were estimated by cross-validation.

[1] Küffner R. et al. (2015) Nature Biotechnology 33(1), 51-57.
[2] Breiman, L. (2001) Machine Learning, 45, 5-32.

...............................................................................................................................

Poster: DR08
Predicting Discontinuation of Docetaxel Treatment for Metastatic Castration-Resistant Prostate Cancer (mCRPC) with hill climbing and random forest

Team Yoda
Daniel Kristiyanto, Kevin Anderson, Seyed Sina Khankhajeh, Kaiyuan Shi, Seth West, Ling Hong Hung, Azu Lee, Qi Wei, Migao Wu, Yunhong Yin, Ka Yee Yeung*

Institute of Technology, University of Washington, Tacoma, WA, United States
*Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Motivation: Prostate cancer patients may develop resistance to androgen deprivation therapy (ADT) [1,2]. In the DREAM 9.5 Prostate Cancer Challenge sub challenge 2 [2], we developed predictive models to predict patient outcomes in metastatic castrate-resistant prostate cancer (mCRPC) with subsequent discontinuation of docetaxel therapy.

Objective: The input data consist of 131 variables measured across clinical data from three clinical trials, namely, Memorial Sloan Kettering (MSK, with 476 patients), Celgene (with 526 patients), Sanofi (with 598 patients). The goal is to predict which patients in a fourth clinical trial (test data), AstraZeneca (AZ, with 470 patients), would discontinue treatment due to adverse events within 3 months.

Data & Methods: Data cleansing and pre-processing. The data cleansing were done separately within each clinical trial and later merged back together. Our data cleansing and pre-processing procedures include imputation of missing data [4], and removal of clinical variables with a high percentage of missing data. Data augmentation were also performed by converting selected multi-label variables into binary variables. Feature selection. We observed that univariate feature selection methods did not perform well. Hence, we adopted a hill-climbing [4] approach that optimized the AUC within 10-fold cross validation of the training data. We also addressed the issue of imbalanced data (total of 1292 negative samples and 197 positive samples) by randomly removing negative samples to meet a ratio roughly of 60% negative and 40% positive samples.

Classification. We applied random forest [5] using Sanofi as the hold-out, setting the parameters “mtry” to 25% of the number of features and number of trees to 100 times of the number of features.

Assessment: For validation, we repeated the training step 10 times, with average AUC as the assessment criteria. We also conducted additional assessment by holding out one of the three clinical trials. Our predictive model using MSK and Celgene data as the training set and Sanofi data as the test set yielded AUC = 0.165, accuracy = 0.9, precision = 0.21, F1 = 0.092, and recall = 0.06.

Results: Our final submission in predicting the discontinuation of docetaxel in the AstraZeneca clinical trial (using MSK, Celgene and Sanofi as training data) resulted in AUC of 0.13. Across the 470 in AstraZeneca clinical trial, 8 patients are predicted to discontinue the treatment within 3 months.

Acknowledgement: Ling Hong Hung and Ka Yee Yeung are supported by NIH grant U54-HL127624. This project used computing resources provided by Microsoft Azure. We would like to thank all students in TCSS 588 Bioinformatics in Spring 2015 at University of Washington Tacoma who contributed to this project.

References
1. Gupta, Eva, Troy Guthrie, and Winston Tan. "Changing paradigms in management of metastatic Castration Resistant Prostate Cancer (mCRPC)."
BMC urology 14.1 (2014): 55.
2. "DREAM 9.5 Prostate Cancer DREAM Challenge - Dream ..." 2015. 8 Oct. 2015 <http://dreamchallenges.org/project/closed/project dataspheresprostate-cancer-challenge/>
3. Hastie T, Tibshirani R, Narasimhan B and Chu G. impute: impute: Imputation for microarray data. R package version 1.42.0.
4. Romanski, P. "FSelector: Selecting attributes." Vienna: R Foundation for Statistical Computing (2009).
5. Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

...............................................................................................................................

Poster: DR09
Predicting olfaction response for each individual
Yuanfang Guan

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor

Abstract: This abstract describes the method I wrote for the 2015 DREAM Olfaction Challenge- sub-challenge 1: building models to predict olfactory response for each individual. I used decision tree as a base-learner of this chemical structural data. There are two reasons that I chose decision tree: 1) the dimension of the structure data is high, which contained over 4000 parameters; decision tree helped to reduce the dimension. 2) The data matrix is sparse; a decision boundary can be put between zeros and the rest of the values. Olfactory responses reported by individuals were noisy. Thus I used the global response across all to balance the individual response, in order to 1) capture the personalized features, and b) stabilize the predictions. For example, a chemical reported to be ‘sweet’ by individual A is trusted more when other peoples also report this chemical to be ‘sweet’. Finally, I used 0.2*individual score + 0.8* global average for each chemical as predictions. But the above parameter can be rather flexible to achieve decent performance. Similar technique also resulted in one of the best performing algorithms in 2014 DREAM Broad Institute Gene Essentiality Challenge. No external data was used. This technique turns out to be the best-performing method for this sub-challenge.

...............................................................................................................................

Poster: DR10
Predicting olfactory perception from chemical structure: a gradient boosting model

Chung Wen Yu¹ , Yusuke Ihara^1,2, and Joel D. Mainland^1,3

¹Monell Chemical Senses Center, Philadelphia PA, United States
²Institute for Innovation, Ajinomoto Co., Inc., Kawasaki, Japan.
³Department of Neuroscience, University of Pennsylvania School of Medicine, Philadelphia PA, United States

A fundamental problem in olfaction is to understand how the physical properties of a stimulus relate to perceptual characteristics. In vision, wavelength translates into color; in audition, frequency translates into pitch. By contrast, the mapping from chemical structure to olfactory percept is unknown. In other words, there is not a scientist or perfumer in the world who can view a novel molecular structure and predict how it will smell. Here we used an unpublished dataset where 49 subjects rated 476 odors to develop a model that uses physicochemical descriptors to predict 21 perceptual features. Previously published models for predicting pleasantness have used principal components of physicochemical descriptors or molecular complexity to predict pleasantness; both performed poorly on this dataset (Khan et al. 2007 r = 0.25, p < 0.001; Kermen et al., 2011 r = 0.26, p < 0.0001). Our model outperformed these previously published models on a validation set (r=0.61, p < 0.001).

Khan, R. M., Luk, C.-H., Flinker, A., Aggarwal, A., Lapid, H., Haddad, R., & Sobel, N. (2007). Predicting odor pleasantness from odorant structure: pleasantness as a reflection of the physical world. The Journal of Neuroscience : the Official Journal of the Society for Neuroscience, 27(37), 10015–10023.

Kermen, F., Chakirian, A., Sezille, C., Joussain, P., Le Goff, G., Ziessel, A., et al. (2011). Molecular complexity determines the number of olfactory notes and the pleasantness of smells. Scientific Reports, 1, 206–206. http://doi.org/10.1038/srep00206

...............................................................................................................................

Poster: DR11
Predicting patient survival in the DREAM 9.5 mCRPC challenge

Team FIMM-UTU
Teemu D. Laajala^1,2, Suleiman Khan², Antti Airola³, Tuomas Mirtti^2,4, Tapio Pahikkala³, Peddinti Gopalacharyulu², Tero Aittokallio^1,2

¹ Department of Mathematics and Statistics, University of Turku, Finland
² Institute for Molecular Medicine Finland, University of Helsinki, Finland
³ Department of Information Technology, University of Turku, Finland
⁴ Department of Pathology, HUSLAB, Helsinki University Hospital, Finland

In this poster, our team FIMM-UTU presents the top performing ensemble of penalized regression models for predicting patient survival in the context of metastatic castration-resistant prostate cancer (mCRPC) patients, originating from several clinical trials (subchallenge 1a of the DREAM 9.5 Prostate Cancer Challenge). Here, we outline the key stages in our method development: (i) processing raw data input; (ii) imputation of missing values, filtering and truncation; (iii) utilizing unsupervised learning to identify most relevant preliminary patterns; (iv) fitting batch-wise optimized penalized regression glmnet-models; and (v) constructing the final ensemble collection of models for performing accurate novel predictions.

By coupling unsupervised learning with survival-analysis-based supervised learning, we constructed an ensemble of batch-wise optimized penalized regression coxnet-models. The final ensemble models were simultaneously optimized for the penalized regression through L1/L2-norm parameter α along with the penalization coefficient λ using cross-validation and were averaged over multiple cross-validation runs to avoid randomness in the binning. Model-based imputation of missing values as well as incorporating clinical á priori knowledge of variables is presented, along with practical lessons learned from processing such challenging clinical data that required wide multidisciplinary expertise. We systematically identified and validated multiple upstream key modeling decisions that contributed to our successful performance, and provide diagnostics of which selected variables and diverse modeling strategies most favorably contributed to the final ensemble of models. Finally, we conclude clinically novel findings in the models and discuss their importance for future prognostic modeling in mCRPC.

...............................................................................................................................

Poster: DR12
GENESIS: A variation discovery framework for clinical cancer genomic profiling

Allen Chi-Shing YU^1,2*, Aldrin Kay-Yuen YIM^1,3*, Marco Jing-Weoi LI^1,2*

¹ Codex Genetics Limited, Hong Kong
² School of Life Sciences, The Chinese University of Hong Kong, Hong Kong
³ Computational & System Biology Program, Washington University School of Medicine
*Co-first authors

Large scale cancer genomics projects carried out by consortiums such as The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) have enormous impact on the understanding, diagnosis and treatment of cancer. Data generated in these projects provides important clinical insights for the advancement of bioinformatics algorithms, for instance somatic variant callers with increasing accuracy, the identification of very rare malignant clones as well as the DNA/RNA structural alterations. To participate in the ICGC-TCGA SMC-DNA DREAM Challenge (Intel-10 SNV real-tumor Sub-Challenge), we have developed a machine learning-based ensemble somatic variant calling pipeline – Genesis. Genesis was trained with the public cancer genomic dataset with validated mutation calls, and over thirty informative metrics were gathered from the four individual variant callers (MuTect, Strelka, Vardict and VarScan2) and genomic sequence features such as regional GC level, entropy, mappability and mutational hotspots. Variants were then classified into true or false positives using Support Vector Machine (SVM) models that are optimized for the tumor type and variant type (SNP or indel). Based on the 88 published hepatocellular carcinoma dataset and in-house clinical cancer samples, we demonstrated that our machine learning approach can achieve higher sensitivity and specificity when comparing to the individual call set. It is therefore expected that Genesis is well-applicable for clinical cancer genomic profiling.

...............................................................................................................................

Poster: DR13
Predicting odor perception from molecular structure using a "nearest neighbor" approach

Aharon Ravia¹, Lavi Secundo¹, Kobi Snitz¹, and Noam Sobel¹

¹Department of Neurobiology, Weizmann Institute of Science

The DREAM olfaction challenge was to predict the perceptual qualities of novel molecules according to their structural properties. The data consisted of 476 molecules rated by 49 subjects across 21 different descriptors.

Predicting odor perception from odor structure is a major goal in olfaction research. We and others have made initial steps in this direction such that we can now predict aspects like odorant pleasantness (Khan et al., 2007; Zarzo 2007; Koulakov et al., 2011) and pairwise odorant similarity (Snitz et al., 2013) from odorant structure alone. These abilities rest in part on the observation that principal component analysis (PCA) of molecular descriptors has a predictive power whereby the first principal component of the physio-chemical space is related to perceived odor pleasantness.

Here we set out to apply PCA analysis and "nearest neighbors approach" to the challenge data using a variation of the Rotation Forest method (Rodríguez, Kuncheva, & Alonso, 2006). This method uses PCA on random splits of the features in order to create weak predictors and then combine them as an Ensemble. In the Rotation Forest method classification is done on a rotated subspace. To answer the challenge we needed to create a continuous estimator instead of a classifier, and used nearest neighbor approach for this purpose. Moreover, we executed the estimation on one subspace of features each time.

This method yielded better results than other methods we tried, such as linear regression. For example, when we correlated predicted vs. actual pleasantness ratings of the leaderboard data we got results of R=~.52. We think that although others were able to get similar and better predictions using other methods, this kind of analysis may reveal facts about the physio-chemical space, and extract characteristics such as proximity between molecules. Finally, we examine the possibility that the olfactory system solves olfactory space using similar strategy.

...............................................................................................................................

Poster: DR14
Ensemble Approaches To Prostate Cancer Dream Challenge

Team The Data Wizard
Wen-Chieh Fang¹, Li-Min Tu², Huan-Jui Chang³, Chia-Tse Chang¹, Yu-Fu Wang¹, Mu-Hung Tsai⁴, Alexey Yu. Lupatov⁵, Konstantin N. Yarygin⁵, Hsih-Te Yang^1,4, Chiang Jung-Hsien^1,4

¹Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
²Department of Computer Science and Engineering, New York University, United States
³Department of Economics, National Cheng Kung University, Taiwan
⁴Institute of medical informatics, National Cheng Kung University, Taiwan
⁵Orekhovich Institute of Biomedical Chemistry of the Russian Academy of Medical Sciences, Moscow, Russia

Abtsract: Prostate cancer is the most common cancer diagnosed in many countries and the third most common cause of cancer death. However prognostic models for overall survival for patients are dated. Based on preliminary exploration of the three sets of raw trial data, we model the data using ensemble learning in the subchallenge 1a. We provide a novel boosting approach to tackle the problem in subchallenge 1b. For the second subchallenge, we combine three different prediction models to derive the final prediction.

We first preprocess data to handle missing values and use filter method for feature selection. For predicting overall survival, we derive different feature sets for the three training sets (ASCENT2, CELGENE, EFC6546), respectively and then apply the Cox model with maximum penalized likelihood on the selected features to train a specific model for each set. We combine the ranking results from the different models. To predict the exact time to event (death of a patient), we develop a novel adaboost-like regression algorithm called "adaboost-s" for survival problem, especially to predict time to event. In the training phase, if the predicted time to event of a censored data point is smaller than the lower bound, the data point is considered incorrectly predicted and its weight is increased. To predict treatment discontinuation for patients treated with docetaxel due to adverse events at early time points, we apply ensemble technique to combine the ranking results of several methods. The methods includes two random forest classifiers and one gradient boosting classier.

The main advantage of ensembles of different models is that it is unlikely that all models will make the same mistake. Ensembles tend to reduce the variance of models. Therefore, we apply ensemble approaches to deal with the prediction problems in this challenge. In the leaderboard, our approaches outperform the baseline and the methods of many teams.

...............................................................................................................................

Poster: DR15
Supervised ensembles to boost the predictive power of DREAM challenges

Gaurav Pandey^1, Gustavo Stolovitzky^1,2, Sean Whalen³, Lara Mangravite⁴, Solveig Sieberts⁴, Abhishek Pratap⁴ and Om Prakash Pandey¹

¹Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, NY
²IBM Research, NY
³Gladstone Institutes, University of California at San Francisco
⁴Sage Bionetworks, Seattle

Abstract: Prediction problems in biomedical sciences, such as those posed in DREAM challenges, are well-known to be quite difficult to solve convincingly. This is due in part to incomplete knowledge of the biomedical phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. These issues are reflected in the diversity of prediction techniques, datasets, domain knowledge and other ingredients used to develop submissions to DREAM challenges. In such scenarios, a powerful approach to improving prediction performance is to construct ensemble predictors that combine the output of complimentary individual predictors derived from diverse techniques and/or datasets. Traditional ensemble methods like boosting, bagging and random forest are insufficient for this task as they (generally) assume that the individual/base predictors are of the same type. They also expect the ensemble process to have control over the generation of these predictors from (generally) a single training set. Both these important assumptions do not hold for the challenge setting. Thus, in this work, we propose the use of heterogeneous ensemble methods, such as stacking and ensemble selection, for building effective ensembles for DREAM challenges as well as other biomedical prediction problems. First, using several protein function and genetic interaction prediction datasets, we illustrated how such heterogeneous ensembles can provide statistically significant gains over individual predictors, including those based on boosting and random forests (Whalen et al., Methods, 2015). Deeper analysis shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional individual predictors. Motivated by these results, we built stacking-based ensembles of subchallenge 2 of the Rheumatoid Arthritis anti-TNF drug response challenge. Using only six of the individual predictors, these ensembles (AUPR=0.5228) again provided prediction gains over the two best individual predictors (AUPR =0.5099 and 0.5071). In current work, we are trying to realize such gains for other DREAM challenges as well, especially by systematically addressing the theoretical and implementation issues associated with this task.

...............................................................................................................................

Poster: DR16
A Boosting approach and Cox model for Predicting Slope and Survival of ALS

Wen-Chieh Fang¹, Chen Yang¹, Huan-Jui Chang², Hsih-Te Yang^1,3, Jung-Hsien Chiang^1,3

¹Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
²Department of Economics, National Cheng Kung University, Taiwan
³Institute of medical informatics, National Cheng Kung University, Taiwan

Abstract: Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that leads muscle weakness and gradually impacts on the functioning of the body, leading to eventual death. It greatly reduces an individual's life expectancy. Currently, experts do not know precisely what causes ALS. There is no known cure for ALS. The DREAM ALS Stratification Prize4Life Challenge is held with the goal to enable better understanding of patient profiles and application of personalized ALS treatments.

In our approach to Dream ALS Challenge, we first disregard those features which 90 percent of values are missing. For the remaining, we replace any missing value with the mean of that variable for all other cases. For all the four subchallenges, we apply equal frequency binning that divides the response variable into three groups such that each group contains approximately same number of values. There are two kinds of features: static features and 'time-resolved' features (those with different values when time varies). For the latter, we try two designated measurements, the minimum and the maximum as features. Then for both two kinds of features, we apply feature selection based on information gain to select the top-six features. In order to select optimal features, we run cross validation on the feature candidates. We apply Gradient Boosted Regression Trees (GBRT) to predict the ALSFRS slopes. GBRT computes a sequence of simple decision trees, where each successive tree is built for the prediction residuals of the preceding tree. We apply Cox model with maximum penalized likelihood to predict the survival probability.

In this challenge, we think that the feature selection is one of the most important steps and we believe that the most appropriate features dominate the performance of the model. In the last round of leaderboard, our team was the second best team and our approaches outperformed the methods of most teams.

...............................................................................................................................

Poster: DR17

Using aggregated weights along paths across random forest to select important features and predict ALS progression
Jinfeng Xiao¹ and Jian Peng¹

¹University of Illinois at Urbana-Champaign, USA

Amyotrophic lateral sclerosis (ALS) is typically a rapidly progressing neurodegenerative disease. In many cases it leads to death within 3-5 years from onset of symptoms, but the rate of progression across the patient population can vary by an order of magnitude. Unwinding such underlying heterogeneity can hopefully shed light on disease mechanisms and drug development, and reliable prediction on progression rate can assist clinical decisions.

We developed a novel random forest based method to select 6 important clinical variables from 68 and predict ALS progression based on those 6. After training a random forest F with all n available features (whose missing rate < 50%), the patients used for training were dropped down the forest and their paths across each tree were tracked. Along each path, nodes were assigned different weights based on their positions along the path. Then node weights from all paths were aggregated so that each patient was represented by a point in an n-dimensional space Rn, where the coordinates are the aggregated weights of the n clinical features. All patients for training were then clustered in Rⁿ, and within the i^th cluster (I went from 1 to the number of clusters) the 6 clinical variables with the highest aggregated weights were used to train a new random forest F_ⁱ. When a new patient came in, based on the aggregated node weights along his/her paths across F, an F_ⁱ and the 6 corresponding clinical variables were selected for predicting his/her ALS progression rate.

Our method was the top performer in sub-challenge 3, which was to predict ALS progression of patients from two national registries. Several other methods were locally tested, and our method turned out the best. For example, we tried representing each patient with a point in Rn whose coordinates were the values of the n clinical features instead of their aggregated weights along paths across random forest, calculated the z-score (aggregated from concordance index, Pearson correlation and root-mean-square deviation) and found that the z-score of our submitted method was 41% higher. We also tried ranking the importance of features using the permutation setting of the importance function of R package randomForest, and the z-score of our submitted method was 30% higher. Inspired by those preliminary results, we are currently further developing and testing our aggregated weights method.

top

RSG POSTER ABSTRACTS - 61 and above

Complete list of RSG Poster Abstracts (.pdf) - Click here.
...............................................................................................................................

Poster: P61
Identifying sequence-dependent regulators of gene expression from a novel massively parallel reporter assay

Vincent Fitzpatrick, Columbia University, Department of Biological Sciences, United States
Joris van Arensbergen, Netherlands Cancer Institute, Netherlands
Marcel de Haas, Netherlands Cancer Institute, Netherlands
J Omar Yáñez-Cuna, Netherlands Cancer Institute, Netherlands
Ludo Pagie, Netherlands Cancer Institute, Netherlands
Bas van Steensel, Netherlands Cancer Institute, Netherlands
Harmen Bussemaker, Columbia University, Department of Biological Sciences, United States

DNA-binding proteins regulate expression through sequence-specific interactions with gene promoters. These interactions are further mediated by local chromatin context extrinsic to the promoter sequence, making it difficult to separate sequence-dependent regulatory mechanisms from other contextual factors. To this end, our collaborators in the Van Steensel lab (NKI) have developed SuRE-seq, a high-throughput reporter assay that screens for genomic fragments capable of driving expression of a uniform plasmid reporter. SuRE-seq quantifies the relative expression rate of millions of genomic elements in parallel, providing insight into genome-wide mechanisms of transcription regulation. Using a regression-based approach, we have discovered sequence-specific, spatially-dependent mechanisms of gene regulation in Drosophila and human cell lines, including motifs attributable to known transcription factors and low-complexity sequence patterns with strand-dependent contributions to expression. These results allow us to separate sequence-intrinsic regulatory properties of gene promoters and enhancers that are independent of endogenous chromatin context.

...............................................................................................................................

Poster: P62
Characterization of phased, secondary, small interfering RNAs (phasiRNAs) using Machine Learning

Parth Patel, University of Delaware, United States
Sandra Mathioni, University of Delaware, United States
Atul Kakrana, University of Delaware, United States
Hagit Shatkay, University of Delaware, United States
Blake Meyers, University of Delaware, United States

Small RNAs (sRNAs) in plant range in size from 21 to 24 nucleotides, and play important roles in biological processes such as development, epigenetics modification, and plant defense. They can be partitioned into three major classes: microRNAs (miRNAs); heterochromatic small interfering RNAs (hc-siRNAs); and phased, secondary, small interfering RNAs (phasiRNAs) (Fei et al., 2013). Our study focuses on phasiRNAs, for which the knowledge about functionality is still limited. We (Zhai et al. 2015) and others have shown that maize anthers (male reproductive organs), express two classes of phasiRNAs (21-nt and 24-nt) during different developmental time points (pre-meiotic and meiosis). Other data suggest these phasiRNAs are required for fertility.

Given the important role grasses such as maize and rice play as a prime food-source in many countries and as influential factors in the global economy, we aim to identify and understand the function of grass-specific phasiRNAs in maize and rice development. To this end, we use the framework of hidden Markov models (HMMs) in order to model both phasiRNA and non-phasiRNA sequences, and to distinguish between the two types of these small RNAs. We performed ANOVA with Dunnett's method, demonstrating that the probability assigned by the resulting HMMs to phasiRNAs (21/24-nt) from rice and maize is significantly different from that assigned to other genomic sequences of similar length. Future work will include classification to distinguish phasiRNA sequences from non-phasiRNA sequences using other machine learning classifier(s), aiming to extract patterns (i.e., motifs, GC content) occurring in phasiRNAs to provide further insight into their biological function.

...............................................................................................................................

Poster: P63
The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer

Robert J.A. Bell, University of California San Francisco, United States
H. Tomas Rube, Columbia University, United States
Alex Kreig, University of Illinois Urbana-Champaign, United States
Andrew Mancini, University of California San Francisco, United States
Shaun F. Fouse, University of California San Francisco, United States
Raman P. Nagarajan, University of California San Francisco, United States
Serah Choi, University of California San Francisco, United States
Chibo Hong, University of California San Francisco, United States
Daniel He, University of California San Francisco, United States
Melike Pekmezci, University of California San Francisco, United States
John K. Wiencke, University of California San Francisco, United States
Margaret R. Wrensch, University of California San Francisco, United States
Susan M. Chang, University of California San Francisco, United States
Kyle M. Walsh, University of California San Francisco, United States
Sua Myong, University of Illinois Urbana-Champaign, United States
Jun S. Song, University of Illinois Urbana-Champaign, United States
Joseph F. Costello, University of California San Francisco, United States

Reactivation of telomerase reverse transcriptase (TERT) expression enables cells to overcome replicative senescence and escape apoptosis, which are fundamental steps in the initiation of human cancer. Multiple cancer types, including up to 83% of glioblastomas (GBMs), harbor highly recurrent TERT promoter mutations of unknown function but specific to two nucleotide positions. We identified the functional consequence of these mutations in GBMs to be recruitment of the multimeric GA-binding protein (GABP) transcription factor specifically to the mutant promoter. Allelic recruitment of GABP is consistently observed across four cancer types, highlighting a shared mechanism underlying TERT reactivation. Tandem flanking native E26 transformation-specific motifs critically cooperate with these mutations to activate TERT, probably by facilitating GABP heterotetramer binding. GABP thus directly links TERT promoter mutations to aberrant expression in multiple cancers.

...............................................................................................................................

Poster: P64
Charting the human genome’s regulatory landscape with transcription factor binding site predictions

Xi Chen, New York University, United States
Richard Bonneau, New York University/Simons Foundation, United States

Transcription factor (TF) binding is an essential step in the regulation of gene expression. Differential binding of multiple TFs at key cis-regulatory loci allows the specification of progenitor cells into various cell types, tissues and organs. ChIP-Seq is a technique that can reveal genome-wide patterns of TF binding. However, it lacks the scalability to cover the range of factors, cell types and dynamic conditions a multicellular eukaryotic organism sees. So charting the regulatory landscape spanning multi-lineage differentiation requires computational methods to predict TF binding sites (TFBS) in an efficient and scalable manner.

We develop a method to predict binding sites for over 800 human TFs using a rich collection of DNA binding motifs. We integrate genomic features, including chromatin accessibility, motif scores, TF footprints, CpG/GC content, evolutionary conservation and the proximity of TF motifs to transcription start sites in sparse logistic regression classifiers. We label candidate motif sites with ChIP-Seq data and apply correlation-based filter and L1 regularization to select relevant features for each trained TF. The resulted logistic regression classifiers accurately predict TFBS and perform favorably in comparison to the current best TFBS prediction methods. Further, we map TFs based on feature distance to a nearest trained TF neighbor. Cross-TF predictions allow us to scale and expand the repertoire of putative TFBS to any TFs where motif data is available and to any cell types where accessibility data is obtainable. Our method has the potential to be applied in previously intractable domains, such as the identification of cell type-specific cis-regulatory modules, and reveal key properties underlying the regulatory complexity of multicellular eukaryotes.

...............................................................................................................................

Poster: P65
Deconvolving discriminative sequence features in overlapping categories of TF binding sites

Akshay Kakumanu, Penn State, United States
Silvia Velasco, New York University, United States
Esteban Mazzoni, New York University, United States
Shaun Mahony, Penn State, United States

Given multiple ChIP-seq experiments, we often seek to define clusters of binding sites that describe site properties across experiments. For example, we may categorize a given transcription factor’s binding sites as condition-independent or condition-specific across multiple condition ChIP-seq experiments. Similarly, we may categorize a transcription factor’s binding sites as being located in active enhancers or not based on overlaps with appropriate histone modification ChIP-seq experiments. Given such binding site categories, it is natural to ask what sequence features are associated with a category label. However, discovering such label-specific sequence features is often confounded by overlaps between binding site categories. For example, if condition-independent transcription factor binding sites are also more likely to be located within promoter regions, any sequence features specific to condition-independent binding behavior will be convolved with sequence features specific to promoters. Therefore, in order to identify sequence signals specifically associated with a given binding label it is necessary to deconvolve discriminative sequence signals from overlapping labels.

In order to meet this challenge, we developed SeqUnwinder, a principled approach to identifying interpretable discriminative sequence features for overlapping categories of transcription factor binding sites. SeqUnwinder uses local k-mer frequencies as predictors for a multiclass logistic classifier. Class label relationships between clusters are incorporated through an L-2 norm regularizer that encourages clusters sharing a label to have similar predictor weights. Our approach yields an integrated framework that identifies discriminative sequence signals for individual TF binding class labels and all combinations of labels, making it easy to gain more insights into TF binding preferences in a given in vivo system.

We demonstrate SeqUnwinder by using it to characterize TF binding during direct motor neuron programming. In our system, over-expression of Ngn2, Isl1, and Lhx3 (NIL) induces rapid and highly efficient conversion of mouse embryonic stem (ES) cells to spinal motor neurons. However, little is known about how the NIL factor combination achieves direct programming. We used ChIP-seq to profile NIL binding at three intermediate time points during the direct programming process. We then formed overlapping clusters of binding sites according to two criteria: dynamics over the course of programming, and the chromatin context in the initial ES cells. SeqUnwinder enables us to identify several meaningful sequence features associated with each cluster label, and thereby allows us to formulate hypotheses about the mechanisms through which over-expression of NIL can alter the fate of ES cells into induced motor neurons.

...............................................................................................................................

Poster: P66
Implementation of a Deep Learning Framework to Predict De Novo Anticancer Drug Activity

Jose Zamalloa, Princeton University, United States
Mona Singh, Princeton Universtiy, United States

Cancer treatment can greatly benefit from highly accurate drug prediction models. Current methods aim to identify key features in genomic data in order to predict known efficacies of a particular drug across cancer cell lines. The Cancer Cell Line Encyclopedia (CCLE) and Cancer Genome Project (CGP) provide the cancer drug panel data to build predictive models based on known drug compounds and genomic backgrounds of cancer cells. However, given that known compounds are not sufficient to efficiently treat cancer at the moment, there is a pressing need to develop methods that can accurately predict drug activity of compounds for which we have no prior information. The present method aims to solve this problem by combining chemical information across compounds and cancer genetic backgrounds to predict an unknown drug activity using a Deep Learning framework. We incorporate structural information of endogenous metabolites to describe chemical features of drug compounds and integrated them as features into our predictor along with selected genomic information.

We applied our approach to the CCLE dataset. We train our model on all the dataset but the compound of interest and test it on such compound in order to simulate the prediction of an unknown drug. Our preliminary results show that our accuracy is on par or better than current methods suggesting its potential use in predicting untested cancer drug candidates.

...............................................................................................................................

Poster: P67
Computational Discovery of Transcription Factors Associated with Drug Response

Casey Hanson, University of Illinois at Urbana - Champaign , United States
Junmei Cairns, Mayo Clinic, United States
Liewei Wang, Mayo Clinic, United States
Saurabh Sinha, University of Illinois at Urbana - Champaign, United States

Purpose: Genome wide association studies in pharmacogenomics generally involve associating drug-induced response with biomarkers. While GWAS suffers from sensitivity issues after correction, even signals that survive face problems of functional interpretation. Our study ameliorates this issue by posing the statistical test in the context of gene regulation. Rather than identifying SNPs or genes associated with drug response, we integrate biomarkers with genome-wide transcription factor (TF) binding data to elucidate whether a TF’s regulatory influence is associated with the drug. Our approach (GENMi) integrates gene expression, genotype, and drug-response data with ENCODE TF tracks to quantify the association between TFs and drugs via cis-regulatory eQTLs.

Methods: The GENMi method for testing a (TF, drug) combination consists of the following procedure. First, SNPs located outside of the TF’s ENCODE peak are discarded. Considering the 50kb upstream region of a gene as a putative cis-regulatory region, the gene is scored by the most significant eQTL under the TF’s peak. The top 400 eQTL genes are then tested for overlap with all genes correlated with the drug’s-induced cytotoxicity, using Gene Set Enrichment Analysis.

Results: We analyzed 114 TFs and 24 treatments using GENMi, yielding 334 significantly associated (TF, drug) pairs. The top 20 sparse (TF, drug) pairs yielded literature support for 13 associations, often from studies where perturbation of the TF’s expression changes drug response. We demonstrate the advantage of our approach by contrasting it with a baseline without using gene expression data. Our method reports more associations than the baseline approach at identical false positive rates (FPR). We further tested 14 TFs GENMi associated with either anthracycline (doxorubicin or epirubicin) and 21 TFs associated with either taxanes (paclitaxel or docetaxel). MTS cytotoxicity assays after TF knockdowns in two triple negative breast cancer cell lines, BT549 and MDA-MB231, yielded 6 TFs that significantly de-sensitized the cell to taxane induced apoptosis and 4 TFs that significantly de-sensitized the cell to anthracycline induced apoptosis.

...............................................................................................................................

Poster: P68
Pervasive variation of transcription factor orthologs contributes to regulatory network divergence

Shilpa Nadimpalli, Princeton University, United States
Anton V. Persikov, Princeton University, United States
Mona Singh, Princeton University, United States

Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning ~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in ~44% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, present in ~70% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to ~26% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve more slowly than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a set of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.

...............................................................................................................................

Poster: P69
A Novel Experimental Model Sheds Light on the Mechanism of Host-Gut Microbiome Interactions

Allison Richards, Wayne State University, United States
Michael Burns, University of Minnesota, United States
Adnan Alazizi, Wayne State University, United States
Roger Pique-Regi, Wayne State University, United States
Ran Blekhman, University of Minnesota, United States
Francesca Luca, Wayne State University, United States

The components of the human gut microbiome vary between physiological and pathological states. It has been shown that the gut microbiome differs in individuals with certain diseases, such as diabetes. However, the question of the cause and effect of the differences in both gut microbiome and host state remains unsolved. In order to delve into the relationship between gut microbiome and host, we treated human primary colonic epithelial cells (colonocytes) with different concentrations of gut microbiome from a healthy donor for varying time intervals. Each experiment was performed in triplicate. We found that a gut microbiome to host ratio of 10:1 is best to simulate the symbiotic environment of the colon. Under these conditions, we performed RNA-seq to determine changes in the host gene expression that comprise a response to microbiome exposure. RNA-sequencing reads were aligned using BWA-mem and differentially expressed genes were identified using DESeq2. We found 2,111 genes and 1,110 genes that change expression in the host colonocytes following exposure to the microbiome for 4 and 6 hours, respectively (FDR = 1%). These genes are enriched for a variety of pathways involved in the interaction of host cells and gut microbiome. Specifically, we found enrichment in pathways involved in cell adhesion and cell surface receptor signaling. In addition, we performed 16S sequencing on bacterial DNA derived from the same co-cultures as the host colonocytes in order to study changes in the composition of the gut microbiome following exposure to the host. We found that after 4 hours of co-culturing, there was a decrease in the proportional abundance of the phylum Firmicutes and a corresponding increase in the phylum Proteobacteria. Furthermore, there was a decrease in overall diversity of the gut microbiome following exposure to the host colonocytes. Together, these results help us to identify which pathways are involved in the host response following microbiome exposure and in turn, how the microbiome is changed by host exposure. Study of both of these responses will help us to understand the cause of the differences in gut microbiome composition that have been seen in various pathological states.

...............................................................................................................................

Poster: P70
Experimentally identified gene-environment interactions contribute to heritability of complex traits

Cynthia Kalita, Wayne State University, United States
Gregory Moyerbrailean, Wayne State University, United States
Omar Davis, Wayne State University, Canada
Chris Harvey, Wayne State University, United States
Adnan Alizizi, Wayne State University, United States
Donovan Watza, Wayne State University, United States
Xiaoquan Wen, University of Michigan, United States
Xiang Zhou, University of Michigan, United States
Roger Pique-Regi, Wayne State University, United States
Francesca Luca, Wayne State University, United States

Genome wide association studies (GWAS) have identified thousands of common genetic variants associated with complex traits, including normal traits and common diseases. However, the significant SNPs found in these association studies explain only a small proportion of disease heritability. One possible explanation for this missing heritability is that the effect of the variant on the trait can be detected only under the right environmental conditions.

To test this hypothesis, we used GEMMA to jointly analyze summary statistics from 18 GWAS meta-analysis studies with annotations of regulatory variation. Our annotations are derived from: SNPs with allele specific expression (ASE) in 48 cellular environments, eQTLs (Wen et al 2014), and SNPs with conditional allele specific expression (cASE).

GEMMA (Genome-wide Efficient Mixed Model Association) tests for the proportion of variance in phenotypes explained (PVE) by typed genotypes, for example, “chip heritability”. At the same time, it estimates enrichment of a set of annotations within a GWAS trait. We find a range of enrichments for SNPs in genes with ASE, up to 7.84 for mean platelet volume. In comparison, for this same trait, SNPs in genic regions without ASE show an enrichment value of 1.03. When we consider SNPs in genes with cASE, we observe an enrichment of 5.10 as compared to 3.93 (SNPs in genes with ASE) and 1.04 (SNPs in genic regions). This approach, which integrates regulatory variation and gene-environment interactions into GWAS signals, can provide a much better understanding of the molecular mechanisms underlying inter-individual variation in complex traits.

...............................................................................................................................

Poster: P71
A systematic survey of the Cys2His2 zinc finger DNA-binding landscape

Joshua Wetzel, Princeton University, United States
Anton Persikov, Princeton University, United States
Mona Singh, Princeton University, United States
Marcus Noyes, NYU Institute for Systems Genetics, United States

Cys2His2 zinc fingers (C2H2-ZFs) comprise the largest class of metazoan DNA-binding domains. Despite this domain's well-defined DNA-recognition interface, and its successful use in the design of chimeric proteins capable of targeting genomic regions of interest, much remains unknown about its DNA-binding landscape. To help bridge this gap in fundamental knowledge and to provide a resource for design-oriented applications, we screened large synthetic protein libraries to select binding C2H2-ZF domains for each possible three base pair target. The resulting data consist of >160 000 unique domain–DNA interactions and comprise the most comprehensive investigation of C2H2-ZF DNA-binding interactions to date. An integrated analysis of these independent screens yielded DNA-binding profiles for tens of thousands of domains and led to the successful design and prediction of C2H2-ZF DNA-binding specificities. Computational analyses uncovered important aspects of C2H2-ZF domain–DNA interactions, including the roles of within-finger context and domain position on base recognition. We observed the existence of numerous distinct binding strategies for each possible three base pair target and an apparent balance between affinity and specificity of binding. In sum, our comprehensive data help elucidate the complex binding landscape of C2H2-ZF domains and provide a foundation for efforts to determine, predict and engineer their DNA-binding specificities.

...............................................................................................................................

Poster: P72
Statistical Algorithms for Motif Discovery on SELEX Data

Chaitanya Rastogi, Department of Biological Sciences, Columbia University, United States
Harmen Bussemaker, Department of Biological Sciences, Columbia University, United States

SELEX-seq is an experimental and computational platform that combines biophysical modeling and deep sequencing in order to determine the DNA binding specificity of a transcription factor complexes [1]. Recent work has demonstrated the protocol’s ability to elucidate novel recognition properties of the eight Drosophila Hox proteins [2]. SELEX-seq analyses require detailed oligomer count information to infer affinities, a challenging computational task given the size of the data. Efficient implementations of the computational pipeline are required as the adoption of SELEX-seq increases. Following the methodology set out in [1,2], we have developed a suite of R/Bioconductor functions, named "SELEX," to facilitate the analysis of SELEX-seq data. Thanks to efficient algorithms, this software can run on a standard laptop computer. Our package includes functionality for kmer counting, Markov model construction, and information gain (Kullback-Leibler divergence) calculations, along with integrated solutions for painless annotation and management of SELEX-seq experiments. Significantly, the package forms the basis for advanced feature-based modeling of TF binding sites. These novel statistical models directly infer G values for nucleotide, dinucleotide, and DNA shape features without any prior information about the binding factor in question.

[1] T.R. Riley, M. Slattery, N. Abe, C. Rastogi, D. Liu, R.S. Mann†, and H.J. Bussemaker†. (2014) SELEX-seq, a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol. Biol. 1196:255-78.

[2] M. Slattery, T.R. Riley, P. Liu, N. Abe, P. Gomez-Alcala, R. Rohs*, B. Honig*, H.J. Bussemaker*, R.S. Mann*. (2011) Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox proteins. Cell 147(6):1270-82.

...............................................................................................................................

Poster: P73
Interpreting non-coding SNPs and quantifying the value of perturbation experiments using ensembles of biophysical models

Farzaneh Khajouei, University of Illinois Urbana-Champaign, United States
Md. Abul Hassan Samee, University of Illinois Urbana-Champaign, United States
Stanislav Shvartsman, Princeton University, United States
Saurabh Sinha, University of Illinois Urbana-Champaign, United States

Characterizing the mechanisms of gene regulation and predicting the functional effects of single nucleotide polymorphisms in regulatory sequences are two major challenges of the day. Thermodynamics-based models that map an enhancer sequence to its precise expression levels in varying cellular conditions are a promising means to meet these challenges. However, these models often involve several (10-20) free parameters, and the parameter space typically has many local optima. Current approaches rely on the best-fitting model (parameterization) to make predictions about perturbation conditions, but a large collection of models that fit the training data almost equally well remains unexplored as a result. We study this crucial problem with the goals of (1) constructing a probabilistic landscape or ‘ensemble’ of models from training data and (2) using the ensemble of models to analyze non-coding polymorphisms and to systematically design perturbation experiments.

Our approach considers all models with goodness-of-fit above a threshold. A large collection of models is constructed through deep, uniform sampling of the parameter space followed by local optimizations. From this collection we construct a continuous probability landscape over the parameter space, using Gaussian mixture models and adaptive variance estimation. The resulting landscape allows predictions about perturbation conditions to be made with error estimates. Using the thermodynamics-based GEMSTAT model, we constructed probabilistic landscapes for wild-type expression of developmental genes in Drosophila. We also constructed landscapes of models that fit both wild-type data and perturbation data from transcription factor knock-down or site mutagenesis experiments. Finally, we assessed the value of each perturbation experiment by the reduction in entropy of parameter landscape resulting from the experiment. Assigning information-theoretic values to experiments is a first step to systematic experiment design, paving the way to fully resolved models.

We also used the above probability landscape to analyze single nucleotide polymorphisms (SNPs) in a well-studied enhancer of the developmental gene ‘ind’ in Drosophila. We predicted the effects of every possible single nucleotide variation within this enhancer, and shortlisted the variations with the greatest predicted effects on average (over the probability landscape) or whose predicted effects exhibit the most uncertainty. We then focused on SNPs within this enhancer as recorded in the DGRP resource, and compared the spectrum of predicted quantitative effects of these observed SNPs to the spectrum of all possible SNPs. Our analyses pointed to an avoidance of strong-effect variations in general, but also provided strong evidence for compensation between modest-effect SNPs in the same individual.

...............................................................................................................................

Poster: P74
Tissue context improves disease-gene mining from biomedical text

Ruth Dannenfelser, Princeton University, United States
Ran Zhang, Princeton University, United States
Olga Troyanskaya, Princeton University, United States

Identifying the genes associated with disease is an important task in biology that furthers our understanding of disease mechanisms and has important clinical implications. Although it is well known that genetic defects regulate disease manifestation in a tissue-specific manner, tissue contexts are not considered in disease-gene curation, nor are they annotated in current disease-gene databases. In this work we develop the first method to identify potential disease-gene associations in a given tissue by mining PubMed abstracts. Conditioning on tissues, we collect and extract the information contained in biomedical abstracts to fit an unsupervised model, based on conditional mutual information, for prediction. In over 50 diverse tissues, we achieve promising performance with our model-based approach, which also outperforms tissue naive prediction, suggesting that our method can accurately assign disease related genes to their specific tissue contexts. Additionally, our model can help identify the genetic cause of tissue specific dysfunction when a disease affects multiple tissues. We illustrate this by localizing obesity related genes in the hypothalamus and adipocytes. A web server will be made available for users to query disease-gene predictions with tissue labels.

...............................................................................................................................

Poster: P75
H3K4me3 downstream of transcription start sites is responsible for transcriptomic modifications in systemic lupus erythematosus

Zhe Zhang, The Children's Hospital of Philadelphia, United States
Lihua Shi, The Children's Hospital of Philadelphia, United States
Kathleen Sullivan, The Children's Hospital of Philadelphia, United States

Autoimmune disease systemic lupus erythematosus (SLE) has a systematically modified epigenome according to our previous studies on histone modifications such as tri-methylation of histone H3 lysine 4 (H3K4me3). H3K4me3 is a canonical open chromatin mark of active transcription. Recent studies also suggested that H3K4me3 breadth at transcription start site (TSS) has important regulatory role in cell identity. This project examined H3K4me3 breadth at TSS in primary monocytes and its association with differential gene transcription in SLE. Integrative bioinformatics analysis was applied to ChIP-seq and RNA-seq data generated from the same samples, as well as public genomic data. We created an online application for this project, which also enables users to explore its data and perform their own analysis. (http://awsomics.org/project/sle_h3k4me3_breadth)

Distinctive H3K4me3 patterns of ChIP-seq peaks were identified from 14,217 TSSs in control monocytes. The narrow peaks are mostly related to housekeeping functions. The broader peaks have extended H3K4me3 at TSS upstream and/or downstream and are often found at immune response genes. Many TSSs have downstream H3K4me3 extended to ~650bp, where H3K36me3, a transcriptional elongation mark, starts to raise. H3K4me3 pattern is strongly associated with gene overexpression in SLE. Genes with narrow peaks were less likely (OR = 0.14) while genes with extended downstream H3K4me3 were more likely (OR = 2.4) to be overexpressed in SLE. Since H3K4me3 levels of nearby regions are correlated to each other, we removed the interdependence of TSS, upstream and downstream regions by fitting a linear model and evaluated the direct correlation between differential transcription and differential H3K4me3 at each region. The downstream region has the strongest association with differential transcription. Of the genes having significant overexpression in SLE (p < 0.01), respectively 78.8%, 55.0% and 47.1% had increased H3K4me3 at their downstream, TSSs and upstream regions. Gene transcription sensitively and consistently responded to downstream H3K4me3 change, as every one percent increase of H3K4me3 led to ~1.5% average increase of transcription.

In summary, we identified TSS downstream as a crucial region responsible for transcription changes in SLE. Given that many genes have the transcriptional initiation-elongation transition in this region, it is plausible to hypothesize that increase of downstream H3K4me3 will facilitate the transition by making the nucleosome more accessible to elongation machinery. This study applied a unique method to study the effect of H3K4me3 breadth on diseases, and revealed new insights about epigenomic modifications in SLE, which can potential lead to novel treatments.

...............................................................................................................................

Poster: P76
Coupled dynamics of drug synergy, gene expression, and alternative splicing in combination therapies of breast cancer

Bojan Losic, Icahn Institute for Genomics and Multiscale Biology, Mount Sinai, United States
Xintong Chen, Icahn Institute for Genomics and Multiscale Biology, Mount Sinai, United States
Gustavo Stolovitzky, IBM Research and Icahn School of Medicine at Mount Sinai, United States

Drug combination therapies in the cancer setting often succeed where mono-therapies fail, facilitating durable and robust responses that may curtail metastases and even be accompanied by milder side-effects. Predicting synergistic and antagonistic combinations based on the gene expression data of mono-therapy drug-tumor response is an important open problem (see concurrent submission) wherein the role of transcriptional splicing dynamics is often ignored or too poorly correlated with phenotypes to be useful.

In this work we leverage the inherent transcript/exon level resolution of RNA-seq data to infer gene expression and splicing signatures associated with additive and synergistic drug combinations as defined by canonical viability measurements in a time-course experiment. Briefly, we used the HiSeq Illumina RNA-seq assay to study the transcriptional response over time (0, 3, 6, 9, 12, and 24 h) for three drugs (A, B and C) and their combinations (AB, AC and BC) in MCF-7 (ER+) breast cancer cells lines. Cell viability measurements show that one of the combinations (AB) is strongly synergistic, whereas the other two (AC and BC) are merely additive. We show via rigorous linear modeling of RNA-seq count data at the exon level that in addition to a novel transcriptional signature driven by differential expression, the combination AB transcriptional landscape is characterized by persistent alternative splicing signatures mostly comprised of genes which are not differentially expressed with respect to A or B but whose functional role has been dramatically changed by the addition (deletion) of a key regulatory protein domain encoded by the extra(missing) exon. We construct an isoform-level co-expression network to probe the regulatory changes this dynamical splicing induces and show that it crucially contributes to the emergence of extensive transcriptional cascades by creating and removing key gene-gene correlations and altering the modular structure of the network. Our results suggest that any gene-signature based drug synergy prediction algorithm must take into account alternative splicing in order to effectively characterize the novel pathways being activated in the synergistic drug-tumor interaction.

...............................................................................................................................

Poster: P77
Spatial-temporal gene regulatory network of maize embryo and endosperm development

Wenwei Xiong, Montclair State University, United States
Chunguang Du, Montclair State University, United States

The study of maize embryo and endosperm has significant agricultural importance, but remains elusive because of a great number of involved genes and their complex interactions. To better understand the genetic control in maize seed development, we need to reveal the dynamic transcriptional regulatory relationships among transcription factors and their target genes quantitatively. Here we report our integrated regulatory network study using genome-wide spatiotemporal transcriptome RNA-Seq data of B73 maize seed development. Gene expression intensities at all stages were normalized and discretized into bins defined by the B-Spline functions. Then we calculated the entropy of each gene according to its respective distribution probabilities within each bin, which is also known as the marginal entropy. For each pair of genes, their joint distribution under the previously defined bins was taken into account to measure the joint entropy.

The mutual information between any two genes was defined as the sum of both marginal entropies subtracting their joint entropy, which indicates the mutual dependence between genes. Transcription factors are major regulators for gene expression, thus transmitting more information to target genes than to unrelated genes or information between non-regulatory genes. Greater mutual information usually suggests higher probability of dependence in general. However some indirect relationships can also contribute to mutual information as well. To avoid false positives caused by these indirect relationships in inferred transcriptional network, we employed relative importance of mutual information indicated by z-scores among all potential regulators and targets, premised on the sparse nature of biological networks. We compared the inferred gene regulatory network to known well-studied genes and found potential transcription factors and genes. We further conducted motif analysis within the same target gene groups. Since functional domains are often involved in transcriptional events, we searched the Pfam database for hits in our enriched set of genes. Network motifs were discovered from the number of edges a node connected to, as well as the topological patterns such as hierarchical structure and network hubs. There are 91 transcription factors and 1,167 genes present exclusively in seed development among an overall of 26,105 investigated genes. This work provides an in-depth dynamic view of the complex regulatory network in maize kernel development.

...............................................................................................................................

Poster: P78
Copy Number Variation Analysis with GROM-RD

Sean Smith, Rutgers University, United States
Joseph Kawash, Rutgers University, United States
Andrey Grigoriev, Rutgers University, United States

Copy number variants (CNVs), amplifications or deletions of genome segments, are important contributors to phenotypic variation. The advent of next-generation sequencing (NGS) has prompted read depth analysis as an essential tool for the detection of CNVs. However, the predictive capabilities of existing algorithms using genome read coverage are frequently hindered by various biases in NGS platforms. Additionally, imprecise breakpoint identification somewhat limits the utility of read depth tools. We describe GROM-RD, an algorithm that analyzes multiple biases in read coverage to detect CNVs in NGS data. After using existing GC bias correction methods we found lingering non-uniform variance across distinct GC regions and developed a novel approach to normalize such variance. By adjusting for repeat bias and using a two-pipeline masking approach GROM-RD is able to detect CNVs in complex and repetitive segments that otherwise complicate CNV detection, as well as improve sensitivity in less complicated regions. GROM-RD employs a CNV search using size-varying overlapping windows to improve breakpoint resolution, a typical weakness of RD methods. Compared to two widely used programs based on read depth methods, CNVnator and RDXplorer, GROM-RD showed improvements in CNV detection and breakpoint accuracy.

...............................................................................................................................

Poster: P79
Probabilistic modeling of multiple ChIP-Seq and open chromatin datasets enables discrimination of direct and indirect binding events

Artur Jaroszewicz, UCLA, United States
Jason Ernst, UCLA, United States

Chromatin ImmunoPrecipitation followed by high-throughput sequencing (ChIP-Seq) is an important assay in the study of gene regulation and epigenetics. Unfortunately, this assay does not discriminate signal associated with a target being in direct contact with the DNA from indirect signal such as through protein-protein interactions or 3-dimensional looping interactions. We present a method that attempts to discriminate between such cases called ChIPs n DIP (ChIP-seq Signal Prediction of Direct and Indirect Peaks). This method takes as input ChIP-seq tracks for multiple targets and open chromatin data such as DNaseI hypersensitivity data and probabilistically models their joint signal to infer for each ChIP-seq track a relative probability of direct binding at each position. Specifically, we model the ChIP-seq signal at each position and each track as belonging to one of three binding types: direct, indirect, or none. The inferred probability of these events in the model depends not only on the signal of the target track, but also the chromatin and transcription factor binding context in which it occurs. We use the existence of DNA motif in direct binding predictions to evaluate the method.

top

RSG POSTER ABSTRACTS - 42 through 60

Complete list of RSG Poster Abstracts (.pdf) - Click here.
...............................................................................................................................

Poster: P42
microRNA-mediated feed forward disinhibition of multiple functional pathways amplifies prohypertensive signaling

Danielle Decicco, Thomas Jefferson University, United States
James Schwaber, Thomas Jefferson University, United States
Rajanikanth Vadigepalli, Thomas Jefferson University, United States

microRNAs have emerged as novel post-transcriptional regulators of many cellular disease processes. However, in essential hypertension, there has been no characterization of the microRNA expression landscape in key neuroanatomical blood pressure control regions during hypertension development. Using a global analysis of microRNA expression levels in these regions, we quantified 419 well-annotated microRNAs in the brainstem, and we identified 24 microRNAs showing stage-dependent differential expression in hypertensive rats compared to controls. We constructed microRNA regulatory networks based on predicted targets from bioinformatic databases including RNA22 and miRWALK. Our microRNA regulatory networks indicated that predicted targets primarily fell into functional pathways previously associated with hypertension such as inflammation and Angiotensin II signaling. We measured the putative targets using high-throughput qPCR to evaluate correlations between microRNAs and their predicted gene targets. Our analysis revealed a similar extent of positive and negative correlations between the microRNA and predicted target transcript patterns suggesting regulatory relationships. We discovered a pair of microRNAs, previously shown to be enriched in different cells types: miR-135a in astrocytes and miR-376a in neurons, which demonstrated stronger anti-correlational relationships with their putative targets in the hypertensive state compared to controls. These microRNAs demonstrate expression levels which are negatively correlated with key target expression levels in the inflammation and Angiotensin II pathways. Interestingly, the key putative targets are known inhibitors of these functional pathways that show increased activity in hypertension. Such feed forward disinhibition by microRNA-135a and microRNA-376a of the inflammatory and Angiotensin II pathways occurred at the onset of hypertension suggesting a mechanistic role for this regulatory network. Given that both pathways are hyperactive in the chronic hypertensive stage, microRNA regulatory network-mediated disinhibition of those pathways at the onset stage is likely to have a causal effect of amplifying those pathways, contributing to the development of hypertension. This feed-forward disinhibition by miR-135a and miR-376a suggests synergistic network activity contributing to the development of hypertension.

...............................................................................................................................

Poster: P43
Prioritizing animal models for human diseases using genome-wide functional networks

Max Homilius, Princeton University, United States
Arjun Krishnan, Princeton University, United States
Calum MacRae, Brigham and Women’s Hospital, Harvard Medical School, United States
Olga Troyanskaya, Princeton University, United States

Model organisms are key to studying the molecular basis of human traits and diseases. Therefore, for a rare or common disease defined by a group of implicated genes, it is valuable to identify relevant model organism phenotypes to transfer knowledge and propel further investigation. Yet, we lack tools to seamlessly search across organisms to identify the model phenotype equivalent to a human disease (or the human disease corresponding to a model phenotype of interest). The most straightforward approach – mapping disease to phenotype based on overlapping homologous genes – is severely limiting because, 1) our knowledge of associated genes for most diseases and phenotypes is largely incomplete, thus leaving many actual disease-phenotype pairs with little to no ‘common’ genes; 2) treating diseases and phenotypes as bags of genes ignores the underlying complex organism-specific biology. Here we present a framework for systematically matching diseases and phenotypes that overcomes both of these limitations. By jointly using genome-scale functional gene interaction networks in both human and the model organism, we create and match genome-wide representations of human diseases and model phenotypes, and further filter nonspecific matches to arrive at highly resolved disease-phenotype mappings. Further, for each disease-phenotype pair, in addition to known genes, we report the novel homologous genes most associated with the disease/phenotype, which are prime candidates for experimental follow-up. We have made our approach available through a dynamic web-interface that allows researchers to easily use their own gene set (or a previously known disease/phenotype) to query a large collection of resources containing disease-gene and phenotype-gene associations in human and five model organisms (mouse, zebrafish, fly, worm and yeast). Users can readily see prioritized diseases/phenotypes, list candidate genes, explore them in the context of the underlying networks, and export all results.

...............................................................................................................................

Poster: P44
Discovery of bruchid resistance-related variations in regulatory regions by genome-wide sequence comparison

Dung-Chi Wu, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Mao-Sen Liu, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Tony Chien-Yen Kuo, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Kuan-Yi Li, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Roland Schafleitner, AVRDC-the World Vegetable Center, Taiwan
Hsiao-Feng Lo, Department of Horticulture and Landscape Architecture, Nation Taiwan University, Taiwan
Long-Fang O. Chen, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Chien-Yu Chen, Dept. of BIME, National Taiwan University, Taiwan
Chia-Yun Ko, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Huei Mei Chen, AVRDC-the World Vegetable Center, Taiwan

Mungbean, Vigna radiata [L.] R. Wilczek, is one of the most important legume crops with valuable nutritional and medical value. The bruchid beetle (Callobruchus maculates), known as bean weevils, would attack mungbeans both in the field and in storage, resulting in great losses in the stored grains. A wild mungbean, V. radiata var sublobata (TC1966) from Madagascar, was resistant to many bean weevils and with the ability of crossing with V. radiata. Though little knowledge has been uncovered for weevil resistance, breeding for bruchid resistance is still a major goal in mungbean stuides. In this study, we first de novo assembled the genome of a bruchid-resistant recombinant inbreeding line 59 (RIL59) which derived from TC1966 and a bruchid-susceptible variety NM92. The primitive genomic data was combined with additional genome and transcriptome analysis for different levels of bruchid-resistance mungbean lines, including the two parent, TC1966 and NM92, and the other 12-inbred-generation progenies, to investigate where might the major distinct loci between bruchid-resistant and bruchid-susceptible lines. The ab initio predicted gene models of RIL59 consist of 44,317 genes, representing 49,952 transcripts. The genome-wide variation analysis performed on NM92, TC1966 and RIL59 revealed that 3,162 genes have sequence variants, including non-synonymous substitutions and INDELs, on exons to cause protein sequence changes. These genes were suspected to be related to the bruchid-resistance. On the other hand, a draft bruchid-susceptible mungbean (Vigna radiata var. radiata VC1973A) genome was previously published. We mapped the above-mentioned putative bruchid-resistance-related genes to this bruchid-susceptible draft genome and found a hot region on Vr05. This result was consistent with the genotype-by-sequencing (GBS) data which also suggested that the region from 5M bps to 12M bps of Vr05 is strongly related to bruchid-resistance. These two draft genomes were aligned to identify 127 scaffolds of RIL59 that together correspond to the Vr05 of VC1973A. Among them, about 50 scaffolds were considered associated with this region. In total, 508 genes were identified in these scaffolds. If considering the upstream 2,000 bps of each gene model as the promoter, there were 544 promoters falling in the suspected resistance-related region. Comparison on the promoters between the bruchid-resistant and bruchid-susceptible mungbean lines discovered some large structure variations, suggesting the gain or loss of regulatory elements might play key roles in bruchid resistance. In summary, the comparison of promoters of the two draft genomes reveals the potential impact of regulatory regions in affecting resistant phenotypes of mungbeans.

...............................................................................................................................

Poster: P45
Empirical Evidence Supporting a Systematic Approach to Gene Network Identification

Sweta Sharma, Rutgers University, United States
Desmond Lun, Rutgers University, United States

A major cellular systems biology challenge of the past decade has been the development of a comprehensive model for gene regulatory networks (GRNs). Particularly, there is growing impetus for the extraction of regulatory information from expression data as it becomes increasingly available and accurate. Identifying networks from such information requires deciphering direct interactions from indirect ones. For instance, if gene A regulates gene B and B regulates gene C, then changing A's expression will directly affect B's expression while indirectly affecting C's.

Recently, Birget et al proposed a systematic approach for network identification. They consider a binary model that captures the non-linear dependencies of GRNs and reverse-engineer the network using assignments (perturbations to the expression level of a single gene) and whole transcriptome steady-state expression measurements. Under this model, their approach achieves identification of acyclic networks with worst-case complexity costs in terms of assignments and measurements that scale quadratically with the size of the network. For networks with cycles, the worst-case complexity cost scales cubically.

We conduct a proof-of-concept experiment for this approach by reverse-engineering a five-gene sub-network of the outer-membrane protein regulator (ompR) in E. coli. Through assignments achieved by gene deletions and expression measurements from qPCR, we successfully identify the regulatory relationships and discern direct from indirect interactions. We also performed computational experiments on in silico networks derived from known regulatory relationships in E. coli and S. cerevisiae, where gene regulation is thermodynamically modeled using the system of ODEs that was used to generate data for previous DREAM challenges. We achieve 100% identification for noiseless acyclic networks of size ranging from 100 to 4,000 genes. For noisy acyclic E. coli networks of size 100, we obtain an AUPR of .95. This is significantly improved from the .71 AUPR obtained by the top performer in the DREAM3 inference challenge for acyclic in silico networks. Furthermore, we achieve this using ten-fold fewer assignments and measurements. For noisy cyclic E. coli networks of size 100, we obtain an AUPR of .75, compared to .45 for the top performer in the DREAM4 InSilico_Size100 sub-challenge containing cyclic networks. We achieve this using roughly the same number of assignments and half as many measurements.

Taken together, our results imply that the reverse engineering method of Birget et al is not only experimentally feasible but uses reasonable resources. It can therefore serve as the basis for systematic, accurate reverse engineering of large-scale gene regulatory networks.

...............................................................................................................................

Poster: P46
Dysregulation of co-regulatory microRNA networks by chronic ethanol consumption leads to impaired liver regeneration

Austin Parrish, Thomas Jefferson University, United States
Egle Juskeviciute, Thomas Jefferson University, United States
Jan Hoek, Thomas Jefferson University, United States
Rajanikanth Vadigepalli, Thomas Jefferson University, United States

microRNAs are a class of small, non-coding RNAs ~21 nucleotides long that regulate numerous cellular processes in a post-transcriptional manner. Previous research has identified several microRNAs of interest involved in liver regeneration and hepatocellular carcinoma, including miR-21, which has been shown by our lab to be significantly upregulated following liver damage by 70% partial hepatectomy, along with chronic ethanol consumption. Given that microRNAs often exert their effects in regulatory networks that display both positive and negative cooperation, we sought to identify additional microRNAs involved in liver regeneration alongside miR-21. In order to accomplish this, we performed in vivo knockdown of miR-21 using a locked nucleic acid (LNA) probe containing a complementary sequence to miR-21. Whole liver tissue samples were collected from both control- and ethanol-fed Sprague-Dawley rats at baseline conditions and 24 hours post-partial hepatectomy. These samples were analyzed for microRNA expression using the NanoString microRNA microarray platform. Analysis of the expression data reveals twelve microRNAs that show differential expression in response to miR-21 knockdown. Of these genes, three show positive correlations with miR-21 expression while eight are negatively correlated. Using target prediction software, we developed a network of putative microRNA-gene interactions and compared the predicted targets to genes identified as differentially expressed based on Affymetrix microarray analysis. This network of putative targets identifies a number of genes that are potentially regulated by these differentially expressed microRNAs. Gene ontology and pathway analysis reveals that multiple predicted targets are involved in processes relating to cell cycle progression. In conclusion, these studies identified a set of co-regulatory microRNAs whose dysregulation by chronic ethanol consumption may lead to impaired liver regeneration.

...............................................................................................................................

Poster: P47
Furthering understanding of Parkinson's Disease through integrative analysis in C. elegans

Victoria Yao, Princeton University, United States
Rachel Kaletsky, Princeton University, United States
Coleen Murphy, Princeton University, United States
Olga Troyanskaya, Princeton University, United States

The etiology of complex human diseases, especially in the context of aging, such as Parkinson's disease, is likely a combination of many environmental and genetic factors. Elucidating the molecular basis of pathophysiologies of such diseases requires a combination of systems-level studies in human and model systems. The nematode C. elegans is an effective and efficient model for human disease due to its sufficient complexity and high genetic conservation with humans, combined with short lifespan and the abundance of genetic tools and assays. In particular, the complexity of C. elegans at the tissue level allows for in depth investigations of relevant diseases in a tissue-specific manner. To this end, we developed a novel semi-supervised regularized Bayesian integration method to integrate a large compendium of heterogenous datasets for the construction of 203 tissue- and cell-type specific networks in C. elegans. We demonstrate the accuracy of these networks in detecting tissue-specific functional signal, even for very small and specific tissues and cell types. We then use the dopaminergic neuron worm network combined with Parkinson's disease genes identified in quantitative genetic studies in human to predict new genes implicated in Parkinson's disease. A subset of these predictions has been experimentally confirmed to have Parkinson's disease endophenotypes in C. elegans and are conserved in human, providing potential therapeutic targets.

...............................................................................................................................

Poster: P48
Transcription Network Inference using RNA Expression and Degradation Rate Data in S. cerevisiae

Konstantine Tchourine, NYU - Center for Genomics and Systems Biology, United States
Christian L. Mueller, Simons Center for Data Analysis, United States
Chirstine Vogel, NYU - Center for Genomics and Systems Biology, United States
Richard Bonneau, NYU - Center for Genomics and Systems Biology, United States

Despite many years of research and the availability of large-scale datasets, modeling RNA transcription and predicting transcriptional regulatory interactions on a systems level in eukaryotes remains a challenging problem and requires modeling changes in RNA abundance due to both the regulation of synthesis and degradation. Even Saccharomyces cerevisiae has several hundred putative TFs and ~6,000 potential targets, rendering the theoretical regulatory interaction space enormous. Further, eukaryotes are marked by extensive promoter regions, many response pathways, and additional regulatory layers, e.g. RNA decay, which further confound gene expression regulation. For these reasons, even the best network inference algorithms have so far performed very poorly in yeast. To address this challenge, we are taking several steps towards constructing the first high-quality, high-coverage yeast regulatory network. I am developing an expanded version of an existing gene regulatory inference framework, Inferelator-BBSR, that incorporates RNA decay rates to predict new regulatory interactions, estimate each interaction’s contribution to the dynamics of the transcription process, and estimate gene-dependent RNA decay rates. Incorporation of RNA decay rates can be done either computationally by finding optimal decay rates for different modes of regulation in yeast, or empirically by directly incorporating RNA decay rate data into the inference procedure. In this presentation, I will show that both ways of incorporating RNA decay rates into the inference framework improve regulatory network inference. Furthermore, I will show that the inferred regulatory network can help identify different modes of stress adaptation which require different average RNA decay rates.

...............................................................................................................................

Poster: P49
The optimized high-throughput siRNA screening : Applications in cancer target discovery

Nayoung Kim, Sookmyung Women's University, Korea, The Republic of
Sukjoon Yoon, Sookmyung Women's University, Korea, The Republic of

RNA interference (RNAi) has become a powerful tool for drug target discovery, and the systematic loss-of-function screens using RNAi libraries can now be performed to identify the biological functions of specific genes or pathways in various diseases. Cancer target discovery studies on clinically relevant drug applications and their mode of actions can be accelerated by integrating multi-level omics data such as genome, transcriptome, proteome and phosphatome data together with siRNA screening data.

We introduce the siRNA screening platform composed of the image-based assay optimization, primary screening, data analysis and hit selection criteria using some studies to investigate novel therapeutic targets in cancer. We applied two different samples to siRNA screening. One example is a study using a specific gene-knockdown cell line. In this study, in order to identify novel therapeutic targets in STK11-deficient lung cancer cells, we utilized a large-scale siRNA screening to identify genes that would sensitize STK11-deficient lung cancer cells (A549) with or without AMPK. And another example is a genome-wide siRNA screening using a sphere-forming (3D) culture system similar to in vivo. 3D growths of cancer cells in vitro are more reflective of in situ cancer cell growth than growth in monolayer (2D). This study is designed to identify genes reducing sphere size on 3D as compared to 2D.

In the study using a stable knockdown cell line, the perturbation of several genes exhibited significant inhibitory effect on the growth of AMPK-knockdown cells. And we identified that specific hits inducing inhibition of cell growth with AMPK knockdown were related to metabolism and signal transduction among various functional categories. These results highlight the potential of synthetic lethal siRNA screens with AMPK inhibitors to define new determinants of potential therapeutic targets. And in another screening using 3D culture system, we found specific genes reducing sphere formation. These hits were related to lipid metabolism. From these results, we can find new therapeutic target-related drugs for inhibition of tumor progression and metastasis.

This screening platform can be provided as a valuable tool to find novel therapeutic targets and drugs for cancer therapy. We now provide this platform service to academic and industrial organizations.

...............................................................................................................................

Poster: P50
MACE: a web-based application for analyzing mutation-specific drug response and gene expression in cancers

Yourae Hong, Sookmyung Women's University, Korea, The Republic of
Sukjoon Yoon, Sookmyung Women's University, Korea, The Republic of

Systematic understanding of mutation-oriented drug sensitivity on cancer cell lines will provides therapeutic benefits on the cancer therapy. Here, we present the MACE database as a web-based interactive tool for interpreting drug response and gene expression in the genotypic classification of cancer cell lines. Chemical screening and DNA microarray data on NCI60 cell lines were organized to identify mutation- or lineage-specific chemicals and gene expression signatures. In this system, users can perform the individual and combined analysis to find potential associations of chemicals and genes with major gene mutations of cancers. The present MACE database can be used to understand how gene mutation is interconnected with the drug response and gene expression in cancer subtypes. This database provides a valuable tool to predict and optimize the therapeutic window for anticancer agents and related gene targets. The MACE web database is available at http://mace.sookmyung.ac.kr/.

...............................................................................................................................

Poster: P51
Reverse engineering gene regulatory networks from structural and epigenetic datasets

Brittany Baur, Marquette University, United States
Serdar Bozdag, Marquette University, United States

One of the major challenges in computational biology is the identification of “driver” copy number changes that promote cancer cell progression. The goal of this study is to identify genes within regions with aberrant copy number and DNA methylation changes that have widespread downstream effects, and their associated targets. We first identified these aberrant regions by integrating DNA methylation or copy number datasets with gene expression datasets in luminal A breast cancer patients. We then identified candidate genes within the aberrated regions which could act as regulators of downstream targets by integrating the expression levels of the regulators and potential targets with pathway analysis. Based on gene ontology, we established that genes associated with aberrant copy number and DNA methylation changes are enriched in terms associated with the regulation of various biological processes. This indicates that these genes are potentially regulators of other genes. We identified several candidate genes within these regions that are likely regulators strongly affected by copy number or DNA methylation aberrations. By identifying causal genes within the aberrant regions, this study could aid in the discovery of therapeutic targets of cancer drugs.

...............................................................................................................................

Poster: P52
Alternative Splicing During Heat Stress in Arabidopsis thaliana

Gaurav Kandoi, Iowa State University, United States
Julie A Dickerson, Iowa State University, United States

Alternative splicing (AS) which produces multiple messenger RNAs by different combination of various regions of the precursor transcript is a major cause of diversity in gene products. Recent estimates suggest the rate of AS to be as high as 95% in humans and 60% in plants. Despite the prevalence of AS events, their functional consequences are largely unknown. Although the impact of abiotic stresses (temperature, salt, light etc.) on AS events in Arabidopsis thaliana has been widely studied, not much is known about how differential splicing affects the metabolic pathways under such stress conditions. High-throughput RNA sequencing (RNA-seq) data from a heat stress experiment in A. thaliana, was used to find regions which undergo differential splicing. Even though heat stress leads to an increase in the number of AS events, only ~90 of such alternatively spliced genes are also differentially spliced (DS) between the two conditions. Most of these are nuclear genes and have been annotated with biological processes such as response to stress, response to abiotic or biotic stimulus and cell organization, and biogenesis. A significant portion of these differentially spliced genes are also linked with molecular functions like binding (DNA or RNA, nucleotide, protein and nucleic acid) and enzymatic activity (transferase, hydrolase and kinase). For the most part, the novel spliced isoforms are predicted to be more abundant than the normal transcript in the heat stress condition. Conserved domain analyses indicate that novel spliced isoforms share similar domain architecture with the normal transcripts more often than not. By studying the effect of such alternative splicing events on protein function, we can identify important metabolic networks. Combination of these differential networks across the spectrum of stress conditions generates metabolic models with a high-level regulatory framework.

...............................................................................................................................

Poster: P53
An integrated computational pipeline for analysing genetic, molecular, and functional variations in complex diseases

Bajuna Salehe, University of Reading, United Kingdom
Chris Jones, University of Reading, United Kingdom
Giuseppe Di Fatta, University of Reading, United Kingdom
Liam McGuffin, University of Reading, United Kingdom

The ongoing advancement of the technologies used for generating ‘omic data has led to the flood in biological data. This ‘big data’ phenomenon has increased the challenge for biologists and biomedical experts of finding a better analytical strategies that are capable of integrating variation data from different omic states, using integrated computational approaches to further understanding phenotypes (complex diseases/polygenic traits). Here, we have proposed an 'omic variation framework focusing initially on single nucleotide polymorphisms (SNPs), which is one of the key 'omic variation types that are studied in order to understand the relationships underpinning complex traits. However, this framework should also be adaptable to other 'omic variation data, such as methylomic, transcriptomic and copy number variation (CNVs). Furthermore, we have designed a pipeline for an integrated computational approach to implement this framework, which we have applied to study platelet proteomic data sets. In this case study the aim is to understand the association of SNPs at different levels with the adenosine diphosphate (ADP) activated platelet response. Platelets play key roles in the thrombus formation, which is one of the major risks for cardiovascular diseases (CVDs), and ADP activated platelet response is highly involved during the thrombus formation, as well as being variable among individuals. Using the initial implementation of this pipeline we have been able to identify key genetic variants (SNPs) such as rs6141803 and rs7007145 in PTK2B and COMMD7 genes respectively that are significantly associated with platelet aggregation. Many of our identified SNPs were previously unidentified, and have been independently reported to be associated with the risk of CVDs.

...............................................................................................................................

Poster: P54
Exploration of Breast Cancer Genes and Bioinformatics Analyses

Shahrzad Eslamian, Grand Valley State University, United States
Leidig Jonathan P, Grand Valley State University, United States

Information visualization may be applied to bioinformatics research tools to assist in understanding the complex (often textual) datasets. The main goal of this work was to design an interactive visualization tools to detail the genes potentially responsible for breast cancer as they are discovered through bioinformatics analysis. The dataset is derived from the publically shared research as maintained by the bioinformatics research community. The visualization aims to detail the explicit relationships and existing analyses of these target genes and their related micro RNA, considering the distributed nature of this field of research and disaggregation of the underlying datasets.

...............................................................................................................................

Poster: P55
Spectral coherence classification of uORF translation in a neuroblastoma cell model of differentiation

Sang Chun, University of Michigan, United States
Caitlin Rodriguez, University of Michigan, United States
Peter Todd, University of Michigan, United States
Ryan Mills, University of Michigan, United States

Upstream open reading frames (uORFs) are prevalent in the human transcriptome and may negatively regulate the abundance of canonically encoding proteins through the promotion of mRNA decay and competitive expression, among other mechanisms. uORFs are conserved across species and have been annotated to genes with diverse biological functions, including but not limited to oncogenes, cell cycle control and differentiation, and stress response. As such, the aberrant expression of certain uORFs has been implicated in the development and progression of various diseases. Therefore, the positive identification and validation of uORFs as translational products is critical for understanding their role in complex biological processes and disease etiology. Where mRNA-Seq has been used to approximate the transcriptomic content of a cell, or group of cells, the recently developed method of sequencing ribosome-protected fragments aims to profile the translational landscape of a sample. In concert, various algorithms have been developed to differentiate coding transcripts from non-coding transcripts based on the alignment of ribosome-protected fragments to a reference transcriptome. We have developed a classification algorithm based on the magnitude of coherence between the aligned ribosome profiling reads and tri-nucleotide periodic signal inherent to protein-coding sequences. In this study, we compare our spectral coherence-based classification algorithm (SPECtre) against existing methods and apply our approach to positively identify variably translated uORFs related to differentiation of SH-SY5Y neuroblastoma cells.

...............................................................................................................................

Poster: P56
Transcriptional Regulatory Networks During the Endothelial-to-Hematopoietic Transition in the Mouse Embryo

Long Gao, The University of Iowa, United States
Joanna Tober, University of Pennsylvania, United States
Peng Gao, The University of Iowa, United States
Jianshu Zhang, The University of Iowa, United States
Changya Chen, The University of Iowa, United States
Nancy A. Speck, University of Pennsylvania, United States
Kai Tan, The University of Iowa, United States

Hematopoietic stem cells (HSCs) in the embryo are derived from hemogenic endothelia (HE) of the arterial wall from the aorta/gonad/mesonephros (AGM) region and yolk sac (YS). HE from AGM and YS has different developmental potentials. HE from YS primarily produces committed erythroid/myeloid progenitor and HE from AGM can produce lymphoid progenitors and HSCs. The transcriptional regulatory networks (TRN) that control the endothelial-to-hemogenic transition in AGM and YS are poorly understood. Here we compared the transcriptomes of endothelium and hemogenic endothelium from embryonic (E) day 9.5 and E10.5 AGM and YS by RNA-Seq. We developed a novel computational method for constructing condition-specific transcriptional regulatory networks (TRNs) by sample elimination and network comparison with limited number of samples. By modeling developmental-stage-specific TRNs, we identified 73 gene modules (1429 genes) with differential activities between E and HE and between AGM HE and YS HE. We further identified a number of transcription factors that regulate the endothelial-to-hemogenic transitions, including Runx1, Sox7, Hoxa7, and Hoxd9. Long intergenic noncoding RNAs (lincRNAs) have been shown to regulate the development of various lineages. However, nothing is known about the role of lincRNAs during embryonic hematopoiesis. We identified 18 and 41 novel lincRNAs that are specifically expressed in E and HE, respectively. Among them, 10 lincRNAs were differentially expressed between E and HE, suggesting a role in regulating the development of hemogenic endothelium. In summary, our systematic analysis of the transcriptomes during endothelial-to-hemogenic transition has uncovered a number of novel regulators and gene pathways of this critical developmental transition.

...............................................................................................................................

Poster: P57
Dynamic organization and activation of enhancers and super-enhancers dictate effector and memory CD8+ T cell responses

Bing He, University of Iowa, United States
Haihui Xue, University of Iowa, United States
Kai Tan, University of Iowa, United States

CD8 T cells are critical in controlling infection by intracellular pathogens including viruses and intracellular bacteria. Differentiation of naïve CD8 T cells (TN) to effector (TE) and memory CD8 T cells (TCM) is accompanied by dynamic gene expression and epigenetic modification changes at promoters as revealed by previous analyses. However, there is virtually no information regarding the dynamics of enhancers during CD8 T cells responses to date. Here, we have mapped four histone modification marks in TN, TE, and TCM cells after viral infection. Our results suggest that the chromatin environment at regulatory DNA sequences in TCM is more permissive than in TN and TE. We further predicted the enhancers, super enhancers, and their targets, and constructed condition-specific transcriptional regulatory networks (TRNs) in three T cell stages. We have identified a highly dynamic repertoire of the enhancers and their targets during CD8 T cell responses, as 77% of the enhancers and 82% of the enhancer-promoter interactions are stage-specific. Our results suggest the dynamic change of enhancer activity during cell stage transition leads to TRN rewiring, which explains the expression change of the key factors of T cell function.

...............................................................................................................................

Poster: P58
Data and Computing Platform to facilitate NCER- PD (National Centre of Excellence in Research on Parkinson’s Disease) Project

Venkata Satagopam, LCSB, University of Luxembourg, Luxembourg
Peter Banda, LCSB, University of Luxembourg, Luxembourg
Jan Martens, LCSB, University of Luxembourg, Luxembourg
Joachim Kutzera, LCSB, University of Luxembourg, Luxembourg
Kirsten Roomp, LCSB, University of Luxembourg, Luxembourg
Wei Gu, LCSB, University of Luxembourg, Luxembourg
Patrick May, LCSB, University of Luxembourg, Luxembourg
Reinhard Schneider, LCSB, University of Luxembourg, Luxembourg

The Data and Computing Platform provides key infrastructure for the integration, curation and interrogation of anonymized clinical and experimental data. The platform manages multidimensional data associated with clinical research, including patient data, sample-associated information, and high-throughput molecular readouts from these samples. These different data flows are integrated at their source with the help of advanced data capture and transfer approaches. Clinical data can be entered remotely, via electronic forms at the time of collection, assuring their integrity and standardization. To attain this goal, REDCap[1], a state-of-the-art clinical research data management system has been implemented. All entered data will be immediately anonymized and sample-associated data will be accessed directly at their storage location, the IBBL, via secure communication with the LIMS of the biobank. High-throughput experimental data will be uploaded directly to the database service provided by LCSB for handling large, heterogeneous biomedical datasets: the tranSMART system[2]. tranSMART enables sharing, integration, standardization and analysis of heterogeneous data from collaborative translational research. It is used in pharmaceutical industry and in Innovative Medicine Initiative projects (e.g. eTRIKS[3], AETIONOMY[4]) to store and share curated phenotypic data such as clinical observations and adverse events; omics data like transcriptomics, proteomics, metabolomics and genotyping. Well-grounded machine learning and computational modeling approaches will enable data analysis and interpretation.

1. REDcap(2015) /www.project-redcap.org
2. tranSMART(2015) http://transmartfoundation.org
3. eTRIKS(2015) www.etriks.org
4. AETIONOMY(2015) www.aetionomy.eu

...............................................................................................................................

Poster: P59
Emergent Topological and Statistical Properties of Gene Regulatory Grids

Wilberforce Ouma, Ohio State University, United States
Mohammadmahdi Yousefi, Ohio State University, United States
Andrea Doseff, Ohio State University, United States
Erich Grotewold, Ohio State University, United States

Gene regulatory grids (GRGs) are static representations of gene regulatory networks (GRNs) encompassing all possible regulator-target gene interactions that provide a system-wide view of transcriptional gene regulation. To understand their architectural organization, we constructed and investigated emergent topological and statistical properties of GRGs of the following model organisms: Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae and Arabidopsis thaliana. We implemented a formal statistical approach for fitting a power-law function to the empirical degree distribution of the grid and observed that the out-degree, and not the in-degree, follows a power-law distribution, suggesting a scale-free property of GRGs that have transcription factors as hubs. The four GRGs however exhibit different power-law exponents. A computational sub-sampling of sub-grids from the original grids showed that for D. melanogaster, the exponent was invariant for a large number of sub-grids. With this invariant property of the exponent, we derived a mathematical formulation that estimates the number of interactions in a fully-connected grid. We hypothesize that a consequence of the scale-free property in cellular networks is reduction of the average path-length in the grid, resulting in faster signal propagation.

...............................................................................................................................

Poster: P60
Integrated Feature Detection From Chromatin State Measurements

Anastasia Shcherban, Institute of Biosciences and Medical Technology (BioMediTech), University of Tampere, Finland
Matti Nykter, Institute of Biosciences and Medical Technology (BioMediTech), University of Tampere, Finland
Juha Kesseli, Institute of Biosciences and Medical Technology (BioMediTech), University of Tampere, Finland

Identification and extraction of biologically relevant features, such as active regulatory regions from high throughput sequencing (HTS) data is a major challenge. A number of algorithms have been developed to perform feature detection on HTS data. In most cases, features are detected from one track of HTS data at a time, followed by further analysis and integration of peak detection results obtained from multiple sources. Thus, these methods fail to efficiently discover and utilize the complementary information found in multiple signals.

In this work we present our approach extending an existing algorithm called ZINBA (Zero-inflated negative binomial algorithm) to analyze multiple data tracks simultaneously. Our goal is to shed light on the relationships between data tracks using our statistical model while improving detection results for features that can be found from a selected pair of tracks. The statistical model is built by incorporating a correlation term for HTS data, so that the algorithm can be run for two tracks in parallel with improved results. To estimate the parameters of this model, the iterative algorithm (of EM-type) is extended from the original by including correlation estimation based on the results of logistic regression and generalized linear model fitting steps. As an output, our algorithm provides feature detection results for both tracks separately, a correlation model describing the interaction between the two tracks, and an optional consensus track output showing feature calls based on data from both of the two tracks used.

Here, we consider three examples of integrating HTS data across multiple data types and evaluate the performance of our algorithm with Receiver Operating Characteristics (ROC) curves and Area Under Curve (AUC) values. The example cases have been selected as pairs of tracks from ENCODE datasets that share promoter and enhancer activity information. First, we combine ChIP-seq histone modification marks H3k4me3 and H3k27ac (K562 cell line). Second, we apply our algorithm to ChIP-seq TFs P300 and FAIRE-seq open chromatin data (K562). Finally, we apply our algorithm to a pair of FAIRE-seq replicates (K562) and study the resulting consensus track. In all the cases the analysis of algorithm performance shows a significant improvement in comparison to single track analysis.

top

RSG POSTER ABSTRACTS - 21 through 41

Complete list of RSG Poster Abstracts (.pdf) - Click here.
...............................................................................................................................

Poster: P21
Epigenetic dysregulation of human myogenesis affects time regulated eRNA and associated transposable element expression

Loqmane Seridi, King Abdullah University of science and technology, Saudi Arabia
Yanal Ghosheh, King Abdullah University of Science and Technology, Saudi Arabia
Beatrice Bodega, Istituto Nazionale Genetica Molecolare ‘Romeo ed Enrica Invernizzi’, Italy
Gregorio Alanis-Lobato, King Abdullah University of Science and Technology, Saudi Arabia
Timothy Ravasi, King Abdullah University of Science and Technology, Saudi Arabia
Valerio Orlando, King Abdullah University of Science and Technology, Saudi Arabia

Transcriptional regulation is a complex process that involves the interaction of transcription factors, promoters, enhancers, noncoding RNAs, transposable elements and chromatin states. To understand the transcriptional regulome, spatiotemporal measurements of its components is necessary. Myogenesis is a model system to study transcriptional regulation because factors driving the process are well known and evolutionary conserved. However, most time course studies of myogenesis are limited to few time points and cell lines. Here, using RNA-Seq and CAGE, we deep sequenced a high-resolution time-course of myogenesis transcriptome from human primary cells of healthy donors and donors affected by Duchenne Muscular Dystrophy (DMD). We compiled a full catalog of coding and non-coding RNAs, promoters, enhancers, and active transposable elements. Comparative analysis of the two time-courses suggests a major change in epigenetic landscape in DMD leading to global dysregulation of coding and non-coding genes, enhancers, and full-length transposable elements. It also indicates a high correlation between enhancers and transposable elements activities.

...............................................................................................................................

Poster: P22
Characterizing the dynamics of enzyme localization

Pablo Meyer, IBM T.J.Watson Research Center, United States
Stacey Gifford, IBM T.J.Watson Research Center, United States

To better understand how enzyme localization affects enzyme activity we studied using timelapse microscopy the cellular localization of necessary enzymes for cell wall synthesis (MurA and MurG) in the bacteria Bacillus subtilis. Enzymes localize during exponential growth and measuring the diffusion coefficient of their complex shows that it diffuses actively around the cell in an antibiotic-dependent manner. Point mutations in the helical domain of one of the proteins, disrupts its localization to the membrane caused severe sporulation defects, but did not affect localization nor caused detectable defects during exponential growth. We found a lipid-dependent mechanism for MurG localization, as in strains where the cardiolipin-synthesizing genes were deleted, MurG levels were diminished at the forespore. These results support localization as a critical factor in the regulation of proper enzyme function and catalysis.

...............................................................................................................................

Poster: P23
The Systems Toxicology Computational Challenge: Identification of Exposure Response Markers

Vincenzo Belcastro, PMI, Switzerland
Carine Poussin, PMI, Switzerland
Stephanie Boue, PMI, Switzerland
Florian Martin, PMI, Switzerland
Alain Sewer, PMI, Switzerland
Bjoern Titz, PMI, Switzerland
Manuel C Peitsch, PMI, Switzerland
Julia Hoeng, PMI, Switzerland

Risk assessment in the context of 21st century toxicology relies on the identification of specific exposure response markers and the elucidation of mechanisms of toxicity, which can lead to adverse events. As a foundation for this future predictive risk assessment, diverse set of chemicals or mixtures are tested in different biological systems, and datasets are generated using high-throughput technologies. However, the development of effective computational approaches for the analysis and integration of these data sets remains challenging. The sbv IMPROVER (Industrial Methodology for Process Verification in Research; http://sbvimprover.com/) project aims to verify methods and concepts in systems biology research via challenges posed to the scientific community. In fall 2015, the 4th sbv IMPROVER computational challenge will be launched which is aimed at evaluating algorithms for the identification of specific markers of chemical mixture exposure response in blood of humans or rodents. The blood is an easily accessible matrix, however remains a complex biofluid to analyze. This computational challenge will address questions related to the classification of samples based on transcriptomics profiles from well-defined sample cohorts. Moreover, it will address whether gene expression data derived from human or rodent whole blood are sufficiently informative to identify human-specific or species-independent blood gene signatures predictive of the exposure status of a subject to chemical mixtures (current/former/non-exposure). Participants will be provided with high quality datasets to develop predictive models/classifiers and the predictions will be scored by an independent scoring panel. The results and post-challenge analyses will be shared with the scientific community, and will open new avenues in the field of systems toxicology.

...............................................................................................................................

Poster: P24
E-Flux2 and SPOT: Validated methods for inferring intracellular metabolic flux distributions from transcriptomic data

Min Kyung Kim, Rutgers University, United States
Anatoliy Lane, Rutgers University, United States
James Kelly, Rutgers University, United States
Desmond Lun, Rutgers University, United States

Several methods have been developed to predict system-wide intracellular metabolic fluxes by integrating transcriptomic data with genome-scale metabolic models. While powerful in many ways, existing methods have several shortcomings, and because of limited validation against experimentally measured intracellular fluxes, it is unclear which method has the best accuracy in general.

We present a general strategy for inferring intracellular metabolic flux distributions using transcriptomic data coupled with genome-scale metabolic reconstructions. It consists of two different template models called DC (determined carbon source model) and AC (all possible carbon sources model) and two different new methods called E-Flux2 (E-Flux method combined with minimization of l2 norm) and SPOT (Simplified Pearson cOrrelation with Transcriptomic data), which can be chosen and combined depending on the availability of knowledge on carbon source or objective function. This enables our strategy to be applied to a broad range of experimental conditions. We examined E. coli and S. cerevisiae as representative prokaryotic and eukaryotic microorganisms respectively. The predictive accuracy of our algorithm was validated by calculating the uncentered Pearson correlation between predicted fluxes and measured fluxes. To this end, we compiled 20 experimental conditions (11 in E. coli and 9 in S. cerevisiae), of transcriptome measurements coupled with corresponding central carbon metabolism intracellular flux measurements determined by 13C metabolic flux analysis (13C-MFA), which is largest dataset assembled to date for the purpose of validating inference methods for predicting intracellular fluxes. In both organisms, our method achieves an average correlation coefficient ranging from 0.59 to 0.87, outperforming a representative sample of competing methods. Easy-to-use implementations of E-Flux2 and SPOT are available as part of the open-source package MOST (http://most.ccib.rutgers.edu/).

Our method represents a significant advance over existing methods for inferring intracellular metabolic flux from transcriptomic data. It not only achieves higher accuracy, but it also combines into a single method a number of other desirable characteristics including applicability to a wide range of experimental conditions, production of a unique solution, fast running time, and the availability of a user-friendly implementation.

...............................................................................................................................

Poster: P25
SetRank: A highly specific tool for pathway analysis

Cedric Simillion, Bern University, Switzerland
Robin Liechti, SIB Swiss Institute of Bioinformatics, Switzerland
Heidi Lischer, University of Zurich, Switzerland
Vassilios Ioannidis, SIB Swiss Institute of Bioinformatics, Switzerland
Rémy Bruggmann, Bern University, Switzerland

The purpose of gene set enrichment analysis (GSEA) is to find general trends in the huge lists of genes or proteins generated by many functional genomics techniques and bioinformatics analyses. We present SetRank, an advanced GSEA algorithm which is able to eliminate many false positive hits. The key principle of the algorithm is that it discards gene sets that have initially been flagged as significant, if their significance is only due to the overlap with another gene set. The algorithm is explained in detail and its performance is compared to that of other methods using objective benchmarking criteria. The benchmarking results show that SetRank is a highly specific and accurate tool for GSEA. Furthermore, we show that the reliability of results can be improved by taking sample source bias into account . SetRank and the accompanying visualization tools are available both as R/Bioconductor packages and through an online web interface.

...............................................................................................................................

Poster: P26
The role of genome accessibility in transcription factor binding in bacteria

Antonio Gomes, Columbia University, United States
Harris Wang, Columbia University, United States

ChIP-seq enables the identification of regulatory regions that govern gene expression at genome-scale. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. A more comprehensive biophysical representation of TF-binding will improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 69% of the ChIP-seq profile variance, while a model based in motif conservation alone explains only 46% of the variance. Moreover, our framework enables de novo ChIP-seq peaks prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysical model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.

...............................................................................................................................

Poster: P27
Does the overall shape of gene networks differ between cancer and normal states? Towards a comprehensive understanding of cancer system biology by meta-analysis of various cancer transcriptomes

Pegah Khosravi, School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran (Islamic Republic of)
Esmaeil Ebrahimie, Department of Genetics and Evolution, School of Biological Sciences, The University of Adelaide, Adelaide, Australia

Recent advances in computational biology have provided the possibility of formulating the characteristics of gene networks in terms of network topology statistics. The aim of the present study is to find the possible network topology rules which can distinguish different types of cancer from normal state. To this end, meta-analysis is employed to analyse the gene regulatory networks of 8 different types of cancer (breast, cervical, esophageal, head and neck, leukemia, prostate, rectal, lung and two subtypes of lung cancer (small cell lung and non-small cell lung)) in comparison to normal state. Microarray data were downloaded from the GEO database, NCBI. Gene regulatory networks were constructed using the ARACNE algorithm through the Cyni toolbox; consequently, 20 network statistics were calculated using NetworkAnalyzer plugin for Cytoscape. These statistics mainly describe number of edges, clustering coefficient, connected components, network diameter, network centralization, characteristics path length, average number of neighbors, number of nodes, network density, and heterogeneity in networks. Discriminant function analysis show that number of edges, network diameter, and average number of neighbors are the main network topology statistics which discriminate cancer networks from normal ones. Cancer networks have lower number of edges with shorter diameter, and fewer number of neighbors that confirms the extensive networks rewiring during cancer progression. Discriminant function analysis is able to predict gene network of cancer from normal with 70% accuracy according to cross-validation test. PCA analysis demonstrates the similarity in network statistics between cervical cancer and breast cancer. Lung cancer have a distinguished different network pattern with low network centralization and diameter. This study demonstrates the possibility of finding universal pattern in different types of cancers based on network topological statistics. It also shows that decision tree models (pattern recognition) are successful in finding the pattern of cancer induction based on the important network statistics.

...............................................................................................................................

Poster: P28
Comparative Assessment Suite for Transcription Factor Binding Motifs

Caleb Kipkurui Kibet, Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown South Africa, South Africa
Philip Machanick, Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown South Africa, South Africa

Predicting transcription factor (TF) binding sites remains an active challenge due to degeneracy and multiple potential binding sites in the genome. The advent of high throughput sequencing has seen several experimental approaches, including ChIP-seq, DNase-seq and ChIP-exo, and dozens of algorithms developed to address the challenge. An increasing number of motif models has been published and those in databases have more than doubled in the last two years. However, there is no standardized means of motif assessment let alone a computational tool to rank the available motifs for a given TF. This makes it hard to choose the best models and for algorithm developers to benchmark, test, quantify and improve on their tools. We introduce a web server hosting a suite of tools that assesses PWM-based motif models using scoring, comparison and enrichment approaches. Given that there is no agreed standard for motif quality assessment, we present a range of measures so users can apply their own judgement. An assess-by-scoring approach uses motif models to score benchmark data partitioned into positive and background sets, then uses AUC, Pearson, MNCP and Spearman’s rank statistics to quantify their performance – scoring functions are energy, GOMER, sum occupancy and sum log-odds. An assess-by-comparison approach seeks to rank, for a given TF, motifs based on similarity to all available motifs in the database using TOMTOM’s Euclidean distance function and FISim. It assumes the best model should be representative of information in the others, provided a variety of data and algorithms is used. This is a quick data-independent approach that has proved to be powerful, reproducing assessment-by-score ranks with over 0.7 average correlation. A web interface to the tools uses the Django framework with a MySQL back end. The database contains 6,530 human and mouse motif models and benchmark data derived from available databases and publications. A user-entered test motif for a given TF is ranked against motifs for the same TF in the database using the available benchmark data as well as user-supplied data in BED or FASTA format. Results are returned in interactive visuals providing further information on motif clustering, similarity and ranks, with options to download publication-ready figures and ranked motif data. We have demonstrated the benefit of our web server in motif choice and ranking as well as in motif discovery. Web server and command-line versions are available (link to be added once available, estimated mid-October 2015).

...............................................................................................................................

Poster: P29
Loregic: A method to characterize the cooperative logic of regulatory factors

Daifeng Wang, Yale University, United States
Koon-Kiu Yan, Yale University, United States
Cristina Sisu, Yale University, United States
Chao Cheng, Dartmouth Medical School, United States
Joel Rozowsky, Yale University, United States
William Meyerson, Yale University, United States
Mark Gerstein, Yale University, United States

Gene expression is controlled by various gene regulatory factors. Those factors work cooperatively forming a complex regulatory circuit on genome wide. Corruptions of regulatory cooperativity may lead to abnormal gene expression activities such as cancer. Traditional experimental methods, however, can only identify small-scale regulatory activities. Thus, to systematically understand the cooperativity between and among different types of regulatory factors, we need the efficient and systematic computational methods. Regulatory circuits have been found to behavior very analogous to the electronic circuits in which a wide variety of electronic elements work coordinately to function correctly. Recently, an increasing amount of next generation sequencing data provides great resources to study regulatory activity, so it is possible to go beyond this and systematically study regulatory circuits in terms of logic elements. To this end, we develop Loregic, a computational method integrating gene expression and regulatory network data, to characterize the cooperativity of regulatory factors for the first time in cancers such as acute myeloid leukemia, which provided unprecedented insights into the gene regulatory logics in complex biological systems [1]. Loregic uses all 16 possible two-input-one-output logic gates (e.g. AND or XOR) to describe triplets of two factors regulating a common target. We attempt to find the gate that best matches each triplet’s observed gene expression pattern across many conditions. We make Loregic available as a general-purpose tool (loregic.gersteinlab.org). We validate it with known yeast transcription-factor knockout experiments. Next, using human ENCODE ChIP-Seq and TCGA RNA-Seq data, we are able to demonstrate how Loregic characterizes complex circuits involving both proximally and distally regulating transcription factors (TFs) and also miRNAs. Furthermore, we show that MYC, a well-known oncogenic driving TF, can be modeled as acting independently from other TFs (e.g., using OR gates) but antagonistically with repressing miRNAs. Finally, we inter-relate Loregic’s gate logic with other aspects of regulation, such as indirect binding via protein-protein interactions, feed-forward loop motifs and global regulatory hierarchy.

[1] Daifeng Wang, Koon-Kiu Yan, Cristina Sisu, Chao Cheng, Joel Rozowsky, William Meyerson, Mark Gerstein, "Loregic: A method to characterize the cooperative logic of regulatory factors," PLoS Computational Biology 11(4): e1004132, April 2015

...............................................................................................................................

Poster: P30
Affymetrix Probesets as Proxies for Mature MicroRNAs

Rebecca Tagett, Wayne State University, United States
Sorin Draghici, Wayne State University, United States

Motivation:
MicroRNAs are small non-coding RNAs that regulate mRNA abundance post-transcriptionally, and have been implicated in many contexts, from development to disease. Primary microRNAs (pri-miRNAs) from intergenic regions, which represent around half of the known population, are independently transcribed, 5-prime capped and poly-adenylated. Like precursor messenger RNAs, they can be kilobases in length, and must undergo extensive processing.
Previous studies have suggested that the expression of some intergenic pri-miRNAs can be used as surrogates for expression of mature microRNAs. Little is known about which pri-miRNAs have this property, and sequence annotation is un-available for the majority of pri-miRNAs.
The Affymetrix HG U133 Plus 2.0 array includes probesets for 19,612 protein coding genes and 15,943 non-coding, mostly un-annotated transcripts. With over 3000 public data series, these experiments cover a large variety of human tissues, conditions and disease states. We identify pri-miRNAs among the non-protein-coding probesets, and show that some of them can be used as proxies for mature microRNAs, opening up the possibility of combined microRNA - mRNA studies of mRNA abundance in relation to the presence of microRNAs.

Methods:
Using sequence similarity, we leverage the Unigene database to connect microRNA stem-loop sequences to target sequences from U133 Plus 2.0, identifying over 250 probeset-microRNA pairs. Fifty of these are one-to-one matches between a Unigene cluster and a probeset, with one or several associated stem-loop precursors. Having compiled a set of public datasets from experiments where samples are run both on a microRNA platform and U133 Plus 2.0, we identify pri-miRNA probesets that correlate with mature microRNA. We select two well-studied microRNAs from these results and screen all public U133 plus 2.0 series to find which sets express the pri-miRNA of interest. We then validate the selected pri-miRNAs based on published literature, and perform functional analyses on the genesets that are co-expressed and anti-co-expressed with these microRNAs.

Results:
We present the set of probesets from U133 Plus 2.0 which target pri-microRNA transcripts, highlighting which are acceptable surrogates for mature microRNA abundance. Those that do not correlate to mature microRNA abundance may be useful to study microRNA processing regulation, tissue specificity, co-transcription and transcription factor activity. Those that are proxies for mature microRNAs can be studied within the pool of all coding mRNAs, over the vast repository of U133 Plus 2.0 chips, to generate new, testable hypotheses regarding microRNA function. We present some intriguing results for the two pri-miRNAs selected for in-depth analysis.

...............................................................................................................................

Poster: P31
Non-coding isoforms of coding genes in B cell development and malignancies

Irtisha Singh, Memorial Sloan Kettering Cancer Institute, United States
Shih-Han Lee, Memorial Sloan Kettering Cancer Institute, United States
Christina Leslie, Memorial Sloan Kettering Cancer Institute, United States
Christine Mayr, Memorial Sloan Kettering Cancer Institute, United States

Alternative cleavage and polyadenylation (ApA) is most often viewed as the selection of alternative pA signals in the 3'UTR, generating 3'UTR isoforms that code for the same protein. However, ApA events can also occur in introns, generating either non-coding transcripts or truncated protein-coding isoforms due to the loss of C-terminal protein domains, leading to diversification of the proteome. Since previous studies have demonstrated the cell type and condition specific expression of 3'UTR isoforms, we decided to investigate the cell type specificity and potential functional consequences of isoforms generated by intronic ApA. We therefore carried out an analysis of 3'-seq and RNA-seq profiles from chronic lymphocytic leukemia (CLL) and multiple myeloma (MM) samples as compared to mature human B cells (naïve and CD5+) and plasma cells, respectively, together with our previous 3'-seq atlas generated from a wide variety of tissues and cell lines. This analysis showed that intronic ApA is a normal and regulated process, most widely used in immune cells, with intronic ApA events enriched near the start of the transcription unit, yielding non-coding transcripts or messages with minimal coding sequence (CDS). These early intronic ApA events preferentially occur in transcription factors, chromatin regulators, and ubiquitin pathway genes. De novo assembly of RNA-seq data supports ~60% of the intronic ApA events from plasma cells and MM samples, leading to >2000 candidate alternative transcripts arising from intronic ApA, with ~900 transcripts ending near the start of the transcription unit, retaining less than 25% of the coding sequence. Our analysis showed that two thirds of these intronic ApA isoforms have minimal coding potential, likely generating non-coding isoforms from protein coding genes. CLL cells increase the expression of early intronic ApA events relative to mature B cells, while MM cells decrease the expression of these events relative to plasma cells. For a fraction of genes, increased expression of isoforms generated by intronic ApA coincides with reduced expression of the full length mRNA in CLLs compared to mature B cells; conversely, lower expression of intronic ApA events coincides with higher full length mRNA expression for some genes in MM samples compared to plasma cells. In these genes, expression of the intronic event may function as a switch to alter full-length mRNA expression. The other fraction of these non-coding isoforms may potentially act as scaffolds for recruiting regulatory factors to the locus.

...............................................................................................................................

Poster: P32
Predicting Metabolic Networks through Pairwise Rational Kernels

Abiel Roche-Lima, Medical Science Campus, University of Puerto Rico, Puerto Rico

Metabolic networks are represented by the set of metabolic pathways. Metabolic
pathways are a series of chemical reactions, in which the product from one reaction serves as the input to another reaction. Many pathways remain incompletely characterized, and in some of them not all enzyme components have been identified. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes.

Pairwise kernel frameworks have been used in supervised learning approaches, e.g., Pairwise Support Vector Machines (SVMs), to predict relationships among two pairs of entities. Pairwise kernel methods are computationally expensive in terms of processing, especially when used to manipulate pairs of sequences, for example to predict metabolic networks. Rational kernels are based on transducers to manipulate sequence data, computing similarity measures between sequences or automata. Rational kernels take advantage of the smaller and faster representation and algorithms of weighted finite-state transducers. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and
machine translations.

We propose a new framework, Pairwise Rational Kernels (PRKs), to manipulate pairs of sequence data, as pairwise combinations of rational kernels. We develop experiments using SVM with PRKs applied to metabolic pathway predictions in order to validate our methods. As a result, we obtain faster execution times with PRKs than similar pairwise kernels, while maintaining accurate predictions. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We also obtain a new type of Pairwise Rational Kernels based on automaton and transducer operations. In this case, we define new operations over two pairs of automata to obtain new rational kernels. We also develop experiments to validate these new PRKs to predict metabolic networks. As a result, we obtain the best execution times when we compare them with pairwise kernels and the previous PRKs.

...............................................................................................................................

Poster: P33
Measuring and interpreting similarity between scale-free biological networks

Qian Peng, Department of Computer Science, Illinois Institute of Technology, United States
Bingqing Xie, Department of Computer Science, Illinois Institute of Technology, United States
Gady Agam, Department of Computer Science, Illinois Institute of Technology, United States

Biological networks such as metabolism networks, protein-protein interaction networks, and gene regulation networks, are used in numerous applications to reveal the functions of genes, proteins, and molecules. Measuring the similarity between such networks is important for both clustering algorithms and the validation of algorithms. In clustering algorithms the similarity measure is used to determine networks that can be grouped in the same cluster. In the validation of algorithms the similarity measure is used to determine the similarity between a synthetic network and an actual one. Synthetic networks are useful for algorithm validation because they can be synthesized in a manner where the ground truth is known (e.g. A network where the clusters are known).

Existing methods for measuring network similarity, such as NetSimile, do not target biological networks specifically and lack absolute interpretation of their measurements. In this paper we propose a principled metric using machine learning which consistently measures the similarity between biological networks and apply it to measure the similarity between actual networks and synthesized ones. In addition to improved performance, similarity in our approach has a meaning of edge rewiring percentage which makes interpreting absolute similarity results easier.

Our similarity classifier uses several network features such as: maximum degree, local and global clustering coefficients, degree exponent, degree distribution, and some other geometric features. To train our model we used a set of 140 actual biological networks for which we generated perturbed versions at various levels by randomly rewiring the edges in them. We use Random forest regression with 10 fold cross validation. Our cross validation results show that we can accurately estimate the known percentage of edge rewire with an average accuracy of roughly 5.5%.

After training the model we measured its performance on predicting the similarity between synthesized networks. In this evaluation we synthesized 100 networks using the Barabasi scale-free network synthesis algorithm. The objective was to measure whether our model can accurately estimate the difference between networks generated with various synthesis parameters (number of vertices and degree distribution). We compared the proposed approach to a standard implementation of NetSimile.

We observed that in the proposed approach the produced measure monotonically increased as the difference in synthesis parameters increased whereas in NetSimilie this was not always the case. The paper provides the full details of this evaluation.

...............................................................................................................................

Poster: P34
Global Functional Annotation and Visualization of the 2015 Yeast Genetic Interaction Network

Anastasia Baryshnikova, Princeton University, United States
Michael Costanzo, University of Toronto, Canada
Chad Myers, University of Minnesota, United States
Brenda Andrews, University of Toronto, Canada
Charlie Boone, University of Toronto, Canada

Large-scale biological networks map functional relationships between most genes in the genome and can potentially uncover high level organizing principles governing cellular functions. Despite the availability of an incredible wealth of network data, our current understanding of their functional organization is very limited and nearly inaccessible for biologists. To facilitate the discovery of functional structure and advance its biological interpretation, we developed a systematic quantitative approach to determine which functions are represented in a network, which parts of the network they are associated with and how they are related to one another. Our method, named Spatial Analysis of Functional Enrichment (SAFE), detects network regions that are statistically overrepresented for a functional group or a quantitative phenotype of interest, and provides an intuitive visual representation of their relative positioning within the network. Using SAFE, we examined the most recent genetic interaction network from budding yeast Saccharomyces cerevisiae, which was derived from the quantitative growth analysis of over 20 million double mutants. By annotating the genetic interaction network with GO biological process, protein localization and protein complex membership data, SAFE showed that the network is structured hierarchically and reflects the functional organization of the yeast cell at many different levels of resolution. In addition, we analyzed the network using a large-scale chemical genomics dataset and generated a global view of the yeast cellular response to chemical treatment. This view recapitulated the known modes-of-action of chemical compounds and identified a potentially novel mechanism of resistance to the anti-cancer drug bortezomib. Our results demonstrate that SAFE is a powerful tool for annotating biological networks and a unique framework for understanding the global wiring diagram of the cell.

...............................................................................................................................

Poster: P35
Memory of Inflammation in Regulatory T Cells

Joris van der Veeken, Memorial Sloan Kettering Cancer Center, United States
Alvaro J. González, Memorial Sloan Kettering Cancer Center, United States
Hyunwoo Cho, Memorial Sloan Kettering Cancer Center, United States
Aaron Arvey, Memorial Sloan Kettering Cancer Center, United States
Christina S. Leslie, Memorial Sloan Kettering Cancer Center, United States
Alexander Y. Rudensky, Memorial Sloan Kettering Cancer Center, United States

Regulatory T (Treg) cells are a specialized lineage of suppressive CD4 T cells that act as critical negative regulators of inflammation in various biological contexts. Treg cells exposed to an inflammatory environment undergo numerous transcriptional and epigenomic changes, acquire highly enhanced suppressive capacity, and show altered tissue homing potential. Whether these changes represent stable differentiation akin to memory T cells, or a transient adaptation to the inflammatory environment, is currently unclear.

We used an inducible lineage tracing system to analyze the long-term stability of inflammation-induced transcriptional, epigenomic, and functional changes in inflammation-experienced Treg cells. To this end, we performed an integrative computational analysis of ATAC-seq, histone modification (H3K27ac, H3K27me3, H3K4me1) ChIP-seq, and RNA-seq profiles of Treg cells before, during, and two months after exposure to an acute inflammatory environment. We found that Treg cells, in contrast to memory T cells, showed a striking ability to revert activation-induced transcriptional and epigenomic changes and maintained only a selective and specific memory of inflammation. Genes undergoing stable expression changes underwent qualitatively similar but more dramatic chromatin remodeling than genes undergoing transient changes. Stable gene expression changes were further reinforced during secondary Treg cell activation, while genes undergoing transient expression changes were similarly regulated during primary and secondary responses. Moreover, transiently expressed genes did not maintain stable chromatin modifications that would facilitate their reactivation. Importantly, while the activation-induced increase in Treg cell suppressive function was transient, inflammation-experienced Treg cells acquired a stable non-lymphoid tissue preference characterized by differential expression of tissue homing molecules. These data suggest that memory of inflammation allows Treg cells to preferentially localize to non-lymphoid organs to dampen ongoing tissue inflammation, without becoming stably hyperactive and causing an immunosuppressed state.

...............................................................................................................................

Poster: P36
Enhancing the detection of genomic rearrangements to better understand cancer pathology

Francesca Cordero, Department of Computer Science, University of Torino, Italy
Marco Beccuti, Department of Computer Science, University of Torino, Italy
Maddalena Arigoni, University of Torino, Italy
Raffaele Calogero, University of Torino, Italy

Among the genome structural variants, the genomic rearrangements are one of the major sources of genetic diversity in human cancer. The chimera (or fusion) genes are derived by recombination event formed by the breakage and re-joining of two DNA sequences. There are some intrinsic difficulties in the detection of these rearrangements due to both the experimental protocol used and the computational methodologies implemented to detect the fusion events. Fusion events occurring in a specific cell type are usually detected at transcription-level and results generated in different laboratories are only partially overlapping. A prototypical example is the case of MCF7 analysis for fusion detection Edgren et al. (Genome Biology, 2011), Kangaspeska et al. (PLoS One, 2012), Sakarya et al. (PLoS Computational Biolology, 2012), Maher et al. (PNAS, 2009), and Inaki et al. (Genome Research 2011) used different tools and also sequencing was done using both Illumina and Solid sequencing technologies and starting from polyA selected RNA or totalRNA.

To understand the effect of library preparation on fusion detection, we compare different sequencing protocols: polyA selection, ACCESS protocol and ribosomal depleted total RNA.

Our data indicated that sequencing polyA selected RNAs is the least effective method to detect MCF7 known fusions, while ribosomal depleted total is the most efficient.

Taken together our data and those previously published, MCF7 cell line represents an ideal model to evaluate the presence of specific genomic roles to define those sites involved in aberrant translocations.

However, RNAseq does not provide information on the effective region in which the translocation is located. Thus, we have sequenced at the MCF7 genome at 35X and we have detected the breakpoint region of the MCF7 know fusions. Preliminary data indicates the presence of some patterns that are associated to these events. We are actually evaluating if we could identify genomic roles that could be also observed in translocations annotated in COSMIC database.

...............................................................................................................................

Poster: P37
In-silico Analysis of Circular RNA as Regulators of miRNA

Nicholas Akers, Icahn School of Medicine at Mount Sinai, United States
Xintong Chen, Icahn School of Medicine at Mount Sinai, United States
Eric Schadt, Icahn School of Medicine at Mount Sinai, United States
Bojan Losic, Icahn School of Medicine at Mount Sinai, United States

MicroRNA (miRNA) are well characterized as important non-coding regulators of cellular gene expression. Less well understood are the mechanisms that regulate miRNA. Recently, circular RNA (cRNA), have been described as a well expressed, non-coding, tissue specific RNA product with an ambiguous cellular function. One plausible hypothesis for the function of cRNA is to specifically bind cellular regulators such as miRNA. This hypothesis was informed by the discovery that cRNA ciRS-7 contains numerous binding sites for miRNA miR-7, allowing the cRNA to attenuate the effects of the miRNA. This finding has led to speculation that the cellular role of many cRNA is to ‘sponge’ miRNA as a tier of control over gene expression. To investigate this hypothesis we have aligned with BLAST the sequences of all known miRNA with all reported cRNA to look for enrichment of miRNA binding sites. As negative controls we have also created a random nucleotide database of miRNA and aligned this to our cRNA database as well. Our results considered ~236M cRNA/miRNA pairs, and indicate that published cRNA are 41% more likely to contain a binding site for known miRNA than randomly generated miRNA (p<2.2x10-16). In addition to this, ciRS-7 and miR-7 are among the strongest cRNA/miRNA pairs, residing in the top 0.001% of all combinations in binding sites per base pair. The next step in this experiment is to examine the effects of cRNA expression on miRNA targets. We are analyzing the effect of expression of the most prolific miRNA binding cRNA on mRNA targets of these miRNA in a public access RNA-Seq dataset. If the hypothesis that cRNA attenuate the effects of miRNA is true, we expect to find that mRNA targets of miRNA degradation will be increased with greater cRNA expression. The results of this analysis will be presented in order to provide clarity on the role of cRNA in regulating mRNA within the cell.

...............................................................................................................................

Poster: P38
A parallel negative feedback motif exhibits bidirectional control based on differential kinetics in cytokine regulatory networks

Warren Anderson, Thomas Jefferson University, United States
Hirenkumar Makadia, Thomas Jefferson University, United States
Andrew Greenhalgh, McGill University, Canada
James Schwaber, Thomas Jefferson University, United States
Samuel David, McGill University, Canada
Rajanikanth Vadigepalli, Thomas Jefferson University, United States

Negative feedback is critical for maintaining homeostasis within and between cells. Inflammatory immune diseases aberrant negative feedback interactions within cytokine regulatory networks. We developed a computational model of a macrophage cytokine interaction network to study the regulatory mechanisms of macrophage-mediated inflammation. We established a literature-based cytokine network, including TNF , TGF , and IL-10, and fitted a mathematical model to published data from LPS-treated microglia (brain macrophage). We evaluated the validity of our model by testing whether it could recapitulate the experimentally determined “tolerance” response to the endotoxin LPS. We applied two doses of LPS and determined the gain of the peak TNF responses. Our results were consistent with published experimental data demonstrating tolerance to LPS. Global sensitivity analysis revealed that TGF - and IL-10-mediated inhibition of TNF was critical for regulating network behavior. Further analysis revealed that TNF exhibited adaptation to sustained LPS stimulation. We simulated the effects of functionally inhibiting TGF and IL-10 on TNF adaptation. Our analysis showed that TGF and IL-10 knockouts (TGF KO and IL-10 KO) exert divergent effects on adaptation. TGF KO attenuated TNF adaptation whereas IL-10 KO enhanced TNF adaptation. We experimentally tested the hypothesis that IL-10 KO enhances TNF adaptation in murine macrophages and found supporting evidence. Next, we tested the effect of IL-10 and TGF KO on tolerance using our computational model. Surprisingly, we found that IL-10 KO enhanced tolerance of the TNF response to sequentially applied LPS doses. In contrast, TGF KO repressed LPS tolerance. These opposing effects could be explained by differential kinetics of negative feedback. Inhibition of IL-10 reduced early negative feedback that results in enhanced TNF -mediated TGF expression. To further assess whether the relative effects of IL-10 and TGF could be explained by their differential kinetics, we adapted our macrophage model to a 3-node system. We found that the 3-node parallel negative feedback topology supported robust adaptation and tolerance. Inhibition of relatively fast negative feedback enhanced adaptation and tolerance. In contrast, inhibition of relatively slow negative feedback attenuated adaptation and tolerance. We propose that differential kinetics in parallel negative feedback loops constitute a novel mechanism underlying the complex and non-intuitive pro- versus anti-inflammatory effects of individual cytokine perturbations. Based on the data from our reduced 3-node network, we posit that parallel negative feedback motifs with differential kinetics can be tuned for bi-directional control (i.e., negative and positive influences) in contexts ranging from intracellular biochemical signaling to inter-cellular interactions.

...............................................................................................................................

Poster: P39
Sequence biases in CLIP experimental data are incorporated in protein RNA-binding models

Yaron Orenstein, MIT, United States
Bonnie Berger, MIT, United States

Protein-RNA interactions play important roles in many processes in the cell. CLIP-based methods measure protein RNA-binding in vivo in a high-throughput manner on a genome-wide scale. In these technologies, the protein is cross-linked to the RNA and pulled down. The protein is then removed following the cleaving of the RNA by a restriction enzyme. Later, bound RNA segments are sequenced and mapped back to the genome to be called as peaks.

Here, we present a newly-identified bias in CLIP peaks, which we call the ‘terminating G’. Most called peaks terminate in a G, since RNase T1 cleaves at accessible G’s much more strongly than at other nucleotides. The fact that most raw sequences do not terminate at a G implies that this bias is introduced in the peak calling process. Unfortunately, protein RNA-binding preferences are not easily disentangled from enzyme specificities. Thus, we call for an appropriate experimental control to measure the cleaving enzyme specificities. These should later be incorporated as co-variants in the peak calling process to identify unbiased binding sites. Then, better algorithms may be developed to predict more accurate binding sites.

...............................................................................................................................

Poster: P40
A novel study of the scope and limitations of baker’s yeast as a model organism for human tissue- specific pathways

Shahin Mohammadi, Purdue University, United States
Baharak Saberidokht, Purdue University, United States
Shankar Subramaniam, University of California, San Diego, United States
Ananth Grama, Purdue University, United States

Budding yeast, S. cerevisiae, has been used extensively as a model organism for studying cellular processes in evolutionarily distant species, including humans. However, different human tissues, while inheriting a similar genetic code, exhibit distinct anatomical and physiological properties. Driving biochemical processes and associated biomolecules that mediate the differentiation of various tissues are not completely understood, neither is the extent to which a unicellular organism, such as yeast, can be used to model these processes within each tissue.

We propose a novel computational framework coupled with the corresponding statistical model to assess the suitability of yeast as a model organism for different human tissues. Using our method, we dissect the functional space of human tissue-specific networks according to their conservation both across species and among different tissues.

Using a case study of GNF Gene Atlas dataset, we classify different tissues based on their similarity to yeast. In cases where suitability of yeast can be established, through conservation of tissue-specific pathways in yeast, it can serve as an experimental model for further investigations of new biomarkers, as well as an unbiased phenotypic screen to assay pharmacological and genetic interventions. On the other hand, for tissues with missing functionality in yeast, we provide molecular constructs (gene insertions) for creating more appropriate, tissue-engineered humanized yeast models.

...............................................................................................................................

Poster: P41
Deciphering single-cell transcriptional heterogeneity to understand principles of neuronal phenotype organization and plasticity

James Park, University of Delaware, United States
Babatunde Ogunnaike, University of Delaware, United States
James Schwaber, Thomas Jefferson University, United States
Rajanikanth Vadigepalli, Thomas Jefferson University, United States

Reconciling a cell’s transcriptional state and its phenotypic function is confounded by the transcriptional heterogeneity observed at the single-cell scale. This transcriptional heterogeneity conflicts with the traditional expectation that a neuronal phenotype consists of functionally identical neurons that respond uniformly to synaptic and neuromodulatory inputs. Moreover, this transcriptional heterogeneity is prominent within and across post-mitotic neuronal populations throughout the brain, where neurons interact to form circuits that regulate physiological function. High-throughput “-omic” level analysis, however, suggests that a more complex molecular organization potentially underlies neuronal phenotypic function and emergent systems-level behavior that occurs in the brain. In order to understand the functional relevance of this transcriptional heterogeneity, we examined two distinct brain nuclei by analyzing the transcriptional responses of individual neurons responding to specific physiological perturbations. In the first case, we generated a high-dimensional gene expression data set from individual blood pressure-regulating neurons within the nucleus tractus solitarus (NTS) that were collected from rats undergoing an acute hypertensive challenge. In the second case, we analyzed the transcriptional states of hundreds of single neurons within the suprachiasmatic nucleus (SCN) from mice responding to a light-induced phase shift in circadian rhythms. Using a combination of multivariate analytical techniques, graph network theory, and a novel fuzzy logic-based regulatory network modeling methodology, we identified molecular organizational structures in which individual neurons from both brain nuclei form distinct transcriptional states that align with synaptic/neuromodulatory inputs. Concomitantly, our quantitative regulatory network models and simulations of NTS neurons suggest that distinct networks correspond to these subtypes and drive heterogeneous gene expression behavior in a continuous fashion. Within the SCN, the presence of transcriptionally distinct neuronal subtypes provides insight into the organization and intercellular interactions underlying SCN regulation of circadian function. Having identified these SCN neuronal subtypes, we are now able to postulate a cellular interaction network in which specific neuronal subtypes fulfill specific functional roles in regulating circadian phase-shift behavior.

top