12th Annual Rocky Mountain Bioinformatics Conference


Automatic Discovery of Regulatory Networks from Morphological Experimental Data

Presenting Author: Daniel Lobo, Tufts University

Michael Levin, Tufts University

Bioinformatics tools for analysis of regulatory mechanisms are generally limited to genomic or time-series concentration data. However, many crucial experiments in developmental and regenerative biology are based on manipulations and perturbations resulting in morphological outcomes. For example, planarian worms can regenerate a complete organism from almost any amputated piece, but knocking down certain genes can result in the regeneration of double-head worms. The inherent complexity and non-linearity of biological regulatory networks prevent us from manually discerning testable comprehensive models. No automated tool exists to mine the huge database of functional results, and despite hundreds of years of experiments, no model has been found that can explain more than one or two morphological results. To bridge the gulf separating morphological data from an understanding of pattern formation, we developed a software method to automate the discovery of regulatory networks form phenotypic experimental data. The method uses mathematical ontologies to unambiguously formalize surgical, genetic, and pharmacological experiments and their resultant morphological phenotypes to explain, and a whole-organism simulator capable of performing the same experiments in silico. We demonstrate this approach by automatically discovering the first comprehensive model of planarian regeneration, which not only explains at once all the key experiments available in the literature (including surgical amputations, knock-down of specific genes, and pharmacological treatments), but also predicts testable novel outcomes. Our approach is the first step to a bioinformatics of shape, which will pave the way for understanding complex pattern formation in developmental and regenerative biology.

Polysaccharide A: a widely distributed immunoregulatory microbial product

Presenting Author: Matthew Rhodes, University of Colorado Denver

Catherine Lozupone, University of Colorado, Denver

The human species and its microbiome have coevolved over the ages. For certain beneficial microbes our adaptive immune system provides them with a selective advantage over the competition. However, the mechanisms by which this interaction is accomplished is poorly understood. Furthermore when the adaptive immune system is compromised by diseases such as HIV, this mutualistic relationship is thrown into disarray. Here we investigate the phylogenetic distribution of a proven immunoactive microbial product, Polysaccharide A (PSA), which is previously understood only from its presence in the commensal bacterial species Bacteroides fragilis, and which is known to induce anti-inflammatory IL-10 producing regulatory CD4 + T cells. By scanning the genomes of >8000 publicly available complete/draft genomes, we have found that species within a wide variety of microbial lineages have operons containing close homologues to the essential genes in the PSA operon of B. fragilis, suggesting that this highly influential molecule has participated in significant lateral gene transfer. Based on the observation that PSA is required for colonization of the host mucosa by B. fragilis, we hypothesized that other PSA-encoding bacteria would also be depleted in HIV-infected subjects, because they may utilize the anti-inflammatory CD4+ T cell interacting signaling molecule to establish their niche. Upon identifying microbial lineages which produce PSA, we continued to ascertain whether these lineages are significantly impacted by HIV. Our results indicate that PSA appears to be a sufficient yet not necessary determinant of members of the human microbiome that decrease significantly in HIV patients.

Meta-analysis leads to deeper understanding of cellular senescence

Presenting Author: Chris Morrissey, Buck Institute

Marco Demaria, Buck Institute
Dave Slate, Buck Institute
Judy Campisi, Buck Institute
Sean Mooney, Buck Institute

Cellular senescence, (CS) is a state of terminal mitotic arrest. It plays a role in aging as senescent cells accumulate in older tissues and secrete a set of extra cellular signal proteins (SASP) which contribute to chronic inflammation and age related diseases. Senescence is also an anti-cancer mechanism which can be triggered by oncogenes, and directly blocks tumor formation. This anti-cancer role is complicated by SASP being conducive to tumor growth in the surrounding tissue. This unique set of beneficial as well as harmful effects, leads us to believe that CS is an example of antagonistic pleiotropy wherein the evolutionary advantage senescence provides as a check against cancer outweighs its deleterious effects on older, post reproductive organisms. As such it provides an attractive target for therapeutic regulation.
We aim to better understand these aspects of CS by conducting a meta-analysis of various models of senescence in mouse and human tissues. We have detected characteristic changes in gene expression that are cell type independent, and are conserved between mouse and human. We are integrating transcriptomic, proteomic, and epigenetic data to better understand these signatures of senescence, allowing us to better understand the short term benefits of senescence and the long term detriments. We have detected transcriptional regulation of the genes involved with SASP, mitotic replication, and apoptosis, and are using the connectivity map to identify drugs to target these networks.

Characterizing the landscape of biomedical ontologies via deductive entailment

Presenting Author: William Baumgartner Jr, University of Colorado Denver

Kevin M. Livingston, University of Colorado Denver
Michael Bada, University of Colorado Denver
Lawrence E. Hunter, University of Colorado Denver

Biomedical ontologies have a semantic richness that is severely underutilized. While the backbone of ontologies remains grounded in the familiar subsumption (is-a) and meronymy (part-of) relationship hierarchies, there are increasing numbers and varieties of other relations being incorporated into the Open Biomedical Ontologies (OBOs). These logical definitions formally define ontology terms with respect to other ontology terms and in doing so, add a rich semantic interconnectedness not only within individual ontologies, but also between ontologies. While the use of biomedical ontologies is a mainstay in many standard bioinformatics practices, the rich interconnectedness of the current ontological landscape is often under-utilized or ignored altogether.

This work characterizes the richly connected network of prominent OBO ontologies and corresponding logical definitions. We explore how logical definitions enable the identification of deductively entailed concepts in this network using Prolog back-chaining. For example, when using all available definitions, the Mammalian Phenotype Ontology term for “abnormal circulating lipoprotein level” entails 191 concepts spanning chemicals, anatomy, biological processes, and other phenotypes whereas using only the subsumption hierarchy yields 11 entailed phenotype concepts. We examine the technical challenges of deductive reasoning over such a rich network in general and consider the practicalities of assessing meaning from sets of entailed concepts. We further discuss issues in assessing the semantics of the specific paths used to entail concepts and highlight issues related to finding those paths. Our results provide a comprehensive characterization of the interconnectedness of the current landscape of biomedical ontologies.

A Bayesian framework for Signature-driven Protein Quantification

Presenting Author: Bobbie-Jo Webb-Robertson, Pacific Northwest National Laboratory

Melissa Matzke, Unilever
Susmita Datta, University of Louisville
Samuel Payne, Pacific Northwest National Laboratory
Jiyun Kang, Pacific Northwest National Laboratory
Lisa Bramer, Pacific Northwest National Laboratory
Carrie Nicora, Pacific Northwest National Laboratory
Anil Shukla, Pacific Northwest National Laboratory
Thomas Metz, Pacific Northwest National Laboratory
Karin Rodland, Pacific Northwest National Laboratory
Richard Smith, Pacific Northwest National Laboratory
Mark Tardiff, Pacific Northwest National Laboratory
Jason McDermott, Pacific Northwest National Laboratory
Joel Pounds, Pacific Northwest National Laboratory
Katrina Waters, Pacific Northwest National Laboratory

Mass spectrometry-based proteomics has the capability to measure tens of thousands of peptides simultaneously across complex biological samples. With samples from higher organisms, translating these peptides into protein level estimates is a major challenge for computational proteomics. A limitation to existing computationally-driven protein quantification methods is that most ignore protein variation, such as alternate splicing of the RNA transcript and post-translational modifications or other possible proteoforms, which will affect a significant fraction of the proteome. The consequence of this assumption is that statistical inference at the protein level, and consequently downstream analyses, such as network and pathway modeling, have only limited power for biomarker discovery. A new approach to this problem is to utilize peptide-level signature in a Bayesian framework (BP-Quant) to identify peptides associated with over-expressed patterns to improve relative protein abundance estimates. BP-Quant is a research-driven approach that utilizes the objectives of the experiment, defined in the context of a standard statistical hypothesis, to identify a set of peptides exhibiting similar statistical behavior relating to a protein. This approach infers that changes in relative protein abundance can be used as a surrogate for changes in function, without necessarily taking into account the effect of differential post-translational modifications, processing, or splicing in altering protein function. We verify the approach using a dilution study from mouse plasma samples and demonstrate that BP-Quant achieves similar accuracy as the current state-of-the-art methods at proteoform identification with significantly better specificity.

Integrating Data in a Microbiome Context

Presenting Author: Michael Shaffer, University of Colorado - Denver

Catherine Lozupone, University of Colorado - Denver

The complex community of microorganisms that colonize the human body (the microbiome), play an important role in human health, and have been implicated in many diseases. A major challenge is to determine how composition of microbes that relates to functional attributes they confer, and how these attributes relate to disease. Many research groups generated “multi-omics” datasets that have potential to link composition to function. We use ‘omics techniques including 16S and shotgun metagenomic sequencing and metabolomics to tease out the organisms, genes, reactions and metabolites associated with disease phenotypes. Traditionally metagenomic studies are analyzed in terms of the genes present. By translating genes to the reactions they encode and from this generating metabolic networks, it is possible to determine the degree to which metagenomics datasets can explain metabolomics data. Furthermore, by relating 16S rRNA data to gene networks (using the content of related genomes as a proxy), it is possible to produce networks containing the organisms and their genes. This allows the linking of microbiome composition to predicted function and phenotypes. We have been working with a variety of datasets to define methods for predicting functional attributes from 16S rRNA and visualizing links using networks. We are also working to determine optimal methods of correcting for differential sequencing depth before predicting metagenomes. We have found normalization before metagenome prediction affects the resulting gene sets and are testing different normalizations of 16S data with paired metagenomic sequencing from the same dataset to determine normalization techniques that match metagenomic sequencing best.

Dietary Influences on the Incidence of Alzheimer Disease: Converging Inferences by Disparate Algorithms

Presenting Author: George Acquaah-Mensah, MCPHS University

Ronald Taylor, Pacific Northwest National Laboratory

Unlike microarray data, in situ hybridization (ISH) data have rarely been exploited as sources of high-throughput gene expression data for transcriptional regulatory relationship network inference. Dementia is a hallmark of Alzheimer Disease (AD), and the hippocampus is a brain region associated with learning and memory. In this study, the Allen Brain Atlas mouse hippocampus and hippocampal formation ISH data were utilized as a source for gene expression measurements. Three disparate, high-performing network inference algorithms: the ordinary differential equations-based Inferelator, the mutual information-based Context Likelihood of Relatedness (CLR), and the tree-based GEne Network Inference with Ensemble of trees (GENIE3) were used to explore these data for insights into the AD etiology.

Unique ISH data in the hippocampal fields were extracted. Focusing on 275 genes relevant to neurodegeneration, transcriptional regulatory networks were learned using Inferelator, CLR, and GENIE3. Each gene can have up to 159,326 expression values, one per voxel. A human post-mortem hippocampal microarray dataset from AD and control subjects was superposed on the orthologous versions of the networks.

There was consensus among the three inference algorithms regarding regulatory roles for
USF2, SP1, CDKN1A, EGR1, and NFE2L1 in the overall network. Some network hubs are affected by glucose or by high dietary fats, both of which are associated with AD. The similarities between diabetes mellitus and AD have been documented. Furthermore, high-fat diet enhances the hallmarks of AD. These findings shed light on possible environment-gene interactions on AD. Further, they demonstrate the utility of ISH in transcriptional regulatory network inference.

Comparative genomics between frogs Xenopus laevis and Xenopus tropicalis.

Presenting Author: Gonzalo Riadi, Pontificia Universidad Católica de Chile

Juan Larraín, Center for Aging and Regeneration and Millennium Nucleus in Regenerative Biology
Francisco Melo, Molecular Bioinformatics Laboratory

The clawed African frog Xenopus laevis has been one of the main animal models for genetic studies in developmental biology. However, for molecular studies, Xenopus tropicalis has been the experimental model of choice due to a simpler genome that does not result from genome duplication as in the case of X. laevis. Today, although in a large number of scaffolds, over 80% of X. tropicalis and 90% of X. laevis genomes have been sequenced. There is a growing expectation for a comparative physical map that can be used as a Rosetta Stone between X. laevis genetic studies and X. tropicalis genomic research. In this work, we have mapped through coarse-grained alignment the 3,169 largest scaffolds of X. laevis, on the 10 reference scaffolds representing the haploid genome of X. tropicalis. Upon validation of the map with empirical and theoretical data, and establishing an average of 44,57% identity between the two species, we made an analysis of synteny of 3,475 orthologous genes. Interestingly, although we found that 99.4% of genes are in the same order, we estimate that up to 10% of the genes may have undergone some type of rearrangement. Taken together, our results established the correspondence between half of both genomes, providing a new and more comprehensive tool for comparative analysis of these species.
Acknowledgements: Fondecyt project 3130441; Iniciativa Cientifica Milenio No. P09-016-F and No. P07/011-F.

Comparison of Phylogeographic Node Flux with Local Disease Trends

Presenting Author: Daniel Magee, Arizona State University

Matthew Scotch, Arizona State University

Discrete phylogeography is a widely used approach for inferring ancestral history of genetic lineages. For virus sequences, this can be used to estimate the spread of an outbreak. However, by default, analyses of discrete phylogeography do not consider the specific characteristics of each sampled location as a means for validating the ancestral reconstruction. For example, for a communicable virus like influenza the discrete locations, or nodes, that are abundant within a tree should in theory contain a high population density, large travel network, or other defining trait to support the node’s virus flux.

In this talk, we will discuss our approach to validating discrete phylogeography estimates of virus spread by linking ancestral models with local data indicative of disease trends. We will consider five recent phylogeographic studies on both a communicable virus, influenza, and a vector-borne virus, West Nile Virus (WNV). For influenza, we will analyze human population density, airports, and incidence of influenza at each node. For WNV, we will analyze human population density, humidity, and incidence of WNV at each node. Here, we will consider both the influx and outflux of the ancestral nodes for both viruses. The influx represents the number of branches leading to the location and outflux represents the number of branches diverging from the location. We will use our work to formulate a hypothesis on the efficacy of discrete phylogeographic studies and to determine if our descriptive approach could be an effective metric of or quality control for phylogeographic inference.

Co-expression Network Hubs in Hypoxia and Acclimatization to High Altitude

Presenting Author: Daniel Dvorkin, University of Colorado

Robert Roach, University of Colorado

The AltitudeOmics project studied the transcriptomic and epigenomic mechanisms that humans employ to counteract the physiological challenge posed by high-altitude hypoxia. We exposed 21 healthy subjects recruited at sea level to high altitude while obtaining a variety of physiological and genomic measurements to follow the process of acclimatization.

During the two-week exposure period, almost half the genes in the genome show significant changes in expression. Weighted gene co-expression network analysis (WGCNA) classifies these genes into "modules" of genes that are highly connected (co-expressed) across the course of the experiment. As in most networks, some "hub" genes are connected to large numbers of other genes, and within the module framework, certain genes are especially strongly connected to other genes within the module. Previous work in module-based network analysis has largely focused on these module "intrahubs." Here we discuss a subset of these that may also be classified as "interhubs," that is, within-module hub genes that are also highly connected to other hub genes outside the module.

We hypothesize that these genes are crucial to information flow throughout the network, particularly between pairs of modules in which one module shows a pattern of early expression change that is then mirrored by late change in the other modeule. We show examples of this phenomenon, discuss and contrast the characteristics of intrahubs and interhubs relative to other genes, and propose experiments to test the hypothesis that interhubs are of particular importance in the response to hypoxia.

Network-based analysis for large environment microbial genomics data

Presenting Author: Erliang Zeng, University of South Dakota

Wei Zhang, WalmartLabs
Scott Emrich, University of Notre Dame
Stuart Jones, University of Notre Dame
Abdelali Barakat, University of South Dakota
Joshua Livermore, University of Notre Dame
Dan Liu, University of Notre Dame

Early environmental microbe studies characterized a limited snapshot of microbial diversity. The availability of huge amount of genome sequence data from natural microbial consortia enables integrated analysis to resolve the genetic and metabolic potential of microbial communities, to establish how functions are partitioned in and among populations, and to reveal how microbial communities evolve and adapt across multiple environments. In this study, we propose to analyze comparative microbial genomes from three environments (human gut, soil, and marine) using network preservation analysis. The developed computational framework was able to identify functional modules and evaluate the functional roles of those modules in microbial communities as response to environmental change. We found modularized environmental adaptation among microbial communities. It is gene module instead of individual gene that serves as evolution unit in microbial communities. Overall, the gene network from soil samples shows more coherent structure, and the network from marine sample has largest divergence. We observed that the wet environments (gut and marine) tend to be more complex than dry environment (soil), due to the dynamics of wet environments such as gut and marine. We demonstrated that modules with strong evidence of preservation contains ubiquitous pathways that contribute to the metabolism of major nutrient substances; while non-preserved modules include many environmentally-dependent pathways that are expected to be responsive to environmental change.

Predicting survival for diverse patient cohorts using large-scale cancer genomics data

Presenting Author: Nicolle Witte, University of Colorado - Anschutz Medical Campus

James Costello, University of Colorado - Anschutz Medical Campus

A fundamental challenge in precision medicine is to identify the genomic features that are predictive of patient response to drug treatment and overall survival. Large, nation-wide efforts, such as The Cancer Genome Atlas (TCGA) have characterized the genomes, transcriptomes and proteomes from tens of thousands of patients across many tumor types. These data present a rich source to characterize predictive genomic features, but caution must be taken when associating patient features with outcome. TCGA represents a heterogeneous sampling of the population with inconsistent patient demographics and patients being treated with an array of therapies. Here, we apply machine learning approaches (lasso regression and random forests) to predict patient survival across several –omics data sources (gene expression, copy number, methylation and protein quantification). Our predictions are based on stratifying patients into cohorts defined by clinical variates, drug treatments and demographics. Here, we explore four cancers: glioblastoma, ovarian, kidney, and lung adenocarcinoma. We compare our results to published results that do not consider clinical or demographic information when predicting survival and show that stratifying the population is critical to identifying features that are predictive of survival. Our results demonstrate differences in predictive ability when accounting for clinical variates across data sources. For example, we see significant differences in performances when comparing Temazolimide-treated vs. untreated glioblastoma patients using methylation data. Our results represent a benchmarking dataset for survival prediction using genomics data. We also identify genomic signatures associated with different patient cohorts.

Global Survey of Protein Complexes from the Sulfate Reducer Desulfovibrio vulgaris: evidence for lower connectivity within stable bacterial interactomes

Presenting Author: Maxim Shatsky, Lawrence Berkeley National Lab

Simon Allen, UCSF
Barbara Gold, LBNL
Nancy Liu, LBNL
Thomas Juba, University of Missouri
Sonia Reveco, LBNL
Dwayne Elias, ORNL
Ramadevi Prathapam, LBNL
Jennifer He, LBNL
Wenhong Yang, LBNL
Evelin Szakal, UCSF
Haichuan Liu, UCSF
Mary Singer, LBNL
Jil Geller, LBNL
Bonita Lam, LBNL
Avneesh Saini, LBNL
Valentine Trotter, LBNL
Steven Hall, UCSF
Susan Fisher, UCSF
Steven Brenner, UC Berkeley
Mark Biggin, LBNL
Swapnil Chhabra, LBNL
Terry Hazen, University of Tennessee
Judy Wall, University of Missouri
Ewa Witkowska, UCSF
John-Marc Chandonia, LBNL
Gareth Butland, LBNL

Sulfate Reducing Microorganisms (SRMs) derive energy from the dissimilatory reduction of sulfate. Desulfovibrio vulgaris Hildenborough is the most intensively studied SRM and serves as a model. Here we report a global survey of protein-protein interactions in D. vulgaris using tandem affinity purification and mass spectrometry (MS). Building on our previous work which enabled rapid locus-specific modification of the D. vulgaris chromosome, we generated and successfully tested 947 gene affinity-tagged D. vulgaris strains. We developed a novel computational pipeline that significantly reduced the false discovery rate of identified interactions, allowing 459 high-confidence protein-protein interactions to be detected with a 17% false discovery rate. Our high-confidence interactome contains many novel interactions associated with existing or new complexes and includes 145 previously unannotated or hypothetical proteins. Our binary protein-protein interactions include 21% same-operon pairs, significantly more than reported for any previous bacterial affinity purification MS study. This observation as well as additional analysis implies that the networks proposed in these earlier studies are largely comprised of false positives and that bacterial stable interactomes are less connected than these studies implied.

EMIRGE 2: Improved resolution of microbial community structure

Presenting Author: Adrienne Narrowe, University of Colorado Denver

Casey Bleeker, University of Colorado Denver
Cuining Liu, University of Colorado Denver
Christopher S. Miller, University of Colorado Denver

Sequencing of the 16S rDNA gene is a popular method to characterize bacterial community membership, abundance, and spatial and temporal dynamics. However, current high-throughput sequencing read lengths limit most studies to sequencing only 30% of the length of the full 16S rRNA gene. This reduction in information content can confound community profiling and abundance estimates by hindering our ability to distinguish closely related organisms.

EMIRGE is an approach designed to exploit the depth of high-throughput sequencing without the loss of information content experienced in a typical short-read amplicon sequencing experiment. By iterative mappings against and refinements to a set of candidate sequences, EMIRGE reconstructs near full-length 16S rRNA gene sequences, including novel sequences not represented in the candidate set, and provides accurate sequence abundance estimates.

We improved the EMIRGE algorithm and systematically characterized parameter space in the face of rapidly changing sequencing technologies. Using realistic simulated datasets, we find increased sensitivity and specificity with the updated algorithm. This ability to more accurately reconstruct community structure from complex communities results from changes to: 1) composition of the candidate sequence database; 2) a sequence-merging heuristic; 3) using a new, indel-aware read mapper; and 4) changes to read preprocessing and postprocessing of output sequences.

We applied EMIRGE 2 to the characterization of bacterial and archaeal diversity in methanogenic freshwater wetlands sediments across 87 samples spanning hydrological and depth gradients. The complex community shifts substantially across these geographic and geochemical gradients, implicating specific organisms in carbon cycling and methane dynamics.

RNABindRPlus Predicts RNA-Protein Interface Residues in Multiple Protein Conformations

Presenting Author: Carla Mann, Iowa State University

Rasna Walia, Iowa State University
Drena Dobbs, Iowa State University

RNABindRPlus (http://einstein.cs.iastate.edu/RNABindRPlus/) is a sequence-based machine learning program that predicts RNA-interacting residues in protein sequences. RNABindRPlus combines an optimized Support Vector Machine (SVM) classifier with a sequence homology-based predictor (HomPRIP) to predict which residues of a protein directly participate in the RNA-protein interface (Walia et al., 2014 PLoS ONE). On several benchmark datasets, RNABindRPlus outperforms other available sequence- and structure-based methods for predicting interfacial residues in RNA binding proteins (0.72 % Specificity; 0.86 AUC of ROC on RB44 test set).
To further improve the performance of RNABindRPlus, we investigated the source of false positive predictions. We hypothesized that certain “false positive” interfacial residue predictions might correspond to actual interfacial residues in a different structural conformation of the RNA-protein complex under consideration. Here we report that many apparently “false positive” predictions made by RNABindRPlus are, in fact, interfacial residues. Thus, using only sequence information as input, RNABindRPlus can recognize amino acids that are involved in binding RNA when the protein is in alternate conformations. Because highly similar sequences were eliminated in constructing the RB198 dataset on which RNABindRPlus was trained (to avoid biasing the SVM classifier), structures with identical or nearly identical sequences but differing conformations were removed. Thus, instances in which a specific protein has been observed to make alternative contacts with an RNA partner were not included in cross-validation experiments. We conclude that RNABindRPlus has a lower false positive rate than previously reported. An updated version of the RNABindRPlus webserver that incorporates these results is under development.

Integrative Genomics Approaches for Predicting Drug Synergies

Presenting Author: Andrew Goodspeed, University of Colorado-Anschutz Medical Campus

James Costello, UCD-AMC
Heide Ford, UCD-AMC
Andrew Thorburn, UCD-AMC
Annie Jean, UCD-AMC

High-throughput, genome-scale technologies offer the opportunity to study disease and response to therapeutics at an unprecedented resolution; however, extracting relevant biology from these data remains a challenge. Here, we present a novel approach to integrate gene expression and genome-wide synthetic lethal data to predict synergistic drug combinations along with experimental validation.
To demonstrate our methodology, we integrated gene expression sampled from TNF-related apoptosis-inducing ligand (TRAIL) treated B Lymphoma cells and a genome-wide shRNA screen with TRAIL treated cells. We identified then rank ordered genes that when treated, both increased in gene expression and were synthetic lethal. Using the Drug Gene Interaction database, we compiled sets of genes targeted by a drug and tested these sets for enrichment in the ranked gene lists using the Kolmogorov-Smirnov test. We predicted that antagonists of the α-1 adrenergic receptor would be synergistic with TRAIL and validated this finding by showing that Prazosin sensitized cells to TRAIL treatment.
While our method and results are encouraging, the approach still relies on previous annotation. To identify novel relationships, we present two additional approaches that rely less on previously annotated genes, including one approach that is completely data driven. This approach uses a gene-gene interaction network constructed from a gene expression compendium. Gene expression and synthetic lethal interactions are mapped onto the network. Novel predictions of drug combinations are based off network clustering to identify druggable gene hotspots. The approaches we present can be applied to any set of gene expression and synthetic lethal screening data.

Prediction and validation of genes regulating breast cancer metastasis.

Presenting Author: Eran Andrechek, Michigan State University

Metastasis is responsible for mortality associated with breast cancer. Despite this, the molecular underpinnings of cancer progression are poorly understood. By integrating bioinformatics with genetics we are examining the events mediating metastasis using mouse model systems. We assembled a gene expression database of over 1200 mouse breast cancers from 23 different models. Using this database, transcription factor motif binding predicted E2F transcription factors were overrepresented in metastatic models. Gene expression signatures for specific E2Fs were applied and predicted strong activity in metastatic models. Further bioinformatic analysis predicted that E2F transcription factors were active in a model of HER2+ breast cancer, a type that composes 25% of human breast cancer. To test the hypothesis that E2Fs regulate HER2 breast cancer, transgenic mice overexpressing an ortholog of HER2 were interbred with E2F transcription factor knockout mice. Strikingly, there was a significant reduction in metastasis with loss of either E2F1 or E2F2. To determine the mechanism by which E2Fs regulated metastatic progression, we examined gene expression profiles of tumors in both control and E2F knockout tumors. Importantly, genes identified as being lost in the non-metastatic model system are co-amplified in human HER2+ breast cancer in a large amplicon. This amplicon is present in over 30% of HER2+ human breast cancers and we show a strong association with metastasis. Taken together, by integrating bioinformatics with mouse models we demonstrated that E2F transcription factors play a pivotal role in tumor metastasis.

The Inverse Relationship Between Sample Size and Differentially Expressed Genes in the selected Microarray Experiments

Presenting Author: Akram Samarikhalaj, Ryerson University

Asli Uyar, Okan-asliuyar@gmail.com
Ayse Basar Bener, Ryerson University- ayse.bener@ryerson.ca

In microarray studies the sample size determinations is one of the concern of biology and life science researchers as the results that are obtained from small sample size are not accurate and providing large sample are expensive and time consuming. In this study, three experiments with 8 and more samples per condition from EMBL-EBI public database are collected to investigate the relationship between sample size and number of differentially expressed genes. An unpaired t-test with p-value of 0.05 is used to identify the number of genes that are differentially expressed in all arrays. The results obtained after Robust Multi-Chip Average (RMA) normalization and removing outliers from the samples using PCA plots indicate that the mean total number of differentially expressed genes is decreased when the sample size is increased. The standard deviation from the mean is also decreased when the sample size increased. This result is reliable if the samples have the high quality and the outliers are removed from the samples. The total number of common genes was an indicator for identifying the outliers. It is unusually increased when the outliers are not removed from the samples.
Conclusion: There is an inverse relationship between the sample size and the number of differentially expressed genes in these three selected experiments. This conclusion can be generalized to all microarray experiments if the selected samples have the high quality.

Calculating prior and posterior probabilities of concepts in conceptually annotated corpora

Presenting Author: Negacy Hailu, University of Colorado

In a natural language, predicates have restrictions on their argument. For example, the direct object of the verb phosphorylate is likely to be a protein, and the object of the verb methylate is likely to be DNA. This concept is called selectional preference in linguistics. Selectional preference is a very important technique for understanding the semantics of specialized domains, such as molecular biology. We are interested in determining if there are predicates, including verbs and nominalizations, in the CRAFT corpus that have restrictions on the semantic classes of the arguments that they take. The calculation of selectional preferences is not mathematically complicated. Conceptual annotations, e.g. of concepts from ontologies, should make it easier to determine selectional preferences, since the semantic classes are a given. In practice, however, determining what should be the denominator in calculating both prior and posterior distributions is quite complicated when dealing with conceptually annotated corpora. A number of competing approaches are presented, and their consequences are discussed.

RNABindRPlus Parallelization: Faster Prediction of RNA-Binding Residues in Proteins

Presenting Author: John Hsieh, Iowa State University

Rasna Walia, Iowa State University
Drena Dobbs, Iowa State University

RNABindRPlus (http://einstein.cs.iastate.edu/RNABindRPlus/) uses a sequence-based machine learning method to predict potential RNA-binding residues in proteins. It is free software hosted on a modest webserver at Iowa State University. The rate-limiting step in the prediction process is the calculation of position-specific scoring matrices (PSSMs), which can require up to 10 minutes per sequence (using PSI-BLAST). To alleviate this bottleneck, we recently implemented a Fortran script that invokes MPICH to distribute the PSSM calculation to multiple cores on the RNABindRPlus server. To evaluate the parallelized version of RNABindRPlus, we executed test jobs in which predictions were made for sets of protein sequences (ranging from 1 to 16 different protein sequences) and compared the runtimes to those of similar jobs run serially on the webserver. We obtained a dramatic decrease in runtimes using the new parallelized version of RNABindRPlus compared to the original serial version. Other performance metrics indicated that we had achieved even data distribution and load balancing, despite using a simple random assignment of PSSM calculations to the multiple cores. Thus, this study not only resulted in improved performance of RNABindRPlus, but has also demonstrated a potential open source solution for relatively small servers that do not have MPI installed. In future work, we plan to use DELTA-BLAST instead of PSI-BLAST to generate PSSMs, which should provide an additional boost in performance, both in terms of runtime and prediction sensitivity. The new parallelized version of RNABindRPlus can be freely accessed at: http://einstein.cs.iastate.edu/RNABindRPlus/index_parallel.html

A phylogenomic approach to assess global genetic diversity and intrapatient evolution of clinical Mycobacterium abscessus strains

Presenting Author: Rebecca Davidson, National Jewish Health

Nabeeh Hasan, University of Colorado Denver
Paul Reynolds, National Jewish Health
Sarah Totten, National Jewish Health
Benjamin Garcia, University of Colorado Denver
Adrah Levin, National Jewish Health
Charles Daley, National Jewish Health
Michael Strong, National Jewish Health

Nontuberculous mycobacterial (NTM) infections caused by Mycobacterium abscessus are responsible for a range of disease manifestations from pulmonary to skin infections and are notoriously difficult to treat due to innate resistance to antibiotics. Previous population studies of clinical M. abscessus utilized multi locus sequence typing or pulsed field gel electrophoresis, but high resolution examinations of the genetic diversity at the whole genome level has not been well characterized, particularly among clinical isolates derived in the United States (US). We performed whole genome sequencing of eleven clinical M. abscessus isolates derived from eight US patients with pulmonary NTM infections, compared them to 30 globally diverse clinical isolates and investigated intra-patient genomic diversity and evolution. Phylogenomic analyses revealed a cluster of closely related US and Western European-derived M. abscessus ssp. abscessus isolates that are genetically distinct from other European and all Asian isolates. Large-scale variation analyses suggested genome content differences of 0.3 – 8.3% relative to the reference strain, ATCC 19977T. Longitudinally sampled isolates showed very few single nucleotide polymorphisms and correlated genomic deletion patterns suggesting homogenous infection populations. Our study explores the genomic diversity of clinical M. abscessus strains from multiple continents, and provides insight into the genome plasticity of an opportunistic pathogen.

Discovering Disease Associated Molecular Interactions Using Discordant Correlation

Presenting Author: Charlotte Siska, University of Colorado Anschutz Medical Campus

A common approach for identifying molecular features (such as transcripts or proteins) associated with disease is testing for differential expression or abundance in –omics data. However, this approach is limited for studying interactions between molecular features, which would give a deeper knowledge of the relevant molecular systems and pathways. We have developed a method for this purpose that we call the Discordant method. The Discordant method measures the posterior probability that a pair of features has discordant correlation between phenotypic groups using mixture models and the EM algorithm. We compare our method to existing approaches; one that uses Fisher’s transformation in a classical frequentist framework and another that uses an Empircal Bayes joint probability model. We prove with simulations and miRNA-mRNA glioblastoma data from the Cancer Genome Atlas that the Discordant method performs better in predicting related feature pairs. In simulations we demonstrate that while all of the methods have similar specificity, the Discordant method has better sensitivity and is better at identifying pairs that have a correlation coefficient close to 0 in one group and a largely positive or negative correlation coefficient in the other group. Using the glioblastoma data, which has matched samples between miRNA and mRNA, we find that the Discordant method finds relatively more glioblastoma-related miRNAs compared to other methods. We conclude from the results in both simulations and glioblastoma data that the Discordant method is more appropriate for identifying molecular feature interactions unique to disease.

A pipeline for virus phylogeography that accounts for geospatial observation error

Presenting Author: Matthew Scotch, Arizona State University

Robert Rivera, Arizona State University
Tasnia Tahsin, Arizona State University
Rachel Beard, Arizona State University
Mari Firago, Arizona State University
Davy Weissenbacher, Arizona State University
Garrick Wallstrom, Arizona State University
Graciela Gonzalez, Arizona State University

Tracking evolutionary changes in viral genomes and their spread often requires the use of data deposited in public databases such as GenBank or the Influenza Research Database. Sequences and their metadata can be downloaded and imported into software applications that generate phylogeographic models for surveillance. These models require the geospatial assignment of taxa, which is often obtained from GenBank metadata. Unfortunately, geospatial metadata such as host location is uncertain in GenBank, with a median of only 30% containing precise location such as a county or town. For example, information such as China or USA was indicated instead of Beijing or Seattle. While town or county might be included in the corresponding journal article, this valuable information is not available for immediate use unless it is extracted and then linked back to the appropriate sequence.

This work focuses on developing and applying information extraction and statistical phylogeography approaches to enhance models that track evolutionary changes in viral genomes and their spread. We will discuss the design and initial results of a framework that uses natural language processing for the automatic extraction of relevant geospatial data from the literature, and assigns a confidence between such geospatial mentions and the GenBank record. Our system will use these locations and the estimates as observation error in the creation of phylogeographic models of zoonotic virus spread.

Prediction of bacteriocin associated operons

Presenting Author: James Morton, University of Colorado Boulder

Iddo Friedberg, Miami University
Stefan Freed, University of Notre Dame
Shaun Lee, University of Notre Dame

Bacteriocins are peptide-derived molecules produced by bacteria, which function as virulence factors, antibiotics, and signaling molecules. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are very diverse and suggest bacteriocins are widely distributed among bacterial species. However, many tools struggle with identifying bacteriocins due to the large sequence and structural diversity of bacteriocins. Bacteriocins are derived from their precursor via a pathway comprising several genes known as context genes. Although bacteriocins themselves are structurally diverse, context genes have been shown to be similar across unrelated species. Our goals are: (1) to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and (2) to identify new candidates for bacteriocins which bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon Associator (BOA) that can identify homologous bacteriocin associated gene clusters and predict novel bacteriocin associated gene clusters

A Novel Visualization Technique for the Discovery of Inter-strain Recombination in Apicomplexans

Presenting Author: Javi Zhang, University of Toronto

Asis Khan, NIH
Michael E. Grigg, NIH
Andrea Kennard, NIH
John Parkinson, Hospital for Sick Children

Apicomplexan parasites, which include Toxoplasma gondii and Plasmodium falciparum, the causative agent of malaria, cause more than a million deaths annually. The population structures of Apicomplexan parasites are a key factor for their evolutionary and virulence potential. However they are poorly described by existing visualization techniques, such as phylogenetic tree and neighbor-net, due to recombination between strains, as genetic distance is no longer an accurate metric of genetic relationship. To overcome this challenge, we have developed a novel visualization pipeline that shows inter-strain genetic relationships on both the global and the local level. Here, we apply this pipeline to a geographically diverse set of T. gondii strains to determine the scale and frequency of inter-strain recombination.
Whole genome sequences of the T. gondii strains are aligned and clustered using the Markov clustering algorithm (MCL) on the basis of genome-wide single nucleotide polymorphisms (SNPs). The genomes are then divided into 10kb segments, and each clustered individually using MCL. The longitudinal clustering patterns along the genomes are used to detect recombination events and regions of recent common ancestry between strains. By combining the two analyses into a single network view, we achieve an accurate, high-resolution representation of the inter-strain genetic relationships. Our results show regions of recombination between both genetically and geographically diverse strains, highlighting the crucial role recombination plays in shaping T. gondii population structure.

Computational Methods to Study Differential Transcription

Presenting Author: Debra Goldberg, University of Colorado Boulder

Suzanne Gallagher, University of Colorado Boulder
May Alhazzani, University of Colorado Boulder
Leslie Seitz, Fairview High School

While the general process of gene transcription is well understood, the mechanisms by which different genes are activated in different conditions or different cell types are not. Transcription must be precisely controlled for proper development and response to differing conditions. The Mediator protein complex is essential for most transcription in eukaryotes, and seems to have a role in differential transcription. CDK8 and CDK19 are homologous proteins that function similarly, alternatively occupying the same position in the CDK module of Mediator. We wish to identify the functional differences between CDK8 and CDK19 by considering how the presence of each one impacts the transcriptional program that Mediator helps regulate. Towards this end, we have studied differences in gene expression associated with the presence of CDK8 and CDK19 under various stress conditions. We have developed methods to predict the transcription factors (TFs) that play a role in this differential gene expression both directly and indirectly. Many TFs are implicated with both CDK8 and CDK19, so we have also identified TFs that are significantly more associated with one than the other.

Modeling Binding Affinity of the Multiple Zinc-Finger Protein Prdm9

Presenting Author: Greg Carter, The Jackson Laboratory

Michael Walker, The Jackson Laboratory
Timothy Billings, The Jackson Laboratory
Alexander Fine, The Jackson Laboratory
Kenneth Paigen, The Jackson Laboratory
Petko Petkov, The Jackson Laboratory

Mammalian genomes encode hundreds of zinc finger proteins (ZFPs) that bind DNA, but most have unknown functions. One well-characterized example, Prdm9, recognizes and binds with multiple tandem zinc fingers to regulate genomic locations of meiotic recombination. Over 20 alleles of Prdm9 are known in mice, each containing a unique array of between 7 and 17 zinc fingers. Each allele is expected to bind a different set of loci, providing a natural experimental system for investigating how ZFPs recognize DNA sequence. To study Prdm9 site selection, we used a novel, in vitro sequencing strategy called Affinity-Seq to assess binding of two mouse Prdm9 alleles to genomic DNA. We found over 30,000 significant binding sites for each allele and quantified the frequency of binding at each sequence. This enabled estimation of binding affinity at each site in addition to standard nucleotide frequencies. The vast majority (95%) of sites contained an allele-specific binding motif, suggesting a single binding sequence for each Prdm9 allele. We identified a few core nucleotides required for binding. However, analysis of F1 hybrid mice suggested variability at all bases affects binding frequency. To assess the importance of each nucleotide, we performed linear regression to model effects on binding affinity. Quantitative data for thousands of sites allowed us to test additive and interactive effects for all bases, revealing multiple nucleotide-nucleotide interactions that drive binding. Our work provides a detailed view of DNA binding by Prdm9 and unprecedented power to assess the complex rules of zinc finger binding specificity.

Taking an omics approach to phenotype responses to cigarette smoke in humans in vivo and in vitro: comparison with healthy smokers and those with COPD

Presenting Author: Mark Pau-Clark, Imperial College London

Neil Galloway-Phillipps, Imperial College London
Paul Armstrong, Queen Mary's University London
Clare Ross, Imperial College London
Jane Mitchell, Imperial College London
Christopher Brearley, Imperial College London
Trevor Hansel, Imperial College London
Katsuhito Ikeda, Imperial College London

According to the World Health Organisation, approximately 5 million people die each year as a result of smoking cigarettes. Smoking represents the second largest global disease burden, and is accountable for 90% of lung cancers. It is a leading risk factor for chronic obstructive pulmonary disease (COPD) and cardiovascular disease. Despite the link between cigarette smoking and disease, the biological mechanisms by which cigarette smoke is associated with inflammation and pathology are not fully understood. We have, over the past 5 years, used transcriptomic analysis on in vitro and ex vivo systems to unravel predicted and novel pathways altered by cigarette smoke. To progress this work we have now undertaken a systems biology approach using a range of ‘omics’ endpoints to address this question in human subjects. We specifically have used transcriptomics, metabolomics together with traditional clinical and lab based assays to analyze and profile responses to smoking in healthy smokers and well-defined phenotypes of patients with COPD. Metabolomic, transcriptomic and ex vivo blood assays suggest that smokers with COPD have different baseline characteristics and responses to smoking cigarettes when compared to healthy smokers. Principal component analysis of the transcriptome revealed clustering with severity of COPD disease. At the metabolic level, disease-distinguishing pathways included (i), 1,5-anhydroglucitol, which implicates a disturbance in renal glucose homeostasis (ii), arginine, which implicates the nitric oxide synthase pathway and (iii), lysolipids, which implicate cytosolic phospholipase A2 activity. A comprehensive analysis of all endpoints may reveal biomarkers and therapeutic targets for smoking related disease.

SASE-hunter – a method for detecting signatures of accelerated somatic evolution in cancer genomes

Presenting Author: Kyle Smith, University of Colorado

Subhajyoti De, University of Colorado
Brent Pedersen, University of Colorado

Detection of novel cancer-associated genes and pathways has yielded novel therapeutics and advanced detection, diagnosis, and treatment of cancer. So far, the emphasis has primarily been on protein coding regions, but non-coding regions that cover 98% of the genome, and harbor major regulatory elements, has been largely under-studied. To this end, we have developed a novel computational framework called SASE-hunter (Signatures of Accelerated Somatic Evolution – hunter) to identify genomic regions that have a significant excess of somatic mutations compared to a locally constructed null model. This method takes into account regional variation in mutation rate, evolutionary conservation and genomic context while making the inference. By applying this methodology to the promoter regions of protein-coding genes in 724 samples in 10 different cancer types, we have identified the promoters that deemed significant in multiple samples. Integrating gene expression, methylation, and survival data, we assess functional impact of the selected cases in lymphoma and melanoma. Our findings facilitate biomarker discovery and have implications for cancer diagnosis and treatment. Application of the SASE-hunter algorithm has identified mutated promoters in lymphoma that are associated with alterations in gene expression for multiple cancer related genes.

Phylogeny-wide Discovery of Bacterial Transcription Factor Binding Motifs by Protein Family-based Approach

Presenting Author: Maxim Shatsky, Lawrence Berkeley National Lab

Alexey Kazakov, LBNL
Kanchana Padmanabhan, LBNL
John-Marc Chandonia, LBNL
Pavel Novichkov, LBNL

Reconstruction of high quality genome-scale gene regulatory networks remains challenging even when gene expression data are available. We developed an automated method for phylogeny-wide discovery of transcription factor (TF) biding motifs across all bacterial genomes to produce starting points for regulon identification. Our approach, SEFMA (Simultaneous Entire-Family Motif Analysis), is based on the simultaneous analysis of all regulators from a given TF family. In contrast to the existing approaches that require a collection of sequenced genomes neighboring the genome of interest we are able to identify TF binding sites across many microbial genomes with high accuracy. Here, we target cis-acting regulators, as in bacterial species the vast majority (~80%) of TFs have BSs in regions local to its operon. The detected local motifs then can be used to search the genome to identify regulons.

We demonstrate an improvement in discovery of local TF BSs over current state of the art approaches. First, we compare SEFMA results to the manually curated motifs from the RegPrecise database and experimentally verified binding sites from RegulonDB. Next, we compare SEFMA results and the available motifs for Shewanella oneidensis MR-1, predicted by one of the standard approaches in the field, against the RegPrecise data. Next, we test SEFMA on >2 million experimentally synthesized and assayed for binding TetR BSs. We also applied SEFMA to identify binding sites of metal-sensing regulators in Pseudomonas stutzeri and to discover a novel fatty acid metabolism regulon in Gammaproteobacteria.

A Model of the Colonic Crypt Microenvironment

Presenting Author: Violeta Kovacheva, University of Warwick

David Snead, University Hospitals Coventry and Warwickshire
Nasir Rajpoot, University of Warwick

Recently, there have been great advancements in the field of multiplexed immunofluorescence imaging. The surge in development of analytical methods for such data makes it crucial to develop benchmark synthetic datasets for objectively validating these methods. We propose a model of the healthy and cancerous colonic crypt microenvironments. Our model can simulate multi-protein immunofluorescence image data with parameters that allow control over differentiation grade of cancer, crypt morphology, cellularity, cell overlap ratio, image resolution, and objective level. The model can take into account localized expression of proteins and interaction between two or more proteins, hence enabling researchers to simulate cell-localized protein dependence profiles. The protein-protein dependence profiles can also be used to simulate different cell phenotypes that may exist within the tissue. The model can learn some of its parameters from real histology image data stained with standard Hematoxylin and Eosin (H&E) dyes in order to generate realistic chromatin texture, nuclei morphology, and crypt architecture. To the best of our knowledge, ours is the first model to simulate multiplexed immunofluorescence image data at subcellular level for healthy and cancerous colon tissue, where the cells have several compartments, express numerous proteins, and are organized to mimic the microenvironment of tissue in situ rather than dispersed cells in a cultured environment. The simulated data could be used to validate techniques such as cell segmentation, cell phenotyping, differentiation grading, analysis of multiplex image data, and localized protein network analysis.

Dev Ops and Automation, key tools in the bioinformaticians tool box

Presenting Author: Simon Twigger, BioTeam

Flexibility, repeatability, reliability - these are all important traits for a modern bioinformatics infrastructure. As an independent consulting group, BioTeam has experience dealing with these issues on all scales, from individual labs to national agencies. There are a number of tools and approaches that we have found lend themselves to building and managing this type of infrastructure. This presentation will introduce the concepts behind one of these, using Chef and related 'dev ops' approaches to get your software and infrastructure under control so you can focus less on server wrangling and software configurations, and more on the science.

Characterizing bladder cancer cell lines as models of solid tumor biology

Presenting Author: Somsak Phattarasukol, University of Colorado Anschutz Medical Campus

Dan Theodorescu, University of Colorado Anschutz Medical Campus
James Costello, University of Colorado Anschutz Medical Campus

According to the American Cancer Society, there are 74,690 new cases of bladder cancer diagnosed each year. This type of cancer affects both men women and is one the most expensive cancer to treat due to high rates of relapse. Despite progress in bladder cancer research, the survival rate for patients has only marginally improved in the past thirty years and no new chemotherapy has been approved in the past fifteen years. Cell lines are important tools for studying cancer in the laboratory and it is important that these cells reflect the genomic diversity of bladder cancer patients. To study this genomic diversity, and the relevance to studying drug sensitivity, we compiled exome sequencing data for a panel of 40 cell lines and 460 patient derived tumor samples. We processed this data under a common pipeline to identify genetic variants. Our results show that the number of exonic variants found in the cell lines varies, ranging from 15,694 (UMUC2) to 23,242 (253J-P). We characterize all 40 cell lines and the genes that are altered in bladder cancer. As a specific example, TP53, a commonly mutated bladder cancer driver gene, is mutated in all cell lines with the exception of PSI. Moreover, our results reveal novel relationships between the cell lines and identify cell lines that more closely model patient genomics. The characterization and indexing of these cell lines represent an invaluable resource for studying the cellular response to drugs across a range of bladder cancer genomic subtypes.

A computational and experimental method for combinatorial drug discovery.

Presenting Author: Muhammad Kashif, Uppsala University

Claes Andersson, Uppsala Univeristy
Sadia Hassan, Uppsala Univeristy
Henning Karlsson, Uppsala Univeristy
Rolf Larsson, Uppsala Univeristy
Mats Gustafsson, Uppsala Univeristy

The potential of multi-compound therapies to achieve effective anti-cancer treatments is well established but quite unexplored, in particular regarding combinations involving more than two compounds. There is a lack of robust automatic methods practical enough to effectively search in large combination spaces for combinations of arbitrary size.
To cater these problems we have developed a semi-automated robotic platform for in-vitro combination screening that uses a pipeline consisting of (1) combination generation as per design of experiment, (2) cell seeding, (3) drug combination transfer by developing Biomek-2000 robotic functionality and (4) end point analysis.
Main limiting factor in this pipeline was drug combination transfers that complicate experimentation exponentially with increasing number of drugs. A “cherry picking” programme was developed and integrated in Biomek-2000 robot to perform automated large scale drug combination transfers. One only needs to specify desired drug combinations in a spread sheet and Biomek-2000 robot does the rest. Validation experiments showed that new functionality is less error prone than manual procedures, and can transfer arbitrary size drug combinations precisely.
Another robot Precision-2000 was programmed for cell seeding and Beckman coulter ORCA robot was used for automated end point readouts. Additionally, R scripts were developed for high-throughput analysis of end point data and to generate set of combination based upon analysis and as per user defined criteria.
Experimental results found that data from the whole platform is reproducible, reliable and one can perform the combination studies of arbitrary size drug combinations with different experimental settings semi-automatically.

Biomed Summarization Of Topics

Presenting Author: Prabha Yadav, University of Colorado, School of Medicine, Denver

Hoa Trang Dang , National Institute of Standards and Technology
Anita de Waard , Elsevier
Lucy Vanderwende, Microsoft Research
Kevin Bretonnel Cohen, University of Colorado School of Medicine

As the millions of scientific documents on the web grow at an exponential pace, a tool that provides timely access to, and digests of, various documents are necessary. Papers are usually summarized by the abstract provided by authors. The shortcoming is that the abstract doesn’t necessarily reflect the actual impact of the paper in the community.
Our group created a corpus for building and evaluating summarization systems based on citations. There were 10 annotators. There are 550 articles in total, with sets of 50 reference papers, each having 10 citing papers. The entire set has 805 citations. The subjects of the papers include tumorigenesis, oncogenes, malignant transformations, identification of tumor suppressor genes, role of miRNA in cancer development and tumor metastasis, siRNA’s role in necroptosis, signaling pathways, oncogenic mutations, gene expression profiling, antigen receptor signaling, chromosomal translocation, and activation of STAT3, etc.
Sets of 4 annotators annotated each article. On average, they took 7-8 hours to annotate each reference paper set. Annotators also wrote a 250-word summary of the reference paper based on its citations, which indicates the impact of the paper over the years.
This pilot study has implications for natural language processing systems in the construction of both rule based and machine-learning based systems for creating a user interface to get a citance-focused faceted summary of the referenced paper from the citations. A number of lessons learned from the annotation process will be presented.

Beyond the coding mutations in Pan-cancer genomes

Presenting Author: Subhajyoti De, University of Colorado

Vinod Yadav, Univ. Colorado
Kyle Smith, Univ. Colorado
Brent Pedersen, Univ. Colorado
Katherine Pollard, UCSF
James DeGregori, Univ. Colorado

Cancer genomes are pockmarked with mutations, and only a subset of them, contribute to tumor initiation and progression. Cancer genomics initiatives have, so far, primarily focused on mutations that affect protein function, but regulatory mutations that alter gene expression are poorly characterized. Recent findings highlight that regulatory mutations in cancers might be more common and clinically important than previously anticipated. We have developed two algorithms to identify novel regulatory alterations in cancer genomes. First, we present REDACT – a computational framework for detecting regulatory mutations within coding regions. We apply the method to 4606 samples from 19 cancer types, analyzing allelic expression, overall mRNA and protein expression, regulatory motif perturbation, and chromatin signatures to identify 121 mutations that alter protein sequence as well as their regulation. Several of the mutations affect known cancer genes and modulate downstream pathways. Second, we extend our search for regulatory mutations to gene promoters, and present a novel algorithm, SASE-hunter for this purpose. SASE-hunter uses subsampling approach for genomic inference to identify promoters that have signatures of accelerated evolution in cancer genomes. We apply the method to 724 completely sequenced samples from 10 different cancer types, and identify promoters that are under accelerated evolution in multiple samples. Integrating expression, regulatory motif, and chromatin data, we highlight functional impact and clinical significance of accelerated evolution in selected cases. Taken together, our findings have implications for biomarker discovery for cancer patients, and call for a systematic, multi-platform initiative to uncover regulatory mutations in different cancer types.

MEseq, a software package for bisulfite sequencing data analysis by regression hidden Markov model

Presenting Author: Kui Shen, National Institute of Allergy and Infectious Diseases

Darrell Hurt, National Institute of Allergy and Infectious Diseases

Epigenetic information is encoded by DNA methylation, which plays a critical role in gene expression regulation. Recent developments in next generation sequencing technology have enabled the measurement and comparison of whole genome methylation at single-base resolution. However, it is challenging to identify differentially methylated sites and regions because the methylation states of nearby CpG sites are correlated.

In this study, we propose a method to identify differentially methylated sites and regions using beta-binomial regression as the emission probability structure in a hidden Markov model. We implemented this method as a software package named MEseq that runs in the R statistical computing environment.

MEseq provides the following functions:
1. Identification of differentially methylated sites by a beta-binomial regression model.
2. Identification of differentially methylated regions by a regression hidden Markov model.
3. Annotation and visualization of identified sites and regions.

MEseq outperforms similar software packages by adapting a flexible emission probability structure in hidden Markov model. Furthermore, we extended the capabilities of MEseq by including a caller for hydroxymethylation, which allows it to be used for hydroxymethylation data analysis and visualization.

Coreference Annotations in the CRAFT Corpus – Statistics and Performance Analysis

Presenting Author: Natalya Panteleyeva, University of Colorado

Kevin Cohen, University of Colorado

Performance of natural language processing tools is closely tied to the corpora on which they are trained. Evaluating state-of-the-art technology against a divergent test set provides useful metrics and information not only for improving performance of existing tools but also ideas for development of new methodology.
This work focuses on co-reference chain annotations of the Colorado Richly Annotated Full Text Corpus (CRAFT), a manually annotated corpus consisting of 67 full-text biomedical journal articles. These articles were annotated and adjudicated to create a gold standard which was examined to create global and per document corpus statistics, such as total annotations, total coreference chains, including total appositions, the mean and the median annotations per article, the mean and the median length of coreference chains, the types of elements in coreference chains, including appositions. It compares the performance of Stanford CoreNLP deterministic coreference resolution system, de facto current standard NLP technology, against the coreference annotations in the gold standard. It describes and quantifies the types of errors observed to provide recommendations regarding elimination of which types of errors is the easiest or will yield the most improvements to the system’s performance.

Abstract Withdrawn