11th Annual Rocky Mountain Bioinformatics Conference
AssemblSprint: Examining Genome Assemblies of Ion PGM next generation sequencing data from Mycobacterium species
Presenting Author: Nabeeh Hasan, University of Colorado, Denver
Rebecca Davidson, National Jewish Health
Paul Reynolds, National Jewish Health
Eveline Farias-Hesson, Life Technologies
Michael Strong, National Jewish Health
Genome sequencing technologies differ in their speed, accuracy and capacity. These differences present challenges for genome assembly software to optimally perform with data from multiple sequencing platforms. Moreover, sequence assembly and finishing of genomes remains challenging in many instances. Robust methods to evaluate genome assemblies have not been well defined, particularly for data produced by newer sequencers, such as the Ion Torrent Personal Genome Machine (PGM) from Life Technologies, Inc. In our AssemblSprint, we examined the performance of three commonly used de novo genome assemblers: MIRA, Newbler and Velvet in assembling Ion PGM 400bp sequence data from two Mycobacterium species: M. abscessus ssp. massiliense and M. chelonae, along with Ion PGM 200bp sequence data from M. malmoense using varying read depths and read filtering parameters. Each assembly was evaluated based upon assembly length, number of scaffolds, N50 of scaffolds, number of ambiguous base calls, and comparisons to a given reference genome, if available. We find that Newbler performed the best with assembling Ion PGM data, but inter-assembly variation suggests a need for further research into methods to improve de novo genome assembly and evaluation.
Host-pathogen interactions and microRNAs
Presenting Author: Christian Forst, Icahn School of Medicine at Mount Sinai
MicroRNAs (miRNAs) are increasingly recognized to be important in the regulation of biological functions. Although the study of miRNAs in host-pathogen systems is still in its infancy, growing evidence suggests that miRNAs play a key role in the control of infection.
We utilize time-series expression data for both miRNAs and mRNAs derived under identical conditions from cell cultures infected with influenza virus. By an integrative approach we infer causal interactions between and within expressed miRNA and mRNAs. We have validated the resulting pathways with miRNA inhibitors revealing pro-viral and anti-viral miRNAs. In particular, we can identify miRNAs that are important players in the regulation of innate immune response -- targets required for viral replication but also host resistance factors as defense against infection.
Characterizing Unknown Viral Genes Through Metabolomics
Presenting Author: Tiffany Liang, San Diego State University
Savannah Sanchez, San Diego State University
Jason Rostron, San Diego State University
Jeremy Frank, San Diego State University
Daniel Cuevas, San Diego State University
Anca Segall, San Diego State University
Forest Rohwer, San Diego State University
Robert Edwards, San Diego State University
Daniel Garza, Evandro Chagas Institute
Viruses are the most diverse biological entities on earth. However, they also have the least characterized genetic, taxonomic, and functional diversity. In metagenomic analyses of viral communities from various environments, most sequences are unrelated to any known sequences; for example, about 90% of the viral sequences found in marine environments are unknown. The goal of this study is to characterize the function of unknown viral genes and identify those that alter host metabolism.
Viral metagenomes were collected from filtered seawater from Pacific coral reefs, sequenced by Roche 454 technology, and open reading frames were predicted from those sequences. Genes were synthesized and cloned into E. coli. These clones have been characterized in several different ways. To investigate these clones that affected metabolic processes, the metabolites were identified by gas chromatography-coupled time-of-flight (GC/TOF) mass spectrometry. In total 423 metabolites were found, however only 15% of those matched known compounds. We are identifying the specific metabolites produced or affected by the over expression of phage proteins to predict physiological roles for these proteins that can then be tested experimentally. We have also analyzed metabolic changes associated with expression of proteins with known functions that are involved in central metabolism; and clustering of these changes allows us to predict functions for other proteins. We are building a systematic analysis pipeline that can process matabolomics data for downstream analysis of metabolomics and related data sets.
Predicting disease related and pharmacogenes through curated and text-mined annotations
Presenting Author: Christopher Funk, University of Colorado
Lawrence Hunter, University of Colorado
Kevin Cohen, University of Colorado
Identifying genetic variants that play a role in disease or affect drug response is an important task. Before individual variants can be explored efficiently specific candidate genes must be identified. While many methods rank candidate genes through the use of sequence features and network topology, only a few exploit the information contained in the biomedical literature. We present a set of enriched pharmacogenic Gene Ontology concepts and train and test a classifier on known pharmacogenes from PharmGKB. Our classifier uses only Gene Ontology concept annotations and simple features mined from the biomedical literature; it is then used to predict pharmacogenes on a genome-wide scale. We achieve performance of F=0.86, AUC=0.860, on five-fold cross validation. Additionally, the top 10 predicted genes are analyzed.
PAIRpred: A large margin method for partner-specific prediction of protein interfaces
Presenting Author: Fayyaz Minhas, Colorado State University
Brian Geiss, Colorado State University
Asa Ben-Hur, Colorado State University
We have developed a novel partner-specific protein-protein interaction site prediction method called PAIRpred that uses the sequences and unbound structures of two proteins in a complex, and is based on support vector machines (SVMs). Unlike most existing machine learning methods for this problem, PAIRpred uses information extracted from both proteins in a complex using pairwise kernels to predict inter-residue contacts. Due to its partner-specific nature, PAIRpred presents a more accurate model of protein binding and is able to generate more detailed predictions. In order to better model the problem, we present an extension of SVMs that can capture the pairwise constraints that two distant residues in a protein cannot simultaneously interact with the other protein in a complex. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We have compared PAIRpred's performance to existing methods such as ZDOCK, PPiPP and PredUS. PAIRpred offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We have studied the contribution of different sequence and structure features along with the effect of binding-associated conformational change on prediction accuracy. As an illustration of potential applications of PAIRpred, we have used it to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. More information on PAIRpred is available at: http://combi.cs.colostate.edu/supplements/pairpred/.
MP2GO: Inferring Gene Function from Phenotype.
Presenting Author: Judith Blake, The Jackson Laboratory
Joao Ascensao, Rice University
Mary Dolan, The Jackson Laboratory
David Hill, The Jackson Laboratory
Biomedical ontologies, while they have proven to be instrumental in the advancement of biological research through their ability to efficiently consolidate scientific data, are also hampered by the segregation of knowledge domains that results from their independent curation. We have developed a new method to computationally infer gene function, as encoded in the Gene Ontology (GO), from mutant phenotypes, as encoded in the Mammalian Phenotype Ontology (MP), using a set and graph theory-inspired approach. We apply this methodology to laboratory mouse (Mus musculus) data as represented in the Mouse Genome Informatics Resource (MGI). We believe this procedure represents a novel methodology for the inference of gene function, as it examines the emergent structure and relationships between the GO and MP annotations without considering the relationships semantically. This could allow for the discovery of unforeseen associations between gene function and phenotypes that would be overlooked by a semantic-based approach. The technique could be applied to a variety of other organisms and annotation databases, taking full advantage of the abundance of available high quality curated data.
Temporal Expression Recognition for Cell Cycle Phase Concepts in Biomedical Literature
Presenting Author: Negacy Hailu, University of Colorado Computational Bioscience Program
Kevin Cohen, University of Colorado School of Medicine
The number of publications in the biomedical domain is increasing exponentially. Searching for papers specific to a researcherâ€™s interest in this domain is difficult. PubMed allows search using keywords but it doesnâ€™t rank results based on document relevance. We present a recognizer for temporal expressions related to Cell Cycle Phase (CCP) concepts in biomedical literature. This task is one of the fundamental tasks towards building a search engine for queries with temporal components. Our ultimate goal is to build a specialized search engine, which is specific to searches in the CCP using genes and small molecules. We seek to improve search accuracy by allowing searches using semantic indexing instead of keywords. We identified 11 cell cycle related temporal expressions, for which we made extensions to TIMEX3, arranging them in an ontology derived from the Gene Ontology. We annotated 310 abstracts from PubMed. We developed annotation guidelines which are consistent with existing time related annotation guidelines such as TimeML. Two annotators participated in the annotation. We computed inter-annotator Agreement (IAA). We achieved an IAA of 0.79 for exact span match and 0.82 for relaxed constraints. Our approach is a hybrid of machine learning to recognize the temporal expressions and a rule-based approach to classify them. We trained a named entity recognizer using Conditional Random Fields (CRFs) models. We used an off-the-shelf implementation of the linear chain CRF model. We obtained a performance of 0.77 F-score for temporal expression recognition. We achieved 0.79 and 0.78 macro and micro average F-scores for classification.
Identification of novel therapeutics for complex diseases from genome-wide association data
Presenting Author: Mani Grover, Deakin University
Merridee Wouters, Deakin University
Sara Ballouz, Cold spring harbour lab,NY
Tamsyn Crowley, CSIRO centre
R.A. George, Victor Chang Institute
Kaavya Mohanasundaram, Deakin University
Craig Sherman, Deakin University
Human genome sequencing has rapidly advanced our knowledge of disease genetics. In silico tools such as candidate gene prediction systems allow rapid identification of disease genes by identifying the most probable candidate genes of the disease. Integration of drug-target data with candidate gene prediction systems can identify novel phenotypes which may benefit from current therapeutics.
We previously used the Gentrepid candidate gene prediction tool as a platform to predict 1,805 candidate genes for the seven complex diseases considered in the WTCCC genome wide association study, namely Type 2 Diabetes , Bipolar Disorder, Crohnâ€™s Disease, Hypertension , Type 1 Diabetes , Coronary Artery Disease, and Rheumatoid Arthritis. Using the drug databases, Therapeutic Target Database, PharmGKB and DrugBank as sources of drug-target association data, we identified a total of 390 (22%) candidate genes as novel therapeutic targets for the phenotype of interest and 2,132 drugs feasible for repositioning against the predicted targets.
By integrating genetic, bioinformatic and drug data, we have demonstrated that currently available drugs may be repositioned as novel therapeutics for the seven diseases studied here, quickly taking advantage of prior work in pharmaceutics to translate groundbreaking results in genetics to clinical treatments.
The Critical Assessment of Function Annotation experiment: a community-wide effort towards a better functional annotation of genes and genomes
Presenting Author: Sean Mooney, Buck Institute for Research on Aging
Predrag Radivojac, Indiana University
Iddo Friedberg, Miami University
A major challenge of the post-genomic era is understanding the function and disease associations of gene products. We are discovering new proteins far faster than we can characterize them experimentally. Most genome projects and derived databases rely fully on automated functional annotations, making the increase in annotation accuracy and coverage a prime goal for annotation algorithms. Understanding the accuracy of these function prediction algorithms is of primary importance to the process of translating sequence data into biologically meaningful information. Here we present the results of the first Critical Assessment of Function Annotations (CAFA) held during 2010-2011 and the challenge of the second CAFA experiment underway now. Thirty-four research groups worldwide participated in the first experiment, employing over 50 function annotation algorithms. The prediction methods were assessed using ROC curves, precision/recall curves, and variations on semantic similarity as applied to the Gene Ontology. During this presentation, I will discuss the results of the first CAFA experiment, the challenges we faced in assessing the results, and the future of CAFA. I will also describe the new experiment which will include biological process, molecular function, cellular component and human disease prediction tracks. Finally, I will describe ways in which you, the community, can participate.
Conceptual comparison through integrative functional genomics in GeneWeaver.org
Presenting Author: Elissa Chesler, The Jackson Laboratory
Erich J. Baker, Baylor University
Chales Phillips, University of Tennessee
Michael Langston, Unviersity of Tennessee
The comparison of biological concepts is fundamental to the improved definition of disease processes, research models, ontology terms, and other descriptors used to define, characterize and categorize biology within and across species. The GeneWeaver system is designed to close the gap between empirically-derived or conceptually defined data sets and the robust data mining tools required to undertake integrative analysis in high dimensional space. To this end, GeneWeaver contains a rich data store of curated functional genomics experimental results, including differential expression studies, term annotations to sets of biological entities, results of genetic mapping, user submitted gene lists, and privately defined gene sets and gene set associations. The GeneWeaver tool set is designed to efficiently integrate these data sets and enable flexible, comparison-based gene set analytics. Algorithms for rapid identification of set-set relations based on maximal biclique enumeration have enabled discovery of novel sets of genes related to underlying sets of biological processes in common, and new algorithms provide for systematic comparison of biological concepts through the transitive relationships of multiple known gene set associations. The overall aim of our tools is to rapidly identify conceptual similarity and cohesiveness through the enumeration of the biological basis of that similarity. We will present several examples of the flexibility and utility of the GeneWeaver approach to real time analysis of arbitrary collections of publicly available or user-submitted gene sets representing a range of biological processes.
Supported by NIH AA018776
Integrative Visualization for Discovery of Phenotype Associations in Clinical and miRNA Data
Presenting Author: Michael Hinterberg, University of Colorado-Denver
Lawrence Hunter, University of Colorado
David Kao, University of Colorado
The increasing size and availability of large clinical datasets provides opportunity for discovery of novel, complex phenotypes in patients. Some of these phenotypes, such as drug responsiveness, are important for differential treatment modalities. Complex phenotypes may also be associated with arrays of diagnostic biomarkers; for example, differential expression of mRNA as well as microRNA can segregate different classes of patients.
In datasets with thousands of clinical features, testing hypotheses for associations between clinical phenotype and genetic expression can be a tedious process. Furthermore, slight modifications in patient stratification may have dramatic effects on biomarker association, but these differences may not be readily apparent. In ongoing work, we present a novel web-based visualization tool that allows the user to view and modify tree-based representations of clinical phenotypes and examine associations with microRNA and mRNA expression, with visible transitions that show the effect of modifying phenotype definition. A specific motivating application to drug-responsiveness in non-ischemic dilated cardiomyopathy is presented as well.
Evolution of palmitoyl acyl transferases (PATs) in apicomplexa
Presenting Author: Swapna Seshadri, Research Institute, Hospital for Sick Children
John Parkinson, Research Institute, Hospital for Sick Children
Tim Gilberger, McMaster University
Protein palmitoylation is the only reversible post-translational mechanism utilising a hydrophobic anchor known to dynamically regulate a proteinâ€™s function by influencing its subcellular localization, stability, and interaction. Although this process is ubiquitous in eukaryotes, a recent study uncovered hundreds of palmitoylated proteins in P. falciparum. Therefore, characterizing the suite of enzymes catalyzing this process (Palmitoyl Acyl Transferases (PATs)) in apicomplexan parasites is essential for understanding various aspects of parasite biology. We conducted a comprehensive survey to identify and classify PATs from complete genomes of 16 parasitic apicomplexans and 2 closely related free-living protists (ciliates). Using HMMER, 159 and 138 PATs were identified in apicomplexans and ciliates, respectively. Classification is confounded due to lower resolution stemming from short (~50aa) conserved catalytic domain combined with presence of ankyrin repeats in many sequences. Analysis revealed a ~170aa region with sufficient information to distinguish them into 7 major clades and 14 sub-clades, using Bayesian and maximum likelihood phylogenetic methods. The sub-clades demonstrate distinct patterns of sequence conservation and indels, providing molecular signatures for possible sub-functionalisation. A structural model of the catalytic domain was generated, providing a molecular perspective of these signatures. Overall, 5 sub-clades are apicomplexa-specific, containing members localized to rhoptries and inner membrane complex, organelles unique to apicomplexa that are involved in host cell invasion. Further, 2 clades and 2 sub-clades contain yeast and human orthologs indicating a role in secretory pathway. In summary, apicomplexans have evolved PATs to serve as an integral part of the biological machinery required to facilitate their parasitic life-style.
From Sample to Answer - Sequencing in the Cloud Era with the Illumina BaseSpaceÂ® Bioinformatics Platform
Presenting Author: Raymond Tecotzky, Illumina
Genomics research promises to revolutionize public and human health. Yet, extracting meaningful information from an enormous collection of sequence data is at risk without the development of new and scalable bioinformatics approaches. Illuminaâ€™s genomics cloud platform delivers a suite of industry-leading analysis tools, ensures secure data storage, and simplifies collaboration with integrated, push-button sharing and analysis options. Learn how BaseSpace has become a valuable platform for bioinformatics developers, and how rapid data access enables new genomics research.
MSProcess â€“ Summarization, Normalization, and Diagnostics for Processing of Mass Spectrometry Based Metabolomic Data
Presenting Author: Katerina Kechris, University of Colorado Denver
Grant Hughes, University of Colorado Denver
The initial processing of Liquid Chromatography coupled with Mass Spectrometry (LC/MS) metabolomic data is covered by a variety of software packages provided by instrument manufacturers and a number of open source packages such as xMSAnalyzer, XCMS and MzMine. While these manage the initial data pre-processing steps of peak detection, chromatogram building, alignment, and quantification, they often lack functions for further processing. We designed the MSProcess package in R to complement existing software by providing additional processing tools and statistical and graphical tools for evaluation of different methods. As there are no universally accepted procedures, the package provides implementation of a variety of novel and previously published methods. The primary functions of the MSProcess package are: summarization of replicates, filtering, imputation of missing data, normalization and/or batch effect adjustment, and dataset diagnostics. The output is in a format ready for input to leading software such as MetaboAnalyst to perform clustering and other downstream analyses. In summary, we developed the MSProcess package to complement other packages by providing additional pre-processing steps, implementing a selection of popular normalization algorithms and generating diagnostics to help guide investigators in their analyses of LC/MS based metabolomic data.
Metabolic Reconstruction Identifies Strain-Specific Regulation of Virulence in Toxoplasma gondii
Presenting Author: Nirvana Nursimulu, University of Toronto
Carl Song, University of Toronto
Stacy Hung, University of Toronto
Melissa Chiasson, NIAID, National Institutes of Health
James Wasmuth, University of Calgary
Michael Grigg, NIAID, National Institutes of Health
John Parkinson, University of Toronto
Estimated to infect at least a third of the worldâ€™s population, the Apicomplexan parasite, Toxoplasma gondii, represents a major threat to immunocompromised individuals and pregnant women, especially due to the limited efficacy of current therapeutic interventions. Since metabolism plays an essential role in providing energy and the basic building blocks required for growth, drug-development programs are now focussing more on targeting metabolic enzymes. We hypothesize that metabolic potential plays a key role in determining the virulence of different strains. Given often nonintuitive relationships between enzymes and pathways, constraints based models such as flux balance analysis (FBA), have emerged as indispensable tools to study the organization and operation of metabolic networks. Here we present a novel application of FBA that leverages microarray data to explore the impact of differential enzyme expression observed between virulent and avirulent strains of T. gondii. Our model correctly predicts the increased growth rate of the more virulent type I strain, relative to type II; further analysis predicts the increase in growth rate to result from increased energy production via upregulation of the glycolytic, pentose phosphate and TCA-cycle pathways. These findings highlight a regulatory route which, in addition to conferring growth rate plasticity, may impact the parasiteâ€™s outstanding ability to infect a broad range of hosts. Moreover, drug assays confirm strain-specific sensitivities of several reactions, as predicted by in silico single knock-out experiments. This study demonstrates how expression data can be integrated into a model to give robust strain-specific predictions.
An Update on KaBOB: towards an integrated knowledge base of biomedicine
Presenting Author: Kevin Livingston, University of Colorado
Michael Bada, University of Colorado
William Baumgartner, University of Colorado
Lawrence Hunter, University of Colorado
A vast wealth of information currently exists in the form of curated biomedical databases with information about genes, proteins, SNPs, drugs, diseases, pathways, interactions, etc. Unfortunately there are many hurdles in the way of researchers being able to view their data in a context that spans multiple of these sources. These problems include idiosyncratic file formats, multiple unique identifiers for the same entities, and varying semantics of the curated data.
The Knowledge Base of Biomedicine (KaBOB) aims to overcome these problems through a series of technical solutions. We have produced a RDF model for representing the contents of databases as extension of the Information Artifact Ontology (IAO), thus providing a uniform foundation for the incoming data. Curated mappings are used to link identifiers across the data sources. The transitive closure of these linkages is computed creating a set of all the identifiers in the source databases for each biomedical entity, thus resolving terminology problems. Biomedical entities are created to represent the real concept (genes, proteins, drugs, etc.) being referred to by each set of database identifiers. These biomedical entities form the foundation for representing knowledge extracted from the source databases. Declarative rules are used to translate the source data into OWL representations that are integrated under the Open Biomedical Ontologies (OBOs). These rules also function to preserve the provenance of the newly created representations. These representations and rules are currently under active development and evaluation.
A Corpus-Based Study of Temporal Relations in Clinical Text
Presenting Author: Natalya Panteleyeva, University of Colorado
Lawrence Hunter, University of Colorado
Kevin Cohen, University of Colorado
A corpus of clinical data was used to investigate the hypothesis that there are correlations between pairs of event types and the temporal links between them. A corpus of about 98,000 words that had been annotated with events, event types, TIMEX3 expressions, and temporal links was examined for such associations. It was found that in fact most pairs of event types show a strong preference for or against a particular type of temporal link. It was also noted that all possible pairs of event types occur even in this relatively small corpus. The preference of specific pairs of event types for particular types of temporal links has implications for natural language processing systems, including establishing baselines for their performance and providing a priori knowledge that can be used to inform the construction of both rule-based and machine-learning-based systems for labeling temporal links in clinical documents. More basic questions about the linguistic expression of temporal relations in clinical text are examined, such as the extent to which they are sequential or not and the extent to which they are intersentential versus intrasentential. Whether surface linguistic cues from morphology, syntax, and lexicon enhance accuracy in establishing temporal link types is addressed.
Building an optimally informative machine-learning model of gene regulatory control
Presenting Author: Molly Megraw, Oregon State University
Gene promoter prediction has long been a difficult challenge, particularly in organisms for which little high-throughput data is available for building and testing accurate computational models. Our lab has recently produced a large-scale transcription start site (TSS) dataset using a sequencing-based method for analysis of 5' ends of mRNA transcripts in plants. Using this dataset, we first categorize the different shapes taken on by the TSS location distributions into TSS "tag clusters". For example, some gene upstream regions have very narrow, high clusters and others have a more broad shape. We then design a high-resolution machine-learning model that predicts the presence of TSS tag cluster with an auROC near 0.98 for each cluster shape. We use this model to analyze the transcription factor binding site content of different promoter shapes. We find that while canonical notions of sharp narrow peak TATA-containing promoters vs more broad "TATA-less" promoters have some merit, the model shows that a large compendium of elements is actually necessary and sufficient for accurate promoter prediction in the case of all tag cluster shapes. In this talk I will demonstrate how a machine learning model can suggest sets of gene interactions which have the potential to "turn on" a particular gene, and briefly discuss one possible approach for dissecting which of those sets are optimal predictors of gene up-regulation.
Recalibration of p-values for Multiple Testing Problems in Genomics
Presenting Author: Dean Palejev, Bulgarian Academy of Sciences
John Ferguson, University of Limerick
Conservative statistical tests are often used in complex multiple testing settings in which computing the type I error maybe difficult. In such tests, the reported p-value for a hypothesis can understate the evidence against the null hypothesis and consequently statistical power may be lost. False Discovery Rate adjustments, used in multiple comparison settings, can worsen the unfavorable effect. Despite these effects, the problem seems to be somewhat overlooked within the computational biology and bioinformatics communities, with many practitioners not even aware of the issue.
We present a computationally efficient and test-agnostic calibration technique that can substantially reduce the conservativeness of such tests. As a consequence, a lower sample size might be sufficient to reject the null hypothesis for true alternatives, and experimental costs can be lowered.
As an example, we apply the calibration technique to the results of DESeq, a popular method for detecting differentially expressed genes from high-throughput RNA sequencing data. The increase in power maybe particularly high in small sample size experiments, often used in preliminary experiments and funding applications. In some situations, after correction, statistical power can increase 3 fold without the need of additional experimental costs.
Our results are structured in a way that makes them easily usable by practitioners in the fields of computational biology and bioinformatics. We also provide R code that could be used, or modified by them as needed.
GeneSeer Aids Drug Discovery by Exploring Evolutionary Relationships Between Genes Across Genomes
Presenting Author: Douglas Fenger, Dart NeuroScience
Matthew Shaw, Dart NeuroScience
Philip Cheung, Dart NeuroScience
Tim Tully, Dart NeuroScience
Homologous relationships facilitate drug discovery by mapping gene/protein function between and within species, allowing functional predictions of novel or unknown genes. Additional benefits of cross-species mapping include the following: use of paralogs for selectivity/specificity screens to eliminate drug side effects, translation of pathway information from model organisms to humans, and allowing comparison and combination of data from different species.
GeneSeer (http://geneseer.com) is a publicly available tool that leverages public sequence data, gene metadata information, and other publicly available data to calculate and display orthologous and paralogous gene relationships for all genes from several species, including yeast, insects, worms, vertebrates, mammals, and primates including humans. GeneSeer calculates homology relationships and its interface is designed to help scientists quickly predict important attributes such as additional closely related family members and paralogous relationships. It is a useful tool for cross-species translational mapping and enables scientists to easily translate hypotheses about gene identity and function from one species to another. We have validated GeneSeer versus Homologene, the homolog prediction tool from NCBI. The results show that GeneSeer is as good as, if not better than, Homologene. Finally, a comparison of features shows GeneSeer to be the most feature rich when compared to alternative homology tools.
The Human Gene Connectome Server: an Online Tool for Prioritizing Genes by Biological Distance
Presenting Author: Yuval Itan, The Rockefeller University
Mark Mazel, The Rockefeller University
Benjamin Mazel, The Rockefeller University
Avinash Abhyankar, New York Genome Center
Bertrand Boisson, The Rockefeller University
Patrick Nitschke, INSERM
Lluis Quintana-Murci, Pasteur Institute
Laurent Abel, INSERM
Shen-Ying Zhang, The Rockefeller University
Jean-Laurent Casanova, The Rockefeller University
To determine the disease-causing allele(s) underlying human disease, high-throughput genomic methods are applied and provide thousands of gene variants per patient. We recently developed a novel approach, the â€œhuman gene connectomeâ€ (HGC) â€“ a concept and method that describe the set of all in silico-predicted biologically plausible routes and distances between all pairs of human genes. With the HGC, we generated a â€œgene-specific connectomeâ€ for each human gene â€“ the set of all human genes ranked by their predicted biological proximity to the core gene of interest, available at: http://lab.rockefeller.edu/casanova/HGC/. We demonstrated that the HGC is the most powerful approach for prioritizing high-throughput genetic variants in Mendelian disease studies. However, there is currently no effective gene-centric online interface for ranking genes by biological distance. We describe here the human gene connectome server (HGCS): a powerful, easy-to-use interactive online tool that enables researchers to prioritize any list of genes according to their biological proximity to core genes (i.e. genes that are known to be associated with the phenotype), and to predict novel gene pathways. We demonstrated the effectiveness of the HGCS for detecting herpes simplex encephalitis-predisposing genes in patientsâ€™ whole exome sequencing data. The HGCS is freely available to use for non-commercial users at: http://hgc.rockefeller.edu/.
Type I Error Rate Analysis of Methods for Correlated Binary Outcomes
Presenting Author: Aarti Munjal, University of Colorado Denver
Jacqueline Johnson, University of North Carolina at Chapel Hill
Sarah Kreidler, University of Colorado Denver
Deborah Glueck, University of Colorado Denver
Keith Muller, University of Florida
We describe a simulation study to estimate type I error rate, calculate power, and investigate optimization
algorithm convergence properties for multilevel and longitudinal designs with correlated binary outcomes.
Multilevel and longitudinal studies with binary outcomes are common throughout the biomedical literature. These studies allow scientists to investigate the causes of disease, determine the efficacy of drugs, and conduct basic biomedical research to improve human health. Our simulations studies show that using
standard multivariate models with binary outcomes data controls type I error rate for many important
designs. The approach produces reasonable results even with upto 10% missing data and very small sample sizes. We discuss the implications for scientists working on oral cancer detection.
An Optimal Metabolic Route Search Tool: RouteSearch
Presenting Author: Mario Latendresse, SRI International
RouteSearch is a new Web accessible metabolic engineering tool
available as part of BioCyc since March 2013. It enables searching for
optimal metabolic linear routes between a start compound and a goal
compound. The optimality criteria are the weighted sum of the costs
of the reactions used, and the weighted sum of the costs of atoms that
are lost in the transformation from the start compound to the goal
compound. These costs and the number of minimum cost routes to find
and display are user selectable. The routes are displayed as a series
of connected enzymatic reactions including chemical structures of the
substrates, where the conserved moeities within each metabolite are
shown using colors. By using a graphical interface, the user can also
easily identify each atom conserved or lost along each route.
RouteSearch uses two algorithms to search for optimal routes: the
Bellman-Ford algorithm that finds the least cost route, and a more
general branch and bound search algorithm that can find several
minimum cost routes. RouteSearch also uses a preferred organism to
search -- a chassis in metabolic engineering terms, such as E. coli --
and a library of additional reactions, which is the MetaCyc
database. The cost of using a reaction from MetaCyc is usually set
higher than using a reaction from the chassis. In this way, new and
more productive metabolic routes can be found for the chassis by
adding reactions from MetaCyc. We will also briefly describe the
computation of atom mappings for MetaCyc. Atom mappings are used by
RouteSearch to track the atoms conserved and lost in a route.
Computational and Mathematical Modeling of the segmentation genes of Honeybee (Apis Mellifera)
Presenting Author: Maryam Bagher Oskouei, University of Otago
Brendan McCane, University of Otago
Peter Dearden, University of Otago
Drosophila and Honeybee embryos are two examples that develop a segmented body plan during their early development. The basic body plan consists of distinct segments along their anterior-posterior axis established via a segmentation process. The process subdivides the embryos into segments, which is controlled by interactions between segmentation genes. Many experimental and computational works have been tested to reveal which interactions cause this process in Drosophila embryos, but few have been done for Honeybee embryos. The Honeybee genome has some aspects that make it worth studying. Honeybees are excellent comparative model systems that help to understand evolutionary pathways behind the segmentation process considering that the insects diverged ~350 million years ago. Here, we present a method using ordinary differential equations (ODEs) to model segmentation genes in Honeybee embryos. The initial and target models for ODEs were configured with data collected in Peter Dearden's lab. The computational modeling was carried out in order to explore how likely each gene is regulated by other genes positively or negatively. The simulations were performed in two phases, first as a Pre-stripes Networks and then the striped pattern forming Networks. The main findings predict gene networks that are more likely to pattern different parts of embryos along their anterior-posterior axis during early developmental stages. These results are comparable with Drosophila embryos. Importantly, the predicted networks provide hypotheses that can be tested experimentally.
Improved RNAi interference target sequencing (RIT-seq) enables dissection of cellular function in Trypanosome brucei.
Presenting Author: Jonathan Wilkes, University of Glasgow
Graham Hamilton, University of Glasgow
Richard McCulloch, University of Glasgow
The protozoan parasite Trypanosoma brucei utilises a RNA interference (RNAi) pathway, widely conserved with other eukaryotes. This can be adapted to regulate expression of the poly-cistronically transcribed genes of T. brucei, utilising gene-specific sequences within a tetracyclin inducible cassette, allowing RNAi 'knockdown'; now an important research tool. RIT-seq has been developed, which enabled the parallel analysis of >8000 genes in T.brucei in life-cycle and differention stages (1). The original RITseq methodology possesses a number of shortcomings which compromise its potential: semi-specific PCR produces small enrichments of the inserted sequences, produces inconsistent amplified sequences, and contains significant genomic sequence unrelated to the inducible fragments.
We have designed an adaptation of the methodology involving a specific PCR to amplify sequences between common primer sites flanking inserted genomic fragments in the RNAi cassette. Preparing the sequencing library from his amplified material requires 10fold less material (500ng of DNA), produces a higher proportion (3-10fold) of reads unequivocally derived from the cassettes, utilises standard protocols for library preparation and permits sample multiplexing. To validate this RITseq approach, we have screened for T.brucei genes that act in DNA damage repair by measuring read abundance after RNAi in the presence or absence of the SN2 alkylator methyl methanesuplhonate. A number of previously characterised T. brucei DNA repair genes are revealed, and several novel pathways that have not been examined to date. The system was adapted to produce a comprehensive panel of protein kinase (kinome) probes.
1) Alsford et al. Genome Res. 2011 Jun;21(6):915-24.
What you did not know your transcription factor was doing
Presenting Author: Mary Allen, University of Colorado
Justin Freeman, University of Colorado
Hestia Mellert, University of Colorado
Joaquin Espinosa, University of Colorado
Robin Dowell, University of Colorado
A transcription factor (TF) protein binds to DNA and regulates transcription of a target gene. The guardian of the genome, p53, is a transcription factor important in cancer and aging, and activates transcription of many genes involved in apoptosis and cell cycle arrest. I have used a novel technique to discover over 200 annotated genes that are direct transcriptional targets of p53. This technique, GRO-seq, captures nascent transcription. Moreover, my work shows that short bidirectional transcripts are produced from p53 binding sites when the sites are within, nearby, or distant from protein coding genes. Additionally, I demonstrate that when p53 is activated, transcription at its binding sites increases. Finally, I show the p53 binding sites have high levels of transcription when they are located near p53 targets genes (protein coding genes). The novel discovery that binding sites are transcribed and that transcription levels of binding sites correlate with TF activity leads to new questions about how transcription of binding sites affects TF binding and activation of target genes.
Linking genotype and enterotype in inflammatory bowel disease
Presenting Author: Dan Knights, University of Minnesota
Mark Silverberg, University of Toronto
Rinse Weersma, University Medical Center Groningen
Dirk Gevers, Broad Institute of Harvard and MIT
Gerard Dijkstra, University Medical Center Groningen
Hailiang Huang, Massachusetts General Hospital
Andrea Tyler, Mount Sinai Hospital
Suzanne von Sommeren, University Medical Center Groningen
Floris Imhann, University Medical Center Groningen
Joanne Stempak, Mount Sinai Hospital
Caitlin Russell, Massachusetts General Hospital
Jenny Sauk, Massachusetts General Hospital
Jo Knight, University of Toronto
Mark Daly, Massachusetts General Hospital
Curtis Huttenhower, Harvard School of Public Health
Ramnik Xavier, Massachusetts General Hospital
Human genetics and host-associated microbiomes have each been associated with inflammatory bowel disease (IBD), however IBD risk cannot be fully explained by either factor alone. Recent findings implicate genotype-enterotype crosstalk as a contributor to IBD pathogenesis. However, there has been no large study of complex genome-microbiome interactions in humans. We have performed such a study using bacterial 16S ribosomal RNA enterotyping and Immunochip genotyping from intestinal mucosal biopsies in three independent cohorts totalling more than 500 individuals. We present methodology, validated internally between cohorts, to test for host genetic locus interaction with taxonomic and functional components of the microbiome. In a targeted analysis integrating fine mapping of causal variants, we find nucleotide oligomerization domain 2 (NOD2)-specific risk associated with known IBD-related imbalances in bacterial taxa, including increased Gammaproteobacteria and Escherichia. NOD2 has known roles in management of commensal bacteria, and a strong genetic signal for increased IBD risk. Using imputed bacterial metagenomes we also find NOD2 risk linked to increased sulfur reduction and lipopolysaccharide biosynthesis. These findings point to pathobiont expansion and bacterial production of genotoxic agent hydrogen sulfide, both involved in inflammation and IBD pathogenesis. In a novel omnibus tests we demonstrate links between host innate and adaptive immune pathways and broad enterotype composition. Our analysis demonstrates the ability to uncover novel interactions from paired genotype-enterotype data and that host genetics is linked to microbial dysbiosis in IBD.
Probing chromatin with digestion assays: from static nucleosome positioning to dynamic response
Presenting Author: Michael Tolstorukov, Massachusetts General Hospital
The read-out of the genetic information occurs in cell in the context of chromatin and in the recent years chromatin structure has been in the focus of intensive research. The basic repeating unit of chromatin, termed nucleosome, comprises eight histone proteins and about 150 bp of genomic DNA. Nucleosomes are subject to repositioning, histone modification and variant exchange, which constitute potent tools of epigenetic regulation of gene expression. Furthermore, multiple diseases, including developmental disorders and cancers, are associated with dysregulation of these pathways.
Digestion assays often in combination with chromatin immunoprecipitation and followed by next generation sequencing are widely used for profiling primary chromatin structure. However, processing of the data produced in such experiments constitutes a substantial challenge due to both complexity of the underlying causes influencing the chromatin response to a digestion agent and internal biases in the technique. In my presentation I will describe a comprehensive set of the computational procedures developed to address these challenges and to extract multiple layers of information on chromatin structure from a single experimental series. In addition to mapping stable positions of bulk and epigenetically modified nucleosomes, our approach allows identification of the nucleosomes of non-canonical sizes and the sites where DNA accessibility is regulated by the changes in nucleosome physical properties. While still in development, this approach has already produced novel results on distribution of the variants of histone H2A and, importantly, provided mechanistic insights into the role of these variants in gene expression regulation.
GIST â€“ an ensemble approach to the taxonomic classification of metatranscriptomic reads
Presenting Author: John Parkinson, Hospital for Sick Children
Geoffrey Halliday, University of Toronto
Whole-microbiome gene expression profiling (â€˜metatranscriptomicsâ€™ or â€˜RNA-seqâ€™) has emerged as a powerful means of gaining a mechanistic understanding of the complex inter-relationships that exist in microbial communities. However, due to the inherent complexity of microbial communities and the lack of a comprehensive set of reference genomes, currently available computational tools for metatranscriptomic analysis are limited in their ability to functionally classify and organize these sequence datasets. To meet this challenge we have been developing methods that combine accurate transcript annotation with systems-level functional interrogation of metatranscriptomic datasets. As part of these methods, we present GIST (Generative Inference of Sequence Taxonomy), which combines several statistical and machine learning methods for compositional analysis of both nucleotide and amino acid content with the output from the Burroughs-Wheeler Aligner to produce robust taxonomic assignments of metatranscriptomic RNA reads. In addition to identifying taxon-specific pathways within the context of a pan-microbial functional network, linking taxa with specific functions in a microbiome will produce deeper understanding of how their loss or gain alters microbiome functionality. Applied to real as well as synthetic datasets, generated using an inhouse simulation tool termed GENEPUDDLE, we demonstrate an improved performance in taxonomic assignments over existing methods.
Comparative metagenomics by cross-assembly
Presenting Author: Bas Dutilh, Radboud University Medical Centre
Rob Edwards, San Diego State University
Determining the interrelationships between metagenomes from different biomes or different time points is important to understand the microbial world around us. Mapping metagenomic sequences to a reference database of known genes is a feasible approach to transfer taxonomical and functional annotations to sequence reads. However, it can limit the amount of data that can be analyzed because the majority of the sequencing reads in difficult-to-annotate datasets, such as viral metagenomes from biomes other than the human microbiome, lack known homologs. A promising alternative is reference-independent comparative metagenomics by cross-assembly.
Cross-assembly of different metagenomes is a fast and insightful way to obtain information about sequences that are shared between the samples, represented by cross-contigs. Importantly, cross-assembly is independent of an annotated reference database, providing a way to also handle unknown sequences. The cross-assembly tool crAss allows a rapid analysis of these cross-contigs. First, it provides cross-contig-based similarity scores between all metagenome pairs. Second, crAss creates insightful images displaying the inter-relationships between samples. Third, it generates occurrence profiles of the cross-contig sequences across metagenomes that can be used to discover related sequences, aiding further assembly and interpretation.
The changing view of protein interaction networks based on data availability
Presenting Author: Barrett Hostetter-Lewis, California State University, Chico
Todd A Gibson, California State University, Chico
The end of the 20th century marked the beginning of the era of large-scale studies identifying protein interactions. These large data sets catalyzed a renaissance in protein interaction network research. The mainstay of this research has been in elucidating the biological and evolutionary factors that affect the network's topological features. As the quantity of data increases and the quality of data improves we have been regularly refining our understanding of these biological and evolutionary influences. Now, as interaction data becomes available for recently sequenced organisms, there is a great opportunity for research on today's nascent interactomes to benefit from the analytical steps and missteps taken on fledgling interactome analysis 10-15 years ago.
Here we describe how the researcher's view of the Saccharomyces cerevisiae protein interaction network has changed since the first publication of large-scale yeast data in late 1999. By creating network snapshots using increasing amounts of interaction data constrained by date-of-publication and various quality criteria, we identify trends in the researcher's view of network topology, and compare this to the interactomes of organisms which still have early, incomplete interaction data sets.
Hypernetworks: the Future of Network Modeling
Presenting Author: Debra Goldberg, University of Colorado Boulder
Hypernetworks (hypergraphs) are an extension to network models that allows any number of nodes in an edge (called a hyperedge). Many classes of problems currently modeled with networks involve data that are not intrinsically binary. These would be more naturally captured by a hypernetwork model. There are not (yet) many analysis tools available for biological hypernetworks. We are developing such tools in the context of protein interactions. However, many measures and algorithms must be tailored specifically to the data being analyzed and the questions being asked. Here I describe some of the domains currently modeled by networks that could benefit from analysis with hypernetworks.
Analyzing biological networks using degree-of-interest functions
Presenting Author: Corinna Vehlow, Visualization Research Center, University of Stuttgart
Carsten Goerg, University of Colorado Anschutz Medical Campus
David Kao, University of Colorado Anschutz Medical Campus
Biologists commonly analyze experimental data using biological networks, such as gene-expression correlation networks, to explain disease specific patterns and identify genotype-phenotype relationships. Biomedical knowledge from various databases and the literature can be integrated with these data networks to allow analysts to interpret experimental data in the context of existing knowledge. While these combined networks provide a rich resource and profound basis for data analysis, they are difficult to explore and understand since they are very dense. Using current static visualization approaches, it takes time and expertise to â€œuntangle the hairballâ€ and manually extract sub-networks that can explain a phenomenon or tell a meaningful biological story. To improve this analytical workflow, we developed a visualization approach that applies the concept of degree-of-interest (DOI) functions to highlight or filter particular parts of a network that are relevant for a specific question or task. We also use these DOI functions to automatically extract and lay out sub-networks in a way that DOI-based groups and their intersections become visually apparent, e.g., extracting a sub-network that includes all nodes involved in a set of pathways of interest and visually arranging these nodes based on their pathway information. To facilitate the analysis of extracted sub-networks in the context of the complete network, the network visualizations are linked through a brushing and linking feature. DOI functions can model various analytical facets, including an analystâ€™s background and interest, properties of the experimental data, and phenotype information. Hence, they provide a generic and powerful approach for analyzing biological networks.
Experimental Determination of Useful and Informative Visualizations of Microbial Ecology Data for Public and Scientific Audiences
Presenting Author: Megan Pirrung, University of Colorado
Modeling Transcriptional Regulation through simulation of the dynamic changes in DNA binding factor configuration
Presenting Author: David Knox, University of Colorado Anschutz Medical Campus
Robin Dowell, University of Colorado Boulder
Transcriptional regulation is the complex system behavior arising from the interaction of numerous regulators with DNA. Experimental efforts have unraveled the function of many individual components of the process, but the systems level behavior remains unpredictable. Growing evidence indicates that the transcriptional response of the system emerges not solely from the individual components, but rather by their collective behavior -- including competition and cooperation. The environment surrounding DNA undergoes millions of molecular interactions every second, resulting in continuous changes to the configuration of physically bound molecular components. It is from these stochastic, temporal, and spatial interactions of regulatory components that transcriptional regulation arises within each cell. Encapsulating our understanding of these interactions into computational models is essential for a full understanding of transcriptional regulation.
Our goal was to create biologically realistic computational models of Transcriptional Regulation that not only capture the behavior of several individual components, but also describe the dynamic and stochastic behavior of competing components. To this end we have developed an automated rule builder to not only create stochastic simulation rule sets, but also basic visualizations of the resultant simulations. Our modeling framework captures the competition between regulatory proteins, and more importantly, the dynamics of regulatory events occurring within individual cells.
Deletion of COX-2 is associated with reduced expression of rgl1: relevance to chemopreventative effect of COX-2 inhibitors
Presenting Author: Nicholas Kirkby, Imperial College London
Jane Mitchell, Imperial College London
Non-steroidal anti-inflammatory drugs (NSAIDs) such as aspirin, ibuprofen, rofecoxib and meloxicam inhibit cyclo-oxygenase (COX)-2 and are used for the treatment of arthritis. NSAIDs and COX-2 gene deletion have also been shown to prevent cancer in animal models, but research into the chemoproventative benefits of NSAIDs in man has been restricted following association of these drugs with cardiovascular toxicity. As such, the mechanisms by which COX-2 regulates cancer progression and cardiovascular health are incompletely understood. Here we have performed transcriptomic analysis of COX-2 knock out mouse aorta in order to identify gene pathways that may help to explain the role of COX-2 in these systems.
Microarray analysis of COX-2-/- vs COX-2+/+ aorta illustrated 29 differentially expressed genes. The most greatly altered of these was Ral guanine nucleotide dissociation stimulator-like 1 (Rgl1) which was down-regulated in COX-2-/- mice (-1.73-fold, p=1.03x10-8). To validate these observations, we mined data from ArrayExpress, which revealed studies showing that chronic dosing of rats with several NSAIDs including meloxicam has been associated with a reduction in Rgl1 expression. This is in line with our findings and supports our observations in gene-deleted mice.
These data suggest Rgl1 gene expression may be regulated by COX-2 activity. Rgl1 is a guanine nucleotide exchange factor, activated by Ras, the most common oncogene in human cancer, and catalyses the activation of, Ral. Ras and Ral are known to be crucial for cancer transformation and progression. Although preliminary, our observations may implicate Rgl1 down-regulation as a novel mechanism by which NSAIDs exert chemoprentative effects.
Prioritizing hypotheses for epigenetic mechanisms in Huntingtonâ€™s Disease using an e-Science approach
Presenting Author: Eleni Mina, Leiden University Medical Center
Willeke van Roon-Mom, Leiden University Medical Center
Peter A.C. â€™t Hoen, Leiden University Medical Center
Mark Thompson, Leiden University Medical Center
Reinout van Schouwen, Leiden University Medical Center
Rajaram Kaliyaperumal, Leiden University Medical Center
Kristina M. Hettne, Leiden University Medical Center
Erik Schultes, Leiden University Medical Center
Barend Mons, Leiden University Medical Center
Marco Roos, Leiden University Medical Center
The amount of data from high throughput experiments requires novel approaches for effective data analysis, dissemination of methods and collaborative research. We followed an e-Science approach to elucidate molecular mechanisms involved in chromatin-mediated regulation of gene expression in Huntingtonâ€™s Disease (HD), particularly for prioritizing hypotheses that result from the analysis and integration of publicly available datasets.
Our e-Science approach involves implementation of computational experiments by scientific workflows, and communication with different experts at all stages. By mining literature information, we prioritized mechanisms and proteins that are likely to be involved in HD, such as SLIT2 (involved in axonal transport during neural development), or the neuronal adaptor APBA1. We discovered enrichment of differentially expressed genes in the poised chromatin state of the caudate nucleus, suggesting a functional link with neuronal development. Moreover, we observed overrepresentation of CpG islands among promotors of differentially expressed genes, implicating DNA methylation in gene deregulation in HD.
A computational approach can produce more predictions than we can test ourselves. To preserve predictions and facilitate data integration, we expose them in a machine readable format (nanopublications), together with provenance information on authorship and the origin of the data. We demonstrate interoperability by linking the nanopublication provenance to the Research Object (RO) model to preserve our methods and resources.
In conclusion, by applying our methodology we can extract meaningful biological associations for generating novel hypotheses that can be tested and validated by wet lab experimentation. Exposure as nanopublications and ROs enhances reproducibility and reuse of our methodology.
Variance component score test in a mixed-effects model framework to map tissue-specific eQTL
Presenting Author: Chaitanya Acharya, Duke University
Andrew Allen, Duke University
Expression quantitative trait loci (eQTL) analysis associates putative regulatory variants (SNPs) with gene expression levels, which are treated as quantitative traits. Until recently, eQTL analysis is performed in a tissue-by-tissue basis followed by an examination of overlap of eQTLs across all tissues. However, most of those methods fall short in their ability to jointly analyze data across multiple tissues. Such type of joint analyses of tissue-types have been shown to improve power to identify eQTLs that have similar effects across tissues. We propose a variance component score test approach in a mixed-effects framework in order to jointly analyze multiple tissue types and assess the power of such tests. Using Monte Carlo simulations, we show that the new score test performs much better than the traditional likelihood ratio method in terms of statistical power. Using real data sets, we show that the new score test not only preserves power but also is computationally very efficient. We think that this method will particularly be very useful in prioritizing variants when analyzing heterogeneous disease model systems especially for any downstream genomic analysis including but not restricted to next-generation sequencing analysis.
Introducing Computations in Biology/Bioinformatics â€“ An Undergraduate Perspective
Presenting Author: Maheen Kibriya, Chapman University
Shehzein Khan, Chapman University
Louis Ehwerhemuepha, Chapman University
The importance of computing in biological and life sciences cannot be overemphasized. It is imperative for students in biological sciences to be introduced to computing at the undergraduate level, and our work presents the view of undergraduates toward this shift to an interdisciplinary field. We developed a simple nucleotide sequence analysis program written in Python and discuss our experience in learning and using Python to solve simple biological problems. The aforementioned sequence analysis program was tested using sequence data from the tuberculosis database (www.tbdb.org), while some high level functions freely available in BioPython are briefly discussed.
High Throughput Phenotype Profiling for Bacterial Flux-Balance Model Optimization
Presenting Author: Daniel Cuevas, San Diego State University
Rob Edwards, San Diego State University
Daniel Garza, Evandro Chagas Institute
Savannah Sanchez, San Diego State University
Advances in large-scale genomic sequencing allow researchers to create accurate computational models of organisms through the use of gene annotation software, such as RAST (Rapid Annotation using Subsystem Technology). These bioinformatics software deduce gene function through homology-based distinctions that are dependent on previously verified information; thus new discoveries cannot be easily extrapolated from current analysis tools without experimental examination. Recent developments using phenotype microarrays (PMs) provide a high throughput, large-scale technique in profiling bacterial characteristics and their phenotypes. PMs have the potential to experimentally test various growth conditions and then provide bacterial yield in real-time. By coupling PM experiments with the advances of genomic sequencing and annotation, more robust and accurate computational models can be developed and confirmed.
Here we present a combined biological and computational approach that (1) uses optical density data from a PM system as input to evaluate various growth curves, and (2) optimize the flux-balance analysis (FBA) models by using the PM results as a base for in silico growth simulations. The bacterium Citrobacter sedlakii was sequenced and studied in the PM-FBA pipeline to assess the capabilities of our approach. RAST annotations produced a base computational model consisting of 1,367 enzymatic reactions. After PM-FBA optimization a total of 44 reactions were added to, or modified within, the model. The model correctly predicted the outcome on 89% of growth experiments.