Oral Poster Presentations
Attention Conference Presenters - please review the Speaker Information Page available here.
Twenty-one (21) posters have been selected for oral presentations (8-minute talk) on the first day of the conference. These Oral Poster Presentations will also be considered for "best poster" awards.
Note: The presenting authors should be available to further discuss their poster content during regular poster-presentation times.
Presenting Author: Yi Hung Huang, Institute of Information Science, Academia Sinica, Taiwan
Yi-Hung Huang, Academia Sinica, Taiwan
Chun-Nan Hsu University of California, San Diego, United States
Peter W. Rose University of California, San Diego, United States
The Protein Data Bank (PDB) is the worldwide repository of 3D structures of proteins, nucleic acids and complex assemblies, most of which play essential biological roles and are the prime drug-targets in various diseases. Most journals require a prior submission of the structures to PDB as part of the publication process, which can be inquired by a unique identifier. By exploring these rich structure data and related citations, we can investigate the relationships between protein structures from the viewpoint of the citation network. Moreover, the analysis of the literature and data citation networks may demonstrate potential pathways of scientific discovery, that is, how knowledge and data were used to advance a particular field in structural biology. Here we propose an information cascade based approach to study the whole PDB citation network. We provide a quantitative measure to show how the relationships between protein structures can be characterized by their corresponding citation cascades. We map protein structures in overlapping citation cascades to drug-target and drug. The result shows that related protein structures can be clustered into groups that correlate to the same drug-targets. By quantifying the growth of cascade of each protein structure, the study reveals how PDB greatly impacts drug development.
Presenting Author: Raffaele Calogero, Università di Torino, Italy
Matteo Carrara University of Torino, Italy
Francesca cordero University of Torino, Italy
Marco Beccuti University of Torino, Italy
Susanna Donatelli University of Torino, Italy
Gene fusions arising from chromosomal translocations have been implicated in cancer. The discovery of novel gene fusions can lead to a better comprehension of cancer progression and development. RNA sequencing has opened many opportunities for the identification of this class of genomic alterations, leading to the discovery of novel chimeric transcripts in melanomas, breast cancers and lymphomas. Nowadays, various computational approaches have been developed for the detection of chimeric transcripts. These computational methods produce outputs that do not follow a standard structure and, as far as we know, no tools are available to analyze and manipulate, these outputs. Thus, we have developed chimera, a Bioconductor package, for downstream processing of data obtained by the following fusion detection tools: bellerophontes, deFuse, FusionFinder, FusionHunter, mapSplice, tophat-fusion, FusionMap, chimeraScan and STAR. The outputs are reorganized in a common data structure. Chimera implements various filters to reduce false positive (spanning/encompassing reads threshold, selecting genomic region associated to coding genes, excluding fusions encompassing long introns, retaining fusions encompassing in-frame fused peptides, etc.). It implements a de novo validation of the fusion junction and produces a structured output that provides the all the information needed for further laboratory characterization of fusion events.
Presenting Author: Caitlyn Mills, Northeastern University, United States
Mary Jo Ondrechen, Northeastern Universtiy, United States
Pengcheng Yin Northeastern University, United States
Penny Beuning Northeastern University, United States
Mary Jo Ondrechen Northeastern University, United States
There are now over 12,500 Structural Genomics (SG) proteins that have structures deposited in the PDB. However, most of these SG proteins are of unknown or uncertain biochemical function. Although many SG proteins have putative functional assignments, these assignments are often incorrect. The Crotonase Superfamily consists of five diverse functional subgroups that are well characterized both structurally and functionally. These subgroups represent different types of reactivity, including hydratase, isomerase, and dehalogenase activities. The Crotonase Superfamily also contains at least 60 SG proteins, so it is ideal to test predictions of protein function. Our approach is based on computational prediction of the functionally active residues in SG protein structures and comparison of these local chemical signatures with those of proteins of known function. First, we utilize Partial Order Optimum Likelihood (POOL) to predict the functionally important residues of each SG protein. Next, Structurally Aligned Local Site of Activity (SALSA) is used to compare the catalytically active residues of the well characterized members in the superfamily to those of the SG proteins. We demonstrate based on these computational methods that the majority of the putative annotations in this superfamily are likely incorrect. For some proteins, a more likely function is predicted. Currently, biochemical assays are being developed and used to test these predictions. The main outcome of this project will be to successfully classify these SG proteins based on their local structure at the predicted active sites.
Presenting Author: Amrita Roy Choudhury, National Institute of Chemistry, Slovenia
Marjana Novič National Institute of Chemistry, Slovenia
Igor Zhukov Polish Academy of Sciences, Poland
The goal of our work is to elucidate the transport channel structure and functional mechanism of the transmembrane protein bilitranslocase. The primary function of bilitranslocase is transport of organic anions, and the protein is potentially druggable. To analyze the protein structure, we have used a combination of computational and experimental methods.
Bilitranslocase has four transmembrane alpha helical regions. The probable assembly of these four transmembrane regions forming the transport channel is analyzed with Monte Carlo approach. Predicted interhelical interactions between transmembrane regions TM2:TM3 and TM1:TM4 serve as the primary constraint during the simulation. Analyzing the best-scoring conformations indicate three probable assemblies of transmembrane regions. In the most populated assembly, the two key transmembrane regions, TM2 and TM3, are arranged diagonally. In addition, the structures of these two regions are analyzed, both individually and in mixture, with NMR experiments performed in SDS micelle environment. FRET experiments validate the interaction between TM2:TM3.
The transmembrane regions TM2 and TM3 constitute of amino acids participating in H-bond formation, and are flanked by ligand binding motifs. These two regions therefore play prime roles in ligand mediation through the transport channel. Further, structures of both the transmembrane regions show Pro induced kinks, which render flexibility to the transport channel. These structural features are in line with the metastable nature of the protein.
Presenting Author: Aik Choon Tan, University of Colorado Denver, United States
Jihye Kim University of Colorado, United States
Minjae Yoo University of Colorado, United States
Jaewoo Kang Korea University, Korea, Rep
Protein kinases represent one of the largest ‘druggable’ and well-studied families in the human genome. Protein kinases play a key role as regulators and transducers of signaling in eukaryotic cells. Kinases are relatively easy to target with small molecules and have been extensively studied at the biochemical, structural, and physiological levels. In cancer cells, some kinases are mutated and acquire oncogenic properties to drive tumorgenesis. Small molecules that inhibit these oncogenic kinases can effectively kill cancer cells, as demonstrated by the success story of imatinib. We developed K-Map—a novel and user-friendly web-based program that systematically connects a set of query kinases to kinase inhibitors based on quantitative profiles of the kinase inhibitor activities. K-Map is motivated by the ‘connectivity map’ concept where gene expression changes could be used as the ‘universal language’ to connect between biological systems, genes, and drugs. Instead of gene expression signatures, we used the kinase activity profiles as the ‘language’ for connecting kinases and small molecules in K-Map to reveal the complex interactions of kinases and inhibitors. As a proof-of-concept, we queried K-Map with a set of essential kinases identified in EGFR-mutant lung cancer cell line. By connecting the essential kinases to compounds in K-Map, we identified and validated bosutinib as an effective compound that could inhibit proliferation and induce apoptosis in EGFR-resistant lines. In summary, we have demonstrated a proof-of-concept, bioinformatics-driven discovery roadmap for drug repurposing and development in cancer research, which could be generalized to other diseases in the era of personalized medicine.
Presenting Author: Benedict Anchang, Stanford University, United States
Brian Williamson Stanford University, United States
Sylvia Plevritis Stanford University, United States
Accumulating evidence implicates intratumor heterogeneity as an important challenge to cancer treatment. To address this challenge, there has been the recent surge in clinical trials for combination therapy for various cancer types, however a principled framework for rational and unbiased designs for optimal drug combinations is needed. We rationalize that by simultaneously targeting multiple key pathways across different cell types at onset, we will decrease the likelihood of emerging resistant populations. Our goal is to develop an optimization framework for effective combination therapy using cell population data that reveals heterogeneity in inter- and intra-cellular signaling at the level of single cells. Here we introduce an algorithm named DRUGMNEM that combines cell state identification, nested effect modeling and link analysis to optimize possible novel combination therapeutic strategies from experimentally derived known multi-drug screening, single cell data that may yield better prognostic results compared to the known monotherapies. We apply DRUGMNEM on normal hematopoietic drug screening single cell data comprising 10 surface markers and 14 intracellular protein expression responses measured after 30 minutes following 5 Jak and Bcr inhibitors administered at 2 dose levels (maximum and no dose) under different stimulations. Under each stimulation, we show that DRUGMNEM is optimized across the entire single cell drug screening data and produces a reduced drug combination set that has the potential for maximally targeting different signals in all major and rare cell types. Experimental validation by comparing the progression of the cells after treatment with the derived optimized drug combination cocktail is ongoing.
Presenting Author: Kunal Kundu, Tata Consultancy Services, United States
Steven Brenner, University of California at Berkeley, United States
Rajgopal Srinivasan Tata Consultancy Services Ltd, India
Uma Sunderam Tata Consultancy Services Ltd, India
Sadhna Rana Tata Consultancy Services Ltd, India
Ajithavalli Chellapan Tata Consultancy Services Ltd, India
We describe and present a comprehensive and easily extensible open source
tool for Human Genome Annotation called VARANT, written in the Python
programming language. While several tools for annotating variants are
available, we believe that VARANT distinguishes itself by being fully open
source, capable of using multiple processors/cores for speedy annotation and
providing extensive annotation of UTR and non-coding regions in addition to
the customary annotations of genes. An additional highlight of the tool is the
ability to incorporate various inheritance models into the annotation process,
which when coupled with phenotype information can be used to quickly generate
a list of prioritized genes and variants. The tool has been successfully used
to identify causal variants in rare immuno-disorders
Presenting Author: Xiaojia Tang, Mayo Clinic, United States
Kevin Thompson Mayo Clinic, United States
Asha Nair Mayo Clinic, United States
Krishna Kalari Mayo Clinic, United States
It is known that the 3’ untranslated region (UTR) contains essential regulatory elements that will affect the expression and stability of mRNA. Transcriptome sequencing (RNA-Seq) with poly-A selected library usually generates high coverage at the 3’ UTRs and thus provides the possibility of accurate calling for the expressed single nucleotide variants (eSNVs) in untranslated regions. We have developed a novel computational system, ESNV-Detect, to identify eSNVs from the RNA-Seq data. It has been validated in several tumors and lymphoblastoid cell lines with high precision and sensitivity in both coding region and UTRs. Here we applied the ESNV-Detect to study 3’UTR eSNVs for estrogen-receptor positive (ER+) breast tumors from The Cancer Genome Atlas (TCGA). We obtained RNA-Seq data of 559 ER+ samples. To identify somatic eSNVs, we focused our analyses to 94 samples (47 pairs) that have both tumor and normal data. We identified 49,636 unique somatic 3’UTR eSNVs in the 47 ER+ tumor-normal pairs. Comparison of somatic 3’ UTR eSNVs with the list of somatic mutations that alter miRNA target sites obtained from SomamiR DB (http://compbio.uthsc.edu/SomamiR/) revealed that in our data 175 eSNVs create and 71 eSNVs disrupt known miRNA target sites. Thus far, only limited information is available in current known databases about somatic eSNVs in 3’ UTRs. Hence we are in the process of developing a computational system which will allow us to investigate the impact of the novel somatic 3’ UTR eSNVs in ER+ tumors. This will enhance our understanding of transcription regulation of ER+ disease.
Presenting Author: Benjamin Good, The Scripps Research Institute, United States
Karthik Gangavarapu The Scripps Research Institute, United States
Salvatore Loguercio The Scripps Research Institute, United States
Obi L. Griffith Washington University School of Medicine, United States
Max Nanis The Scripps Research Institute, United States
Chunlei Wu The Scripps Research Institute, United States
Andrew I. Su The Scripps Research Institute, United States
Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility and biological interpretability. Methods that take advantage of structured prior knowledge show promise in helping to define better signatures but most knowledge remains unstructured.
Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes previously unheard of. Here, we developed and evaluated a game called The Cure on the task of gene selection for breast cancer survival prediction. Our central hypothesis was that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from game players. We envisioned capturing knowledge both from the players prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game.
Between its launch in Sept. 2012 and Sept. 2013, The Cure attracted more than 1,000 registered players who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data clearly demonstrated the accumulation of relevant expert knowledge. In terms of predictive accuracy, these gene sets provided comparable performance to gene sets generated using other methods including those used in commercial tests. The Cure is available at http://genegames.org/cure/
Presenting Author: Andrew Quitadamo, University of North Carolina at Charlotte, United States
Xinghua Shi, University of North Carolina at Charlotte, United States
Ovarian cancer accounts for 5% of cancer deaths in women, making a thorough understanding of the biological basis important for developing better treatment and diagnosis. In this study we develop a network approach that integrates serous ovarian cancer DNA methylation and gene expression data from The Cancer Genome Atlas (TCGA). We apply two methods to obtain networks by mining the associations between DNA methylation and gene expression. Specifically, we 1) perform a methylation expression quantitative trait loci (meQTL) analysis for each CpG and gene pair, and build gene networks by expanding the meQTL results under the guidance of known protein-protein interactions; 2) apply machine learning approach that simultaneously build gene networks using graphical models. We then build an integrated network by consolidating these networks constructed from different methods. Our further analysis of the integrated network points to a network view of how epigenetic signature, particularly DNA methylation, perturbs gene regulation and leads to tumorigenesis and cancer progression in ovarian cancer.
Presenting Author: Serdar Bozdag, Marquette University, United States
Brittany Baur Marquette University, United States
One of the challenging and important computational problems in systems biology is to infer gene regulatory networks of biological systems. Several methods that exploit gene expression data have been developed to tackle this problem. Due to the limitations of the gene expression data, these approaches resulted in many spurious regulatory interactions. Additional information such as copy number and DNA methylation could be integrated to improve the accuracy of the results. Here, we report the first dynamic Bayesian network-based framework that integrates gene expression, copy number and DNA methylation to infer gene regulatory networks. We show that DNA methylation between genes that have a regulatory relationship is more correlated than DNA methylation between other genes. We also show that there is a higher correlation between copy number and gene expression of genes that have a regulatory relationship than correlation between copy number and gene expression of other genes. Our results show that the integration of copy number and DNA methylation as a network prior improves the accuracy of the network construction. Our approach could easily integrate any number of other biological datasets such as transcription factor binding and/or literature.
Presenting Author: Sergio Pulido Tamayo, Department of Information Technology, Belgium
Kathleen Marchal, Ghent University, Belgium
Jorge Duitama KU Leuven, Belgium
Aminael Sanchez-Rodriguez KU Leuven, Belgium
Annelies Goovaerts KU Leuven, Belgium
Georg Hubmann KU Leuven, Belgium
María R. Foulquié-Moreno KU Leuven, Belgium
Johan M. Thevelein KU Leuven, Belgium
Kevin J. Verstrepen KU Leuven, Belgium
Kathleen Marchal Ghent Leuven, Belgium
Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors.
To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM).
Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method.
EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio’s i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available.
Presenting Author: Perry Haaland, BD Technologies, United States
John Palowitch UNC and BD, United States
Steve Marron UNC, United States
Qunqun Yu UNC and BD, United States
This work focuses on the microbiome of the lower lung based on sequencing of DNA from bronchoalveolar lavage (BAL) samples. This is relevant to lower respiratory infections (LRI) and to ventilator associated pneumonia (VAP) as it occurs in the hospital intensive care unit.
We analyze published data consisting of taxonomic identifications and relative abundances for BAL samples from ICU patients. With the goal of characterizing similarities and differences among four diagnosis groups, we employ object oriented data analysis (OODA) as a statistical framework.
The goal of OODA is to carefully define and appropriately combine the possible data objects. First, we consider relative abundances as data objects and consider species richness measures using standard multivariate analysis and visualization techniques. Second we construct a support tree that captures phylogenetic relationships. We consider relative abundances along with genetic distances in the context of tree structure as data objects. Third, we consider patient specific subtrees as data objects. We propose and implement a similarity measure for tree objects. The measure integrates within-sample branch distances into leaf abundance differentials.
Our results suggest distributional differences in diversity among diagnosis groups. For example, certain taxa are dominant in the Principle Component Analysis directions and these taxa are over-represented in subjects with low species richness. Taking into account tree structure suggests interesting clusters of patients that cut across diagnosis groups.
We believe that OODA provides a feature-rich data analysis strategy in support of our long-term goal to better characterize LRI and to provide guidance in selecting antimicrobial therapies.
Presenting Author: Emma Schwager, Harvard School of Public Health, United States
Curtis Huttenhower, Harvard School of Public Health, United States
Uri Weingart Harvard School of Public Health, United States
Timothy Tickle Harvard School of Public Health, United States
Xochitl Morgan Harvard School of Public Health, United States
Curtis Huttenhower Harvard School of Public Health, United States
Background: Compositional data, or data constrained to sum to a constant total, occur in many scientific areas. The non-independence of such data causes spurious correlations when standard covariance measures are applied, regardless of the similarity measure used. This problem has not yet been addressed in a way that generalizes to different similarity measures, nor for the high-dimensional measurements typical of modern biological data, including data from microbial community studies.
Results: We developed an approach to provide appropriate p-values for varied similarity scores between compositional measurements, which we call Compositionality Corrected by PErmutation and REnormalization (CCREPE). We assessed the false positive rate of CCREPE using synthetic datasets modeling a variety of realistic community structures, as well as comparing its performance and behavior with existing methods. We observed that CCREPE performs better in communities with greater evenness than in more skewed communities. We further applied the CCREPE procedure using a novel ecologically-targeted similarity score (the N-dimensional Checkerboard score) to 682 metagenomes from the Human Microbiome Project to determine significant co-variation patterns while avoiding spurious correlation from compositionality. Overall, the resulting network recapitulated the basic characteristics of earlier 16S-based networks, including little (<15%) between-site interaction and few "hub" microbes (scale-freeness).
Conclusions: These new methods will allow the derivation of significant co-variation networks from high-dimensional compositional data, particularly the detection of species and, eventually, sub-species level ecological interactions within microbial communities.
Presenting Author: shany ofaim, Technion, Israel
Maya Ofek Agricultural Research Organization, Israel
Noa Sela Agricultural Research Organization, Israel
Dror Minz Agricultural Research Organization, Israel
Yechezkel Kashi Technion, Israel
Shiri Freilich Agricultural Research Organization, Israel
Rapid advances in metagenomics and genome sequencing have led to the accumulation of vast amounts of empirical ecological data such as 16S, RNA-Seq etc. With the increase in ecological data production, the need for robust automated functional community analysis approaches rises, creating an information-analysis gap. An accumulating body of evidence now supports the reliability of metabolic analysis, such as metabolic networks, as a tool for processing genomic data into information describing the 'lifestyle' of microbial species and the network of interaction they form within such communities. Metabolic networks are comprised of interconnected chains of chemical reactions that occur in a living organism to maintain life. It has been demonstrated that metabolic network approaches are a sufficient tool for the prediction of cellular activity and growth capacity under changing conditions at the single species level. The integration of multiple single species networks into a communal metabolic network representation allows for the investigation of the functional division between its participants, showing the metabolic hierarchy in the sampled environment. Here we demonstrate the use of such metabolic network approaches in the functional division analysis of communities in the rhizosphere and bulk soil environments based on RNA Seq data that was collected for cucumber and wheat crops in Israel. Through the application of the Expansion algorithm we simulate community activity following the systematic deletion of community members, allowing delineating its unique contribution to community metabolism. For example, we identify species' contribution to N-, S and glycan metabolism.
Presenting Author: Tomislav Ilicic, Wellcome Trust Sanger Institute, United Kingdom
Mouse embryonic stem cells cultured in serum/LIF are heterogenous morphologically than when cultured in 2i/LIF.
Stochastic fluctuations in transcription, also referred to as "noise", can cause phenotypic changes in cells, leading to heterogeneity. This implies that supposedly homogenous populations of cells can yield biologically important subpopulations. Bulk sequencing masks such heterogeneity, as gene expression is averaged over a cell population. By using single cell sequencing, we were able to capture transcriptomes of single mESC cultured under three different conditions (Serum/LIF, 2i/LIF and alternative 2i/LIF) and chart out the landscape of heterogeneity. Our findings suggest that different growth conditions result in distinct transcriptional profiles of cells. Cells cultured in alternative 2i/LIF seem to share greater transcriptional similarity with cells from 2i/LIF. Comparative analysis of single cells within the same culture condition revealed a differentiated subpopulation of cells in Serum/LIF, characterised by downregulation of Nanog, Oct4, Rex1, Esrrb, Sox2 and Tet2, and up regulation of Krt8, Krt18, Klf6 and Tpm1.
To measure the degree of noise across culture conditions, the cells were examined for allele-specific expression patterns in hybrid mice. We identified four distinct noise signatures representing extrinsic noise (variability between cells), intrinsic noise (variability within a cell) and allele specific expression. Most genes show signatures characterised by intrinsic noise and only a few genes show extrinsic noise.
In summary, our work reveals transcriptional heterogeneity of mouse embryonic stem cells grown in different culture conditions and gives a global overview about the degree of stochastic fluctuations in such cells.
Presenting Author: Naisha Shah, National Institutes of Health, United States
Angelique Biancotto NIH, United States
Yong Lu NIH, United States
CHI Consortium NIH, United States
Pamela L. Schwartzberg NIH, United States
John S. Tsang NIH, United States
It is increasingly clear that the immune system and inflammation contribute not only to the pathogenesis of autoimmune and infectious disease, but also to a host of pathologies, including cancer, diabetes, neurodegeneration and other chronic illnesses. Thus, a more comprehensive and comparative characterization of immune signatures of diseases could lead to a better understanding of etiology, biomarker identification, and ultimately, disease prevention and treatment. Blood transcriptomic profiling is a powerful approach to assess the statuses of the immune system, and it has been widely applied to examine diseases. Due to the myriad immune-cell subsets in blood, changes in transcript abundance could reflect alternations in the composition of cell populations, transcriptional changes within cell subsets, or both. While flow-cytometry could be used for assessing cell population abundances, such data is often unavailable. Thus, interpretations of blood profiling results tend to focus solely at the genomics level, but much less so from a cellular perspective.
Here, by leveraging natural variations within a cohort of healthy subjects where blood transcriptomics and 100+ immune-cell subset abundances were measured simultaneously, we derived machine-learning models for predicting cell frequencies using gene-expression information alone. Our approach involves cross-platform normalization of gene-expression data as well as training and cross-validation of Elastic Net models. We next applied these models to predict and compare immune-cell subset alterations across 112 diseases, including lupus, sepsis and autism. We identified a number of immune-cell associations, including ones shared across multiple diseases. Our approach can be applied in other settings to dissect the cellular origin of transcriptomic signatures.
Presenting Author: Hong Yu, UMass Medical School, United States
An intelligent figure search engine will not only assist biocuration and allow individual biomedical researcher to access figures more efficiently from full-text biomedical articles, but also is an important step towards automatic validations of genome-wide high-throughput predictions. With more and more full-text biomedical articles becoming open access (the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities, the Bethesda Statement on Open Access Publishing, and PubMed Central), we are developing a figure search system (available at http://figuresearch.askhermes.org) that
integrates natural language processing, image processing, machine learning, and user interfacing “biomedical Natural Language figure Processing” approaches for intelligent biomedical figure search (iBioFigureSearch). Our iBioFigureSearch associates each figure with text that describes the content of the figure, summarizes the associated text, ranks figures by their importance, and integrates both image and text for improved information retrieval. We have evaluated iBioFigureSearch by both intrinsic and task-driven extrinsic evaluation and found that iBioFigureSearch improves user-centered information seeking.
Presenting Author: Alistair Martin, University of Oxford, United Kingdom
Charlotte Deane University of Oxford, United Kingdom
Due to the degeneracy of the genetic code, there exists multiple synonymous DNA sequences that result in the same amino acid sequence being produced. Both the central dogma and Anfinsen’s dogma state that the flow of information within a cell is unidirectional and hence these synonymous sequences should retain no knowledge once translated, thus resulting in equivalent proteins. An increasing volume of experimental work show that in actuality these synonymous sequences produce proteins with different physical properties; a wide range of changes have been reported to occur in association with synonymous mutations in coding DNA sequences. For our research, we use a bioinformatics approach to probe the manner in which codon choice can influence the tertiary protein structure. We utilise the concepts of codon optimality and cotranslational folding to score the translation rate of RNA sequences which have corresponding Protein Data Bank entries, crucially using data sourced from a wide range of species. By subsequently grouping by structure and aligning, we find features in their translation profiles that are evolutionarily conserved in association with structural features. Our research indicates an additional layer of information pertaining to the protein structure contained within mRNA that is lost once translation occurs. We hope that in the future we can use these results to improve upon the current biophysical understanding of translation and structure formation.
Presenting Author: Reazur Rahman, Brandeis University, United States
Yuliya Sytnikova Brandeis University, United States
Nelson Lau Brandeis University, United States
Transposons are major structural variants (SVs) in animal genomes. In cancer and human biology, there is a need to determine new transposon SVs beyond the tremendous load of existing transposons (>45% of the human genome). Most current efforts to discover transposon SVs rely on Paired-End (PE) reads from genome deep-sequencing, but the greater costs of PE reads compared to Single-End (SE) reads (the standard form of genome deep-sequencing) motivated us to develop a new bioinformatics tool called GenTrAn (Genome Transposon Analyzer). By scanning SE read libraries with a hybrid approach of broad-level split-read mapping and then filtering with various quality criteria, GenTrAn discovers de-novo transposon SVs with high sensitivity and specificity. Importantly, the transposon SV sites that GenTrAn identifies display target site duplications indicative of a recent transposition event, and point to precise genomic coordinates that enable discrimination of SVs that disrupt coding gene exons versus less-disruptive intronic insertions.
We demonstrate the efficacy of our tool by discovering the genome-wide distributions of transposon SVs in four different Drosophila melanogaster cell lines. GenTrAn showed that transposon SV landscapes can be surprisingly diverse even in a natural cell line, and these SVs tend to avoid coding exons, yet prefer to insert near genes in intergenic regions. In addition, GenTrAn can measure the allele ratio of transposon SVs and all predicted SVs were successfully validated by genomic PCR. GenTrAn’s precision in transposon SV detection and feasibility to mine the more economical SE read libraries make this an attractive tool for genome diagnostics.
Presenting Author: Christopher Schlosberg, Washington University in St. Louis, United States
Nathan VanderKraats Washington University in St. Louis, United States
Jeffery Hiken Washington University in St. Louis, United States
Kilian Weinberger Washington University in St. Louis, United States
Tao Ju Washington University in St. Louis, United States
John Edwards Washington University in St. Louis, United States
Establishment of specific patterns of DNA methylation is necessary for normal development, and aberrant methylation is frequently observed in cancer. Hypermethylation of CpG islands overlapping the transcription start site (TSS) downregulates tumor suppressor genes, thus promoting tumorigenesis. However, recent genome-wide mapping of methylation indicates only modest correlation between differential gene expression (DGE) and methylation, casting doubt on the importance of methylation in regulating DGE. In addition, complex patterns, such as CpG island-shore methylation and long hypomethylated domains, also correlate with DGE. We hypothesize that unbiased computational tools will better model complex patterns of methylation and capture strong associations between DGE and methylation. By representing methylation as continuous curves centered on a gene’s TSS and performing unsupervised clustering using Dynamic Time Warping, we enumerate complex, differential methylation signatures that highly correlate with DGE. We next trained a nearest neighbor classifier on examples of these significantly correlated signatures to identify genes that display both differential methylation and expression. Using data from the Human Epigenome Atlas, ENCODE, and eight breast cancer cell lines, our classifier significantly outperforms state-of-the-art Differentially Methylated Region (DMR)- and Support Vector Machine-based methods at identifying associated genes. By further analyzing these associated genes, we find methylation’s silencing mechanism may be signature-dependent. In breast cancer cells, we observe that methylation at the TSS does not affect transcriptional initiation, however, methylation proximal to the TSS may inhibit transcriptional elongation. The discovery of these potentially functional methylation changes will facilitate the identification of patients who may benefit from clinically-approved demethylating therapeutics.