Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.


Track: CAMDA

Session A-163: TRUCE: a Hidden Markov Model for Annotation of Tandem Repeats
COSI: CAMDA
  • Daniel Olson, University of Montana, United States
  • Travis Wheeler, University of Montana, United States

Short Abstract: Short tandem repeats make up more than 2% of the unlabeled human genome, and a significant fraction of both transposable elements and protein-coding DNA. Though well-established software exists for annotating this repetitive content, incorporation into annotation pipelines is hampered by limited information on the statistical significance of repeat annotation. We present a prototype of a new tool called TRUCE (Tandem Repeat Unifying disCovErer) for labeling tandem repeats, based on a new hidden Markov model of repetitive sequence. TRUCE analyzes sequence in two passes. The first pass uses a high sensitivity and low specificity hidden Markov model to detect candidate repeat regions. The second pass models the repeat core, refines boundaries, and removes false positives. We demonstrate its effectiveness and discuss incorporation into sequence annotation tools, such as RepeatMasker and HMMER.

Session A-165: Improving homology-based gene prediction using intron position conservation and RNA-seq data
COSI: CAMDA
  • Jens Keilwagen, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Germany
  • Frank Hartung, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Germany
  • Michael Paulini, EMBL-EBI, United Kingdom
  • Jan Grau, Institute of Computer Science, Martin Luther University Halle-Wittenberg, Germany

Short Abstract: Today, genomes are sequenced and assembled rapidly. Annotations are added to those genomes based on RNA-seq data and computational predictions. Homology-based approaches predict genes or transcripts in newly sequenced target genomes based on the similarity to known genes from well annotated reference genomes. Most programs assume conservation of amino acid sequence but neglect the known gene structure defined by exons and introns. However, the gene structure of intron-containing orthologous genes is highly conserved throughout the whole plant or animal kingdom. Here, we present GeMoMa, which additionally exploits the conservation of gene structures from a reference species to predict gene models in a target genome. Instead of searching for the complete coding sequence or amino acid sequence, GeMoMa searches for the amino acid sequences encoded by the (partially) coding exons and subsequently builds gene models from these hits. Recently, we extended GeMoMa allowing for improving prediction accuracy by utilizing RNA-seq data to define splice sites, and for joining predictions from multiple reference species. In contrast to purely transcriptomics-based gene predictions, GeMoMa is capable of predicting lowly or specifically transcribed genes. By design, GeMoMa automatically provides information about putative homologous gene pairs and allows for transferring information about gene function. We assess the performance of GeMoMa and compare it with state-of-the-art competitors on plant, animal, and fungi genomes. Subsequently, we predict gene models for four nematodes species and compare them with the official annotation from Wormbase proposing hundreds to thousands of high-quality new or re-annotations per species.

Session A-167: Computing Pangenome Statistics in R
COSI: CAMDA
  • Asterios Mpatziakas, Aristotle University of Thessaloniki, Greece
  • Fotis Psomopoulos, Institute of Applied Biosciences, Center for Research and Technology Hellas, Greece
  • Theodoros Moysiadis, Institute of Applied Biosciences, Center for Research and Technology Hellas, Greece
  • Stefanos Sgardelis, Department of Biology, Aristotle University of Thessaloniki, Greece

Short Abstract: Advances in sequencing techniques have massively increased the publicly accessible -omics data and thus enable further and more extensive research opportunities on genome diversity. The concept of the pangenome refers to the union of gene families shared by a set of genomes. Several studies have implemented specific pangenome analyses for a variety of organisms, ranging from microbes to viruses and plants [1], leading to genomic projects of various scales and the advancement of general understanding of evolutionary mechanisms, generating usable knowledge across multiple sectors. A pangenome can be defined as the identification of three distinct subsets of gene families; the Core genome consisting of gene families that are shared among all genomes, the Dispensable genome consisting of families present in the majority of the genomes and iii) genes that have presence only in one genome, known as Peripheral . The essential part of this type of analysis is the use of data in an encompassing way instead of the traditionally linear approaches evident in targeted genome studies [1]. We present an open source software freely available for the R statistical language, usable in the later stages of such an analysis, i.e. after the construction of the gene families for a given set of genomes, based on information of the full complement of gene families. A complete methodology is proposed, suitable for sets of genomes of varying complexity, optimizing and enriching an assortment of existing measures from micropan [2]. Finally, we demonstrate the methodology using publicly available data from UniProt [http://www.uniprot.org/]. [1] The Computational Pan-genomics Consortium, “Computational pan-genomics : status , promises and challenges,” Briefings in Bioinformatics, no. May, pp. 1–18, 2016. [2] L. Snipen and K. H. Liland, “micropan: an R-package for microbial pan-genomics.,” BMC bioinformatics, vol. 16, p. 79, 2015.

Session A-169: Identification of Differentially Expressed Genes Using Tests Based on Multiple Imputations
COSI: CAMDA
  • Sang Cheol Kim, Korea National Research Institute of Health, Centers for Disease Control & Prevention, South Korea

Short Abstract: Datasets from DNA microarray experiments, which are in the form of large matrices of expression levels of genes, often have missing values. However, the existing statistical methods including the principle components analysis (PCA) and Hotelling’s t−test are not directly applicable for the datasets having missing values due to the fact that they assume the observed dataset is complete in general. Many methods have been proposed in previous literature to impute the missing in the observed data. Numerous methods, such as the k−nearest neighbor (kNN) imputation, local least squares (LLS) method, and multiple imputation (MI) for missing values, have been proposed to impute the missing values into the observed data. To identify differentially expressed genes, we propose a new testing procedure when the missing exists in the observed data. The proposed procedure uses the Stouffer’s z−scores and combines the test results of individual imputed samples, which are dependent to each other. We numerically show that the proposed test procedure based on MI performs better than the existing test procedures based on single imputation (SI) by comparing their ROC curves. We apply the proposed method to analyzing a public microarray data.

Session A-171: Microbiome Diversity on Materials
COSI: CAMDA
  • Chandrima Bhattacharya, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Pinaki Chakraborty, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Rohit Pandey, Indian Institute of Engineering Science and Technology, Shibpur, India
  • Malay Bhattacharyya, Indian Institute of Engineering Science and Technology, Shibpur, India

Short Abstract: The study of microbiome is promising in understanding the higher-level organisms to a broader extent. In this paper, we consider to analyze the microbiome diversity across different materials. We particularly focus on the metagenomics data from the MetaSUB International Consortium for the said purpose. With an emphasis on the materials like metals, plastics, and woods available in subways and its vicinity, we demonstrate the how diverse the microbiome community might appear in multiple cities.

Session A-173: Computational Approaches to Assessing Clinical Relevance of Pre-clinical Cancer Models
COSI: CAMDA
  • Vladimir Uzun, University of Sheffield, United Kingdom
  • Ian Sudbery, University of Sheffield, United Kingdom
  • James Bradford, University of Sheffield, United Kingdom

Short Abstract: Pre-clinical cancer models, such as tumour-derived cell-lines and animal models, are essential in cancer research. Consistently used as a platform to investigate mechanism of action, they can identify potential biomarkers prior to clinical trials where similar exploration is more complicated and expensive. However, whilst cell-lines are the most used pre-clinical model, their applicability in certain settings is questioned because of the difficulty of aligning the appropriate cell-lines with a clinically relevant disease segment. We aim to develop computational tools which would determine, for some pre-clinical model, suitability for clinical experiments, and the most relevant disease segment. Genomics profiling data from patient tumours and cell-lines were used to train and test the method. Machine learning techniques (including random forests, principal component analysis, Gaussian processes) were applied to create predictive models based on patient training data. Their accuracy was evaluated on the patient test set and then applied to cell-line data. Endometrial and breast cancer classification achieved good correspondence with established subtypes (0.94 AUC). With the appropriate classifiers (copy-number for endometrial, expression for breast), cell-lines mostly accurately differentiated into respective subtypes. Whilst most cell-lines associated with clinically relevant segments, a significant number were ambiguous. Furthermore, cell-line suitability scores across different subtypes were not complementary - inappropriate cell line for one subtype is likely to be inappropriate for the other. We will refine the methodology and ultimately develop an online scoring tool to improve the usage of pre-clinical cancer models in therapeutic testing

Session A-175: Codon usage diversity in city microbiomes
COSI: CAMDA
  • Haruo Suzuki, Keio University, Japan

Short Abstract: For the MetaSUB Inter-City Challenge, we propose to apply annotation-independent approaches for synonymous codon usage to the microbiomes of three cities: New York City, Boston, and Sacramento. Multivariate statistical analysis identified gene features such as the codon-anticodon interaction efficiency and nucleotide content at third codon positions as major trends of variation in synonymous codon usage among genes of the metagenomes. We also found that diversity in synonymous codon usage was high in Sacramento, intermediate in Boston, and low in the New York City. Our results suggest that codon usage can provide additional information on genetic diversity in microbiomes.

Session A-177: Unraveling bacterial fingerprints of city subways from microbiome 16S gene profiles
COSI: CAMDA
  • Alejandro Walker, University of Florida, United States
  • Tyler Grimes, University of Florida, United States
  • Susmita Datta, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Short Abstract: Microbial communities can be location specific and the abundance of some species within locations can be more influential on the ability to determine whether the sample belongs to one city or another. This analysis, as part of the 2017 CAMDA competition is focused on the MetaSUB Challenge dataset. We undertake a number of investigations for the OTU count data at the taxonomic level “Order” across the three cities. First, a PCA analysis showed a clear clustering of the data points for the three cities, where a large proportion of variability was explained by the first three principal components. Next we attempted to build a classifier using the OTU count data and were successful in achieving very high specificity and sensitivity. The relative abidance patterns of the OTUs varied significantly across the city, which was formally confirmed by an analysis of variance. Finally, we conduct a network analysis based on the co-abundance patterns the OTUs in a given city. Overall, we found finding different patterns in the three networks when inspected visually; the networks of close by cities showed similar bacterial co-abundance patterns compared to distant cities.

Session A-179: Integrative analysis of heterogeneous genomics data with phenome to discover functional cancer subgroups
COSI: CAMDA
  • Gift Nyamundanda, The Institute of Cancer Research, Sutton, UK, United Kingdom
  • Katherine Eason, The Institute of Cancer Research, Sutton, UK, United Kingdom
  • Pawan Poudel, The Institute of Cancer Research, Sutton, UK, United Kingdom
  • Yatish Patil, The Institute of Cancer Research, Sutton, UK, United Kingdom
  • Anguraj Sadanandam, The Institute of Cancer Research, Sutton, UK, United Kingdom

Short Abstract: The main challenge in finding clinically relevant subgroups of cancer is how to integrate information from multiple molecular events (multi-omics) in tumors with patient clinical characteristics. The current standard approaches involve several steps including; a) clustering of individual data sets to discover subtypes; b) reconciling subtypes identified from several datasets; c) associate the new reconciled subtypes with patient phenotypic information (such as grade, stage, age and life style) to provide some clinical interpretation to the new subtypes; and d) finally, identify signatures associated with the subtypes. However, these approaches do not capture interactions between different data types, and also statistical power to find associations is lost due to several steps involved. We developed a tool that combines these steps to jointly model the dependence structure (borrowing strength) across different omics datasets with patient phenotypic information as covariates. In addition to discovering new molecular functional sub-groups, the tool allows for features and covariates selection to yield a panel of the most discriminative integrated features and phenotypes characterizing the identified subtypes. The framework of this tool is governed by a key assumption that, features from different omics data types are correlated due to some "hidden" variable(s) (meta-variables), which allows for data integration. These meta-variables are modeled as a function of covariates, which include patient phenotypic information, to expose any underlying clustering structure between multiple omics data types. In addition, in order to capture both between and within data type variability, the meta-variables are allowed to be either common between datasets or to be specific to each data type. This is achieved by adopting a sparsity inducing prior to switch on and off some meta-variables on some datasets. The utility of this tool is demonstrated using publicly available methylation and gene expression data of 189 breast tumors along with the clinical information of patients. Five breast cancer subtypes were identified, and their prognostic value was significant even when controlling for other important clinical variables. Although our results show that gene expression was mainly driving these subtypes, integration with methylation and patient clinical information improved the prognostic value of the identified subtypes.

Session A-181: Assessing reproducibility of metagenomics studies and diversity of public transport systems microbiome profiles of New York, Boston and Sacramento cities
COSI: CAMDA
  • Alina Frolova, The Institute of Molecular Biology and Genetics of NASU, Ukraine

Short Abstract: As a new members of MetaSUB Consortium we were greatly interested in analyzing Boston, New York and Sacramento cities microbiome profiles to point out important issues and problems before upcoming global City Sampling Day 2017. Here we performed detailed quality control of raw sequences, evaluated collected metadata in the context of creating uniform specification for data collection, assessed reproducibility of OTU abundances calculation, verified Yersinia pestis (the causative agent of plague) and Bacillus anthracis (the causative agent of anthrax) presence in NY microbiome profile, and investigated biodiversity vs biolocation. We conclude that it is important to sample fewer, more controlled environments with greater specificity and uniform coverage of meta-variables.

Session A-183: Analysis of CAMDA RNA-seq data with the knowlegde of protein domains in genes
COSI: CAMDA
  • Anna Lesniewska, Institute of Computer Science, Poznan University of Technology, Poland
  • Alicja Szabelska-Beresewicz, Department of Mathematical and Statistical Methods Poznan University of Life Sciences, Poland
  • Joanna Zyprych-Walczak, Department of Mathematical and Statistical Methods Poznan University of Life Sciences, Poland
  • Michal Okoniewski, Scientific IT Services, ETH Zurich, Switzerland

Short Abstract: In RNA sequencing with short reads, it is often not possible to assign RNA fragment to a gene due to similarities in repeatable regions or protein domains. This may influence the downstream analysis. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the Neuroblastoma dataset. The major findings are: * pairs of genes that share a domain have increased Pearson's correlation coefficients of counts * the distribution of correlation coefficient for those pairs is leaning more towards the positive values for the for the smaller number of biological samples * using diverse primary analysis counting strategies on non-CAMDA datasets suggests that the increased correlation reflects rather a real biological co-expression than sequence-based artifacts * genes sharing a domain are expected to have a lower predictive power due to increased correlation, but with various type of classifiers the number of misclassified samples does not show yet an obvious dependence * various classifiers perform in a very different way on the CAMDA data, which proves that clinical application of gene signatures from similar datasets may be difficult We have to admit that outcomes are sometimes not following our intuition and experience from standard RNA-seq analysis. That is why we would like to present it at the CAMDA meeting and discuss there with the experts in the area.

Session A-185: Identification of mobile elements in metagenomic data.
COSI: CAMDA
  • Josef Moser, Austrian Centre of Industrial Biotechnology (ACIB), Vienna, Austria, Austria
  • Samuel Gerner, FH Campus Wien, Austria
  • Alexandra Graf, FH Campus Wien, Austria

Short Abstract: Little is known about mobile elements in metagenome samples, but certainly they represent important traits of microbial communities. Their potential to confer resistances, and transfer beneficial mutations and advantageous genes over spezies borders improves the evolutionary fitness for bacteria in, for humans, sometimes disastrous ways. Antibiotics resistance, which is partly acquired by horizontal gene transfer through mobile elements, has been termed as one of the major threats humanity is faced with today. In current metamobilomics experiments, additional lab protocols such as plasmid purification are applied to achieve higher sensitivity and selectivity compared with whole metagenome samples This however removes plasmids from the population context they exist in. In this study, we evaluate the opportunities to study aspects of mobile elements in shotgun sequenced metagenomic data, using only bioinformatics methods. This would allow the scientific community to get a better understanding of existing data and thereby give a clearer picture of the trade off between the different experimental approaches. Functional analysis of found plasmids will help to elucidate the specific advantage they impart to the microbial community. Additionally, we look at CRISPR found in the provided metagenome samples to gain information about the exposure history of the microbial populations in the sample. We used the metagenome data provided by CAMDA to compare the different cities with regard to mobile elements and CRISPR content and composition. Preliminary results on the plasmid and CRISPR content of the Boston and Sacramento data show that the Sacramento Samples contain a higher abundance of both plasmids and CRISPR. Alignment of reads to known plasmids produced very limited results, compared to de-novo assembled plasmid candidates, highlighting the knowledge that can be gain from and therefore the importance of urban metagenome data.

Session A-187: Integration of CNV and RNA-seq data can increase the predictive power of Neuroblastoma endpoint
COSI: CAMDA
  • Yimin Ma, East China Normal University, China
  • Jiajun Chen, East China Normal University, China
  • Tieliu Shi, East China Normal University, China

Short Abstract: Neuroblastoma (NB) is the most common extracranial solid tumor in children. To compare the predictive power between data integration and the original expression-only study, we first built two risk-score models based on RNA-seq data and CNV data respectively, we then combined them with two different strategies; last we evaluated the predictive power of these four models. Using the Cox regression method, we built the first risk-score model with five genes. NB patients could be classified into a high-risk group or a low-risk group based on this model. Overall survival between these two groups was significantly different (P = 0.00953 in the testing set). In addition, this model can further subdivide each of the clinical defined high/low group into two subgroups. By applying similar procedures, we selected four CNV loci and built the second risk-score model. This model can also classify those matched NB patients but the predictive power was weaker than the RNA-seq based model (P = 0.0884 in the testing set). To test whether integration of two different data (CNV and RNA-seq) can increase the predictive power or not, we combined the two individual models with two strategies. The first strategy was to define patients who were classified into high-risk group in both of previous two individual models as a new high-risk group, the predictive power of the new model can be significantly improved (P = 0.00228 in the testing set). The second strategy was to combine the high-risk samples defined by two individual models into a new high-risk group, but the predictive power of this strategy was improved only marginally (P = 0.0412 in the testing set). However, the clinical defined high-risk samples were entirely included in the expanded high-risk group. According to clinical information, about three quarters of the NB patients were alive and most patients were defined as low-risk. When we redefined the high-risk group by overlapping the classification results of both of models, which make the samples more consist with the risk distribution of this disease, the new model was significantly improved in predictive power, suggesting that different integration strategies for different purposes with different data should be chosen to improve the predictive performance.


View Posters By Category

Search Posters: