Home

Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Track: Other

Session A-480: Fast accurate sequence alignment using Maximum Exact Matches

COSI: Non-COSI

Arash Bayat
Aleksandar Ignjatovic
Bruno Gaeta
Sri Parameswaran

Short Abstract: While efficient, sequence alignment by dynamic programming is still costly in terms of time and memory when aligning very large sequences. We describe MEM-Align, an optimal alignment algorithm that focuses on Maximal Exact Matches (MEMs) between two sequences, instead of processing every symbol individually. In its original definition, MEM-Align is guaranteed to find the optimal alignment but its execution time is not manageable unless optimisations are applied that decrease its accuracy. However it is possible to configure these optimisations to balance speed and accuracy. The resulting algorithm outperforms existing solutions for the alignment of reads to a reference following BWT search

Session B-335: Digital assisted curation to the rescue of traditional literature curation for life-science databases

COSI: Non-COSI

Fabio Rinaldi, Swiss Institute of Bioinformatics and University of Zurich, Switzerland
Socorro Gama, Center for Genomic Sciences - UNAM, Mexico
Yalbi Itzel Balderas-Martínez, Facultad de Ciencias, UNAM, Mexico
Oscar Lithgow, Center for Genomic Sciences - UNAM, Mexico
Hilda Solano Lira, Center for Genomic Sciences - UNAM, Mexico
Mishael Sánchez-Pérez, Center for Genomic Sciences - UNAM, Mexico
Alejandra Lopez-Fuentes, Center for Genomic Sciences - UNAM, Mexico
Luis Muñiz-Rascado, Center for Genomic Sciences - UNAM, Mexico
Cecilia Ishida, Center for Genomic Sciences - UNAM, Mexico
Carlos-Francisco Méndez-Cruz, Center for Genomic Sciences - UNAM, Mexico
Alberto Santos-Zavaleta, Center for Genomic Sciences - UNAM, Mexico
Julio Collado-Vides, Center for Genomic Sciences - UNAM, Mexico

Short Abstract: Faced with decreasing financial resources, life science databases struggle to keep pace with the constantly increasing amount of published results. Traditional approaches based on careful human review of published papers guarantee a high quality of database entries, and cannot be easily replaced by automated technologies, but are very slow and not cost-effective. There are several technologies derived from the field of natural language processing that promise to support the search for information in textual resources. These technologies have been applied to the scientific literature for many years. In particular in the life science domain several community-organized evaluation campaigns carried out in the past few years have shown a steady improvement in the capabilities of these systems. However, there is still widespread skepticism on the possibility to use such tools in a curation pipeline. We argue that although text mining tools on their own would not be easily usable in a curation pipeline, their integration in a supportive environment can lead to a remarkable increase in efficiency of the curation process, and we prove this point through recent digital assisted curation experiments in the context of an established bacterial database and a recently initiated curation effort for an important human disease.

Download

Session B-337: A workflow for accurate neoantigen discovery using NGS data

COSI: Non-COSI

Ognjen Milicevic, Seven Bridges, Serbia
Vladimir Kovačević, Seven Bridges, Serbia
Ana Mijalkovic Lazic, Seven Bridges, Serbia
Nikola Skundric, Seven Bridges, Serbia
Nevena Ilic Raicevic, Seven Bridges, Serbia
Milica Kojicic, Seven Bridges, Serbia
Jack Digiovanna, Seven Bridges, United States

Short Abstract: Neoantigens are proteins presented on the surface of cancer cells that are recognized by the immune system. Multiple novel therapeutic approaches involve identifying neoantigens and using them to trigger immunity-induced tumor regression. Seven Bridges has developed a workflow for neoantigen discovery using NGS data, which analyzes tumor-normal pairs of whole exome sequencing samples and tumor gene expression data in order to output candidate epitopes for neoantigens. The proposed Neoantigen workflow consists of two parts: Total calling and Neoantigen prediction. Total calling performs read alignment and preparation of aligned files (using ABRA, an assembly-based realigner tool), as well as germline variant calling (using Freebayes) and somatic variant calling (using Strelka and VarDict). After variants are called and merged into one file, the workflow performs variant phasing - the separation of variants belonging to the same chromosome. The purpose of this step is to accurately reconstruct the nucleotide sequence which is translated into the neoantigen candidate. Next, RNA reads are processed with Salmon and RSEM tools, which enables high precision quantification of transcripts. They output a list of gene isoforms and corresponding RNA expression scores. Neoantigen prediction consists of several tools: Protein extraction tools; Optitype, which calculates Human Leukocyte Antigen (HLA) type; and NetCTLpan, which predicts epitopes. Protein extraction tools, developed by Seven Bridges, extract the nucleotide sequences around the positions of the somatic variants and translates them to proteins. During this process the protein extraction tools take into account the full range of complex changes, currently not existent within the published workflows: changes from both germline and somatic variants (SNPs and indels); mutations (SNPs or indels) within the STOP codon (nonstop) and mutations that create STOP codons (nonsense). The extracted protein sequence and identified HLA type are used by the epitope prediction tool NetCTLpan to compute confidence-ranked epitope candidates. A final list of the confidence-ranked neoantigen candidates, HLA types, variant information and RNA expression scores are merged and prioritized in the Analyse epitopes tool, also developed by Seven Bridges. Non-proprietary components of this portable and reproducible workflow will be publically available on the Seven Bridges Platform, enabling rapid identification of patient-specific neoantigen candidates.

Session B-339: Elucidation of time-dependent systems biology cell response patterns with time course network enrichment

COSI: Non-COSI

Christian Wiwie, Department of Mathematics and Computer Science, University of Southern Denmark, Denmark
Richard Röttger, University of Southern Denmark, Denmark
Jan Baumbach, University of Southern Denmark, Denmark

Short Abstract: Advances in OMICS technologies emerged both massive expression data sets and huge networks modeling the molecular interplay of genes, RNAs, proteins and metabolites. Network enrichment methods combine these two data types to extract subnetwork responses from case/control setups. However, no methods exist to integrate time series data with networks, thus preventing the identification of time-dependent systems biology responses. We close this gap with Time Course Network Enrichment (TiCoNE). It combines a new kind of human-augmented clustering with a novel approach to network enrichment. It finds temporal expression prototypes that are mapped to a network and investigated for enriched prototype pairs interacting more often than expected by chance. Such patterns of temporal subnetwork co-enrichment can be compared between different conditions. With TiCoNE, we identified the first distinguishing temporal systems biology profiles in time series gene expression data of human lung cells after infection with Influenza and Rhino virus. TiCoNE is available online (https://ticone.compbio.sdu.dk) and as Cytoscape app.

Session B-341: Comparative genomics analysis of human gut microbiome demonstrated broad distribution of metabolic pathways for mucin glycans foraging

COSI: Non-COSI

Dmitry Ravcheev, University of Luxembourg, Luxembourg
Ines Thiele, University of Luxembourg, Luxembourg

Short Abstract: Mucins are heavily glycosylated proteins with high molecular weight and they are produced by epithelium in most animals. In the human intestine, mucins are responsible for forming of the mucus layer. Recent finding demonstrated that alterations in mucin glycoconjugates (MGC) impact on the composition of human gut microbiota (HGM). Here, we present a systematic analysis of HGM encoded systems for degradation of MGC. We applied genomic analysis to 397 HGM genomes microorganisms found in the human gut belonging to the phyla of Actinobacteria, Bacteroidetes, Euryarchaeota, Firmicutes, Fusobacteria, Proteobacteria, Synergistetes, Tenericutes, and Verrucomicrobia. For the annotation of gene functions, the PubSEED platform (http://pubseed.theseed.org) was used. The gene function annotation was done using available literature data, protein sequence similarity, protein domain structure, and genome-context based approaches, including gene chromosomal clustering and phyletic patterns. We analyzed genes required for the degradation of MGC to monosaccharides as well as genes for the utilization of these monosaccharides (fucose, galactose, N-acetylgalactosamine, N-acetylglucosamine, and N-acetylneuraminic acid) as carbon and energy sources. Genes for utilization of one or more monosaccharides were found in 369 (93%) studied genomes. In addition to previously known genes involved in MGC degradation, we predict four non-orthologous replacements for enzymes and four novel transport systems for MGC-derived monosaccharides. The analysis of genes for utilization of multiple monosaccharides in large number of co-inhabiting organisms revealed the following roles of the gut microbial community in MGC foraging. First, different monosaccharides demonstrated distinct distribution patterns across the analyzed genomes, which correlated with distribution of these monosaccharides in nature and particularly within the human intestine. Second, 339 genomes encoded only partial pathways, i.e., the presence of either the glycosyl hydrolases (GHs) for cleavage of a monosaccharide from MGC or the catabolic pathway for the utilization of a monosaccharide. Based on these pathways, we propose that there exist exchange pathways for MGC-derived monosaccharides within HGM. Consistently, we show that 338 (85%) of the analyzed genomes may be involved in such exchange pathways. Third, the analysis of MGC-degrading GHs allows us to predict the ability of each analyzed microorganism to degrade specific types of MGC. Finally, we predict so-called beneficial pairs of organism, i.e., pairs of organisms that can utilize specific MGCs, which cannot be degraded by any microbe alone. The 325 (82%) of the analyzed genomes are capable to form such pairs. We demonstrate that the HGM community is highly adapted to utilization of MGCs as sources of carbon and energy and suggest that this adaptation may be a consequence of co-evolution.

Session B-343: Alpha and Omega of Darwinian selection: disentangling the two in codon models

COSI: Non-COSI

Iakov Davydov, University of Lausanne, Switzerland
Nicolas Salamin, University of Lausanne, Switzerland
Marc Robinson-Rechavi, University of Lausanne, Switzerland

Short Abstract: The fixation of mutations in genes is due to a balance of selection, mutation and drift. Codon models have proven very useful in distinguishing selection, including positive selection, from drift. Synonymous substitution rates are assumed to capture all variation that is not under selection, and thus the ratio of non synonymous (dN) to synonymous (dS) substitutions should indicate selection. There are many models for gene-wide identification of positive selection allowing selection (and thus dN) to vary across the gene, but dS is usually assumed to be constant over all positions of one gene. Yet significant variation of dS has been observed inside genes. We have developed a simple new model which takes into account variations in codon substitution rate in addition to amino acid selection levels variation. Our approach introduces rate variation as a single parameter, thus not inflating the number of model parameters. We introduce codon rate variation into models with amino acid selection variation (M8 of PAML) and branch-site selection variation. We use a simulated dataset to assess the model's statistical performance. We show that our new models work well in both in the absence and in the presence of rate variation in the data. While the increase in the model complexity comes at computational cost, our models remain computationally tractable and useful even for large datasets. We provide an implementation in Go at https://bitbucket.org/Davydov/godon/. We use our improved positive selection model to scan genome-scale real data in two clades, Vertebrates and Drosophila. We show that data provides strong support for our synonymous rate variation. We demonstrate that positive selection inference results are largely affected by the model choice, and a majority of predictions of positive selection given by the model without rate variation are not supported by our models. Therefore we hypothesize that large proportion of positive selection detected by codon rate variation agnostic models are false positives caused by model assumptions violations. Finally, we study how different biological factors affect codon rate variation as estimated by the model. We demonstrate that codon rate variation is correlated with gene expression levels, recombination rate and GC-content. We show that our new models are able to capture rate variation caused by synonymous selection acting on the nucleotide level, for example detecting strong synonymous selection in the proximity of intron splicing sites. In conclusion, our new models capture important biological information about gene evolution, force a reconsideration of detection of positive selection, and are computationally tractable for genome scans.

Download

Session B-345: Prediction of drug-drug interactions using molecular structure information and link prediction approaches

COSI: Non-COSI

Eunyoung Kim, School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), South Korea
Hojung Nam, School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), South Korea

Short Abstract: As taking multiple medications has been increased, characterization of adverse or synergistic interactions among drugs became crucial. Identification of drug-drug interactions (DDIs) needs to be considered because it may cause severe unexpected adverse reactions or can lead to findings of combination drug candidates as synergistic pairs. However, DDI identification through in vivo or in vitro experiments is extremely time-consuming and cost-expensive. Therefore, many researchers tried developing in silico prediction models for DDIs which can help reduce significant time and effort. In this study, we developed a DDI prediction model using molecular structure information and similarity scores obtained from missing link prediction algorithms. The current form of DDI network is scarcely inter-connected and there are a number of unknown interactions to be identified. We adopted missing link prediction algorithms to predict the likelihood of a future association between drugs considering nodes as drugs and missing links as unknown interactions. Moreover, compound structural similarity was used as additional features to represent the relationship. Features of structural similarity and similarity scores derived from various link prediction algorithms were obtained over known DDIs. Then prediction models were trained with Random Forest algorithm. The constructed model resulted in high performance of all AUC, AUPR, and accuracy.

Session B-347: Ultra-sensitive n-plexed protein quantification by a model-based reconstruction method

COSI: Non-COSI

Kyowon Jeong, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
Yeon Choi, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
Joon-Won Lee, Department of Applied Chemistry, Kyung Hee University, South Korea
Sangtae Kim, Pacific Northwest National Laboratory, United States
Jae Hun Jung, Department of Applied Chemistry, Kyung Hee University, South Korea
Young-Suk Lee, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
Kwang Pyo Kim, Department of Applied Chemistry, Kyung Hee University, South Korea
V. Narry Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
Jong-Seo Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea

Short Abstract: Isotopic labeling based protein quantification has advantages such as accurate quantity ratios and reduced technical bias over other approaches. However, conventional isotopic labeling schemes (e.g., SILAC) have a limited multiplexity (≤3-plex). Although some trials have been made to increase multiplexity, they require either ultrahigh-resolution instruments or complicated/expensive labeling schemes. Also, thorough evaluation of quantification was scarcely made or the number of proteins quantified in all labels was often insufficient for most applications. We present a model-based reconstruction method called EPIQ (Epic Protein Integrative Quantification) that enables ultra-sensitive n-plexed protein quantification. EPIQ allows deuterium-based isotopic labeling and small mass difference between labels (≥2Da). Such labels make the XICs (eXtracted-Ion Chromatograms) from distinct labels hard to be separated; they have different retention time (due to deuterium effect) and mutual interferences (due to overlapping isotope clusters). EPIQ is based on a generative model that describes how the observed XICs are generated given the labeled peptide ions of the same species. The model assumes the observed XICs are generated by superimposing signal components (XICs from labeled peptide ions) as well as noise components (coelution or flat intensity noises). Given an identified PSM (Peptide-Spectrum Match), it predicts retention time, isotope distribution, and XIC shapes of the labeled peptide ions. By integrating these predictions, the signal and noise components in the generation model are predicted. Then EPIQ reconstructs the observation using the predicted components. As a result, it successfully separates XICs from distinct labels and performs accurate quantification with low limit of quantification (LOQ). To test the quantification performance of EPIQ, we developed deuterium-based 6-plexed labeling. Labeled HeLa unfractionated sample having ratio 30:20:10:1:5:10 was subject to LC-MS/MS (Q-Exactive). EPIQ reported ~3,000 proteins with median quantity ratio 30.4:21.3:10:1.1:4.1:10.1. In ~70% of the cases, the ratios (to the first label) fell within 2-fold change from the input ratio. To benchmark against other state-of-the-art tools, we adopted 13C-based 3-plexed labeling. A sample with a known ratio (HeLa, 1:10:20) and a biological sample (Xenopus early embryo) were analyzed by EPIQ and other tools. While all reported comparable numbers of proteins, the ratios from other tools than EPIQ were severely biased, especially for low abundant peptides. Such results demonstrate that EPIQ achieves low LOQ than other tools. As EPIQ allows higher multiplexity, we are currently developing further chemical/metabolic labeling schemes (≥8-plex). EPIQ could facilitate various biological applications (e.g., cell dynamics studies or sensitive detection of differentially expressed proteins).

Session B-349: Epistatic SNP pair analysis for lung cancer

COSI: Non-COSI

Jairo Rocha, University of Balearic Islands, Spain
Jaume Sastre, University of the Balearic Islands, Spain
Emidio Capriotti, University of Bologna, Italy

Short Abstract: By using the TCGA (The Cancer Genome Atlas Consortium) genomes for lung cancer and the 1000Genomes project data, we analyze all possible SNP pairs to find synergistic epistasis. We find associations between SNPs (and other mutations) to cause tumor with respect to samples from normal tissue from the same subjects. In addition, we search for SNP pairs in normal tissue of subjects with lung cancer using as control the 1000Genome database. The models used are logistic regression and independence of two-way and three-way contingency tables. Preliminary tests show that the synergy is very difficult to detect as high significance is needed for the high number of pairs considered. The small number of pairs found are being analyzed for biological significance.

Session B-353: A Consensus of Molecular Subgroups in Medulloblastoma

COSI: Non-COSI

Tanvi Sharma, German Cancer Research Centre,DKFZ, Germany
Ed Schwalbe, Newcastle University, United Kingdom
Paul Northcott, St.Jude's, USA
Volker Hovestadt, Broad Institute, USA
Dan Williamson, Newcastle University, United Kingdom
Steve Clifford, Newcastle University, United Kingdom
Stefan Pfister, German Cancer Research Centre,DKFZ, Germany
Lukas Chavez, German Cancer Research Centre,DKFZ, Germany

Short Abstract: Medulloblastoma is a highly malignant childhood brain tumor type. Recently, several independent studies have stratified Medulloblastoma tumors into distinct molecular subgroups. Due to differences in patient cohorts and applied analytical methods, this has created confusion regarding the definitive recognition and comparability of relevant subgroups. Consequently, this study aims to establish a consensus of clinically and genetically relevant molecular subgroups in Medulloblastoma.We combined DNA methylation array and gene expression data across different patient cohorts complemented by 194 novel cases resulting in the largest Medulloblastoma sample cohort ever compiled (1,845 samples with DNA methylation and 392 samples with matched DNA methylation and gene expression data). We have subjectesd this cohort to all previously employed methods including t-distributed neighborhood embedding (t-sne) followed by DBSCAN, non-negative matrix factorization (NMF), and similarity-network fusion (SNF). Our preliminary results indicate that the different proposed Medulloblastoma subgroups previously proposed largely overlap regardless of the applied analytical method. By combining DNA methylation and gene expression data, fewer samples are required for deriving otherwise largely similar clusters as for DNA methylation data alone. Our plan of action involves integration of mutations and clinical features, and to finally establish a consensus on Medulloblastoma stratification into clinically relevant subgroups.

Session B-355: Impact of tissue architecture on the nature and predictability of tumour evolution

COSI: Non-COSI

Robert Noble, Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
John Burley, Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA, United States
Michael Hochberg, Institut des Sciences de l’Evolution, University of Montpellier, France

Short Abstract: Intra-tumour genetic heterogeneity is a product of evolution in spatially structured populations of cells. Whereas genetic heterogeneity has been proposed as a prognostic biomarker in cancer, its spatially dynamic nature makes accurate prediction of tumour progression challenging. We use a novel computational model of cell proliferation, competition, mutation and migration to assess when and how genetic diversity is predictive of tumour growth and evolution. We characterize how tissue architecture (cell-cell competition and cell migration) influences the potential for subclonal population growth, the prevalence of clonal sweeps, and the resulting pattern of intra-tumour heterogeneity. We further compare the accuracy of cancer growth forecasts generated using different virtual biopsy sampling strategies, in different tissue types, and when cancer evolution is characteristically neutral or non-neutral. We thus determine the conditions under which genetic diversity is most predictive of future tumour states. Our findings help explain the multiformity of tumour evolution and contribute to establishing a theoretical foundation for predictive oncology.

Session B-357: FRICTION: validated, quantitative immune cell type deconvolution

COSI: Non-COSI

Aaron Wise, Illumina, USA
Alex So, Illumina, USA
Joyee Yao, Illumina, USA
Shannon Kaplan, Illumina, USA
Shile Zhang, Illumina, USA

Short Abstract: Recent work has demonstrated the value of understanding the tumor microenvironment for its impact on tumor progression and immunotherapy efficacy. Computational tools based on gene expression data have shown promise for their ability to deconvolve the tumor microenvironment and report the types of immune cells present in heterogeneous tumor samples. Here we present FRICTION, a new algorithm for cell type deconvolution. While many state of the art deconvolution approaches report relative fractions or statistical enrichment, we focus on the careful selection and normalization of genes to better detect the absolute fractional level of cell types. To enable this, we performed a novel gene selection method that combined both statistical properties of gene expression along with the expression’s ability to classify different cell types. Furthermore, we normalized against expression levels from over 10 different control tissues to ensure robustness in many tissue backgrounds. FRICTION combines our gene selection and normalization techniques with a support vector regression based approach to deconvolution. FRICTION has been trained to detect three cell types: CD8+ T, CD4+ T and CD19+ B cells. We have validated the technique using spike-in cell titrations and IHC (immunohistochemistry) staining of FFPE (formalin-fixed, paraffin-embedded tumor samples. The titration experiments demonstrate our method’s linearity in a variety of tissue backgrounds (median R2 > 0.97), with high reproducibility among both technical and biological replicates. The IHC staining experiments demonstrate our method’s ability to differentiate CD4 high/low status in tumor samples. FRICTION represents an important step towards delivering validated, quantitative cell type deconvolution results.

Session B-359: Identification of novel peptidic antibiotics via large-scale scoring of mass spectra against natural products databases

COSI: Non-COSI

Alexander Shlemov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
Alexey Gurevich, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
Alla Mikheenko, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
Anastasiia Abramova, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
Anton Anton Korobeynikov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
Hossein Mohimani, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA
Pavel Pevzner, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA

Short Abstract: Discovery of novel peptidic antibiotics and other natural products is hardly possible without modern high-throughput technologies and computational pipelines. Previous studies mainly relied on low-throughput NMR-based technologies requiring large amounts of highly purified material that is often difficult to obtain. This resulted in a recent decline in the pace of antibiotics discovery. Coupling of NGS and mass spectrometry data enabled the renaissance of the antibiotics research and led to discovery of teixobactin, the new class of antibiotics, first in three decades. While databases in traditional proteomics consist of known peptides, the ongoing genome mining efforts for natural product discovery generate vast databases of still unknown putative compounds making matching spectra against such databases prohibitively time-consuming. This leads to a challenging problem of matching millions of spectra against millions of peptides in a reasonable time. The statistical significance (P-value) of a few high scored matches can then be calculated for further elimination of untrustworthy hits. Here we suggest a fast scoring strategy based on peptide database partitioning and preprocessing. We couple this strategy with an accurate method for calculating P-value for a given Peptide-Spectrum pair. Traditional proteomics approaches are not applicable here because peptidic natural products typically have complex structures (e.g. cyclic or branch-cyclic) and contain non-standard amino acids alongside with post-translational modifications. Besides P-value estimation algorithm itself we also propose a fast method to estimate whether the P-value would be below the predefined threshold, speeding up the search even more.

Session B-361: Developing rAMP-seq primers for use in Durum and Bread Wheat

COSI: Non-COSI

Dustin Cram, National Research Council of Canada, Canada
David Konkin, National Research Council of Canada, Canada

Short Abstract: Cost effective genotyping strategies enable researchers and breeders. Buckler et al. recently proposed rAMP-seq (Buckler et al., 2016) as a simple approach to reduced representation sequencing and demonstrate its use in Maize. This approach involves amplifying variable repetitive regions that are flanked by conserved regions, such that hundreds to thousands of loci can be targeted by a single set of primers. In wheat, there is a rich diversity of transposable elements with ~180 families represented by 11-100 copies on the 3B chromosome (Daron et al., 2014). With the goal of developing rAMP-seq primer sets that target 1000-5000 loci distributed across the wheat genome, we are currently exploring the rAMP-seq primer design space in wheat and performing preliminary trials.

Session B-363: VarQuest: modification-tolerant identification of novel variants of peptidic antibiotics and other natural products

COSI: Non-COSI

Alexey Gurevich, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
Alla Mikheenko, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
Alexander Shlemov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
Anton Korobeynikov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
Hosein Mohimani, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA
Pavel Pevzner, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA; Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Peters, USA

Short Abstract: Motivation: Peptidic natural products (PNPs) include many antibiotics, anti-cancer agent, and other bioactive compounds. While billions of tandem mass spectra of natural products have been generated and deposited to Global Natural Products Social (GNPS) molecular network, discovery of novel PNPs and even variants of known PNPs from this gold mine of spectral data remains challenging. To address this problem, bioinformaticians develop dereplication techniques to identify known PNPs and their novel variants. However, since PNP databases are dominated by the most abundant representatives of PNP families, existing algorithms, focusing at dereplication of known PNPs, identify only a small fraction of spectra in the GNPS molecular network. Results: We present a novel VarQuest algorithm for identification of PNP variants via database search of mass spectra, the first high-throughput mutation-tolerant PNP identification method capable of analyzing the entire GNPS infrastructure. VarQuest identified an order of magnitude more PNP spectra and many novel PNP variants as compared to existing PNP identification strategies. Availability and implementation: http://cab.spbu.ru/software/varquest Contact: aleksey.gurevich@spbu.ru

Download

Session B-365: Comparison of open source (galaxy based) and commercial pipelines for RNA-Seq data analysis

COSI: Non-COSI

Slave Trajanoski, Medical University Graz, Austria
Marija Djurdjevic, Medical University Graz, Austria
Andrea Groselj-Strele, Medical University Graz, Austria

Short Abstract: From the first RNA-Seq projects until today software for data analysis was constantly developed and improved. Nowadays we’re finding a plethora of different solutions of tools for either single parts of RNA-Seq data analysis as well as complete pipelines that deliver gene expression results. In our work, we present a comparison of one commercial software from the company Partek(R) as two implementations: Partek(R) Flow(R) and Partek(R) Genomics Suite(R) and open source tools implemented as pipeline in a very popular web-based framework Galaxy (1). By using test dataset from the NCBI Sequence Read Archive (SRA) we could run performance and usability tests of the two different approaches. We come to the conclusion that: 1. Results from both solutions are comparable, but one should be very careful about parameters used in each step since they can lead to different results; 2. Usability is on the side of the commercial solution, even though Galaxy developers are making good progress in this direction; 3. For the sake of easy and fast analysis a lot of parameters are hidden from the user in the commercial software, which leads to lower flexibility in comparison with the open source pipelines; 4. Less management effort when using commercial software, which on the other hand is connected with license costs. 1. Afgan E, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016;44(W1):W3.

Download

Session B-367: CALQ: Coverage-Adaptive Lossy compression of high-throughput sequencing quality values

COSI: Non-COSI

Jan Voges, Leibniz Universitaet Hannover, Germany
Mikel Hernaez, Stanford, United States
Joern Ostermann, Leibniz Universitaet Hannover, Germany

Short Abstract: Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. We present a novel lossy compression scheme for the quality values named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. We analyze the performance of several lossy compressors of quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well known NA12878 individual. We show that CALQ achieves, on average, better performance than with the original data with a size reduction of more than an order of magnitude with respect to the state-of-the-art lossless compressors.

Download

Session B-369: A CWL-based pipeline for the ultra-fast identification of mutually exclusive pairs of aberrant genes in tumors

COSI: Non-COSI

Tarcisio Fedrizzi, CIBIO, Italy
Davide Prandi, CIBIO, Italy
Francesca Demichelis, CIBIO, Italy

Short Abstract: Synthetic lethality (SL) is a phenomenon where the concomitant aberration of two (or more) genes causes cell death while aberration of either one gene is compatible with life. As tumor cells tend to accumulate aberrations SL can be exploited for tumor cell selective death while targeting the second gene with a drug. Genomic aberrations of SL combinations should reflect in mutual exclusivity (ME) signatures. By studying co-aberration patterns in large datasets (e.g. TCGA), we can nominate mutually exclusive combinations and validate their SL potential through experimental work. To identify such combinations, we developed a pipeline to handle Whole Exome Sequencing data and an algorithm named FaME (Fast Mutual Exclusivity). The pipeline is written in Common Workflow Language (CWL), one of the leading languages for pipeline specification. CWL combined with Docker offers a way to execute the pipeline in heterogeneous environments (e.g. HPC cluster, cloud) while avoiding complex setups. With matched normal and tumor samples data (BAM) as input, the pipeline generates purity- and ploidy-adjusted somatic copy number alteration (SCNA) calls, both gene-based and allele-specific, and Single Nucleotide Variants (SNVs) all annotated with clonality estimates using a tool developed by our group (CLONET). FaME leverages fast matrix multiplication (OpenBLAS), a logarithm-based Fisher test implementation, and parallel code execution to compute hundred of millions of ME tests (genome-wide coverage) for thousand of samples in few minutes on a single HPC machine. Details of the pipeline and FaME will be presented. This work is part of the ERC funded project SPICE (ERC-CoG-2014, 648670).

Session B-371: Wet lab preparations critically influence metagenomics profiles as shown in surrogate samples and a sample of a patient with unknown cause of meningitis

COSI: Non-COSI

Corinne P. Oechslin, Biology Division, SPIEZ LABORATORY, Swiss Federal Office for Civil Protection, Spiez; Institute for Infectious Diseases, University of Bern, Bern; Graduate School for Cellular and Biomedical Sciences,, Switzerland
Stephen L. Leib, Institute for Infectious Diseases, University of Bern, Bern, Switzerland
Christian M. Beuret, Biology Division, SPIEZ LABORATORY, Swiss Federal Office for Civil Protection, Spiez, Switzerland

Short Abstract: Sample preparations like host nucleic acid (NA) depletion through NA extraction up to sequencing library preparation directly influences the quality of sequencing output. This study compares different sample preparations and their influence on the metagenomics profile of surrogate samples of bacterial or viral infections in humans as well as of a cerebrospinal fluid sample of a patient suffered from meningitis of unknown etiology. To examine the influence of the predominant amount of host NA in clinical samples on the detection of infectious pathogens we implemented a host NA depletion method. Furthermore, we compared four combinations of native and host NA depleted samples and different NA extractions: crowding agent followed by magnetic beads purification, and phenol-chloroform based mixture of different pH followed by column purification. Extracted NA were reverse transcribed, whole genome amplified, and prepared for sequencing with IonTorrent™ S5. Sequencing data was analyzed using Kraken taxonomic sequence classifier, mapping to reference genomes and BLAST®. We found in the analysis of the metagenomics profiles that the four sample preparations of the surrogate samples selectively single out either intact or lysed bacteria or RNA viruses, or intact DNA viruses. These results guided the analysis of the patient sample data. Thereby, the host NA depleted sample in combination with phenol chloroform (pH >8) and column purification provided reliable metagenomics evidence on a multi infection by bacteria that caused meningitis.

Session B-373: Finding associated variants in genome-wide associations studies on multiple traits

COSI: Non-COSI

Lisa Gai, UCLA, USA
Eleazar Eskin, University of California, Los Angeles, USA

Short Abstract: Many variants identified by genome-wide association studies (GWAS) have been found to affect multiple traits, either directly or through shared pathways. There is currently a wealth of GWAS data collected in numerous phenotypes, and analyzing multiple traits at once can increase power to detect shared variant effects. However, the vast majority of studies consider one trait at a time. Studies that do analyze multiple traits are typically limited to sets of traits already believed to share a genetic basis. Traditional meta-analysis methods for combining studies are designed for use on studies in the same trait. When applied to dissimilar studies, meta-analysis methods are underpowered compared to univariate analysis. This is major limitation as the degree to which a pair of traits share effects is often not known. Here we present a flexible method for finding associated variants from GWAS summary statistics for multiple traits. Our method estimates the degree of shared effects between traits from the data. Using simulations, we show that our method properly controls the false positive rate and increases power compared to trait-by-trait GWAS at varying degrees of relatedness between traits. We apply our method to real data sets in a variety of disease-relevant traits.

Session B-375: Inferring molecular data type similarity in anti-cancer drug response prediction

COSI: Non-COSI

Nanne Aben, Netherlands Cancer Institute, Netherlands
Yipeng Song, University of Amsterdam, Netherlands
Johan Westerhuis, University of Amsterdam, Netherlands
Henk Kiers, University of Groningen, Netherlands
Magali Michaut, Netherlands Cancer Institute, Netherlands
Age Smilde, University of Amsterdam, Netherlands
Lodewyk Wessels, Netherlands Cancer Institute, Netherlands

Short Abstract: As patient response to anticancer drugs is highly variable, biomarkers are needed to identify which patients will benefit from a given treatment. These biomarkers can be determined using large-scale pharmacogenomic screens, in which hundreds of cell lines have been extensively molecularly profiled for mutations, gene expression, drug response, etc. We have previously shown that the classic approach for biomarker identification (Elastic Net predicting drug response using all molecular data types simultaneously) is strongly affected by overlapping information contained in multiple data types. Here, we want to gain further insight in how the information overlaps between different data types. To this end, one could employ matrix correlation measures, such as the RV coefficient, to estimate the redundancy between data types. However, these measures are not suitable for application to binary and continuous data. Thus, we extended the RV coefficient to be able to compare binary and continuous data (e.g. correlation between mutation and proteomics data) and to compute partial matrix correlations (e.g. correlation between mutation and drug response, corrected for proteomics data). Using this approach, we found that gene expression and proteomics data act as ‘mediator data types’: they contain all the information shared between drug response and the remaining data types. As linear models give relatively high weights to mediator variables, this result can be used to explain why an Elastic Net predictor of drug response mostly selects gene expression and proteomics biomarkers.

Session B-377: Unsupervised domain adaptation for age prediction from DNA methylation data

COSI: Non-COSI

Lisa Handl, Max Planck Institute for Informatics, Germany
Adrin Jalali, Max-Planck-Institut für Informatik, Germany
Michael Scherer, Max-Planck Institute for Informatics, Germany
Nico Pfeifer, Department of Computer Science, University of Tübingen, Germany

Short Abstract: Over the last years, huge resources of biological and medical data have become available for research. This data offers great chances for machine learning applications in health care, e.g. for precision medicine, but is also challenging to analyze. One key challenge in biological data is heterogeneity, which can arise, e.g., from different sources of data, unknown subgroups or differences in data acquisition. If this heterogeneity causes a distribution mismatch between the training and test data of a statistical model, prediction performance quickly deteriorates. An interesting problem in epigenetics, where this is relevant, is age prediction from multiple tissues. Here, DNA methylation data is used to predict a donor’s “epigenetic age” and an increased epigenetic age has been shown to be linked to lifestyle and disease history. In this work we address the problem of heterogeneity by proposing an adaptive model which detects distribution mismatches between the inputs of training and test data, and excludes or down weights features that behave differently. Our method can be seen as unsupervised domain adaptation. We apply the model to the problem of age prediction based on DNA methylation data from a variety of tissues, and compare it to a standard model, which does not take heterogeneity into account. The standard approach has particularly bad performance on one tissue type on which we show substantial improvement with our new adaptive approach even though no samples of that tissue were part of the training data.

Download

Session B-379: MicrobiomeDB: a Web-based platform for interrogating microbiome experiments

COSI: Non-COSI

Francislon S. Oliveira, Centro de Pesquisas René Rachou (Fiocruz Minas), Brazil
Shon Cade, Department of Biology, University of Pennsylvania, United States
John Brestelli, Department of Genetics, School of Medicine, University of Pennsylvania, United States
John Iodice, Department of Genetics, School of Medicine, University of Pennsylvania, United States
Brian P. Brunk, Department of Biology, University of Pennsylvania, United States
Jie Zheng, Department of Genetics, School of Medicine, University of Pennsylvania, United States
Christian J. Stoeckert Jr, Department of Genetics, School of Medicine, University of Pennsylvania, United States
Gabriel R. Fernandes, Centro de Pesquisas René Rachou (Fiocruz Minas), Brazil
Jessica C. Kissinger, Center for Tropical & Emerging Global Diseases, Department of Genetics and Institute of Bioinformatics, University of Georgia, United States
David S. Roos, Department of Biology, University of Pennsylvania, United States
Daniel P. Beiting, Center for Host-Microbial Interactions, Department of Pathobiology, University of Pennsylvania, United States

Short Abstract: High-throughput sequencing has revolutionized microbiology by allowing scientists to complement culture-based approaches with culture-independent profiling of complex microbial communities, also known as the microbiome. Community composition data is often accompanied with rich metadata that describes the source from which the sample was derived and how samples were treated prior to collection. However, there is a lack of tools that allow integration of sample metadata with microbiome profiling data. In order to better understand how variables described by metadata influence structure and function of microbial communities we developed MicrobiomeDB (microbiomeDB.org), a discovery platform that empowers researchers to fully leverage their experimental metadata to construct queries which interrogate microbiome datasets. Furthermore, the resulting queries can be statistically analysed and then graphically visualized in the web browser, giving the user a powerful tool to interpret any set of samples from the loaded datasets. A key feature of MicrobiomeDB is the development of an automated pipeline for loading data from microbiome experiments using the standard Biological Observation Matrix (.biom) as input. Taxonomy assignments from the .biom file are mapped to the GreenGenes database to retrieve full 16S rRNA gene sequences and NCBI taxon identifiers. MixS and user-defined metadata terms describing each sample are mapped to an OBO-based application ontology. The ontology is used to guide metadata harmonization, organization and display. Taken together, these results constitute a first step toward a full-featured open-source platform for a systems biology view of microbial communities.

Session B-381: Stability Change Prediction and Multiple Structure Alignment Extensions for UCSF Chimera

COSI: Non-COSI

Jessica Köberle, University of Applied Sciences Upper Austria, Austria
Markus Saliger, University of Applied Sciences Upper Austria, Austria
Jonas Schurr, University of Applied Sciences Upper Austria, Austria
Josef Laimer, University of Salzburg, Austria
Peter Lackner, University of Salzburg, Austria

Short Abstract: UCSF Chimera is a widely used software for the visualization and modeling of protein 3D structures. We developed a bundle of extensions which supports the analysis of protein variants and protein engineering. The first extension offers an easy to use interface to our online service MAESTRO, a versatile tool for the prediction of stability changes upon point mutations. MAESTRO includes several search algorithms for the most (de)stabilizing combination of point mutations or potential stabilizing disulfide bonds. Our Chimera extension provides access to the most popular application scenarios. Results are presented in tabular form, and mutation sites can be highlighted within the structures. The second extension performs multiple structure alignments (MStAs) by utilizing our web service PIRATES. PIRATES is a meta server for MStAs, which currently provides access to eight widely used alignment methods. In addition, the service computes a consensus alignment based on their results. All resulting alignments are scored, and loaded structures can be superimposed based on them. An extension, providing advanced residue selection options based on amino acid types or groups, secondary structure, and accessible surface area, complements the bundle. Various selection constraints can be combined and inverted. The resulting residue filter can be returned to the MAESTRO extension. All extensions are easy to use without any knowledge about operating the underlying software and results are presented within Chimera. All components are freely available for academic research and do not have any dependencies other than an UCSF Chimera (version 11.1 or later) installation and an internet connection.

Session B-383: Analyzing cell migration dynamics from intravital imaging by deformable image matching

COSI: Non-COSI

Hideo Matsuda, Osaka University, Japan
Hironori Shigeta, Osaka University, Japan
Shigeto Seno, Osaka University, Japan
Junichi Kikuta, Osaka University, Japan
Masaru Ishii, Osaka University, Japan

Short Abstract: Tools for analyzing cell migration dynamics play an important role in the field of bio-imaging. Understanding physiological processes in health and disease requires the imaging and analysis of the dynamic behavior of cells in tissues under normal and perturbed conditions. This typically involves the tracking of large numbers of cells in time-lapse microscopy data sets. Conventional cell tracking methods generally consists of two processing steps: (1) cell segmentation (dividing an image into biological meaningful parts as "object" and the remainder as "background"), and (2) cell association (the association of segmented objects from frame to frame). However, it is not easy to resolve the cell segmentation generally due to the heterogeneity and dynamical deforming features of cell morphology. For example, migrated macrophages often change their shapes, and their "segments" frequently overlap each other in an image. To cope with this issue, we propose a different method for analyzing cell migration dynamics by replacing the cell segmentation step with an image matching method from frame to frame. For the image matching, we use a method called "Deep Matching," which can match objects from frame to frame with a non-rigid deformable image matching with deep convolution operations. We applied the method to the comparative analysis of the distributions of cell migration speeds between normal and LPS-stimulated mouse leukocytes. The intravital image data of the leukocytes were obtained by two-photon microscopy. The performance of the proposed method is demonstrated by evaluating the analysis results.

Session B-385: A method of bioimage analysis for spatial pattern of cellular interaction and intercalation

COSI: Non-COSI

Shigeto Seno, Osaka University, Japan
Masayuki Furuya, Osaka University, Japan
Junichi Kikuta, Osaka University, Japan
Masaru Ishii, Osaka University, Japan
Hideo Matsuda, Osaka University, Japan

Short Abstract: Biological imaging technologies have been rapidly advancing for several years and　multicolor fluorescent imaging have become an important part in the field of biology. Moreover, intravital imaging enables us to monitor the activity and spatial distribution of various types of cells. Because cell–cell interaction and intercalation plays important role in developmental processes or causes many different disease, analyzing spatial distributions of cells in images is a fundamental task in biological researches. Co-localization analysis is one of the well-used approach to analyze spatial interaction of two sets of objects. The relative location of a set of cells with respect to another set of cells contains information about potential interactions, and their spatial distributions are correlate if cells need to communicate with each other. However, in the case analyzing cellular competition or intercalation, not only co-localization but exclusive colony-forming should be quantified. In this study, we propose a novel method to measure the degree of the mixture distributions and exclusive colony-forming of two kind cells. As the first step in image analysis, we construct the dendrogram which represent connectivity between cell areas, by using hierarchical clustering method. Then impurity of the resulting clusters for each level of dendrogram are calculated, and this change of impurity value indicate the pattern of co-localization and exclusive colony-forming. Finally, we calculate CMI (cell mixture index) as the area under the characteristic curve of impurity. This index compensates the weakness of co-localization analysis such as pixel intensity spatial correlation and object-based overlap method.

Session B-387: Exploration of Bacteriophage within the Female Urinary Microbiota

COSI: Non-COSI

Taylor Miller-Ensminger, Loyola University Chicago, United States
Jonathon Brenner, Loyola University Chicago, United States
Krystal Thomas-White, Loyola University Chicago, United States
Alan J. Wolfe, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States

Short Abstract: Bacteriophages (viruses that infect bacteria) play a significant role in shaping the overall bacterial community structure of the human microbiota. While this has been clearly demonstrated in the microbial communities of the gut, the contributions of phage within the bladder microbiota remain unknown. Prior evidence has shown correlations between bacterial populations within the bladder and clinical symptoms. We recently began to investigate phages in the bladder. In an effort to identify urinary bacteriophage, we sequenced 300 phylogenetically diverse species from our collection of more than 8000 bacteria isolated from urine obtained by transurethral catheter from about 1000 women. The sequences from these samples were run through our software, developed in Python, which integrates the tool VirSorter and novel functionality to automate downstream analyses. Our method identified 304 phages with confidence and an additional 318 sequences warranting further investigation. A key challenge of identifying phage genomes via computational methods, including our own, is the dearth of phage genomes publicly available; to date only 2010 phage species have been characterized. Despite this challenge, we identified phage that infect a myriad of species commonly found within the bladder’s microbiota, including Lactobacillus, Streptococcus, Enterococcus, Lactococcus, Morganella, and several Enterobacteriaceae. Since the bladder microbiota have been relatively unexplored, we expect that many novel phages exist, exceeding the ones found here. From our computational work, we have identified samples to isolate and characterize phages in lab. This effort is essential to furthering our knowledge of the contributions of phages to bladder health.

Session B-389: Sequence analysis and evolutionary relationships of Microbial Transglutaminases

COSI: Non-COSI

Deborah Giordano, Institute of Food Sciences, CNR, via Roma 64, Avellino, Italy
Angelo Facchiano, Institute of Food Sciences, CNR, via Roma 64, Avellino, Italy

Short Abstract: Since 1990, Streptomyces mobaraensis microbial transglutaminase (MTGase) has become of industrial interest because of its ability to catalyze post-translational modification in many proteins. Even if many studies search novel forms of MTGase, as an alternative to that in use, nowadays this kind of MTGase is the only commercially available and employed as a tool for industry; moreover, the functions of transglutaminases in bacteria are still unknown, and a real classification of all the MTGases known does not exist. All protein sequences annotated as Transglutaminase in Pfam database have been analyzed in order to divide them into groups and select the most representative sequences. Among these sequences, we performed the analysis of the evolutionary relationship (by MEGA tool and phylogenic trees based on Maximum Likelihood and Neighbor Joining algorithms). Based on the features of the analyzed proteins, it is possible to detect the presence of at list five groups of MTGases: Group I: MTGases similar to the experimentally characterized MTGase of Chryseobacterium sp., a novel form of MTGase, which is very different from all the other MTGases known. Group II: MTGases similar to the already known MTGase of Bacillus Subtilis (Tgl-like), some of them do not preserve all the catalytic residues. Group III: MTGases that preserve the main catalytic triad of Streptomyces mobaraensis MTGase, in the order Cys Asp His. Group IV: a little group of proteins from Proteobacteria. Group V: MTGases, similar to the eukaryotic TGases and preserving their typical catalytic triad order, i.e. Cys, His, Asp.

Session B-390: New Application of Computational Target Deconvolution for Phenotypic Screening

COSI: Non-COSI

Ryo Kunimoto, University of Bonn, Germany
Dilyana Dimova, University of Bonn, Germany
Jürgen Bajorath, University of Bonn, Germany

Short Abstract: Target deconvolution of phenotypic assays is a hot topic in chemical biology and drug discovery. It is generally thought that phenotypic screens might produce leads that are more relevant for addressing complex biology in vivo than other compounds identified in target-based assays. Phenotypic discovery is challenged by the need to identify -or at least narrow down- cellular targets for compounds with interesting phenotypic readouts, a process often referred to as target deconvolution. A widely applied computational approach infers putative targets of new active molecules on the basis of their chemical similarity to compounds with activity against known targets. Herein, we introduce a molecular scaffold-based variant for similarity-based target deconvolution from chemical cancer cell line screens that were used as a model system for phenotypic assays. A new scaffold type was applied for substructure-based similarity assessment, termed analog series-based (ASB) scaffold. Compared to conventional scaffolds and compound-based similarity calculations, target assignment centered on ASB scaffolds resulting from screening hits and bioactive reference compounds restricted the number of target hypotheses in a meaningful way and led to a significant enrichment of known cancer targets among candidates.

Session B-391: Origin mechanisms of L-SAARs in signal peptides among higher eukaryotes

COSI: Non-COSI

Michał Stolarczyk, Silesian University of Technology, Gliwice, Poland, Poland
Joanna Polańska, Silesian University of Technology, Gliwice, Poland, Poland
Paweł P. Łabaj, Chair of Bioinformatics, Boku University Vienna, Austria

Short Abstract: Single amino acid repeats (SAARs) are reiterations of single amino acids within peptides. The preliminary analysis of the proteomes insinuates the significance of those containing leucine. The vast majority of leucine reiterations is located in the signal peptide (amino-terminus of proteins destined towards the secretory pathway). Our earlier research has proven that Leucine repeats are overrepresented and conserved in Eukaryotic organisms. This rise a question about additional role of the leucine repeats in signal peptides. Here, we analyze the distribution of leucine repeats found in signal peptides of orthologous Eukaryotic proteins and investigate the changes of leucines forming the L-SAARs both on amino acid and nucleotide level. Thorough analysis facilitates the detection of trends determining the direction of evolution and prospectively the determination of the origins of L-SAARs in signal peptides. The study was focused chiefly on mammals and the data set consisted of the proteomes and transcriptomes of nine organisms. In the study we have examined the following: signal peptides and L-SAARs length changes (to determine if DNA replication slippage is indeed the main L-SAARs creating phenomenon), amino acids to leucine changes (to inspect the amino acids that were previously present on leucines positions in evolutionarily lower organism), leucine codon usage (to test for any bias in leucine codon distribution) and codon changes within leucine (to determine if the inequality in the leucine codon distribution results from any direction of point mutations within leucine codons).

Session B-392: Global deceleration of gene evolution following recent genome hybridizations in fungi

COSI: Non-COSI

Sira Sriswasdi, Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Japan
Masako Takashima, Japan Collection of Microorganisms, RIKEN BioResource Center, Japan
Ri-Ichiroh Manabe, Division of Genomic Technologies, RIKEN Center for Life Science Technologies, Japan
Moriya Ohkuma, Japan Collection of Microorganisms, RIKEN BioResource Center, Japan
Takashi Sugita, Department of Microbiology, Meiji Pharmaceutical University, Japan
Wataru Iwasaki, Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Japan

Short Abstract: Polyploidization events such as whole-genome duplication and inter-species hybridization are major evolutionary forces that shape genomes. Although long-term effects of polyploidization have been well-characterized, early molecular evolutionary consequences of polyploidization remain largely unexplored. Here, we report the discovery of two recent and independent genome hybridizations within a single clade of a fungal genus, Trichosporon. Comparative genomic analyses revealed that redundant genes are experiencing decelerations, not accelerations, of evolutionary rates. We identified a relationship between gene conversion and decelerated evolution suggesting that gene conversion may improve the genome stability of young hybrids by restricting gene functional divergences. Furthermore, we detected large-scale gene losses from transcriptional and translational machineries that indicate a global compensatory mechanism against increased gene dosages. Overall, our findings illustrate counteracting mechanisms during an early phase of post-genome hybridization and fill a critical gap in existing theories on genome evolution.

Session B-393: Gene expression pattern and genetic variation in human induced pluripotent stem cells in a family-based cohort of Tetralogy of Fallot

COSI: Non-COSI

Sandra Appelt, Charité – Universitätsmedizin Berlin, Germany
Marcel Grunert, Charité – Universitätsmedizin Berlin, Germany
Sophia Schönhals, Charité – Universitätsmedizin Berlin, Germany
Huanhuan Cui, Charité – Universitätsmedizin Berlin, Germany
Silke R.-Sperling, Charité – Universitätsmedizin Berlin, Germany

Short Abstract: Patient-specific induced pluripotent stem cells (ps-iPSCs) and their differentiated cell types are a powerful model system to gain insights into mechanisms driving developmental and disease-associated regulatory networks. An open question is to which degree somatic mutations impact on functional studies of genetic disorders using ps-iPSCs. To investigate this question as well as the impact of differential expression patterns between ps-iPSCs and iPSCs derived from healthy relatives, we studied healthy and diseased individuals with Tetralogy of Fallot (ToF). ToF represents the most common cyanotic heart defect in humans, characterized by a multigenic background. We performed whole genome sequencing of pooled iPSCs derived from fibroblast and transcriptome sequencing of iPSCs of three differentiation states and blood samples. To estimate the relative contribution of sample identity, genetic background and differentiation state to transcriptional variation, we applied a linear mixed effect model. As expected, the most important driver for global variation is the differentiation state, followed by the genetic background and the sample identity with only minor impact. We further applied a stringent filtering pipeline to identify damaging somatic and germline variations. We found a damaging somatic mutation in the DNA-binding domain of TP53 in two iPSC clones of one patient, which was moreover functionally validated. TP53 encodes the tumor suppressor P53, and mutations at this site are associated with cancer and are described to decrease P53-mediated regulation of apoptosis, genomic stability and cell cycle. These results imply that a careful genetic characterization of iPSCs is essential before further follow-up experiments or clinical usage.

Session B-394: Discovering novel drug indications based on NLP and topic modeling

COSI: Non-COSI

Giup Jang, Gachon University, South Korea
Taekeon Lee, Gachon University, South Korea
Soyoun Hwang, Gachon University, South Korea
Youngmi Yoon, Gachon University, South Korea

Short Abstract: Text mining is a technique that extract meaningful information from unstructured text data. We proposed a method to discover novel drug indications using natural language processing (NLP) and topic modeling among text mining technique. First of all, we extracted sentences where a gene and a drug co-occur from abstracts in PubMed. Using the sentences, we identified words that explain relationships between gene and drug using Stanford parser. We defined up-/down-regulation to each identified word. We assigned +1 if defined word is up-regulation, and assigned -1 if defined word is down-regulation. We multiplied all assigned scores, and calculated GRS (Gene Regulation Score). We assigned activation if GRS is greater than 0, and assigned inhibition if GRS is smaller than 0 in gene-drug relationship. Next, we clustered genes and their regulations, and built a set of topics based on topic modeling. For each drug, we used probabilities of topics as features and known drug-gene associations as classification class, and we measured performance of classifier. We measured AUC for J48, RandomForest and NaiveBayes, and used 10-fold cross-validation. Furthermore, we predicted novel drug-gene associations for potential drugs among unknown drug-gene associations using classifier. For novel drugs, we measured p-value using Fisher’s exact test and DisGeNet and we identified candidate drugs for drug repositioning if p-value is smaller than 0.05.

Session B-395: gTide-Hi: Accelerating a Cross-correlation Algorithm in Tide using a single GPU

COSI: Non-COSI

Hyunwoo Kim, Korea Institute of Science and Technology Information, South Korea
Kyongseok Park, Korea Institute of Science and Technology Information, South Korea
Sunggeun Han, Korea Institute of Science and Technology Information, South Korea
Jung-Ho Um, Korea Institute of Science and Technology Information, South Korea

Short Abstract: A cross-correlation algorithm is one of the most popular algorithms utilized to search for peptide identification in databases and many computer programs such as SEQUEST, Comet, and Tide currently use this algorithm. Recently, HiXCorr algorithm was developed to speed up this algorithm for high-resolution spectra by improving the preprocessing step of the tandem mass spectra. However, despite the development of HiXCorr algorithm, the algorithm is still slow when parameters such as number of enzymatic termini, missed cleavage, or post-translational modification (PTM) are used in the search. To solve this problem, we used the graphics processing unit (GPU) to develop the gTide-Hi, which uses the algorithm derived by combining Tide’s cross-correlation algorithm and HiXCorr algorithm. gTide-Hi is 2.7 times faster compared to the original Tide with the following parameter: mz-bin-width = 0.01, maximum number of missed cleavage = 3, variable modifications of 2 oxidations per peptide on Met and 2 deamidations per peptide on Asn and Gln. Regardless of the parameters used, the results produced from both gTide-Hi and the original Tide were same.

Download

Session B-396: A Hierarchical Gene Tree for Survival Prediction

COSI: Non-COSI

Minhyeok Lee, Korea University, South Korea
Sung Won Han, Korea University, South Korea
Junhee Seok, Korea University, South Korea

Short Abstract: Predicting survival risks using gene expression is an important issue in large-scale genomic data analyses. While this problem has been extensively studied for last decades, it is still suffered from low predictive power. Here, we propose a hierarchical gene tree to improve survival prediction. From gene expression data, the proposed method constructs a tree structure by recursively finding genes associated with survival outcomes. For finding predictor genes of a target gene, variable selection based on regularized regression method is conducted. In order to reduce variation and error within the gene expression data, data points are projected to inter-level models of the hierarchical gene tree. The proposed method was evaluated by a simulation study. In the most simulation cases, the proposed method outperformed conventional methods based on cox regression. Furthermore, the proposed method was applied to survival prediction of pancreatic cancer patients. The proposed method enhanced performance by 16.3% compared to the conventional method for prediction of low- and high-risk patients. We expect that the proposed method will significantly advance the survival prediction with gene expression data.

Session B-397: Losses of Ubiquitylation Sites during Human Evolution

COSI: Non-COSI

Dongbin Park, Department of Life Science, Chung-Ang University, South Korea
Chul Jun Goh, Department of Life Science, Chung-Ang University, South Korea
Hyein Kim, Department of Life Science, Chung-Ang University, South Korea
Yoonsoo Hahn, Department of Life Science, Chung-Ang University, South Korea

Short Abstract: Ubiquitylation, in which the highly conserved 76-residue polypeptide ubiquitin is covalently attached to a lysine residue of substrate proteins, mediates targeted destruction of ubiquitylated proteins by the ubiquitin-proteasome system. We hypothesize that the loss of ancestral ubiquitylation sites in highly conserved proteins during evolution may modify the ubiquitin-mediated regulatory network, which potentially results in the acquisition of novel phenotypes. We analyzed mouse ubiquitylation data compiled in the Mammalian Ubiquitination Site Database (mUbiSiDa) and multiple sequence alignments of orthologous proteins from 62 mammalian species to identify losses of ancestral ubiquitylation sites in the Euarchonta lineage leading to humans. We found that 194 ancestral ubiquitylation sites were lost in 170 human proteins since the Euarchonta lineage had diverged from the Glires lineage. Of the 194 sites, 9 events occurred in human proteins after the human-chimpanzee divergence. The loss of ancestral ubiquitylation sites might be resulted in the evolution of protein degradation and/or other regulatory networks, and the emergence of novel phenotypes.

Session B-398: Analysis of predicted interaction hotspots in host-specific protein sequences

COSI: Non-COSI

Myeongji Cho, Seoul National University, South Korea
Ji-Hae Lee, Seoul National University, South Korea
Mikyung Je, Seoul National University, South Korea
Hayeon Kim, Kyungdong University, South Korea
Hyeon S. Son, Seoul National University, South Korea

Short Abstract: We analyzed the host specificity of selected viral receptors affecting virus–host interactions through the prediction and evaluation of interspecies transmissible viruses and their potential hosts. As a novel method to explain viral cross-species infection, which is difficult to describe clearly based on the information of genetic information of full amino acid sequences, we predicted and compared protein disordered regions that are likely to interact with selected viruses using host-specific protein sequences. We used charge/hydropathy analysis to calculate the ratio of charge to hydropathy for each amino acid along the protein sequence. Predictions were made by classifying sequence segments with structural differences by converting the boundary line equation that separates folded proteins from disordered proteins using Uversky’s algorithm. Next, interspecies conservation rates of receptor protein sequences were calculated and compared by application of amino acid residues, predicted as interaction hotspots, to each sequence data through multiple sequence alignment for interspecies comparison of hotspot residues on each amino acid sequence. We confirmed that all predicted results corresponded to the actual viral host ranges and infectivity through literature reviews. The results of this study suggest that the predicted structural regions of host proteins that are important for interactions with viral pathogens can be used to infer viral host ranges and the risk of infection by viral jumping over interspecific barriers. We expect that this method will be applicable to the search for, and evaluation of, major receptors that play important roles in mechanisms of action for virus infection or transmission.

Session B-399: Discovering new long-term potentiation pathway members using protein–protein interaction networks

COSI: Non-COSI

Ji-Hae Lee, Seoul National University, South Korea
Mikyung Je, Seoul National University, South Korea
Myeongji Cho, Seoul National University, South Korea
Hayeon Kim, Kyungdong University, South Korea
Hyeon S. Son, Seoul National University, South Korea

Short Abstract: Synaptic plasticity is the ability of synapses to strengthen or weaken the synaptic connection strength. Long-term potentiation (LTP) and depression (LTD) in hippocampal synapses of the brain represent a form of long-term synaptic plasticity and play important roles in memory systems. The proteins and components involved in LTP/LTD induction have not yet been fully elucidated. To extend the known long-term potentiation reference pathway, we used a protein–protein interaction network-based method to select candidate genes. We compared the hippocampal tissues of mice with reduced learning and memory to that of control mice using microarray data in the Gene Expression Omnibus database. To identify genes associated with memory deficit, genes showing significant differential expression were investigated. We also used the STRING protein–protein interaction networks database to investigate the interactions between various interactors and genes in the known long-term potentiation maps of the Kyoto Encyclopedia of Genes and Genomes. Gene Ontology enrichment analysis was performed to identify significant biological subsystems. Cytoscape was used to visualize the interactions of the components constituting protein–protein interaction networks. We generated data of protein pairs including the correlation coefficient for the expression pattern, sequence similarity, and binding affinity. Classification models were generated to predict candidate genes that could also be involved in the synaptic long-term potentiation pathway. We developed a method to extend the synaptic transmission network using systems biology, which can help to identify therapeutic targets for diseases that cause memory loss.

Session B-400: Inference of microbial interactions from human gut metagenome data

COSI: Non-COSI

Yu Watanabe, Niigata University, Japan
Yiwei Ling, Niigata University, Japan
Shujiro Okuda, Niigata University, Japan

Short Abstract: Microbial communities play important roles in the biocycles of all ecosystems. However, most microbes remain uncultivated, and much metabolic diversity must still be elucidated. The recently developed metagenomics approach is a powerful tool for measuring biodiversity in ecosystems, but the dynamics of microbial interactions are still unknown. Furthermore, it has been well known that human gut microbial communities are related to a host human health, thus the clarification of the dynamism is one of the most important issues. In order to investigate the interactions of microorganisms in human gut environments, we have applied networks analysis about functional modules defined in metabolic pathways to the human gut metagenomics data. We used the integrated reference catalog of the human gut microbiome (IGC) metagenomics data as human gut metagenome data and KEGG MODULES as functional metabolic modules. Subsequently, we mapped KEGG Orthology (KO) of the IGC data to the module data sets to obtain information of presence or absence of the functional pathway modules. We finally integrated the modules and constructed human gut microbial networks from linkages of them. As a result, it was suggested that healthy or diseased human gut possessed their specific microorganism networks and were likely to develop microorganism interaction networks constructed from an optimal or sub-optimal species compositions.

Session B-401: MetaGraph: De Bruijn graph based data structures and algorithms for comparative and metagenomics

COSI: Non-COSI

Andreas Andrusch, Robert Koch Institute, Germany
Michael Schwabe, Robert Koch Institute, Germany
Simon H. Tausch, Robert Koch Institute, Germany
Piotr W. Dabrowski, Robert Koch Institute, Germany
Bernhard Y. Renard, Robert Koch Institute, Germany
Andreas Nitsche, Robert Koch Institute, Germany

Short Abstract: Due to the ever growing amount of NGS data generated, efficient data structures for their storage and analysis are becoming increasingly crucial. Here we present MetaGraph, a novel approach addressing both requirements in the context of metagenomic data analysis. MetaGraph uses a de Bruijn graph based data structure for reference sequence storage augmented with sequence metadata, including their taxonomic lineage. One of MetaGraph’s main applications is the taxonomic binning of unknown sequences, for example reads from metagenomic NGS datasets, in order to assess sample constituents. Additionally, it enables classifications and comparisons based on user selectable clades, stepping away from single references towards pan-genome references. Outside of the field of metagenomics, it enables the researcher to perform a wide array of analyses important for comparative genomics. This includes sequence comparisons of references against reads or other references in order to find shared sequence stretches or unique subsequences. In the same fashion it allows the analysis of pan-genomes. Using the graph structures presented here, MetaGraph avoids redundant computations by fully exploiting similarities between sequences. Due to its flexibly extensible sequence metadata it can be adapted to a multitude of sequence analysis contexts. These features are realized using a highly performant and scalable data structure, which utilizes sequence redundancies to amortize space requirements. It performs comparably or better than published tools with similar functionality in both speed and scalability. Further improvements are planned regarding the out-of-process storage of MetaGraph’s data structures using database back-ends and increases in sensitivity by working with spaced k-mers.

Session B-402: Supervised Chromatin Segmentation

COSI: Non-COSI

Tobias Frisch, University of Southern Denmark, Denmark
Xinyi Yang, Max Planck Institute for Molecular Genetics, Germany
Johannes Helmuth, Max Planck Institute for Molecular Genetics, Germany
Jan Baumbach, University of Southern Denmark, Denmark
Annalisa Marsico, Max Planck Institute for Molecular Genetics, Germany
Ho-Ryun Chung, Max Planck Institute for Molecular Genetics, Germany

Short Abstract: RNA sequencing has become a widely accepted and used technique in order to analyze the human transcriptome. However, its capability of revealing low abundant transcripts is limited to the sequencing depth that in turn is correlated to costs. Furthermore, within the last years research has shown that a significant amount of RNA underly a high degradation rate and has been missing in the corresponding RNA-Seq experiments. We emphasize the usage of histone modifications accessed via ChIP-Seq to reveal the expression of transcripts. Therefore a hidden Markov model is trained based on the histone modification of highly expressed genes. The model is further used in order to divide the human genome into functional units associated with transcribed and suppressed genes. The model revealed histone pattern for transcription start site, elongation and intergenic regions that correlates well with known function of those modifications. Furthermore, it is shown that the modification of a significant amount of, according to RNA-Seq data unexpressed, transcripts are strongly related to highly expressed genes. Overall it is demonstrated that a hidden Markov model based on histone modifications is able to reveal location and expression status of previously hidden transcripts.

Session B-403: Isoelectric Point is Evidence of Transcriptional and Translational Pseudogenes

COSI: Non-COSI

Seunghyuk Choi, Hanyang University, South Korea
Bongseok Jo, Kangwon University, South Korea
Eunok Paek, Hanyang University, South Korea
Sun Choi, Kangwon University, South Korea

Short Abstract: Pseudogenes have been described as non-functional and can be categorized into unitary, duplicated, and processed by evolutionary mechanisms. In the case of duplicated and processed pseudogenes, recent studies suggest that some pseudogenes are often transcribed and sometimes translated. However, the characterization of translated pseudogenes has been seldom studied. We used 140,503,871 spectra from 50 early onset gastric cancer patients and applied multi-stage search against human pseudogene database, constructed by three frame translation of previously reported pseudogene transcripts, for tandem mass spectrometry (MS/MS) assay based analysis. We controlled the resulting peptide spectrum matches (PSMs) at 1% estimated false discovery rate (FDR). Among the 72,575 MS/MS-certified PSMs, 32,630 PSMs were discarded because they overlapped with a reference protein database (UniProt 2016-09). The 1,959 unique pseudogene peptides from 39,945 PSMs were analyzed in terms of 1) propensities of isoelectric point (pI) against control peptides (Ensembl coding genes), and 2) transcriptional activity of pseudogene peptides. The median pI for 1,959 pseudogene peptides was ~4.1, significantly lower than that of the control (~7.3). Furthermore, the average pI decreased to ~2.5 as we filtered by increasing numbers of samples up to 10. Transcriptional activities of 1,959 pseudogene peptides were obtained from the previous annotation. We found 1,597 pseudogenes that correspond to the 1,959 pseudogene peptides. Transcriptional activity annotations were found only for 483 pseudogenes (~30.2%). It is noteworthy that among the remaining 10,110 pseudogenes, not mapped to our findings, showed lower transcriptional activity (~11.3%).

Session B-404: Optimisation of RNA-Seq gene expression data pre-processing

COSI: Non-COSI

Pashupati Mishra, University of Helsinki, Finland
Petri Törönen, University of Helsinki, Finland
Liisa Holm, University of Helsinki, Finland

Short Abstract: RNA-Seq enables the study of RNA expression profiles. The inference of gene expression level in a sample involves pre-processing steps like read alignment, transcript compilation and expression estimation. Numerous alternative methods are available for each of these steps, but existing literature comparing different pre-processing methods present varying results. Clearly, a quality control system that allows users to robustly estimate the amount of signal in their RNA-Seq data after different pre-processing methods would benefit the field. We tested a set of quality control metrics for RNA-Seq gene expression data. The set included four subsets of metrics that aim to measure biological signal in data by monitoring four different features: a) Differential gene expression, b) Treatment group separation, c) Control genes separation, and d) Differential gene set expression. We evaluated alternative metrics within each subset using a novel benchmark based on an Artificial Dilution Series (ADS). ADS takes a real RNA-Seq data, makes multiple copies of it and adds varying amounts of noise to the copies. The rationale behind the evaluation was to rank the metrics within each subset based on their sensitivity to the different levels of noise. Our results show drastic differences between different metrics within each subset. We present a Quality Control system for RNA-Seq Gene Expression Data (QC-GED) that comprises the best metrics for monitoring each of the four features in data. It, thus, reliably quantifies biological signal from four different perspectives and allows users to choose the best pre-processing methods.

Download

Session B-405: Bgee database: creating knowledge from gene expression in any animal species

COSI: Non-COSI

Marc Robinson-Rechavi, University of Lausanne, Switzerland
Bgee Team, Swiss Institute of Bioinformatics, Switzerland

Short Abstract: Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as "healthy" or not). Data are integrated and made comparable between species thanks to calls of presence/absence of expression and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. As a result, Bgee provides unique gene expression analysis tools: i- Bgee is capable of detecting the preferred conditions of expression of any single gene, accommodating any data type and species. These condition rankings are highly specific, even for broadly expressed genes. ii- Bgee provides a new type of gene list enrichment analysis tool, TopAnat, capable of detecting the preferred conditions of expression of a list of gene. We hope that TopAnat will prove to be as useful as, and complementary to, standard Gene Ontology enrichment tests. iii- Bgee provides a convenient Bioconductor package, allowing to perform analyses directly into R, and to download all processed expression data available in Bgee. This includes thousands of annotated and re-processed Affymetrix chips and RNA-Seq libraries. Bgee includes 29 animal species, and is available at http://bgee.org/

Download

Session B-406: Half-Sibling Reconstruction Using Forbidden Subgraphs

COSI: Non-COSI

Tanya Berger-Wolf, University of Illinois at Chicago, United States
Nick Shaskevich, Google Inc, United States
Dhruv Mubayi, Univ of Illinois at Chicago, United States
Aayush Kataria, Univ of Illinois at Chicago, United States
Krutarth Joshi, Univ of Illinois at Chicago, United States

Short Abstract: Knowledge about lower order pedigree is an important component of many biological studies, particularly those focused on mating systems and evolution, and adaptation. In non-monogamous species is it often important to know the half-sibling relationships as those provide an insight into the degree of polygamy, the mating mechanisms, as well as dominance relationships. Microsatellites, or Short Tandem Repeats (STRs), are the genetic markers of choice for wildlife population studies. We propose an algorithm for solving the problem of half-sibling reconstruction starting with a microsatellite sample of individuals all belonging to the same generation and same population. We show that the problem of half-sibling reconstruction is equivalent to 2-Vertex Cover and propose and experimentally validate an algorithm that uses the equivalence of 2-cover obstructions to find valid half-sibling groups. The algorithm runs in time cubic in the number of individuals in the population, and produces accurate results when the number of alleles per locus is sufficiently high to be informative.

Session B-407: smartAPI: Towards a More Intelligent Network of Web APIs

COSI: Non-COSI

Michel Dumontier, Maastricht University, Netherlands
Shima Dastgheib, Stanford University, United States
Trish Whetzel, University of California San Diego, United States
Pedro Assisi, Stanford University, United States
Paul Avillach, Harvard Medical School, United States
Kathleen Jagodnik, Icahn School of Medicine at Mount Sinai, United States
Gabor Korodi, Harvard Medical School, United States
Marcin Pilarczyk, University of Cincinnati, United States
Stephan Schurer, University of Miami, United States
Raymond Terryn, University of Miami, United States
Ruben Verborgh, Ghent University, Belgium
Chunlei Wu, The Scripps Research Institute, United States

Short Abstract: Data science increasingly employs cloud-based Web application programming interfaces (APIs) stored in different repositories. However, discovering and connecting suitable APIs by sifting through these repositories for a given application, is difficult due to the lack of rich metadata needed to precisely describe the service and lack of explicit knowledge about the structure and datatypes of Web API inputs and outputs. To address this challenge, we conducted a survey to identify the metadata elements that are crucial to the description of Web APIs and subsequently developed a smartAPI metadata specification that includes 54 API metadata elements divided into five categories: (i) API Metadata, (ii) Service Provider Metadata, (iii) API Operation Metadata, (iv) Operation Parameter Metadata, (v) Operation Response Metadata. Then, we extended the widely used Swagger editor for annotating APIs, to develop a smartAPI editor that captures the APIs’ domain-related and structural characteristics using the FAIR (Findable, Accessible, Interoperable, Reusable) principles. The smartAPI editor enables API developers to reuse existing metadata elements and values by automatically suggesting terms used by other APIs. In addition to making APIs more accessible and interoperable, we integrated the editor with a smartAPI profiler to annotate the API parameters and responses with semantic identifiers. Finally, the annotated APIs are published into a searchable API registry. The registry makes it easier to find, reuse and see how the different APIs are connected together so that complex workflows can be more easily made. Links to the specification, tool and registry are available at: http://smart-api.info/.

Session B-408: A journey for building up the ELIXIR Scientific Benchmark infrastructure: openEBench

COSI: Non-COSI

Salvador Capella-Gutiérrez, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Spain
Diana De La Iglesia, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Spain
Juergen Haas, SIB, Switzerland
José María Fernández González, Spanish National Cancer Research Centre (CNIO), Spain
Dmitry Repchevsky, INB, BSC/CNS, Spain
Josep Ll Gelpi, Dept. Bioquimica i Biologia Molecular. Univ. Barcelona, Spain
Alfonso Valencia, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Barcelona Supercomputing Center (BSC-CNS), Spain

Short Abstract: Benchmarking of bioinformatics tools provides objective metrics in terms of scientific quality, technical reliability, and functionality [1] and stimulates new developments by highlighting areas which require improvements [2]. Within the ELIXIR-EXCELERATE project [3], we propose a community-driven benchmarking infrastructure to support online assessment, comparison and ranking of bioinformatics tools. Our objective is to store all relevant data and metadata for the continuous evaluation of existing methods and the development of new ones. Nevertheless, evaluation of bioinformatics methods in an unbiased fashion remain challenging. Key challenges are the integration of highly heterogeneous data sets and models created for specific tasks from a myriad of fields into a single infrastructure, as well as the diversity of metrics that should be integrated in such a manner that can be compared. The design of the infrastructure, therefore, is based on standards where JSON (JavaScript Object Notation [4]) is used as data-exchange format to define a common structure [5] for benchmarking data that is generated in open scientific challenges. There are various active communities worldwide who stand to benefit from such a collaborative infrastructure: research groups providing new algorithms that need to be evaluated, tool developers wanting to promote their bioinformatics tools, data scientists demanding reference data sets and 'gold standards' to feed their methods, and users that need an unbiased ranking of available resources for conducting their research. The ultimate goal is to become the reference infrastructure of a broad range of bioinformatics communities, from protein modelling structures [6] to orthology [7] to biomedical text-mining [8], putting forward different benchmark initiatives. A first prototype of the platform is available at [9]. REFERENCES: [1] Jackson M. et al. Software Evaluation: Criteria-based Assessment. Technical Report. Software Sustainability Institute. 2011. [2] Costello JC, Stolovitzky G. Seeking the wisdom of crowds through challenge-based competitions in biomedical research. Clin Pharmacol Ther. 2013 May;93(5):396-8. [3] https://www.elixir-europe.org/excelerate [4] http://json.org [5] https://github.com/inab/benchmarking-data-model [6] Haas J. et al. The Protein Model Portal--a comprehensive resource for protein structure and model information. Database 2013, Database(Oxford). 2013 Apr 26;2013:bat031. [7] Altenhoff AM. et al. Standardized benchmarking in the quest for orthologs. Nat Methods. 2016 May;13(5):425-30. [8] Hirschman L. et al. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 Suppl 1, S1 (2005). [9] https://elixir.bsc.es/benchmarking/home.htm

Session B-409: Toward elucidation of behavioral significance about mounting behavior through multi-omics analysis of temporary social parasitic ant, Polyrhachis lamellidens (Hymenoptera: Formicidae)

COSI: Non-COSI

Hironori Iwai, Institute for Advanced Biosciences, Keio University, Japan
Nobuaki Kono, Institute for Advanced Biosciences, Keio University, Japan
Daiki Horikawa, Institute for Advanced Biosciences, Keio University, Japan
Masaru Tomita, Institute for Advanced Biosciences, Keio University, Japan
Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan

Short Abstract: Polyrhachis lamellidens (Hymenoptera: Formicidae) is a temporary social parasitic ant, and the newly mated queen founds her new colony by invading colonies of other ant species. It is known that the newly mated queen of P. lamellidens performs mounting behavior against a worker ant in the host colony during the early stage of social parasitism. It has been hypothesized that this behavior is required for cuticular hydrocarbon (CHC) camouflage. In this study, we conducted gas chromatography mass spectrometric (GC/MS) analysis of CHC in order to elucidate the role of mounting behavior in social parasitic strategy of P. lamellidens. Furthermore, we carried out genome sequencing of P. lamellidens and transcriptome sequencing of larva, worker and queen in order to confirm existence and expression of mounting behavior-related genes. The GC/MS analysis showed low levels of CHC in the pre-parasitism P. lamellidens queen; however a host-like CHC profile was observed following mounting behavior, supporting our hypothesis. This implied that desaturase, a gene suggested to be related to the CHC biosynthesis by previous studies, is a candidate gene of the mounting behavior mechanism. Genomic analysis of P. lamellidens determined multiple desaturase genes, expressed in all three stages. These desaturase genes were widely preserved in subfamily Formicinae, but showed lower similarity in other subfamilies. In this poster, we would like to discuss the molecular mechanism underlying the mounting behavior in P. lamellidens.

Session B-410: High-throughput immunophenotyping of a large knockout mouse library offers a systems-level insight into genetic control of immune homeostasis

COSI: Non-COSI

Anna Lorenc, Kings College London Immunology Department, United Kingdom
Albina Rahim, BC Cancer Research Centre, Canada
Justin Meskas, BC Cancer Research Centre, Canada
Sibyl Drissler, BC Cancer Research Centre, Canada
Alice Yue, BC Cancer Research Centre, Canada
Lucie Abeler-Doerner, Kings College London Immunology Department, United Kingdom
Adam Laing, Kings College London Immunology Department, United Kingdom
Ryan Brinkman, BC Cancer Research Centre, Canada
Adrian Hayday, Kings College London Immunology Department, United Kingdom

Short Abstract: The Infection Immunity and Immunophenotyping (3i) consortium has performed a high-throughput immune phenotyping analysis of ~400 knockout mouse lines and matched wild-type controls generated by the Wellcome Trust Sanger Institute within the International Mouse Phenotyping Consortium. The screen comprised high-content (up to 14 markers) flow cytometric analysis of steady-state multiple immune tissues and several infection challenges, with the aim of identifying genes that influence the cellular composition and function of the immune system in health and disease. We: (1) developed an automated gating strategy for flow cytometry data from thousands of animals analysed over 2.5 years (2) created a computational pipeline to identify significant phenotypes in KO mouse strains (3) jointly analyzed multiple immune and non-immune parameters to discover relationships between these parameters, their links to knocked-out genes and functions of immune system in challenge. (4) related findings in mice to human immune system and health by cross-referencing with preexisting knowledge (5) made the data publicly available Our analyses confirmed known immune system involvement of many genes and discovered several unknown players and unexpected relationships.

Session B-411: Evolutionary analysis of Rift Valley fever virus

COSI: Non-COSI

Mikyung Je, Seoul National University, South Korea
Ji-Hae Lee, Seoul National University, South Korea
Myeongji Cho, Seoul National University, South Korea
Hyeon S. Son, Seoul National University, South Korea
Hayeon Kim, Kyungdong University, South Korea

Short Abstract: There are increasing numbers of newly identified viruses, such as severe fever with thrombocytopenia syndrome virus (SFTSV) and heartland virus (HRTV), which are members of the Phlebovirus genus. According to the International Committee on Taxonomy of Viruses, the Phlebovirus genus currently contains 70 viruses, including viruses that can cause severe disease in humans. Rift Valley fever virus (RVFV), the most widely known virus in the Phlebovirus genus, has been confirmed to have recently spread to Europe, the USA, and Asia, beyond the traditional endemic region, since it was first reported. The emergence of RVFV in new areas can cause serious public health problems. In this study, bioinformatics analysis was performed to investigate the relation of the expansion of RVFV infection areas to viral evolutionary variations. We downloaded the sequence data of four CDS regions within the large, medium, and small segments from the GenBank database of the National Center for Biotechnology Information to perform phylogenetic and codon usage analyses. The results confirmed the presence of codon usage pattern in the medium (M) segment of RVFV according to the passage of time, and the codon usage pattern appears differently in certain amino acids through RSCU analysis on the CDS region of Gn and Gc glycoproteins. These features may be critical factors for expansion of the host range and infection region or to induce changes in the toxicity of RVFV. Therefore, further studies to predict the future evolutionary patternsbased on the results of this study are required.

Session B-412: Simple scoring method for predicting oxidative stress and inflammation status in 3D organotypic cultures of human bronchial cells exposed to cigarette smoke

COSI: Non-COSI

Kazushi Matsumura, JAPAN TOBACCO INC., Japan
Shinkichi Ishikawa, JAPAN TOBACCO INC., Japan
Shigeaki Ito, JAPAN TOBACCO INC., Japan

Short Abstract: Cigarette smoke (CS) is a known risk factor for some airway diseases, including chronic obstructive pulmonary disease, which are believed to be initiated by increased oxidative stress, followed by chronic inflammation. These airway diseases are triggered by cellular responses evoked in the airway epithelium by inhaled CS. Therefore, evaluation of such intracellular perturbations, especially oxidative stress and the inflammatory response, should facilitate prediction of the risk of airway disease onset. As the first step towards development of a risk assessment model, we constructed a simple scoring method to predict the status of oxidative stress and inflammation. First, we exposed MucilAir, a 3D organotypic culture of human bronchial cells, to 6 inducers, with different mechanisms of action related to cellular stress (including oxidative stress and the inflammatory response), using various doses and exposure times. After microarray-based hierarchical clustering and canonical pathway analysis of the data, we identified oxidative stress and inflammation inducers. We then identified commonly differentially expressed genes, at each timepoint, as early or late oxidative stress and inflammatory response regulating gene sets. Next, the transcriptomics data from CS-exposed MucilAir were analyzed using our scoring method, based on log2 fold changes in differential measurements that were included in our gene sets. We found that the measured scores accurately demonstrated dynamic changes and inducer dose-responses for oxidative stress and inflammation status. This supported the potential of our scoring method as the first step in a quantitative risk assessment based on an adverse outcome pathway.

Session B-413: Novel methodologies for gene family silencing using the CRISPR-Cas9 system

COSI: Non-COSI

Gal Hyams, Tel Aviv University, Israel
Itay Mayrose, Tel Aviv University, Israel
Eran Halperin, UCLA, United States

Short Abstract: The CRISPR-Cas9 system forms a bacterial immune system that recently has been adopted as a genome-editing technique of eukaryote genomes. The system is directed to the genomic site using a programmed single-guide RNA (sgRNA) that base-pairs with the DNA target, subsequently leading to a site-specific double-strand break. The binding affinity of the CRISPR-Cas9 system does not require perfect matching between the sgRNA and the DNA target. Thus, in addition to cleaving the desired "on-target", cleavage may occur at multiple unintended genomic sites (termed off-targets) that are similar, up to a certain degree, to the on-target. Due to extensive history of local and large-scale genome duplications, many eukaryotic genomes harbor many large gene families of partially overlapping functions. This redundancy often results in a buffering effect: most single null mutants present no or minimal phenotypic consequence due to the overlapping function of one or more paralogs. Therefore, in many cases, the silencing of multiple family members of a gene family is necessary to uncover any phenotypic effects. Here, we introduce graph-based algorithms for the optimal design of potential sgRNA. The developed algorithms harness the low specificity of the CRISPR-Cas9 system to target multiple members of a given gene family. In-silico examination over all gene families in the Solanum lycopersicum genome shows that our suggested approach outperforms simpler alignment-based techniques. The utility of the developed algorithm is further demonstrated in vivo by successfully silencing an entire family of gibberellin transporters in S. lycopersicum consisting of seven members using a single sgRNA. This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University and by the Manna Center for food safety and security at Tel-Aviv University

Download

Session B-414: Epigenetic regulation model in AHNAK Deficient Adipose Differentiation

COSI: Non-COSI

Young Seek Lee, Hanyang university, South Korea
Soo Young Cho, National Cancer Center, South Korea
Jong Kyu Woo, Seoul National University, South Korea
Soojun Park, ETRI, South Korea

Short Abstract: Ahnak deficient model were used to investigate the function of adipocyte differentiation and obesity. Several reports propose molecular mechanism and related pathway in Ahnak deficient adipocyte differentiation. However, the functional rule in Ahnak related adipocyte differentiation is unclear. Specially, epigenomic mechanism for Ahank deficient model is not studied in adipocyte differentiation. To understand epigenetic mechanism, we constructed genome wide DNA methylation pattern with Methyl-CpG-binding domain sequencing. We found numerous adipocyte-dependent and Ahnak associated methylated regions. Intriguingly, the dynamics of the methylation change was associated by Ahnak deficient and differentiation state. Differentiation of Ahnak deficient cells were involved genome-wide epigenetic changes and maintenance of differentiation stage by temporal methylation and demethylation.

Session B-415: NEPHROSEQ: A WEB-BASED GENE EXPRESSION DATABASE AND ANALYSIS PLATFORM

COSI: Non-COSI

Heather Ascani, University of Michigan - Ann Arbor, United States
Becky Steck, University of Michigan - Ann Arbor, United States
Rebecca Reamy, University of Michigan - Ann Arbor, United States
Zach Wright, University of Michigan - Ann Arbor, United States
Viji Nair, University of Michigan - Ann Arbor, United States
Felix Eichinger, University of Michigan - Ann Arbor, United States
Sean Eddy, University of Michigan - Ann Arbor, United States
Wenjun Ju, University of Michigan - Ann Arbor, United States
Matthias Kretzler, University of Michigan - Ann Arbor, United States

Short Abstract: Introduction: Transcriptomic data of human renal disease can provide critical information to identify de novo molecular pathways associated with kidney disease, guide focused analysis in model systems towards pathways with human relevance, and verify the relevance of presumed biomarkers. Considerable amounts of transcriptome data from kidney disease gene expression experiments are being generated and made available in public repositories. However, these datasets are still not analyzed by most researchers because of 1) a lack of data harmonization, especially with regards to gene annotation and associated clinical information, and 2) the absence of accessible tools which allow users to explore the available datasets. Methods: To facilitate access of the renal research community to human kidney disease gene expression data and the relevant associated clinical information, we have developed Nephroseq (www.nephroseq.org), a web-based data repository and analytical tool. We collect and standardize publically available kidney genome-wide expression data before loading it into Nephroseq. To date, we have curated and analyzed 1,989 samples across 32 datasets. Results: This provides a platform for comparison of differential expression, co-expression, and outlier results for queried genes or gene lists across standardized datasets, and between model systems and human kidney disease. It also allows meta-analysis and gene list comparisons from the same disease entities, between clinical disease entities, or across diseases. Conclusions: Nephroseq empowers researchers without bioinformatics knowledge to employ sophisticated systems biology tools to extract key information from ongoing kidney disease research, validate model system findings in relevant human diseases, and generate hypotheses for further experimental investigations, contributing to the knowledge transfer between bench and bedside, and ultimately towards improving patient outcomes.

Session B-416: HTSanalyzeR2: an ultra fast R/Bioconductor package for high-throughput screens with interactive report

COSI: Non-COSI

Feng Gao, City University of Hong Kong & Cornell University, United States
Xiupei Mei, City University of Hong Kong, Hong Kong
Lina Zhu, City University of Hong Kong, Hong Kong
Yuchen Zhang, City University of Hong Kong, Hong Kong
Wei Wang, City University of Hong Kong, Hong Kong
Xin Wang, City University of Hong Kong, Hong Kong

Short Abstract: High-throughput screens (HTS) is one of the most promising tools in functional genomics and enable scientists to study the genome-wide perturbations. Our previous work HTSanalyzeR provides a set of pipelines for integrated functional analysis, including gene set enrichment and network analysis, and present the results as rich-text HTML pages with plots. Since its first release in 2011, it became a popular software and is used widely by the community. Recent biotechnical developments like CRISPR enhanced the efficiency and applicability of HTS. In the meantime, new algorithms and technologies facilitated the computation a lot and brought a more convenient and powerful way to demonstrate the results as an interactive report. Here we developed HTSanalyzeR2, supporting high-throughput CRISPR screens. We also increased its compatibility to work with most of the common spices used in biomedical experiments and updated the computational module fasten the calculation up to 1000X compared with the previous version. Besides the functional upgrade, now HTSanalyzeR2 presents the results in an interactive web interface, which can be easily deployed to Shiny server for further communication with biologists. HTSanalyzeR2 is an R package and now available via Github at https://github.com/CityUHK-CompBio/HTSanalyzeR2

Session B-417: Candidate gene prioritization with Endeavour

COSI: Non-COSI

Amin Ardeshirdavani, KU Leuven - ESAT - STADIUS, Belgium
Léon-Charles Tranchevent, Cancer Research Centre of Lyon, France
Sarah Elshal, KU Leuven, Belgium
Daniel Alcaide, KU Leuven, Belgium
Jan Aerts, Leuven University, Belgium
Didier Auboeuf, INSERM, France
Yves Moreau, KU Leuven, Belgium

Short Abstract: Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/

Session B-418: The human gut virome at the extremes of life

COSI: Non-COSI

Feargal Ryan, APC Microbiome Institute, Ireland
Angela McCann, APC Microbiome Institute, Ireland
Stephen Stockdale, APC Microbiome Institute, Ireland
Marion Dalmasso, APC Microbiome Institute, Ireland
R. Paul Ross, University College Cork, Ireland
Colin Hill, APC Microbiome Institute, Ireland

Short Abstract: The role of the human microbiome in health is an area of scientific research that has received a large amount of attention in recent years with a number of large cohorts analysing the bacterial component of the microbiome. However the role of the virome in the context of the microbiome and human health is still poorly understood. Recent work has indicated that there is a healthy human virome in humans and that this can become perturbed in disease. Here we sequence the viromes of faecal samples collected from 20 Irish elderly (> 67 years old) and 20 Irish infants (12 months after birth) to examine the gut virome at the extremes of life. Virome analysis is complicated by the lack of reference sequences in established databases and thus a de novo assembly based approach was applied here. We show here that the gut virome is able to distinguish between infants and elderly. Furthermore it can distinguish between those infants born by caesarean section and spontaneous vaginal delivery 12 months after birth and between those elderly living in the community or in long-stay care. The infant viromes were marked by the presence of viruses highly similar to known Lactococcus phage, however 16S rRNA gene analysis found virtually no Lactococcus sequences present in these data. The elderly were found to contain particularly high levels of two novel phage, one of which is proposed to infect Bacteroides vulgatus and the other Clostridium difficile based on homology of a CRISPR-cas spacer and t-RNA gene respectively. CrAssphage was identified in both the elderly and infants, and was the most prevalent sequence identified throughout the dataset.

Session B-419: A DNA barcode archive and analysis system for medicinal herbs of traditional Korean medicine

COSI: Non-COSI

Sang-Jun Yea, KIOM, South Korea
Boseok Seong, KIOM, South Korea
Yunji Jang, KIOM, South Korea
Chul Kim, KIOM, South Korea

Short Abstract: Introduction: Although there are several kinds of DNA barcode systems for archiving, analyzing, and identifying species, the existing systems are not adequate due to the fact that one or more plants are identified as single medicinal herb. Therefore, we implemented web based system for DNA barcode archive and analysis to support those conditions specifically designed for traditional Korean medicine. Methodology: The proposed system, called SDBMH, was designed as three main module: DNA barcode archives, DNA barcode analysis, and System management. In DNA barcode archives, SDBMH fetches the specimen information from external herbarium information management system (HIMS). In DNA barcode analysis, to align multiple FASTA sequences, we used NCBI’s Basic Local Search Alignment Tool (BLAST). The flows of data processing of SDBMH was designed into three steps as DNA barcode management; Group management and analysis; Report and visualization. Results: SDBMH provides three main functions: Barcode registration, Barcode search and view, and Species identification. In barcode registration, user selects specimen information from HIMS and chooses specific region and primer set for individual DNA barcode. DNA barcode is searchable by barcode ID, herbal name, species name, etc. Beside those search options, user can designate marker to filter out results. To identify species, user goes through three steps: apply BLAST, select norm group, and chose results. SDBMH offers graphical user interface of conflict information, phylogenetic trees, and identification reports. Conclusions: Our system will help in archiving and identifying correct botanical origins for herbal medicine to standardize Korean herbal medicine as well as quality control.

Session B-420: The Cancer Genome Collaboratory

COSI: Non-COSI

Christina K Yung, University of Chicago, United States
Michelle D Brazas, Ontario Institute for Cancer Research, Canada
George L Mihaiescu, Ontario Institute for Cancer Research, Canada
Bob Tiernay, Ontario Institute for Cancer Research, Canada
Junjun Zhang, Ontario Institute for Cancer Research, Canada
Francois Gerthoffert, Ontario Institute for Cancer Research, Canada
Andy Yang, Ontario Institute for Cancer Research, Canada
Jared Baker, Ontario Institute for Cancer Research, Canada
Guillaume Bourque, McGill University, Canada
Paul C Boutros, Ontario Institute for Cancer Research, Canada
Bartha M Knoppers, McGill University, Canada
B. F. Francis Ouellette, Genome Quebec, Canada
Cenk Sahinalp, Simon Fraser University, Canada
Sohrab Shah, BC Cancer Agency, Canada
Vincent Ferretti, Ontario Institute for Cancer Research, Canada
Lincoln D Stein, Ontario Institute for Cancer Research, Canada

Short Abstract: The Cancer Genome Collaboratory is an academic compute cloud designed to enable computational research on the world’s largest and most comprehensive cancer genome dataset, the International Cancer Genome Consortium (ICGC). The ICGC is on target to categorize the genomes of 25,000 tumors by 2018. A subproject of ICGC, the PanCancer Analysis of Whole Genomes (PCAWG) alone has generated over 800TB of harmonized sequence alignments, variants and interpreted data from over 2,800 cancer patients. A dataset of this size requires months to download and significant resources to store and process. By making the ICGC data available in cloud compute form in the Collaboratory, researchers can bring their analysis methods to the cloud, yielding benefits from the high availability, scalability and economy offered by cloud services, avoiding a large investment in static compute resources and essentially eliminating the time needed to download the data. To facilitate the computational analysis on the ICGC data, the Collaboratory has developed software solutions that are optimized for typical cancer genomics workloads, including well tested and accurate genome aligners and somatic variant calling pipelines. We have developed a simple to use, but fast and secure, data transfer tool that imports genomic data from cloud object storage into the user’s compute instances. Because a growing number of cancer datasets have restrictions on their storage locations, it is important to have software solutions that are interoperable across multiple cloud environments. We have successfully demonstrated interoperability across The Cancer Genome Atlas (TCGA) dataset hosted at University of Chicago’s Bionimbus Protected Data Cloud, the ICGC dataset hosted at the Collaboratory, and ICGC datasets stored in the Amazon Web Services (AWS) S3 storage. Lastly, we have developed a non-intrusive user authorization system that allows the Collaboratory to authenticate against the ICGC Data Access Compliance Office (DACO) when researchers require access to controlled tier data. We anticipate that our software solutions will be implemented on additional commercial and academic clouds. The Collaboratory is actively growing, with a target hardware infrastructure of over 3000 CPU cores and 15 petabytes of raw storage. As of November 2016, the Collaboratory holds information on 2,000 ICGC PCAWG donors (500TB total). We anticipate expanding the Collaboratory to host the entire ICGC dataset of 25,000 donors (approximately 5PB) and to extend its data management and analysis facilities across multiple clouds. The Collaboratory has been successfully utilized by multiple research groups, most notably PCAWG project researchers who analyzed thousands of genomes at scale over a few weeks’ time. The Collaboratory is now open to the public and we invite cancer researchers to learn more about our cloud resources at cancercollaboratory.org, and apply for access to the Collaboratory.

Session B-421: SonicParanoid: extremely fast, easy and accurate orthology inference

COSI: Non-COSI

Salvatore Cosentino, Graduate School of Science, The University of Tokyo, Japan
Wataru Iwasaki, Graduate School of Science, The University of Tokyo, Japan

Short Abstract: Thanks to recent advancements in DNA sequencing technologies, the number of species for which genome sequences are available is growing at an accelerated pace. Accurate inference of orthologous genes encoded on multiple genomes is the key to various analyses based on those datasets. For example, elucidation of species-specific and/or evolutionarily shared genomic signatures at the sequence, structural, and/or functional levels (comparative genomics), transfer of genetic knowledge between genomes of model and non-model organisms, reconstruction of phylogenetic trees using genomic information (phylogenomics), and development of reference genome databases, all depend on reliable orthology inference. Despite its general importance, it is still time-consuming and difficult to conduct orthology inference on a specific set of genomes that often include in-house sequenced genomes. Here, we present SonicParanoid, which is orders of magnitude faster than, but comparably accurate to the existing tools, with a balanced precision-recall trade-off.

Session B-423: Using heterogenous components to build a scalable digital pathology environment

COSI: Non-COSI

Yves Sucaet, Vrije Universiteit Brussel, Belgium
Silke Smeets, Vrije Universiteit Brussel, Belgium
Sandrina Martens, Vrije Universiteit Brussel, Belgium
Wim Waelput, UZ Brussel, Belgium
Peter In'T Veld, Vrije Universiteit Brussel, Belgium

Short Abstract: At Brussels Free University (VUB), we wanted to build a core digital pathology infrastructure to support a range of different use cases. Various images platforms needed to be accessible through a single access point, while still supporting different user profiles. We wanted a scalable solution that would allow interaction between equipment from different research groups intra and extramuros. A combination of commercial hardware, commercial software, and open source software was used to get this accomplished. Custom coding to connect interfaces was used where needed. We built a centralized infrastructure that integrates a variety of imaging platforms (brightfield, fluorescence, zstacking), and we now have an interconnected network of heterogeneous and scalable information silos. Image analysis and data/image mining projects can remain stuck in micro-environments due to limits artificially imposed by vendor-specific solutions. We have shown this need not be the case, and have integrated five different imaging platforms onto one architecture. We are storing data from all modalities in a single storage facility, and can manage it through a single access point. We support 40+ users, serve 5000+ whole slide images monthly, and faciliate on different use cases, including education, biobanking, and telepathology.

Session B-424: Maximum Entropy Methods for Extracting the Learned Features of Deep Neural Networks

COSI: Non-COSI

Alex Finnegan, University of Illinois, Urbana-Champaign, United States
Jun Song, University of Illinois, Urbana-Champaign, United States

Short Abstract: Motivation: New architectures of multilayer artificial neural networks and new methods for training them are rapidly revolutionizing the application of machine learning in diverse fields, including business, social science, physical sciences, and biology. Interpreting deep neural networks, however, currently remains elusive, and a critical challenge lies in understanding which meaningful features a network is actually learning. Results: We present a general method for interpreting deep neural networks and extracting network-learned features from input data. We describe our algorithm in the context of biological sequence analysis. Our approach, based on ideas from statistical physics, samples from the maximum entropy distribution over possible sequences, anchored at an input sequence and subject to constraints implied by the empirical function learned by a network. Using our framework, we demonstrate that local transcription factor binding motifs can be identified from a network trained on ChIP-seq data and that nucleosome positioning signals are indeed learned by a network trained on chemical cleavage nucleosome maps. Imposing a further constraint on the maximum entropy distribution, similar to the grand canonical ensemble in statistical physics, also allows us to probe whether a network is learning global sequence features, such as the high GC content in nucleosome-rich regions. This work thus provides valuable mathematical tools for interpreting and extracting learned features from feed-forward neural networks.

Session B-425: Comparative genomic, evolutionary and functional analysis of caleosins: a family of multifunctional plant and fungal proteins

COSI: Non-COSI

Farzana Rahman, University of South Wales, United Kingdom
Mehedi Hassan, University of South Wales, United Kingdom
Rozana Rosli, University of South Wales, United Kingdom
Abdulsamie Hanano, Atomic Energy Commission of Syria, Syria
Denis Murphy, University of South Wales, United Kingdom

Short Abstract: Caleosins (CLO) belong to a family of multifunctional calcium- and lipid-binding proteins with peroxygenase and other signalling activities. This gene family is found almost ubiquitously in two distinct eukaryotic clades: Viridiplantae and Fungi. This evolutionary pattern of CLO gene occurrence is not consistent descent from a common ancestor. This suggests that caleosins may have originated in one of the current clades via horizontal gene transfer from the other. We studied CLO gene and protein sequences across a range of plant and fungal species to understand the structure and function of these proteins in detail. We characterised CLO occurrence and function in date palm, oil palm and banana with respect to tissue expression, subcellular localisation and oxylipin pathway substrate specificities in developing seedlings. Here we report the variation across a comprehensive range of plant and fungal species. Protein structure predictions suggest that the calcium-binding and EF hand domains are widely conserved across species. While the biological functions of studied proteins have yet to be determined in detail, it is clear that these proteins have several subcellular locations and participate in a range of physiological processes in both plants and fungi, including acting as peroxygenases. One crucial role appears to be in responses to a range biotic and abiotic stresses, including plant-fungal interactions. In this presentation, we describe additional studies that have been carried out to shed light on the origin and functions of this intriguing group of proteins.

Session B-426: Factor Extraction from transcriptome-wide expression data as a method for robust prognosis of prostate carcinoma

COSI: Non-COSI

Dominik Otto, Fraunhofer IZI, Germany
Susanne Füssel, Universitätsklinikum Carl Gustav Carus Dresden, Germany
Manfred Wirth, Universitätsklinikum Carl Gustav Carus Dresden, Germany
Friedemann Horn, Fraunhofer Institute for Cell Therapy and Immunology, Germany
Kristin Reiche, Fraunhofer Institute for Cell Therapy and Immunology, Germany

Short Abstract: Prostate Cancer is the most prevalent cancer disease among men in the US and patients often face unnecessary surgeries, because biomarkers and their according classification models are often of poor discrimination accuracy. One layer of gene-regulation which has been undervalued for years but now emerges to be associated with prostate cancer is lncRNA-mediated gene regulation. Individual long non-protein coding RNAs (lncRNAs) have been linked to cancer-specific deaths [Prensner et al. 2013 Nature Genetics]. Assessing expression variation of protein-coding and non-protein coding RNAs with high-throughput methods is a promising approach, but is accompanied by the problem of high-dimensionality. The number of differentially regulated genes exceeds the number of available samples by several orders of magnitude. Reliable models can only be derived through a reasonably small parameter set. Further, a prostate cancer tissue sample is very diverse and contains often heterogeneous cell mixtures. The challenge here is to separate all relevant information in the derived transcriptome-wide expression data from variation of expression unrelated to disease outcome. We developed a computational method which enables us to eliminate inter individual differences in gene expression that are unrelated to cancer-specific deaths and to tumor cell content in a sample. Our method provides in addition a dimensional reduction by finding associations between individual genes. We applied our approach to a set of protein-coding and non-protein coding genes in 139 prostate cancer samples and were able to assign selected lncRNAs to cancer-specific deaths.

Session B-427: Optimization of a cancer panel library preparation platform for FFPE samples of all solid tumor types

COSI: Non-COSI

Min-Kyeong Gwon, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
Hyun-Jung Lee, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
Ye-Lim Hong, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
Hyun-Seob Lee, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea

Short Abstract: Formalin-fixed, paraffin-embedded (FFPE) samples are used to conduct large-scale studies across a variety of tumor phenotypes without incurring the significant expense of recruiting patients to build new cohorts.. However, degraded FFPE-derived nucleic acids, a result of long-time storage and formalin fixation, often hinder NGS research because of a high failure rate in analysis. Here, we present a platform that remarkably improves the success rate of analysis using cancer panels through validations of its FFPE samples and libraries, with the quality-control process set up in our laboratory, using FFPE samples collected in the last 10 years for various cancer types. Genomic DNA was extracted using six types of FFPE extraction kits to determine the most suitable extraction method for FFPE samples. We measured the quality of the extracted DNA using a spectrophotometer, two types of fluorometers, and by qPCR analysis, and compared these results with the quantified yield of final libraries. Furthermore, we changed the amount of initial gDNA input for Covaris (200~1000 ng), according to its ddCt result using the Agilent NGS FFPE QC kit, instead of changing the number of prePCR cycles, which is instructed by the manufacturer. On this platform, we successfully built and analyzed 487 libraries from FFPE samples of 10 solid types of tumor with a success rate of 88%. The established validation platform is easy to use and applicable to all libraries regardless of cancer type.

Session B-428: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature

COSI: Non-COSI

Sunwon Lee, Korea University, South Korea
Donghyeon Kim, Korea University, South Korea
Sunkyu Kim, Korea University, South Korea
Kyubum Lee, Korea University, South Korea
Jaehoon Choi, Korea University, South Korea
Seongsoon Kim, Korea University, South Korea
Minji Jeon, Korea University, South Korea
Sangrak Lim, Korea University, South Korea
Donghee Choi, Korea University, South Korea
Aik-Choon Tan, Division of Medical Oncology, University of Colorado Anschutz Medical Campus, United States
Jaewoo Kang, Korea University, South Korea

Short Abstract: As the volume of publications rapidly increases, searching for relevant information from the literature becomes more challenging. To complement standard search engines such as PubMed, it is useful to have an advanced search tool that directly returns relevant biomedical entities such as targets, drugs, and mutations rather than a long list of articles. Some existing tools submit a query to PubMed and process retrieved abstracts to extract information at query time, resulting in a slow response time and limited coverage of only a fraction of the PubMed corpus. Other tools preprocess the PubMed corpus to speed up the response time; however, they are not constantly updated, and thus produce outdated results. Further, most existing tools cannot process sophisticated queries such as searches for mutations that co-occur with query terms in the literature. To address these problems, we introduce BEST, a biomedical entity search tool. BEST returns, as a result, a list of 10 different types of biomedical entities including genes, diseases, drugs, targets, transcription factors, miRNAs, and mutations that are relevant to a user’s query. For example, BEST returns imatinib, dasatinib, and nilotinib for a query on drugs for chronic myeloid leukemia. To the best of our knowledge, BEST is the only system that processes free text queries and returns up-to-date results in real time including mutation information. BEST is freely accessible at http://best.korea.ac.kr.

Download

Session B-429: EIS-DB: an Exon-Intron Structure Database

COSI: Non-COSI

Irina Poverennaya, Vavilov Institute of General Genetics RAS, Russia
Denis Gorev, Moscow Institute of Physics and Technology, Russia
Mikhail Roytberg, Moscow Institute of Physics and Technology; Institute of Mathematical Problems of Biology RAS, Russia

Short Abstract: We present a new exon-intron structure database (EIS-DB) containing comprehensive data of well-annotated genes in more than 100 eukaryotic genomes from different taxonomic groups (vertebrates, invertebrates, plants, and fungi). It allows extracting data related to a special gene or isoform of a special organism or obtain statistical data related to the given set of genes and/or organisms. Although the similar databases exist, they are mainly out of date, or taxonomic-or isoform specific. EIS-DB is a relational database managed by PostgreSQL. Structurally, it contains 15 tables. The main ones are ‘Organisms’, ‘Genes’, ‘Isoforms’, ‘Orthologous groups’, ‘Exons’ and ‘Introns’. The others contain auxiliary data, e.g., taxonomy; EIS-DB also contains fasta-files with related sequences. The detailed intron section makes EIS-DB especially appealing for studying various intron features in different organisms. As the main source of gene sequences and annotations, 112 RefSeq genome assemblies, (current to March 2017) were used along with additional input data on gene orthology obtained from NCBI. To ascertain orthology between exons and introns we have developed a special tool. It first builds multiple protein alignment using modification of MUSCLE program taking into account data on exon borders. Then we realign alignment regions where exon borders are not well aligned. The orthologous groups of exons and introns are determines based on the refined alignment. The preliminary version of web interface of EIS-DB is available at http://212.47.226.240:3000/; the database could be downloaded to user’s PC for more advanced requests.

Session B-430: Harnessing Deep Learning for High-Content Imaging Screens

COSI: Non-COSI

Jan Robin Winter, Bayer AG, Germany
Stefan Prechtl, Bayer AG, Germany
Andreas Steffen, Bayer AG, Germany
Djork-Arné Clevert, Bayer AG, Germany

Short Abstract: High-Throughput Image Analysis (HT-IMA) is a well-established approach for phenotypic screening applications in pharmaceutical research projects. Until now, current standard image analysis typically relies on a small set of physiological features (70% of all published screens relying on less than 3 extracted features) in tested biological systems. A prominent weak point of this approach is that the capability of HT-IMA is not used to its full capacity as the vast majority of recorded parameters is not analyzed and remains statistically unnoticed. Recent advances in Deep Learning have made significant contribution to image analysis and drug discovery. In this work, we harness Convolutional Neural Networks (CNNs) for high-content screening-based phenotype classification. CNNs have achieved a remarkable empirical success in both industry and in academia and are now the state-of-the-art methods in classification and segmentation of images. The model selection and evaluation was performed on public benchmark data. In particular, we trained a CNN on single cell images cropped from the entire field image. The classification task was to determine the correct phenotype class given the cropped images. The performance of our CNN model was validated in a five-fold cross-validation scheme and compared to a Random Forest classifier. In terms of classification accuracy the CNN clearly outperforms its competitors.

Session B-431: Improving predictions of pan-allele peptide-MHC binding affinities using deep learning

COSI: Non-COSI

Rudolph Layko, National Research University - Higher Schools of Economics, Russia
Vadim Nazarov, National Research University - Higher Schools of Economics, Russia

Short Abstract: We propose a novel preprocessing approach of amino-acids sequences using distributed representation and new architecture of neural network for one of the most essential task in immunoinformatics - prediction of binding affinities among MHC molecules and their peptide ligands. Deep overview of existing solutions let us overcome their constraints and limitations and overhaul them. We tested our model on human peptides and obtained competitive results with existent software tools - F1 score 0.72, AUC 0.84 in task for prediction of binding affinities of peptides for unseen MHC sequences and F1 score 0.82, AUC 0.9 for the IEDB testing dataset. Obtained results suggests low generalization error and overall applicability of the proposed model to the prediction of binding affinities for peptide-MHC complexes with both existed and unseen MHC sequences.

Session B-432: Defining a Core Genome for the Herpesvirales and Elucidating their Evolutionary Relationship with the Caudovirales

COSI: Non-COSI

Juan Sebastián Andrade Martínez, Universidad de los Andes, Colombia
Alejandro Reyes Muñoz, Universidad de los Andes, Colombia

Short Abstract: The order Herpesvirales encompasses a great variety of important and widely distributed human pathogens, including the Varicella-Zoster Virus, Human Cytomegalovirus and Epstein-Barr virus. During the last decades, similarities in the viral cycle and the structure of some of their proteins with those of the tailed phages have brought speculation regarding the existence of an evolutionary relationship between the two clades. To evaluate such hypothesis, we used over 700 Herpesvirales and 2000 Caudovirales genomes downloaded from the NCBI genome and nucleotide databases, which were first de-replicated both at the nucleotide and amino acid level. Following this, they were screened for the presence or absence of clusters of orthologous viral proteins, and a dendogram was constructed based on their compositional similarities. The results obtained strongly suggest that the Herpesvirales are indeed the closest viral order with eukaryotes hosts to the Caudovirales, and allows putting forth hypotheses concerning the specific details of such relationship (i.e. whether they are sister clades or one stems from a minor clade within the other). Moreover, the identification of clusters that were abundant amongst the Herpesvirales made it possible to propose a Core Genome for the entire order, composed of 5 proteins, including the ATPase subunit of the DNA-packaging terminase, the only one with previously verified conservation in this clade. Overall, this work simultaneously provides important results supporting the long-held hypothesis that the two orders are evolutionary related and contributes to the understanding of the evolutionary history of the Herpesvirales themselves.

Session B-433: Deep learning model of clonal selection for T-cell receptor sequences

COSI: Non-COSI

Sofia Tolstoukhova, National Research University - Higher School of Economics, Moscow, Russia, Russia
Evgenii Ofitserov, Tula State University, Tula, Russia, Russia
Vadim Nazarov, National Research University - Higher School of Economics, Moscow, Russia, Russia

Short Abstract: Immune system protects human organism from different pathogens invading the body. T-cells are basic immunity weapon against viral infections. Each T-cell has an amino acid molecule on its surface called a T-cell receptor (TCR), which is able to bind particular kind of pathogen peptide. Cells with same TCR form a clonotype. Originally, all clonotypes are represented by a very few cells. After detecting an infected cell a T-cell starts to proliferate and the number of cells in the clonotype grows. This process is called clonal selection. Observing the dynamics of TCR numbers in the peripheral blood is a significant task to both fundamental immunology and medicine. However, no methods were developed to predict the quantities of specific TCR sequences and model the general clonal selection process. In this work we build a deep learning model capable to predict the T-cell quantity from its sequence. For this purpose we implemented variations of the following architectures: Variational Autoencoder combined with a recurrent neural network architectures of Long Short Term Memory or Gated Recurrent Unit (GRU). We tested them on 6 repertoires of 3 individuals with 2 replicas for every individual, using replicas' parts as cross-validation and testing datasets for corresponding individuals. We used a wide spectrum of hyperparameters for each model. Our final model was based on GRU with weighted RMSE equal to 1.81 To demonstrate applicability of the model we implemented a method for comparative analysis of clonal selection in different repertoires and tested it on repertoires of monozygotic twins.

Session B-434: Comparative analysis of bootstrap and jackknife methods in the programs of phylogenetic reconstruction

COSI: Non-COSI

Dmitry Penzar, MSU FBB, Russia
Sergei Spirin, MSU FBB, Russia

Short Abstract: The program of phylogenetic reconstruction – PQ – has been developed previously. It reconstructs phylogenetic tree based on the new criteria of congruence with multiple alignment. PQ showed great results on the majority of datasets compared to the most popular programs of reconstruction – FastME and RAxML. Working with multiple alignments with small number of sequences, PQ frequently generates the tree closer to the real one compared to the other programs. However, PQ tends to be more inaccurate if it uses multiple alignments with more than 30 sequences in them. Some modifications, such as a new way of finding suboptimal trees for multiple alignments with a lot of sequences, were implemented into the program to improve the results. Moreover, two methods of resampling (bootstrap and jackknife) were added to the PQ package to make the evaluation of the branches possible. Bootstrap is widely used in bioinformatics programs, it replaces some columns from the alignment with other columns from the same alignment. Jackknife simply takes some columns off the alignment. Despite being mathematically equivalent to bootstrap and requiring less computational resources, jackknife is rarely used in tree reconstruction. The efficacy of these methods incorporated into the programs (PQ, FastME, RAxML, etc) was investigated using different datasets with different number of replicas. The results are: 1) Jackknife shows the same accuracy in predicting the reliability of a branch as bootstrap, given the same amount of replicas; 2) The dependency between the value of support (generated by either bootstrap or jackknife) and branch’s probability of being correct is almost linear on our datasets; 3) The aforementioned dependency remains linear either on 100 replicas or 20 ones. The area under ROC-curve (and other metrics of prediction quality) doesn’t change with raising the number of replicas from 20 to 100, which confirms the hypothesis of 100 replicas being redundant for computing the reliability of tree’s branches.

Session B-435: triMS5 - storing LC-IMS-MS data in HDF5

COSI: Non-COSI

Jennifer Leclaire, Institute for Computer Science, JGU Mainz, Germany
Stefan Tenzer, UMC of the Johannes-Gutenberg-University Mainz, Germany
Andreas Hildebrandt, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany

Short Abstract: Mass spectrometry based proteomics is a key technology for the elucidation of many biological processes and as such is a dynamically evolving research field aiming at the complete identification and quantification of proteins present in a complex sample. Ongoing developments not only in the instrumentation but also in the design of workflows such as the integration of additional separation strategies like ion mobility separation (IMS) and data independent acquisition (DIA) modes lead to continuously growing complexity of raw data, which in turn put a challenge on existing analysis tools as well as storage routines. Raw data sets are stored in vendor-specific formats and the provided software packages are usually closed-source or restricted to Microsoft Windows operating system. Here, we introduce triMS5, an open-source data format for liquid chromatography coupled mass spectrometry datasets enhanced with IMS (LC-IMS-MS) based on the hierarchical data format version 5 (HDF5). HDF5 is a well-established scientific data format, which is highly optimized for flexible I/O. Particularly, triMS5 benefits from HDF5’s chunked representation – i.e., data can be partitioned and then processed individually, e.g., by natively supported compression filters. This chunked representation is combined with a compressed sparse row (CSR) data layout to ensure storage efficiency. Additionally, triMS5 provides rapid extraction of the signal regions of interest in all three dimension (m/z, retention and drift time) by using a multi-dimensional search tree with language bindings to C/C++ and Python.

Session B-436: TrAMiS - A distributed large-scale molecular trajectory representation and analysis framework for Apache Spark

COSI: Non-COSI

Thomas Kemmer, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany
Christian Ortwein, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany
Marialore Sulpizi, Department of Physics, Johannes Gutenberg University Mainz, Germany
Andreas Hildebrandt, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany

Short Abstract: Molecular dynamics (MD) simulations are among the most important, versatile, and accurate tools for studying molecular behavior in a variety of different settings. Furthermore, they are the main provider of high-resolution molecule trajectory data, which are not only essential in drug design and protein docking, but also of particular importance in atomic-scale studies, such as the analysis of the dielectric properties of biological systems and media. MD simulation toolsets have been highly optimized over the past decades, and are now capable of handling almost arbitrary systems at multi-scale or fixed resolutions. At the same time, these tools utilize multi-core platforms and multi-node environments, e.g., large compute clusters, to allow extended simulation times and finer resolutions in the same overall runtime, resulting in an exponential growth of the generated data and causing I/O operations to become a major bottleneck of the subsequent analysis. Here, we present our software framework for representing and analyzing GROMACS-generated molecule trajectory and topology data combining the Apache Avro data serialization system and the Apache Spark data processing engine. Our data representation layer is specially tailored to distributed file systems, reducing communication overhead while simultaneously increasing reliability and data redundancy in multi-node environments. The data can be used either with our own analysis tools, e.g., to compute radial distribution functions for user-selected partitions of the given system, or integrated into existing pipelines owing to the format's bindings to several other programming languages, including C(++), C#, Java, Python, and Perl.

Session B-437: Ebola virus multi-alignment: analysis and visualization

COSI: Non-COSI

Paulina Hyży, University of Warsaw, Poland
Jakub Tyrek, University of Warsaw, Poland
Norbert Dojer, University of Warsaw, Poland

Short Abstract: Multiple sequence alignment is a rich source of knowledge. However, even a deep analysis is not enough to make this knowledge easily available for other researchers. To fully benefit from it one has to visualize the research results in a way convenient for others to consume. As an example of this approach we will present multi-alignment of the Ebola virus. We extract the most relevant information, golden paths predictions and the sequences structure as a single model rendered by a web browser. The visual readability and computational efficiency is achieved by the usage of graph representation.

Session B-438: VHLdb: a curated community resource for Von Hippel Lindau syndrome

COSI: Non-COSI

Francesco Tabaro, University of Padua, Italy
Federica Quaglia, University of Padua, Italy
Giovanni Minervini, University of Padua, Italy
Damiano Piovesan, University of Padua, Italy
Silvio C. E. Tosatto, University of Padua, Italy

Short Abstract: Mutations in von Hippel-Lindau tumor suppressor protein (pVHL) predispose to develop tumors affecting specific organs, such as retina, epididymis, adrenal glands, pancreas and kidneys. VHLdb (http://vhldb.bio.unipd.it/) is a publicly available resource collecting interaction and mutation data from different resources. Currently, it provides more than 400 pVHL interacting proteins and more than 1,000 pVHL mutations. This makes VHLdb the largest available database for pVHL-related information. The set of pVHL interacting proteins have been generated collecting and annotating data from different public databases. A quarter of the retrieved pVHL interactors has been manually curated adding extra value to the data. Mutation data has been collected and annotated from published papers and selected to be highly relevant to clinically observed phenotypes. VHLdb offers different ways to access its data. First, a graphical user web interface exploits advanced visualization strategies to clearly depict both interaction and mutation data. Then, a public RESTful API (Application Programming Interface) is available for headless access to the data. VHLdb is actively maintained at the University of Padova. Three times a year novel data are added and already existing one are updated and reviewed. The maintenance process is managed via a curator web interface which let an expert user to upload custom data. The annotation of novel entries is performed by an automatic in-house developed software. Also, a user-friendly feedback function to improve database content through community-driven curation is also provided.

Session B-439: Class Imbalance learning for the analysis of protein dynamics in mammalian heart proteome

COSI: Non-COSI

Bilal Mirza, ucla, United States
Jennifer Polson, UCLA, United States
Ding Wang, UCLA, United States
Howard Choi, UCLA, United States
Peipei Ping, UCLA, United States

Short Abstract: Machine learning methods have been effectively applied in biological studies for converting data into knowledge. Particularly, omics datasets contain huge amount of information which can be extracted using state of the art machine learning approaches. A special type of machine learning approach, referred to as class imbalance learning (CIL), is very useful in datasets with rare biological patterns. Imbalance in datasets refers to a condition where the number of sample points belonging to one group is much less than those in the other groups. Standard machine learning models trained using such imbalanced datasets are biased towards the larger group while the recognition rate on the smaller group is very low. For example, standard machine learning model trained to identify distinct molecular signatures of a functional protein group in mammalian heart, with respect to the entire protein ensemble, obtains sub-optimal results. In this study, we applied a cost sensitive based CIL method with support vector machine (SVM) as a base classifier to study temporal dynamics of two small functional groups in mammalian heart proteome i.e. contractile proteins and degradation machineries. The analysis is based on the fold change in protein abundance values at seven time points and across six genetic mouse strains. It is observed that contractile proteins’ temporal dynamics is peculiarly different from the entire protein ensemble, achieving a high recognition rate with the SVM based CIL models. While no distinct pattern was observed in protein degradation machineries, with recognition rate close to random guess.

Download

Session B-440: Pan - genome annotation transfer

COSI: Non-COSI

Jakub Tyrek, University of Warsaw, Poland
Norbert Dojer, University of Warsaw, Poland
Paulina Hyży, University of Warsaw, Poland

Short Abstract: Single reference genome per species is a standard approach in exploring genetics mechanisms. It became acknowledged that this approach lacks an important aspect - genomic interindividual variation. Thanks to the development of sequencing techniques, acquiring new genomic data got cheaper, faster and more accessible allowing pan- genomic approach. One of important aspects of this approach is effective genome annotations transfer. Different approaches are compared in a task of aligned genomes annotations transfer. Existing tools are used, including annotation transfer tools (CrossMap), annotation correction tools (Mugsy-annotator) and prediction tools (Prokka).

Session B-441: Recognition site avoidance as an anti-restriction strategy of prokaryotic viruses

COSI: Non-COSI

Ivan Rusinov, Lomonosov Moscow State University, Russia
Anna Ershova, Lomonosov Moscow State University, Russia
Sergey Spirin, Lomonosov Moscow State University, Russia
Anna Karyagina, Lomonosov Moscow State University, Russia
Andrei Alexeevski, Lomonosov Moscow State University, Russia

Short Abstract: Restriction-modification (R-M) systems protect prokaryotes from invasion of foreign DNA, like bacteriophages. An R-M system is specific to a short DNA sequence, called recognition site. Bacteriophages avoid some recognition sites of R-M systems in their genomes. The avoidance is considered one of anti-restriction strategies of bacteriophages but have not been systematically studied yet. We analyzed restriction site avoidance in genomes of 2069 prokaryotic viruses. As one could expect, DNA bacteriophages demonstrate significant restriction site avoidance, and RNA phages do not. DNA bacteriophages commonly avoid only recognition sites of Type II R-M systems (excluding IIG subtype). Sites of other Type I, IIG, and III systems are avoided in scattered instances. It could indicate that bacteriophages have other widespread anti-restriction strategies targeting such R-M systems. We also demonstrated that Myoviridae coliphages encoding DNA-hydroxymethylase (anti-restriction enzyme) do not avoid Type II restriction sites, while the related phages without the gene avoid 73.7% of such sites. Temperate and lytic bacteriophages manifest different trends in Type II restriction site avoidance. Lytic phages more often (11.9% of the sites) completely eliminate restriction site occurences from their genomes than temperate phages (2.0% of the sites). It is probably caused by a long-term prophage stage, when a temperate bacteriophage shares host selective pressure affecting oligonucleotide composition of its genome. The average number of avoided potential R-M sites in a phage genome several times exceeds the average number of R-M systems of a bacterium. It might indicate that bacteriophages do not generally specialize on a single host strain.

Session B-442: The R package zeroSum

COSI: Non-COSI

Thorsten Rehberg, University of Regensburg, Germany
Michael Altenbuchinger, University of Regensburg, Germany
Rainer Spang, University of Regensburg, Germany

Short Abstract: zeroSum is an R package for fitting reference point insensitive linear models by imposing the zero-sum constraint combined with the elastic net regularization. The zero-sum constraint causes linear models to become invariant to sample-wise shifts, thus working around normalization problems and measurement uncertainties like for example diluted samples. The advantages of the zero-sum constraint for data analysis, especially for building cross-platform signatures, are shown in the presentation “Molecular signatures that can be transferred across different omics platforms” by Altenbuchinger et. al. . We present our efficient coordinate descent algorithm for fitting generalized linear zero-sum models and details about the C++ implementation - the core library of the zeroSum R package. Moreover we give a quick-start tutorial showing the basics steps for creating linear zero-sum models with our package in R.

Session B-443: Pescal++: A high performance standards compliant tool for label free quantitation of peptides from LC-MS/MS shotgun proteomics data

COSI: Non-COSI

Ryan Smith, Queen Mary University of London, United Kingdom
Pedro Cutillas, Barts Cancer Institute, United Kingdom
Jon Hays, Queen Mary University of London, United Kingdom
Conrad Bessant, Queen Mary University of London, United Kingdom

Short Abstract: Advances in sensitivity and accuracy of mass spectrometry instrumentation has led to an increase in size and ambition of large-scale label-free proteomics studies which, in turn, has highlighted the need for fast and stable high-throughput bioinformatics software. Pescal++ has been developed in the C++ programming language which enables accurate and fast quantitation of peptides in large-scale proteomics experiments. HUPO-PSI data standards such as mzIdentML (input) and mzTab (output) have been adopted for connectivity with other tools, allowing Pescal++ to be used as part of complex analysis workflows. Pescal++ has been successfully applied to a range of datasets, including a very large phosphoproteomics experiment where we were able to quantify 30,000+ peptides in a single experiment containing 900+ samples. Pescal++ will be used as a platform for studying further optimisation and parameterisation at key stages in the quantitation workflow, where its stability and rapid performance in processing hundreds of samples simultaneously will be particularly useful for developing a more accurate retention time alignment algorithm.

Session B-444: Ultra-fast 2-way and 3-way SNP Interaction Tests on FPGAs and GPUs

COSI: Non-COSI

Lars Wienbrandt, Institute of Clinical Molecular Biology, Kiel University, Germany
Jan Christian Kässens, Institute of Clinical Molecular Biology, Kiel University, Germany
Matthias Hübenthal, Institute of Clinical Molecular Biology, Kiel University, Germany
David Ellinghaus, Institute of Clinical Molecular Biology, Kiel University, Germany

Short Abstract: Exhaustive higher order SNP interaction testing is computationally very demanding due to the algorithmic complexity. The combination of FPGA (Field Programmable Gate Array) and GPU (Graphics Processing Unit) computing technologies provides an ideal architecture to significantly speedup such interaction tests. The problem is split into two main parts, where either part is implemented on the architecture that fits best. The first part is the creation of contingency tables, which ideally suits FPGA technology. The test statistics based on the contingency tables is then computed efficiently with GPU technology. We show that the application of the information gain measure, an entropy-based test statistic, delivers significant results in an example ulcerative colitis case-control dataset for 2-way (SNPxSNP) as well as for 3-way (SNPxSNPxSNP) interaction analysis. We confirmed the validity of the statistic by evaluating two different ways to determine the null distribution. Firstly, we tested 300 cross-validated permutations of the trait of the original data and secondly, 100 cross-validated reduced datasets with 10% of the original SNPs and the same 300 trait permutations. We achieve a speedup of more than 1,650-fold on our FPGA-GPU computer using four Xilinx Kintex UltraScale KU115 FPGAs and four Nvidia Tesla P100 GPUs when compared to a multi-core CPU cluster node (32 threads on Intel Xeon E5-2667v4), reducing the computational runtime from 11.3 years to only 2.5 days for a 3-way test of 5,725 SNPs, >43,000 samples and 300 permutations.

Download

Session B-445: Assembly and Annotation of the Hexaploid Oat Genome

COSI: Non-COSI

Rachel Walstead, University of North Carolina at Charlotte, United States
Adam Whaley, University of North Carolina at Charlotte, United States
Robert Reid, University of North Carolina at Charlotte, United States
Veronica Vallejo, PepsiCo, United States
Cory Brouwer, University of North Carolina at Charlotte, United States
Jessica Schlueter, University of North Carolina at Charlotte, United States

Short Abstract: The hexaploid oat (Avena sativa L) is a staple cereal crop, used for both human consumption and animal feed. Genomic resources for oat, despite of the importance of cereals, are lagging behind many other crops. The estimated genome size of A. sativa is about 13GB. We aim to fully sequence, assemble, and annotate the hexaploid oat genome (2n = 6x = 42). To this aim, we utilized PacBio RSII technologies and sequenced approximately 580 SMRT cells, achieving a coverage of approximately 40X. We are currently assembling the genome using both the FALCON, Canu, and SMARTdenovo assemblers. Upon completion of the assembly, we will compare the quality of each assembly output. We will then annotate using a combined approach of RNAseq data, predictive gene models, and comparative annotations from other grass genomes that will be integrated using MAKER. The genomic information obtained by this project will be a valuable resource for crop scientists and breeders.

Session B-446: A proteome informatic approach to investigate the role of retroelement proteins in disease

COSI: Non-COSI

Mohamed Nazrath Mohamed Nawaz, Queen Mary - University of London, United Kingdom
Paul Hurd, Queen Mary - University of London, United Kingdom
Miguel Branco, Queen Mary - University of London, United Kingdom
Conrad Bessant, Queen Mary - University of London, United Kingdom

Short Abstract: Retroelements have been implicated in a number of diseases, but little is known about the behaviour of proteins coded by these elements. In order to contribute to understanding how retroelement proteins relate to diseases, an unbiased proteomic approach was developed to re-analyse large amounts of publicly available proteomics data. The main database used was PRIDE, from which we re-analysed disease datasets to detect retroelement proteins, helping to build a picture of retroelement protein expression across a range of diseases. A pipeline has been created to automatically carry out spectral data re-analysis using the combination of SearchGUI and PeptideShaker for confident protein identification on a high performance computing cluster. Other available proteomics datasets from CPTAC and PeptideAtlas could potentially also be re-analysed. Furthermore, different genetic variants and tissue specificity of retroelement protein were explored, as a first step towards understanding the role retroelement proteins play in diseases.

Session B-447: Sequence based prediction of TCR and peptide interaction

COSI: Non-COSI

Vanessa Isabell Jurtz, Technical University of Denmark, Denmark
Martin Closter Jespersen, Technical University of Denmark, Denmark
Kamilla Kjærgaard Jensen, Technical University of Denmark, Denmark
Bjoern Peters, La Jolla Institute for Allergy and Immunology, United States
Morten Nielsen, Technical University of Denmark, Denmark

Short Abstract: A major challenge for T cell therapy and rational identification of T cell epitopes is the identification of the cognate target (the peptide-HLA complex) of a given TCR. While reliable predictions of HLA-peptide interaction are available for most HLA class I alleles, prediction models for the interaction between TCR and the HLA-peptide complex have not yet to the best of our knowledge been described. Recent sequencing projects have generated a considerable amount of data relating TCR sequences with the HLA-peptide complex they recognize. We utilize such data to train sequence-based predictors of the interaction between TCR and peptides presented by HLA-A*02:01. Our models are based on convolutional neural networks, which are especially designed to meet the challenges posed by sequences of variable length, as TCRs. We show that such sequence-based models allow for the identification of the cognate peptide-HLA target of a given TCR from its sequence alone. Moreover we expect predictive performance to increase as more data becomes available.

Session B-448: Bacteriophage Whole Genome Alignment and Recombination History Reconstruction

COSI: Non-COSI

Krister Swenson, CNRS, Université de Montpellier, France
Anne Bergeron, Universite du Quebec a Montreal, Canada
Severine Berard, Universite Montpellier, France
Annie Chateau, Universite Montpellier, France

Short Abstract: Virus genomes are generally very compact due to the physical constraint of fitting into a capsid. Coding regions are dense and gene orders are relatively conserved between strains. The evolution of virus genome architecture, however, is complicated due to the -- often programmed -- existence of recombination points that mix genetic material from two individuals into a new mosaic strain. Thus, phylogenetic reconstruction of virus strains can be tricky business since clean alignments in the presence of recombination is difficult. To this end, we address two current challenges which are: 1. the effective "alignment" of viral genomes, and 2. the study of the peculiar nature of bacteriophage recombination histories. The mosaic structure of viral sequence alignments is captured by our tool call Alpha (ALignment of PHAges), which builds a partial order on well aligned blocks. With homologous blocks in hand, we study a model of bacteriophage recombination that requires two homologous points for each recombination event. We show conditions under which recombination histories can be readily reconstructed.

Session B-449: EpiC: assessing Epigenetics profiles in Cancer samples contaminated by normal cells

COSI: Non-COSI

Elnaz Saberi Ansari, Institute Cochin, France
Valentina Boeva, Institute Cochin, France

Short Abstract: The aim of this research is to develop a computational method to characterize epigenetic profiles (histone modifications and open chromatin sites) in primary tumor tissues that represent a mixture of cancer and normal cells. This method can be used by researchers studying chromatin remodeling in cancer initiation and progression and searching for epigenetic markers associated with drug sensitivity and overall patient survival. In this method, the enrichment in histone modifications in a primary tumor is modeled as a linear mixture of the signal coming from the normal and cancer cells. The sample with the lowest contamination level of normal cells is considered as the reference and the algorithm tries to extract the tumor signals from the samples. The data are normalized by the copy number and the tumor purity, which are assessed using the Control-FREEC software, and also by the noise ratio between different experiments. After removing the tumor signal, the remained signal (a mixture of normal cells) is subject to linear decomposition using non-negative matrix factorization algorithm or independent component analysis. Our method will be validated on both simulated and experimental datasets. During the validation on the simulated data, we will address such questions as the minimal required number of tumor samples, the maximal level of contamination and the maximal number of different contaminating normal cell types. Experimental validation will be done on 16 various neuroblastoma cell lines contaminated with normal cell lines at different rates. ChIP-seq profiles for 3 histone marks for two normal and 19 neuroblastoma cell lines are already available in our lab. This project is supported by the worldwide cancer research.

Session B-450: Prospecting in Contributed Personal Genomic Data

COSI: Non-COSI

Richard Shaw, Repositive Ltd., United Kingdom
Dennis Schwartz, Repositive Ltd., United Kingdom
Manuel Corpas, Repositive Ltd., United Kingdom
Fiona Nielsen, Repositive Ltd., United Kingdom

Short Abstract: Open-access personal genomic data contributed directly by the individuals genotyped is a growing resource but, as with human genomic data in general, its storage is fragmented across multiple sites. Using the Repositive human genomic metadata aggregation platform (https://discover.repositive.io/?ECCB2017), we explored the landscape of such genomic data, within the scope of SNP array genotypes generated by a prominent provider (23andMe TM). Our approach was to search for metadata containing the name of the provider, download the corresponding (3137) data files and then filter out those files not matching the format of interest (GRCh37 23andMe genotypes) or that appeared corrupted. An initial principal component analysis revealed that 122 of the 2402 remaining were from the same individual as other genotypes in the dataset. Some corresponded to identical files multiply submitted to the same or different repositories but others to different versions of the same genotype. Mapping the deduplicated set of 2280 genotypes onto principal component axes generated from a set of African, Asian and European genotypes from 1000 Genomes populations and then applying nearest neighbour classification showed that the dataset is predominantly comprised of European ancestry genotypes. Promethease (https://www.snpedia.com/index.php/Promethease) analyses of these genotypes revealed, among other traits, a preponderance of male individuals. With this analysis we have shown that it is possible to collect and aggregate a large dataset from open access data available across multiple data sources. The examination of the data may be useful in further investigations into linking genotype and phenotype.

Download

Session B-451: A probabilistic approach to whole genome based phylogeny

COSI: Non-COSI

Johanne Ahrenfeldt, Technical University of Denmark, Denmark
Anders Gorm Pedersen, Technical University of Denmark, Denmark
Anders Krogh, University of Copenhagen, Denmark
Ole Lund, Technical University of Denmark, Denmark

Short Abstract: The use of whole genome sequencing is increasing in diagnostics. To perform outbreak analysis it is crucial to be able to make accurate phylogenetic trees. These phylogenetic trees are often based on mapping of raw reads to a reference genome. More information is however given in raw reads data, than what is being used by current methods for WGS phylogeny, which are mostly based on nucleotide calling. This information can be used by building an algorithm, which utilizes all the information given when mapping raw reads to a reference genome. The phylogenetic distance is calculated as the sum of the probabilities of each position being different. This probability is calculated using a Bayesian approach. The dataset to test this method on is a dataset with known phylogeny, made by in vitro evolution of an E. coli K12 strain. By Ahrenfeldt et al. 2017

Download

Session B-452: Optimization of mutational pressure in bacterial genomes according to costs of amino acid replacement

COSI: Non-COSI

Paweł Mackiewicz, University of Wroclaw, Poland
Paweł Błażej, University of Wroclaw, Poland
Małgorzata Grabińska, University of Wroclaw, Faculty of Biotechnology, Poland
Małgorzata Wnętrzak, University of Wroclaw, Faculty of Biotechnology, Poland
Dorota Mackiewicz, University of Wroclaw, Poland

Short Abstract: Mutations occurring in DNA are usually considered a spontaneous and random process. They are important in the evolution of organisms because they generate genetic variation. On the other hand, most mutations are undesirable because they make genes non-functional, and their repairing requires a lot of energy. Therefore, we can expect that the mutational pressure should be optimized during evolution to simultaneously generate genetic diversity and preserve genetic information. In order to check the optimization level of empirical mutational pressures, we compared the matrices of nucleotide mutation rates derived from bacterial genomes with their best possible alternatives that were found by Evolutionary Multiobjective Optimization approach. We searched for the matrices that minimized or maximized costs of amino acid replacements resulted from differences in their physicochemical properties, e.g. hydropathy and polarity. It should be emphasised that the studied empirical nucleotide substitution matrices and the costs of amino acid replacements are independent because these matrices were derived from sites free of selection on amino acid properties and the amino acid costs assumed only amino acid physicochemical properties without any information about mutation at the nucleotide level. Obtained results indicate that the empirical mutational matrices have a tendency to minimize costs of amino acid replacements. It implies that bacterial mutational pressures can evolve to decrease consequences of amino acid substitutions. However, the optimization is not full, which enables generation of some genetic variability necessary to adapt to changing environment.

Session B-453: The role of signal-anchor in the subcellular localization of transmembrane protein

COSI: Non-COSI

Tatsuki Kikegawa, Department of Electronics, Graduate School of science and technology, Meiji University, Japan
Yuri Mukai, Department of Electronics, Graduate School of science and technology, Meiji University, Japan

Short Abstract: Transmembrane proteins are typical internal membrane proteins spanning biomembranes, including the endoplasmic reticulum (ER), Golgi and plasma membranes. Their functions are essential to maintain homeostasis via signal transduction, membrane transport and energy production. Their transmembrane regions usually consist of ten to thirty hydrophobic amino acids, which are known as ER-targeting signals called signal-anchors. However, the evidential transport mechanisms of transmembrane protein localization from ER to other organelles have not been elucidated. Understanding the mechanism of protein subcellular localization is believed to be crucial for treatment of the incurable diseases resulting from erroneous subcellular localization. In this study, to elucidate transport mechanisms of transmembrane proteins, the amino acid propensity around the signal-anchor was calculated. The transmembrane protein dataset was classified into three groups: plasma membrane proteins, ER membrane proteins and Golgi membrane proteins. The discrimination accuracy of each group was estimated by the discrimination scores which were calculated by the position-specific scoring matrix and artificial neural network to evaluate whether the transmembrane protein localization is determined by the sequences around transmembrane. Each group members could be discriminated with high accuracy (> 90%) based on the 5-fold cross-validation test. The result suggested that the amino acid propensity around transmembrane domain was related to the localization mechanisms. To verify this presumption by experimental methods, the GFP fusion proteins with signal-anchors of the representative proteins selected from each group were designed. The subcellular localization of these GFP fusion proteins expressed in HeLa cells were observed by confocal laser fluorescence microscope.

Session B-454: Geena 2, a public web tool for automated analysis of MALDI/TOF mass spectra

COSI: Non-COSI

Paolo Romano, Ospedale Policlinico San Martino, Genoa, Italy, Italy
Aldo Profumo, Ospedale Policlinico San Martino, Genoa, Italy, Italy
Claudia Angelini, CNR - Istituto per l'Applicazione del Calcolo, Italy
Eugenio Del Prete, Department of Sciences, University of Basilicata, Potenza, Italy, Italy
Angelo Facchiano, CNR - Istituto di Scienze dell'Alimentazione, Italy

Short Abstract: Geena2 is a public tool for the pre-processing of MALDI/ToF spectra, developed to fill the lack of tools for the automatic analysis of proteomics data and to help scientists which do not have a strong computer science background. Automation in the analysis of high-throughput data is useful for both the needs of managing large amount of data as well as for supporting replicability and reproducibility of data analysis. Geena2 implements: a) unification of isotopic abundances for the same molecule, b) normalization of data against a standard, c) background noise reduction, d) computation of an average spectrum representative for replicate spectra, e) alignment of average spectra. Input consists of peak lists and parameters for setting the procedure according to the user needs. The output consists of average spectra, their alignment, and intermediate results for checking the correct execution. Geena2 was used with success for the evaluation of the effects of long-term cryopreservation on serum samples, then for two retrospective studies on the correlation between serum peptidomic profiles and cancer. These applications demonstrated that Geena2 it able to automate many steps in the pre-processing of MALDI/ToF spectra. We are now working to implement GeenaR, to exploit the power of R modules and increase its applications by scientists without specific programming and computer science expertise, following the reproducible research philosophy.

Session B-455: Identifiers.org - Persistent identifier resolution and services

COSI: Non-COSI

Sarala Wimalaratne, EBI, United Kingdom
Nick Juty, EBI, United Kingdom
Henning Hermjakob, EBI, United Kingdom

Short Abstract: In this era of 'big data', it has become increasingly important to reference data consistently, robustly, and in a manner that facilitates perennial accessibility. This enables reliable referencing, allows users unfettered access to data and records, and facilitates interoperability between diverse data sets. Solutions for data identification and retrieval must contend with the distributed nature of data, with differing data accessibility, and availability. Furthermore, it is necessary to record alternative means by which the same data can be referenced, to facilitate seamless transition across the identifier landscape. The Identifiers.org system provides a central infrastructure towards facilitating findable, accessible, interoperable and re-usable (FAIR) data. It offers a range of services to generate, resolve and validate persistent Compact Identifiers to promote the citability of individual data providers and integration with e-infrastructures. The Identifiers.org registry contains hundreds of manually curated, high quality data collections, with each assigned a unique prefix. A combination of the prefix and a locally assigned database identifier (accession) forms a Compact Identifier, [prefix]:[accession]. The Identifiers.org resolver provides a stable resolution service these Compact Identifiers, taking into consideration information such as the uptime and reliability of all available hosting resources.

Session B-456: Single Cell Mass Cytometry Marker Panel Extension

COSI: Non-COSI

Tamim Abdelaal, Delft University of Technology, Netherlands
Ahmed Mahfouz, Delft University of Technology, Netherlands
Thomas Höllt, Delft University of Technology, Netherlands
Vincent van Unen, Leiden University Medical Center, Netherlands
Frits Koning, Leiden University Medical Center, Netherlands
Boudewijn Lelieveldt, Leiden University Medical Center, Netherlands
Marcel Reinders, Delft University of Technology, Netherlands

Short Abstract: High-dimensional mass cytometry (CyTOF) allows simultaneous measurement of multiple cellular markers, providing a system-wide view of immune phenotypes at the single-cell level. The maximum number of markers that can be measured simultaneously (N) is limited to ~50 due to technical challenges. We propose an approach to integrate CyTOF data from several marker panels that include an overlapping marker set, allowing for a deeper interrogation of the cellular composition of the immune system. Assuming two CyTOF panels with share m

Session B-457: ARTISiN: A Repository and Multi-Agent System for omics data Integration and Identification of Signaling Networks

COSI: Non-COSI

Milton Y. Nishiyama-Jr, Laboratorio Especial de Toxinologia Aplicada, Instituto Butantan, Brazil
Marcelo S. Reis, Laboratório Especial de Ciclo Celular (LECC), Center of Toxins, Immune-response and Cell Signaling (CeTICS), Instituto Butantan, Brazil
Henrique C Vieira, Instituto Butantan, Brazil
Bruno F de Souza, Instituto Butantan, Brazil
Daniel F. Silva, Instituto Butantan, Brazil
Inácio L.M. Junqueira-De-Azevedo, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
Julia P.C. Da Cunha, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
Leo K. Iwai, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
Junior Barrera, Instituto de Matemática e Estatística, Universidade de São Paulo, and CeTICS, Instituto Butantan, Brazil
Solange M.T. Serrano, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
Hugo A. Armelin, Butantan Institute, Brazil

Short Abstract: The mission of CeTICS is to discover and chemically characterize molecular targets of venom toxins, which probably initiate biological responses of interest in human pathophysiology and therapeutics. The CeTICS sub-projects generate a vast amount of heterogeneous, low and high-throughput omics data, whose complexity defy analytical efforts to uncover hidden biological knowledge. Moreover, static maps eventually yielded by those omics analyses very often are not sufficient to unveil underlying dynamics of cell signaling networks, which demands the design and simulation quantitative dynamical models. We proposed the ARTISiN platform, an amalgam of repositories and tools, both public and in-house built ones, for identification of signaling networks and assessment of mechanistic aspects of their actions. The ARTISiN has been designed as a Multi-agent system (MAS), which consists of a composite system of multiple intelligent agents whose orchestration allows the solution of a complex problem; these agents are autonomous entities, each one focused on solving part of the problem. The two core components of this platform are: i) CeTICSdb, an access-controlled repository for storage, analysis and integration of omics data and also for generation of static maps of signaling networks; ii) SigNetSim, a tool in which an “artisan” selects part of the signaling network that could be approximately isolated as a functional module and “handcrafts” dynamical models to explain its underlying mechanism. Finally, our mid-term objective is to develop a communication between CeTICSdb and SigNetSim, which will be accomplished through the MAS under development and make the platform available to the scientific community.

Download

Session B-458: Study on the properties of alternative genetic codes in comparison with the canonical genetic code and theoretical codon assignments

COSI: Non-COSI

Pawel Blazej, University of Wroclaw, Poland
Przemyslaw Gagat, University of Wroclaw, Poland
Małgorzata Wnętrzak, Faculty of Biotechnology, University of Wrocław, Poland
Paweł Mackiewicz, University of Wroclaw, Poland

Short Abstract: Generally, the standard genetic code (SGC) is regarded universal. However, there are many alternative genetic codes, whose number has been increased rapidly for recent years. The big number of the deviations in codon reassignments implies further questions about the structure, properties and evolutionary directions of the existing codes. In this work, we evaluated differences between the SGC and existing alternative codes in terms of costs of amino acid replacement based on their polarity. Furthermore, we tested the properties of all possible theoretical genetic code, which differed from the SGC in one, two or three changes in assignments of codons to amino acids. Depending on the number of changes, the substantial fraction of the theoretical codes minimized costs of amino acids replacement better than the SGC. Interestingly, many types of codon reassignments observed in the alternative codes are also responsible for the significant improvement of the fitness measure. The reassignments are one of the most valuable changes in the genetic code structure in terms of minimization of the cost value in comparison to the theoretical assignments. These findings suggest potential evolutionary directions of alternative genetic codes.

Session B-459: Nonpher: computational method for design of hard-to-synthesize structures

COSI: Non-COSI

Milan Voršilák, UCT Prague, Czech Republic
Daniel Svozil, UCT Prague, Czech Republic

Short Abstract: Machine learning methods are often used in cheminformatics to predict activity, cluster similar structures or classify structures into distinctive classes. To train a classifier, a training data set must contain examples from every class. A binary classifier needs examples from two classes, usually the positive and the negative (e.g. active/nonactive). While a biological activity or toxicity can be experimentally measured, another important molecular property, the synthetic feasibility, is a more abstract feature that can’t be easily assessed. Furthermore, synthetic feasibility is not only abstract, but hard-to-synthesize structures are not readily available from any database. Nonpher is a computational method developed to construct the needed virtual library of hard-to-synthesize structures. Nonpher is based on a molecular morphing algorithm, which iteratively generates new structures by simple structural changes in starting structure, such as the addition or removal of an atom or a bond. Nonpher was optimized to yield reasonably complex structures, which are hard-to-synthesize. Structures generated by Nonpher were compared with structures selected by SAScore and dense region (DR) methods. Random forest classifier trained on Nonpher data achieved better results than models obtained using SAscore and DR data.

Session B-460: Integrating gene expression and proteomics data into protein-protein interactions networks using a modular methodology based on open source software

COSI: Non-COSI

Frederico Guimarães, Centro de Pesquisas René Rachou, Brazil
Leilane Gonçalves, Instituto Oswaldo Cruz, Brazil
Henrique Toledo, Centro de Pesquisas René Rachou, Brazil
Daniela Resende, Instituto Oswaldo Cruz, Brazil
Jeronimo Ruiz, Centro de Pesquisas René Rachou, Brazil

Short Abstract: Nowadays there is a considerable amount of available biological data, but the odds in the process of obtaining information from them are overwhelming and growing. Differences between data formats, absence of a unified identifier for biological features and lack of integration between existing databases are some of the challenges researchers face in the task of biological information mining. With the main goal of integrate biological data from different sources for further analysis and interpretation, we developed a methodology that uses a series of shell and Perl scripts that extract, filter and format protein interactions data from STRING v.10 database and from high throughput genomic data (RNASeq and shotgun proteomics), integrating them in protein-protein interaction networks using Cytoscape. The methodology was modularly structured and can be adapted and/or integrated to different analytical protocols and organisms. Specifically on this study, the model organism used was Trypanosoma cruzi, the causative agent of Chagas disease. As results we generated a series of protein-protein interaction networks that emphasize, using Cytoscape visual styles, characteristics of biological interest, as EC number, functional grouping, protein interaction types (binding, reaction, expression, activation, catalysis and post-translational modifications) and graph metrics. In these networks we could highlight a series of evidences of biological features, like topological associations between gene regulatory mechanisms (kinases, phosphatases and RNA polimerases) and clusters of proteins functionally associated. Finally, the developed methodological approach emphasize that genomic data integration into PPI networks could be valuable in converting high throughput data into biological knowledge.

Session B-461: The communities of party hubs in fusion protein-protein interaction networks increases their robustness against their site-directed knockouts

COSI: Non-COSI

Somnath Tagore, BAR-ILAN University, Israel
Vikrant Palande, BAR-ILAN University, Israel
Milana Frenkel-Morgenstern, BAR-ILAN University, Israel

Session B-462: Algorithms for Structural Variation Discovery Using Hybrid Sequencing Technologies

COSI: Non-COSI

Ezgi Ebren
Ayse Berceste Dincer

Session B-463: The Affinity Data Bank: An improved online suite of tools for investigation of protein-nucleic acid affinity models and biophysical analysis of regulatory sequences

COSI: Non-COSI

Cory Colaneri, University of Massachusetts Boston, United States
Brandon Phan, University of Massachusetts Boston, United States
Aadish Shah, University of Massachusetts boston, United States
Pritesh Patel, University of Massachusetts Boston, United States
Todd Riley, University of Massachusetts Boston, United States

Short Abstract: We present The Affinity Data Bank (ADB), an improved suite of tools that provides biologists with novel aids to deeply investigate the sequence-specific binding properties of a transcription factor (TF) or an RNA-binding protein (RBP), and to study subtle differences in specificity between homologous nucleic acid-binding proteins. Also, integrated with Pfam, the PDB, and the UCSC database, The ADB allows for simultaneous interrogation of protein-DNA and protein-RNA specificity and structure in order to find the biochemical basis for differences in specificity across protein families. The ADB also includes a biophysical genome browser for quantitative annotation of levels of binding – using free protein concentrations to model the non-linear saturation effect that relates binding occupancy with binding affinity. The biophysical browser also integrates dbSNP and other polymorphism data in order to depict changes in affinity due to genetic polymorphisms – which can aid in finding both functional SNPs and functional binding sites. Lastly, the biophysical browser also supports biophysical positional priors to allow for quantitative designation of the level of locus-specific accessibility that a protein has to the DNA. Importantly, the use of this toolset does not require bioinformatics programming knowledge – which makes ADB tool suite highly useful for a wide range of researchers. Protein concentration is an important ingredient, along with the protein’s sequence specificity, that also greatly affects levels of protein-nucleic acid binding. In addition, as protein concentrations increase, the saturation of the highest-affinity binding sites additionally increases the levels of occupancy for functional medium and low-affinity sites. This biophysical, nonlinear relationship between free protein concentration, binding site affinity, and resultant binding is an important part of accurately determining the level of protein-DNA and protein-RNA binding under in vivo conditions. Accurate protein-DNA affinity models are necessary but not sufficient enough to properly model and predict the level of in vivo protein-DNA binding and subsequent gene regulation. For example, in the human genome most possible binding sites are not accessible for binding by a TF-protein. Tissue-specific chromatin state and accessibility is a complex, major factor that heavily influences protein-DNA binding. Because many possible binding sites are actually inaccessible for binding, methods that do not include in vivo accessibility when searching for putative binding sites in or near a gene have a high false positive rate. The ADB can properly model in vivo protein-DNA binding by integrating the effects of chromatin accessibility and epigenetic marks via the inclusion of biophysical occupancy-based and affinity-based positional priors. Lastly, the ADB now includes two new tools for affinity model visualization and stochastic modeling of transcriptional and translational regulation. Firstly, the new graphical Universal Sequence Logo incorporates any order of nucleotide dependencies, insertions, and deletions between positions in a protein-DNA or protein-RNA binding affinity model. Secondly, the new integrated Biochemical Network Stochastic Simulator (BioNetS) Version 2.0 can import the annotated binding sites, protein concentrations, and accessibility annotations in order to accurately model the stochastic behavior and dynamics of gene expression regulation at both the transcriptional and translational levels.

Session B-464: Human Phenotype Ontology Prediction with the Utilization of Co-occurrences Between HPO terms and GO terms

COSI: Non-COSI

Tunca Doğan, METU / EMBL-EBI, Turkey
Rabie Saidi, EMBL-EBI, United Kingdom
Ahmet Rifaioglu, METU, Turkey
Volkan Atalay, METU, Turkey
Rengul Atalay, METU, Turkey
Maria Martin, EMBL-EBI, United Kingdom

Short Abstract: Here we propose a new approach to predict HPO term associations to human genes/proteins with the detection of co-annotation fractions between all HPO term and Gene Ontology (GO) term combinations, using the annotations of the training set genes/proteins. The HPO terms and the GO terms that are highly co-occurring on different proteins, as annotations, are linked to each other (training step). Finally, proteins with a linked GO term annotation receives the corresponding HPO term as prediction. The idea here is to associate HPO term Y with GO term X in the sense that: "if a protein loses its function defined by GO term X (or at least a reduction in the defined functionality) as a result of a mutation, then it will cause the disease which is defined by the phenotype term Y". This idea is based on the nature of annotating genes/proteins with HPO terms, as only the mutated versions of these genes (i.e. disease causing variants) are associated with genetic diseases and their phenotypic abnormality terms. Mutations usually lead to diseases by causing functionality losses in the gene products. As a result, if the HPO term Y and the GO term X are observed to be highly co-occurred on different proteins, then the lost function, which gave way to the corresponding disease is probably the one defined by the GO term X. We applied this methodology to predict HPO terms for the human proteins dataset provided in CAFA3 challenge.

Session B-465: Integrative Mixed Graphical Models Identify Causal Factors for Chronic Lung Disease Diagnosis and Progression

COSI: Non-COSI

Andrew J Sedgewick, University of Pittsburgh, United States
Panayiotis V Benos, University of Pittsburgh, United States
Joseph Ramsey, Carnegie Mellon University, United States
Ivy Shi, University of Pittsburgh, United States
Dimitris Manatakis, University of Pittsburgh, United States
Yingze Zhang, University of Pittsburgh, United States
Jessica Bon, University of Pittsburgh, United States
Divay Chandra, University of Pittsburgh, United States
Chad Karoleski, University of Pittsburgh, United States
Peter Spirtes, Carnegie Mellon University, United States
Frank Sciurba, University of Pittsburgh, United States
Clark Glymour, Carnegie Mellon University, United States

Short Abstract: Integration of data from different modalities is a necessary step for multi-scale data analysis in many fields, including biomedical research and systems biology. Causal graphical models offer an attractive tool for this problem because they can represent both the complex, multivariate probability distributions and the causal pathways influencing the system. Graphical models learned from biomedical data can be used for classification, biomarker selection and functional analysis, while revealing the underlying causal network structure and thus allowing for arbitrary likelihood queries over the data. In this paper, we present and test new methods for finding directed graphs over mixed data types (continuous and discrete variables). We used this new algorithm, MGM-Learn, to identify variables causally linked to disease diagnosis and progression variables in various multi-modal datasets, including clinical datasets from chronic obstructive pulmonary disease (COPD). COPD is the third leading cause of death and a major cause of disability and determining the factors that cause longitudinal lung function decline is thus very important. By applying our causal inference algorithm on the COPD dataset we were able to confirm and extended previously described connections, which provided new insights regarding the factor causally affecting the longitudinal lung function decline of COPD patients.

Session B-466: De novo pathway-based classification of breast cancer subtypes

COSI: Non-COSI

Session B-467: Detection of Significantly Differentially Expressed Cleavage Site Intervals within 3’ Untranslated Regions using CSI-UTR

COSI: Non-COSI

Eric Rouchka, University of Louisville, United States
Benjamin Harrison, University of New England,
Juw Won Park, University of Louisville, United States
Cynthia Gomes, University of Louisville, United States
Jeffrey Petruska, University of Louisville, United States
Matt Sapio, National Institutes of Health, United States
Michael Iadarola, National Institutes of Health,

Short Abstract: Motivation: Untranslated regions of the 3’ end of transcripts (3’UTRs) are critical for controlling transcript abundance and location. 3’UTR configuration is highly regulated and provides functional diversity, similar to alternative splicing of exons. Detailed transcriptome-wide profiling of 3’UTR structures may help elucidate mechanisms regulating cellular functions. This profiling is more difficult than for coding sequences (CDS), where exon/intron boundaries are well-defined. To enable this we developed a new approach, CSI-UTR. Meaningful configurations of the 3’UTR are determined using cleavage site intervals (CSIs) that lie between functional alternative polyadenylation (APA) sites. The functional APAs are defined using publicly-available polyA-seq datasets biased to the site of polyadenylation. CSI-UTR can be applied to any RNASeq dataset, regardless of the 3’ bias.

Results: Using CSI-UTR, we produced a predefined set of CSIs for human, mouse, and rat. Previous studies indicate 3’UTR structure is highly regulated during nervous system functions. We therefore assessed CSI-UTR using archived RNASeq datasets from the nervous system (SRP056604 and SRP038707) and a rat dataset of our own. In all three species, CSI-UTR identified differential expression (DE) events not detected by standard gene-based differential analyses. Many DE events were in transcripts in which the CDS was unchanged. Enrichment analyses determined these DE 3’UTRs are associated with genes with known roles in neural processes. CSI-UTR is a powerful new tool to uncover DE that is undetectable by standard pipelines, but can exert a major influence on cellular function.

Availability: Source code, CSI BED files and example datasets are available at: http://bioinformatics.louisville.edu/CSI-UTR/

Contact: eric.rouchka@louisville.edu

Session B-468: Reprogrammed Lipid Metabolism in Bladder Cancer with Cisplatin Resistance

COSI: Non-COSI

Jayoung Kim

Short Abstract: Due to its tendency to recur and acquire chemoresistance quickly, bladder cancer (BC) remains to be an elusive and hard to treat disease. Patients with recurring BC and acquired chemoresistance have an extremely poor prognosis, with a 5-15% chance of 5-year survival. Thus, there is a unsolved yet highly urgent task to identify patients who are at a higher risk of developing chemoresistance. Currently, the molecular signatures underlying cisplatin resistance remain unknown. One avenue that could provide more information regarding resistance mechanisms is looking into lipid metabolism. Metabolism of lipids is essential for cancer cells and is associated with the regulation of a variety of key cellular processes and functions. This study conducted a comprehensive and comparative lipidomic profiling of two isogenic human T24 bladder cancer cell lines; one of which was clinically characterized as cisplatin sensitive and the other resistant. immunohistochemistry analysis revealed that expression of cytosolic acetyl-CoA synthetase 2 (ACSS2) is positively correlated with aggressive BC. Ultra Performance Liquid Chromatography-Mass Spectrometry analysis profiled a total of 1,864 lipids, and the levels of differentially expressed lipids such as cholesteryl ester (CE(22:6)), triglyceride (TG(49:1)), and TG(53:2) were markedly higher in cisplatin resistant cancer cells than in sensitive cells. The levels of metabolites such as CE(18:1), CE(22:6), TG(49:1) and TG(53:2) were greatly perturbed by ACSS2 inhibition. This study broadens our current knowledge on the links between cisplatin resistance and lipid metabolism in aggressive bladder cancer, and suggests potential biomarkers for bladder cancer patients at a higher risk.

Session B-469: Network Centrality in Non-alcoholic steatohepatitis: An Integrative Analysis

COSI: Non-COSI

Cristina Baciu , Toronto General Hospital, Canada
Marc Angeli, Toronto General Hospital, Canada
Elisa Pasini, Toronto General Hospital, Canada
Atul Humar, Toronto General Hospital,
Mamatha Bhat, Toronto General Hospital, Canada

Short Abstract: We performed an integrative computational analysis of publicly available gene expression data in human non-alcoholic steatohepatitis (NASH) from GEO. The pathways, networks, molecular interactions, functional analyses were generated through the use of IPA. We discovered that HNF4A is the central gene in the network of NASH connected to metabolic diseases and we show for the first time to our knowledge, that HNF4A is central to the pathogenesis of NASH.

Session B-470: Integrating heterogeneous data using deep autoencoders for protein function prediction

COSI: Non-COSI

Vladimir Gligorijevic, Simons Center for Computational Biology, Flatiron Institute, United States
Meet Barot, Simons Center for Computational Biology, Flatiron Institute, United States
Richard Bonneau, Simons Center for Computational Biology, Flatiron Institute; 2NYU Departments of Biology and Computer Science, United States

Short Abstract: The prevalence of high-throughput experimental methods has resulted in an abundance of large-scale nonlinear data representing different types of protein interactions. These types of data are more difficult to integrate with the standard methods of function prediction, which are often linear and unable to capture hierarchical, abstract features that are more indicative of protein function. Deep learning is a promising technique to deal with such problems, and has been shown to work well for several biological problems. Thus, we propose a method based on deep multimodal autoencoders to extract the features of proteins from multilayer molecular interaction networks. We apply this method on STRING networks to construct a common low-dimensional representation containing high-level protein features. We use different autoencoder architectures for handling different network modalities in the early layers of the autoencoder, later connecting all the architectures into a single bottlenecked layer from which we extract features to predict protein function for the yeast and human species. We compared the 5-fold cross validation predictive performance of our method with the state-of-the-art method, Mashup. Our results show that our method outperforms Mashup for both human and yeast STRING networks. We have also demonstrated the superior performance of our method in comparison to Mashup for predicting GO terms grouped into categories of varying specificity; i.e., we obtain micro-AUPR scores of 0.35 (Mashup: 0.29), 0.22 (Mashup: 0.20) and 0.52 (Mashup: 0.49) for predicting MF, BP and CC GO terms, respectively, belonging to the category of GO terms annotating between 11-30 human proteins.

Session B-472: DWCOACH: Predict protein complexes from dynamic weighted PPI networks by GO semantic similarity

COSI: Non-COSI

xiaowu Sun, Information and Engineering department, Capital Normal University, China
Lizhen Liu, Information and Engineering department, Capital Normal University, China
Wei Song, Information and Engineering department, Capital Normal University, China

Short Abstract: We propose a method called DWCOACH to predict protein complexes from dynamic weighted PPI (protein protein interaction) network by integrating multiple techniques. This method can be divided into three parts, including the construction of dynamic networks, the computing of network weight and the prediction of protein complexes. Firstly, we combine gene expression data at different time points with traditional static PPI network to construct different dynamic sub-networks. This part will improve three-sigma method by optimizing the variance and mean for each protein based on parameter estimation and interval estimation. Then, to further filter out the data noise, the semantic similarity based on gene ontology (GO) is regarded as the network weights together with the principal component analysis (PCA) which is introduced to deal with the weights computed by three traditional methods. Thirdly, DWCOACH algorithm is applied to detect protein complexes. Based on the “core-attachment” structural characteristic, we proposed DWCOACH to predict complexes. DWCOACH selects proteins with high weighted local clustering coefficient for the construction of core; to expand a core, then the attachment proteins which satisfy the judgment condition will be added into a core; there are may redundancies have been generated during the predicted protein complexes, the last thing we need to do is refine the complexes. Lastly, it is revealed from the experimental results that our method performs well on detecting complexes from dynamic weighted PPI networks.

Download

Session B-473: Rfam: Growth and Improvements in the RNA Families Database

COSI: Non-COSI

Ioanna Kalvari , EMBL-EBI, United Kingdom
Joanna Argasinska, EMBL-EBI, United Kingdom
Natalia Quinones Olvera, EMBL-EBI, United Kingdom
Anton Petrov, EMBL-EBI, United Kingdom
Eric Nawrocki, National Institutes of Health, National Library of Medicine, United States
Rob Finn, EMBL-EBI,
Alex Bateman, EMBL-EBI, United Kingdom

Short Abstract: Rfam is a database of functional non-coding RNA families represented by multiple sequence alignments and consensus secondary structures. The sequence and structure information is used to build probabilistic models called covariance models that can find new instances of Rfam families in sequences and annotate genomes with non-coding RNAs. The Rfam website is available at http://rfam.xfam.org.

In the past year we continued the development of Rfam with three releases containing over 200 new RNA families. The Rfam website has been updated with a new unified search interface that allows to search by keywords, species names, RNA types, and more. All Rfam families have been analysed using the R-scape software that identifies statistically significant basepairs supported by covariation and suggests alternative secondary structures that are consistent with the alignments. In order to make it easier to query the data in ways that are not supported by the website, we created a public MySQL database with the latest Rfam data. Work is underway on a major new version of Rfam, release 13.0, which will be built using a new sequence database based on a non-redundant genome collection maintained by UniProt. Rfam 13.0 will be available in late 2017.

Rfam is continuously growing with the addition of new families and the development of new features. Multiple fixes have improved the quality of our data, and the transition to a genome-centric Rfam will not only reduce data redundancy but also enable meaningful taxonomic comparisons and frequent updates.

Session B-474: RNAcentral: The Unified Entry Point for Non-coding RNA Sequences

COSI: Non-COSI

Anton Petrov, EMBL-EBI, United Kingdom
Blake Sweeney, EMBL-EBI, United Kingdom
Boris Burkov, EMBL-EBI, United Kingdom
Natalia Quinones-Olvera, EMBL-EBI, United Kingdom
Simon Kay, EMBL-EBI, United Kingdom
Rob Finn, EMBL-EBI, United Kingdom
Alex Bateman, EMBL-EBI,

Short Abstract: RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that represents all types of ncRNA from a broad range of organisms. RNAcentral provides a single entry point for anyone interested in ncRNA biology by integrating the data from a consortium of RNA resources. The RNAcentral website is available at http://rnacentral.org.

RNAcentral currently contains over ten million ncRNA sequences from more than twenty RNA databases, such as miRBase, RefSeq, GtRNAdb and others. Recent updates include ncRNA data from HGNC, Ensembl, and FlyBase. We are also integrating RNAcentral with the Rfam database so that the majority of RNAcentral sequences are annotated with Rfam families.

There are three main ways of browsing the data through the RNAcentral website. The text search makes it easy to explore all ncRNA sequences, compare data across different resources, and discover what is known about each ncRNA. Using the sequence similarity search one can search data from multiple RNA databases starting from a sequence. Finally, one can explore ncRNAs in select species by genomic location using an integrated genome browser.

RNAcentral continues to grow, with an additional one million new non-coding RNA sequences added to the database in 2016. The website has been continuously improved including a redesigned homepage, and more relevant search results. Our immediate priorities include the incorporation of functional annotations of non-coding RNAs, such as intermolecular interactions, nucleotide modifications, and high-quality secondary structures. The ultimate goal of RNAcentral is to include curated information about all non-coding RNAs as UniProt does for proteins.

Session B-475: Structure-based prediction of protein-peptide binding regions using Random Forest

COSI: Non-COSI

Ghazaleh Taherzadeh, Griffith University, Australia

Short Abstract: Protein-peptide interactions are one of the most important biological interactions and play crucial role in many diseases including cancer. However, only a small portion of proteins has known complex structures and experimental determination of protein-peptide interaction is costly and inefficient. Thus, predicting peptide-binding sites computationally will be useful to improve efficiency and cost effectiveness of experimental studies. Here, we established a machine learning method called SPRINT-Str (Structure-based prediction of protein-Peptide Residue-level Interaction) to use structural information for predicting protein-peptide binding residues. These predicted binding residues are then employed to infer the peptide-binding site by a clustering algorithm.
SPRINT-Str achieves robust and consistent results for prediction of protein-peptide binding regions in terms of residues and sites. Matthews’ Correlation Coefficient (MCC) for 10-fold cross validation and independent test set are 0.27 and 0.293, respectively, as well as 0.775 and 0.782, respectively for Area Under the Curve (AUC). The prediction outperforms other state-of-the-art methods, including our previously developed sequence-based method. A further spatial neighbor clustering of predicted binding residues leads to prediction of binding sites at 20%-116% higher coverage than the next best method at all precision levels in the test set. The application of SPRINT-Str to protein binding with DNA, RNA, and carbohydrate confirms the method’s capability of separating peptide-binding sites from other functional sites. More importantly, similar performance in prediction of binding residues and sites is obtained when experimentally determined structures are replaced by unbound structures or quality model structures built from homologs, indicating its wide applicability.

Session B-476: A sparse latent regression approach for integrative analysis of glycomic and glycotranscriptomic data

COSI: Non-COSI

Xuefu Wang, Indiana University, United States

Short Abstract: We present a Bayesian sparse latent regression (BSLR) model for predicting quantitative glycan abundances from glycotranscriptomic data. The model is built using the matched glycomic and glycotranscriptomic collected in the same samples, and then exploited to infer common properties among training samples and to predict these properties (e.g., the glycan abundances) in similar samples from which only glycotranscriptomc data are available. The BSLR model assumes the glycan and the glycotranscriptomic abundances are both modulated by a small number of independent latent variables, and thus can be constructed by using only a relatively small number of training samples. We further employ a Bayesian learning algorithm to promote the sparse models with fewer parameters associating latent variables and glycan/glycan synthesis genes.

Session B-478: Using genomic analysis to identify tomato Tm-2 resistance breaking mutations and their underlined evolutionary path in a new and emerging tobamovirus

COSI: Non-COSI

Yonatan Maayan, ARO Volcani, Israel
Eswari Pandaranayaka, ARO Volcani, Israel
Moshe Lapidot, ARO Volcani, Israel
Ilan Levin, ARO Volcani, Israel
Aviv Dombrovsky, ARO Volcani, Israel
Arye Harel, Volcani Center, ARO, Israel

Short Abstract: Recently, a new tobamovirus was discovered in Israel able to break over 40 years Tm-2 mediated resistance in tomato. Following isolation and sequencing the virus was found to be Tomato brown rugose fruit virus (ToBRFV), a new tobamovirus recently identified in Jordan. Previous studies on mutation species causing resistance breaking, including Tm-2 mediated resistance, demonstrated that this phenotype was mediated by only few mutations. Identification of such residues in resistance breakers is hindered by significant background resulting from approximately 10% differences in their genomic sequences compared to known species. To understand the evolutionary path leading to the emergence of this resistance breaker, we have utilized a comprehensive phylogenetic analysis, and genomic comparison of tobamovirus species, followed by molecular modelling of its viral helicase. Our phylogenetic analysis highlights the location of the resistance breaker genes within a host shifting inter-clade, which together with relatively low mutation-rate, suggest a similar evolutionary path for the emergence of this new species. Our comparative genomic analysis identified 5 potential resistance-breaking mutations in the viral movement protein (MP), the primary target of the related Tm-2 resistance, and 2 in its helicase. Finally, molecular modelling of the helicase enabled the identification of 2 additional resistance breaking mutations.

Session B-502: On the feasibility of mining CD8+ T-cell receptor patterns underlying immunogenic peptide recognition

COSI: Non-COSI

Nicolas De Neuter, University of Antwerp, Belgium
Wout Bittremieux, University of Antwerp, Belgium
Charlie Beirnaert, University of Antwerp, Belgium
Bart Cuypers, University of Antwerp, Belgium
Aida Mrzic, University of Antwerp, Belgium
Pieter Moris, University of Antwerp, Belgium
Arvid Suls, University of Antwerp, Belgium
Viggo Van Tendeloo, University of Antwerp, Belgium
Benson Ogunjimi, University of Antwerp, Belgium
Kris Laukens, University of Antwerp, Belgium
Pieter Meysman, University of Antwerp, Belgium

Short Abstract: Current T-cell epitope prediction tools are a valuable resource in designing targeted immunogenicity experiments. They typically focus on, and are able to, accurately predict peptide binding and presentation by major histocompatibility complex (MHC) molecules on the surface of antigen-presenting cells. However, recognition of the peptide-MHC complex by a T-cell receptor is often not included in these tools. We developed a classification approach based on random forest classifiers to predict recognition of a peptide by a T-cell and discover patterns that contribute to recognition. We considered two approaches to solve this problem: (1) distinguishing between two sets of T-cell receptors that each bind to a known peptide and (2) retrieving T-cell receptors that bind to a given peptide from a large pool of T-cell receptors. Evaluation of the models on two HIV-1, B*08-restricted epitopes reveals good performance and hints towards structural CDR3 features that can determine peptide immunogenicity. These results are of particularly importance as they show that prediction of T-cell epitope and T-cell epitope recognition based on sequence data is a feasible approach. In addition, the validity of our models not only serves as a proof of concept for the prediction of immunogenic T-cell epitopes but also paves the way for more general and high performing models.

View Posters By Category

Session A: (July 22 and July 23)

3Dsig
Bioinformatics Open Source Conference (BOSC)
CAMDA
Education
Network Biology
Regulatory Genomics (RegGenSig)
RNA
Computational Modeling of Biological Systems (SysMod)

Session B: (July 24 and July 25)

Bio-Ontologies
BioVis
Function
High Throughput Sequencing Algorithms and Applications (HitSeq)
Machine Learning Systems Biology (MLSB)
Translational Medicine (TransMed)
VarI
Other

ISMB/ECCB 2017

Sponsors

Accepted Posters

Track: Other

View Posters By Category

Session A: (July 22 and July 23)

Session B: (July 24 and July 25)

Search Posters: