Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.


Track: Other

Session A-480: Fast accurate sequence alignment using Maximum Exact Matches
COSI: Non-COSI
  • Arash Bayat
  • Aleksandar Ignjatovic
  • Bruno Gaeta
  • Sri Parameswaran

Short Abstract: While efficient, sequence alignment by dynamic programming is still costly in terms of time and memory when aligning very large sequences. We describe MEM-Align, an optimal alignment algorithm that focuses on Maximal Exact Matches (MEMs) between two sequences, instead of processing every symbol individually. In its original definition, MEM-Align is guaranteed to find the optimal alignment but its execution time is not manageable unless optimisations are applied that decrease its accuracy. However it is possible to configure these optimisations to balance speed and accuracy. The resulting algorithm outperforms existing solutions for the alignment of reads to a reference following BWT search

Session B-335: Digital assisted curation to the rescue of traditional literature curation for life-science databases
COSI: Non-COSI
  • Fabio Rinaldi, Swiss Institute of Bioinformatics and University of Zurich, Switzerland
  • Socorro Gama, Center for Genomic Sciences - UNAM, Mexico
  • Yalbi Itzel Balderas-Martínez, Facultad de Ciencias, UNAM, Mexico
  • Oscar Lithgow, Center for Genomic Sciences - UNAM, Mexico
  • Hilda Solano Lira, Center for Genomic Sciences - UNAM, Mexico
  • Mishael Sánchez-Pérez, Center for Genomic Sciences - UNAM, Mexico
  • Alejandra Lopez-Fuentes, Center for Genomic Sciences - UNAM, Mexico
  • Luis Muñiz-Rascado, Center for Genomic Sciences - UNAM, Mexico
  • Cecilia Ishida, Center for Genomic Sciences - UNAM, Mexico
  • Carlos-Francisco Méndez-Cruz, Center for Genomic Sciences - UNAM, Mexico
  • Alberto Santos-Zavaleta, Center for Genomic Sciences - UNAM, Mexico
  • Julio Collado-Vides, Center for Genomic Sciences - UNAM, Mexico

Short Abstract: Faced with decreasing financial resources, life science databases struggle to keep pace with the constantly increasing amount of published results. Traditional approaches based on careful human review of published papers guarantee a high quality of database entries, and cannot be easily replaced by automated technologies, but are very slow and not cost-effective. There are several technologies derived from the field of natural language processing that promise to support the search for information in textual resources. These technologies have been applied to the scientific literature for many years. In particular in the life science domain several community-organized evaluation campaigns carried out in the past few years have shown a steady improvement in the capabilities of these systems. However, there is still widespread skepticism on the possibility to use such tools in a curation pipeline. We argue that although text mining tools on their own would not be easily usable in a curation pipeline, their integration in a supportive environment can lead to a remarkable increase in efficiency of the curation process, and we prove this point through recent digital assisted curation experiments in the context of an established bacterial database and a recently initiated curation effort for an important human disease.

Session B-337: A workflow for accurate neoantigen discovery using NGS data
COSI: Non-COSI
  • Ognjen Milicevic, Seven Bridges, Serbia
  • Vladimir Kovačević, Seven Bridges, Serbia
  • Ana Mijalkovic Lazic, Seven Bridges, Serbia
  • Nikola Skundric, Seven Bridges, Serbia
  • Nevena Ilic Raicevic, Seven Bridges, Serbia
  • Milica Kojicic, Seven Bridges, Serbia
  • Jack Digiovanna, Seven Bridges, United States

Short Abstract: Neoantigens are proteins presented on the surface of cancer cells that are recognized by the immune system. Multiple novel therapeutic approaches involve identifying neoantigens and using them to trigger immunity-induced tumor regression. Seven Bridges has developed a workflow for neoantigen discovery using NGS data, which analyzes tumor-normal pairs of whole exome sequencing samples and tumor gene expression data in order to output candidate epitopes for neoantigens. The proposed Neoantigen workflow consists of two parts: Total calling and Neoantigen prediction. Total calling performs read alignment and preparation of aligned files (using ABRA, an assembly-based realigner tool), as well as germline variant calling (using Freebayes) and somatic variant calling (using Strelka and VarDict). After variants are called and merged into one file, the workflow performs variant phasing - the separation of variants belonging to the same chromosome. The purpose of this step is to accurately reconstruct the nucleotide sequence which is translated into the neoantigen candidate. Next, RNA reads are processed with Salmon and RSEM tools, which enables high precision quantification of transcripts. They output a list of gene isoforms and corresponding RNA expression scores. Neoantigen prediction consists of several tools: Protein extraction tools; Optitype, which calculates Human Leukocyte Antigen (HLA) type; and NetCTLpan, which predicts epitopes. Protein extraction tools, developed by Seven Bridges, extract the nucleotide sequences around the positions of the somatic variants and translates them to proteins. During this process the protein extraction tools take into account the full range of complex changes, currently not existent within the published workflows: changes from both germline and somatic variants (SNPs and indels); mutations (SNPs or indels) within the STOP codon (nonstop) and mutations that create STOP codons (nonsense). The extracted protein sequence and identified HLA type are used by the epitope prediction tool NetCTLpan to compute confidence-ranked epitope candidates. A final list of the confidence-ranked neoantigen candidates, HLA types, variant information and RNA expression scores are merged and prioritized in the Analyse epitopes tool, also developed by Seven Bridges. Non-proprietary components of this portable and reproducible workflow will be publically available on the Seven Bridges Platform, enabling rapid identification of patient-specific neoantigen candidates.

Session B-339: Elucidation of time-dependent systems biology cell response patterns with time course network enrichment
COSI: Non-COSI
  • Christian Wiwie, Department of Mathematics and Computer Science, University of Southern Denmark, Denmark
  • Richard Röttger, University of Southern Denmark, Denmark
  • Jan Baumbach, University of Southern Denmark, Denmark

Short Abstract: Advances in OMICS technologies emerged both massive expression data sets and huge networks modeling the molecular interplay of genes, RNAs, proteins and metabolites. Network enrichment methods combine these two data types to extract subnetwork responses from case/control setups. However, no methods exist to integrate time series data with networks, thus preventing the identification of time-dependent systems biology responses. We close this gap with Time Course Network Enrichment (TiCoNE). It combines a new kind of human-augmented clustering with a novel approach to network enrichment. It finds temporal expression prototypes that are mapped to a network and investigated for enriched prototype pairs interacting more often than expected by chance. Such patterns of temporal subnetwork co-enrichment can be compared between different conditions. With TiCoNE, we identified the first distinguishing temporal systems biology profiles in time series gene expression data of human lung cells after infection with Influenza and Rhino virus. TiCoNE is available online (https://ticone.compbio.sdu.dk) and as Cytoscape app.

Session B-341: Comparative genomics analysis of human gut microbiome demonstrated broad distribution of metabolic pathways for mucin glycans foraging
COSI: Non-COSI
  • Dmitry Ravcheev, University of Luxembourg, Luxembourg
  • Ines Thiele, University of Luxembourg, Luxembourg

Short Abstract: Mucins are heavily glycosylated proteins with high molecular weight and they are produced by epithelium in most animals. In the human intestine, mucins are responsible for forming of the mucus layer. Recent finding demonstrated that alterations in mucin glycoconjugates (MGC) impact on the composition of human gut microbiota (HGM). Here, we present a systematic analysis of HGM encoded systems for degradation of MGC. We applied genomic analysis to 397 HGM genomes microorganisms found in the human gut belonging to the phyla of Actinobacteria, Bacteroidetes, Euryarchaeota, Firmicutes, Fusobacteria, Proteobacteria, Synergistetes, Tenericutes, and Verrucomicrobia. For the annotation of gene functions, the PubSEED platform (http://pubseed.theseed.org) was used. The gene function annotation was done using available literature data, protein sequence similarity, protein domain structure, and genome-context based approaches, including gene chromosomal clustering and phyletic patterns. We analyzed genes required for the degradation of MGC to monosaccharides as well as genes for the utilization of these monosaccharides (fucose, galactose, N-acetylgalactosamine, N-acetylglucosamine, and N-acetylneuraminic acid) as carbon and energy sources. Genes for utilization of one or more monosaccharides were found in 369 (93%) studied genomes. In addition to previously known genes involved in MGC degradation, we predict four non-orthologous replacements for enzymes and four novel transport systems for MGC-derived monosaccharides. The analysis of genes for utilization of multiple monosaccharides in large number of co-inhabiting organisms revealed the following roles of the gut microbial community in MGC foraging. First, different monosaccharides demonstrated distinct distribution patterns across the analyzed genomes, which correlated with distribution of these monosaccharides in nature and particularly within the human intestine. Second, 339 genomes encoded only partial pathways, i.e., the presence of either the glycosyl hydrolases (GHs) for cleavage of a monosaccharide from MGC or the catabolic pathway for the utilization of a monosaccharide. Based on these pathways, we propose that there exist exchange pathways for MGC-derived monosaccharides within HGM. Consistently, we show that 338 (85%) of the analyzed genomes may be involved in such exchange pathways. Third, the analysis of MGC-degrading GHs allows us to predict the ability of each analyzed microorganism to degrade specific types of MGC. Finally, we predict so-called beneficial pairs of organism, i.e., pairs of organisms that can utilize specific MGCs, which cannot be degraded by any microbe alone. The 325 (82%) of the analyzed genomes are capable to form such pairs. We demonstrate that the HGM community is highly adapted to utilization of MGCs as sources of carbon and energy and suggest that this adaptation may be a consequence of co-evolution.

Session B-343: Alpha and Omega of Darwinian selection: disentangling the two in codon models
COSI: Non-COSI
  • Iakov Davydov, University of Lausanne, Switzerland
  • Nicolas Salamin, University of Lausanne, Switzerland
  • Marc Robinson-Rechavi, University of Lausanne, Switzerland

Short Abstract: The fixation of mutations in genes is due to a balance of selection, mutation and drift. Codon models have proven very useful in distinguishing selection, including positive selection, from drift. Synonymous substitution rates are assumed to capture all variation that is not under selection, and thus the ratio of non synonymous (dN) to synonymous (dS) substitutions should indicate selection. There are many models for gene-wide identification of positive selection allowing selection (and thus dN) to vary across the gene, but dS is usually assumed to be constant over all positions of one gene. Yet significant variation of dS has been observed inside genes. We have developed a simple new model which takes into account variations in codon substitution rate in addition to amino acid selection levels variation. Our approach introduces rate variation as a single parameter, thus not inflating the number of model parameters. We introduce codon rate variation into models with amino acid selection variation (M8 of PAML) and branch-site selection variation. We use a simulated dataset to assess the model's statistical performance. We show that our new models work well in both in the absence and in the presence of rate variation in the data. While the increase in the model complexity comes at computational cost, our models remain computationally tractable and useful even for large datasets. We provide an implementation in Go at https://bitbucket.org/Davydov/godon/. We use our improved positive selection model to scan genome-scale real data in two clades, Vertebrates and Drosophila. We show that data provides strong support for our synonymous rate variation. We demonstrate that positive selection inference results are largely affected by the model choice, and a majority of predictions of positive selection given by the model without rate variation are not supported by our models. Therefore we hypothesize that large proportion of positive selection detected by codon rate variation agnostic models are false positives caused by model assumptions violations. Finally, we study how different biological factors affect codon rate variation as estimated by the model. We demonstrate that codon rate variation is correlated with gene expression levels, recombination rate and GC-content. We show that our new models are able to capture rate variation caused by synonymous selection acting on the nucleotide level, for example detecting strong synonymous selection in the proximity of intron splicing sites. In conclusion, our new models capture important biological information about gene evolution, force a reconsideration of detection of positive selection, and are computationally tractable for genome scans.

Session B-345: Prediction of drug-drug interactions using molecular structure information and link prediction approaches
COSI: Non-COSI
  • Eunyoung Kim, School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), South Korea
  • Hojung Nam, School of Electrical Engineering and Computer Science, ​Gwangju Institute of Science and Technology (GIST), South Korea

Short Abstract: As taking multiple medications has been increased, characterization of adverse or synergistic interactions among drugs became crucial. Identification of drug-drug interactions (DDIs) needs to be considered because it may cause severe unexpected adverse reactions or can lead to findings of combination drug candidates as synergistic pairs. However, DDI identification through in vivo or in vitro experiments is extremely time-consuming and cost-expensive. Therefore, many researchers tried developing in silico prediction models for DDIs which can help reduce significant time and effort. In this study, we developed a DDI prediction model using molecular structure information and similarity scores obtained from missing link prediction algorithms. The current form of DDI network is scarcely inter-connected and there are a number of unknown interactions to be identified. We adopted missing link prediction algorithms to predict the likelihood of a future association between drugs considering nodes as drugs and missing links as unknown interactions. Moreover, compound structural similarity was used as additional features to represent the relationship. Features of structural similarity and similarity scores derived from various link prediction algorithms were obtained over known DDIs. Then prediction models were trained with Random Forest algorithm. The constructed model resulted in high performance of all AUC, AUPR, and accuracy.

Session B-347: Ultra-sensitive n-plexed protein quantification by a model-based reconstruction method
COSI: Non-COSI
  • Kyowon Jeong, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Yeon Choi, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Joon-Won Lee, Department of Applied Chemistry, Kyung Hee University, South Korea
  • Sangtae Kim, Pacific Northwest National Laboratory, United States
  • Jae Hun Jung, Department of Applied Chemistry, Kyung Hee University, South Korea
  • Young-Suk Lee, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Kwang Pyo Kim, Department of Applied Chemistry, Kyung Hee University, South Korea
  • V. Narry Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea
  • Jong-Seo Kim, Center for RNA Research, Institute for Basic Science; School of Biological Sciences, Seoul National University, South Korea

Short Abstract: Isotopic labeling based protein quantification has advantages such as accurate quantity ratios and reduced technical bias over other approaches. However, conventional isotopic labeling schemes (e.g., SILAC) have a limited multiplexity (≤3-plex). Although some trials have been made to increase multiplexity, they require either ultrahigh-resolution instruments or complicated/expensive labeling schemes. Also, thorough evaluation of quantification was scarcely made or the number of proteins quantified in all labels was often insufficient for most applications. We present a model-based reconstruction method called EPIQ (Epic Protein Integrative Quantification) that enables ultra-sensitive n-plexed protein quantification. EPIQ allows deuterium-based isotopic labeling and small mass difference between labels (≥2Da). Such labels make the XICs (eXtracted-Ion Chromatograms) from distinct labels hard to be separated; they have different retention time (due to deuterium effect) and mutual interferences (due to overlapping isotope clusters). EPIQ is based on a generative model that describes how the observed XICs are generated given the labeled peptide ions of the same species. The model assumes the observed XICs are generated by superimposing signal components (XICs from labeled peptide ions) as well as noise components (coelution or flat intensity noises). Given an identified PSM (Peptide-Spectrum Match), it predicts retention time, isotope distribution, and XIC shapes of the labeled peptide ions. By integrating these predictions, the signal and noise components in the generation model are predicted. Then EPIQ reconstructs the observation using the predicted components. As a result, it successfully separates XICs from distinct labels and performs accurate quantification with low limit of quantification (LOQ). To test the quantification performance of EPIQ, we developed deuterium-based 6-plexed labeling. Labeled HeLa unfractionated sample having ratio 30:20:10:1:5:10 was subject to LC-MS/MS (Q-Exactive). EPIQ reported ~3,000 proteins with median quantity ratio 30.4:21.3:10:1.1:4.1:10.1. In ~70% of the cases, the ratios (to the first label) fell within 2-fold change from the input ratio. To benchmark against other state-of-the-art tools, we adopted 13C-based 3-plexed labeling. A sample with a known ratio (HeLa, 1:10:20) and a biological sample (Xenopus early embryo) were analyzed by EPIQ and other tools. While all reported comparable numbers of proteins, the ratios from other tools than EPIQ were severely biased, especially for low abundant peptides. Such results demonstrate that EPIQ achieves low LOQ than other tools. As EPIQ allows higher multiplexity, we are currently developing further chemical/metabolic labeling schemes (≥8-plex). EPIQ could facilitate various biological applications (e.g., cell dynamics studies or sensitive detection of differentially expressed proteins).

Session B-349: Epistatic SNP pair analysis for lung cancer
COSI: Non-COSI
  • Jairo Rocha, University of Balearic Islands, Spain
  • Jaume Sastre, University of the Balearic Islands, Spain
  • Emidio Capriotti, University of Bologna, Italy

Short Abstract: By using the TCGA (The Cancer Genome Atlas Consortium) genomes for lung cancer and the 1000Genomes project data, we analyze all possible SNP pairs to find synergistic epistasis. We find associations between SNPs (and other mutations) to cause tumor with respect to samples from normal tissue from the same subjects. In addition, we search for SNP pairs in normal tissue of subjects with lung cancer using as control the 1000Genome database. The models used are logistic regression and independence of two-way and three-way contingency tables. Preliminary tests show that the synergy is very difficult to detect as high significance is needed for the high number of pairs considered. The small number of pairs found are being analyzed for biological significance.

Session B-353: A Consensus of Molecular Subgroups in Medulloblastoma
COSI: Non-COSI
  • Tanvi Sharma, German Cancer Research Centre,DKFZ, Germany
  • Ed Schwalbe, Newcastle University, United Kingdom
  • Paul Northcott, St.Jude's, USA
  • Volker Hovestadt, Broad Institute, USA
  • Dan Williamson, Newcastle University, United Kingdom
  • Steve Clifford, Newcastle University, United Kingdom
  • Stefan Pfister, German Cancer Research Centre,DKFZ, Germany
  • Lukas Chavez, German Cancer Research Centre,DKFZ, Germany

Short Abstract: Medulloblastoma is a highly malignant childhood brain tumor type. Recently, several independent studies have stratified Medulloblastoma tumors into distinct molecular subgroups. Due to differences in patient cohorts and applied analytical methods, this has created confusion regarding the definitive recognition and comparability of relevant subgroups. Consequently, this study aims to establish a consensus of clinically and genetically relevant molecular subgroups in Medulloblastoma.We combined DNA methylation array and gene expression data across different patient cohorts complemented by 194 novel cases resulting in the largest Medulloblastoma sample cohort ever compiled (1,845 samples with DNA methylation and 392 samples with matched DNA methylation and gene expression data). We have subjectesd this cohort to all previously employed methods including t-distributed neighborhood embedding (t-sne) followed by DBSCAN, non-negative matrix factorization (NMF), and similarity-network fusion (SNF). Our preliminary results indicate that the different proposed Medulloblastoma subgroups previously proposed largely overlap regardless of the applied analytical method. By combining DNA methylation and gene expression data, fewer samples are required for deriving otherwise largely similar clusters as for DNA methylation data alone. Our plan of action involves integration of mutations and clinical features, and to finally establish a consensus on Medulloblastoma stratification into clinically relevant subgroups.

Session B-355: Impact of tissue architecture on the nature and predictability of tumour evolution
COSI: Non-COSI
  • Robert Noble, Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
  • John Burley, Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA, United States
  • Michael Hochberg, Institut des Sciences de l’Evolution, University of Montpellier, France

Short Abstract: Intra-tumour genetic heterogeneity is a product of evolution in spatially structured populations of cells. Whereas genetic heterogeneity has been proposed as a prognostic biomarker in cancer, its spatially dynamic nature makes accurate prediction of tumour progression challenging. We use a novel computational model of cell proliferation, competition, mutation and migration to assess when and how genetic diversity is predictive of tumour growth and evolution. We characterize how tissue architecture (cell-cell competition and cell migration) influences the potential for subclonal population growth, the prevalence of clonal sweeps, and the resulting pattern of intra-tumour heterogeneity. We further compare the accuracy of cancer growth forecasts generated using different virtual biopsy sampling strategies, in different tissue types, and when cancer evolution is characteristically neutral or non-neutral. We thus determine the conditions under which genetic diversity is most predictive of future tumour states. Our findings help explain the multiformity of tumour evolution and contribute to establishing a theoretical foundation for predictive oncology.

Session B-357: FRICTION: validated, quantitative immune cell type deconvolution
COSI: Non-COSI
  • Aaron Wise, Illumina, USA
  • Alex So, Illumina, USA
  • Joyee Yao, Illumina, USA
  • Shannon Kaplan, Illumina, USA
  • Shile Zhang, Illumina, USA

Short Abstract: Recent work has demonstrated the value of understanding the tumor microenvironment for its impact on tumor progression and immunotherapy efficacy. Computational tools based on gene expression data have shown promise for their ability to deconvolve the tumor microenvironment and report the types of immune cells present in heterogeneous tumor samples. Here we present FRICTION, a new algorithm for cell type deconvolution. While many state of the art deconvolution approaches report relative fractions or statistical enrichment, we focus on the careful selection and normalization of genes to better detect the absolute fractional level of cell types. To enable this, we performed a novel gene selection method that combined both statistical properties of gene expression along with the expression’s ability to classify different cell types. Furthermore, we normalized against expression levels from over 10 different control tissues to ensure robustness in many tissue backgrounds. FRICTION combines our gene selection and normalization techniques with a support vector regression based approach to deconvolution. FRICTION has been trained to detect three cell types: CD8+ T, CD4+ T and CD19+ B cells. We have validated the technique using spike-in cell titrations and IHC (immunohistochemistry) staining of FFPE (formalin-fixed, paraffin-embedded tumor samples. The titration experiments demonstrate our method’s linearity in a variety of tissue backgrounds (median R2 > 0.97), with high reproducibility among both technical and biological replicates. The IHC staining experiments demonstrate our method’s ability to differentiate CD4 high/low status in tumor samples. FRICTION represents an important step towards delivering validated, quantitative cell type deconvolution results.

Session B-359: Identification of novel peptidic antibiotics via large-scale scoring of mass spectra against natural products databases
COSI: Non-COSI
  • Alexander Shlemov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
  • Alexey Gurevich, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
  • Alla Mikheenko, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, Russia
  • Anastasiia Abramova, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
  • Anton Anton Korobeynikov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
  • Hossein Mohimani, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA
  • Pavel Pevzner, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA

Short Abstract: Discovery of novel peptidic antibiotics and other natural products is hardly possible without modern high-throughput technologies and computational pipelines. Previous studies mainly relied on low-throughput NMR-based technologies requiring large amounts of highly purified material that is often difficult to obtain. This resulted in a recent decline in the pace of antibiotics discovery. Coupling of NGS and mass spectrometry data enabled the renaissance of the antibiotics research and led to discovery of teixobactin, the new class of antibiotics, first in three decades. While databases in traditional proteomics consist of known peptides, the ongoing genome mining efforts for natural product discovery generate vast databases of still unknown putative compounds making matching spectra against such databases prohibitively time-consuming. This leads to a challenging problem of matching millions of spectra against millions of peptides in a reasonable time. The statistical significance (P-value) of a few high scored matches can then be calculated for further elimination of untrustworthy hits. Here we suggest a fast scoring strategy based on peptide database partitioning and preprocessing. We couple this strategy with an accurate method for calculating P-value for a given Peptide-Spectrum pair. Traditional proteomics approaches are not applicable here because peptidic natural products typically have complex structures (e.g. cyclic or branch-cyclic) and contain non-standard amino acids alongside with post-translational modifications. Besides P-value estimation algorithm itself we also propose a fast method to estimate whether the P-value would be below the predefined threshold, speeding up the search even more.

Session B-361: Developing rAMP-seq primers for use in Durum and Bread Wheat
COSI: Non-COSI
  • Dustin Cram, National Research Council of Canada, Canada
  • David Konkin, National Research Council of Canada, Canada

Short Abstract: Cost effective genotyping strategies enable researchers and breeders. Buckler et al. recently proposed rAMP-seq (Buckler et al., 2016) as a simple approach to reduced representation sequencing and demonstrate its use in Maize. This approach involves amplifying variable repetitive regions that are flanked by conserved regions, such that hundreds to thousands of loci can be targeted by a single set of primers. In wheat, there is a rich diversity of transposable elements with ~180 families represented by 11-100 copies on the 3B chromosome (Daron et al., 2014). With the goal of developing rAMP-seq primer sets that target 1000-5000 loci distributed across the wheat genome, we are currently exploring the rAMP-seq primer design space in wheat and performing preliminary trials.

Session B-363: VarQuest: modification-tolerant identification of novel variants of peptidic antibiotics and other natural products
COSI: Non-COSI
  • Alexey Gurevich, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Alla Mikheenko, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Alexander Shlemov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Anton Korobeynikov, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia; Department of Mathematics and Mechanics, St. Petersburg State Uni, Russia
  • Hosein Mohimani, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA, USA
  • Pavel Pevzner, Department of Computer Science and Engineering, University of California, San Diego, La Jolla, California, USA; Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Peters, USA

Short Abstract: Motivation: Peptidic natural products (PNPs) include many antibiotics, anti-cancer agent, and other bioactive compounds. While billions of tandem mass spectra of natural products have been generated and deposited to Global Natural Products Social (GNPS) molecular network, discovery of novel PNPs and even variants of known PNPs from this gold mine of spectral data remains challenging. To address this problem, bioinformaticians develop dereplication techniques to identify known PNPs and their novel variants. However, since PNP databases are dominated by the most abundant representatives of PNP families, existing algorithms, focusing at dereplication of known PNPs, identify only a small fraction of spectra in the GNPS molecular network. Results: We present a novel VarQuest algorithm for identification of PNP variants via database search of mass spectra, the first high-throughput mutation-tolerant PNP identification method capable of analyzing the entire GNPS infrastructure. VarQuest identified an order of magnitude more PNP spectra and many novel PNP variants as compared to existing PNP identification strategies. Availability and implementation: http://cab.spbu.ru/software/varquest Contact: aleksey.gurevich@spbu.ru

Session B-365: Comparison of open source (galaxy based) and commercial pipelines for RNA-Seq data analysis
COSI: Non-COSI
  • Slave Trajanoski, Medical University Graz, Austria
  • Marija Djurdjevic, Medical University Graz, Austria
  • Andrea Groselj-Strele, Medical University Graz, Austria

Short Abstract: From the first RNA-Seq projects until today software for data analysis was constantly developed and improved. Nowadays we’re finding a plethora of different solutions of tools for either single parts of RNA-Seq data analysis as well as complete pipelines that deliver gene expression results. In our work, we present a comparison of one commercial software from the company Partek(R) as two implementations: Partek(R) Flow(R) and Partek(R) Genomics Suite(R) and open source tools implemented as pipeline in a very popular web-based framework Galaxy (1). By using test dataset from the NCBI Sequence Read Archive (SRA) we could run performance and usability tests of the two different approaches. We come to the conclusion that: 1. Results from both solutions are comparable, but one should be very careful about parameters used in each step since they can lead to different results; 2. Usability is on the side of the commercial solution, even though Galaxy developers are making good progress in this direction; 3. For the sake of easy and fast analysis a lot of parameters are hidden from the user in the commercial software, which leads to lower flexibility in comparison with the open source pipelines; 4. Less management effort when using commercial software, which on the other hand is connected with license costs. 1. Afgan E, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 2016;44(W1):W3.

Session B-367: CALQ: Coverage-Adaptive Lossy compression of high-throughput sequencing quality values
COSI: Non-COSI
  • Jan Voges, Leibniz Universitaet Hannover, Germany
  • Mikel Hernaez, Stanford, United States
  • Joern Ostermann, Leibniz Universitaet Hannover, Germany

Short Abstract: Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. We present a novel lossy compression scheme for the quality values named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. We analyze the performance of several lossy compressors of quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well known NA12878 individual. We show that CALQ achieves, on average, better performance than with the original data with a size reduction of more than an order of magnitude with respect to the state-of-the-art lossless compressors.

Session B-369: A CWL-based pipeline for the ultra-fast identification of mutually exclusive pairs of aberrant genes in tumors
COSI: Non-COSI
  • Tarcisio Fedrizzi, CIBIO, Italy
  • Davide Prandi, CIBIO, Italy
  • Francesca Demichelis, CIBIO, Italy

Short Abstract: Synthetic lethality (SL) is a phenomenon where the concomitant aberration of two (or more) genes causes cell death while aberration of either one gene is compatible with life. As tumor cells tend to accumulate aberrations SL can be exploited for tumor cell selective death while targeting the second gene with a drug. Genomic aberrations of SL combinations should reflect in mutual exclusivity (ME) signatures. By studying co-aberration patterns in large datasets (e.g. TCGA), we can nominate mutually exclusive combinations and validate their SL potential through experimental work. To identify such combinations, we developed a pipeline to handle Whole Exome Sequencing data and an algorithm named FaME (Fast Mutual Exclusivity). The pipeline is written in Common Workflow Language (CWL), one of the leading languages for pipeline specification. CWL combined with Docker offers a way to execute the pipeline in heterogeneous environments (e.g. HPC cluster, cloud) while avoiding complex setups. With matched normal and tumor samples data (BAM) as input, the pipeline generates purity- and ploidy-adjusted somatic copy number alteration (SCNA) calls, both gene-based and allele-specific, and Single Nucleotide Variants (SNVs) all annotated with clonality estimates using a tool developed by our group (CLONET). FaME leverages fast matrix multiplication (OpenBLAS), a logarithm-based Fisher test implementation, and parallel code execution to compute hundred of millions of ME tests (genome-wide coverage) for thousand of samples in few minutes on a single HPC machine. Details of the pipeline and FaME will be presented. This work is part of the ERC funded project SPICE (ERC-CoG-2014, 648670).

Session B-371: Wet lab preparations critically influence metagenomics profiles as shown in surrogate samples and a sample of a patient with unknown cause of meningitis
COSI: Non-COSI
  • Corinne P. Oechslin, Biology Division, SPIEZ LABORATORY, Swiss Federal Office for Civil Protection, Spiez; Institute for Infectious Diseases, University of Bern, Bern; Graduate School for Cellular and Biomedical Sciences,, Switzerland
  • Stephen L. Leib, Institute for Infectious Diseases, University of Bern, Bern, Switzerland
  • Christian M. Beuret, Biology Division, SPIEZ LABORATORY, Swiss Federal Office for Civil Protection, Spiez, Switzerland

Short Abstract: Sample preparations like host nucleic acid (NA) depletion through NA extraction up to sequencing library preparation directly influences the quality of sequencing output. This study compares different sample preparations and their influence on the metagenomics profile of surrogate samples of bacterial or viral infections in humans as well as of a cerebrospinal fluid sample of a patient suffered from meningitis of unknown etiology. To examine the influence of the predominant amount of host NA in clinical samples on the detection of infectious pathogens we implemented a host NA depletion method. Furthermore, we compared four combinations of native and host NA depleted samples and different NA extractions: crowding agent followed by magnetic beads purification, and phenol-chloroform based mixture of different pH followed by column purification. Extracted NA were reverse transcribed, whole genome amplified, and prepared for sequencing with IonTorrent™ S5. Sequencing data was analyzed using Kraken taxonomic sequence classifier, mapping to reference genomes and BLAST®. We found in the analysis of the metagenomics profiles that the four sample preparations of the surrogate samples selectively single out either intact or lysed bacteria or RNA viruses, or intact DNA viruses. These results guided the analysis of the patient sample data. Thereby, the host NA depleted sample in combination with phenol chloroform (pH >8) and column purification provided reliable metagenomics evidence on a multi infection by bacteria that caused meningitis.

Session B-373: Finding associated variants in genome-wide associations studies on multiple traits
COSI: Non-COSI
  • Lisa Gai, UCLA, USA
  • Eleazar Eskin, University of California, Los Angeles, USA

Short Abstract: Many variants identified by genome-wide association studies (GWAS) have been found to affect multiple traits, either directly or through shared pathways. There is currently a wealth of GWAS data collected in numerous phenotypes, and analyzing multiple traits at once can increase power to detect shared variant effects. However, the vast majority of studies consider one trait at a time. Studies that do analyze multiple traits are typically limited to sets of traits already believed to share a genetic basis. Traditional meta-analysis methods for combining studies are designed for use on studies in the same trait. When applied to dissimilar studies, meta-analysis methods are underpowered compared to univariate analysis. This is major limitation as the degree to which a pair of traits share effects is often not known. Here we present a flexible method for finding associated variants from GWAS summary statistics for multiple traits. Our method estimates the degree of shared effects between traits from the data. Using simulations, we show that our method properly controls the false positive rate and increases power compared to trait-by-trait GWAS at varying degrees of relatedness between traits. We apply our method to real data sets in a variety of disease-relevant traits.

Session B-375: Inferring molecular data type similarity in anti-cancer drug response prediction
COSI: Non-COSI
  • Nanne Aben, Netherlands Cancer Institute, Netherlands
  • Yipeng Song, University of Amsterdam, Netherlands
  • Johan Westerhuis, University of Amsterdam, Netherlands
  • Henk Kiers, University of Groningen, Netherlands
  • Magali Michaut, Netherlands Cancer Institute, Netherlands
  • Age Smilde, University of Amsterdam, Netherlands
  • Lodewyk Wessels, Netherlands Cancer Institute, Netherlands

Short Abstract: As patient response to anticancer drugs is highly variable, biomarkers are needed to identify which patients will benefit from a given treatment. These biomarkers can be determined using large-scale pharmacogenomic screens, in which hundreds of cell lines have been extensively molecularly profiled for mutations, gene expression, drug response, etc. We have previously shown that the classic approach for biomarker identification (Elastic Net predicting drug response using all molecular data types simultaneously) is strongly affected by overlapping information contained in multiple data types. Here, we want to gain further insight in how the information overlaps between different data types. To this end, one could employ matrix correlation measures, such as the RV coefficient, to estimate the redundancy between data types. However, these measures are not suitable for application to binary and continuous data. Thus, we extended the RV coefficient to be able to compare binary and continuous data (e.g. correlation between mutation and proteomics data) and to compute partial matrix correlations (e.g. correlation between mutation and drug response, corrected for proteomics data). Using this approach, we found that gene expression and proteomics data act as ‘mediator data types’: they contain all the information shared between drug response and the remaining data types. As linear models give relatively high weights to mediator variables, this result can be used to explain why an Elastic Net predictor of drug response mostly selects gene expression and proteomics biomarkers.

Session B-377: Unsupervised domain adaptation for age prediction from DNA methylation data
COSI: Non-COSI
  • Lisa Handl, Max Planck Institute for Informatics, Germany
  • Adrin Jalali, Max-Planck-Institut für Informatik, Germany
  • Michael Scherer, Max-Planck Institute for Informatics, Germany
  • Nico Pfeifer, Department of Computer Science, University of Tübingen, Germany

Short Abstract: Over the last years, huge resources of biological and medical data have become available for research. This data offers great chances for machine learning applications in health care, e.g. for precision medicine, but is also challenging to analyze. One key challenge in biological data is heterogeneity, which can arise, e.g., from different sources of data, unknown subgroups or differences in data acquisition. If this heterogeneity causes a distribution mismatch between the training and test data of a statistical model, prediction performance quickly deteriorates. An interesting problem in epigenetics, where this is relevant, is age prediction from multiple tissues. Here, DNA methylation data is used to predict a donor’s “epigenetic age” and an increased epigenetic age has been shown to be linked to lifestyle and disease history. In this work we address the problem of heterogeneity by proposing an adaptive model which detects distribution mismatches between the inputs of training and test data, and excludes or down weights features that behave differently. Our method can be seen as unsupervised domain adaptation. We apply the model to the problem of age prediction based on DNA methylation data from a variety of tissues, and compare it to a standard model, which does not take heterogeneity into account. The standard approach has particularly bad performance on one tissue type on which we show substantial improvement with our new adaptive approach even though no samples of that tissue were part of the training data.

Session B-379: MicrobiomeDB: a Web-based platform for interrogating microbiome experiments
COSI: Non-COSI
  • Francislon S. Oliveira, Centro de Pesquisas René Rachou (Fiocruz Minas), Brazil
  • Shon Cade, Department of Biology, University of Pennsylvania, United States
  • John Brestelli, Department of Genetics, School of Medicine, University of Pennsylvania, United States
  • John Iodice, Department of Genetics, School of Medicine, University of Pennsylvania, United States
  • Brian P. Brunk, Department of Biology, University of Pennsylvania, United States
  • Jie Zheng, Department of Genetics, School of Medicine, University of Pennsylvania, United States
  • Christian J. Stoeckert Jr, Department of Genetics, School of Medicine, University of Pennsylvania, United States
  • Gabriel R. Fernandes, Centro de Pesquisas René Rachou (Fiocruz Minas), Brazil
  • Jessica C. Kissinger, Center for Tropical & Emerging Global Diseases, Department of Genetics and Institute of Bioinformatics, University of Georgia, United States
  • David S. Roos, Department of Biology, University of Pennsylvania, United States
  • Daniel P. Beiting, Center for Host-Microbial Interactions, Department of Pathobiology, University of Pennsylvania, United States

Short Abstract: High-throughput sequencing has revolutionized microbiology by allowing scientists to complement culture-based approaches with culture-independent profiling of complex microbial communities, also known as the microbiome. Community composition data is often accompanied with rich metadata that describes the source from which the sample was derived and how samples were treated prior to collection. However, there is a lack of tools that allow integration of sample metadata with microbiome profiling data. In order to better understand how variables described by metadata influence structure and function of microbial communities we developed MicrobiomeDB (microbiomeDB.org), a discovery platform that empowers researchers to fully leverage their experimental metadata to construct queries which interrogate microbiome datasets. Furthermore, the resulting queries can be statistically analysed and then graphically visualized in the web browser, giving the user a powerful tool to interpret any set of samples from the loaded datasets. A key feature of MicrobiomeDB is the development of an automated pipeline for loading data from microbiome experiments using the standard Biological Observation Matrix (.biom) as input. Taxonomy assignments from the .biom file are mapped to the GreenGenes database to retrieve full 16S rRNA gene sequences and NCBI taxon identifiers. MixS and user-defined metadata terms describing each sample are mapped to an OBO-based application ontology. The ontology is used to guide metadata harmonization, organization and display. Taken together, these results constitute a first step toward a full-featured open-source platform for a systems biology view of microbial communities.

Session B-381: Stability Change Prediction and Multiple Structure Alignment Extensions for UCSF Chimera
COSI: Non-COSI
  • Jessica Köberle, University of Applied Sciences Upper Austria, Austria
  • Markus Saliger, University of Applied Sciences Upper Austria, Austria
  • Jonas Schurr, University of Applied Sciences Upper Austria, Austria
  • Josef Laimer, University of Salzburg, Austria
  • Peter Lackner, University of Salzburg, Austria

Short Abstract: UCSF Chimera is a widely used software for the visualization and modeling of protein 3D structures. We developed a bundle of extensions which supports the analysis of protein variants and protein engineering. The first extension offers an easy to use interface to our online service MAESTRO, a versatile tool for the prediction of stability changes upon point mutations. MAESTRO includes several search algorithms for the most (de)stabilizing combination of point mutations or potential stabilizing disulfide bonds. Our Chimera extension provides access to the most popular application scenarios. Results are presented in tabular form, and mutation sites can be highlighted within the structures. The second extension performs multiple structure alignments (MStAs) by utilizing our web service PIRATES. PIRATES is a meta server for MStAs, which currently provides access to eight widely used alignment methods. In addition, the service computes a consensus alignment based on their results. All resulting alignments are scored, and loaded structures can be superimposed based on them. An extension, providing advanced residue selection options based on amino acid types or groups, secondary structure, and accessible surface area, complements the bundle. Various selection constraints can be combined and inverted. The resulting residue filter can be returned to the MAESTRO extension. All extensions are easy to use without any knowledge about operating the underlying software and results are presented within Chimera. All components are freely available for academic research and do not have any dependencies other than an UCSF Chimera (version 11.1 or later) installation and an internet connection.

Session B-383: Analyzing cell migration dynamics from intravital imaging by deformable image matching
COSI: Non-COSI
  • Hideo Matsuda, Osaka University, Japan
  • Hironori Shigeta, Osaka University, Japan
  • Shigeto Seno, Osaka University, Japan
  • Junichi Kikuta, Osaka University, Japan
  • Masaru Ishii, Osaka University, Japan

Short Abstract: Tools for analyzing cell migration dynamics play an important role in the field of bio-imaging. Understanding physiological processes in health and disease requires the imaging and analysis of the dynamic behavior of cells in tissues under normal and perturbed conditions. This typically involves the tracking of large numbers of cells in time-lapse microscopy data sets. Conventional cell tracking methods generally consists of two processing steps: (1) cell segmentation (dividing an image into biological meaningful parts as "object" and the remainder as "background"), and (2) cell association (the association of segmented objects from frame to frame). However, it is not easy to resolve the cell segmentation generally due to the heterogeneity and dynamical deforming features of cell morphology. For example, migrated macrophages often change their shapes, and their "segments" frequently overlap each other in an image. To cope with this issue, we propose a different method for analyzing cell migration dynamics by replacing the cell segmentation step with an image matching method from frame to frame. For the image matching, we use a method called "Deep Matching," which can match objects from frame to frame with a non-rigid deformable image matching with deep convolution operations. We applied the method to the comparative analysis of the distributions of cell migration speeds between normal and LPS-stimulated mouse leukocytes. The intravital image data of the leukocytes were obtained by two-photon microscopy. The performance of the proposed method is demonstrated by evaluating the analysis results.

Session B-385: A method of bioimage analysis for spatial pattern of cellular interaction and intercalation
COSI: Non-COSI
  • Shigeto Seno, Osaka University, Japan
  • Masayuki Furuya, Osaka University, Japan
  • Junichi Kikuta, Osaka University, Japan
  • Masaru Ishii, Osaka University, Japan
  • Hideo Matsuda, Osaka University, Japan

Short Abstract: Biological imaging technologies have been rapidly advancing for several years and multicolor fluorescent imaging have become an important part in the field of biology. Moreover, intravital imaging enables us to monitor the activity and spatial distribution of various types of cells. Because cell–cell interaction and intercalation plays important role in developmental processes or causes many different disease, analyzing spatial distributions of cells in images is a fundamental task in biological researches. Co-localization analysis is one of the well-used approach to analyze spatial interaction of two sets of objects. The relative location of a set of cells with respect to another set of cells contains information about potential interactions, and their spatial distributions are correlate if cells need to communicate with each other. However, in the case analyzing cellular competition or intercalation, not only co-localization but exclusive colony-forming should be quantified. In this study, we propose a novel method to measure the degree of the mixture distributions and exclusive colony-forming of two kind cells. As the first step in image analysis, we construct the dendrogram which represent connectivity between cell areas, by using hierarchical clustering method. Then impurity of the resulting clusters for each level of dendrogram are calculated, and this change of impurity value indicate the pattern of co-localization and exclusive colony-forming. Finally, we calculate CMI (cell mixture index) as the area under the characteristic curve of impurity. This index compensates the weakness of co-localization analysis such as pixel intensity spatial correlation and object-based overlap method.

Session B-387: Exploration of Bacteriophage within the Female Urinary Microbiota
COSI: Non-COSI
  • Taylor Miller-Ensminger, Loyola University Chicago, United States
  • Jonathon Brenner, Loyola University Chicago, United States
  • Krystal Thomas-White, Loyola University Chicago, United States
  • Alan J. Wolfe, Loyola University Chicago, United States
  • Catherine Putonti, Loyola University Chicago, United States

Short Abstract: Bacteriophages (viruses that infect bacteria) play a significant role in shaping the overall bacterial community structure of the human microbiota. While this has been clearly demonstrated in the microbial communities of the gut, the contributions of phage within the bladder microbiota remain unknown. Prior evidence has shown correlations between bacterial populations within the bladder and clinical symptoms. We recently began to investigate phages in the bladder. In an effort to identify urinary bacteriophage, we sequenced 300 phylogenetically diverse species from our collection of more than 8000 bacteria isolated from urine obtained by transurethral catheter from about 1000 women. The sequences from these samples were run through our software, developed in Python, which integrates the tool VirSorter and novel functionality to automate downstream analyses. Our method identified 304 phages with confidence and an additional 318 sequences warranting further investigation. A key challenge of identifying phage genomes via computational methods, including our own, is the dearth of phage genomes publicly available; to date only 2010 phage species have been characterized. Despite this challenge, we identified phage that infect a myriad of species commonly found within the bladder’s microbiota, including Lactobacillus, Streptococcus, Enterococcus, Lactococcus, Morganella, and several Enterobacteriaceae. Since the bladder microbiota have been relatively unexplored, we expect that many novel phages exist, exceeding the ones found here. From our computational work, we have identified samples to isolate and characterize phages in lab. This effort is essential to furthering our knowledge of the contributions of phages to bladder health.

Session B-389: Sequence analysis and evolutionary relationships of Microbial Transglutaminases
COSI: Non-COSI
  • Deborah Giordano, Institute of Food Sciences, CNR, via Roma 64, Avellino, Italy
  • Angelo Facchiano, Institute of Food Sciences, CNR, via Roma 64, Avellino, Italy

Short Abstract: Since 1990, Streptomyces mobaraensis microbial transglutaminase (MTGase) has become of industrial interest because of its ability to catalyze post-translational modification in many proteins. Even if many studies search novel forms of MTGase, as an alternative to that in use, nowadays this kind of MTGase is the only commercially available and employed as a tool for industry; moreover, the functions of transglutaminases in bacteria are still unknown, and a real classification of all the MTGases known does not exist. All protein sequences annotated as Transglutaminase in Pfam database have been analyzed in order to divide them into groups and select the most representative sequences. Among these sequences, we performed the analysis of the evolutionary relationship (by MEGA tool and phylogenic trees based on Maximum Likelihood and Neighbor Joining algorithms). Based on the features of the analyzed proteins, it is possible to detect the presence of at list five groups of MTGases: Group I: MTGases similar to the experimentally characterized MTGase of Chryseobacterium sp., a novel form of MTGase, which is very different from all the other MTGases known. Group II: MTGases similar to the already known MTGase of Bacillus Subtilis (Tgl-like), some of them do not preserve all the catalytic residues. Group III: MTGases that preserve the main catalytic triad of Streptomyces mobaraensis MTGase, in the order Cys Asp His. Group IV: a little group of proteins from Proteobacteria. Group V: MTGases, similar to the eukaryotic TGases and preserving their typical catalytic triad order, i.e. Cys, His, Asp.

Session B-390: New Application of Computational Target Deconvolution for Phenotypic Screening
COSI: Non-COSI
  • Ryo Kunimoto, University of Bonn, Germany
  • Dilyana Dimova, University of Bonn, Germany
  • Jürgen Bajorath, University of Bonn, Germany

Short Abstract: Target deconvolution of phenotypic assays is a hot topic in chemical biology and drug discovery. It is generally thought that phenotypic screens might produce leads that are more relevant for addressing complex biology in vivo than other compounds identified in target-based assays. Phenotypic discovery is challenged by the need to identify -or at least narrow down- cellular targets for compounds with interesting phenotypic readouts, a process often referred to as target deconvolution. A widely applied computational approach infers putative targets of new active molecules on the basis of their chemical similarity to compounds with activity against known targets. Herein, we introduce a molecular scaffold-based variant for similarity-based target deconvolution from chemical cancer cell line screens that were used as a model system for phenotypic assays. A new scaffold type was applied for substructure-based similarity assessment, termed analog series-based (ASB) scaffold. Compared to conventional scaffolds and compound-based similarity calculations, target assignment centered on ASB scaffolds resulting from screening hits and bioactive reference compounds restricted the number of target hypotheses in a meaningful way and led to a significant enrichment of known cancer targets among candidates.

Session B-391: Origin mechanisms of L-SAARs in signal peptides among higher eukaryotes
COSI: Non-COSI
  • Michał Stolarczyk, Silesian University of Technology, Gliwice, Poland, Poland
  • Joanna Polańska, Silesian University of Technology, Gliwice, Poland, Poland
  • Paweł P. Łabaj, Chair of Bioinformatics, Boku University Vienna, Austria

Short Abstract: Single amino acid repeats (SAARs) are reiterations of single amino acids within peptides. The preliminary analysis of the proteomes insinuates the significance of those containing leucine. The vast majority of leucine reiterations is located in the signal peptide (amino-terminus of proteins destined towards the secretory pathway). Our earlier research has proven that Leucine repeats are overrepresented and conserved in Eukaryotic organisms. This rise a question about additional role of the leucine repeats in signal peptides. Here, we analyze the distribution of leucine repeats found in signal peptides of orthologous Eukaryotic proteins and investigate the changes of leucines forming the L-SAARs both on amino acid and nucleotide level. Thorough analysis facilitates the detection of trends determining the direction of evolution and prospectively the determination of the origins of L-SAARs in signal peptides. The study was focused chiefly on mammals and the data set consisted of the proteomes and transcriptomes of nine organisms. In the study we have examined the following: signal peptides and L-SAARs length changes (to determine if DNA replication slippage is indeed the main L-SAARs creating phenomenon), amino acids to leucine changes (to inspect the amino acids that were previously present on leucines positions in evolutionarily lower organism), leucine codon usage (to test for any bias in leucine codon distribution) and codon changes within leucine (to determine if the inequality in the leucine codon distribution results from any direction of point mutations within leucine codons).

Session B-392: Global deceleration of gene evolution following recent genome hybridizations in fungi
COSI: Non-COSI
  • Sira Sriswasdi, Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Japan
  • Masako Takashima, Japan Collection of Microorganisms, RIKEN BioResource Center, Japan
  • Ri-Ichiroh Manabe, Division of Genomic Technologies, RIKEN Center for Life Science Technologies, Japan
  • Moriya Ohkuma, Japan Collection of Microorganisms, RIKEN BioResource Center, Japan
  • Takashi Sugita, Department of Microbiology, Meiji Pharmaceutical University, Japan
  • Wataru Iwasaki, Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Japan

Short Abstract: Polyploidization events such as whole-genome duplication and inter-species hybridization are major evolutionary forces that shape genomes. Although long-term effects of polyploidization have been well-characterized, early molecular evolutionary consequences of polyploidization remain largely unexplored. Here, we report the discovery of two recent and independent genome hybridizations within a single clade of a fungal genus, Trichosporon. Comparative genomic analyses revealed that redundant genes are experiencing decelerations, not accelerations, of evolutionary rates. We identified a relationship between gene conversion and decelerated evolution suggesting that gene conversion may improve the genome stability of young hybrids by restricting gene functional divergences. Furthermore, we detected large-scale gene losses from transcriptional and translational machineries that indicate a global compensatory mechanism against increased gene dosages. Overall, our findings illustrate counteracting mechanisms during an early phase of post-genome hybridization and fill a critical gap in existing theories on genome evolution.

Session B-393: Gene expression pattern and genetic variation in human induced pluripotent stem cells in a family-based cohort of Tetralogy of Fallot
COSI: Non-COSI
  • Sandra Appelt, Charité – Universitätsmedizin Berlin, Germany
  • Marcel Grunert, Charité – Universitätsmedizin Berlin, Germany
  • Sophia Schönhals, Charité – Universitätsmedizin Berlin, Germany
  • Huanhuan Cui, Charité – Universitätsmedizin Berlin, Germany
  • Silke R.-Sperling, Charité – Universitätsmedizin Berlin, Germany

Short Abstract: Patient-specific induced pluripotent stem cells (ps-iPSCs) and their differentiated cell types are a powerful model system to gain insights into mechanisms driving developmental and disease-associated regulatory networks. An open question is to which degree somatic mutations impact on functional studies of genetic disorders using ps-iPSCs. To investigate this question as well as the impact of differential expression patterns between ps-iPSCs and iPSCs derived from healthy relatives, we studied healthy and diseased individuals with Tetralogy of Fallot (ToF). ToF represents the most common cyanotic heart defect in humans, characterized by a multigenic background. We performed whole genome sequencing of pooled iPSCs derived from fibroblast and transcriptome sequencing of iPSCs of three differentiation states and blood samples. To estimate the relative contribution of sample identity, genetic background and differentiation state to transcriptional variation, we applied a linear mixed effect model. As expected, the most important driver for global variation is the differentiation state, followed by the genetic background and the sample identity with only minor impact. We further applied a stringent filtering pipeline to identify damaging somatic and germline variations. We found a damaging somatic mutation in the DNA-binding domain of TP53 in two iPSC clones of one patient, which was moreover functionally validated. TP53 encodes the tumor suppressor P53, and mutations at this site are associated with cancer and are described to decrease P53-mediated regulation of apoptosis, genomic stability and cell cycle. These results imply that a careful genetic characterization of iPSCs is essential before further follow-up experiments or clinical usage.

Session B-394: Discovering novel drug indications based on NLP and topic modeling
COSI: Non-COSI
  • Giup Jang, Gachon University, South Korea
  • Taekeon Lee, Gachon University, South Korea
  • Soyoun Hwang, Gachon University, South Korea
  • Youngmi Yoon, Gachon University, South Korea

Short Abstract: Text mining is a technique that extract meaningful information from unstructured text data. We proposed a method to discover novel drug indications using natural language processing (NLP) and topic modeling among text mining technique. First of all, we extracted sentences where a gene and a drug co-occur from abstracts in PubMed. Using the sentences, we identified words that explain relationships between gene and drug using Stanford parser. We defined up-/down-regulation to each identified word. We assigned +1 if defined word is up-regulation, and assigned -1 if defined word is down-regulation. We multiplied all assigned scores, and calculated GRS (Gene Regulation Score). We assigned activation if GRS is greater than 0, and assigned inhibition if GRS is smaller than 0 in gene-drug relationship. Next, we clustered genes and their regulations, and built a set of topics based on topic modeling. For each drug, we used probabilities of topics as features and known drug-gene associations as classification class, and we measured performance of classifier. We measured AUC for J48, RandomForest and NaiveBayes, and used 10-fold cross-validation. Furthermore, we predicted novel drug-gene associations for potential drugs among unknown drug-gene associations using classifier. For novel drugs, we measured p-value using Fisher’s exact test and DisGeNet and we identified candidate drugs for drug repositioning if p-value is smaller than 0.05.

Session B-395: gTide-Hi: Accelerating a Cross-correlation Algorithm in Tide using a single GPU
COSI: Non-COSI
  • Hyunwoo Kim, Korea Institute of Science and Technology Information, South Korea
  • Kyongseok Park, Korea Institute of Science and Technology Information, South Korea
  • Sunggeun Han, Korea Institute of Science and Technology Information, South Korea
  • Jung-Ho Um, Korea Institute of Science and Technology Information, South Korea

Short Abstract: A cross-correlation algorithm is one of the most popular algorithms utilized to search for peptide identification in databases and many computer programs such as SEQUEST, Comet, and Tide currently use this algorithm. Recently, HiXCorr algorithm was developed to speed up this algorithm for high-resolution spectra by improving the preprocessing step of the tandem mass spectra. However, despite the development of HiXCorr algorithm, the algorithm is still slow when parameters such as number of enzymatic termini, missed cleavage, or post-translational modification (PTM) are used in the search. To solve this problem, we used the graphics processing unit (GPU) to develop the gTide-Hi, which uses the algorithm derived by combining Tide’s cross-correlation algorithm and HiXCorr algorithm. gTide-Hi is 2.7 times faster compared to the original Tide with the following parameter: mz-bin-width = 0.01, maximum number of missed cleavage = 3, variable modifications of 2 oxidations per peptide on Met and 2 deamidations per peptide on Asn and Gln. Regardless of the parameters used, the results produced from both gTide-Hi and the original Tide were same.

Session B-396: A Hierarchical Gene Tree for Survival Prediction
COSI: Non-COSI
  • Minhyeok Lee, Korea University, South Korea
  • Sung Won Han, Korea University, South Korea
  • Junhee Seok, Korea University, South Korea

Short Abstract: Predicting survival risks using gene expression is an important issue in large-scale genomic data analyses. While this problem has been extensively studied for last decades, it is still suffered from low predictive power. Here, we propose a hierarchical gene tree to improve survival prediction. From gene expression data, the proposed method constructs a tree structure by recursively finding genes associated with survival outcomes. For finding predictor genes of a target gene, variable selection based on regularized regression method is conducted. In order to reduce variation and error within the gene expression data, data points are projected to inter-level models of the hierarchical gene tree. The proposed method was evaluated by a simulation study. In the most simulation cases, the proposed method outperformed conventional methods based on cox regression. Furthermore, the proposed method was applied to survival prediction of pancreatic cancer patients. The proposed method enhanced performance by 16.3% compared to the conventional method for prediction of low- and high-risk patients. We expect that the proposed method will significantly advance the survival prediction with gene expression data.

Session B-397: Losses of Ubiquitylation Sites during Human Evolution
COSI: Non-COSI
  • Dongbin Park, Department of Life Science, Chung-Ang University, South Korea
  • Chul Jun Goh, Department of Life Science, Chung-Ang University, South Korea
  • Hyein Kim, Department of Life Science, Chung-Ang University, South Korea
  • Yoonsoo Hahn, Department of Life Science, Chung-Ang University, South Korea

Short Abstract: Ubiquitylation, in which the highly conserved 76-residue polypeptide ubiquitin is covalently attached to a lysine residue of substrate proteins, mediates targeted destruction of ubiquitylated proteins by the ubiquitin-proteasome system. We hypothesize that the loss of ancestral ubiquitylation sites in highly conserved proteins during evolution may modify the ubiquitin-mediated regulatory network, which potentially results in the acquisition of novel phenotypes. We analyzed mouse ubiquitylation data compiled in the Mammalian Ubiquitination Site Database (mUbiSiDa) and multiple sequence alignments of orthologous proteins from 62 mammalian species to identify losses of ancestral ubiquitylation sites in the Euarchonta lineage leading to humans. We found that 194 ancestral ubiquitylation sites were lost in 170 human proteins since the Euarchonta lineage had diverged from the Glires lineage. Of the 194 sites, 9 events occurred in human proteins after the human-chimpanzee divergence. The loss of ancestral ubiquitylation sites might be resulted in the evolution of protein degradation and/or other regulatory networks, and the emergence of novel phenotypes.

Session B-398: Analysis of predicted interaction hotspots in host-specific protein sequences
COSI: Non-COSI
  • Myeongji Cho, Seoul National University, South Korea
  • Ji-Hae Lee, Seoul National University, South Korea
  • Mikyung Je, Seoul National University, South Korea
  • Hayeon Kim, Kyungdong University, South Korea
  • Hyeon S. Son, Seoul National University, South Korea

Short Abstract: We analyzed the host specificity of selected viral receptors affecting virus–host interactions through the prediction and evaluation of interspecies transmissible viruses and their potential hosts. As a novel method to explain viral cross-species infection, which is difficult to describe clearly based on the information of genetic information of full amino acid sequences, we predicted and compared protein disordered regions that are likely to interact with selected viruses using host-specific protein sequences. We used charge/hydropathy analysis to calculate the ratio of charge to hydropathy for each amino acid along the protein sequence. Predictions were made by classifying sequence segments with structural differences by converting the boundary line equation that separates folded proteins from disordered proteins using Uversky’s algorithm. Next, interspecies conservation rates of receptor protein sequences were calculated and compared by application of amino acid residues, predicted as interaction hotspots, to each sequence data through multiple sequence alignment for interspecies comparison of hotspot residues on each amino acid sequence. We confirmed that all predicted results corresponded to the actual viral host ranges and infectivity through literature reviews. The results of this study suggest that the predicted structural regions of host proteins that are important for interactions with viral pathogens can be used to infer viral host ranges and the risk of infection by viral jumping over interspecific barriers. We expect that this method will be applicable to the search for, and evaluation of, major receptors that play important roles in mechanisms of action for virus infection or transmission.

Session B-399: Discovering new long-term potentiation pathway members using protein–protein interaction networks
COSI: Non-COSI
  • Ji-Hae Lee, Seoul National University, South Korea
  • Mikyung Je, Seoul National University, South Korea
  • Myeongji Cho, Seoul National University, South Korea
  • Hayeon Kim, Kyungdong University, South Korea
  • Hyeon S. Son, Seoul National University, South Korea

Short Abstract: Synaptic plasticity is the ability of synapses to strengthen or weaken the synaptic connection strength. Long-term potentiation (LTP) and depression (LTD) in hippocampal synapses of the brain represent a form of long-term synaptic plasticity and play important roles in memory systems. The proteins and components involved in LTP/LTD induction have not yet been fully elucidated. To extend the known long-term potentiation reference pathway, we used a protein–protein interaction network-based method to select candidate genes. We compared the hippocampal tissues of mice with reduced learning and memory to that of control mice using microarray data in the Gene Expression Omnibus database. To identify genes associated with memory deficit, genes showing significant differential expression were investigated. We also used the STRING protein–protein interaction networks database to investigate the interactions between various interactors and genes in the known long-term potentiation maps of the Kyoto Encyclopedia of Genes and Genomes. Gene Ontology enrichment analysis was performed to identify significant biological subsystems. Cytoscape was used to visualize the interactions of the components constituting protein–protein interaction networks. We generated data of protein pairs including the correlation coefficient for the expression pattern, sequence similarity, and binding affinity. Classification models were generated to predict candidate genes that could also be involved in the synaptic long-term potentiation pathway. We developed a method to extend the synaptic transmission network using systems biology, which can help to identify therapeutic targets for diseases that cause memory loss.

Session B-400: Inference of microbial interactions from human gut metagenome data
COSI: Non-COSI
  • Yu Watanabe, Niigata University, Japan
  • Yiwei Ling, Niigata University, Japan
  • Shujiro Okuda, Niigata University, Japan

Short Abstract: Microbial communities play important roles in the biocycles of all ecosystems. However, most microbes remain uncultivated, and much metabolic diversity must still be elucidated. The recently developed metagenomics approach is a powerful tool for measuring biodiversity in ecosystems, but the dynamics of microbial interactions are still unknown. Furthermore, it has been well known that human gut microbial communities are related to a host human health, thus the clarification of the dynamism is one of the most important issues. In order to investigate the interactions of microorganisms in human gut environments, we have applied networks analysis about functional modules defined in metabolic pathways to the human gut metagenomics data. We used the integrated reference catalog of the human gut microbiome (IGC) metagenomics data as human gut metagenome data and KEGG MODULES as functional metabolic modules. Subsequently, we mapped KEGG Orthology (KO) of the IGC data to the module data sets to obtain information of presence or absence of the functional pathway modules. We finally integrated the modules and constructed human gut microbial networks from linkages of them. As a result, it was suggested that healthy or diseased human gut possessed their specific microorganism networks and were likely to develop microorganism interaction networks constructed from an optimal or sub-optimal species compositions.

Session B-401: MetaGraph: De Bruijn graph based data structures and algorithms for comparative and metagenomics
COSI: Non-COSI
  • Andreas Andrusch, Robert Koch Institute, Germany
  • Michael Schwabe, Robert Koch Institute, Germany
  • Simon H. Tausch, Robert Koch Institute, Germany
  • Piotr W. Dabrowski, Robert Koch Institute, Germany
  • Bernhard Y. Renard, Robert Koch Institute, Germany
  • Andreas Nitsche, Robert Koch Institute, Germany

Short Abstract: Due to the ever growing amount of NGS data generated, efficient data structures for their storage and analysis are becoming increasingly crucial. Here we present MetaGraph, a novel approach addressing both requirements in the context of metagenomic data analysis. MetaGraph uses a de Bruijn graph based data structure for reference sequence storage augmented with sequence metadata, including their taxonomic lineage. One of MetaGraph’s main applications is the taxonomic binning of unknown sequences, for example reads from metagenomic NGS datasets, in order to assess sample constituents. Additionally, it enables classifications and comparisons based on user selectable clades, stepping away from single references towards pan-genome references. Outside of the field of metagenomics, it enables the researcher to perform a wide array of analyses important for comparative genomics. This includes sequence comparisons of references against reads or other references in order to find shared sequence stretches or unique subsequences. In the same fashion it allows the analysis of pan-genomes. Using the graph structures presented here, MetaGraph avoids redundant computations by fully exploiting similarities between sequences. Due to its flexibly extensible sequence metadata it can be adapted to a multitude of sequence analysis contexts. These features are realized using a highly performant and scalable data structure, which utilizes sequence redundancies to amortize space requirements. It performs comparably or better than published tools with similar functionality in both speed and scalability. Further improvements are planned regarding the out-of-process storage of MetaGraph’s data structures using database back-ends and increases in sensitivity by working with spaced k-mers.

Session B-402: Supervised Chromatin Segmentation
COSI: Non-COSI
  • Tobias Frisch, University of Southern Denmark, Denmark
  • Xinyi Yang, Max Planck Institute for Molecular Genetics, Germany
  • Johannes Helmuth, Max Planck Institute for Molecular Genetics, Germany
  • Jan Baumbach, University of Southern Denmark, Denmark
  • Annalisa Marsico, Max Planck Institute for Molecular Genetics, Germany
  • Ho-Ryun Chung, Max Planck Institute for Molecular Genetics, Germany

Short Abstract: RNA sequencing has become a widely accepted and used technique in order to analyze the human transcriptome. However, its capability of revealing low abundant transcripts is limited to the sequencing depth that in turn is correlated to costs. Furthermore, within the last years research has shown that a significant amount of RNA underly a high degradation rate and has been missing in the corresponding RNA-Seq experiments. We emphasize the usage of histone modifications accessed via ChIP-Seq to reveal the expression of transcripts. Therefore a hidden Markov model is trained based on the histone modification of highly expressed genes. The model is further used in order to divide the human genome into functional units associated with transcribed and suppressed genes. The model revealed histone pattern for transcription start site, elongation and intergenic regions that correlates well with known function of those modifications. Furthermore, it is shown that the modification of a significant amount of, according to RNA-Seq data unexpressed, transcripts are strongly related to highly expressed genes. Overall it is demonstrated that a hidden Markov model based on histone modifications is able to reveal location and expression status of previously hidden transcripts.

Session B-403: Isoelectric Point is Evidence of Transcriptional and Translational Pseudogenes
COSI: Non-COSI
  • Seunghyuk Choi, Hanyang University, South Korea
  • Bongseok Jo, Kangwon University, South Korea
  • Eunok Paek, Hanyang University, South Korea
  • Sun Choi, Kangwon University, South Korea

Short Abstract: Pseudogenes have been described as non-functional and can be categorized into unitary, duplicated, and processed by evolutionary mechanisms. In the case of duplicated and processed pseudogenes, recent studies suggest that some pseudogenes are often transcribed and sometimes translated. However, the characterization of translated pseudogenes has been seldom studied. We used 140,503,871 spectra from 50 early onset gastric cancer patients and applied multi-stage search against human pseudogene database, constructed by three frame translation of previously reported pseudogene transcripts, for tandem mass spectrometry (MS/MS) assay based analysis. We controlled the resulting peptide spectrum matches (PSMs) at 1% estimated false discovery rate (FDR). Among the 72,575 MS/MS-certified PSMs, 32,630 PSMs were discarded because they overlapped with a reference protein database (UniProt 2016-09). The 1,959 unique pseudogene peptides from 39,945 PSMs were analyzed in terms of 1) propensities of isoelectric point (pI) against control peptides (Ensembl coding genes), and 2) transcriptional activity of pseudogene peptides. The median pI for 1,959 pseudogene peptides was ~4.1, significantly lower than that of the control (~7.3). Furthermore, the average pI decreased to ~2.5 as we filtered by increasing numbers of samples up to 10. Transcriptional activities of 1,959 pseudogene peptides were obtained from the previous annotation. We found 1,597 pseudogenes that correspond to the 1,959 pseudogene peptides. Transcriptional activity annotations were found only for 483 pseudogenes (~30.2%). It is noteworthy that among the remaining 10,110 pseudogenes, not mapped to our findings, showed lower transcriptional activity (~11.3%).

Session B-404: Optimisation of RNA-Seq gene expression data pre-processing
COSI: Non-COSI
  • Pashupati Mishra, University of Helsinki, Finland
  • Petri Törönen, University of Helsinki, Finland
  • Liisa Holm, University of Helsinki, Finland

Short Abstract: RNA-Seq enables the study of RNA expression profiles. The inference of gene expression level in a sample involves pre-processing steps like read alignment, transcript compilation and expression estimation. Numerous alternative methods are available for each of these steps, but existing literature comparing different pre-processing methods present varying results. Clearly, a quality control system that allows users to robustly estimate the amount of signal in their RNA-Seq data after different pre-processing methods would benefit the field. We tested a set of quality control metrics for RNA-Seq gene expression data. The set included four subsets of metrics that aim to measure biological signal in data by monitoring four different features: a) Differential gene expression, b) Treatment group separation, c) Control genes separation, and d) Differential gene set expression. We evaluated alternative metrics within each subset using a novel benchmark based on an Artificial Dilution Series (ADS). ADS takes a real RNA-Seq data, makes multiple copies of it and adds varying amounts of noise to the copies. The rationale behind the evaluation was to rank the metrics within each subset based on their sensitivity to the different levels of noise. Our results show drastic differences between different metrics within each subset. We present a Quality Control system for RNA-Seq Gene Expression Data (QC-GED) that comprises the best metrics for monitoring each of the four features in data. It, thus, reliably quantifies biological signal from four different perspectives and allows users to choose the best pre-processing methods.

Session B-405: Bgee database: creating knowledge from gene expression in any animal species
COSI: Non-COSI
  • Marc Robinson-Rechavi, University of Lausanne, Switzerland
  • Bgee Team, Swiss Institute of Bioinformatics, Switzerland

Short Abstract: Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as "healthy" or not). Data are integrated and made comparable between species thanks to calls of presence/absence of expression and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. As a result, Bgee provides unique gene expression analysis tools: i- Bgee is capable of detecting the preferred conditions of expression of any single gene, accommodating any data type and species. These condition rankings are highly specific, even for broadly expressed genes. ii- Bgee provides a new type of gene list enrichment analysis tool, TopAnat, capable of detecting the preferred conditions of expression of a list of gene. We hope that TopAnat will prove to be as useful as, and complementary to, standard Gene Ontology enrichment tests. iii- Bgee provides a convenient Bioconductor package, allowing to perform analyses directly into R, and to download all processed expression data available in Bgee. This includes thousands of annotated and re-processed Affymetrix chips and RNA-Seq libraries. Bgee includes 29 animal species, and is available at http://bgee.org/

Session B-406: Half-Sibling Reconstruction Using Forbidden Subgraphs
COSI: Non-COSI
  • Tanya Berger-Wolf, University of Illinois at Chicago, United States
  • Nick Shaskevich, Google Inc, United States
  • Dhruv Mubayi, Univ of Illinois at Chicago, United States
  • Aayush Kataria, Univ of Illinois at Chicago, United States
  • Krutarth Joshi, Univ of Illinois at Chicago, United States

Short Abstract: Knowledge about lower order pedigree is an important component of many biological studies, particularly those focused on mating systems and evolution, and adaptation. In non-monogamous species is it often important to know the half-sibling relationships as those provide an insight into the degree of polygamy, the mating mechanisms, as well as dominance relationships. Microsatellites, or Short Tandem Repeats (STRs), are the genetic markers of choice for wildlife population studies. We propose an algorithm for solving the problem of half-sibling reconstruction starting with a microsatellite sample of individuals all belonging to the same generation and same population. We show that the problem of half-sibling reconstruction is equivalent to 2-Vertex Cover and propose and experimentally validate an algorithm that uses the equivalence of 2-cover obstructions to find valid half-sibling groups. The algorithm runs in time cubic in the number of individuals in the population, and produces accurate results when the number of alleles per locus is sufficiently high to be informative.

Session B-407: smartAPI: Towards a More Intelligent Network of Web APIs
COSI: Non-COSI
  • Michel Dumontier, Maastricht University, Netherlands
  • Shima Dastgheib, Stanford University, United States
  • Trish Whetzel, University of California San Diego, United States
  • Pedro Assisi, Stanford University, United States
  • Paul Avillach, Harvard Medical School, United States
  • Kathleen Jagodnik, Icahn School of Medicine at Mount Sinai, United States
  • Gabor Korodi, Harvard Medical School, United States
  • Marcin Pilarczyk, University of Cincinnati, United States
  • Stephan Schurer, University of Miami, United States
  • Raymond Terryn, University of Miami, United States
  • Ruben Verborgh, Ghent University, Belgium
  • Chunlei Wu, The Scripps Research Institute, United States

Short Abstract: Data science increasingly employs cloud-based Web application programming interfaces (APIs) stored in different repositories. However, discovering and connecting suitable APIs by sifting through these repositories for a given application, is difficult due to the lack of rich metadata needed to precisely describe the service and lack of explicit knowledge about the structure and datatypes of Web API inputs and outputs. To address this challenge, we conducted a survey to identify the metadata elements that are crucial to the description of Web APIs and subsequently developed a smartAPI metadata specification that includes 54 API metadata elements divided into five categories: (i) API Metadata, (ii) Service Provider Metadata, (iii) API Operation Metadata, (iv) Operation Parameter Metadata, (v) Operation Response Metadata. Then, we extended the widely used Swagger editor for annotating APIs, to develop a smartAPI editor that captures the APIs’ domain-related and structural characteristics using the FAIR (Findable, Accessible, Interoperable, Reusable) principles. The smartAPI editor enables API developers to reuse existing metadata elements and values by automatically suggesting terms used by other APIs. In addition to making APIs more accessible and interoperable, we integrated the editor with a smartAPI profiler to annotate the API parameters and responses with semantic identifiers. Finally, the annotated APIs are published into a searchable API registry. The registry makes it easier to find, reuse and see how the different APIs are connected together so that complex workflows can be more easily made. Links to the specification, tool and registry are available at: http://smart-api.info/.

Session B-408: A journey for building up the ELIXIR Scientific Benchmark infrastructure: openEBench
COSI: Non-COSI
  • Salvador Capella-Gutiérrez, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Spain
  • Diana De La Iglesia, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Spain
  • Juergen Haas, SIB, Switzerland
  • José María Fernández González, Spanish National Cancer Research Centre (CNIO), Spain
  • Dmitry Repchevsky, INB, BSC/CNS, Spain
  • Josep Ll Gelpi, Dept. Bioquimica i Biologia Molecular. Univ. Barcelona, Spain
  • Alfonso Valencia, Spanish National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Barcelona Supercomputing Center (BSC-CNS), Spain

Short Abstract: Benchmarking of bioinformatics tools provides objective metrics in terms of scientific quality, technical reliability, and functionality [1] and stimulates new developments by highlighting areas which require improvements [2]. Within the ELIXIR-EXCELERATE project [3], we propose a community-driven benchmarking infrastructure to support online assessment, comparison and ranking of bioinformatics tools. Our objective is to store all relevant data and metadata for the continuous evaluation of existing methods and the development of new ones. Nevertheless, evaluation of bioinformatics methods in an unbiased fashion remain challenging. Key challenges are the integration of highly heterogeneous data sets and models created for specific tasks from a myriad of fields into a single infrastructure, as well as the diversity of metrics that should be integrated in such a manner that can be compared. The design of the infrastructure, therefore, is based on standards where JSON (JavaScript Object Notation [4]) is used as data-exchange format to define a common structure [5] for benchmarking data that is generated in open scientific challenges. There are various active communities worldwide who stand to benefit from such a collaborative infrastructure: research groups providing new algorithms that need to be evaluated, tool developers wanting to promote their bioinformatics tools, data scientists demanding reference data sets and 'gold standards' to feed their methods, and users that need an unbiased ranking of available resources for conducting their research. The ultimate goal is to become the reference infrastructure of a broad range of bioinformatics communities, from protein modelling structures [6] to orthology [7] to biomedical text-mining [8], putting forward different benchmark initiatives. A first prototype of the platform is available at [9]. REFERENCES: [1] Jackson M. et al. Software Evaluation: Criteria-based Assessment. Technical Report. Software Sustainability Institute. 2011. [2] Costello JC, Stolovitzky G. Seeking the wisdom of crowds through challenge-based competitions in biomedical research. Clin Pharmacol Ther. 2013 May;93(5):396-8. [3] https://www.elixir-europe.org/excelerate [4] http://json.org [5] https://github.com/inab/benchmarking-data-model [6] Haas J. et al. The Protein Model Portal--a comprehensive resource for protein structure and model information. Database 2013, Database(Oxford). 2013 Apr 26;2013:bat031. [7] Altenhoff AM. et al. Standardized benchmarking in the quest for orthologs. Nat Methods. 2016 May;13(5):425-30. [8] Hirschman L. et al. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6 Suppl 1, S1 (2005). [9] https://elixir.bsc.es/benchmarking/home.htm

Session B-409: Toward elucidation of behavioral significance about mounting behavior through multi-omics analysis of temporary social parasitic ant, Polyrhachis lamellidens (Hymenoptera: Formicidae) 
COSI: Non-COSI
  • Hironori Iwai, Institute for Advanced Biosciences, Keio University, Japan
  • Nobuaki Kono, Institute for Advanced Biosciences, Keio University, Japan
  • Daiki Horikawa, Institute for Advanced Biosciences, Keio University, Japan
  • Masaru Tomita, Institute for Advanced Biosciences, Keio University, Japan
  • Kazuharu Arakawa, Institute for Advanced Biosciences, Keio University, Japan

Short Abstract: Polyrhachis lamellidens (Hymenoptera: Formicidae) is a temporary social parasitic ant, and the newly mated queen founds her new colony by invading colonies of other ant species. It is known that the newly mated queen of P. lamellidens performs mounting behavior against a worker ant in the host colony during the early stage of social parasitism. It has been hypothesized that this behavior is required for cuticular hydrocarbon (CHC) camouflage. In this study, we conducted gas chromatography mass spectrometric (GC/MS) analysis of CHC in order to elucidate the role of mounting behavior in social parasitic strategy of P. lamellidens. Furthermore, we carried out genome sequencing of P. lamellidens and transcriptome sequencing of larva, worker and queen in order to confirm existence and expression of mounting behavior-related genes. The GC/MS analysis showed low levels of CHC in the pre-parasitism P. lamellidens queen; however a host-like CHC profile was observed following mounting behavior, supporting our hypothesis. This implied that desaturase, a gene suggested to be related to the CHC biosynthesis by previous studies, is a candidate gene of the mounting behavior mechanism. Genomic analysis of P. lamellidens determined multiple desaturase genes, expressed in all three stages. These desaturase genes were widely preserved in subfamily Formicinae, but showed lower similarity in other subfamilies. In this poster, we would like to discuss the molecular mechanism underlying the mounting behavior in P. lamellidens.

Session B-410: High-throughput immunophenotyping of a large knockout mouse library offers a systems-level insight into genetic control of immune homeostasis
COSI: Non-COSI
  • Anna Lorenc, Kings College London Immunology Department, United Kingdom
  • Albina Rahim, BC Cancer Research Centre, Canada
  • Justin Meskas, BC Cancer Research Centre, Canada
  • Sibyl Drissler, BC Cancer Research Centre, Canada
  • Alice Yue, BC Cancer Research Centre, Canada
  • Lucie Abeler-Doerner, Kings College London Immunology Department, United Kingdom
  • Adam Laing, Kings College London Immunology Department, United Kingdom
  • Ryan Brinkman, BC Cancer Research Centre, Canada
  • Adrian Hayday, Kings College London Immunology Department, United Kingdom

Short Abstract: The Infection Immunity and Immunophenotyping (3i) consortium has performed a high-throughput immune phenotyping analysis of ~400 knockout mouse lines and matched wild-type controls generated by the Wellcome Trust Sanger Institute within the International Mouse Phenotyping Consortium. The screen comprised high-content (up to 14 markers) flow cytometric analysis of steady-state multiple immune tissues and several infection challenges, with the aim of identifying genes that influence the cellular composition and function of the immune system in health and disease. We: (1) developed an automated gating strategy for flow cytometry data from thousands of animals analysed over 2.5 years (2) created a computational pipeline to identify significant phenotypes in KO mouse strains (3) jointly analyzed multiple immune and non-immune parameters to discover relationships between these parameters, their links to knocked-out genes and functions of immune system in challenge. (4) related findings in mice to human immune system and health by cross-referencing with preexisting knowledge (5) made the data publicly available Our analyses confirmed known immune system involvement of many genes and discovered several unknown players and unexpected relationships.

Session B-411: Evolutionary analysis of Rift Valley fever virus
COSI: Non-COSI
  • Mikyung Je, Seoul National University, South Korea
  • Ji-Hae Lee, Seoul National University, South Korea
  • Myeongji Cho, Seoul National University, South Korea
  • Hyeon S. Son, Seoul National University, South Korea
  • Hayeon Kim, Kyungdong University, South Korea

Short Abstract: There are increasing numbers of newly identified viruses, such as severe fever with thrombocytopenia syndrome virus (SFTSV) and heartland virus (HRTV), which are members of the Phlebovirus genus. According to the International Committee on Taxonomy of Viruses, the Phlebovirus genus currently contains 70 viruses, including viruses that can cause severe disease in humans. Rift Valley fever virus (RVFV), the most widely known virus in the Phlebovirus genus, has been confirmed to have recently spread to Europe, the USA, and Asia, beyond the traditional endemic region, since it was first reported. The emergence of RVFV in new areas can cause serious public health problems. In this study, bioinformatics analysis was performed to investigate the relation of the expansion of RVFV infection areas to viral evolutionary variations. We downloaded the sequence data of four CDS regions within the large, medium, and small segments from the GenBank database of the National Center for Biotechnology Information to perform phylogenetic and codon usage analyses. The results confirmed the presence of codon usage pattern in the medium (M) segment of RVFV according to the passage of time, and the codon usage pattern appears differently in certain amino acids through RSCU analysis on the CDS region of Gn and Gc glycoproteins. These features may be critical factors for expansion of the host range and infection region or to induce changes in the toxicity of RVFV. Therefore, further studies to predict the future evolutionary patternsbased on the results of this study are required.

Session B-412: Simple scoring method for predicting oxidative stress and inflammation status in 3D organotypic cultures of human bronchial cells exposed to cigarette smoke
COSI: Non-COSI
  • Kazushi Matsumura, JAPAN TOBACCO INC., Japan
  • Shinkichi Ishikawa, JAPAN TOBACCO INC., Japan
  • Shigeaki Ito, JAPAN TOBACCO INC., Japan

Short Abstract: Cigarette smoke (CS) is a known risk factor for some airway diseases, including chronic obstructive pulmonary disease, which are believed to be initiated by increased oxidative stress, followed by chronic inflammation. These airway diseases are triggered by cellular responses evoked in the airway epithelium by inhaled CS. Therefore, evaluation of such intracellular perturbations, especially oxidative stress and the inflammatory response, should facilitate prediction of the risk of airway disease onset. As the first step towards development of a risk assessment model, we constructed a simple scoring method to predict the status of oxidative stress and inflammation. First, we exposed MucilAir, a 3D organotypic culture of human bronchial cells, to 6 inducers, with different mechanisms of action related to cellular stress (including oxidative stress and the inflammatory response), using various doses and exposure times. After microarray-based hierarchical clustering and canonical pathway analysis of the data, we identified oxidative stress and inflammation inducers. We then identified commonly differentially expressed genes, at each timepoint, as early or late oxidative stress and inflammatory response regulating gene sets. Next, the transcriptomics data from CS-exposed MucilAir were analyzed using our scoring method, based on log2 fold changes in differential measurements that were included in our gene sets. We found that the measured scores accurately demonstrated dynamic changes and inducer dose-responses for oxidative stress and inflammation status. This supported the potential of our scoring method as the first step in a quantitative risk assessment based on an adverse outcome pathway.

Session B-413: Novel methodologies for gene family silencing using the CRISPR-Cas9 system
COSI: Non-COSI
  • Gal Hyams, Tel Aviv University, Israel
  • Itay Mayrose, Tel Aviv University, Israel
  • Eran Halperin, UCLA, United States

Short Abstract: The CRISPR-Cas9 system forms a bacterial immune system that recently has been adopted as a genome-editing technique of eukaryote genomes. The system is directed to the genomic site using a programmed single-guide RNA (sgRNA) that base-pairs with the DNA target, subsequently leading to a site-specific double-strand break. The binding affinity of the CRISPR-Cas9 system does not require perfect matching between the sgRNA and the DNA target. Thus, in addition to cleaving the desired "on-target", cleavage may occur at multiple unintended genomic sites (termed off-targets) that are similar, up to a certain degree, to the on-target. Due to extensive history of local and large-scale genome duplications, many eukaryotic genomes harbor many large gene families of partially overlapping functions. This redundancy often results in a buffering effect: most single null mutants present no or minimal phenotypic consequence due to the overlapping function of one or more paralogs. Therefore, in many cases, the silencing of multiple family members of a gene family is necessary to uncover any phenotypic effects. Here, we introduce graph-based algorithms for the optimal design of potential sgRNA. The developed algorithms harness the low specificity of the CRISPR-Cas9 system to target multiple members of a given gene family. In-silico examination over all gene families in the Solanum lycopersicum genome shows that our suggested approach outperforms simpler alignment-based techniques. The utility of the developed algorithm is further demonstrated in vivo by successfully silencing an entire family of gibberellin transporters in S. lycopersicum consisting of seven members using a single sgRNA. This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University and by the Manna Center for food safety and security at Tel-Aviv University

Session B-414: Epigenetic regulation model in AHNAK Deficient Adipose Differentiation
COSI: Non-COSI
  • Young Seek Lee, Hanyang university, South Korea
  • Soo Young Cho, National Cancer Center, South Korea
  • Jong Kyu Woo, Seoul National University, South Korea
  • Soojun Park, ETRI, South Korea

Short Abstract: Ahnak deficient model were used to investigate the function of adipocyte differentiation and obesity. Several reports propose molecular mechanism and related pathway in Ahnak deficient adipocyte differentiation. However, the functional rule in Ahnak related adipocyte differentiation is unclear. Specially, epigenomic mechanism for Ahank deficient model is not studied in adipocyte differentiation. To understand epigenetic mechanism, we constructed genome wide DNA methylation pattern with Methyl-CpG-binding domain sequencing. We found numerous adipocyte-dependent and Ahnak associated methylated regions. Intriguingly, the dynamics of the methylation change was associated by Ahnak deficient and differentiation state. Differentiation of Ahnak deficient cells were involved genome-wide epigenetic changes and maintenance of differentiation stage by temporal methylation and demethylation.

Session B-415: NEPHROSEQ: A WEB-BASED GENE EXPRESSION DATABASE AND ANALYSIS PLATFORM
COSI: Non-COSI
  • Heather Ascani, University of Michigan - Ann Arbor, United States
  • Becky Steck, University of Michigan - Ann Arbor, United States
  • Rebecca Reamy, University of Michigan - Ann Arbor, United States
  • Zach Wright, University of Michigan - Ann Arbor, United States
  • Viji Nair, University of Michigan - Ann Arbor, United States
  • Felix Eichinger, University of Michigan - Ann Arbor, United States
  • Sean Eddy, University of Michigan - Ann Arbor, United States
  • Wenjun Ju, University of Michigan - Ann Arbor, United States
  • Matthias Kretzler, University of Michigan - Ann Arbor, United States

Short Abstract: Introduction: Transcriptomic data of human renal disease can provide critical information to identify de novo molecular pathways associated with kidney disease, guide focused analysis in model systems towards pathways with human relevance, and verify the relevance of presumed biomarkers. Considerable amounts of transcriptome data from kidney disease gene expression experiments are being generated and made available in public repositories. However, these datasets are still not analyzed by most researchers because of 1) a lack of data harmonization, especially with regards to gene annotation and associated clinical information, and 2) the absence of accessible tools which allow users to explore the available datasets. Methods: To facilitate access of the renal research community to human kidney disease gene expression data and the relevant associated clinical information, we have developed Nephroseq (www.nephroseq.org), a web-based data repository and analytical tool. We collect and standardize publically available kidney genome-wide expression data before loading it into Nephroseq. To date, we have curated and analyzed 1,989 samples across 32 datasets. Results: This provides a platform for comparison of differential expression, co-expression, and outlier results for queried genes or gene lists across standardized datasets, and between model systems and human kidney disease. It also allows meta-analysis and gene list comparisons from the same disease entities, between clinical disease entities, or across diseases. Conclusions: Nephroseq empowers researchers without bioinformatics knowledge to employ sophisticated systems biology tools to extract key information from ongoing kidney disease research, validate model system findings in relevant human diseases, and generate hypotheses for further experimental investigations, contributing to the knowledge transfer between bench and bedside, and ultimately towards improving patient outcomes.

Session B-416: HTSanalyzeR2: an ultra fast R/Bioconductor package for high-throughput screens with interactive report
COSI: Non-COSI
  • Feng Gao, City University of Hong Kong & Cornell University, United States
  • Xiupei Mei, City University of Hong Kong, Hong Kong
  • Lina Zhu, City University of Hong Kong, Hong Kong
  • Yuchen Zhang, City University of Hong Kong, Hong Kong
  • Wei Wang, City University of Hong Kong, Hong Kong
  • Xin Wang, City University of Hong Kong, Hong Kong

Short Abstract: High-throughput screens (HTS) is one of the most promising tools in functional genomics and enable scientists to study the genome-wide perturbations. Our previous work HTSanalyzeR provides a set of pipelines for integrated functional analysis, including gene set enrichment and network analysis, and present the results as rich-text HTML pages with plots. Since its first release in 2011, it became a popular software and is used widely by the community. Recent biotechnical developments like CRISPR enhanced the efficiency and applicability of HTS. In the meantime, new algorithms and technologies facilitated the computation a lot and brought a more convenient and powerful way to demonstrate the results as an interactive report. Here we developed HTSanalyzeR2, supporting high-throughput CRISPR screens. We also increased its compatibility to work with most of the common spices used in biomedical experiments and updated the computational module fasten the calculation up to 1000X compared with the previous version. Besides the functional upgrade, now HTSanalyzeR2 presents the results in an interactive web interface, which can be easily deployed to Shiny server for further communication with biologists. HTSanalyzeR2 is an R package and now available via Github at https://github.com/CityUHK-CompBio/HTSanalyzeR2

Session B-417: Candidate gene prioritization with Endeavour
COSI: Non-COSI
  • Amin Ardeshirdavani, KU Leuven - ESAT - STADIUS, Belgium
  • Léon-Charles Tranchevent, Cancer Research Centre of Lyon, France
  • Sarah Elshal, KU Leuven, Belgium
  • Daniel Alcaide, KU Leuven, Belgium
  • Jan Aerts, Leuven University, Belgium
  • Didier Auboeuf, INSERM, France
  • Yves Moreau, KU Leuven, Belgium

Short Abstract: Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/

Session B-418: The human gut virome at the extremes of life
COSI: Non-COSI
  • Feargal Ryan, APC Microbiome Institute, Ireland
  • Angela McCann, APC Microbiome Institute, Ireland
  • Stephen Stockdale, APC Microbiome Institute, Ireland
  • Marion Dalmasso, APC Microbiome Institute, Ireland
  • R. Paul Ross, University College Cork, Ireland
  • Colin Hill, APC Microbiome Institute, Ireland

Short Abstract: The role of the human microbiome in health is an area of scientific research that has received a large amount of attention in recent years with a number of large cohorts analysing the bacterial component of the microbiome. However the role of the virome in the context of the microbiome and human health is still poorly understood. Recent work has indicated that there is a healthy human virome in humans and that this can become perturbed in disease. Here we sequence the viromes of faecal samples collected from 20 Irish elderly (> 67 years old) and 20 Irish infants (12 months after birth) to examine the gut virome at the extremes of life. Virome analysis is complicated by the lack of reference sequences in established databases and thus a de novo assembly based approach was applied here. We show here that the gut virome is able to distinguish between infants and elderly. Furthermore it can distinguish between those infants born by caesarean section and spontaneous vaginal delivery 12 months after birth and between those elderly living in the community or in long-stay care. The infant viromes were marked by the presence of viruses highly similar to known Lactococcus phage, however 16S rRNA gene analysis found virtually no Lactococcus sequences present in these data. The elderly were found to contain particularly high levels of two novel phage, one of which is proposed to infect Bacteroides vulgatus and the other Clostridium difficile based on homology of a CRISPR-cas spacer and t-RNA gene respectively. CrAssphage was identified in both the elderly and infants, and was the most prevalent sequence identified throughout the dataset.

Session B-419: A DNA barcode archive and analysis system for medicinal herbs of traditional Korean medicine
COSI: Non-COSI
  • Sang-Jun Yea, KIOM, South Korea
  • Boseok Seong, KIOM, South Korea
  • Yunji Jang, KIOM, South Korea
  • Chul Kim, KIOM, South Korea

Short Abstract: Introduction: Although there are several kinds of DNA barcode systems for archiving, analyzing, and identifying species, the existing systems are not adequate due to the fact that one or more plants are identified as single medicinal herb. Therefore, we implemented web based system for DNA barcode archive and analysis to support those conditions specifically designed for traditional Korean medicine. Methodology: The proposed system, called SDBMH, was designed as three main module: DNA barcode archives, DNA barcode analysis, and System management. In DNA barcode archives, SDBMH fetches the specimen information from external herbarium information management system (HIMS). In DNA barcode analysis, to align multiple FASTA sequences, we used NCBI’s Basic Local Search Alignment Tool (BLAST). The flows of data processing of SDBMH was designed into three steps as DNA barcode management; Group management and analysis; Report and visualization. Results: SDBMH provides three main functions: Barcode registration, Barcode search and view, and Species identification. In barcode registration, user selects specimen information from HIMS and chooses specific region and primer set for individual DNA barcode. DNA barcode is searchable by barcode ID, herbal name, species name, etc. Beside those search options, user can designate marker to filter out results. To identify species, user goes through three steps: apply BLAST, select norm group, and chose results. SDBMH offers graphical user interface of conflict information, phylogenetic trees, and identification reports. Conclusions: Our system will help in archiving and identifying correct botanical origins for herbal medicine to standardize Korean herbal medicine as well as quality control.

Session B-420: The Cancer Genome Collaboratory
COSI: Non-COSI
  • Christina K Yung, University of Chicago, United States
  • Michelle D Brazas, Ontario Institute for Cancer Research, Canada
  • George L Mihaiescu, Ontario Institute for Cancer Research, Canada
  • Bob Tiernay, Ontario Institute for Cancer Research, Canada
  • Junjun Zhang, Ontario Institute for Cancer Research, Canada
  • Francois Gerthoffert, Ontario Institute for Cancer Research, Canada
  • Andy Yang, Ontario Institute for Cancer Research, Canada
  • Jared Baker, Ontario Institute for Cancer Research, Canada
  • Guillaume Bourque, McGill University, Canada
  • Paul C Boutros, Ontario Institute for Cancer Research, Canada
  • Bartha M Knoppers, McGill University, Canada
  • B. F. Francis Ouellette, Genome Quebec, Canada
  • Cenk Sahinalp, Simon Fraser University, Canada
  • Sohrab Shah, BC Cancer Agency, Canada
  • Vincent Ferretti, Ontario Institute for Cancer Research, Canada
  • Lincoln D Stein, Ontario Institute for Cancer Research, Canada

Short Abstract: The Cancer Genome Collaboratory is an academic compute cloud designed to enable computational research on the world’s largest and most comprehensive cancer genome dataset, the International Cancer Genome Consortium (ICGC). The ICGC is on target to categorize the genomes of 25,000 tumors by 2018. A subproject of ICGC, the PanCancer Analysis of Whole Genomes (PCAWG) alone has generated over 800TB of harmonized sequence alignments, variants and interpreted data from over 2,800 cancer patients. A dataset of this size requires months to download and significant resources to store and process. By making the ICGC data available in cloud compute form in the Collaboratory, researchers can bring their analysis methods to the cloud, yielding benefits from the high availability, scalability and economy offered by cloud services, avoiding a large investment in static compute resources and essentially eliminating the time needed to download the data. To facilitate the computational analysis on the ICGC data, the Collaboratory has developed software solutions that are optimized for typical cancer genomics workloads, including well tested and accurate genome aligners and somatic variant calling pipelines. We have developed a simple to use, but fast and secure, data transfer tool that imports genomic data from cloud object storage into the user’s compute instances. Because a growing number of cancer datasets have restrictions on their storage locations, it is important to have software solutions that are interoperable across multiple cloud environments. We have successfully demonstrated interoperability across The Cancer Genome Atlas (TCGA) dataset hosted at University of Chicago’s Bionimbus Protected Data Cloud, the ICGC dataset hosted at the Collaboratory, and ICGC datasets stored in the Amazon Web Services (AWS) S3 storage. Lastly, we have developed a non-intrusive user authorization system that allows the Collaboratory to authenticate against the ICGC Data Access Compliance Office (DACO) when researchers require access to controlled tier data. We anticipate that our software solutions will be implemented on additional commercial and academic clouds. The Collaboratory is actively growing, with a target hardware infrastructure of over 3000 CPU cores and 15 petabytes of raw storage. As of November 2016, the Collaboratory holds information on 2,000 ICGC PCAWG donors (500TB total). We anticipate expanding the Collaboratory to host the entire ICGC dataset of 25,000 donors (approximately 5PB) and to extend its data management and analysis facilities across multiple clouds. The Collaboratory has been successfully utilized by multiple research groups, most notably PCAWG project researchers who analyzed thousands of genomes at scale over a few weeks’ time. The Collaboratory is now open to the public and we invite cancer researchers to learn more about our cloud resources at cancercollaboratory.org, and apply for access to the Collaboratory.

Session B-421: SonicParanoid: extremely fast, easy and accurate orthology inference
COSI: Non-COSI
  • Salvatore Cosentino, Graduate School of Science, The University of Tokyo, Japan
  • Wataru Iwasaki, Graduate School of Science, The University of Tokyo, Japan

Short Abstract: Thanks to recent advancements in DNA sequencing technologies, the number of species for which genome sequences are available is growing at an accelerated pace. Accurate inference of orthologous genes encoded on multiple genomes is the key to various analyses based on those datasets. For example, elucidation of species-specific and/or evolutionarily shared genomic signatures at the sequence, structural, and/or functional levels (comparative genomics), transfer of genetic knowledge between genomes of model and non-model organisms, reconstruction of phylogenetic trees using genomic information (phylogenomics), and development of reference genome databases, all depend on reliable orthology inference. Despite its general importance, it is still time-consuming and difficult to conduct orthology inference on a specific set of genomes that often include in-house sequenced genomes. Here, we present SonicParanoid, which is orders of magnitude faster than, but comparably accurate to the existing tools, with a balanced precision-recall trade-off.

Session B-423: Using heterogenous components to build a scalable digital pathology environment
COSI: Non-COSI
  • Yves Sucaet, Vrije Universiteit Brussel, Belgium
  • Silke Smeets, Vrije Universiteit Brussel, Belgium
  • Sandrina Martens, Vrije Universiteit Brussel, Belgium
  • Wim Waelput, UZ Brussel, Belgium
  • Peter In'T Veld, Vrije Universiteit Brussel, Belgium

Short Abstract: At Brussels Free University (VUB), we wanted to build a core digital pathology infrastructure to support a range of different use cases. Various images platforms needed to be accessible through a single access point, while still supporting different user profiles. We wanted a scalable solution that would allow interaction between equipment from different research groups intra and extramuros. A combination of commercial hardware, commercial software, and open source software was used to get this accomplished. Custom coding to connect interfaces was used where needed. We built a centralized infrastructure that integrates a variety of imaging platforms (brightfield, fluorescence, zstacking), and we now have an interconnected network of heterogeneous and scalable information silos. Image analysis and data/image mining projects can remain stuck in micro-environments due to limits artificially imposed by vendor-specific solutions. We have shown this need not be the case, and have integrated five different imaging platforms onto one architecture. We are storing data from all modalities in a single storage facility, and can manage it through a single access point. We support 40+ users, serve 5000+ whole slide images monthly, and faciliate on different use cases, including education, biobanking, and telepathology.

Session B-424: Maximum Entropy Methods for Extracting the Learned Features of Deep Neural Networks
COSI: Non-COSI
  • Alex Finnegan, University of Illinois, Urbana-Champaign, United States
  • Jun Song, University of Illinois, Urbana-Champaign, United States

Short Abstract: Motivation: New architectures of multilayer artificial neural networks and new methods for training them are rapidly revolutionizing the application of machine learning in diverse fields, including business, social science, physical sciences, and biology. Interpreting deep neural networks, however, currently remains elusive, and a critical challenge lies in understanding which meaningful features a network is actually learning. Results: We present a general method for interpreting deep neural networks and extracting network-learned features from input data. We describe our algorithm in the context of biological sequence analysis. Our approach, based on ideas from statistical physics, samples from the maximum entropy distribution over possible sequences, anchored at an input sequence and subject to constraints implied by the empirical function learned by a network. Using our framework, we demonstrate that local transcription factor binding motifs can be identified from a network trained on ChIP-seq data and that nucleosome positioning signals are indeed learned by a network trained on chemical cleavage nucleosome maps. Imposing a further constraint on the maximum entropy distribution, similar to the grand canonical ensemble in statistical physics, also allows us to probe whether a network is learning global sequence features, such as the high GC content in nucleosome-rich regions. This work thus provides valuable mathematical tools for interpreting and extracting learned features from feed-forward neural networks.

Session B-425: Comparative genomic, evolutionary and functional analysis of caleosins: a family of multifunctional plant and fungal proteins
COSI: Non-COSI
  • Farzana Rahman, University of South Wales, United Kingdom
  • Mehedi Hassan, University of South Wales, United Kingdom
  • Rozana Rosli, University of South Wales, United Kingdom
  • Abdulsamie Hanano, Atomic Energy Commission of Syria, Syria
  • Denis Murphy, University of South Wales, United Kingdom

Short Abstract: Caleosins (CLO) belong to a family of multifunctional calcium- and lipid-binding proteins with peroxygenase and other signalling activities. This gene family is found almost ubiquitously in two distinct eukaryotic clades: Viridiplantae and Fungi. This evolutionary pattern of CLO gene occurrence is not consistent descent from a common ancestor. This suggests that caleosins may have originated in one of the current clades via horizontal gene transfer from the other. We studied CLO gene and protein sequences across a range of plant and fungal species to understand the structure and function of these proteins in detail. We characterised CLO occurrence and function in date palm, oil palm and banana with respect to tissue expression, subcellular localisation and oxylipin pathway substrate specificities in developing seedlings. Here we report the variation across a comprehensive range of plant and fungal species. Protein structure predictions suggest that the calcium-binding and EF hand domains are widely conserved across species. While the biological functions of studied proteins have yet to be determined in detail, it is clear that these proteins have several subcellular locations and participate in a range of physiological processes in both plants and fungi, including acting as peroxygenases. One crucial role appears to be in responses to a range biotic and abiotic stresses, including plant-fungal interactions. In this presentation, we describe additional studies that have been carried out to shed light on the origin and functions of this intriguing group of proteins.

Session B-426: Factor Extraction from transcriptome-wide expression data as a method for robust prognosis of prostate carcinoma
COSI: Non-COSI
  • Dominik Otto, Fraunhofer IZI, Germany
  • Susanne Füssel, Universitätsklinikum Carl Gustav Carus Dresden, Germany
  • Manfred Wirth, Universitätsklinikum Carl Gustav Carus Dresden, Germany
  • Friedemann Horn, Fraunhofer Institute for Cell Therapy and Immunology, Germany
  • Kristin Reiche, Fraunhofer Institute for Cell Therapy and Immunology, Germany

Short Abstract: Prostate Cancer is the most prevalent cancer disease among men in the US and patients often face unnecessary surgeries, because biomarkers and their according classification models are often of poor discrimination accuracy. One layer of gene-regulation which has been undervalued for years but now emerges to be associated with prostate cancer is lncRNA-mediated gene regulation. Individual long non-protein coding RNAs (lncRNAs) have been linked to cancer-specific deaths [Prensner et al. 2013 Nature Genetics]. Assessing expression variation of protein-coding and non-protein coding RNAs with high-throughput methods is a promising approach, but is accompanied by the problem of high-dimensionality. The number of differentially regulated genes exceeds the number of available samples by several orders of magnitude. Reliable models can only be derived through a reasonably small parameter set. Further, a prostate cancer tissue sample is very diverse and contains often heterogeneous cell mixtures. The challenge here is to separate all relevant information in the derived transcriptome-wide expression data from variation of expression unrelated to disease outcome. We developed a computational method which enables us to eliminate inter individual differences in gene expression that are unrelated to cancer-specific deaths and to tumor cell content in a sample. Our method provides in addition a dimensional reduction by finding associations between individual genes. We applied our approach to a set of protein-coding and non-protein coding genes in 139 prostate cancer samples and were able to assign selected lncRNAs to cancer-specific deaths.

Session B-427: Optimization of a cancer panel library preparation platform for FFPE samples of all solid tumor types
COSI: Non-COSI
  • Min-Kyeong Gwon, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
  • Hyun-Jung Lee, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
  • Ye-Lim Hong, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea
  • Hyun-Seob Lee, Genomics Core Facility, Biomedical Research Institute, Seoul National University Hospital, South Korea

Short Abstract: Formalin-fixed, paraffin-embedded (FFPE) samples are used to conduct large-scale studies across a variety of tumor phenotypes without incurring the significant expense of recruiting patients to build new cohorts.. However, degraded FFPE-derived nucleic acids, a result of long-time storage and formalin fixation, often hinder NGS research because of a high failure rate in analysis. Here, we present a platform that remarkably improves the success rate of analysis using cancer panels through validations of its FFPE samples and libraries, with the quality-control process set up in our laboratory, using FFPE samples collected in the last 10 years for various cancer types. Genomic DNA was extracted using six types of FFPE extraction kits to determine the most suitable extraction method for FFPE samples. We measured the quality of the extracted DNA using a spectrophotometer, two types of fluorometers, and by qPCR analysis, and compared these results with the quantified yield of final libraries. Furthermore, we changed the amount of initial gDNA input for Covaris (200~1000 ng), according to its ddCt result using the Agilent NGS FFPE QC kit, instead of changing the number of prePCR cycles, which is instructed by the manufacturer. On this platform, we successfully built and analyzed 487 libraries from FFPE samples of 10 solid types of tumor with a success rate of 88%. The established validation platform is easy to use and applicable to all libraries regardless of cancer type.

Session B-428: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature
COSI: Non-COSI
  • Sunwon Lee, Korea University, South Korea
  • Donghyeon Kim, Korea University, South Korea
  • Sunkyu Kim, Korea University, South Korea
  • Kyubum Lee, Korea University, South Korea
  • Jaehoon Choi, Korea University, South Korea
  • Seongsoon Kim, Korea University, South Korea
  • Minji Jeon, Korea University, South Korea
  • Sangrak Lim, Korea University, South Korea
  • Donghee Choi, Korea University, South Korea
  • Aik-Choon Tan, Division of Medical Oncology, University of Colorado Anschutz Medical Campus, United States
  • Jaewoo Kang, Korea University, South Korea

Short Abstract: As the volume of publications rapidly increases, searching for relevant information from the literature becomes more challenging. To complement standard search engines such as PubMed, it is useful to have an advanced search tool that directly returns relevant biomedical entities such as targets, drugs, and mutations rather than a long list of articles. Some existing tools submit a query to PubMed and process retrieved abstracts to extract information at query time, resulting in a slow response time and limited coverage of only a fraction of the PubMed corpus. Other tools preprocess the PubMed corpus to speed up the response time; however, they are not constantly updated, and thus produce outdated results. Further, most existing tools cannot process sophisticated queries such as searches for mutations that co-occur with query terms in the literature. To address these problems, we introduce BEST, a biomedical entity search tool. BEST returns, as a result, a list of 10 different types of biomedical entities including genes, diseases, drugs, targets, transcription factors, miRNAs, and mutations that are relevant to a user’s query. For example, BEST returns imatinib, dasatinib, and nilotinib for a query on drugs for chronic myeloid leukemia. To the best of our knowledge, BEST is the only system that processes free text queries and returns up-to-date results in real time including mutation information. BEST is freely accessible at http://best.korea.ac.kr.

Session B-429: EIS-DB: an Exon-Intron Structure Database
COSI: Non-COSI
  • Irina Poverennaya, Vavilov Institute of General Genetics RAS, Russia
  • Denis Gorev, Moscow Institute of Physics and Technology, Russia
  • Mikhail Roytberg, Moscow Institute of Physics and Technology; Institute of Mathematical Problems of Biology RAS, Russia

Short Abstract: We present a new exon-intron structure database (EIS-DB) containing comprehensive data of well-annotated genes in more than 100 eukaryotic genomes from different taxonomic groups (vertebrates, invertebrates, plants, and fungi). It allows extracting data related to a special gene or isoform of a special organism or obtain statistical data related to the given set of genes and/or organisms. Although the similar databases exist, they are mainly out of date, or taxonomic-or isoform specific. EIS-DB is a relational database managed by PostgreSQL. Structurally, it contains 15 tables. The main ones are ‘Organisms’, ‘Genes’, ‘Isoforms’, ‘Orthologous groups’, ‘Exons’ and ‘Introns’. The others contain auxiliary data, e.g., taxonomy; EIS-DB also contains fasta-files with related sequences. The detailed intron section makes EIS-DB especially appealing for studying various intron features in different organisms. As the main source of gene sequences and annotations, 112 RefSeq genome assemblies, (current to March 2017) were used along with additional input data on gene orthology obtained from NCBI. To ascertain orthology between exons and introns we have developed a special tool. It first builds multiple protein alignment using modification of MUSCLE program taking into account data on exon borders. Then we realign alignment regions where exon borders are not well aligned. The orthologous groups of exons and introns are determines based on the refined alignment. The preliminary version of web interface of EIS-DB is available at http://212.47.226.240:3000/; the database could be downloaded to user’s PC for more advanced requests.

Session B-430: Harnessing Deep Learning for High-Content Imaging Screens
COSI: Non-COSI
  • Jan Robin Winter, Bayer AG, Germany
  • Stefan Prechtl, Bayer AG, Germany
  • Andreas Steffen, Bayer AG, Germany
  • Djork-Arné Clevert, Bayer AG, Germany

Short Abstract: High-Throughput Image Analysis (HT-IMA) is a well-established approach for phenotypic screening applications in pharmaceutical research projects. Until now, current standard image analysis typically relies on a small set of physiological features (70% of all published screens relying on less than 3 extracted features) in tested biological systems. A prominent weak point of this approach is that the capability of HT-IMA is not used to its full capacity as the vast majority of recorded parameters is not analyzed and remains statistically unnoticed. Recent advances in Deep Learning have made significant contribution to image analysis and drug discovery. In this work, we harness Convolutional Neural Networks (CNNs) for high-content screening-based phenotype classification. CNNs have achieved a remarkable empirical success in both industry and in academia and are now the state-of-the-art methods in classification and segmentation of images. The model selection and evaluation was performed on public benchmark data. In particular, we trained a CNN on single cell images cropped from the entire field image. The classification task was to determine the correct phenotype class given the cropped images. The performance of our CNN model was validated in a five-fold cross-validation scheme and compared to a Random Forest classifier. In terms of classification accuracy the CNN clearly outperforms its competitors.

Session B-431: Improving predictions of pan-allele peptide-MHC binding affinities using deep learning
COSI: Non-COSI
  • Rudolph Layko, National Research University - Higher Schools of Economics, Russia
  • Vadim Nazarov, National Research University - Higher Schools of Economics, Russia

Short Abstract: We propose a novel preprocessing approach of amino-acids sequences using distributed representation and new architecture of neural network for one of the most essential task in immunoinformatics - prediction of binding affinities among MHC molecules and their peptide ligands. Deep overview of existing solutions let us overcome their constraints and limitations and overhaul them. We tested our model on human peptides and obtained competitive results with existent software tools - F1 score 0.72, AUC 0.84 in task for prediction of binding affinities of peptides for unseen MHC sequences and F1 score 0.82, AUC 0.9 for the IEDB testing dataset. Obtained results suggests low generalization error and overall applicability of the proposed model to the prediction of binding affinities for peptide-MHC complexes with both existed and unseen MHC sequences.

Session B-432: Defining a Core Genome for the Herpesvirales and Elucidating their Evolutionary Relationship with the Caudovirales
COSI: Non-COSI
  • Juan Sebastián Andrade Martínez, Universidad de los Andes, Colombia
  • Alejandro Reyes Muñoz, Universidad de los Andes, Colombia

Short Abstract: The order Herpesvirales encompasses a great variety of important and widely distributed human pathogens, including the Varicella-Zoster Virus, Human Cytomegalovirus and Epstein-Barr virus. During the last decades, similarities in the viral cycle and the structure of some of their proteins with those of the tailed phages have brought speculation regarding the existence of an evolutionary relationship between the two clades. To evaluate such hypothesis, we used over 700 Herpesvirales and 2000 Caudovirales genomes downloaded from the NCBI genome and nucleotide databases, which were first de-replicated both at the nucleotide and amino acid level. Following this, they were screened for the presence or absence of clusters of orthologous viral proteins, and a dendogram was constructed based on their compositional similarities. The results obtained strongly suggest that the Herpesvirales are indeed the closest viral order with eukaryotes hosts to the Caudovirales, and allows putting forth hypotheses concerning the specific details of such relationship (i.e. whether they are sister clades or one stems from a minor clade within the other). Moreover, the identification of clusters that were abundant amongst the Herpesvirales made it possible to propose a Core Genome for the entire order, composed of 5 proteins, including the ATPase subunit of the DNA-packaging terminase, the only one with previously verified conservation in this clade. Overall, this work simultaneously provides important results supporting the long-held hypothesis that the two orders are evolutionary related and contributes to the understanding of the evolutionary history of the Herpesvirales themselves.

Session B-433: Deep learning model of clonal selection for T-cell receptor sequences
COSI: Non-COSI
  • Sofia Tolstoukhova, National Research University - Higher School of Economics, Moscow, Russia, Russia
  • Evgenii Ofitserov, Tula State University, Tula, Russia, Russia
  • Vadim Nazarov, National Research University - Higher School of Economics, Moscow, Russia, Russia

Short Abstract: Immune system protects human organism from different pathogens invading the body. T-cells are basic immunity weapon against viral infections. Each T-cell has an amino acid molecule on its surface called a T-cell receptor (TCR), which is able to bind particular kind of pathogen peptide. Cells with same TCR form a clonotype. Originally, all clonotypes are represented by a very few cells. After detecting an infected cell a T-cell starts to proliferate and the number of cells in the clonotype grows. This process is called clonal selection. Observing the dynamics of TCR numbers in the peripheral blood is a significant task to both fundamental immunology and medicine. However, no methods were developed to predict the quantities of specific TCR sequences and model the general clonal selection process. In this work we build a deep learning model capable to predict the T-cell quantity from its sequence. For this purpose we implemented variations of the following architectures: Variational Autoencoder combined with a recurrent neural network architectures of Long Short Term Memory or Gated Recurrent Unit (GRU). We tested them on 6 repertoires of 3 individuals with 2 replicas for every individual, using replicas' parts as cross-validation and testing datasets for corresponding individuals. We used a wide spectrum of hyperparameters for each model. Our final model was based on GRU with weighted RMSE equal to 1.81 To demonstrate applicability of the model we implemented a method for comparative analysis of clonal selection in different repertoires and tested it on repertoires of monozygotic twins.

Session B-434: Comparative analysis of bootstrap and jackknife methods in the programs of phylogenetic reconstruction
COSI: Non-COSI
  • Dmitry Penzar, MSU FBB, Russia
  • Sergei Spirin, MSU FBB, Russia

Short Abstract: The program of phylogenetic reconstruction – PQ – has been developed previously. It reconstructs phylogenetic tree based on the new criteria of congruence with multiple alignment. PQ showed great results on the majority of datasets compared to the most popular programs of reconstruction – FastME and RAxML. Working with multiple alignments with small number of sequences, PQ frequently generates the tree closer to the real one compared to the other programs. However, PQ tends to be more inaccurate if it uses multiple alignments with more than 30 sequences in them. Some modifications, such as a new way of finding suboptimal trees for multiple alignments with a lot of sequences, were implemented into the program to improve the results. Moreover, two methods of resampling (bootstrap and jackknife) were added to the PQ package to make the evaluation of the branches possible. Bootstrap is widely used in bioinformatics programs, it replaces some columns from the alignment with other columns from the same alignment. Jackknife simply takes some columns off the alignment. Despite being mathematically equivalent to bootstrap and requiring less computational resources, jackknife is rarely used in tree reconstruction. The efficacy of these methods incorporated into the programs (PQ, FastME, RAxML, etc) was investigated using different datasets with different number of replicas. The results are: 1) Jackknife shows the same accuracy in predicting the reliability of a branch as bootstrap, given the same amount of replicas; 2) The dependency between the value of support (generated by either bootstrap or jackknife) and branch’s probability of being correct is almost linear on our datasets; 3) The aforementioned dependency remains linear either on 100 replicas or 20 ones. The area under ROC-curve (and other metrics of prediction quality) doesn’t change with raising the number of replicas from 20 to 100, which confirms the hypothesis of 100 replicas being redundant for computing the reliability of tree’s branches.

Session B-435: triMS5 - storing LC-IMS-MS data in HDF5
COSI: Non-COSI
  • Jennifer Leclaire, Institute for Computer Science, JGU Mainz, Germany
  • Stefan Tenzer, UMC of the Johannes-Gutenberg-University Mainz, Germany
  • Andreas Hildebrandt, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany

Short Abstract: Mass spectrometry based proteomics is a key technology for the elucidation of many biological processes and as such is a dynamically evolving research field aiming at the complete identification and quantification of proteins present in a complex sample. Ongoing developments not only in the instrumentation but also in the design of workflows such as the integration of additional separation strategies like ion mobility separation (IMS) and data independent acquisition (DIA) modes lead to continuously growing complexity of raw data, which in turn put a challenge on existing analysis tools as well as storage routines. Raw data sets are stored in vendor-specific formats and the provided software packages are usually closed-source or restricted to Microsoft Windows operating system. Here, we introduce triMS5, an open-source data format for liquid chromatography coupled mass spectrometry datasets enhanced with IMS (LC-IMS-MS) based on the hierarchical data format version 5 (HDF5). HDF5 is a well-established scientific data format, which is highly optimized for flexible I/O. Particularly, triMS5 benefits from HDF5’s chunked representation – i.e., data can be partitioned and then processed individually, e.g., by natively supported compression filters. This chunked representation is combined with a compressed sparse row (CSR) data layout to ensure storage efficiency. Additionally, triMS5 provides rapid extraction of the signal regions of interest in all three dimension (m/z, retention and drift time) by using a multi-dimensional search tree with language bindings to C/C++ and Python.

Session B-436: TrAMiS - A distributed large-scale molecular trajectory representation and analysis framework for Apache Spark
COSI: Non-COSI
  • Thomas Kemmer, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany
  • Christian Ortwein, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany
  • Marialore Sulpizi, Department of Physics, Johannes Gutenberg University Mainz, Germany
  • Andreas Hildebrandt, Institute of Computer Science, Johannes Gutenberg University Mainz, Germany

Short Abstract: Molecular dynamics (MD) simulations are among the most important, versatile, and accurate tools for studying molecular behavior in a variety of different settings. Furthermore, they are the main provider of high-resolution molecule trajectory data, which are not only essential in drug design and protein docking, but also of particular importance in atomic-scale studies, such as the analysis of the dielectric properties of biological systems and media. MD simulation toolsets have been highly optimized over the past decades, and are now capable of handling almost arbitrary systems at multi-scale or fixed resolutions. At the same time, these tools utilize multi-core platforms and multi-node environments, e.g., large compute clusters, to allow extended simulation times and finer resolutions in the same overall runtime, resulting in an exponential growth of the generated data and causing I/O operations to become a major bottleneck of the subsequent analysis. Here, we present our software framework for representing and analyzing GROMACS-generated molecule trajectory and topology data combining the Apache Avro data serialization system and the Apache Spark data processing engine. Our data representation layer is specially tailored to distributed file systems, reducing communication overhead while simultaneously increasing reliability and data redundancy in multi-node environments. The data can be used either with our own analysis tools, e.g., to compute radial distribution functions for user-selected partitions of the given system, or integrated into existing pipelines owing to the format's bindings to several other programming languages, including C(++), C#, Java, Python, and Perl.

Session B-437: Ebola virus multi-alignment: analysis and visualization
COSI: Non-COSI
  • Paulina Hyży, University of Warsaw, Poland
  • Jakub Tyrek, University of Warsaw, Poland
  • Norbert Dojer, University of Warsaw, Poland

Short Abstract: Multiple sequence alignment is a rich source of knowledge. However, even a deep analysis is not enough to make this knowledge easily available for other researchers. To fully benefit from it one has to visualize the research results in a way convenient for others to consume. As an example of this approach we will present multi-alignment of the Ebola virus. We extract the most relevant information, golden paths predictions and the sequences structure as a single model rendered by a web browser. The visual readability and computational efficiency is achieved by the usage of graph representation.

Session B-438: VHLdb: a curated community resource for Von Hippel Lindau syndrome
COSI: Non-COSI
  • Francesco Tabaro, University of Padua, Italy
  • Federica Quaglia, University of Padua, Italy
  • Giovanni Minervini, University of Padua, Italy
  • Damiano Piovesan, University of Padua, Italy
  • Silvio C. E. Tosatto, University of Padua, Italy

Short Abstract: Mutations in von Hippel-Lindau tumor suppressor protein (pVHL) predispose to develop tumors affecting specific organs, such as retina, epididymis, adrenal glands, pancreas and kidneys. VHLdb (http://vhldb.bio.unipd.it/) is a publicly available resource collecting interaction and mutation data from different resources. Currently, it provides more than 400 pVHL interacting proteins and more than 1,000 pVHL mutations. This makes VHLdb the largest available database for pVHL-related information. The set of pVHL interacting proteins have been generated collecting and annotating data from different public databases. A quarter of the retrieved pVHL interactors has been manually curated adding extra value to the data. Mutation data has been collected and annotated from published papers and selected to be highly relevant to clinically observed phenotypes. VHLdb offers different ways to access its data. First, a graphical user web interface exploits advanced visualization strategies to clearly depict both interaction and mutation data. Then, a public RESTful API (Application Programming Interface) is available for headless access to the data. VHLdb is actively maintained at the University of Padova. Three times a year novel data are added and already existing one are updated and reviewed. The maintenance process is managed via a curator web interface which let an expert user to upload custom data. The annotation of novel entries is performed by an automatic in-house developed software. Also, a user-friendly feedback function to improve database content through community-driven curation is also provided.

Session B-439: Class Imbalance learning for the analysis of protein dynamics in mammalian heart proteome
COSI: Non-COSI
  • Bilal Mirza, ucla, United States
  • Jennifer Polson, UCLA, United States
  • Ding Wang, UCLA, United States
  • Howard Choi, UCLA, United States
  • Peipei Ping, UCLA, United States

Short Abstract: Machine learning methods have been effectively applied in biological studies for converting data into knowledge. Particularly, omics datasets contain huge amount of information which can be extracted using state of the art machine learning approaches. A special type of machine learning approach, referred to as class imbalance learning (CIL), is very useful in datasets with rare biological patterns. Imbalance in datasets refers to a condition where the number of sample points belonging to one group is much less than those in the other groups. Standard machine learning models trained using such imbalanced datasets are biased towards the larger group while the recognition rate on the smaller group is very low. For example, standard machine learning model trained to identify distinct molecular signatures of a functional protein group in mammalian heart, with respect to the entire protein ensemble, obtains sub-optimal results. In this study, we applied a cost sensitive based CIL method with support vector machine (SVM) as a base classifier to study temporal dynamics of two small functional groups in mammalian heart proteome i.e. contractile proteins and degradation machineries. The analysis is based on the fold change in protein abundance values at seven time points and across six genetic mouse strains. It is observed that contractile proteins’ temporal dynamics is peculiarly different from the entire protein ensemble, achieving a high recognition rate with the SVM based CIL models. While no distinct pattern was observed in protein degradation machineries, with recognition rate close to random guess.

Session B-440: Pan - genome annotation transfer
COSI: Non-COSI
  • Jakub Tyrek, University of Warsaw, Poland
  • Norbert Dojer, University of Warsaw, Poland
  • Paulina Hyży, University of Warsaw, Poland

Short Abstract: Single reference genome per species is a standard approach in exploring genetics mechanisms. It became acknowledged that this approach lacks an important aspect - genomic interindividual variation. Thanks to the development of sequencing techniques, acquiring new genomic data got cheaper, faster and more accessible allowing pan- genomic approach. One of important aspects of this approach is effective genome annotations transfer. Different approaches are compared in a task of aligned genomes annotations transfer. Existing tools are used, including annotation transfer tools (CrossMap), annotation correction tools (Mugsy-annotator) and prediction tools (Prokka).

Session B-441: Recognition site avoidance as an anti-restriction strategy of prokaryotic viruses
COSI: Non-COSI
  • Ivan Rusinov, Lomonosov Moscow State University, Russia
  • Anna Ershova, Lomonosov Moscow State University, Russia
  • Sergey Spirin, Lomonosov Moscow State University, Russia
  • Anna Karyagina, Lomonosov Moscow State University, Russia
  • Andrei Alexeevski, Lomonosov Moscow State University, Russia

Short Abstract: Restriction-modification (R-M) systems protect prokaryotes from invasion of foreign DNA, like bacteriophages. An R-M system is specific to a short DNA sequence, called recognition site. Bacteriophages avoid some recognition sites of R-M systems in their genomes. The avoidance is considered one of anti-restriction strategies of bacteriophages but have not been systematically studied yet. We analyzed restriction site avoidance in genomes of 2069 prokaryotic viruses. As one could expect, DNA bacteriophages demonstrate significant restriction site avoidance, and RNA phages do not. DNA bacteriophages commonly avoid only recognition sites of Type II R-M systems (excluding IIG subtype). Sites of other Type I, IIG, and III systems are avoided in scattered instances. It could indicate that bacteriophages have other widespread anti-restriction strategies targeting such R-M systems. We also demonstrated that Myoviridae coliphages encoding DNA-hydroxymethylase (anti-restriction enzyme) do not avoid Type II restriction sites, while the related phages without the gene avoid 73.7% of such sites. Temperate and lytic bacteriophages manifest different trends in Type II restriction site avoidance. Lytic phages more often (11.9% of the sites) completely eliminate restriction site occurences from their genomes than temperate phages (2.0% of the sites). It is probably caused by a long-term prophage stage, when a temperate bacteriophage shares host selective pressure affecting oligonucleotide composition of its genome. The average number of avoided potential R-M sites in a phage genome several times exceeds the average number of R-M systems of a bacterium. It might indicate that bacteriophages do not generally specialize on a single host strain.

Session B-442: The R package zeroSum
COSI: Non-COSI
  • Thorsten Rehberg, University of Regensburg, Germany
  • Michael Altenbuchinger, University of Regensburg, Germany
  • Rainer Spang, University of Regensburg, Germany

Short Abstract: zeroSum is an R package for fitting reference point insensitive linear models by imposing the zero-sum constraint combined with the elastic net regularization. The zero-sum constraint causes linear models to become invariant to sample-wise shifts, thus working around normalization problems and measurement uncertainties like for example diluted samples. The advantages of the zero-sum constraint for data analysis, especially for building cross-platform signatures, are shown in the presentation “Molecular signatures that can be transferred across different omics platforms” by Altenbuchinger et. al. . We present our efficient coordinate descent algorithm for fitting generalized linear zero-sum models and details about the C++ implementation - the core library of the zeroSum R package. Moreover we give a quick-start tutorial showing the basics steps for creating linear zero-sum models with our package in R.

Session B-443: Pescal++: A high performance standards compliant tool for label free quantitation of peptides from LC-MS/MS shotgun proteomics data
COSI: Non-COSI
  • Ryan Smith, Queen Mary University of London, United Kingdom
  • Pedro Cutillas, Barts Cancer Institute, United Kingdom
  • Jon Hays, Queen Mary University of London, United Kingdom
  • Conrad Bessant, Queen Mary University of London, United Kingdom

Short Abstract: Advances in sensitivity and accuracy of mass spectrometry instrumentation has led to an increase in size and ambition of large-scale label-free proteomics studies which, in turn, has highlighted the need for fast and stable high-throughput bioinformatics software. Pescal++ has been developed in the C++ programming language which enables accurate and fast quantitation of peptides in large-scale proteomics experiments. HUPO-PSI data standards such as mzIdentML (input) and mzTab (output) have been adopted for connectivity with other tools, allowing Pescal++ to be used as part of complex analysis workflows. Pescal++ has been successfully applied to a range of datasets, including a very large phosphoproteomics experiment where we were able to quantify 30,000+ peptides in a single experiment containing 900+ samples. Pescal++ will be used as a platform for studying further optimisation and parameterisation at key stages in the quantitation workflow, where its stability and rapid performance in processing hundreds of samples simultaneously will be particularly useful for developing a more accurate retention time alignment algorithm.

Session B-444: Ultra-fast 2-way and 3-way SNP Interaction Tests on FPGAs and GPUs
COSI: Non-COSI
  • Lars Wienbrandt, Institute of Clinical Molecular Biology, Kiel University, Germany
  • Jan Christian Kässens, Institute of Clinical Molecular Biology, Kiel University, Germany
  • Matthias Hübenthal, Institute of Clinical Molecular Biology, Kiel University, Germany
  • David Ellinghaus, Institute of Clinical Molecular Biology, Kiel University, Germany

Short Abstract: Exhaustive higher order SNP interaction testing is computationally very demanding due to the algorithmic complexity. The combination of FPGA (Field Programmable Gate Array) and GPU (Graphics Processing Unit) computing technologies provides an ideal architecture to significantly speedup such interaction tests. The problem is split into two main parts, where either part is implemented on the architecture that fits best. The first part is the creation of contingency tables, which ideally suits FPGA technology. The test statistics based on the contingency tables is then computed efficiently with GPU technology. We show that the application of the information gain measure, an entropy-based test statistic, delivers significant results in an example ulcerative colitis case-control dataset for 2-way (SNPxSNP) as well as for 3-way (SNPxSNPxSNP) interaction analysis. We confirmed the validity of the statistic by evaluating two different ways to determine the null distribution. Firstly, we tested 300 cross-validated permutations of the trait of the original data and secondly, 100 cross-validated reduced datasets with 10% of the original SNPs and the same 300 trait permutations. We achieve a speedup of more than 1,650-fold on our FPGA-GPU computer using four Xilinx Kintex UltraScale KU115 FPGAs and four Nvidia Tesla P100 GPUs when compared to a multi-core CPU cluster node (32 threads on Intel Xeon E5-2667v4), reducing the computational runtime from 11.3 years to only 2.5 days for a 3-way test of 5,725 SNPs, >43,000 samples and 300 permutations.

Session B-445: Assembly and Annotation of the Hexaploid Oat Genome
COSI: Non-COSI
  • Rachel Walstead, University of North Carolina at Charlotte, United States
  • Adam Whaley, University of North Carolina at Charlotte, United States
  • Robert Reid, University of North Carolina at Charlotte, United States
  • Veronica Vallejo, PepsiCo, United States
  • Cory Brouwer, University of North Carolina at Charlotte, United States
  • Jessica Schlueter, University of North Carolina at Charlotte, United States

Short Abstract: The hexaploid oat (Avena sativa L) is a staple cereal crop, used for both human consumption and animal feed. Genomic resources for oat, despite of the importance of cereals, are lagging behind many other crops. The estimated genome size of A. sativa is about 13GB. We aim to fully sequence, assemble, and annotate the hexaploid oat genome (2n = 6x = 42). To this aim, we utilized PacBio RSII technologies and sequenced approximately 580 SMRT cells, achieving a coverage of approximately 40X. We are currently assembling the genome using both the FALCON, Canu, and SMARTdenovo assemblers. Upon completion of the assembly, we will compare the quality of each assembly output. We will then annotate using a combined approach of RNAseq data, predictive gene models, and comparative annotations from other grass genomes that will be integrated using MAKER. The genomic information obtained by this project will be a valuable resource for crop scientists and breeders.

Session B-446: A proteome informatic approach to investigate the role of retroelement proteins in disease
COSI: Non-COSI
  • Mohamed Nazrath Mohamed Nawaz, Queen Mary - University of London, United Kingdom
  • Paul Hurd, Queen Mary - University of London, United Kingdom
  • Miguel Branco, Queen Mary - University of London, United Kingdom
  • Conrad Bessant, Queen Mary - University of London, United Kingdom

Short Abstract: Retroelements have been implicated in a number of diseases, but little is known about the behaviour of proteins coded by these elements. In order to contribute to understanding how retroelement proteins relate to diseases, an unbiased proteomic approach was developed to re-analyse large amounts of publicly available proteomics data. The main database used was PRIDE, from which we re-analysed disease datasets to detect retroelement proteins, helping to build a picture of retroelement protein expression across a range of diseases. A pipeline has been created to automatically carry out spectral data re-analysis using the combination of SearchGUI and PeptideShaker for confident protein identification on a high performance computing cluster. Other available proteomics datasets from CPTAC and PeptideAtlas could potentially also be re-analysed. Furthermore, different genetic variants and tissue specificity of retroelement protein were explored, as a first step towards understanding the role retroelement proteins play in diseases.

Session B-447: Sequence based prediction of TCR and peptide interaction
COSI: Non-COSI
  • Vanessa Isabell Jurtz, Technical University of Denmark, Denmark
  • Martin Closter Jespersen, Technical University of Denmark, Denmark
  • Kamilla Kjærgaard Jensen, Technical University of Denmark, Denmark
  • Bjoern Peters, La Jolla Institute for Allergy and Immunology, United States
  • Morten Nielsen, Technical University of Denmark, Denmark

Short Abstract: A major challenge for T cell therapy and rational identification of T cell epitopes is the identification of the cognate target (the peptide-HLA complex) of a given TCR. While reliable predictions of HLA-peptide interaction are available for most HLA class I alleles, prediction models for the interaction between TCR and the HLA-peptide complex have not yet to the best of our knowledge been described. Recent sequencing projects have generated a considerable amount of data relating TCR sequences with the HLA-peptide complex they recognize. We utilize such data to train sequence-based predictors of the interaction between TCR and peptides presented by HLA-A*02:01. Our models are based on convolutional neural networks, which are especially designed to meet the challenges posed by sequences of variable length, as TCRs. We show that such sequence-based models allow for the identification of the cognate peptide-HLA target of a given TCR from its sequence alone. Moreover we expect predictive performance to increase as more data becomes available.

Session B-448: Bacteriophage Whole Genome Alignment and Recombination History Reconstruction
COSI: Non-COSI
  • Krister Swenson, CNRS, Université de Montpellier, France
  • Anne Bergeron, Universite du Quebec a Montreal, Canada
  • Severine Berard, Universite Montpellier, France
  • Annie Chateau, Universite Montpellier, France

Short Abstract: Virus genomes are generally very compact due to the physical constraint of fitting into a capsid. Coding regions are dense and gene orders are relatively conserved between strains. The evolution of virus genome architecture, however, is complicated due to the -- often programmed -- existence of recombination points that mix genetic material from two individuals into a new mosaic strain. Thus, phylogenetic reconstruction of virus strains can be tricky business since clean alignments in the presence of recombination is difficult. To this end, we address two current challenges which are: 1. the effective "alignment" of viral genomes, and 2. the study of the peculiar nature of bacteriophage recombination histories. The mosaic structure of viral sequence alignments is captured by our tool call Alpha (ALignment of PHAges), which builds a partial order on well aligned blocks. With homologous blocks in hand, we study a model of bacteriophage recombination that requires two homologous points for each recombination event. We show conditions under which recombination histories can be readily reconstructed.

Session B-449: EpiC: assessing Epigenetics profiles in Cancer samples contaminated by normal cells
COSI: Non-COSI
  • Elnaz Saberi Ansari, Institute Cochin, France
  • Valentina Boeva, Institute Cochin, France

Short Abstract: The aim of this research is to develop a computational method to characterize epigenetic profiles (histone modifications and open chromatin sites) in primary tumor tissues that represent a mixture of cancer and normal cells. This method can be used by researchers studying chromatin remodeling in cancer initiation and progression and searching for epigenetic markers associated with drug sensitivity and overall patient survival. In this method, the enrichment in histone modifications in a primary tumor is modeled as a linear mixture of the signal coming from the normal and cancer cells. The sample with the lowest contamination level of normal cells is considered as the reference and the algorithm tries to extract the tumor signals from the samples. The data are normalized by the copy number and the tumor purity, which are assessed using the Control-FREEC software, and also by the noise ratio between different experiments. After removing the tumor signal, the remained signal (a mixture of normal cells) is subject to linear decomposition using non-negative matrix factorization algorithm or independent component analysis. Our method will be validated on both simulated and experimental datasets. During the validation on the simulated data, we will address such questions as the minimal required number of tumor samples, the maximal level of contamination and the maximal number of different contaminating normal cell types. Experimental validation will be done on 16 various neuroblastoma cell lines contaminated with normal cell lines at different rates. ChIP-seq profiles for 3 histone marks for two normal and 19 neuroblastoma cell lines are already available in our lab. This project is supported by the worldwide cancer research.

Session B-450: Prospecting in Contributed Personal Genomic Data
COSI: Non-COSI
  • Richard Shaw, Repositive Ltd., United Kingdom
  • Dennis Schwartz, Repositive Ltd., United Kingdom
  • Manuel Corpas, Repositive Ltd., United Kingdom
  • Fiona Nielsen, Repositive Ltd., United Kingdom

Short Abstract: Open-access personal genomic data contributed directly by the individuals genotyped is a growing resource but, as with human genomic data in general, its storage is fragmented across multiple sites. Using the Repositive human genomic metadata aggregation platform (https://discover.repositive.io/?ECCB2017), we explored the landscape of such genomic data, within the scope of SNP array genotypes generated by a prominent provider (23andMe TM). Our approach was to search for metadata containing the name of the provider, download the corresponding (3137) data files and then filter out those files not matching the format of interest (GRCh37 23andMe genotypes) or that appeared corrupted. An initial principal component analysis revealed that 122 of the 2402 remaining were from the same individual as other genotypes in the dataset. Some corresponded to identical files multiply submitted to the same or different repositories but others to different versions of the same genotype. Mapping the deduplicated set of 2280 genotypes onto principal component axes generated from a set of African, Asian and European genotypes from 1000 Genomes populations and then applying nearest neighbour classification showed that the dataset is predominantly comprised of European ancestry genotypes. Promethease (https://www.snpedia.com/index.php/Promethease) analyses of these genotypes revealed, among other traits, a preponderance of male individuals. With this analysis we have shown that it is possible to collect and aggregate a large dataset from open access data available across multiple data sources. The examination of the data may be useful in further investigations into linking genotype and phenotype.

Session B-451: A probabilistic approach to whole genome based phylogeny
COSI: Non-COSI
  • Johanne Ahrenfeldt, Technical University of Denmark, Denmark
  • Anders Gorm Pedersen, Technical University of Denmark, Denmark
  • Anders Krogh, University of Copenhagen, Denmark
  • Ole Lund, Technical University of Denmark, Denmark

Short Abstract: The use of whole genome sequencing is increasing in diagnostics. To perform outbreak analysis it is crucial to be able to make accurate phylogenetic trees. These phylogenetic trees are often based on mapping of raw reads to a reference genome. More information is however given in raw reads data, than what is being used by current methods for WGS phylogeny, which are mostly based on nucleotide calling. This information can be used by building an algorithm, which utilizes all the information given when mapping raw reads to a reference genome. The phylogenetic distance is calculated as the sum of the probabilities of each position being different. This probability is calculated using a Bayesian approach. The dataset to test this method on is a dataset with known phylogeny, made by in vitro evolution of an E. coli K12 strain. By Ahrenfeldt et al. 2017

Session B-452: Optimization of mutational pressure in bacterial genomes according to costs of amino acid replacement
COSI: Non-COSI
  • Paweł Mackiewicz, University of Wroclaw, Poland
  • Paweł Błażej, University of Wroclaw, Poland
  • Małgorzata Grabińska, University of Wroclaw, Faculty of Biotechnology, Poland
  • Małgorzata Wnętrzak, University of Wroclaw, Faculty of Biotechnology, Poland
  • Dorota Mackiewicz, University of Wroclaw, Poland

Short Abstract: Mutations occurring in DNA are usually considered a spontaneous and random process. They are important in the evolution of organisms because they generate genetic variation. On the other hand, most mutations are undesirable because they make genes non-functional, and their repairing requires a lot of energy. Therefore, we can expect that the mutational pressure should be optimized during evolution to simultaneously generate genetic diversity and preserve genetic information. In order to check the optimization level of empirical mutational pressures, we compared the matrices of nucleotide mutation rates derived from bacterial genomes with their best possible alternatives that were found by Evolutionary Multiobjective Optimization approach. We searched for the matrices that minimized or maximized costs of amino acid replacements resulted from differences in their physicochemical properties, e.g. hydropathy and polarity. It should be emphasised that the studied empirical nucleotide substitution matrices and the costs of amino acid replacements are independent because these matrices were derived from sites free of selection on amino acid properties and the amino acid costs assumed only amino acid physicochemical properties without any information about mutation at the nucleotide level. Obtained results indicate that the empirical mutational matrices have a tendency to minimize costs of amino acid replacements. It implies that bacterial mutational pressures can evolve to decrease consequences of amino acid substitutions. However, the optimization is not full, which enables generation of some genetic variability necessary to adapt to changing environment.

Session B-453: The role of signal-anchor in the subcellular localization of transmembrane protein
COSI: Non-COSI
  • Tatsuki Kikegawa, Department of Electronics, Graduate School of science and technology, Meiji University, Japan
  • Yuri Mukai, Department of Electronics, Graduate School of science and technology, Meiji University, Japan

Short Abstract: Transmembrane proteins are typical internal membrane proteins spanning biomembranes, including the endoplasmic reticulum (ER), Golgi and plasma membranes. Their functions are essential to maintain homeostasis via signal transduction, membrane transport and energy production. Their transmembrane regions usually consist of ten to thirty hydrophobic amino acids, which are known as ER-targeting signals called signal-anchors. However, the evidential transport mechanisms of transmembrane protein localization from ER to other organelles have not been elucidated. Understanding the mechanism of protein subcellular localization is believed to be crucial for treatment of the incurable diseases resulting from erroneous subcellular localization. In this study, to elucidate transport mechanisms of transmembrane proteins, the amino acid propensity around the signal-anchor was calculated. The transmembrane protein dataset was classified into three groups: plasma membrane proteins, ER membrane proteins and Golgi membrane proteins. The discrimination accuracy of each group was estimated by the discrimination scores which were calculated by the position-specific scoring matrix and artificial neural network to evaluate whether the transmembrane protein localization is determined by the sequences around transmembrane. Each group members could be discriminated with high accuracy (> 90%) based on the 5-fold cross-validation test. The result suggested that the amino acid propensity around transmembrane domain was related to the localization mechanisms. To verify this presumption by experimental methods, the GFP fusion proteins with signal-anchors of the representative proteins selected from each group were designed. The subcellular localization of these GFP fusion proteins expressed in HeLa cells were observed by confocal laser fluorescence microscope.

Session B-454: Geena 2, a public web tool for automated analysis of MALDI/TOF mass spectra
COSI: Non-COSI
  • Paolo Romano, Ospedale Policlinico San Martino, Genoa, Italy, Italy
  • Aldo Profumo, Ospedale Policlinico San Martino, Genoa, Italy, Italy
  • Claudia Angelini, CNR - Istituto per l'Applicazione del Calcolo, Italy
  • Eugenio Del Prete, Department of Sciences, University of Basilicata, Potenza, Italy, Italy
  • Angelo Facchiano, CNR - Istituto di Scienze dell'Alimentazione, Italy

Short Abstract: Geena2 is a public tool for the pre-processing of MALDI/ToF spectra, developed to fill the lack of tools for the automatic analysis of proteomics data and to help scientists which do not have a strong computer science background. Automation in the analysis of high-throughput data is useful for both the needs of managing large amount of data as well as for supporting replicability and reproducibility of data analysis. Geena2 implements: a) unification of isotopic abundances for the same molecule, b) normalization of data against a standard, c) background noise reduction, d) computation of an average spectrum representative for replicate spectra, e) alignment of average spectra. Input consists of peak lists and parameters for setting the procedure according to the user needs. The output consists of average spectra, their alignment, and intermediate results for checking the correct execution. Geena2 was used with success for the evaluation of the effects of long-term cryopreservation on serum samples, then for two retrospective studies on the correlation between serum peptidomic profiles and cancer. These applications demonstrated that Geena2 it able to automate many steps in the pre-processing of MALDI/ToF spectra. We are now working to implement GeenaR, to exploit the power of R modules and increase its applications by scientists without specific programming and computer science expertise, following the reproducible research philosophy.

Session B-455: Identifiers.org - Persistent identifier resolution and services
COSI: Non-COSI
  • Sarala Wimalaratne, EBI, United Kingdom
  • Nick Juty, EBI, United Kingdom
  • Henning Hermjakob, EBI, United Kingdom

Short Abstract: In this era of 'big data', it has become increasingly important to reference data consistently, robustly, and in a manner that facilitates perennial accessibility. This enables reliable referencing, allows users unfettered access to data and records, and facilitates interoperability between diverse data sets. Solutions for data identification and retrieval must contend with the distributed nature of data, with differing data accessibility, and availability. Furthermore, it is necessary to record alternative means by which the same data can be referenced, to facilitate seamless transition across the identifier landscape. The Identifiers.org system provides a central infrastructure towards facilitating findable, accessible, interoperable and re-usable (FAIR) data. It offers a range of services to generate, resolve and validate persistent Compact Identifiers to promote the citability of individual data providers and integration with e-infrastructures. The Identifiers.org registry contains hundreds of manually curated, high quality data collections, with each assigned a unique prefix. A combination of the prefix and a locally assigned database identifier (accession) forms a Compact Identifier, [prefix]:[accession]. The Identifiers.org resolver provides a stable resolution service these Compact Identifiers, taking into consideration information such as the uptime and reliability of all available hosting resources.

Session B-456: Single Cell Mass Cytometry Marker Panel Extension
COSI: Non-COSI
  • Tamim Abdelaal, Delft University of Technology, Netherlands
  • Ahmed Mahfouz, Delft University of Technology, Netherlands
  • Thomas Höllt, Delft University of Technology, Netherlands
  • Vincent van Unen, Leiden University Medical Center, Netherlands
  • Frits Koning, Leiden University Medical Center, Netherlands
  • Boudewijn Lelieveldt, Leiden University Medical Center, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands

Short Abstract: High-dimensional mass cytometry (CyTOF) allows simultaneous measurement of multiple cellular markers, providing a system-wide view of immune phenotypes at the single-cell level. The maximum number of markers that can be measured simultaneously (N) is limited to ~50 due to technical challenges. We propose an approach to integrate CyTOF data from several marker panels that include an overlapping marker set, allowing for a deeper interrogation of the cellular composition of the immune system. Assuming two CyTOF panels with share m

Session B-457: ARTISiN: A Repository and Multi-Agent System for omics data Integration and Identification of Signaling Networks
COSI: Non-COSI
  • Milton Y. Nishiyama-Jr, Laboratorio Especial de Toxinologia Aplicada, Instituto Butantan, Brazil
  • Marcelo S. Reis, Laboratório Especial de Ciclo Celular (LECC), Center of Toxins, Immune-response and Cell Signaling (CeTICS), Instituto Butantan, Brazil
  • Henrique C Vieira, Instituto Butantan, Brazil
  • Bruno F de Souza, Instituto Butantan, Brazil
  • Daniel F. Silva, Instituto Butantan, Brazil
  • Inácio L.M. Junqueira-De-Azevedo, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
  • Julia P.C. Da Cunha, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
  • Leo K. Iwai, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
  • Junior Barrera, Instituto de Matemática e Estatística, Universidade de São Paulo, and CeTICS, Instituto Butantan, Brazil
  • Solange M.T. Serrano, Laboratório Especial de Toxinologia Aplicada - CeTICS, Instituto Butantan, Brazil
  • Hugo A. Armelin, Butantan Institute, Brazil

Short Abstract: The mission of CeTICS is to discover and chemically characterize molecular targets of venom toxins, which probably initiate biological responses of interest in human pathophysiology and therapeutics. The CeTICS sub-projects generate a vast amount of heterogeneous, low and high-throughput omics data, whose complexity defy analytical efforts to uncover hidden biological knowledge. Moreover, static maps eventually yielded by those omics analyses very often are not sufficient to unveil underlying dynamics of cell signaling networks, which demands the design and simulation quantitative dynamical models. We proposed the ARTISiN platform, an amalgam of repositories and tools, both public and in-house built ones, for identification of signaling networks and assessment of mechanistic aspects of their actions. The ARTISiN has been designed as a Multi-agent system (MAS), which consists of a composite system of multiple intelligent agents whose orchestration allows the solution of a complex problem; these agents are autonomous entities, each one focused on solving part of the problem. The two core components of this platform are: i) CeTICSdb, an access-controlled repository for storage, analysis and integration of omics data and also for generation of static maps of signaling networks; ii) SigNetSim, a tool in which an “artisan” selects part of the signaling network that could be approximately isolated as a functional module and “handcrafts” dynamical models to explain its underlying mechanism. Finally, our mid-term objective is to develop a communication between CeTICSdb and SigNetSim, which will be accomplished through the MAS under development and make the platform available to the scientific community.

Session B-458: Study on the properties of alternative genetic codes in comparison with the canonical genetic code and theoretical codon assignments
COSI: Non-COSI
  • Pawel Blazej, University of Wroclaw, Poland
  • Przemyslaw Gagat, University of Wroclaw, Poland
  • Małgorzata Wnętrzak, Faculty of Biotechnology, University of Wrocław, Poland
  • Paweł Mackiewicz, University of Wroclaw, Poland

Short Abstract: Generally, the standard genetic code (SGC) is regarded universal. However, there are many alternative genetic codes, whose number has been increased rapidly for recent years. The big number of the deviations in codon reassignments implies further questions about the structure, properties and evolutionary directions of the existing codes. In this work, we evaluated differences between the SGC and existing alternative codes in terms of costs of amino acid replacement based on their polarity. Furthermore, we tested the properties of all possible theoretical genetic code, which differed from the SGC in one, two or three changes in assignments of codons to amino acids. Depending on the number of changes, the substantial fraction of the theoretical codes minimized costs of amino acids replacement better than the SGC. Interestingly, many types of codon reassignments observed in the alternative codes are also responsible for the significant improvement of the fitness measure. The reassignments are one of the most valuable changes in the genetic code structure in terms of minimization of the cost value in comparison to the theoretical assignments. These findings suggest potential evolutionary directions of alternative genetic codes.

Session B-459: Nonpher: computational method for design of hard-to-synthesize structures
COSI: Non-COSI
  • Milan Voršilák, UCT Prague, Czech Republic
  • Daniel Svozil, UCT Prague, Czech Republic

Short Abstract: Machine learning methods are often used in cheminformatics to predict activity, cluster similar structures or classify structures into distinctive classes. To train a classifier, a training data set must contain examples from every class. A binary classifier needs examples from two classes, usually the positive and the negative (e.g. active/nonactive). While a biological activity or toxicity can be experimentally measured, another important molecular property, the synthetic feasibility, is a more abstract feature that can’t be easily assessed. Furthermore, synthetic feasibility is not only abstract, but hard-to-synthesize structures are not readily available from any database. Nonpher is a computational method developed to construct the needed virtual library of hard-to-synthesize structures. Nonpher is based on a molecular morphing algorithm, which iteratively generates new structures by simple structural changes in starting structure, such as the addition or removal of an atom or a bond. Nonpher was optimized to yield reasonably complex structures, which are hard-to-synthesize. Structures generated by Nonpher were compared with structures selected by SAScore and dense region (DR) methods. Random forest classifier trained on Nonpher data achieved better results than models obtained using SAscore and DR data.

Session B-460: Integrating gene expression and proteomics data into protein-protein interactions networks using a modular methodology based on open source software
COSI: Non-COSI
  • Frederico Guimarães, Centro de Pesquisas René Rachou, Brazil
  • Leilane Gonçalves, Instituto Oswaldo Cruz, Brazil
  • Henrique Toledo, Centro de Pesquisas René Rachou, Brazil
  • Daniela Resende, Instituto Oswaldo Cruz, Brazil
  • Jeronimo Ruiz, Centro de Pesquisas René Rachou, Brazil

Short Abstract: Nowadays there is a considerable amount of available biological data, but the odds in the process of obtaining information from them are overwhelming and growing. Differences between data formats, absence of a unified identifier for biological features and lack of integration between existing databases are some of the challenges researchers face in the task of biological information mining. With the main goal of integrate biological data from different sources for further analysis and interpretation, we developed a methodology that uses a series of shell and Perl scripts that extract, filter and format protein interactions data from STRING v.10 database and from high throughput genomic data (RNASeq and shotgun proteomics), integrating them in protein-protein interaction networks using Cytoscape. The methodology was modularly structured and can be adapted and/or integrated to different analytical protocols and organisms. Specifically on this study, the model organism used was Trypanosoma cruzi, the causative agent of Chagas disease. As results we generated a series of protein-protein interaction networks that emphasize, using Cytoscape visual styles, characteristics of biological interest, as EC number, functional grouping, protein interaction types (binding, reaction, expression, activation, catalysis and post-translational modifications) and graph metrics. In these networks we could highlight a series of evidences of biological features, like topological associations between gene regulatory mechanisms (kinases, phosphatases and RNA polimerases) and clusters of proteins functionally associated. Finally, the developed methodological approach emphasize that genomic data integration into PPI networks could be valuable in converting high throughput data into biological knowledge.

Session B-461: The communities of party hubs in fusion protein-protein interaction networks increases their robustness against their site-directed knockouts
COSI: Non-COSI
  • Somnath Tagore, BAR-ILAN University, Israel
  • Vikrant Palande, BAR-ILAN University, Israel
  • Milana Frenkel-Morgenstern, BAR-ILAN University, Israel
Session B-462: Algorithms for Structural Variation Discovery Using Hybrid Sequencing Technologies
COSI: Non-COSI
  • Ezgi Ebren
  • Ayse Berceste Dincer
Session B-463: The Affinity Data Bank: An improved online suite of tools for investigation of protein-nucleic acid affinity models and biophysical analysis of regulatory sequences
COSI: Non-COSI
  • Cory Colaneri, University of Massachusetts Boston, United States
  • Brandon Phan, University of Massachusetts Boston, United States
  • Aadish Shah, University of Massachusetts boston, United States
  • Pritesh Patel, University of Massachusetts Boston, United States
  • Todd Riley, University of Massachusetts Boston, United States

Short Abstract: We present The Affinity Data Bank (ADB), an improved suite of tools that provides biologists with novel aids to deeply investigate the sequence-specific binding properties of a transcription factor (TF) or an RNA-binding protein (RBP), and to study subtle differences in specificity between homologous nucleic acid-binding proteins. Also, integrated with Pfam, the PDB, and the UCSC database, The ADB allows for simultaneous interrogation of protein-DNA and protein-RNA specificity and structure in order to find the biochemical basis for differences in specificity across protein families. The ADB also includes a biophysical genome browser for quantitative annotation of levels of binding – using free protein concentrations to model the non-linear saturation effect that relates binding occupancy with binding affinity. The biophysical browser also integrates dbSNP and other polymorphism data in order to depict changes in affinity due to genetic polymorphisms – which can aid in finding both functional SNPs and functional binding sites. Lastly, the biophysical browser also supports biophysical positional priors to allow for quantitative designation of the level of locus-specific accessibility that a protein has to the DNA. Importantly, the use of this toolset does not require bioinformatics programming knowledge – which makes ADB tool suite highly useful for a wide range of researchers. Protein concentration is an important ingredient, along with the protein’s sequence specificity, that also greatly affects levels of protein-nucleic acid binding. In addition, as protein concentrations increase, the saturation of the highest-affinity binding sites additionally increases the levels of occupancy for functional medium and low-affinity sites. This biophysical, nonlinear relationship between free protein concentration, binding site affinity, and resultant binding is an important part of accurately determining the level of protein-DNA and protein-RNA binding under in vivo conditions. Accurate protein-DNA affinity models are necessary but not sufficient enough to properly model and predict the level of in vivo protein-DNA binding and subsequent gene regulation. For example, in the human genome most possible binding sites are not accessible for binding by a TF-protein. Tissue-specific chromatin state and accessibility is a complex, major factor that heavily influences protein-DNA binding. Because many possible binding sites are actually inaccessible for binding, methods that do not include in vivo accessibility when searching for putative binding sites in or near a gene have a high false positive rate. The ADB can properly model in vivo protein-DNA binding by integrating the effects of chromatin accessibility and epigenetic marks via the inclusion of biophysical occupancy-based and affinity-based positional priors. Lastly, the ADB now includes two new tools for affinity model visualization and stochastic modeling of transcriptional and translational regulation. Firstly, the new graphical Universal Sequence Logo incorporates any order of nucleotide dependencies, insertions, and deletions between positions in a protein-DNA or protein-RNA binding affinity model. Secondly, the new integrated Biochemical Network Stochastic Simulator (BioNetS) Version 2.0 can import the annotated binding sites, protein concentrations, and accessibility annotations in order to accurately model the stochastic behavior and dynamics of gene expression regulation at both the transcriptional and translational levels.

Session B-464: Human Phenotype Ontology Prediction with the Utilization of Co-occurrences Between HPO terms and GO terms
COSI: Non-COSI
  • Tunca Doğan, METU / EMBL-EBI, Turkey
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Ahmet Rifaioglu, METU, Turkey
  • Volkan Atalay, METU, Turkey
  • Rengul Atalay, METU, Turkey
  • Maria Martin, EMBL-EBI, United Kingdom

Short Abstract: Here we propose a new approach to predict HPO term associations to human genes/proteins with the detection of co-annotation fractions between all HPO term and Gene Ontology (GO) term combinations, using the annotations of the training set genes/proteins. The HPO terms and the GO terms that are highly co-occurring on different proteins, as annotations, are linked to each other (training step). Finally, proteins with a linked GO term annotation receives the corresponding HPO term as prediction. The idea here is to associate HPO term Y with GO term X in the sense that: "if a protein loses its function defined by GO term X (or at least a reduction in the defined functionality) as a result of a mutation, then it will cause the disease which is defined by the phenotype term Y". This idea is based on the nature of annotating genes/proteins with HPO terms, as only the mutated versions of these genes (i.e. disease causing variants) are associated with genetic diseases and their phenotypic abnormality terms. Mutations usually lead to diseases by causing functionality losses in the gene products. As a result, if the HPO term Y and the GO term X are observed to be highly co-occurred on different proteins, then the lost function, which gave way to the corresponding disease is probably the one defined by the GO term X. We applied this methodology to predict HPO terms for the human proteins dataset provided in CAFA3 challenge.

Session B-465: Integrative Mixed Graphical Models Identify Causal Factors for Chronic Lung Disease Diagnosis and Progression
COSI: Non-COSI
  • Andrew J Sedgewick, University of Pittsburgh, United States
  • Panayiotis V Benos, University of Pittsburgh, United States
  • Joseph Ramsey, Carnegie Mellon University, United States
  • Ivy Shi, University of Pittsburgh, United States
  • Dimitris Manatakis, University of Pittsburgh, United States
  • Yingze Zhang, University of Pittsburgh, United States
  • Jessica Bon, University of Pittsburgh, United States
  • Divay Chandra, University of Pittsburgh, United States
  • Chad Karoleski, University of Pittsburgh, United States
  • Peter Spirtes, Carnegie Mellon University, United States
  • Frank Sciurba, University of Pittsburgh, United States
  • Clark Glymour, Carnegie Mellon University, United States

Short Abstract: Integration of data from different modalities is a necessary step for multi-scale data analysis in many fields, including biomedical research and systems biology. Causal graphical models offer an attractive tool for this problem because they can represent both the complex, multivariate probability distributions and the causal pathways influencing the system. Graphical models learned from biomedical data can be used for classification, biomarker selection and functional analysis, while revealing the underlying causal network structure and thus allowing for arbitrary likelihood queries over the data. In this paper, we present and test new methods for finding directed graphs over mixed data types (continuous and discrete variables). We used this new algorithm, MGM-Learn, to identify variables causally linked to disease diagnosis and progression variables in various multi-modal datasets, including clinical datasets from chronic obstructive pulmonary disease (COPD). COPD is the third leading cause of death and a major cause of disability and determining the factors that cause longitudinal lung function decline is thus very important. By applying our causal inference algorithm on the COPD dataset we were able to confirm and extended previously described connections, which provided new insights regarding the factor causally affecting the longitudinal lung function decline of COPD patients.

Session B-466: De novo pathway-based classification of breast cancer subtypes
COSI: Non-COSI
    Session B-467: Detection of Significantly Differentially Expressed Cleavage Site Intervals within 3’ Untranslated Regions using CSI-UTR
    COSI: Non-COSI
    • Eric Rouchka, University of Louisville, United States
    • Benjamin Harrison, University of New England,
    • Juw Won Park, University of Louisville, United States
    • Cynthia Gomes, University of Louisville, United States
    • Jeffrey Petruska, University of Louisville, United States
    • Matt Sapio, National Institutes of Health, United States
    • Michael Iadarola, National Institutes of Health,

    Short Abstract: Motivation: Untranslated regions of the 3’ end of transcripts (3’UTRs) are critical for controlling transcript abundance and location. 3’UTR configuration is highly regulated and provides functional diversity, similar to alternative splicing of exons. Detailed transcriptome-wide profiling of 3’UTR structures may help elucidate mechanisms regulating cellular functions. This profiling is more difficult than for coding sequences (CDS), where exon/intron boundaries are well-defined. To enable this we developed a new approach, CSI-UTR. Meaningful configurations of the 3’UTR are determined using cleavage site intervals (CSIs) that lie between functional alternative polyadenylation (APA) sites. The functional APAs are defined using publicly-available polyA-seq datasets biased to the site of polyadenylation. CSI-UTR can be applied to any RNASeq dataset, regardless of the 3’ bias.

    Results: Using CSI-UTR, we produced a predefined set of CSIs for human, mouse, and rat. Previous studies indicate 3’UTR structure is highly regulated during nervous system functions. We therefore assessed CSI-UTR using archived RNASeq datasets from the nervous system (SRP056604 and SRP038707) and a rat dataset of our own. In all three species, CSI-UTR identified differential expression (DE) events not detected by standard gene-based differential analyses. Many DE events were in transcripts in which the CDS was unchanged. Enrichment analyses determined these DE 3’UTRs are associated with genes with known roles in neural processes. CSI-UTR is a powerful new tool to uncover DE that is undetectable by standard pipelines, but can exert a major influence on cellular function.

    Availability: Source code, CSI BED files and example datasets are available at: http://bioinformatics.louisville.edu/CSI-UTR/

    Contact: eric.rouchka@louisville.edu

    Session B-468: Reprogrammed Lipid Metabolism in Bladder Cancer with Cisplatin Resistance
    COSI: Non-COSI
    • Jayoung Kim

    Short Abstract: Due to its tendency to recur and acquire chemoresistance quickly, bladder cancer (BC) remains to be an elusive and hard to treat disease. Patients with recurring BC and acquired chemoresistance have an extremely poor prognosis, with a 5-15% chance of 5-year survival. Thus, there is a unsolved yet highly urgent task to identify patients who are at a higher risk of developing chemoresistance. Currently, the molecular signatures underlying cisplatin resistance remain unknown. One avenue that could provide more information regarding resistance mechanisms is looking into lipid metabolism. Metabolism of lipids is essential for cancer cells and is associated with the regulation of a variety of key cellular processes and functions. This study conducted a comprehensive and comparative lipidomic profiling of two isogenic human T24 bladder cancer cell lines; one of which was clinically characterized as cisplatin sensitive and the other resistant. immunohistochemistry analysis revealed that expression of cytosolic acetyl-CoA synthetase 2 (ACSS2) is positively correlated with aggressive BC. Ultra Performance Liquid Chromatography-Mass Spectrometry analysis profiled a total of 1,864 lipids, and the levels of differentially expressed lipids such as cholesteryl ester (CE(22:6)), triglyceride (TG(49:1)), and TG(53:2) were markedly higher in cisplatin resistant cancer cells than in sensitive cells. The levels of metabolites such as CE(18:1), CE(22:6), TG(49:1) and TG(53:2) were greatly perturbed by ACSS2 inhibition. This study broadens our current knowledge on the links between cisplatin resistance and lipid metabolism in aggressive bladder cancer, and suggests potential biomarkers for bladder cancer patients at a higher risk.

    Session B-469: Network Centrality in Non-alcoholic steatohepatitis: An Integrative Analysis
    COSI: Non-COSI
    • Cristina Baciu , Toronto General Hospital, Canada
    • Marc Angeli, Toronto General Hospital, Canada
    • Elisa Pasini, Toronto General Hospital, Canada
    • Atul Humar, Toronto General Hospital,
    • Mamatha Bhat, Toronto General Hospital, Canada

    Short Abstract: We performed an integrative computational analysis of publicly available gene expression data in human non-alcoholic steatohepatitis (NASH) from GEO. The pathways, networks, molecular interactions, functional analyses were generated through the use of IPA. We discovered that HNF4A is the central gene in the network of NASH connected to metabolic diseases and we show for the first time to our knowledge, that HNF4A is central to the pathogenesis of NASH.

    Session B-470: Integrating heterogeneous data using deep autoencoders for protein function prediction
    COSI: Non-COSI
    • Vladimir Gligorijevic, Simons Center for Computational Biology, Flatiron Institute, United States
    • Meet Barot, Simons Center for Computational Biology, Flatiron Institute, United States
    • Richard Bonneau, Simons Center for Computational Biology, Flatiron Institute; 2NYU Departments of Biology and Computer Science, United States

    Short Abstract: The prevalence of high-throughput experimental methods has resulted in an abundance of large-scale nonlinear data representing different types of protein interactions. These types of data are more difficult to integrate with the standard methods of function prediction, which are often linear and unable to capture hierarchical, abstract features that are more indicative of protein function. Deep learning is a promising technique to deal with such problems, and has been shown to work well for several biological problems. Thus, we propose a method based on deep multimodal autoencoders to extract the features of proteins from multilayer molecular interaction networks. We apply this method on STRING networks to construct a common low-dimensional representation containing high-level protein features. We use different autoencoder architectures for handling different network modalities in the early layers of the autoencoder, later connecting all the architectures into a single bottlenecked layer from which we extract features to predict protein function for the yeast and human species. We compared the 5-fold cross validation predictive performance of our method with the state-of-the-art method, Mashup. Our results show that our method outperforms Mashup for both human and yeast STRING networks. We have also demonstrated the superior performance of our method in comparison to Mashup for predicting GO terms grouped into categories of varying specificity; i.e., we obtain micro-AUPR scores of 0.35 (Mashup: 0.29), 0.22 (Mashup: 0.20) and 0.52 (Mashup: 0.49) for predicting MF, BP and CC GO terms, respectively, belonging to the category of GO terms annotating between 11-30 human proteins.

    Session B-472: DWCOACH: Predict protein complexes from dynamic weighted PPI networks by GO semantic similarity
    COSI: Non-COSI
    • xiaowu Sun, Information and Engineering department, Capital Normal University, China
    • Lizhen Liu, Information and Engineering department, Capital Normal University, China
    • Wei Song, Information and Engineering department, Capital Normal University, China

    Short Abstract: We propose a method called DWCOACH to predict protein complexes from dynamic weighted PPI (protein protein interaction) network by integrating multiple techniques. This method can be divided into three parts, including the construction of dynamic networks, the computing of network weight and the prediction of protein complexes. Firstly, we combine gene expression data at different time points with traditional static PPI network to construct different dynamic sub-networks. This part will improve three-sigma method by optimizing the variance and mean for each protein based on parameter estimation and interval estimation. Then, to further filter out the data noise, the semantic similarity based on gene ontology (GO) is regarded as the network weights together with the principal component analysis (PCA) which is introduced to deal with the weights computed by three traditional methods. Thirdly, DWCOACH algorithm is applied to detect protein complexes. Based on the “core-attachment” structural characteristic, we proposed DWCOACH to predict complexes. DWCOACH selects proteins with high weighted local clustering coefficient for the construction of core; to expand a core, then the attachment proteins which satisfy the judgment condition will be added into a core; there are may redundancies have been generated during the predicted protein complexes, the last thing we need to do is refine the complexes. Lastly, it is revealed from the experimental results that our method performs well on detecting complexes from dynamic weighted PPI networks.

    Session B-473: Rfam: Growth and Improvements in the RNA Families Database
    COSI: Non-COSI
    • Ioanna Kalvari , EMBL-EBI, United Kingdom
    • Joanna Argasinska, EMBL-EBI, United Kingdom
    • Natalia Quinones Olvera, EMBL-EBI, United Kingdom
    • Anton Petrov, EMBL-EBI, United Kingdom
    • Eric Nawrocki, National Institutes of Health, National Library of Medicine, United States
    • Rob Finn, EMBL-EBI,
    • Alex Bateman, EMBL-EBI, United Kingdom

    Short Abstract: Rfam is a database of functional non-coding RNA families represented by multiple sequence alignments and consensus secondary structures. The sequence and structure information is used to build probabilistic models called covariance models that can find new instances of Rfam families in sequences and annotate genomes with non-coding RNAs. The Rfam website is available at http://rfam.xfam.org.

    In the past year we continued the development of Rfam with three releases containing over 200 new RNA families. The Rfam website has been updated with a new unified search interface that allows to search by keywords, species names, RNA types, and more. All Rfam families have been analysed using the R-scape software that identifies statistically significant basepairs supported by covariation and suggests alternative secondary structures that are consistent with the alignments. In order to make it easier to query the data in ways that are not supported by the website, we created a public MySQL database with the latest Rfam data. Work is underway on a major new version of Rfam, release 13.0, which will be built using a new sequence database based on a non-redundant genome collection maintained by UniProt. Rfam 13.0 will be available in late 2017.

    Rfam is continuously growing with the addition of new families and the development of new features. Multiple fixes have improved the quality of our data, and the transition to a genome-centric Rfam will not only reduce data redundancy but also enable meaningful taxonomic comparisons and frequent updates.

    Session B-474: RNAcentral: The Unified Entry Point for Non-coding RNA Sequences
    COSI: Non-COSI
    • Anton Petrov, EMBL-EBI, United Kingdom
    • Blake Sweeney, EMBL-EBI, United Kingdom
    • Boris Burkov, EMBL-EBI, United Kingdom
    • Natalia Quinones-Olvera, EMBL-EBI, United Kingdom
    • Simon Kay, EMBL-EBI, United Kingdom
    • Rob Finn, EMBL-EBI, United Kingdom
    • Alex Bateman, EMBL-EBI,

    Short Abstract: RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences that represents all types of ncRNA from a broad range of organisms. RNAcentral provides a single entry point for anyone interested in ncRNA biology by integrating the data from a consortium of RNA resources. The RNAcentral website is available at http://rnacentral.org.

    RNAcentral currently contains over ten million ncRNA sequences from more than twenty RNA databases, such as miRBase, RefSeq, GtRNAdb and others. Recent updates include ncRNA data from HGNC, Ensembl, and FlyBase. We are also integrating RNAcentral with the Rfam database so that the majority of RNAcentral sequences are annotated with Rfam families.

    There are three main ways of browsing the data through the RNAcentral website. The text search makes it easy to explore all ncRNA sequences, compare data across different resources, and discover what is known about each ncRNA. Using the sequence similarity search one can search data from multiple RNA databases starting from a sequence. Finally, one can explore ncRNAs in select species by genomic location using an integrated genome browser.

    RNAcentral continues to grow, with an additional one million new non-coding RNA sequences added to the database in 2016. The website has been continuously improved including a redesigned homepage, and more relevant search results. Our immediate priorities include the incorporation of functional annotations of non-coding RNAs, such as intermolecular interactions, nucleotide modifications, and high-quality secondary structures. The ultimate goal of RNAcentral is to include curated information about all non-coding RNAs as UniProt does for proteins.

    Session B-475: Structure-based prediction of protein-peptide binding regions using Random Forest
    COSI: Non-COSI
    • Ghazaleh Taherzadeh, Griffith University, Australia

    Short Abstract: Protein-peptide interactions are one of the most important biological interactions and play crucial role in many diseases including cancer. However, only a small portion of proteins has known complex structures and experimental determination of protein-peptide interaction is costly and inefficient. Thus, predicting peptide-binding sites computationally will be useful to improve efficiency and cost effectiveness of experimental studies. Here, we established a machine learning method called SPRINT-Str (Structure-based prediction of protein-Peptide Residue-level Interaction) to use structural information for predicting protein-peptide binding residues. These predicted binding residues are then employed to infer the peptide-binding site by a clustering algorithm.
    SPRINT-Str achieves robust and consistent results for prediction of protein-peptide binding regions in terms of residues and sites. Matthews’ Correlation Coefficient (MCC) for 10-fold cross validation and independent test set are 0.27 and 0.293, respectively, as well as 0.775 and 0.782, respectively for Area Under the Curve (AUC). The prediction outperforms other state-of-the-art methods, including our previously developed sequence-based method. A further spatial neighbor clustering of predicted binding residues leads to prediction of binding sites at 20%-116% higher coverage than the next best method at all precision levels in the test set. The application of SPRINT-Str to protein binding with DNA, RNA, and carbohydrate confirms the method’s capability of separating peptide-binding sites from other functional sites. More importantly, similar performance in prediction of binding residues and sites is obtained when experimentally determined structures are replaced by unbound structures or quality model structures built from homologs, indicating its wide applicability.

    Session B-476: A sparse latent regression approach for integrative analysis of glycomic and glycotranscriptomic data
    COSI: Non-COSI
    • Xuefu Wang, Indiana University, United States

    Short Abstract: We present a Bayesian sparse latent regression (BSLR) model for predicting quantitative glycan abundances from glycotranscriptomic data. The model is built using the matched glycomic and glycotranscriptomic collected in the same samples, and then exploited to infer common properties among training samples and to predict these properties (e.g., the glycan abundances) in similar samples from which only glycotranscriptomc data are available. The BSLR model assumes the glycan and the glycotranscriptomic abundances are both modulated by a small number of independent latent variables, and thus can be constructed by using only a relatively small number of training samples. We further employ a Bayesian learning algorithm to promote the sparse models with fewer parameters associating latent variables and glycan/glycan synthesis genes.

    Session B-478: Using genomic analysis to identify tomato Tm-2 resistance breaking mutations and their underlined evolutionary path in a new and emerging tobamovirus
    COSI: Non-COSI
    • Yonatan Maayan, ARO Volcani, Israel
    • Eswari Pandaranayaka, ARO Volcani, Israel
    • Moshe Lapidot, ARO Volcani, Israel
    • Ilan Levin, ARO Volcani, Israel
    • Aviv Dombrovsky, ARO Volcani, Israel
    • Arye Harel, Volcani Center, ARO, Israel

    Short Abstract: Recently, a new tobamovirus was discovered in Israel able to break over 40 years Tm-2 mediated resistance in tomato. Following isolation and sequencing the virus was found to be Tomato brown rugose fruit virus (ToBRFV), a new tobamovirus recently identified in Jordan. Previous studies on mutation species causing resistance breaking, including Tm-2 mediated resistance, demonstrated that this phenotype was mediated by only few mutations. Identification of such residues in resistance breakers is hindered by significant background resulting from approximately 10% differences in their genomic sequences compared to known species. To understand the evolutionary path leading to the emergence of this resistance breaker, we have utilized a comprehensive phylogenetic analysis, and genomic comparison of tobamovirus species, followed by molecular modelling of its viral helicase. Our phylogenetic analysis highlights the location of the resistance breaker genes within a host shifting inter-clade, which together with relatively low mutation-rate, suggest a similar evolutionary path for the emergence of this new species. Our comparative genomic analysis identified 5 potential resistance-breaking mutations in the viral movement protein (MP), the primary target of the related Tm-2 resistance, and 2 in its helicase. Finally, molecular modelling of the helicase enabled the identification of 2 additional resistance breaking mutations.

    Session B-502: On the feasibility of mining CD8+ T-cell receptor patterns underlying immunogenic peptide recognition
    COSI: Non-COSI
    • Nicolas De Neuter, University of Antwerp, Belgium
    • Wout Bittremieux, University of Antwerp, Belgium
    • Charlie Beirnaert, University of Antwerp, Belgium
    • Bart Cuypers, University of Antwerp, Belgium
    • Aida Mrzic, University of Antwerp, Belgium
    • Pieter Moris, University of Antwerp, Belgium
    • Arvid Suls, University of Antwerp, Belgium
    • Viggo Van Tendeloo, University of Antwerp, Belgium
    • Benson Ogunjimi, University of Antwerp, Belgium
    • Kris Laukens, University of Antwerp, Belgium
    • Pieter Meysman, University of Antwerp, Belgium

    Short Abstract: Current T-cell epitope prediction tools are a valuable resource in designing targeted immunogenicity experiments. They typically focus on, and are able to, accurately predict peptide binding and presentation by major histocompatibility complex (MHC) molecules on the surface of antigen-presenting cells. However, recognition of the peptide-MHC complex by a T-cell receptor is often not included in these tools. We developed a classification approach based on random forest classifiers to predict recognition of a peptide by a T-cell and discover patterns that contribute to recognition. We considered two approaches to solve this problem: (1) distinguishing between two sets of T-cell receptors that each bind to a known peptide and (2) retrieving T-cell receptors that bind to a given peptide from a large pool of T-cell receptors. Evaluation of the models on two HIV-1, B*08-restricted epitopes reveals good performance and hints towards structural CDR3 features that can determine peptide immunogenicity. These results are of particularly importance as they show that prediction of T-cell epitope and T-cell epitope recognition based on sequence data is a feasible approach. In addition, the validity of our models not only serves as a proof of concept for the prediction of immunogenic T-cell epitopes but also paves the way for more general and high performing models.


    View Posters By Category

    Search Posters: