Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide



Poster Categories
Poster Schedule
Preparing your Poster - Information and Poster Size
How to mount your poster
Print your poster in Basel

View Posters By Category

Session A: (July 22 and July 23)
Session B: (July 24 and July 25)

Presentation Schedule for July 22, 6:00 pm – 8:00 pm

Presentation Schedule for July 23, 6:00 pm – 8:00 pm

Presentation Schedule for July 24, 6:00 pm – 8:00 pm

Session A Poster Set-up and Dismantle
Session A Posters set up: Monday, July 22 between 7:30 am - 10:00 am
Session A Posters should be removed at 8:00 pm, Tuesday, July 23.

Session B Poster Set-up and Dismantle
Session B Posters set up: Wednesday, July 24 between 7:30 am - 10:00 am
Session B Posters should be removed at 2:00 pm, Thursday, July 25.

V-001: Metadata Acquired from Clinical Case Reports: a resource for extracting information from clinical narratives
COSI: General Comp Bio
  • Yijiang Zhou, Department of Cardiology, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
  • Quan Cao, The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA, United States
  • Jessica Lee, The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA, United States
  • Sanjana Murali, The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA, United States
  • Sarah Spendlove, The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA, United States
  • Anders Olav Garlid, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
  • Harry Caufield, University of California, Los Angeles, United States
  • David Liem, BD2K Center of Excellence @ UCLA, United States

Short Abstract: How do we isolate novel relationships from numerous sets of biomedical observations? With clinical documents, we face the issue of massive variation in document content, vocabulary and style across domain, clinical specialty, and location. If they are to be applied to clinical language, truly comprehensive text mining tools must therefore be trainined and designed using both domain-specific knowledge and clinical text, such as that provided by expert curation and clinical case reports (CCRs), respectively. More than 1.9 million CCRs have been published in the last century. To provide a starting point for the development of methods for identification of high-level concepts within clinical narratives, we assembled a set of metadata acquired from clinical case reports (MACCRs). This expert-curated, publicly-available data set contains publication metadata and text describing clinical events and observations within 3,100 CCRs. The reports span 15 disease groups, more than 750 rare disease presentations, and include an additional focus on 7 selected mitochondrial diseases. Our MACCR set is a resource for aiding clinicians, researchers, and machine learning systems in understanding how disease presentations are described and how they may be written about more clearly. Future developments will provide strategies for document comparison, biomarker identification, and early diagnosis.

V-002: The BIG Data Center: from deposition to integration to translation
COSI: General Comp Bio
  • Zhang Zhang, BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, China
  • On Behalf Of Big Data Center Members, BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, China

Short Abstract: The BIG Data Center at Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences provides a suite of database resources in support of worldwide research activities in both academia and industry. With the vast amounts of multi-omics data generated at unprecedented scales and rates, the BIG Data Center is continually expanding, updating and enriching its core database resources through big data integration and value-added curation. Resources with significant updates in the past year include BioProject (a biological project library), BioSample (a biological sample library), Genome Sequence Archive (GSA, a data repository for archiving raw sequence reads), Genome Warehouse (GWH, a centralized resource housing genome-scale data), Genome Variation Map (GVM, a public repository of genome variations), Science Wikis (a catalog of biological knowledge wikis for community annotations) and IC4R (Information Commons for Rice). Newly released resources include EWAS Atlas (a knowledgebase of epigenome-wide association studies), iDog (an integrated omics data resource for dog) and RNA editing resources (for editome-disease associations and plant RNA editosome, respectively). To promote biodiversity and health big data sharing around the world, the Open Biodiversity and Health Big Data (BHBD) initiative is introduced. All of these resources are publicly accessible at http://bigd.big.ac.cn.

V-003: A statistical model for screening high dimensional data
COSI: General Comp Bio
  • Weixing Dai, The Chinese University of Hong Kong, Hong Kong
  • Dianjing Guo, The Chinese University of Hong Kong, Hong Kong

Short Abstract: Screening samples in high dimensional and imbalanced data is a great challenge in various field of research. We here describe BNNL (bayes nearest neighbor learning), a screening model focusing locally on range of interest. With a specially designed loss function taking both bias and variance into account, BNNL outperformed conventional classification algorithms on both synthetic and real datasets in terms of noise detection and screening accuracy. The model was then used for screening potential activators of HIV-1 integrase multimerization in independent compound library, and the virtual screening result was experimentally validated. Of the 25 compounds tested, 12 were proved to be active. The most potent activators in the experimental validation showed a EC50 value of 0.71um.

V-004: Pharmacoepidemiologic Evaluation of Birth Defects from Health‑Related Postings in Social Media During Pregnancy
COSI: General Comp Bio
  • Susan Golder, University of York, United Kingdom
  • Stephanie Chiuve, AbbVie, United States
  • Ari Klein, University of Pennsylvania, United States
  • Karen O'Connor, University of Pennsylvania, United States
  • Martin Bland, University of York, United Kingdom
  • Murray Malin, Abbvie, United States
  • Mondira Bhattacharya, Abbvie, United States
  • Linda Scarazzini, Abbvie, United States
  • Weissenbacher Davy, University of Pennsylvania, United States
  • Gonzalez-Hernandez Graciela, University of Pennsylvania, United States

Short Abstract: Introduction Adverse effects of medications taken during pregnancy are traditionally studied through post-marketing pregnancy registries, which have limitations. Social media data may be an alternative data source for pregnancy surveillance studies. Methods We assessed the feasibility of using social media data as a source for pregnancy surveillance. We created an automated method to identify Twitter accounts of pregnant women. We identified 196 pregnant women with a mention of a birth defect in relation to their baby and 196 without a mention of a birth defect in relation to their baby. We extracted information on pregnancy and maternal demographics, medication intake and timing, and birth defects. Results Although often incomplete, we extracted data for the majority of the pregnancies. Among women that reported birth defects, 35% reported taking one or more medications during pregnancy compared with 17% of controls. After accounting for age, race, and place of residence, a higher medication intake was observed in women who reported birth defects. Conclusions Twitter data capture information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. Development of improved methods to automatically extract and annotate social media data may increase their value.

V-005: Numerical Study of an Influenza Epidemic Dynamical Model with Diffusion
COSI: General Comp Bio
  • Mudassar Imran, Gulf University for Science and Technology, Kuwait
  • Mohamed Ben-Romdhane, Gulf University for Science and Technology, Kuwait
  • Ali Ansari, Gulf University for Science and Technology, Kuwait
  • Helmi Temimi, Gulf University for Science and Technology, Kuwait

Short Abstract: In our study, a deterministic model is formulated in the aim of performing a thorough investigation of the transmission dynamics of influenza. The main advantage of our model compared to existing models is that it takes into account the effects of both the hospitalization of infected individuals and the diffusion. The proposed model consisting of a dynamical system of partial differential equations with diffusion terms is numerically solved using a fast and accurate numerical discretization technique. Such discretization leads to a consistent explicit scheme yielding accurate numerical solutions of our dynamical system, under some stability assumptions. Furthermore, a steady-state analysis as well as a local stability analysis of the proposed model are carried out. The basic reproductive number guaranteeing the local stability of disease-free steady state without diffusion term is then calculated. Various numerical simulations for different values of the model input parameters are finally performed in order to show the effect of the effective contact rate on the steady state of the different population compartments (susceptible, exposed, infected, hospitalized, and recovered). The effect of considering an inhibition effect in our model is also discussed.

V-006: Medically Actionable Rare Variants In 50,000 Exomes From UK BioBank
COSI: General Comp Bio
  • Suganthi Balasubramanian, Regeneron Pharmaceuticals, Inc., United States

Short Abstract: The promise of precision medicine is to apply large-scale human genomic sequencing to preemptively identify patients and their family members carrying “medically actionable” pathogenic variants. Here, we present a survey of such variants identified from the exomes of 50,000 individuals from UK Biobank in 59 ACMG genes. The ACMG59 was chosen because pathogenic variants in these genes are known to cause or predispose individuals to diseases and where medical intervention is expected to improve an outcome(s) in terms of mortality or the avoidance of significant morbidity. We integrate exome data with genetic variant databases to identify pathogenic variants in ACMG59. Additionally, we identify hitherto unreported “Likely Pathogenic” loss-of-function variants in ACMG59 genes where truncating mutations are expected to cause disease. Using stringent criteria for defining pathogenic variants from ClinVar, we find that approximately 2% of the population have a medically actionable variant. Using broader definitions of pathogenic variants from ClinVar and HGMD, we obtain higher estimates ranging from 2% - 7%. Variants in cancer associated genes, BRCA2, BRCA1, PMS2 and MSH6 are the most prevalent; followed by LDLR associated with familial hypercholesterolemia. We highlight the importance of building a scalable workflow for rapid identification and systematic evaluation of pathogenic variants.

V-007: Genetic Characterization of Rift Valley Fever Virus in Kenya
COSI: General Comp Bio
  • Lucy Kariuki, University of Nairobi, Kenya
  • Pro. Wallace Bulimo, University of Nairobi, Kenya
  • Prof. Rosemary Sang, Kenya Medical Research Institute, Kenya

Short Abstract: Rift Valley fever virus (RVFV) is an arbovirus transmitted by mosquitoes and causes the Rift valley fever (RVF) in livestock and humans. The sporadic outbreaks causes a lot of scare in Kenya which affects the economy in terms of the burden on health care and the morbidity and mortality it causes. RVFV undergoes genetic evolution in the vector and host environments as it adapts to the different environments. Different genotypes of the virus are associated with disparate disease outcomes in humans or livestock; thus the need of genetic studies of RVFV to identify genetic strains that cause severe disease and those that are amenable to manipulation for development of vaccines. The recent outbreak in Kenya was May 2018. Virus isolates were obtained from archived infectious biological materials from mosquito pools, human and livestock sera. Viral genomic RNA was extracted and reverse transcribed to vDNA, which was sequenced using the NGS sequencing. The contigs were assembled followed by bioinformatics analyses. This opens area of the genetic diversity of RVFV, their association with severe or mild disease outcomes geared towards adoption of appropriate strains for development of vaccines as well as new molecular diagnostic tools for detection and preventing outbreaks.            

V-008: Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset
COSI: General Comp Bio
  • Judith Somekh, University of Haifa, Israel
  • Shai Shen-Orr, Tachnion, Israel
  • Isaac Kohane, Harvard University, United States

Short Abstract: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissues derived from the GTEx project. Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that correcting for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

V-009: Calling full length transcripts with nucleotide precision using iTiSS
COSI: General Comp Bio
  • Florian Erhard, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Germany
  • Christopher Jürges, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Germany

Short Abstract: Transcription start sites (TiSS) can be identified by a variety of sequencing techniques including cRNA-seq, dRNA-seq and PROcap-seq. All of them rely on enriching reads at the 5’-end of mRNAs. For individual techniques, computational tools have been developed to automatically detect and call TiSS, but no uniform tool for all these data is available. Moreover, third-generation sequencing in principle provides full length mRNAs including TiSS. Here we show that each individual technique produces large numbers of false positives, and also misses many bona-fide TiSS. We present our tool iTiSS (integrative Transcriptional Start Site caller), an integrative approach for fast and sensitive TiSS identification with high specificity. iTiSS was used for an unbiased re-annotation of the herpes simplex virus 1 genome (manuscript under revision), integrating data from cRNA-seq, dRNA-seq as well as PacBio third generation sequencing. Manually curated mRNAs reveal both good sensitivity (113/201 TiSS, 56.2%) and perfect specificity (100%) for all the high confident TiSS called by iTiSS. A recently novel feature of iTiSS further allows it to use third-generation datasets to extend called TiSS to full length transcripts including splicing events making it the first program of its kind.

V-010: FFT-Mutant kit. A novel library for the design of de novo mutations using mathematical modeling techniques, data mining and pattern recognition.
COSI: General Comp Bio
  • David Medina, CeBiB, Chile
  • Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering, Department of Chemical Engineering and Biotechnology, University of Chile, Chile

Short Abstract: Designing mutations to get desirable biological activities is one of the most recurrent problems in biotechnology. Experimental methods imply large time, economic costs and limit the search space for mutants. Computational tools appear as a powerful solution for this challenge. Nevertheless, the latent problem still persists. We propose FFT-Mutant Kit, a novel tool that allows to design mutations from linear sequences, digitizing their physicochemical properties, considering for their evaluation, structural and phylogenetic information and the application of techniques of data mining and pattern recognition. Tool trains models through meta-learning techniques. The descriptors are based on frequency spectra of the protein obtained from the coding of the residues and digitized through Fourier Transforms. Dataset is composed of homolog proteins whose characteristics are known. New mutations and to evaluate their potential effect, physicochemical, thermodynamic and phylogenetic properties are considered as a pre-filter stage. The tool applies trained models to associate mutations and their expected effect on the target variable and the relevant physicochemical properties for describing the suggested mutations. Finally, it is believed that this tool will be a significant contribution when designing mutants with desirable biological activities and a powerful means of studying proteins from the digitalization of their physicochemical properties.

V-011: 20 Years Of The PSIPRED Protein Analysis Workbench
COSI: General Comp Bio
  • Daniel Buchan, University College London, United Kingdom
  • David Jones, University College London, United Kingdom

Short Abstract: The PSIPRED Protein Analysis Workbench is an online web server which has been available to the Biosciences community for 20 years. The server offers a number of machine learning based predictive algorithms for annotating either protein sequence or protein structure. These methods cover secondary structure prediction, disorder region prediction, automated homology modelling, fold recognition, function prediction and many more methods. We present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The poster covers our new website, the workflow for users and updates to the algorithms and predictive methods

V-012: Literature Triage Advances at Mouse Genome Informatics
COSI: General Comp Bio
  • Li Ni, The Jackson Laboratory, Bar Harbor, Maine 04609 USA, United States

Short Abstract: Although the numbers of yearly accessions to PubMed are rapidly rising, the genetics and genomics literature specifically chosen for curation by the Mouse Genome Informatics (MGI) resource has remained fairly steady over the last few years. MGI selects an average of ten to twelve thousand articles a year for curation. Despite using natural language processing where possible, screening the literature requires many steps and substantial resource commitment. To help reduce the effort involved in identifying mouse papers that are most relevant, where mining of PubMed abstracts is ineffective, a scrum of MGI data curators and software engineers devoted effort to improving the literature triage process through greater automation and simplified workflows. Here we show our historical triage method and the improved references module which are being implemented. Similar machine learning approaches will provide a general framework for us to build a centralized Alliance Literature Curation Portal that will provide curators with an interface to query for all papers that have been indexed/triaged for given species, data types/methods, named entities, and relevant sentences for fact extraction. Supported by NIH grant HG000330.

V-013: Multi-omics approach for the study of Mycoplasma hyopneumoniae metabolism and the response of swine epithelial cells to infection
COSI: General Comp Bio
  • Mariana G. Ferrarini, INSA Lyon, France
  • Scheila Mucha, CBiot - UFRGS, Brazil
  • Arnaldo Zaha, CBiot - UFRGS, Brazil
  • Marie-France Sagot, Inria, Université Claude Bernard Lyon 1, France

Short Abstract: Mycoplasma hyopneumoniae, the causative agent of enzootic pneumonia, is an economically devastating pathogen in the pig farming industry, however little is known about its relation with the swine host. We performed a multi-omics approach to study the host-pathogen interaction. In a previous work, we reconstructed the metabolic models of this species along with two other mycoplasmas from the respiratory tract of swine, Mycoplasma hyorhinis, considered less pathogenic but which nonetheless causes disease and Mycoplasma flocculare, a commensal bacterium. We identified metabolic differences that partially explained their virulence. In these models, the most important trait was the production of hydrogen peroxide from glycerol only in the two pathogenic species. Strikingly, in vitro only the pathogenic strains of M. hyopneumoniae produced this toxic product. Furthermore, to improve our understanding on this interaction, we infected swine epithelial cells with M. hyopneumoniae to identify the effects of the infection on the expression of swine genes and miRNAs. Hence, M. hyopneumoniae seems to elicit an antioxidant response in infected cells via hydrogen peroxide production. Although further tests are needed, we present an interesting metabolic trait of M. hyopneumoniae potentially related to its enhanced virulence and some of the host mechanisms activated to fight the infection.

V-014: EV-CNN: A web application for EnteroVirus Genotyping in Deep learning
COSI: General Comp Bio
  • Shu-Hwa Chen, Institute of Information Science, Academia Sinica, Taiwan
  • Chieh-Hwa Lin, Institute of Population Science, National Health Research Institutes, Taiwan
  • Zhe-Ren Hsu, Institute of Information Science, Academia Sinica, Taiwan
  • I-Hsuan Lu, Institute of Information Science, Academia Sinica, Taiwan
  • Chung-Yen Lin, Institute of Information Science, Academia Sinica, Taiwan

Short Abstract: Here we integrated and corrected EV sequences with their serotypes according to virus classification for NCBI GenBank and the International Committee on Virus Taxonomy (ICTV) taxonomy. By using corrected sequences of EV family with their serotype around 48,382 records, we have introduced a deep learning approach (Convolutional Neural Networks,CNN) to classify these 308 genotypes of EV family. Although the macro-average of prediction accuracy by five folds cross-validation (CV) is around 80%, the accuracy/ recall rates for EV-71 and D-68 are 96.5%/ 98% and 91.8%/ 99.7%, respectively. To ensure the submitted sequences belonged to EV family, we composited the pipeline by setting the filter for homology search (Coverage>80% and E<1.0E-5). Then the trained CNN model classifies them into suitable genotype. Here, we implement this approach as the web application (EV-CNN) that is fully automatic to provide precise and rapid prediagnosis on EV genotype. The web application of EV-CNN can perform the genotyping immediately on those long reads from biopsies in third-generation sequencers like oxford nanopore. This integrated platform will be helpful to clinical laboratories, research community for disease surveillance. This website is free and open to all users, and there is no login requirement. EV Genotyping is available at http://symbiosis.iis.sinica.edu.tw/Enterovirus/

V-015: Elegance, Electronic Lab Notebook On Cloud: Digitize Experimental Designs And Results Into Wisdom From Discovery To Application
COSI: General Comp Bio
  • Shu-Hwa Chen, Institute of Information Science, Academia Sinica, Taiwan
  • Chung-Yen Lin, Institute of Information Science, Academia Sinica, Taiwan
  • Chi-Wei Huang, Institute of Information Science, Academia Sinica, Taiwan

Short Abstract: The hand-writing, paper-based recording way is not competent to keep data in increasing volumes and complexity and hard to make data sharing in a cooperating project among various disciplines and research communities. Currently, we have developed the framework of pure web-based ELN as the standalone versions and Docker images, which can be deployed on the local machine, servers, NAS or clouds. The essential functions of ELN include simple installation with few clicks, notes creating with digital signatures, attachments of experimental digital outputs, full-text search, succinct user management, automatic system backup, calendar with event notification, personalized interface with high privacy, data sharing and exchange via the web, lab resource management, etc. Thus, such framework reshapes the way of managing thoughts and all kind of lab working logs/ experiment data for a single researcher. As well as for a small research team, this system constructs their internet web service for public and intranet framework to manage experimental results, along with a sharing working platform among lab members with collaborators out of the lab. In brief, we believe the ELN developed by our team will help the research community on supporting interventions, sharing information, re-organizing knowledge, and manifesting actual laboratory works. reference: https://hub.docker.com/r/lsbnb/eln/

V-016: Building database for the base quality score recalibration in the genetic variant calling
COSI: General Comp Bio
  • Sunhee Kim, Kongju National University, South Korea
  • Chang-Yong Lee, Kongju National University, South Korea

Short Abstract: The base quality score recalibration (BQSR) is an important step in the variant calling from high-throughput next-generation sequence (NGS) data. While BQSR necessarily requires a database of known variants such as dbSNP, many organisms except human do not have large enough database in size to recalibrate the base quality score effectively. Based on the finding that the size of a database plays a crucial role in BQSR, we proposed a method of creating a database when the size of a database is not large enough for BQSR to be reliable. The proposed method is based on the stratified sampling and consists of two parts. First, we call variants from a set of stratified samples without BQSR step to obtain a variant calling format (VCF) file. Second, we regard the VCF file as the database named dbSELF and call variants again with BQSR step. To assess the proposed method, we call variants from NGS data of human and rice with both dbSNP and dbSELF. We demonstrated that, in the case of human, both dbSNP and dbSELF produced more or less the same results. In the case of rice, however, we found that dbSELF provided more reasonable than dbSNP.

V-017: Methods for detecting contribution of mutational signatures in cancer genomes
COSI: General Comp Bio
  • Damian Wójtowicz, National Institutes of Health, NCBI, United States
  • Xiaoqing Huang, National Institutes of Health, NCBI, United States

Short Abstract: Cancers arise as the result of somatically acquired changes in the DNA of cancer cells. However, in addition to the mutations that confer a growth advantage, cancer genomes accumulate a large number of somatic mutations resulting from normal DNA damage and repair processes as well as carcinogenic exposures or cancer-related aberrations of DNA maintenance machinery. These mutagenic processes often produce characteristic mutational patterns called mutational signatures. The decomposition of a cancer genome's mutation catalog into mutations consistent with such signatures can provide valuable information about cancer etiology. We developed methods to decompose a mutation catalog of cancer patient into a linear combination of predefined mutational signatures and to assess the accuracy of such decomposition, as well as methods to assign mutational signatures to mutations. We proposed two complementary ways of measuring the confidence and stability of decomposition results and applied them to analyze mutational signatures in many cancer genomes. We identified signatures that previously have not been associated with a particular cancer type and we provided additional support for the presence of these signatures. Our results emphasize the importance of assessing the confidence and stability of inferred signature contributions.

V-018: Genome-wide methylome and transcriptome analysis reveals potential therapeutic targets for triple negative breast cancer
COSI: General Comp Bio
  • Maoni Guo, Faculty of Health Sciences, University of Macau, Macau, China, China

Short Abstract: The prognosis of triple negative breast cancer (TNBC) is poor due to the lack of specific biomarkers for clinical intervention. In order to identify potential TNBC biomarkers, we performed a comprehensive bioinformatics study on TNBC methylation and transcription data derived from TCGA. We selected triple positive breast cancer (TPBC) as the control in order to increase the sensitivity and specificity of biomarker detection in TNBC. Our study identified 911 differentially methylated genes and 710 differentially expressed genes in TNBC, in which 114 differentially methylated genes and 114 differentially expressed genes were coincidently the same. We identified 250-differentially methylated CpG sites that were able to effectively distinguish between TNBC and TPBC. Applying drug repositioning analysis, we determined that 16 differentially methylated and expressed genes (DEMGs) can be potential therapeutic targets for TNBC. Together, our study revealed the presence of rich alternated DNA methylation and gene expression in TNBC as a rich resource to identify novel biomarkers for TNBC.

V-019: Human Aging DNA Methylation Signatures are Conserved but Accelerated in Cultured Fibroblasts
COSI: General Comp Bio
  • Gabriel Sturm, Columbia University, United States
  • Andres Cardenas, University of California, Berkeley, United States
  • Marie-Abèle Bind, Harvard University, United States
  • Steve Horvath, University of California, Los Angeles, United States
  • Shuang Wang, Columbia University, United States
  • Yunzhang Wang, Karolinska Institutet, Sweden
  • Sara Hägg, Karolinska Institutet, Sweden
  • Michio Hirano, Columbia University, United States
  • Martin Picard, Columbia University, United States

Short Abstract: Aging is associated with progressive and site-specific changes in DNA methylation (DNAm). These global DNAm changes have been used to train elastic net regression algorithms, i.e. DNAm clocks, to accurately predict chronological age in humans. However, relatively little is known about how these clocks perform on cells in culture. Here we culture primary human fibroblasts across the cellular lifespan (~6 months) and use four different DNAm clocks to show that age-related DNAm signatures are conserved and accelerated in vitro. The Skin & Blood clock shows the best linear correlation with chronological time (r=0.90), including during replicative senescence. Although similar in nature, the rate of epigenetic aging is approximately 62x times faster in cultured cells than in the human body. Leveraging this data’s high-temporal resolution we subsequently applied generalized additive modeling and show how single CpGs exhibit loci-specific, linear and nonlinear trajectories across the lifespan that reach rates up to -47% (hypomethylation) to +23% (hypermethylation) per month, which are remarkably higher than changes in the human body. Our computational approach demonstrates how global and single CpG DNAm dynamics are conserved and accelerated in cultured fibroblasts, which may represent a system to evaluate age-modifying interventions across the lifespan.

V-020: Genomic Variants Calling Implemented Over SPARK MapReduce Framework : SPARK+GATK4+WDL+CROMWELL
COSI: General Comp Bio
  • Ambarish Kumar, Jawaharlal Nehru University, New Delhi, India, India

Short Abstract: Augmenting data driven challenges into Genomics can be met using technologies of parallel and distributed computing. Along with it comes to address the need of customised interaction of Biologists with that of high-performance computing platform. Workflows implemented over multi-node clusters or as an cloud-based services is one way forward to achieve parallelism, scalability, accessibility, reproducibility and customisation to large-scale and data-driven Genomics research. Variant Calling methods need to be ported and made compatible with multi-node cluster installed with the MapReduce framework, which provides extensive performance gains (computational as well as accuracy). GATK4 is bundled with genomic variants - SNPs, INDELs and SVs - detection tools enabled for SPARK MapReduce framework. All steps of genomic variants calling can be combined together as customised workflow for biologists using WDL - Workflow Description Language. CROMWELL execution engine provides REST API for the execution of WDL scripts. Furthermore, web-accessibility into the research work can be met through hosting the workflow as cloud-based services. Work execution into the presentation follows as setting multi-node SPARK cluster, creating simulated RNASEQ reads from manually mutated Ebola reference genome containing SNPs, INDELs, INVERSIONs, TRANSLOCATIONs and LARGE INDELs. Performance check was done over multi-core CPU with that of standard implementation.

V-021: Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins
COSI: General Comp Bio
  • Hampapathalu Adimurthy Nagarajaram, Department of Systems and Computational Biology, University of Hyderabad (UoH), Hyderabad, Telangana, 500046, India, India
  • Rakesh Trivedi, Center For DNA Fingerprinting And Diagnostics (CDFD), Hyderabad, Telangana, 500039, India, India

Short Abstract: An amino acid substitution scoring matrix encapsulates the rates at which various amino acid residues in proteins are substituted by other amino acid residues, over time. Database search methods make use of substitution scoring matrices to identify sequences with homologous relationship. However, widely used substitution scoring matrices, including BLOSUM series, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. Hence, these substitution-scoring matrices are mostly inappropriate to study disordered regions in proteins as these regions have distinct amino acid compositional bias and therefore expected to show distinct substitution frequencies as compared with ordered regions. We developed a novel series of substitution scoring matrices, EDSSMat by exclusively considering the substitution frequencies of amino acids in the disordered regions of the eukaryotic proteins. The newly developed matrices were used in SSEARCH assisted homology detection for proteins composed of varying percentages of disordered residues, and their sensitivities for a given threshold of specificity were compared with various conventionally used search matrices. The results indicate that EDSSMat matrices detect more number of homologs than the widely used BLOSUM, PAM and other standard matrices, indicating their utility value for sequence analyses of proteins enriched with disordered regions.

V-022: Integration resource for human cancer cell line multiomics data and functional investigation for proteogenomics
COSI: General Comp Bio
  • Daejin Hyung, National Cancer Center, Republic of Korea, South Korea
  • Min-Jeong Baek, National Cancer Center, Republic of Korea, South Korea
  • Young Seek Lee, College of Veterinary Medicine, Research Institute for Veterinary Science, Seoul Nation University, Republic of Korea, South Korea
  • Soojun Park, Bio-Medical IT Convergence Research Department, ETRI, Republic of Korea, South Korea
  • Charny Park, National Cancer Center, Republic of Korea, South Korea
  • Soo Young Cho, National Cancer Center, Republic of Korea, South Korea

Short Abstract: A basic assumption in proteomics is that all proteins of the gene models are presented in reference sequence database, but many peptides are not recognized in the sequence database. One strategy to identify novel peptides is proteogenomics, which is integration study of proteomics with genomics and transcriptomics. Here we propose cancer associated protein variation and expression abundance database, which aims to facilitate functional investigation of proteogenomics. 404 proteomics studies, 14 cancer types, 335 human cancer cell lines and 141 genomic studies are annotated with manual curation and preprocessed several tools to exploration of proteogenomics. We provide information about protein expression in human cancer cell line, protein variation (eg, single amino acid variation, fusion gene associated peptide, and novel peptide) landscape in cancer, dynamics of mRNA-protein abundance, and isoprotein expression. The protein variations in proteomics are identified using by improved genome annotation of the protein coding potential sequences. Dynamics of mRNA-protein abundance is calculated by coexpression analysis, classified by state transition (mRNA-protein concord or discordance), and estimated significantly perturbed pathways for mRNA-protein abundance. This Database is integration study for multiomics data set, and functional investigation for proteogenomics.

V-023: Convergent evolution of anhydrobiosis-related proteins in Tardigrades suggested by multi-oimcs analysis of Echiniscus testudo
COSI: General Comp Bio
  • Yumi Murai, Keio University, Japan
  • Masayuki Fujiwara, Keio University, Japan
  • Masaru Tomita, Keio University, Japan
  • Kazuharu Arakawa, Keio University, Japan

Short Abstract: Limno-terrestrial tardigrades enter an ametabolic state termed anhydrobiosis upon desiccation, in which the animals can withstand extreme environments. Through genomic studies, several molecular components of anhydrobiosis have been unveiled, such as the expansion of oxidative stress response genes, loss of stress signaling pathways, and gain of tardigrade-specific heat-soluble protein families. However, studies so far have been limited to the class Eutardigrada, and molecular mechanisms in another remaining class, Heterotardigrada, remains elusive. To this end, we report a multi-omics study of a heterotardigrade, Echiniscus testudo, one of the most desiccation tolerant species caught from the wild. We employed a multi-omics strategy, i.e. genome sequencing, transcriptomic analysis, and proteomics to elucidate the molecular basis of anhydrobiosis in E. testudo. After removal of contaminations in the genome sequence with Blobtools, we identified novel heat soluble proteins as candidate genes for Heterotardigrade-specific anhydrobiosis machinery in the draft genome. Structural domains similar to those of previously identified tardigrade specific genes were predicted using Foldindex and DISOPRED, suggesting that these genes may be analogous genes of known tardigrade specific anhydrobiosis-related genes. These results suggest that Heterotardigrada have partly shared, but distinct anhydrobiosis machinery compared to Eutardigrades, possibly obtained in part by convergent evolution in Tardigrades.

V-024: Deciphering the landscape of phosphorylated HLA-I ligands
COSI: General Comp Bio
  • Marthe Solleder, University of Lausanne, Swiss Institute of Bioinformatics, Switzerland
  • David Gfeller, University of Lausanne, Swiss Institute of Bioinformatics, Switzerland

Short Abstract: The identification and prediction of HLA-I–peptide interactions play an important role in our understanding of antigen recognition in infected or malignant cells. In cancer, non-self HLA-I ligands can arise from many different alterations, including non-synonymous mutations, gene fusion, cancer-specific alternative mRNA splicing or aberrant post-translational modifications. In this study, we collected in-depth phosphorylated HLA-I peptidomics data (1,920 unique phosphorylated peptides) from several studies covering 67 HLA-I alleles and expanded our motif deconvolution tool to identify precise binding motifs of phosphorylated HLA-I ligands for several alleles. In addition to the previously observed preferences for phosphorylation at P4, for proline next to the phosphosite and for arginine at P1, we could detect a clear enrichment of phosphorylated peptides among HLA-C ligands and among longer peptides. Binding assays were used to validate and interpret these observations. Using these data, we then developed the first predictor of HLA-I– phosphorylated peptide interactions and demonstrated that combining phosphorylated and unmodified HLA-I ligands in the training of the predictor led to highest accuracy.

V-025: Analysis of coherent network partitions reveal protein complexes in large-scale protein-protein interaction networks
COSI: General Comp Bio
  • Sara Omranian, University of Potsdam, Max Planck Institute of Molecular Plant Physiology, Germany
  • Angela Angeleska, University of Tampa, United States
  • Zoran Nikoloski, University of Potsdam, Max Planck Institute of Molecular Plant Physiology, Germany

Short Abstract: One of the fundamental problems in network analysis is clustering, popularly called community detection. While different clustering algorithms are available, they usually rely on parameter tuning or involve unintuitive cluster quality measures. Here, we introduce a new graph clustering algorithm based on the recently proposed concept of coherent partition. Since the problem is computationally intractable, we devised a greedy approximation algorithm for arriving at a coherent network partition. The approximation algorithm is based on inspection of the induced second neighborhood of each vertex and identification of the largest subgraph whose complement is disconnected. A subgraph is then iteratively selected for removal based on the relation between the number of edges within the subgraph and the number of edges connecting it to the rest of the network. This algorithm ensures community quality, since it guarantees connectivity and compactness, whereby each cluster includes a biclique. Moreover, it does not require training and setting parameter values. We demonstrate that application of the algorithm on large-scale protein-protein interaction networks accurately identifies protein complexes.

V-026: MS Atlas - A molecular map of brain lesion stages in progressive multiple sclerosis
COSI: General Comp Bio
  • Tobias Frisch, University of Southern Denmark, Odense, Denmark
  • Maria L. Elkjaer, University of Southern Denmark, Odense, Denmark
  • Richard Reynolds, Division of Brain Science, Imperial College, London, United Kingdom
  • Tanja Maria Michel, Department of Psychiatry, University of Southern Denmark, Odense, Denmark
  • Tim Kacprowski, TUM School of Life Sciences, Technical University of Munich, Munich, Germany
  • Mark Burton, Odense University Hospital, Odense, Denmark
  • Torben A. Kruse, University of Southern Denmark, Odense, Denmark
  • Mads Thomassen, University of Southern Denmark, Odense, Denmark
  • Zsolt Illes, University of Southern Denmark, Odense, Denmark
  • Jan Baumbach, Technical University of Munich, Germany

Short Abstract: Multiple sclerosis (MS) is a chronic inflammatory neurodegenerative disorder of the central nervous system with an untreatable late progressive phase in a high percentage of patients. Molecular maps of different stages of brain lesion evolution in patients with progressive MS (PMS) are missing but critical for understanding disease development and to identify novel targets to halt progression. We introduce the first MS brain lesion atlas (msatlas.dk), developed to address the current challenges of understanding mechanisms driving the fate of PMS on lesion basis. The MS Atlas gives means for testing research hypotheses, validating candidate biomarkers and drug targets. The data base comprises comprehensive high-quality transcriptomic profiles of 73 brain white matter lesions at different stages of lesion evolution from 10 PMS patients and 25 control white matter samples from five patients with non-neurological disease. The MS Atlas was assembled from next generation RNA sequencing of post mortem samples using strict, conservative preprocessing as well as advanced statistical data analysis. It comes with a user-friendly web interface, which allows for querying and interactively analyzing the PMS lesion evolution. It fosters bioinformatics methods for de novo network enrichment to extract mechanistic markers for specific lesion types and pathway-based lesion type comparison.

V-027: The genetic code structure reflects the impact of different types of translational inaccuracies
COSI: General Comp Bio
  • Małgorzata Wnętrzak, University of Wrocław, Poland
  • Pawel Blazej, University of Wrocław, Poland
  • Dorota Mackiewicz, University of Wrocław, Poland
  • Paweł Mackiewicz, University of Wrocław, Poland

Short Abstract: The standard genetic code (SGC) is an unambiguous assignment of 20 amino acids and the stop translation signal to 64 codons, although at the beginnings of its evolution, the codons may have been read ambiguously due to the inaccuracy of the translation machinery. The goal of our work was to find structures of the genetic codes that could have evolved under different types of inaccuracy of the translation apparatus, starting from ambiguous codon assignments. Thus we developed a computational model with the level of uncertainty of codon assignments gradually decreasing during the simulations. Since one of the genetic code evolution hypotheses states that the SGC evolved to be robust against point mutations and mistranslations, we performed three simulation scenarios assuming that such errors have an impact on one, two, or three codon positions. To search for the codes that decrease the coding ambiguity and increase the robustness against mutations and mistranslations under assumed conditions, we used an evolutionary algorithm. The results suggest that the codon block structure of the SGC could have evolved to decrease the ambiguity of codon assignments and to increase the translation fidelity. This work was supported by the National Science Centre, Poland, under Grant number 2017/27/N/NZ2/00403.

V-028: An online repository for diseases associated with amyloid deposition
COSI: General Comp Bio
  • Katerina Nastou, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Vassiliki Iconomidou, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Georgia Nasi, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Paraskevi Tsiolaki, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece
  • Zoi Litou, Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, Greece

Short Abstract: Amyloid fibrils are highly ordered and insoluble aggregates, that are formed by otherwise soluble proteins. The deposition of amyloid fibrils in various human organs and tissues is the hallmark of a group of disorders called “amyloidoses”. Interestingly, amyloid deposition is recorded as a complication in a broad range of devastating, well-known or less common, pathological conditions. To date, only a few attempts have been made to classify and gather data for the diseases associated with amyloid deposition. In this work, we introduce AmyCo, an open-access database providing a comprehensive, literature-curated repository for amyloidoses and clinical disorders related to amyloid deposition. Currently, it contains information about 75 diseases, classified based on their association with amyloid deposition in 2 major categories, namely amyloidosis and clinical conditions associated with amyloidosis. Database entries are supplemented with detailed annotation and linked to ICD-10, MeSH, OMIM, PubMed, AmyPro and UniProtKB databases. In addition to information regarding the diseases, AmyCo holds data about proteinaceous components of amyloid deposits. The database is available at the following link http://bioinformatics.biol.uoa.gr/amyco and is open to annotation of existing entries or submission of novel data from the scientific community, through an online submission form.

V-029: Optimization of drugs administration via control theory
COSI: General Comp Bio
  • Fabrizio Angaroni, Università Milano Bicocca, Italy
  • Marco Rossignolo, Institute for Complex Quantum Systems, Center for Integrated Quantum Science and Technologies, Universität Ulm, Germany
  • Rocco Piazza, Hematology and Clinical Research Unit, San Gerardo Hospital, Monza,, Italy
  • Simone Montagero, Dept. of Physics and Astronomy “G. Galilei", University of Padova, Italy
  • Marco Antoniotti, Università Milano Bicocca, Italy
  • Davide Maspero, Biotechnology and Biosciences, University Milano-Bicocca, Italy
  • Alex Graudenzi, University of Milan - Bicocca, Italy

Short Abstract: One of the challenges in current cancer research is the development of reliable methods for the definition of personalized therapeutic strategies, based on the available experimental data on single patients. This goal leads to a better clinical outcome while at the same time reducing the drugs usage. To this end, methods from control theory can be effectively employed on patient-specific pharmakokinetic, pharmakodynamic and tumor models to generate optimal drugs administration schedules. We introduce a framework for the generation of optimized personalized therapeutic strategies in cancer patients, based on control theory and population dynamics modeling. This method can help clinicians in designing patient-specific therapeutic regimens, with the specific goal of optimizing the efficacy of the cure while reducing the costs, especially in terms of toxicity and adverse effects. In particular,this kind of algorithm approach introduces the possibility of tuning the therapy with respect to different targets measured in clinical trials. We present the application of the framework to the specific case of Imatinib administration in Chronic Myeloid Leukemia, in which we show that the optimized therapeutic strategies are diversified among patients, and display improvements with respect to the actual regimes in terms of efficacy or consumption of drugs.

V-030: Read Mapping on Genome Variation Graphs
COSI: General Comp Bio
  • Kavya Vaddadi, TCS Research, India
  • Rajgopal Srinivasan, TCS Research, India
  • Naveen Sivadasan, TCS Research, India

Short Abstract: Genome variation graphs are natural candidates to represent a pangenome collection. In such graphs, common subsequences are encoded as vertices and the genomic variations are capturedby introducing additional labeled vertices and directed edges. Unlike a linear reference, a reference graph allows rich representation of the genomic diversities and avoids reference bias. We address the fundamental problem of mapping reads to genome variation graphs. We give a novel mapping algorithm V-MAP for efficient identification of small subgraph of the genome graph for optimal gapped alignment of reads. For fast and accurate mapping, V-MAP creates a space efficient index using locality sensitive minimizer signatures computed using a novel graph winnowing and graph embedding onto metric space. Experiments involving graph constructed from the 1000 Genomes data and using both real and simulated reads show that V-MAP is fast, memory efficient and can map short reads as well as PacBio/Nanopore long reads with high accuracy. V-MAP performance is significantly better than the state-of-the-art, especially for long reads.

V-031: Modeling effector-host interactions in the context of the barley protein interactome
COSI: General Comp Bio
  • Valeria Velasquez-Zapata, Iowa State University, United States
  • Sagnik Banerjee, Iowa State University, United States
  • Priyanka Surana, Iowa State University, United States
  • James Mitch Elmore, USDA-ARS, Iowa State University, United States
  • Roger Wise, USDA-ARS, Iowa State University, United States

Short Abstract: Pathogen effectors are excellent tools to explore dynamic regulation of plant resistance/susceptibility. To discover novel mechanisms of effector action, we exploited the biotrophic powdery mildew fungus, Blumeria graminis f. sp. hordei (Bgh), and its host, barley (Hordeum vulgare). We used next-generation sequencing to identify interacting partners from high-throughput yeast two-hybrid assays, using Bgh effectors as baits, and as preys, a time-course cDNA library from infected barley and isogenic immune-mutants. We evaluated selected vs. non-selected conditions for positive interactors using a robust informatics and statistics pipeline, including mapping reads to barley and Bgh genomes, reconstruction of prey fragments and fusions with GAL4-AD, and processing of count data. We used this information to develop a ranking system for interactors, comprising 1) significant enrichment under selection for positive interactions, 2) in-frame with GAL4-AD, and 3) degree of enrichment in pairwise comparisons of baits under selection. Outputs from this pipeline facilitated sorting and validation by binary Y2H. We integrated the top ranked effector targets with a predicted barley protein-interactome to identify barley-Bgh interaction hubs. Additionally, we filtered the interactions based on co-expression to detect tightly regulated genes during the immune response. Immune modules exhibited enrichment for genes associated with transcription, phosphorylation and intracellular transport.

V-032: Active transitivity clustering for memory- and computing efficient clustering on large datasets.
COSI: General Comp Bio
  • Mathias Bøgebjerg, SDU, Denmark

Short Abstract: Cluster analysis has been widely applied to biological datasets in order to retrive meaningful structural data. Applying clustering algorithms to larger datasets remains an issue, as both computing time and memory usage can be bottlenecked, as the size of the similarity matrix becomes huge, and computing pairwise similarity measures is necessarily quadratic. We have previously developed the clustering algorithm transitivity clustering, which has shown very good results on many datasets, however when applying this algorithm to very large datasets, one can run into the aforementioned memory and computing time problems. In order to run on large datasets, we have developed Active transitivity clustering. Active transitivity clustering can decide which similarity measures are necessary to improve the clustering, and thus we're able to cluster a dataset with a minimal amount of similarities, reducing both computing time and memory usage, as we don't have to store the entire similarity matrix, and we don't have to compute all similarities. We developed multiple strategies for picking which similarities must be computed, and still keep a stable clustering.

V-033: A deep learning model for predicting gene dependencies of cancer by integrated genomic profiles
COSI: General Comp Bio
  • Yu-Chiao Chiu, University of Texas Health Science Center at San Antonio, United States
  • Siyuan Zheng, University of Texas Health Science Center at San Antonio, United States
  • Manjeet Rao, University of Texas Health Science Center at San Antonio, United States
  • Yidong Chen, University of Texas Health Science Center at San Antonio, United States
  • Yuifei Huang, Department of Electrical and Computer Engineering, the University of Texas at San Antonio, United States

Short Abstract: Recent genome-wide CRISPR-Cas9 screens of cancer cell lines have brought insights into genetic dependencies of cancer. However, it remains challenging to utilize accumulating genomic data to accurately predict gene dependencies for unscreened cell lines and translate the findings to tumors. Here we propose a deep learning model that learns feature representations of high-dimensional genomics profiles (DNA mutations, gene expressions, DNA methylations, and copy number alterations) to predict dependencies of a sample on more than 1,000 genes. We trained and tested the model by cell-line data collected from the Cancer Dependency Map (DepMap) project. The model demonstrated superior prediction performance over conventional machine learning methods by a hold-out validation and an independent set of cell lines. In order to translate the findings to real tumors, we used a transfer learning scheme to generate the first pan-cancer dependency map of more than 8,000 tumors of The Cancer Genome Atlas (TCGA). The pan-cancer dependency map allowed us to investigate the interplay of different genomics mechanisms in the determination of cancer dependencies and identify novel therapeutic targets. We expect our learning machine to evolve with rapidly developing dependency screens and facilitate the prioritization of therapeutic targets of cancers.

V-034: Consensus approaches for CRISPR guide design
COSI: General Comp Bio
  • Jacob Bradford, Queensland University of Technology, Australia
  • Dimitri Perrin, Queensland University of Technology, Australia

Short Abstract: CRISPR-based systems are playing an important role in modern genome engineering. A large number of computational methods have been developed to assist in the identification of suitable guides. However, there is only limited overlap between the guides that each tool identifies. This can motivate further development, but also raises the question of whether it is possible to combine existing tools to improve guide design. We considered 10 leading guide design tools, and their output for a set of guides for which experimental validation data is available. We found that consensus approaches were able to outperform all individual tools. The best performance (with a precision of 0.924) was obtained when combining five of the tools and accepting all guides selected by at least four of them. These results can be used to improve CRISPR-based studies, but also to guide further tool development.

V-035: DNA methylation and treatment response to de-methylation agents: genome-wide profiling and prediction signature identification through machine learning
COSI: General Comp Bio
  • Zhifu Sun, Mayo Clinic, United States
  • Xuewei Wang, Mayo Clinic, United States
  • Pete Vedell, Mayo Clinic, United States
  • Jean-Pierre Kocher, Mayo Clinic, United States

Short Abstract: DNA de-methylation agents have been used to treat certain hematological disorders and solid tumors with success. However, there are no established DNA methylation markers that can predict which patient would benefit from the treatment. Taking advantage of the large datasets where both DNA methylation and drug response data to 4 de-methylation agents for 600 cancer cell lines are available, we compared the response profiles of these 4 drugs; conducted genome-wide methylation association; and applied machine learning techniques to predict drug response. Cancer cell lines from different origins had very different responses to the 4 agents. Haematologic cancer lines were highly responsive to decitabine. There were a large number of CpG sites whose methylation levels were significantly associated with decitabine and RG-108 response but almost none for azacitidine and zebularine. Multiple pathways were responsible for the responsiveness including, notably, calcium signaling, regulation of actin cytoskeleton, and MAPK signaling. Using haematopoietic and lymphoid cell lines, we trained and developed machine learning models, which showed high predictive performance. More importantly, this model could predict the response of other cancer cell lines, which are generally not treated by de-methylation agents, suggesting a proportion of other cancers may benefit from decitabine treatment.

V-036: A de novo RNA-seq short read assembler by a recursive breakpoint detection, branch extension, and merging approach
COSI: General Comp Bio
  • Yu-Wei Tsay, Institute of Information Science, Academia Sinica, Taiwan
  • Arthur Chun-Chieh Shih, Institute of Information Science, Academia Sinica, Taiwan

Short Abstract: Deep short read sequencers produce shorter but a much greater number of reads, thus posing new challenges to many computational issues, such as de novo genome assembly for DNA-seq data and transcriptome assembly for RNA-seq data. In genome assembly, if one read is partially mapped to one contig with the remaining parts entirely or partially mapped to another ones, it is considered as crossing a repeat boundary. Most assemblers break the contig at the repeat boundary if no further information to examine the connection. In contrast, most repeats in transcripts are much shorter than those in genome. Thus, in transcriptome assembly, such as a read with a breakpoint is crossing more likely a spicing site than a repeat boundary if the read length is longer than most repeats in transcripts. In this study, we propose a new de novo transcriptome assembler, called JR-Trans. Not only using the kernel developed in JR-Assembler for the read extension by jumping and extension with back-trimming, JR-Trans also detects the breakpoint sites and provides a recursive procedure for branch extension and contig merging. Compared with current assemblers with real data, JR-Trans achieves a better overall assembly quality, requires less execution time and requires less memory.

V-037: Leveraging DNA Methylation Data to Better Understanding Transcription Factor Binding Site Selection
COSI: General Comp Bio
  • Fei-Man Hsu, The University of Tokyo, Japan
  • Paul Horton, National Cheng Kung University, Taiwan

Short Abstract: DNA methylation can affect transcription regulation indirectly via changes induced in chromatin structure and may also directly change the binding strength of (cytosine containing) transcription factor binding sites (TFBS). We treated the 5-methylcytosine as the fifth DNA base and developed a TFBS prediction pipeline EpiDWM (Epigenetic dinucleotide weight matrix) so as to model the impact of DNA methylation on TF binding, accompanied by a visualization tool MethylSeqLogo to display such effect. Our pipeline outperforms other published software packages and predicts TFBSs with tissue-specificity. The MethylSeqLogo can further 1) differentiate TF binding motifs on distinct genomic regions, 2) reveal DNA methylation impact on splicing sites, 3) create tissue-specific logos, 4) monitor time-lapse methylomes and 5) compare TF binding motifs that show sequence similarity.

V-038: Association of variants in coding regions with clinical data in Colombian patients using data mining techniques.
COSI: General Comp Bio
  • Jennifer Vélez, Universidad Nacional de Colombia, Colombia

Short Abstract: I proposed a model fora analysis of variants in gene coding regions of Colombian patients. Their data correspond to 227 patients to whom 4813 genes were sequenced and from whom their clinical histories were obtained. Variants were filtered by quality in each patients, and clinical histories were stored in a relational database. An analysis model that integrates three components was designed and implemented a pipeline for the identification of variants; a textual analysis of medical records, using data mining; an association model that uses association rules among variants and groups of patients. Pipeline for the identification of variants was used to minimize the variants identification error. The aim of the textual analysis was to identify groups of patients according to the content of their clinical records, 5 groups of patients were obtained. A association rules were applied to each group, in order to identify the relationships of the variants among themselves, and with the groups of patients. A specific analysis the CFTR and RB1 genes was also performed. Through the model, polymorphisms for the CFTR gene and pathogenic variants for the RB1 were identified, indicating that groups of patients can be associated to variants found in this study.

V-039: Transcriptome analysis for discovering biomarkers in the targeted cancer therapy
COSI: General Comp Bio
  • Sukjoon Yoon, Sookmyung women's university, South Korea
  • Hyejeong Gu, Sookmyung women's university, South Korea

Short Abstract: Integrating multi-level omics data and RNAi data can accelerate clinical application by identifying association between cancer target and biomarker. Availability of synergistic and predictive biomarkers is the key for the success of anticancer therapies. We are screening biomarkers for SCD1 (Stearoyl-CoA desaturase1) that plays an important role in cancer cell unsaturated fatty acid metabolism, which is the core of cancer stem cell control. Anticancer effect of siSCD1 was systematically screened for the colon cell line panel. Transcriptome data of the screened colon cell line panel was analyzed for discovering the associated gene expression (biomarkers) with the inhibitory effect of SCD1. Functional gene set analysis of the relevant gene expression can provide insights on the mechanism of synergistic effect between target (SCD1) and biomarkers. This approach will have a major impact on the discovery of novel SCD1 inhibitory chemicals and their clinical applications.

V-040: ATAC-graph: analyzing and visualizing chromatin accessibility with ATAC-seq
COSI: General Comp Bio
  • Yen-Ting Liu, Academia Sinica, Taiwan
  • Jui-Hsien Lu, Academia Sinica, Taiwan
  • Ming-Ren Yen, Academia Sinica, Taiwan
  • Pao-Yang Chen, Academia Sinica, Taiwan

Short Abstract: Chromatin structure is dominated by the binding of nucleosomes and transcription factors (TFs) to DNA such that the accessibility of regulatory DNA can indicate activation status. An efficient and precise way to reveal chromatin accessibility is the Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), in which chromatin-accessible regions are cleaved, adaptor sequences are integrated, and the DNA is sequenced using next-generation sequencing. Current ATAC-seq analysis tools simply follow chromatin immunoprecipitation sequencing (ChIP-seq) pipelines and do not consider ATAC-seq-specific quality control nor the underlying biology regarding chromatin accessibility, which can be very different from that in ChIP-seq. We developed ATAC-graph, which is bioinformatic software in Python specifically for analyzing ATAC-seq data. ATAC-graph profiles accessible chromatin regions and provides ATAC-seq-specific information. We demonstrated the biological relevance of ATAC-graph analysis in both animal and plant genomes. The run time for a full analysis of human ATAC-seq data takes only 12 minutes in our local computer which has 32 cores. ATAC-graph performs ATAC-seq specific data analysis and is not limited to particular genomes. The software of ATAC-graph is available at https://github.com/kullatnunu/atacgraph.

V-041: Data handling challenges in the Organ on a Chip informatic platform
COSI: General Comp Bio
  • Pavel Vazquez Faci, Hybrid Technology Hub, University of Oslo, Norway
  • Simon Rayner, Department of Medical Genetics, Oslo University Hospital & Hybrid Technology Hub, University of Oslo, Norway

Short Abstract: The Hybrid Technology Hub and many other research centers work in cross-functional teams whose workflow is not necessarily linear and where in many cases technology advances are done through parallel work. The lack of proper tools and platforms for the collaborative environment can create time lags in coordination and limited sharing of research advances. To solve this, we are developing a simple, user-friendly platform built for academic and scientific research collaboration. The platform will consist of a version control system along with an object storage solution. It already integrates a high performance, distributed and scalable object storage to persist and secure the different project’s data. The platform also implements identity and access management to provide researchers confidentiality and integrity of their research. We are developing a version control system; it will provide a history track of the project along with the possibility of reviewing the project’s development, a restoration function will also be implemented. We are also testing a tamper proof solution to store relevant metadata in order to ensure authenticity and transparency. This platform aims to be a standardised tool within the Hybrid Technology Hub to ease collaboration, speed research workflow and improve research quality.

V-042: The origin of the central dogma through conflicting multilevel selection
COSI: General Comp Bio
  • Nobuto Takeuchi, School of Biological Sciences, University of Auckland, New Zealand
  • Kunihiko Kaneko, Research Center for Complex Systems Biology, Graduate School of Arts and Sciences, University of Tokyo, Japan

Short Abstract: The central dogma of molecular biology rests on two kinds of asymmetry between genomes and enzymes: informatic asymmetry, where information flows from genomes to enzymes but not from enzymes to genomes; and catalytic asymmetry, where enzymes provide chemical catalysis but genomes do not. How did these asymmetries originate? Here we show that these asymmetries can spontaneously arise from conflict between selection at the molecular level and selection at the cellular level. We developed a computational model consisting of a population of protocells, each containing a population of replicating catalytic molecules. The molecules are assumed to face a trade-off between serving as catalysts and serving as templates. This trade-off causes conflicting multilevel selection: serving as catalysts is favored by selection between protocells, whereas serving as templates is favored by selection between molecules within protocells. This conflict induces informatic and catalytic symmetry breaking, whereby the molecules differentiate into genomes and enzymes, establishing the central dogma. We show mathematically that the symmetry breaking is caused by a positive feedback between Fisher's reproductive values and the relative impact of selection at different levels. Our results suggest that the central dogma is a logical consequence of conflicting multilevel selection.

V-043: Simultaneous unsupervised inference of protein-protein contacts and interactions
COSI: General Comp Bio
  • Aalt-Jan Van Dijk, Wageningen University and Research, Netherlands

Short Abstract: Protein-protein contact residues can be predicted based on correlated mutations which can be revealed by statistical analysis of alignments of homologs of interacting proteins. It is non-trivial, however, to avoid introducing non-interacting proteins, which leads to decreased contact prediction performance. We have developed a novel algorithm to decrease such noise. The method simultaneously models protein-protein interaction and correlated mutations, with no prior knowledge of interactions. It iterates between weighting proteins according to how likely they are to interact, and predicting correlated mutations based on the weighted alignment. Importantly, unlike previous approaches, our method can be applied to many-to-many interactions. The method was initially tested on two experimental interaction datasets with various levels of noise. Without using knowledge on protein interaction status, the algorithm discriminates well between interacting and non-interacting proteins, and improves the prediction of protein-protein contacts. Subsequently, we applied the algorithm to polyketide synthases (PKSs). PKSs are organized in assembly lines defined by specific protein-protein interactions. We accurately predicted PKS assembly line order, which enables to predict scaffold chemical structures for PKS gene clusters. This will be of great use for efforts to engineer synthetic assembly lines consisting of entirely new combinations of proteins.

V-044: Finding a critical gene related on clubroot disease resistance of Brassica rapa phenotyping by ‘Yeoncheon’provincial isolate
COSI: General Comp Bio
  • Suhyoung Park, NIHHS, South Korea
  • Suk-Woo Jang, NIHHS, South Korea
  • Jeong-Soo Lee, NIHHS, South Korea
  • Min Young Park, NIHHS, South Korea

Short Abstract: Kimchi cabbage(Chinese cabbage) is one of major vegetable in Korea. Koreans used Kimci cabbage mainly for making Kimchi, some times for shabu-shabu, Ssam, and others. As Koreans love Kimchi, the continuous production of fresh Kimch cabbage is one of major research subject. The effect of continuous cultivation of Kimchi cabbage in one area caused mass-spread of clubroot disease. Even resistant varieties lost their activity after 3 times of cultivation. Since the clubroot disease resistant gene should be introduced from turnip, we tried to search resistant materials using GWAS(Genome-wide Association Study) analyse among Brassica rapa plants. We selected 96 plant materials showing different resistant level. After analysing, 20,540 different SNPs were identified. The plant materials were divided into four groups and confirmed that genetically diverse plant materials were included. We developed a CAPS marker located in a gene belonging to the Glucose-methanol-choline(GMC) oxidroreductase family protein. As this marker was developed from ‘Yeoncheon’clubroot provincial isolation, it was possible to explain disease resistant. We introduced this marker into a wide range of Cruciferae vagetables and figured out phenotype match was 18.4% higher comparing to other 6 markers in average.

V-045: Redundancy Removal from NGS Data on Microbial Genome
COSI: General Comp Bio
  • Fabricio Araujo, UFPA, Brazil
  • Marcus Braga, UFRA - Universidade Federal Rural da Amazônia, Brazil
  • Kenny Pinheiro, UFPA - Universidade Federal do Pará, Brazil
  • Rommel Ramos, Federal University of Pará, Brazil

Short Abstract: Repetitive DNA sequences longer than reads’ length produce assembly gaps. In addition, repetition can cause complex and misassembled rearrangements that creates branches in assembler graphs. Algorithms must decide which way is the best. Incorrect decisions create false associations, called chimeric contigs. Reads coming from different copies of a repetitive region on genome may be wrongly assembled as a unique contig, a repetitive contig. Furthermore, the growth of hybrid assembling approaches using different sequencing platforms data, different fragment sizes or even data from distinct assemblers are responsible for significantly increasing in the amount of generated contigs and therefore subsequent redundancy on data. This work presents a computational method to detect and eliminate redundant contigs from microbial genome assemblies. It consists of two Hashing-Based techniques, a Bloom Filter to detect duplicated contigs and a Locality-Sensitive Hashing (LSH) to remove similar contigs. The redundancy reduction facilitates downstream analysis and diminishes the required time to finishing and curate genomic assemblies. The hybrid assembly of GAGE-B dataset was performed with SPAdes (De Bruijn Graph) assembler and Fermi (OLC) assembler. The proposed pipeline was applied to the resulting contigs and the performance compared to other similar tools such as HS-BLATN, Simplifier and CD-HIT. Results are presented.

V-046: Bioinformatics approach to support risk assessment in toxicology
COSI: General Comp Bio
  • Pranika Singh, Edelweiss Connect GmbH, Switzerland
  • Tatyana Y. Doktorova, Edelweiss Connect GmbH, Switzerland
  • Barry Hardy, Edelweiss Connect, Switzerland
  • Thomas Exner, Edelweiss Connect GmbH , Switzerland

Short Abstract: Adverse outcome pathways (AOPs) are novel tools in toxicology and human risk assessment, designed to describe relationships between events at different levels of biological organization, which ultimately lead to an adverse outcome. For integration of heterogeneous data from public sources into the AOP development process in a fast, efficient and unbiased way, as well as developing predictive models based on the AOPs, we have adopted several bioinformatics strategies: (i) Frequent itemset mining of in vitro, in vivo and disease data, to look for relationships between genes, pathways and diseases that co-occur across datasets. (ii) Toxic class specific biomarker discovery using transcriptomics, biochemical/hematological and histopathological data. (iii) Computational annotation and linking of newly discovered Key Events to biological networks. These serve as a source for new AOP discoveries as well as quantification of toxicant-dependent network perturbations. The effectiveness of these strategies is proof-shown for different endpoints (i.e. cardiac developmental toxicity, acute kidney failure, hepatocellular carcinoma) and shows the general usefulness of the proposed approach for risk assessment. This work was supported by funding from Marie Skłodowska-Curie Actions (in3, GA 721975, and ADVaNCE, GA 750195, projects) , the EU Horizon 2020 programme (OpenRiskNet e-infrastructure, GA 731075) and NC3Rs CrackIT challenge.

V-047: Gene-specific correlation analysis of mRNA and protein levels in colorectal cancer cell lines
COSI: General Comp Bio
  • Fatemeh Zamanzad Ghavidel, University of Bergen, department of Informatics, Computational Biology Unit (CBU), Norway
  • Inge Jonassen, University of Bergen, department of Informatics, Computational Biology Unit (CBU), Norway
  • Alvis Brazma, European bioinformatics institute , Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom

Short Abstract: The central dogma of molecular biology includes translation of genetic information from mRNA to protein. Quantitative analyses of the correlation between the abundance of mRNAs and corresponding proteins can be weak or moderate. The level of correlation varies between experimental conditions assessed and also between organisms. In this work, we performed a correlation analysis to look for possible correlation between gene and protein expression in cancer cell lines. We performed a comprehensive correlation study of protein expression profiles of 50 colorectal cancer cell lines and the corresponding gene expression levels from two public data bases: Cancer Cell Line Encyclopedia (CCLE) and Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC). We identified genes with discordance/concordant gene and protein expression levels. This information has important implication for diagnosis and therapeutic targets. Moreover, gene-protein specific correlation indicates a GO- dependent concordance of protein/mRNA expression. We found out moderate cell line–specific correlation (median Spearman’s r = 0.59). Highly variable mRNAs tend to correspond to highly variable proteins (Spearman’s r = 0.68), although with a wide distribution. Notably, several genes, including TP53, displayed high variation at the protein level despite low variation at the mRNA level, implicating significant post-transcriptional modulation of their abundance.

V-048: OpenRiskNet Part I: Development of an open e-infrastructure predictive toxicology and risk assessment
COSI: General Comp Bio
  • Marc Jacobs, Fraunhofer, Germany
  • Atif Raza, Johannes Gutenberg-Universität Mainz, Germany
  • Thomas Exner, Edelweiss Connect GmbH , Switzerland
  • Atif Raza, Johannes Gutenberg University Mainz, Germany
  • Denis Gebele, in silico toxicology gmbh, Switzerland
  • Stefan Kramer, Johannes Gutenberg University Mainz, Germany
  • Tim Dudgeon, Informatics Matters Ltd, United Kingdom
  • Egon Willighagen, Maastricht University, Netherlands
  • Chris T. Evelo, Maastricht University, Netherlands
  • Barry Hardy, Edelweiss Connect, Switzerland
  • Paul Jennings, Vrije Universiteit Amsterdam, Netherlands
  • Daan Geerke, Vrije Universiteit Amsterdam, Netherlands
  • Frederic Bois, Institut National De L’environnement Et Des Risques, France
  • Alan Christie, Informatics Matters Ltd., United Kingdom
  • Ola Spjuth, Uppsala University, Sweden
  • Lucian Farcal, Edelweiss Connect, Switzerland
  • Haralambos Sarimveis, National Technical University of Athens, Greece
  • Pantelis Karatzas, National Technical University of Athens, Greece
  • Philip Doganis, National Technical University of Athens, Greece
  • George Gkoutos, University of Birmingham, United Kingdom
  • Iseult Lynch, University of Birmingham, United Kingdom
  • Marvin Martens, Maastricht University, Netherlands
  • Jumamurat Bayjanov, Maastricht University, Netherlands
  • Danyel Jennen, Maastricht University, Netherlands
  • Jordi Rambla, Fundacio Centre De Regulacio Genomica, Spain
  • Cedric Notredame, Fundacio Centre De Regulacio Genomica, Spain
  • Evan Floden, Fundacio Centre De Regulacio Genomica, Spain
  • Nofisat Oki, Edelweiss Connect, Switzerland
  • Daniel Bachler, Edelweiss Connect, Switzerland

Short Abstract: OpenRiskNet (https://openrisknet.org/) is a 3-year project funded by the EU within Horizon 2020 EINFRA-22-2016 Programme, with the main objective to develop an open e-infrastructure providing data and software resources and services to a variety of industries requiring risk assessment (e.g. chemicals, cosmetic ingredients, pharma or nanotechnologies). The infrastructure is built on virtual research environments (VREs), which can be deployed to workstations as well as public and in-house cloud infrastructures. Services providing data, data analysis, modelling and simulation tools for risk assessment are integrated into the e-infrastructure and can be combined into workflows using harmonised and interoperable application programming interfaces (APIs) (https://openrisknet.org/e-infrastructure/services/). For complete risk assessment and safe-by-design studies, OpenRiskNet e-infrastructure functionality is combined via a variety of incorporated services demonstrated within a set of case studies. The case studies present real-world settings such as data curation, systems biology approaches for grouping compounds, read-across applications using chemical and biological similarity, and identification of areas of concern based only on alternative methods (non-animal testing) approaches. OpenRiskNet is working with a network of partners, organised within an Associated Partners Programme, aiming to strengthen the working ties to other organisations developing relevant solutions or tools.

V-049: An introduction to genome graphs using GenGraph and python.
COSI: General Comp Bio
  • Jon Ambler, University of Cape Town, South Africa
  • Shandukani Mulaudzi, University of Cape Town, South Africa
  • Nicola Mulder, University of Cape Town, South Africa

Short Abstract: Genome graphs are being used increasingly in research, replacing the limited and biased single reference sequence. But adoption is slow, possibly due in part to a difficult conceptual model and lack of tools for using genome graphs in more everyday analysis. GenGraph provides users with the power to create and work with genome graphs in an intuitive manner allowing for easier tool development. GenGraph is an open source python library that uses existing file formats, presenting genome graphs in a way that is intuitive to even novice programmers. It comes with a number of functions that make common tasks like extracting sequences from a region simple, and make the transition away from single linear sequences far less daunting. The graphs themselves are designed and structured in a manner that prioritises representing the biological relationships between the sequences, making interpretation simple. Sequences that are homologous between isolates are represented in the same nodes, with edges creating a path through the graph that allows the reconstitution of the component genomes. Here we demonstrate the functionality of GenGraph, and give examples of how it can be used to carry out common tasks in sequence analysis and comparative genomics in a highly intuitive manner.

V-050: Assimilation of high-resolution HLA alleles from low-resolution serological typing; a computational approach.
COSI: General Comp Bio
  • Adriana Toutoudaki, Cambridge University Hospitals NHS Foundation Trust, United Kingdom
  • Hannah Turnbull, Cambridge University Hospitals NHS Foundation Trust, United Kingdom

Short Abstract: High-resolution HLA typing information is required as input for algorithms used to determine immunogenicity scores. Performing high-resolution HLA typing on retrospective cohorts of solid organ transplant patients is both time-consuming and expensive. Therefore, we developed an algorithm to convert existing low/intermediate-resolution HLA typing information into high-resolution, to be subsequently used to generate immunogenicity scores. Common HLA haplotypes, associations and allele frequencies within the Caucasian population were assessed through web databases and used to create a series of rules (n=115) which were incorporated into an assimilation table. A python script has been developed which manipulates multiple HLA types simultaneously and transforms low-resolution data into high-resolution using the assimilation table as reference. This computational algorithm was tested on an existing dataset genotyped by NGS (n=104) to evaluate the rule validity. The success rate by locus was: HLA-A (95%); HLA-B (86%); HLA-C (94%); HLA-DRB1 (78%) and HLA-DQB1 (73%). Overall the algorithm performed well, particularly for Class I loci, with some common errors identified in specific loci. Considering the complexity of the HLA system and ethnic variation, correct assimilation of HLA alleles is challenging. This initial proof of concept indicates it is possible and further development could lead to a useful tool for research.

V-051: In silico structural characterization and moleculardocking for human TAS2R16 receptor
COSI: General Comp Bio
  • Catiane Souza, Laboratório de Pesquisa em Microbiologia das Universidade Estadual de Feira de Santana, Brazil
  • Geovane Araujo, Laboratório de Bioinformática e Química Computacional da Universidade Estadual do Sudoeste da Bahia - Jequié, Brazil
  • Bruno Andrade, Laboratório de Bioinformática e Química Computacional da Universidade Estadual do Sudoeste da Bahia - Jequié, Brazil
  • Samille Gonçalves, Laboratório de Pesquisa em Microbiologia das Universidade Estadual de Feira de Santana, Brazil
  • Aristóteles Goes-Neto, Laboratório de Biologia Molecular e Computacional de Fungos da Universidade Federal de Minas Gerais, Brazil
  • Raquel Benevides, Laboratório de Pesquisa em Microbiologia das Universidade Estadual de Feira de Santana, Brazil

Short Abstract: Autism is a rare psychiatric disorder characterized by imbalanced intellectual development, which impairs the ability to socialize, and in some cases motor coordination. This condition occurs due to genetic alterations that affect the normal development of the central Nervous System. There is a range of genetic basis involved in autism, one of these are related to G-protein–coupled taste receptors. For some authors, genetic polymorphism in these molecules is responsible for different levels of Autism. TAS2R16 receptor is associated with detecting bitter taste for molecules such as sesquiterpene lactones, clerodane and labdane, diterpenoids, strychnine and denatonium. In this work, we aimed to construct 3D structures of human TAS2R16 receptor, based on their normal gene sequences, as well as performing molecular docking with different bitter taste molecules in order to describe active site interactions. For 3D construction, we used Modeller 9.21, and after performed an AMBER 14 energy minimization for 5000 cycles of steepest descent and 5000 cycles of conjugated gradient for adjusting protein structures. The structure was validated using QMEAN, ANOLEA and Procheck programs. Docking results were obtained with Autodock Vina, and 2D ligand interaction maps were constructed using Accelrys Discovery Studio 2.5.

V-052: atyPeak: Correlating TRBS ChIP-seq peaks from multiple datasets in ReMap using deep convolutional autoencoders
COSI: General Comp Bio
  • Quentin Ferré, TAGC (INSERM, UMR U1090) & LIS (CNRS, UMR 7020), France
  • Jeanne Chèneby, Aix Marseille Univ, INSERM, UMR U1090, TAGC, Marseille, France, France
  • Denis Puthier, Aix Marseille Univ, INSERM, UMR U1090, TAGC, Marseille, France, France
  • Cécile Capponi, Aix Marseille Univ, CNRS, UMR 7020, LIS, Qarma, Marseille, France, France
  • Benoît Ballester, Aix Marseille Univ, INSERM, UMR U1090, TAGC, Marseille, France, France

Short Abstract: Cis-regulatory elements (CREs) are genomic regions regulating gene expression by binding proteins called Transcriptional Regulators (TRs). TR binding is mostly studied experimentally, via ChIP-Seq, but these experiments have false positives, and there is no method to discern them. However, TRs are known to be co-occurent, and many replica datasets exist. As such, we use common TR and/or datasets combinations to identify “atypical” peaks. We use the ReMap database to learn such correlations. CREs are represented as 3D tensors of peak presence (namely ‘position’, ‘TR’, and ‘dataset’). We use an autoencoder to perform a lossy compression of each, to keep common patterns and discard rare elements (atypical peaks). The regions are viewed by the model through convolutional filters to focus on the correlations. Each peak gets an anomaly score corresponding to the autoencoder reconstruction error. We use artificial data to confirm the model’s ability to discover correlation groups of TR/datasets and label lonely/anomalous peaks. Application to ReMap is in progress, currently on a curated subset of data. To our knowledge, our research shows the first use of a large-scale meta-analysis to corroborate different ChIP-Seq datasets, using deep learning to integrate them in complex combinations and eliminate atypical peaks.

V-053: RADeep: The Rare Anaemia Disorders European Epidemiological Platform
COSI: General Comp Bio
  • Stella Tamana, Molecular Genetics Thalassaemia, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus, Cyprus
  • Petros Kountouris, Molecular Genetics Thalassaemia, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus, Cyprus
  • Paola Bianchi, Fondazione IRCCS Ca' Granda Ospedale Policlinico Milano. Hematology Unit. Pathophysiology of Anemias Unit Milan- Italy, Italy
  • Raffaella Colombatti, Clinic of Pediatric Hematology Oncology, Department of Child and Maternal Health, Padova, Italy, Italy
  • Eduard van Beers, University Medical Center Utrecht, Utrecht, Netherlands, Netherlands
  • Victoria Gutierrez Valle, Rare Diseases Centre. University Hospital Vall d'Hebron- Vall d'Hebron Research Institute. Barcelona, Spain, Spain
  • Beatrice Gulbis, CUB Hopital Erasme- LHUB-ULB, Brussels, Belgium, Belgium
  • Marina Kleanthous, Molecular Genetics Thalassaemia, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus, Cyprus
  • Maria del Mar Mañú Pereira, Rare Diseases Centre. University Hospital Vall d'Hebron- Vall d'Hebron Research Institute. Barcelona, Spain, Spain

Short Abstract: Rare Anaemia (RA) is a highly heterogenous group of disorders characterized for presenting anaemia, as the main clinical manifestation. The Rare Anaemia Disorders European Epidemiological Platform (RADeep) is an initiative of ERN-EuroBloodNet to serve as an umbrella for new and existing patient registries of RAs in Europe. RADeep’s primary objective is to assess at the EU level the prevalence, incidence and survival of RA patients stratified by demographics and severity. Compatible with FAIR principles, RADeep will allow mapping of diagnosis methods, demography, main clinical features and treatments of RA patients. RADeep is being implemented in different phases through disease-specific arms, consulted by different multi-disciplinary scientific committees. To date, RADeep has reached significant milestones: (a) a legal frame has been established for secure sharing and re-use of data on RA patients among data providers and third parties, including other ERNs, research community and industry; (b) the Steering Committee has compiled comprehensive descriptions of metadata to facilitate database and platform development; (c) first phase of RADeep implementation will be launched shortly for Pyruvate Kinase Deficiency.

V-054: MultiBaC: A strategy to remove batch effect between different omic data types
COSI: General Comp Bio
  • Manuel Ugidos, Centro de Investigación Príncipe Felipe, Spain
  • Sonia Tarazona, Centro de Investigacion Principe Felipe, Spain
  • José Manuel Prats-Montalbán, Universitat Politècnica de València, Spain
  • Alberto Ferrer, Universitat Politècnica de València, Spain
  • Ana Conesa, University of Florida, United States

Short Abstract: Diversity of omic technologies has expanded together with the number of omics integration strategies. However, the costs of the different techniques are still high and many research groups cannot afford projects where many different omic technologies are generated. Nevertheless, as researchers share their data in public repositories, there is a possibility of utilization of datasets from other laboratories to construct a multiomic study. An important issue when integrating data from different studies is the batch effect. There are several methods described which correct batch effect on common omic data between different studies but they cannot be used to correct no common, distinctive data (i.e. the omic that has been analyzed at only one lab). This impairs multi-omics meta-analyses. We have developed MultiBaC, a batch effect correction strategy on a distinct omics data type which facilitates integration of different omic data types from different studies. Our strategy is based on the existence of at least one shared data type and the data prediction across omics. We validate this approach on a case where a multiomics design is fully shared by two labs and comparing within data type batch correction using traditional methods with across data type batch correction using our MultiBaC.

V-055: Context-specific interaction networks from vector representation of words
COSI: General Comp Bio
  • Roland Mathis, Telepathy Labs, Switzerland
  • Matteo Manica, IBM, Switzerland
  • Joris Cadow, IBM, Switzerland
  • María Rodríguez Martínez, IBM, Switzerland

Short Abstract: The number of biomedical publications has grown steadily in recent years. However, most biomedical facts are not readily available, but buried in the form of unstructured text. Here we present INtERAcT, an unsupervised method to extract interactions from a corpus of biomedical articles. INtERAcT exploits a vector representation of words, computed on a corpus of domain-specific knowledge, and implements a new metric that estimates an interaction score between two molecules in the space where the corresponding words are embedded. We use INtERAcT to reconstruct the molecular pathways of 10 different cancer types using corpora of disease-specific articles, considering the STRING database as a benchmark. Our metric outperforms currently adopted approaches and it is highly robust to parameter choices, leading to the identification of known molecular interactions in all studied cancer types. Furthermore, our approach does not require text annotation, manual curation or the definition of semantic rules based on expert knowledge, and can therefore be efficiently applied to different scientific domains.

V-056: Contribution of synthetic lethality to cancer risk and onset time across human tissues
COSI: General Comp Bio
  • Nishanth Ulhas Nair, National Institutes of Health (NIH), United States
  • Kuoyuan Cheng, National Institutes of Health (NIH), United States
  • Joo Sang Lee, Cancer Data Science Lab, NCI/NIH, United States
  • Eytan Ruppin, Cancer Data Science Lab, NCI/NIH, United States

Short Abstract: Considerable variation exists in lifetime cancer risk across human tissues, which has been reported to be strongly correlated with the number of stem cell divisions and with abnormal DNA-methylation levels occurring in a tissue. Here, we investigate the hypothesis that the number of down-regulated synthetic lethal (SL) gene pairs in a tissue (termed its SL load) is another strong determinant of its cancer risk. We show that the SL load of normal tissues is higher than that of the cancers that originate from them, and that SL load of early-stage tumors is higher than that of late-stage ones. These findings testify that many SLs are lost during these transitions, and lead to the hypothesis that high SL load in normal tissues may impede cancer development. Accordingly, we find that normal tissues with high SL load have less risk of developing cancer than tissues with low SL load. Tissues with high SL load also develop cancer at later ages than tissues with low SL load. The SLs lost in the transition from healthy to cancer tissues tend to be the functionally stronger ones. Our findings highlight the significant role of synthetic lethality in determining cancer risk and onset time across tissues.

V-057: A multi-modal knowledge-based hybrid feature selection model for identification of cancer biomarkers
COSI: General Comp Bio
  • Osama Hamzeh, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada

Short Abstract: Identifying biomarkers that can be used to predict certain diseases or states of a disease is one of the most important applications of machine learning. Traditional biomarker identification approaches, typically, use machine learning techniques to identify a number of genes and macromolecules as biomarkers that can be used to diagnose specific diseases or states of diseases with very high accuracy. Experts’ opinions and knowledge are required to validate such findings. We propose a new machine learning method that incorporates a knowledge-based system used to integrate the findings of the DisGeNET database, which is a framework that provides proven relationships among diseases and genes. The machine learning pipeline starts by reducing the number of features using a filter based feature-selection method. The DisGeNET database scores each gene related to the given cancer name, followed by a wrapper-based feature-selection method that picks the best subset of genes. The method returns key genes that predict with high accuracy while being biologically relevant, and no human intervention is needed. Initial results provided a high area under the curve with a handful of genes that are already proven to be related to the relevant diseases based on the latest published medical findings.

V-058: Combining multiple features and algorithms to learn antimicrobial resistance genotype-phenotype relationships
COSI: General Comp Bio
  • Kara K. Tsang, McMaster University, Canada
  • Finlay Maguire, Dalhousie University, Canada
  • Haley Zubyk, McMaster University, Canada
  • Sommer Chou, McMaster University, Canada
  • Gerard D. Wright, McMaster University, Canada
  • Robert G. Beiko, Dalhousie University, Canada
  • Andrew G. McArthur, McMaster University, Canada

Short Abstract: Genotypic methods could be faster, cheaper, and more sensitive than existing approaches of diagnosing antimicrobial resistance (AMR). Machine learning studies observe high accuracy AMR prediction, yet few have attempted to elucidate AMR genotype-phenotype relationships. To identify the mechanisms driving clinical, multidrug-resistant Escherichia coli and Pseudomonas aeruginosa, we used logistic regression (LR) or set covering machine (SCM) in combination with whole genome k-mers, resistance determinants predicted using the Resistance Gene Identifier (RGI), or mutations identified using reference sequences. Some approaches predicted specific AMR phenotypes with higher accuracy than others. All methods identified genetic elements for some AMR phenotypes, e.g., (mutations in) gyrA and parC for ciprofloxacin resistance. LR+RGI and SCM+RGI predicted novel genotype-phenotype relationships, e.g., cefazolin resistance associated with CTX-M-27 and CMY-2 despite no published reports, which we experimentally verified. Only SCM+k-mers identified k-mers within aac(3)-iid predictive of gentamicin, yet using k-mers leads to difficult interpretation compared to annotated mutations or resistance determinants. Overall, we need broad sampling of clinical, farm, and environmental isolates to better assess genomic diversity for adequately training machine learning models. Increased clinical sequencing will lead to immense depths of data that will build the foundation of implementing artificial intelligence diagnostics at all clinics and hospitals.

V-059: OpenRiskNet Part II: Predictive Toxicology based on Adverse Outcome Pathways and Biological Pathway Analysis
COSI: General Comp Bio
  • Danyel Jennen, Maastricht University, Netherlands
  • Jumamurat Bayjanov, Maastricht University, Netherlands
  • Marvin Martens, Maastricht University, Netherlands
  • Chris T. Evelo, Maastricht University, Netherlands
  • Egon Willighagen, Maastricht University, Netherlands
  • Nofisat Oki, Edelweiss Connect GmbH, Switzerland
  • Tim Dudgeon, Informatics Matters Ltd, United Kingdom
  • Thomas Exner, Edelweiss Connect GmbH , Switzerland

Short Abstract: The OpenRiskNet project (https://openrisknet.org/) is funded by the H2020-EINFRA-22-2016 Programme. Here we present how the concept of Adverse Outcome Pathways (AOPs), which captures mechanistic knowledge from a chemical exposure causing a Molecular Initiating Event (MIE), through Key Events (KEs) towards an Adverse Outcome (AO), can be extended with additional knowledge by using tools and data available through the OpenRiskNet e-Infrastructure. This poster describes how the case study of AOPLink, together with DataCure, TGX, and SysGroup, can utilize the AOP framework for knowledge and data integration to support risk assessments. AOPLink involves the integration of knowledge captured in AOPs with additional data sources and experimental data from DataCure. TGX feeds this integration with prediction models of the MIE of such AOPs using either gene expression data or knowledge about stress response pathways. This is complemented by SysGroup, which is about the grouping of chemical compounds based on structural similarity and mode of action based on omics data. Therefore, the combination of these case studies extends the AOP knowledge and allows biological pathway analysis in the context of AOPs, by combining experimental data and the molecular knowledge that is captured in KEs of AOPs.

V-060: OpenRiskNet Part III: Modelling Services in Chemical/Nano-safety, Environmental Science and Pharmacokinetics
COSI: General Comp Bio
  • Lucian Farcal, Edelweiss Connect, Switzerland
  • Philip Doganis, National Technical University of Athens, Greece
  • Pantelis Karatzas, National Technical University of Athens, Greece
  • Haralambos Sarimveis, National Technical University of Athens, Greece
  • Ola Spjuth, Uppsala University, Sweden
  • Barry Hardy, Edelweiss Connect, Switzerland
  • Jonathan Alvarsson, Uppsala University, Sweden
  • Staffan Arvidsson, Uppsala University, Sweden
  • Stefan Kramer, Johannes Gutenberg University Mainz, Germany
  • Denis Gebele, in silico toxicology gmbh, Switzerland
  • Atif Raza, Johannes Gutenberg University Mainz, Germany
  • Thomas Exner, Edelweiss Connect GmbH , Switzerland
  • Atif Raza, Johannes Gutenberg-Universität Mainz, Germany

Short Abstract: The OpenRiskNet project (https://openrisknet.org/) is funded by the H2020-EINFRA-22-2016 Programme and its main objective is the development of an open e-infrastructure providing data and software resources and services to a variety of industries requiring risk assessment (e.g. chemicals, cosmetic ingredients, pharma or nanotechnologies). The concept of case studies was followed in order to test and evaluate proposed solutions and is described in https://openrisknet.org/e-infrastructure/development/case-studies/. Two case studies, namely ModelRX and RevK, focus on modelling within risk assessment. The ModelRX – Modelling for Prediction or Read Across case study provides computational methods for predictive modelling and support of existing data suitability assessment. It supports final risk assessment by providing calculations of theoretical descriptors, gap filling of incomplete datasets. computational modelling (QSAR) and predictions of adverse effects. Services are offered through Jaqpot (UI/API), JGU WEKA (API), Lazar (UI) and Jupyter & Squonk Notebooks. In the RevK – Reverse dosimetry and PBPK prediction case study, physiologically based pharmacokinetic (PBPK) models are made accessible for the purpose of risk assessment-relevant scenarios. The PKSim software, the httk R package and custom-made PBPK models have been integrated. RevK offers services through Jaqpot (UI/API).

V-061: Genome-wide analysis shows multiple genetic loci shared between major depressive disorder and intelligence
COSI: General Comp Bio
  • Shahram Bahrami, University of Oslo, Norway
  • Alexey Shadrin, University of Oslo, Norway
  • Oleksandr Frei, University of Oslo, Norway
  • Kevin O’connell, University of Oslo, Norway
  • Francesco Bettella, University Of Oslo, Norway
  • Florian Krull, University of Oslo, Norway
  • Chun Fan, University of California, United States
  • Jan Røssberg, University of Oslo, Norway
  • Torill Ueland, University of Oslo, Norway
  • Anders Dale, University of California, United States
  • Srdjan Djurovic, University of Oslo, Norway
  • Nils Steen, University of Oslo, Norway
  • Olav Smeland, University of Oslo, Norway
  • Ole Andreassen, University of Oslo, Norway

Short Abstract: Genome-wide association studies (GWAS) have identified several common genetic variants influencing major depression (MD) and general intelligence (INT), but little is known about whether the two share any of their genetic etiology. In this study we identified susceptibility loci shared between major depression and INT. Using a conditional false discovery rate (condFDR) statistical method, we analyzed GWAS data on MD (n=480,359) and INT (n=269,867) to improve statistical power for revealing the genetic underpinnings of MD and INT. We applied the conjunctional false discovery rate (conjFDR) framework to identify genetic loci shared between these phenotypes. Functional analysis of identified loci was performed using FUMA and genetic correlation was estimated with LD score regression. Despite non-significant genetic correlation (rg=-0.0148, p=0.5008), we identified 92 loci shared between major depression and general intelligence at conjFDR<0.05. Forty-eight of the shared loci showed consistent directions of allelic effects in MDD and INT, while the remaining 44 loci had opposite effect directions. Based on the functional analysis, the most significant functions were cell adhesion and metabolic processes for the genes with consistent effect directions, and regulation of gene silencing for the genes with opposite effect directions.

V-062: Selective neuronal vulnerability in Alzheimer's disease: an integrative network-based analysis
COSI: General Comp Bio
  • Olga Troyanskaya, Princeton University, United States
  • Vicky Yao, Princeton University, United States
  • Jean-Pierre Roussarie, The Rockefeller University, United States
  • Paul Greengard, The Rockefeller University, United States

Short Abstract: A major obstacle to treating Alzheimer's disease is our lack of understanding of the molecular mechanisms underlying selective neuronal vulnerability, a key characteristic of the disease. While this property is shared among most neurodegenerative diseases, it is challenging to study because we are unable to obtain high quality cell type-specific profiles from non-postmortem human brain. Here we present a framework to integrate high-quality neuron-type specific molecular profiles from the mouse together with a large compendium of postmortem human functional genomics and quantitative genetics data. We demonstrate human-mouse conservation of cellular taxonomy at the molecular level for Alzheimer's vulnerable and resistant neurons. Leveraging our spatial homology mapping, we construct in silico neuron-type-specific networks for each of the neurons, then develop a new method that leverages probabilistic subsampling, NetWAS 2.0, to identify specific genes and pathways associated with Alzheimer's disease pathology. Finally, we pinpoint a specific functional gene module underlying selective vulnerability, enriched in processes associated with axonal remodeling, and affected by both amyloid accumulation and aging. Overall, our study provides a molecular framework for understanding the complex interplay between Aβ, aging, and neurodegeneration within the most vulnerable neurons in Alzheimer's disease.

COSI: General Comp Bio
  • Tinuke Oladele, Department of Computer Science, University of Ilorin, Ilorin, Nigeria
  • Oluwafisayo Ayoade, Department of Computer Science, University of Ilorin, Ilorin, Nigeria
  • Roseline Ogundokun, Department of Computer Science, Landmark University, Omu-Aran, Nigeria
  • Marion Adebiyi, Department of Computer Science, Landmark University, Omu-Aran, Nigeria

Short Abstract: Sickle Cell Disease (SCD) is one of the malicious hematological syndrome in Nigeria with various vaso-occlusive incidents. SCD comes in diverse ranges of forms of Hemoglobin disorders (such as sickle cell hemoglobin S-HbSS and hemoglobin C-HbSC, sickle cell anemia, beta and alpha thalassemias, SD, and SE, and so on). SCD is a common genetic disorder in most sub-Saharan African countries with affecting rate of up to three percent of parturitions (births) as witnessed in various parts of the continents and especially in Nigeria. Feature selection, extraction and classification will enhance adequate diagnosis and management of the disorder. The objective is to develop a health information system for performing feature extraction and classification of SCD simulated dataset using machine learning techniques. Feature selection is by Box Counting Method, extraction was done by Neural Networks and classification by using Adaptive Neuro Fuzzy Inference Systems (ANFIS). The adequacy of proper data and information management support for diagnosing and managing SCD can reduce the high rate of early-mortality which has been on the increase in Nigeria. The combination of these machine learning techniques produced better performance even for simulated dataset in solving issues with respect to various attributes of SCD.

V-064: Meta-analysis of C. elegans single-cell developmental data reveals multi-frequency oscillation in gene activation
COSI: General Comp Bio
  • Isaac Kohane, Harvard University, United States
  • Luke Hutchison, Massachusetts Institute of Technology, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States

Short Abstract: The advent of in vivo techniques for single-cell lineaging, sequencing, and gene expression analysis has increased understanding of organismal development. We applied novel meta-analysis and visualization techniques to the EPIC single-cell-resolution developmental gene expression dataset from Bao et al. to gain insight into developmental timing mechanisms. Our meta-analysis revealed that a simple linear combination of the expression levels of the developmental genes is strongly correlated with the developmental age of the organism, irrespective of cell division rate. We uncovered a pattern of collective sinusoidal oscillation in gene activation, of multiple orthogonal frequencies, pointing to the existence of a global timing mechanism. We developed a novel method based on Fisher's Discriminant Analysis (FDA) to identify gene expression weightings that maximally separate traits of interest, and found that simple linear gene expression weightings can produce oscillations of any frequency and phase. We cross-linked EPIC with gene ontology and anatomy ontology terms, employing FDA methods to identify previously unknown contributions to developmental processes and phenotypes. Our results highlight both the continued relevance of the EPIC technique, and the value of meta-analysis of previous results. The presented techniques are broadly applicable across developmental and systems biology.

V-065: Inferring pathway activation/suppression to rank tumors by sensitivity to immune checkpoint therapy
COSI: General Comp Bio
  • Boris Reva, Icahn School of Medicine at Mount Sinai, United States
  • Anna Calinawan, Icahn School of Medicine at Mount Sinai, United States
  • Dmitry Rykunov, Icahn School of Medicine at Mount Sinai, United States
  • Azra Krek, Icahn School of Medicine at Mount Sinai, United States
  • Sujit Nair, Icahn School of Medicine at Mount Sinai, United States
  • Ash Tewari, Icahn School of Medicine at Mount Sinai, United States
  • Eric Schadt, Icahn School of Medicine at Mount Sinai, United States

Short Abstract: We introduce a new method to infer pathway activation and suppression by examining under- and over-representation of pathway genes in tumor genes ranked by gene expression levels. The novelty of our approach rests in the independent assessment of over- and under-representations of genes in a given pathway in the rank-ordered list of genes for a given sample. By finding the point of maximal pathway enrichment in the rank-ordered list, the tumors are stratified into two groups, one in which the pathway is inferred as activated (or suppressed) and the other inferred as not activated (or suppressed). We applied this method to differentiate prostate cancers by sensitivity to immune checkpoint inhibitors. We hypothesized that non-responder tumors had either the IFN-γ axis suppressed, which makes tumors invisible to immune cells, or the IFN-γ axis activated along with highly activated processes of immune evasion. Our findings show that ~1/3 of prostate tumors are likely non-responders to checkpoint therapy due to downregulation of key genes along the IFN-γ axis. Using nominated tumor immune subtypes, we determined characteristically expressed genes involved in immune evasion, proposed combination therapy and specific targets for both immune subtypes, and proposed biomarkers for clinical diagnostics of prostate cancer immune subtypes.

V-066: P. aeruginosa in cystic fibrosis lungs: within host sub-clone diversity is implicated in differential virulence factor expression
COSI: General Comp Bio
  • Aaron Weimann, University of Cambridge, United Kingdom
  • Louise Ellison, University of Cambridge, United Kingdom
  • Karen Brown, University of Cambridge, United Kingdom
  • Emem Ukor, University of Cambridge, United Kingdom
  • Damian Sutcliffe, University of Cambridge, United Kingdom
  • Judy Ryan, Royal Papworth Hospital, United Kingdom
  • Josie Bryant, University of Cambridge, United Kingdom
  • Vasanthini Athithan, Royal Papworth Hospital, United Kingdom
  • Martin Welch, University of Cambridge, United Kingdom
  • Julian Parkhill, Wellcome Trust Sanger Institute, United Kingdom
  • John Winn, Microsoft, United Kingdom
  • Andres Floto, University of Cambridge, United Kingdom

Short Abstract: Chronic infections of the lungs with the opportunistic pathogen Pseudomonas aeruginosa represent a major burden for people suffering from cystic fibrosis (CF). P. aeruginosa harbours a multitude of virulence factors that allow the bacteria to thrive in the humans lungs and that change in the course of chronic infections. The extent of co-occuring sub-clones of Pseudomonas in the CF lungs, their differential ability to produce virulence and the underlying genetic variants driving those changes are not well understood. We monitored nine CF patients over six month and recovered, sequenced and phenotyped over 4000 isolates from sputum samples. We found that most patients harboured a number of distinct sub-clones with considerable genomic variation some of which produced markedly different quantities of virulence factors. Genome-wide association studies identified several variants that were associated with changes in these virulence phenotypes. We further confirmed hits using gene knock-out mutants from an arrayed transposon library of Pseudomonas PAO1. Chronic infections with P. aeruginosa create a massive burden for CF patients and controlling such infections requires extensive therapy. We revealed a high diversity of Pseudomonas populations in chronic infections which underlines the importance of treating the entirety of the extant Pseudomonas sub-clones in the CF lungs.

V-067: DORMAN: Database Of Reconstructed MetAbolic Networks
COSI: General Comp Bio
  • Furkan Ozden, Bilkent University, Turkey
  • Metin Can Siper, Bilkent University, Turkey
  • Necmi Acarsoy, Bilkent University, Turkey
  • Turgrulcan Elmas, Bilkent University, Turkey
  • Bryan Marty, Case Western Reserve University, United States
  • Xinjian Qi, Case Western Reserve University, United States
  • A. Ercument Cicek, Bilkent University, Turkey

Short Abstract: Genome-scale reconstructed metabolic networks have provided an organism specific understanding of cellular processes and their relations to phenotype. As they are deemed essential to study metabolism, the number of organisms with reconstructed metabolic networks continues to increase. This everlasting research interest lead to the development of online systems/repositories that store existing reconstructions and enable new model generation, integration and constraint-based analyses. While features that support model reconstruction are widely available, current systems lack the means to help users who are interested in analyzing the topology of the reconstructed networks. Here, we present the Database of Reconstructed Metabolic Networks - DORMAN. DORMAN is a centralized online database that stores SBML-based reconstructed metabolic networks published in the literature, and provides web-based computational tools for visualizing and analyzing the model topology. Novel features of DORMAN are (i) Interactive visualization interface that allows rendering of the complete network as well as editing and exporting the model, (ii) Hierarchical navigation that provides efficient access to connected entities in the model, (iii) Built-in query interface that allow posing topological queries, and finally, (iv) Model comparison tool that enables comparing models with different nomenclatures, using approximate string matching. DORMAN is online and freely accessible at http://ciceklab.cs.bilkent.edu.tr/dorman.

V-068: Deep2Full: Computational strategies for predicting large complementary fraction of deep mutational scan outcomes
COSI: General Comp Bio
  • Sruthi C. K., JNCASR, Bangalore, India
  • Meher Prakash, JNCASR, Bangalore, India

Short Abstract: Deep mutational scan involves probing the phenotypic effects of thousands of variants of a protein. But such large scale experiments may not be required if computational models can be developed to predict these effects. Mutational effect predictions of computational models however are not sensitive to the phenotype being studied. Supervised machine learning techniques where predictive model is developed by training on a subset of mutations from deep mutational scan data using sequence, structure information of protein as features have been used to address this issue. We propose and evaluate different strategies to choose the minimal subset of mutations on which the model is trained. For the six proteins analyzed, we find that prediction quality of model trained only on mutations to alanine, asparagine and histidine which is a kind of site-directed approach and that of one trained on a randomly chosen set of mutations are comparable whereas the model having all 19 substitutions at randomly chosen positions in the training set had lower prediction quality. Our study suggests that fitness data of a subset of all possible single amino acid substitutions obtained through random mutagenesis is sufficient to develop computational models that predict fitness of rest of the mutants reliably.

V-069: ELIXIR 5 years on: Providing a coordinated European Infrastructure for Life Science Data and Services
COSI: General Comp Bio
  • Jennifer Harrow, ELIXIR, United Kingdom

Short Abstract: Since its inception in 2014, ELIXIR, the European Life-science Infrastructure for Biological Information provides best-practice guidelines in the implementation of databases, software tools standards, training, data management and analysis. ELIXIR promotes open access, enabling users to access and reuse publicly funded research effectively, and consolidates Europe’s national services, and core bioinformatics resources into a single, coordinated infrastructure. There are currently 22 countries involved in ELIXIR, bringing together more than 200 institutes and 600 scientists. ELIXIR's activities are coordinated across five areas called 'Platforms', which have made significant progress over the past few years. The Data Platform has developed a process to identify data resources that are of fundamental importance to research and committed to long term preservation of data, known as core data resources. The Tools Platform has services to help search appropriate software tools, workflows, benchmarking as well as a Biocontainer’s registry. The Compute Platform has services to store, share and analyse large data sets and has developed the Authorization and Authentication Infrastructure (AAI) single-sign on service. The Interoperability Platform develops and encourages adoption of standards such as FAIRsharing, and the Training Platform helps scientists and developers find the training they need via the Training e-Support System (TeSS).

V-070: Nanopore base-calling from a perspective of instance segmentation
COSI: General Comp Bio
  • Yao-Zhong Zhang, The University of Tokyo, Japan
  • Seiya Imoto, The University of Tokyo, Japan
  • Satoru Miyano, Human Genome Center, the Institute of Medical Science, University of Tokyo, Japan
  • Rui Yamaguchi, Aichi Cancer Center Research Institute, Japan
  • Arda Akdemir, The University of Tokyo, Japan
  • Georg Tremmel, The University of Tokyo, Japan

Short Abstract: Nanopore sequencing is a rapidly developing technology that can provide long nucleotide reads on a portable device in real time. It translates ion currency signals of a DNA/RNA fragment pass through a pore into nucleotides. Compared with the short-read sequencing, higher error rate is the fundamental challenge of nanopore sequencing. Recently, deep learning models have been applied to nanopore base-calling, which reduces the error rate to a range between 5% and 15%. In this work, we propose a novel base-calling method for raw currency signals from a perspective of instance segmentation. Directly applying existed instance segmentation algorithms on noisy currency data often suffers from over segmentation. Instead, we propose a simple yet effective method based on a deep U-net model. We formulate the base-calling task as a segmentation task that splits raw currency signals and assigns nucleotide labels in an end-to-end manner. For those adjacent identical nucleotides that can not be distinguished by the original U-net model, we solve it through pre-processing nucleotide labels with longest-consecutive-length information. We compare our proposed method with the state-of-the-art base-callers. Our experiment results show that the proposed method provide competitive results with small editor distance, when compared with recent deep learning based base-callers.

V-071: Large blocklength LDPC codes for Illumina sequencing-based DNA storage
COSI: General Comp Bio
  • Shubham Chandak, Stanford University, United States
  • Kedar Tatwawadi, Stanford University, United States
  • Billy Lau, Stanford University, United States
  • Matthew Kubit, Stanford University, United States
  • Jay Mardia, Stanford University, United States
  • Joachim Neu, Stanford University, United States
  • Hanlee Ji, Stanford University, United States
  • Tsachy Weissman, Stanford University, United States
  • Peter Griffin, Stanford University, United States
  • Mary Wootters, Stanford University, United States

Short Abstract: With the amount of data being stored increasing rapidly, current storage technologies are unable to keep up due to the slowing down of Moore’s law. In this context, DNA based storage systems can offer significantly higher storage densities (petabytes/gram) and durability (thousands of years) than current technologies. Recent advances in DNA sequencing and synthesis have made DNA storage a promising candidate for the storage technology of the future. Recently, there have been multiple efforts in this direction focusing on aspects such as error correction for synthesis/sequencing errors and erasure correction to handle missing sequences. The typical approach is to separate the codes for handling errors and erasures, but there is limited understanding of the efficiency of this framework. In this work, we study the trade-off between the writing and reading costs involved in DNA storage and propose practical and efficient schemes to achieve a smooth trade-off between these quantities. Our scheme breaks from the traditional framework and instead uses large block-length LDPC codes for both erasure and error correction, coupled with novel techniques to handle insertion and deletion errors. For a range of writing costs, the proposed scheme achieves 30-40% lower reading costs than state-of-the-art techniques using Illumina sequencing.

V-072: Controlling the False Discovery Rate in Epistasis Test Prioritization
COSI: General Comp Bio
  • Gizem Caylak, Bilkent University, Turkey
  • A. Ercument Cicek, Bilkent University, Turkey

Short Abstract: Identification of interacting (epistatic) loci, even just pairs, is a major challenge both computationally and statistically. A popular approach is to prioritize the tests to be performed rather than discarding pairs from the search space. While there are several algorithms designed with this goal, still they suffer from high False Discovery Rates (FDR). For instance the state-of-the-art Linden’s FDR is ~0.99. Here, we propose a new pipeline, which would guide epistasis prioritization algorithms to focus on areas of the genome that are likely to yield epistatic pairs. For this purpose, use the SPADIS method for prescreening. SPADIS selects a subset of SNPs that are (i) individually associated with the phenotype, and (ii) diverse in terms of their genomic locations by optimizing a submodular set function. As its output contains complementary SNPs, they are likely to be epistatic as well. This set prunes the search pace of downstream methods. Our results show that SPADIS’ guidance increases the precision of the state-of-the-art method Linden up to 55%, while requiring only one forth of the running time of Linden without SPADIS guidance.

V-073: Evaluation of software for mutational signature analysis based on realistic synthetic data
COSI: General Comp Bio
  • Yang Wu, Duke-NUS Medical School, Singapore
  • S M Ashiqul Islam, University of California San Diego, United States
  • Ludmil B Alexandrov, University of California San Diego, United States
  • Michael R Stratton, Wellcome Sanger Institute,, United Kingdom
  • Steven G Rozen, Duke-NUS Medical School, Singapore

Short Abstract: Mutational signature analysis examines exome- or genome-wide mutations, usually in tumors, to infer the endogenous or exogenous processes that generated the mutations and the processes' characteristic mutational profiles. The analysis emphasizes the causes of the mutations, not their phenotypic or selective consequences. This analysis is important for studies of genetic toxicology and cancer epidemiology and of the life histories of tumors. Indeed, signature analysis revealed important mutational processes in cancer that were unknown 5 years ago. Mutational signatures can be delineated in experimental systems, but can also be discovered by unsupervised analysis of somatic mutations in 100s to 10,000s of tumours, an approach sometimes called signature "extraction". Mutational signature analysis must also determine which signatures are present (and by inference, which mutational processes were operating) in a given tumor, a task known as "signature attribution". While at least 16 approaches to mutational signature analysis have been implemented, there has been negligible evaluation of their usability, utility, or accuracy on substantial sets of realistic synthetic data. Here we present a suite of synthetic data sets, an R package for generating such synthetic data sets, and evaluations of 7 software packages on these synthetic data sets.

V-074: Delineating features associated with non-small-cell lung cancer
COSI: General Comp Bio
  • Murlidharan Nair, Indiana University South Bend, United States
  • Eilis Kilbride, Indiana University South Bend, United States
  • Johnny Dang, Indiana University South Bend, United States

Short Abstract: Lung cancer is one of the major causes of mortality worldwide. Non-small-cell lung cancer (NSCLC) is an epithelial lung cancer and accounts for a large proportion (85%) of all lung cancers. NSCLC are refractory to chemotherapy in comparison to small cell lung cancer. NSCLC may be adenocarcinoma (ADC), squamous cell carcinoma (SqCC) and large cell carcinoma. The two type of NSCLCs that has been the focus of this study are ADC and SqCC. ADCs are considered the most common type of cancers that occur among non-smokers, while SqCCs have been correlated with a history of smoking. Molecular characterization of NSCLC could help identify biomarkers that could be used to determine targets for novel therapy. We have developed novel statistical methods to rank comparative analysis results. The top ranking features were then used to build machine learning models. Further, we also developed approaches using machine learning methods to delineate features associated with second and higher-order correlations. Further, these features were also capable class separation. The features identified were analyzed for their biological significance. The results of the functional analysis revealed features that provided key insights into the biology of ADC and SqCC.

V-075: Unsupervised noise removal strategy for mass cytometry data
COSI: General Comp Bio
  • Maria-Fernanda Senosain, Vanderbilt University, United States
  • Pierre P. Massion, Vanderbilt University, United States

Short Abstract: Mass cytometry is a single cell proteomic technique that allows the simultaneous measurement of ~40 proteins. Although signal overlapping is minimal due to the use of metal-conjugated antibodies, other sources of noise are present especially when dealing with solid tissue samples that had gone through dissociation steps, and manual processing (gating) can be subjective and tedious if working with a large number of samples. We developed an algorithm that automatically removes the noise from the data having as input the normalized files of the samples and their batch controls. An initial step removes doublets (event length >70) and zeros. Then calibration beads are removed with gaussian mixture models and a Random Forest classifier. Finally, dead cells are identified in the controls using DBSCAN, setting the boundaries for dead cell removal. We tested this algorithm using a lung adenocarcinoma dataset. Beads removal step had 92.1% accuracy. Dead cell removal step set the following mean boundaries (arcsinh transform, cofactor =5): for Rhodium viability dye 0–1.97 (95%CI:[0–0] [1.73–2.21]); for Histone H3 (nucleated cells) 1.3–5.99 (95%CI:[1.03–1.58] [5.85–6.13]). In summary, this approach can provide an automated unbiased detection and removal of noise compared to the use of theoretical cutoff values or user-dependent gating.

V-076: Genome-scale genetic interaction analysis to rewire comprehensive metabolic models of Escherichia coli
COSI: General Comp Bio
  • Ai Muto-Fujita, Data Science Center, Nara Institute of Science and Technology, Japan
  • Jonathan Monk, University of California San Diego, United States
  • Yuichiro Tanaka, Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
  • Markus Herrgard, Technical University of Denmark, Denmark
  • Bernhard Palsson, University of California San Diego, United States
  • Hirotada Mori, Data Science Center, Nara Institute of Science and Technology, Japan

Short Abstract: Genome-scale metabolic models(GSMs) are now commonly used for prediction of the gene essentiality, growth phenotypes and pathway disruptions that are responsible for phenotypic changes. The iJO1366, a comprehensive genome-scale reconstruction of Escherichia coli metabolism, is one of the most complete metabolic models available. However, comparison between genome-scale essentiality and growth prediction using iJO1366 still shows discrepancies between predicted and observed cell. This imply the existence of unknown metabolic reactions or pathways. Genetic Interaction(GI) is the phenomenon where mutation of one gene affect the other mutation’s effects on phenotypes. In the case of growth phenotype, GI can be detected by difference between observed and estimated growth of double-knockout(DKO) mutants. Our group has previously established two comprehensive collections of single-gene deletion mutants of E.coli; Keio collection and Aska deletion libraries. We also established systems for high-throughput construction of double-knockout (DKO) mutants via conjugal transfer between Keio and Aska strains, and automated detection of their growth on agar plates. Genome-wide screening of GI enable us to know about unknown connection between genes. We performed genome-wide measurements of DKO mutants’ growth on different culture media by focusing on the genes that showed the discrepancies. We will report how our results rewired the networks.

V-077: GeneWalk: function association of individual genes through network representation learning with random walks
COSI: General Comp Bio
  • Peter Sorger, Harvard University, United States
  • Robert Ietswaart, Harvard University, United States
  • Benjamin Gyori, Harvard University, United States
  • John Bachman, Harvard University, United States
  • Stirling Churchman, Harvard University, United States

Short Abstract: A central bottleneck in high-throughput gene expression analysis is identifying the most relevant genes and their function from a list of results, for instance from an RNA-seq experiment or CRISPR screen. Gene Ontology (GO) annotation lists all known gene functions, but it is not clear which functions are at play in the particular experimental context. Functional analysis methods determine which biological processes are enriched across a gene set, but do not address what function any individual gene could have. Here we introduce GeneWalk, a method that quantifies similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a context-specific gene regulatory network that is assembled automatically from all published literature with INDRA. Similarity significance is determined through comparison with randomized networks. Consequently, GeneWalk determines for each gene the functions that are relevant in a particular biological context as the resulting significantly similar GO terms. We benchmark GeneWalk on genes involved in myelination with mouse brain RNA-seq and apply it to human NET-seq where conventional GO enrichment analysis breaks down. GeneWalk is a powerful tool for exploratory functional analysis that will ease the burden of analyzing results of high-throughput genetics experiments.

V-078: OpenRiskNet Part IV: WEKA Machine Learning Services for the Prediction of Half-Lifes of Chemicals and Nanoparticle Transport
COSI: General Comp Bio
  • Stefan Kramer, Johannes Gutenberg University Mainz, Germany
  • Denis Gebele, in silico toxicology gmbh, Switzerland
  • Atif Raza, Johannes Gutenberg University Mainz, Germany
  • Atif Raza, Johannes Gutenberg-Universität Mainz, Germany

Short Abstract: The OpenRiskNet project (https://openrisknet.org/) is funded by the H2020-EINFRA-22-2016 Programme and its main objective is the development of an open e-infrastructure providing data and software resources and services to a variety of industries requiring risk assessment (e.g. chemicals, cosmetic ingredients, pharma or nanotechnologies). We will present the WEKA machine learning services within the infrastructure and how they can be used to solve complex prediction tasks: the prediction of (i) half-lifes of chemicals under given environmental conditions and of (ii) nanoparticle transport behavior from physicochemical properties. For that purpose, we will reconstruct previous efforts using complex workflows and architectures and simplify the models while maintaining their prediction performance. In both cases, the overall problem (predicting the fate of a compound depending on its properties and external conditions) is modeled as a cascaded prediction model, where the prediction of one model is, with particular attention to validity and performance, entering another model as input. The approach performs well on the half-life data, while the nanoparticle data are too noisy and incomplete to warrant more than the most basic models. Overall, the reconstruction of the two applications within OpenRiskNet provides more evidence for the power and versatility of the framework.

V-079: Detection and prioritization of very low frequency somatic cancer mutations in tumor derived circulating cell-free DNA sequencing data
COSI: General Comp Bio
  • Preetida Bhetariya, University of Utah, United States
  • Gabor Marth, University of Utah, United States

Short Abstract: Somatic cancer mutation profiling in circulating tumor DNA (ctDNA) is a promising avenue for non-invasive biomarker development. With growing interest in ctDNA testing, it is important to establish a sensitive and specific variant calling and variant prioritization bioinformatic pipeline, which can facilitate personalized therapies. We developed a variant calling workflow which detects lower frequency detection (<1%) variants from circulating cell-free DNA (ccfDNA) samples sequenced with the targeted panel. We benchmarked our pipeline against several state-of-the-art variants calling methods using synthetic datasets and demonstrated the performance of our pipeline in detecting somatic variants. Our variant prioritization workflow incorporates three critical resources: functional mutations from the TCGA, known cancer driver mutations from CHASMplus, and clinically actionable mutations from the DEPO database. We tested the workflow with a dataset that contains tumor and paired normal targeted next sequencing of ctDNA from cancer patients. We found tumor-specific mutations were identified in most ctDNA samples, with the observed mutation spectra highly concordant with the matched tumor tissues. We used our pipeline to identify clinically actionable variants detectable in ctDNA from the patient. With improved sequencing library protocols, the pipeline is ready to be integrated into high throughput sequencing-based ctDNA mutation profiling in research settings.

V-080: Generating patient insights in chronic obstructive pulmonary disease (COPD) with social media listening study
COSI: General Comp Bio
  • Florian S. Gutzwiller, Novartis, Switzerland

Short Abstract: Background: Online forums are commonly used by patients to share disease information, grievances and for seeking mutual support. However, social media analysis has seldom been applied to evaluate patients’ perspective on COPD. Aim: To use social media data for understanding the patients’ perspective on COPD with respect to symptoms, unmet needs, and impact on their health-related quality of life (HRQoL). Methods: A social media data aggregator tool was used to download relevant records posted between July 2016 and Jan 2018 using predefined keywords. Data sources were forums, blogs, Twitter and newswires. Post data anonymization, text algorithms and manual curation techniques were used to analyse and describe the impact of COPD symptoms on patient’s HRQoL. Results: 695 unique patient records were considered for the analysis. Patients’ discussions were primarily on symptoms, diagnosis, treatments and HRQoL. The most commonly reported symptoms were cough (25%), mucous (23%), and shortness of breath (23%). These symptoms impacted patients’ HRQoL by affecting their sleep, mobility, work, inducing panic attacks and emotional wellbeing. Conclusions: Social media sourced data (unfiltered, uninfluenced) can generate useful insights on patients’ experiences with COPD and identify unmet needs. This can inform decision-making in early drug development and guide discussions with regulatory bodies.

V-081: Bringing the Algorithms to the Data - Distributed Medical Analytics using the Personal Health Train Paradigm
COSI: General Comp Bio
  • Marius Herr, University Hospital Tübingen & University of Tübingen, Germany
  • Lukas Zimmermann, University Hospital Tübingen & University of Tübingen, Germany
  • Oliver Kohlbacher, University Hospital Tübingen, University of Tübingen & Max Planck Institute for Developmental Biology, Tübingen, Germany
  • Nico Pfeifer, University of Tübingen & Max Planck Institute for Informatics, Saarbrücken, Germany

Short Abstract: The 'Personal Health Train' (PHT) is a paradigm proposed within the GO-FAIR initiative as one solution for distributed analysis of medical data, enhancing their FAIRness. Rather than transferring data, the analysis algorithm ('train'), travels between multiple sites (e.g., hospitals as 'train stations') hosting the data in a secure fashion. Implementing trains as light-weight docker containers enables even complex data analysis between sites, for example, genomics pipelines or deep-learning algorithms - analytics methods that are not easily amenable to established distributed queries. We developed a prototypical PHT implementation within the context of the German national medical informatics initiative (https://github.com/personalhealthtrain). Furthermore, we demonstrate how modern cloud techniques can be leveraged for complex distributed, privacy-preserving medical data analytics. The scope of applications of the infrastructure ranges from statistical queries to complex machine learning algorithms, or sophisticated omics and image analyses. To participate, a station only needs to deploy a lightweight platform application, which provides the communication interface with the registry. Currently, the train station provided access to an i2b2/tranSMART instance as a local data repository.Each constructed train image is immutable and thus enhances reproducibility of analyses.A comprehensive Python library has been developed that facilitates the implementation of train images.

V-082: Analysis options for whole human genome sequencing
COSI: General Comp Bio
  • Tomasz Stokowy, University of Bergen, Norway
  • Anna Supernat, University of Gdańsk and Medical University of Gdańsk, Poland
  • Oskar Vidarsson, University of Bergen, Norway
  • Vidar Steen, University of Bergen, Norway

Short Abstract: Testing of patients with hereditary disorders and cancer is in progress of shifting to whole-genome sequencing (WGS). We decided to summarise analysis options for human WGS from a clinical perspective. We performed a comprehensive literature search and review of 512 articles related to WGS applications. Subsequently, we tested DeepVariant, a new TensorFlow machine learning-based variant caller, and compared to GATK 4.0, SpeedSeq and Illumina Dragen using 30X WGS data of the well-known NA12878 DNA reference sample. According to our comparison, DeepVariant reached the highest F-Score (harmonic mean of recall and precision) both for SNVs and indels (0.98 and 0.94 respectively). Performance of SNV calling using the four mentioned approaches was comparable (GATK and SpeedSeq and Dragen reached lower F-Scores, however still equal to 0.98 after rounding to two decimal places). On the other hand, DeepVariant was more precise in indel calling (Dragen, GATK and SpeedSeq reached F-Scores 0.92, 0.90 and 0.84, respectively). We concluded that the DeepVariant tool has great potential and usefulness for medical genetics. We used methods mentioned above to disclose causes of several monogenic disorders: keratolytic winter erythema (CTSB, duplication of enhancer), Penttinen syndrome (PDGRFB, de novo substitution) and nuclear envelopathy (LEMD2, de novo substitution).

V-083: Structure-based identification of new HDAC substrates
COSI: General Comp Bio
  • Julia K. Varga, Institute of Enzimology, Research Center for Natural Sciences of the Hungarian Academy of Sciences, Hungary
  • Katherine Leng, Biophysics and Department of Chemistry, University of Michigan, United States
  • Alisa Khramushin, Institute for Medical Research Israel-Canada, Faculty of Medicine, The Hebrew University, Israel

Short Abstract: The acetylation state of a protein can influence its interaction, subcellular localization and stability. Histone deacetylase 6 (HDAC6) catalyzes the removal of acetyl groups from the lysines of proteins. It mainly localizes in the cytoplasm and was shown to have diverse roles in the function of the cells participating in the regulation of cytoskeleton modulation, cell motility, misfolded protein degradation, gene expression activation and autophagy. These roles are fulfilled either by one of its two catalytic deacetylase domains or its ubiquitin binding C-terminal domain. Finding additional substrates of the protein could help to determine what other processes it is involved in or expand our understanding of how it influences different processes. HDAC6 enzyme activity was measured on a set of acetylated hexapeptides. Rosetta FlexPepBind module was calibrated using these experimental values first to distinguish between substrates and non-substrates and then to predict enzyme activity based on peptide-protein binding ability. The calibrated protocol is to be run on acetylated sites from PhosphoSitePlus database to find new in vitro substrate proteins.

V-084: Construction of bioinformatics workflow system (Bio-Express) for massive genomic sequencing data analysis
COSI: General Comp Bio
  • Gunhwan Ko, Korean Bioinformation Center, South Korea
  • Pan-Gyu Kim, Korean Bioinformation Center, South Korea
  • Byungwook Lee, Korean BioInformation Center, South Korea

Short Abstract: The rapidly increasing amounts of data available from the new high-throughput methods have made data processing without automated pipelines infeasible. Integration of data and analytic resources into workflow systems provides a solution to the problem, simplifying the task of data analysis. To address the challenge, we developed a cloud-based workflow management system, Bio-Express, to provide a fast and cost-effective analysis on massive genomic data. We implemented complex workflows making optimal use of high-performance compute clusters. Bio-Express allows users to create multi-step analyses using drag and drop functionality, and modify parameters of pipeline tools. We also developed a high-speed data transmission solution, KoDS, to transmit a large amount of data at a fast rate. KoDS has a file transferring speed up to 10 times than normal FTP. Computer hardware for Bio-Express is 800 CPU cores and 800Tb, which enable 500 jobs to run at the same time. Bio-Express provides a user-friendly interface to all genomic scientists to try to select right results from NGS platform data. The Bio-Express cloud server is freely available for use from https://www.bio-express.re.kr

V-085: Integrating Genome3D structural annotations and predicted secondary structures in InterPro 7
COSI: General Comp Bio
  • Typhaine Paysan-Lafosse, InterPro, Protein Data Bank in Europe, EMBL-EBI, United Kingdom
  • Matthias Blum, InterPro, EMBL-EBI, United Kingdom
  • Matloob Qureshi, InterPro, EMBL-EBI, United Kingdom
  • Gustavo A Salazar, InterPro, EMBL-EBI, United Kingdom
  • Ian Sillitoe, UCL, United Kingdom
  • Robert D Finn, InterPro, EMBL-EBI, United Kingdom

Short Abstract: Genome3D provides consensus structural annotations and 3D models for sequences from 10 model organisms, including human. These data are generated by several UK based resources that together form the Genome3D consortium: SCOP, CATH, SUPERFAMILY, Gene3D, FUGUE, pDomTHREADER and PHYRE. InterPro, meanwhile, provides functional analysis of proteins by classifying them into homologous superfamilies, families, predicting domains, repeats and important sites based on data from 14 member databases. Until now, InterPro has only presented CATH-Gene3D and SUPERFAMILY annotations from Genome3D. The Genome3D resources Fugue, PHYRE, pDomTHREADER have not historically been integrated in InterPro as they are too computationally expensive to calculate over the entire UniProt database. However, the tools underpinning these resources exploit sensitive threading-based techniques, and as such their coverage of a given genome is greater than that of either SUPERFAMILY/CATH-Gene3D. In order to enhance the coverage of InterPro, we have integrated the predicted secondary structures from the aforementioned resources dynamically from Genome3D at different levels: InterPro entry, UniProt accession and PDB structure. This work has been conducted in collaboration between InterPro and Genome3D developers, using API calls underpinned by the mapping between InterPro entry to UniProt accessions and UniProt accessions to PDB structures.

V-086: Spatio-molecular dissection of the breast cancer metastatic microenvironment
COSI: General Comp Bio
  • Johanna Klughammer, Harvard University, United States
  • Orit Rozenblatt-Rosen, Harvard University, United States
  • Nikhil Wagle, DFCI, United States
  • Aviv Regev, Harvard University, United States

Short Abstract: The tumor microenvironment, defined as the ecosystem of malignant and non-malignant cells within a tumor, is being increasingly recognized for its role in disease progression including therapeutic resistance and metastasis. Single-cell RNA sequencing has proven to be a powerful tool in the characterization of the large variety of stromal, immune, and malignant cell-types and states that make up these heterogeneous ecosystems. Emerging RNA- and protein-based spatial methods are starting to complement the picture, describing the spatial organization of those cell types and states. Breast cancer is the most common cancer among women. When diagnosed at an early stage, breast cancer is potentially curable with a combination of surgery, radiation, and systemic therapy. Unfortunately, metastatic breast cancer remains incurable due to inevitable development of resistance. Nevertheless, characterization of metastasis has been lacking due to both technological barriers and limited availability of samples. To characterize the metastatic microenvironment, including malignant and non-malignant cells, we performed both single-cell RNA sequencing and spatio-molecular methods on biopsies of breast cancer metastasis. We assess multiple clinically relevant breast cancer subtypes and sites of metastasis in the form of fresh and frozen biopsy samples. To enable clinically relevant discovery, we combine these analyses with detailed clinical annotation.

V-087: Adapter removal with no a priori knowledge of adapter sequences
COSI: General Comp Bio
  • Cheng-Ching Huang, NCTU, Taiwan
  • Ting-Hsuan Wang, NCTU, Taiwan
  • Jui-Hung Hung, NCTU, Taiwan

Short Abstract: NGS reads are contaminated by adapter sequence fragments that have to be removed before downstream analyses. Modern adapter trimmers require users to provide candidate adapter sequences, which are sometimes unavailable or mistaken; large-scale meta-analyses are therefore confounded by suboptimal trimming. Here we introduce a fast and accurate adapter trimming algorithm which can be applied to both paired-end and single-end sequences and entails no a priori adapter sequences. We implemented the algorithm in modern C++ with SIMD and multithreading to accelerate its speed and compared it with current mainstream adapter trimmers using simulated data and several real-life datasets. Results show that the new algorithm reaches higher throughput and comparable accuracy than that of existing adapter trimmers. Our new adapter trimmer does not need any prior knowledge of adapter sequences and can be used in any NGS sequence analysis pipelines, especially meta-analyses.

V-088: The multiple myeloma risk allele at 5q15 lowers ELL2 expression and increases ribosomal gene expression in malignant plasma cells
COSI: General Comp Bio
  • Mina Ali, Biotech research and innovation centre, Denmark

Short Abstract: Multiple myeloma (MM) is the second most common hematologic malignancy that forms in plasma cell. By carrying out a case-control genome-wide association study (GWAS) on a dataset from Sweden-Norway and Iceland, we could identify one novel MM risk locus related to the ELL2 (Elongation Factor for RNA Polymerase-II). We confirmed ELL2 in a meta-analysis of six GWASes together with the United Kingdom, Germany, Netherlands and the United States. We performed expression quantitative-trait locus (eQTL) analysis in CD138+ plasma cells from 1,630 MM patients from four populations. We show that the MM risk allele lowers ELL2 expression in these cells (Pcombined=2.5×10-27), but not in peripheral blood or other tissues. A total of 67 single-nucleotide polymorphisms and 5 small insertions/deletions are highly correlated with the best-supported sentinel MM risk variant (rs1423269). Using bioinformatic approaches we identified 8 variants that might alter the efficiency of ELL2 transcription. Among those, three risk variants (rs3777189-C, rs3777185-C and rs4563648-G) yielded decreased luciferase activity relative to their corresponding protective variants in plasma cell lines, but not in non-plasma cell lines. Further analysis reveals that the MM risk allele associates with upregulation of gene sets related to ribosome biogenesis, and knockout/knockdown and rescue experiments support cause-effect relationship.

V-089: Predicting the effects of SNPs on transcription factor binding affinity
COSI: General Comp Bio
  • Sierra Nishizaki, University of Michigan, United States
  • Natalie Ng, Stanford University, United States
  • Shengcheng Dong, University of Michigan, United States
  • Cody Morterud, University of Michigan, United States
  • Colten Williams, University of Michigan, United States
  • Alan Boyle, University of Michigan, United States

Short Abstract: GWAS have revealed that 88% of disease associated SNPs reside in noncoding regions. However, noncoding SNPs remain understudied, partly because they are challenging to prioritize for experimental validation. To address this deficiency, we developed the SNP effect matrix pipeline (SEMpl). SEMpl estimates transcription factor binding affinity by observing differences in ChIP-seq signal intensity for SNPs within functional transcription factor binding sites genome-wide. By cataloging the effects of every possible mutation within the transcription factor binding site motif, SEMpl can predict the consequences of SNPs to transcription factor binding. This knowledge can be used to identify potential disease-causing regulatory loci.

V-090: ENCODE Data on the Cloud
COSI: General Comp Bio
  • Paul Sud, Stanford University, United States

Short Abstract: The Encyclopedia of DNA Elements (ENCODE) Project has generated a wealth of genomic data on human samples and model organisms for the purposes of identification and analysis of all functional elements in the genome. As of May 2019, the ENCODE data corpus includes over 15,000 datasets spanning a variety of assays as well as both raw and uniformly processed files. ENCODE data is available on a public Amazon Web Services S3 bucket, facilitating usage with cloud-based services and applications. Here, we present an example of a data science stack that can be deployed for exploratory cross-sample correlation analysis of ENCODE ChIP-seq data. The components of this stack include a Jupyter server running on a virtual machine instance in the cloud, allowing for dynamic scalability of compute resources, the Anaconda distribution, which facilitates dependency management and environment isolation, and Goofys, a tool that can mount the S3 bucket to the instance as a POSIX-like filesystem. For the analysis, a Jupyter notebook permits interactive exploration of the data when combined with Python libraries like pandas and deepTools.

V-091: Induced pluripotent stem cells of patients with Tetralogy of Fallot reveal alterations in cardiomyocyte differentiation
COSI: General Comp Bio
  • Marcel Grunert, Charité – Universitätsmedizin Berlin, Germany
  • Sandra Appelt, Charité – Universitätsmedizin Berlin, Germany
  • Sophia Schönhals, Charité – Universitätsmedizin Berlin, Germany
  • Natalie Weber, Hannover Medical School, Germany
  • Huanhuan Cui, Charité – Universitätsmedizin Berlin, Germany
  • Fleur Mason, University Medical Center Göttingen, Germany
  • Niels Voigt, University Medical Center Göttingen, Germany
  • Silke R. Sperling, Charité - Universitätmedizin Berlin, Germany

Short Abstract: In this study, we use patient-specific induced pluripotent stem cells (ps-iPSCs) to gain insights into Tetralogy of Fallot (TOF), which represents the most common cyanotic heart defect in humans. Patient-specific expression patterns and genetic variability were investigated in iPSCs and derived cardiomyocytes (CMs) using whole genome and transcriptome sequencing data. First, the clonal mutational burden of the iPSCs was studied, which revealed in two out of three iPSC lines of one patient a somatic mutation in the DNA-binding domain of tumor suppressor P53 that was not observed in the genomic DNA from blood. Characterization of this mutation showed its functional impact, which makes the cells inappropriate for modelling and studying the disease. For the other patient, potential disease-relevant differential gene expression between and across cardiac differentiation was shown. Here, clear differences at the later stages of differentiation could be observed. In addition, there were abnormalities in the patient-specific CMs based on intracellular calcium handling, cell contraction and action potential dynamics. Overall, this study provides first insights into the complex molecular and functional mechanisms underlying iPSC-derived cardiomyocyte differentiation and its alterations in TOF, which might also have an impact on the long-term clinical outcome and management of these patients.

COSI: General Comp Bio
  • Faisal Albalwy, The University of Manchester, United Kingdom
  • Angela Davies, Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, United Kingdom
  • Andy Brass, Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, United Kingdom

Short Abstract: The advent of fast, effective genome sequencing technologies has caused a step change in the diagnosis of rare genetic diseases, as their causes can now be accurately determined by variants in the patient’s genome. These variants, called pathogenic, are very rare and found in only a small percentage of the population. The molecular diagnosis of a rare disease involves comparing a patient’s genetic variant data with the existing variants of others with similar diseases. Due to privacy and security concerns, these variants are still generally collected and stored in silos by local laboratories making it challenging to share this data on a large scale. This project presents a novel blockchain-based dynamic consent model that we are developing to determine the efficacy of blockchain technology in supporting genomic data sharing. Implemented on the Ethereum test blockchain, the model operates as a secure middleware system between patients, clinicians and laboratories. The model’s web-based portal allows patients to store and update their consent and track their data. Likewise, clinicians can review and securely access patient data upon patient consent. All data transitions and sharing between participants are recorded in a tamper-proof manner, enabling data auditing at any future time.

V-093: Analytical Methods to Identify Tumor Heterogeneity and Rare Subclones in Single Cell DNA Sequencing Data from Targeted Panels
COSI: General Comp Bio
  • Sombeet Sahu, Mission Bio, United States
  • Manimozhi Manivannan, Mission Bio, United States
  • Shu Wang, Mission Bio, United States
  • Dong Kim, Mission Bio, United States
  • Saurabh Gulati, Mission Bio, United States
  • Nianzhen Li, Mission Bio, United States
  • Adam Sciambi, Mission Bio, United States
  • Nigel Beard, Mission Bio, United States

Short Abstract: With the advancements of single cell sequencing technologies it is now possible to interrogate thousands of cells in a single experiment. ScRNA-Seq has been available for several years but high-throughput single-cell DNA analysis is in its infancy. To address these challenges and enable the characterization of genetic diversity in cancer cell populations, we developed a novel approach to identify mutation signatures which define subclones present in a tumor population. Here we present subclone identification method using data generated on the Tapestri platform and analyzed by Tapestri analytical workflow. The variant-cell matrix is then subjected to identification of subclones. Top variants defined the signature of each subclone are also identified. To validate our methodology, we used two different targeted sequencing panels on model systems with known truth mutations. Our pipeline shows the distinct clusters correlating with titration and cell line ratios. Cluster associated signature mutations were also identified. Our approach addresses key issues of identifying rare subpopulations of cells down to 0.1%, and transforms the ability to accurately characterize clonal heterogeneity in tumor samples. This high throughput method advances research efforts to improve patient stratification and therapy selection for various cancer indications.

V-094: Dynamically Defined Microdomains in Rho GTPase Signaling
COSI: General Comp Bio
  • Xuexia Jiang, UT Southwestern Medical Center, United States
  • Gaudenz Danuser, UT Southwestern Medical Center, United States

Short Abstract: Many biological signaling networks are spatiotemporally organized at the subcellular scale to control crosstalk between both synergistic and antagonistic pathways. This organization is required for robust cellular decision making in the context of multi-factorial inputs. It is increasingly clear that transient organizational changes (i.e. receptor clustering, micro and macro autophagy, cell morphodynamics) play a critical role in both normal cell function and cancer. Developments in molecular biosensor technology allow us to directly visualize this organization over time. One such signaling system revolves around the Rho-family GTPase Rac1, which is commonly altered in cancer progression. Rac1 activity can be robustly visualized via FRET-based biosensors. We are developing computational image analysis tools to quantitatively catalogue the organization of Rac1 activity in microdomains, learn for a specific model system (focal adhesion maturation) the role of Rac1 regulators in directing signaling organization, and develop a predictive strategy using simulations of Rac1 activity to bridge measured changes in Rac1 signaling organization and potential changes in Rac1 regulation.

V-095: Identification of potential blood biomarkers for Parkinson’s disease by gene expression and DNA methylation data integration analysis
COSI: General Comp Bio
  • Changliang Wang, University of Macau, China
  • Liang Chen, University of Macau, China
  • Yang Yang, FHS, University of Macau, Macau, China, Macao
  • Menglei Zhang, University of Macau, China

Short Abstract: The prognosis of Parkinson’s disease (PD) is poor due to the lack of specific biomarkers for targeted clinical intervention. In order to identify potential biomarkers, we developed an approach consisting of 1) Combinational methylation and expression in PD blood in order to identify the methylation regulated genes, and 2) Using PD normal blood samples as the control to detect PD blood-specific biomarkers. Using this approach, we characterized a methylation dataset and a transcription dataset from GEO and identified 1,045 differentially expressed genes (DEGs) and 891 differentially methylated genes (DMGs). By integration, we identified 94 coincidently differentially methylated and expressed genes and 90.4% (85 out of 94) of them are hypo-methylated and up-regulated (hypo-up) genes. We applied the 85 hypo-up genes to classify PD and normal blood samples in order to increase the sensitivity and specificity to detect PD. Both hypo-up gene expression and methylation profiles could significantly classify PD and normal blood samples. Our study reveals the existence of rich biomarkers in PD blood and the power of our approach to effectively identify these biomarkers. This study was supported by a grant MYRG2016–00101-FHS from the Faculty of Health Sciences, University of Macau.

V-096: TCRex: a webtool for the prediction of T-cell receptor sequence epitope specificity
COSI: General Comp Bio
  • Sofie Gielis, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Pieter Moris, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Nicolas De Neuter, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Sara Benmohammed, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Wout Bittremieux, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Benson Ogunjimi, Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium, Belgium
  • Kris Laukens, University of Antwerp, Belgium
  • Pieter Meysman, University of Antwerp, Belgium

Short Abstract: To date, multiple immunoinformatics tools have been created with the goal to achieve a better understanding of immunological processes. Although great tools exist for the prediction of epitopes and their binding to MHC molecules, we are still lacking useful tools for the prediction of epitope-MHC recognition by TCRs. Hence, we propose TCRex, a tool to investigate TCR recognition of epitopes. This tool is based on our prior work related to the feasibility of predicting TCR-epitope recognition using TCRβ sequences. In this study, we showed that a random forest classifier trained to predict TCR-epitope interactions from TCR amino acid physicochemical properties can achieve a high accuracy. We extended this work into a toolbox trained on a large dataset containing epitopes from different viruses and tumour cells. To this end, we collected epitope-specific human TCRβ sequence data containing information about the CDR3 sequences and the corresponding V-and J-genes. Random forest classifiers were trained on this data and kept if they report a sufficiently high performance in a cross-validation setting. These classifiers are freely available in a webtool, called TCRex, at tcrex.biodatamining.be TCRex is useful to make predictions on newly gathered experimental TCRβ sequence data. https://doi.org/10.1101/373472

V-097: Hierarchical LDA reveals reasonable profiles of mutation signatures in cancer genomes
COSI: General Comp Bio
  • Taro Matsutani, Waseda University, CBBD-OIL, Japan
  • Michiaki Hamada, Waseda University, Japan

Short Abstract: Mutation signature is defined as the mutational distribution per mutational process (e.g., tobacco smoke, UV light and so on), and elucidation of signatures leads to clarification of carcinogenic mechanism. Previous studies reveal many mutation signatures applying matrix decomposition to mutation catalogs in cancer genomes; however, the mutation catalogs need to be divided by primary lesion. Besides, signatures in different primary lesions from the same mutational process have to be merged to determine the "true" signature. If all mutation catalogs are analyzed together to avoid this problem, the sparsity of data becomes high, and signature prediction becomes difficult. This is because active signatures are different by each primary lesion (e.g., tobacco signature is active in lung cancer genome but not active in other ones). In our study, we proposed hierarchical LDA, a novel Bayesian model to predict mutation signatures with all mutation catalogs simultaneously. Hierarchical LDA is an extended model of latent Dirichlet allocation(LDA) similar to matrix decomposition. It has hierarchized hyperparameters of prior distributions for the signature activity, and they represent the sparsity of signatures by each primary lesion. We applied this model to COSMIC mutation catalogs and could obtain reasonable profiles of mutation signatures.

V-098: PMgram: an interactive web-based prediction model nomogram generator
COSI: General Comp Bio
  • Sungyoung Lee, Seoul National University, South Korea

Short Abstract: A prediction of clinical outcome such as drug response or early diagnosis is emerging as a significant topic in the fields of clinical medicine. Here, the prediction of clinical outcome usually can be achieved by putting a set of clinical variables of a patient to some mathematical model that interconnects the outcome of interest and the input variables. While there is a number of successes that implements such prediction, the prediction model is often highly complex and requires professional knowledge for its interpretation. One practical solution to overcome those limitations by visualize the prediction model is nomogram. The nomogram is a useful method to visualize both the effects of variables and the result of the prediction model. In this poster, we present PMgram, a web-based tool to generate a customized and interactive nomogram that gives better understanding of the prediction model from the researcher. PMgram is provided as a web service and implements an interactive nomogram for both binary and survival phenotypes. Our multiple nomograms that were generated from the clinical studies show the utility of the proposed tool. The PMgram is under the development and will be opened freely to the researchers who want to visualize their prediction model.

V-099: A Computational Platform for Screening Large-Scale ctDNA Data
COSI: General Comp Bio
  • Chieh-Wei Huang, National Center for High-Performance Computing, Taiwan
  • Chang-Wei Yeh, National Center for High-Performance Computing, Taiwan
  • Chao-Chun Chuang, National Center for High-Performance Computing, Taiwan
  • Chien-Ta Tu, National Center for High-Performance Computing, Taiwan
  • Yu-Tai Wang, National Center for High-Performance Computing, Taiwan
  • Hsi-Ching Lin, National Center for High-Performance Computing, Taiwan

Short Abstract: Background: The medical applications of ctDNA have become more and more popular because of it non-invasive and comprehensive nature. However, analyzing ctDNA data is a time-consuming task. Method and Results: At NCHC, we set up a platform to speed up calculation by optimizing the workflow. The workflow contains sub-modules including quality control, alignment, SNP/Indel detection, fusion gene detection and annotation, which will take more than four hours to complete all steps linearly. We try to reallocate the steps to optimize resource usage. By paralleling the steps, the calculation time can be decreased by 30%. Furthermore, by deploying the workflow on our 64-nodes computing cluster, which is set up for processing medical data, we can process more than 2000 samples per day at its full computational capacity. Each of the computing nodes is connected by fiber network. The Lustre file system provides high-speed reading and writing of data. Conclusion: In brief, our platform is able to speed up analysis by 3000 times comparing to a single node computer/PC. That is, this platform could be valuable to future screening of large ctDNA data for building circulating DNA database.

V-100: A Robust Method to Identify Complex Associations of Gene-Sets with a Censored Survival Time Outcomes
COSI: General Comp Bio
  • Xueyuan Cao, University of Tennessee Health Science Center, United States
  • Natasha Sahr, St. Jude Children's Research Hospital, United States
  • Stanley Pounds, St. Jude Children's Research Hospital, United States

Short Abstract: Biological research studies often seek to identify genes or sets of related genes that associate with a treatment or phenotype. Gene-set enrichment analysis (GSEA; PMID 16199517) and other methods are frequently used to identify gene-sets with several individual genes having simple associations with the treatment or phenotype of interest. The multi-response permutation procedure (MRPP; PMID 18042553) identifies sets of genes that have complex associations with a categorical treatment or phenotype, such as differential correlation of genes across groups. We developed the generalized MRPP (GMRPP) to also identify sets of genes that have complex associations with censored survival time outcomes, such as time until relapse or death in oncology studies. In simulations evaluating the performance of GRMPP, GSEA, and five other methods, GMRPP was unrivaled in its ability to identify sets of genes that have a complex association with a censored survival time. In an application involving pediatric acute myeloid leukemia (AML), only GMRPP found evidence that pediatric AML patients’ survival was associated with their leukemic cells’ expression of KEGG AML pathway genes. These results show that GMRPP can discover complex association of gene-sets with censored survival time outcomes that are invisible to other methods. Software will be available at https://github.com/nsahr.

V-101: Enzyme Prediction with Word Embedding Approach
COSI: General Comp Bio
  • Erkan Akin, METU, Turkey
  • Mehmet Volkan Atalay, Middle East Technical University, Turkey

Short Abstract: Information such as molecular function, biological process and cellular localization can be inferred from the protein sequence. For this purpose, we describe an approach based on the use of a word2vec model, more specifically continuous bag-of-words model to generate the vector representation of a given protein sequence. In the word2vec model, a protein sequence is treated as a document or a sentence and its subsequences correspond to words. Continuous bag-of-words is a supervised word2vec model to predict a subsequence from its neighboring subsequences. Feature vectors from word2vec model can be coupled with classifiers to infer information from the sequence. As a sample application, we consider the problem of determining whether a given protein sequence is enzyme or not. For a sample dataset which contains 19,155 of enzyme and non-enzyme protein sequences, for which 20% of these sequences are put apart for test and 80% is used for 5-fold cross validation, the best performance scores obtained are 0.98 for Precision, Recall, F1, accuracy and 0.95 for Matthews correlation coefficient by the word2vec model with vector size of 100, window size of 25 and number of epochs as 100 and for the 5-nearest neighbor classifier.

V-102: Importance of Visit Type in Understanding Results from Phenome-Wide Association Studies: Results from a Visit-WAS
COSI: General Comp Bio
  • Mary Regina Boland, Department of Epimediology, Biostatistics, and Informatics, Perelman School of Medicine, University of Pennsylvania, United States

Short Abstract: Introduction: Widespread adoption of Electronic Health Records (EHR) increased the number of reported disease association studies, or Phenome-Wide Association Studies (PheWAS). Traditional PheWAS studies ignore visit type (i.e., department/service conducting the visit). In this study, we investigate the role of visit type on disease association results in the first ‘VisitWAS’. Results: We studied this visit type effect on PheWAS results using EHR data from the University of Pennsylvania. Penn EHR data comes from 1,048 different departments and clinics. We analyzed differences between cancer and obstetrics/gynecologist (Ob/Gyn) visits. Some findings were expected (i.e., increase of neoplasm diagnoses among cancer visits), but others were surprising, including an increase in infectious disease conditions among those visiting the Ob/Gyn. Conclusion: We conclude that assessing visit type is important for EHR studies because different medical centers have different visit type distributions. To increase reproducibility among EHR data mining algorithms, we recommend that researchers report visit type in studies.

V-103: OncoMX: enabling exploration of integrated cancer biomarker data in the context of mutation, differential expression, conserved expression patterns, and automatically mined literature evidence
COSI: General Comp Bio
  • Hayley Dingerdissen, The George Washington University, United States
  • Raja Mazumder, The George Washington University, United States
  • Dan Crichton, JPL, United States
  • K. Vijay-Shanker, University of Delaware, United States
  • Frederic B. Bastian, University of Lausanne, SIB Swiss Institute of Bioinformatics, Switzerland
  • Amanda Bell, The George Washington University, United States
  • Samir Gupta, University of Delaware, United States
  • Robel Kahsay, The George Washington University, United States
  • Heather Kincaid, NASA, United States
  • David Liu, NASA, United States
  • Marc Robinson-Rechavi, Universite de Lausanne, Switzerland
  • Stephanie Singleton, The George Washington University, United States

Short Abstract: OncoMX is a web portal designed to facilitate cancer biomarker research by integrating relevant data across types, systems, and repositories into a single, harmonized resource. Contributing sources include BioMuta, BioXpress, Bgee, DEXTER, DiMeX, and EDRN, with disease and anatomical names unified through domain-specific ontologies. As an NCI-ITCR funded project, OncoMX has leveraged extensive collaboration and user feedback from its inception. Based on these collaborative activities, OncoMX has focused development on four user perspectives: (1) Exploration of cancer biomarkers; (2) Evaluation of mutation and expression in an evolutionary context; (3) Side-by-side exploration of published literature for mutations and expression in cancer; and (4) Exploration of a specific gene or biomarker within a pathway context. Major updates to OncoMX include: an improved gene-centric search view and dashboard for data exploration, addition of scRNA-seq tissue-specific expression in cancer, implementation of data unification practices and BioCompute tracking of data provenance, new infrastructure for management and hosting of OncoMX datasets (data.oncomx.org), improved user documentation including web help and explainer videos, and improved mobile-friendly access. The OncoMX model of data integration enables both scaling and extensibility of different data types over time, resulting in a sustainable and easily searchable cancer biomarker online resource. URL: https://www.oncomx.org/.

V-104: A two-stage approach for detection, segmentation and classification: an application in cervical cytology image
COSI: General Comp Bio
  • Jing Ke, Shanghai Jiao Tong University / University of New South Wales, Australia

Short Abstract: Typically, high accuracy in segmentation and classification by deep learning is attributed to large pixel-wise labeled dataset. However, in biomedical domain, it remains challenge of acquisition of relevant annotations. In this paper, we propose a novel two-stage approach for detection, segmentation and classification of lesions, verified by a case study in cervical cancer diagnosis. Advantages of one-stage image-level labeling for classification and two-stage pixel-level labeling for detection and segmentation are combined to save labor-intensive annotations. Firstly, we use a hybrid ResNet (encoder) and U-Net (decoder) architecture combining skip connections to segment nuclei, cytoplasm and background. Then, ResNet-based classification is applied to squares whose central have been located as nuclei, while the contour generated previously are reserved. The tests are performed on 49 positive LBC images around 56,000x56,000 pixels each. Only three segmentation maps of 2000x2000 are pixel-wise annotated and 3700 of 400x400 for training and inference are image-level labeled. Performed by cytotechnologists, 4600 cells are randomly selected and balanced in subtypes where three are catalogued to abnormality. The results show an average precision of 90.1% in ten-class classification and 95.2% in normal/abnormal binary classification, and 92.0% in nuclei segmentation. The cytotechnologists estimate the model efficiently deducts over 90% annotation burden.

V-105: The evolutionary traceability of proteins
COSI: General Comp Bio
  • Arpit Jain, Goethe University Frankfurt, Germany
  • Dominik Perisa, Goethe University Frankfurt, Germany
  • Arndt von Haeseler, Max F. Perutz Laboratories, Austria
  • Ingo Ebersberger, Goethe University Frankfurt, Germany

Short Abstract: Background Orthologs document the evolution of genes and metabolic capacities encoded in extant genomes. Orthologous genes detected in all domains of life allow reconstructing the gene set of LUCA, the last universal common ancestor. However, as similarity between orthologs decays with time, it becomes insufficient to infer common ancestry. Thus, ancient gene set reconstructions are incomplete and distorted to an unknown extent. Methods and Results Here we introduce the evolutionary traceability, together with the software protTrace, that quantifies, for each protein, the evolutionary distance beyond which the sensitivity of the ortholog search becomes limiting. protTrace estimates for a seed protein its specific evolutionary rate together with constraints on the evolutionary change jointly from a pre-compiled ortholog set and from the seed protein’s domain architecture. A simulation based framework estimates then the traceability decay with time. We show that the LUCA set comprises only highly traceable proteins, and we demonstrate how a traceability-informed adjustment of the search sensitivity identifies hitherto missed orthologs. Discussion The evolutionary traceability helps to differentiate between true absence and non-detection of orthologs, and thus improves our understanding about the evolutionary dynamics of functional protein interaction networks.

V-106: Tracing functional protein interaction networks using a feature-aware phylogenetic profiling
COSI: General Comp Bio
  • Ingo Ebersberger, Goethe University Frankfurt, Germany
  • Julian Dosch, Goethe University Frankfurt, Germany
  • Hannah Muelbaier, Goethe University Frankfurt, Germany

Short Abstract: Introduction Tracing the phylogenetic distribution of protein networks across hundreds to thousands of species calls for scalable and reliable methods for assessing evolutionary relationships and functional similarity of proteins. Standard ortholog search tools have running times that prohibit dynamic analyses. Moreover, the resulting presence-absence patterns of orthologs across taxa provide no information about their functional similarity or divergence. Methods and Results HaMStR-OneSeq integrates for a seed protein a targeted ortholog search with an assessment of the pair-wise feature architecture similarity (FAS) between the seed and its orthologs. For the ortholog search HaMStR-OneSeq applies a hidden Markov model based approach, which scales linearly with the number of search taxa. For the FAS scoring, HaMStR-OneSeq considers identity, copy number, and positional similarity of shared features, e.g. Pfam or SMART domains, transmembrane domains or low complexity regions between two proteins. Feature-aware phylogenetic profiles of HaMStR-OneSeq can be visualized, interactively explored, and analyzed in PhyloProfile (https://github.com/BIONF/PhyloProfile). Discussion HaMStR-OneSeq facilitates the dynamic generation of feature-aware phylogenetic profiles across large and customizable taxon collections. The simultaneous assessment of both presence/absence of orthologs across species, and of their similarities/deviations in the feature architecture eases the tracing of proteins and their function across species and through time.

V-107: Identification and Characterization of MicroRNAs Associated with Somatic Copy Number Alterations in Cancer
COSI: General Comp Bio
  • Jihee Soh, Gwangju Institute of Science and Technology, South Korea
  • Hyejin Cho, Gwangju Institute of Science and Technology, South Korea
  • Chan-Hun Choi, Dongshin University, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology (GIST), Korea, The Democratic People's Republic of

Short Abstract: MicroRNAs (miRNAs) are key molecules that regulate biological processes such as cell proliferation, differentiation, and apoptosis in cancer. Somatic copy number alterations (SCNAs) are common genetic mutations that play essential roles in cancer development. Here, we investigated the association between miRNAs and SCNAs in cancer. We collected 2,538 tumor samples of seven cancer types, including bladder urothelial carcinoma, breast invasive carcinoma, head and neck squamous cell carcinoma, kidney renal clear cell carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, and uterine corpus endometrial carcinoma from The Cancer Genome Atlas, and firstly examined SCNA regions including 32-84% of miRNAs depending on the cancer type. A statistical approach allowed the identification of 80 SCNA-miRNAs whose expression was mainly associated with SCNAs in at least one cancer type and 58 SCNA-miRNAs common in the seven cancer types (CC-SCNA-miRNAs). We demonstrated the relevance of SCNA-miRNAs in cancer by survival analysis and literature searching, and showed that CC-SCNA-miRNAs are more likely to regulate the expression of genes and proteins than other miRNAs. Furthermore, the oncogenic role of miR-589 was experimentally validated. In conclusion, our study suggested that SCNA-miRNAs significantly alter biological processes related to cancer development, confirming the importance of SCNAs in non-coding regions in cancer.

V-108: Integrated genome annotation to functional annotation pipeline for plants
COSI: General Comp Bio
  • Sathishkumar Natarajan, 3BIGS CO., LTD, South Korea
  • Hoyong Chung, 3BIGS CO., LTD, South Korea
  • Dawood Dedekule, 3BIGS CO., LTD, South Korea
  • Sridhar Srinivasan, 3BIGS CO., LTD, South Korea
  • Preethiba Gopi, 3BIGS CO., LTD, South Korea
  • Junhyung Park, 3BIGS CO., LTD, South Korea

Short Abstract: In recent years, the sequencing platform has been a tremendous upgrade in the plant genomics research. There are many existing tools and automated pipelines has been developed for application specific. Here, we introduce a comprehensive plant genome annotation pipeline for plant species. The developed customized pipeline consists of three major parts which includes denovo assembly, structural annotation, and functional annotation. In detailed, NGS platforms are supported to conduct plant specific sequencing and denovo assembly was performed using Omicsbox to generate assembled fasta files. Structural annotation workflow was adapted from MAKER-P, collected supporting evidences for plant specific which includes repeats, EST evidence, homology evidence, gene prediction methods to generate consensus gene models. Further, post validation criteria and manual curation was followed for optimizing the final gene models. Then, functional annotation analysis was carried out to identify the functional descriptions by BLAST (Swissprot and NCBI NR), gene ontology (biological process, cellular component, and molecular functions), pathways (KEGG, Reactome), domains (interproscan), transmembrane prediction, and signal peptide prediction. These customized plant specific genome annotation pipeline are will be help to plant researchers to conduct their genome annotation to functional annotation. Furthermore, we will incorporate additional bioinformatics tools, visualization tools for easier and data access.

V-109: A statistical method for the de novo inference of the regulators involved in genetic buffering
COSI: General Comp Bio
  • Jia-Hsin Huang, Institute of Information Science, Taiwan
  • Huai-Kuang Tsai, Institute of Information Science, Academia Sinica, Taiwan

Short Abstract: Genome-wide studies of the genetic perturbation experiments have advanced considerably our understanding of the function of the genome. Although this constitutes an extraordinary resource to study complex organismal traits and diseases, the tools to better explore the data of the genetic perturbations are rarely developed. In this study, we proposed a statistical framework incorporating mixture regression model using maximization expectation algorithm for the de novo inference of trait-associated function from genetic perturbation experiments. Herein, we applied our method to identify 19 candidate regulators that showed significant correlations between the expression changes after deletion of the regulator gene with DNA replication timing in the budding yeast. Among 19 candidates, some of them have been reported involvement of transcriptional buffering during DNA replication previously and some are novel candidates. Further, we selected four candidates, i.e. mrc1, elg1, rtt109, and ctf8 to perform the transcriptome analysis during S phase for experimental validations. Notably, the transcriptome results of four mutants showing loss of transcriptional buffering during S phase confirm the robustness of our method.

V-110: Prediction of survival and recurrence of pancreatic cancer by integrating multi-omics data
COSI: General Comp Bio
  • Bin Baek, GIST, Korea, The Democratic People's Republic of
  • Hyunju Lee, Gwangju Institute of Science and Technology (GIST), Korea, The Democratic People's Republic of

Short Abstract: Although several studies for predicting the prognosis of pancreatic cancer have been performed, there is still a lack of assortment for the high-risk pancreatic adenocarcinoma (PAAD). We propose two factors and computational approaches for prediction. First, we infer the clonal expansion of DNA to identify tumor progression and find candidate genes mutated typically in the early stage (with high cellular prevalence (CP) value) of PAAD. We found five candidate genes with high CP values among PAAD patients, and survival and recurrence between patients with the mutation in candidate genes and others were significantly different. Second, we built autoencoder networks for reducing the dimension of multi-omics data of 134 PAAD patients, including RNA sequencing, microRNA sequencing, and DNA methylation data from The Cancer Genome Atlas. After obtaining features from the autoencoder, K-means clustering method provided two subgroups of patients with significant survival and recurrence differences. We built a prediction model of prognosis using these two factors and clinical data. Among various prediction models, the logistic regression method showed the best prediction performance (ACC = 0.754; AUC = 0.745). In conclusion, in this study, we classified patients with a high probability of recurrence and a high risk of PAAD.

V-111: Genome-wide detection of structural variations in cancer
COSI: General Comp Bio
  • Hyunju Lee, Gwangju Institute of Science and Technology (GIST), Korea, The Democratic People's Republic of
  • Yeonghun Lee, Gwangju Institute of Science and Technology, South Korea

Short Abstract: Structural variations (SVs) are main characteristics of cancers, generating aberrant karyotypes with highly rearranged chromosomes. In the era of high-throughput sequencing, several methods have been developed to detect SVs, but they were limited to detecting raw SV breakpoints in small genomic windows rather than detecting genome-wide SVs. Here, we present an integrative framework for detecting genome wide SVs that analyzes SVs, cancer purity and ploidy, total CNAs, allele-specific CNAs, and haplotype information simultaneously. Our framework initially constructs a breakpoint graph to represent cancer genomes through iterative optimizing procedures, and extends it to an allele-specific graph and a haplotype graph. Based on the haplotype graph, we find candidate karyotypes, including cancer amplicon structures from homogeneously staining regions and double minutes. We validated our framework using simulated cancer data sets and HeLa cell line data. Our framework outperformed other independent variant detection tools on the simulated sets, and reconstructed karyotypes of HeLa matched with the spectral karyotyping. Our framework extended previous genomic views to the genome-wide reconstruction level, and will provide a basis for structure-based analyses in cancer development studies.

V-112: Hierarchical and programmable one-pot synthesis of oligosaccharides
COSI: General Comp Bio
  • Cheng-Wei Cheng, Genomics Research Center, Academia Sinica, Taiwan
  • Yixuan Zhou, Genomics Research Center, Academia Sinica, Taiwan
  • Wen-Harn Pan, Institute of Biomedical Sciences, Academia Sinica, Taiwan
  • Supriya Dey, The Scripps Research Institute, United States
  • Chung-Yi Wu, Genomics Research Center, Academia Sinica, Taiwan
  • Wen-Lian Hsu, Institute of Information Science, Academia Sinica, Taiwan
  • Chi-Huey Wong, Genomics Research Center, Academia Sinica, Taiwan

Short Abstract: The programmable one-pot oligosaccharide synthesis method was designed to enable the rapid synthesis of a large number of oligosaccharides, using the software Optimer to search Building BLocks (BBLs) with defined relative reactivity values (RRVs) to be used sequentially in the one-pot reaction. However, there were only about 50 BBLs with measured RRVs in the original library and the method could only synthesize small oligosaccharides due to the RRV ordering requirement. Here, we increase the library to include 154 validated BBLs and more than 50,000 virtual BBLs with predicted RRVs by machine learning. We also develop the software Auto-CHO to accommodate more data handling and support hierarchical one-pot synthesis using fragments as BBLs generated by the one-pot synthesis. This advanced programmable one-pot method provides potential synthetic solutions for complex glycans. Four essential glycans, including Globo-H, SSEA-4, heparin pentasaccharide, and OligoLacNAc, are successfully synthesized and demonstrated in this work. This work has been published in Nature Communications 9 (2018): 5202. DOI: https://doi.org/10.1038/s41467-018-07618-8

V-113: Towards computational drug screening: profiling drug toxicity in the context of a biological network
COSI: General Comp Bio
  • Artem Lysenko, RIKEN, Japan
  • Keith Boroevich, RIKEN, Japan
  • Tatsuhiko Tsunoda, The University of Tokyo, Japan

Short Abstract: High levels of toxicity can lead to failure of candidate drug compounds during clinical trials and drug withdrawals from market. In particular, cases of idiosyncratic toxicity, which occur unpredictably and infrequently in a general population, are very difficult to identify in smaller clinical trial cohorts. Therefore, to minimise drug development costs and improve safety, it is very important to identify potentially dangerous compounds at an earliest possible stage. To address this problem, we propose an original machine learning method that leverages the identity of drug targets and off-targets, Gene Ontology annotations and biological networks to predict toxicity. Graph-based representations allow efficient mining of interactions in biological systems, but given their size and complexity optimal ways of relating this information to clinically-relevant phenotypes remain a topic of ongoing research. In this work we have explored one potential solution to this problem, which uses machine learning to enable network locations to be considered in the context of relevant clinical and pharmacological covariates. Using a dataset enriched for idiosyncratically toxic drugs, we demonstrate how constructed model can successfully predict potentially dangerous compounds and show that its interpretation suggests an existence of toxicity-associated “hot spots” in a protein association network.

V-114: Hands-On Training in Single Cell Data Analytics with scOrange
COSI: General Comp Bio
  • Janez Demsar, University of Ljubljana, Slovenia
  • Martin Stražar, University of Ljubljana, Slovenia
  • Lan Žagar, University of Ljubljana, Slovenia
  • Jaka Kokošar, University of Ljubljana, Slovenia
  • Vesna Tanko, University of Ljubljana, Slovenia
  • Aleš Erjavec, University of Ljubljana, Slovenia
  • Pavlin Poličar, University of Ljubljana, Slovenia
  • Anže Starič, University of Ljubljana, Slovenia
  • Gad Shaulsky, Baylor College of Medicine, United States
  • Menon Vilas, Howard Hughes Medical Institute, United States
  • Andrew Lamire, Howard Hughes Medical Institute, United States
  • Anup Parikh, Naringi Inc., United States
  • Blaž Zupan, University of Ljubljana, Slovenia

Short Abstract: While single-cell RNA sequencing (scRNA-seq) is increasingly accessible throughout the biomedical research community, the complexity of computational analysis may impose substantial barriers for biologists. Complementary to existing single-cell gene expression analysis environments in R or Python, we present single-cell Orange (scOrange), a workflow-based open-source tool with interactive visualizations. Its components include data normalization, filtering, batch effect correction, clustering and cluster characterization, and differential gene expression analysis. Users of scOrange connect computational components into flexible workflows, retaining control over data flow and the order of applied methods and their parameters. ScOrange supports interactive, real-time computation - any change in method parameters or data selection in the visualizations propagates immediately to the downstream components. Workflow design by visual programming also minimizes proneness to errors. Workflows store parameters and any of the user's selections in the visualizations, supporting the sharing of results and guaranteeing reproducibility. Compared to scripting, scOrange's user-friendly interface presents a lean learning curve. Its home page (https://singlecell.biolab.si) contains comprehensive learning material: protocol-oriented blogs, video tutorials, and example workflows. The simplicity of use and interactivity make scOrange a viable option for organizing educational tutorials and enables biologists to get a grasp of core principles underlying scRNA-seq data analysis methods.

V-115: VarFish – fishing for causative variants
COSI: General Comp Bio
  • Manuel Holtgrewe, Berlin Institute of Health (BIH), Core Unit Bioinformatics, Berlin, 10178 Germany, Germany
  • Oliver Stolpe, Berlin Institute of Health (BIH), Core Unit Bioinformatics, Berlin, 10178 Germany, Germany
  • Mikko Nieminen, Berlin Institute of Health (BIH), Core Unit Bioinformatics, Berlin, 10178 Germany, Germany
  • Dieter Beule, Berlin Institute of Health, Germany

Short Abstract: VarFish is an easy-to-use web-based database system designed for empowering geneticists in the analysis of clinical and whole exome sequencing variant data sets for individuals and families. It provides a set of tools for supporting the full workflow from variant data quality control, variant filtration and efficient assessment of variants based on visual alignment inspection and annotation data such as functional and frequency annotation. The system allows to organize data into a folder structure of projects with access control. Variant quality metrics can be displayed project-wise or for single cases/families. The variants themselves can be filtered based on genotype, population frequency, variant effect, quality metrics, and annotation. A special ClinVar-centric view allows for the easy screening of variants based on pathogenicity annotation in ClinVar. After filtration of the variants, their quick and efficient assessment is supported by various tools: color flags and commenting allows for note-taking, remote-controlling IGV to display a variant’s locus, and important database excerpts are available directly within the system. Further, there are link-outs to various external databases and assessment tools. Filtered data sets can be downloaded as VCF or Excel files or submitted to external tools. VarFish is availabe under a permissive open source license.

V-116: Estimation of microbe viability from human oral to gut
COSI: General Comp Bio
  • Shion Hosoda, Waseda University, Japan
  • Michiaki Hamada, Waseda University, Japan

Short Abstract: Metagenomic analysis is revealing a relationship between human and human microbiome. The Human Microbiome Project (HMP) collected samples from various parts of different individuals. One of the new facts found by HMP is the correlation between human oral and gut microbiome. Although this relationship is reasonable considering a path of microbes, these two microbiomes are not matched exactly. Thus, it is expected that the viability of each microbe in the path from human oral to gut is different. Estimating the viability is helpful to elucidate the dynamics of microbes from human oral to gut. However, to capture the change of microbe abundances between human oral and gut microbiome from relative abundance is a difficult issue. In this study, we constructed a hierarchical Bayesian model that defines the viability of each microbe and generate human oral and gut taxonomic profiles. This model has parameters that are relative abundances of human oral and gut microbiome and the viability. As a result, it was found that Bacteroides, Faecalibacterium, and Parabacteroides have high viability. These genera were abundant in the human gut. In addition, Dialister has low viability despite being abundant in the human gut. These results are consistent with the taxonomic profiles.

V-117: FAIRDOM: FAIR Data Management for Life Science
COSI: General Comp Bio
  • Olga Krebs, Heidelberg Institute for Theoretical Studies, Germany
  • Finn Bacall, The University of Manchester, United Kingdom
  • Martin Golebiewski, HITS gGmbH, Germany
  • Stuart Owen, The University of Manchester, United Kingdom
  • Alan Williams, The University of Manchester, United Kingdom
  • Ulrike Wittig, HITS gGmbH, Germany
  • Katy Wolstencroft, Leiden University, Netherlands
  • Jacky Snoep, Stellenbosch University, South Africa
  • Wolfgang Müller, HITS gGmbH, Germany
  • Carole Goble, The University of Manchester, United Kingdom

Short Abstract: Systems biology research typically involves the integration and analysis of heterogeneous data types in order to model and predict biological processes. As a result, data management has become an important part of modern research and software infrastructure is necessary to support the whole data life cycle, from data acquisition, through analysis, to data sharing. Here, we present the FAIRDOM project, which aims to establish an internationally sustained service to enable the systems biology community to produce Findable, Accessible, Interoperable and Reproducible (FAIR) Data, Operating procedures and Models. The FAIRDOMHUb is an internationally sustained resource for researchers to share and publish data, models, protocols and the relationships between them, together with services for curation, training and community knowledge exchange. The FAIR principles for Systems Biology are supported by the capabilities of the Hub and in particular our SEEK software that underpins it. By drawing together the multiple components of investigations, regardless of their physical location, we contextualize experiments and richly annotate and interlink the components. By using the FAIRDOM software as they run their projects, either through the Hub or independent installations, researchers can prepare for reproducible publication and more effective exchange with collaborators.

V-118: Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq
COSI: General Comp Bio
  • Simona Cristea, Harvard University, United States

Short Abstract: Triple-negative breast cancer (TNBC) is an aggressive subtype characterized by extensive intratumoral heterogeneity. To investigate the underlying biology, we conducted single-cell RNA-sequencing (scRNA-seq) of >1500 cells from six primary TNBC. Here, we show that intercellular heterogeneity of gene expression programs within each tumor is variable and largely correlates with clonality of inferred genomic copy number changes, suggesting that genotype drives the gene expression phenotype of individual subpopulations. Clustering of gene expression profiles identified distinct subgroups of malignant cells shared by multiple tumors, including a single subpopulation associated with multiple signatures of treatment resistance and metastasis, and characterized functionally by activation of glycosphingolipid metabolism and associated innate immunity pathways. A novel signature defining this subpopulation predicts long-term outcomes for TNBC patients in a large cohort. Collectively, this analysis reveals the functional heterogeneity and its association with genomic evolution in TNBC, and uncovers unanticipated biological principles dictating poor outcomes in this disease.

V-119: Causes and Consequences of Breakdown of Robustness: Theoretical framework, computational model, and application to metastasis
COSI: General Comp Bio
  • Maryl Lambros, Albert Einstein College of Medicine, United States
  • Aviv Bergman, Albert Einstein College of Medicine, United States
  • Yehonatan Sella, Albert Einstein College of Medicine, United States

Short Abstract: In evolutionary biology, the well-established Waddingtonian concept of robustness roughly implies phenotypic insensitivity to environmental and genetic variation. During multicellular organism development, environmental cues dictate lineage. However, after development, environmental robustness and commitment are observed. Thus, is it possible for a committed cell to sustainably switch phenotypes, as observed in metastatic cells, without the need to return to a stem-like state (termed phenotypic pliancy)? If this is possible, what are the mechanisms that allow it to occur? Mechanistically, many epigenetic mechanisms confer cells environmental robustness after development by continually controlling susceptible genes’ expression state. Expanding a well-established evolutionary gene regulatory network model to explicitly incorporate environmental interactions and an epigenetic mechanism, we find that breakage of an epigenetic mechanism leads to phenotypic pliancy and decreased environmental robustness. To help validate our phenotypic pliancy hypothesis, we develop and apply statistical and analytical tools to single-cell RNA-sequencing data of metastatic Head and Neck Cancer. We reveal that metastatic cells are enriched with epigenetic disruption, and the level of this dysregulation is correlated with the level of phenotypic pliancy. Our work demonstrates the importance of theoretical evolutionary biology concepts, like robustness’ breakdown, and innovatively incorporating these into fields like cancer biology.

COSI: General Comp Bio
  • Ahmet Sureyya Rifaioglu, Middle East Technical University, Turkey
  • Rengül Atalay, Middle East Technical University, Turkey
  • Gokhan Ozsari, Middle East Technical University, Turkey
  • Tunca Dogan, European Bioinformatics Institute, Turkey
  • Mehmet Volkan Atalay, Middle East Technical University, Turkey

Short Abstract: There exist several computational methods for automated prediction of protein subcellular localization; however, there is still room for better performance. Here, we propose a multi-view SVM-based approach which provides predictions for human nucleus proteins. We represent each protein sequence by multi-view features; i.e., physicochemical properties, amino acid compositions, and homology-based features. Our classification model contains seven classifiers for each localization, where each classifier provides a probabilistic result. To develop a multi-view voting classifier, we employ a weighted classifier combination method which assigns different weights to classifiers depending on their discriminative strengths. We evaluated the described method on previously used datasets, as well as on our in-house dataset, called Trust dataset. Trust dataset is created by using a novel subcellular localization hierarchy which merges UniProt subcellular localization hierarchy and GO Cellular Component hierarchy by applying it on only manual experimental annotations in UniProtKB. We compared our results with five state-of-the-art methods which are SubCon, LocTree2, CELLO2.5, SherLoc2, MultiLoc2. Our approach outperformed the others with 65%, 65%, 62% Matthews correlation coefficient (MCC) scores on Trust, Golden (SubCon benchmark dataset), Golden-Trust (refined Golden dataset) datasets, respectively where SubCon's MCC scores were 49%, 64%, and 52%.

V-121: maTE: Discovering Expressed MicroRNA - Target Interactions
COSI: General Comp Bio
  • Malik Yousef, Zefat College, Israel
  • Loai Abdallah, Department of Information Systems, The Max Stern Yezreel Valley College, Israel
  • Jens Allmer, Hochschule Ruhr West, University of Applied Sciences, Mülheim an der Ruhr, Germany, Germany

Short Abstract: We present a novel approach, maTE, based on machine learning which, integrates miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. A combined classifier is built from multiple miRNAs to improve separation. The aim of the study is to find a set of miRNAs causing regulation of their target genes that best explains the difference between groups (e.g.: cancer vs. control). maTE provides a list of significant groups of genes where each group is targeted by a specific microRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. It is of note, that when the accuracy is much lower (e.g.: ~50%) the set of miRNAs provided is likely, not causative for the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, it opens up new avenues for exploring miRNA regulation and may pave the way for the development of miRNA-based biomarkers and drugs.

V-122: Uncovering gene-specific branching dynamics with a multiple output branched Gaussian Process (mBGP)
COSI: General Comp Bio
  • Sumon Ahmed, The University of Manchester, United Kingdom
  • Alexis Boukouvalas, PROWLER.io, United Kingdom
  • Magnus Rattray, The University of Manchester, United Kingdom

Short Abstract: Identifying branching dynamics from high-throughput single-cell data can help uncover gene expression changes leading to cellular differentiation and fate determination. Boukouvalas et al. (Genome Biology 2018) developed a branched Gaussian Process (BGP) method that provides a posterior estimate of gene-specific branching time with associated credible regions. Inference in this model is performed independently per gene, resulting in two significant drawbacks: potentially inconsistent cell assignment and very high computational requirements. To address these issues, we propose a multiple output branching Gaussian Process (mBGP) model that performs inference jointly across all genes of interest and involves two main ideas: (1) We develop a joint model with a different branching time parameter for each output dimension (gene) where cell allocation is shared for all genes. (2) We develop a gradient-based approach to learn branching times. Using gradients removes the need for a grid search, which is impractical in the multiple-gene case since there is a combinatorial explosion of the number of branching time combinations. By applying our model on both synthetic and real single-cell RNA-seq data, we show that it can jointly estimate all branching times with significantly less computational time compared to the original BGP model, whilst also ensuring cell assignment consistency.

V-123: Pentachlorophenol affects RIG-1 antiviral pathway that produces type I interferon at the transcriptional level
COSI: General Comp Bio
  • Yayoi Natsume-Kitatani, National Institutes of Biomedical Innovation, Health and Nutrition, Japan
  • Kenji Mizuguchi, National Institutes of Biomedical Innovation, Health and Nutrition, Japan
  • Ken-Ichi Aisaki, National Institute of Health Sciences, Japan
  • Satoshi Kitajima, National Institute of Health Sciences, Japan
  • Samik Ghosh, The Systems Biology Institute, SBX Corporation, Japan
  • Hiroaki Kitano, The Systems Biology Institute, Okinawa Institute of Science and Technology Garuda School, Japan
  • Jun Kanno, Japan Bioassay Research Center, Japan

Short Abstract: Pentachlorophenol (PCP) is a pesticide that has been banned or strictly restricted from use because of its various functional symptoms including acute sweating, convulsions and hyperthermia in human [1]. We attempted to examine the mechanism of these symptoms by computational biology approach. The list of up- or down-regulated genes by oral administration of PCP in mice (0, 10, 30, or 100 mg/kg) after 2, 4, 8 or 24 hrs of treatment was obtained from [2]. The result of pathway enrichment analysis showed that genes in RIG-1 antiviral pathway, whose activation results in induction of type I interferon (IFN), and IFN α/β signaling pathway were up-regulated after 24hr. The gene cluster whose members are known to associate each other densely at protein level was detected, and disease enrichment analysis showed that this cluster is associated with RNA virus infection, especially influenza A. Since it is reported that influenza virus is recognized by RIG-1 [3], our result implied that the acute toxicity of PCP is caused by a similar mechanism of viral response via RIG-1. [1] Proudfoot AT , Toxicol Rev. 2003;22(1):3-11. [2] Kanno J. et al., J. Toxicol. Sci. 2013;38(4): 643-654 [3] Kato H. et al., Nature. 2006;441:101–5

V-124: dv-trio : a trio variant calling pipeline using DeepVariant with Mendelian error correction.
COSI: General Comp Bio
  • Eddie Ip, Victor Chang Cardiac Research Institute, Australia
  • Clinton Hadinata, Victor Chang Cardiac Research Institute, Australia
  • Joshua Ho, School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong
  • Eleni Giannoulatou, Victor Chang Cardiac Research Institute, Australia

Short Abstract: In 2018 Google published an innovative variant caller, DeepVariant, which converts pileups of sequence reads into images and uses a deep neural network to identify single variant polymorphisms and small insertion/deletions from whole genome sequencing data. Its classification of true genetic variants from false positives demonstrated a greater accuracy over other contemporary variant callers. However, DeepVariant was designed to call variants for a single sample. In the study of diseases, the ability to examine a family trio, (father-mother-affected child), provides greater power for discovery. To utilise this accuracy in DeepVariant, we have developed a trio variant calling pipeline called “dv-trio”, which combines DeepVariant’s individual variant calling with Genome Analysis Toolkit’s co-calling ability to create a trio-based VCF. “dv-trio” also applies Mendelian error correction based on family pedigree using a Bayesian network algorithm via FamSeq. Using Genome in a Bottle Consortium's AshkenazimTrio, we demonstrated that the Mendelian error rate was reduced by 68% for a “dv-trio” trio VCF over a trio VCF created by merging individual DeepVariant VCFs and 20% over a GATK co-called trio VCF. “dv-trio” provides a simple pipeline to improve on trio variant calling by harnessing the accuracy of DeepVariant, with the additional advantage of Mendelian error correction.

V-125: HyperMut: a method to detect localized hypermutation with stringent control for confounders
COSI: General Comp Bio
  • David Mas-Ponte, Institute for Research in Biomedicine (IRB Barcelona), Spain
  • Fran Supek, Institute for Research in Biomedicine (IRB Barcelona), Spain

Short Abstract: The study of human cancer genomes has revealed unexpected patterns of mutagenesis. Regional variability thereof has been well explored at a coarse resolution (domain and gene-level) but less so at finer scales. Sub-gene resolution mutation patterns are however pervasive in many cancer types. A common underlying mechanism involves the APOBEC3 family of cytidine deaminases which generate high-density mutation clusters (kataegis). We developed a novel statistical method to detect localized hypermutation events, HyperMut. Our algorithm stringently controls for the regional mutation rate variability and the oligonucleotide mutation spectra by generating a randomized version of each genome and comparing inter-mutation distance distribution to observed values, in order to estimate a local false discovery rate. This enabled us to detect prevalent mutation clusters across almost all human tumors, which were inaccessible to existing methods due to high Type I and Type II error rates in higher mutation burden tumors. We used the resulting clustered mutations to obtain pentanucleotide signatures and to associate them with transcription profiles of DNA replication and repair genes. We were able to precisely quantify mutagenic preferences of APOBEC3A and APOBEC3B enzymes across cancer (sub)types and we detected novel associations with expression of TLS polymerases and replication stress-associated genes.

V-126: HyperCell: outlying, mislabelled and diverged cancer cell lines revealed by a multi-omics approach
COSI: General Comp Bio
  • Fran Supek, Institute for Research in Biomedicine (IRB Barcelona), Spain
  • Marina Salvadores, Institute for Research in Biomedicine (IRB Barcelona), Spain

Short Abstract: Cell lines are commonly used as cancer models because they carry the genomic alterations that arose in the tumor they derive from. Despite being widely used, there are two major problems associated with their usage: the cell line mislabelling/contamination and the emergence of genomic, transcriptomic and epigenomic alterations during cell culture. To account for this, we aligned the mRNA and methylation data between tumors and cell lines such that they are indistinguishable using batch-effect correction methods. Once the data was made comparable, we correlated the mRNA and methylation data from the cell lines to human tumor samples from TCGA in order to identify possible contaminated/mislabelled cell lines. Overall, we identified by mRNA and methylation independently some cell lines which are significantly matched to the incorrect but related cancer type and, surprisingly, a few cell lines that are matched to a very distant cell type. Additionally, we identified some cell lines whose transcriptomic and epigenomic profile diverged very substantially from any cancer type examined and should therefore be used with great caution. We suggest that the use of the corrected cell-type labels and exclusion of outlier cell lines will boost accuracy of analyses of drug screening and CRISPR genetic screening data.

V-127: Tamock – Simulation of habitat-specific benchmark data in metagenomics
COSI: General Comp Bio
  • Samuel Gerner, University of Vienna, Austria
  • Alexandra Graf, FH Campus Wien, Austria
  • Thomas Rattei, Universität Wien, Germany

Short Abstract: Background Simulated metagenomic reads are widely used to benchmark software and workflows for metagenome interpretation. Ideally, the simulation is based on genomes, which resemble a realistic microbial community. Therefore, scope and power of metagenomic benchmarks depend on the selection of their underlying communities. Pure simulated data however cannot fully represent biological conditions. Yet for optimal software and parameter selection of workflows, benchmark data should be tailored to the study in question. Methods and Results We developed Tamock to simulate metagenomic reads according to a microbial community, derived from real metagenomic data. Thus, Tamock simulations enable assessment of computational methods, workflows and parameters specifically for a microbial habitat. Tamock automatically determines taxonomic profiles from shotgun metagenomic data, selects reference genomes accordingly and uses them to simulate metagenomic reads. Tamock simulations are not based on artificial design and associated biases but provide tailored benchmark data that reflect original sample parameters as closely as possible. Conclusion Tamock is a user-friendly command-line application, enabling the fully automated creation of benchmark data derived from real metagenomic data. Extensive supplementary information is provided along with the simulation. Availability Tamock is available under https://github.com/gerners/tamock

V-128: PathRacer: racing profile HMM paths on assembly graph
COSI: General Comp Bio
  • Anton Korobeynikov, Saint Petersburg State University, Russia
  • Alexander Shlemov, Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Russia

Short Abstract: Recently large databases containing profile Hidden Markov Models (pHMMs) have emerged. These pHMMs may represent the sequences of antibiotic resistance genes or allelic variations amongst highly conserved housekeeping genes used for strain typing, etc. The typical application of such a database includes the alignment of contigs to pHMM hoping that the sequence of the gene of interest is located within the single contig. Such a condition is often violated for metagenomes preventing the effective use of such databases. We present PathRacer - a novel standalone tool that aligns a profile HMM directly to the assembly graph (performing the codon translation on fly for amino acid pHMMs). The tool provides the set of most probable paths traversed by a HMM through the whole assembly graph, regardless whether the sequence of interest is encoded on the single contig or scattered across the set of edges, therefore significantly improving the recovery of sequences of interest even from fragmented metagenome assemblies. Compared to the analogues (Xander and MegaGTA) PathRacer can perform partial gene search and search for pseudogenes and gene sequences with frameshifts. That makes PathRacer appropriate for unpolished longread assembly annotation as well.

V-129: Semantic annotation of scientific media using controlled vocabularies with iCLiKVAL
COSI: General Comp Bio
  • Naveen Kumar, RIKEN Center for Integrative Medical Sciences, Japan
  • Todd Taylor, RIKEN Center for Integrative Medical Sciences, Japan

Short Abstract: iCLiKVAL is a web-based application that provides a platform to collect semantic annotations for scientific media found online in the form of texts, images, audios, videos, and datasets using crowdsourcing. The annotation is structured as (entity, key, value) tuples where key is an IRI (Internationalized Resource Identifier) and value can be a literal or another IRI. The media is assumed as an entity and is identified by an IRI. An annotation attaches information to the media (entity) by assigning a value via an attribute (key). The user interface provides ontology lookup from various known controlled vocabularies to assist users in entering the annotations easily and accurately. In addition, users can upload their own vocabularies and use them to create annotations. The annotations are saved in the database similar to the "triples" used in the RDF (Resource Description Framework) data model. The idea behind iCLiKVAL is to assign semantic annotations for various concepts related to the scientific media to identify and mark occurrences of ontological entities and relationships; thereby, effectively linking all online scientific media through these informative user-curated annotations. This aids computers to easily index and interpret information and allows for sophisticated data searches and knowledge discovery.

V-130: HaDeX - analysis of data from hydrogen-deuterium exchange-mass spectrometry experiments
COSI: General Comp Bio
  • Weronika Puchała, Institute of Biochemistry and Biophysics, Polish Academy of Science, Poland
  • Michał Kistowski, Institute of Biochemistry and Biophysics, Polish Academy of Science, Poland
  • Katarzyna A. Dąbrowska, Institute of Biochemistry and Biophysics, Polish Academy of Science, Poland
  • Aleksandra E. Badaczewska-Dawid, Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Warsaw, Poland, Poland
  • Dominik Cysewski, Institute of Biochemistry and Biophysics, Polish Academy of Science, Poland
  • Michał Dadlez, Institute of Biochemistry and Biophysics, Polish Academy of Science, Poland
  • Michal Burdukiewicz, University of Wrocław, Poland

Short Abstract: Hydrogen-deuterium mass spectrometry (HDX-MS) is a staple tool for monitoring dynamics and interactions of proteins. Due to the sheer size of the HDX-MS results, the data analysis require a dedicated software suite. However, the majority of existing tools does not cover a complete analytic workflow. We propose HaDeX, a novel tool for processing, analysis, and visualization of HDX-MS experiments. HaDeX features functions supporting the whole analytic process, including preliminary data exploration, quality control and generation of personalized publication-quality figures. This is the only tool to support multiple-state comparison and exact uncertainty debate. The reproducibility of the whole procedure is ensured with advanced reporting functions. HaDeX is available primarily as a web-server (http://mslab-ibb.pl/shiny/HaDeX/), but its all functionalities are also accessible as the R package (https://CRAN.R-project.org/package=HaDeX).

V-131: HyperMatch: a framework for detecting differential selection in human somatic cells
COSI: General Comp Bio
  • Fran Supek, Institute for Research in Biomedicine (IRB Barcelona), Spain
  • Elizaveta Besedina, Institute for Research in Biomedicine (IRB Barcelona), Spain

Short Abstract: Carcinogenesis is a process of evolution of somatic cells. Determining which genes are under positive or negative selection can explain mechanisms of tumor formation and provide new therapeutic targets. Detection of selection in somatic mutation data is a challenging task due to the heterogeneity of mutational processes across the genome and between cell types. A number of methods have been proposed to study selection in somatic cells. However, no general framework exists to detect tissue-specific selection, while stringently controlling for background mutation rate variability, which also changes across tissues. Here we propose HyperMatch, an approach to systematically detect condition-specific (differential) selection in the soma, and apply it to detect tissue specificity of essential genes. The algorithm compares the mutation rate in a gene of interest with a baseline that can be estimated in different ways, for instance using neighboring genes. The general form of HyperMatch framework can test different types of selection signals and control for background genomic alterations and interactions thereof. The algorithm performed well on a literature set of tissue-specific oncogenes, with AUC>0.9. Condition-specific essential genes and non-trivial combinations of genomic alterations that lead to synthetic lethality identified by HyperMatch may be useful for suggesting targeted cancer therapies.

V-132: Integration and evaluation of scRNAseq-, bulk RNAseq and microarray-based gene coexpression data
COSI: General Comp Bio
  • Yuichi Aoki, Tohoku University, Japan
  • Kengo Kinoshita, Tohoku University, Japan
  • Takeshi Obayashi, Graduate School of Info. Sci., Tohoku University, Japan

Short Abstract: Gene coexpression is a relationship having similar expression profiles across a variety of cellular conditions. Based on a guilt-by-association principle, gene coexpression information is widely used for identification of functionally associated gene pairs. Quality of gene coexpression data, which can be quantified based on discrimination power whether a pair of coexpressed genes has same cellular function or not, is the key factor of gene coexpression database. Because gene coexpression information is a summarization of gene expression data, the quality of gene coexpression data primally depends on the quality and the quantity of transcriptome data in addition to the methodology. Different transcriptomic technologies result in different properties of noises. Therefore, appropriate data treatment is nessesary to fully use different transcriptomic technologies to deduce high quality coexpression data. Recently, single cell RNAseq technologies has been applied to various tissues. In this study, we first prepared gene coexpression data from scRNAseq data, then compared them with bulk RNAseq-based and microarray based coexpression data in ATTED-II (atted.jp) and COXPRESdb (coxpresdb.jp). Finally, we integrated coexpression data based on different technologies to achieve representative coexpression data with high coverage and high quality.

V-133: Characterizing human iPSC-derived microglia by single-cell and bulk RNA sequencing
COSI: General Comp Bio
  • Maria Zavodszky, Biogen, United States
  • Tom Lanz, Biogen, United States
  • Mark Sheehan, Biogen, United States
  • Hui-Hsin Tsai, Biogen, United States
  • Qiurong Xiao, Biogen, United States
  • Mehool Patel, Biogen, United States
  • Ravi Challa, Biogen, United States
  • Chao Sun, Biogen, United States
  • Chris Roberts, Biogen, United States

Short Abstract: Induced pluripotent stem cells (iPSC) are a valuable model system for studying human cell biology. We employed a set of computational approaches, including clustering, deconvolution, and differential gene expression, to benchmark data obtained by single-cell and bulk RNA sequencing of iPSC-derived cells against native cells taken directly from the brain. The resulting characterization of iPSC-derived microglia allowed us to optimize protocol and assay development. Based on their bulk expression profiles, in-house iPSC-derived microglia resembled published patient-derived microglia cultured after extraction (in vitro) more than freshly isolated microglia (ex vivo). Single-cell sequencing data showed, however, that these iPSC-derived cells were quite heterogeneous. In an attempt to mimic their native environment, iPSC-derived microglia were co-cultured with iPSC-derived neurons. This generated a distinct cluster of microglia in which inflammatory and cell cycle genes were differentially expressed compared to microglia cultured without neurons. Many of the same genes were also differentially expressed in ex vivo compared to in vitro microglia. These results emphasize the need for neuronal context in the cellular microenvironment. Our work demonstrates that combining multiple computational methods with wet lab biology improved the development of a translatable iPSC-derived model of human microglia to advance drug discovery efforts related to neuroinflammatory targets.

V-134: solida.core: download, deploy and use an open-source and ready-to-use set of bioinformatic analysis pipelines
COSI: General Comp Bio
  • Gianmauro Cuccuru, Albert Ludwigs University, Freiburg, Germany, Germany
  • Matteo Massidda, CRS4, Biosciences Sector, Italy
  • Rossano Atzeni, CRS4, Biosciences Sector, Italy
  • Paolo Uva, CRS4, Biosciences Sector, Italy
  • Giorgio Fotia, CRS4, Biosciences Sector, Italy

Short Abstract: solida-core, (https://github.com/solida-core), is a robust, ready-to-use collection of extensively validated Snakemake-based bioinformatic pipelines that ensure both reproducibility and portability between different computing environments The pipelines of solida-core are built following the GATK Best Practices for DNA and RNA sequencing analysis. Further improvements and refinements are then incorporated after their testing in various research sequencing projects at the CRS4 Next Generation Sequencing Core Facility (http://next.crs4.it), one of the largest sequencing facilities in Italy. For example, the Exome Sequencing Data Analysis Pipeline (DiVA, https://github.com/solida-core/diva) was thoroughly tested and validated on ~1000 samples from > 20 projects. The solida-core open-source curated collection is publicly released with SOLIDA, (https://github.com/solida-core/solida), a pipeline manager developed at CRS4 that guides the user during pipeline configuration and deployment. The combination of these resources represents a complete easy-to-use bioinformatic analysis framework which can be used by both researchers facing with bioinformatics for the first time and by experienced bioinformaticians. Moreover, since the resource is public, other users can also contribute to the development of new pipelines or to the improvement of existing ones.

V-135: PhenDB: Deciphering the microbiome. Large-scale prediction of microbial roles and traits
COSI: General Comp Bio
  • Javier Geijo, Universität Wien, Austria
  • Patrick Hyden, Universität Wien, Austria
  • Thomas Rattei, Universität Wien, Germany
  • Lukas Lüftinger, Universität Wien, Austria

Short Abstract: Background With the increasing sequencing capacities and rapidly improving computational tools, more and more near-complete genomes are sequenced and binned from whole-metagenome shotgun sequence data of microbial communities. The rapid, automatic annotation and comparative analysis of high numbers of metagenomic bins requires novel bioinformatics tools that are specifically adapted to these problems. Methods and Results We developed PhenDB, a freely available resource to analyse entire collections of metagenomic bins for microbial traits. PhenDB provides a first taxonomic and functional overview of a bin collection and thereby simplifies the identification of interesting metagenomic bins for follow-up analysis. Training and prediction are performed by PICA, which uses a support vector machine algorithm (SVM). The samples are represented as binary vectors of protein family presence/absence, based on EggNOG 4.5. Conclusion PhenDB provides a user-friendly interface with several browsable tables simplifying identification of interesting findings. The models used in PhenDB are constantly improved by training data gathered using text-mining. We believe, that this resource will be useful to a large scope of scientists as a first-line, easy to use, analysis tool. Availability https://phendb.csb.univie.ac.at/

V-136: Multi-cohort study identifies leukocytes shift associated to smoking
COSI: General Comp Bio
  • Giulia Piaggeschi, University of Turin/ Italian Institute for Genomic Medicine (IIGM), Italy
  • Chiara Catalano, University of Turin, Italy
  • Laura Conti, University of Turin, Italy
  • Sonia Tarallo, Italian Institute for Genomic Medicine (IIGM), Italy
  • Valentina Panero, Italian Institute for Genomic Medicine (IIGM), Italy
  • Alessia Visconti, King's College London, United Kingdom
  • Mario Falchi, King's College London, United Kingdom
  • Paolo Vineis, Imperial College London/Italian Institute for Genomic Medicine (IIGM), United Kingdom
  • Silvia Polidoro, Italian Institute for Genomic Medicine (IIGM), Italy
  • Francesca Cordero, University of Turin, Italy

Short Abstract: Cigarette smoking is a major risk factor for human health. Previous studies showed that it impacts leucocyte cell counts. However, its effects on the main leucocyte sub-populations remains unclear. We evaluated in 300 healthy volunteers, aged between 35 and 70 years old, the association between self-reported smoking habits (current and former smokers and individuals who never smoked) and the cell-count distributions of nine leucocyte subpopulations (namely: CD4+ T-helper, CD8+ T-cytotoxic, CD16/CD56+ NK-cells, CD3+ T-cells, CD56/CD3+ NKT-cells, CD19+ B-cells, CD14+ monocytes, neutrophils, and eosinophils) as well as of their GPR15 cell receptor as smoking marker, quantified by flow cytometry. Indeed, previous studies suggest that the GPR15+ counts in CD3+ T cells are significantly higher in current smokers. Association studies between cell-type and GPR15 cell receptor counts were carried out using linear models, as implemented in the R statistical software. Age and sex were used as covariates. Current smokers showed a significant lower NK cell count, and an increase of GPR15+cell-type in both T cell (CD3+, CD4+ and CD8+) and B cells and a decrease of GPR15+cell-type in monocyte (P<0.05/18=2.8x10-3), despite the cohort included only light smokers (< 10 cig/day). The obtained results are being validated in two independent cohorts.

V-137: dropSeqPipe - A SingleCell RNASeq pre-processing snakemake workflow
COSI: General Comp Bio
  • Patrick Roelli, TUM, Lehrstuhl für Tierphysiologie und Immunologie, Germany
  • Kristiyan Kanev, TUM, Lehrstuhl für Tierphysiologie und Immunologie, Germany
  • Ming Wu, TUM, Lehrstuhl für Tierphysiologie und Immunologie, Germany
  • Dietmar Zehn, TUM, Lehrstuhl für Tierphysiologie und Immunologie, Germany

Short Abstract: In recent years, the increasing demand for single-cell development to answer complex biological questions has driven the scientific communities and companies to create numerous single-cell RNA-sequencing platforms. Being able to choose the best approach for your particular experiment can be tricky. Although commercial protocols are easy to use, provide software support for their data they can be hard to customize and adapt to your specific needs. Custom protocols, on the other hand, have to be tested and optimized in-house and the software support might have to be developed as well. As we developed our own protocol, we needed as much feedback as possible on our library preparation to improve and adapt our methodology. To this end, we developed dropSeqPipe, a workflow specifically designed to provide relevant feedback about single-cell RNA-seq data. Based on snakemake, it focuses on reproducibility, ease of use and flexibility. It is specifically tailored to polyA capturing protocols and is compatible with both droplet-based protocols such as drop-seq, 10x as well as well-plate based protocols such as SCRB-seq. dropSeqPipe is composed of 3 main steps; trimming, mapping and demultiplexing of UMI and cell barcodes. It automatically provides reports and plots to better understand the underlying data.

V-138: Dasa: a computational pipeline for differential ATAC-Seq analysis
COSI: General Comp Bio
  • Alberto Riva, Bioinformatics Core, ICBR, University of Florida, United States

Short Abstract: ATAC-Seq (Assay for Transposase-Accessible Chromatin with high-throughput Sequencing) is a powerful method to study genome-wide chromatin accessibility. Basic computational analysis of ATAC-Seq data is similar to ChIP-seq analysis: reads are aligned to the genome, and peaks in the alignment pileup indicate accessible regions. Differential ATAC-Seq analysis aims at identifying chromatin accessibility differences between two conditions. This allows detection of altered chromatin modification signatures, or the effect of mutations that disrupt chromatin structure, which in turn can have profound downstream effects on gene regulation. Identification of significantly different peaks between conditions is a challenging problem. Peak positions must be inferred from the data, and they may not be the same across samples (or even biological replicates). The number of reads in each peak is affected by total library size and by the total size of the accessible regions, requiring careful normalization, and there is no consensus on how to assess significance of between-sample differences. Dasa is a complete pipeline for differential ATAC-Seq analysis. Starting from alignment pileup and called peaks, Dasa identifies and quantifies common and unique peaks, finds peaks that are significantly different, associates differential peaks with nearby genes, and automatically generates a complete report including genome browser tracks.

V-139: Interactive Quality Control and Analysis for Targeted Metabolomics Kits
COSI: General Comp Bio
  • Eric Blanc, Berlin Institute of Health, Germany
  • Dieter Beule, Berlin Institute of Health, Germany
  • Mathias Kuhring, Berlin Institute of Health, Germany
  • Yoann Gloaguen, Berlin Institute of Health, Germany
  • Alina Eisenberger, Berlin Institute of Health, Germany
  • Raphaela Fritsche, Berlin Institute of Health, Germany
  • Jennifer Kirwan, Berlin Institute of Health, Germany

Short Abstract: Targeted mass spectrometry profiling methods optimized and validated for defined metabolites enable comprehensive routine metabolomics applications. However, routine scientific applications not only rely on established or standardized measurement methods such as provided by commercial targeted metabolomics kits. Also computational processing such as quality control and first-level analysis should be standardized and automated in a flexible, accessible, time-efficient and reproducible manner. Here, we present an interactive web application for initial analysis of targeted metabolomic data with focus on Biocrates kits. Using R Shiny, the app provides interactive and visual access to quality controls of, for instance, measured and missing values, positional irregularities, variability and reproducibility as well as to established univariate (e.g. adjusted normality, t- and correlation tests) and multivariate analysis methods (for instance hierarchical clustering, PCA and PLS-DA). Overall, the app aids in verifying data consistency and quality and provides initial insights into research questions in an accessible and automated fashion, thereby standardizing and accelerating routine applications in targeted metabolomics. The app supports latest Biocrates kits, with possible future extension to other kits and generic targeted metabolomics data. It is made available as an easy-to-install package under a permissive open source license.

V-140: Methylation analysis of combined drug application in leukemia cell lines
COSI: General Comp Bio
  • Yvonne Saara Gladbach, University Medical Center Rostock, Germany
  • Anna Richter, University Medical Center Rostock, Germany
  • Catrin Roolf, University Medical Center Rostock, Germany
  • Hugo Murua Escobar, University Medical Center Rostock, Germany
  • Christian Junghanss, University Medical Center Rostock, Germany
  • Mohamed Hamed Fahmy, University Medical Center Rostock, Germany

Short Abstract: Background The full DNA code remains not fully understood due to its complexity and a variety in methylation as a snapshot of methylation in one moment of screening. Methylations play essential roles in diverse biological processes, and further, distinct DNA methylation patterns might be key players in chemotherapy resistance and need, therefore further investigation. DNA methyltransferase 3A (DNMT3A) induced PTEN promoter hypermethylation results in increased PI3K signaling and can be reversed by hypomethylating agents such as decitabine (DEC). Results First, we surveyed the effects of the novel CK2 inhibitor CX-4945 and DEC in B-ALL. Incubation with CX-4945 resulted in PI3K pathway downregulation and induced anti-proliferative effects on B-ALL cell lines. Then, whole methylome screening revealed broad DEC-induced demethylation while CX-4945 had little influence and tumor suppressors among the five hypomethylated genes after CX-4945, DEC, and combined incubation. Finally, in vivo assessment of CX-4945 anti-tumor potential in B-ALL xenografts resulted in decreased leukemic blast frequency. Conclusion Our analysis showed that treatment of patient-derived xenografts with CX-4945 alone did not change tumor cell proliferation or infiltration while the addition of DEC reduced blast frequency in one sample and evaluates the effect of combined CK2 inhibition and DEC-mediated epigenetic modification.

V-141: Castor: Reference-based error assessment and correction of long read assemblies
COSI: General Comp Bio
  • Janet Lorv, University of Waterloo, Canada
  • Brendan McConkey, University of Waterloo, Canada

Short Abstract: Long read sequencing technologies are becoming an increasingly attractive option to generate highly contiguous assemblies. One caveat is the high read error rate, often 10% or more. After polishing, long read assemblies can have a low overall error rate, but even highly accurate assemblies (~99.8% identity) can still contain many errors that hinder downstream analyses. Insertions and deletions are the most prominent errors, and can introduce frameshifts and premature stop codons, making protein prediction challenging. To identify these errors within long-read assemblies we developed the reference-based error assessment tool, Castor. The software utilizes consensus alignments between multiple reference genomes and the draft assembly to determine putative sites of errors. The assembly quality at potential error sites is then evaluated, and if desired, corrected using reference data. Castor was evaluated on two Pseudomonas syringae genomes sequenced using ONT’s MinION sequencer and assembled using an optimized OLC pipeline. For both assemblies, Castor detected fewer than 3700 errors, accounting for less than 0.06% of each assembly; the majority of errors were homopolymer-associated indels. On correction, genome completeness improved from 81-85% to >99.6%, rivaling Illumina assemblies. Overall, Castor detects and corrects fragmenting errors, significantly improving long read assembly quality.

V-142: Integrated bioinformatic analysis of -omics data reveals differentiation status of human hepatocellular carcinoma cell lines in association with drug-specific sensitivity
COSI: General Comp Bio
  • Panagiotis Agioutantis, School of Chemical Engineering, National Technical University of Athens, Greece, Greece
  • Heleni Loutrari, 1st Department of Critical Care Medicine & Pulmonary Services, Evangelismos Hospital, Medical School, NKUA, Greece, Greece
  • Fragiskos N. Kolisis, School of Chemical Engineering, National Technical University of Athens, Greece, Greece

Short Abstract: Hepatocellular carcinoma (HCC), the predominant type of liver malignancies, associates with a high mortality rate due to inherent aggressiveness and limited available therapeutic regimes. Human HCC cell lines can provide appealing preclinical systems for drug screening and elucidation of molecular heterogeneity to treatment sensitivity and resistance. Herein, we conducted an integrated computational analysis of transcriptome and proteome data from the CCLE database, regarding 21 widely used HCC cell lines to get global insights into their molecular profiles. Exploratory data and single sample gene set enrichment analyses revealed two discrete clusters, consistent both at the gene and protein expression levels, namely a group of liver-like well differentiated cell lines displaying high enrichment scores in a “specifically upregulated in liver gene set” and a group of undifferentiated cell lines. Hierarchical clustering based on a published Epithelial-Mesenchymal Transition set of genes further supported this stratification. Consequent between-group differential expression and functional analyses, unveiled key distinctive genes and proteins. Finally, drug screening data obtained from the CTRPv2 database were used to correlate cell line differentiation status with drug sensitivity. Conclusively, present results provide a rational base for an accurate selection of HCC cell lines as appropriate models in studies of drug development/repositioning and pharmacogenomics.

V-143: Drug repurposing using a simple omics integration approach for melanoma patients
COSI: General Comp Bio
  • Yian Chen, Moffitt Cancer Center, United States
  • Zachary Thompson, Moffitt Cancer Center, United States
  • Jamie Teer, Moffitt Cancer Center, United States
  • Zhihua Chen, Moffitt Cancer Center, United States
  • Yonghong Zhang, Moffitt Cancer Center, United States
  • Eric Welsh, Moffitt Cancer Center, United States
  • Ling Cen, Moffitt Cancer Center, United States
  • Eroglu Zeynep, Moffitt Cancer Center, United States
  • Aik-Choon Tan, University of Colorado Boulder, United States
  • Keiran Smalley, Moffitt Cancer Center, United States

Short Abstract: Background: Much progress in treatments is made based on mutations in melanoma patients. After accounting for major mutations, there are still approximately 25% of patients without clear driver mutations. Although a few immunotherapies show promise, after treatment failure, patients have limited treatment options. Methods: We developed an integrated approach for drug repurposing by investigating the association between patients’ expression and mutation of target genes from each drug with patients’ OS, using Cox regressions. An eQTL analysis was performed to assess association between mutation and expression of target genes. We applied this approach to analyze the WES and RNAseq data from two melanoma cohorts: 459 TCGA and 135 BMS patients. Fisher’s Product method was used to synthesize the results. False discovery rate was used for ranking. A total of 5,835 candidate treatments (from DsigDB) were included. Results: PD1/PDL1 (FDR= 4.92 x10-10), FDA approved immunotherapy, was ranked the second, serving as a positive control. LAG3 (FDR=3.2 x10-8), another immunotherapy currently in clinical trials, and a chemotherapy Uramustine (FDR= 3.4 x10-8), used in lymphatic malignancies, were among top 10. Although other top ranked drugs were not previously used for cancer treatment, their potential signaling mechanisms have been recently studied and seem promising.

V-144: An effective transcription factor binding site prediction model for Nuclear factor, Erythroid 2 like 2 (NFE2L2)
COSI: General Comp Bio
  • Kalyani Dhusia, Michigan State University, United States
  • Sudin Bhattacharya, Michigan State University, United States

Short Abstract: Ligand-activated Nrf2 binds genomic sequences in the promoter regions of target genes containing a core antioxidant responsive element (ARE) with the motif RTGACnnnGC. Current methods to identify Nrf2 binding sites suffer from high false-positives and do not account for flanking sequences beyond the central ARE core motif. Prediction of Nrf2 binding sites and their target genes in specific tissues is made difficult by our lack of knowledge of the identity of these flanking sequences. In this study we compared multiple machine learning models to predict NRF2 binding sites in the liver based on chromatin accessibility and Nrf2 ChIP-Seq data (ENCODE). These data sets were mined to identify Nrf2-DNA binding peaks containing a single instance of the ARE core motif. These instances, along with flanking sequences, constituted a set of positive Nrf2 binding sites. A set of “negative” sites was assembled by sampling from instances of the core ARE motif in Nrf2-unbound accessible chromatin regions. One-hot encoding was used to represent both positive and negative sites, which were then classified using several machine learning models including k-nearest neighbors, logistic regression, support vector machines, random forest, and boosted trees.

V-145: Multiple-Sample Somatic Variant Caller
COSI: General Comp Bio
  • Chuanyi Zhang, University of Illinois at Urbana-Champaign, United States
  • Idoia Ochoa, University of Illinois at Urbana-Champaign, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Short Abstract: Somatic single nucleotide variant (SNV) calling from bulk DNA sequencing samples is more challenging than germline SNV calling due to intra-tumor heterogeneity. That is, the presence of distinct subclonal cellular populations with distinct complements of SNVs generate variant allele frequencies (VAFs) of SNVs that are too small to distinguish from sequencing errors. Using multiple samples from the same patient may help to discover them. However, current somatic variant callers either do not support multiple samples and instead run on each sample independently (e.g. Strelka2), or support but suffer from long running time (e.g. Octopus). Therefore, we propose a multiple-sample variant calling method which can be used as an extension to existing single-sample SNV callers. Our method takes as input the BAM files, and candidate locations reported by any single-sample SNV caller, then calculates somatic scores via a model tailored for multiple samples that considers sample-specific VAFs and latent states for somatic SNVs. Using synthetic dataset, we demonstrate that our method improves the recall rate at only a modest increase in running time. In particular, our method discovers low-VAF variants in all samples, which are missed when taking the union of variants called by Strelka2 and Octopus on each sample independently.

V-146: A interventional healthcare data resource for interpreting genetic and environment interaction
COSI: General Comp Bio
  • Joon Ho Kang, Department of Molecular Cell Biology, Sungkyunkwan University School of Medicine, Suwon 16419, Gyeonggi-do, South Korea
  • Heewon Kim, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Yoon Jung Yang, Department of Food and Nutrition, Dongduk Women’s University, Seoul 02748, South Korea
  • Soyeon Cha, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Jinki Kim, Software R&D Center, Samsung Electronics, Hwaseong 18448, Gyeonggi-do, South Korea
  • Hyunjung Oh, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Hong-Hee Won, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Samsung Medical Center, Seoul, South Korea
  • Kyunga Kim, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Sung Jin Park, Software R&D Center, Samsung Electronics, Hwaseong 18448, Gyeonggi-do, South Korea
  • Sunga Kong, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Jae-Hak Lee, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Joon Seol Bae, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Jinho Kim, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea
  • Woong-Yang Park, Samsung Genome Institute, Samsung Medical Center, Gangnam-gu, Seoul 06351, South Korea

Short Abstract: The interaction between genetic and environmental factors affecting obesity and their body fat mass disposition has been studied, but it is hard to clarify the effect of interactions with a data resource without lifestyle modification. We constructed a longitudinal interventional healthcare resource of 279 adults for revealing the effects of SNPs on phenotype according to their lifestyle. Lifelog for participants were collected by a wearable device during a 3-month period, including genotype data, anthropometry, blood chemistry tests, and blood pressure at the beginning and the end of the study. We devised to widely vary the data by recruiting participants at two different time points; summer and winter, and advising dietary habits and physical activities. Intervention based on baseline lifelog helped participants to improve overall health status. Effect of SNP and environment interaction were calculated based on the difference of anthropometry between start and end of the study. We validated the effects of interactions by comparing with the previous study about genome-wide polygenic score(GPS) calculation. Finally, randomized controlled lifestyle intervention data resulted in significant changes of blood chemistry test and arthrometry and could identify the effect of the SNP and environmental interaction with relatively small sample size compared to past studies.

V-147: The Importance of Privacy Preserving Machine Learning in Psychiatric Research
COSI: General Comp Bio
  • Anne-Christin Hauschild, University of Marburg, Germany
  • Daniel Mueller, Center for Addiction and Mental Health, Canada
  • Dominik Heider, University of Marburg, Germany

Short Abstract: In the era of big data and machine learning (ML), secure data management is one of the key challenges when developing biomedical software, handling personal and clinical patient information. Particularly in psychiatric research, where state-of-the-art ML approaches are slowly gaining attention, data confidentiality is a continuous concern. Therefore, privacy preserving technologies will play a crucial role to pave the way for the application of modern ML in medical diagnostics and treatment optimization of psychiatric disorders. In this presentation, we will highlight the pitfalls of traditional ML approaches with respect to data privacy and leakage as well as availability and distribution of clinical information. Subsequently, we will compare possible solutions utilizing the power of distributed systems, such as, federated learning. Its aim is to implement a high-quality centralized model while patient data remains at a potentially large number of secure host locations. In summary, it uses the concepts of distributed computing to (1) train local models on the host devices, (2) build a central aggregate model and (3) re-distribute the updated model to the hosts. Finally, we will demonstrate potential applications of privacy preserving machine learning within the scope of personalized psychiatric diagnostics and pharmacogenomic research for treatment optimization.

V-148: Multi-omic temporal analysis of Gene Expression in the Developing Neocortex
COSI: General Comp Bio
  • Dermot Harnett, BIMSB, Humboldt Universitat Berlin, Germany

Short Abstract: The neocortex is the primary brain structure responsible for complex cognition in humans and other advanced species. Its development in the prenatal and perinatal period requires the differentiation of neural stem cells into diverse neuronal subtypes, as well as glial cells. This maturation process involves the emergence of cell type-specific gene expression programs, where changes in mRNA levels and mRNA translation result in proteins that drive neuronal differentiation through specialized proteomes. Our study is the first to simultaneously measure and model steady-state mRNA, mRNA translation, and steady-state protein levels genome-wide in the mammalian neocortex across prenatal-perinatal developmental time points. Our analysis of gene expression levels in the mouse neocortex demonstrates that neuronal precursors transcribe and translate genes associated with differentiated tissues. By analysing temporal shifts in the rate of translation, we are able to demonstrate that more than 1000 genes show temporally dynamic translation efficiency (TE), including key regulators of neuronal stem-cell fate, such as SatB2. Genes associated with neuronal function are enriched for dynamic TE. We further demonstrate that the levels of certain proteins show non-equilibrium dynamics, and use a hierarchical bayesian model of synthesis and degradation to estimate the relationship between ribosome footprint density and protein synthesis.

V-149: Medically Actionable Rare Variants In 50,000 Exomes From UK BioBank
COSI: General Comp Bio
  • Suganthi Balasubramanian, Regeneron Pharmaceuticals, Inc., United States

Short Abstract: The promise of precision medicine is to apply large-scale human genomic sequencing to preemptively identify patients and their family members carrying “medically actionable” pathogenic variants. Here, we present a survey of such variants identified from the exomes of 50,000 individuals from UK Biobank in 59 ACMG genes. The ACMG59 was chosen because pathogenic variants in these genes are known to cause or predispose individuals to diseases and where medical intervention is expected to improve an outcome(s) in terms of mortality or the avoidance of significant morbidity. We integrate exome data with genetic variant databases to identify pathogenic variants in ACMG59. Additionally, we identify hitherto unreported “Likely Pathogenic” loss-of-function variants in ACMG59 genes where truncating mutations are expected to cause disease. Using stringent criteria for defining pathogenic variants from ClinVar, we find that approximately 2% of the population have a medically actionable variant. Using broader definitions of pathogenic variants from ClinVar and HGMD, we obtain higher estimates ranging from 2% - 7%. Variants in cancer associated genes, BRCA2, BRCA1, PMS2 and MSH6 are the most prevalent; followed by LDLR associated with familial hypercholesterolemia. We highlight the importance of building a scalable workflow for rapid identification and systematic evaluation of pathogenic variants.

V-150: Calling full length transcripts with nucleotide precision using iTiSS
COSI: General Comp Bio
  • Florian Erhard, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Germany
  • Christopher Jürges, Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Germany

Short Abstract: Transcription start sites (TiSS) can be identified by a variety of sequencing techniques including cRNA-seq, dRNA-seq and PROcap-seq. All of them rely on enriching reads at the 5’-end of mRNAs. For individual techniques, computational tools have been developed to automatically detect and call TiSS, but no uniform tool for all these data is available. Moreover, third-generation sequencing in principle provides full length mRNAs including TiSS. Here we show that each individual technique produces large numbers of false positives, and also misses many bona-fide TiSS. We present our tool iTiSS (integrative Transcriptional Start Site caller), an integrative approach for fast and sensitive TiSS identification with high specificity. iTiSS was used for an unbiased re-annotation of the herpes simplex virus 1 genome (manuscript under revision), integrating data from cRNA-seq, dRNA-seq as well as PacBio third generation sequencing. Manually curated mRNAs reveal both good sensitivity (113/201 TiSS, 56.2%) and perfect specificity (100%) for all the high confident TiSS called by iTiSS. A recently novel feature of iTiSS further allows it to use third-generation datasets to extend called TiSS to full length transcripts including splicing events making it the first program of its kind.

V-151: Development of a transfer learning framework to predict recurrence of colon cancers
COSI: General Comp Bio
  • Jun-Gi Jeong, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Yeon Jeong Kim, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Hyojeong Jeon, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Hye Kyung Hong, Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine,Seoul,Republic of Korea, South Korea
  • Bojana Popovic, AstraZeneca, United Kingdom
  • Yong Beom Cho, Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine,Seoul,Republic of Korea, South Korea
  • Donghyun Park, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Jinho Kim, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea

Short Abstract: Treatment strategy for colon cancer patients can be tailored if accurate recurrence prediction is possible. Methylation signature has been suggested to be important in colon cancer recurrence. Available methylation datasets are heterogeneous making it challenging to integrate them to create a predictive model. We devised a two-step transfer learning approach to develop a statistical model to predict recurrence of colorectal cancers. Firstly, we attempted to select methylation features by using TCGA-deposited methylation array data of 291 colon cancers with patient survival information as a proxy of recurrence. We condensed 450,000 CpG sites into 200 principal components (PCs), which turned out to improve prediction accuracy (average AUC 0.6). We then transferred this feature information to the second step, in which our in-house bisulfite sequencing data comprising 77colon cancer samples with time-of-recurrence information were used. We only used 3,337 CpG sites that were highly correlated with the most important 28 PCs. The model showed an average concordance-index of 0.63 across all stages, which was greater than those of the models that were trained without the assistance of transfer learning. Our predictive model may provide orthogonal information together with existing clinical tests to screen patients with high risk of recurrence.

V-152: CRAB: A Comprehensive Repository of Drug Resistant Determinants of Acinetobacter baumannii
COSI: General Comp Bio
  • Tina Sharma, CSIR-IMTech, India
  • Anshu Bhardwaj, CRI, France

Short Abstract: Acinetobacter baumannii (AB) is identified as the most critical drug resistant (DR) pathogen by WHO given that it is exhibiting increasing resistance to conventional treatments. It is imperative that drug resistance determinants and trends of antibiotic resistance are understood across various clinical isolates of AB. It is crucial that the determinants of various clinically resistant phenotypes are consolidated along with detailed annotation so as to understand the entire molecular landscape of drug resistance in AB. Towards this, Comprehensive Repository of drug resistant determinants of A. baumannii (CRAB) is developed with the drug resistance profile of 42 complete genomes and 94 draft genomes with the antibiogram for 57 isolates. In CRAB, the coordinates of the reference genome are used to represent manually curated data on 619 essential genes, 1334 genes with pathway, 221 PDB structures, 81 drug targets, 334 genes with reported resistance mechanism, 118 transcription factors, 4 sigma factors, 14 TCS. It is expected that the genotype-phenotype correlation data on DR determinants of clinical isolates with comprehensive annotation of ATCC 17978 makes CRAB is a unique resource. The consolidated data has facilitated the prioritisation of 36 potential drug targets which are non-human octamer, core and invariant across 42 clinical isolates.

V-153: Development of a transfer learning framework to predict recurrence of colon cancers
COSI: General Comp Bio
  • Jun-Gi Jeong, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Yeon Jeong Kim, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Hyojeong Jeon, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Hye Kyung Hong, Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine,Seoul,Republic of Korea, South Korea
  • Bojana Popovic, AstraZeneca, United Kingdom
  • Yong Beom Cho, Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine,Seoul,Republic of Korea, South Korea
  • Donghyun Park, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea
  • Jinho Kim, Samsung Genome Institute, Samsung Medical Center, Seoul 06351, South Korea, South Korea

Short Abstract: Treatment strategy for colon cancer patients can be tailored if accurate recurrence prediction is possible. Methylation signature has been suggested to be important in colon cancer recurrence. Available methylation datasets are heterogeneous making it challenging to integrate them to create a predictive model. We devised a two-step transfer learning approach to develop a statistical model to predict recurrence of colorectal cancers. Firstly, we attempted to select methylation features by using TCGA-deposited methylation array data of 291 colon cancers with patient survival information as a proxy of recurrence. We condensed 450,000 CpG sites into 200 principal components (PCs), which turned out to improve prediction accuracy (average AUC 0.6). We then transferred this feature information to the second step, in which our in-house bisulfite sequencing data comprising 77colon cancer samples with time-of-recurrence information were used. We only used 3,337 CpG sites that were highly correlated with the most important 28 PCs. The model showed an average concordance-index of 0.63 across all stages, which was greater than those of the models that were trained without the assistance of transfer learning. Our predictive model may provide orthogonal information together with existing clinical tests to screen patients with high risk of recurrence.

V-154: CASP_COMMONS: A Community Wide Program in Data Guided Protein Structure Prediction
COSI: General Comp Bio
  • Chin-Hsien Tai, National Cancer Institute, NIH, United States
  • Yojiro Ishida, Rutgers University, United States
  • Greg Hura, The Advanced Light Source, Lawrence Berkeley National Laboratory, United States
  • Susan Tsutakawa, The Advanced Light Source, Lawrence Berkeley National Laboratory, United States
  • John Tainer, MD Anderson Cancer Center, United States
  • Andriy Kryshtafovych, University of California, Davis, United States
  • John Moult, University of Maryland, United States
  • Krzysztof Fidelis, University of California, Davis, United States
  • Gaetano Montelione, Rutgers University, United States

Short Abstract: Protein structure information can drive biomedical research. When high resolution experimental structures are not available, high quality models can help. To engage the CASP community with broader biological fields, CASP_COMMONS, an initiative to provide structural information, which may include de novo or data-guided predictions, NMR, or SAXS data, for some proteins with high biological significance was launched. Thirty-two targets were nominated from 28 high-impact biomedical research labs worldwide. Their sequences were released to the prediction groups which were given three weeks to submit at most 5 models for each target. Averagely 30 models per target were received, evaluated, ranked, and they are available on CASP website for nominators to further examine. Fifteen targets, which have 50 to 200 residues, shallow sequence alignments and no good templates, were selected for NMR and SAXS experiments. SAXS data were then released for the “data-assisted” prediction contests while NMR information was used to evaluate the de novo prediction models. In summary, CASP_COMMONS is a collaborative initiative that everyone contributed their expertise to help each other improving their fields.

V-155: The European Variation Archive: Genetic variation archiving and accessioning for all species
COSI: General Comp Bio
  • Cristina Gonzalez, EMBL-EBI, United Kingdom
  • Jose Miguel Mut Lopez, EMBL-EBI, United Kingdom
  • Sundararaman Venkataraman, EMBL-EBI, United Kingdom
  • Andres Silva, EMBL-EBI, United Kingdom
  • Baron Koylass, EMBL-EBI, United Kingdom
  • Kirill Tsukanov, EMBL-EBI, United Kingdom
  • Thomas Keane, EMBL-EBI, United Kingdom

Short Abstract: The European Variation Archive (EVA, https://www.ebi.ac.uk/eva) is a primary open repository for archiving, accessioning, and distributing genome variation including single nucleotide variants, short insertion and deletions (indels), and larger structural variants (SVs) in any species. Since launching in 2014, the EVA has archived more than 780 million unique variants across more than 50 species. The EVA currently peers with the NCBI-based database dbSNP to form a worldwide network for exchanging and brokering of variation data. From 2017, issuing and maintaining variant accessions is divided by species: the EVA is responsible for non-human species and dbSNP for human. Since then, the EVA has imported approximately 450 million identifiers from dbSNP and issued 360 million new ones. The EVA offers a REST API to query and export data, that supports the htsget streaming protocol defined by the GA4GH. An implementation of the Beacon v0.4 specification has been developed, and it will be updated to v1.0 as a result of ongoing development work. The EVA also contributes to maintaining the Variant Call Format (VCF) specification and implemented a validation suite to ensure correctness of all submissions made to the archive.

V-156: The de.NBI / ELIXIR-DE training platform (SIG3)
COSI: General Comp Bio
  • Daniel Wibberg, ELIXIR-DE,

Short Abstract:

The 'German Network for Bioinformatics Infrastructure' (de.NBI) provides bioinformatics services and training to users in life sciences research, industry and biomedicine. Training activities of de.NBI are focused on supporting and training end users. Life science researchers will thus be enabled to exploit their data more effectively by applying tools, standards and compute services provided by de.NBI. To effectively coordinate training courses of the consortium, de.NBI has established the Special Interest Group 3 (SIG3 - Training and Education) also known as the de.NBI training platform. SIG3 is composed of training experts from each de.NBI unit.

Different types of training activities are supported and organzied by de.NBI. First of all, the de.NBI summer schools provide training courses for undergraduate and graduate students in specific topics related to one or several de.NBI nodes. The respective nodes organize tool-specific training. These trainings can either be attached to existing conferences (where many of the potential training participants will attend anyhow) or organized independently. In addition, online training was introducted on the de.NBI website in 2016. Online training enables users first insights into bioinformatics tools. In addition to online training material, online hackathons for different software packages and webinars have been established by the service centers RBC and CIBI.

Between 2015 to 2018, the number and diversity of training activities have increased significantly (2015: 329 Participants - 17 courses; 2016: 882 Participants - 40 courses; 2017: 1,489 Participants - 69 courses; 2018: 1,520 Participants – 77 courses). Since 2017, the number of training events has reached a plateau at ~1,500 participants and ~70 courses, whereas the number of training courses at larger conferences was increased. To further increase the number of participants and courses as well as the visibility of the de.NBI network, de.NBI support training course of external partners, e.g. UKE Hamburg. According to the recent administrative developments, Germany joined ELIXIR. The German ELIXIR Node will be run by de.NBI. As a result, de.NBI also joined the ELIXIR Training Platform and SIG 3 already started to establish collaborations with ELIXIR in training activities.