Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters - Schedules

Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19 and no later than July 23. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2021. There are Q&A opportunities through a chat function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.

Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC
Session B: Monday, July 26 between 15:20 - 16:20 UTC
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
Session E: Thursday, July 29 between 15:20 - 16:20 UTC
A method for classification of cell migration types based on cell-tracking results from live-cell imaging
COSI: General Comp Bio
  • Aya Watanabe, Graduate School of Information Science and Technology, Osaka University, Japan
  • Tsubasa Mizugaki, Graduate School of Information Science and Technology, Osaka University, Japan
  • Hironori Shigeta, Graduate School of Information Science and Technology, Osaka University, Japan
  • Shigeto Seno, Graduate School of Information Science and Technology, Osaka University, Japan
  • Hideo Matsuda, Graduate School of Information Science and Technology, Osaka University, Japan

Short Abstract: Cell migration is one of the important criteria for determining effects on cells by inflammatory and/or chemical stimulation. For detecting the movement of cells, it is not easy to utilize tracking tool because those cells have the similar fluorescence intensity and cell shapes to each other. In this study, we analyze neutrophil cell migration based on the results of cell tracking by manual inspection using time-series images observed with two-photon excitation microscopy. The cells were activated by two types of stimulants such as LPS (lipopolysaccharide) and GM-CSF (granulocyte-macrophage colony-stimulating factor). Since the existing methods often classify the types of each cell migration with their velocities, no features of cell migration types were found to identify differences in stimulations. To cope with this issue, we classify the types of cell migration based on curvature due to the difference of stimulations. We adapted a spline interpolation method and a second-order differential method to calculate their motion curvatures. And, we adapted the principal component analysis to classify the types of cell migration. We will demonstrate the performance of our method by revealing the difference of cell migration types and show the effectiveness of this method compared to the existing methods.

A multiresolution optimization strategy for inferring 3D genome architecture from Hi-C data
COSI: General Comp Bio
  • William Stafford Noble, University of Washington, United States
  • Alexandra Gesine Cauer, University of Washington, United States
  • Jean-Philippe Vert, Google, France
  • Nelle Varoquaux, University of California, Berkeley, France

Short Abstract: The three-dimensional organization of the genome plays an important part in regulating numerous basic cellular functions, including gene regulation, differentiation, the cell cycle, DNA replication, and DNA repair. Assays like Hi-C measure DNA-DNA contacts in a high-throughput fashion, and inferring accurate 3D models of chromosomes can yield insights hidden in the raw data. However, inference on low-coverage or high-resolution data is challenging, as is inference of diploid genomes. Previous haploid structural inference methods have successfully addressed the difficulties presented by low-coverage or high-resolution data via multiscale optimization, an optimization strategy that solves a large optimization problem by building upon the solutions to smaller versions of the problem. Because many organisms of interest are diploid, we sought to develop a multiscale optimization approach that infers the structure of diploid genomes. We use simulations to show that integrating multiscale optimization with a previously published diploid inference method significantly improves the accuracy of inferred structures.

A Quantitative Bioinformatics Approach to Rabies Lyssavirus Proteome Diversity
COSI: General Comp Bio
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia/ Bezmialem Vakif University, Turkey, Turkey
  • Li Chuin Chong, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Esin Özkan, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey

Short Abstract: Rabies is a zoonotic and fatal infectious disease caused by the Rabies Lyssavirus (RABV). Herein, a comprehensive quantitative bioinformatics approach was applied to dissect the diversity of human and dog proteome sequences. All reported RABV protein sequences from human and dog hosts were retrieved from NCBI Virus database. The sequences were deduplicated, grouped by proteins for each host, co-aligned across hosts, and separated for further analyses. Shannon’s entropy and diversity motif analyses (index and its variants: major, minor and unique) were performed for each aligned overlapping nonamer position of each host protein to study the proteome diversity. A total of 10,444 protein sequences were retrieved, with 847 and 9,597 for human and dog hosts, respectively. The average proteome-wide entropy was relatively higher for human (~0.82) compared to the dog RABV (~0.70). P protein was the most diverse, while L was the most conserved for both hosts. Index, the predominant nonamer, and its variants exhibited distinctive pattern between the proteins and the hosts. Completely conserved positions represented ~16% and ~6% of the human and dog RABV, respectively. This study provides insight into the RABV proteome diversity, with implications to vaccine and drug design.

A Transcriptomic Approach to Normal Tissue Selection Strategies for Cancer and Basic Research
COSI: General Comp Bio
  • Seda Arat, Pfizer, United States
  • John Jakubczak, Pfizer, United States
  • Ahmed Shoieb, Pfizer, United States
  • Renee Huynh, Pfizer, United States
  • Zhiyong Xie, Pfizer, United States
  • Matthew Martin, Pfizer, United States
  • David Potter, Pfizer, United States

Short Abstract: Normal human tissues are critical in the drug discovery setting to better understand disease biology, validate drug targets, assess target distribution, develop and validate various assay platforms, and determine the binding properties of therapeutic biologics. Access to high quality normal human tissue can be challenging for a variety of clinical, logistical and ethical factors. We dissected four different kidney tissue sources post-mortem cadaveric (CAD), deceased organ donor (OD), adjacent normal (NAT) and tumor from surgical remnants of living donors. Unlike other cancer tissue studies, pathological and transcriptomic profiles of 91 donors from three normal and one tumor kidney tissue sources suggest that NAT is more similar to the normal kidney tissue from CAD and OD than to their matching tumor tissue. Our differential expression and enrichment analyses provide tissue source selection strategies for both cancer and basic research. The common strategies are: (1) NAT may be an alternative tissue source to OD for studies focusing on adaptive/humoral immune response, and (2) there is no alternative baseline to OD for lymphocyte/leukocyte-related immune processes. To our knowledge, our work not only is the first marrying pathological data with transcriptomics on 3 different normal kidney tissue sources but also provides strategies for proper “baseline” selection for both cancer and non-cancer studies.

AcrFinder: genome mining anti-CRISPR operons in prokaryotes and their viruses; AcrDB: a database of anti-CRISPR operons in prokaryotes and viruses
COSI: General Comp Bio
  • Yanbin Yin, University of Nebraska-Lincoln, United States
  • Bowen Yang, University of Nebraska-Lincoln, United States
  • Haidong Yi, University of North Carolina at Chapel Hill, United States

Short Abstract: Anti-CRISPR (Acr) proteins encoded by (pro)phages/(pro)viruses have a great potential to enable a more controllable genome editing. Here we present AcrFinder and AcrDB. AcrFinder is a web server designed for Acr screening. The tool has the following unique functions: (i) the first online server specifically mining genomes for Acr-Aca operons; (ii) provides a most comprehensive Acr and Aca (Acr-associated regulator) database; (iii) combines homology based, GBA-based, and self-targeting approaches in one software package; and (iv) provides a user-friendly web interface. AcrFinder had a 100% recall from validation tests. AcrDB was constructed by scanning ~19,000 genomes of prokaryotes and viruses with AcrFinder and further processed by two machine learning-based programs (AcRanker and PaCRISPR). Compared to existing Acr databases, AcrDB has the following unique features: (i) It is a genome-centric database with a much larger data size; (ii) It offers a user-friendly web interface; (iii) It focuses on the genomic context of Acr and Aca homologs instead of individual Acr protein family; and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. Both AcrFinder and AcrDB will be a valuable resource to the anti-CRISPR research community.

Analysis of Zika Virus Protein Sequence Diversity
COSI: General Comp Bio
  • Nadir Emre Herdan, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Li Chuin Chong, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia/ Bezmialem Vakif University, Turkey, Turkey

Short Abstract: Zika virus (ZIKV), an arthropod-borne flavivirus, is considered a global health threat. Herein, we describe a comprehensive quantitative analyses of the ZIKV proteome sequence diversity, isolated from the human host. A total of 1,214 protein sequences were retrieved, separated by BLASTp into the respective 10 viral proteins encoded by the proteome, removed of duplicates(Cd-Hit), and aligned(Muscle). Shannon’s entropy was calculated by use of Dima for each of the aligned overlapping nonamer positions to measure the proteome diversity. The distinct sequences at each position were classified into four diversity motifs (index and its variants: major, minor and unique) based on their incidence. The human ZIKV proteome was highly conserved with a mean entropy value of only ~0.2, and a peak total variants incidence as low as ~35%. The structural capsid protein was the most diverse (average entropy of ~0.43), while the non-structural (NS) 3 was the most conserved (~0.13). Despite the limited diversity, given the recent history of the virus, there was a notable dynamics of sequence change involving each variant motif. This indicates the opportunity for a vaccine against the virus, while the diversity is low, and the need for surveillance of viral variants to monitor possible fitness change.

BatchQC: An interactive R package for batch effect assessment, visualization, and correction
COSI: General Comp Bio
  • Michael Silverstein, Boston University, United States
  • Regan Conrad, Boston University, United States
  • Zhaorong Li, Boston University, United States

Short Abstract: When sequencing and microarray samples are processed in multiple batches, it can be challenging to disentangle true biological differences between samples from differences that arise as a result of non-biological processing discrepancies. While batch effects can usually be avoided by uniformly processing all samples in a single batch, this becomes less realistic with large studies (e.g. multi-site consortiums) or those that require the sequential processing of samples. Thus, there is a need for analytical methods for removing the bias introduced by batch effects from the true biological signals. We propose BatchQC, an interactive R Shiny package that streamlines batch effect identification and exploration, through visualizations and statistical analyses, and implements batch effect correction using common algorithms such as ComBat-seq, SVA, and RUV. Further, we demonstrate the effectiveness of BatchQC using a SARS-CoV-2 immune response dataset (GSE147507), which was generated from several experiments with different protocols and thus has significant batch effects. Using BatchQC, we are able to successfully identify, display, and correct the batch effects between the separately run experiments. We recommend that researchers analyzing sequencing experiments add BatchQC to their computational toolkit for efficient batch effect identification and correction.

Characterization and prediction of DNA methylation instability across human cancers
COSI: General Comp Bio
  • Brittany Lasseigne, University of Alabama at Birmingham, United States
  • Sasha Thalluri, University of Alabama at Birmingham, United States

Short Abstract: DNA methylation (DNAm) instability occurs when there are globally altered DNA methylation signatures. While these patterns occur throughout the genome, extensive research to date in the cancer field has implicated that methylation changes occurring at CpG islands may be important biomarkers of disease etiology, progression, or treatment response. While many studies have examined this phenomenon, known as the CpG Island Methylator Phenotype (CIMP), few have studied the relationship between non-CpG Island methylation, including CpG shores and CpG shelves, within and across cancers and with respect to clinical characteristics. We obtained DNAm array (450k Infinium Chip) data from over 11,000 patients in 33 cancers publicly available from The Cancer Genome Atlas (TCGA) database and compared and contrasted these DNAm instability metrics. In this study, we calculate different DNAm instability metrics and compare them to gain insight into the correlation between these signatures within and across cancer types. Our findings provide insights for generating hypotheses and contributing to the clinical relevance of DNA methylation as potential targets for future therapeutic intervention.

Cluster guided de novo isoform assembly
COSI: General Comp Bio
  • Karl Johan Westrin, KTH Royal institute of technology, Sweden
  • Warren Kretzschmar, Karolinska institutet, Sweden
  • Olof Emanuelsson, KTH Royal institute of technology, Sweden

Short Abstract: Motivation: Transcriptome assembly in species without a reference genome has to be performed de novo, which requires a deeper sequencing than a reference based approac
would require. In turn, this makes the study of alternative splicing in such species difficult, particularly for lowly expressed isoforms. Sequencing of full-length transcripts using long reads could improve this, but such techniques are still either expensive or error-prone.

Result: We present the de novo transcript isoform assembler ClusTrast, which clusters a set of guiding contigs by similarity, aligns short reads to the guiding contigs, and assembles each thus clustered set of short reads individually. Tested on bulk-RNA sequencing datasets from different species, ClusTrast recovered more expressed known isoforms than any of the other tested de novo assemblers at a moderate reduction in precision. Therefore, we propose that ClusTrast can be a useful tool for studying alternative splicing in non-model organisms.

CMOS: An Encyclopedia of Multi-omic Stacks and their Respective Cell Lines in Head and Neck Squamous Cell Carcinoma
COSI: General Comp Bio
  • Sabaoon Zeb, Precision Medicine Lab, Pakistan
  • Faisal F. Khan, Precision Medicine Lab, Pakistan

Short Abstract: Over the past decades, hundreds of Head and Neck Squamous Cell Carcinoma (HNSCC) cell lines have been established globally. However, the online available information is sparse, incomplete or distributed. Moreover, bioinformatics datasets especially of the whole cell ‘omic’ scale are critical and even more dispersed. In this study, we have developed the largest database that we know of for HNSCC cell lines, where the data was collected from PubMed (NCBI), Cellosaurus, COSMIC and other data sources. This encyclopedia contains details on 979 established HNSCC cell lines with metadata on patient demographics, tumorigenicity in mice xenografts models, primary cell culture protocol reagents, reported mutations and available multi-omics datasets. The catalogue contains multi-omics datasets of expression profiling by array (n=186), non-coding RNA profiling by array (n=79), methylation profiling by array (n=57), SNP array (n=51), RNA-seq analysis (n=118), Chip-seq analysis (n=44), proteome analysis (n=107), metabolome analysis (n=30), whole exome sequencing (n=150) and whole genome sequencing (n=45). According to the data, CAL27, FaDu and HSC2 are the cell lines where along with the commercial availability, all the above mentioned multi-omics datasets are available. This catalogue is a benchmark for the scientific community for breakthroughs in the HNSCC research using both computational and experimental analysis.

Combination of statistical methods and unsupervised learning algorithms facilitates the identification of patterns in protein sequence sets
COSI: General Comp Bio
  • David Medina-Ortiz, Centre for Biotechnology and Bioengineering - CeBiB, Chile
  • Álvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering - CeBiB, Uruguay
  • Alfredo Hernández-Inostroza, Universidad de Chile, Chile

Short Abstract: Currently, the computational tools to discover patterns in groups of protein sequences and group them seldom deliver satisfactory results. Given this, a sequence clustering strategy has been designed and implemented, the principle of which is based on combining the use of different coding methods, sequence alignments, analytical methods using distance metrics, applications for processing digital sequences, and natural language processing. Thus, using these data and statistical models, it is possible to find the probability that the given sequences are related. In this way, using graph structures to represent these probabilities and applying community detection strategies, it will be possible to identify groups of sequences. See the tool's effectiveness in three different cases: in a set of hydrophobin protein data, DNA interaction proteins, and in reconnecting groups of sequences in peptides, obtaining a 75.6%, 83.2%, and 87, 5%, respectively, for the evaluated cases. In our work in process, we try to test the effectiveness of our approach using variated datasets of protein variants with know classes. We think that this approach will represent an attractive alternative to the classical strategies. Also, it could be used as a classification method as a semi-supervised approach.

CStone: A de novo assembler that identifyies non-chimeric contiguous sequences based on underlying graph structure.
COSI: General Comp Bio
  • John Archer, CIBIO – Research Centre in Biodiversity and Genetic Resources, Portugal
  • Raquel Linheiro, CIBIO – Research Centre in Biodiversity and Genetic Resources, Portugal

Short Abstract: Artificially generated chimeric sequences can closely resemble underlying expressed transcripts, but patterns such as those seen between co-evolving sites or re-mapped read counts become obscured. With the exponential growth of sequence information stored over the last decade the quantification of chimeras has become essential, especially when assembling read data. We have created a de novo assembler, CStone, that annotates each contig produced with one of three classification levels indicating whether or not it can be guaranteed to be non chimeric. Classification levels are dependent on the complexity of the gene family from which the reads are derived. As a by-product, this also provides insight into the structural makeup of the gene families that have been sequenced. To demonstrate of CStones ability to assemble high quality contigs, and to label then in this manner, RNA-seq data was simulated from cDNA libraries representing ten different species and assembled using three different assemblers: CStone, Trinity and RnaSPAdes. On comparison back to the original cDNA libraries the contigs that CStone generates are comparable in quality to those of Trinity and RNASpades, while providing additional information on chimerism. The CStone project is available at: sourceforge.net/projects/cstone/.

Deciphering effect and mutation relationships in von Hippel Lindau disease using data mining strategies
COSI: General Comp Bio
  • David Medina-Ortiz, Centre for Biotechnology and Bioengineering (CeBiB), Chile
  • Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering (CeBiB), Uruguay
  • Gabriel Cabas-Mora, Universidad de Talca, Chile
  • Julio Salgado, Universidad de Talca, Chile
  • Claudio Guevara-Vasquez, Universidad de Talca, Chile

Short Abstract: Von Hippel-Lindau is one of the most complex diseases to cope with, mainly due to the effects it causes on the well-being of the individual. It is characterized by point mutations in the pVHL protein, which causes detrimental changes in its structure, losing interactions, stability, or functionality. Different computational approaches have been implemented to study and analyze disease mutations. However, there are no massive studies that attempt to decipher the phylogeny-structure correlation between these and the consequences at the structural level that cause changes in pVHL. To solve this, we design and implement a database with refined and updated information on mutations and effects. We then correlated the impact of mutations concerning thermodynamic changes and co-evolution analysis using the SDM and evocoupling services. Remarkably, we developed a mutant landscape and evaluated changes at the structural level using discrete mathematical modeling. Besides, we design and implement different binary predictive models to classify the effects of mutations, clinical risks, and VHL types, using Machine Learning strategies and characterizing mutations from the phylogenetic, structural, and thermodynamic points of view, achieving a weighted performance of 88.7% accuracy. Finally, we implement VHL-Hunter, a user-friendly web application, as a support tool for the scientific community.

Deciphering liver metastases with pan-cancer transcriptomic comparisons
COSI: General Comp Bio
  • Ke Liu, Michigan State University, United States
  • Mingdian Tan, Stanford University, United States
  • Benjamin Glicksberg, Icahn School of Medicine at Mount Sinai, United States
  • Shreya Paithankar, Michigan State University, United States
  • Rama Shankar, Michigan State University, United States
  • Dimitri Joseph, Michigan State University, United States
  • Samuel So, Stanford University, United States
  • Mei-Sze Chua, Stanford University, United States
  • Bin Chen, Michigan State University, United States

Short Abstract: We conduct transcriptomic comparisons in seven cancer types to decipher the complexity of liver metastases. We first develop DEBoost to identify differentially expressed (DE) genes between primary and metastatic cancer cells. The following functional analyses suggest that liver metastases of prostate cancer and pancreatic neuroendocrine tumor are more active in cell cycling than their respective primary cancers whereas other cancer types are not. The expressions of DE genes have limited associations with clinical measures, indicating most DE genes may be passengers in the metastasis process. We cluster DE genes based on their chromosome coordinates to uncover copy number differences and further confirm gain of 19p13.12 drives metastasis in Basal-like breast cancer. Finally, we show that metastatic cancer cells could partially mimic the secretome of hepatocytes by selectively expressing liver-specific genes encoding secreted proteins. Our work provides a novel framework to study cancer metastasis using pan-cancer transcriptomic data.

Dissecting the Dynamics of Lassa Virus Protein Sequence Diversity
COSI: General Comp Bio
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih, Istanbul, Turkey, Turkey
  • Eyyüb Selim Ünlü, Istanbul Faculty of Medicine, Istanbul University, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Bezmialem Vakif University, Turkey / Perdana University, Malaysia, Turkey

Short Abstract: Lassa virus (LASV) poses an endemic threat to sub-Saharan African countries, with no available vaccine currently. An ongoing aim is to elucidate the proteome sequence diversity from an immunological perspective for novel intervention strategies. Herein, we used a comprehensive quantitative bioinformatics approach to dissect the human and rodent Lassa virus proteome sequence diversity. All reported LASV protein sequences and relevant metadata were downloaded from public databases (as of June 2020), cleaned, deduplicated (CD-HIT) and aligned (Clustal Omega). Shannon’s entropy was measured for each nonamer position to survey the overall proteome diversity. The distinct nonamers at each position were further classified into four diversity motifs based on their incidence (tool: DiMA): index, major, minor and unique. Additionally, the motifs were assessed for irregular switching between positions (MoSWA). Proteome-wide entropy of human LASV was generally higher than rodent. The inter-relationships between the diversity motifs were complex, with motif switching a common phenomenon. Most of the proteome nonamer positions showed mixed-variability or were highly conserved. There were more than 100 human LASV nonamer sequences that exhibited complete conservation. This study provides an insight into LASV proteomic diversity, fitness change of amino acids, and evolutionary conservation of the virus.

Does the proportion of functional T cell receptor genes change with age?
COSI: General Comp Bio
  • Justyna Mika, Department of Data Science and Engineering, Silesian University of Technology, Poland
  • Serge Candéias, Univ. Grenoble Alpes, CEA, CNRS, IRIG-LCBM-UMR5249, Grenoble, France
  • Joanna Polanska, Department of Data Science and Engineering, Silesian University of Technology, Poland

Short Abstract: T lymphocytes play an essential role in the defense against pathogens and cancers through their clonally distributed T cell receptor (TCR). TCR genes are assembled from discrete V, D and J segments in developing lymphocytes. Due to the random nature of this process, only one-third of the rearranged TCR genes are functional while 2/3 are non-functional. Here we intend to determine the impact of age on the proportion of functionally rearranged TCR genes in blood.
We used a collection of 587 human TCRβ repertoires obtained from healthy donors. After data preprocessing, we calculated diversity of functional status of TCR genes for each donor using Pielou’s J index and applied piecewise linear regression to find different patterns of evolution in age categories. The best age split was determined based on a minimal Bayesian Information Criterion value.
We observed a significant drop of functional status diversity (r=-0.60, p<0.0001) in donors younger than 19 years, whereas diversity is stable in older donors over time (r=-0.06, p=0.1985). Thus, the proportion of functional sequences increases early in life and then stabilizes, suggesting different patterns of T lymphocyte proliferation with age.

The work was supported by the European Social Fund grant POWR.03.02.00-00-I029 (JM).

Drug repositioning for Mucopolysaccharidoses based on systems biology data
COSI: General Comp Bio
  • Gerda Cristal Villalba Silva, Universidade Federal do Rio Grande do Sul, Bioinformatics Core, Brazil
  • Ursula Silveira Matte, Universidade Federal do Rio Grande do Sul, Department of Genetics, Brazil

Short Abstract: Mucopolysaccharidoses (MPS) are lysosomal storage diseases characterized by defects in the activity of lysosomal hydrolases. In MPS, secondary cell disturbance affects pathways common to cancer. Hence, the study aimed to identify MPS-related drugs targeting oncogenic pathways and identify a list of drugs for repurposing. Gene expression and ontology analysis were performed with the human MPS datasets GSE111906 (MPS I), and GSE23075 (MPS IIIB). We retrieved drug data from CTD, Drugbank, cmttDB, PharmacoDB, and GDSC. We used a Venn diagram to choose the drugs related to the oncogenic pathways for the next steps. We used STITCH v.5 and Cytoscape v.3.8.2 to the protein-drug network. To improve the interaction networks, we used Omnipath v.2. The network was composed of 244 nodes, 13 of them related to drugs, and 1824 edges. Regarding the Omnipath analysis, the GSE111906 showed 47 enriched drugs in the interaction network. For the GSE23075, there were 31 enriched drugs. Our results suggest that drugs modulating the Axon guidance, EGFR, mTOR, Wnt, and immune system pathways, are particularly promising for intervention. Furthermore, the list of drugs and related MPS enriched genes could be useful not only as new treatments but also considered for pathophysiological studies.

Dynamics of Primate erythroparvovirus 1 Protein Sequence Change
COSI: General Comp Bio
  • Mohammad Asif Khan, Bezmialem Vakif University, Turkey / Perdana University, Malaysia, Turkey
  • Li Chuin Chong, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Malaysia
  • Pendy Tok, Faculty of Information Science and Technology, Multimedia University, Malaysia

Short Abstract: Primate erythroparvovirus 1, generally known as parvovirus B19 (B19V), causes erythema infectiosum, commonly seen in children. Herein, we report quantitative analyses of B19V proteome sequence diversity. All reported protein sequences of the virus from human hosts were retrieved from the NCBI Entrez Protein and Virus databases. A total of 6,562 protein sequences were collected, deduplicated (CD-HIT), processed into separate proteins (BLASTp), and aligned (Clustal Omega) individually. Only three (VPs: VP1 and VP2; NS1) of the six proteins had sufficient sample size (>30) for further analysis. Shannon’s entropy calculations and quantitative pattern analysis of sequence diversity motifs (index and its variants: major, minor and unique) were carried out for each of the overlapping proteome nonamer positions. The proteome studied had a low mean entropy of ~0.5, with only five positions exhibiting total variants close to 55%, indicating high conservation, overall. Distribution of index, the predominant sequence and its variant motifs across the individual protein nonamer positions illustrated a distinctive pattern of sequence change dynamics. The variants of the index originated mostly from ~26% of the proteome. Notably, as many as 156 nonamer positions were completely (100%) conserved which merit further investigation as possible targets for vaccine design.

ENCODE Uniform Processing Pipeline Infrastructure
COSI: General Comp Bio
  • Jennifer Jou, Stanford University, United States
  • J. Michael Cherry, Stanford University, United States
  • Ben Hitz, Stanford University, United States
  • Matt Simison, Stanford University, United States
  • Stuart Miyasato, Stanford University, United States
  • Pedro Assis, Stanford University, United States
  • Emma Spragins, Stanford University, United States
  • Philip Adenekan, Stanford University, United States
  • Forrest Tanaka, Stanford University, United States
  • Jessica Au, Stanford University, United States
  • Paul Sud, Stanford University, United States
  • Khine Lin, Stanford University, United States
  • Ingrid Youngworth, Stanford University, United States
  • Bonita Lam, Stanford University, United States
  • Meenakshi Kagda, Stanford University, United States
  • Keenan Graham, Stanford University, United States
  • Jin Lee, Stanford University, United States
  • Otto Jolanki, Stanford University, United States
  • Idan Gabdank, Stanford University, United States

Short Abstract: The Encyclopedia of DNA Elements (ENCODE) Consortium has generated a wealth of genomic data for the purposes of identification and analysis of the functional elements in the human genome. Furthermore, ENCODE has also developed numerous uniform, reproducible and portable processing pipelines in order to generate readily integratable secondary and tertiary analysis products. Here we use the ENCODE Hi-C pipeline (github.com/ENCODE-DCC/hic-pipeline/) as case study to illustrate the components that comprise a typical ENCODE production-grade pipeline, namely Cromwell (cromwell.readthedocs.io/), Caper (github.com/ENCODE-DCC/caper), Docker (www.docker.com/), WDL (github.com/openwdl/wdl), Google Cloud Platform (GCP), Amazon Web Services (AWS), and CircleCI (circleci.com/). Cromwell provides cross-platform workflow execution, and Caper provides a high-level, user-friendly wrapper over Cromwell. Docker is used for pipeline containerization, helping ensure reproducibility, and together with Caper allows users to run pipelines on their preferred backend infrastructure. Automated builds and testing are conducted on the CircleCI platform following every code change in GitHub. Cloud computing (GCP, AWS) provides the scalability required to expediently process the volumes of ENCODE data. We also describe a multi-level testing strategy that increases confidence in the reproducibility of pipeline outputs.

GeneCodis 4: Expanding the modular enrichment analysis to regulatory elements.
COSI: General Comp Bio
  • Adrian Garcia-Moreno, Pfizer-University of Granada-Junta de Andalucía Centre for Genomics and Oncological Research (GENYO), Spain
  • Raul López-Domínguez, Pfizer-University of Granada-Junta de Andalucía Centre for Genomics and Oncological Research (GENYO), Spain
  • Pedro Carmona-Saez, Centre for Genomics and Oncological Research (GENYO) and Biostatistics, Department of Statistics and O.R. UGR, Spain

Short Abstract: In this work it is presented the extension of GeneCodis functionality to analyse regulatory elements, namely, transcription factors, CpG sites and miRNAs. This is implemented with noncentral hypergeometric distribution models in order
to address the bias previously reported in the literature when performing over-representation analyses of these biological entities. It also extends its database with new and updated sources, furthermore, it incorporates the GO Covid subset to help fit the scope of the sars-cov-2 research. It provides a new gene-annotation network visualisation in order to catch functional modules dynamically. Finally a heart disorder associated set of miRNAs is studied to validate the new implementations. GeneCodis 4 is freely available at genecodis.genyo.es.

Generation of whole-genome, allele-specific maps for CRISPR-Cas9 genome editing
COSI: General Comp Bio
  • Jacob Bradford, Queensland University of Technology, Australia
  • Bo Zhou, Stanford University, United States
  • Alexander E Urban, Stanford University, United States
  • Dimitri Perrin, Queensland University of Technology, Australia

Short Abstract: The CRISPR-Cas9 system has become a leading tool for gene editing. However, designing guide RNAs for targeting specific regions is not trivial, as target sequences must maximise the likelihood of obtaining the desired cut, and minimise the risk of off-target modifications. Achieving this across entire genomes is computationally challenging, particularly when sequence variations must be considered. Here, we present an extended edition of our guide RNA design pipeline, Crackling, which now incorporates the ability to leverage haplotype-phased data to design allele-specific guide RNAs. Candidate sequences and off-target sites are extracted using both the reference and alternative sequences. On-target efficiency is assessed combining three scoring methods with a customisable consensus vote, and our off-target indexing technology gains an order of magnitude in speed compared to other methods. To demonstrate the pipeline, we use data from the haplotype-resolved whole-genome characterization for marmoset ESC line cj367 from the Wisconsin National Primate Research Center. Due to anatomical and physiological similarities to humans, the common marmoset (Callithrix jacchus) is an ideal organism for the study human diseases. This data and our gRNA design method make it easier to leverage genome editing for the in vivo biomedical modelling of human neuropsychiatric and neurodegenerative diseases.

Harmonising The Integration Of Biomedical Data, A Case Study On COVID-19 Patients
COSI: General Comp Bio
  • Bilge Gültepe, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Rümeyza Kazancıoğlu, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • İbrahim Tuncay, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Teoman Aydın, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Ramazan Özdemir, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Fatma Nur Okyaltırık, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Özlem Su Küçük, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Meliha Meriç Koç, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Kazım Karaaslan, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Bedia Gülen, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Esra Büşra Işık, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Faruk Üstünel, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Hatice Dilara Karakuş, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Ömer Erkam Engin, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Hasiba Karimi, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Rashid Mukaila, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia/Bezmialem Vakif University, Turkey, Turkey

Short Abstract: Harmonising the integration of biomedical dataset can pose extended time delays when a timely response is required. Herein, we describe a case study aiming to rapidly integrate COVID-19 patient data for health analytics. We compared the effort required to harmonise the integration of COVID-19 clinical data from two departments of Bezmialem Vakif University (BVU) versus direct extraction from the BVU’s hospital central repository (HCR). Patient data was provided by two departments, which collectively accounted for 875 unique patient records, with 571 PCR positives; collectively, covered 254 unique variables. Integration was manual and was met with multiple challenges, such as duplicates, conflicting information, missing data and errors (relative to the HCR), typos, and inconsistencies. Additionally, it required regular clarification from the respective departments. Taken together, these issues prolonged (approximately seven months) the final creation of the integrated, harmonised golden master dataset. Subsequently, similar data categories were obtained directly from the HCR for all the unique list of patients. A computational pipeline was developed to automate the integration process. The general pipeline workflow included filtering out unrelated information and various standardisation of variables. The pipeline took approximately five months to develop and was able to produce the golden master dataset within hours.

Identification, classification, and prioritization of most influential players in normal biological processes and diseases
COSI: General Comp Bio
  • Abbas Salavaty, Monash University, Australia
  • Mirana Ramialison, Monash University, Australia
  • Peter Currie, Monash University, Australia

Short Abstract: High-throughput technologies have enabled the identification and measuring of the activity of thousands of genes and proteins at a time and across several conditions. However, one of the biggest challenges is the selection of the right candidates amongst thousands of features for experimental functional validation. Currently, a number of different models have been developed for candidate gene prioritization, most of which rely on external sources of information and are not able to classify genes into “drivers”, “biomarkers”, and “mediators”. On the other hand, ExIR—Experimental data-based Integrative Ranking— recruits the potential of Integrated Value of Influence (IVI) algorithm and combines it with machine learning techniques to extract, classify and prioritize candidate features from any type of experimental data such as single-cell and bulk sequencing. ExIR is accessible as a rapid user-friendly web-based application on the Influential Software Package web portal (influential.erc.monash.edu/ExIR/).

Interoperability of Dataset for Bioinformatic Studies of Arachis hypogaea L. and Arachis duranensis
COSI: General Comp Bio
  • Fortune Ogo-Ndah Awala, University of Port Harcourt, Nigeria
  • Osivmete Victor Andrew, Federal Polytechnic, Ukana Akwa Ibom State, Nigeria

Short Abstract: Source Code: www.legumefederation.org/
License: (Cyverse Data Commons)

Data interoperability is the ability of systems to create, exchange and consume data, it is useful in bioinformatics for integrating and joining up data from multiple sources and across systems, for better collaboration and research. When bioinformatic resources are open and public, it gives better accessibility, usability, reproducibility and limitless improvement in the research. The genomic data analyzed, were sourced from a public data repository in Cogepedia, a software used in retrieval and comparative genome analysis. The plant genomes were obtained by name search, selected and added to the analysis environment utilizing an open-ended analysis network workflow. The sequence obtained was further subjected to genome expression tools known as the Epic-Coge. The result showed that Arachis duranensis has a genome length of 1,084,261,490 while Arachis hypogaea var hypogaea have a total genome length of 2,556,916,893. The different analysis ranged from OrganismView to 3D SynMap an interactive map, showing the synteny and non-syntenic relationship of the genome in space. The research holds great potential for plant breeding and genetic engineering of this economic plant, hence the research was made possible through public dataset policy, which houses the concept of data interoperability.

Large-scale multi-mediator analyses under a composite null
COSI: General Comp Bio
  • En-Yu Lai, Institute of Statistical Science, Academia Sinica, Taiwan
  • Yen-Tsung Huang, Institute of Statistical Science, Academia Sinica, Taiwan

Short Abstract: Mediation analysis aims to evaluate the effect of a hypothetical causal mechanism that is from an exposure, through mediators, to an outcome. Therefore, the effect of the exposure on the mediator and the effect of the mediator on the outcome conditional on the exposure will jointly construct the mediation effect. Conventional test statistics of the mediation effect become conservative when signals are sparse. The power loss originates from 1) a poor approximation of the normal product distribution using the normal distribution and 2) the complications of a composite null hypothesis. Huang (2019) has proposed a novel test for single-mediator analyses to accurately assess the normal product distribution under a composite null hypothesis accounting for the composition of null hypotheses within a study. Here we extend the method to accommodate the setting with multiple mediators. We utilize Huang (2019)'s method to account for the composite null hypothesis and then exploit global testing procedures proposed by Sun and Lin (2019) to conduct multivariate tests. We conduct extensive simulation studies to evaluate the performance of the proposed method. In addition, we apply the method to the TCGA-LUAD dataset in order to select genes whose expression may be regulated by smoking-induced DNA methylation.

Longdat: an R package for confounder-sensitive longitudinal analysis on various data types
COSI: General Comp Bio
  • Chia-Yu Chen, MDC/Charité/ECRC, Germany
  • Sofia Kirke Forslund, MDC/Charité/ECRC, Germany

Short Abstract: Longitudinal data consist of repeated measurement on the same individuals over time. Compared to cross-sectional data, longitudinal data may enable the establishment of cause-and-effect relationships. Two crucial points must be considered for correct analysis of such data. First, when choosing statistical tests, one should take the distribution of data into account. Second, confounders, which are factors other than the tested hypothesized causes (e.g. treatments) that might influence the outcome, should be uncovered and/or controlled for to avoid false conclusions. No existing tool, however, is capable of dealing with different data types as input while systematically detecting and controlling confounding effects in longitudinal data. Therefore, we developed LongDat, an R package employing generalized linear mixed models (GLMMs) and non-parametric tests. LongDat analyzes longitudinal data in a confounder-sensitive manner and can be applied to various data types. The output tables from LongDat contains estimated significances, effect sizes and identified confounders for each feature, making downstream analysis convenient for users. A development version of the tool was used in a recently published study of gut immunomodulation under a dietary intervention to account for changes in probands’ medication dosages. In conclusion, LongDat serves ideally as a confounder-sensitive analysis tool on various longitudinal data types.

Mapping the Minimal Set of the Viral Peptidome across all Major Viral Taxonomies
COSI: General Comp Bio
  • Li Chuin Chong, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Turkey, Malaysia
  • Mohammad Asif Khan, Bezmialem Vakif University, Turkey / School of Data Sciences, Perdana University, Malaysia, Turkey

Short Abstract: Sequence changes in viral genomes generate protein sequence diversity that enable viruses to evade the host immune system and pose challenges in the design of interventions. The total repertoire of antigenic diversity within a dataset of protein sequences can be represented by only a fraction of the sequences due to antigenic redundancy. Consequently, the idea of minimal set of the viral peptidome is to identify the smallest set of protein sequences that can represent the antigenic diversity present in a given protein sequence dataset. This is achieved by subjecting a protein sequence dataset to two levels of data compression—duplicate and antigenic reductions—without incurring any loss of information in terms of the total antigenic repertoire. This was applied across ranks of taxonomic lineages (species, genus, and family) for major viruses with sufficiently large reported protein sequences. As of December 2019, a total of 168 viral species originating from 85 genera and 46 families possessed at least 2,000 sequences. The minimal sets generated elucidated intricate patterns of antigenic diversity inherent across the taxonomic lineage ranks. The analysis enables closer interrogation of these patterns, revealing some order despite the complexity and positing new questions that merit further investigation.

Mapping the Protein Sequence Diversity Dynamics of Chikungunya Virus
COSI: General Comp Bio
  • Eyyüb Selim Ünlü, Istanbul Faculty of Medicine, Istanbul University, Istanbul, Turkey, Turkey
  • Li Chuin Chong, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Turkey, Malaysia
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia/Bezmialem Vakif University, Turkey, Turkey

Short Abstract: Chikungunya virus (CHIKV) is an emerging arthropod-borne pathogen that is a significant health threat. This study focuses on proteome-wide analysis of CHIKV sequence diversity dynamics. All reported CHIKV protein sequences (5,820) isolated from humans were collected from the NCBI Virus database. BLASTp was utilized to group the sequences by the 9 CHIKV proteins, deduplicated, and aligned. Shannon’s entropy was calculated for each of the aligned overlapping nonamer positions of the proteins. Furthermore, each position was analyzed for diversity motifs (index and its variants: major, minor and unique). The proteome-wide mean entropy was low (~0.41), indicating that the emerging virus is highly conserved. Nonetheless, only ~7.4% nonamers were completely conserved. The most diverse protein was E3 envelope glycoprotein (average entropy ~1.19), while the most conserved was nonstructural protein 2 (~0.22). The absolute peak entropy (2.71) was observed in capsid protein. The predominant nonamer, index, and its variants elucidated distinctive patterns of inherent sequence dynamics change, with total variants restricted to a maximum of ~71.7%. Major variant was notable and challenged the incidence of the index for 14 of the positions. Understanding CHIKV diversity may provide important insights to the evolution of the virus, and interaction with the host.

META-ANALYSIS OF TRANSCRIPTOME REVEALS BIOMARKER PAIRS IN TETROLOGY OF FALLOT
COSI: General Comp Bio
  • Sona Charles, Bharathiar University, India
  • Jeyakumar Natarajan, Bharathiar Univeristy, India

Short Abstract: Tetrology of Fallot is a cyanotic congenital condition contributed by genetic, epigenetic as well as environmental factors. We applied sparse machine learning algorithms to RNAseq and sRNAseq to select prospective biomarker candidates. Furthermore we applied filtering techniques to identify a subset of biomarker pairs in TOF. Differential expression analysis disclosed 2757 genes and 214 miRNAs which are dysregulated. Weighted gene co-expression network analysis on the differentially expressed genes extracted 5 significant modules that are enriched in GO terms extracellular matrix, signaling and calcium ion binding. voomNSC selected 2 genes and 5 miRNAs and transformed PLDA predicted 72 genes and 38 miRNAs as prognostic biomarkers. Out of the selected biomarkers, miRNA target analysis revealed 14 miRNA-gene interactions. 10 out of 14 pairs were oppositely expressed. 4 out of 10 oppositely expressed biomarker pairs shared common pathways of focal adhesion and P13K-Akt Signaling. In conclusion out study demonstrated the concept of biomarker pairs which may be considered for clinical validation due to high literature as well as experimental support.

MetaboNet: An application and database for small molecule - gene interactions
COSI: General Comp Bio
  • Anuradha Surendra, National Research Council Canada, Canada
  • Miroslava Cuperlovic-Culf, NRC, Canada

Short Abstract: The objective of this work was to (i) provide a systematic understanding small molecule (including metabolites and drugs) - gene relationships, (ii) compile a database (MetaboNet) of available interaction data from publically available information (iii) generate an interactive website by linking the compiled information for query and seamless information generation by the end-user.
Interactions data was parsed using R and Python scripts from 4 different publically available databases: Drugbank, HMDB, PDBBind and Brenda. Furthermore, the corresponding information from PubChem, Uniprot and PubMed was extracted.
The PostgresSQL database is split into the separate tables for each collated data. These tables incorporate details such as; compound name, gene id, gene name, uniprot id and pubchem id. The Pubchem, Uniprot and PubMed tables provide detailed information on various properties. These relational databases are connected by the DBID (an unique identifier), Pubchem IDs, Uniprot IDs and PubMed publication IDs.
The interactive website presenting the database to the end-user is built mainly using the Rshiny, packages incorporating number of R libraries.
By combining published interaction data the MetaboNet database and the application interface provides an overview as well as network of drug or metabolite and gene interactions.

MoSwA: Protein Sequence Diversity Motif Switch Analyser for Viruses
COSI: General Comp Bio
  • Asif M. Khan, Perdana University, Kuala Lumpur, Malaysia / Bezmialem Vakif University, Beykoz, Istanbul, Turkey, Turkey
  • Muhammet Celik, Bezmialem Vakif University, Istanbul, Turkey / Konya Food and Agriculture University, Konya, Turkey, Turkey
  • Kaushal Kumar Singh, Quantum Cipher Private Limited, New Delhi, India, India
  • Shan Tharanga, Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia, Malaysia

Short Abstract: Protein sequence diversity is one of the major challenges in the design of interventions against viruses. Shannon’s entropy has been used as a quantitative measure of protein sequence diversity, applied via a user-defined k-mer sliding window. Studies have classified distinct k-mer peptides at a given position into diversity motifs based on their incidence: index (predominant sequence), major (most common) variant, unique (singleton) and minor (incidence between major and unique). Motif switching at a given k-mer alignment position is a phenomenon where fitness change in one or more amino acids, such as through mutations, changes the incidence of a given k-mer sequence across its overlapping positions, resulting in a sequence rank change, and thus, a motif change. Identifying k-mer positions that exhibited a motif switch and determining the nature of the switches was a challenge given the large combination of switches that are possible and their omnipresence. Herein, we present MoSwA (github.com/macelik/MoSwA), a tool that not only identifies all alignment k-mer positions that exhibit motif switching, but also provides a multi-faceted and extensive characterisation of the switches. The input to MoSwA is a protein multiple sequence alignment and enables a comparative analyses of motif switches within and between viral species proteomes.

Multi-omics data integration reveals correlated regulatory features of triple negative breast cancer
COSI: General Comp Bio
  • Kanishka Manna, University of Arkansas for Medical Sciences, United States
  • Kevin Chappell, University of Arkansas for Medical Sciences, United States
  • Charity Washam, University of Arkansas for Medical Sciences, United States
  • Duah Alkam, University of Arkansas for Medical Sciences, United States
  • Jordan Bird, University of Arkansas for Medical Sciences, United States
  • Allen Gies, University of Arkansas for Medical Sciences, United States
  • Stephanie Byrum, University of Arkansas for Medical Sciences, United States

Short Abstract: Triple negative breast cancer (TNBC) is an aggressive type of breast cancer with very little treatment options. TNBC is very heterogeneous with large alterations in the genomic, transcriptomic, and proteomic landscapes leading to various subtypes with differing responses to therapeutic treatments. We applied a multi-omics data integration method to evaluate the correlation of important regulatory features in TNBC BRCA1 wild-type MDA-MB-231 and TNBC BRCA1 5382insC mutated HCC1937 cells compared with non-tumorigenic epithelial breast MCF10A cells. The data includes DNA methylation, RNAseq, protein, phosphoproteomics, and histone post-translational modification. Data integration methods identified regulatory features from each omics method had greater than 80% positive correlation within each TNBC subtype. Key regulatory features at each omics level were identified distinguishing the three cell lines and were involved in important cancer related pathways such as TGFβ signaling, PI3K/AKT/mTOR, and Wnt/beta-catenin signaling. The DNA methylation and RNAseq data is freely available via GEO GSE171958 and the proteomics data is available via the ProteomeXchange PXD025238.

mutSigMapper: an R package to map spectra to mutational signatures based on shot-noise modeling
COSI: General Comp Bio
  • Julian Candia, National Institutes of Health, United States

Short Abstract: Background: Mutational signatures are quantitative representations of mutagenic processes in a discrete space of somatic mutation motifs. The mutational profile (spectrum) of individual cancer samples can be compared against a compendium of mutational signatures to inform of possible etiologies, features for prognostic and biologic stratification, and vulnerabilities to be exploited therapeutically. A critical shortcoming of existing software for mutational signature analysis, however, is to find parsimonious and biologically plausible associations.

Results: Exploiting the analogy between mutagenic exposures and shot-noise phenomena in optics and electronics, we propose a model to generate spectral ensembles that allow a quantitative, non-parametric assessment of statistical significance for the association between mutational signatures and observed spectra. The package implements Poisson and negative binomial noise models. As case example, the analysis of 60 WGS colorectal adenocarcinomas shows signatures SBS1, SBS6, SBS10 and SBS18 in excellent agreement with previously reported observations.

Significance: The central question we aim to address is how to assess, in a statistically meaningful way, the significance of the association between spectra and mutational signatures. With mutSigMapper, we propose a framework to robustly map spectra to signatures based on shot-noise modeling, which fills an important gap in the existing software for mutational signature analysis.

NCBI Datasets, a new resource for fast, easy access to NCBI sequence data
COSI: General Comp Bio
  • Nuala Oleary, NCBI, United States
  • Eric Cox, NCBI, United States
  • Brad Holmes, NCBI, United States
  • Anne Ketter, NCBI, United States
  • Vichet Hem, NCBI, United States
  • Robert Falk, NCBI, United States
  • William Anderson, NCBI, United States
  • Xuan Zhang, NCBI, United States
  • Wes Ulm, NCBI, United States
  • Greg Schuler, NCBI, United States
  • Valerie Schneider, NCBI, United States
  • Peter Meric, NCBI, United States

Short Abstract: NCBI is the world’s most extensive public repository of genome sequence, annotation, and metadata for organisms across the tree of life. The rapid increase in volume and complexity of genomic data has made it increasingly challenging for researchers to find and retrieve comprehensive genome datasets in formats that are convenient for their workflows. Furthermore, researchers need data structures, access mechanisms, and sharing capabilities that adhere to the principle of FAIR (Findable, Accessible, Interoperable, and Reusable). Toward this goal, NCBI introduces Datasets, a new resource that develops web, command line, and API interfaces for accessing NCBI sequence data that are intuitive and user-friendly. Datasets delivers data as a coherent data package including genome, transcript, and protein sequence, annotation, and a JSON-lines formatted data report of metadata. Lastly, we provide command line tools for parsing and converting data into user-friendly formats and python and R libraries that allow researchers to access the API. This presentation will demonstrate the latest Datasets features and show to use Datasets to integrate NCBI sequence and metadata into analysis workflows.

NetExtract: Pathways reconstruction from phosphoproteomics data
COSI: General Comp Bio
  • Evangelia Petsalaki, EMBL-EBI, United Kingdom
  • Girolamo Giudice, EMBL-EBI, United Kingdom

Short Abstract: Signalling pathways regulate the cell’s response to external stimuli and modulate some of the most important biological processes. A small alteration in signalling pathways can fuel cancer initiation and progression making signalling pathways attractive candidates for anti-cancer drugs. Despite the great efforts the way in which the dynamic pattern that takes place inside the cell upon stimulation remains poorly understood.
Phosphoproteomics data could help to dissect the dynamic networks active in a cell in each condition leading to a specific cell response or phenotype. However, the data tends to be sparse, relatively noisy and have low reproducibility, making it a challenge to detect the right signal and compare across datasets.
To overcome these limitations, we developed NetExtract. NetExtract simulates the global signal propagation employing a random walk with restart algorithm. Next, EGO networks, centred on phosphorylated proteins, are employed to propagate the effect of the differentially phosphorylated proteins on the local neighbourhood and reconstruct the signal transduction. The enrichment analysis performed on the extracted subnetworks shows that NetExtract (i) amplifies and detects the correct background signals obtaining the context-specific signalling networks (ii) permits to compare phosphoproteomics datasets even when the overlap between them is poor.

Optimizing Model Selection for Glioblastoma Utilizing Gene Expression
COSI: General Comp Bio
  • Vishal Oza, University of Alabama at Birmingham, United States
  • Brittany Lasseigne, University of Alabama at Birmingham, United States
  • Jennifer Fisher, University of Alabama at Birmingham, United States
  • Avery Williams, University of Alabama at Birmingham, United States

Short Abstract: Glioblastoma (GBM) is a debilitating brain cancer that affects around 210,000 people worldwide. Currently, disease diagnosis and monitoring is typically done via tissue biopsy, but it is invasive, difficult in cases of tumor inaccessibility, and only provides a single snapshot that may not be representative of disease heterogeneity or etiology. Further, there has been difficulty determining viable treatment options for the disease, which has a high relapse and morbidity rate. One possibility for improved patient diagnosis and treatment is the preclinical model, in this case particularly cell lines and patient derived xenografts (PDXs) as they are valuable tools for studying disease etiology and treatment efficacy.
In this study, I used public cohort data from The Cancer Genome Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE), and the Mayo Clinic Brain Tumor Patient-Derived Xenograft National Resource for GBM patient tissue, cell line, and PDX gene expression respectively. With this data, I performed hierarchical clustering and ranked correlation to identify global patterns and differences that may suggest advantages and weaknesses of given preclinical models as avatars for specific patients. Our long-term goal is to identify the best model for a patient and to develop computational approaches for assessment and further analyses.

Pathway enrichment, machine learning and causal reasoning analysis to deconvolute potential targets of a small-molecule tau aggregation inhibitor
COSI: General Comp Bio
  • Layla Hosseini-Gerami, University of Cambridge, United Kingdom
  • Andreas Bender, University of Cambridge, Germany
  • David Collier, Eli Lilly and Company, United Kingdom
  • Howard Broughton, Eli Lilly and Company, Spain
  • Emma Laing, GlaxoSmithKline, United Kingdom
  • David Evans, DeepMind, United Kingdom
  • Suchira Bose, Eli Lilly and Company, United Kingdom
  • Elena Ficulle, University College London, United Kingdom
  • Brian Eastwood, Eli Lilly and Company, United Kingdom
  • James Scherschel, Eli Lilly and Company, United States
  • David Airey, Eli Lilly and Company, United States
  • Neil Humphryes-Kirilov, C4X Discovery, United Kingdom

Short Abstract: In this work we employed bioinformatics analysis - including pathway enrichment and causal reasoning - of an in vitro tauopathy model, consisting of cultured rat cortical neurons seeded with human-derived tau aggregates, which was treated with a tool compound which modulates tau aggregation. Gene expression data was generated and used to generate additional hypotheses for the compounds mode of action, through causal reasoning and pathway enrichment analysis. In parallel, we performed ligand-target prediction using the compound chemical structure. Combining the different approaches, we found mechanistic evidence involving processes related to AD progression, including cholesterol homeostasis and neuroinflammation. On the pathway level, we found pathways related to these two processes including “Superpathway of cholesterol biosynthesis” and “Granulocyte adhesion and diapedesis”. With causal reasoning, we inferred differential activity of SREBF1/2 (involved in cholesterol regulation) and mediators of the inflammatory response such as NFKB1. Additionally, through structure-based ligand-target prediction we predicted the intracellular cholesterol carrier NPC1 as well as NF-κB subunits as potential targets of the compound. This study has furthered our understanding of the likely mechanism of action of a small molecule tau aggregation inhibitor, potentially extending its mode of action beyond direct aggregation with tau.

POMaDe (Perturb-Omics Machine Learning target Deconvolution): machine learning for molecules mode of action classification using pertub-omics data
COSI: General Comp Bio
  • Yannick Cogne, Bayer Crop Science, France
  • Luigi Di Vietro, Bayer Crop Science, France

Short Abstract: The identification of the cellular targets and the modes of action (MoA) of a bioactive compound is a critical step in drug discovery for both de-risking possible toxicity and allowing a better compound optimization.
Among the different approaches used in target deconvolution, MoA classification using transcriptomic fingerprinting is a promising methodology but its potential has been hampered by the lack of sufficient data and performant analytical methods to process them.
Nevertheless, in more recent years, huge transcriptomics datasets that encompass different parameters (small molecules, gene knockdown/knockout, cell lines, time and dosage) as the L1000 project have been made available by the effort of big consortia like the Library of Integrated Network-Based Cellular Signatures (LINCS).
The aim of our project is to build a machine learning (ML) method capable of classifying the MoA of small molecules according to the transcriptomic fingerprints present in the L1000 database.
To this end, we have tested different ML algorithms and applied them on different data sub selection to separate parameters effects and test different experimental setups and their impact on classification performance.
Our method is a new, highly-performing take on MoA classification that makes optimal use of transcriptomics data and machine learning algorithms.

rAMPage: Rapid Antimicrobial Peptide Annotation and Gene Estimation
COSI: General Comp Bio
  • Diana Lin, Canada's Michael Smith Genome Sciences Centre, Canada
  • Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, Canada
  • Sambina Aninta, Canada's Michael Smith Genome Sciences Centre, Canada
  • Chenkai Li, Canada's Michael Smith Genome Sciences Centre, Canada
  • Rene L. Warren, Canada's Michael Smith Genome Sciences Centre, Canada
  • Caren Helbing, Department of Biochemistry and Microbiology, University of Victoria, Canada
  • Linda Hoang, Department of Pathology and Laboratory Medicine, University of British Columbia, Canada
  • Inanc Birol, Department of Medical Genetics, University of British Columbia, Canada

Short Abstract: Antimicrobial peptides (AMPs) are a family of short defence proteins produced naturally by all organisms. Since AMPs do not confer resistance as easily as antibiotics, they are a potential alternative to antibiotics. Past research has shown that amphibians have the richest known AMP diversity, specifically the North American bullfrog has demonstrated potential in aiding the discovery of novel putative AMPs. Antibiotic resistance is becoming more prevalent each day, requiring agricultural practices to reduce the use of antibiotics to protect human health, animal health, and food safety. rAMPage is a scalable bioinformatics-based discovery platform for mining AMP sequences in publicly available genomic resources. RNA-seq amphibian and insect reads from the Sequence Read Archive (SRA) are used. After trimming, reads are assembled with RNA-Bloom into transcripts, filtered, and translated in silico. Then, the translated protein sequences are compared to known AMP sequences from the NCBI protein database and specific AMP databases, via homology search. These sequences are cleaved into their mature/bioactive form. Next, machine learning algorithm AMPlify, is employed to classify and prioritize the candidate AMPs based on their AMP probability score. Finally, these candidate AMPs are annotated and characterized. Across 84 datasets, rAMPage detected >1000 putative AMPs for downstream validation.

Repurposing DX-600, an ACE2 inhibitor, against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein
COSI: General Comp Bio
  • Muhammed Miran Öncel, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Elif Karaaslan, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Nadir Emre Herdan, Faculty of Medicine, Bezmialem Vakif University, Fatih 34093, Istanbul, Turkey, Turkey
  • Nesibe Selma Çetin, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Merve Kalkan-Yazıcı, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Choi Sy Bing, Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia, Malaysia
  • Ayesha Fatima, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia/Bezmialem Vakif University, Turkey, Turkey
  • Mehmet Ziya Doymaz, Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz 34820, Istanbul, Turkey, Turkey

Short Abstract: The global research community is focused on new approaches to reduce the time and cost for an antiviral against COVID-19 infection. In this study, we explored the possibility of repurposing an inhibitor of angiotensin-converting enzyme 2 (ACE2), DX-600 in serving as an inhibitory ligand to the receptor-binding domain of the spike protein of SARS CoV-2. Higher affinity variants to the DX-600 sequence were explored by use of the BLOSUM 62 matrix, followed by docking and molecular dynamics (MD) simulations. Given the astronomical combinations possible, 60,000 peptides were generated for the initial screening. Top ten peptides were evaluated by molecular dynamics simulations and synthesised for wet-lab validation. The DX-600 peptide and its two closely related variants, although showed promising in silico results, did not neutralize the virus infections in our in vitro infection model. Whether these peptides show inhibitory effects at the viral transcription or viral gene expression level remains to be addressed, however, at overall viral cytopathic effect level, DX-600 peptides do not seem to support a significant antiviral outcome, at least in the model applied.

ritmic : a R package to study the Regulation of Tumor Microenvironment
COSI: General Comp Bio
  • Magali Richard, CNRS, France
  • Daniel Jost, ENS Lyon, France
  • Clementine Decamps, TIMC - University Grenoble Alpes, France

Short Abstract: Current understandings of cancer biology and therapeutic strategies are based on the level of inter-patient classification and neglect the fact that cancers consist of cells with different identities and origins (cell heterogeneity). Here, we propose a new method and a corresponding package to take advantage of recent advances in high throughput sequencing technologies to study how the gene expression of ‘pure’ tumor cells specifically contributes to the regulation of the immune microenvironment.
First, we reconstitute a surrogate differential expression matrix specific to tumor cells, at the patient/individual level, using a reference free deconvolution algorithm (EDec) and a personalized differential expression approach (PenDA). Second, we statistically infer which genes are involved in the regulation of the microenvironment composition.
Using a realistic benchmark of simulated pancreatic tumor, we demonstrated that ritmic achieved high specificity and sensitivity to detect tumor-specific genetic regulation of immune cell fractions. We are currently applying our pipeline on several independent cohorts of non-small cell lung cancer to validate the method in real pathological context. These results will contribute to interrogate and revisit current tumorigenesis understandings, in the light of genetic models accounting for tumor heterogeneity.

scGNN: a novel graph neural network framework for single-cell RNA-Seq analyses
COSI: General Comp Bio
  • Juexin Wang, University of Missouri, United States
  • Anjun Ma, Ohio State University, United States
  • Qin Ma, Ohio State University, United States
  • Dong Xu, Univ. of Missouri-Columbia, United States

Short Abstract: Single-cell RNA-sequencing (scRNA-Seq) is widely used to reveal the heterogeneity and dynamics of tissues, organisms, and complex diseases, but its analyses still suffer from multiple grand challenges, including the sequencing sparsity and complex differential patterns in gene expression. We introduce the scGNN (single-cell graph neural network) to provide a hypothesis-free deep learning framework for scRNA-Seq analyses. This framework formulates and aggregates cell–cell relationships with graph neural networks and models heterogeneous gene expression patterns using a left-truncated mixture Gaussian model. scGNN integrates three iterative multi-modal autoencoders and outperforms existing tools for gene imputation and cell clustering on four benchmark scRNA-Seq datasets. In an Alzheimer’s disease study with 13,214 single nuclei from postmortem brain tissues, scGNN successfully illustrated disease-related neural development and the differential mechanism. scGNN provides an effective representation of gene expression and cell–cell relationships. It is also a powerful framework that can be applied to general scRNA-Seq analyses.

Semi-supervised identification of SARS-CoV-2 molecular targets
COSI: General Comp Bio
  • Kristen Beck, IBM Research, United States
  • Edward Seabolt, IBM Research, United States
  • Akshay Agarwal, IBM Research, United States
  • Gowri Nayar, IBM Research, United States
  • Simone Bianco, IBM Research, United States
  • Harsha Krishnareddy, IBM Research, United States
  • Vandana Mukherjee, IBM Research, United States
  • James Kaufman, IBM Research, United States

Short Abstract: SARS-CoV-2 genomic sequencing efforts scaled dramatically to address the current global pandemic and aid public health. In this work, we analyzed a corpus of 66,000 SARS-CoV-2 genome sequences. We developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on use of a single reference genome and by overcoming atypical genome traits. Using this method, we identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction compared to proteome references including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools such as Prokka (base) and VAPiD, we yielded an 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 molecular target sequences— some conserved across time and geography while others represent emerging variants. We observed 3,362 non-redundant sequences per protein on average within this corpus and describe key D614G and N501Y variants spatiotemporally. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized Receptor Binding Domain variants. Here, we comprehensively present the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable high-accuracy method to analyze newly sequenced infections.

Single-cell analytics for phospho flow cytometry reveals dynamic interactions between molecular pathways
COSI: General Comp Bio
  • Paul Pavlidis, Department of Psychiatry, The University of British Columbia, Canada
  • Yue Huang, Graduate Program in Bioinformatics, The University of British Columbia, Canada
  • Patrick Coleman, Djavad Mowafaghian Centre for Brain Health, The University of British Columbia, Canada
  • Fabian Meili, Djavad Mowafaghian Centre for Brain Health, The University of British Columbia, Canada
  • Warren M. Meyers, Djavad Mowafaghian Centre for Brain Health, The University of British Columbia, Canada
  • Wun C. Sin, Djavad Mowafaghian Centre for Brain Health, The University of British Columbia, Canada
  • Kurt Haas, Djavad Mowafaghian Centre for Brain Health, The University of British Columbia, Canada

Short Abstract: Quantification of large single-cell measures acquired by flow cytometry typically involves establishing inclusion gate thresholds and combining measures from accepted cells into a single median metric.

Here, we have formulated approaches to extract additional information from these population data sets involving dose-response and interactions between multiple molecules from flow cytometry data sets.

Using phospho flow multiplexed sampling of cell physical features, and primary antibodies against protein markers, including GAPDH as a protein expression control, and 8 antibodies detecting the activation (phosphorylation) state of the 8 kinds of proteins, two panels of phospho antibodies were used simultaneously for multiplexed measures in the same cells.

Our approach involves, single-cell standardization (by GAPDH), fitting loess regression, identifying linear domains in dose-response plots, building linear mixed-effects models, and multi-dimensional analyses to detect interactions between markers.

We demonstrate the utility of this approach by expressing wild-type and 5 variants of PTEN on 8 markers of molecular pathways downstream of PTEN, and we also expressed RHEB WT testing impact on markers in the shared associated pathways.

Results demonstrate dose response and molecular pathway interactions unavailable from reducing population data to single values. Our approach manifests strong promise in variant function measurement and molecular pathway inference.

Structural and computational analysis of sense-antisense chimeric transcripts reveals their potential regulatory roles in human cells
COSI: General Comp Bio
  • Sumit Mukherjee, Bar-Ilan University, Israel
  • Rajesh Detroja, Bar-Ilan University, Israel
  • Milana Frenkel-Morgenstern, Bar-Ilan University, Israel

Short Abstract: Many human genes are transcribed from both strands and produce sense-antisense gene pairs. Sense-antisense (SAS) chimeric transcripts are produced upon the coalescing of exons/introns from both sense and antisense transcripts of the same gene. SAS chimera was first reported in prostate cancer cell-line. Subsequently, numerous SAS chimeras have been reported in the ChiTaRS database. Still, the functional implications and evolutionary significance of SAS chimeras remain elusive. We investigated the structural and functional aspects of SAS chimeras. We found that longer palindromic sequences are a unique feature of SAS chimeras. Structural analysis indicates that a long hairpin-like structure formed by many consecutive Watson-Crick base pairs appear because of these long palindromic sequences, which possibly play a similar role as double-stranded RNA (dsRNA), interfering with gene expression. RNA-RNA interaction analysis suggested that SAS chimeras could significantly interact with their parental mRNAs, indicating their potential regulatory roles. Finally, we found several SAS chimeras in the RNA-seq data of different healthy human tissues and detected their potential orthologs in mice, highlighting their possible regulatory roles. Our study is the first comprehensive analysis of SAS chimeras in humans and established their potency in functional regulation.

Studying the Dynamics of Ebola Virus (EBOV) Proteome Sequence Diversity
COSI: General Comp Bio
  • Hasiba Karimi, Bezmialem Vakif University, Turkey
  • Li Chuin Chong, Perdana University, Malaysia / Bezmialem Vakif University, Turkey, Turkey
  • Eyyüb Selim Ünlü, Bezmialem Vakif University, Turkey
  • Mohammed Miran Öncel, Bezmialem Vakif University, Turkey
  • Mohammad Asif Khan, Perdana University, Malaysia / Bezmialem Vakif University, Turkey, Turkey

Short Abstract: Ebola virus disease (EVD) most commonly infects humans and primates causing highly fatal hemorrhagic fever. Herein, we describe an analysis of Zaire Ebola virus (ZEBOV) proteome sequence diversity. A total of 23,543 ZEBOV sequences were downloaded from the NCBI Virus database for the human host, deduplicated (CD-HIT), BLASTp separated into the eight encoded proteins, and aligned (MAFFT and MUSCLE). Shannon’s entropy and motif diversity analyses (index and its variants: major, minor and unique) were performed by use of DiMA for each of the aligned overlapping nonamer positions to measure the proteome diversity. The ZEBOV proteome (2,706 unique sequences) was highly conserved with a mean entropy value of ~0.3. The Envelope glycoprotein (GP) was the most diverse (average entropy ~0.65), while VP30 was the most conserved (average entropy ~0.16). The highest entropy (~1.84) was observed in GP, but with the peak incidence (~54.69%) of total variants in the sGP. Complete conservation was observed for 19% of the proteome positions. Motif diversity analysis revealed notable patterns of sequence change, distinct between the proteins, against a backdrop of high host fatality rate. The result herein has new implications to help understand the evolution of the virus, with implications to vaccine and drug design.

Super3Path: Identifying commonly enriched biological pathways using three pathway databases
COSI: General Comp Bio
  • Renata Fu, Scarsdale High School, United States
  • Yongsheng Bai, Next Generation Intelligence Science Training, United States

Short Abstract: In recent years, the drastic increase in publicly available pathway information has enabled researchers to easily identify enriched pathways between cell conditions, making the development of gene targeted therapy more efficient and affordable. However, many existing studies fail to analyze results from multiple pathway databases because different databases often have different names for a common biological process, making it hard for researchers to utilize multiple databases in a single study.

To address this urgent need, we developed a user-friendly bioinformatics pipeline, Super3Path, that utilizes three publicly available pathway databases –– the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and WikiPathways –– to categorize pathways based on the databases that present them. Super3Path includes an R script that uses pathway data from the aforementioned databases to perform gene set enrichment analysis. It also features a Python script that outputs equivalent and hierarchical relationships between enriched pathways from different databases. Our Python script uses a mapping catalog developed by ComPath to report commonly enriched pathways in three pathway databases. We evaluated Super3Path using three Gene Expression Omnibus (GEO) datasets. Super3Path outputted significant pathways reported by the related studies, confirming that it is effective. Super3Path is a Github repository available at github.com/Renata-Fu/Super3Path.

The anti-influenza potential and dynamic simulation of active phytochemicals from Canarium patentinervium Miq
COSI: General Comp Bio
  • Najwan Jubair, UCSI university, Malaysia
  • Mogana R, UCSI university, Malaysia
  • Ayesha Fatima, Quest International University, Malaysia
  • Anna A. Muryleva, St. Petersburg Pasteur Institute, Russia
  • Vladimir V. Zarubaev, St. Petersburg Pasteur Institute, Russia
  • Nor Hayati Binti Abdullah, Forest Research Institute Malaysia, Malaysia
  • Christophe Wiart, University of Nottingham, Malaysia

Short Abstract: Influenza A virus is an RNA virus causes acute respiratory diseases and can undergo a pointed mutation or genetic recombination causing seasonal flu outbreaks or pandemics. In this study, ethanol, hexane, and chloroform bark extracts as well as ethanolic leaf extract of Canarium patentinervium Miq, were tested for their anti-influenza activity against H3N2. Among the extracts tested, the ethanolic leaves and bark extracts had the best activity with CC50 >300 µg/mL, IC50= 30.2µg/mL, and SI= 10. catechin has been isolated from the ethanolic bark extract. A molecular docking was performed for catechin and four other active phytochemicals identified previously to check the binding affinity to H3N2 (pdb code; 4WE5) using Autodock vina software. Catechin, hyperin, and cynaroside had the strongest binding affinity (∆G= -5.6, -6.9, and -7.8 kcal/mol), than the ligand (∆G= -5.2 kcal/mol). The docking between H3N2 and the active compounds characterized by hydrogen bonds and pi-pi interaction. The dynamic behavior of these compounds had been tested using Amber software by which catechin was firmly bound to H3N2 (∆G= - 3,378.8 +/- 467.6 kcal/mol). The molecular and dynamic simulations supported the experimental findings in emphasizing the efficacy of catechin, hyperin, and cynarosides as a promising anti-influenza agents.

The application of mixture models to RNA-seq data to discover ageing regulators
COSI: General Comp Bio
  • Atefeh Taherian Fard, University of Queensland, Australia
  • Jessica Mar, University of Queensland, Australia
  • Sasdekumar Loganathan, University of Queensland, Australia
  • Ameya Kulkarni, https://www.abbvie.com.au/, United States

Short Abstract: Ageing is a complex process. The combined effects of environmental and genetic factors make it challenging to isolate specific regulators. Given the dynamic nature of gene expression, a gene expression can follow different distributions during the ageing process. We can capture the biological variability using mixture models. This is done by modelling the variability via multimodality using multiple different distributions at the gene level for RNA-sequencing (RNA-seq) data.
We used the Genotype-Tissue Expression (GTEx) cohort to identify lists of candidate genes that clustered according to multimodal distributions with donors that showed significant changes in age. MTOR was the only age-related gene that was identified through our mixture model analysis and not captured through differential expression analysis. We identified mixture model only genes that were common across different tissues, suggesting the presence of systemic ageing genes. Gene set over-representation using the mixture model only genes and the standard differentially expressed gene list resulted in similar pathways, indicating that mixture models detecting different genes in the same pathway.
The results indicate that modelling gene expression variability using mixture models in conjunction with standard differential gene expression can help uncover regulators that have a potential role in understanding human ageing.

The Differences of Prokaryotic Pan-genome Analysis on Complete Genomes and Simulated Metagenome-Assembled Genomes
COSI: General Comp Bio
  • Yanbin Yin, University of Nebraska-Lincoln, United States
  • Tang Li, University of Nebraska-Lincoln, United States

Short Abstract: Pan-genome represents the entire gene set of all strains in a species. It is commonly used in studying complete/reference/isolate genomes in various research areas. Metagenome-assembled genomes(MAGs) are generated from environment metagenomes to study the unculturable species in the community. In recent years, pan-genome analyses of MAGs are being increasingly used in analyzing microbiomes from human oral cavity/human gut/seawater/hot springs. It is an open question that how much accuracy loss the pan-genome analysis results will have due to the nature of MAGs: fragmentation, incompleteness, and contamination. By simulating MAGs from complete genomes of 17 prokaryotic species and performing pan-genome analysis on simulated MAGs, we found that fragmentation and incompleteness were major reasons for decreasing core genome size in MAGs. Additionally, we performed core genome functional analyses and constructed phylogenetic trees based on gene presence and absence. The observed underestimation or misprediction in functional/phylogenetic analysis of MAGs indicated potential bias or errors in understanding the environmental microbiomes. Overall, the accuracy of pan-genome studies is significantly influenced by the nature of MAGs. Quality control of MAGs, pan-genome parameter selection, and gene clustering algorithms need to be improved for more precise pan-genome analysis.

The EMBL-EBI search and sequence analysis tools APIs and their role during the current COVID-19 pandemic
COSI: General Comp Bio
  • Fábio Madeira, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Matt Pearce, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Rodrigo Lopez, European Bioinformatics Institute (EMBL-EBI), United Kingdom

Short Abstract: The EMBL-EBI provides free access to a full-featured text search engine with powerful cross-referencing and data retrieval capabilities as well as to popular bioinformatics sequence analysis applications. These services can be accessed via user-friendly web interfaces and via established Application Programming Interfaces (APIs), which are increasingly used to improve the way in which biological data are consumed and integrated into third-party systems. EBI Search1 and the Job Dispatcher (JD)1 frameworks have been developed with the same core principles, making their APIs an integral part of many popular EMBL-EBI resources, such as MGnify, RNAcentral, ENA, UniProtKB, InterPro, Ensembl Genomes, among others.
Here, we would like to describe the latest improvements made to the frameworks with particular emphasis on the role of these services during the current pandemic. EBI Search has been extensively used and forms the core data search resource powering the European COVID-19 Data Portal2. The COVID-19 Data Portal platform is leading the way by enabling research and patient data to be deposited, analysed, searched and visualised by the community. The JD system has seen a noticeable surge in the usage of several bioinformatics applications where more than 592 million jobs were performed in 2020 alone.

Topological Strategies for the Analysis of Rhythmic Dynamics in Transcriptomic Time-Series Data
COSI: General Comp Bio
  • Elan Ness-Cohn, Northwestern University, United States
  • Rosemary Braun, Northwestern University, United States

Short Abstract: The circadian clock drives the oscillatory expression of thousands of genes across all tissues and bears significant implications for human health. RNA-seq timeseries experiments interrogate the mechanistic links between transcriptional rhythms and phenotypic outcomes. Analysis methods must overcome the challenges of sparse temporal sampling, noisy data, and non-strictly periodic dynamics.

We present two complementary methods to overcome these challenges: “TimeCycle” detects oscillatory dynamical components in noisy, sparsely sampled data; and “TimeChange” quantifies how gene rhythms change across experimental conditions. Methods leverage a data transformation technique known as time-delay embedding to reconstruct the underlying state space for each gene-of-interest. Takens’ embedding theorem implies that rhythmic dynamics will exhibit circular patterns in the embedded space. “TimeCycle” quantifies the circularity of the embedding using persistent homology, an algebraic method for discerning the topological features of data. The persistence scores are compared to a biologically-informed null model that considers RNA transcription and degradation rates to identify cycling genes. “TimeChange” nonparametrically compares the distributions of points in the embedded space to assess whether the topological structures differ significantly between phenotypes, thereby quantifying differences in transcriptional dynamics without requiring knowledge of the underlying model.

We demonstrate each method’s accuracy and reliability using synthetic and real data.

Towards an integrative multi-omics workflow
COSI: General Comp Bio
  • Florian Jeanneret, Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France., France
  • Stéphane Gazut, Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France., France

Short Abstract: The advent of high-throughput techniques has greatly enhanced biological discovery. Last years, analysis of multi-omics data has taken the front seat to improve physiological understanding. Handling functional enrichment results from various biological data raises practical questions.

We propose an integrative workflow, wrapped in the Bioconductor R package multiSight, to better interpret biological process insights in a multi-omics approach. In this work, we present this workflow applied to breast cancer data from The Cancer Genome Atlas (TCGA) related to Invasive Ductal Carcinoma (IDC) and Invasive Lobular Carcinoma (ILC). Pathway enrichment by Over Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) has been conducted with both features' information from differential expression analysis (DEA) or selected features from multi-block sPLS-DA methods. Then, comprehensive comparisons of enrichment results have been carried out by looking at classical enrichment analysis, probabilities pooling by Stouffer's Z scores method and pathway clustering into biological themes.

Our work shows that ORA enrichment with selected sPLS-DA features and pathways probabilities pooling by Stouffer's method lead to enrichment maps highly associated to the physiological knowledge of the IDC or ILC phenotypes, better than ORA and GSEA with differential expression driven features.

Towards haplotype-specific chromatin contact maps from GAM data
COSI: General Comp Bio
  • Julia Markowski, Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine (MDC), Berlin, Germany, Germany
  • Alexander Kukalev, Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine (MDC), Berlin, Germany, Germany
  • Teresa Szczepińska, Centre of New Technologies, University of Warsaw, Warsaw, Poland, Poland
  • Ana Pombo, Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine (MDC), Berlin, Germany, Germany
  • Roland F Schwarz, Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine (MDC), Berlin, Germany, Germany

Short Abstract: The dynamics of chromatin conformation are essential for the precise orchestration of gene expression in time and space, and ultimately for healthy organismal development. Allele-specific chromatin folding due to genetic variation can perturb these expression programs. Detailed, haplotype-specific analyses of chromatin contacts are crucial for understanding the impact of genetic variation on mechanisms of gene expression in health and disease.
Traditionally, e.g. in Hi-C data, sequencing reads can only be assigned to their homologous chromosome of origin if they overlap heterozygous variant positions. The low variant density in human genomes results in low phasing efficiency and impedes the generation of haplotype-specific chromatin contact matrices.
Genome Architecture Mapping (GAM) measures chromatin contacts through the co-segregation frequencies of genomic regions captured in thin nuclear slices. All sequencing reads from a DNA fragment captured in a nuclear slice originate from the same chromosome copy, thus providing local phasing information.
Leveraging this unique feature of GAM data, we have developed a novel genome-wide phasing strategy. We drastically improve read phasing efficiency and, for the first time, derive accurate, detailed haplotype-specific chromatin contact matrices in genomes with low variant density. Our phasing approach reveals unappreciated allele-specific chromosome topologies in human genomes in high resolution.

Unlocking insights into cellular senescence through single cell transcriptomics of ageing mesenchymal stem cells
COSI: General Comp Bio
  • Atefeh Taherian Fard, Australian Institute for Bioengineering and Nanotechnology, University of Queensland, Australia
  • Jessica Mar, Australian Institute for Bioengineering and Nanotechnology, University of Queensland, Australia

Short Abstract: Cellular senescence acts to protect against cancer, and other fundamental biological processes such as development, tissue repair, and ageing. Having a clear understanding of the molecular mechanisms that define the progression of senescence is critical to identifying any new therapeutic strategies that impact age-related diseases. The recent advances in single cell (sc) technologies have helped to understand the regulatory mechanisms and modulators of single cells. The application of these technologies have the potential to unlock insights into cellular senescence in different tissue and cell types. Here for the first time, sc RNA-seq data was generated to investigate the gene expression heterogeneity of MSCs undergoing replicative senescence. We computationally characterised different MSCs sub-populations at the different stages of cell cycle, compared the transcription profile of cells going from a proliferative to a senescent state and identified the key factors driving this transitional process. We found that, there are atleast three different senescent phenotypes in the aging MSCs. Using novel computational methods and statistical approaches for sc RNA-seq data analysis, we identified senescent phenotypes that are linked to SASP, oncogene- and SASP-induced senescence escapees, revealing a level of previously unappreciated heterogeneity associated with the senescent phenotype.

Using single-cell transcriptomics to characterise the bone marrow microenvironment in health and leukemia
COSI: General Comp Bio
  • Sarah Ennis, National University of Ireland, Galway, Ireland
  • Alessandra Conforte, National University of Ireland, Galway, Ireland
  • Pilib O Broin, National University of Ireland, Galway, Ireland
  • Eva Szegezdi, National University of Ireland, Galway, Ireland

Short Abstract: Acute myeloid leukemia (AML) is an aggressive blood cancer which causes an accumulation of myeloid precursor cells in the bone marrow. Drug resistance is common among patients and is partly driven by the protective microenvironment where the cells reside. Recent studies using single-cell technologies have provided valuable insights into the behaviour of this niche in individual patients and demonstrated the high level of inter- and intra-patient heterogeneity present among AML cells. However, the throughput of these studies has remained restricted to few patients and so the ability to identify any unifying mechanisms of AML progression from them is limited. Here, we've performed single-cell RNA-seq of bone marrow aspirates from 10 AML patients and integrated the data with published studies of both healthy and AML bone marrow to create a combined dataset of ~250,000 cells from more than 60 donors. Using this dataset, we first established a baseline reference of healthy cells and then looked for changes in cell type composition, gene expression and ligand-receptor interactions that occur during the establishment and progression of AML. This analysis not only highlighted the heterogeneity among patients but also revealed that cell adhesion and inflammatory interactions undergo substantial change during AML development.

Whole genome doubling-aware copy number phylogenies for cancer evolution with MEDICC2
COSI: General Comp Bio
  • Tom L Kaufmann, Max Delbrück Center for Molecular Medicine, Germany
  • Marina Petkovic, Max Delbrück Center for Molecular Medicine, Croatia
  • Roland F Schwarz, Max Delbrück Center for Molecular Medicine, Germany

Short Abstract: Somatic copy number alterations (SCNA) include large-scale events, such as chromosome arm-level gains and losses as well as focal amplifications and deletions and play a key role in the evolutionary processes that shape cancer genomes. SCNAs often appear together with whole genome doubling (WGD) which generates near-tetraploid cells and is associated with poor patient outcome.
While the importance of SCNAs and WGD events for tumour evolution is widely accepted, there are currently no methods for phylogenetic inference from SCNAs that include WGD events.
Here we present MEDICC2, a new phylogenetic algorithm for multi-sample haplotype-specific SCNA data based on a minimum-evolution criterion that infers phylogenetic trees, reconstructs ancestral genomes and reliably detects WGD events.
MEDICC2 accurately locates clonal and subclonal copy number events, including WGDs, timing them relative to each other. Detected events can be compared with user-provided genomic regions (such as known oncogenes) to decipher the evolutionary history of the tumour and bootstrap resampling techniques allow estimating the robustness of a given tree topology. Efficient parallel implementations enable the application to experiments with thousands of samples.
Here, we introduce the algorithmic novelties in MEDICC2 and apply it to single-cell data from triple-negative breast cancer to demonstrate its range of features.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube