Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


General Computational Biology


Schedule subject to change
Thursday, July 16th
10:40 AM-11:00 AM
Proceedings Presentation: Inference Attacks Against Differentially-Private Query Results from Genomic Datasets Including Dependent Tuples
Format: Pre-recorded with live Q&A

  • Erman Ayday, Case Western Reserve University, United States
  • Nour Almadhoun, Bilkent University, Turkey
  • Ozgur Ulusoy, Bilkent University, Turkey

Presentation Overview: Show

Motivation: The rapid decrease in the sequencing technology costs leads to a revolution in medical research and clinical care. Today, researchers have access to large genomic datasets to study associations between variants and complex traits. However, availability of such genomic datasets also results in new privacy concerns about personal information of the participants in genomic studies. Differential privacy (DP) is one of the rigorous privacy concepts, which received widespread adoption for sharing summary statistics from genomic datasets while protecting the privacy of participants against inference attacks. However, DP has a known drawback as it does not take into account the correlation between dataset tuples. Therefore, privacy guarantees of DP-based mechanisms may degrade if the dataset includes dependent tuples, which is a common situation for genomic datasets due to the inherent correlations between genomes of family members.

Results: In this paper, using two real-life genomic datasets, we show that exploiting the correlation between the dataset participants results in significant information leak from differentially-private results of complex queries. We formulate this as an attribute inference attack and show the privacy loss in minor allele frequency (MAF) and chi-square queries. Our results show that using the results of differentially-private MAF queries and utilizing he dependency between tuples, an adversary can reveal up to 50% more sensitive information about the genome of a target (compared to original privacy guarantees of standard DP-based mechanisms), while differentially-privacy chi-square queries can reveal up to 40% more sensitive information. Furthermore, we show that the adversary can use the inferred genomic data obtained from the attribute inference attack to infer the membership of a target in another genomic dataset (e.g., associated with a sensitive trait). Using a log-likelihood-ratio (LLR) test, our results also show that the inference power of the adversary can be significantly high in such an attack even by using inferred (and hence partially incorrect) genomes.

11:00 AM-11:20 AM
RCA2 - An improved framework for reference-based clustering of single cell transcriptomes
Format: Pre-recorded with live Q&A

  • Florian Schmidt, A-Star Institute Singapore, Singapore
  • Bobby Ranjan, Genome Institute of Singapore (GIS/A*STAR), Singapore
  • Mohammad Amin Honardoost, School of Medicine, National University of Singapore, Singapore
  • Shyam Prabhakar, Genome Institute of Singapore, Singapore

Presentation Overview: Show

Clustering is one of the most critical steps in analysis of scRNA-seq data, since it is essential for cell-type and marker gene identification. However, it remains challenging to cluster cells accurately in the presence of experimental noise, technical variation and batch effects. It was shown that Reference Component Analysis (RCA), which is a supervised clustering approach guided by a set of reference transcriptomes, is more accurate and less susceptible to batch effects than unsupervised clustering (Li et al., Nat Genet 2017). However, the original RCA software is not scalable to the size of modern scRNA-seq data sets, has limited usability, graphical visualization options, documentation and includes only a single reference transcriptome panel.
Here, we present RCA2, an improved implementation of reference-based clustering addressing all of the above limitations. We have reduced runtime, incorporated memory-efficient graph based clustering, expanded the set of reference panels and facilitated the generation of custom reference panels from user-provided data. Also, RCA2 has easy-to-use plotting functions, e.g. expression heat maps, and 2D/3D UMAP visualizations. Finally, RCA2 includes extensive documentation and tutorials, describing the use of the software and its integration into widely used scRNA-seq pipelines such as Seurat. RCA2 is freely available on GitHub: github.com/prabhakarlab/RCAv2.

11:20 AM-11:40 AM
Automating FAIR assessment scores
Format: Pre-recorded with live Q&A

  • Joseph Bonello, University of Malta, Malta
  • Ernest Cachia, University of Malta, Malta

Presentation Overview: Show

We are presenting a research project that provides a semi-automatic means of conducting FAIR assessments of Bioinformatics tools and datasets. Our motivation stems from the growing interest in ensuring the transparency and reproducibility of the published scientific literature. A study of 149 biomedical articles (published between 2015 and 2017) by Wallach, Boyack and Ioannidis (2018) showed that only 19 (~18%) of 104 articles with empirical data discussed publicly available data, only one (~1.0%) included a link to a full study protocol while only 5 (~5.2%) of 97 articles had replication of previous studies.

Findability, Accessibility, Interoperability and Reusability (FAIR) assessments provide an indication of how easy it is for a researcher to reproduce a study by scoring aspects such as whether a tool or dataset used in a study are easily available for download and use, whether a tool can be used on different OS platforms and whether the tools or datasets are available on a respectable source among others.

Our tool provides a good start to automate the assessment by attempting to search for the FAIR criteria on the internet and providing an option for the researcher to input missing details. It can also score entire pipelines of tools.

12:00 PM-12:20 PM
DeepPS: a transformer model for predicting general and kinase-specific phosphorylation sites
Format: Pre-recorded with live Q&A

  • Mark Lennox, Queen's University Belfast, United Kingdom
  • Neil Robertson, Queen's University Belfast, United Kingdom
  • Barry Devereux, Queen's University Belfast, United Kingdom

Presentation Overview: Show

Deep learning has become an innovative tool for detecting phosphorylation sites within a protein. However, the imbalance between negative and positive sites makes it challenging for a deep learning model to classify all sites accurately. Although identifying additional sites is possible, it is often costly and time-consuming with existing methods. Therefore, there is a demand for innovative modelling techniques that can overcome these limitations. To address these issues, we have designed a modelling scheme that utilises both convolutional and transformer-based neural networks. Specifically, we explore how both types of network can be combined and trained using a loss function employed in computer vision to form a robust architecture that is less likely to overfit to any one class when compared to previous baselines. We evaluate our model on a general phosphorylation site dataset, and a variety of kinase-specific datasets, including CDK, CK2, MAPK, PKA and PKC. Finally, to emphasise that this is an example of white-box deep learning, we show how one can visualise the model's features to gain a better understanding behind the prediction of each site.

12:20 PM-12:40 PM
Drug-adapted cancer cell lines reveal drug-induced heterogeneity and enable the identification of biomarker candidates for the acquired resistance setting
Format: Pre-recorded with live Q&A

  • Florian Rothweiler, Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
  • Mark Wass, Industrial Biotechnology Centre and School of Biosciences, University of Kent, Canterbury, UK, United Kingdom
  • Martin Michaelis, Industrial Biotechnology Centre and School of Biosciences, University of Kent, Canterbury, UK, United Kingdom
  • Jindrich Cinatl, Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
  • Ian Reddin, Industrial Biotechnology Centre and School of Biosciences, University of Kent, Canterbury, UK, United Kingdom
  • Yvonne Voges, Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
  • Stephanie Hehlgans, Department of Radiotherapy and Oncology, University Hospital, Goethe University Frankfurt am Main, Frankfurt, Germany
  • Jaroslav Cinatl, Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
  • Marco Mernberger, IInstitute of Molecular Oncology, Member of the Geman Center for Lung Research (DZL), Philipps-University, Marburg, Germany
  • Andrea Nist, Genomics Core Facility, Philipps-University, Marburg, Germany
  • Thorsten Stiewe, IInstitute of Molecular Oncology, Member of the Geman Center for Lung Research (DZL), Philipps-University, Marburg, Germany
  • Franz Roedel, Department of Radiotherapy and Oncology, University Hospital, Goethe University Frankfurt am Main, Frankfurt, Germany

Presentation Overview: Show

Acquired resistance is a central problem in cancer treatment. Biomarkers that indicate early therapy failure are needed to adapt therapies if resistance emerges, however, intrinsic and acquired resistance mechanisms may substantially differ. We investigate this by determining whether we can obtain information from acquired resistance models that cannot be identified from non-adapted cell lines. Findings from one survivin suppressant YM155-adapted sub-line of the neuroblastoma cell line UKF-NB-3 had suggested that increased ABCB1 levels, decreased SLC35F2 levels, decreased survivin levels, and TP53 mutations indicate YM155 resistance. Here, the investigation of ten additional YM155-adapted UKF-NB-3 sub-lines only confirmed the roles of ABCB1 and SLC35F2. Computational analysis of drug response data obtained from two large pharmacogenomic cell line screens confirmed this association between SLC35F2 and ABCB1 expression and YM155 sensitivity, however, sensitivity could not be determined by the expression of either gene in these YM155-naive cell lines. In conclusion, cancer cell populations of limited intrinsic heterogeneity can develop various resistance phenotypes in response to treatment. Therefore, individualised therapies will require monitoring of cancer cell evolution in response to treatment. Moreover, biomarkers can indicate resistance formation in the acquired resistance setting, even when they are not predictive in the intrinsic resistance setting.

3:20 PM-3:40 PM
Proceedings Presentation: Efficient Exact Inference for Dynamical Systems with Noisy Measurements using Sequential Approximate Bayesian Computation
Format: Pre-recorded with live Q&A

  • Yannik Schälte, Helmholtz Zentrum München, Germany
  • Jan Hasenauer, Universaty of Bonn, Germany

Presentation Overview: Show

Approximate Bayesian Computation (ABC) is an increasingly popular method for likelihood-free parameter inference in systems biology and other fields of research, since it allows analysing complex stochastic models. However, the introduced approximation error is often not clear. It has been shown that ABC actually gives exact inference under the implicit assumption of a measurement noise model. Noise being common in biological systems, it is intriguing to exploit this insight. But this is difficult in practice, since ABC is in general highly computationally demanding. Thus, the question we want to answer here is how to efficiently account for measurement noise in ABC.

We illustrate exemplarily how ABC yields erroneous parameter estimates when neglecting measurement noise. Then, we discuss practical ways of correctly including the measurement noise in the analysis. We present an efficient adaptive sequential importance sampling based algorithm applicable to various model types and noise models. We test and compare it on several models, including ordinary and stochastic differential equations, Markov jump processes, and stochastically interacting agents, and noise models including normal, Laplace, and Poisson noise. We conclude that the proposed algorithm could improve the accuracy of parameter estimates for a broad spectrum of applications.

3:40 PM-4:00 PM
Mogrify: A computational framework to convert between cell types
Format: Pre-recorded with live Q&A

  • Kalaivani Raju, Mogrify Limited, United Kingdom
  • Minkyung Sung, Mogrify Limited, United Kingdom
  • Owen Rackham, Duke-NUS Medical School, Singapore
  • Jose Polo, Monash University, Australia
  • Aida Moreno-Moral, Mogrify Limited, United Kingdom
  • Rodrigo Santos, Mogrify Limited, United Kingdom
  • Karin Schmitt, Mogrify Limited, United Kingdom
  • Julian Gough, Laboratory of Molecular Biology, United Kingdom

Presentation Overview: Show

Mogrify is a computational framework that combines gene expression data and regulatory information to systematically predict the reprogramming factors necessary to induce cell conversion. The platform is developed to systematically control the cellular transcriptomic network underlying cellular identity, and consequently identify the key regulatory factors necessary to convert any cell type into any other cell type without going through the stem cell state, a process called transdifferentiation. We have applied Mogrify to 173 human cell types and 134 tissues, defining an atlas of cellular reprogramming including both known transcription factors used in transdifferentiations and new ones, never implicated before in these cellular conversions. Mogrify in silico predictions have been validated in vitro in over 20 cell conversions, including generation of endothelial cells, astrocytes and cardiomyocytes. This technology also allows the development of enhanced differentiations and reduces the costs of current cell therapies.

4:00 PM-4:10 PM
Investigating the potential of quantum computing for protein folding
Format: Pre-recorded with live Q&A

  • Carlos Outeiral Rubiera, University of Oxford, United Kingdom
  • Garrett Morris, University of Oxford, United Kingdom
  • Jiye Shi, UCB Pharma, United Kingdom
  • Martin Strahm, F. Hoffmann La Roche, Switzerland
  • Simon Benjamin, University of Oxford, United Kingdom
  • Charlotte Deane, University of Oxford, United Kingdom

Presentation Overview: Show

Protein folding, the determination of the lowest-energy configuration of a protein, is an unsolved computational problem. If protein folding could be solved, it would lead to significant advances in molecular biology, and technological development in areas such as drug discovery and catalyst design. As a hard combinatorial optimisation problem, protein folding has been studied as a potential target problem for adiabatic quantum computing. Although several experimental implementations have been discussed in the literature, the computational scaling of these approaches has not been elucidated. In this article, we present a numerical study of the (stoquastic) adiabatic quantum algorithm applied to protein lattice folding. Using exact numerical modelling of small systems, we find that the time-to-solution metric scales exponentially with peptide length, even for small peptides. However, comparison with classical heuristics for optimisation indicates a potential limited quantum speedup. Overall, our results suggest that quantum algorithms may well offer improvements for problems in the protein folding and structure prediction realm.

4:10 PM-4:20 PM
Analysis of ISCB honorees and keynotes reveals disparities
Format: Pre-recorded with live Q&A

  • Trang T. Le, University of Pennsylvania, United States
  • Daniel S. Himmelstein, University of Pennsylvania, United States
  • Ariel A. Hippen Anderson, University of Pennsylvania, United States
  • Matthew R. Gazzara, University of Pennsylvania, United States
  • Casey S. Greene, University of Pennsylvania, United States

Presentation Overview: Show

Professional societies and their conferences provide an important venue for disseminating scientific knowledge. Delivering a keynote for or being named a fellow by an international society is a major recognition. Do such recognitions reflect the composition of the field of bioinformatics? We compiled a list of 412 International Society for Computational Biology honorees and corresponding authors in leading bioinformatics journals. Comparing the two distributions, we looked for disparities in gender, country of affiliation, race, and name-origin. The proportion of female honorees has kept pace with increasing levels of female authorship, but neither has yet to reach gender parity. However, we noticed a striking geographic disparity where the proportion of honorees with an affiliation in the U.S. was 1.6-fold greater compared to field-specific senior authors. In total, we estimate the U.S. received 85 more honorees than would be expected from randomly selecting honorees from senior authors. Furthermore, within the U.S., we identify racial disparities with an excess of white honorees and a depletion of Asian honorees. This pattern replicated globally, where we find names of East Asian origin have been persistently underrepresented among ISCB honorees. Early indications suggest the ICSB has taken note of our findings by selecting more diverse honorees.

4:40 PM-5:00 PM
Proceedings Presentation: CRISPRLand: Interpretable Large-Scale Inference of DNA Repair Landscape Based on a Spectral Approach
Format: Pre-recorded with live Q&A

  • Amirali Aghazadeh, University of California, Berkeley, United States
  • Orhan Ocal, University of California, Berkeley, United States
  • Kannan Ramchandran, University of California, Berkeley, United States

Presentation Overview: Show

We propose a new spectral framework for reliable training, scalable inference, and interpretable explanation of the DNA repair outcome following a Cas9 cutting. Our framework, dubbed CRISPRLand, relies on an unexploited observation about the nature of the repair process: the landscape of the DNA repair is highly sparse in the (Walsh-Hadamard) spectral domain. This observation enables our framework to address key shortcomings that limit the interpretability and scaling of current deep-learning-based DNA repair models. In particular, CRISPRLand reduces the time to compute the full DNA repair landscape from a striking 5230 years to one week and the sampling complexity from 10^12 to 3 million guide RNAs with only a small loss in accuracy (R^2 ~ 0.9). Our proposed framework is based on a divide-and-conquer strategy that uses a fast peeling algorithm to learn the DNA repair models. CRISPRLand captures lower-degree features around the cut site which enrich for short insertions and deletions as well as higher-degree microhomology patterns that enrich for longer deletions.
The CRISPRLand software is publicly available at https://github.com/UCBASiCS/CRISPRLand.

5:00 PM-5:20 PM
Single-cell transcriptomic analysis of highly-multiplexed cytometry data via antigen mapping
Format: Pre-recorded with live Q&A

  • Kiya Govek, University of Pennsylvania, United States
  • Emma Troisi, University of Pennsylvania, United States
  • Zhen Miao, University of Pennsylvania, United States
  • Steven Woodhouse, University of Pennsylvania, United States
  • Pablo Camara, University of Pennsylvania, United States

Presentation Overview: Show

Recently developed technologies for digital imaging and highly-multiplexed immunohistochemistry (mIHC) are enabling the field of histology to enter into a quantitative era, allowing for more complex descriptions of tissue architecture. Imaging cytometry by time of flight (CyTOF), multiplexed ion beam imaging, and co-detection by indexing (CODEX) can be used to simultaneously profile the expression of dozens of proteins in a tissue section with single-cell resolution. However, annotating cell populations or states that differ little in the profiled antigens or for which the antibody panel does not include specific markers is challenging. To overcome this obstacle, we have developed a computational approach for enriching mIHC images with single-cell RNA-seq data, building upon recent experimental procedures for augmenting single-cell transcriptomes with concurrent antigen measurements. Spatially-resolved Transcriptomics via Epitope Anchoring (STvEA) performs transcriptome-guided annotation of highly-multiplexed cytometry datasets. It increases the level of detail in histological analyses by enabling annotation of subtle cell populations, spatial patterns of transcription, and interactions between cell types. We demonstrate the utility of STvEA by uncovering the architecture of poorly characterized cell types in the murine spleen using published CODEX and CyTOF datasets, and a CITE-seq atlas we have generated.

5:20 PM-5:40 PM
Variable Number Tandem Repeats mediate the expression of proximal genes
Format: Pre-recorded with live Q&A

  • Mehrdad Bakhtiari, University of California San Diego, United States
  • Jonghun Park, University of California San Diego, United States
  • Yuan-Chun Ding, Beckman Research Institute of City of Hope, United States
  • Sharona Shleizer-Burko, University of California San Diego, United States
  • Susan Neuhausen, Beckman Research Institute of City of Hope, United States
  • Bjarni Halldorsson, deCODE Genetics, Iceland
  • Kari Stefansson, deCODE Genetics, Iceland
  • Melissa Gymrek, University of California San Diego, United States
  • Vineet Bafna, University of California San Diego, United States

Presentation Overview: Show

Variable Number Tandem Repeats (VNTRs) account for a significant amount of human genetic variation. VNTRs have been implicated in both Mendelian and Complex disorders, but are largely ignored by whole genome analysis pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks for fast read recruitment. On 55X whole genome data, adVNTR-NN genotyped each VNTR in less than 18 cpu-seconds, while maintaining 100% accuracy on 76% of VNTRs.

We used adVNTR-NN to genotype 10,264 VNTRs in 652 individuals from the GTEx project and associated VNTR length with gene expression in 46 tissues. We identified 163 `eVNTRs' that were significantly associated with gene expression. Of the 22 eVNTRs in blood where independent data was available, 21 (95%) were replicated in terms of significance and direction of association. 49% of the eVNTRs showed a strong and likely causal impact on the expression of genes and 80% had an effect size at least 0.1. The impacted genes have important role in complex phenotypes including Alzheimer's, obesity and familial cancers. Our results point to the importance of studying VNTRs for understanding the genetic basis of complex diseases.

5:40 PM-6:00 PM
Inferring identifying characteristics through pooling of information across genotypic trajectories
Format: Pre-recorded with live Q&A

  • Prashant Emani, Yale University, United States
  • Gamze Gürsoy, Yale University, United States
  • Mark Gerstein, Yale University, United States

Presentation Overview: Show

The leakage of identifying information in genetic and omics data has been established in several studies, and represents a new frontier in the struggle for individual privacy protections. Specifically, it has been shown that single nucleotide polymorphisms (SNPs) carry a strong risk of reidentification for individuals and close relatives. The risk is substantial given that SNPs can be inferred not only from direct genotyping data but also from omics measurements and genotype-phenotype relationships. Accordingly, we present a computational tool to assess the risk of publicly releasing genotype/omics data, by quantifying the informativeness of even a sparse set of common SNPs from an individual. We employ a Hidden Markov Model of recombination and mutation to reveal SNP representation within a reference genotype database, and to pool information across the identified genotypic trajectories. We find that even ~10 common (minor allele frequency > 0.05) SNPs are sufficient to identify individuals in databases, and determine genotypic “mosaics” of reference individuals with as little as 20 common SNPs. We are also able to identify first-order (parents and siblings) relatives of query individuals with 20-30 common SNPs. The tool could be used to determine the value of selectively masking released SNPs.