Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

May 16 -19, 2016 | GLBIO/CCBC 2016

Talks

Modeling drug repositioning by leveraging drug-target-disease association datasets: A case study of Ebola virus disease
Download

Date: TBA
Room: TBA

  • Gaston Mazandu, University of Cape Town, South Africa
  • Kayleigh Rutherford, University of Cape Town, South Africa
  • Elsa-Gayle Zekeng, University of Liverpool, United States Minor Outlying Islands
  • Emile Chimusa, University of Cape Town, South Africa
  • Nicola Mulder, University of Cape Town, South Africa

Presentation Overview:

The world's population is currently subjected to several public health challenges, including growing prevalence of infections and the emergence of new pathogenic organisms. The cost and risk associated with drug development process makes the development of drugs for several diseases, especially orphan or rare diseases, unappealing to the pharmaceutical industry. A potential strategy to address the challenges of novel drug development is that of drug repositioning, which consists of examining new uses for existing approved drugs. Here, we developed an integrative computational framework to predict possible drug repositioning for approved drugs by leveraging drug-target-disease associations, and functional and genomic data from public databases. We identified drug target enriched biological processes and similar diseases based on their biological processes and mapped diseases to new potential drugs using Gene Ontology (GO) semantic similarity scores. We assessed the performance of this model using the area under the Receiver Operating Characteristic (ROC) curve (AUC), precision and accuracy score as measures of discriminative power. The model performs well with an AUC score of 0.97684, 85% precision and 78% accuracy. Applying this model to Ebola Virus disease (EVD), we were able to assess associations between human proteins associated with EVD and approved drug targets. The viral matrix protein VP40 showed evidence of playing a critical role in the viral life cycle inside the host, and we identified three putative protein targets in human: DEAD (Asp-Glu-Ala-Asp) box polypeptide 5 (DDX58), Tumor Necrosis Factor (TNF) and Toll-Like Receptor 4 (TLR4). Mining the Drugbank database, we revealed potential drugs for EVD, including immune-related drugs, such as infliximab, and amebicide related drugs. There is potential for the model suggested to bridge the gap in the production of orphan disease therapies, offering a systematic, effective and reliable approach to predict new uses for existing drugs, and thereby harnessing their full therapeutic power.

Investigating the usefulness of long-read sequencing technologies on the study of large eukaryotic gene families
Download

Date: TBA
Room: TBA

  • Armin Rouhi, University of Calgary, Canada
  • Janneke Wit, University of Calgary, Canada
  • James Wasmuth, University of Calgary, Canada

Presentation Overview:

Advances in genome sequencing have led to the sequencing of a wide range of species at an unprecedented pace. The majority of these genomes, however, remain in a draft state: fragmented and unfinished. While the per-base accuracy of the genomes is extremely high (>99%), the most important limiting factor to accurately assembling genomes is the read length. Next-generation sequencing (NGS) technologies have been limited to read lengths of up to 300 bp. The assemblies generated from these reads frequently incorrectly assemble tandemly arrayed genes and gene families under high copy number. Such gene families are often implicated in disease phenotypes and, for pathogenic organisms, interactions with their hosts. Therefore, it is important that they are correctly assembled.
A promising solution is the use of long-read sequencing (LRS) technologies. Three LRS technologies have recently become available: PacBio RS, Illumina Synthetic Long Reads (SLR), and the pocket-sized Oxford Nanopore MinION. These technologies are capable of producing read lengths of tens of thousands of bases. Published studies have reported success based on the length of the resultant assemblies and the ability to assemble transposable elements. There has not been a rigorous investigation on how large gene families are resolved.
Here, we have examined the accuracy of de novo assembled C. elegans genomes using each of the three LRS technologies. We have generated new genome data using the MinION and compared to the other LRS platforms. We show that assemblies generated using LRS platforms resolve regions containing gene families significantly better than exclusive short-read-based assemblies. We will present these findings and our assessment on which LRS is better suited different scenarios.

Using phylogenetic instability to identify members of large gene families under adaptive evolution
Download

Date: TBA
Room: TBA

  • David Curran, University of Calgary, Canada
  • John Gilleard, University of Calgary, Canada
  • James Wasmuth, University of Calgary, Canada

Presentation Overview:

Properly understanding large gene families is an important part of understanding how organisms are able to adapt to their environments, as gene duplication and subsequent sub-functionalization is one of the most rapid methods of phenotypic innovation. This presents as lineage-specific expansions and contractions of gene families, generating paralogues and inter-species copy number variants, a characteristic that has been termed ‘phylogenetic instability’. The phenomenon correlates well with direct environmental interactions, such as chemosensory receptors, immune responses to pathogens, and xenobiotic detoxification pathways. While phylogenetic instability has been used as a predictor of such functionality, there are no current methods to quantify it in large gene families. Here we present a novel algorithm, MIPhy, which solves this problem.
There are two aspects to the phylogenetic instability algorithm: clustering a phylogenetic tree into a set of meaningful sub-trees, and then quantifying the instability in those sub-trees. Algorithms do exist that could be used to quantify sub-trees, but some require detailed ancestral information that is only available for a small number of organisms, some do not incorporate the sub-tree branching information, and others use evolutionary models that we have found to be unsuitable. Ideally the clustering problem would be solved by protein interaction or biochemical reaction data, but this is rarely available. There are many ways to cluster a phylogenetic tree, but we have found existing methods to be inconsistent and overly sensitive to their arbitrary parameter values.
MIPhy finds the minimum number of events (gene duplication, gene loss, and incomplete lineage sorting) required to reconcile the observed gene tree with a given species tree. The clustering problem is solved by grouping the genes into homologous clusters such that total genomic events, modified by the standard deviation of each cluster, is minimized. We have applied MIPhy to several large and complex gene families, including the cytochrome P450s. We were able to distinguish between genes with endogenous functions and those that detoxify xenobiotic drugs.

Efficient techniques for direct analysis of discrete-time population genetic models
Download

Date: TBA
Room: TBA

  • Ivan Kryukov, University of Calgary, Canada
  • Bianca de Sanctis, University of Calgary, Canada
  • A.P. Jason de Koning, University of Calgary, Canada

Presentation Overview:

Diffusion theory approximations lie at the heart of most population genetic and phylogenetic methods. These include: 1) Estimators of allele age based on current population frequency; 2) Site-frequency spectrum methods for inferring demography and selection from population-level variation; and 3) Population-genetic models of molecular evolution. However, these approximations are only available for a subset of analytically tractable models, which are often unrealistic. For more complex models, the diffusion approximations need to be themselves approximated, leading to further potential error and computational difficulty.

We present a new software package, Wright-Fisher Exact Solver (WFES), which performs fast, scalable computations in population genetics without diffusion approximations or simulations. WFES employs rapid, parallel, sparse linear algebra techniques for a direct analysis of arbitrary discrete-time models. This approach solves for long-term properties, such as the probability and expected time to fixation or extinction. Importantly, it also allows for new, custom, and exact statistics to be easily implemented, creating a workbench for new ideas to be easily developed and explored. By taking advantage of efficient computational techniques, WFES is applicable to the analysis of populations ranging in size from humans to model organisms, with effective population sizes up to hundreds of thousands of individuals on typical computers. This substantially extends the applicability of direct Markov chain methods in population genetics, which have typically been limited to studying population sizes of only a few hundred individuals.

Biological Interaction Networks in Bacteria - Generation and Evolutionary Insights
Download

Date: TBA
Room: TBA

  • Cedoljub Bundalovic-Torma, Hospital for Sick Children, Canada
  • John Parkinson, University of Toronto, Canada

Presentation Overview:

Bacteria inhabit a diverse array of environments and form integral relationships with humans, which bare important consequences toward health and disease. With the advent of next generation sequencing technology, the availability of bacterial genomes has exponentially increased, allowing us study not only the emergence of bacterial pathogens and the pressing challenges to circumventing antibiotic resistance, but also the composition of the human microbiome and the role of bacteria in mediating complex disorders, such as autism and obesity, which are increasing in prevalence today. However, the majority of bacterial biology remains uncharted, which for the past decade has motivated the generation of numerous large-scale biological networks for the model bacterium E. coli.

In this talk I will present current work to address these challenges, which involves the construction of the first large-scale physical interaction network of the Escherichia coli cell envelope proteome generated by tandem-affinity purification. Graphical clustering enabled the elucidation of several important biological processes mediated by physical interactions of cell-envelope associated proteins. I will also illustrate how such data can also be applied to identify biological processes that contribute to environmental adaptation in enterobacterial foodborne pathogens.

High-Performance and Exascale Computing Frontiers in Cancer Applications
Download

Date: TBA
Room: TBA

  • Eric Stahlberg, Frederick National Laboratory for Cancer Research, United States
  • Carl McCabe, Center for Biomedical Informatics and Information Technology - National Cancer Institute, United States
  • George Zaki, Frederick National Laboratory for Cancer Research, United States

Presentation Overview:

The expanded use of high-performance is transforming the future for cancer research and clinical applications. The recently announced collaboration between the National Cancer Institute and the US Department of Energy joins earlier initiatives including the NCIP Cancer Cloud Pilots and the Genomic Data Commons that help establish an advance look at the future where extreme scale and exascale computing are used regularly in cancer research applications. The presentation will include discussions of critical lessons learned, research results when applying HPC to image processing and signal processing applications in RNA structure determination, and areas of future research activity related to the use of future computing technologies. The presentation will also include efforts underway to raise the level of computational and data science education within the cancer research community.

Xenolog Classification: Steps toward a Xenolog Conjecture
Download

Date: TBA
Room: TBA

  • Charlotte Darby, Carnegie Mellon University, United States
  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

In his landmark review, Fitch (Trends Genet. 2000) defined xenology as "the relationship of any two homologous characters whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material." However, the nomenclature currently available to describe homology relationships involving transfer remains ambiguous although many different evolutionary scenarios could be described as histories that involve interspecies transfer. Yet careful classification of horizontally transferred genes is essential for gaining insight into complex evolutionary processes and for homology-based gene function prediction.
We propose a classification scheme which provides a comprehensive nomenclature capturing the variety of xenologous relationships which can—and do—occur. We define formal rules that unambiguously assign gene pairs to xenolog classes. These rules are based on the topology of a reconciled gene and species tree, can be applied to a tree with arbitrary number and arrangement of transfer events, and have been implemented in prototype software. Our scheme accounts for the inherent asymmetry of horizontal transfer, whether both genes are in the same species, the potential interaction of duplication and transfer events, and order of divergences in the gene and species trees. We demonstrate the practical application of this classification by showing its correspondence to different functional properties.

Reactome Revolutions - Pathways and Networks
Download

Date: TBA
Room: TBA

  • Robin Haw, Reactome - OICR, Canada

Presentation Overview:

Modern health initiatives and drug discovery are focused increasingly on targeting diseases that arise from perturbations in complex cellular events. Consequently, there has been a tremendous effort in biological research to elucidate the molecular mechanisms that underpin normal cellular processes. A reaction-network pathway knowledgebase is the tool of choice for assembling and visualizing the “parts list” of proteins and functional RNAs, as a foundation for understanding cellular processes, function and disease. The Reactome Knowledgebase (www.reactome.org) is a publically accessible, open access bioinformatics resource that stores full descriptions of human biological reactions, pathways and processes. Curated pathway knowledgebases, like Reactome, are uniquely powerful and flexible tools for extracting biologically and clinically useful information from the flood of genomic data. Our data model accommodates the annotation of disease processes, allowing us to represent the altered biological behaviour of mutant variants frequently found in cancer, and to describe the mode of action and specificity of drugs and therapeutics. Bio- and chemoinformaticians use Reactome to interpret high-throughput experimental datasets, to develop novel algorithms for data mining and visualization, and to build predictive models of normal and abnormal pathways. Specific features of Reactome support the visualization of interactions of many gene products in a complex biological process, and the application of bioinformatics tools to find causal patterns in genomic data sets. To maximize Reactome’s coverage of the genome, we have supplemented curated data with a conservative set of predicted functional interactions (FI), roughly doubling our coverage of the translated genome. We have developed a Cytoscape app called “ReactomeFIViz”, which utilizes this FI network to assist biologists to perform pathway and network analysis to search for gene signatures from within gene expression data sets or identify significant genes within a list. Pathway and network-based tools for building and validating interaction networks derived from multiple data sets will give researchers substantial power to screen intrinsically noisy experimental data in order to uncover biologically relevant information.

Identification of novel genomic islands in Liverpool epidemic strain of Pseudomonas aeruginosa using segmentation and clustering
Download

Date: TBA
Room: TBA

  • Rajeev Azad, University of North Texas, United States
  • Mehul Jani, University of North Texas, United States
  • Kalai Mathee, Florida International University, United States

Presentation Overview:

Pseudomonas aeruginosa is an opportunistic pathogen implicated in myriad of infections, and a leading pathogen responsible for mortality in patients with cystic fibrosis (CF). Horizontal transfers of genes among the microorganisms living within CF patients have led to a more morbid and multi-drug resistant strains such as Liverpool epidemic strain of Pseudomonas aeruginosa, namely the LESB58 strain, that has the propensity to acquire virulence and antibiotic resistance genes. Often these genes are acquired in large clusters, referred to as “genomic islands”. A genome mining tool based on a recursive segmentation and clustering procedure, “GEMINI”, was used to decipher novel genomic islands and understand their contributions to the evolution of virulence and antibiotic resistance in P. aeruginosa LESB58. GEMINI was validated on experimentally verified genomic islands in P. aeruginosa LESB58 before examining its potential to decipher novel islands. Of the 6062 genes in P. aeruginosa LESB58, 596 genes were identified residing on 20 genomic islands of which 12 had not been previously reported. Comparative genomics provided evidence in support of our novel predictions. GEMINI unravelled the mosaic structure of islands that are composed of segments of likely different evolutionary origins, and also demonstrated its ability to identify potential strain biomarkers. These newly found islands likely have contributed to the hyper-virulence and multidrug resistance of the Liverpool epidemic strain of P. aeruginosa.

Computational methods to identify cancer-driver single nucleotide variants and large rearrangements in non-coding regions
Download

Date: TBA
Room: TBA

  • Eric Minwei Liu, Weill Cornell Medical College, United States
  • Alexander Martinez Fundichely, Weill Cornell Medical College, United States
  • Priyanka Dhingra, Weill Cornell Medical College, United States
  • Andrea Sboner, Weill Cornell Medical College, United States
  • Ekta Khurana, Weill Cornell Medical College, United States

Presentation Overview:

Most variants obtained from whole-genome sequencing occur in non-coding regions of the genome. Although variants in protein-coding regions have received the majority of attention, numerous studies have now noted the importance of non-coding variants in cancer. Identification of functional non-coding variants that drive tumor growth remains a challenge and a bottleneck for the use of whole-genome sequencing in the clinic. We have developed two computational methods to identify non-coding cancer drivers. I will present the details of these methods and discuss the ongoing efforts to apply them to analyze ~2800 tumor whole-genomes in the ‘Pan-Cancer Analysis of Whole Genomes, PCAWG’ consortium. (1) For single nucleotide variants, we have developed CompositeDriver-SNV. This method integrates the signals of high functional impact of variants with the recurrence of variants across multiple tumor samples to identify the elements that show more and higher functional impact mutations than expected randomly. The functional impact is predicted using the FunSeq scheme that uses the properties of ENCODE elements (including conservation, transcription-factor (TF) motif disruption and network properties) within a weighted scoring scheme to predict deleteriousness of non-coding variants. (2) For large genomic rearrangements, we have developed CompositeDriver-SV. The structural variants analyzed can be of any type, e.g. deletions, duplications, translocations, etc. The non-parametric null model accounts for tumor-specific properties of the rearrangements. Using this approach, we are able to identify novel functional elements that are significantly rearranged in prostate cancer.

A Versatile Framework for Learning Feature-Based Protein-DNA Recognition Models Directly from SELEX Data
Download

Date: TBA
Room: TBA

  • Chaitanya Rastogi, Columbia University, United States
  • Gabriella Martini, Columbia University, United States
  • H. Tomas Rube, Columbia University, United States
  • Harmen Bussemaker, Columbia University, United States

Presentation Overview:

SELEX-seq [1] and HT-SELEX [2,3] are sequencing-based methods for elucidating the intrinsic DNA binding specificity of transcription factor (TF) complexes at high resolution. While the amount of raw information that modern SELEX provides is unprecedented, the computational methods for building DNA recognition models (“motifs”) from these data are still far from mature. The standard is to tabulate of the relative enrichment of each oligomer of a given length [4], for which we have developed efficient software [5]. Unfortunately, having to use oligomer tables as an intermediate step for feature-based analysis [6] has two key disadvantages: (i) limited range over which readout can be analyzed, as counts decrease exponentially with footprint size; and (ii) requirement for prior ad hoc sequence-based alignment of different oligomers. We present a new and highly versatile framework for motif discovery from SELEX data that overcomes these limitations. It uses a hierarchical maximum likelihood approach to fit a feature-based biophysically motivated protein-DNA recognition model directly to the raw SELEX data. First, this allows us to consider base and shape readout in more detail and over a larger footprint than was possible before, as we illustrate by reanalyzing Hox heterodimer data. Second, we can now for the first time analyze shape readout for TFs with low binding specificity, which we demonstrate using Hox monomer data. We find that shape readout by the Hox N-terminal arm is already seen for the monomer, but is altered by the presence of the Exd cofactor. Our method produces rich, biophysically interpretable models from only a single round of SELEX-seq data. Additionally, our flexible modeling framework should be easily extendable to other sequencing-based assays.

[1] M. Slattery, T.R. Riley, P. Liu, N. Abe, P. Gomez-Alcala, R. Rohs*, B. Honig*, H.J. Bussemaker*, R.S. Mann*. (2011) Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox proteins. Cell 147:1270-82.
[2] Y. Zhao, D. Granas, G.D. Stormo*. (2009) Inferring Binding Energies from Selected Binding Sites. PLoS Comput. Biol. 5(12): e1000590.
[3] A. Jolma, et. al. (2013) DNA-Binding Specificites of Human Transcription Factors. Cell 152: 327-339.
[4] T.R. Riley, M. Slattery, N. Abe, C. Rastogi, D. Liu, R.S. Mann*, and H.J. Bussemaker*. (2014) SELEX-seq, a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol. Biol. 1196:255-78.
[5] http://bioconductor.org/packages/release/bioc/html/SELEX.html
[6] N. Abe, I. Dror, L. Yang, M. Slattery, T. Zhou, H.J. Bussemaker, R. Rohs*, R.S. Mann*. (2015) Deconvolving the Recognition of DNA Shape from Sequence. Cell 161:307-18.

How plastic are multidomain proteins? Quantifying domain co-evolution in primate genomes
Download

Date: TBA
Room: TBA

  • Maureen Stolzer, Carnegie Mellon University, United States
  • Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

Multidomain proteins evolve via the insertion, duplication, and deletion of domains, sequence fragments that encode structural or functional modules. This process of domain shuffling allows for rapid exploration of functions by introducing new combinations of existing folds. Understanding the evolution of protein function requires an understanding of how domain architectures change over time, because gain, loss, or replacement of a domain can result in an immediate and dramatic change in protein interactions. It has been argued that domain pairs that co-occur are selectively favored combinations and once united tend to persist as a unit. This hypothesis is supported by empirical studies reporting that convergent formation of the same domain architecture is rare. However, the gain-loss parsimony methods typically used to reconstruct domain shuffling histories ignore sequence variation between domain instances and cannot recognize parallel gains or losses. If the same domain architectures were forming repeatedly, this pattern would not be discerned by gain-loss parsimony.

To test this hypothesis, we conducted a genome-scale analysis of co-occurring domain pairs in primates, using a novel, phylogeny-based approach that can distinguish between different domain instances and infer parallel events. Almost half of all domain pairs tested (45%) experienced at least two independent fusions. Even when only events with very strong statistical support (bootstrap > 90%) were considered, two or more independent fusions were inferred in 25% of the cases considered. Our results challenge the hypothesis that convergent formation of domain architectures is rare. Further, they lead us to question the extent to which our understanding of multidomain evolution is driven by the algorithms we use to study them and highlight the need for more powerful evolutionary algorithms. Finally, our results have practical consequences for homology-based function prediction, since they suggest that proteins with the same domain architecture may contain domains that are not orthologous.

PopNet: Revealing the Impact of Recombination on Population Structure
Download

Date: TBA
Room: TBA

  • Javi Zhang, University of Toronto, Canada
  • Asis Khan, NIH, United States
  • Andrea Kennard, NIH/NIAID, United States
  • Michael E. Grigg, NIH, United States
  • John Parkinson, Hospital for Sick Children, Canada

Presentation Overview:

A central question in population genomics is how populations interact. Admixture and recombination between populations strongly impact their structures. In the context of public health, this may translate to the development and spread of virulence and resistance factors. Currently, new visualization methods are needed to study recombination. Existing methods, such as Neighbour-Net and Structure, offer limited ability to compare populations. In particular, while shared ancestry can be inferred, the genomic locations are not shown. Hence, the challenge lies in facilitating the identification of regions of shared ancestry between individuals.
To meet the challenge, we present PopNet, which combines the concept of chromosome painting with network visualization to illustrate regions of shared ancestry. We demonstrate the effectiveness of PopNet through its application to three diverse populations of Saccharomyces cerevisiae, Toxoplasma gondii, and Plasmodium falciparum, showing that S. cerevisiae lineages form around habitat or function; North American T. gondii families share multiple regions of similarity, and Asian P. falciparum populations show a higher degree of intergression compared to their African counterparts. PopNet’s novel visualization process offers a new framework for the analysis of recombination between populations.

Building and Sustaining Bioinformatics Community Diversity
Download

Date: TBA
Room: TBA

  • Alexander Ropelewski, Pittsburgh Supercomputing Center, United States
  • Ricardo Gonzalez Mendez , University of Puerto Rico School of Medicine, United States
  • Jimmy Torres, School of Communications, University of Puerto Rico Rio Piedras, Puerto Rico
  • Hugh Nicholas, Pittsburgh Supercomputing Center, United States
  • Pallavi Ishwad, Pittsburgh Supercomputing Center, United States

Presentation Overview:

Recent discussions have focused on improving bioinformatics education by defining key biological, mathematical, and computational competencies required to be a bioinformatics practitioner. Keeping bioinformatics education up-to-date is a challenge on all campuses but especially so at Minority Serving Institutions (MSIs), where faculty are burdened with heavy teaching loads and are isolated from critical bioinformatics resources, such as sequencers and high performance computers.

Here, we describe our experience building bioinformatics competencies within the underrepresented minority community through the Pittsburgh Supercomputing Center’s Minority Access to Research Careers program. This program, begun in 2001, assists Minority Serving Institutions through broad-based educational and research programs oriented towards building and sustaining bioinformatics curricula and programs at partner MSIs. The emphasis of this program is based on the community’s documented training needs, enabling scientists to develop the prerequisite competencies needed to excel in bioinformatics.

We will discuss the barriers and challenges encountered by the program and interventions developed to address these barriers and challenges. We will conclude with general recommendations to improve diversity within the bioinformatics community based on the experiences of the program.

PIPE (Protein-protein Interaction Prediction Engine): A computational approach for comprehensive soybean functional genomics
Download

Date: TBA
Room: TBA

  • Bahram Samanfar, AAFC-ORDC, Canada
  • Andrew Schoenrock, Carleton University, Canada
  • Frank Dehne, Carleton University, Canada
  • Ashkan Golshani, Carleton University, Canada
  • Elroy Cober, AAFC-ORDC, Canada
  • Martin Charette, AAFC-ORDC, Canada
  • Steve Molnar, AAFC-ORDC, Canada

Presentation Overview:

Protein-Protein Interactions (PPIs) are essential molecular interactions that define the biology of a cell, its development and responses to various stimuli. Theoretically (“guilt by association”), if a gene interacts with groups of genes involved in one specific pathway, that gene might also be involved in that specific pathway. Our knowledge of global PPI networks in complex organisms such as human and plants is restricted by technical limitations of current methods. The Protein-protein Interaction Prediction Engine (PIPE) is a computational tool used to predict protein-protein interactions (PPI). PIPE has been used to produce proteome-wide, all-to-all predicted interactomes in a variety of organisms including yeast (Saccharomyces cerevisiae), human (Homo sapiens), Arabidopsis and others. PIPE can produce individual PPI predictions in a fraction of a second and is typically tuned, for a given organism, to achieve a specificity of 99.95%. PIPE has been independently evaluated and compared to other PPI prediction methods and has been shown to significantly outperform the others in terms of recall-precision across all of the datasets tested. It has also been shown that PIPE has the ability to produce cross-species predictions (ie. use interaction data from one organism to make high quality PPI predictions in another). Briefly, PIPE works based on searching for re-occurring short polypeptide sequences between known interacting protein pairs; simply, it predicts interactions based on protein sequence information and a database of known interacting pairs. PIPE requires a set of known interacting protein pairs as well as their primary (amino acid) sequences to be able to make its predictions. Recently, PIPE is being redesigned to be able to computationally handle the large proteome of soybean (75,778 confirmed soybean protein sequences). Currently we are using PIPE towards predicting the first comprehensive protein-protein interaction network for soybean.
Soybean is one of the major Canadian grain crops and its production is expanding in Canada with the majority of the increase in short season areas (Western Canada and northern regions). So far, eleven maturity loci have been reported in soybean, however the molecular basis of almost half of them is not yet clear. The list of novel factors affecting these pathways in soybean, and in model plants like Arabidopsis, continues to grow suggesting the presence of other novel players which are yet to be discovered. To this end, we have used three different approaches; bioinformatics (functional genomics), classical plant breeding and molecular biology (analysis of SSR and SNP haplotypes) to identify novel genes involved in flowering and maturity pathways in soybean. Identification of molecular markers tagging the PIPE-identified genes controlling flowering and maturity in soybean will allow soybean breeders to efficiently develop varieties using molecular marker assisted breeding. Allele specific markers will allow stacking of early maturity alleles to develop even earlier maturing cultivars. This bioinformatics approach will also help to bridge the gap in knowledge of the flowering and maturity pathway in soybean and can be applied to other important traits such as seed protein content, oil quality and host-pathogen interactions.

Mining for new antimicrobial agents: predicting bacteriocin gene blocks
Download

Date: TBA
Room: TBA

  • James Morton, University of California, San Diego, United States
  • Stefan Freed, University of Notre Dame, United States
  • Md Nafiz Hamid, Iowa State University, United States
  • Shaun Lee, University of Notre Dame, United States
  • Iddo Friedberg, Iowa State University, United States

Presentation Overview:

Bacteriocins are peptide-derived molecules produced by bacteria, whose recently-discovered functions include virulence factors and signaling molecules as well as their better known roles as antibiotics. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are highly diverse and widely distributed among bacterial species. Given the heterogeneity of bacteriocin compounds, many tools struggle with identifying novel bacteriocins due to their vast sequence and structural diversity. Many bacteriocins undergo post-translational processing or modifications necessary for the biosynthesis of the final mature form. Enzymatic modification of bacteriocins as well as their export is achieved by proteins whose genes are often located in a discrete gene cluster proximal to the bacteriocin precursor gene, referred to as context genes in this study. Although bacteriocins themselves are structurally diverse, context genes have been shown to be largely conserved across unrelated species.

Using this knowledge, we set out to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and identify new candidates for bacteriocins that bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon and gene block Associator (BOA) that can identify homologous bacteriocin associated gene blocks and predict novel ones. BOA generates profile Hidden Markov Models from the clusters of bacteriocin context genes, and uses them to identify novel bacteriocin gene blocks and operons.
Results and conclusions

We provide a novel dataset of predicted bacteriocins and context genes. We also discover that several phyla have a strong preference for bacteriocin genes, suggesting distinct functions for this group of molecules.

Software Availability: https://​github.​com/​idoerg/​BOA

Network-driven discovery and interpretation of cancer driver mutations
Download

Date: TBA
Room: TBA

  • Jüri Reimand, Ontario Institute for Cancer Research, Canada

Presentation Overview:

Identifying driver mutations from cancer exome and genome sequencing data is essential to deciphering tumour biology and designing precision treatments. Information on pathways and molecular interaction networks can improve interpretation of cancer mutations and associating mechanism and clinical information. We hypothesize that many cancer driver mutations precisely modify interfaces encoded in small sites in proteins and DNA, leading to interaction losses and gains in networks. We developed computational strategies to find network-associated driver mutations and infer their impact on network topology. The mutation enrichment model ActiveDriver detects proteins with site-specific positive selection, and the machine learning method MIMP infers mutations that rewire kinase signalling networks by erasing existing phosphorylation sites and creating new phosphorylation sites in substrate proteins. We conducted pan-cancer analyses of post-translational modification (PTM) networks and showed their enrichment in known driver mutations, increased functional impact, frequent rewiring of network topology, and associations to clinical characteristics. We also studied PTM networks in the human population and found that inter-individual genome variation is significantly reduced in PTM sites while inherited disease mutations are significantly enriched. This emphasizes the importance of network-related variation in human physiology and cancer. We extended our approaches to full cancer genomes and investigated site-specific mutations in gene promoters, enhancers, and transcription regulatory networks, using data from the International Cancer Genome Consortium and the Roadmap Epigenomics project. Our network-centric approaches provide novel interpretation to known cancer mutations, help find new cancer drivers and risk modifier alleles, and characterise their biological mechanisms.

Detecting a novel signature of non-coding, regulatory alterations in cancer genomes
Download

Date: TBA
Room: TBA

  • Kyle Smith, University of Colorado, United States
  • Vinod Yadav, University of Colorado, United States
  • Subhajyoti De, Rutgers Cancer Institute of New Jersey, United States

Presentation Overview:

Oncogenic mutations outside protein-coding regions remain largely unexplored. Analyses of the TERT locus have indicated that non-coding regulatory mutations can be more frequent than previously suspected and play important roles in oncogenesis. So far, limited studies are under-way to identify recurrent mutations in promoters of known genes. And yet, functional mutations need not always be recurrent at the same base position (e.g. TP53 mutations are distributed throughout the gene), and non-coding mutations are no exceptions. Recurrence based detection methods are not designed to detect these alternative mutation signatures. Signature of accelerated somatic evolution (SASE) is one such novel mutation signature in non-coding regions that we recently reported. Genomic regions under accelerated evolution are those that accumulate excess of mutations compared to that expected based on background mutation rate. In mammalian evolution, human accelerated regions (HARs; regions that acquired significantly more substitutions than expected after divergence from the common ancestor with chimpanzees) were frequently found to have regulatory functions contributing to human-specific attributes. We applied the concept to cancer, and developed a computational method, SASE-hunter to identify the signature of accelerated somatic evolution (SASE) in a genomic locus, and prioritized those loci that carried the signature in multiple cancer patients. Interestingly, even when an affected locus carried the signature in multiple individuals, the mutations contributing to SASE themselves were not necessarily recurrent at the base-pair resolution. In a pan-cancer analysis of 12 tumor types, we detected SASE in the promoters of known cancer genes such as MYC, BCL2, RBM5 and WWOX. SASEs in selected cancer gene promoters were associated with over-expression, and also correlated with the age of onset of cancer, aggressiveness of the disease and survival. Taken together, our work detects a hitherto under-appreciated and clinically important class of regulatory changes in cancer genomes.

Dynamic and integrative biological network research of aging
Download

Date: TBA
Room: TBA

  • Fazle Faisal, University of Notre Dame, United States
  • Yuriy Hulovatyy, University of Notre Dame, United States
  • Huili Chen, University of Notre Dame, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview:

ABSTRACT

The world is on average growing older, with people over 60 years of age representing 11% of the global population. Because of this, and because susceptibility to diseases increases with age, studying molecular causes of aging continues to gain importance. However, human aging is hard to study experimentally due to long lifespan as well as ethical constraints. Therefore, human aging-related knowledge needs to be inferred computationally. Computational analyses of gene expression or genomic sequence data, which have been indispensable for investigating human aging, are limited to studying genes (or their protein products) in isolation, ignoring their cellular interconnectivities. But proteins do not function in isolation; instead, they carry out cellular processes by interacting with other proteins. And this is exactly what biological networks, such as protein-protein interaction (PPI) networks, model. Thus, analyzing topologies of proteins in PPI networks could contribute to our understanding of the processes of aging.

The majority of the current methods for analyzing systems-level PPI networks deal with their static representations, due to limitations of biotechnologies for PPI collection, even though cellular functioning is dynamic. For this reason, and because different data types can give complementary biological insights, we integrate current static PPI network data with aging-related gene expression data to computationally infer dynamic, age-specific PPI networks. Then, we apply a series of sensitive measures of network topology to the dynamic PPI network data to study cellular changes with age. For example, we apply a graphlet-based measure of local network position (or centrality) of a node; graphlets are small connected induced subgraphs. By doing so, we find that while global PPI network topologies do not significantly change with age, local topologies (i.e., network centralities) of a number of genes do. We predict such genes to be key players in the processes of aging [1]. We demonstrate the credibility of our predictions by: 1) observing significant overlap between our predicted aging-related genes and known "ground truth" aging-related genes; 2) observing significant overlap between functions and diseases that are enriched in our aging-related predictions and those that are enriched in the "ground truth" data; 3) providing evidence that diseases which are enriched in our aging-related predictions are linked to human aging; and 4) validating our high-scoring novel predictions in the literature.

In the above work, we study network (e.g., graphlet-based) positions of a node in each individual (static) age-specific PPI network "snapshot" and then simply consider time series of the results. In the process, we still overlook likely important relationships between the different snapshots. To capture the inter-snapshot relationships explicitly, we take the well-established and proven ideas behind static graphlets to the next level to develop novel theory of dynamic graphlets that are needed to allow for truly dynamic analysis of the age-specific PPI networks [2]. When we apply the dynamic graphlet approach to study human aging (just as described above), this approach further improves upon our previous work in terms of the quality of aging-related predictions. Namely, our new predictions lead to better overlap with "ground truth" aging-related data as well as to more aging-relevant functional and disease enrichments. Importantly, our new approach unveils novel knowledge about human aging with high (e.g., literature) validation accuracy, thus complementing the existing aging-related knowledge.

REFERENCES

1. Faisal F.E. and Milenković T. 2014. Dynamic networks reveal key players in aging. Bioinformatics 30(12):1721-29.
2. Hulovatyy Y., Chen H. and Milenković T. 2015. Exploring the structure and function of temporal networks with dynamic graphlets. Bioinformatics 31(12):i171-180.

Survey of the Heritability and Sparsity of Gene Expression Traits Across Human Tissues
Download

Date: TBA
Room: TBA

  • Heather Wheeler, Loyola University Chicago, United States
  • Kaanan Shah, University of Chicago, United States
  • Jonathon Brenner, Loyola University Chicago, United States
  • Hae Kyung Im, University of Chicago, United States

Presentation Overview:

Regulatory variation plays a key role in the genetics of complex traits as demonstrated by the consistent enrichment of expression quantitative trait loci (eQTLs) among trait-associated variants. Thus, understanding the genetic architecture of gene expression traits within and across tissues will help elucidate the underlying mechanisms of complex traits. We present a systematic survey of the heritability (h2) and the distribution of variant effect sizes on gene expression across the human body. Using RNA-seq data from a comprehensive set of tissue samples generated by the Genotype-Tissue Expression (GTEx) Project and the Depression Genes and Networks (DGN) whole blood cohort, we find that local h2 (contribution of SNPs within 1Mb of the gene) can be relatively well characterized with 50% of expressed genes showing significant h2 in DGN and 8-19% in GTEx. However, the current sample sizes (n = 922 in DGN and n < 362 in each of the 40 GTEx tissues) only allow us to compute distal h2 for a handful of genes (3% in DGN and < 1% in GTEx). Thus, here we focus on local regulation. Bayesian Sparse Linear Mixed Model (BSLMM) analysis provide strong evidence that local architecture of gene expression traits is sparse rather than polygenic across DGN and all 40 GTEx tissues examined. This result is further confirmed by the sparsity of optimal performing gene expression predictors via elastic net modeling. To further explore the tissue context specificity, we use a mixed-effects model to decompose the expression traits into cross-tissue and tissue-specific components. Heritability and sparsity estimates of these derived expression phenotypes show similar characteristics to the original traits. The local h2 estimates of the cross-tissue phenotype have larger magnitude and lower standard errors compared to single tissue estimates due to the borrowing of information across all samples. Consistent properties relative to prior GTEx multi-tissue results suggest that these derived traits reflect the expected biology. We apply this knowledge to develop prediction models of gene expression traits for all tissues. The prediction models, heritability, and prediction performance R2 for original and decomposed expression phenotypes are made publicly available for use in our gene-level association method, PrediXcan (https://github.com/hakyimlab/PrediXcan).

Decoding compound mechanism of action using integrative pharmacogenomics
Download

Date: TBA
Room: TBA

  • Nehme El-Hachem, Institut de recherches cliniques de Montreal, Canada

Presentation Overview:

Nehme El-Hachem1,2,*, Deena M.A. Gendoo3,4,*, Laleh Soltan Ghoraie3,4, Zhaleh Safikhani3,4, Petr Smirnov3, Ruth Isserlin5, Jacques Archambault6, Gary Bader5,6, Anna Goldenberg8,9, Benjamin Haibe-Kains3,4,8
1 Integrative Computational Systems Biology, Institut de Recherches Cliniques de Montréal, Montreal, Quebec, Canada
2 Department of Biomedical Sciences. Université de Montréal, Montreal, Quebec, Canada
3 Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
4 Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
5 The Donnelly Centre, Toronto, Ontario, Canada
6 Laboratory of Molecular Virology, Institut de Recherches Cliniques de Montréal
7The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
9 Hospital for Sick Children, Toronto, Ontario, Canada

For decades, the “one drug-one target-one disease” paradigm dictated much of the drug development process. However, in the past ten years, tremendous advances in transcriptomics and genomics research shifted this simplistic view of a drug mechanism of action (MoA) to a more complex systems pharmacology paradigm where a drug can bind to several targets.
Several computational strategies have been proposed to elucidate the mechanism of action for existing and newly developed drug-like compounds. Traditional approaches predicted new drug-target associations based on the chemical similarity of corresponding ligands or side effects of approved drugs. Recent bioinformatic approaches built drug-drug networks from drug-induced transcriptional profiles and inferred new mechanisms of action. However, current drug taxonomies are relying on information difficult to gather for new compounds (e.g., side effects) or are inaccurate to predict drug target(s) and MoA. There is therefore a dire need to leverage the increasing amount of pharmacogenomic data in order to improve drug taxonomy by better characterizing drug targets without relying on prior knowledge such as therapeutic indications or side effects.
In our study, we integrated different layers of information from recent large-scale pharmacogenomic datasets in order to infer new MoA for chemical compounds extracted from cancer screens from three data layers: (i) drug structural similarity, (ii) drug perturbation transcriptomics profiles from the LINCS database; and (iii) drug sensitivity profiling assays from cancer cell lines (CTRPv2). We used our recently published Similarity Network Fusion algorithm to efficiently integrate these three data layers into a single, integrative drug taxonomy called Drug Network Fusion (DNF). We found that DNF outperformed drug taxonomies based on single data layers for both drug target prediction (DNF concordance index = 0.89 vs. 0.71, 0.83, 0.64 for structure, sensitivity and perturbation layers, respectively) and ATC classification (DNF concordance index = 0.77 vs. 0.72, 0.58, 0.54 for structure, sensitivity and perturbation, respectively).
We classified correctly almost all kinase inhibitors and inferred new mechanisms for other undescribed compounds. Our innovative computational framework highlights the importance of integrating complementary data layers concerning drugs such as chemical, transcriptional and sensitivity profiles. DNF can be easily extended to more compounds or data layers and as such, constitutes a valuable resource to the cancer research community by providing new hypotheses on the compound MoA and potential insights for drug repurposing.

A NOVEL ATOMISTIC MOTIONAL CORRELATION METHOD COMBINED WITH THERMODYNAMICS TO DELINEATE THE INTRICATE MECHANISM OF SUBSTRATE SPECIFIC CATALYSIS: ENZYME ENGINEERING PERSPECTIVE
Download

Date: TBA
Room: TBA

  • Devashish Das, Polyclone Bioservices, India
  • Pravin Kumar, Polycone Bioservices, India
  • Naveen Kulkarni, Polycone Bioservices, India
  • Anurag Kumar, Polyclone Bioservices, India

Presentation Overview:

Enzymes are powerful and highly specific catalysts, both in the reactions that they catalyze and in their choice of reactants. (1) Enzymes show this partiality towards substrate through a precise mechanism. The precision of this mechanism is governed by a well connected network of residues in and around the catalytic site, in terms of their motions and consequential thermodynamics. In this study we have designed a novel atomistic motional (AM) correlation method which measures the distance and direction of the atoms in motion. Discretizing the variable, 41 different combinations (AM alphabets) of displacement and direction of the motion of a single atom was calculated over a wide range of molecular motions derived from MD simulations. This is the maximum reported number of measurements which quantifies high-frequency harmonic oscillations to slow functional conformational transitions with a higher level of sensitivity. The method was tested to delineate the mechanism of the Michaelis complexes of Penicillin G acylase (PGA) & Penicillin-G (native reaction) and PGA & PGSO (slow reaction). Correlation of AM alphabets of a pair of atoms (i,j) was calculated as a normalized mutual information (MI; I_LL^n), nI(C_(i,) C_j )= (I(C_(i,) C_j )- ε(C_(i,) C_j ))/H(C_(i,) C_j ) (2) and this was summed to derive the per residue correlation (PRC;CnMI) ∑_(ij=1)^n▒I_LL^n . The CnMI was used to generate network models and clustering analysis, post 0.25 μs of simulations of the Michaelis complexes. Results show clear difference between the AM alphabets of the fast and the slow hydrolyzing enzymatic reactions (Fig1.A). Networks formed between the amino acids in the slow reaction are very much different from native reaction (Fig1.B & C), especially cluster 1 and 2 which shows close relation with the substrate of the native reaction is completely decomposed in the slow reaction. Further, CnMI was combined with a novel method that quantitatively weights atomic interactions (qWAI) in conjuncture with high throughput binding free energy calculation deposited over every amino acid (PRB). The three methods CnMI, qWAI and PRB in isolation and in combination showed the precise mechanism of PGA pertaining to substrate specificity. To mention CnMI shows decomposition of cluster 1 and 2 in slow reaction. qWAI shows that the amide bond of PenG was stabilized by βSer1, βThr68, βGln23 and βAla69 and the same in PGSO is stabilized only by βSer1 and βAla69. Finally, the combined score shows that βGln23 and βPro22, forming a part of oxyanion hole are extremely modulated in the slow reaction, resulting in the destabilization of tetrahedral intermediate. The presentation will show the precise mechanism of substrate selectivity of PGA revealed by these three methods in conjunction with insights for enzyme engineering.

References:
1. Berg JM, Tymoczko JL, Stryer L.New York: W H Freeman; 2002.
2. Cover T. M., Thomas J. A. (1991) Elements of Information Theory, Wiley-Interscience, New York

Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet
Download

Date: TBA
Room: TBA

  • Coby Viner, University of Toronto, Canada
  • James Johnson, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
  • Nicolas Walker, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
  • Hui Shi, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
  • Marcela Sjoberg, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
  • David J. Adams, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
  • Anne C. Ferguson-Smith, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
  • Timothy L. Bailey, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
  • Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Many transcription factors (TFs) initiate transcription only in specific sequence contexts, providing the means for sequence specificity of transcriptional control. A four-letter DNA alphabet only partially describes the possible diversity of nucleobases a TF might encounter. Cytosine is often present in the modified forms: 5-methylcytosine (5mC) or 5-hydroxymethylcytosine (5hmC). TFs have been shown to distinguish unmodified from modified bases. Modification-sensitive TFs provide a mechanism by which widespread changes in DNA methylation and hydroxymethylation can dramatically shift active gene expression programs.

To understand the effect of modified nucleobases on gene regulation, we developed methods to discover motifs and identify TF binding sites (TFBSs) in DNA with covalent modifications. Our models expand the standard A/C/G/T alphabet, adding m (5mC) and h (5hmC). We additionally add symbols to encode guanine complementary to these modified cytosine nucleobases and represent states of ambiguous modification. We adapted the position weight matrix model of TFBS affinity to an expanded alphabet. We developed a program, Cytomod, to create a modified sequence. We also enhanced the MEME Suite to be able to handle custom alphabets. We created an expanded-alphabet sequence using whole-genome maps of 5mC and 5hmC in naive ex vivo mouse T cells. Using this sequence and ChIP-seq data from Mouse ENCODE and others, we identified modification-sensitive cis-regulatory modules. We elucidated various known methylation binding preferences, including the preference of ZFP57 and C/EBPβ for methylated motifs and the preference of c-Myc for unmethylated E-box motifs. We demonstrated that our method is robust to parameter perturbations, with TF sensitivities for (hydroxy)methylated DNA broadly conserved across a range of modified base calling thresholds. Hypothesis testing across different threshold values was used to determine cutoffs most suitable for further analyses. Using these known binding preferences to tune model parameters enables discovery of novel modified motifs.

Hypothesis testing of motif central enrichment provides a natural means of differentially assessing modified versus unmodified binding affinity. This approach can be readily extended to other DNA modifications. As more high-resolution epigenomic data becomes available, we expect this method to continue to yield insights into altered TFBS affinities across a variety of modifications.

Prediction of Metal Binding Sites in Proteins Using Coevolution Data
Download

Date: TBA
Room: TBA

  • Frazier Baker, University of Cincinnati, United States
  • Alexey Porollo, Children's Hospital Medical Center, United States

Presentation Overview:

Trace metals play an important role in determining the folding and function of many proteins. Knowledge of protein metal binding sites can facilitate the efforts in protein modeling, the understanding of molecular mechanisms of protein function, and the identification of new drug targets. Experimental annotation of metalloproteomes lags behind the pace at which genomes and proteomes become available. Hence, there is a growing demand for reliable sequence-based prediction of metal binding proteins and the actual binding sites.

To fulfill this need, we have developed a new machine learning-based model for metal-binding site prediction. The model employs coevolution information derived from multiple sequence alignment. Three coevolution metrics were explored: Chi-squared, Mutual Information, and Pearson Correlation. All metrics were adjusted for phylogeny bias in the multiple sequence alignment. Features are based on the cumulative properties derived from the most covariant residues (group) for each potential metal binding residue (C, D, E, H, N, Q, S, T). The feature space includes the average of individual conservation scores and the composition of amino acids within the group.

The training set is comprised of 165 manually curated metal binding proteins taken from the Metal MACiE database. There are 637 residues that bind metals (true positives), and 25909 non-binding residues of the same amino acids (true negatives). To keep the training data balanced, we did random resampling of the negative class keeping 1:1 and 2:1 (TN:TP) ratios, 1000 times for both. Two machine learning algorithms, C4.5 decision tree (DT) and Random Forest (RF), implemented in Weka were used to build prediction models. Each model was evaluated using 10-fold cross-validation (CV) in 1000 datasets with 2 ratios. The best performing model (23 features) yielded Matthew’s correlation coefficient of 0.67 with an overall accuracy of 87.5%, averages based on 1000 runs of 10-fold CV on the 2:1 ratio dataset. RF appears to outperform DT; furthermore, the coevolution-based model with group-based features is superior to other existing models using features derived from individual residues.

A novel algorithm for analyzing drug-drug interactions from MEDLINE literature
Download

Date: TBA
Room: TBA

  • Yin Lu, College of Pharmacy, University of South Florida, United States
  • Yi-Cheng Tu, Department of Computer Science, University of South Florida, United States
  • Feng Cheng, College of Pharmacy, University of South Florida, United States

Presentation Overview:

Drug–drug interaction (DDI) is becoming a serious clinical safety issue as the use of multiple medications becomes more common. Searching the MEDLINE database for journal articles related to DDI produces over 330,000 results. It is impossible to read and summarize these references manually. As the volume of biomedical reference in the MEDLINE database continues to expand at a rapid pace, automatic identification of DDIs from literature is becoming increasingly important. In this article, we present a random-sampling-based statistical algorithm to identify possible DDIs and the underlying mechanism from the substances field of MEDLINE records. The substances terms are essentially carriers of compound (including protein) information in a MEDLINE record. Four case studies on warfarin, ibuprofen, furosemide and sertraline implied that our method was able to rank possible DDIs with high accuracy (90.0% for warfarin, 83.3% for ibuprofen, 70.0% for furosemide and 100% for sertraline in the top 10% of a list of compounds ranked by p-value). A social network analysis of substance terms was also performed to construct networks between proteins and drug pairs to elucidate how the two drugs could interact.

Distal chromatin loop prediction with deep siamese neural networks
Download

Date: TBA
Room: TBA

  • Davide Chicco, Princess Margaret Cancer Centre, University of Toronto, Canada
  • Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Introduction. Transcriptional regulation is influenced by physical interactions with distal genetic elements such as enhancers. While those elements might be tens of thousands of base pairs away from the genes they affect on the DNA back-bone, they can be physically close in the folded three-dimensional conformation of chromatin. Powerful molecular biology techniques, such as chromosome conformation capture (3C) and Hi-C can locate these 3D long-range interactions (or loops). They are too expensive and difficult, however, for widespread use.
Previous work demonstrates that chromatin loops can be predicted from DNase hyper-sensitivity signals. For example, Thurman et al. in 2012 used these data in a statistical method that takes advantage of hierarchical clustering and Pearson correlation to predict distal interactions. We sought to improve these methods with a more flexible deep learning technique.

Methods. We created a new method for distal interaction prediction that uses a
deep siamese neural network algorithm. This technique was originally developed in artificial intelligence to recognize forged hand-written signatures. The siamese neural network learns the inner mathematical non-linear representation of pairs of DNase I hyper-sensitivity profiles, and states whether these pairs represent long-range interactions or not.

Results. We tested the effectiveness of our method through a standard cross-validation optimization approach, built on receiver operating characteristic (ROC) curves and precision-recall curves, by using the high-resolution genome-scale Hi-C datasets as gold standard. Preliminary results on held-out test sets confirm the efficacy of our algorithm.

Discussion. We designed a siamese neural network algorithm that predicts chromatin loops from pairs of chromosome region DNase I hyper-sensitivity profiles.
Compared to previous models, our computational method has the following advantages: (i) the ability to train a machine-learning model from the DNase profile signals of pairs of chromosome regions representing interactions; (ii) a prediction and validation pipeline that can be easily expanded and integrated with alternative algorithms or additional datasets such as those from ChIP-seq data in the future.

Ray Surveyor: phenetic comparison of genomes and its application in microbial evolution
Download

Date: TBA
Room: TBA

  • Frederic Raymond, Université Laval, Canada
  • Maxime Déraspe, Université Laval, Canada
  • Sébastien Boisvert, Gydle Inc., Canada
  • Alexander Culley, Université Laval, Canada
  • Paul H. Roy, Université Laval, Canada
  • François Laviolette, Université Laval, Canada
  • Jacques Corbeil, Université Laval, Canada

Presentation Overview:

Microbial genomics studies are getting more extensive and complex. Therefore, new methods for analyses are required for epidemiologists and microbiologists to make sense of these massive datasets. Here we demonstrate that comparison of genomes based on their k-mer content allows to easily reconstruct a phenetic tree coherent with classical phylogeny, without need of prior data curation such as the alignment of core genomes. The Ray Surveyor software can compare hundreds to thousands of microbial genomes based k-mers and removes the bias caused by the use of conserved regions to cluster samples. Using the Ray Surveyor software, we built in less than 6 hours an accurate phenetic tree of the Bacteria kingdom using 2429 complete genome sequences. The distinguishing feature of Ray Surveyor is its ability to dissect population structures based on a subset of sequences, for example resistance genes or bacteriophages, and thus determine which genetic elements drive bacterial fitness. We show that the population structure of Pseudomonas aeruginosa is closely linked with resistance genes while bacteriophage-related sequences are important in Streptococcus pneumoniae populations. We applied this methodology to 57 bacterial families belonging to 7 different phyla, showing for each family the importance of mobile elements, resistance genes, bacteriophages, plasmids and biosynthetic gene clusters. Only 5% of these 57 families were correlated with mobile elements, 23% with resistance genes and 39% with bacteriophages. In addition to determining correlations with phenetypic tree structure, we quantified the abundance of k-mers related to the five categories. This allowed us to determine which taxa have the most abundant genetic determinants associated with each of these five categories. This global view of the pan-genome of human pathogens demonstrates the taxa-dependent influence of mobility-related genes on population structure.

Radiation Dose Estimation by Automated Cytogenetic Biodosimetry
Download

Date: TBA
Room: TBA

  • Peter Rogan, University of Western Ontario, Canada
  • Yanxin Li, University of Western Ontario, Canada
  • Ruth Wilkins, Health Canada, Canada
  • Farrah Flegal, Canadian Nuclear Laboratories, Canada
  • Joan Knoll, University of Western Ontario, Canada

Presentation Overview:

The dose from ionizing radiation exposure can be interpolated from a calibration curve fit to the frequency of dicentric chromosomes (DC) in metaphase cells of peripheral blood lymphocytes at multiple doses. As DC counts are manually determined, there is an acute need for accurate, fully automated biodosimetry calibration curve generation and analysis of exposed samples. We automated DC detection by extracting key chromosome-derived features in Giemsa-stained metaphase chromosome images and classifying objects by machine learning (ML). The algorithm finds centromeres, differentiates DCs from MCs, overlapped chromosomes and other objects with acceptable accuracy over a wide range of radiation exposures (0-5 Gy; Microscopy Res. Tech. 2016. DOI:10.1002/jemt.22642). At high dose (3-4 Gy), for a true positive rate of 0.65, the positive predictive value is 0.72. These methods were incorporated into the Automated Dicentric Chromosome Identifier (ADCI), a software program that detects and segments chromosomes (Trans. Biomed. Engineering 2013. 60: 2005-13), detects centromere candidates and discriminates DCs from monocentric chromosomes by ML, then computes biodosimetry curves and determines radiation dose of test samples. Manually scored images from two reference laboratories exposed to 1.4 to 3.4 Gy were re-analyzed with ADCI, which estimated exposures between 0 and 0.4 Gy of the physical dose. ADCI can determine radiation dose with accuracies comparable to standard triage biodosimetry (within 0.5 Gy of the physical dose; Rad. Protect. Biodosimetry, submitted). Calibration curves were generated from metaphase images in ~10 hr, and dose estimations required ~0.8 hr per 500 image sample. Running multiple instances of ADCI may be an effective response to a mass casualty radiation event (Rad. Prot. Biodosimetry 2014. 159: 95-104).

Patient similarity networks as a framework for genetic case-control prediction in autism spectrum disorders
Download

Date: TBA
Room: TBA

  • Shraddha Pai, University of Toronto, Canada
  • Shirley Hui, University of Toronto, Canada
  • Ruth Isserlin, University of Toronto, Canada
  • Hussam Kaka, University of Toronto, Canada
  • Gary Bader, University of Toronto, Canada

Presentation Overview:

Autism spectrum disorders (ASD) are heritable, childhood-onset disorders that affect 1-2% of the population worldwide. Screening can identify genetic causes in 10-30% cases; DNA copy-number variants (CNV) are present in at least 10% of cases and contribute to disease risk (Ref 1,2). We have developed a predictor for ASD case-control status based on the genomic location of rare CNV deletions (data from the Autism Genome Project (N=1,485 cases and 1,806 controls of European descent))(Ref 2). Our method uses patient similarity networks based on shared CNV overlap and uses GeneMANIA for network integration and prediction (Ref 3). Using individual genes as features resulted in chance performance, consistent with previous reports of nonoverlapping disrupted genes in ASD patients. In contrast, using pathways as features improved performance to AUC=0.71, beyond that seen with other pathway-based predictors. At recommended stringency levels, our predictor accounts for 8-15% of all cases with rare CNV deletions. Feature selected pathways recapitulate mechanisms previously identified in ASD genetics, including themes of cell proliferation and division, neuronal development and function, and signal transduction.
CNV-based networks are sparse and binary, making the predictor prone to overfitting. Our feature selection therefore combines scores from three train-test partitions of the data, with ten-fold cross validation within each partition. Separately, we use nonparametric statistics to identify and exclude “random-like” cliques; this clique-filtering step substantially reduces the fraction of random networks that pass feature selection. Future challenges include incorporation of reference epigenome data to account for 32% of patients with CNVs located in intergenic regions, and extending the predictor to include single nucleotide variants, clinical data and medical history. As a framework, patient similarity networks provide important advantages for building predictors, including the ability to integrate heterogeneous data sources and handle highly missing or sparse data. This approach also naturally provides network-based visualization of patient similarities and organization of predictive pathways, thus making an intuitive tool for clinical and basic ASD research.
References: 1. Anagnostou E et al. (2014). Can Med Assoc J 186:509-19. 2. Pinto D et al (2014). Am J Hum Genet. 94:677-94. 3. Mostafavi S and Q Morris (2010). Bioinformatics. 26 (14):1759-65.

MetaDCN: meta-analysis framework for differential coexpression network detection with an application in breast cancer
Download

Date: TBA
Room: TBA

  • Li Zhu, University of Pittsburgh, United States
  • Ying Ding, University of Pittsburgh, United States
  • Cho-Yi Chen, University of Pittsburgh, United States
  • Zhiguang Huo, University of Pittsburgh, United States
  • Sunghwan Kim, University of Pittsburgh, United States
  • Steffi Oesterreich, Magee-Women’s Research Institute, United States
  • George Tseng, University of Pittsburgh, United States

Presentation Overview:

Background: Gene coexpression network analysis from large transcriptomic studies is often used to elucidate potential gene-gene interactions and regulatory mechanisms. In contrast to traditional differential expression analysis, identifying differential coexpression subnetworks between cases and controls could reveal pathways with disease-related dysfunction. Coexpression network estimated from a single transcriptomic study is often unstable and not generalizable due to biological variation, cohort bias and limited sample size. With the rapid accumulation of transcriptomic studies in the public domain, coexpression analysis combining multiple transcriptomic studies can provide more accurate and robust results. In this paper, we propose a MetaDCN framework to combine multiple studies to identify disease associated differential coexpression networks.

Methods: The framework is composed of two major components: module searching and network visualization. Coexpression networks are first constructed in cases and controls separately in each study. Differential coexpression seed modules are detected by optimizing an energy function via simulated annealing. Seed modules sharing common pathways are merged into pathway-centric supermodules and a visualization tool is developed using Cytoscape.

Results: We applied the method to five breast cancer studies (ER+ vs ER-) and identified 32 supermodules engaged in 96 pathways under 5% FDR control. Ranking atop are the immune response pathway and complement and coagulation cascades pathway. The supermodules associated with those two immune system related pathways demonstrated alternative ER activation, which is consistent with recently reported ER-mediated immune functions.

Conclusions: MetaDCN integrates multiple studies to detect disease associated gene modules with differential coexpression. The result sheds light on the underlying disease mechanisms in a systems manner.

On the clustering of biomedical datasets - a data-driven perspective
Download

Date: TBA
Room: TBA

  • Christian Wiwie, University of Southern Denmark, Denmark
  • Jan Baumbach, University of Southern Denmark, Denmark
  • Richard Röttger, University of Southern Denmark, Denmark

Presentation Overview:

Nowadays, scientists of virtually all disciplines are confronted with an increasing supply of information; this is especially true for biomedical research where recent advances in wet-lab technologies have led to a sheer explosion of the wealth, quality, and amount of available data. In order to extract actual knowledge from this plethora of information, one of the most common approaches, and often the beginning of an analysis, is the so-called cluster analysis which unravels the inherent structure of the data by grouping similar objects together.

Despite being a long standing problem, conducting a cluster analysis is everything but straight-forward; to the contrary, a high quality clustering analysis is very often overwhelming the practitioner. A multitude of different decisions have to be made, all of them require deep understanding of the underlying methods; decisions the lay-man often can not make or is not even aware of, like feature extraction, similarity calculation, choice of clustering tool and its parameter optimization, and many more. Here, well-structured and objective guidelines are widely missing, especially on larger scale.

To attack these challenges, we have developed ClustEval, a fully integrated and automatized cluster evaluation framework. The power of this framework allowed us to conduct a massive, objective and fully reproducible clustering comparison analysis consisting of several million evaluations. This massive data-driven background of structured clustering results allowed us provide an highly demanded overview of the field and to carefully derive guidelines for the clustering of biomedical datasets which we recently published in Nature Methods. Based on this effort, we want to present ClustEval, most recent findings, and furthermore aim to evaluate the future perspectives for improving the overall quality and usability of cluster analyses. All results and the framework are freely available: http://clusteval.sdu.dk/

miRNet – dissecting miRNA-target interactions and functional associations through network-based visual analysis
Download

Date: TBA
Room: TBA

  • Yannan Fan, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
  • Paula Ribeiro, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
  • Sarah Kimmins, Department of Animal Science, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
  • Jianguo Xia, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada

Presentation Overview:

MicroRNAs (miRNAs) can regulate nearly all of the biological processes and their dysregulations are implicated in various complex diseases and pathological conditions. Recent years have seen a growing number of functional studies of miRNAs using high-throughput experimental technologies, which have produced a large amount of high-quality data regarding miRNA target genes, their interactions with small molecules, long non-coding RNAs, epigenetic modifiers, as well as disease associations, etc. These rich sets of information have enabled the creation of comprehensive networks linking miRNAs with various biologically important entities to shed light on their collective functions and regulatory mechanisms. Here, we introduce miRNet, a high-performance, easy-to-use, web-based tool that offers statistical, visual, and network-based approaches to help researchers understand miRNAs functions and regulatory mechanisms. The key features of miRNet include: (i) a comprehensive knowledge base integrating high-quality miRNA-target interaction data from ten databases; (ii) support for differential expression analysis of data from microarray, RNA-seq and quantitative PCR; (iii) implementation of flexible interface for data filtering, refinement and customization during network creation; (iv) a powerful fully-featured network visualization system coupled with enrichment analysis. miRNet offers a comprehensive tool suite to enable statistical analysis and functional interpretation of various data generated from current miRNA studies.

Non-invasive Precision Medicine - Statistical learning of exhaled biomarker profiles
Download

Date: TBA
Room: TBA

  • Anne-Christin Hauschild, Max Planck Institute for Informatics, Germany
  • Jörg Ingo Baumbach, Faculty Applied Chemistry, Reutlingen University, Germany
  • Jan Baumbach, University of Southern Denmark, Denmark

Presentation Overview:

Precision medicine aims for tailoring medical treatment to the individual patient. It relies on efficient molecular methods for diagnostic testing and computational methods for high-throughput data analysis. Major initiatives have recently been launched to build the infrastructure needed to guide clinical practice. Detection of clinically relevant biomarker molecules in exhaled air has the potential to establish a non-invasive precision medicine branch. To this end, we utilize volatile organic compounds, which are emitted by all living cells and tissues. We seek to identify non-invasive biomarkers that are predictive for the biomedical fate of individual patients or cell cultures. This promises great hope to move the therapeutic windows to earlier stages of disease progression. While portable devices for exhaled volatile metabolite measurement exist, we face the traditional biomarker research barrier: A lack of robustness hinders translation to the world outside laboratories. To move from biomarker discovery to validation, from separability to predictability, we have developed several bioinformatics methods for computational breath analysis, which have the potential to redefine non-invasive biomedical decision making by rapid and cheap matching of decisive medical patterns in exhaled air. We aim to provide a supplementary diagnostic tool complementing classic urine, blood and tissue samples. In the presentation, we will review the state of the art, study some clinical application examples, highlight existing challenges, and introduce new data mining methods for identifying exhaled biomarkers.

Predicting physiologically relevant SH3 domain mediated protein-protein interactions in human
Download

Date: TBA
Room: TBA

  • Shobhit Jain, University of Toronto, Canada
  • Gary Bader, University of Toronto, Canada

Presentation Overview:

Protein-protein interactions (PPIs) are physical associations between protein pairs in a specific biological context. Their knowledge provides important insights into the functioning of a cell. Many intracellular signaling processes are mediated by interactions involving peptide recognition modules such as SH3 domains. These domains bind to small, linear sequence motifs within proteins, which can be identified using high-throughput experimental screens such as phage display or peptide chips. Binding motif patterns can then be used to computationally predict protein interactions mediated by these domains. While many protein-protein interaction prediction methods exist, most do not work with peptide recognition module mediated interactions or do not consider many of the known constraints governing physiologically relevant interactions between two proteins.

A new approach for predicting physiologically relevant SH3 domain-peptide mediated protein-protein interactions in H. sapiens using phage display data is presented. Like some previous similar methods, this method uses position weight matrix models of protein linear motif preference for individual SH3 domains to scan the proteome for potential hits and then filters these hits using a range of evidence sources related to sequence-based and cellular constraints on protein interactions. The novelty of this approach is the large number of evidence sources used and the method of combination of sequence based and protein pair based evidence sources. This method combines diverse binding site (peptide) features, including presence in a disordered region of the protein, surface accessibility, conservation across different species, and structural contact with the SH3 domain, as well as protein features such as cellular proximity, shared biological process, similar molecular function, correlated gene expression, protein expression and sequence signature. By combining different peptide and protein features using multiple Bayesian models we are able to predict high confidence SH3 domain-peptide interactions.

Time-series clustering enables the exploration of temporal patterns in marker gene data
Download

Date: TBA
Room: TBA

  • Michael Hall, Dalhousie University, Canada
  • Jonathan Perrie, Dalhousie University, Canada
  • Robert Beiko, Dalhousie University, Canada

Presentation Overview:

Marker-gene sequencing of entire communities of microorganisms is a popular method for identifying the microbial taxa that are associated with different conditions or environmental changes. In particular, longitudinal studies are beginning to highlight the dynamism of these groups of invisible inhabitants. Typically, marker genes from these organisms are clustered by sequence identity into “operational taxonomic units” (OTUs) that are surrogates for species or some other level of taxonomic similarity. This approach is widely used and useful as it minimizes the effects of sequencing error and reduces the size of the data set for downstream computations. However, OTU clustering has been shown to group ecologically distinct organisms into a single unit which can obscure potentially important functional differences. We present an alternate clustering approach for time-series marker gene data: clustering sequences by temporal abundance patterns. Our method groups sequences from taxa that are potentially exhibiting similar responses to their environment. An interactive user interface allows the researcher to explore their data and discover distinct temporal patterns. For example, time-series clustering can reveal the seasonal cycles of taxa in a freshwater lake, and can group together taxa that follow the same periodicity. By modifying the radius of the clusters, we can identify discordance within OTUs and generate hypotheses about interactions between taxa. Time-series clustering represents a novel way to explore marker gene data that complements existing techniques.

A regulatory model for discovering aberrant post-transcriptional programs in cancer
Download

Date: TBA
Room: TBA

  • Hamed Najafabadi, McGill University, Canada
  • Pouria Jandaghi, McGill University, Canada
  • Shraddha Solanki, McGill University, Canada
  • Andreas Papadakis, McGill University, Canada
  • Maryam Safisamghabadi, McGill University, Canada
  • Cristina Storoz, McGill University, Canada
  • Mark Lathrop, McGill University, Canada
  • Sidong Huang, McGill University, Canada
  • Simon Tanguay, McGill University, Canada
  • Fadi Brimo, McGill University, Canada
  • Yasser Riazalhosseini, McGill University, Canada

Presentation Overview:

In contrast to extensive studies on diverse molecular factors involved in transcriptional gene programs, the current knowledge of the mechanisms underlying post-transcriptional gene de-regulation in cancer is scarce, and has mainly been limited to the function of tumor-suppressive or oncogenic microRNAs (miRNAs). RNA-binding proteins (RBPs), as key factors that modulate stability and splicing of mRNAs, play a central role in post-transcriptional gene regulation, but have rarely been studied in the context of cancer. By combining sequence specificities of human RBPs with tens of genome-wide expression datasets across multiple tissues and cell lines, we have developed an integrated regulatory code that models the stability of each mRNA as a combinatorial function of the binding of RBPs. This model enables us not only to predict the abundance of mRNAs based on the ‘activity’ of upstream regulatory RBPs, but also to infer the cancer-associated change in the activity of each RBP based on the change in the abundance of its stability targets. Using this model, we have investigated the activity landscape of RBPs in normal and tumor tissues from 45 clear cell renal cell carcinoma (ccRCC) patients. This model identified several RBPs with recurrent tumor-associated change in activity, including an increase in the activity of MBNL2 and PCBP2, and a decrease in the activity of ESRP2, despite a lack of change in the mRNA levels of some of these RBPs. Using immunohistochemistry, we have found that these RBPs are de-regulated at the protein level in ccRCC tumors. Furthermore, using shRNA-mediated knockdown followed by cellular phenotyping and RNA-seq analysis, we have validated the function of these RBPs in regulating several cancer-related pathways. In particular, ESRP2 knockdown remodels the transcriptome of normal kidney cells toward that of ccRCC tumors, and activates pathways associated with cancer. On the other hand, knockdown of MBNL2 and PCBP2 can revert the cancer gene expression profile, and suppress various cancer-associated pathways. Inhibition of MBNL2 also suppresses proliferation of ccRCC cell lines, underlining the role of MBNL2 as a potential oncogene. These findings highlight the effectiveness of predictive gene regulatory models in identification of cancer-driving post-transcriptional programs, and suggest a prominent role of RBPs in development and progression of cancer.

RGAugury: A pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants
Download

Date: TBA
Room: TBA

  • Pingchuan Li, Morden Research and Development Centre, AAFC, Canada
  • Sylvie Cloutier, Ottawa Research and Development Centre, AAFC, Canada
  • Frank You, Morden Research and Development Centre, AAFC, Canada

Presentation Overview:

Resistance gene analogs (RGAs) are a large class of potential R-genes that usually have conserved domain and motif configuration. To date, NBS-encoding protein, receptor like protein (RLP) and serine/threonine/tyrosine conferring receptor like kinase (RLK) and membrane associated coiled-coil protein (TM-CC) have been broadly reported to be closely associated with plant defense and resistance. Although few RGAs have been fully characterized for their functions and pathways, genome-wide prediction of the potential resistance genes from sequenced plant genomes has been shown to be an effective and practical approach for identification, fine mapping and cloning of plant resistance genes. Few computational programs for identification of some individual resistance related domains are available but a comprehensive and easy-to-use pipeline for RGA prediction was still lacking. Here, we propose an integrative package named RGAugury that automates genome-wide prediction for different types of RGAs. All modules of the pipeline have been fully paralleled to run in multiple threads to ensure maximum data mining speed through the full usage of all CPU resources on a cutting-edge server. The pipeline first identifies resistance related domains such as NBS, LRR, transmembrane, serine/threonine/tyrosine kinase and coiled-coil from gene and predicted protein sequences from genomes. All identified essential domains are then comprehensively analyzed to declare RGA candidates and classified them as NBS-encoding, TM-CC and membrane associated RLP or RLK. The pipeline was tested using the Arabidopsis and Medicago genomes and validated against their previously reported RGAs. A total of 93% and 90% of the reported putative NBS-encoding genes and, 98% and 99% of the membrane associated RLPs and RLKs, were identified from Arabidopsis and Medicago, respectively. These results demonstrated that RGAugury is an effective bioinformatics tool for genome-wide RGA identification that can be applied to other plant genomes. The pipeline program will be available at Bitbucket.

ORFanFinder: automated identification of taxonomically restricted orphan genes
Download

Date: TBA
Room: TBA

  • Yanbin Yin, Northern Illinois University, United States
  • Alex Ekstrom, Northern Illinois University, United States

Presentation Overview:

Motivation: Orphan genes, also known as ORFans, are newly evolved genes in a genome that enable the organism to adapt to specific living environment. The gene content of every sequenced genome can be classified into different age groups, based on how widely/narrowly a gene’s homologs are distributed in the context of species taxonomy. Those having homologs restricted to organisms of particular taxonomic ranks are classified as taxonomically restricted ORFans.
Results: Implementing this idea, we have developed an open source program named ORFanFinder and a free web server to allow automated classification of a genome’s gene content and identification of ORFans at different taxonomic ranks. ORFanFinder and its web server will contribute to the comparative genomics field by facilitating the study of the origin of new genes and the emergence of lineage-specific traits in both prokaryotes and eukaryotes.
Availability: http://cys.bios.niu.edu/orfanfinder
Publication: Ekstrom A and Yin Y (2016) ORFanFinder: automated identification of taxonomically restricted orphan genes, Bioinformatics, doi:10.1093/bioinformatics/btw122, in press

Set Covering Machines and Reference-Free Genome Comparisons Uncover Predictive Biomarkers of Antibiotic Resistance
Download

Date: TBA
Room: TBA

  • Alexandre Drouin, Université Laval, Canada
  • Sébastien Giguère, Institute for Research in Immunology and Cancer, Canada
  • Maxime Déraspe, Université Laval, Canada
  • Mario Marchand, Université Laval, Canada
  • Jacques Corbeil, Université Laval, Canada
  • François Laviolette, Université Laval, Canada

Presentation Overview:

Despite an era of supercomputing and increasingly precise instrumentation, many biological phenomena remain misunderstood. One approach to understanding such events is the elaboration of case-control studies, where large groups of phenotypically different individuals are compared, with the objective of finding predictive biomarkers of a phenotype.

We focus on the identification of genomic biomarkers, ranging from single nucleotide substitutions and indels, to large scale genomic rearrangements. We use reference-free genome comparisons based on k-mers, i.e., sequences of k nucleotides, coupled with the Set Covering Machine (SCM), a machine learning algorithm that produces sparse classifiers. We devise extensions to the algorithm that make it well suited for learning from extremely large sets of genomic features. Our method is robust to large-scale genomic rearrangements and is well-suited for organisms that show high genomic diversity. Moreover, the uncharacteristically sparse models produced by the SCM explicitly highlight the relationship between genomic variations and the phenotype of interest.

The method was validated by generating models that predict the antibiotic resistance of four important human pathogens: C. difficile, M. tuberculosis, P. aeruginosa and S. pneumoniae. The method generated accurate models for 17 antibiotics, the majority achieving error rates smaller than 10% on a validation set. Within hours of computation, the method recovered, de novo, known and validated antibiotic resistance mechanisms that have been reported over the past decades. Moreover, previously unreported genomic variations that could prove biologically relevant were uncovered. The method also identified markers of cross-resistance between antibiotics, knowledge that could prove relevant for the improvement of combination antibiotherapies.

We are confident that this method is applicable to other organisms and that it could guide biological efforts for understanding a plethora of phenotypes. To this end, we propose Kover, a highly scalable implementation of our method that relies on external storage, instead of the computer’s memory. Kover is open-source software and is available from http://github.com/aldro61/kover.

CoDaSeq: compositional data analysis for high throughput sequencing data
Download

Date: TBA
Room: TBA

  • Greg Gloor, U. Western Ontario, Canada
  • Jean Megan Macklaim, U. Western Ontario, Canada
  • Jia Wu, U. Western Ontario, Canada

Presentation Overview:

We present CoDaSeq, a compositionally-appropriate end-to-end workflow for the analysis of sparse high throughput sequencing (HTS) datasets. HTS is routinely applied to collect data from 16S rRNA gene sequencing, RNA-seq, metagenomic sequencing and other experimental designs. It is now acknowledged in several domains that HTS generates compositional data that are sparse (contain many 0 values). Compositional data are those where the multivariate components in individual samples have a constant, yet arbitrary sum. The datasets generated by HTS instruments are compositional because the number of reads output by the machine is constrained by the capacity of the instrument. Compositional data are prone to sub-compositional effects and spurious correlations, first noted by Pearson in 1896, where the conclusions drawn are conditioned upon arbitrary choices made in terms of what features (genes, operational taxonomic units, etc) are included. Compositional data analysis (CoDa) approaches to analyze these datasets are rarely used, and when applied are only used piecemeal, in part because sparse data are often thought to be incompatible with with CoDa approaches. CoDaSeq uses a Bayesian approach to estimate posterior probability distributions of the data such that the that sparsity can be accounted for appropriately during the analysis. We will show how the CoDaSeq approach can be applied to 16S rRNA gene sequencing, metagenomic inference, and transcriptome experiments in a coherent manner for ordination (using the compositional biplot), correlation (using measures of constant ratio) and group difference analysis (using measures of ratio variance). This approach is a powerful, and fully generalizable addition to the HTS analysis toolkit.

Meta-analysis of large pharmacogenomics studies to develop isoform-based biomarkers predictive of response to targeted therapies
Download

Date: TBA
Room: TBA

  • Zhaleh Safikhani, Princess Margaret Cancer Centre, University Health Network, Canada
  • Benjamin Haibe-Kains, Princess Margaret Cancer Centre, University Health Network, Canada
  • Kelsie Thu, Princess Margaret Cancer Centre, University Health Network, Canada
  • Petr Smirinov, Princess Margaret Cancer Centre, University Health Network, Canada
  • David Cescon, Princess Margaret Cancer Centre, University Health Network, Canada
  • Mathieu Lupien, Princess Margaret Cancer Centre, University Health Network, Canada

Presentation Overview:

Introduction: Advances in genome-­wide molecular profiling and high­-throughput drug screening technologies offer an unique opportunity to identify novel biomarkers predictive of response to anticancer therapies. The vast majority of predictive biomarkers for targeted therapies are based on genetic aberrations or protein expressions, as opposed to transcriptomic biomarkers. However, the recent adoption of next­-generation sequencing technologies enables accurate profiling of not only gene expression but also alternative­ and trans-­spliced transcripts in large-­scale pharmacogenomic studies.
Methods: We applied multiple machine learning modeling techniques towards identification of transcriptomic biomarkers for drug response in cancer. To address the lack of reproducibility of drug sensitivity measurements across studies, we developed a framework to efficiently combine the pharmacological data from two large studies, the Cancer Cell Line Encyclopedia (CCLE) and the Genomics of Drug Sensitivity in Cancer (GDSC). Our framework consists of fitting predictive models using the cell lines RNA­-seq profiles as predictor variables, controlled for tissue type and batch indicators, and combined CCLE and GDSC drug sensitivity calls as dependent variables. The accuracy and significance of the fitted models have been assessed using cross-­validation, embedding both feature selection and model fitting. We prioritized gene­ and isoform­-based biomarkers that are differentially distributed between healthy tissues from GTEx dataset and cancer cell lines.
Results: Independent pharmacogenomic datasets developed by the Gray and Neel laboratories have been exploited to validate the biomarkers that predict the response of breast cancer cell lines. We validated in vitro our most promising in silico predictions, such as NM_004207(SLC16a3­002) as a significant predictive biomarker for the MEK inhibitor AZD6244.
Conclusion: Despite initial promises, biomarker discovery from large pharmacogenomic datasets did not fully realize their potential, with only few robust biomarkers being reproduced across studies. Our study is the first to implement a meta-analysis pipeline of such valuable data, opening new avenues of research for the identification of isoform-based biomarkers predictive of response to targeted therapies in breast cancer.

CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization
Download

Date: TBA
Room: TBA

  • Houxiang Zhu, Miami University, United States
  • Lauren Misel, Miami University, United States
  • Mitchell Graham, Miami University, United States
  • Chun Liang, Miami University, United States

Presentation Overview:

The CRISPR system holds much promise for successful genome engineering, but therapeutic, industrial, and research applications will place high demand on improving the specificity and efficiency of this tool. CT-Finder (http://bioinfolab.miamioh.edu/ct-finder) is a web service to help users design guide RNAs (gRNAs) optimized for specificity and efficiency. CT-Finder accommodates the original single-gRNA Cas9 system and two specificity-enhancing paired-gRNA systems: Cas9 D10A nickases (Cas9n) and dimeric RNA-guided FokI nucleases (RFNs). Optimal target candidates can be chosen based on the minimization of predicted off-target effects and the maximization of gRNA activities. Graphical visualization of on-target and off-target sites in the genome is provided for target validation. Major model organisms are covered by this web service.

In-silico pipeline for tissue-specific drug combination discovery
Download

Date: TBA
Room: TBA

  • Seyed Ali Madani Tonekaboni, University of Toronto, Canada
  • Zhaleh Safikhani, University of Toronto, Canada
  • Benjamin Haibe-Kains, University of Toronto, Canada

Presentation Overview:

Rational. Efforts toward developing efficient single-agent anticancer therapies against many aggressive cancer types have failed. A possibly more efficient strategy to conquer the limitations of single agent therapeutics (monotherapies) could be developing combination therapies to target multiple components of a complicated disease such as cancer. However, testing numerous combination therapies in preclinical studies or clinical trials is not a feasible approach considering the extent of potential combinations from over thousand potential drugs.
Objective. We developed a high-throughput drug combination computational modeling approach to integrate multiple pharmacogenomic datasets in order to efficiently explore the large space of drug combinations and predict the most synergistic drug combinations in vitro.
Methods. For each combination of drugs on a cell line, we used transcriptomic profiles of the cancer cell line, the L1000 drug perturbation data as well as monotherapy response of the cell lines to each agent in the combination as the input data to train our model. We used elastic net as the supervised machine learning approach in our pipeline and tested our model using random sampling on the drug combination data, for all tissue types as well as lung and breast cancer cell lines separately, provided in Astrazeneca-Sanger DREAM Challenge.
Results. Our predictions for the drug combinations provided in the challenge were significantly better than random (p < 1e-6). Although the results were promising, we went one step further and developed tissue-specific predictors for lung and breast cancer cell lines in the data set which significantly improved our predictions (p < 1e-3).
Conclusions. Using our tissue-specific in-silico pipeline, we can predict synergistic drug combinations for each tissue type in our training set. This constitutes a promising approach which can be integrated with high throughput chemical screens to improve drug combination discovery in cancer.

A genome-scale algorithmic approach for metabolic engineering of plants
Download

Date: TBA
Room: TBA

  • Jiun Yen, Virginia Tech, United States
  • Glenda Gillaspy, Virginia Tech, United States
  • Ryan Senger, Virginia Tech, United States

Presentation Overview:

One of the major challenges in metabolic engineering of cells to over-produce commodity chemicals is the identification of effective gene modification targets. For this reason, much focus has been on the development of computational tools to accurately predict metabolic engineering strategies. Predictive tools that utilize genome-scale metabolic flux models (GEMs) have shown promising results in engineering microbes; however, few studies have utilized GEMs of plants. This work introduces a novel algorithm called “Reverse Flux Balance Analysis with Flux Ratios” (R-FBrAtio) and deploys this algorithm to enhance cellulose production in Arabidopsis thaliana. R-FBrAtio generates gene candidates that are ranked to indicate best targets for gene over-expression and/or knockdown. R-FBrAtio predicted many intuitive metabolic engineering strategies, including over-expression of cellulose synthase and UDP-glucose pyrophosphorylase, which had been shown to increase cellulose in previous studies. There were also many non-intuitive predictions, and this research focused on experimental validation one of the non-intuitive predictions, the overexpression of mitochondrial malate dehydrogenase (mMDH). This was done by generating multiple transgenic plant lines with upregulated mMDH2:2HA and examining cellulose content. Characterization of mMDH2:2HA plants showed increased biomass and altered morphology. Analysis of the crystalline cellulose revealed a 30% increase in the stem.

Compositional Epistasis Detection Using A Few Prototype Disease Models
Download

Date: TBA
Room: TBA

  • Lu Cheng, University of Waterloo, Canada
  • Mu Zhu, University of Waterloo, Canada

Presentation Overview:

Failure of replication for single locus effects in genome wide association studies (GWAS) motivates the exploration of epistasis (interaction effects) for human complex diseases. Existing methods mostly target epistasis in the statistical sense, i.e., deviation from additive effects, which is believed to be of limited help for understanding the biological mechanism. Of more biological relevance is the so called "compositional epistasis" termed by Phillips (2008) that bares the original meaning of "masking effect" when the term was coined.

It is straightforward to model compositional epistasis by genetic disease models. There are 512 two-locus, two-allele, two-phenotype and complete-penetrance disease models (Li et al 1999). Studying all of them not only lends insight to the exact epistasis mechanism, but also enhances the power for locus discoveries. Studies of single SNP effects show that the maximal detection power is achieved when the correct genetic disease model is used, i.e., when the testing of SNPs is done by using the genetic disease model that matches the actual underlying mode of inheritance. Hence to achieve high power for epistasis detection, it is imperative to determine proper disease models to test.

One direct way is to check all of them and use the one most fitting for the current data. However this not only causes computational burden but also multiple testing problems. Observing that the disease models are similar to each other, we come up with the idea of testing only a few representative ones, similar to what principal component regression does. We define a novel "distance" metric to measure how different two disease models are and then use it to group disease models into a few clusters. We find that the 512 disease models form 6 clusters most of the time, and a prototype disease model selected from each cluster serves as a good representative model that can be used for epistasis loci detection. It is worth mentioning that clustering epistasis models is not only beneficial to the aforementioned computational and multiple-testing problems, but it also allows us to better understand and characterize different disease models for future research.

By carrying out simulation studies on some popular disease models, we observe that our approach provides satisfying power when compared with two other most relevant methods, i.e., MDR and the complete compositional epistasis detection approach by Wan et al (2013). For certain heterogeneous models that involve at least two pairs of SNPs contributing to the disease, our approach performs better than the other two methods. For some other cases, our method may perform worse. The causes are explored and two alternative methods are proposed, which use more refined ways to determine prototype disease models.

In summary, there is a limited amount of work devoted to complete compositional epistasis detection. The underlying variable selection/screening task is complicated by the need to determine the interaction form, or disease model, before selection/screening can be carried out. Our approach --- first finding a few prototype disease models and then using them to perform screening --- complements MDR and the method by Wan et al (2013) in detecting more biologically relevant epistasis.

Integrative analysis for identification and functional prediction of long non-coding RNAs in cancer
Download

Date: TBA
Room: TBA

  • Musaddeque Ahmed, UHN, Canada
  • Haiyang Guo, UHN, Canada
  • He Housheng, UHN, Canada

Presentation Overview:

Trait-associated SNPs identified through Genome-Wide Association Studies are enriched in regulatory regions. However, the functional link between these SNPs and their target genes remains elusive, particularly more for the non-protein coding genes. Due to their involvement in fundamental biological processes, the largest class of non-protein coding genes, long noncoding RNAs (lncRNAs), represent an attractive class of candidates to mediate cancer risk. We have developed an integrative computation method that can identify lncRNAs that may have potential role in development and/or progression of any cancer type. Our prediction algorithm integrates outputs from multiple data types including expression data, chromatin accessibility, genomic occupancy and genotyping data both from cell lines and patient samples. The implementation of our method on the lncRNA transcriptome with genomic and prostate cancer GWAS SNP data, we identified 45 candidate lncRNAs associated with risk to prostate cancer. The top hit from our algorithm is PCAT1, the expression of which we found to be affected by a SNP, rs7463708, through modulation of an enhancer region 78kb downstream of PCAT1 transcription start site. Further analysis suggested that this enhancer is likely prostate cancer specific and the risk allele of rs7463708 increase the binding of a transcription factor, ONECUT2. The efficacy of our prediction algorithm is further complemented by the occurrence of lncRNAs that were previously reported to be associated with prostate cancer, such as H19 and KCNQ1OT1. Our method provides a novel and effective approach to pinpoint lncRNAs that are functionally critical in any disease development or progression.

Insights into the metatranscriptome and metabolome of the vaginal microbiome
Download

Date: TBA
Room: TBA

  • Jean Megan Macklaim, University of Western Ontario, Canada
  • Amy McMillan, University of Western Ontario, Canada
  • Mark Sumarah, Agriculture and Agri-food Canada, Canada
  • Jonathan Swann, Division of Computational and Systems Medicine, Department of Surgery and Cancer, Canada
  • Gregor Reid, University of Western Ontario, Canada
  • Greg Gloor, University of Western Ontario, Canada

Presentation Overview:

A major challenge of any microbiome investigation is to determine the role of the microbes in the environment, and the effects on the host or system. High­throughput (HT) sequencing and small molecule analyses provide an overview of the function of the entire microbiome, which can be thought of as a “meta­organism”. Analyzing such data to make biological inferences is challenging due to the multivariate nature of the data, the scale of output from high­throughput experiments, and the complexity of the interactions and fluctuations of the biological system. In any HT sequencing output (16S rRNA gene sequencing, metatranscriptomics, and metagenomics) the data are expressed as parts of the whole, and are therefore bound by the requirement for compositional data analysis (CoDa).

We used the vaginal microbiome as a model system to demonstrate the relationship between the bacterially expressed mRNAs (the metatranscriptome), and the products of metabolism (the metabolome). In using a CoDa framework, we identified novel functional profiles of the vaginal microbiome associated with healthy and dysbiotic conditions. Our strategy involved mapping mRNA reads to a reference library, assembling unmapped reads, and grouping the reads by functional categories. We show that the transcriptional components of specific taxa (Megasphaera and Prevotella) associate with the vaginal microbiome subgroups, while others (Gardnerella, Lactobacillus iners) are nondiscriminatory to the different subgroups. The power to separate subgroups transcriptionally was increased by aggregating reads into functional groups rather than individual organisms.

Despite significant taxonomic variability within each subgroup, we found core metabolic products separating health and dysbiosis. We show correlations between the small molecules and the transcripts detected in key metabolic pathways of the condition: amino acids and polyamines, end­products of anaerobic metabolism, and structural components that could be biomarkers of disease. This study underscored the importance and value of a multi­omics approach to understanding an alteration in a microbiome state.

Statistical, Visual and Functional Analysis of 16S rRNA Marker Gene Data
Download

Date: TBA
Room: TBA

  • Achal Dhariwal, Mcgill University, Canada
  • Jeff Xia, Mcgill University, Canada

Presentation Overview:

Metagenomics studies aim to understand the composition and function of uncultured microbial communities. Nowadays, 16S rRNA based marker gene metagenomics sequencing is widely used to characterize the diversity of complex microbial communities. Functional, statistical analysis and visualization of such data possess great challenges. Many tools or approaches provide pipelines for performing microbiome analysis on such data to understand microbiome composition and function. However, many aspects of the current approaches can be improved to get a deeper understanding of communities. Hence, we introduced MetagenomeNet, a user-friendly, high-performance tool for comprehensive analysis of 16S rRNA metagenomic data. The key features of MetagenomeNet includes:(i) a knowledgebase comprising taxonomic profiling from multiple databases (Greengenes, SILVA and RDP), allowing users to input OTU tables mapped from various 16S rRNA analysis pipelines for its functional profiling; (ii) support for differential abundant features analysis for marker gene data along with discerning gene and taxon-specific associations for comparative analysis; (iii) a powerful fully-featured network visualization system at gene-level (metabolic networks) and at taxon-level (correlation network). Such tool will provide a system-level insight, help in understanding microbiome function, biomarker predictions, and also provide multiple alternative interpretations and hypothesis generation for various pathophysiological states.

Global Peak Alignment for Comprehensive Two-Dimensional Gas Chromatography Mass Spectrometry Using Point Matching Algorithms
Download

Date: TBA
Room: TBA

  • Beichuan Deng, Department of Mathematics, Wayne State University, United States
  • Hengguang Li, Department of Mathematics, Wayne State University, United States
  • Xiang Zhang, Department of Chemistry, University of Louisville, United States
  • Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University, United States

Presentation Overview:

Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC×GC-MS) has been used to analyze multiple samples in a metabolomics study. However, due to some uncontrollable experimental conditions, such as the differences in temperature or pressure, matrix effects on samples, and stationary phase degradation, there is always a shift of retention times in the two GC columns between samples. In order to correct the retention time shifts in GC×GC-MS, the peak alignment is a crucial data analysis step to recognize the peaks generated by the same metabolite in different samples. Two approaches have been developed for GC×GC-MS data alignment: profile alignment and peak matching alignment. However, these existing alignment methods are all based on a local alignment, resulting that a peak is not correctly aligned in a dense chromatographic region where many peaks are present in a small region. False alignment will result in false discovery in the downstream statistical analysis. We, therefore, develop a global comparison based peak alignment method using point matching algorithm (PMA-PA) for both homogeneous and heterogeneous data. The developed algorithm PMA-PA first extracts feature points in the chromatography and then searches globally the matching peaks in the consecutive chromatography by adopting the projection of rigid and non-rigid transformation. Simulation studies show that PMA-PA outperforms the existing alignment algorithms in terms of F1 score, although it uses only peak location information.

Balancing mRNA and Protein Levels in a Demand-Directed Dynamic Flux Balance Analysis Describes Effects of the Transition of an Anaerobic Escherichia coli Culture to Aerobic Conditions
Download

Date: TBA
Room: TBA

  • Joachim von Wulffen, University of Stuttgart, Institute for System Dynamics, Germany
  • Oliver Sawodny, University of Stuttgart, Institute for System Dynamics, Germany
  • Ronny Feuer, University of Stuttgart, Institute for System Dynamics, Germany

Presentation Overview:

The facultative anaerobic bacterium Escherichia coli is frequently forced to adapt to changing environmental conditions. One important determinant for metabolism is the availability of oxygen allowing a more efficient metabolism. Especially in large scale bioreactors the distribution of oxygen is inhomogeneous and individual cells encounter frequent changes. This might contribute to observed yield losses during process upscaling. Short-term gene expression data exist of an anaerobic E. coli batch culture shifting to aerobic conditions. The data reveal temporary upregulation of genes that are less efficient in terms of energy conservation than the genes predicted by conventional flux balance analyses.

In this study, we provide evidence that a positive correlation between metabolic fluxes and gene expression exists. We then hypothesize that the more efficient enzymes are limited by their low expression restricting flux through their reactions and we define a demand that triggers expression of the demanded enzymes that we explicitly include in our model. With these features we propose a method, demand-directed dynamic flux balance analysis, dddFBA, bringing together elements of several previously published methods. The introduction of additional flux constraints proportional to gene expression provoke a temporary demand for less efficient enzymes which is in agreement with the transient upregulation of these genes observed in the data.

In the proposed approach, the applied objective function of growth rate maximization together with the introduced constraints triggers expression of metabolically less efficient genes. This finding is one possible explanation for the yield losses observed in large scale bacterial cultivations where steady oxygen supply cannot be warranted.

Ligand similarity complements sequence, physical interaction, and co-expression for gene function prediction
Download

Date: TBA
Room: TBA

  • Matthew O'Meara, University of California at San Francisco, Department of Pharmaceutical Chemistry, United States
  • Sara Ballouz, Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, United States
  • Brian Shoichet, University of California at San Francisco, Department of Pharmaceutical Chemistry, United States
  • Jesse Gillis, Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, United States

Presentation Overview:

The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63-0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited.

Simulating Next-Generation Sequencing Datasets From Empirical Mutation and Sequencing Models
Download

Date: TBA
Room: TBA

  • Zachary Stephens, University of Illinois at Urbana-Champaign, United States
  • Matthew Hudson, University of Illinois at Urbana-Champaign, United States
  • Liudmila Mainzer, University of Illinois at Urbana-Champaign, United States
  • Morgan Taschuk, Ontario Intitute for Cancer Research, Canada
  • Matthew Weber, University of Illinois at Urbana-Champaign, United States
  • Ravishankar Iyer, University of Illinois at Urbana-Champaign, United States

Presentation Overview:

An obstacle to the validation and benchmarking of methods for the analysis of genomes is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read and alignment simulators can fulfill this requirement. The most flexible approach to quantifying the accuracy of next-generation sequencing (NGS) variant calling pipelines is to utilize simulated read data, where the ground truth (correct read mapping positions, variant locations) is known a priori. However, simulated data is often criticized for limited resemblance to true data, and some simulation tools can be inflexible. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but additional scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters, which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads.

Estimation of Free Energy Contribution of Protein Residues as Feature for Structure Prediction from Sequence
Download

Date: TBA
Room: TBA

  • Sumaiya Iqbal, University of New Orleans, United States
  • Md Tamjidul Hoque, University of New Orleans, United States

Presentation Overview:

A feature that can map one dimensional sequence information into three dimensional information is crucial for solving complex protein structure prediction problems. With a view to this, based on the contact energy and the predicted relative solvent accessibility (RSA), we propose a novel approach to estimate position specific estimated energy (PSEE) per residue from sequence alone. PSEE can identify the structured as well as unstructured or, intrinsically disordered region of a protein by computing favorable and unfavorable energy respectively, characterized by appropriate threshold. The Intriguing feature provided by PSEE, verified empirically, suggests that PSEE can effectively classify disorder versus ordered residues and can segregate secondary structure components by computing their constituent energies. PSEE based residual characterization strongly correlates with their hydrophobicity indices as well. Further, PSEE can detect the existence of critical binding regions that essentially undergo disorder to order transition to perform crucial biological functions. Towards an application of disorder prediction using the PSEE feature, we have rigorously tested and found that PSEE helps perform the predictor consistently better.

Expanding the UniFrac toolbox
Download

Date: TBA
Room: TBA

  • Ruth G. Wong, University of Western Ontario, Canada
  • Jia R. Wu, University of Western Ontario, Canada
  • Gregory B. Gloor, University of Western Ontario, Canada

Presentation Overview:

Microbiome analysis is frequently performed using the UniFrac distance metric to separate groups. Here we demonstrate that unweighted UniFrac is highly sensitive to rarefaction instance and to sequencing depth in uniform data sets. We show that this arises because of subcompositional effects. We introduce information UniFrac and centered ratio UniFrac, two new weightings that are not sensitive to rarefaction and allow greater separation of outliers than classic unweighted and weighted UniFrac. With this expansion of the UniFrac toolbox, we hope to empower researchers to extract more varied information from their data.

An unlabeled-negative learning framework for human enhancer prediction based on low-methylation regions
Download

Date: TBA
Room: TBA

  • Jingting Xu, University of Illinois at Chicago, United States
  • Hong Hu, University of Illinois at Chicago, United States
  • Yang Dai, Univ. of Illinois at Chicago, United States

Presentation Overview:

Background The identification of enhancer is a challenge task. Various types of epigenetic information including histone modification have been utilized in the construction of enhancer prediction models based on a diverse panel of machine learning models. However, DNA methylation profiles generated from the whole genome bisulfate sequencing (WGBS) have not been fully explored for their potential in enhancer prediction despite of the fact that low methylated regions (LMRs) have been implied to be distal to transcription starting sites and active in regulation of target genes.

Method In this work we propose an unlabeled-negative learning framework using a weighted support vector machine model to build prediction models based on LMRs from cell-type specific WGBS DNA methylation profiles. The unlabeled LMR set is further divided into reliable positive, like positive and likely negative according to their resemblance to a small set of experimentally validated enhancers in the VISTA database based on their non-parametric density distributions.

Results We demonstrate the performance of LMR-wSVM by using the WGBS DNA methylation profile derived from the ES H1 cell line. Our results show that the predicted enhancers are highly conserved and the validation rate ranges from 68.92% to 83.4% from 64,791 to 29,818 predicted enhancers. The performance our models is competitive or even better compared with the existing best-performed methods.

Conclusion Our work suggests that low methylated regions detected from the WGBS data is useful to develop models for the prediction of cell type-specific enhancers.

WEVOTE: Weighted Voting Taxonomic Identification Method of Microbial Sequences
Download

Date: TBA
Room: TBA

  • Ahmed Metwally, University of Illinois at Chicago, United States
  • Yang Dai, University of Illinois at Chicago, United States
  • Patricia Finn, University of Illinois at Chicago, United States
  • David Perkins, University of Illinois at Chicago, United States

Presentation Overview:

Background: Metagenome shotgun sequencing presents opportunities to identify organisms that may prevent or promote disease. Analysis of sample diversity is achieved by taxonomic identification of metagenomic reads followed by generating an abundance profile. However, existing taxonomic identification tools with the best precision and practical performance still lack sensitivity. Moreover, methods with the highest sensitivity suffer from low precision, low specificity along with long computation time.

Methods: In this paper, we present WEVOTE (WEighted VOting Taxonomic idEntification), a method that classifies whole genome shotgun sequencing DNA reads based on an ensemble of existing methods using k-mer based, marker-based, naive-similarity approaches. Our evaluation based on three benchmarking datasets shows that the WEVOTE reduces the false positives to half of that produced by the other high sensitive tools while preserving the same level of sensitivity.

Conclusions: WEVOTE is an automated efficient tool that combines individual taxonomic identification methods. It is expandable and has the potential to reduce the false positives and produce more accurate taxonomic identification for microbiome data. The WEVOTE framework is written in C++, Perl, and shell scripting.

Bayesian Correlation Analysis for Sequence Count Data
Download

Date: TBA
Room: TBA

  • Daniel Sanchez-Taltavull, OHRI, Canada
  • Parameswaran Ramachandran, OHRI, Canada
  • Nelson Lau, OHRI, Canada
  • Theodore Perkins, Ottawa Health Research Institute, Canada

Presentation Overview:

Measuring similarity between different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities' measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurment confidence is low---especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities' signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on RNA-seq data describing gene expression across a collection of human tissue types.

Dissecting the expression relationships between RNA-binding proteins and their cognate targets in eukaryotic post-transcriptional regulatory networks
Download

Date: TBA
Room: TBA

  • Sneha Nishtala, Indiana University Purdue University Indianapolis (IUPUI), United States
  • Yaseswini Neelamraju, Indiana University Purdue University Indianapolis (IUPUI), United States
  • Sarath Chandra Janga, Indiana University Purdue University Indianapolis (IUPUI), United States

Presentation Overview:

RNA-binding proteins (RBPs)are a class of regulatory molecules pivotal in orchestrating several steps in the metabolism of RNA in eukaryotes thereby controlling an extensive network of RBP-RNA interactions. In this study, we employ CLIP-seq datasets for 50 human RBPs and RIP-chip data for 69 yeast RBPs to construct a network of genome-wide RBP- target RNA interactions for each RBP. Using these datasets we studied the expression association (measured as correlation) of RBPs at both transcriptomic and proteomic levels with their experimentally known target transcripts across 16 human tissues and 18 experimental conditions in yeast. We show that in humans majority (~78%) of the RBPs are strongly associated with their target transcripts at the transcript level while ~96% of the studied RBPs were found to be strongly associated with the expression levels of target transcripts when protein expression levels of RBPs were employed. Further analysis revealed that based on the observed distribution of RBP-RNA correlations compared to a null distribution, RBPs can be classified into three classes, namely significantly congruent (SC) – RBPs which exhibit significantly higher correlation than expected, significantly incongruent (SIC) – RBPs which exhibit significantly lower correlation than expected and no significant change (NSC) - RBPs which exhibit no association with their targets, respectively. At transcript level, RBP – RNA interaction data for the yeast genome, exhibited a strong association for 57% of the RBPs, confirming that our observed association between RBPs and their targets is conserved across large phylogenetic distances. Further analysis to uncover the features contributing to these associations using elastic net as well as multi-variate regression modelling revealed the significant contribution of features like the number of the target transcripts and length of the selected protein-coding transcript of a RBP at the transcript level while intensity of the CLIP signal, number of RNA-Binding domains, location of the binding site on the transcript were found to be significant at the protein level. Our analysis provides a comprehensive understanding of the relation between the expression levels of RBPs and their targets with specific insights into the factors contributing to the observed association and will contribute to improved modelling and prediction of post-transcriptional networks.

Metabolomics and Cheminformatics analysis guiding the Discovery of Antifungal Metabolites for Crop Protection
Download

Date: TBA
Room: TBA

  • Miroslava Cuperlovic-Culf, National Research Council of Canada, Canada
  • Nandhakishore Rajagopalan, National Research Council of Canada, Canada
  • Dan Tulpan, National Research Council of Canada, Canada
  • Michele Loewen, National Research Council of Canada, Canada

Presentation Overview:

Fusarium head blight (FHB), also known as scab or tombstone, is a devastating disease of wheat, barley, oats and other small-grain cereals as well as corn caused primarily by Fusarium graminearum. Several cultivars of wheat have developed some level of resistance to FHB. Resistance to this fungal pathogen includes specific metabolic responses to inoculation. A number of published metabolomics studies have determined major metabolic changes induced by pathogen in resistant and susceptible plants. Functionality of the majority of these metabolites in resistance remains, however, unknown. In this work we have made a compilation of all metabolites determined to selectively accumulate following FHB inoculation in resistant plants. Characteristics as well as possible functions and targets of these plant metabolites are investigated using cheminformatics approaches. A particular focus has been on the likelihood of these metabolites targeting specific proteins and acting as drug-like molecules. Results of computational analyses of binding properties of several representative metabolites to homology models of proteins are presented. Theoretical analysis highlights the possibility of strong inhibitory activity of several metabolites against some major proteins in F. graminearum such as carbonic anhydrases and cytochrome P450s. Activity of several of these compounds has been experimentally confirmed in fungal growth inhibition assays.

An unsupervised kNN Method to systematically detect changes in protein localization in high-throughput microscopy images
Download

Date: TBA
Room: TBA

  • Alex Lu, Department of Computer Science, University of Toronto, Canada
  • Alan Moses, Department of Computer Science and Department of Cells and System Biology, University of Toronto, Canada

Presentation Overview:

Despite the importance of characterizing genes that exhibit subcellular localization changes between conditions in proteome-wide imaging experiments, many recent studies still rely upon manual evaluation to assess the results of high-throughput imaging experiments. We describe and demonstrate an unsupervised k-nearest neighbours method for the detection of localization changes. Compared to previous classification-based supervised change detection methods, our method is much simpler and faster, and operates directly on the feature space to overcome limitations in needing to manually curate training sets that may not generalize well between screens. In addition, the output of our method is flexible in its utility, generating both a quantitatively ranked list of localization changes that permit user-defined cut-offs, and a vector for each gene describing feature-wise direction and magnitude of localization changes. We demonstrate that our method is effective at the detection of localization changes using the Δrpd3 perturbation in Saccharomyces cerevisiae, where we capture 71.4% of previously known changes within the top 10% of ranked genes, and find least four new localization changes within the top 1% of ranked genes. The results of our analysis indicate that simple unsupervised methods may be able to identify localization changes in images without laborious manual image labelling steps.

Using ancestral sequence reconstruction methods to predict functional evolution in cetacean rhodopsin over a major evolutionary transition
Download

Date: TBA
Room: TBA

  • Sarah Dungan, University of Toronto, Canada
  • Belinda Chang, University of Toronto, Canada

Presentation Overview:

Ancestral sequence reconstruction methods, particularly those that use probabilistic models, have become increasingly refined over recent years, which has resulted in their popular use as tools for tracing the functional evolution of proteins. Nevertheless, the robustness of reconstructed sequences is often poorly addressed, with reliance on only the most probable sequence under a single model. This is despite probabilistic models having known optimization biases towards more frequent amino acid states that can subsequently lead to biased inferences of ancestral protein function. The dim-light visual protein, rhodopsin, was recently shown to be under positive selection in cetaceans, with accompanying functional shifts that suggest divergence and adaptation within Cetacea to different underwater light environments. Nevertheless, the evolution of dim-light vision at the origin of Cetacea as they transitioned to aquatic environments remains unexplored. Because cetacean rhodopsin is highly conserved, yet has strong signatures of functional evolution, it is an ideal system in which to test the application of ancestral sequence reconstruction. We compare commonly used amino acid and codon-based likelihood models to reconstruct the rhodopsin sequences from the ancestral cetacean, and the common ancestor of cetaceans with their nearest hippopotamid and ruminant relatives (Whippomorpha and Cetruminantia). Specifically, we determine whether different models result in the same most probable ancestral sequences, and within-model, which ancestral sites vary after randomly sampling from the empirical Bayesian posterior probability distribution. We then construct homology models of the different ancestral protein 3D structures to assess the likelihood that uncertain sites will impact rhodopsin functions. By evaluating model uncertainty, we are able to reliably develop precise hypotheses for resolving whether functional differences in rhodopsin between extant cetaceans and outgroups are due to ancestral or derived substitutions. The single-gene focus of our work facilitates a more detailed discussion of protein structure-function in an evolutionary context, thus providing necessary foundations for future investigations that can evaluate our predictions experimentally.

Base-By-Base v3: new tools for the comparative analysis of genomes
Download

Date: TBA
Room: TBA

  • Chad Smithson, University of Victoria, Canada
  • Chris Upton, University of Victoria, Canada

Presentation Overview:

Base-By-Base (BBB) is a Java tool to create, edit and analyse multiple sequence alignments (MSA) of proteins, genes and large viral genomes. A number of significant new analysis features have been added to this release of the software:

1) CODEHOP, a tool to design sets of degenerate PCR primers for the detection of distant homologs has been rewritten in Java and included as a feature of BBB.
2) FIND DIFFERENCES is a powerful feature for the exploration of which SNPs contribute to phylogenetic assignments. From a set of aligned DNA sequences, the user can count and display nucleotides that are a) unique to one or more sequences; b) present in sequence X, but not in sequences Y and Z; c) present in sequences A, B and C, but not in sequences X, Y and Z with tolerance to a number of failures in this matching. This analysis has been employed to detect small regions of recombination in the smallpox virus that may have been involved in the switch to human hosts.
3) SNIP allows the removal of nucleotides (conversion to consensus sequence) that are present in only 1 sequence in of a MSA. This provides a useful simplification of MSAs when looking for recombination events (patterns of SNPs) in MSAs.
4) Individual amino acids present above a user-defined frequency in protein MSAs can be highlighted.
5) Graphs can be drawn to display % similarity between proteins in MSAs.
6) MAFFT and ClustalO have been included for alignment options.

Multi-genome Scaffold Co-Assembly Based on the Analysis of Gene Orders and Genomic Repeats
Download

Date: TBA
Room: TBA

  • Sergey Aganezov, Computational Biology Institute & Department of Mathematics, The George Washington University, United States
  • Max Alekseyev, George Washington University, United States

Presentation Overview:

Advances in the DNA sequencing technology over the past decades have increased the volume of raw sequenced genomic data available for further assembly and analysis. While there exist many software tools for assembly of sequenced genomic material, they often experience difficulties with reconstructing complete chromosomes. Major obstacles include uneven read coverage and presence of long similar DNA subsequences (repeats). Genome assemblers therefore often are able to reliably reconstruct only long fragments, called scaffolds. We present a method for simultaneous co-assembly of all fragmented genomes (represented as collections of scaffolds rather than chromosomes) in a given set of annotated genomes. The method is based on the analysis of gene orders and relies on the evolutionary model, which includes genome rearrangements as well as gene insertions and deletions. It can also utilize information about genomic repeats and the phylogenetic tree of the given genomes, further improving their assembly quality.

Machine Learning Approaches for Breast Cancer Subtypes Reveal Key Genes as Potential Biomarkers
Download

Date: TBA
Room: TBA

  • Michele D'Agnillo, Department of Biological Sciences, University of Windsor, Canada
  • Iman Rezaeian, University of Windsor, Canada
  • Alioune Ngom, School of Computer Science, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada

Presentation Overview:

Worldwide, breast cancer is the second leading cause of death among women and one in nine women are diagnosed with breast cancer in their life time. Accurate diagnosis of the specific subtypes of this disease is a vital step for determining an appropriate patient’s therapy. In this study, we use machine learning approaches to identify the most informative genes that can best discriminate the ten subtypes of breast cancer. In particular, we use a bottom-up hierarchical classification approach to select the most informative genes for different subtypes. This approach clusters the subtypes based on their similarity and produces a tree-based model with a semi-balanced topology. We also use different classification methods and perform in-depth comparison of their performances using different performance measures on the METABRIC dataset consisting of 997 samples. Our results support that this approach to gene selection and breast cancer subtyping yields a small subset of genes that can predict each of these ten subtypes with very high accuracy of at least 95%. Moreover, the machine learning model provides an insightful structure for further analysis of these subtypes.
We have further analyzed the functions of three genes identified by the machine learning approaches: USP21, PTRH2 and TACO1. Differential expression of USP21 discriminates the Subtype-7 and Subtype-8. USP21 encodes a protein that catalyzes intracellular protein degradation. PTRH2 and TACO1 discriminate Subtype-1 and Subtype-3. PTRH2 encodes a mitochondrial enzyme, which degrades peptidyl tRNA. The downregulation of PTRH2 causes translational errors, which have been linked to tumour progression and metastasis. TACO1 encodes a mitochondrial translation activator for cytochrome c oxidase, the upregulation of which increases cellular respiration. The proposed mechanism involves the accumulation of mitochondrial protein, which can cause translational errors that upregulate cellular respiration, thereby increasing the risk of oncogenesis.

Pan-Cancer Analyses Reveal Long Intergenic Non-Coding RNAs Relevant to Tumor Diagnosis, Subtyping and Prognosis
Download

Date: TBA
Room: TBA

  • Travers Ching, University of Hawaii Cancer Center, United States
  • Lana Garmire, University of Hawaii Cancer Center, United States

Presentation Overview:

Long intergenic noncoding RNAs (lincRNAs) are a relatively new class of non-coding RNAs that have the potential as cancer biomarkers. To seek a panel of lincRNAs as pan-cancer biomarkers, we have analyzed transcriptomes from over 3300 cancer samples with clinical information. Compared to mRNA, lincRNAs exhibit significantly higher tissue specificities that are then diminished in cancer tissues. Moreover, lincRNA clustering results accurately classify tumor subtypes. Using RNA-Seq data from thousands of paired tumor and adjacent normal samples in The Cancer Genome Atlas (TCGA), we identify six lincRNAs as potential pan-cancer diagnostic biomarkers (PCAN-1 to PCAN-6). These lincRNAs are robustly validated using cancer samples from four independent RNA-Seq data sets, and are verified by qPCR in both primary breast cancers and MCF-7 cell line. Interestingly, the expression levels of these six lincRNAs are also associated with prognosis in various cancers. We further experimentally explored the growth and migration dependence of breast and colon cancer cell lines on two of the identified lncRNAs. In summary, our study highlights the emerging role of lincRNAs as potentially powerful and biologically functional pan-cancer biomarkers and represents a significant leap forward in understanding the biological and clinical functions of lincRNAs in cancers.

Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis
Download

Date: TBA
Room: TBA

  • Sijia Huang, Univeristy of Hawaii at Manoa, United States
  • Lana Garmire, University of Hawaii Cancer Center, United States

Presentation Overview:

Background

More accurate diagnostic methods are pressingly needed to diagnose breast cancer, the most common malignant cancer in women worldwide. Blood-based metabolomics is a promising diagnostic method for breast cancer. However, many metabolic biomarkers are difficult to replicate among studies.

Methods

We propose that higher-order functional representation of metabolomics data, such as pathway-based metabolomic features, can be used as robust biomarkers for breast cancer. Towards this, we have developed a new computational method that uses personalized pathway dysregulation scores for disease diagnosis. We applied this method to predict breast cancer occurrence, in combination with correlation feature selection (CFS) and classification methods.

Results

The resulting all-stage and early-stage diagnosis models are highly accurate in two sets of testing blood samples, with average AUCs (Area Under the Curve, a receiver operating characteristic curve) of 0.968 and 0.934, sensitivities of 0.946 and 0.954, and specificities of 0.934 and 0.918. These two metabolomics-based pathway models are further validated by RNA-Seq-based TCGA (The Cancer Genome Atlas) breast cancer data, with AUCs of 0.995 and 0.993. Moreover, important metabolic pathways, such as taurine and hypotaurine metabolism and the alanine, aspartate, and glutamate pathway, are revealed as critical biological pathways for early diagnosis of breast cancer.

Conclusions

We have successfully developed a new type of pathway-based model to study metabolomics data for disease diagnosis. Applying this method to blood-based breast cancer metabolomics data, we have discovered crucial metabolic pathway signatures for breast cancer diagnosis, especially early diagnosis. Further, this modeling approach may be generalized to other omics data types for disease diagnosis.

Integrated Microbiome Resource (IMR): Developing an Open and Streamlined Experimental and Analysis Pipeline for Microbiome Research
Download

Date: TBA
Room: TBA

  • André M. Comeau, Dalhousie University, Canada
  • Gavin M. Douglas, Dalhousie University, Canada
  • Morgan Langille, Dalhousie University, Canada

Presentation Overview:

Microbiome studies have revolutionized the microbiology field and are becoming increasingly popular. In recent years, advances in sequencing technologies and in bioinformatic methods have led to faster and more robust methods for generating and analyzing data. The Comparative Genomics and Evolutionary Bioinformatics – Integrated Microbiome Resource (CGEB-IMR: http://cgeb-imr.ca/) has streamlined and connected each essential step of a microbiome study starting with samples and ending with various plots and tables ready for interpretation within a single workweek. Our pipeline can handle up to 380 samples per sequencing run covering a variety of amplicon targets (16S, 18S, ITS, Bar-Seq, etc.). In little over one year of operation we have processed 7600 samples generating 600 M sequences and 336 G bases from a variety of host-associated (e.g. humans, mice, rats, fish, insects, birds, reptiles) and environmental (e.g. soil, waste water, marine) biomes. These samples encompass 75 projects from 30 principal investigators from several countries. We openly present each step of this resource including primer validation, library preparation, sequencing, quality control, paired-end assembly, taxonomic annotation, functional annotation (metagenomes), predictive functional annotation using PICRUSt (16S data), statistical evaluation, and visualization. This pipeline, Microbiome Helper (https://github.com/mlangill/microbiome_helper), is continually updated based on evolving best practices and can be replicated in other locations with only standard molecular and computational equipment, minimum personnel, and access to a bench top next-generation sequencer (e.g. Illumina MiSeq). Our results illustrate that microbiome studies can be easily conducted in various scientific settings, including for time-sensitive applications, and provide a complete experimental and analysis package that can be replicated by other microbiome researchers.

Hierarchal Clustering based on Non-negative Matrix Factorization for Time Series transcriptomes profiles
Download

Date: TBA
Room: TBA

  • Abed Alkhateeb, University of Windsor, Canada
  • Iman Rezaeian, university of windsor, Canada
  • Luis Rueda, University of Windsor, Canada

Presentation Overview:

Studying the transcriptome of the cancer cells from different cancer stages is essential to understand the disease development. A dataset contains different samples at different progression stages from Chinese population. First, the samples reads were preprocessed by aligning them to the human genome, then the transcripts were constructed at each cancer stage and the reads were quantified on the constructed transcripts. The final step of preprocessing was to construct a matrix of vectors V that contains the transcripts profiles measured by fragments per kilo base per million reads (FPKM).
The main purpose of using nonnegative Matrix Factorization (NMF) method is used to represent the data in part-based representation by factorizing matrix V into two non-negative matrices, by finding the localized interesting parts intuitively. Only additive parts are allowed here because of the non-negative representation. This method focuses on learning identifier features, where all sparse details can represent V sharply in a lower level of presentation. These vectors are then clustered effectively by focusing on the sparse sharp features, and removing unnecessary noisy and redundant features. The best number of clusters (k) is determined at each stage of a hierarchical model using purity and sparsity of the clusters.
The results demonstrate finding meaningful clusters; the resulting clusters are biologically assessed by gathering information from the literature review. The significant clusters are analyzed to find relationships among transcripts that follow similar trends across the different stages. We studied the functionality and promoters of the genes that belong to each cluster, and found some similarities in the same cluster’s genes ontologies. More biological validation and wet lab experiment required for those transcripts are the future work for those resulting clusters.

Genomics and transcriptomic analysis of imatinib resistance in gastrointestinal stromal tumor
Download

Date: TBA
Room: TBA

  • Asmaa Elzawahry, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
  • Tsuyoshi Takahashi, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
  • Sachiyo Mimaki, Division of Translational Research, Exploratory Oncology Research and Clinical Trial Center, National Cancer Center, 6-5-1 Kashiwanoha, Kashiwa, Chiba 277-8577, Japan, Japan
  • Eisaku Furukawa, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
  • Rie Nakatsuka, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
  • Isao Kurosaka, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
  • Takahiko Nishigaki, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
  • Hiromi Nakamura, Division of Cancer Genomics, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
  • Satoshi Serada, Laboratory for Immune Signal, National Institute of Biomedical Innovation, 7-6-8 Saito-Asagi, Ibaraki City, Osaka, 567-0085, Japan, Japan
  • Tetsuji Naka, Laboratory for Immune Signal, National Institute of Biomedical Innovation, 7-6-8 Saito-Asagi, Ibaraki City, Osaka, 567-0085, Japan, Japan
  • Seiichi Hirota, Department of Surgical Pathology, Hyogo Medical College, 1-1, Mukogawa-cho, Nishinomiya City, Hyogo, 663-8501, Japan, Japan
  • Tatsuhiro Shibata, Division of Cancer Genomics, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
  • Katsuya Tsuchihara, Division of Translational Research, Exploratory Oncology Research and Clinical Trial Center, National Cancer Center, 6-5-1 Kashiwanoha, Kashiwa, Chiba 277-8577, Japan, Japan
  • Toshirou Nishida, Department of Surgery, National Cancer Center Hospital East, 6-5-1 Kashiwanoha, Kashiwa, Chiba, 277-8577, Japan, Japan
  • Mamoru Kato, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan

Presentation Overview:

Background
The gastrointestinal stromal tumor (GIST) is the most common mesenchymal tumor of the digestive tract, of which proliferation is driven by gain-of-function mutations in KIT. These characteristics have facilitated the development of targeted therapies with tyrosine kinase inhibitors, such as imatinib. Although many clinical studies have demonstrated revolutionized effects of imatinib, more than 80% of patients eventually develop disease progression driven by secondary resistance mutations located in KIT kinase domains. However, the full spectrum of genomic and transcriptomic changes behind the resistance remains unknown.

Results
This study analyzed genomic and transcriptomic changes in drug-sensitive and -resistant cell lines against imatinib. We also looked at an “intermediate” cell-line before reaching the full resistance. We identified SNVs and CNAs from the next-generation sequencing and also the transcriptome from microarrays. For clinical insights, we conducted exome sequencing for two clinical samples with the resistance. Notably, the cell line briefly exposed to imatinib exhibited drastic transcriptional changes, but few genomic changes.

Conclusion
We suggest that pre-existing cell death-resistant subpopulations are the main cause for full resistance via secondary KIT mutations. The combination of chemotherapy with imatinib and apoptosis pathway-targeting drugs, could limit the emergence of drug-resistant cancer.

Computational drug repositioning through graph-based semi-supervised learning with genomic expression and drug-gene interaction network
Download

Date: TBA
Room: TBA

  • Gyeongmo Gu, Kyungpook National University, Korea, Republic of
  • Erkhembayar Jadamba, Kyungpook National University, Korea, Republic of
  • Miyoung Shin, Kyungpook National University, Korea, Republic of

Presentation Overview:

Background:
Drug repositioning is an efficient drug discovery method that detects new indication(s) for existing drugs. So far, there have been various computational drug repositioning methods using pharmaceutical data and/or disease data, but currently we have new demands for computational approaches to combine experimental data, such as patients’ gene expression profiles, with pharmaceutical and genomic databases in order to explore relationships between drugs and diseases.

Results:
We propose a new strategy for drug repositioning that employs a graph-based semi-supervised learning method to analyze experimental knowledge from gene expression data and pharmaceutical knowledge from drug resources. In the first step of our study, we applied a graph-based semi-supervised learning algorithm to gene expression data in order to identify discriminative genes related to a specific disease. Next, we built a drug-gene interaction bipartite graph using existing relationships between drugs and genes recorded in public databases. To discover candidates for drug repositioning, we used the known drug-disease associations and discriminative genes found earlier to initiate drug-gene interaction network and performed network propagation via the guilt-by-association principle in a semi-supervised manner. We tested our method by evaluating 1239 FDA-approved drugs on the Rosetta breast cancer dataset, and observed that out strategy can identify promising candidate drugs for breast cancer treatment.

Conclusions:
Through our new drug repositioning method, we were able to discover new drug disease associations.

Inferring genes sensitive to severity of toxicity symptom
Download

Date: TBA
Room: TBA

  • Jinwoo Kim, Kyoungpook national university, Korea, Republic of
  • Hyunjung Lee, Kyoungpook national university, Korea, Republic of
  • Miyoung Shin, Kyoungpook national university, Korea, Republic of

Presentation Overview:

Background:
It is important to find genes related to toxicity symptoms for developing drugs efficiently. Many studies have worked for inferring the markers, but most of them focused on discovering the markers of toxicity occurrence only. In this work, our aim was to find gene markers related to the aggravation of toxicity symptoms, which should be more sensitive to severity of symptoms than other markers obtained by existing approaches.

Result:
To identify gene markers for each of targeted 4 liver toxicity symptoms (necrosis, hypertrophy, cellular infiltration, cellular changes), we used microarray and pathology data of 14,144 in-vivo rat samples, and employed sparse linear discriminant analysis (sLDA) for gene selection. Severity was used as class for sLDA. To evaluate inferred gene markers, we constructed regression models for predicting symptom severity. As result of 10-fold cross-validation, our model shows AUC 0.96 in predicting the samples in which liver necrosis occur, providing spearman correlation coefficient 0.80 between predicted necrosis severity and actual severity. For other 3 symptoms, the coefficients are shown 0.72, 0.77, 0.65 in hypertrophy, infiltration, and changes, respectively. In addition, we used one-way ANOVA and student’s t-test as gene selection method, and made comparisons between performances of models by different gene selection methods. As results, sLDA provides higher correlation coefficients than ANOVA and t-test for the all targeted symptoms.

Conclusion:
Our sLDA-based feature selection method is more useful to find gene markers sensitive to aggravation of toxicity symptoms than conventional statistical methods. Our method may help to develop new drugs or treatments which have high therapeutic effects minimizing toxic effects.

Prediction of Calmodulin-binding Proteins Using Canonical Motifs
Download

Date: TBA
Room: TBA

  • Mrinalini Pandit, University of Windsor, Canada
  • Mina Maleki, University of Windsor, Canada
  • Nicholas J Carruthers, Wayne State University, United States
  • Paul Stemmer, Wayne State University, United States
  • Luis Rueda, University of Windsor, Canada

Presentation Overview:

Calmodulin (CaM) is a calcium-binding protein that is a major transducer of calcium signaling. It has no enzymatic activity of its own but rather acts by binding to and altering the activity of a panel of cellular protein targets. Its targets are structurally and functionally diverse and participate in a wide range of physiological functions including immune response, muscle contraction and memory formation. Identifying CaM target proteins and CaM sites in those proteins is an important and ongoing research problem because its binding sites are defined by physical characteristics like helicity and charge rather than a particular amino acid sequence. Current algorithms for CaM binding site prediction struggle to identify novel CaM-binding proteins.
Short Linear Motifs (SLiMs), on the other hand, help regulate many cellular processes, by being interaction sites for other SLiM containing proteins., SLiM mediated interactions are often transient interactions or utilize additional interaction domains to co-operatively produce stable complexes. In this work, we propose a meta-analysis model used for prediction of CaM-binding proteins based on sequence information. The model uses SLiMs that are derived from a set of validated CaM binding proteins as features for the prediction. A dataset of 194 manually curated CaM-binding proteins from the Calmodulin Target Database [2] and another dataset of 200 Mitochondrial proteins have been obtained and used for testing the model. For each protein, the extracted features are the frequencies of occurrence of the known CaM-binding motifs reported in [1]. Predictions have been performed using k-nearest neighbor, support vector machine and random forest classifiers achieving accuracies of 93%, 93% and 91% respectively. These results denote that using SLiMs for prediction of CaM-binding proteins help identify the CaM-binding regions which will further enhance future biological experiments to analyze CaM and their binding with target proteins. This analysis also suggests identification of new CaM-binding motifs will enrich the study of CaM-binding proteins in the field of computational biology.

CoDuMIMM: Coevolution Detection using Mutual Information and Mutational Mapping
Download

Date: TBA
Room: TBA

  • Andrew Low, Carleton University, Canada
  • Alex Wong, Carleton University, Canada

Presentation Overview:

Epistasis, the genetic interaction of mutations in nucleotide sequences to produce effects that are not the sum of their parts, was once thought to be a relatively uncommon phenomenon. However, research has shown that epistasis may be much more pervasive than previously thought, to the point of being an important force in molecular evolution. While many approaches to detect epistatic interactions in silico have been attempted, there is still room for improvement in these techniques. To this end, we present a new algorithm, CoDuMIMM (Coevolution Detection using Mutual Information and Mutational Mapping). CoDuMIMM works by first generating substitution histories along a phylogeny for each site in an alignment using Bayesian techniques, and then calculating mutual information between pairs of sites based on the proportion of time throughout evolutionary history that states have been shared. Preliminary results on simulated data show that CoDuMIMM is able to identify epistatic sites with high sensitivity and specificity. Further testing on transfer RNA and ribosomal RNA sequences with known structures (and therefore known epistatic interactions) will be carried out in the near future, with the potential to move on to protein sequences and the prediction of inter-gene epistatic interactions after that.

Modelling stochastic pulsatility of transcription factors in Saccharomyces cerevisiae
Download

Date: TBA
Room: TBA

  • Ian Hsu, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada

Presentation Overview:

Many transcription factors in Saccharomyces cerevisiae reveal pulsatility. Pulsatile transcription factors localize to nuclei for a short period and return to the cytoplasm during constant conditions. Although most pulsatile transcription factors passively localize to nuclei after receiving signal transduced from a change in the environment, a recently documented class of pulsatile transcription factors show active localization to nuclei. The localization of this class of pulsatile transcription factors is believed as stochastic that the period of pulses is not synchronized between each cell in a homogeneous environment; in addition, the frequency of pulses in one cell does not show obvious oscillation. The mechanism of stochastic pulsatility in cell is largely unknown. We use mathematical models to explore the mechanism of stochastic pulsatility. Florescence-protein-tagged transcription factors can be traced through confocal microscopy. Therefore, we utilize time-lapse images combined with image analysis on the intensity of florescence in nuclei to quantify pulsatility overtime in each cell. Based on the quantified data, we construct a model that focuses on the reactions of phosphorylation and dephosphorylation on the transcription factors, which are currently hypothesized to play a major role in pulsatility. Understanding the mechanism of active and stochastic pulsatility and how it is different from passive and deterministic pulsatility could help us an important strategy of gene expressions regulation.

Identification of haplotypes in HLA using genetic variant calling from amplicon NGS data
Download

Date: TBA
Room: TBA

  • Hong Hu, University of Illinois at Chicago, United States
  • Mark Maienschein-Cline, University of Illinois at Chicago, United States
  • Zhengdeng Lei, University of Illinois at Chicago, United States
  • Pinal Kanabar, University of Illinois at Chicago, United States
  • George Chlipala, University of Illinois at Chicago, United States
  • Morris Chukhman, University Of Illinois at Chicago, United States
  • Neil Bahroos, University of Illinois at Chicago, United States
  • David Everly, Rosalind Franklin University of Medicine and Science, United States

Presentation Overview:

The human leukocyte antigen (HLA) complex is of key importance in determination of the antigenic specificity in adaptive immunization process. Therefore, typing the HLA genes is essential for stem cell transplantation to match donors and recipients. Amplicon-based targeting of HLA genes is an efficient and cost-effective strategy for obtaining high-coverage sequencing for HLA typing. However, existing off-the-shelf HLA type prediction tools such as HLAminer are designed for whole-genome or whole-exome shotgun sequencing data, and provide low accuracy when applied to amplicon NGS data. Here we present a novel workflow based on genetic variance calling to identify actual haplotypes on each amplicon for HLA typing. The short reads from NGS are merged to single long reads which completely span each given amplicons, aligned to reference HLA amplicons, and then used for variant calling. The haplotypes are obtained through reprocessing each merged read using identified genetic variants, yielding consensus sequences of each haplotype spanning each HLA amplicon. The consensus sequences are aligned back to IMGT/HLA database for nomenclature and under validation with the reference sequences obtained through Sanger sequencing. We have applied this method to datasets of known HLA types to assess the specificity and sensitivity of our approach.

Predicting patient outcomes of hormone therapy in the METABRIC breast cancer study
Download

Date: TBA
Room: TBA

  • Iman Rezaeian, University of Windsor, Canada
  • Eliseos Mucaki, University of Western Ontario, Canada
  • Katherina Baranova, University of Western Ontario, Canada
  • Huy Pham Quang, University of Windsor, Canada
  • Dimo Angelov, University of Western Ontario, Canada
  • Lucian Ilie, University of Western Ontario, Canada
  • Alioune Ngom, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada
  • Peter Rogan, University of Western Ontario, Canada

Presentation Overview:

Genomic aberrations and gene expression-defined subtypes in the METABRIC patient cohort have been used to stratify and predict survival in the breast cancer population (Nature 486: 346; Molecular Oncology 9: 115). Gene expression and clinical outcome were used to predict response for different survival durations in METABRIC patients receiving hormone treatments (HT), with or without chemotherapy (CT). Our previously optimized and validated biochemically-inspired gene expression signature for paclitaxel response were used to predict different outcome in patients (Molecular Oncology 10:85). By applying machine learning algorithms for classification and feature selection, this signature, which was originally developed to model paclitaxel response in breast cancer cell lines, was used to predict survival in METABRIC patients. For 54 CT patients, a Random Forest classifier containing ABCB11, BAD,CYP2C8,CYP3A4, MAP2, MAPT, FGF2 exhibited 77% accuracy (AUROC=0.76) in discriminating survivors from deceased individuals. HT patients (n=188) analyzed with a 19 gene signature (ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, TUBB4B) predicted >3 year survival with 88% accuracy. This signature showed 83% accuracy in the combined HT+CT patient set (n=221, AUROC = 0.60) . In addition, for 84 HT+CT patients, a Support Vector Machine classifier was applied using genes ABCB1, BAD, BCAP29, BCL2, BCL2L1, BMF, CNGA3, CYP2C8, CYP3A4, FGF2, GBP1, MAP2, MAPT, OPRK1, SLCO1B3, TLR6, TUBB1 and TWIST1 as features, yielding 76% accuracy. Applied to untreated patients, the accuracy of predicting survival was relatively lower (about 70%). These tumor gene expression signatures for response to paclitaxel therapy may be useful as a surrogate measure of early to intermediate term survival.

Selection on quantitative traits within intrinsically disordered protein regions preserves functional output of phosphorylation sites
Download

Date: TBA
Room: TBA

  • Caressa Tsai, University of Toronto, Canada
  • Taraneh Zarin, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada

Presentation Overview:

Intrinsically disordered regions (IDRs) of proteins are characterized by their conformational flexibility and absence of stable tertiary structure. Comprising 30% of eukaryotic proteins, IDRs have critical roles in regulation, but sequence analyses have revealed high turnover rates and divergence in these regions, suggestive of weak evolutionary constraints. Notably, phosphorylation sites are a highly important regulatory motif enriched in IDRs, yet display evidence of rapid evolution and poor positional conservation. We examined patterns of phosphorylation site evolution in IDRs in an attempt to detect molecular signatures of selection on quantitative traits describing these regulatory elements. We predicted that under a model of stabilizing selection, rapid divergence may be permitted with the reshuffling and turnover of individual phosphorylation sites, so long as their overall functional output is maintained within an optimal range. Thus, we tested this prediction using both a comparative phylogenetic method as well as a birth-death model of evolution, on Ste50 (a regulatory protein with a highly divergent IDR) and subsequently, on the yeast proteome. Our applications of both of these models reveal striking differences between IDRs in yeast proteins compared to sequences simulated under neutral expectations, indicative of selective constraint on these molecular phenotypes. These results offer an explanation for rapidly diverging IDRs, capable of maintaining regulatory functional outputs under the action of mutation-selection balance.

Systematic Characterization of Subcellular RNA Localization Through Fractionation-Sequencing
Download

Date: TBA
Room: TBA

  • Louis Philip Benoit Bouvrette, IRCM, Canada
  • Neal Cody, Mount Sinai, United States
  • Julie Bergalet, IRCM, Canada
  • Alexis Blanchet-Cohen, IRCM, Canada
  • Xiaofeng Wang, IRCM, Canada
  • Eric Lecuyer, IRCM, Canada

Presentation Overview:

Most eukaryotic cells are highly asymmetric in shape and composition. This feature relies on the capacity of molecular constituents, including proteins and nucleic acids, to be organized within distinct organelles and is central for different cell types to execute specialized functions. Subcellular localization of messenger RNA (mRNA) is a post-transcriptional mechanism which can modulate protein activities at specific functional sites. Global RNA imaging-based screens in Drosophila oocytes and embryos have demonstrated that as much as 70% of the coding transcripts are localized in patterns that broadly correlate with the distribution and function of the encoded proteins. However, this may represent an exceptional example and it remains unclear whether a comparable prevalence of RNA localization is observable in standard cells grown in culture.

To gain global insights into RNA subcellular localization properties, we subjected Drosophila (D17) and human (HepG2, K562) cells to biochemical fractionation combined with RNA sequencing (Frac-seq), allowing RNA mapping within several subcellular compartments (i.e. nuclear, cytosolic, membrane, insoluble). We further performed mass spectrometry on proteins extracted from these same fractions. A complete bioinformatics analysis was carried out to determine RNA and protein enrichment and correlations. We also catalogued specific attributes and motifs characterising asymmetrically distributed RNA.

These results reveal the high prevalence of RNA asymmetric localization, with distinctive subcellular enrichments observed for a diverse array of cellular RNA species (i.e. mRNA, lncRNA, circular RNA). Additionally, we observed specific correlations and anti-correlations between groups of mRNAs and their encoded proteins, as well as attributes (e.g. UTR and coding sequence lengths, exon content) that distinguish fraction-specific mRNA populations. Our study therefore reveals that RNA localization is a prevalent and evolutionarily conserved process, acting through discriminative constraints, that likely impacts every aspect of post- transcriptional gene regulation.

Linking Transposable Elements to Chromatin Architecture in Arabidopsis thaliana
Download

Date: TBA
Room: TBA

  • Christopher Cameron, McGill University, Canada
  • Maia Kaplan, McGill University, Canada
  • Alex Drouin, Laval University, Canada
  • François Laviolette, Laval University, Canada
  • Mathieu Blanchette, McGill University, Canada

Presentation Overview:

Transposable elements (TE) are self-replicating sequences of mobile DNA that move throughout a host genome. These “selfish” movements are known to be harmful when a TE either: 1) inserts in or near a functional gene, often perturbing its expression; or 2) produces a double-stranded break that may not be repaired correctly (Ayarpadikannan and Kim, 2014). To ensure self-preservation, host genomes have evolved complex defence mechanisms to decrease TE activity, which are known to influence epigenetic states (Feschotte et al., 2002; Fulz et al., 2015). The downstream effects of these mechanisms are often not limited to the TE itself and also affect surrounding DNA. Chromatin architecture of A. thaliana has recently been shown to be tightly linked to epigenetic state. Here, we investigate the role played by TEs, with their potentially disruptive nature, in defining the three-dimensional structure of DNA .

By combining publicly available A. thaliana genomic information and high-throughput chromosome conformation capture (Hi-C) data (Grob et al., 2014), we identify trends in Hi-C interaction frequency (IF) that can be described as a function of TE presence. Hi-C provides a genome-wide observation of all DNA-DNA interactions, facilitated by proteins, occurring within a population of cells. The high-level genomic spatial organization (compartments) identified by Hi-C have been shown to be recapitulated by long-range correlations in epigenetic data, providing a possible link between TEs and chromatin structure. We explore the effects of TE presence on chromatin architecture, which may result from the modification of the epigenetic landscape. Discovered Hi-C and TE correlations suggest the potential for machine learning models to predict the presence of TEs from Hi-C data and other key genomic features. We compare such a classifier to current predictors, such as Hidden Markov Models and consensus sequences, to demonstrate the potential link between chromatin architecture and the presence of TEs.

The nitrogen responsive transcriptome in potato (Solanum tuberosum L.) reveals significant gene regulatory motifs
Download

Date: TBA
Room: TBA

  • Jose Hector Galvez Lopez, McGill University, Canada
  • Helen H. Tai, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
  • Martin Lague, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
  • Bernie Zebarth, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
  • Martina V. Stromvik, McGill University, Canada

Presentation Overview:

Nitrogen (N) fertilization is an important abiotic factor for the growth of potato (S. tuberosum) because of its potential effects on yield. Additionally, since excess N in the soil negatively impacts the environment, studies on N use by the plant are key. Three commercial potato cultivars (Shepody, Russet Burbank and Atlantic) were grown under two different rates of applied N-fertilizer (0 kg N ha-1 and 180 kg N ha-1) to obtain more information on the underlying gene regulation mechanisms associated with N. Total mRNA samples were taken at two different time-points during the growth season and sequenced. The results for each cultivar and time-point were analyzed separately to find differentially expressed genes. The results of the differential expression analysis were compared to identify N-responsive genes found in all cultivars and time-points. A total of thirty genes were found to be over-expressed and nine genes were found to be under-expressed in all plants with added N-fertilizer. The 1000 bp upstream flanking regions of the differentially expressed genes were analyzed to find overrepresented motifs using three de novo motif discovery algorithms (Seeder, Weeder and MEME). Nine different motifs were found, indicating potential gene regulatory mechanisms for potato under N-deficiency.

A COMPREHENSIVE MAP OF CRITICAL PATHWAYS AND NETWORKS IN CANCER STEM CELLS
Download

Date: TBA
Room: TBA

  • Jeffrey Liu, University of Toronto, Canada
  • Veronique Voisin, University of Toronto, Canada
  • Changjiang Xu, University of Toronto, Canada
  • Ruth Isserlin, University of Toronto, Canada
  • Gary Bader, University of Toronto, Canada

Presentation Overview:

Introduction: It has been shown that a hierarchy exists in cancer where cancer stem cells (CSC) are at the apex with the ability to regenerate the disease and resistant to chemo- and radiation- therapy. Here, we are showing an example of how pathway and network analysis is used to extract common stem cell features, knowing that distinguishing stem cell maintaining pathways from CSC-driving mechanisms is critical in the fight against cancer.
Method: RNA-Seq datasets comparing CSC and normal stem cells (NSC) from multiple tissues were processed using the STAR alignment software with the latest genome assembly (GRCh38). Differential expression and Gene Set Enrichment Analysis (GSEA) generated the lists of significance in genes and pathways. An Enrichment Map (EM) was created using Cytoscape to summarize the genes and pathways into networks of interactions.
Results: GSEA was performed on gene expression data contrasting normal stem cell with stem-cell-derived tissues of heart, blood, breast, nervous system, adipose, kidney, brain, and developing embryo from 24 publically available RNA-Seq datasets. Highly significant pathways from the combined results (FDR q-value < 0.005) were selected to generate EM. We discovered that Telomere, DNA replication, DNA repair, Cell cycle, Mitotic spindle, VPR/nuclear transport, pluripotent stem cell, and Histone methylation pathways are generally enriched in normal stem cells. On the contrary, Notch1, ERK/MAPK, and Toll-like receptor pathways are activated after stem cell differentiation.
Conclusion: Here we demonstrated that through combining multiple stem-cell datasets by GSEA and EM, it led to the identification of well-known stem cell pathways such as Telomere, DNA replication, and pluripotency. We also identified pathways involved in differentiation and tissue-specification, like Notch1 and MAPK pathways. Furthermore, we discovered novel pathways the VPR/nuclear transport in stem cell for in-depth study. Multiple CSC datasets will be analyzed and combined for EM, and by contrasting with the EM of normal stem cells, therapeutic and biomarker candidates will be identified. Here in the Bader lab, we are actively develop and improve tools such as Cytoscape, GeneMANIA, and EM to discover dynamic pathways and networks. A comprehensive map of CSC will provide valuable information for developing specific anti-CSC strategies.

Pathway Commons: Single Point of Access to Biological Pathway Information
Download

Date: TBA
Room: TBA

  • Jeffrey Wong, University of Toronto, Canada
  • Gary Bader, University of Toronto, Canada
  • Igor Rodchenkov, University of Toronto, Canada
  • Chris Sander, Dana Farber Cancer Institute, Harvard Medica School, United States
  • Emek Demir, Oregon Health & Science University, United States
  • Ethan Cerami, Dana Farber Cancer Institute, Harvard Medica School, United States

Presentation Overview:

Pathway Commons is a service that collects and integrates public pathway data and makes it readily available for researchers through a single point of access. The Pathway Commons web site (www.pathwaycommons.org) provides an integrated tool to quickly search and visualize pathways. A download site provides integrated, bulk sets of pathway information in a variety of formats. An accompanying web service is available for software developers to conveniently query and access all data. Pathways include biochemical reactions, complex assembly, transport and catalysis events, gene regulation, genetic interactions, and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons currently contains over 42 000 pathways and 1.35 million interactions from twenty-two data providers with ongoing plans to expand the reach.


Convergent Evolution of Medulloblastoma Metastatic Tumors
Download

Date: TBA
Room: TBA

  • Patryk Skowron, The Hospital for Sick Children, Canada
  • Livia Garzia, The Hospital for Sick Children, Canada
  • Sorana Morrissy, The Hospital for Sick Children, Canada
  • Michael Taylor, The Hospital for Sick Children, Canada

Presentation Overview:

Introduction: Medulloblastoma initiates within the cerebellum and in 30% of cases disseminates throughout the brain and spinal cord. Little is known about the genes driving dissemination since matching primary and metastatic samples are rare. The medulloblastoma Sleeping Beauty (SB) mouse model uses random integration of transposons to initiate tumorigenesis. Insertions that confer a growth advantage are selected upon as the cancer progresses. Recent literature has demonstrated divergent evolution between the primary and metastatic sites and phenotypic convergent evolution between independent metastatic sites in multiple cancers. The extent of convergent evolution in medulloblastoma metastasis is unknown and its investigation may reveal important therapeutic targets.

Methods: Independent metastatic samples were from each mouse and the SB transposon/normal genomic DNA junctions were sequenced. From every sample, only the most abundant (i.e. clonal) insertions were kept for downstream analysis. Convergently selected genes were located by identifying unique insertion sites targeting the same gene in different metastatic compartments. For each mouse the probability of a random convergent event was modelled using the binomial distribution and compared to the observed rates to identify genes undergoing selective pressure.

Results: There were 15 significant genes undergoing convergent selective pressure across multiple mice. The most recurrent of these were Crebbp, Lgals3, Rabgap1l, Ak7, Ncoa3, Ptk2, Gabrb3, and Ophn1. Crebbp and Ncoa3 are chromatin remodellers part of the same complex, they play an essential role in growth control and embryonic development. While Ptk2, Gabrb3 and Ophn1 regulate cell-to-cell junction maintenance. Subsequent gene set enrichment analysis revealed a multitude of pathways essential for metastasis in medulloblastoma such as cell adhesion and Hedgehog signalling.

Conclusions: Convergent evolution plays a prominent role in medulloblastoma metastasis progression. Independent metastases have unique insertions in the same gene indicative of strong selective pressure.

In silico Discovery of Candidate Transcriptional Biomarkers for Ionizing Radiation
Download

Date: TBA
Room: TBA

  • Yared Kidane, WYLE/NASA, United States

Presentation Overview:

Gene expression profiling has aided in identification of biomarkers for ionizing radiation. In spite of previous attempts, a comprehensive list of biomarker genes that cross experimental conditions still remain to be discovered. This has hampered the development of countermeasures against ionizing radiation. Polymerase chain reaction (PCR) based studies conducted earlier were aimed at identifying a small number of radiation signatures that can be used to assess exposure to ionizing radiation during mass radiologic incidents. In this sense, they focused on identifying genes that act in isolation.

Our goal is to formulate a comprehensive list of genes that respond to a range of radiation characteristics including various particles, doses, and time post exposure by leverage the wealth of knowledge in protein-protein interaction networks. More specifically, we collected a list of well-studied ionizing radiation signature genes derived from previous PCR-based studies. We overlaid protein-protein interaction network on these genes/proteins to predict additional ionizing radiation-responsive genes using a guilt-by-association technique. The validation and assessment of the robustness of these molecular markers that could represent a radiation-induction signature is essential. With this objective in mind, we mapped predicted genes to biological pathways using KEGG, GO, and Pathway Commons databases. A number of the ionizing radiation-responsive genes that we predicted are associated with previously known ionizing radiation-related biological processes, molecular functions, and cellular components, which has reinforced the validity of our prediction. In addition, mapping of predicted genes to diseases has revealed the enrichment of cancers of different types, radiation induced neoplasm, diseases related to chromosomal aberration, and genetic disorders of DNA repair. Furthermore, we conducted a two-fold cross validation to assess the accuracy of the predictor. The area under the precision-recall curve (AUC) was approximately 0.75.

Taken together, we used a computational approach to predict potential ionizing radiation-responsive genes by leveraging existing radiation signature genes and protein interaction networks. These predictions may be used in discovery of potential transcriptional biomarkers and future experimental prioritization and validation.

Neptune: Signature Discovery Software
Download

Date: TBA
Room: TBA

  • Eric Marinier, Public Health Agency of Canada, Canada
  • Rahat Zaheer, Public Health Agency of Canada, Canada
  • Chrystal Berry, Public Health Agency of Canada, Canada
  • Kelly Weedmark, Public Health Agency of Canada, Canada
  • Michael Domaratzki, University of Manitoba, Canada
  • Philip Mabon, Public Health Agency of Canada, Canada
  • Natalie Knox, Public Health Agency of Canada, Canada
  • Aleisha Reimer, Public Health Agency of Canada, Canada
  • Morag Graham, Public Health Agency of Canada, Canada
  • The Lids-Ng Consortium, The LiDS-NG Consortium, Canada
  • Gary Van Domselaar, Public Health Agency of Canada, Canada

Presentation Overview:

An important component of public health response is rapid characterization of infectious agents, including the discovery of discriminatory sequences that may be leveraged to uniquely delineate a group of organisms, such as isolates associated with a disease cluster. These discriminatory sequences may be useful for further investigation of bacterial association with virulence, or assist in the development of rapid diagnostic assays for identification of bacterial isolates. The volume of available, high-throughput, next generation sequence data has necessitated the use of "big data" computational approaches for effective, real-time, comprehensive outbreak investigation and response.

We present new software that locates genomic signatures using an exact k-mer matching strategy while accommodating sequence mismatches. The software identifies sequences that are sufficiently represented within a group of interest and sufficiently absent from a background group. These groups may be provided by the user and specified dynamically. The signature discovery process is accomplished using probabilistic models instead of heuristic strategies.

We have evaluated Neptune on Listeria monocytogenes and Escherichia coli genome data sets and found that signatures identified from these experiments are sensitive and specific to their respective data sets. In addition, the identified sequences provide a catalogue of differential loci for further investigation of group-specific traits. Neptune has broad implications in bacterial characterization for public health applications due to its efficient ad hoc signature discovery based upon user-specified differential genomics and scalability with analyses of large bacterial populations.

Neptune is freely available as open-source software and as a module within the Galaxy platform.

The Affinity Data Bank for biophysical analysis of regulatory sequences
Download

Date: TBA
Room: TBA

  • Todd Riley, University of Massachusetts Boston, United States
  • Cory Colaneri, University of Massachusetts Boston, United States
  • Brandon Phan, University of Massachusetts Boston, United States
  • Aadish Shah, University of Massachusetts boston, United States
  • Pritesh Patel, University of Massachusetts Boston, United States

Presentation Overview:


We present The Affinity Data Bank (ADB), a suite of tools that provides biologists with novel aids to deeply investigate the sequence-specific binding properties of a transcription factor (TF) or an RNA-binding protein (RBP), and to study subtle differences in specificity between homologous nucleic acid-binding proteins. Also, integrated with Pfam, the PDB, and the UCSC database, The ADB allows for simultaneous interrogation of protein-DNA and protein-RNA specificity and structure in order to find the biochemical basis for differences in specificity across protein families. The ADB also includes a biophysical genome browser for quantitative annotation of levels of binding – using free protein concentrations to model the non-linear saturation effect that relates binding occupancy with binding affinity. The biophysical browser also integrates dbSNP and other polymorphism data in order to depict changes in affinity due to genetic polymorphisms – which can aid in finding both functional SNPs and functional binding sites. Lastly, the biophysical browser also supports biophysical positional priors to allow for quantitative designation of the level of locus-specific accessibility that a protein has to the DNA. With the inclusion of these biophysical occupancy-based and affinity-based positional priors, the ADB can properly model in vivo protein-DNA binding by integrating the effects of chromatin accessibility and epigenetic marks. Importantly, the use of this toolset does not require bioinformatics programming knowledge – which makes ADB tool suite highly useful for a wide range of researchers.

Mining for New Antimicrobials: Predicting Bacteriocin Gene Blocks
Download

Date: TBA
Room: TBA

  • James Morton, University of California, San Diego, United States
  • Stefan Freed, University of Notre Dame, United States
  • Md Nafiz Hamid, Iowa State University, United States
  • Shaun Lee, University of Notre Dame, United States
  • Iddo Friedberg, Iowa State University, United States

Presentation Overview:

Bacteriocins are peptide-derived molecules produced by bacteria, whose recently-discovered functions include virulence factors and signaling molecules as well as their better known roles as antibiotics. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are highly diverse and widely distributed among bacterial species. Given the heterogeneity of bacteriocin compounds, many tools struggle with identifying novel bacteriocins due to their vast sequence and structural diversity. Many bacteriocins undergo post-translational processing or modifications necessary for the biosynthesis of the final mature form. Enzymatic modification of bacteriocins as well as their export is achieved by proteins whose genes are often located in a discrete gene cluster proximal to the bacteriocin precursor gene, referred to as context genes in this study. Although bacteriocins themselves are structurally diverse, context genes have been shown to be largely conserved across unrelated species.

Using this knowledge, we set out to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and identify new candidates for bacteriocins that bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon and gene block Associator (BOA) that can identify homologous bacteriocin associated gene blocks and predict novel ones. BOA generates profile Hidden Markov Models from the clusters of bacteriocin context genes, and uses them to identify novel bacteriocin gene blocks and operons.
Results and conclusions

We provide a novel dataset of predicted bacteriocins and context genes. We also discover that several phyla have a strong preference for bacteriocin genes, suggesting distinct functions for this group of molecules.

Software Availability: https://​github.​com/​idoerg/​BOA

Discovering regulatory elements in co-expressed genes
Download

Date: TBA
Room: TBA

  • Yichao Li, Ohio University, United States
  • Rami Al-Ouran, Ohio University, United States
  • Lonnie Welch, Ohio University, United States

Presentation Overview:

Motif discovery is an important step to understand gene regulation. Nowadays, both microarray and next-generation sequencing techniques provide the opportunity for reverse engineering of the genetic regulatory code. However, the performance of existing motif discovery algorithms remains poorly due to the high heterogeneity within the co-expressed genes.
In this study, we provide a new motif discovery pipeline to analyze co-expressed genes with high heterogeneity. This includes a convolution-and-pooling motif discovery ensembles and a sequence clustering based on motif content. We demonstrate the usefulness of our pipeline in two case studies: 1. Infective stage-specific gene set in Brugia Malayi; 2. Myc induced over-expressed gene set in Homo Sapiens, MCF10A cell line. In both cases, known motifs can be found. For the ENCODE cell line, DNA methylation information was used as a prior in motif discovery and TF CHIP-seq information was used to determine putative TF-TF interactions and promoter-enhancer interactions.

A Metagenomic perspective on the structure and function of Lake Microbial Communities exposed to toxic metal traces in short term evolution
Download

Date: TBA
Room: TBA

  • Bachar Cheaib, Laval University, Canada
  • Malo Le Boulch, Laval University, Canada
  • Pierre-Luc Mercier, Laval University, Canada
  • Nicolas Derome, Laval University, Canada

Presentation Overview:

Metals and metalloids expelled by industrial mining and smelting activities are major pollutants of soil and water ecosystems. Toxic metal traces affect all life forms, including microbes. Consequently, metal contamination shapes the microbial meta-community structure and function through accelerating the evolutionary and adaptive processes. Little studies investigated the impact of a metallic cocktail gradient on the natural microbial communities. In this research, we studied five lakes located in a mining area in western Quebec. Among them, three are interconnected and lie along a metal contamination gradient (Cadmium, Copper, Lead, and Mercury) caused by historic mining activities (< 60 years of exposure). Two landlocked and distant lakes were used as positive and negative controls. Using a metagenomic shotgun sequencing approach (Illumina HiSeq) we generated 30Tbp of data from five lake samples. We used a comparative metagenomic strategy to explore shifts and thresholds for community disturbance. We found parallel shifts of taxonomic abundance (Actinobacteria, filamentous Cyanobacteria, Proteobacteria) between sites. This disturbance of meta-community structure was also observed at the functional and metabolic levels, through decreasing of subsystems abundance implicated in nitrogen cycle and photosynthesis, and on the other side, the increasing of virulence and stress response functionalities. The differential abundance of metabolic components suggests metabolic erosion along the five sites. Furthermore, a taxonomy/function decoupling may suggest that horizontal Gene Transfer (HGT) events mitigated taxonomic erosion. Ongoing work will provide insights into meta-community plasticity in response to environmental pressure, revealing thresholds between transient and permanent shifts in their composition and genetic repertory.