GLBIO/CCBC 2016

Talks

Modeling drug repositioning by leveraging drug-target-disease association datasets: A case study of Ebola virus disease

Download

Date: TBA
Room: TBA

Gaston Mazandu, University of Cape Town, South Africa
Kayleigh Rutherford, University of Cape Town, South Africa
Elsa-Gayle Zekeng, University of Liverpool, United States Minor Outlying Islands
Emile Chimusa, University of Cape Town, South Africa
Nicola Mulder, University of Cape Town, South Africa

Presentation Overview:

The world's population is currently subjected to several public health challenges, including growing prevalence of infections and the emergence of new pathogenic organisms. The cost and risk associated with drug development process makes the development of drugs for several diseases, especially orphan or rare diseases, unappealing to the pharmaceutical industry. A potential strategy to address the challenges of novel drug development is that of drug repositioning, which consists of examining new uses for existing approved drugs. Here, we developed an integrative computational framework to predict possible drug repositioning for approved drugs by leveraging drug-target-disease associations, and functional and genomic data from public databases. We identified drug target enriched biological processes and similar diseases based on their biological processes and mapped diseases to new potential drugs using Gene Ontology (GO) semantic similarity scores. We assessed the performance of this model using the area under the Receiver Operating Characteristic (ROC) curve (AUC), precision and accuracy score as measures of discriminative power. The model performs well with an AUC score of 0.97684, 85% precision and 78% accuracy. Applying this model to Ebola Virus disease (EVD), we were able to assess associations between human proteins associated with EVD and approved drug targets. The viral matrix protein VP40 showed evidence of playing a critical role in the viral life cycle inside the host, and we identified three putative protein targets in human: DEAD (Asp-Glu-Ala-Asp) box polypeptide 5 (DDX58), Tumor Necrosis Factor (TNF) and Toll-Like Receptor 4 (TLR4). Mining the Drugbank database, we revealed potential drugs for EVD, including immune-related drugs, such as infliximab, and amebicide related drugs. There is potential for the model suggested to bridge the gap in the production of orphan disease therapies, offering a systematic, effective and reliable approach to predict new uses for existing drugs, and thereby harnessing their full therapeutic power.

Investigating the usefulness of long-read sequencing technologies on the study of large eukaryotic gene families

Download

Date: TBA
Room: TBA

Armin Rouhi, University of Calgary, Canada
Janneke Wit, University of Calgary, Canada
James Wasmuth, University of Calgary, Canada

Presentation Overview:

Advances in genome sequencing have led to the sequencing of a wide range of species at an unprecedented pace. The majority of these genomes, however, remain in a draft state: fragmented and unfinished. While the per-base accuracy of the genomes is extremely high (>99%), the most important limiting factor to accurately assembling genomes is the read length. Next-generation sequencing (NGS) technologies have been limited to read lengths of up to 300 bp. The assemblies generated from these reads frequently incorrectly assemble tandemly arrayed genes and gene families under high copy number. Such gene families are often implicated in disease phenotypes and, for pathogenic organisms, interactions with their hosts. Therefore, it is important that they are correctly assembled.
A promising solution is the use of long-read sequencing (LRS) technologies. Three LRS technologies have recently become available: PacBio RS, Illumina Synthetic Long Reads (SLR), and the pocket-sized Oxford Nanopore MinION. These technologies are capable of producing read lengths of tens of thousands of bases. Published studies have reported success based on the length of the resultant assemblies and the ability to assemble transposable elements. There has not been a rigorous investigation on how large gene families are resolved.
Here, we have examined the accuracy of de novo assembled C. elegans genomes using each of the three LRS technologies. We have generated new genome data using the MinION and compared to the other LRS platforms. We show that assemblies generated using LRS platforms resolve regions containing gene families significantly better than exclusive short-read-based assemblies. We will present these findings and our assessment on which LRS is better suited different scenarios.

Using phylogenetic instability to identify members of large gene families under adaptive evolution

Download

Date: TBA
Room: TBA

David Curran, University of Calgary, Canada
John Gilleard, University of Calgary, Canada
James Wasmuth, University of Calgary, Canada

Presentation Overview:

Properly understanding large gene families is an important part of understanding how organisms are able to adapt to their environments, as gene duplication and subsequent sub-functionalization is one of the most rapid methods of phenotypic innovation. This presents as lineage-specific expansions and contractions of gene families, generating paralogues and inter-species copy number variants, a characteristic that has been termed ‘phylogenetic instability’. The phenomenon correlates well with direct environmental interactions, such as chemosensory receptors, immune responses to pathogens, and xenobiotic detoxification pathways. While phylogenetic instability has been used as a predictor of such functionality, there are no current methods to quantify it in large gene families. Here we present a novel algorithm, MIPhy, which solves this problem.
There are two aspects to the phylogenetic instability algorithm: clustering a phylogenetic tree into a set of meaningful sub-trees, and then quantifying the instability in those sub-trees. Algorithms do exist that could be used to quantify sub-trees, but some require detailed ancestral information that is only available for a small number of organisms, some do not incorporate the sub-tree branching information, and others use evolutionary models that we have found to be unsuitable. Ideally the clustering problem would be solved by protein interaction or biochemical reaction data, but this is rarely available. There are many ways to cluster a phylogenetic tree, but we have found existing methods to be inconsistent and overly sensitive to their arbitrary parameter values.
MIPhy finds the minimum number of events (gene duplication, gene loss, and incomplete lineage sorting) required to reconcile the observed gene tree with a given species tree. The clustering problem is solved by grouping the genes into homologous clusters such that total genomic events, modified by the standard deviation of each cluster, is minimized. We have applied MIPhy to several large and complex gene families, including the cytochrome P450s. We were able to distinguish between genes with endogenous functions and those that detoxify xenobiotic drugs.

Efficient techniques for direct analysis of discrete-time population genetic models

Download

Date: TBA
Room: TBA

Ivan Kryukov, University of Calgary, Canada
Bianca de Sanctis, University of Calgary, Canada
A.P. Jason de Koning, University of Calgary, Canada

Presentation Overview:

Diffusion theory approximations lie at the heart of most population genetic and phylogenetic methods. These include: 1) Estimators of allele age based on current population frequency; 2) Site-frequency spectrum methods for inferring demography and selection from population-level variation; and 3) Population-genetic models of molecular evolution. However, these approximations are only available for a subset of analytically tractable models, which are often unrealistic. For more complex models, the diffusion approximations need to be themselves approximated, leading to further potential error and computational difficulty.

We present a new software package, Wright-Fisher Exact Solver (WFES), which performs fast, scalable computations in population genetics without diffusion approximations or simulations. WFES employs rapid, parallel, sparse linear algebra techniques for a direct analysis of arbitrary discrete-time models. This approach solves for long-term properties, such as the probability and expected time to fixation or extinction. Importantly, it also allows for new, custom, and exact statistics to be easily implemented, creating a workbench for new ideas to be easily developed and explored. By taking advantage of efficient computational techniques, WFES is applicable to the analysis of populations ranging in size from humans to model organisms, with effective population sizes up to hundreds of thousands of individuals on typical computers. This substantially extends the applicability of direct Markov chain methods in population genetics, which have typically been limited to studying population sizes of only a few hundred individuals.

Biological Interaction Networks in Bacteria - Generation and Evolutionary Insights

Download

Date: TBA
Room: TBA

Cedoljub Bundalovic-Torma, Hospital for Sick Children, Canada
John Parkinson, University of Toronto, Canada

Presentation Overview:

Bacteria inhabit a diverse array of environments and form integral relationships with humans, which bare important consequences toward health and disease. With the advent of next generation sequencing technology, the availability of bacterial genomes has exponentially increased, allowing us study not only the emergence of bacterial pathogens and the pressing challenges to circumventing antibiotic resistance, but also the composition of the human microbiome and the role of bacteria in mediating complex disorders, such as autism and obesity, which are increasing in prevalence today. However, the majority of bacterial biology remains uncharted, which for the past decade has motivated the generation of numerous large-scale biological networks for the model bacterium E. coli.

In this talk I will present current work to address these challenges, which involves the construction of the first large-scale physical interaction network of the Escherichia coli cell envelope proteome generated by tandem-affinity purification. Graphical clustering enabled the elucidation of several important biological processes mediated by physical interactions of cell-envelope associated proteins. I will also illustrate how such data can also be applied to identify biological processes that contribute to environmental adaptation in enterobacterial foodborne pathogens.

High-Performance and Exascale Computing Frontiers in Cancer Applications

Download

Date: TBA
Room: TBA

Eric Stahlberg, Frederick National Laboratory for Cancer Research, United States
Carl McCabe, Center for Biomedical Informatics and Information Technology - National Cancer Institute, United States
George Zaki, Frederick National Laboratory for Cancer Research, United States

Presentation Overview:

The expanded use of high-performance is transforming the future for cancer research and clinical applications. The recently announced collaboration between the National Cancer Institute and the US Department of Energy joins earlier initiatives including the NCIP Cancer Cloud Pilots and the Genomic Data Commons that help establish an advance look at the future where extreme scale and exascale computing are used regularly in cancer research applications. The presentation will include discussions of critical lessons learned, research results when applying HPC to image processing and signal processing applications in RNA structure determination, and areas of future research activity related to the use of future computing technologies. The presentation will also include efforts underway to raise the level of computational and data science education within the cancer research community.

Xenolog Classification: Steps toward a Xenolog Conjecture

Download

Date: TBA
Room: TBA

Charlotte Darby, Carnegie Mellon University, United States
Maureen Stolzer, Carnegie Mellon University, United States
Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

In his landmark review, Fitch (Trends Genet. 2000) defined xenology as "the relationship of any two homologous characters whose history, since their common ancestor, involves an interspecies (horizontal) transfer of the genetic material." However, the nomenclature currently available to describe homology relationships involving transfer remains ambiguous although many different evolutionary scenarios could be described as histories that involve interspecies transfer. Yet careful classification of horizontally transferred genes is essential for gaining insight into complex evolutionary processes and for homology-based gene function prediction.
We propose a classification scheme which provides a comprehensive nomenclature capturing the variety of xenologous relationships which can—and do—occur. We define formal rules that unambiguously assign gene pairs to xenolog classes. These rules are based on the topology of a reconciled gene and species tree, can be applied to a tree with arbitrary number and arrangement of transfer events, and have been implemented in prototype software. Our scheme accounts for the inherent asymmetry of horizontal transfer, whether both genes are in the same species, the potential interaction of duplication and transfer events, and order of divergences in the gene and species trees. We demonstrate the practical application of this classification by showing its correspondence to different functional properties.

Reactome Revolutions - Pathways and Networks

Download

Date: TBA
Room: TBA

Robin Haw, Reactome - OICR, Canada

Presentation Overview:

Modern health initiatives and drug discovery are focused increasingly on targeting diseases that arise from perturbations in complex cellular events. Consequently, there has been a tremendous effort in biological research to elucidate the molecular mechanisms that underpin normal cellular processes. A reaction-network pathway knowledgebase is the tool of choice for assembling and visualizing the “parts list” of proteins and functional RNAs, as a foundation for understanding cellular processes, function and disease. The Reactome Knowledgebase (www.reactome.org) is a publically accessible, open access bioinformatics resource that stores full descriptions of human biological reactions, pathways and processes. Curated pathway knowledgebases, like Reactome, are uniquely powerful and flexible tools for extracting biologically and clinically useful information from the flood of genomic data. Our data model accommodates the annotation of disease processes, allowing us to represent the altered biological behaviour of mutant variants frequently found in cancer, and to describe the mode of action and specificity of drugs and therapeutics. Bio- and chemoinformaticians use Reactome to interpret high-throughput experimental datasets, to develop novel algorithms for data mining and visualization, and to build predictive models of normal and abnormal pathways. Specific features of Reactome support the visualization of interactions of many gene products in a complex biological process, and the application of bioinformatics tools to find causal patterns in genomic data sets. To maximize Reactome’s coverage of the genome, we have supplemented curated data with a conservative set of predicted functional interactions (FI), roughly doubling our coverage of the translated genome. We have developed a Cytoscape app called “ReactomeFIViz”, which utilizes this FI network to assist biologists to perform pathway and network analysis to search for gene signatures from within gene expression data sets or identify significant genes within a list. Pathway and network-based tools for building and validating interaction networks derived from multiple data sets will give researchers substantial power to screen intrinsically noisy experimental data in order to uncover biologically relevant information.

Identification of novel genomic islands in Liverpool epidemic strain of Pseudomonas aeruginosa using segmentation and clustering

Download

Date: TBA
Room: TBA

Rajeev Azad, University of North Texas, United States
Mehul Jani, University of North Texas, United States
Kalai Mathee, Florida International University, United States

Presentation Overview:

Pseudomonas aeruginosa is an opportunistic pathogen implicated in myriad of infections, and a leading pathogen responsible for mortality in patients with cystic fibrosis (CF). Horizontal transfers of genes among the microorganisms living within CF patients have led to a more morbid and multi-drug resistant strains such as Liverpool epidemic strain of Pseudomonas aeruginosa, namely the LESB58 strain, that has the propensity to acquire virulence and antibiotic resistance genes. Often these genes are acquired in large clusters, referred to as “genomic islands”. A genome mining tool based on a recursive segmentation and clustering procedure, “GEMINI”, was used to decipher novel genomic islands and understand their contributions to the evolution of virulence and antibiotic resistance in P. aeruginosa LESB58. GEMINI was validated on experimentally verified genomic islands in P. aeruginosa LESB58 before examining its potential to decipher novel islands. Of the 6062 genes in P. aeruginosa LESB58, 596 genes were identified residing on 20 genomic islands of which 12 had not been previously reported. Comparative genomics provided evidence in support of our novel predictions. GEMINI unravelled the mosaic structure of islands that are composed of segments of likely different evolutionary origins, and also demonstrated its ability to identify potential strain biomarkers. These newly found islands likely have contributed to the hyper-virulence and multidrug resistance of the Liverpool epidemic strain of P. aeruginosa.

Computational methods to identify cancer-driver single nucleotide variants and large rearrangements in non-coding regions

Download

Date: TBA
Room: TBA

Eric Minwei Liu, Weill Cornell Medical College, United States
Alexander Martinez Fundichely, Weill Cornell Medical College, United States
Priyanka Dhingra, Weill Cornell Medical College, United States
Andrea Sboner, Weill Cornell Medical College, United States
Ekta Khurana, Weill Cornell Medical College, United States

Presentation Overview:

Most variants obtained from whole-genome sequencing occur in non-coding regions of the genome. Although variants in protein-coding regions have received the majority of attention, numerous studies have now noted the importance of non-coding variants in cancer. Identification of functional non-coding variants that drive tumor growth remains a challenge and a bottleneck for the use of whole-genome sequencing in the clinic. We have developed two computational methods to identify non-coding cancer drivers. I will present the details of these methods and discuss the ongoing efforts to apply them to analyze ~2800 tumor whole-genomes in the ‘Pan-Cancer Analysis of Whole Genomes, PCAWG’ consortium. (1) For single nucleotide variants, we have developed CompositeDriver-SNV. This method integrates the signals of high functional impact of variants with the recurrence of variants across multiple tumor samples to identify the elements that show more and higher functional impact mutations than expected randomly. The functional impact is predicted using the FunSeq scheme that uses the properties of ENCODE elements (including conservation, transcription-factor (TF) motif disruption and network properties) within a weighted scoring scheme to predict deleteriousness of non-coding variants. (2) For large genomic rearrangements, we have developed CompositeDriver-SV. The structural variants analyzed can be of any type, e.g. deletions, duplications, translocations, etc. The non-parametric null model accounts for tumor-specific properties of the rearrangements. Using this approach, we are able to identify novel functional elements that are significantly rearranged in prostate cancer.

A Versatile Framework for Learning Feature-Based Protein-DNA Recognition Models Directly from SELEX Data

Download

Date: TBA
Room: TBA

Chaitanya Rastogi, Columbia University, United States
Gabriella Martini, Columbia University, United States
H. Tomas Rube, Columbia University, United States
Harmen Bussemaker, Columbia University, United States

Presentation Overview:

SELEX-seq [1] and HT-SELEX [2,3] are sequencing-based methods for elucidating the intrinsic DNA binding specificity of transcription factor (TF) complexes at high resolution. While the amount of raw information that modern SELEX provides is unprecedented, the computational methods for building DNA recognition models (“motifs”) from these data are still far from mature. The standard is to tabulate of the relative enrichment of each oligomer of a given length [4], for which we have developed efficient software [5]. Unfortunately, having to use oligomer tables as an intermediate step for feature-based analysis [6] has two key disadvantages: (i) limited range over which readout can be analyzed, as counts decrease exponentially with footprint size; and (ii) requirement for prior ad hoc sequence-based alignment of different oligomers. We present a new and highly versatile framework for motif discovery from SELEX data that overcomes these limitations. It uses a hierarchical maximum likelihood approach to fit a feature-based biophysically motivated protein-DNA recognition model directly to the raw SELEX data. First, this allows us to consider base and shape readout in more detail and over a larger footprint than was possible before, as we illustrate by reanalyzing Hox heterodimer data. Second, we can now for the first time analyze shape readout for TFs with low binding specificity, which we demonstrate using Hox monomer data. We find that shape readout by the Hox N-terminal arm is already seen for the monomer, but is altered by the presence of the Exd cofactor. Our method produces rich, biophysically interpretable models from only a single round of SELEX-seq data. Additionally, our flexible modeling framework should be easily extendable to other sequencing-based assays.

[1] M. Slattery, T.R. Riley, P. Liu, N. Abe, P. Gomez-Alcala, R. Rohs*, B. Honig*, H.J. Bussemaker*, R.S. Mann*. (2011) Cofactor Binding Evokes Latent Differences in DNA Binding Specificity between Hox proteins. Cell 147:1270-82.
[2] Y. Zhao, D. Granas, G.D. Stormo*. (2009) Inferring Binding Energies from Selected Binding Sites. PLoS Comput. Biol. 5(12): e1000590.
[3] A. Jolma, et. al. (2013) DNA-Binding Specificites of Human Transcription Factors. Cell 152: 327-339.
[4] T.R. Riley, M. Slattery, N. Abe, C. Rastogi, D. Liu, R.S. Mann*, and H.J. Bussemaker*. (2014) SELEX-seq, a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol. Biol. 1196:255-78.
[5] http://bioconductor.org/packages/release/bioc/html/SELEX.html
[6] N. Abe, I. Dror, L. Yang, M. Slattery, T. Zhou, H.J. Bussemaker, R. Rohs*, R.S. Mann*. (2015) Deconvolving the Recognition of DNA Shape from Sequence. Cell 161:307-18.

How plastic are multidomain proteins? Quantifying domain co-evolution in primate genomes

Download

Date: TBA
Room: TBA

Maureen Stolzer, Carnegie Mellon University, United States
Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

Multidomain proteins evolve via the insertion, duplication, and deletion of domains, sequence fragments that encode structural or functional modules. This process of domain shuffling allows for rapid exploration of functions by introducing new combinations of existing folds. Understanding the evolution of protein function requires an understanding of how domain architectures change over time, because gain, loss, or replacement of a domain can result in an immediate and dramatic change in protein interactions. It has been argued that domain pairs that co-occur are selectively favored combinations and once united tend to persist as a unit. This hypothesis is supported by empirical studies reporting that convergent formation of the same domain architecture is rare. However, the gain-loss parsimony methods typically used to reconstruct domain shuffling histories ignore sequence variation between domain instances and cannot recognize parallel gains or losses. If the same domain architectures were forming repeatedly, this pattern would not be discerned by gain-loss parsimony.

To test this hypothesis, we conducted a genome-scale analysis of co-occurring domain pairs in primates, using a novel, phylogeny-based approach that can distinguish between different domain instances and infer parallel events. Almost half of all domain pairs tested (45%) experienced at least two independent fusions. Even when only events with very strong statistical support (bootstrap > 90%) were considered, two or more independent fusions were inferred in 25% of the cases considered. Our results challenge the hypothesis that convergent formation of domain architectures is rare. Further, they lead us to question the extent to which our understanding of multidomain evolution is driven by the algorithms we use to study them and highlight the need for more powerful evolutionary algorithms. Finally, our results have practical consequences for homology-based function prediction, since they suggest that proteins with the same domain architecture may contain domains that are not orthologous.

PopNet: Revealing the Impact of Recombination on Population Structure

Download

Date: TBA
Room: TBA

Javi Zhang, University of Toronto, Canada
Asis Khan, NIH, United States
Andrea Kennard, NIH/NIAID, United States
Michael E. Grigg, NIH, United States
John Parkinson, Hospital for Sick Children, Canada

Presentation Overview:

A central question in population genomics is how populations interact. Admixture and recombination between populations strongly impact their structures. In the context of public health, this may translate to the development and spread of virulence and resistance factors. Currently, new visualization methods are needed to study recombination. Existing methods, such as Neighbour-Net and Structure, offer limited ability to compare populations. In particular, while shared ancestry can be inferred, the genomic locations are not shown. Hence, the challenge lies in facilitating the identification of regions of shared ancestry between individuals.
To meet the challenge, we present PopNet, which combines the concept of chromosome painting with network visualization to illustrate regions of shared ancestry. We demonstrate the effectiveness of PopNet through its application to three diverse populations of Saccharomyces cerevisiae, Toxoplasma gondii, and Plasmodium falciparum, showing that S. cerevisiae lineages form around habitat or function; North American T. gondii families share multiple regions of similarity, and Asian P. falciparum populations show a higher degree of intergression compared to their African counterparts. PopNet’s novel visualization process offers a new framework for the analysis of recombination between populations.

Building and Sustaining Bioinformatics Community Diversity

Download

Date: TBA
Room: TBA

Alexander Ropelewski, Pittsburgh Supercomputing Center, United States
Ricardo Gonzalez Mendez , University of Puerto Rico School of Medicine, United States
Jimmy Torres, School of Communications, University of Puerto Rico Rio Piedras, Puerto Rico
Hugh Nicholas, Pittsburgh Supercomputing Center, United States
Pallavi Ishwad, Pittsburgh Supercomputing Center, United States

Presentation Overview:

Recent discussions have focused on improving bioinformatics education by defining key biological, mathematical, and computational competencies required to be a bioinformatics practitioner. Keeping bioinformatics education up-to-date is a challenge on all campuses but especially so at Minority Serving Institutions (MSIs), where faculty are burdened with heavy teaching loads and are isolated from critical bioinformatics resources, such as sequencers and high performance computers.

Here, we describe our experience building bioinformatics competencies within the underrepresented minority community through the Pittsburgh Supercomputing Center’s Minority Access to Research Careers program. This program, begun in 2001, assists Minority Serving Institutions through broad-based educational and research programs oriented towards building and sustaining bioinformatics curricula and programs at partner MSIs. The emphasis of this program is based on the community’s documented training needs, enabling scientists to develop the prerequisite competencies needed to excel in bioinformatics.

We will discuss the barriers and challenges encountered by the program and interventions developed to address these barriers and challenges. We will conclude with general recommendations to improve diversity within the bioinformatics community based on the experiences of the program.

PIPE (Protein-protein Interaction Prediction Engine): A computational approach for comprehensive soybean functional genomics

Download

Date: TBA
Room: TBA

Bahram Samanfar, AAFC-ORDC, Canada
Andrew Schoenrock, Carleton University, Canada
Frank Dehne, Carleton University, Canada
Ashkan Golshani, Carleton University, Canada
Elroy Cober, AAFC-ORDC, Canada
Martin Charette, AAFC-ORDC, Canada
Steve Molnar, AAFC-ORDC, Canada

Presentation Overview:

Protein-Protein Interactions (PPIs) are essential molecular interactions that define the biology of a cell, its development and responses to various stimuli. Theoretically (“guilt by association”), if a gene interacts with groups of genes involved in one specific pathway, that gene might also be involved in that specific pathway. Our knowledge of global PPI networks in complex organisms such as human and plants is restricted by technical limitations of current methods. The Protein-protein Interaction Prediction Engine (PIPE) is a computational tool used to predict protein-protein interactions (PPI). PIPE has been used to produce proteome-wide, all-to-all predicted interactomes in a variety of organisms including yeast (Saccharomyces cerevisiae), human (Homo sapiens), Arabidopsis and others. PIPE can produce individual PPI predictions in a fraction of a second and is typically tuned, for a given organism, to achieve a specificity of 99.95%. PIPE has been independently evaluated and compared to other PPI prediction methods and has been shown to significantly outperform the others in terms of recall-precision across all of the datasets tested. It has also been shown that PIPE has the ability to produce cross-species predictions (ie. use interaction data from one organism to make high quality PPI predictions in another). Briefly, PIPE works based on searching for re-occurring short polypeptide sequences between known interacting protein pairs; simply, it predicts interactions based on protein sequence information and a database of known interacting pairs. PIPE requires a set of known interacting protein pairs as well as their primary (amino acid) sequences to be able to make its predictions. Recently, PIPE is being redesigned to be able to computationally handle the large proteome of soybean (75,778 confirmed soybean protein sequences). Currently we are using PIPE towards predicting the first comprehensive protein-protein interaction network for soybean.
Soybean is one of the major Canadian grain crops and its production is expanding in Canada with the majority of the increase in short season areas (Western Canada and northern regions). So far, eleven maturity loci have been reported in soybean, however the molecular basis of almost half of them is not yet clear. The list of novel factors affecting these pathways in soybean, and in model plants like Arabidopsis, continues to grow suggesting the presence of other novel players which are yet to be discovered. To this end, we have used three different approaches; bioinformatics (functional genomics), classical plant breeding and molecular biology (analysis of SSR and SNP haplotypes) to identify novel genes involved in flowering and maturity pathways in soybean. Identification of molecular markers tagging the PIPE-identified genes controlling flowering and maturity in soybean will allow soybean breeders to efficiently develop varieties using molecular marker assisted breeding. Allele specific markers will allow stacking of early maturity alleles to develop even earlier maturing cultivars. This bioinformatics approach will also help to bridge the gap in knowledge of the flowering and maturity pathway in soybean and can be applied to other important traits such as seed protein content, oil quality and host-pathogen interactions.

Mining for new antimicrobial agents: predicting bacteriocin gene blocks

Download

Date: TBA
Room: TBA

James Morton, University of California, San Diego, United States
Stefan Freed, University of Notre Dame, United States
Md Nafiz Hamid, Iowa State University, United States
Shaun Lee, University of Notre Dame, United States
Iddo Friedberg, Iowa State University, United States

Presentation Overview:

Bacteriocins are peptide-derived molecules produced by bacteria, whose recently-discovered functions include virulence factors and signaling molecules as well as their better known roles as antibiotics. To date, close to five hundred bacteriocins have been identified and classified. Recent discoveries have shown that bacteriocins are highly diverse and widely distributed among bacterial species. Given the heterogeneity of bacteriocin compounds, many tools struggle with identifying novel bacteriocins due to their vast sequence and structural diversity. Many bacteriocins undergo post-translational processing or modifications necessary for the biosynthesis of the final mature form. Enzymatic modification of bacteriocins as well as their export is achieved by proteins whose genes are often located in a discrete gene cluster proximal to the bacteriocin precursor gene, referred to as context genes in this study. Although bacteriocins themselves are structurally diverse, context genes have been shown to be largely conserved across unrelated species.

Using this knowledge, we set out to identify new candidates for context genes which may clarify how bacteriocins are synthesized, and identify new candidates for bacteriocins that bear no sequence similarity to known toxins. To achieve these goals, we have developed a software tool, Bacteriocin Operon and gene block Associator (BOA) that can identify homologous bacteriocin associated gene blocks and predict novel ones. BOA generates profile Hidden Markov Models from the clusters of bacteriocin context genes, and uses them to identify novel bacteriocin gene blocks and operons.
Results and conclusions

We provide a novel dataset of predicted bacteriocins and context genes. We also discover that several phyla have a strong preference for bacteriocin genes, suggesting distinct functions for this group of molecules.

Software Availability: https://github.com/idoerg/BOA

Network-driven discovery and interpretation of cancer driver mutations

Download

Date: TBA
Room: TBA

Jüri Reimand, Ontario Institute for Cancer Research, Canada

Presentation Overview:

Identifying driver mutations from cancer exome and genome sequencing data is essential to deciphering tumour biology and designing precision treatments. Information on pathways and molecular interaction networks can improve interpretation of cancer mutations and associating mechanism and clinical information. We hypothesize that many cancer driver mutations precisely modify interfaces encoded in small sites in proteins and DNA, leading to interaction losses and gains in networks. We developed computational strategies to find network-associated driver mutations and infer their impact on network topology. The mutation enrichment model ActiveDriver detects proteins with site-specific positive selection, and the machine learning method MIMP infers mutations that rewire kinase signalling networks by erasing existing phosphorylation sites and creating new phosphorylation sites in substrate proteins. We conducted pan-cancer analyses of post-translational modification (PTM) networks and showed their enrichment in known driver mutations, increased functional impact, frequent rewiring of network topology, and associations to clinical characteristics. We also studied PTM networks in the human population and found that inter-individual genome variation is significantly reduced in PTM sites while inherited disease mutations are significantly enriched. This emphasizes the importance of network-related variation in human physiology and cancer. We extended our approaches to full cancer genomes and investigated site-specific mutations in gene promoters, enhancers, and transcription regulatory networks, using data from the International Cancer Genome Consortium and the Roadmap Epigenomics project. Our network-centric approaches provide novel interpretation to known cancer mutations, help find new cancer drivers and risk modifier alleles, and characterise their biological mechanisms.

Detecting a novel signature of non-coding, regulatory alterations in cancer genomes

Download

Date: TBA
Room: TBA

Kyle Smith, University of Colorado, United States
Vinod Yadav, University of Colorado, United States
Subhajyoti De, Rutgers Cancer Institute of New Jersey, United States

Presentation Overview:

Oncogenic mutations outside protein-coding regions remain largely unexplored. Analyses of the TERT locus have indicated that non-coding regulatory mutations can be more frequent than previously suspected and play important roles in oncogenesis. So far, limited studies are under-way to identify recurrent mutations in promoters of known genes. And yet, functional mutations need not always be recurrent at the same base position (e.g. TP53 mutations are distributed throughout the gene), and non-coding mutations are no exceptions. Recurrence based detection methods are not designed to detect these alternative mutation signatures. Signature of accelerated somatic evolution (SASE) is one such novel mutation signature in non-coding regions that we recently reported. Genomic regions under accelerated evolution are those that accumulate excess of mutations compared to that expected based on background mutation rate. In mammalian evolution, human accelerated regions (HARs; regions that acquired significantly more substitutions than expected after divergence from the common ancestor with chimpanzees) were frequently found to have regulatory functions contributing to human-specific attributes. We applied the concept to cancer, and developed a computational method, SASE-hunter to identify the signature of accelerated somatic evolution (SASE) in a genomic locus, and prioritized those loci that carried the signature in multiple cancer patients. Interestingly, even when an affected locus carried the signature in multiple individuals, the mutations contributing to SASE themselves were not necessarily recurrent at the base-pair resolution. In a pan-cancer analysis of 12 tumor types, we detected SASE in the promoters of known cancer genes such as MYC, BCL2, RBM5 and WWOX. SASEs in selected cancer gene promoters were associated with over-expression, and also correlated with the age of onset of cancer, aggressiveness of the disease and survival. Taken together, our work detects a hitherto under-appreciated and clinically important class of regulatory changes in cancer genomes.

Dynamic and integrative biological network research of aging

Download

Date: TBA
Room: TBA

Fazle Faisal, University of Notre Dame, United States
Yuriy Hulovatyy, University of Notre Dame, United States
Huili Chen, University of Notre Dame, United States
Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview:

ABSTRACT

The world is on average growing older, with people over 60 years of age representing 11% of the global population. Because of this, and because susceptibility to diseases increases with age, studying molecular causes of aging continues to gain importance. However, human aging is hard to study experimentally due to long lifespan as well as ethical constraints. Therefore, human aging-related knowledge needs to be inferred computationally. Computational analyses of gene expression or genomic sequence data, which have been indispensable for investigating human aging, are limited to studying genes (or their protein products) in isolation, ignoring their cellular interconnectivities. But proteins do not function in isolation; instead, they carry out cellular processes by interacting with other proteins. And this is exactly what biological networks, such as protein-protein interaction (PPI) networks, model. Thus, analyzing topologies of proteins in PPI networks could contribute to our understanding of the processes of aging.

The majority of the current methods for analyzing systems-level PPI networks deal with their static representations, due to limitations of biotechnologies for PPI collection, even though cellular functioning is dynamic. For this reason, and because different data types can give complementary biological insights, we integrate current static PPI network data with aging-related gene expression data to computationally infer dynamic, age-specific PPI networks. Then, we apply a series of sensitive measures of network topology to the dynamic PPI network data to study cellular changes with age. For example, we apply a graphlet-based measure of local network position (or centrality) of a node; graphlets are small connected induced subgraphs. By doing so, we find that while global PPI network topologies do not significantly change with age, local topologies (i.e., network centralities) of a number of genes do. We predict such genes to be key players in the processes of aging [1]. We demonstrate the credibility of our predictions by: 1) observing significant overlap between our predicted aging-related genes and known "ground truth" aging-related genes; 2) observing significant overlap between functions and diseases that are enriched in our aging-related predictions and those that are enriched in the "ground truth" data; 3) providing evidence that diseases which are enriched in our aging-related predictions are linked to human aging; and 4) validating our high-scoring novel predictions in the literature.

In the above work, we study network (e.g., graphlet-based) positions of a node in each individual (static) age-specific PPI network "snapshot" and then simply consider time series of the results. In the process, we still overlook likely important relationships between the different snapshots. To capture the inter-snapshot relationships explicitly, we take the well-established and proven ideas behind static graphlets to the next level to develop novel theory of dynamic graphlets that are needed to allow for truly dynamic analysis of the age-specific PPI networks [2]. When we apply the dynamic graphlet approach to study human aging (just as described above), this approach further improves upon our previous work in terms of the quality of aging-related predictions. Namely, our new predictions lead to better overlap with "ground truth" aging-related data as well as to more aging-relevant functional and disease enrichments. Importantly, our new approach unveils novel knowledge about human aging with high (e.g., literature) validation accuracy, thus complementing the existing aging-related knowledge.

REFERENCES

1. Faisal F.E. and Milenković T. 2014. Dynamic networks reveal key players in aging. Bioinformatics 30(12):1721-29.
2. Hulovatyy Y., Chen H. and Milenković T. 2015. Exploring the structure and function of temporal networks with dynamic graphlets. Bioinformatics 31(12):i171-180.

Survey of the Heritability and Sparsity of Gene Expression Traits Across Human Tissues

Download

Date: TBA
Room: TBA

Heather Wheeler, Loyola University Chicago, United States
Kaanan Shah, University of Chicago, United States
Jonathon Brenner, Loyola University Chicago, United States
Hae Kyung Im, University of Chicago, United States

Presentation Overview:

Regulatory variation plays a key role in the genetics of complex traits as demonstrated by the consistent enrichment of expression quantitative trait loci (eQTLs) among trait-associated variants. Thus, understanding the genetic architecture of gene expression traits within and across tissues will help elucidate the underlying mechanisms of complex traits. We present a systematic survey of the heritability (h2) and the distribution of variant effect sizes on gene expression across the human body. Using RNA-seq data from a comprehensive set of tissue samples generated by the Genotype-Tissue Expression (GTEx) Project and the Depression Genes and Networks (DGN) whole blood cohort, we find that local h2 (contribution of SNPs within 1Mb of the gene) can be relatively well characterized with 50% of expressed genes showing significant h2 in DGN and 8-19% in GTEx. However, the current sample sizes (n = 922 in DGN and n < 362 in each of the 40 GTEx tissues) only allow us to compute distal h2 for a handful of genes (3% in DGN and < 1% in GTEx). Thus, here we focus on local regulation. Bayesian Sparse Linear Mixed Model (BSLMM) analysis provide strong evidence that local architecture of gene expression traits is sparse rather than polygenic across DGN and all 40 GTEx tissues examined. This result is further confirmed by the sparsity of optimal performing gene expression predictors via elastic net modeling. To further explore the tissue context specificity, we use a mixed-effects model to decompose the expression traits into cross-tissue and tissue-specific components. Heritability and sparsity estimates of these derived expression phenotypes show similar characteristics to the original traits. The local h2 estimates of the cross-tissue phenotype have larger magnitude and lower standard errors compared to single tissue estimates due to the borrowing of information across all samples. Consistent properties relative to prior GTEx multi-tissue results suggest that these derived traits reflect the expected biology. We apply this knowledge to develop prediction models of gene expression traits for all tissues. The prediction models, heritability, and prediction performance R2 for original and decomposed expression phenotypes are made publicly available for use in our gene-level association method, PrediXcan (https://github.com/hakyimlab/PrediXcan).

Decoding compound mechanism of action using integrative pharmacogenomics

Download

Date: TBA
Room: TBA

Nehme El-Hachem, Institut de recherches cliniques de Montreal, Canada

Presentation Overview:

Nehme El-Hachem1,2,*, Deena M.A. Gendoo3,4,*, Laleh Soltan Ghoraie3,4, Zhaleh Safikhani3,4, Petr Smirnov3, Ruth Isserlin5, Jacques Archambault6, Gary Bader5,6, Anna Goldenberg8,9, Benjamin Haibe-Kains3,4,8
1 Integrative Computational Systems Biology, Institut de Recherches Cliniques de Montréal, Montreal, Quebec, Canada
2 Department of Biomedical Sciences. Université de Montréal, Montreal, Quebec, Canada
3 Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
4 Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
5 The Donnelly Centre, Toronto, Ontario, Canada
6 Laboratory of Molecular Virology, Institut de Recherches Cliniques de Montréal
7The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
8 Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
9 Hospital for Sick Children, Toronto, Ontario, Canada

For decades, the “one drug-one target-one disease” paradigm dictated much of the drug development process. However, in the past ten years, tremendous advances in transcriptomics and genomics research shifted this simplistic view of a drug mechanism of action (MoA) to a more complex systems pharmacology paradigm where a drug can bind to several targets.
Several computational strategies have been proposed to elucidate the mechanism of action for existing and newly developed drug-like compounds. Traditional approaches predicted new drug-target associations based on the chemical similarity of corresponding ligands or side effects of approved drugs. Recent bioinformatic approaches built drug-drug networks from drug-induced transcriptional profiles and inferred new mechanisms of action. However, current drug taxonomies are relying on information difficult to gather for new compounds (e.g., side effects) or are inaccurate to predict drug target(s) and MoA. There is therefore a dire need to leverage the increasing amount of pharmacogenomic data in order to improve drug taxonomy by better characterizing drug targets without relying on prior knowledge such as therapeutic indications or side effects.
In our study, we integrated different layers of information from recent large-scale pharmacogenomic datasets in order to infer new MoA for chemical compounds extracted from cancer screens from three data layers: (i) drug structural similarity, (ii) drug perturbation transcriptomics profiles from the LINCS database; and (iii) drug sensitivity profiling assays from cancer cell lines (CTRPv2). We used our recently published Similarity Network Fusion algorithm to efficiently integrate these three data layers into a single, integrative drug taxonomy called Drug Network Fusion (DNF). We found that DNF outperformed drug taxonomies based on single data layers for both drug target prediction (DNF concordance index = 0.89 vs. 0.71, 0.83, 0.64 for structure, sensitivity and perturbation layers, respectively) and ATC classification (DNF concordance index = 0.77 vs. 0.72, 0.58, 0.54 for structure, sensitivity and perturbation, respectively).
We classified correctly almost all kinase inhibitors and inferred new mechanisms for other undescribed compounds. Our innovative computational framework highlights the importance of integrating complementary data layers concerning drugs such as chemical, transcriptional and sensitivity profiles. DNF can be easily extended to more compounds or data layers and as such, constitutes a valuable resource to the cancer research community by providing new hypotheses on the compound MoA and potential insights for drug repurposing.

A NOVEL ATOMISTIC MOTIONAL CORRELATION METHOD COMBINED WITH THERMODYNAMICS TO DELINEATE THE INTRICATE MECHANISM OF SUBSTRATE SPECIFIC CATALYSIS: ENZYME ENGINEERING PERSPECTIVE

Download

Date: TBA
Room: TBA

Devashish Das, Polyclone Bioservices, India
Pravin Kumar, Polycone Bioservices, India
Naveen Kulkarni, Polycone Bioservices, India
Anurag Kumar, Polyclone Bioservices, India

Presentation Overview:

Enzymes are powerful and highly specific catalysts, both in the reactions that they catalyze and in their choice of reactants. (1) Enzymes show this partiality towards substrate through a precise mechanism. The precision of this mechanism is governed by a well connected network of residues in and around the catalytic site, in terms of their motions and consequential thermodynamics. In this study we have designed a novel atomistic motional (AM) correlation method which measures the distance and direction of the atoms in motion. Discretizing the variable, 41 different combinations (AM alphabets) of displacement and direction of the motion of a single atom was calculated over a wide range of molecular motions derived from MD simulations. This is the maximum reported number of measurements which quantifies high-frequency harmonic oscillations to slow functional conformational transitions with a higher level of sensitivity. The method was tested to delineate the mechanism of the Michaelis complexes of Penicillin G acylase (PGA) & Penicillin-G (native reaction) and PGA & PGSO (slow reaction). Correlation of AM alphabets of a pair of atoms (i,j) was calculated as a normalized mutual information (MI; I_LL^n), nI(C_(i,) C_j )= (I(C_(i,) C_j )- ε(C_(i,) C_j ))/H(C_(i,) C_j ) (2) and this was summed to derive the per residue correlation (PRC;CnMI) ∑_(ij=1)^n▒I_LL^n . The CnMI was used to generate network models and clustering analysis, post 0.25 μs of simulations of the Michaelis complexes. Results show clear difference between the AM alphabets of the fast and the slow hydrolyzing enzymatic reactions (Fig1.A). Networks formed between the amino acids in the slow reaction are very much different from native reaction (Fig1.B & C), especially cluster 1 and 2 which shows close relation with the substrate of the native reaction is completely decomposed in the slow reaction. Further, CnMI was combined with a novel method that quantitatively weights atomic interactions (qWAI) in conjuncture with high throughput binding free energy calculation deposited over every amino acid (PRB). The three methods CnMI, qWAI and PRB in isolation and in combination showed the precise mechanism of PGA pertaining to substrate specificity. To mention CnMI shows decomposition of cluster 1 and 2 in slow reaction. qWAI shows that the amide bond of PenG was stabilized by βSer1, βThr68, βGln23 and βAla69 and the same in PGSO is stabilized only by βSer1 and βAla69. Finally, the combined score shows that βGln23 and βPro22, forming a part of oxyanion hole are extremely modulated in the slow reaction, resulting in the destabilization of tetrahedral intermediate. The presentation will show the precise mechanism of substrate selectivity of PGA revealed by these three methods in conjunction with insights for enzyme engineering.

References:
1. Berg JM, Tymoczko JL, Stryer L.New York: W H Freeman; 2002.
2. Cover T. M., Thomas J. A. (1991) Elements of Information Theory, Wiley-Interscience, New York

Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet

Download

Date: TBA
Room: TBA

Coby Viner, University of Toronto, Canada
James Johnson, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Nicolas Walker, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Hui Shi, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Marcela Sjoberg, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
David J. Adams, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
Anne C. Ferguson-Smith, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Timothy L. Bailey, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Many transcription factors (TFs) initiate transcription only in specific sequence contexts, providing the means for sequence specificity of transcriptional control. A four-letter DNA alphabet only partially describes the possible diversity of nucleobases a TF might encounter. Cytosine is often present in the modified forms: 5-methylcytosine (5mC) or 5-hydroxymethylcytosine (5hmC). TFs have been shown to distinguish unmodified from modified bases. Modification-sensitive TFs provide a mechanism by which widespread changes in DNA methylation and hydroxymethylation can dramatically shift active gene expression programs.

To understand the effect of modified nucleobases on gene regulation, we developed methods to discover motifs and identify TF binding sites (TFBSs) in DNA with covalent modifications. Our models expand the standard A/C/G/T alphabet, adding m (5mC) and h (5hmC). We additionally add symbols to encode guanine complementary to these modified cytosine nucleobases and represent states of ambiguous modification. We adapted the position weight matrix model of TFBS affinity to an expanded alphabet. We developed a program, Cytomod, to create a modified sequence. We also enhanced the MEME Suite to be able to handle custom alphabets. We created an expanded-alphabet sequence using whole-genome maps of 5mC and 5hmC in naive ex vivo mouse T cells. Using this sequence and ChIP-seq data from Mouse ENCODE and others, we identified modification-sensitive cis-regulatory modules. We elucidated various known methylation binding preferences, including the preference of ZFP57 and C/EBPβ for methylated motifs and the preference of c-Myc for unmethylated E-box motifs. We demonstrated that our method is robust to parameter perturbations, with TF sensitivities for (hydroxy)methylated DNA broadly conserved across a range of modified base calling thresholds. Hypothesis testing across different threshold values was used to determine cutoffs most suitable for further analyses. Using these known binding preferences to tune model parameters enables discovery of novel modified motifs.

Hypothesis testing of motif central enrichment provides a natural means of differentially assessing modified versus unmodified binding affinity. This approach can be readily extended to other DNA modifications. As more high-resolution epigenomic data becomes available, we expect this method to continue to yield insights into altered TFBS affinities across a variety of modifications.

Prediction of Metal Binding Sites in Proteins Using Coevolution Data

Download

Date: TBA
Room: TBA

Frazier Baker, University of Cincinnati, United States
Alexey Porollo, Children's Hospital Medical Center, United States

Presentation Overview:

Trace metals play an important role in determining the folding and function of many proteins. Knowledge of protein metal binding sites can facilitate the efforts in protein modeling, the understanding of molecular mechanisms of protein function, and the identification of new drug targets. Experimental annotation of metalloproteomes lags behind the pace at which genomes and proteomes become available. Hence, there is a growing demand for reliable sequence-based prediction of metal binding proteins and the actual binding sites.

To fulfill this need, we have developed a new machine learning-based model for metal-binding site prediction. The model employs coevolution information derived from multiple sequence alignment. Three coevolution metrics were explored: Chi-squared, Mutual Information, and Pearson Correlation. All metrics were adjusted for phylogeny bias in the multiple sequence alignment. Features are based on the cumulative properties derived from the most covariant residues (group) for each potential metal binding residue (C, D, E, H, N, Q, S, T). The feature space includes the average of individual conservation scores and the composition of amino acids within the group.

The training set is comprised of 165 manually curated metal binding proteins taken from the Metal MACiE database. There are 637 residues that bind metals (true positives), and 25909 non-binding residues of the same amino acids (true negatives). To keep the training data balanced, we did random resampling of the negative class keeping 1:1 and 2:1 (TN:TP) ratios, 1000 times for both. Two machine learning algorithms, C4.5 decision tree (DT) and Random Forest (RF), implemented in Weka were used to build prediction models. Each model was evaluated using 10-fold cross-validation (CV) in 1000 datasets with 2 ratios. The best performing model (23 features) yielded Matthew’s correlation coefficient of 0.67 with an overall accuracy of 87.5%, averages based on 1000 runs of 10-fold CV on the 2:1 ratio dataset. RF appears to outperform DT; furthermore, the coevolution-based model with group-based features is superior to other existing models using features derived from individual residues.

A novel algorithm for analyzing drug-drug interactions from MEDLINE literature

Download

Date: TBA
Room: TBA

Yin Lu, College of Pharmacy, University of South Florida, United States
Yi-Cheng Tu, Department of Computer Science, University of South Florida, United States
Feng Cheng, College of Pharmacy, University of South Florida, United States

Presentation Overview:

Drug–drug interaction (DDI) is becoming a serious clinical safety issue as the use of multiple medications becomes more common. Searching the MEDLINE database for journal articles related to DDI produces over 330,000 results. It is impossible to read and summarize these references manually. As the volume of biomedical reference in the MEDLINE database continues to expand at a rapid pace, automatic identification of DDIs from literature is becoming increasingly important. In this article, we present a random-sampling-based statistical algorithm to identify possible DDIs and the underlying mechanism from the substances field of MEDLINE records. The substances terms are essentially carriers of compound (including protein) information in a MEDLINE record. Four case studies on warfarin, ibuprofen, furosemide and sertraline implied that our method was able to rank possible DDIs with high accuracy (90.0% for warfarin, 83.3% for ibuprofen, 70.0% for furosemide and 100% for sertraline in the top 10% of a list of compounds ranked by p-value). A social network analysis of substance terms was also performed to construct networks between proteins and drug pairs to elucidate how the two drugs could interact.

Distal chromatin loop prediction with deep siamese neural networks

Download

Date: TBA
Room: TBA

Davide Chicco, Princess Margaret Cancer Centre, University of Toronto, Canada
Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Introduction. Transcriptional regulation is influenced by physical interactions with distal genetic elements such as enhancers. While those elements might be tens of thousands of base pairs away from the genes they affect on the DNA back-bone, they can be physically close in the folded three-dimensional conformation of chromatin. Powerful molecular biology techniques, such as chromosome conformation capture (3C) and Hi-C can locate these 3D long-range interactions (or loops). They are too expensive and difficult, however, for widespread use.
Previous work demonstrates that chromatin loops can be predicted from DNase hyper-sensitivity signals. For example, Thurman et al. in 2012 used these data in a statistical method that takes advantage of hierarchical clustering and Pearson correlation to predict distal interactions. We sought to improve these methods with a more flexible deep learning technique.

Methods. We created a new method for distal interaction prediction that uses a
deep siamese neural network algorithm. This technique was originally developed in artificial intelligence to recognize forged hand-written signatures. The siamese neural network learns the inner mathematical non-linear representation of pairs of DNase I hyper-sensitivity profiles, and states whether these pairs represent long-range interactions or not.

Results. We tested the effectiveness of our method through a standard cross-validation optimization approach, built on receiver operating characteristic (ROC) curves and precision-recall curves, by using the high-resolution genome-scale Hi-C datasets as gold standard. Preliminary results on held-out test sets confirm the efficacy of our algorithm.

Discussion. We designed a siamese neural network algorithm that predicts chromatin loops from pairs of chromosome region DNase I hyper-sensitivity profiles.
Compared to previous models, our computational method has the following advantages: (i) the ability to train a machine-learning model from the DNase profile signals of pairs of chromosome regions representing interactions; (ii) a prediction and validation pipeline that can be easily expanded and integrated with alternative algorithms or additional datasets such as those from ChIP-seq data in the future.

Ray Surveyor: phenetic comparison of genomes and its application in microbial evolution

Download

Date: TBA
Room: TBA

Frederic Raymond, Université Laval, Canada
Maxime Déraspe, Université Laval, Canada
Sébastien Boisvert, Gydle Inc., Canada
Alexander Culley, Université Laval, Canada
Paul H. Roy, Université Laval, Canada
François Laviolette, Université Laval, Canada
Jacques Corbeil, Université Laval, Canada

Presentation Overview:

Microbial genomics studies are getting more extensive and complex. Therefore, new methods for analyses are required for epidemiologists and microbiologists to make sense of these massive datasets. Here we demonstrate that comparison of genomes based on their k-mer content allows to easily reconstruct a phenetic tree coherent with classical phylogeny, without need of prior data curation such as the alignment of core genomes. The Ray Surveyor software can compare hundreds to thousands of microbial genomes based k-mers and removes the bias caused by the use of conserved regions to cluster samples. Using the Ray Surveyor software, we built in less than 6 hours an accurate phenetic tree of the Bacteria kingdom using 2429 complete genome sequences. The distinguishing feature of Ray Surveyor is its ability to dissect population structures based on a subset of sequences, for example resistance genes or bacteriophages, and thus determine which genetic elements drive bacterial fitness. We show that the population structure of Pseudomonas aeruginosa is closely linked with resistance genes while bacteriophage-related sequences are important in Streptococcus pneumoniae populations. We applied this methodology to 57 bacterial families belonging to 7 different phyla, showing for each family the importance of mobile elements, resistance genes, bacteriophages, plasmids and biosynthetic gene clusters. Only 5% of these 57 families were correlated with mobile elements, 23% with resistance genes and 39% with bacteriophages. In addition to determining correlations with phenetypic tree structure, we quantified the abundance of k-mers related to the five categories. This allowed us to determine which taxa have the most abundant genetic determinants associated with each of these five categories. This global view of the pan-genome of human pathogens demonstrates the taxa-dependent influence of mobility-related genes on population structure.

Radiation Dose Estimation by Automated Cytogenetic Biodosimetry

Download

Date: TBA
Room: TBA

Peter Rogan, University of Western Ontario, Canada
Yanxin Li, University of Western Ontario, Canada
Ruth Wilkins, Health Canada, Canada
Farrah Flegal, Canadian Nuclear Laboratories, Canada
Joan Knoll, University of Western Ontario, Canada

Presentation Overview:

The dose from ionizing radiation exposure can be interpolated from a calibration curve fit to the frequency of dicentric chromosomes (DC) in metaphase cells of peripheral blood lymphocytes at multiple doses. As DC counts are manually determined, there is an acute need for accurate, fully automated biodosimetry calibration curve generation and analysis of exposed samples. We automated DC detection by extracting key chromosome-derived features in Giemsa-stained metaphase chromosome images and classifying objects by machine learning (ML). The algorithm finds centromeres, differentiates DCs from MCs, overlapped chromosomes and other objects with acceptable accuracy over a wide range of radiation exposures (0-5 Gy; Microscopy Res. Tech. 2016. DOI:10.1002/jemt.22642). At high dose (3-4 Gy), for a true positive rate of 0.65, the positive predictive value is 0.72. These methods were incorporated into the Automated Dicentric Chromosome Identifier (ADCI), a software program that detects and segments chromosomes (Trans. Biomed. Engineering 2013. 60: 2005-13), detects centromere candidates and discriminates DCs from monocentric chromosomes by ML, then computes biodosimetry curves and determines radiation dose of test samples. Manually scored images from two reference laboratories exposed to 1.4 to 3.4 Gy were re-analyzed with ADCI, which estimated exposures between 0 and 0.4 Gy of the physical dose. ADCI can determine radiation dose with accuracies comparable to standard triage biodosimetry (within 0.5 Gy of the physical dose; Rad. Protect. Biodosimetry, submitted). Calibration curves were generated from metaphase images in ~10 hr, and dose estimations required ~0.8 hr per 500 image sample. Running multiple instances of ADCI may be an effective response to a mass casualty radiation event (Rad. Prot. Biodosimetry 2014. 159: 95-104).

Patient similarity networks as a framework for genetic case-control prediction in autism spectrum disorders

Download

Date: TBA
Room: TBA

Shraddha Pai, University of Toronto, Canada
Shirley Hui, University of Toronto, Canada
Ruth Isserlin, University of Toronto, Canada
Hussam Kaka, University of Toronto, Canada
Gary Bader, University of Toronto, Canada

Presentation Overview:

Autism spectrum disorders (ASD) are heritable, childhood-onset disorders that affect 1-2% of the population worldwide. Screening can identify genetic causes in 10-30% cases; DNA copy-number variants (CNV) are present in at least 10% of cases and contribute to disease risk (Ref 1,2). We have developed a predictor for ASD case-control status based on the genomic location of rare CNV deletions (data from the Autism Genome Project (N=1,485 cases and 1,806 controls of European descent))(Ref 2). Our method uses patient similarity networks based on shared CNV overlap and uses GeneMANIA for network integration and prediction (Ref 3). Using individual genes as features resulted in chance performance, consistent with previous reports of nonoverlapping disrupted genes in ASD patients. In contrast, using pathways as features improved performance to AUC=0.71, beyond that seen with other pathway-based predictors. At recommended stringency levels, our predictor accounts for 8-15% of all cases with rare CNV deletions. Feature selected pathways recapitulate mechanisms previously identified in ASD genetics, including themes of cell proliferation and division, neuronal development and function, and signal transduction.
CNV-based networks are sparse and binary, making the predictor prone to overfitting. Our feature selection therefore combines scores from three train-test partitions of the data, with ten-fold cross validation within each partition. Separately, we use nonparametric statistics to identify and exclude “random-like” cliques; this clique-filtering step substantially reduces the fraction of random networks that pass feature selection. Future challenges include incorporation of reference epigenome data to account for 32% of patients with CNVs located in intergenic regions, and extending the predictor to include single nucleotide variants, clinical data and medical history. As a framework, patient similarity networks provide important advantages for building predictors, including the ability to integrate heterogeneous data sources and handle highly missing or sparse data. This approach also naturally provides network-based visualization of patient similarities and organization of predictive pathways, thus making an intuitive tool for clinical and basic ASD research.
References: 1. Anagnostou E et al. (2014). Can Med Assoc J 186:509-19. 2. Pinto D et al (2014). Am J Hum Genet. 94:677-94. 3. Mostafavi S and Q Morris (2010). Bioinformatics. 26 (14):1759-65.

MetaDCN: meta-analysis framework for differential coexpression network detection with an application in breast cancer

Download

Date: TBA
Room: TBA

Li Zhu, University of Pittsburgh, United States
Ying Ding, University of Pittsburgh, United States
Cho-Yi Chen, University of Pittsburgh, United States
Zhiguang Huo, University of Pittsburgh, United States
Sunghwan Kim, University of Pittsburgh, United States
Steffi Oesterreich, Magee-Women’s Research Institute, United States
George Tseng, University of Pittsburgh, United States

Presentation Overview:

Background: Gene coexpression network analysis from large transcriptomic studies is often used to elucidate potential gene-gene interactions and regulatory mechanisms. In contrast to traditional differential expression analysis, identifying differential coexpression subnetworks between cases and controls could reveal pathways with disease-related dysfunction. Coexpression network estimated from a single transcriptomic study is often unstable and not generalizable due to biological variation, cohort bias and limited sample size. With the rapid accumulation of transcriptomic studies in the public domain, coexpression analysis combining multiple transcriptomic studies can provide more accurate and robust results. In this paper, we propose a MetaDCN framework to combine multiple studies to identify disease associated differential coexpression networks.

Methods: The framework is composed of two major components: module searching and network visualization. Coexpression networks are first constructed in cases and controls separately in each study. Differential coexpression seed modules are detected by optimizing an energy function via simulated annealing. Seed modules sharing common pathways are merged into pathway-centric supermodules and a visualization tool is developed using Cytoscape.

Results: We applied the method to five breast cancer studies (ER+ vs ER-) and identified 32 supermodules engaged in 96 pathways under 5% FDR control. Ranking atop are the immune response pathway and complement and coagulation cascades pathway. The supermodules associated with those two immune system related pathways demonstrated alternative ER activation, which is consistent with recently reported ER-mediated immune functions.

Conclusions: MetaDCN integrates multiple studies to detect disease associated gene modules with differential coexpression. The result sheds light on the underlying disease mechanisms in a systems manner.

On the clustering of biomedical datasets - a data-driven perspective

Download

Date: TBA
Room: TBA

Christian Wiwie, University of Southern Denmark, Denmark
Jan Baumbach, University of Southern Denmark, Denmark
Richard Röttger, University of Southern Denmark, Denmark

Presentation Overview:

Nowadays, scientists of virtually all disciplines are confronted with an increasing supply of information; this is especially true for biomedical research where recent advances in wet-lab technologies have led to a sheer explosion of the wealth, quality, and amount of available data. In order to extract actual knowledge from this plethora of information, one of the most common approaches, and often the beginning of an analysis, is the so-called cluster analysis which unravels the inherent structure of the data by grouping similar objects together.

Despite being a long standing problem, conducting a cluster analysis is everything but straight-forward; to the contrary, a high quality clustering analysis is very often overwhelming the practitioner. A multitude of different decisions have to be made, all of them require deep understanding of the underlying methods; decisions the lay-man often can not make or is not even aware of, like feature extraction, similarity calculation, choice of clustering tool and its parameter optimization, and many more. Here, well-structured and objective guidelines are widely missing, especially on larger scale.

To attack these challenges, we have developed ClustEval, a fully integrated and automatized cluster evaluation framework. The power of this framework allowed us to conduct a massive, objective and fully reproducible clustering comparison analysis consisting of several million evaluations. This massive data-driven background of structured clustering results allowed us provide an highly demanded overview of the field and to carefully derive guidelines for the clustering of biomedical datasets which we recently published in Nature Methods. Based on this effort, we want to present ClustEval, most recent findings, and furthermore aim to evaluate the future perspectives for improving the overall quality and usability of cluster analyses. All results and the framework are freely available: http://clusteval.sdu.dk/

miRNet – dissecting miRNA-target interactions and functional associations through network-based visual analysis

Download

Date: TBA
Room: TBA

Yannan Fan, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
Paula Ribeiro, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
Sarah Kimmins, Department of Animal Science, Faculty of Agricultural and Environmental Sciences, McGill University, Canada
Jianguo Xia, Institute of Parasitology, Faculty of Agricultural and Environmental Sciences, McGill University, Canada

Presentation Overview:

MicroRNAs (miRNAs) can regulate nearly all of the biological processes and their dysregulations are implicated in various complex diseases and pathological conditions. Recent years have seen a growing number of functional studies of miRNAs using high-throughput experimental technologies, which have produced a large amount of high-quality data regarding miRNA target genes, their interactions with small molecules, long non-coding RNAs, epigenetic modifiers, as well as disease associations, etc. These rich sets of information have enabled the creation of comprehensive networks linking miRNAs with various biologically important entities to shed light on their collective functions and regulatory mechanisms. Here, we introduce miRNet, a high-performance, easy-to-use, web-based tool that offers statistical, visual, and network-based approaches to help researchers understand miRNAs functions and regulatory mechanisms. The key features of miRNet include: (i) a comprehensive knowledge base integrating high-quality miRNA-target interaction data from ten databases; (ii) support for differential expression analysis of data from microarray, RNA-seq and quantitative PCR; (iii) implementation of flexible interface for data filtering, refinement and customization during network creation; (iv) a powerful fully-featured network visualization system coupled with enrichment analysis. miRNet offers a comprehensive tool suite to enable statistical analysis and functional interpretation of various data generated from current miRNA studies.

Non-invasive Precision Medicine - Statistical learning of exhaled biomarker profiles

Download

Date: TBA
Room: TBA

Anne-Christin Hauschild, Max Planck Institute for Informatics, Germany
Jörg Ingo Baumbach, Faculty Applied Chemistry, Reutlingen University, Germany
Jan Baumbach, University of Southern Denmark, Denmark

Presentation Overview:

Precision medicine aims for tailoring medical treatment to the individual patient. It relies on efficient molecular methods for diagnostic testing and computational methods for high-throughput data analysis. Major initiatives have recently been launched to build the infrastructure needed to guide clinical practice. Detection of clinically relevant biomarker molecules in exhaled air has the potential to establish a non-invasive precision medicine branch. To this end, we utilize volatile organic compounds, which are emitted by all living cells and tissues. We seek to identify non-invasive biomarkers that are predictive for the biomedical fate of individual patients or cell cultures. This promises great hope to move the therapeutic windows to earlier stages of disease progression. While portable devices for exhaled volatile metabolite measurement exist, we face the traditional biomarker research barrier: A lack of robustness hinders translation to the world outside laboratories. To move from biomarker discovery to validation, from separability to predictability, we have developed several bioinformatics methods for computational breath analysis, which have the potential to redefine non-invasive biomedical decision making by rapid and cheap matching of decisive medical patterns in exhaled air. We aim to provide a supplementary diagnostic tool complementing classic urine, blood and tissue samples. In the presentation, we will review the state of the art, study some clinical application examples, highlight existing challenges, and introduce new data mining methods for identifying exhaled biomarkers.

Predicting physiologically relevant SH3 domain mediated protein-protein interactions in human

Download

Date: TBA
Room: TBA

Shobhit Jain, University of Toronto, Canada
Gary Bader, University of Toronto, Canada

Presentation Overview:

Protein-protein interactions (PPIs) are physical associations between protein pairs in a specific biological context. Their knowledge provides important insights into the functioning of a cell. Many intracellular signaling processes are mediated by interactions involving peptide recognition modules such as SH3 domains. These domains bind to small, linear sequence motifs within proteins, which can be identified using high-throughput experimental screens such as phage display or peptide chips. Binding motif patterns can then be used to computationally predict protein interactions mediated by these domains. While many protein-protein interaction prediction methods exist, most do not work with peptide recognition module mediated interactions or do not consider many of the known constraints governing physiologically relevant interactions between two proteins.

A new approach for predicting physiologically relevant SH3 domain-peptide mediated protein-protein interactions in H. sapiens using phage display data is presented. Like some previous similar methods, this method uses position weight matrix models of protein linear motif preference for individual SH3 domains to scan the proteome for potential hits and then filters these hits using a range of evidence sources related to sequence-based and cellular constraints on protein interactions. The novelty of this approach is the large number of evidence sources used and the method of combination of sequence based and protein pair based evidence sources. This method combines diverse binding site (peptide) features, including presence in a disordered region of the protein, surface accessibility, conservation across different species, and structural contact with the SH3 domain, as well as protein features such as cellular proximity, shared biological process, similar molecular function, correlated gene expression, protein expression and sequence signature. By combining different peptide and protein features using multiple Bayesian models we are able to predict high confidence SH3 domain-peptide interactions.

Time-series clustering enables the exploration of temporal patterns in marker gene data

Download

Date: TBA
Room: TBA

Michael Hall, Dalhousie University, Canada
Jonathan Perrie, Dalhousie University, Canada
Robert Beiko, Dalhousie University, Canada

Presentation Overview:

Marker-gene sequencing of entire communities of microorganisms is a popular method for identifying the microbial taxa that are associated with different conditions or environmental changes. In particular, longitudinal studies are beginning to highlight the dynamism of these groups of invisible inhabitants. Typically, marker genes from these organisms are clustered by sequence identity into “operational taxonomic units” (OTUs) that are surrogates for species or some other level of taxonomic similarity. This approach is widely used and useful as it minimizes the effects of sequencing error and reduces the size of the data set for downstream computations. However, OTU clustering has been shown to group ecologically distinct organisms into a single unit which can obscure potentially important functional differences. We present an alternate clustering approach for time-series marker gene data: clustering sequences by temporal abundance patterns. Our method groups sequences from taxa that are potentially exhibiting similar responses to their environment. An interactive user interface allows the researcher to explore their data and discover distinct temporal patterns. For example, time-series clustering can reveal the seasonal cycles of taxa in a freshwater lake, and can group together taxa that follow the same periodicity. By modifying the radius of the clusters, we can identify discordance within OTUs and generate hypotheses about interactions between taxa. Time-series clustering represents a novel way to explore marker gene data that complements existing techniques.

A regulatory model for discovering aberrant post-transcriptional programs in cancer

Download

Date: TBA
Room: TBA

Hamed Najafabadi, McGill University, Canada
Pouria Jandaghi, McGill University, Canada
Shraddha Solanki, McGill University, Canada
Andreas Papadakis, McGill University, Canada
Maryam Safisamghabadi, McGill University, Canada
Cristina Storoz, McGill University, Canada
Mark Lathrop, McGill University, Canada
Sidong Huang, McGill University, Canada
Simon Tanguay, McGill University, Canada
Fadi Brimo, McGill University, Canada
Yasser Riazalhosseini, McGill University, Canada

Presentation Overview:

In contrast to extensive studies on diverse molecular factors involved in transcriptional gene programs, the current knowledge of the mechanisms underlying post-transcriptional gene de-regulation in cancer is scarce, and has mainly been limited to the function of tumor-suppressive or oncogenic microRNAs (miRNAs). RNA-binding proteins (RBPs), as key factors that modulate stability and splicing of mRNAs, play a central role in post-transcriptional gene regulation, but have rarely been studied in the context of cancer. By combining sequence specificities of human RBPs with tens of genome-wide expression datasets across multiple tissues and cell lines, we have developed an integrated regulatory code that models the stability of each mRNA as a combinatorial function of the binding of RBPs. This model enables us not only to predict the abundance of mRNAs based on the ‘activity’ of upstream regulatory RBPs, but also to infer the cancer-associated change in the activity of each RBP based on the change in the abundance of its stability targets. Using this model, we have investigated the activity landscape of RBPs in normal and tumor tissues from 45 clear cell renal cell carcinoma (ccRCC) patients. This model identified several RBPs with recurrent tumor-associated change in activity, including an increase in the activity of MBNL2 and PCBP2, and a decrease in the activity of ESRP2, despite a lack of change in the mRNA levels of some of these RBPs. Using immunohistochemistry, we have found that these RBPs are de-regulated at the protein level in ccRCC tumors. Furthermore, using shRNA-mediated knockdown followed by cellular phenotyping and RNA-seq analysis, we have validated the function of these RBPs in regulating several cancer-related pathways. In particular, ESRP2 knockdown remodels the transcriptome of normal kidney cells toward that of ccRCC tumors, and activates pathways associated with cancer. On the other hand, knockdown of MBNL2 and PCBP2 can revert the cancer gene expression profile, and suppress various cancer-associated pathways. Inhibition of MBNL2 also suppresses proliferation of ccRCC cell lines, underlining the role of MBNL2 as a potential oncogene. These findings highlight the effectiveness of predictive gene regulatory models in identification of cancer-driving post-transcriptional programs, and suggest a prominent role of RBPs in development and progression of cancer.

RGAugury: A pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants

Download

Date: TBA
Room: TBA

Pingchuan Li, Morden Research and Development Centre, AAFC, Canada
Sylvie Cloutier, Ottawa Research and Development Centre, AAFC, Canada
Frank You, Morden Research and Development Centre, AAFC, Canada

Presentation Overview:

Resistance gene analogs (RGAs) are a large class of potential R-genes that usually have conserved domain and motif configuration. To date, NBS-encoding protein, receptor like protein (RLP) and serine/threonine/tyrosine conferring receptor like kinase (RLK) and membrane associated coiled-coil protein (TM-CC) have been broadly reported to be closely associated with plant defense and resistance. Although few RGAs have been fully characterized for their functions and pathways, genome-wide prediction of the potential resistance genes from sequenced plant genomes has been shown to be an effective and practical approach for identification, fine mapping and cloning of plant resistance genes. Few computational programs for identification of some individual resistance related domains are available but a comprehensive and easy-to-use pipeline for RGA prediction was still lacking. Here, we propose an integrative package named RGAugury that automates genome-wide prediction for different types of RGAs. All modules of the pipeline have been fully paralleled to run in multiple threads to ensure maximum data mining speed through the full usage of all CPU resources on a cutting-edge server. The pipeline first identifies resistance related domains such as NBS, LRR, transmembrane, serine/threonine/tyrosine kinase and coiled-coil from gene and predicted protein sequences from genomes. All identified essential domains are then comprehensively analyzed to declare RGA candidates and classified them as NBS-encoding, TM-CC and membrane associated RLP or RLK. The pipeline was tested using the Arabidopsis and Medicago genomes and validated against their previously reported RGAs. A total of 93% and 90% of the reported putative NBS-encoding genes and, 98% and 99% of the membrane associated RLPs and RLKs, were identified from Arabidopsis and Medicago, respectively. These results demonstrated that RGAugury is an effective bioinformatics tool for genome-wide RGA identification that can be applied to other plant genomes. The pipeline program will be available at Bitbucket.

ORFanFinder: automated identification of taxonomically restricted orphan genes

Download

Date: TBA
Room: TBA

Yanbin Yin, Northern Illinois University, United States
Alex Ekstrom, Northern Illinois University, United States

Presentation Overview:

Motivation: Orphan genes, also known as ORFans, are newly evolved genes in a genome that enable the organism to adapt to specific living environment. The gene content of every sequenced genome can be classified into different age groups, based on how widely/narrowly a gene’s homologs are distributed in the context of species taxonomy. Those having homologs restricted to organisms of particular taxonomic ranks are classified as taxonomically restricted ORFans.
Results: Implementing this idea, we have developed an open source program named ORFanFinder and a free web server to allow automated classification of a genome’s gene content and identification of ORFans at different taxonomic ranks. ORFanFinder and its web server will contribute to the comparative genomics field by facilitating the study of the origin of new genes and the emergence of lineage-specific traits in both prokaryotes and eukaryotes.
Availability: http://cys.bios.niu.edu/orfanfinder
Publication: Ekstrom A and Yin Y (2016) ORFanFinder: automated identification of taxonomically restricted orphan genes, Bioinformatics, doi:10.1093/bioinformatics/btw122, in press

Set Covering Machines and Reference-Free Genome Comparisons Uncover Predictive Biomarkers of Antibiotic Resistance

Download

Date: TBA
Room: TBA

Alexandre Drouin, Université Laval, Canada
Sébastien Giguère, Institute for Research in Immunology and Cancer, Canada
Maxime Déraspe, Université Laval, Canada
Mario Marchand, Université Laval, Canada
Jacques Corbeil, Université Laval, Canada
François Laviolette, Université Laval, Canada

Presentation Overview:

Despite an era of supercomputing and increasingly precise instrumentation, many biological phenomena remain misunderstood. One approach to understanding such events is the elaboration of case-control studies, where large groups of phenotypically different individuals are compared, with the objective of finding predictive biomarkers of a phenotype.

We focus on the identification of genomic biomarkers, ranging from single nucleotide substitutions and indels, to large scale genomic rearrangements. We use reference-free genome comparisons based on k-mers, i.e., sequences of k nucleotides, coupled with the Set Covering Machine (SCM), a machine learning algorithm that produces sparse classifiers. We devise extensions to the algorithm that make it well suited for learning from extremely large sets of genomic features. Our method is robust to large-scale genomic rearrangements and is well-suited for organisms that show high genomic diversity. Moreover, the uncharacteristically sparse models produced by the SCM explicitly highlight the relationship between genomic variations and the phenotype of interest.

The method was validated by generating models that predict the antibiotic resistance of four important human pathogens: C. difficile, M. tuberculosis, P. aeruginosa and S. pneumoniae. The method generated accurate models for 17 antibiotics, the majority achieving error rates smaller than 10% on a validation set. Within hours of computation, the method recovered, de novo, known and validated antibiotic resistance mechanisms that have been reported over the past decades. Moreover, previously unreported genomic variations that could prove biologically relevant were uncovered. The method also identified markers of cross-resistance between antibiotics, knowledge that could prove relevant for the improvement of combination antibiotherapies.

We are confident that this method is applicable to other organisms and that it could guide biological efforts for understanding a plethora of phenotypes. To this end, we propose Kover, a highly scalable implementation of our method that relies on external storage, instead of the computer’s memory. Kover is open-source software and is available from http://github.com/aldro61/kover.

CoDaSeq: compositional data analysis for high throughput sequencing data

Download

Date: TBA
Room: TBA

Greg Gloor, U. Western Ontario, Canada
Jean Megan Macklaim, U. Western Ontario, Canada
Jia Wu, U. Western Ontario, Canada

Presentation Overview:

We present CoDaSeq, a compositionally-appropriate end-to-end workflow for the analysis of sparse high throughput sequencing (HTS) datasets. HTS is routinely applied to collect data from 16S rRNA gene sequencing, RNA-seq, metagenomic sequencing and other experimental designs. It is now acknowledged in several domains that HTS generates compositional data that are sparse (contain many 0 values). Compositional data are those where the multivariate components in individual samples have a constant, yet arbitrary sum. The datasets generated by HTS instruments are compositional because the number of reads output by the machine is constrained by the capacity of the instrument. Compositional data are prone to sub-compositional effects and spurious correlations, first noted by Pearson in 1896, where the conclusions drawn are conditioned upon arbitrary choices made in terms of what features (genes, operational taxonomic units, etc) are included. Compositional data analysis (CoDa) approaches to analyze these datasets are rarely used, and when applied are only used piecemeal, in part because sparse data are often thought to be incompatible with with CoDa approaches. CoDaSeq uses a Bayesian approach to estimate posterior probability distributions of the data such that the that sparsity can be accounted for appropriately during the analysis. We will show how the CoDaSeq approach can be applied to 16S rRNA gene sequencing, metagenomic inference, and transcriptome experiments in a coherent manner for ordination (using the compositional biplot), correlation (using measures of constant ratio) and group difference analysis (using measures of ratio variance). This approach is a powerful, and fully generalizable addition to the HTS analysis toolkit.

Meta-analysis of large pharmacogenomics studies to develop isoform-based biomarkers predictive of response to targeted therapies

Download

Date: TBA
Room: TBA

Zhaleh Safikhani, Princess Margaret Cancer Centre, University Health Network, Canada
Benjamin Haibe-Kains, Princess Margaret Cancer Centre, University Health Network, Canada
Kelsie Thu, Princess Margaret Cancer Centre, University Health Network, Canada
Petr Smirinov, Princess Margaret Cancer Centre, University Health Network, Canada
David Cescon, Princess Margaret Cancer Centre, University Health Network, Canada
Mathieu Lupien, Princess Margaret Cancer Centre, University Health Network, Canada

Presentation Overview:

Introduction: Advances in genome-wide molecular profiling and high-throughput drug screening technologies offer an unique opportunity to identify novel biomarkers predictive of response to anticancer therapies. The vast majority of predictive biomarkers for targeted therapies are based on genetic aberrations or protein expressions, as opposed to transcriptomic biomarkers. However, the recent adoption of next-generation sequencing technologies enables accurate profiling of not only gene expression but also alternative and trans-spliced transcripts in large-scale pharmacogenomic studies.
Methods: We applied multiple machine learning modeling techniques towards identification of transcriptomic biomarkers for drug response in cancer. To address the lack of reproducibility of drug sensitivity measurements across studies, we developed a framework to efficiently combine the pharmacological data from two large studies, the Cancer Cell Line Encyclopedia (CCLE) and the Genomics of Drug Sensitivity in Cancer (GDSC). Our framework consists of fitting predictive models using the cell lines RNA-seq profiles as predictor variables, controlled for tissue type and batch indicators, and combined CCLE and GDSC drug sensitivity calls as dependent variables. The accuracy and significance of the fitted models have been assessed using cross-validation, embedding both feature selection and model fitting. We prioritized gene and isoform-based biomarkers that are differentially distributed between healthy tissues from GTEx dataset and cancer cell lines.
Results: Independent pharmacogenomic datasets developed by the Gray and Neel laboratories have been exploited to validate the biomarkers that predict the response of breast cancer cell lines. We validated in vitro our most promising in silico predictions, such as NM_004207(SLC16a3002) as a significant predictive biomarker for the MEK inhibitor AZD6244.
Conclusion: Despite initial promises, biomarker discovery from large pharmacogenomic datasets did not fully realize their potential, with only few robust biomarkers being reproduced across studies. Our study is the first to implement a meta-analysis pipeline of such valuable data, opening new avenues of research for the identification of isoform-based biomarkers predictive of response to targeted therapies in breast cancer.

CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization

Download

Date: TBA
Room: TBA

Houxiang Zhu, Miami University, United States
Lauren Misel, Miami University, United States
Mitchell Graham, Miami University, United States
Chun Liang, Miami University, United States

Presentation Overview:

The CRISPR system holds much promise for successful genome engineering, but therapeutic, industrial, and research applications will place high demand on improving the specificity and efficiency of this tool. CT-Finder (http://bioinfolab.miamioh.edu/ct-finder) is a web service to help users design guide RNAs (gRNAs) optimized for specificity and efficiency. CT-Finder accommodates the original single-gRNA Cas9 system and two specificity-enhancing paired-gRNA systems: Cas9 D10A nickases (Cas9n) and dimeric RNA-guided FokI nucleases (RFNs). Optimal target candidates can be chosen based on the minimization of predicted off-target effects and the maximization of gRNA activities. Graphical visualization of on-target and off-target sites in the genome is provided for target validation. Major model organisms are covered by this web service.

In-silico pipeline for tissue-specific drug combination discovery

Download

Date: TBA
Room: TBA

Seyed Ali Madani Tonekaboni, University of Toronto, Canada
Zhaleh Safikhani, University of Toronto, Canada
Benjamin Haibe-Kains, University of Toronto, Canada

Presentation Overview:

Rational. Efforts toward developing efficient single-agent anticancer therapies against many aggressive cancer types have failed. A possibly more efficient strategy to conquer the limitations of single agent therapeutics (monotherapies) could be developing combination therapies to target multiple components of a complicated disease such as cancer. However, testing numerous combination therapies in preclinical studies or clinical trials is not a feasible approach considering the extent of potential combinations from over thousand potential drugs.
Objective. We developed a high-throughput drug combination computational modeling approach to integrate multiple pharmacogenomic datasets in order to efficiently explore the large space of drug combinations and predict the most synergistic drug combinations in vitro.
Methods. For each combination of drugs on a cell line, we used transcriptomic profiles of the cancer cell line, the L1000 drug perturbation data as well as monotherapy response of the cell lines to each agent in the combination as the input data to train our model. We used elastic net as the supervised machine learning approach in our pipeline and tested our model using random sampling on the drug combination data, for all tissue types as well as lung and breast cancer cell lines separately, provided in Astrazeneca-Sanger DREAM Challenge.
Results. Our predictions for the drug combinations provided in the challenge were significantly better than random (p < 1e-6). Although the results were promising, we went one step further and developed tissue-specific predictors for lung and breast cancer cell lines in the data set which significantly improved our predictions (p < 1e-3).
Conclusions. Using our tissue-specific in-silico pipeline, we can predict synergistic drug combinations for each tissue type in our training set. This constitutes a promising approach which can be integrated with high throughput chemical screens to improve drug combination discovery in cancer.

A genome-scale algorithmic approach for metabolic engineering of plants

Download

Date: TBA
Room: TBA

Jiun Yen, Virginia Tech, United States
Glenda Gillaspy, Virginia Tech, United States
Ryan Senger, Virginia Tech, United States

Presentation Overview:

One of the major challenges in metabolic engineering of cells to over-produce commodity chemicals is the identification of effective gene modification targets. For this reason, much focus has been on the development of computational tools to accurately predict metabolic engineering strategies. Predictive tools that utilize genome-scale metabolic flux models (GEMs) have shown promising results in engineering microbes; however, few studies have utilized GEMs of plants. This work introduces a novel algorithm called “Reverse Flux Balance Analysis with Flux Ratios” (R-FBrAtio) and deploys this algorithm to enhance cellulose production in Arabidopsis thaliana. R-FBrAtio generates gene candidates that are ranked to indicate best targets for gene over-expression and/or knockdown. R-FBrAtio predicted many intuitive metabolic engineering strategies, including over-expression of cellulose synthase and UDP-glucose pyrophosphorylase, which had been shown to increase cellulose in previous studies. There were also many non-intuitive predictions, and this research focused on experimental validation one of the non-intuitive predictions, the overexpression of mitochondrial malate dehydrogenase (mMDH). This was done by generating multiple transgenic plant lines with upregulated mMDH2:2HA and examining cellulose content. Characterization of mMDH2:2HA plants showed increased biomass and altered morphology. Analysis of the crystalline cellulose revealed a 30% increase in the stem.

Compositional Epistasis Detection Using A Few Prototype Disease Models

Download

Date: TBA
Room: TBA

Lu Cheng, University of Waterloo, Canada
Mu Zhu, University of Waterloo, Canada

Presentation Overview:

Failure of replication for single locus effects in genome wide association studies (GWAS) motivates the exploration of epistasis (interaction effects) for human complex diseases. Existing methods mostly target epistasis in the statistical sense, i.e., deviation from additive effects, which is believed to be of limited help for understanding the biological mechanism. Of more biological relevance is the so called "compositional epistasis" termed by Phillips (2008) that bares the original meaning of "masking effect" when the term was coined.

It is straightforward to model compositional epistasis by genetic disease models. There are 512 two-locus, two-allele, two-phenotype and complete-penetrance disease models (Li et al 1999). Studying all of them not only lends insight to the exact epistasis mechanism, but also enhances the power for locus discoveries. Studies of single SNP effects show that the maximal detection power is achieved when the correct genetic disease model is used, i.e., when the testing of SNPs is done by using the genetic disease model that matches the actual underlying mode of inheritance. Hence to achieve high power for epistasis detection, it is imperative to determine proper disease models to test.

One direct way is to check all of them and use the one most fitting for the current data. However this not only causes computational burden but also multiple testing problems. Observing that the disease models are similar to each other, we come up with the idea of testing only a few representative ones, similar to what principal component regression does. We define a novel "distance" metric to measure how different two disease models are and then use it to group disease models into a few clusters. We find that the 512 disease models form 6 clusters most of the time, and a prototype disease model selected from each cluster serves as a good representative model that can be used for epistasis loci detection. It is worth mentioning that clustering epistasis models is not only beneficial to the aforementioned computational and multiple-testing problems, but it also allows us to better understand and characterize different disease models for future research.

By carrying out simulation studies on some popular disease models, we observe that our approach provides satisfying power when compared with two other most relevant methods, i.e., MDR and the complete compositional epistasis detection approach by Wan et al (2013). For certain heterogeneous models that involve at least two pairs of SNPs contributing to the disease, our approach performs better than the other two methods. For some other cases, our method may perform worse. The causes are explored and two alternative methods are proposed, which use more refined ways to determine prototype disease models.

In summary, there is a limited amount of work devoted to complete compositional epistasis detection. The underlying variable selection/screening task is complicated by the need to determine the interaction form, or disease model, before selection/screening can be carried out. Our approach --- first finding a few prototype disease models and then using them to perform screening --- complements MDR and the method by Wan et al (2013) in detecting more biologically relevant epistasis.

Integrative analysis for identification and functional prediction of long non-coding RNAs in cancer

Download

Date: TBA
Room: TBA

Musaddeque Ahmed, UHN, Canada
Haiyang Guo, UHN, Canada
He Housheng, UHN, Canada

Presentation Overview:

Trait-associated SNPs identified through Genome-Wide Association Studies are enriched in regulatory regions. However, the functional link between these SNPs and their target genes remains elusive, particularly more for the non-protein coding genes. Due to their involvement in fundamental biological processes, the largest class of non-protein coding genes, long noncoding RNAs (lncRNAs), represent an attractive class of candidates to mediate cancer risk. We have developed an integrative computation method that can identify lncRNAs that may have potential role in development and/or progression of any cancer type. Our prediction algorithm integrates outputs from multiple data types including expression data, chromatin accessibility, genomic occupancy and genotyping data both from cell lines and patient samples. The implementation of our method on the lncRNA transcriptome with genomic and prostate cancer GWAS SNP data, we identified 45 candidate lncRNAs associated with risk to prostate cancer. The top hit from our algorithm is PCAT1, the expression of which we found to be affected by a SNP, rs7463708, through modulation of an enhancer region 78kb downstream of PCAT1 transcription start site. Further analysis suggested that this enhancer is likely prostate cancer specific and the risk allele of rs7463708 increase the binding of a transcription factor, ONECUT2. The efficacy of our prediction algorithm is further complemented by the occurrence of lncRNAs that were previously reported to be associated with prostate cancer, such as H19 and KCNQ1OT1. Our method provides a novel and effective approach to pinpoint lncRNAs that are functionally critical in any disease development or progression.

Insights into the metatranscriptome and metabolome of the vaginal microbiome

Download

Date: TBA
Room: TBA

Jean Megan Macklaim, University of Western Ontario, Canada
Amy McMillan, University of Western Ontario, Canada
Mark Sumarah, Agriculture and Agri-food Canada, Canada
Jonathan Swann, Division of Computational and Systems Medicine, Department of Surgery and Cancer, Canada
Gregor Reid, University of Western Ontario, Canada
Greg Gloor, University of Western Ontario, Canada

Presentation Overview:

A major challenge of any microbiome investigation is to determine the role of the microbes in the environment, and the effects on the host or system. Highthroughput (HT) sequencing and small molecule analyses provide an overview of the function of the entire microbiome, which can be thought of as a “metaorganism”. Analyzing such data to make biological inferences is challenging due to the multivariate nature of the data, the scale of output from highthroughput experiments, and the complexity of the interactions and fluctuations of the biological system. In any HT sequencing output (16S rRNA gene sequencing, metatranscriptomics, and metagenomics) the data are expressed as parts of the whole, and are therefore bound by the requirement for compositional data analysis (CoDa).

We used the vaginal microbiome as a model system to demonstrate the relationship between the bacterially expressed mRNAs (the metatranscriptome), and the products of metabolism (the metabolome). In using a CoDa framework, we identified novel functional profiles of the vaginal microbiome associated with healthy and dysbiotic conditions. Our strategy involved mapping mRNA reads to a reference library, assembling unmapped reads, and grouping the reads by functional categories. We show that the transcriptional components of specific taxa (Megasphaera and Prevotella) associate with the vaginal microbiome subgroups, while others (Gardnerella, Lactobacillus iners) are nondiscriminatory to the different subgroups. The power to separate subgroups transcriptionally was increased by aggregating reads into functional groups rather than individual organisms.

Despite significant taxonomic variability within each subgroup, we found core metabolic products separating health and dysbiosis. We show correlations between the small molecules and the transcripts detected in key metabolic pathways of the condition: amino acids and polyamines, endproducts of anaerobic metabolism, and structural components that could be biomarkers of disease. This study underscored the importance and value of a multiomics approach to understanding an alteration in a microbiome state.

Statistical, Visual and Functional Analysis of 16S rRNA Marker Gene Data

Download

Date: TBA
Room: TBA

Achal Dhariwal, Mcgill University, Canada
Jeff Xia, Mcgill University, Canada

Presentation Overview:

Metagenomics studies aim to understand the composition and function of uncultured microbial communities. Nowadays, 16S rRNA based marker gene metagenomics sequencing is widely used to characterize the diversity of complex microbial communities. Functional, statistical analysis and visualization of such data possess great challenges. Many tools or approaches provide pipelines for performing microbiome analysis on such data to understand microbiome composition and function. However, many aspects of the current approaches can be improved to get a deeper understanding of communities. Hence, we introduced MetagenomeNet, a user-friendly, high-performance tool for comprehensive analysis of 16S rRNA metagenomic data. The key features of MetagenomeNet includes:(i) a knowledgebase comprising taxonomic profiling from multiple databases (Greengenes, SILVA and RDP), allowing users to input OTU tables mapped from various 16S rRNA analysis pipelines for its functional profiling; (ii) support for differential abundant features analysis for marker gene data along with discerning gene and taxon-specific associations for comparative analysis; (iii) a powerful fully-featured network visualization system at gene-level (metabolic networks) and at taxon-level (correlation network). Such tool will provide a system-level insight, help in understanding microbiome function, biomarker predictions, and also provide multiple alternative interpretations and hypothesis generation for various pathophysiological states.

Global Peak Alignment for Comprehensive Two-Dimensional Gas Chromatography Mass Spectrometry Using Point Matching Algorithms

Download

Date: TBA
Room: TBA

Beichuan Deng, Department of Mathematics, Wayne State University, United States
Hengguang Li, Department of Mathematics, Wayne State University, United States
Xiang Zhang, Department of Chemistry, University of Louisville, United States
Seongho Kim, Biostatistics Core, Karmanos Cancer Institute, Wayne State University, United States

Presentation Overview:

Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC×GC-MS) has been used to analyze multiple samples in a metabolomics study. However, due to some uncontrollable experimental conditions, such as the differences in temperature or pressure, matrix effects on samples, and stationary phase degradation, there is always a shift of retention times in the two GC columns between samples. In order to correct the retention time shifts in GC×GC-MS, the peak alignment is a crucial data analysis step to recognize the peaks generated by the same metabolite in different samples. Two approaches have been developed for GC×GC-MS data alignment: profile alignment and peak matching alignment. However, these existing alignment methods are all based on a local alignment, resulting that a peak is not correctly aligned in a dense chromatographic region where many peaks are present in a small region. False alignment will result in false discovery in the downstream statistical analysis. We, therefore, develop a global comparison based peak alignment method using point matching algorithm (PMA-PA) for both homogeneous and heterogeneous data. The developed algorithm PMA-PA first extracts feature points in the chromatography and then searches globally the matching peaks in the consecutive chromatography by adopting the projection of rigid and non-rigid transformation. Simulation studies show that PMA-PA outperforms the existing alignment algorithms in terms of F1 score, although it uses only peak location information.

Balancing mRNA and Protein Levels in a Demand-Directed Dynamic Flux Balance Analysis Describes Effects of the Transition of an Anaerobic Escherichia coli Culture to Aerobic Conditions

Download

Date: TBA
Room: TBA

Joachim von Wulffen, University of Stuttgart, Institute for System Dynamics, Germany
Oliver Sawodny, University of Stuttgart, Institute for System Dynamics, Germany
Ronny Feuer, University of Stuttgart, Institute for System Dynamics, Germany

Presentation Overview:

The facultative anaerobic bacterium Escherichia coli is frequently forced to adapt to changing environmental conditions. One important determinant for metabolism is the availability of oxygen allowing a more efficient metabolism. Especially in large scale bioreactors the distribution of oxygen is inhomogeneous and individual cells encounter frequent changes. This might contribute to observed yield losses during process upscaling. Short-term gene expression data exist of an anaerobic E. coli batch culture shifting to aerobic conditions. The data reveal temporary upregulation of genes that are less efficient in terms of energy conservation than the genes predicted by conventional flux balance analyses.

In this study, we provide evidence that a positive correlation between metabolic fluxes and gene expression exists. We then hypothesize that the more efficient enzymes are limited by their low expression restricting flux through their reactions and we define a demand that triggers expression of the demanded enzymes that we explicitly include in our model. With these features we propose a method, demand-directed dynamic flux balance analysis, dddFBA, bringing together elements of several previously published methods. The introduction of additional flux constraints proportional to gene expression provoke a temporary demand for less efficient enzymes which is in agreement with the transient upregulation of these genes observed in the data.

In the proposed approach, the applied objective function of growth rate maximization together with the introduced constraints triggers expression of metabolically less efficient genes. This finding is one possible explanation for the yield losses observed in large scale bacterial cultivations where steady oxygen supply cannot be warranted.

Ligand similarity complements sequence, physical interaction, and co-expression for gene function prediction

Download

Date: TBA
Room: TBA

Matthew O'Meara, University of California at San Francisco, Department of Pharmaceutical Chemistry, United States
Sara Ballouz, Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, United States
Brian Shoichet, University of California at San Francisco, Department of Pharmaceutical Chemistry, United States
Jesse Gillis, Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, United States

Presentation Overview:

The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63-0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited.

Simulating Next-Generation Sequencing Datasets From Empirical Mutation and Sequencing Models

Download

Date: TBA
Room: TBA

Zachary Stephens, University of Illinois at Urbana-Champaign, United States
Matthew Hudson, University of Illinois at Urbana-Champaign, United States
Liudmila Mainzer, University of Illinois at Urbana-Champaign, United States
Morgan Taschuk, Ontario Intitute for Cancer Research, Canada
Matthew Weber, University of Illinois at Urbana-Champaign, United States
Ravishankar Iyer, University of Illinois at Urbana-Champaign, United States

Presentation Overview:

An obstacle to the validation and benchmarking of methods for the analysis of genomes is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read and alignment simulators can fulfill this requirement. The most flexible approach to quantifying the accuracy of next-generation sequencing (NGS) variant calling pipelines is to utilize simulated read data, where the ground truth (correct read mapping positions, variant locations) is known a priori. However, simulated data is often criticized for limited resemblance to true data, and some simulation tools can be inflexible. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but additional scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters, which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads.

Estimation of Free Energy Contribution of Protein Residues as Feature for Structure Prediction from Sequence

Download

Date: TBA
Room: TBA

Sumaiya Iqbal, University of New Orleans, United States
Md Tamjidul Hoque, University of New Orleans, United States

Presentation Overview:

A feature that can map one dimensional sequence information into three dimensional information is crucial for solving complex protein structure prediction problems. With a view to this, based on the contact energy and the predicted relative solvent accessibility (RSA), we propose a novel approach to estimate position specific estimated energy (PSEE) per residue from sequence alone. PSEE can identify the structured as well as unstructured or, intrinsically disordered region of a protein by computing favorable and unfavorable energy respectively, characterized by appropriate threshold. The Intriguing feature provided by PSEE, verified empirically, suggests that PSEE can effectively classify disorder versus ordered residues and can segregate secondary structure components by computing their constituent energies. PSEE based residual characterization strongly correlates with their hydrophobicity indices as well. Further, PSEE can detect the existence of critical binding regions that essentially undergo disorder to order transition to perform crucial biological functions. Towards an application of disorder prediction using the PSEE feature, we have rigorously tested and found that PSEE helps perform the predictor consistently better.

Expanding the UniFrac toolbox

Download

Date: TBA
Room: TBA

Ruth G. Wong, University of Western Ontario, Canada
Jia R. Wu, University of Western Ontario, Canada
Gregory B. Gloor, University of Western Ontario, Canada

Presentation Overview:

Microbiome analysis is frequently performed using the UniFrac distance metric to separate groups. Here we demonstrate that unweighted UniFrac is highly sensitive to rarefaction instance and to sequencing depth in uniform data sets. We show that this arises because of subcompositional effects. We introduce information UniFrac and centered ratio UniFrac, two new weightings that are not sensitive to rarefaction and allow greater separation of outliers than classic unweighted and weighted UniFrac. With this expansion of the UniFrac toolbox, we hope to empower researchers to extract more varied information from their data.

An unlabeled-negative learning framework for human enhancer prediction based on low-methylation regions

Download

Date: TBA
Room: TBA

Jingting Xu, University of Illinois at Chicago, United States
Hong Hu, University of Illinois at Chicago, United States
Yang Dai, Univ. of Illinois at Chicago, United States

Presentation Overview:

Background The identification of enhancer is a challenge task. Various types of epigenetic information including histone modification have been utilized in the construction of enhancer prediction models based on a diverse panel of machine learning models. However, DNA methylation profiles generated from the whole genome bisulfate sequencing (WGBS) have not been fully explored for their potential in enhancer prediction despite of the fact that low methylated regions (LMRs) have been implied to be distal to transcription starting sites and active in regulation of target genes.

Method In this work we propose an unlabeled-negative learning framework using a weighted support vector machine model to build prediction models based on LMRs from cell-type specific WGBS DNA methylation profiles. The unlabeled LMR set is further divided into reliable positive, like positive and likely negative according to their resemblance to a small set of experimentally validated enhancers in the VISTA database based on their non-parametric density distributions.

Results We demonstrate the performance of LMR-wSVM by using the WGBS DNA methylation profile derived from the ES H1 cell line. Our results show that the predicted enhancers are highly conserved and the validation rate ranges from 68.92% to 83.4% from 64,791 to 29,818 predicted enhancers. The performance our models is competitive or even better compared with the existing best-performed methods.

Conclusion Our work suggests that low methylated regions detected from the WGBS data is useful to develop models for the prediction of cell type-specific enhancers.

WEVOTE: Weighted Voting Taxonomic Identification Method of Microbial Sequences

Download

Date: TBA
Room: TBA

Ahmed Metwally, University of Illinois at Chicago, United States
Yang Dai, University of Illinois at Chicago, United States
Patricia Finn, University of Illinois at Chicago, United States
David Perkins, University of Illinois at Chicago, United States

Presentation Overview:

Background: Metagenome shotgun sequencing presents opportunities to identify organisms that may prevent or promote disease. Analysis of sample diversity is achieved by taxonomic identification of metagenomic reads followed by generating an abundance profile. However, existing taxonomic identification tools with the best precision and practical performance still lack sensitivity. Moreover, methods with the highest sensitivity suffer from low precision, low specificity along with long computation time.

Methods: In this paper, we present WEVOTE (WEighted VOting Taxonomic idEntification), a method that classifies whole genome shotgun sequencing DNA reads based on an ensemble of existing methods using k-mer based, marker-based, naive-similarity approaches. Our evaluation based on three benchmarking datasets shows that the WEVOTE reduces the false positives to half of that produced by the other high sensitive tools while preserving the same level of sensitivity.

Conclusions: WEVOTE is an automated efficient tool that combines individual taxonomic identification methods. It is expandable and has the potential to reduce the false positives and produce more accurate taxonomic identification for microbiome data. The WEVOTE framework is written in C++, Perl, and shell scripting.

Bayesian Correlation Analysis for Sequence Count Data

Download

Date: TBA
Room: TBA

Daniel Sanchez-Taltavull, OHRI, Canada
Parameswaran Ramachandran, OHRI, Canada
Nelson Lau, OHRI, Canada
Theodore Perkins, Ottawa Health Research Institute, Canada

Presentation Overview:

Measuring similarity between different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities' measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurment confidence is low---especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities' signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on RNA-seq data describing gene expression across a collection of human tissue types.

Dissecting the expression relationships between RNA-binding proteins and their cognate targets in eukaryotic post-transcriptional regulatory networks

Download

Date: TBA
Room: TBA

Sneha Nishtala, Indiana University Purdue University Indianapolis (IUPUI), United States
Yaseswini Neelamraju, Indiana University Purdue University Indianapolis (IUPUI), United States
Sarath Chandra Janga, Indiana University Purdue University Indianapolis (IUPUI), United States

Presentation Overview:

RNA-binding proteins (RBPs)are a class of regulatory molecules pivotal in orchestrating several steps in the metabolism of RNA in eukaryotes thereby controlling an extensive network of RBP-RNA interactions. In this study, we employ CLIP-seq datasets for 50 human RBPs and RIP-chip data for 69 yeast RBPs to construct a network of genome-wide RBP- target RNA interactions for each RBP. Using these datasets we studied the expression association (measured as correlation) of RBPs at both transcriptomic and proteomic levels with their experimentally known target transcripts across 16 human tissues and 18 experimental conditions in yeast. We show that in humans majority (~78%) of the RBPs are strongly associated with their target transcripts at the transcript level while ~96% of the studied RBPs were found to be strongly associated with the expression levels of target transcripts when protein expression levels of RBPs were employed. Further analysis revealed that based on the observed distribution of RBP-RNA correlations compared to a null distribution, RBPs can be classified into three classes, namely significantly congruent (SC) – RBPs which exhibit significantly higher correlation than expected, significantly incongruent (SIC) – RBPs which exhibit significantly lower correlation than expected and no significant change (NSC) - RBPs which exhibit no association with their targets, respectively. At transcript level, RBP – RNA interaction data for the yeast genome, exhibited a strong association for 57% of the RBPs, confirming that our observed association between RBPs and their targets is conserved across large phylogenetic distances. Further analysis to uncover the features contributing to these associations using elastic net as well as multi-variate regression modelling revealed the significant contribution of features like the number of the target transcripts and length of the selected protein-coding transcript of a RBP at the transcript level while intensity of the CLIP signal, number of RNA-Binding domains, location of the binding site on the transcript were found to be significant at the protein level. Our analysis provides a comprehensive understanding of the relation between the expression levels of RBPs and their targets with specific insights into the factors contributing to the observed association and will contribute to improved modelling and prediction of post-transcriptional networks.

Metabolomics and Cheminformatics analysis guiding the Discovery of Antifungal Metabolites for Crop Protection

Download

Date: TBA
Room: TBA

Miroslava Cuperlovic-Culf, National Research Council of Canada, Canada
Nandhakishore Rajagopalan, National Research Council of Canada, Canada
Dan Tulpan, National Research Council of Canada, Canada
Michele Loewen, National Research Council of Canada, Canada

Presentation Overview:

Fusarium head blight (FHB), also known as scab or tombstone, is a devastating disease of wheat, barley, oats and other small-grain cereals as well as corn caused primarily by Fusarium graminearum. Several cultivars of wheat have developed some level of resistance to FHB. Resistance to this fungal pathogen includes specific metabolic responses to inoculation. A number of published metabolomics studies have determined major metabolic changes induced by pathogen in resistant and susceptible plants. Functionality of the majority of these metabolites in resistance remains, however, unknown. In this work we have made a compilation of all metabolites determined to selectively accumulate following FHB inoculation in resistant plants. Characteristics as well as possible functions and targets of these plant metabolites are investigated using cheminformatics approaches. A particular focus has been on the likelihood of these metabolites targeting specific proteins and acting as drug-like molecules. Results of computational analyses of binding properties of several representative metabolites to homology models of proteins are presented. Theoretical analysis highlights the possibility of strong inhibitory activity of several metabolites against some major proteins in F. graminearum such as carbonic anhydrases and cytochrome P450s. Activity of several of these compounds has been experimentally confirmed in fungal growth inhibition assays.

An unsupervised kNN Method to systematically detect changes in protein localization in high-throughput microscopy images

Download

Date: TBA
Room: TBA

Alex Lu, Department of Computer Science, University of Toronto, Canada
Alan Moses, Department of Computer Science and Department of Cells and System Biology, University of Toronto, Canada

Presentation Overview:

Despite the importance of characterizing genes that exhibit subcellular localization changes between conditions in proteome-wide imaging experiments, many recent studies still rely upon manual evaluation to assess the results of high-throughput imaging experiments. We describe and demonstrate an unsupervised k-nearest neighbours method for the detection of localization changes. Compared to previous classification-based supervised change detection methods, our method is much simpler and faster, and operates directly on the feature space to overcome limitations in needing to manually curate training sets that may not generalize well between screens. In addition, the output of our method is flexible in its utility, generating both a quantitatively ranked list of localization changes that permit user-defined cut-offs, and a vector for each gene describing feature-wise direction and magnitude of localization changes. We demonstrate that our method is effective at the detection of localization changes using the Δrpd3 perturbation in Saccharomyces cerevisiae, where we capture 71.4% of previously known changes within the top 10% of ranked genes, and find least four new localization changes within the top 1% of ranked genes. The results of our analysis indicate that simple unsupervised methods may be able to identify localization changes in images without laborious manual image labelling steps.

Using ancestral sequence reconstruction methods to predict functional evolution in cetacean rhodopsin over a major evolutionary transition

Download

Date: TBA
Room: TBA

Sarah Dungan, University of Toronto, Canada
Belinda Chang, University of Toronto, Canada

Presentation Overview:

Ancestral sequence reconstruction methods, particularly those that use probabilistic models, have become increasingly refined over recent years, which has resulted in their popular use as tools for tracing the functional evolution of proteins. Nevertheless, the robustness of reconstructed sequences is often poorly addressed, with reliance on only the most probable sequence under a single model. This is despite probabilistic models having known optimization biases towards more frequent amino acid states that can subsequently lead to biased inferences of ancestral protein function. The dim-light visual protein, rhodopsin, was recently shown to be under positive selection in cetaceans, with accompanying functional shifts that suggest divergence and adaptation within Cetacea to different underwater light environments. Nevertheless, the evolution of dim-light vision at the origin of Cetacea as they transitioned to aquatic environments remains unexplored. Because cetacean rhodopsin is highly conserved, yet has strong signatures of functional evolution, it is an ideal system in which to test the application of ancestral sequence reconstruction. We compare commonly used amino acid and codon-based likelihood models to reconstruct the rhodopsin sequences from the ancestral cetacean, and the common ancestor of cetaceans with their nearest hippopotamid and ruminant relatives (Whippomorpha and Cetruminantia). Specifically, we determine whether different models result in the same most probable ancestral sequences, and within-model, which ancestral sites vary after randomly sampling from the empirical Bayesian posterior probability distribution. We then construct homology models of the different ancestral protein 3D structures to assess the likelihood that uncertain sites will impact rhodopsin functions. By evaluating model uncertainty, we are able to reliably develop precise hypotheses for resolving whether functional differences in rhodopsin between extant cetaceans and outgroups are due to ancestral or derived substitutions. The single-gene focus of our work facilitates a more detailed discussion of protein structure-function in an evolutionary context, thus providing necessary foundations for future investigations that can evaluate our predictions experimentally.

Base-By-Base v3: new tools for the comparative analysis of genomes

Download

Date: TBA
Room: TBA

Chad Smithson, University of Victoria, Canada
Chris Upton, University of Victoria, Canada

Presentation Overview:

Base-By-Base (BBB) is a Java tool to create, edit and analyse multiple sequence alignments (MSA) of proteins, genes and large viral genomes. A number of significant new analysis features have been added to this release of the software:

1) CODEHOP, a tool to design sets of degenerate PCR primers for the detection of distant homologs has been rewritten in Java and included as a feature of BBB.
2) FIND DIFFERENCES is a powerful feature for the exploration of which SNPs contribute to phylogenetic assignments. From a set of aligned DNA sequences, the user can count and display nucleotides that are a) unique to one or more sequences; b) present in sequence X, but not in sequences Y and Z; c) present in sequences A, B and C, but not in sequences X, Y and Z with tolerance to a number of failures in this matching. This analysis has been employed to detect small regions of recombination in the smallpox virus that may have been involved in the switch to human hosts.
3) SNIP allows the removal of nucleotides (conversion to consensus sequence) that are present in only 1 sequence in of a MSA. This provides a useful simplification of MSAs when looking for recombination events (patterns of SNPs) in MSAs.
4) Individual amino acids present above a user-defined frequency in protein MSAs can be highlighted.
5) Graphs can be drawn to display % similarity between proteins in MSAs.
6) MAFFT and ClustalO have been included for alignment options.

Multi-genome Scaffold Co-Assembly Based on the Analysis of Gene Orders and Genomic Repeats

Download

Date: TBA
Room: TBA

Sergey Aganezov, Computational Biology Institute & Department of Mathematics, The George Washington University, United States
Max Alekseyev, George Washington University, United States

Presentation Overview:

Advances in the DNA sequencing technology over the past decades have increased the volume of raw sequenced genomic data available for further assembly and analysis. While there exist many software tools for assembly of sequenced genomic material, they often experience difficulties with reconstructing complete chromosomes. Major obstacles include uneven read coverage and presence of long similar DNA subsequences (repeats). Genome assemblers therefore often are able to reliably reconstruct only long fragments, called scaffolds. We present a method for simultaneous co-assembly of all fragmented genomes (represented as collections of scaffolds rather than chromosomes) in a given set of annotated genomes. The method is based on the analysis of gene orders and relies on the evolutionary model, which includes genome rearrangements as well as gene insertions and deletions. It can also utilize information about genomic repeats and the phylogenetic tree of the given genomes, further improving their assembly quality.

Machine Learning Approaches for Breast Cancer Subtypes Reveal Key Genes as Potential Biomarkers

Download

Date: TBA
Room: TBA

Michele D'Agnillo, Department of Biological Sciences, University of Windsor, Canada
Iman Rezaeian, University of Windsor, Canada
Alioune Ngom, School of Computer Science, University of Windsor, Canada
Luis Rueda, University of Windsor, Canada

Presentation Overview:

Worldwide, breast cancer is the second leading cause of death among women and one in nine women are diagnosed with breast cancer in their life time. Accurate diagnosis of the specific subtypes of this disease is a vital step for determining an appropriate patient’s therapy. In this study, we use machine learning approaches to identify the most informative genes that can best discriminate the ten subtypes of breast cancer. In particular, we use a bottom-up hierarchical classification approach to select the most informative genes for different subtypes. This approach clusters the subtypes based on their similarity and produces a tree-based model with a semi-balanced topology. We also use different classification methods and perform in-depth comparison of their performances using different performance measures on the METABRIC dataset consisting of 997 samples. Our results support that this approach to gene selection and breast cancer subtyping yields a small subset of genes that can predict each of these ten subtypes with very high accuracy of at least 95%. Moreover, the machine learning model provides an insightful structure for further analysis of these subtypes.
We have further analyzed the functions of three genes identified by the machine learning approaches: USP21, PTRH2 and TACO1. Differential expression of USP21 discriminates the Subtype-7 and Subtype-8. USP21 encodes a protein that catalyzes intracellular protein degradation. PTRH2 and TACO1 discriminate Subtype-1 and Subtype-3. PTRH2 encodes a mitochondrial enzyme, which degrades peptidyl tRNA. The downregulation of PTRH2 causes translational errors, which have been linked to tumour progression and metastasis. TACO1 encodes a mitochondrial translation activator for cytochrome c oxidase, the upregulation of which increases cellular respiration. The proposed mechanism involves the accumulation of mitochondrial protein, which can cause translational errors that upregulate cellular respiration, thereby increasing the risk of oncogenesis.

Pan-Cancer Analyses Reveal Long Intergenic Non-Coding RNAs Relevant to Tumor Diagnosis, Subtyping and Prognosis

Download

Date: TBA
Room: TBA

Travers Ching, University of Hawaii Cancer Center, United States
Lana Garmire, University of Hawaii Cancer Center, United States

Presentation Overview:

Long intergenic noncoding RNAs (lincRNAs) are a relatively new class of non-coding RNAs that have the potential as cancer biomarkers. To seek a panel of lincRNAs as pan-cancer biomarkers, we have analyzed transcriptomes from over 3300 cancer samples with clinical information. Compared to mRNA, lincRNAs exhibit significantly higher tissue specificities that are then diminished in cancer tissues. Moreover, lincRNA clustering results accurately classify tumor subtypes. Using RNA-Seq data from thousands of paired tumor and adjacent normal samples in The Cancer Genome Atlas (TCGA), we identify six lincRNAs as potential pan-cancer diagnostic biomarkers (PCAN-1 to PCAN-6). These lincRNAs are robustly validated using cancer samples from four independent RNA-Seq data sets, and are verified by qPCR in both primary breast cancers and MCF-7 cell line. Interestingly, the expression levels of these six lincRNAs are also associated with prognosis in various cancers. We further experimentally explored the growth and migration dependence of breast and colon cancer cell lines on two of the identified lncRNAs. In summary, our study highlights the emerging role of lincRNAs as potentially powerful and biologically functional pan-cancer biomarkers and represents a significant leap forward in understanding the biological and clinical functions of lincRNAs in cancers.

Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis

Download

Date: TBA
Room: TBA

Sijia Huang, Univeristy of Hawaii at Manoa, United States
Lana Garmire, University of Hawaii Cancer Center, United States

Presentation Overview:

Background

More accurate diagnostic methods are pressingly needed to diagnose breast cancer, the most common malignant cancer in women worldwide. Blood-based metabolomics is a promising diagnostic method for breast cancer. However, many metabolic biomarkers are difficult to replicate among studies.

Methods

We propose that higher-order functional representation of metabolomics data, such as pathway-based metabolomic features, can be used as robust biomarkers for breast cancer. Towards this, we have developed a new computational method that uses personalized pathway dysregulation scores for disease diagnosis. We applied this method to predict breast cancer occurrence, in combination with correlation feature selection (CFS) and classification methods.

Results

The resulting all-stage and early-stage diagnosis models are highly accurate in two sets of testing blood samples, with average AUCs (Area Under the Curve, a receiver operating characteristic curve) of 0.968 and 0.934, sensitivities of 0.946 and 0.954, and specificities of 0.934 and 0.918. These two metabolomics-based pathway models are further validated by RNA-Seq-based TCGA (The Cancer Genome Atlas) breast cancer data, with AUCs of 0.995 and 0.993. Moreover, important metabolic pathways, such as taurine and hypotaurine metabolism and the alanine, aspartate, and glutamate pathway, are revealed as critical biological pathways for early diagnosis of breast cancer.

Conclusions

We have successfully developed a new type of pathway-based model to study metabolomics data for disease diagnosis. Applying this method to blood-based breast cancer metabolomics data, we have discovered crucial metabolic pathway signatures for breast cancer diagnosis, especially early diagnosis. Further, this modeling approach may be generalized to other omics data types for disease diagnosis.

Integrated Microbiome Resource (IMR): Developing an Open and Streamlined Experimental and Analysis Pipeline for Microbiome Research

Download

Date: TBA
Room: TBA

André M. Comeau, Dalhousie University, Canada
Gavin M. Douglas, Dalhousie University, Canada
Morgan Langille, Dalhousie University, Canada

Presentation Overview:

Microbiome studies have revolutionized the microbiology field and are becoming increasingly popular. In recent years, advances in sequencing technologies and in bioinformatic methods have led to faster and more robust methods for generating and analyzing data. The Comparative Genomics and Evolutionary Bioinformatics – Integrated Microbiome Resource (CGEB-IMR: http://cgeb-imr.ca/) has streamlined and connected each essential step of a microbiome study starting with samples and ending with various plots and tables ready for interpretation within a single workweek. Our pipeline can handle up to 380 samples per sequencing run covering a variety of amplicon targets (16S, 18S, ITS, Bar-Seq, etc.). In little over one year of operation we have processed 7600 samples generating 600 M sequences and 336 G bases from a variety of host-associated (e.g. humans, mice, rats, fish, insects, birds, reptiles) and environmental (e.g. soil, waste water, marine) biomes. These samples encompass 75 projects from 30 principal investigators from several countries. We openly present each step of this resource including primer validation, library preparation, sequencing, quality control, paired-end assembly, taxonomic annotation, functional annotation (metagenomes), predictive functional annotation using PICRUSt (16S data), statistical evaluation, and visualization. This pipeline, Microbiome Helper (https://github.com/mlangill/microbiome_helper), is continually updated based on evolving best practices and can be replicated in other locations with only standard molecular and computational equipment, minimum personnel, and access to a bench top next-generation sequencer (e.g. Illumina MiSeq). Our results illustrate that microbiome studies can be easily conducted in various scientific settings, including for time-sensitive applications, and provide a complete experimental and analysis package that can be replicated by other microbiome researchers.

Hierarchal Clustering based on Non-negative Matrix Factorization for Time Series transcriptomes profiles

Download

Date: TBA
Room: TBA

Abed Alkhateeb, University of Windsor, Canada
Iman Rezaeian, university of windsor, Canada
Luis Rueda, University of Windsor, Canada

Presentation Overview:

Studying the transcriptome of the cancer cells from different cancer stages is essential to understand the disease development. A dataset contains different samples at different progression stages from Chinese population. First, the samples reads were preprocessed by aligning them to the human genome, then the transcripts were constructed at each cancer stage and the reads were quantified on the constructed transcripts. The final step of preprocessing was to construct a matrix of vectors V that contains the transcripts profiles measured by fragments per kilo base per million reads (FPKM).
The main purpose of using nonnegative Matrix Factorization (NMF) method is used to represent the data in part-based representation by factorizing matrix V into two non-negative matrices, by finding the localized interesting parts intuitively. Only additive parts are allowed here because of the non-negative representation. This method focuses on learning identifier features, where all sparse details can represent V sharply in a lower level of presentation. These vectors are then clustered effectively by focusing on the sparse sharp features, and removing unnecessary noisy and redundant features. The best number of clusters (k) is determined at each stage of a hierarchical model using purity and sparsity of the clusters.
The results demonstrate finding meaningful clusters; the resulting clusters are biologically assessed by gathering information from the literature review. The significant clusters are analyzed to find relationships among transcripts that follow similar trends across the different stages. We studied the functionality and promoters of the genes that belong to each cluster, and found some similarities in the same cluster’s genes ontologies. More biological validation and wet lab experiment required for those transcripts are the future work for those resulting clusters.

Genomics and transcriptomic analysis of imatinib resistance in gastrointestinal stromal tumor

Download

Date: TBA
Room: TBA

Asmaa Elzawahry, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
Tsuyoshi Takahashi, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
Sachiyo Mimaki, Division of Translational Research, Exploratory Oncology Research and Clinical Trial Center, National Cancer Center, 6-5-1 Kashiwanoha, Kashiwa, Chiba 277-8577, Japan, Japan
Eisaku Furukawa, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
Rie Nakatsuka, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
Isao Kurosaka, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
Takahiko Nishigaki, Department of Gastroenterological Surgery, Osaka University Graduate School of Medicine, 2-2 E2, Yamadaoka, Suita City, Osaka, 565-0871, Japan, Japan
Hiromi Nakamura, Division of Cancer Genomics, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
Satoshi Serada, Laboratory for Immune Signal, National Institute of Biomedical Innovation, 7-6-8 Saito-Asagi, Ibaraki City, Osaka, 567-0085, Japan, Japan
Tetsuji Naka, Laboratory for Immune Signal, National Institute of Biomedical Innovation, 7-6-8 Saito-Asagi, Ibaraki City, Osaka, 567-0085, Japan, Japan
Seiichi Hirota, Department of Surgical Pathology, Hyogo Medical College, 1-1, Mukogawa-cho, Nishinomiya City, Hyogo, 663-8501, Japan, Japan
Tatsuhiro Shibata, Division of Cancer Genomics, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan
Katsuya Tsuchihara, Division of Translational Research, Exploratory Oncology Research and Clinical Trial Center, National Cancer Center, 6-5-1 Kashiwanoha, Kashiwa, Chiba 277-8577, Japan, Japan
Toshirou Nishida, Department of Surgery, National Cancer Center Hospital East, 6-5-1 Kashiwanoha, Kashiwa, Chiba, 277-8577, Japan, Japan
Mamoru Kato, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuuoo-ku, Tokyo 104-0045, Japan, Japan

Presentation Overview:

Background
The gastrointestinal stromal tumor (GIST) is the most common mesenchymal tumor of the digestive tract, of which proliferation is driven by gain-of-function mutations in KIT. These characteristics have facilitated the development of targeted therapies with tyrosine kinase inhibitors, such as imatinib. Although many clinical studies have demonstrated revolutionized effects of imatinib, more than 80% of patients eventually develop disease progression driven by secondary resistance mutations located in KIT kinase domains. However, the full spectrum of genomic and transcriptomic changes behind the resistance remains unknown.

Results
This study analyzed genomic and transcriptomic changes in drug-sensitive and -resistant cell lines against imatinib. We also looked at an “intermediate” cell-line before reaching the full resistance. We identified SNVs and CNAs from the next-generation sequencing and also the transcriptome from microarrays. For clinical insights, we conducted exome sequencing for two clinical samples with the resistance. Notably, the cell line briefly exposed to imatinib exhibited drastic transcriptional changes, but few genomic changes.

Conclusion
We suggest that pre-existing cell death-resistant subpopulations are the main cause for full resistance via secondary KIT mutations. The combination of chemotherapy with imatinib and apoptosis pathway-targeting drugs, could limit the emergence of drug-resistant cancer.

Computational drug repositioning through graph-based semi-supervised learning with genomic expression and drug-gene interaction network

Download

Date: TBA
Room: TBA

Gyeongmo Gu, Kyungpook National University, Korea, Republic of
Erkhembayar Jadamba, Kyungpook National University, Korea, Republic of
Miyoung Shin, Kyungpook National University, Korea, Republic of

Presentation Overview:

Background:
Drug repositioning is an efficient drug discovery method that detects new indication(s) for existing drugs. So far, there have been various computational drug repositioning methods using pharmaceutical data and/or disease data, but currently we have new demands for computational approaches to combine experimental data, such as patients’ gene expression profiles, with pharmaceutical and genomic databases in order to explore relationships between drugs and diseases.

Results:
We propose a new strategy for drug repositioning that employs a graph-based semi-supervised learning method to analyze experimental knowledge from gene expression data and pharmaceutical knowledge from drug resources. In the first step of our study, we applied a graph-based semi-supervised learning algorithm to gene expression data in order to identify discriminative genes related to a specific disease. Next, we built a drug-gene interaction bipartite graph using existing relationships between drugs and genes recorded in public databases. To discover candidates for drug repositioning, we used the known drug-disease associations and discriminative genes found earlier to initiate drug-gene interaction network and performed network propagation via the guilt-by-association principle in a semi-supervised manner. We tested our method by evaluating 1239 FDA-approved drugs on the Rosetta breast cancer dataset, and observed that out strategy can identify promising candidate drugs for breast cancer treatment.

Conclusions:
Through our new drug repositioning method, we were able to discover new drug disease associations.

Inferring genes sensitive to severity of toxicity symptom

Download

Date: TBA
Room: TBA

Jinwoo Kim, Kyoungpook national university, Korea, Republic of
Hyunjung Lee, Kyoungpook national university, Korea, Republic of
Miyoung Shin, Kyoungpook national university, Korea, Republic of

Presentation Overview:

Background:
It is important to find genes related to toxicity symptoms for developing drugs efficiently. Many studies have worked for inferring the markers, but most of them focused on discovering the markers of toxicity occurrence only. In this work, our aim was to find gene markers related to the aggravation of toxicity symptoms, which should be more sensitive to severity of symptoms than other markers obtained by existing approaches.

Result:
To identify gene markers for each of targeted 4 liver toxicity symptoms (necrosis, hypertrophy, cellular infiltration, cellular changes), we used microarray and pathology data of 14,144 in-vivo rat samples, and employed sparse linear discriminant analysis (sLDA) for gene selection. Severity was used as class for sLDA. To evaluate inferred gene markers, we constructed regression models for predicting symptom severity. As result of 10-fold cross-validation, our model shows AUC 0.96 in predicting the samples in which liver necrosis occur, providing spearman correlation coefficient 0.80 between predicted necrosis severity and actual severity. For other 3 symptoms, the coefficients are shown 0.72, 0.77, 0.65 in hypertrophy, infiltration, and changes, respectively. In addition, we used one-way ANOVA and student’s t-test as gene selection method, and made comparisons between performances of models by different gene selection methods. As results, sLDA provides higher correlation coefficients than ANOVA and t-test for the all targeted symptoms.

Conclusion:
Our sLDA-based feature selection method is more useful to find gene markers sensitive to aggravation of toxicity symptoms than conventional statistical methods. Our method may help to develop new drugs or treatments which have high therapeutic effects minimizing toxic effects.

Prediction of Calmodulin-binding Proteins Using Canonical Motifs

Download

Date: TBA
Room: TBA

Mrinalini Pandit, University of Windsor, Canada
Mina Maleki, University of Windsor, Canada
Nicholas J Carruthers, Wayne State University, United States
Paul Stemmer, Wayne State University, United States
Luis Rueda, University of Windsor, Canada

Presentation Overview:

Calmodulin (CaM) is a calcium-binding protein that is a major transducer of calcium signaling. It has no enzymatic activity of its own but rather acts by binding to and altering the activity of a panel of cellular protein targets. Its targets are structurally and functionally diverse and participate in a wide range of physiological functions including immune response, muscle contraction and memory formation. Identifying CaM target proteins and CaM sites in those proteins is an important and ongoing research problem because its binding sites are defined by physical characteristics like helicity and charge rather than a particular amino acid sequence. Current algorithms for CaM binding site prediction struggle to identify novel CaM-binding proteins.
Short Linear Motifs (SLiMs), on the other hand, help regulate many cellular processes, by being interaction sites for other SLiM containing proteins., SLiM mediated interactions are often transient interactions or utilize additional interaction domains to co-operatively produce stable complexes. In this work, we propose a meta-analysis model used for prediction of CaM-binding proteins based on sequence information. The model uses SLiMs that are derived from a set of validated CaM binding proteins as features for the prediction. A dataset of 194 manually curated CaM-binding proteins from the Calmodulin Target Database [2] and another dataset of 200 Mitochondrial proteins have been obtained and used for testing the model. For each protein, the extracted features are the frequencies of occurrence of the known CaM-binding motifs reported in [1]. Predictions have been performed using k-nearest neighbor, support vector machine and random forest classifiers achieving accuracies of 93%, 93% and 91% respectively. These results denote that using SLiMs for prediction of CaM-binding proteins help identify the CaM-binding regions which will further enhance future biological experiments to analyze CaM and their binding with target proteins. This analysis also suggests identification of new CaM-binding motifs will enrich the study of CaM-binding proteins in the field of computational biology.

CoDuMIMM: Coevolution Detection using Mutual Information and Mutational Mapping

Download

Date: TBA
Room: TBA

Andrew Low, Carleton University, Canada
Alex Wong, Carleton University, Canada

Presentation Overview:

Epistasis, the genetic interaction of mutations in nucleotide sequences to produce effects that are not the sum of their parts, was once thought to be a relatively uncommon phenomenon. However, research has shown that epistasis may be much more pervasive than previously thought, to the point of being an important force in molecular evolution. While many approaches to detect epistatic interactions in silico have been attempted, there is still room for improvement in these techniques. To this end, we present a new algorithm, CoDuMIMM (Coevolution Detection using Mutual Information and Mutational Mapping). CoDuMIMM works by first generating substitution histories along a phylogeny for each site in an alignment using Bayesian techniques, and then calculating mutual information between pairs of sites based on the proportion of time throughout evolutionary history that states have been shared. Preliminary results on simulated data show that CoDuMIMM is able to identify epistatic sites with high sensitivity and specificity. Further testing on transfer RNA and ribosomal RNA sequences with known structures (and therefore known epistatic interactions) will be carried out in the near future, with the potential to move on to protein sequences and the prediction of inter-gene epistatic interactions after that.

Modelling stochastic pulsatility of transcription factors in Saccharomyces cerevisiae

Download

Date: TBA
Room: TBA

Ian Hsu, University of Toronto, Canada
Alan Moses, University of Toronto, Canada

Presentation Overview:

Many transcription factors in Saccharomyces cerevisiae reveal pulsatility. Pulsatile transcription factors localize to nuclei for a short period and return to the cytoplasm during constant conditions. Although most pulsatile transcription factors passively localize to nuclei after receiving signal transduced from a change in the environment, a recently documented class of pulsatile transcription factors show active localization to nuclei. The localization of this class of pulsatile transcription factors is believed as stochastic that the period of pulses is not synchronized between each cell in a homogeneous environment; in addition, the frequency of pulses in one cell does not show obvious oscillation. The mechanism of stochastic pulsatility in cell is largely unknown. We use mathematical models to explore the mechanism of stochastic pulsatility. Florescence-protein-tagged transcription factors can be traced through confocal microscopy. Therefore, we utilize time-lapse images combined with image analysis on the intensity of florescence in nuclei to quantify pulsatility overtime in each cell. Based on the quantified data, we construct a model that focuses on the reactions of phosphorylation and dephosphorylation on the transcription factors, which are currently hypothesized to play a major role in pulsatility. Understanding the mechanism of active and stochastic pulsatility and how it is different from passive and deterministic pulsatility could help us an important strategy of gene expressions regulation.

Identification of haplotypes in HLA using genetic variant calling from amplicon NGS data

Download

Date: TBA
Room: TBA

Hong Hu, University of Illinois at Chicago, United States
Mark Maienschein-Cline, University of Illinois at Chicago, United States
Zhengdeng Lei, University of Illinois at Chicago, United States
Pinal Kanabar, University of Illinois at Chicago, United States
George Chlipala, University of Illinois at Chicago, United States
Morris Chukhman, University Of Illinois at Chicago, United States
Neil Bahroos, University of Illinois at Chicago, United States
David Everly, Rosalind Franklin University of Medicine and Science, United States

Presentation Overview:

The human leukocyte antigen (HLA) complex is of key importance in determination of the antigenic specificity in adaptive immunization process. Therefore, typing the HLA genes is essential for stem cell transplantation to match donors and recipients. Amplicon-based targeting of HLA genes is an efficient and cost-effective strategy for obtaining high-coverage sequencing for HLA typing. However, existing off-the-shelf HLA type prediction tools such as HLAminer are designed for whole-genome or whole-exome shotgun sequencing data, and provide low accuracy when applied to amplicon NGS data. Here we present a novel workflow based on genetic variance calling to identify actual haplotypes on each amplicon for HLA typing. The short reads from NGS are merged to single long reads which completely span each given amplicons, aligned to reference HLA amplicons, and then used for variant calling. The haplotypes are obtained through reprocessing each merged read using identified genetic variants, yielding consensus sequences of each haplotype spanning each HLA amplicon. The consensus sequences are aligned back to IMGT/HLA database for nomenclature and under validation with the reference sequences obtained through Sanger sequencing. We have applied this method to datasets of known HLA types to assess the specificity and sensitivity of our approach.

Predicting patient outcomes of hormone therapy in the METABRIC breast cancer study

Download

Date: TBA
Room: TBA

Iman Rezaeian, University of Windsor, Canada
Eliseos Mucaki, University of Western Ontario, Canada
Katherina Baranova, University of Western Ontario, Canada
Huy Pham Quang, University of Windsor, Canada
Dimo Angelov, University of Western Ontario, Canada
Lucian Ilie, University of Western Ontario, Canada
Alioune Ngom, University of Windsor, Canada
Luis Rueda, University of Windsor, Canada
Peter Rogan, University of Western Ontario, Canada

Presentation Overview:

Genomic aberrations and gene expression-defined subtypes in the METABRIC patient cohort have been used to stratify and predict survival in the breast cancer population (Nature 486: 346; Molecular Oncology 9: 115). Gene expression and clinical outcome were used to predict response for different survival durations in METABRIC patients receiving hormone treatments (HT), with or without chemotherapy (CT). Our previously optimized and validated biochemically-inspired gene expression signature for paclitaxel response were used to predict different outcome in patients (Molecular Oncology 10:85). By applying machine learning algorithms for classification and feature selection, this signature, which was originally developed to model paclitaxel response in breast cancer cell lines, was used to predict survival in METABRIC patients. For 54 CT patients, a Random Forest classifier containing ABCB11, BAD,CYP2C8,CYP3A4, MAP2, MAPT, FGF2 exhibited 77% accuracy (AUROC=0.76) in discriminating survivors from deceased individuals. HT patients (n=188) analyzed with a 19 gene signature (ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, TUBB4B) predicted >3 year survival with 88% accuracy. This signature showed 83% accuracy in the combined HT+CT patient set (n=221, AUROC = 0.60) . In addition, for 84 HT+CT patients, a Support Vector Machine classifier was applied using genes ABCB1, BAD, BCAP29, BCL2, BCL2L1, BMF, CNGA3, CYP2C8, CYP3A4, FGF2, GBP1, MAP2, MAPT, OPRK1, SLCO1B3, TLR6, TUBB1 and TWIST1 as features, yielding 76% accuracy. Applied to untreated patients, the accuracy of predicting survival was relatively lower (about 70%). These tumor gene expression signatures for response to paclitaxel therapy may be useful as a surrogate measure of early to intermediate term survival.

Selection on quantitative traits within intrinsically disordered protein regions preserves functional output of phosphorylation sites

Download

Date: TBA
Room: TBA

Caressa Tsai, University of Toronto, Canada
Taraneh Zarin, University of Toronto, Canada
Alan Moses, University of Toronto, Canada

Presentation Overview:

Intrinsically disordered regions (IDRs) of proteins are characterized by their conformational flexibility and absence of stable tertiary structure. Comprising 30% of eukaryotic proteins, IDRs have critical roles in regulation, but sequence analyses have revealed high turnover rates and divergence in these regions, suggestive of weak evolutionary constraints. Notably, phosphorylation sites are a highly important regulatory motif enriched in IDRs, yet display evidence of rapid evolution and poor positional conservation. We examined patterns of phosphorylation site evolution in IDRs in an attempt to detect molecular signatures of selection on quantitative traits describing these regulatory elements. We predicted that under a model of stabilizing selection, rapid divergence may be permitted with the reshuffling and turnover of individual phosphorylation sites, so long as their overall functional output is maintained within an optimal range. Thus, we tested this prediction using both a comparative phylogenetic method as well as a birth-death model of evolution, on Ste50 (a regulatory protein with a highly divergent IDR) and subsequently, on the yeast proteome. Our applications of both of these models reveal striking differences between IDRs in yeast proteins compared to sequences simulated under neutral expectations, indicative of selective constraint on these molecular phenotypes. These results offer an explanation for rapidly diverging IDRs, capable of maintaining regulatory functional outputs under the action of mutation-selection balance.

Systematic Characterization of Subcellular RNA Localization Through Fractionation-Sequencing

Download

Date: TBA
Room: TBA

Louis Philip Benoit Bouvrette, IRCM, Canada
Neal Cody, Mount Sinai, United States
Julie Bergalet, IRCM, Canada
Alexis Blanchet-Cohen, IRCM, Canada
Xiaofeng Wang, IRCM, Canada
Eric Lecuyer, IRCM, Canada

Presentation Overview:

Most eukaryotic cells are highly asymmetric in shape and composition. This feature relies on the capacity of molecular constituents, including proteins and nucleic acids, to be organized within distinct organelles and is central for different cell types to execute specialized functions. Subcellular localization of messenger RNA (mRNA) is a post-transcriptional mechanism which can modulate protein activities at specific functional sites. Global RNA imaging-based screens in Drosophila oocytes and embryos have demonstrated that as much as 70% of the coding transcripts are localized in patterns that broadly correlate with the distribution and function of the encoded proteins. However, this may represent an exceptional example and it remains unclear whether a comparable prevalence of RNA localization is observable in standard cells grown in culture.

To gain global insights into RNA subcellular localization properties, we subjected Drosophila (D17) and human (HepG2, K562) cells to biochemical fractionation combined with RNA sequencing (Frac-seq), allowing RNA mapping within several subcellular compartments (i.e. nuclear, cytosolic, membrane, insoluble). We further performed mass spectrometry on proteins extracted from these same fractions. A complete bioinformatics analysis was carried out to determine RNA and protein enrichment and correlations. We also catalogued specific attributes and motifs characterising asymmetrically distributed RNA.

These results reveal the high prevalence of RNA asymmetric localization, with distinctive subcellular enrichments observed for a diverse array of cellular RNA species (i.e. mRNA, lncRNA, circular RNA). Additionally, we observed specific correlations and anti-correlations between groups of mRNAs and their encoded proteins, as well as attributes (e.g. UTR and coding sequence lengths, exon content) that distinguish fraction-specific mRNA populations. Our study therefore reveals that RNA localization is a prevalent and evolutionarily conserved process, acting through discriminative constraints, that likely impacts every aspect of post- transcriptional gene regulation.

Linking Transposable Elements to Chromatin Architecture in Arabidopsis thaliana

Download

Date: TBA
Room: TBA

Christopher Cameron, McGill University, Canada
Maia Kaplan, McGill University, Canada
Alex Drouin, Laval University, Canada
François Laviolette, Laval University, Canada
Mathieu Blanchette, McGill University, Canada

Presentation Overview:

Transposable elements (TE) are self-replicating sequences of mobile DNA that move throughout a host genome. These “selfish” movements are known to be harmful when a TE either: 1) inserts in or near a functional gene, often perturbing its expression; or 2) produces a double-stranded break that may not be repaired correctly (Ayarpadikannan and Kim, 2014). To ensure self-preservation, host genomes have evolved complex defence mechanisms to decrease TE activity, which are known to influence epigenetic states (Feschotte et al., 2002; Fulz et al., 2015). The downstream effects of these mechanisms are often not limited to the TE itself and also affect surrounding DNA. Chromatin architecture of A. thaliana has recently been shown to be tightly linked to epigenetic state. Here, we investigate the role played by TEs, with their potentially disruptive nature, in defining the three-dimensional structure of DNA .

By combining publicly available A. thaliana genomic information and high-throughput chromosome conformation capture (Hi-C) data (Grob et al., 2014), we identify trends in Hi-C interaction frequency (IF) that can be described as a function of TE presence. Hi-C provides a genome-wide observation of all DNA-DNA interactions, facilitated by proteins, occurring within a population of cells. The high-level genomic spatial organization (compartments) identified by Hi-C have been shown to be recapitulated by long-range correlations in epigenetic data, providing a possible link between TEs and chromatin structure. We explore the effects of TE presence on chromatin architecture, which may result from the modification of the epigenetic landscape. Discovered Hi-C and TE correlations suggest the potential for machine learning models to predict the presence of TEs from Hi-C data and other key genomic features. We compare such a classifier to current predictors, such as Hidden Markov Models and consensus sequences, to demonstrate the potential link between chromatin architecture and the presence of TEs.

The nitrogen responsive transcriptome in potato (Solanum tuberosum L.) reveals significant gene regulatory motifs

Download

Date: TBA
Room: TBA

Jose Hector Galvez Lopez, McGill University, Canada
Helen H. Tai, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
Martin Lague, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
Bernie Zebarth, Agriculture and Agri-Food Canada, Fredericton Research and Development Centre, Canada
Martina V. Stromvik, McGill University, Canada

Presentation Overview:

Nitrogen (N) fertilization is an important abiotic factor for the growth of potato (S. tuberosum) because of its potential effects on yield. Additionally, since excess N in the soil negatively impacts the environment, studies on N use by the plant are key. Three commercial potato cultivars (Shepody, Russet Burbank and Atlantic) were grown under two different rates of applied N-fertilizer (0 kg N ha-1 and 180 kg N ha-1) to obtain more information on the underlying gene regulation mechanisms associated with N. Total mRNA samples were taken at two different time-points during the growth season and sequenced. The results for each cultivar and time-point were analyzed separately to find differentially expressed genes. The results of the differential expression analysis were compared to identify N-responsive genes found in all cultivars and time-points. A total of thirty genes were found to be over-expressed and nine genes were found to be under-expressed in all plants with added N-fertilizer. The 1000 bp upstream flanking regions of the differentially expressed genes were analyzed to find overrepresented motifs using three de novo motif discovery algorithms (Seeder, Weeder and MEME). Nine different motifs were found, indicating potential gene regulatory mechanisms for potato under N-deficiency.

A COMPREHENSIVE MAP OF CRITICAL PATHWAYS AND NETWORKS IN CANCER STEM CELLS

Download

Date: TBA
Room: TBA

Jeffrey Liu, University of Toronto, Canada
Veronique Voisin, University of Toronto, Canada
Changjiang Xu, University of Toronto, Canada
Ruth Isserlin, University of Toronto, Canada
Gary Bader, University of Toronto, Canada

Presentation Overview:

Introduction: It has been shown that a hierarchy exists in cancer where cancer stem cells (CSC) are at the apex with the ability to regenerate the disease and resistant to chemo- and radiation- therapy. Here, we are showing an example of how pathway and network analysis is used to extract common stem cell features, knowing that distinguishing stem cell maintaining pathways from CSC-driving mechanisms is critical in the fight against cancer.
Method: RNA-Seq datasets comparing CSC and normal stem cells (NSC) from multiple tissues were processed using the STAR alignment software with the latest genome assembly (GRCh38). Differential expression and Gene Set Enrichment Analysis (GSEA) generated the lists of significance in genes and pathways. An Enrichment Map (EM) was created using Cytoscape to summarize the genes and pathways into networks of interactions.
Results: GSEA was performed on gene expression data contrasting normal stem cell with stem-cell-derived tissues of heart, blood, breast, nervous system, adipose, kidney, brain, and developing embryo from 24 publically available RNA-Seq datasets. Highly significant pathways from the combined results (FDR q-value < 0.005) were selected to generate EM. We discovered that Telomere, DNA replication, DNA repair, Cell cycle, Mitotic spindle, VPR/nuclear transport, pluripotent stem cell, and Histone methylation pathways are generally enriched in normal stem cells. On the contrary, Notch1, ERK/MAPK, and Toll-like receptor pathways are activated after stem cell differentiation.
Conclusion: Here we demonstrated that through combining multiple stem-cell datasets by GSEA and EM, it led to the identification of well-known stem cell pathways such as Telomere, DNA replication, and pluripotency. We also identified pathways involved in differentiation and tissue-specification, like Notch1 and MAPK pathways. Furthermore, we discovered novel pathways the VPR/nuclear transport in stem cell for in-depth study. Multiple CSC datasets will be analyzed and combined for EM, and by contrasting with the EM of normal stem cells, therapeutic and biomarker candidates will be identified. Here in the Bader lab, we are actively develop and improve tools such as Cytoscape, GeneMANIA, and EM to discover dynamic pathways and networks. A comprehensive map of CSC will provide valuable information for developing specific anti-CSC strategies.

Pathway Commons: Single Point of Access to Biological Pathway Information

Download

Date: TBA
Room: TBA

Jeffrey Wong, University of Toronto, Canada
Gary Bader, University of Toronto, Canada
Igor Rodchenkov, University of Toronto, Canada
Chris Sander, Dana Farber Cancer Institute, Harvard Medica School, United States
Emek Demir, Oregon Health & Science University, United States
Ethan Cerami, Dana Farber Cancer Institute, Harvard Medica School, United States

Presentation Overview:

Pathway Commons is a service that collects and integrates public pathway data and makes it readily available for researchers through a single point of access. The Pathway Commons web site (www.pathwaycommons.org) provides an integrated tool to quickly search and visualize pathways. A download site provides integrated, bulk sets of pathway information in a variety of formats. An accompanying web service is available for software developers to conveniently query and access all data. Pathways include biochemical reactions, complex assembly, transport and catalysis events, gene regulation, genetic interactions, and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons currently contains over 42 000 pathways and 1.35 million interactions from twenty-two data providers with ongoing plans to expand the reach.

Convergent Evolution of Medulloblastoma Metastatic Tumors

Download

Date: TBA
Room: TBA

Patryk Skowron, The Hospital for Sick Children, Canada
Livia Garzia, The Hospital for Sick Children, Canada
Sorana Morrissy, The Hospital for Sick Children, Canada
Michael Taylor, The Hospital for Sick Children, Canada

Presentation Overview:

Introduction: Medulloblastoma initiates within the cerebellum and in 30% of cases disseminates throughout the brain and spinal cord. Little is known about the genes driving dissemination since matching primary and metastatic samples are rare. The medulloblastoma Sleeping Beauty (SB) mouse model uses random integration of transposons to initiate tumorigenesis. Insertions that confer a growth advantage are selected upon as the cancer progresses. Recent literature has demonstrated divergent evolution between the primary and metastatic sites and phenotypic convergent evolution between independent metastatic sites in multiple cancers. The extent of convergent evolution in medulloblastoma metastasis is unknown and its investigation may reveal important therapeutic targets.

Methods: Independent metastatic samples were from each mouse and the SB transposon/normal genomic DNA junctions were sequenced. From every sample, only the most abundant (i.e. clonal) insertions were kept for downstream analysis. Convergently selected genes were located by identifying unique insertion sites targeting the same gene in different metastatic compartments. For each mouse the probability of a random convergent event was modelled using the binomial distribution and compared to the observed rates to identify genes undergoing selective pressure.

Results: There were 15 significant genes undergoing convergent selective pressure across multiple mice. The most recurrent of these were Crebbp, Lgals3, Rabgap1l, Ak7, Ncoa3, Ptk2, Gabrb3, and Ophn1. Crebbp and Ncoa3 are chromatin remodellers part of the same complex, they play an essential role in growth control and embryonic development. While Ptk2, Gabrb3 and Ophn1 regulate cell-to-cell junction maintenance. Subsequent gene set enrichment analysis revealed a multitude of pathways essential for metastasis in medulloblastoma such as cell adhesion and Hedgehog signalling.

Conclusions: Convergent evolution plays a prominent role in medulloblastoma metastasis progression. Independent metastases have unique insertions in the same gene indicative of strong selective pressure.

In silico Discovery of Candidate Transcriptional Biomarkers for Ionizing Radiation

Download

Date: TBA
Room: TBA

Yared Kidane, WYLE/NASA, United States

Presentation Overview:

Gene expression profiling has aided in identification of biomarkers for ionizing radiation. In spite of previous attempts, a comprehensive list of biomarker genes that cross experimental conditions still remain to be discovered. This has hampered the development of countermeasures against ionizing radiation. Polymerase chain reaction (PCR) based studies conducted earlier were aimed at identifying a small number of radiation signatures that can be used to assess exposure to ionizing radiation during mass radiologic incidents. In this sense, they focused on identifying genes that act in isolation.

Our goal is to formulate a comprehensive list of genes that respond to a range of radiation characteristics including various particles, doses, and time post exposure by leverage the wealth of knowledge in protein-protein interaction networks. More specifically, we collected a list of well-studied ionizing radiation signature genes derived from previous PCR-based studies. We overlaid protein-protein interaction network on these genes/proteins to predict additional ionizing radiation-responsive genes using a guilt-by-association technique. The validation and assessment of the robustness of these molecular markers that could represent a radiation-induction signature is essential. With this objective in mind, we mapped predicted genes to biological pathways using KEGG, GO, and Pathway Commons databases. A number of the ionizing radiation-responsive genes that we predicted are associated with previously known ionizing radiation-related biological processes, molecular functions, and cellular components, which has reinforced the validity of our prediction. In addition, mapping of predicted genes to diseases has revealed the enrichment of cancers of different types, radiation induced neoplasm, diseases related to chromosomal aberration, and genetic disorders of DNA repair. Furthermore, we conducted a two-fold cross validation to assess the accuracy of the predictor. The area under the precision-recall curve (AUC) was approximately 0.75.

Taken together, we used a computational approach to predict potential ionizing radiation-responsive genes by leveraging existing radiation signature genes and protein interaction networks. These predictions may be used in discovery of potential transcriptional biomarkers and future experimental prioritization and validation.

Neptune: Signature Discovery Software

Download

Date: TBA
Room: TBA

Eric Marinier, Public Health Agency of Canada, Canada
Rahat Zaheer, Public Health Agency of Canada, Canada
Chrystal Berry, Public Health Agency of Canada, Canada
Kelly Weedmark, Public Health Agency of Canada, Canada
Michael Domaratzki, University of Manitoba, Canada
Philip Mabon, Public Health Agency of Canada, Canada
Natalie Knox, Public Health Agency of Canada, Canada
Aleisha Reimer, Public Health Agency of Canada, Canada
Morag Graham, Public Health Agency of Canada, Canada
The Lids-Ng Consortium, The LiDS-NG Consortium, Canada
Gary Van Domselaar, Public Health Agency of Canada, Canada

Presentation Overview:

An important component of public health response is rapid characterization of infectious agents, including the discovery of discriminatory sequences that may be leveraged to uniquely delineate a group of organisms, such as isolates associated with a disease cluster. These discriminatory sequences may be useful for further investigation of bacterial association with virulence, or assist in the development of rapid diagnostic assays for identification of bacterial isolates. The volume of available, high-throughput, next generation sequence data has necessitated the use of "big data" computational approaches for effective, real-time, comprehensive outbreak investigation and response.

We present new software that locates genomic signatures using an exact k-mer matching strategy while accommodating sequence mismatches. The software identifies sequences that are sufficiently represented within a group of interest and sufficiently absent from a background group. These groups may be provided by the user and specified dynamically. The signature discovery process is accomplished using probabilistic models instead of heuristic strategies.

We have evaluated Neptune on Listeria monocytogenes and Escherichia coli genome data sets and found that signatures identified from these experiments are sensitive and specific to their respective data sets. In addition, the identified sequences provide a catalogue of differential loci for further investigation of group-specific traits. Neptune has broad implications in bacterial characterization for public health applications due to its efficient ad hoc signature discovery based upon user-specified differential genomics and scalability with analyses of large bacterial populations.

Neptune is freely available as open-source software and as a module within the Galaxy platform.

The Affinity Data Bank for biophysical analysis of regulatory sequences

Download

Date: TBA
Room: TBA

Todd Riley, University of Massachusetts Boston, United States
Cory Colaneri, University of Massachusetts Boston, United States
Brandon Phan, University of Massachusetts Boston, United States
Aadish Shah, University of Massachusetts boston, United States
Pritesh Patel, University of Massachusetts Boston, United States

Presentation Overview:

We present The Affinity Data Bank (ADB), a suite of tools that provides biologists with novel aids to deeply investigate the sequence-specific binding properties of a transcription factor (TF) or an RNA-binding protein (RBP), and to study subtle differences in specificity between homologous nucleic acid-binding proteins. Also, integrated with Pfam, the PDB, and the UCSC database, The ADB allows for simultaneous interrogation of protein-DNA and protein-RNA specificity and structure in order to find the biochemical basis for differences in specificity across protein families. The ADB also includes a biophysical genome browser for quantitative annotation of levels of binding – using free protein concentrations to model the non-linear saturation effect that relates binding occupancy with binding affinity. The biophysical browser also integrates dbSNP and other polymorphism data in order to depict changes in affinity due to genetic polymorphisms – which can aid in finding both functional SNPs and functional binding sites. Lastly, the biophysical browser also supports biophysical positional priors to allow for quantitative designation of the level of locus-specific accessibility that a protein has to the DNA. With the inclusion of these biophysical occupancy-based and affinity-based positional priors, the ADB can properly model in vivo protein-DNA binding by integrating the effects of chromatin accessibility and epigenetic marks. Importantly, the use of this toolset does not require bioinformatics programming knowledge – which makes ADB tool suite highly useful for a wide range of researchers.

Mining for New Antimicrobials: Predicting Bacteriocin Gene Blocks

Download

Date: TBA
Room: TBA

James Morton, University of California, San Diego, United States
Stefan Freed, University of Notre Dame, United States
Md Nafiz Hamid, Iowa State University, United States
Shaun Lee, University of Notre Dame, United States
Iddo Friedberg, Iowa State University, United States

Presentation Overview:

Discovering regulatory elements in co-expressed genes

Download

Date: TBA
Room: TBA

Yichao Li, Ohio University, United States
Rami Al-Ouran, Ohio University, United States
Lonnie Welch, Ohio University, United States

Presentation Overview:

Motif discovery is an important step to understand gene regulation. Nowadays, both microarray and next-generation sequencing techniques provide the opportunity for reverse engineering of the genetic regulatory code. However, the performance of existing motif discovery algorithms remains poorly due to the high heterogeneity within the co-expressed genes.
In this study, we provide a new motif discovery pipeline to analyze co-expressed genes with high heterogeneity. This includes a convolution-and-pooling motif discovery ensembles and a sequence clustering based on motif content. We demonstrate the usefulness of our pipeline in two case studies: 1. Infective stage-specific gene set in Brugia Malayi; 2. Myc induced over-expressed gene set in Homo Sapiens, MCF10A cell line. In both cases, known motifs can be found. For the ENCODE cell line, DNA methylation information was used as a prior in motif discovery and TF CHIP-seq information was used to determine putative TF-TF interactions and promoter-enhancer interactions.

A Metagenomic perspective on the structure and function of Lake Microbial Communities exposed to toxic metal traces in short term evolution

Download

Date: TBA
Room: TBA

Bachar Cheaib, Laval University, Canada
Malo Le Boulch, Laval University, Canada
Pierre-Luc Mercier, Laval University, Canada
Nicolas Derome, Laval University, Canada

Presentation Overview:

Metals and metalloids expelled by industrial mining and smelting activities are major pollutants of soil and water ecosystems. Toxic metal traces affect all life forms, including microbes. Consequently, metal contamination shapes the microbial meta-community structure and function through accelerating the evolutionary and adaptive processes. Little studies investigated the impact of a metallic cocktail gradient on the natural microbial communities. In this research, we studied five lakes located in a mining area in western Quebec. Among them, three are interconnected and lie along a metal contamination gradient (Cadmium, Copper, Lead, and Mercury) caused by historic mining activities (< 60 years of exposure). Two landlocked and distant lakes were used as positive and negative controls. Using a metagenomic shotgun sequencing approach (Illumina HiSeq) we generated 30Tbp of data from five lake samples. We used a comparative metagenomic strategy to explore shifts and thresholds for community disturbance. We found parallel shifts of taxonomic abundance (Actinobacteria, filamentous Cyanobacteria, Proteobacteria) between sites. This disturbance of meta-community structure was also observed at the functional and metabolic levels, through decreasing of subsystems abundance implicated in nitrogen cycle and photosynthesis, and on the other side, the increasing of virulence and stress response functionalities. The differential abundance of metabolic components suggests metabolic erosion along the five sites. Furthermore, a taxonomy/function decoupling may suggest that horizontal Gene Transfer (HGT) events mitigated taxonomic erosion. Ongoing work will provide insights into meta-community plasticity in response to environmental pressure, revealing thresholds between transient and permanent shifts in their composition and genetic repertory.

Investigating the usefulness of long-read sequencing technologies on the study of large eukaryotic gene families

Download

Date: TBA
Room: TBA

Armin Rouhi, University of Calgary, Canada
Janneke Wit, University of Calgary, Canada
James Wasmuth, University of Calgary, Canada

Presentation Overview:

Visualizing the Effects of Data Transformations on Errors

Download

Date: TBA
Room: TBA

Robert Flight, University of Kentucky, United States
Hunter Moseley, University of Kentucky, United States

Presentation Overview:

In many omics analyses, a primary step in the data analysis is the application
of a transformation to the data. Transformations are generally employed to
convert proportional error (variance) to additive error, which most statistical
methods appropriately handle. However, omics data frequently contain error
sources that result in both additive and proportional errors. To our knowledge,
there has not been a systematic study on detecting the presence of proportional
error in omics data, or the effect of transformations on the error structure. In
this work, we demonstrate a set of three simple graphs which facilitate the
detection of proportional and mixed error in omics data when multiple replicates
are available. The three graphs illustrate proportional and mixed error in a
visually compelling manner that is both straight-forward to recognize and to
communicate. The graphs plot the 1) absolute range, 2) standard deviation and 3)
relative standard deviation against the mean signal across replicates. In
addition to showing the presence of different types of error, these graphs
readily demonstrate the effect of various transformations on the error structure
as well. Using these graphical summaries we find that the log-transform is the
most effective method of the common methods employed for removing proportional
error.

An Accurate RNAseq-based Protocol for Metagenomic Taxonomy Classification

Download

Date: TBA
Room: TBA

Jeremy Cox, University of Cincinnati, United States
Richard Ballweg, Children's Hospital Medical Center, United States
Prakash Velayutham, Children's Hospital Medical Center, United States
David Haslam, Children's Hospital Medical Center, United States
Alexey Porollo, Children's Hospital Medical Center, United States

Presentation Overview:

RNAseq sequencing may be more beneficial than DNAseq to study the microbiome. In addition to genomic evidence of organisms present, RNAseq provides insights into the functional states and metabolism of the microbial community. A major challenge is designing an RNAseq-based Metagenomic Taxonomy Classifier (MTC) workflow that functions well with rRNA-depleted RNA. No MTCs have been published to work with RNAseq (after depletion of rRNA), and we found no adequate existing MTC workflows for DNAseq to adapt for this problem. Two key challenges for DNAseq-based MTC are (1) how to process a large volume of sequence data efficiently, and (2) how to deal with ambiguous information, when the same sequence matches to multiple species. Here, we present a protocol for RNAseq-based MTC, which addresses these two issues. A transcriptome assembler is used to assemble many short reads into few, long contigs. These contigs are rarely ambiguous due to increased length. The decrease in the number of sequences reduces the processing burden by orders of magnitude.
The protocol was comprehensively evaluated using 61 simulation experiments to account for the impact of reads length, mutation rate, sequencing depth, coverage, microbiome composition, reference database, and host reads abundance. The new protocol for RNAseq-based MTC is robust for both short and long sequences, different mutation rates, and various microbe community compositions. De novo assembly of RNAseq data reduces computation time, increases the accuracy of organism identification, and enables consistent performance across read lengths (50 – 150 bp). The new protocol performs best with the metagenome reference database constructed from complete genomes only.

Modeling of non-detects in qPCR

Download

Date: TBA
Room: TBA

Valeriia Sherina, University of Rochester, United States
Matthew McCall, University of Rochester, United States

Presentation Overview:

Quantitative real-time PCR (qPCR) is one of the most widely used methods to measure gene expression. Despite extensive research in qPCR laboratory protocols, normalization and statistical analysis, little attention has been given to qPCR non-detects — those reactions failing to produce a minimum amount of signal. While most current software replaces these non-detects with the maximum possible Ct value, recent work has shown that this introduces large biases in estimation of both absolute and diﬀerential expression. Existing approach, while better then previously used methods, underestimates the variability, leading to the anti-conservative inference. The idea of treating non-detects as missing data, model the missing data mechanism, and use this model to impute Ct values using EM algorithm for the non-detects was expanded to multiple imputation. There are three sources of model uncertainty – parameters of the Ct values distribution, random noise and the parameters of the curve, corresponding to the probability of the point being a non-detect. All the three variability souses were incorporated in this work. The benefits of this approach are shown when estimating both absolute and differential gene expression. This work resulted in a user friendly open source software.

Statistical Analysis of Differential Gene Expression in Spinal Nerve Ligated Rats using RNA-seq

Download

Date: TBA
Room: TBA

Dirk Bullock, University of Akron, United States

Presentation Overview:

The purpose of this study is to investigate differential gene expression in a model organism (the rat) subjected to a neurological intervention (spinal nerve ligation) using RNA-seq data. The data were derived from a previous study done by Hammer et. Al (2010, Genome Research) where mRNA expression was measured in Sprague-Dolly rats (Rattus norvegicus) that either received a spinal nerve ligation (SNL) or did not. The RNA-seq data obtained from there were raw counts. The data were cleaned, filtered for small counts, and the library sizes were then normalized. Genes were symbolically annotated, and converted from Ensembl IDs to Entrez IDs using the Bioconductor package biomaRt. A design matrix was defined to match the experimental setup, and dispersion estimates were computed with the EdgeR package in Bioconductor. A negative binomial GLM was fitted for each of the gene tags in the data set, and a likelihood ratio test was performed to find the differentially expressed genes between the control and treatment groups. The top genes were selected based on the p-values of the likelihood ratio tests. It was found that, in the treatment group, 3573 genes were upregulated, 3692 were down regulated, and 4277 were neither. The gene ontology (GO) and KEGG pathways were investigated for the top genes from the likelihood ratio tests. Many of the results from gene ontology suggest that genes that play a role in neural cellular mechanisms involving neurons, synapses, and ion transmembrane transport were upregulated under SNL. On the other hand, genes that played a role in immune response and responses to stimuli were highly downregulated under SNL. From the KEGG pathway database, upregulated pathways included metabolic, Alzheimer’s disease and Huntington’s disease. Downregulated pathways included cancer, HTLV-I (Human T-lymphotropic virus) infection, and PI3K/AKT signaling pathway (an intracellular pathway important in regulating the cell cycle).

Widespread complex genetic bases of ABC transporter mediated drug resistance in S. cerevisiae revealed through an engineered population profiling strategy

Download

Date: TBA
Room: TBA

Albi Celaj, Roth Lab, Canada
Nozomu Yachie, Synthetic Biology Division, Research Center for Advanced Science and Technology, the University of Tokyo, Tokyo, Japan, Japan
Louai Musa, University of Toronto, Canada
Minjeong Ko, Ontario Institute for Cancer Research, Canada
Marinella Gebbia, University of Toronto, Canada
Shijie Zhou, University of Toronto, Canada
Benjamin Grys, University of Toronto, Canada
Tina Sing, University of Toronto, Canada
Tiffany Fong, McMaster University, Canada
Frederick Roth, University of Toronto & Mt Sinai Hospital, Canada

Presentation Overview:

Unexpected effects of combined genetic perturbations (‘genetic interactions’) underlie complex traits and are thus critical to understanding genotype-phenotype relationships. Genetic interactions can be systematically mapped via either engineered strains or outbred populations. ‘High-order’ genetic interactions - those involving three or more genes - are likely to be more abundant than pairwise interactions. However, the number of combinations makes engineered strain approaches impractical. In outbred population studies, the multiple-testing burden limits the statistical power to detect high-order interactions. We explored high-order genetic interactions within a targeted gene set by designing a hybrid ‘engineered population’ strategy. To demonstrate this approach, we engineered a S. cerevisiae population of uniquely DNA-barcoded cells with variation segregating at 16 loci encompassing every ATP Binding Cassette (ABC) transporter gene implicated in multiple drug resistance. Using a combination of next-generation sequencing strategies, genotypes were obtained for ~7000 individuals from this population and the growth sensitivity of each individual was determined for 16 drugs. We visualized multi-gene fitness effects and identified numerous high-order genetic interactions conferring both drug resistance and sensitivity. For example, quadruple deletion of SNQ2, YBT1, YCF1, and YOR1 was found to confer resistance to the PDR5 substrates fluconazole and ketoconazole. Our study points to a new model for complex mutual inhibition of ABC transporters, potentially through direct heterodimeric repression. Thus, we illustrate the potential for an ‘engineered population’ strategy to inform our understanding of complex traits.

Gene expression profile based sample classification

Download

Date: TBA
Room: TBA

Khadija El Amrani, Charité - Universitätsmedizin Berlin, Germany
Nancy Mah, Charité - Universitätsmedizin Berlin, Germany
Miguel Andrade-Navarro, Faculty of Biology, Johannes Gutenberg University of Mainz, Germany
Andreas Kurtz, Charité - Universitätsmedizin Berlin, Germany

Presentation Overview:

Discrimination between different classes of samples such as different cell types or tissues using gene expression profiles is an important problem in genetic and cell research. It has several implications and can contribute to our understanding of cell phenotype differences and will allow precise identification of various cell types and tissues. We developed a bioinformatics tool for the classification of samples based on gene expression profiles. The tool requires a training and a test data set, and uses a simple algorithm called Shared Marker Genes (SMG). As the name suggests, the number of shared marker genes between a reference and a query sample is used as a similarity measure. Marker genes are detected using the tool MGFM, which we have previously developed as a Bioconductor package. We demonstrate the utility and effectiveness of the proposed approach by the classification of different tissues using public microarray and RNA-seq data sets. We verified our tool using 186 test samples from four human tissues (heart, kidney, liver and lung), from the NCBI’s Gene Expression Omnibus public repository. Our approach accurately classified 99% of these 186 test samples. Furthermore, we compared our tool to Support Vector Machines (SVM). Our method performed comparable or better than SVM. The proposed tool is implemented as an R package named ’sampleClassifier’, which will be submitted to Bioconductor. The source code is available upon request.

Age estimation for the genus Cymbidium (Orchidaceae: Epidendroideae) with implementation of fossil data calibration using molecular markers (ITS2 & matK) and phylogeographic inference from ancestr

Download

Date: TBA
Room: TBA

Devendra Biswal, Bioinformatics Centre, North-Eastern Hill University, India
Manish Debnath, Bioinformatics Centre, North-Eastern Hill University, India
Ruchishree Konhar, Bioinformatics Centre, North-Eastern Hill University, India
Jean Valrie, Botany Department, North-Eastern Hill University, India
Pramod Tandon, Biotech Park, Lucknow, India

Presentation Overview:

Intercontinental dislocations between tropical regions harboring two-thirds of the flowering plants have always drawn attention from taxonomists and bio-geographers. The focus had always been on woody land plants rather than on herbs. The Orchidaceae is one such family belonging to angiosperms, with an herbaceous habit and high species diversity in the tropics. Here, we investigate the evolutionary and bio-geographical history of the genus Cymbidium, which represents a monophyletic subfamily (Epidendroideae) of the orchids and comprises 50 odd species that are disjunctly distributed in tropical to temperate regions. A relatively well-resolved and highly supported phylogeny of Cymbidium orchids was reconstructed based on sequence analyses of internal transcribed spacer (ITS2) regions and maturase K (matK) from the chloroplast region available on the public domain in GenBank at NCBI. Crassulacean acid metabolism (CAM) is one of the photosynthetic pathways regarded as adaptations to water stress in land plants and much is not known about correlations among the level of CAM activity, habitat, life form, and phylogenetic relationships of a plant group from an evolutionary perspective. This study examined a genus level analysis by integrating matrices of ITS2 and matK data to all available fossil data on orchids in a molecular Bayesian relaxed clock employed in BEAST and assessed divergence times for the genus Cymbidium with a focus on evolutionary plasticity of photosynthetic characters. Our study has enabled age estimations for the genus Cymbidum (45Ma) for the first time using BEAST by addition of previously analyzed two internal calibration points.

Ancestral area reconstruction and genetic variation of Paragonimus westermani (Trematoda: Digenea) and its position within the genus Paragonimus: an in silico study using ITS2 and COXI sequences

Download

Date: TBA
Room: TBA

Devendra Biswal, Bioinformatics Centre, North-Eastern Hill University, India
Manish Debnath, Bioinformatics Centre, North-Eastern Hill University, India
Srinivasan Ramachandran, Institute of genomics and Integrative Biology, India
Anupam Chatterjee, Department of Biotechnology & Bioinformatics, North-Eastern Hill University, India
Veena Tandon, Biotech Park, Lucknow, India

Presentation Overview:

Paragonimiasis in humans is a neglected tropical disease of the lung and pleural cavity; besides, the extra-pulmonary paragonimiasis also happens to be an important clinical manifestation that has received feeble attention from public health authorities and has a far wider scope than mere clinical diagnosis and treatment. The major causative agent, Paragonimus westermani (Trematoda: digenea), is a cryptic species complex which is widely spread in east and northeast China, Japan, Korea and Taiwan (collectively referred to as East Asia). Lung flukes are also found in the tropics and sub tropics of East and South Asia and suburban Africa. The results of genetic variance analyses showed that samples from the same geographical region (country) may be attributed to different clusters but may not be having sufficient phylogenetic signals to exhibit the biological uniqueness of the corresponding populations. In a highly heterogeneous population the genetic variability in toto may be represented by a minor fraction of sampled individuals that might have resulted from fortuitous events in the exhibited clustering pattern of some variants. In fact the geographical fitting presents no surprise at all, taking into account that the Paragonimus species distribution covering different countries, is more or less aligned following an East-South axis because of the relatively slenderness of the South Asian countries. We harnessed the entire genome sequence information for P. westermani via Next Generation Sequencing (NGS), and its correlation with the current information for the P. westermani towards mt DNA phylogenomic investigations. Specific primers were designed for the 12 protein coding genes with the guide of existing P. westermani mtDNA as the reference. The Ion torrent technology was used in the mitochondrial genome sequencing and the genome assembled and analyzed in silico.

Predicting cis-regulation in human promoters by information density-based clustering of heterotypic transcription factor binding sites

Download

Date: TBA
Room: TBA

Ruipeng Lu, Western University, Canada
Peter K. Rogan, Western University, Cytognomix Inc., Canada

Presentation Overview:

Background: Heterotypic clusters of transcription factor binding sites (TFBSs) are a common feature of cis-regulatory modules in both bacterial and eukaryotic genomes, and are present in both promoters and distal enhancer/silencer elements. Their complexity and spatial organization can modulate the temporal dynamics of multiple TFBS and influence gene expression. Window-based (e.g., Information density-based clustering (IDBC, Ref.1) and model-based algorithms have been described(e.g., Cluster-Buster (2), Comet (3), and Cister (4)) to detect homo- or heterotypic sets of adjacent binding sites. Homogeneous and bipartite information models from ENCODE ChIP-seq datasets (5) were used to detect TFBSs, which were subsequently analyzed with these algorithms.

Methods: Ground-truth TFBS clusters have been documented in studies of the regulation of the GR, CIITA, and AGT gene promoters. We compared the performance of TFBS clustering algorithms by evaluating their similarity to the ground truth clusters. We also sought evidence for crosstalk between adjacent TFBSs within a cluster, and compared these results for known regulated regions and regulatory deserts (e.g., the distal enhancer elements and proximal promoter elements of a looping interaction, and the region between them).

Results: IDBC identified 7 of 8 ground truth binding site clusters in GR, whereas Cluster-Buster, Comet and Cister found 0, 1, and 2 of these clusters, respectively. Similar results were obtained for the other genes: in CIITA, IDBC detected 3/4, Cluster-Buster 0/4, Comet 2/4 and Cister 0/4; in AGT, IDBC detected 2/3, Cluster-Buster 0/3, Comet 1/3 and Cister 0/3 TFBS clusters. Only IDBC and Cluster-Buster detected nearly all validated independent TFBSs that were not members of a cluster in each promoter. Evidence for interaction between proximate TFBSs within a cluster was confirmed by IDBC. Information density and cluster numbers in regulatory deserts were significantly fewer than in regulated regions. IDBC also detected unclustered TFBSs that are known to contribute to transcriptional regulation of these genes.

Conclusions: IDBC can predict interactions between proximate TFBSs and distinguish between actively regulated regions and regulatory deserts. Because both binding site strength and distribution are incorporated, IDBC is more comprehensive than alternative approaches.

References: 1. Dinakarpandian et al. (2005) BMC Bioinformatics, 6, 204; 2. Frith et al. (2003) Nucleic Acids Res., 31, 3666–3668; 3. Frith et al. (2002) Nucleic Acids Res., 30, 3214–3224; 4. Frith et al. (2001) Bioinforma. Oxf. Engl., 17, 878–889; 5. Lu et al. Nucleic Acid Res., submitted.

Towards better identification of potential RNA G-Quadruplexes using machine learning

Download

Date: TBA
Room: TBA

Jean-Michel Garant, RNA Group/Groupe ARN, Département de biochimie, Faculté de médecine et des sciences de la santé, Pavillon de recherche appliquée sur le cancer, Université de Sherbrooke, QC, J1E 4K8, Canada, Canada
Michelle S. Scott, RNA Group/Groupe ARN, Département de biochimie, Faculté de médecine et des sciences de la santé, Pavillon de recherche appliquée sur le cancer, Université de Sherbrooke, QC, J1E 4K8, Canada, Canada
Jean-Pierre Perreault, RNA Group/Groupe ARN, Département de biochimie, Faculté de médecine et des sciences de la santé, Pavillon de recherche appliquée sur le cancer, Université de Sherbrooke, QC, J1E 4K8, Canada, Canada

Presentation Overview:

G-quadruplexes (G4) are nucleic acid, tetrahelical structures formed from planar arrangement of guanines. A simple motif was originally proposed to describe G4-forming sequences. More recently, however, formation of G4 was discovered to depend, at least in part, on the contextual backdrop of neighboring sequences. Prediction of G4 folding is becoming more challenging as G4 structured outliers (i.e. not described by the originally proposed motif) are increasingly reported. Recent observations call for a comprehensive tool, adaptable to the changing definition of a typical G4. We proposed a machine learning based strategy to meet the needs and stopped our choice on an artificial neural network since it generally provides good prediction capability from noisy data. Using manually curated database G4RNA as a training set, we successfully produced a tool able to identify most experimentally observed G4s from transcriptomic sequences. G4RNA is designed to be very inclusive and presents numerous unusual structures. We are confident that this diversity is crucial for subsequent predictions. We are validating the generalization power of the tool and adapting its design to use it on transcriptomes of various species. Adaptation will include options to obtain several predictive values previously described and known to the field. A web application will be available in the near future for users to submit their own sequences to the tool.

Finding novel tragets of snoRNAs influencing alternative splicing and gene expression.

Download

Date: TBA
Room: TBA

Vincent Boivin, Université de Sherbrooke, Canada
Sonia Couture, Université de Sherbrooke, Canada
Sherif Abou Elela, Université de Sherbrooke, Canada
Michelle Scott, Université de Sherbrooke, Canada

Presentation Overview:

Small nucleolar RNAs (snoRNA) are an essential group of non-coding RNAs that are mostly known to mediate chemical modifications of ribosomal RNA (rRNA). Recent studies have, however, highlighted non-canonical functions for some snoRNAs, ranging from the modulation of chromatin structure to the regulation of transcript expression and the regulation of alternative splicing. However, more than a third of all known human snoRNAs have no function yet uncovered and many more could have such unexpected roles in cells.
Analysis of differential gene expression and splicing in our high-throughput RNA sequencing (RNA-seq) datasets on SKOV cell lines shows that the expression and the splicing patterns of many genes are altered when we deplete the snoRNAs SNORD88C and SNORD124, both of which have non-canonical characteristics. Bioinformatic predictions of RNA-RNA interactions of snoRNAs on the whole transcriptome show that some of these events are associated with highly energetically favored interactions. These predicted interactions and their effects on gene expression or splicing patterns will be validated experimentally by PCR and shift-assay.
As more RNAseq experiments with depleted snoRNAs are planned, we will use previous true positive and false positive results to create a Bayesian network based predictor in order to find new targets more efficiently with new datasets. The predictor will use parameters of the previously validated interactions to find out wich new predicted events are more likely to be true positives. Every validated target from newly obtained RNAseq dataset will be used to strengthen the precision of the predictor. By becoming more efficient with each iteration, we hope this predictor will help to find new targets for snoRNAs faster and could eventually be applied to other families of non-coding RNAs.

Computational Prediction of Key Transcription Factors Involved in Regulating the Cellular Response to Rapamycin

Download

Date: TBA
Room: TBA

Kimberly MacKay, University of Saskatchewan, Canada
Zoe Gillespie, University of Saskatchewan, Canada
Brett Trost, University of Saskatchewan, Canada
Christopher Eskiw, University of Saskatchewan, Canada
Anthony Kusalik, University of Saskatchewan, Canada

Presentation Overview:

Background: Rapamycin is well-known inhibitor of the Target of Rapamycin signalling cascade and is currently used clinically as an immunosuppressant. The effects of rapamycin have been extensively studied in various model organisms, but the global effects it has on gene expression in normal human primary cells remains unclear.

Objective: Identify the most probable set of transcription factors that could be responsible for regulating cellular rapamycin response in normal human primary cells.

Methodology: RNA-seq was performed on proliferative and rapamycin-treated human fibroblasts. The resultant paired-end reads were mapped to the GRh37/hg19 reference genome using the TUXEDO suite. Fold-change differences in transcriptional abundance were calculated with SeqMonk by comparing read counts from the two datasets. Promoter sequences for genes that experienced a 5-fold or greater increase in transcriptional abundance were analyzed with Clover (Cis-eLement OVER representation) to identify over-represented transcription factor binding motifs (TFBMs).

Results: Rapamycin treatment of primary human fibroblasts resulted in 537 genes exhibiting a 5-fold or greater change in transcriptional abundance (421 up-regulated, 116 down-regulated). Clover analysis of the up-regulated genes identified significant TFBMs for the Nuclear Factor of Activated T Cells (NFAT) transcription factor family, the Signal Transducer and Activator of Transcription (STAT) family and the Octamer Binding (OCT) family. Additionally, it revealed that 86.2% of the promoters associated with up-regulated genes contained a NFAT binding motif, 78.6% contained a STAT5A/B binding motif, and 83.1% contained a OCT1 binding motif. Overall, STAT5A/B, NFAT and OCT1 are present in 98.1% of the promoters associated with genes up-regulated by rapamycin treatment.

Conclusion: STAT5A/B, NFAT and OCT1 are the most likely nuclear regulators of rapamycin response in normal human fibroblasts. Additionally, the global cellular response to rapamycin could be the result of this transcription factor network mediating changes in transcript levels.

Advances in Detecting Convergent Sequence Evolution at Phylogenomic Scales

Download

Date: TBA
Room: TBA

Nathaniel Bryans, University of Calgary, Canada
Chenzhe Qian, University of Calgary, Canada
A.P. Jason de Koning, University of Calgary, Canada

Presentation Overview:

Because all life on earth shares common ancestry, studies of biological form and function must be careful to distinguish whether functional similarities between species are caused by common ancestry or common functional requirements. Understanding the phylogenetic relationships among species is therefore an important prerequisite to designing experiments involving comparative biology (including in model organism studies). Recently, we have shown that convergent sequence evolution (the parallel substitution of the same amino acid states in different species) can happen at a tremendous scale and that, when present, can positively mislead all known methods of phylogenetic reconstruction. At present, there does not yet exist a general statistical procedure for reliably distinguishing between random convergence and convergence resulting from parallel selective pressures. We have developed a computational tool that rapidly estimates posterior convergent substitution probabilities across entire phylogenies. This approach has been combined with models of site-specific constraints, with which we can estimate the exact probability of observed levels of excess non-neutral convergent evolution under a model of site-wise negative selection. Applying these tools and findings, we plan to conduct a large-scale survey of sequence convergence across a set of approximately 70 vertebrate genomes.

Cryptic Genetic Relatedness Among 1000 Human Genomes

Download

Date: TBA
Room: TBA

Larisa Fedorova, Tthe University of Toledo, United States
Alexei Fedorov, Tthe University of Toledo, United States
Rajib Dutta, Tthe University of Toledo, United States
Shuhao Qiu, Tthe University of Toledo, United States
Ahmed Al-Khudhair, Tthe University of Toledo, United States

Presentation Overview:

Nucleotide sequence differences on the whole-genome scale have been computed for 1092 people from 14 populations publicly available by the 1000 Genomes Project. Total number of differences in genetic variants between 96,464 human pairs has been calculated. We also analyzed the distribution patterns of very rare genetic variants (vrGVs), which have minor allele frequencies less than 0.2%, and used these patterns for revealing cryptic genetic relatedness. Contrary to the existing probabilistic approaches our novel computational method for detecting identical-by-descent (IBD) chromosomal segments between sequenced genomes is rather deterministic, because it considers a group of very rare events which cannot happen together only by chance. This method has been applied for exhaustive computational search of shared IBD segments among 1092 sequenced individuals from 14 populations. It demonstrated that clusters of vrGVs are unique and powerful markers of genetic relatedness, that uncover IBD chromosomal segments between and within populations, irrespective of whether divergence was recent or occurred hundreds-to-thousands of years ago. We found that several IBD segments are shared by practically any possible pair of individuals belonging to the same population. Moreover, shared short IBD segments (median size 183 Kb) were found in 10% of inter-continental human pairs, each comprising of a person from Sub-Saharan Africa and a person from Southern Europe. The shortest shared IBD segments (median size 54 Kb) were found in 0.42% of inter-continental pairs composed of individuals from Chinese/Japanese populations and Africans from Kenya and Nigeria. Knowledge of inheritance of IBD segments is important in clinical case-control and cohort studies, since unknown distant familial relationships could compromise interpretation of collected data. Clusters of vrGVs should be useful markers for familial relationship and common multifactorial disorders.

Covariation between rates of substitution and levels of quinolone resistance in clinical isolates of Escherichia coli

Download

Date: TBA
Room: TBA

Prabhjeet Basra, Carleton University, Canada
Gabriela Bernal-Astrain, Carleton University, Canada
Alex Wong, Carleton University, Canada

Presentation Overview:

Establishing a connection between genotype and phenotype remains a significant challenge in understanding the genetics of adaptation. Here, we take a computational approach to understand the link between phenotypic and genotypic changes in clinical isolates of bacterium Escherichia coli. We predict that covariation should be detectable between the rate of change in a phenotype of interest and the rate of substitution at genes underlying that phenotype. We test this prediction using a set of 39 isolates of the bacterium E. coli displaying differing degrees of resistance to the quinolone antibiotic, ciprofloxacin. We estimate covariance between several phenotypes of interest (ciprofloxacin resistance, growth rates in different media) and rate of substitution at 168 genes that might play a role in drug resistance and host adaptation. We propose that evolutionary rate covariation can be a powerful tool for connecting genotype to adaptive phenotype in bacteria.

Novel strategies for dynamic analysis and alignment of biological networks and their interdisciplinary applications

Download

Date: TBA
Room: TBA

Fazle Faisal, University of Notre Dame, United States
Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview:

Network science spans many domains including computational biology. Biomolecules in the cell, such as genes, do not function alone but instead interact with each other to carry our cellular processes. And this is exactly what biological networks model. Efficient network (or graph) theoretic and computational analyses of biological analyses have a potential to deepen our understanding of complex biological processes and diseases, which can lead to identification of disease genes, design of drugs targeting the disease genes, and consequently improvement in health care. However, several challenges exist that make it hard to efficiently extract biological knowledge from topology of biological networks: 1) biological network data are large, 2) biological network data are heterogeneous due to availability of various different data types, 3) biological network data are noisy, 4) current biological network data represent static snapshots of actually dynamic cellular functioning, and 5) many of network theoretic problems are computationally hard.

In this context, computational data integration can take advantage of the complementary information of different data types and can lead to more accurate biological knowledge extraction from the integrated heterogeneous network data. Further, computational data integration can also help with the inference and analysis of dynamic biological networks (which cannot be inferred experimentally due to limitations of biotechnologies for data collection) and can thus allow for studying dynamic cellular processes, such as aging. Finally, efficient computational strategies for network comparison and alignment can revolutionize our biological understanding by facilitating a new way to transfer knowledge between species. With these motivations, we developed novel computational strategies for integrative, dynamic, and comparative biological network analysis. We showed the application of our network-based strategies in addressing important biological problems, such as studying human aging, which is hard to study experimentally due to long life span as well as ethical constraints.

Application of codon-based evolutionary models in the prediction and validation of pathogenic mutations

Download

Date: TBA
Room: TBA

Frances Hauser, University of Toronto, Canada
Portia Tang, University of Toronto, Canada
Ryan Schott, University of Toronto, Canada
Gianni Castiglione, University of Toronto, Canada
Alexander Van Nynatten, University of Toronto, Canada
Elise Heon, Hospital for Sick Children, Canada
Belinda Chang, University of Toronto, Canada

Presentation Overview:

Human genome and exome sequencing projects have substantially expanded the number of novel single-nucleotide variants (SNVs) implicated in hereditary disease. This rapid surge in data means that the development of tools designed to accurately predict the pathogenicity of a given mutation is increasingly critical. These prediction methods are largely dependent on various metrics of evolutionary conservation, but lack of standardization and poor specificity and sensitivity often preclude accurate validation of novel mutations recovered in sequencing projects. Furthermore, due to their protein-based approach, these methods ignore the role that synonymous variation can play in disease. To address these problems, we are generating a pipeline designed to calculate evolutionary conservation across the human exome by extracting interspecific homologous sequences from genome databases and using codon-based likelihood models of sequence evolution to estimate gene-wide and site-specific conservation. These models use both nonsynonymous and synonymous rate variation, lending improved resolution to site-specific calculations of evolutionary conservation. Here, we use codon-based molecular evolutionary analyses of ten genes linked to retinal degeneration to assess the applicability of codon-based methods in predicting pathogenic (disease-associated) mutations. We estimated evolutionary constraint (dN/dS) across all coding sequences, as well as between pathogenic and nonpathogenic codon sites. Differences in the magnitude of nonsynonymous and synonymous substitution rates were assessed between pathogenic and nonpathogenic sites, with pathogenic sites consistently coinciding with regions of high evolutionary conservation. Based on these results, we are moving forward with the development of a pipeline to automate the calculation of evolutionary constraint across vertebrate and mammalian sequences for all protein-coding genes. Additionally, metrics of conservation will be incorporated into an improved algorithm for the prediction of deleterious amino acid substitutions implicated in human hereditary disease.

An -omics approach to improve diagnosis and treatment of the lancet liver fluke

Download

Date: TBA
Room: TBA

Sonja Dunemann, University of Calgary, Canada
Matthew Workentine, University of Calgary, Canada
Cameron Goater, University of Lethbridge, Canada
John Gilleard, University of Calgary, Canada
James Wasmuth, University of Calgary, Canada
Doug Colwell, Lethbridge Research and Development Centre, Canada

Presentation Overview:

Dicrocoelium dendriticum is an invasive parasite of primary veterinary importance [Goater and Colwell, 2007, Beck et al., 2014]. D. dendriticum infects cattle, goats, and sheep, leading to massive production losses in the industry [Jahed Khaniki et al., 2013]. Previously, the distribution of the parasite was thought to be contained to Western Europe [Beck et al., 2015], but has now spread wordwide [Cabeza-Barrera et al., 2011, Beck et al., 2015].

To improve diagnostic detection methods and study host-parasite interaction, we are investigating the transcriptome and the secretome of D. dendriticum. In the absence of a robust genome assembly for D. dendriticum, we are using RNA-seq data to create reliable gene models. The gene models are subsequently used for the analysis of D. dendriticum secretome data, with the focus on antigens and immunity-modulating proteins.

First, creating a reference antigen catalogue for D. dendriticum would allow the development of novel antibody-based detection methods. This will unlock more reliable tools for the study of epidemiology of D. dendriticum and other parasitic trematodes, which is urgently needed.
Second, the investigation of immunomodulation will help us to understand the mechanisms of host-parasite interaction.
This will bring better understanding of establishment and maintenance of the infection, advancing prevention strategies and novel drug targets.

Hence, this work will shed new light on an important veterinary parasite, D. dendriticum, and pave the road for further genomic research.

Deciphering epigenetic signatures underlying coding status and cell-type specificity

Download

Date: TBA
Room: TBA

Pamela Wu, NYU, United States
David Fenyo, NYU, United States

Presentation Overview:

It is unknown how much information about genome annotations are encoded in the signals of histone modifications surrounding the gene locus. It has been extensively shown that expression and transcription factor binding can be predicted well with histone modification signals, showing that histone modification patterns can be read as an indicator of a regulatory mode that varies by transcriptional activity. Those models used ChIP-seq signal values of histone modifications as features, but recently it has been argued that chromatin states are better predictors of expression and transcript location than individual histone modification signals. The aim of this study was to test the predictive potential of chromatin states on gene expression and whether or not the locus contained a coding gene. We also explored the possibility of chromatin states being used to predict the locus cell-type specificity. We constructed features for supervised classification out of the probabilistically-weighted enrichment of chromatin states at the transcription start and termination sites of gene loci. We found that the chromatin states model was more effective at predicting coding status than previous studies using histone modifications, an effect that increased in efficacy at higher levels of expression, and that surprisingly, chromatin states could separate cell specific and non-specific populations of genes while subtracting the effects of expression. This suggests the possibility that specificity is governed by two distinct modes of chromatin state regulation. Overall, we find that coding status and cell-type specificity are significantly associated with differential modes of chromatin state signals.

MIMEAnTo– A bioinformatics tool to profile functional RNA

Download

Date: TBA
Room: TBA

Maureen Smith, FU Berlin, Germany
Redmond Smyth, Université de Strasbourg, France
Roland Marquet, Université de Strasbourg, France
Max von Kleist, FU Berlin, Germany

Presentation Overview:

RNA has long been believed to mainly serve as a blueprint for proteins (in the form of mRNA). However, a large diversity of so called functional RNA has been discovered lately to regulate virtually all cellular processes (Wan et al (2011) Nat Rev Genet 12: 641-55). Characterizing functional domains and structure-function-relations in RNA is yet a major challenge as classical approaches are time-consuming and require substantial experience.
We recently developed the mutational inference mapping experiment (MIME, Smyth et al (2015) Nat Meth 12:866-872), which is a powerful, time- and cost-efficient experimental method to identify and characterize RNA domains and structures important for the RNA’s function of interest. In MIME, RNA is randomly mutated, subjected to selection-by-function, physically separated, and sequenced using NGS. The mutation frequencies in the functionally selected vs. supernatant pools contain information about the function and structural commitment of each nucleotide within the analyzed RNA: I.e. a mutation that does not affect selection is not required for the function of the RNA. Similarly, the precise functional and structural role of a nucleotide may be inferred if specific mutations are tolerated while others are not.
Here, we present a cross-platform software (MIME Analysis Tool, MIMEAnTo) to analyze the vast pool of data. The software has a wizard-like graphical user interface, guiding through the data analysis procedure in 3 steps: (i) Raw mutation frequencies in the selected- and supernatant RNA-pools are provided as input and plausibility checks allow the user to interactively asses the data for errors. (ii) Position-mutation specific sequencing-error correction is provided based on control experiments. (iii) Quality criteria for the evaluation are entered by the user (e.g. signal-to-noise ratios) and quantitative binding parameters for each possible mutation at each nucleotide, including a non-parametric statistical ascertainment are calculated from the data. Finally publication-ready graphical- and tabular output can be generated to conclude the analysis.

Analysis of biclustering algorithms using high dimensional gene expression data from skeletal tissues

Download

Date: TBA
Room: TBA

Katie Ovens, University of Saskatchewan, Canada
Ian McQuillan, University of Saskatchewan, Canada
Patsy Gómez-Picos, University of Saskatchewan, Canada
Brian Eames, University of Saskatchewan, Canada

Presentation Overview:

Understanding of molecular and cellular processes behind bone development and the differences from cartilage formation is currently limited. To further elucidate these processes would contribute to describing an evolutionary relationship between cartilage and bone. Furthermore, it allows for improved classification of normal skeletogenesis and degenerative conditions such as endochondral ossification and osteoporosis, respectively. To accomplish this requires system-level characterization of two possible gene regulatory networks (GRNs) driving cartilage and bone development. Analysis of the transcriptome to discover GRNs typically involves identifying subsets of biologically relevant genes with correlated expression before predicting relationships between them. Clustering is used to select gene subsets when using small sample sizes to improve GRN prediction accuracy. Several biclustering algorithms in particular have worked well when handling multiple tissues, but no gold standard has been identified. Three biclustering algorithms were selected on their ability to cluster either distinct tissues or tissue subtypes. The algorithms were evaluated on their ability to group skeletal tissues in distinct biclusters and separate genes in a biologically relevant manner with RNA-seq data collected from immature cartilage, mature cartilage, and bone in mouse. While immature cartilage persists over an organism’s lifetime, mature cartilage is replaced by bone during endochondral ossification. Biclustering algorithms proficient at clustering tissue subtypes as opposed to distinct tissues are hypothesized to be more appropriate for clustering mature cartilage samples if a majority of gene expression characteristics in mature cartilage depend on bone and immature cartilage GRNs as a consequence of its evolution. From preliminary results, the best biclustering option was FABIA although it was unable to separate all tissues in a single run. However, both Plaid and FABIA show evidence of noise sensitivity with few samples, which may be a significant limitation for exploratory analysis of this gene expression data suggesting global clustering may be a stable alternative.

Complex-proficient while being signaling-deficient: insight into the effects of mutations on TLR4 function

Download

Date: TBA
Room: TBA

Muhammad Ayaz Anwar, Ajou University, Suwon, Korea South, Korea, Republic of
Sangdun Choi, Ajou University, Korea, Republic of

Presentation Overview:

Toll like receptor (TLR) 4 is a cell surface receptor of the innate immune system that initiates a signaling cascade to challenge bacterial intrusion. For efficient signaling, TLR4 needs to form a stable complex necessary to dimerize the cytoplasmic domain. However, polymorphism may abrogate the stability of the complex, leading to compromised TLR4 signaling. In particular, D299G and T399I mutations in the ectodomain render TLR4 nonfunctional. Crystallography has provided valuable insights into the structural aspects of the TLR4 ectodomain; however, the dynamic behavior of polymorphic TLR4 is still unclear. Here, we employed molecular dynamics simulations (MDS) to decipher the dynamic structural parameters associated with TLR4 mutations. The time-dependent parameters, such as the secondary structure definition, sharp decrease in the rotational correlation function, and low-frequency motions indicated by principal component analysis are correlated to the differential behaviors of TLR4 mutant forms. A lower number density values, higher mean Cα distance, and substantial differences in the dihedral distribution were observed in mutated forms. Normal mode analysis (NMA) further revealed that the mutational variants of TLR4 acquired ‘z-shaped’ forms. Therefore, MDS analysis significantly illuminated the mutant-specific conformational alterations, and these are helpful in deciphering the mechanism of loss-of-function mutations.

NCBI and Regional Data Science, Bioinformatics and Genomics Hackathons!

Download

Date: TBA
Room: TBA

Ben Busby, NCBI, United States

Presentation Overview:

Gaps exist between public datasets and the open-source software tools built by the community to analyze those that are similar. We assemble groups of data science, bioinformatics and genomics professionals and software developers in hackathons to rapidly prototype software to address these gaps. Goals, as defined in collaboration with the project leads are distributed to participants at the beginning of the event and are refined during the event. Software, scripts, and pipelines were developed and published on Github, a web service providing publicly available, free-usage tiers for collaborative software development. The code is published at https://github.com/NCBI-Hackathons/ with repositories for each team. Hackathons happening in 2016 will be announced on NCBI News and on this poster.

Codon Usage Under the Lens of Machine Learning

Download

Date: TBA
Room: TBA

Eric Ho, Lafayette College, United States

Presentation Overview:

Eighteen out of twenty amino acids are encoded by multiple codons. Although an amino acid can be translated from a set of equivalent codons, usage of these codons, namely synonymous codons, were found to be nonrandom in nature. Such phenomenon is coined as codon usage bias (CUB). CUB is prevalent in living organisms including viruses, influencing gene expression, splicing, mRNA stability, and protein translation. Studies also showed that CUB has played a role by some pathogens in retain virulence. Findings of CUB in viruses have been translated for vaccine development by producing attenuated viruses through suboptimal codon usage. Thus, studying CUB is important in answering basic questions in molecular biology and translating it for biomedical uses. Our goal is to use supervised machine learning methods to identify distinctive features that influence codon usage. Two regression methods were used in this study: logistic regression, and LASSO. We trained the supervised models by real coding and simulated sequences from a wide variety of organisms, including four animals, three plants, and three microorganisms. A key advantage of regression method over other machine learning methods such as support vector machine is the interpretability of predictors. In order to create realistic negative sequences to challenge the model during training, we have developed a coding sequence simulator to simulate sequences that preserves the composition of amino acids and nucleotides of the real sequences. By analyzing the predictors of regression models obtained from different organisms, we have identified codon features in animals that are distinctive from microorganisms such as joint nucleotide profile at codon junctions, suggesting an added layer of complexity is involved in shaping codon usage in higher organisms. In conclusion, we have proposed a novel and effective approach to elucidate codon usage. Our findings and the coding sequence simulator will benefit the research community in unraveling the underlying biology of codon usage in different organisms.

Statistical Models to Capture Mutational Properties for NextGen Sequencing Data

Download

Date: TBA
Room: TBA

Matthew R. Weber, University of Illinois at Urbana-Champaign, United States
Zachary D. Stephens, University of Illinois, United States
Liudmila S. Mainzer, University of Illinois, United States
Matthew Hudson, University of Illinois, United States
B. F. Francis Ouellette, Ontario Institute for Cancer Research, Canada
Morgan Taschuk, Ontario Institute for Cancer Research, Canada

Presentation Overview:

This project is part of an effort to faithfully simulate cancer sequencing datasets while accurately reproducing their statistical properties. The ability to simulate Next Generation sequencing data will be useful for testing bioinformatics tools designed for cancer data, and for training physicians and analysts. The present work specifically addresses statistical properties of NextGen sequencing data, to provide correct models of genomic variation. The aim is to produce realistic simulated data that is free of clinical confidentiality and other restrictions.

Although many such statistical properties have already been described, the published datasets frequently lack the richness and resolution required for accurate simulation. We analyze the pooled variant information from the International Cancer Genome Consortium, dbSNP and other sources to model mutational rates for single nucleotide substitutions as a function of the nucleotide itself and its surrounding trinucleotide context. The frequency is calculated relative to the reference trinucleotide for dbSNP or relative to the corresponding germline trinucleotide for cancer data. When heterozygous alleles are encountered, we randomly pick one, to create variant models appropriate for simulation of diploid genomes. We also derive indel probabilities and indel length distributions. All of the above variant statistics are provided in the context of functionally important positional information within the genome, such as variant location within introns or exons, 5' UTR, regulatory regions, etc. This is accomplished by computing the mutational models in a sliding window of varying sizes along the genome.

We have produced mutational models for a number of different cancers, such as breast cancer, leukemia, melanoma, and are in the process of doing so for other cancers. In ongoing work we are adding more detailed mutation model statistics, including heterozygous variant models, longer insertion motifs, heterozygous/homozygous ratios and large structural variants.

Our scripts can be found at: https://github.com/oicr-gsi/NeatDataStructures.git

Assessment of the vaccine on tuberculosis control using age-structured mathematical model

Download

Date: TBA
Room: TBA

Chacha Issarow, University of Cape Town, South Africa

Presentation Overview:

Despite the fact that tuberculosis (TB) vaccine has been implemented globally, TB remain by sure a public health concern that causes morbidity and mortality in the community. Here, we design an age-structured mathematical model that incorporates vaccination for susceptible individuals (young age groups [0, 5) and [5, 15) years) to explore the impact of the vaccine on TB control. Furthermore, we examine the conditions under which a TB vaccine can be useful in a high transmission, high reinfection community and examine assumptions about natural immunity. In order to explore the impact of the vaccine, the study was divided into two parts: (i) without vaccination (ii) with vaccination, and observed the sensitivity of the model in the applied data. In the first case (without vaccination), high active TB was detected at the age groups [0, 5), [15, 25), [45, 55) and [55, 65) years with notification rates 562, 484, 505 and 484 per 100, 000, respectively. The lowest active TB was detected at the age group [5, 15) years with notification rate of 83 per 100,000 population. In the second case, when the vaccine was introduced in the population, active TB decreased at the age groups [0, 5), [5, 15), [15, 25), where TB notification rates became 282, 30 and 90 per 100, 000, respectively. However, in both cases (with and without vaccination), active TB detection remained high at the age groups [25, 35), [35, 45), [45, 55) and [55, 65) years (with notification rates 404, 449, 505 and 484, respectively), suggesting that these age groups are at excessively high risk of developing TB disease. It shows that active disease progression depends on age and average duration of the waning of the vaccine. The findings in this study suggest that the new vaccine should be applicable for both children and adults, while targeting age groups at high risk of disease progression for TB control.

Identification of age associated differentially expressed genes in prostate cancer

Download

Date: TBA
Room: TBA

Lihong Yin, Department of Statistics, the University of Akron, Akron, OH, United States
Farahnaz Rahmatpanah, Department of Pathology & Laboratory Medicine, University of California, Irvine; Irvine, CA, United States
Michael McClelland, Department of Pathology & Laboratory Medicine, University of California, Irvine; Irvine, CA, United States
Dan Mercola, Department of Pathology & Laboratory Medicine, University of California, Irvine; Irvine, CA, United States
Jun Ye, Department of Statistics, the University of Akron, Akron, OH, United States
Zhenyu Jia, Department of Botany & Plant Sciences, University of California, Riverside; Riverside, CA, United States

Presentation Overview:

Background. Prostate cancer is the most common non-cutaneous cancer affecting men in North America. It kills more than 200,000 men annually worldwide. There are several major factors influencing the risk of getting prostate cancer, including age – the greatest risk factor. Prostate cancer mainly affects men over age 50, and the risk increases with ageing. More than two-thirds of all prostate cancers are diagnosed in men over age of 65. Since prostate cancer has been well accepted to be an aging-related cancer, we are trying to determine if the alteration of the disease relevant gene depend on age of prostate cancer patients.

Methods. We analyzed microarray data generated from 118 prostate cancer patients with variety of clinical background, such as age, percentage for tumor or stroma, preOP PSA levels, Gleason scores, lymph nodes, and etc. The LIMMA package from Bioconductor was used to detect differentially expressed genes. A multiple linear regression model was applied to fit gene expression data with age or age interacted with tumor or stroma. Differentially expressed genes were selected by p-value less than 0.001. False discovery rate was checked by a permutation procedure.

Results. We performed two ways to select age associated genes. One way was to identify the differentially expressed (DE) genes within two age groups (age <= 65 and age > 65). We detected 6835 differentially expressed genes between tumor and normal biopsy. Within these genes, we selected 61 and 32 DE genes between age groups with p-value less than 0.01 and 0.005, respectively. We checked false discovery rate (FDR) with permutation and found the FDR for these selected genes was about 100%. This may be due to the huge variation among the percentage of tumor and stroma from the samples. So it is hard to compare the differential gene expression among different age groups. We then performed the second approach, which was using multiple linear regression model to identify genes associated with age, age interacted with tumor or stroma. We detected 189 genes which were significantly associated with age with p-value less than 0.001. In addition, 222 and 126 genes were significantly associated with age interacted with tumor (age*tumor) or interacted with stroma (age*stroma), respectively. The FDR is about 20% for genes related to age or age*tumor. Among these genes, 104 genes were related with age, age*tumor and age*stroma.

Conclusions. From our advanced microarray analysis, we identified age-related genes in tumor epithelium cells as well as tumor-adjacent stroma cells in prostate cancer. Our findings provide novel insight into the biological connection between aging process and prostate cancer. Further analysis and validation of these age associated genes could improve our understanding of prostate cancer progression, especially the interaction between prostate tumor and its microenvironment, and eventually lead to precision therapies to benefit prostate cancer patients.

Challenges and opportunities of differential methylation identification using bisulfite sequencing data

Download

Date: TBA
Room: TBA

Shuying Sun, Texas State University, United States
Xiaoqing Yu, Yale University, United States

Presentation Overview:

DNA methylation plays an important role in regulating gene expression in cancerous cells. It is important to study differential methylation (DM) patterns between two groups of samples (e.g. cancerous vs. normal individuals). With next generation sequencing (NGS) technologies and the bisulfite technique, it is now possible to identify DM patterns by considering methylation at the single CG site level in an entire genome. However, it is challenging to analyze large and complex NGS data. Even though a number of methods have been developed for DM identification, there is still not a consensus on statistical and computational approaches. In this poster, we will show the comprehensive comparative analysis results of five DM identification methods: methylKit, BSmooth, BiSeq, HMM-DM, and HMM-Fisher. Our comparison results are summarized below. First, parameter settings may largely affect the accuracy of DM identification. Second, all five methods show more accurate results when identifying simulated DM regions that are long and have small within-group variation, but they have low concordance, probably due to the different approaches they have used for DM identification. Third, HMM-DM and HMM-Fisher yield relatively higher sensitivity and lower false positive rates than others, especially in DM regions with large variation. With the comprehensive comparative analysis, we aim to share new perspectives about the challenges and opportunities of DM identification, which may pave the way for novel and better methodology development in the near future.

Redefining the breast tumor margin utilizing a computational genomic ruler of tumor and tumor-adjacent normal tissue

Download

Date: TBA
Room: TBA

Amanda Ernlund, New York University School of Medicine, United States
Shubha Dhage, New York University School of Medicine, United States

Presentation Overview:

Objectives: Current treatment for breast cancer includes biopsy of tumor to visually determine the extent of tumor infiltration into surrounding tissue, ultimately informing surgeons of tumor location and the margin of normal tissue surrounding tumor to remove. Despite the use of chemotherapy following surgery, breast tumors may recur locally near the site of removal, or metastasize to distant locations, suggesting that occult disease and / or a malignant microenvironment are present in the patient following surgery. Currently, methodology has highlighted that tumor adjacent normal tissue harbors genomic changes that reflect early tumor transformation events as well as promote a tumor-nurturing environment. However, the extent at which abnormal genomic changes penetrate into the surrounding tissue has not been examined. Characterizing a genomic tumor margin could ultimately impact the ability of surgeons to distinguish tumor and tumor-like normal tissue from truly normal tissue.

Methods: Samples of tumor and tumor-adjacent histologically normal tissue (verified by a pathologist to be cancer-free) at 5 mm, 10mm, 15mm, and 20 mm were obtained from 32 patients undergoing mastectomy for invasive ductal breast cancer. Tissues were analyzed for genome-wide mRNA expression using microarrays. Following background correction and normalization, an unsupervised NMF analysis for data dimension reduction of tissue clustering, was performed. To define a tumor-like gene signature, pair-wise statistical analysis was performed to isolate genes with a 2-fold change in expression (adjusted p.value < 0.01) in at least one distance but not all distances of normal tissue compared to tumor. Using tumor-like genes, hierarchical clustering was used to separate genes into 7 clusters based on average gene expression levels across tumor and each distance of normal and further examined for functional characterization using Ingenuity Pathway Analysis. Correlation studies examining each patient’s gene expression in normal tissue compared to tumor were carried out. All computational analysis were performed using either scipy packages in python or R Bioconductor packages.

Results: We propose a gene expression molecular ruler for determining tumor margins in a cohort of patients. Unsupervised NMF analysis of the whole gene set in tumor and normal tissues revealed three clusters of samples; tumor (cluster 1), 5mm and 10mm (cluster 2), and 15mm and 20mm (cluster 3), suggesting that gene expression can classify samples based on tumor and distance of normal tissue from tumor. In order to define a subset of genes that have similar expression in tumor and at least one distance of normal tissue, we defined a tumor-like gene signature. Using hierarchical clustering, we segregated the tumor-like genes into 7 functional clusters, each cluster displaying a distinct pattern of gene expression across tumor and all distances of normal tissue. We found key tumorigenic pathways displayed a general decrease of gene expression at distances further from tumor and particularly the 20 mm region displayed an enrichment of pathways necessary for normal tissue maintenance. Finally, when comparing patient samples individually for correlation trends between tumor and each distance of normal using the tumor-like signature, we found that 50% or more samples within 5mm, 10mm, and 15mm correlate well (R >0.5) with tumor expression compared to only 30% of 20mm tissues.

Conclusion: These results demonstrate that tissue categorized as histologically normal display genomic changes similar to tumor, particularly normal tissues isolated closer to tumor. These genomic changes display a gradient from tumor truncating at 20 mm. Ultimately, utilizing computational strategies to distinguish genetically normal tissue could have implications for developing more finely tuned surgical margins, impacting both surgical and disease outcomes in breast cancer.

Identification of Dipyridamole Analogues for the Treatment of Acute Myeloid Leukemia using Computational Pharmacogenomics Approaches

Download

Date: TBA
Room: TBA

Deena Gendoo, Princess Margaret Cancer Centre & University of Toronto, Canada
Alexandra Pandyra, Princess Margaret Cancer Centre, Canada
Peter Mullen, Princess Margaret Cancer Centre, Canada
Joseph Longo, Princess Margaret Cancer Centre, Canada
Jenna van Leeuwen, Princess Margaret Cancer Centre, Canada
Linda Penn, Princess Margaret Cancer Centre, Canada
Benjamin Haibe-Kains, Princess Margaret Cancer Centre, Canada

Presentation Overview:

Identification of novel drug combinations to treat acute myeloid leukemia (AML) remains necessary to improve management of this hematological malignancy. Our previous investigations identified a synergistic combination of dipyridamole (DP) and statins towards inducing apoptosis in multiple myeloma and AML cell lines and primary patient samples, as well as decreasing in vivo tumour growth. Statin drugs target the metabolic mevalonate pathway (MVA). Our findings indicated that inhibition of the statin-induced restorative feedback loop by agents such as DP, served as a mechanism to potentiate the anticancer efficacy of statins. Here, we conduct an explorative analysis to identify novel potentiators of statin anticancer activity, similar to DP, using an integrative pharmacogenomics and computational pipeline. We implement multiple cross-comparative analyses that harness drug structures, transcriptomic signatures, and drug sensitivity profiles to identify DP-like drugs that demonstrate specificity for targeting the MVA pathway. Our analyses exploit a large collection of high-throughput, curated pharmacogenomics datasets that encompass vast breadth and depth across cell lines, drugs, and genetic information. These datasets include (i) transcriptionally profiled cancer cell lines from the new L1000 dataset, contains over 1.3 million gene expression profiles spanning 20,000 drugs, and (ii) drug sensitivity data from the NCI60 and recently released CTRPv2 datasets, spanning over 800 cell lines across thousands of drugs. We identified new drug combinations which demonstrate similar efficacy to DP when tested in AML cell lines, and which may serve as new cancer therapies.

Mapping environmental virus-host interactions in solar salterns through metagenomics

Download

Date: TBA
Room: TBA

Abraham Moller, Miami University, United States
Chun Liang, Miami University, United States

Presentation Overview:

Solar salterns are excellent model ecosystems for the study of virus-archaeal interactions because of their low microbial diversity, environmental stability, and very high viral density. By using a set of novel bioinformatics tools to analyze saltern metagenomes, we mapped virus-host interactions across a geographically diverse set of salterns and related them to carbon cycling in these environments. We identified possible host microbes such as Haloquadratum, Halorubrum, and Haloarcula using taxonomic profiling and also found that microbial community composition related not only to salinity but also to local environmental dynamics. Subsequent characterization of glycerol metabolism genes in these ecosystems suggested most dihydroxyacetone kinase (DhaK) genes affiliate to Halorubrum and Haloquadratum and most NAD+ dependent glycerol-3-phosphate dehydrogenase genes affiliate to Salinibacter. We identified CRISPR spacers in metagenomes with two different methods and found more spacers in the generally Halobacteriaceae-dominated IC21 and Cahuil salterns compared with the specifically Haloquadratum-dominated SS19, SS33, and SS37 salterns, suggesting a low level of CRISPR diversity and possibly a high rate of CRISPR loss in the Haloquadratum-dominated salterns. After CRISPR detection, spacers were aligned against halovirus genomes to map virus to host. While most alignments linked viruses to the abundant Haloquadratum walsbyi, there were clusters of interactions with less abundant taxa Haloarcula and Haloferax. Further examination of the dimer and codon usage differences between paired viruses and hosts and detection of Cas genes in the salterns confirmed both the plausibility of virus-host interactions and the possibility of CRISPR activity. Taken together, our studies suggest viruses are critical players in saltern carbon cycling, and the loss of CRISPRs may play an important role in regulating viral-mediated nutrient cycling in these environments.

Study of alternative snoRNAs characterization methods by deep sequencing

Download

Date: TBA
Room: TBA

Fabien Dupuis-Sandoval, Sherbrooke University, Canada
Michelle Scott, Sherbrooke University, Canada
Douglas Wu, University of Texas at Austin, United States
Ryan Nottingham, University of Texas at Austin, United States
Sherif Abou Elela, Sherbrooke University, Canada
Alan Lambowitz, University of Texas at Austin, United States

Presentation Overview:

Small nucleolar RNAs (snoRNAs) are small non-coding RNAs species (ncRNAs) ranging from 60-150 nucleotides in length and conserved in eukaryotes. Each family of snoRNAs is responsible for binding and guiding a set of distinct proteins. The C/D snoRNA family guides fibrillarin to methylate target RNAs while H/ACA snoRNA family guides dyskerin to pseudourydilate targets. SnoRNAs guide by pairing to complementary sequences in their targets. Those targets are often found in ribosomal RNAs (rRNAs) and nuclear RNAs (snRNAs) where they take part in ribosomes' biogenesis processes. In cases where ribosome synthesis is increased, the snoRNAs production is also increased. However, the extent of the snoRNAs' involvement in cancer remains unexplored as new metabolic pathways in which snoRNAs are involved surface. These functions vary from lipotoxicity resistance, alternative splicing and chromatin unwinding. Furthermore, snoRNAs subclasses are often deregulated in cancerous cell types.

At this point in time, there are no standard method to analyze highly structured, small (<350nt) ARN species such as snoRNAs.

In an effort to characterize snoRNAs, we have elaborated a new unbiased method to perform unfragmented transcriptomic sequencing. The absence of fragmentation allows for the entire transcriptome's RNA species quantification. Sequencing datasets were generated using SKOV3ip1 (ovarian cancer) cells. These RNAseq datasets were treated following a bioinformatics pipeline as to quantify snoRNAs' abundance changes to predict non-canonical snoRNA interactors. This process requires the alignment of reads to genomic positions and their annotation. This annotation has to be flexible to identify and categorize all extensions to snoRNAs.

To attest of the sequencing datasets' quantification, a subset of ncRNAs were selected to be analyzed by qPCR. By correlating qPCR relative abundance with sequencing data, we proved that the new sequencing protocol is far more representative of the ncRNAs presence than the most widely used current methods. This innovative method also retains its ratios between ncRNAs whether the libraries were generated using fragmented or unfragmented RNA. Finally, this analysis reveals a wide range of snoRNA subclasses, not limited to C/D snoRNAs.

Fragment Warheads for Selective Drug Design against Bacterial Glutaredoxins

Download

Date: TBA
Room: TBA

Daniel Morris, The University of Akron, United States
Ram Khattri, The University of Akron, United States
Stephanie Bilinovich, The University of North Carolina at Chapel Hill, United States
Thomas Leeper, The University of Akron, United States

Presentation Overview:

Glutaredoxins (GRX) play an integral role in buffering the redox state of cells. By extension they are considered mediators of cell processes that occur at specific redox potentials. Acting as a secondary controller in many diverse pathways, GRXs represent an untapped reservoir as novel drug targets. These proteins are characterized by a dithiol containing active site with a Cys-Pro-Tyr-Cys (CPYC) motif. By exploiting nucleophilic properties of these active site cysteines, it is possible to develop covalent inhibitors for GRXs. Thiols are prime targets for chemical modification where covalent alkylation of enzymatically important cysteines via the Michael addition reaction have already been explored to produce irreversible inhibitors of other proteins containing hot thiols. Small molecule inhibitors containing covalently capable functional groups targeting nucleophiles have been termed “warheads”. In order to confer target specificity, the warhead must be coupled to a target selective driving group molecule. Active sites among orthologous GRXs were probed for subtle structural differences to discover species selective hits to serve as this driving group. Using saturation transfer difference nuclear magnetic resonance spectroscopy (NMR), we screened a small molecule “fragment” library to find driving groups specific for the GRXs of two common infectious bacteria, Pseudomonas aeruginosa and Brucella melitensis. Chemical shift perturbation data were collected on species selective hits via 2D NMR experiments and used as restraints for the molecular docking program Protein-Ligand ANT system (PLANTS), where hit fragments are simulated into the active sites following docking algorithms based on ant colony optimization. Potential docked poses are filtered against experimental data using computationally generated proton chemical shifts produced by the shift prediction software SHIFTS and a scoring function. After determining the binding mode of hit fragments, electrophilic acrylamide moieties were coupled to these driving groups producing species selective irreversible inhibitors as lead molecules for potential antibiotics against the bacteria.

Terpene Synthases: The Mechanistically Intriguing Family of Enzymes Generating the Enormous Terpenome

Download

Date: TBA
Room: TBA

Piyush Priya, National Institute of Plant Genome Research, India
Gitanjali Yadav, National Institute of Plant Genome Research, India

Presentation Overview:

Plants being sessile are predisposed to various kinds of stresses and thus to counteract, a plethora of phytochemicals are being produced as end points of sophisticated survival mechanisms. Among them, one of the largest and most diverse classes of specialized metabolites are the terpenes playing evident roles in the secondary and primary metabolism, together with various pharmacological properties (anticancerous taxol and forskolin used in glaucoma treatment). The complex chemical library consisting of about 60,000 of these structurally and stereochemically diverse family of natural products, has been named as TERPENOME. This fascinating and puzzling structural diversity of terpenes directly relates to the key molecular players, the terpene synthases (TPSs) enzymes. TPSs carry out one of the most unique reactions in biology, based on their ability to produce hundreds of regio- and stereo-specific products from a single initial substrate. Due to the huge complexity and uniqueness associated with the structural attributes and catalytic mechanism of TPSs, and paucity of their detection and classification tool, methods that seek to improve identification of TPSs in the massive high-throughput genomic data, are indispensable. We introduce a novel platform, the TERZYME (http://www.nipgr.res.in/terzyme.html), a program based on the concept of Hidden Markov Model, for identification and classification of TPSs, thus predicting the potential terpenome of a plant. Our results have identified more than 2000 novel TPSs, with very high degree of predictive accuracy and performance better than existing resources. Further analysis of these novel TPS candidates suggested clustered spatial pattern in several genomes and is very interesting in view of the crucial roles played by terpenes in plant adaptive immunity. Such clusters represent a new avenue of research and are so diverse and dynamic that they act as excellent models for studying genome plasticity and the novel mechanisms of adaptive evolution in plants.

A Morphology Profile Pipeline for Genome-wide Screens in Saccharomyces cerevisiae

Download

Date: TBA
Room: TBA

Nil Sahin, University of Toronto, Canada
Erin Styles, University of Toronto, Canada
Adrian Verster, University of Washington, United States
Quaid Morris, University of Toronto, Canada
Brenda Andrews, University of Toronto, Canada

Presentation Overview:

Synthetic genetic array (SGA) analysis coupled with high-content screening (HCS) in Saccharomyces cerevisiae has provided a wealth of information on functional genomics. Until recently, genetic interactions in SGA analysis have used colony size as a proxy for cellular fitness. Although this metric has proven to be robust, higher resolution phenotypes such as subcellular morphology cannot be assessed. Since there are various perturbations in S. cerevisiae in which mutant growth is normal despite morphological abnormalities within subcellular compartments, it would be of great benefit to the yeast community to complement existing colony size data with cell morphology data. Using the SGA-HCS approach, the Boone and Andrews Labs at the University of Toronto, have produced an image-based dataset of subcellular mutant phenotypes in the context of genome-wide perturbations. To analyze these massive datasets of 900,000 images, our labs have developed a machine learning strategy that has been able to successfully detect and classify about half of all the observed and published phenotypes. However, it is challenging to computationally analyze and model a total number of 100 classifiers for all the expected phenotypes with the existing pipeline. Thus, to complete the analysis, I expanded on optimizing the existing pipeline by constructing classifiers for missing phenotypes, and score the genes generating aberrant morphologies. So far, the optimized pipeline can classify a substantial amount of the mutant phenotypes as preliminary. In order to validate genes resulting with similar mutant phenotypes, I generated the morphology profiles for each gene and performed a preliminary trend analysis on morphology profiles by comparing them to the genetic interaction network to identify genes with high fitness but aberrant morphology. By obtaining complete profiles, we can construct a new informative network for our collections to use alongside the fitness scores from genetic interaction networks. Comparing the biological interpretations of the genetic interaction network and the morphology profiles can reveal further information on biological enrichment and functional analysis that might have been overlooked by the multiplicative model of the fitness measurements alone. This analysis will allow for the identification of connections between discrete biological processes, the prediction of novel gene function, and the generation of a clearer understanding of basic eukaryotic cell biology.

Gene expression networks of transcriptional response to 78 cellular environment perturbations

Download

Date: TBA
Room: TBA

Daniel Kurtz, Wayne State University School of Medicine, United States
Gregory Moyerbrailean, Wayne State University School of Medicine, United States
Allison Richards, Wayne State University School of Medicine, United States
Omar Davis, Wayne State University School of Medicine, United States
Chris Harvey, Wayne State University School of Medicine, United States
Adnan Alazizi, Wayne State University School of Medicine, United States
Donovan Watza, Wayne State University School of Medicine, United States
Yoram Sorokin, Department of Obstetrics and Gynecology, Wayne State University, United States
Nancy Hauff, College of Nursing, Wayne State University, United States
Francesca Luca, Wayne State Center for Molecular Medicine and Genetics, United States
Roger Pique-Regi, Wayne State University School of Medicine, United States

Presentation Overview:

Cells adapt and respond to their complex cellular environment by modulating the expression of multiple transcripts through a programed response across a network of interconnected genes. Using RNA-seq it is possible to measure the gene transcriptional response across the whole genome and identify sets of interacting genes that constitute the pathways and circuits regulating the molecular responses to specific environmental conditions (e.g. drug treatment). Here, we have implemented a weighted correlation network approach to construct a coexpression network from RNA-seq data collected across five different cell types (human umbilical vein endothelial cells, lymphoblastoid cells, peripheral blood mononuclear cells, smooth muscle cells, and melanocytes) exposed to 33 treatments with 3 different individuals represented by each cell type, for a total of 78 distinct cellular environments. We used the R software package Weighted Gene Coexpression Analysis (WGCNA) to construct a large, scale-free coexpression network for 14,527 genes, of which 7936 cluster into 87 highly-connected network modules. Most modules were found to be significantly associated with more than one cellular environment (p < 0.01, median = 7). Common cellular mechanisms for treatment response (identified by functional enrichment analysis of modules significantly associated with >12 treatments) included N-linked glycosylation, response to endoplasmic reticulum stress, and cellular homeostasis. When focusing on treatments represented in a minimum of 2 cell types, we identified that Caffeine and Vitamin A share a similar gene expression response across at least 2 cell types.We also identified several modules that responded inconsistently across cell types, including a Vitamin-D associated module containing Golgi apparatus genes (q-value = 0.003) that responded oppositely in HUVEC and PBMC cells. Modules within a single cell type that had treatment-specific responses were also considered. For example, module 22 - enriched in genes encoding proteasome activity (q-value 4x10-4) - was down-regulated following treatment with aspirin, but up-regulated in response to caffeine (a known activator of the proteasome) in smooth muscle cells. Our network analysis offer a comprehensive analysis of the transcriptional profiles across 78 unique cellular environments and the resulting network modules provide an invaluable resource to gain a better understanding of the genes that direct the cellular response to changes in the environment.

Mapping multi-protein-interface-interaction network in HIV-1 pre-integration complex (PIC)

Download

Date: TBA
Room: TBA

Madara Hetti Arachchilage, Kent State University, United States
Brett Lowden, Kent State University, United States
Helen Piontkivska, Kent State University, United States

Presentation Overview:

Numerous studies have shown that protein-protein interactions play important roles in HIV-1viral life cycle and that inhibiting these interactions has a significant therapeutic potential. Yet, identifying the most promising targets, including specific protein regions directly involved in protein-protein interactions remains a challenge. This is especially the case for viral-viral protein interactions in large nucleoprotein complex called pre-integration complex (PIC) which has higher-order protein interaction networks which involve mutually shared binding sites capable of transient binding to multi-partners through dynamic transient interactions. However, these multi-protein interface-interactions are not fully understood. Here, we map this high order interface-interaction network of viral proteins that is associated within the PIC, through a coevolutionary analysis (Hetti Arachchilage and Piontkivska 2016). Our analysis also identifies direct and prolonged interactions between multi-partners in PIC that require high affinity and/or specificity and transient interactions through dynamic conformational changed that involve shared binding sites. These findings are useful with validation from experimental studies, for example, using site-directed mutagenesis and/or be integrated with structural analysis to gain better understanding of the viral-viral protein interactions within PIC and investigate it’s applicability to use as potential targets for multi-epitope or adjuvant-based treatments and/or to design novel protein inhibitors that will target functionally and/or structurally important regions in viral-viral protein interaction interfaces in PIC.

Next Generation Pathway Analysis: Increasing Sensitivity and Accuracy by Incorporating Regulatory Elements

Download

Date: TBA
Room: TBA

Marilyn Hayden, Ohio University, United States
Frank Drews, Ohio University, United States
Lonnie Welch, Ohio University, United States

Presentation Overview:

Pathway Analysis (PA) takes experimental expression data, also known as experimental expression profile, and pathways from an existing pathway database and utilizes statistical testing and other algorithms to identify relevant pathways that are enriched with the experimental expression profile. The significantly enriched pathways identified enable the researcher to make new hypotheses, develop succeeding experiments, and confirm experimental findings. Currently, PA incorporates key structural aspects of a pathway, including interactions between two genes and pathway directionality. These structural elements are collectively known as topological information, or pathway topology. Topological information is used to assign different weights to genes based on their location in a pathway. One topological element not accounted for in current Pathway Analysis tools is regulatory elements. Regulatory elements such as Transcription Factors (TFs), Transcription Factor Binding Sites (TFBSs), and chromatin states can control distal genes, function in in cell fates or states as well as control more than one gene. Addition of the regulatory elements will allow for topology Pathway Analysis methods to more accurately reflect and model real biological systems. The proposed algorithm will model CS, TFs and TFBSs extracted from regulatory element databases. To evaluate accuracy and sensitivity of the proposed algorithm and the current generation of PA tools will analyze 36 disease models expression datasets and attempt to identify known pathways specific to each disease model. To further assess the proposed algorithm’s sensitivity, three different diseases, Alzheimer’s Disease, Cervical cancer and Type II diabetes, disease progression expression datasets will be analyzed by the proposed algorithm against the current generation of PA tools. Each of these diseases have finite levels of progression (i.e. early, moderate and post-partum) and a known pathway specific to the disease. If the proposed algorithm is more sensitive than current generation’s algorithms, only it will be able to detect known pathway associated with the disease in either early and/or moderate stage expression datasets corresponding to the different diseases.

CRF: a web server for detecting CRISPRs in microbe genome

Download

Date: TBA
Room: TBA

Kai Wang, Miami University, United States
Chun Liang, Miami University, United States

Presentation Overview:

CRISPRs (clustered regularly interspaced short palindromic repeats) are special DNA sequence fragments that were found in bacteria and archaea genomes. It services an immune system of microbes that can be used against virus or plasmids inversion and with the implementation of CRISPR/Cas 9 system as a genetic editing tool, CRISPRs is becoming more important for biologists. Nowadays, there are several tools can be used to detect CRISPRs based on microbe genome data, such as CRT (CRISPR Recognition Tool), CRISPRFinder, and PILER-CR. Here we developed a pipeline called CRF for detecting CRISPRs that utilizes random forest classifier combining repeat sequences base composition and secondary structure as feature vectors (triplet elements). Compared to those programs that based on pure computational method, these triplet elements represent biological meaning of these repeat sequences. The pipeline includes three parts. In the first, a parameter modified CRISPRs recognition tool (CRT) will be used for detecting CRISPRs array candidates. Second, a random forest classifier will be trained as a main filter to exclude those pseudo-CRISPRs arrays based on the triplet elements features of the repeat sequences. The positive data set was from the CRISPRdb. The negative data set was random sequences that generated by using a first-order Markov model based on the positive data set. This classifier achieved around 94% of accuracy and sensitivity. In the last, Phobos, a tandem repeat detection tool, will be used to filter satellite sequences and other tandem repeats that are parts of CRISPRs arrays. The whole pipeline was implemented with server PERL scripts and a Javascript-based web server will be established for users to detect CRISPRs. In total, 165 archaea and 2,578 bacteria were analyzed by our new pipeline and 12,649 CRISPRs arrays were detected. Different from former CRISPRs databases, we will provide structural visualization of detected CRISPRs in our graphic user interfaces to facilitate structural validation of CRISPRs. Moreover, users can find CRISPRs in their query sequences through the web interface. As another important improvement, we can analyze the relationship between each spacers and relevant plasmids/viruses, and determine a virus-host interacting network.

Cuprizone intoxication perturbs central nervous system metabolism

Download

Date: TBA
Room: TBA

Alexandra Taraboletti, University of Akron, United States
Leah Shriver, University of Akron, United States

Presentation Overview:

The cuprizone intoxication is a widely-used animal model to test new myelin regenerative therapies for diseases such as multiple sclerosis. Mice fed this copper chelator, develop reversible, region-specific oligodendrocyte loss and demyelination. While the cellular correlates influencing myelin loss and formation have been well-studied in this model, there is no consensus on the biochemical mechanisms of cuprizone toxicity in oligodendrocytes. In order to provide insight into this mechanism, we have identified an oligodendroglial cell line that is sensitive to cuprizone toxicity and performed global metabolomic profiling to identify metabolites altered by cuprizone. We further link the biochemical changes occurring in cells with alterations in brain metabolism in mice fed cuprizone for six weeks. We find that cuprizone induces widespread changes in 1C and amino acid metabolism as well as alterations in small molecules important for energy generation. We also probe the ability of cuprizone to effectively pull copper away from small molecule mimics of protein copper binding sites. Our results indicate that cuprizone induces global perturbations in cellular metabolism that may be independent of its copper chelating ability.

Prediction of High-throughput Protein-protein Interactions Using Short Linear Motifs

Download

Date: TBA
Room: TBA

Yixun Li, University of Windsor, Canada
Luis Rueda, University of Windsor, Canada
Alioune Ngom, University of Windsor, Canada

Presentation Overview:

Prediction of protein-protein interactions (PPIs) is a difficult and important problem in biology. Although high-throughput technologies have made remarkable progress, the predictions are often inaccurate and include high rates of both false positives and false negatives. While short-linear motifs (SLiMs) in protein sequences have being effectively used as features for predicting and analyzing obligate protein interactions, several computational approaches have been used for prediction of high-throughput PPIs, though none of them has exploited the power of SLiMs. In this study, we propose a new method for prediction PPIs based on counting SLiMs between pairs of protein sequences. The method has been tested on a positive dataset of 50 protein pairs obtained from the PrePPI database, which contains human PPIs, and a negative dataset of 38 protein pairs obtained from the Negatome-PDB 2.0 database, which is a repository of non-interacting pairs of proteins. We have used Multiple EM for Motif Elucidation (MEME) to obtain 50 motifs for each the positive and negative datasets, separately, obtaining a set of 100 motifs which we call SM dataset. Similarly, we generated 50 motifs from the combined negative and positive datasets, which we call CM. We applied the Wrapper criterion with Random Forest for feature selection (FS) on the SM and CM datasets, followed by classification using different algorithms in Weka, on a 3-fold cross-validation setup. Among these, random forest and decision tree yielded the highest classification accuracies, ranging from 55.7% to 80.7% in both datasets, the highest being for the CM dataset, with accuracies ranging from 55.7% to 84.1% using random forest after FS. Our method shows promising results and demonstrates that information contained in SLiMs is highly relevant for accurate prediction of PPIs. In addition to efficient prediction, individual SLiMs bring extra information on meaningful patterns linked to specific roles in protein function.

Leveraging integrative gene expression and metabolomics analysis to define molecular cancer signatures

Download

Date: TBA
Room: TBA

Ewy Mathe, Ohio State University, United States
Elizabeth Baskin, The Ohio State University, United States
Senyang Hu, Ohio State University, United States

Presentation Overview:

Every year, over 14 million people are diagnosed with cancer worldwide and over 8 million people will die of the disease, making cancer a leading cause of death. In recent years, the metabolomics field has gained momentum for its potential of uncovering metabolites that could be measured in the clinic, and could help predict early diagnosis, prognosis, and treatment outcome. Despite the high potential impact of metabolomic profiles in biomarker discovery, the biological mechanisms underlying these cancer-specific profiles are oftentimes unknown. Better understanding the regulation of metabolic enzymes involved in producing these cancer phenotypes is critical and could facilitate the search for novel therapeutic targets. To this end, a global approach to integrating gene expression with metabolite measurements is proposed.

Identifying highly correlated gene:metabolite pairs may enhance our understanding of how genes affect metabolic phenotypes. In addition to assessing global correlations, we hypothesize that the correlations themselves may be associated with different phenotypes. Leveraging the public NCI-60 cell line data, we applied a linear model, m = g + c + g:c, where m are metabolite abundances, g are gene expression values, c is cancer type (e.g. leukemia, prostate), and g:c is the interaction between gene expression and cancer type. A statistically significant interaction p-value indicates that the slope relating genes and metabolites in one cancer type is different from that of another and that the gene:metabolite pair is cancer-type specific. These models were calculated for all possible gene:metabolite pairs (N=4,640,646) to compare different cancer subtypes in the NCI-60 cell lines, and to extract relevant gene:metabolite pairs whose association are specific to certain cancer types. Results of this approach and comparison with other transcriptomics/metabolomics approaches will be discussed here.

Large-scale GWAS pathway analysis identifies several breast cancer susceptibility pathways

Download

Date: TBA
Room: TBA

Shirley Hui, University of Toronto, Canada
Asha Rostamianfar, Univesrity of Toronto, Canada
Gary Bader, University of Toronto, Canada

Presentation Overview:

Data from a large-scale genome-wide association study among 43,612 European controls and 44,791 invasive breast cancer cases (further broken down by ER+ and ER- status) was used to perform pathway analysis to identify breast cancer susceptibility pathways. A modified GSEA (gene set enrichment analysis) algorithm for GWAS data was used to perform this analysis. Pathways related to FGF signaling, the immune system, mitophagy, kinase signaling, and Wnt signaling were among the significant pathways for ER+ cases. Pathways related to DNA damage checkpoint, tricarboxylic acid (TCA) cycle, response to hypoxia, apoptosis, mitochondrial translation and double-strand break repair and were significant in both ER+ and ER- cases. This is the most comprehensive pathway analysis performed on a breast cancer risk cohort and several novel pathways were identified after excluding known breast cancer susceptibility genes from the analysis.

Transcription factor expression and its effects on binding site occupancy and motif preference

Download

Date: TBA
Room: TBA

Mehran Karimzadeh, Princess Margaret Cancer Centre, Canada
Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Introduction. Transcription factor binding site occupancy is inherently limited by the concentration of the binding transcription factor. We investigated how variation in transcription factor expression affected binding frequency. Using ENCODE ChIP-seq and RNA-seq data from multiple cell types, we classified transcription factor binding sites into one of four binding variability categories: dynamic, static, expression-sensitive, and low. We then investigated differential enrichment of motifs discovered de novo, as well as JASPAR vertebrate motifs in binding sites of different variability categories.

Results. For 8 out of 54 analyzed transcription factors, occupancy of more than half of their binding sites correlated with the transcription factor’s mRNA expression across multiple cell types. Some of these transcription factors such as ATF3 and RCOR1 are specific to a particular biological process or developmental stage. For example, BRCA1, ESRRA and NRF1 are involved in a variety of biological processes including metabolism, growth and global gene regulation. NFE2 and MAFK are involved in hematopoietic lineage specification. Investigating NFE2 motif preferences, we found that in the GM12878 cell line, where NFE2 mRNA is not expressed, there were no instances of the JASPAR MAF::NFE2 motif. The USF1 motif, however, was significantly enriched only in this cell line and not others. Furthermore, occupancy of more than half of the binding sites for all of the tested transcription factors is correlated with expression of other transcription factors. This may represent cooperative binding between the other transcription factor and the transcription factor originally analyzed.

Discussion. In addition to recapitulating previously reported motif preferences of each transcription factor, we found novel motif preferences related to differences in mRNA expression for several transcription factors including NFE2. We can expand our approach to identify other transcription factors that occupy different sequence motifs at different expression levels. This may elucidate cooperative and competitive relationship between transcription factors.

Identification of transcriptional patterns conserved across in vitro and in vivo cancer samples

Download

Date: TBA
Room: TBA

Seyed Ali Madani Tonekaboni, University of Toronto, Canada
Laleh Soltan Ghoraie, University of Toronto, Canada
Benjamin Haibe-Kains, University of Toronto, Canada

Presentation Overview:

Rationale. In vitro and in vivo model systems are frequently used in cancer research to investigate pathways involved in carcinogenesis and drug response. However, it is well established that no model system recapitulate perfectly the pathway activities observed in patient tumors. To optimize the translation of preclinical findings into clinical settings, there is a need for a thorough characterization of the pathways that exhibit transcriptional activity patterns conserved between model systems and patient tumors.
Objective. In this study we investigated the conservation of biological pathway activities in cell lines (CCL), patient-derived xenografts (PDX) and patient samples across different tumor types including breast, lung, prostate, ovarian, and pancreatic cancers.
Methods. We leveraged the cancer cell line encyclopedia (CCLE), patient-derived xenografts encyclopedia (PDXE) and the cancer genome atlas (TCGA) for CCL, PDX and patient transcriptomic profiles, respectively. We used 825 biological pathways provided in c5 collection in MSigDB to separate sets of genes based on their biological functions. For each biological pathway, we built pathway-specific matrices of gene expressions for each sample type (CCL, PDX and patients). We adapted the RV coefficient to compare these matrices and assess their similarity in transcriptional activity patterns.
Results. The results of our pan-cancer analyses show that there are 73 pathways, significantly conserved, and 84 pathways, differentially expressed, between CCLs and patients. Interestingly, there are more than 50 pathways which are conserved in some cancers while unconserved in others.
Conclusions. Using our novel approach, we will draw a transcriptomic map of pathway activity conservation across model systems and patient samples for multiple cancer types. This will allow the cancer research community to ensure their preclinical results are likely to translate into clinical settings with patient samples.

Dramatic impact of knowledge accumulation on pathway enrichment analysis

Download

Date: TBA
Room: TBA

Lina Wadi, OICR, Canada
Mona Meyer, OICR, Canada
Jüri Reimand, OICR/UofT, Canada

Presentation Overview:

Information about gene function is accumulating rapidly due to high-throughput techniques like next generation sequencing. Biological interpretation of data produced from high-throughput techniques is often a challenge. Pathway enrichment analysis is a common method for interpreting gene lists with knowledge of biological processes and pathways. It relies on up-to-date software tools that use the latest databases and Gene Ontology. We surveyed the datasets used in 21 pathway enrichment tools and found that the majority of tools had not updated their functional information in several years. We therefore investigated the evolution of Gene Ontology (GO) and Reactome pathway resources for the period of 2009-2016 and its impact on functional interpretation of gene lists. Specifically, we analyzed the growth in the size and complexity of pathways, the quantity and quality of human gene annotations, and the impact of using outdated functional information on pathway enrichment analysis. We found that the number and complexity of functional annotations of genes is growing consistently and has more than doubled during the period. Annotations have also increased in quality, while genes with no annotations have become less frequent. We analysed candidate cancer gene lists of glioblastoma and breast cancer and observed a dramatic impact on pathway enrichment results in comparing historical and contemporary functional annotations. In particular, the most popular tool DAVID with no updates since 2010 misses 74% of contemporary gene annotations in pathway enrichment analysis, affecting the results of thousands of papers published in 2015. As the use of outdated information leads to misleading interpretation of experimental data, the community needs to use timely resources of gene annotations.

Using networks to identify key players in breast cancer microenvironment interactions

Download

Date: TBA
Room: TBA

Venkata Manem, University Health Network, Canada
Sadiq Saleh, The Goodman Cancer Centre, McGill University, Canada
Nicholas Bertos, McGill University, Canada
Morag Park, The Goodman Cancer Centre, McGill University, Canada
Benjamin Haibe-Kains, University Health Network, Canada

Presentation Overview:

Background. Triple negative breast cancer (TNBC) is considered to be one of the major causes of mortality in women. There is compelling evidence from biological studies highlighting the role of tumor microenvironment (TME), in carcinogenesis. Tumor cells take support from tumor stroma in the TME through growth factors. In this study, we aim to identify key players that participate in the tumor epithelium-stromal interactions.

Methods. We developed a genome-wide network by estimating pairwise co-expression interactions based on Pearson correlation between the tumor epithelium and tumor stroma along with normal counterparts. Each node and edge in the network represents the gene and the co-expression relationship between the epithelium and stroma, respectively. The tumor and normal epithelium-stroma interaction networks were statistically compared to identify gain or loss of interactions specific to tumor network using a permutation test. This analysis will enable the identification of the set of genes that are involved in rewiring of tumor epithelium-stroma network. Then, for each gene in the network, we computed its connectivity to identify tightly connected hubs, and performed the GO enrichment analysis.

Results. We found the proportion of self-loops interactions, defined as the genes that are co-expressed in both tumor epithelium and stroma, to be highly enriched in tumor epithelial-stromal samples compared to normal epithelial-stromal samples. We have identified a key hub, CDK4, an important regulator of cell cycle and drug target. We performed local assessment of GO term associations with the tightly connected hubs in the network, which are highly expressed in tumor and are significantly enriched in biological processes, like regulation of fibroblast proliferation, regulation of B cell apoptotic process.

Conclusions. The treatment efficacy of TNBC relies on killing tumor cells along with inhibiting the tumor epithelial-stromal interactions. This study holds potential to characterize the epithelium-stroma crosstalk for designing new treatments in TNBC.

Protein-ligand interaction exploration based on proteome-wide tertiary structure prediction and further in vitro validation

Download

Date: TBA
Room: TBA

Michael Dong, university of toronto, Canada
Nicholas Provart, University of toronto, Canada

Presentation Overview:

Proteins rarely function alone, most of the times they interact with the other molecules to fulfill their functions. Protein-ligand interactions are crucial for biological activities and the exploration and analysis of protein-ligand interactions are essential for our biological understanding of cellular function. In my study, I will use bioinformatics methods to screen for candidate proteins that can bind to certain ligands (carbohydrates and proteins) based on their structural features such as aromatic side chain shapes and protein-protein surface shapes (shape complimentarity) from a protein structure-ome containing 30,000 predicted protein structure models covering around 80% of the Arabidopsis proteome. The predicted interactions will be further validated using in vitro methods. My project aims to (1) provide evidence that predicted protein structures can be used in protein structure and functional studies, and (2) explore the relationship between protein structure and proteinligand interaction ability.

Enrichment Map Pipeline– a set of Cytoscape apps to visualize, explore and summarize pathway enrichment results.

Download

Date: TBA
Room: TBA

Ruth Isserlin, University of Toronto, Canada
Mike Kucera, University of Toronto, Canada
Veronique Voisin, University of Toronto, Canada
Gary D. Bader, University of Toronto, Canada

Presentation Overview:

High throughput experiments, including transcriptomics, proteomics, or metabolomics, generate millions of signals for all the entities in the cell. In efforts to summarize and explore these signals, expression results are examined in the context of known pathways and processes through enrichment analysis. Enrichment analysis generates a set of pathways that are significantly enriched in the set of differential entities. Due to the high redundancy in annotation resources this often results in hundreds of enriched pathways. To facilitate the analysis of these results we have developed the Enrichment Map Pipeline to visualize, explore and summarize enrichment results as a network. The Enrichment Map App generates a network visualization of the enrichments where pathways are nodes and edges represent known pathway crosstalk, defined by the number of genes shared between a pair of pathways. An Enrichment Map post analysis allows users to add specific known signatures or a complete set(i.e. transcription factor, microRNA, or drug targets) filtering potential hits by Mann-Whitney or hypergeometric statistic. An Enrichment Map can be further summarized with ClusterMaker, WordCloud and AutoAnnotate Cytoscape Apps that groups densely connected regions in the network, calculate prominent themes based on the text in node descriptions and visually annotate the network so dominant themes can be easily recognized.

Performance comparison of targeted gene panel sequencing technologies using several cancer sample types

Download

Date: TBA
Room: TBA

Peter Ulintz, Univ Michigan, United States
Jeanne Geskes, Univ Michigan, United States
Melissa Coon, Univ Michigan, United States
Patricia Tamsen, Univ Michigan, United States
Elizabeth Ketterer, Univ Michigan, United States
Angela Chidester, Univ Michigan, United States
Christopher Gates, Univ Michigan, United States
Erika Koppe, Univ Michigan, United States
Craig Johnson, Univ Michigan, United States
Elena Stoffel, Univ Michigan, United States
Richard McEachin, Univ Michigan, United States
Robert Lyons, Univ Michigan, United States

Presentation Overview:

Custom Targeted panels for Next Generation sequencing are a valuable tool for in-depth analysis of a subset of genes. Targeted panels offer an economical approach that requires less sequencing to achieve necessary read depth as compared to WGS or WES. Read depth is particularly important for tumor samples which often contain mutations at lower allele frequencies due to non-aberrant cell admixture or the presence of subclones. A number of vendors have kits available for this strategy. We organized a “bakeoff” across 5 vendors to assess the design process, manufacturing time, kit sizes, protocols, hands on time, cost, coverage of our genes and repeatability across vendorsPlatforms vary based on their cost, ease of use, and hands on time. Qiagen and NuGen platforms require only 10ng DNA, whereas the capture-based methods require at least 100ng. Overall, capture-based methods show a more continuous tiling pattern across targeted regions than the amplicon method, permitting de-dup and strand bias based variant filtering. Amplicon methods generally do not suffer from off-target effects and appear slightly more sensitive, but at the expense of a higher false-positive rate. Amplicon methods require primer trimming, preferably post-alignment (4), which to our knowledge currently requires a custom tool. All platforms performed well on coverage, with capture based methods, particularly Agilent XT2, requiring more reads to achieve target coverage saturation (we’ve observed that the post-capture pooling version of Agilent SureSelect platform, XT, has a lower off-target rate, data not shown). No platform is a clear winner in this experiment, they all have pros, cons and nuances and choice of platform will depend on multiple constraints such as amount of available DNA, lab expertice, number of project samples, and budget.

How to be a good parasite

Download

Date: TBA
Room: TBA

Nirvana Nursimulu, University of Toronto, Canada
Swapna Seshadri, Hospital for Sick Children, Canada
John Parkinson, Hospital for Sick Children, Canada

Presentation Overview:

The phylum Apicomplexa, comprising over 5000 species of parasites, is mostly known for causing significant impact economically and on human health. For example, the malaria parasite Plasmodium falciparum was responsible for almost 500,000 deaths in 2015 alone, and Toxoplasma gondii is the leading cause of childhood retinitis. To survive and persist, Apicomplexan parasites exploit host pathways. In the case of metabolism, they have evolved different techniques for diverting resources/metabolites—essentially scavenging—from their host organisms. As a consequence, some pathways are no longer essential to parasites and can be lost. Therefore, if we are to develop new therapeutics, understanding host-parasite metabolic interactions is key. During the evolution of Apicomplexans, it is clear that many members have adapted their metabolic potential to exploit availability of host nutrients. What remain to be understood are the factors (for example, different nutrient availability) which have led to the reductive evolution of their diverse metabolic networks. Therefore, we simulate the reductive metabolic evolution of Apicomplexans by using the framework of Flux Balance Analysis. We start by reconstructing a hypothetical ancestral metabolic network of Apicomplexans, and then use mathematical programming to predict different evolutionary outcomes as hypothetical present-day metabolic networks. We find that there can be alternate metabolic pathways that could have arisen independently over the course of evolution; for example, the TCA Cycle—a crucial pathway in energy production—is reduced differently in different Cryptosporidium species. Furthermore, we are able to explain discrepancies between predicted and extant Apicomplexan metabolic networks; for instance, present-day parasites have retained certain alternative pathways (not found in predictions), possibly to maintain functionality despite fluctuations in environmental conditions. Altogether, the work we present here gives a novel view into the metabolism of Apicomplexan parasites, especially explaining how different nutrient availability may have shaped their metabolic networks. The methodology applied here may be extensible to other organisms with reduced metabolism, including other parasites and endosymbionts.

Genome Sequencing and De Novo Assembly of a Clinical Isolate of Plasmodium vivax Malaria from India Using the Oxford Nanopore MinION

Download

Date: TBA
Room: TBA

Zunping Luo, New York University, United States
Lingdi Zhang, New York University, United States
Pavitra Rao, New York University, United States
Swapna Uplekar, New York University, United States
Daniel Hupalo, New York University, United States
Jane Carlton, New York University, United States

Presentation Overview:

Of the four Plasmodium species that routinely cause human malaria, Plasmodium vivax is the most widespread species outside Africa, causing ~15.8 million cases each year. India is a major contributor to the worldwide burden of vivax malaria, responsible for ~80% of the cases in the Southeast Asian region. A reference assembly for the P. vivax Indian VII strain was published in 2012 (Neafsey et al., Nature Genetics., 2012), however this strain was collected from a patient in 1972, and the genome assembly is not of high quality. Here we describe the sequencing, assembly and annotation of a P. vivax isolate, IndiaNYC, isolated from a patient from India in 2013. We tested Oxford Nanopore Technologies’ MinION platform that can produce read lengths of tens of kilobases, to determine the ability of a nanopore-based system to generate long read lengths from this malaria parasite clinical sample. First, to deplete the human DNA contamination resulting from the presence of white blood cells in the P. vivax-infected blood sample, we tested an enrichment method known to work for another malaria parasite species, Plasmodium falciparum (NEBNext Microbiome DNA Enrichment). We observed a 2000-fold decrease in copies of a human test gene quantified by Digital PCR after enrichment, and observed a loss of ~40% of the P. vivax DNA during the process. We then used the MinION device to generate long reads from the enriched P. vivax DNA sample. Data showing the quality and quantity of the reads, and their use to generate a hybrid assembly with Illumina HiSeq data (2x100bp) from the same isolate will be presented.

In-silico investigation of parasite proteomes to predict immune mimicking proteins

Download

Date: TBA
Room: TBA

Shruti Srivastava, University of Calgary, Canada
Derek McKay, University of Calgary, Canada
James Wasmuth, University of Calgary, Canada

Presentation Overview:

Helminths, parasitic worms, are major cause of mortality and morbidity in humans and livestock around the world. Infections are typically chronic, enabled by parasite induced changes in the immune responses, which includes a reduced inflammatory response. Proteomic analyses in several helminth species have implicated parasite-derived proteins as modulators of the host immune system, but their identity is unknown. Understanding the molecular components of this host-parasite interaction offers potential new treatments for severe auto-immune diseases, including: inflammatory bowel diseases, autoimmune encephalomyelitis, and arthritis. Here, we have taken a genomic-based approach to identify short peptides which potentially mimic host proteins, particularly components of the immune system. Our focus has been on the tapeworm species: Echinococcus, Taenia and Hymenolepis. To overcome the problems typical with eukaryote genomes (e.g. fragmented assemblies and missing annotations), we have searched the published gene models, transcriptomes and DNA sequence reads. To date, we have identified 35 peptides that are potential mimics, of which ten mimic proteins involved in the host immune system. These are undergoing structural modelling, the results of which will be presented.

Can-VD: Data Standard and Online Resource for Reporting Mutations Impact on Cancer Protein Interaction Networks

Download

Date: TBA
Room: TBA

Mohamed Helmy, University of Toronto, Canada
Alexander Crits-Christoph, Johns Hopkins University, United States
Omar Wagih, University of Cambridge, United Kingdom
Gary Bader, University of Toronto, Canada

Presentation Overview:

Research on the impact of cancer mutations on protein-protein interaction (PPI) networks has recently experienced a shift towards the development of large scale, high-throughput methods. Several recent reports have presented results on the assessment of the mutations impact on PPIs mediated by peptide recognition domains (PRD), and these data are useful for understanding cellular dynamics in cancer for basic or clinical applications. Considering the importance of these data, a standard for reporting the results and a central online resource for collecting them would provide a useful aid for researchers in the field. Here, we propose a standard for reporting cancer mutations impact results that ensures high level of data integration and interoperability. Furthermore, we present Can-VD (The Cancer Variants Database), an online resource for cancer variants and their impacts on domain-peptide PPI networks. Currently, Can-VD stores all the published datasets of large-scale mutation-impact assessment in a standard database format with a web interface that allows for search, visualization and bulk download of data.

Analysis of Parallel Bayesian Network Learning

Download

Date: TBA
Room: TBA

Joseph Haddad, University of Akron, United States
Anthony Deeter, University of Akron, United States
Zhong-Hui Duan, University of Akron, United States
Timothy O'Neil, University of Akron, United States

Presentation Overview:

Bayesian networks have been proven useful in providing information relevant to how genes interact with one another. These interactions can then be used to make testable hypotheses to determine how gene interactions influence life in various organisms. As a result, tests in the lab can be performed with more confidence and a less likely chance of wasting time and resources. Unfortunately, Bayesian network learning is inherently slow due to the nature of the algorithm. The computational complexity can be reduced using methods like search space reduction. K2 is a great example of search space reduction, but this still doesn't solve the fundamental problem. To achieve high confidence in the generated networks, an abundance of Bayesian networks need to be computed using random search space restrictions. This takes time, but has higher potential to be improved upon. To compute Bayesian networks more efficiently, their individual computation can be spread across multiple cores. Parallelizing the computation of networks across multiple cores results in linear speed-up and negligible added overhead. Furthermore, the computation of networks can also be spanned across multiple nodes (systems, computers) or deployed on a supercomputer, which also results in linear speed-up but can introduce variable overhead. The overhead introduced going cross-systems is variable to what is trying to be accomplished because of what will have to be sent over the network to come to a consensus. Research was conducted to implement methods of computing Bayesian networks in parallel across multiple cores and nodes. These methods were then tested to determine the speed increase resulting from their utilization. The results from the parallel code were also validated in dozens of tests against a known working Bayesian network library to ensure the integrity and correctness of our code.

Starving in the Dark: The Impact of Ultra-Small Cells on a Groundwater Microbial Community

Download

Date: TBA
Room: TBA

Olivia S. Hershey, The University of Akron, United States
Hazel A. Barton, The University of Akron, United States

Presentation Overview:

The lakes in Wind Cave, Wind Cave National Park, South Dakota, provide a rare window into the massive and ancient Madison Aquifer, which is the main source of groundwater for five US states and two Canadian provinces. The long residence time of the groundwater en route to the lakes makes it extremely nutrient-limited, creating oligotrophic conditions that life must adapt to for survival. One of the apparent adaptations by the microorganisms in the aquifer is a reduction in cell size, resulting in an increased surface area to volume ratio and more efficient nutrient absorption. Until recently, the theoretical limit on cell size allowed filtration of cells at a pore diameter of 0.2um; however, starved environments have been found to contain cells <0.2um. As previous studies of the microbial diversity at the lakes of Wind Cave were performed by collecting biomass on a 0.2um pore filter, the community diversity was likely underestimated as cells <0.2um were not captured on the filter. To determine what proportion of the population was below 0.2 um in size, we collected cells through a 0.1um pore filtration system. In order to collect enough cells for DNA extraction, approximately 140 L of lake water was filtered over 24 hours. Microbial DNA was extracted from the filter using the phenol chloroform method on a Boreal Genomics Aurora system, and amplified via using universal primers for bacterial and archaeal 16S rRNA (V3/V4 region). The PCR products were sequences on the Illumina MiSeq platform, processed using QIIME software platform and matched against the Greengenes rRNA curated database. Analysis of the operational taxonomic units (OTUs) found in the 0.1um filter collection, when compared to the previous sample collected on the 0.2um filter, demonstrated the potential structure of the ultra-small, ultra-oligotrophic microbial population, which is being used in cultivation approaches to better understand evolutionary adaptations. This work has implications for uncovering the metabolic processes that assist survival under some of the most starved conditions on Earth, while allowing us to once again examine the theoretical limits for cell size.

Dissecting the properties of spatially-coordinated combinatorial transcription factor binding using ChIP-exo and evolutionary conservation

Download

Date: TBA
Room: TBA

Minggao Liang, The Hospital for Sick Children, Canada
Liis Liis Uusküla-Reimand, The Hospital for Sick Children, Canada
Shiao Yuan Huang, The University of Toronto, Canada
Huayun Hou, The Hospital for Sick Children, Canada
Alejandra Medina-Rivera, Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Mexico
Michael Wilson, SickKids, Canada

Presentation Overview:

The cooperative binding of transcription factors (TFs) at cis-regulatory modules (CRMs) is essential for the precise regulation of gene expression. Compared to singleton TF binding events, CRMs are enriched for tissue-specific gene expression and pathways and are more likely to be conserved across species. The relative order, orientation, and spacing of transcription factor binding motifs is an important determinant of CRM function. Furthermore, recent work suggests that spatially constrained binding between TF pairs is widespread and essential for developmental gene expression. Hence, there is increasing interest in characterizing spatially coordinated TF binding.

Our study aims to characterize spatially-coordinated TF binding at CRMs in the mouse liver and identify potential links to CRM function and evolutionary turnover. To explore combinatorial TF occupancy, we analyzed ChIP-exo footprints for four liver master regulators (HNF4A, FOXA1, ONECUT1 and CEBPA) at CRMs that we defined from a set of >20 mouse liver TFs which have been profiled by ChIP-seq. As expected, ChIP-exo revealed footprints representative of canonical TF binding events. However, ChIP-exo footprints also revealed distinct TF binding modalities for each TF corresponding to the pairwise binding of other liver TFs and identified several novel examples of recurrent, spatially constrained co-binding configurations. Specific co-bound footprint profiles enriched for regions of conserved liver TF binding and conserved enhancer marks, suggesting that spatially coordinated TF occupancy is a conserved feature underlying liver gene regulation.

An efficient finite-difference strategy for sensitivity analysis of stochastic models of biochemical systems

Download

Date: TBA
Room: TBA

Monjur Morshed, University Of Waterloo, Canada
Brian Ingalls, University Of Waterloo, Canada
Silvana Ilie, Ryerson University, Canada

Presentation Overview:

Sensitivity analysis characterizes the dependence of a model's behaviour on system parameters. It is a critical tool in the formulation, characterization, and verification of models of biochemical reaction networks, for which confident estimates of parameter values are often lacking. We propose a novel method for sensitivity analysis of discrete stochastic discrete models of biochemical reaction systems whose dynamics occur over a range of timescales. This method combines previously established finite-difference approximations and adaptive tau-leaping strategies to efficiently estimate the parametric sensitivities for stiff stochastic biochemical kinetics models, with negligible loss in accuracy compared with previously published approaches. We analyze several models of interest to illustrate the advantages of our method.

sl1p: A computational pipeline for the processing and analysis of 16S rRNA microbiome sequencing data.

Download

Date: TBA
Room: TBA

Fiona Whelan, McMaster University, Canada
Michael Surette, McMaster University, Canada

Presentation Overview:

Advances in next-generation sequencing technologies have allowed for detailed, molecular-based studies of microbial communities such as the human gut, soil, and ocean waters. Sequencing of the 16S rRNA gene, specific to bacteria, using universal PCR primers has become a common approach to studying the composition of microbiomes. However, the bioinformatic analyses of the resulting millions of DNA sequences can be challenging, and a standardized protocol for such analysis would aid in reproducible analyses. The Surette Lab 16S rRNA Pipeline (sl1p, pronounced “slip”) was designed with the purpose of mitigating this lack of reproducibility by combining pre-existing tools into a computational pipeline. This pipeline automates the processing of raw sequencing data to create human-readable tables, graphs, and figures to make the collected data more readily accessible to biologists. The most appropriate analyses software and algorithms were determined by sequencing synthetic microbiomes created in the laboratory and comparing the output using various analyses techniques. To date, sl1p has been used to process data from numerous microbial communities, including those from the human gut, upper and lower airways, and mouse cecal samples. Finally, sl1p promotes reproducible research by providing a comprehensive log file, and reduces the computational knowledge needed by the user to process next-generation sequencing data.

RelA as a Potential Regulator of Inflammation and Tissue Damage in Streptozotocin-Induced Diabetic STAT5 Knockout Mice

Download

Date: TBA
Room: TBA

Emilee Holtzapple, Ohio University Honors Tutorial College, United States
K. Wyatt McMahon, Johns Hopkins School of Medicine, United States
Lonnie Welch, Ohio University Russ College of Engineering, United States
Karen Coschigano, Ohio University Biomedical Sciences, United States

Presentation Overview:

Type 1 Diabetes (T1D) has long-term complications that result in tissue damage. One of these chronic complications is diabetic nephropathy (DN). In diabetic kidneys, activation of the Signal Transducers and Activators of Transcription (STAT) protein family has been observed, suggesting a pathological role for the STAT protein family in DN. To examine the role of STAT5 in DN, a STAT5 knockout mouse was previously created. Type 1 diabetes was then induced using streptozotocin (STZ). In the diabetic STAT5 knockout mouse model, kidney damage was indicated by an increase in tubulointerstitial pathology. In this model, microarray analysis revealed differential expression of over 1,000 genes as compared to the diabetic wildtype kidney. By using pathway analysis software to analyze gene expression results obtained from this diabetic STAT5 knockout mouse model, the role of STAT5 in diabetic kidney damage was examined. The results showed differential activation of a number of inflammatory pathways, which could possibly explain the increased kidney damage. The results also indicated activation of RelA, which could lead to increased transcription of a number of Nf-κB dependent genes. Chromatin immunoprecipitation followed by quantitative PCR (ChIP-qPCR) was used to examine RelA activity in vivo. ChIP-qPCR results showed that RelA bound the promoters of both ICAM1 and BIRC5 at different levels dependent on the genotype and disease state. This indicates that STAT5 affects RelA signaling in DN.

Topoisomerase II beta interacts with cohesin and CTCF at topological domain borders

Download

Date: TBA
Room: TBA

Huayun Hou, SickKids Research Institute, Canada
Liis Uusküla-Reimand, SickKids Research Institute, Canada
Payman Samavarchi-Tehrani, Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Canada
Matteo Vietri Rudan, Cancer Institute, University College London, United Kingdom
Minggao Liang, SickKids Research Institute, Canada
Jüri Reimand, Ontario Institute for Cancer Research, Canada
Suzana Hadjur, Cancer Institute, University College London, United Kingdom
Anne-Claude Gingras, Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Canada
Michael Wilson, SickKids Research Institute, Canada

Presentation Overview:

Type II DNA topoisomerases (TOP2) regulate DNA topology by generating transient double stranded breaks (DSB) during replication and transcription. The ubiquitously expressed TOP2B enzyme facilitates rapid gene expression and functions at the later stages of development and differentiation. To gain new insight into these distinct roles of TOP2B, we used proteomics (BioID), chromatin immunoprecipitation (ChIP-seq and ChIP-exo), and high-throughput chromosome conformation capture (Hi-C). We detected novel proximal TOP2B protein interactions and characterized the genomic landscape of TOP2B binding at base pair resolution.
Our TOP2B protein-protein interaction network revealed known and novel human TOP2B interacting proteins including CTCF and several members of the cohesin complex. Mouse TOP2B binds extensively to open chromatin regions marked by DNase I hypersensitivity sites. It is also enriched at evolutionarily conserved transcription factor binding sites and its occupancy correlates with allele-specific transcription factor binding. Approximately half of all CTCF/cohesin binding sites possessed TOP2B binding. Base pair resolution mapping of TOP2B/CTCF/cohesin sites revealed a striking structural ordering of these proteins along the genome relative to the asymmetric CTCF motif. We find that these ordered TOP2B-CTCF-cohesin sites flank the boundaries of topological associated domains with TOP2B positioned externally and cohesin internally to the domain loop.
We establish that TOP2B is interacting with chromatin architectural proteins and is positioned to solve topological problems at diverse cis-regulatory elements. We demonstrate that TOP2B’s DNA occupancy is a highly ordered and prevalent feature of topological domain boundaries.

Applying a co-elution strategy to generate the first genome scale protein interaction network for T. gondii

Download

Date: TBA
Room: TBA

Swapna Seshadri, Hospital for Sick Children, Canada
Zhongming Hu, University of Toronto, Canada
Verena Brand, Hospital for Sick Children, Canada
Xuejian Xiong, Hospital for Sick Children, Canada
Michael E. Grigg, NIH, United States
Andrew Emili, University of Toronto, Canada
John Parkinson, Hospital for Sick Children, Canada

Presentation Overview:

Proteins do not operate in isolation but form components of intricate biological systems such as biochemical pathways and complexes. Parasites of biomedical importance, such as Toxoplasma gondii and Plasmodium falciparum, rely on a coordinated cascade of invasion related proteins to invade and persist in their hosts. Understanding their organization is critical to identifying proteins mediating invasion and hence may be targeted for therapeutics. Applying an innovative protein co-elution strategy, we present the first global scale protein-protein interaction network for T. gondii. First we apply a battery of biophysical techniques to separate parasite protein extracts into 420 fractions. Each fraction is then subject to shotgun proteomics to identify 1423 unique T. gondii proteins across all fractions. The tendency of proteins to be found in the same fraction (co-elute) is captured using correlation metrics. These scores are integrated with functional genomic datasets based on gene expression, domain-domain interactions, phylogenetic profiles, text mining and gene fusion, using a RandomForest classifier to construct a network composed of 7448 interactions between 973 proteins. Clustering the network identifies both known invasion interactions as well as complexes containing novel components. Through predictions of novel functional associations, we believe this network represents a valuable community resource allowing the prioritization of new candidates for in depth functional characterization.

Mutation patterns in Ebola may shine light on future outbreaks

Download

Date: TBA
Room: TBA

Mary Halpin, Kent State University, United States
Helen Piontkivska, Kent State University, United States

Presentation Overview:

Little is known about where the Ebola virus comes from and how it behaves in its reservoir species (which may or may not be bats), including the extent of sequence changes the virus may experience between outbreaks. Delineating whether it is a chronic or acute infection in its reservoir is important, as different evolutionary trajectories can be expected, and in turn may aid in the development of (future) treatments. We examined the extent of nucleotide sequence divergence among different genes in order to identify regions under strong purifying selection.

Identification of subtypes of sickle cell disease with different prognosis and severity

Download

Date: TBA
Room: TBA

Zhengdeng Lei, University of Illinois at Chicago, United States
Mark Maienscheincline, University of Illinois at Chicago, United States
Hong Hu, University of Illinois at Chicago, United States
Pinal Kanabar, University of Illinois at Chicago, United States
Morris Chukhman, University of Illinois at Chicago, United States
George Chlipala, University of Illinois at Chicago, United States
Neil Bahroos, University of Illinois at Chicago, United States
Ankit Desai, University of Arizona, United States
Roberto Machado, University of Illinois at Chicago, United States

Presentation Overview:

Sickle cell disease (SCD) is characterized by systemic complications and significant clinical variability. While some patients never experience severe clinical events, others experience lifelong morbidity and accelerated mortality. To address this phenotypic variability in a mendelian disease, we prospectively followed patients with SCD at steady-state to assess outcomes and leveraged microarray analysis of RNA isolated from peripheral blood mononuclear cells (PBMCs) to identify novel gene expression patterns and signaling pathways that could be associated with survival in SCD. We used consensus hierarchical clustering to identify two clusters for SCD patients in the discovery cohort. We derived FAIME pathway profiles (KEGG pathways) from the gene expression profiles. FAIME, the Functional Analysis of Individual Microarray Expression, is an algorithm to compute pathway scores using rank weighted gene expression of an individual sample. We developed a support vector machine (SVM) classifier using the FAIME signature, which we used to predict the subtype of SCD patients in validation cohort. We demonstrated that the identified clusters were significantly associated with patients’ prognosis and SCD severity.

Systematic analysis of alternative polyadenylation regulatory mechanisms from RNA-Seq

Download

Date: TBA
Room: TBA

Kevin Ha, University of Toronto, Canada
Benjamin Blencowe, University of Toronto, Canada
Quaid Morris, University of Toronto, Canada

Presentation Overview:

Alternative polyadenylation (APA) is a post-transcriptional process by which multiple RNA transcript isoforms with distinct 3 ends derived from the same gene can be produced. APA is common: ~30-50% of mammalian genes contain more than one cleavage and polyadenylation site (pA site)[1]; and changes in APA patterns have been observed during development of mouse embryonic stem cells (ESCs)[2–4]. However, the regulatory mechanisms involved in APA are not completely understood. We developed a framework for studying APA from polyA+ RNA-Seq datasets by examining the expression of 3´ UTR isoforms to estimate differential usage of pA sites. This method was applied to a published longitudinal RNA-Seq study of neuronal differentiation from mouse embryonic stem cells (ESCs) to glutamatergic neurons[5]. Principal component analysis revealed a major subset of transcripts that expressed short 3´ UTRs during in ESCs and subsequently lengthen in neurons, confirming previous findings[2]. Gene ontology analysis identified several functions related to neurogenesis and stem cell development. We next inferred an APA regulatory code by using random forests to predict pA site usage based on local sequence features, including the presence of known RBP motifs, polyadenylation signals, and dinucleotide content. Our model acheived good performance in predicting overall and tissue-specific pA site usage. Surprising features that are predictive of pA usage will be presented. Overall, in addition to providing insight on APA regulation; this work demonstrates the feasibility of using RNA-Seq data alone to elucidate these regulatory mechanisms.

1. Tian, B. et al. Nucleic Acids Res. 33, 201–12 (2005).
2. Ji, Z. et al. Proc. Natl. Acad. Sci. U. S. A. 106, 7028–33 (2009).
3. Lackford, B. et al. EMBO J. 33, 878–89 (2014).
4. Ji, Z. & Tian, B. PLoS One 4, e8419 (2009).
5. Hubbard, K. S., et al. F1000Research 2, 35 (2013).

Extracting more from transcriptomes: assessing their utility for comparative analyses across birds

Download

Date: TBA
Room: TBA

Kelly Boyd, Loyola University Chicago, United States
Amanda Misch, Loyola University Chicago, United States
Emma Highland, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States
Sushma Reddy, Loyola University, United States

Presentation Overview:

High-throughput sequencing technologies have allowed science to probe previously unexamined areas of life. Genomic sequencing of birds has unearthed a myriad of novel expressed sequences through the use of transcriptomes, yet comparative analyses of these data are seldom examined. We analyzed 425 published datasets from 39 different species and 49 different tissue types to assess their utility in comparative analyses across species. Although numerous transcriptomes have been generated for a wide variety of avian species, only two annotated genomes are complete for the most species-rich class of tetrapods – the chicken (Gallus gallus) and the zebra finch (Taeniopygia guttata). As such, extant bioinformatic tools which rely on a closely related reference sequence often fall short. We have developed a software pipeline to assist in these analyses by repurposing existing tools with new functionalities. This pipeline automates de novo assembly and annotation of each dataset. We further developed a protocol for extracting and comparing common genes across datasets. Here we present the results of cross-species comparisons, focusing on a single tissue – the liver across nine species, as well as cross-tissue comparisons across a single species, Anas platyrhynchos (the mallard). As expected, there is substantial overlap in common genes, indicating that the use of these data for comparative studies will be fruitful. We intend to further evaluate their utility for phylogenomic analyses to elucidate the evolutionary relationship of birds.

Scaling cancer subpopulation phylogeny reconstruction to thousands of tumors

Download

Date: TBA
Room: TBA

Jeff Wintersinger, University of Toronto, Canada
Amit Deshwar, University of Toronto, Canada
Quaid Morris, University of Toronto, Canada

Presentation Overview:

Tumors are composed of distinct, heterogeneous cellular populations, each containing a different selection of mutations. By observing the frequencies of mutations across all cells in a tumor, we can infer the existence of these subpopulations, as well as their evolutionary histories relative to one another. Past studies have relied solely on semi-manual methods for reconstructing tumor phylogenies, limiting their application to only tens of cancers. Moreover, with few experts possessing the requisite expertise to perform such work, subclonal reconstruction could only be confidently performed by a handful of labs.

We recently published PhyloWGS, an automated method for reconstructing tumor phylogenies. Subsequently, we have made numerous improvements to help non-experts to perform phylogenetic reconstructions at large scales. PhyloWGS uses Markov Chain Monte Carlo methods to probabilistically sample from a Bayesian posterior over phylogenies consistent with observed mutation frequencies; as such, our results consist not of a single phylogeny for each tumor, but of thousands of phylogeny estimates. Given this multitude of evolutionary history reconstructions, we must understand which phylogeny portions are consistent across estimates and which vary. Moreover, we require consensus representations that illustrate the primary structures observed in each sample of evolutionary histories, ignoring minor discrepancies. To realize these goals, we have begun clustering sampled phylogenies for each tumor. Rather than presenting the researcher with thousands of trees, we instead return only a handful of clusters, each of which represents a plausible phylogeny.

To utilize PhyloWGS' ability to perform phylogenetic reconstructions for thousands of tumors, we joined the Pan-Cancer Analysis of Whole Genomes project, where we are analyzing 2800 tumors drawn from 23 cancer types. By adapting PhyloWGS to a cluster-computing environment, we have been able to reconstruct phylogenies for thousands of tumors in parallel, demonstrating the method's ability to operate at massive scales.

A novel model reduction approach for the Chemical Master Equation

Download

Date: TBA
Room: TBA

Midhun Kathanaruparambil Sukumaran, Department of Applied Mathematics, University of Waterloo, Canada
Marc R. Roussel, Department of Chemistry and Biochemistry, University of Lethbridge, Canada
Brian P. Ingalls, Department of Applied Mathematics, University of Waterloo, Canada

Presentation Overview:

The dynamics of biochemical systems typically vary over multiple time scales, a phenomenon referred to as stiffness; this poses challenges to numerical analysis of system behaviour. By eliminating the fast modes, which correspond to fast time scales that are often not experimentally observed, a model reductions can be achieved. In our work, we make use of a stochastic process model of biochemical system dynamics called the Chemical Master Equation (CME). The slow and fast modes of the system correspond to small and large eigenvalues of the transition matrix of the CME. By a transformation generated from a set of left eigenvectors corresponding to slow eigenvalues, we remove the fast modes to arrive at a truncated model. The transformation was constructed so that the probability conservation is maintained in the truncated variable set. By this reduction, we attain a significantly reduced set of non-stiff differential equations. Moreover, the transformed truncation yields an exact representation of the initial condition of the original model, providing an optimal reduced representation of the original dynamics.

Large Scale Analysis of Predicting Subcellular Locations of Proteins Represented by Chaos Game Theory

Download

Date: TBA
Room: TBA

Brian Powell, Youngstown State, United States

Presentation Overview:

Current approaches of predicting subcellular locations of proteins located in a cell have made some advances but are far from perfect. Accurately predicting these locations result in better annotations of that protein and provide clearer pictures of its functions. We approach this problem by using a chaos game representation of the sequence based on physical and chemical properties of amino acids. We then split the resulting graph into two related discrete series, which is then subjected to wavelet transformation. The wavelet transformation data is then used as input for the classification algorithm, k-nearest neighbor. We observe the accuracy of how well each property predicts the correct subcellular location. We aim to achieve above the threshold of ~45 percent accuracy, which is the average of existing general subcellular predictors. For our study protein sequences were obtained from Uniprot’s freely accessible repositories. After parsing the raw data file roughly 15,000 mammalian sequences remained that contained subcellular location annotations. We accommodate 11 subcellular locations: Nucleus, Membrane, Cytoplasm, Endoplasmic Reticulum, Secreted, Mitochondria, Lysosome, Plasma Membrane, Peroxisome, Golgi Apparatus, and Extracellular Space.
Protein sequences comprised of 23 amino acids are sorted into groups of 4 based on the selected property of amino acids. These groups allow the sequence to be plotted using 2-dimension chaos game theory. The resulting graph retains the sequence order in numerical form. Looking at the graph with a human eye we can’t deduce any information. To address this, we split the graph into two related discrete series based on the x-axis and y-axis. We then use a 3-level Haar wavelet transformation. Each level provides us with a detail coefficient vector the length of our sequence. For each detail coefficient vector we calculate the mean, min, max, and standard deviation. This provides us with 24 features to be used as input for classification. We use k-nearest neighbor as our classification algorithm with a 67/33 ratio for our training/testing sets. After evaluating the accuracy of one property ran through this algorithm, we sit at 36% accuracy.

Deconvolving gene expression profiles for tumor populations with prior frequency information

Download

Date: TBA
Room: TBA

Chris Cremer, University of Toronto, Canada
Quaid Morris, University of Toronto, Canada

Presentation Overview:

The predictive potential of a tumour sample is stifled by its heterogeneity. A tumor sample is not a uniform collection of genetically identical cells, but a mixture of different cell populations. This heterogeneity arises from contamination by non-cancerous tissue, as well as evolution that occurs throughout a tumor’s lifetime. These distinct cellular populations exhibit heterogeneity through not only genomic mutations, but also differences in gene expression. Beyond informing our understanding of cancer evolution, these gene expression differences constrain the discovery of biomarkers necessary for personalized cancer therapeutics.

At present, subclonal populations in a tumor sample can be modeled by analyzing DNA sequencing data for simple somatic mutations and copy number variations unique to each population. To build on these efforts, we have developed a computational model using RNA expression data that addresses how contamination by normal tissue and tumor subpopulation heterogeneity affect gene expression. Our model represents tumor expression profiles as linear combinations of their constituent cancerous populations and non-cancerous contamination. Expression profiles for each population in a tumor are learned by iterative optimization. Prior knowledge of population frequencies within a tumor is drawn from DNA-sequencing-based methods, serving as a regularization factor that penalizes the model when it deviates too far from estimated population frequencies in generating tumor expression profiles. Our method extends existing methods for deconvolving tumor expression data, by incorporating frequency estimates as prior information, thereby, pushing the deconvolution towards meaningful latent factors.

In evaluating our model using simulated data, we found it accurately recovers correct tumor expression profiles. We are now investigating how variance in estimates for numbers of subpopulations and their associated frequencies affects model performance. Application of our model to clinical data will grant insight into how mutations affect gene expression in cancer, improving understanding of tumor evolution and informing development of improved treatments.

GenAP: A computing platform for life sciences research

Download

Date: TBA
Room: TBA

David Bujold, McGill University, Canada

Presentation Overview:

Full author list:
David Bujold (1), Carol Gauthier (2), Kuang Chung Chen (1), David Morais (2), Michel Barrette (2), Joël Fillon (1), Maxime Lévesque (2), Simon Nderitu (1), Jean-François Landry (2), Louis Létourneau (1), Jules Gagnon (2), Bryan Caron (1), Marc‐Étienne Rousseau (1), Alain Veilleux (2), Pierre-Étienne Jacques (2), Guillaume Bourque (1)
(1) McGill University, Montreal, Quebec, Canada
(2) Université de Sherbrooke, Sherbrooke, Quebec, Canada

The Genetics and Genomics Analysis Platform (GenAP) is a computing platform for life sciences researchers, with two goals in mind: making bioinformatics analyzes and data sharing more accessible to non-bioinformaticians, and reducing the bottlenecks associated with genomics big data processing. To accomplish these goals, GenAP has developed three components on the Compute Canada infrastructure: a web portal with powerful and user-friendly analysis tools, bioinformatics software analysis pipelines, and a code and genomic reference files distribution platform with CVMFS.

The Web Portal offers various tools through its virtual machines architecture, such as Galaxy, a web-based analysis framework that allows launching data processing software on High-Performance Computing (HPC) clusters through clickable interfaces rather than with the typical command shell. The GenAP flavour of Galaxy enables users to make use of their Compute Canada allocation to run analyzes. A data hub service allows researchers to share and publish large datasets using technologies such as UCSC Genome Browser track hubs. Portal service nodes are currently available at two Compute Canada HPC sites.

The Analysis Pipelines component is composed of a framework implemented in an object-oriented fashion to enable better modularity and reusability, and offers features such as error recovery. Pipelines implemented using this framework support different experiment types such as variant calling and transcriptomics analyzes. The source code is freely available on a public source control repository.

The CVMFS component includes repositories for ~100 analysis software packages and their required genomic libraries, and allow these resources to be readily accessible from all supporting sites. Such resources include software for pipelines execution (BWA, GATK, etc.) and many reference genomes. CVMFS has already been integrated at several HPC centres.

In addition, GenAP offers services such as a UCSC Genome Browser mirror with full annotations for the human and mouse assemblies. It also hosts the IHEC Data Portal, the International Human Epigenome Consortium (IHEC) data distribution website, with tools to facilitate datasets discovery and analysis.

Screenlamp: A software framework for hypothesis-driven ligand discovery based on virtual screening and machine learning

Download

Date: TBA
Room: TBA

Sebastian Raschka, Michigan State University, United States
Leslie A. Kuhn, Michigan State University, United States
Weiming Li, Michigan State University, United States
Mar Huertas, Michigan State University, United States
Anne M. Scott, Michigan State University, United States

Presentation Overview:

Virtual screening, a computational approach to automate the evaluation of small molecules as protein activators or inhibitors, has ushered in a new era for drug discovery. The goal in virtual screening is to select from huge datasets of small molecules a subset that is likely to be active, which can then be tested in vitro. Here, we present a novel computational framework, Screenlamp, for highly efficient screening. Screenlamp allows scientists to integrate hypotheses and experimental data into the computer-based screening pipeline, thereby increasing the hit rate and the identification of which functional groups are important for activity. Screenlamp is more efficient than traditional brute-force search paradigms, allowing scientists to screen ~15,000,000 compounds within days. Prior knowledge about the importance of chemical groups and their relative spatial orientation can be used as filtering criteria to select the most relevant compounds for further screening. Using a relational database, Screenlamp tracks properties such as molecular weight, solubility, flexibility, vendor availability, polar group matches, drug-likeness, etc., which facilitates efficient subset filtering and post-analysis. External conformer generators and molecular overlay tools are supported, to facilitate advances in technology and the selection of tools that satisfy the desired tradeoff between accuracy and efficiency. Finally, Screenlamp employs robust, ensemble-based regression and classification on the experimental data to predict the relative importance of chemical groups for activity. The predicted structure-activity relationship can then be back-integrated into the screening pipeline or drive the design and synthesis of novel compounds with improved activity. Screenlamp has been applied in several research projects to discover agonists and antagonists. For instance, in collaboration with experimental biologists, we used Screenlamp to identify an antagonist molecule that blocked the in vivo olfactory response to a G-protein coupled receptor bile acid ligand by more than 90%.

Cisplatin Response Prediction in Recurrent Bladder Cancer using Biochemically-inspired Machine Learning

Download

Date: TBA
Room: TBA

Katherina Baranova, University of Western Ontario, Canada
Eliseos Mucaki, University of Western Ontario, Canada
Dimo Angelov, University of Western Ontario, Canada
Dan Lizotte, University of Western Ontario, Canada
Peter Rogan, University of Western Ontario, Canada

Presentation Overview:

The ability to predict response to chemotherapy could impact drug selection or dosing, possibly reduce toxicity, and improve outcomes. Using a novel supervised machine learning approach (Mol Oncology 10: 85-100, 2016), we derived and validated Gaussian kernel support vector machine (SVM) models for the prediction of cisplatin chemotherapy response in bladder cancer. Using breast (n = 39) and bladder cancer cell line (n = 18) gene expression and drug concentrations producing 50% inhibition of growth (GI50), we performed parallelized backwards feature selection with crossvalidation on a set of 32 biochemically related genes. Cell lines labeled as sensitive or resistant are thresholded according to their collective median GI50 value. Gene expression signatures were validated by classifying relapsed bladder urothelial carcinoma patients from The Cancer Genome Atlas (n = 78) treated with cisplatin as chemotherapy resistant and all others as sensitive. The best performing cell line SVM accurately classified 68% disease-free patients and 66% with recurrent/progessive disease. To avoid reliance on a specific GI50 threshold (which may be different in patients), machine learning experiments were performed by relabeling cell lines according to different GI50 thresholds, which produced a set of optimized SVMs. The 5 most prevalent genes among these models were BCL2, FEN1, BCL2L1, PRKAA2, and BARD1. Receiver Operator Curves (ROC) curves were constructed by comparing patient outcomes with corresponding composite hyperplane distances derived from a set of SVMs generated at multiple GI50 thresholds. Hyperplane distances were averaged from the best model at each resistance threshold to give a single score with information from each resistance threshold model. The resulting set of scores classified disease-free patients (as "sensitive”) with 62% accuracy, and those with recurrent disease (as “resistant”) with 37% accuracy; the area under the ROC was 0.57.

Lipid composition and cardiovascular health of Danio rerio larvae with maternal and embryonic exposure to endocrine disruptors (EDs).

Download

Date: TBA
Room: TBA

Alysha Cypher, The University of Akron, United States
Brian Bagatto, The University of Akron, United States

Presentation Overview:

Bisphenol A (BPA) is a polycarbonate synthesizer known for its ability to mimic estrogen. Estrogen-regulated processes like vitellogenesis, the process of yolk deposition in oocytes, is therefore susceptible to BPA exposure. The yolk, which is entirely provided by maternal resources, is the only energy source for the initial development of nearly every organ system. Lipids in particular are a critical energy source in yolk for embryonic development because of their large ATP yield and role in cellular structure and hormone signaling. The health and function of many organ systems rely on the availability of polyunsaturated fatty acids (PUFAs) which can have structural functions and act as hormone precursors with numerous roles like mediating red blood cell function and vascular homeostasis. Therefore we hypothesized that maternal exposure to BPA would alter the end product of vitellogenesis, the lipid composition of eggs and ultimately the cardiovascular health of larvae. A MS/MS/ALL shotgun lipidomics approach was used to detect a broad range of lipids in egg, ovary, liver, and whole mount tissue while video microscopy was used to assess cardiovascular health in zebrafish larvae. Lipidomics data is presented by lipid class as a ratio of the total lipids detected. Total fat was higher with exposure to BPA in all tissue types. BPA-exposed eggs decreased in phosphatidylethanolamine and increased in phosphatidylserine. Ovarian tissue exposed to BPA experienced increases in phosphatidylethanolamine, phosphatidylglycerol, and phosphatidylserine while liver tissue experienced a decrease in phosphatidylethanolamine. Whole mount tissue experienced an increase in phosphatidylethanolamine and phosphatidylcholine and decreases in phosphatidylglycerol, phosphatidylserine, sulfogalactoceramides, and triglycerides. More data is needed to decipher the physiological relevance and significance of this data.

FindIR2: A MATLAB-based tool for accurate detection of imperfect inverted repeats in DNA sequences

Download

Date: TBA
Room: TBA

Sreeskandarajan Sutharzan, Miami University, United States
Chun Liang, Miami University, United States
John Karro, Miami University, United States

Presentation Overview:

Inverted Repeat (IR) sequences in DNA play a number of important biological roles. For example, they are active in the formation of DNA secondary structures, and they severs as components in cis-acting elements, transposons and CRISPR sequences. To better understand the numerous roles played by IRs, it is important to develop accurate IR detection tools. Currently available tools, while fast, have poor sensitivity with respect to imperfect IRs (IRs that allow for some small sequence variation). Hence we propose a more efficient and accurate tool for imperfect IR detection, FindIR2 , based on our previously proposed tool FindIR. The newly proposed tool extends the prime number based scoring method of FindIR to detect imperfect IRs (IRs with gaps and mismatches). After identifying potential imperfect IR candidates using the modified prime number based scoring system, valid imperfect IR sequences are filtered out from the detected candidates using a dynamic program technique based on multiple sequence alignment.

Reverse engineering gene regulatory networks from structural and epigenetic data

Download

Date: TBA
Room: TBA

Brittany Baur, Marquette University, United States
Serdar Bozdag, Marquette University, United States

Presentation Overview:

Cancer genomes contain many structural and epigenetic abnormalities, including DNA methylation and copy number aberrations. One of the major challenges in computational biology is to identify “driver” copy number, and to a lesser extent DNA methylation, changes that disrupt the expression levels of key regulatory genes, which in turn have significant effects on their downstream targets. In this study, we propose a framework to uncover driver DNA methylation and copy number changes simultaneously. In addition to the driver aberrations, our methodology uncovers the putative regulatory gene(s) within the aberration region and canonical pathways that are disrupted as a result of the aberration. For a given disrupted pathway, our methodology also ranks the potential regulators within the various copy number and DNA methylation aberrations to prioritize cancer genes. We applied our method on the Cancer Genome Atlas breast cancer data. Our results included previously known driver genes and pathways as well as potentially novel structural and epigenetic aberrations that have key roles in breast cancer.

RSAT: A toolkit for rapid generation and analysis of protein-coding sequence datasets

Download

Date: TBA
Room: TBA

Ryan Schott, University of Toronto, Canada
Daniel Gow, University of Toronto, Canada
Belinda Chang, University of Toronto, Canada

Presentation Overview:

We present RSAT (Rapid Sequence Analysis Toolkit) a new application to facilitate the fast and easy generation and analysis of protein-coding sequence datasets. The application uses a portable database framework to manage and organize sequences along with a graphical user interface (GUI) that makes the application extremely easy to use, even for those with little bioinformatics experience. The application consists of two modules that can be used separately or together. The first module enables the assembly of coding sequence datasets. BLAST searches can be used to obtain all related sequences of interest from NCBI. Full GenBank records are saved within the database and coding sequences are automatically extracted. A feature of particular note is that sequences can be sorted based on NCBI taxonomic hierarchy before export to MEGA for visualization. The application provides GUIs for automatic alignment of sequences with the popular tools MUSCLE and PRANK, as well as for reconstructing phylogenetic trees using PhyML. The second module incorporates selection analyses using codon-based likelihood methods. The alignments and phylogenetic trees generated with the dataset module, or those generated elsewhere, can be used to run the models implemented in the codeml PAML package. A GUI allows easy selection of models and parameters. Importantly, replicate analyses with different parameter starting values can be automatically performed in order to ensure selection of the best-fitting model. Multiple analyses can be run simultaneously based on the number of processor cores available, while additional analyses will be run iteratively until completed. Results are saved within the database and can be exported to publication-ready Excel tables, which further automatically compute the appropriate likelihood ratio test between models in order to determine statistical significance. Future updates will add additional options for phylogenetic reconstruction (eg, MrBayes) and selection analyses (eg, HYPHY). RSAT saves researches of all bioinformatics experience levels considerable time by automating the numerous tasks required for the generation and analysis of protein-coding sequence datasets using a straightforward graphical interface.

Comparision of Response of Pristine Soil Microbial Community to the Intrusion of Coal Mine-Derived Acid Mine Drainage from Two Different Sites.

Download

Date: TBA
Room: TBA

Shagun Sharma, The University of Akron, United States
John Senko, The university of Akron, United States
Mathew Lee, The University of Akron, United States

Presentation Overview:

Approximately 10,000 km of streams are adversely impacted by acid mine drainage (AMD) in the Appalachian coal mining regions of the United States. AMD forms when oxygenated water biogeochemically reacts with coal seam associated FeS2 (pyrite), forming acidic fluids with high concentration of Fe and other metal(loid)s. Geochemical and microbiological analyses were conducted at two different actively monitored abandoned coalmine sites in southeastern Ohio: Huff Run-Mineral City (HF-MC) and Corning Mine Pool (CMP) affecting major watersheds in their regions. We conducted a evaluated the microbial communities associated with these circumneutral mine pools with similar chemical characteristics (HF-MC, pH 5.5 to 6.5; Fe (II) 0.02 to 0.05 mM) (CMP, pH 6.4 to 7.1; Fe (II) 0.01 to 0.04 mM), and the responses of microbial communities in pristine soil when infiltrated with different AMD. We incubated AMD-unimpacted soil with two AMD types. Evaluation of pyrosequencing-derived 16S rRNA gene sequences recovered from incubations revealed that both HR-MC and CMP sites had lithotroph dominated microbial communities centered around phylotypes attributable to Gallionellales (HF-MC), Leionellales (CMP), Burkholderiales (CMP), Methylophilus & Rhizobiales (methanotrophs). Incubation of CMP AMD with soil resulted is increase in the contribution of Burkholderiales (21-25%) (Betaproteobacteria) in the community composition in comparison to the CMP drainage alone (5%). Incubation of HR-MC AMD with soil led to a huge decline in the relative abundance of Gallionellales-affiliated phylotypes (lithotrophic lineages), resulting in a microbial community with composition similar to that of pristine soil. This study shows how the microbial community of initially pristine soil responds to AMD intrusion from different sites with similar geochemical characteristics.

Identifying the role of noncoding single nucleotide variants (SNVs) on transcription factors activity in liver cancer

Download

Date: TBA
Room: TBA

Parisa Mazrooei, University of Toronto, Canada
Tahmid Mehdi, University of Waterloo, Canada
Anna Goldenberg, University of Toronto, Canada
Mathieu Lupien, Princess Margaret Cancer Centre, Canada

Presentation Overview:

95% of single nucleotide variants (SNVs) including somatic point mutations or single nucleotide polymorphisms (SNPs) occur in noncoding DNA is outside gene coding sequences. Non-coding SNVs directly contribute to cancer development, by, for example, affecting the binding intensity of proteins called transcription factors (TFs) to the DNA, thus changing the expression of genes. Previous works have revealed such a disruption for a handful of SNVs and TFs in various cancer types; however, cancer is not likely to arise from one or two SNVs. Thousands of SNVs exist in each tumor sample the functional effects of which are unknown. Therefore, there is a growing need to identify and analyze the combinatorial effect of SNVs that affect complexes of TFs binding intensity, which we refer to as signatures.
To do so, we first infer the effect of any SNV on a TF’s binding intensity to the DNA using ChIP-seq data. Then, we utilize our new ensemble method for bi-clustering based on bipartite network analysis to extract the TFs-SNVs signatures. We used this approach to characterize the impact of 2,233 somatic mutations in liver cancer on the binding of 76 transcription factors to the chromatin and utilized our method to identify TFs-SNVs signatures. Our preliminary results reveal a TF-combination consisting of CTCF, Rad21 and SMC3 affected by a collection of SNVs. This combination agrees with their co-localization across the genome and their complimentary function in regulating chromatin interactions.
This work provides a framework to identify the functional contribution of noncoding SNVs in cancer based on their impact on transcription factor binding to the chromatin. It allows to identify the SNVs-transcription factor combinations most significantly affected in cancer, pinpointing the transcriptional machinery to target for therapeutic action.

Phylogenomic Analysis of the TRAF family of Co-regulators in Maize.

Download

Date: TBA
Room: TBA

Jennifer K. Holmes, University of Toledo, Toledo, OH 43606, United States
Rachael A. Wasikowski, University of Toledo, Toledo, OH 43606, United States
Michael W. Scott, University of Toledo, Toledo, OH 43606, United States
Frank McFarland, The University of Wisconsin-Madison, Madison, WI 53706, United States
Erich Grotewold, The Ohio State University, Columbus, OH 43210, United States
Andrea I. Doseff, The Ohio State University, Columbus, OH 43210, United States
John Gray, University of Toledo, Toledo, OH 43606, United States

Presentation Overview:

The TNF receptor (TNFR) associated factor (TRAF) protein family is characterized by the presence of a conserved coiled-coil domain TRAF/MATH domain (for meprin and TRAF-C homology). This domain is required for homo- or heterodimerization and interaction with receptor or cytoplasmic signaling proteins including transcription factors (TFs). This family of proteins is present in plants but the biological role of only a few plant TRAF proteins have been characterized. In Arabidopsis the SEVEN IN ABSENTIA (SINA) clade of TRAF-like proteins has been implicated in proteasome-mediated regulation of transcription factors such as NAC1, CUC2, and AP2. In rice, OsDIS1 has been identified as a negative regulator in the drought tolerance response. We performed a survey of the TRAF family of proteins in maize and have identified at least 46 members. The maize TRAF proteins could be divided into major and minor subclades based on the presence or absence of MATH, and BTB (for BR-C, ttk and bab) or POZ (for Pox virus and Zinc finger) domains. The gene structure of 30 members of the TRAF family could be confirmed from full length cDNA clones present in the maize TFome collection. The expression profiles and co-expression of maize TRAF genes was analyzed using publicly available RNA-seq data. To investigate the structure of TRAF proteins in maize, the XTALPRED algorithm was employed to select members with a high crystalizability index. Progress on the overexpression and purification of selected maize TRAF proteins in their native conformation will be presented. This project was supported by NSF grant IOS-1125620.

Single-cell RNA sequencing reveals allele-specific gene expression and rearrangement during V-J recombination

Download

Date: TBA
Room: TBA

Mark Maienschein-Cline, University of Illinois at Chicago, United States
Sophiya Karki, University of Chicago, United States
Zhengdeng Lei, University of Illinois at Chicago, United States
Pinal Kanabar, University of Illinois at Chicago, United States
George Chlipala, University of Illinois at Chicago, United States
Hong Hu, University of Illinois at Chicago, United States
Morris Chukhman, University of Illinois at Chicago, United States
Neil Bahroos, University of Illinois at Chicago, United States
Marcus Clark, University of Chicago, United States

Presentation Overview:

One of the key components underpinning the incredible diversity of our adaptive immune system is V(D)J recombination, which occurs during lymphocyte development. This recombination happens via a stochastic rearrangement, merging one combination of many possible V and J gene segment pairs in a given immunoglobulin (Ig) locus. Because each cell is expected to undergo a different random instance of rearrangement, single-cell RNA sequencing gives us unprecedented precision to observe the cell-to-cell variation in this process. In particular, we are interested in the allele specificity with which rearrangement occurs: i.e., how precisely can we infer which allele produced a given transcript, and how do the two copies of each Ig allele in each cell differ in their transcriptional profiles? We address these and other questions using single-cell RNA-seq on small pre-B cells from a genetically heterozygous mouse, obtained from a cross between two different homozygous lab strains. This genetic background enables us to differentiate the allele of origin of each sequence read using strain-specific SNPs, and to pair this information with V and J expression and splicing patterns. Taken together, these data allowed us to assess the rearrangement status of each cell and observe how this is associated with allele-specific expression across different Ig loci. We observe that there is a great deal of transcriptomic diversity even within a single cell (in addition to cell-to-cell), and discuss the bioinformatic and statistical methods required to obtain robust results.

Seten: A tool for systematic identification and comparison of processes, phenotypes and diseases associated with RNA-binding proteins from condition-specific CLIP-seq profiles

Download

Date: TBA
Room: TBA

Gungor Budak, Indiana University / IUPUI, United States
Sarath Janga, Indiana University / IUPUI, United States

Presentation Overview:

RNA-binding proteins (RBPs) control the regulation of gene expression in eukaryotic genomes at post-transcriptional level by binding to their cognate RNAs. Although several variants of CLIP (crosslinking and immunoprecipitation) protocols are currently available to study the global protein-RNA interaction landscape at single nucleotide resolution in a cell, currently there are very few tools which can facilitate understanding and dissecting the functional associations of RBPs from the resulting binding maps. Here, we present Seten, a web-based (developed in HTML5/JavaScript) and a command line tool (developed in Python), which can identify and compare processes, phenotypes and diseases associated with RBPs from condition-specific CLIP-seq profiles. Seten uses BED files resulting from most peak calling algorithms which include scores reflecting the extent of binding of an RBP on the target transcript, to provide both traditional functional enrichment as well as gene set enrichment results for a number of gene set collections including Reactome, BioCarta, KEGG, Gene Ontology (GO), Human Phenotype Ontology (HPO) and MalaCards Disease Ontology. It combines the results from functional enrichment, which considers only the presence of a binding event in a gene and gene set enrichment, which considers the extent of binding of an RBP to report an integrated significance score for each associated gene set. It also provides an option to dynamically compare the associated gene sets across cell lines as bubble charts. Seten’s web-based user interface currently provides precomputed results for more than 150 CLIP-seq datasets and both interfaces can be used to analyze CLIP-seq and other kinds of binding profile datasets such as ChIP-seq datasets. We highlight several examples to show the general utility of Seten for rapid profiling of various CLIP-seq and ChIP-seq datasets. Seten is available on http://www.iupui.edu/~sysbio/seten/.

Decoding compound mechanism of action using integrative pharmacogenomics

Download

Date: TBA
Room: TBA

Nehme El-Hachem, Institut de Recherche Cliniques de Montreal, Canada

Presentation Overview:

For decades, the “one drug-one target-one disease” paradigm dictated much of the drug development process. However, in the past ten years, tremendous advances in transcriptomics and genomics research shifted this simplistic view of a drug mechanism of action (MoA) to a more complex systems pharmacology paradigm where a drug can bind to several targets.
Several computational strategies have been proposed to elucidate the mechanism of action for existing and newly developed drug-like compounds. Traditional approaches predicted new drug-target associations based on the chemical similarity of corresponding ligands or side effects of approved drugs. Recent bioinformatic approaches built drug-drug networks from drug-induced transcriptional profiles and inferred new mechanisms of action. However, current drug taxonomies are relying on information difficult to gather for new compounds (e.g., side effects) or are inaccurate to predict drug target(s) and MoA. There is therefore a dire need to leverage the increasing amount of pharmacogenomic data in order to improve drug taxonomy by better characterizing drug targets without relying on prior knowledge such as therapeutic indications or side effects.
In our study, we integrated different layers of information from recent large-scale pharmacogenomic datasets in order to infer new MoA for chemical compounds extracted from cancer screens from three data layers: (i) drug structural similarity, (ii) drug perturbation transcriptomics profiles from the LINCS database; and (iii) drug sensitivity profiling assays from cancer cell lines (CTRPv2). We used our recently published Similarity Network Fusion algorithm to efficiently integrate these three data layers into a single, integrative drug taxonomy called Drug Network Fusion (DNF). We found that DNF outperformed drug taxonomies based on single data layers for both drug target prediction (DNF concordance index = 0.89 vs. 0.71, 0.83, 0.64 for structure, sensitivity and perturbation layers, respectively) and ATC classification (DNF concordance index = 0.77 vs. 0.72, 0.58, 0.54 for structure, sensitivity and perturbation, respectively).
We classified correctly almost all kinase inhibitors and inferred new mechanisms for other undescribed compounds. Our innovative computational framework highlights the importance of integrating complementary data layers concerning drugs such as chemical, transcriptional and sensitivity profiles. DNF can be easily extended to more compounds or data layers and as such, constitutes a valuable resource to the cancer research community by providing new hypotheses on the compound MoA and potential insights for drug repurposing.

Predicting Gene Regulation in Diverse Global Populations

Download

Date: TBA
Room: TBA

Virginia Saulnier, Loyola University Chicago, United States
Alexa Badalamenti, Loyola University Chicago, United States
Jeffrey Ng, Loyola University Chicago, United States
Shyam Shah, Loyola University Chicago, United States
Dr. Heather Wheeler, Loyola University Chicago, United States

Presentation Overview:

Variation in gene regulation has been shown to play a key role in the genetics of complex traits. The goal of our research is to build and test genetic predictors of gene expression within and between populations. Ultimately, we want to know how differences in gene regulation affect complex genetic traits, including disease status and drug susceptibility. Utilizing genetic and gene expression data from diverse populations, we aim to expand our predictive models of gene expression beyond those of European ancestry. Using a computational program called PrediXcan, statistical models accurately predicting gene expression are produced, all based on variant genotypes. These models are then applied to existing genome-wide genotype datasets to predict gene expression levels. In the final step of PrediXcan, these predicted levels are tested for association with complex traits, such as disease status. We are using machine learning approaches, including elastic net regularization, to build prediction models in several populations from the HapMap Project and testing model performance across populations. We aim to see how these genetic predictors of expression vary between diverse populations, as well as determine if differences in the expression of the same genes associate with a particular complex trait between these populations. With a broad range of potential applications for our research findings, most notable are those in the fields of bioinformatics and medicine. If a correlation between disease susceptibility and high predicted gene expression can be found, it leads to the possibility of targeting the gene with expression blocking drugs as a personalized disease treatment plan.

Semi-automated transcriptome annotation with SegRNA

Download

Date: TBA
Room: TBA

Mickaël Mendez, Department of Computer Science, Univertsity of Toronto, Canada
Eric G Roberts, Princess Margaret Cancer Center, Canada
Michael M Hoffman, Department of Computer Science, Department of Medical Biophysics, Univertsity of Toronto, Princess Margaret Cancer Center, Canada

Presentation Overview:

Mickaël Mendez1,2, Eric G. Roberts2, Michael M. Hoffman1,2,3

1 Department of Computer Science, University of Toronto, Toronto, ON, Canada;
2 Princess Margaret Cancer Center, Toronto, ON, Canada;
3 Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada

Introduction: We have developed a method that uses data from ChIP-seq and similar assay to identify novel subtypes of known functional elements, and to elucidate relationships among different types of elements across different cell types. Our integrative method, called Segway, uses dynamic Bayesian networks to simultaneously segment the genome and discover joint patterns across different experiment types in an unsupervised fashion. However the current Segway model does not take into account intrinsic properties of transcriptome dataset such as directionality and strand specificity, thus, it can not be used to characterize new type of transcript from RNA-seq and CAGE data.

Method: Here we introduce a new method called SegRNA which extends our existing Segway software with an optimized model for stranded RNA data. The new model uses dynamic Bayesian network to learn patterns simultaneously from the RNA signal on both strands and produces two human-understandable annotations and visualizations, one for each strand. Additionally SegRNA learns a model of the most likely transitions between particular patterns, allowing us to describe the structure of larger domains made of individual RNA elements.

Results: We applied SegRNA on 12 ENCODE RNA-seq data from K562 cells including multiple cellular compartments, and found consistent transitions between segments for both strands. This result suggests that the stranded model is indeed able to identify similar patterns on both strands. Additionally, we discovered recurring multitrack patterns indicative of and associated with protein-coding genes, snRNAs, miRNAs, and pseudogenes.

Discussion: SegRNA will identify novel transcripts that control gene expression, such as enhancer RNAs and long non-coding RNAs. It will pinpoint differences in the locations of these transcripts between various normal and cancer cell types. This will allow us to understand how different kinds of transcripts control behavior of normal human cells, and how this process goes wrong in disease.

Whole Genome Assembly of Camelina microcarpa, a Resource for the Improvement of the Oilseed Crop Camelina sativa

Download

Date: TBA
Room: TBA

Beatriz E. Lujan-Toro, Carleton University, Canada
Connie Sauder, Agriculture and AgriFood Canada, Canada
Owen Rowland, Carleton University, Canada
Sara L. Martin, Agriculture Canada, Canada

Presentation Overview:

Increasing world population and diminishing resources have caused the agricultural and scientific community to seek alternative food and energy sources. Camelina sativa (camelina) is an oilseed crop that has been cultivated since the Bronze Age. Camelina is an excellent candidate for green oil production, suitable for the production of biofuels and specialized industrial oils. In addition, camelina oil has a high omega-3 fatty acid content making its oil a potential renewable alternative for fish oil. Camelina is suitable for Canadian agriculture since it is a low-input, pest and drought resistant crop that can grow on marginal lands. However, its oil profile and agronomic properties need improvement for these applications. Traditional breeding and engineering attempts are hindered by the complex hexaploid nature of the camelina genome. Wild relatives of crops are often neglected in research, but are key in understanding the evolutionary history of crops and can be a source of desirable traits for the improvement of crop species. The most abundant wild relative of camelina in North America is Camelina microcarpa or littlepod false flax. Here we present a hybrid de novo assembly of a diploid C. microcarpa genome using Illumina and PacBio sequencing data, which will serve as a resource for further development of the camelina crop. This genome will provide molecular breeding information as well as valuable understanding on the evolutionary relationship between the two species, and a potential route to introduce variability and desirable traits into the camelina crop.

Gene regulation in Chagas Disease

Download

Date: TBA
Room: TBA

Prashant Kumar Kuntala, Ohio University, United States

Presentation Overview:

Chagas disease is an infectious disease caused by a parasite found in the feces of the triatomine bug. Common places include: South America, Central America, and Mexico. Left untreated, it can cause congestive heart failure. In order to develop better drug targets and therapeutics, Discriminative motif Discovery and selection can be used. These methods identify the transcription factor binding sites (TFBS), in the promoter regions of Chagas disease related genes belonging to three different cell types, namely, Human microvascular endothelial cells, Human foreskin fibroblasts and Human vascular smooth muscle cells. Identifying these putative binding sites, helps facilitate the understanding gene regulation of Chagas disease.
In this study we aim to discover the putative TFBS in the Chagas disease related genes by using Greedy algorithm for motif selection.

NOTUNG-DM: Software for Reconstructing the Evolutionary History of Multidomain Proteins

Download

Date: TBA
Room: TBA

Maureen Stolzer, Carnegie Mellon University, United States
Han Lai, Carnegie Mellon university, United States
Minli Xu, Carnegie Mellon Univerity, United States
Rosanna G Alderson, Carnegie Mellon Univerity, United States
Katherine Siewert, University of Pennsylvania, United States
Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

Multidomain proteins play essential roles in apoptosis, cell signalling and adhesion, multicellularity, and the immune system. Multidomain families evolve via the duplication, insertion, and deletion of domains, sequence fragments that encode protein modules. However, rigorous methods for reconstructing the evolution of this important class of proteins are lacking.

Here, we present Notung-DM, software that reconstructs the evolutionary history of multidomain protein families. Our software implements novel algorithms that utilize an event-based framework to consider evolution across three levels of organization: domains, genes, and species. Given a species tree, a gene tree, and a tree for each domain in the family, Notung-DM infers the events in the history of the multidomain family, as well as the timing of these events relative to gene and species divergences. Notung-DM also outputs the information required to reconstruct the domain content of each ancestral gene.

Our software is unique in that it considers both domain sequence evolution and domain content evolution. By modelling a multidomain family as a collection of co-evolving domain trees, Notung-DM accounts for sequence evolution within individual domains, as well as the evolution of the domain content of the family as a whole.

In contrast, the widely-used Wagner and Dollo parsimony approaches to inferring domain gains and losses only consider the domain content of proteins. These methods cannot recognize parallel gains or losses and tend to underestimate the number of events that occurred. Because Notung-DM can infer more informative and detailed evolutionary histories, it has the potential to leverage novel discoveries in both protein evolution and protein function. For example, Notung-DM can be used to determine whether the same domain architecture formed more than once independently and can identify instances of the gain, loss, or replacement of a domain that encodes interaction specificity, leading to an immediate and dramatic change in function.

Innovating Traditional de novo Genomics Tools to Overcome Traditional Requirements of High Power Computational Platforms Under Modern Financial Restrictions

Download

Date: TBA
Room: TBA

Seth Munholland, University of Windsor, Canada
Claudia Dinatale, Univesity of Windsor, Canada
William Crosby, University of Windsor, Canada

Presentation Overview:

Genome assembly, annotation, and analysis are tools gaining widespread use and applicability in a biological research context. Traditionally, a requirement of these tools is a computing platform that is both powerful and expensive. Budget constraints can still limit these tools from further adoption, however, the increasing power available to desktop workstations has afforded the possibility for performing bioinformatic genomics without the traditional high-cost investment into server hardware. Along with this access comes challenges unique to the landscape of non-server computation platforms, largely stemming from the server-oriented design of much of the software. This is further complicated by the refinement of old, and the introduction of new, sequence development methods, such as the introduction of SMRT sequencing.
To overcome these obstacles bioinformaticians must alter extant pipelines in new and creative ways to to achieve a high resolution, high quality, and low cost solution to the problems of de novo genomics without a prohibitively long timeline. Addressing the reorganization of output from one pipeline or software to be inserted into a different pipeline or software to meet these demands is so common as to verge on becoming a subdiscipline in its own right. This often requires an intimate understanding of the biological processes at play along with a keen grasp of coding paradigms. Methods such as parsing of contigs/scaffolds into artificial short reads, or reorganizing, renaming, and repurposing of predicted SNPs to act as anchors for scaffold anchoring have become standard fare, all the while maintaining statistical integrity of the assembly for further research.

Enabling phyletic-based comparison and visualization of genomic islands for tens to hundreds of microbial genomes.

Download

Date: TBA
Room: TBA

Claire Bertelli, SFU, Canada
Adrian C Lim, SFU, Canada
Gemma Hoad, SFU, Canada
Geoffrey L Winsor, SFU, Canada
Fiona S L Brinkman, SFU, Canada

Presentation Overview:

The worldwide spread of virulent bacterial strains and antimicrobial resistance has a significant impact on human health and major economic consequences. Whole bacterial genome sequencing is increasingly used to characterize the spread and evolution of strains causing infectious disease outbreaks. Particular genomic regions called genomic islands (GIs; commonly defined as clusters of genes with probable horizontal origins) are interesting since they disproportionately encode medically important adaptations, including virulence factors (VFs) and certain antimicrobial resistance (AMR) genes. While microbial genome sequencing has become rapid and inexpensive, current computational methods for GI analysis are not amenable for rapid, user-friendly, and scalable analysis of the thousands of genomes being sequenced.
To fill this gap, we are developing IslandCompare, an open-source computational pipeline and web-based visualization resource to compare GIs across several to hundreds of bacterial genomes. GI predictions are performed using SIGI-HMM and IslandPath-DIMOB, two of the most accurate GI prediction tools based on nucleotide composition bias. A bacterial core-genome phylogeny is computed using Parsnp and regions of similarity between isolates are calculated pair-wise with Mauve. The core-genome phylogenetic tree is displayed in an interactive, user-friendly display (D3 JavaScript library). At each tree leaf, genomes are displayed linearly with GIs highlighted. A zoom-in functionality permits further visualization of genes and their annotation. Furthermore, we have improved IslandPath-DIMOB and assessed its accuracy against a dataset of GIs identified by comparative genomics. IslandPath-DIMOB V2.0 features a 15% increase in recall and a 4% increase in precision compared to the currently available version.
IslandCompare and IslandPath-DIMOB V2.0 will facilitate more robust, flexible analysis and comparison of GIs, complementing existing tools like IslandViewer, and enabling more efficient larger-scale analysis of disease outbreaks and microbial evolution. The genome visualization developed may be adapted to visualize any genome feature on hundreds of phyletically-organized linear genome view, with broad applications.

S-plot 2.0: A computational tool for the rapid identification of horizontally acquired elements within genomic sequences

Download

Date: TBA
Room: TBA

Laurynas Kalesinskas, Loyola University Chicago, United States
Evan Cudone, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States

Presentation Overview:

The result of recent horizontal transfer events, genomic islands provide key insights into phenotypic changes within bacterial species. In the age of ‘super-bugs’, the horizontal acquisition of drug resistance genes and virulence factors can dramatically alter a bacterium’s pathogenicity within its host. Thus, the ability to readily identify horizontally acquired elements as well as their putative source(s) can profoundly advance our understanding of both the interactions between the host and pathogen, as well as interactions between microbes within the host microbiota. Previously, a tool called S-plot was developed for the large scale comparison of genomic sequences; this tool proved fruitful for the recognition of regions that originated through horizontal gene transfer. Since its inception, however, high-throughput sequencing has exploded generating complete and partial genomes for many more species and strains of bacteria. As such, we present S-plot 2.0 for the expedient analysis of genomic sequences (partial or complete). The tool creates an interactive, two-dimensional heat map capturing the similarities and dissimilarities in nucleotide usage at various levels both within and between genomic sequences. Putative xenologous sequences can thus be easily identified and further examined for their source(s). The utility of this tool is exhibited here through the comparison of genomic sequences for the cosmopolitan gram-negative bacillus Pseudomonas aeruginosa. While found within soil, water, plants, and animals, it is an opportunistic pathogen and a frequent cause of hospital-acquired infections posing serious risks for patients with cystic fibrosis. Using the S-plot 2.0 tool, we have conducted an exhaustive analysis of publicly available P. aeruginosa genomic sequences, exploring the acquisition and source of virulence genes as well as the evolutionary history of this species.

Random Walk with Resistance and Solving Laplace’s Equation on MicroRNAs-Gene Interaction Networks

Download

Date: TBA
Room: TBA

Silva Konini, York University, Canada
Jake O'Brien, York University, Canada
Chun Peng, York University, Canada
E J Janse van Rensburg, York University, Canada

Presentation Overview:

Random Walk with Resistance and Solving Laplace’s Equation on MicroRNAs-Gene Interaction Networks

In this presentation two network topology-based algorithms are presented to improve the quality of protein-protein interaction networks and increase the accuracy of protein complexes. The key idea of the algorithms is that two proteins sharing higher topological similarities are likely interacting with each other and might belong to the same protein complex. The first algorithm used to measure the topological similarities of the proteins in a network is the Random Walk with Resistance Algorithm. In addition, solving the Laplace’s Equation on a network is presented as an alternative to the Random Walk with Resistance Algorithm. The most up and down regulated genes by microRNA hsa-miR-218-5p along with the set of their interacting genes are visualized in a graphical network. Both the Random Walk with Resistance and Solving Laplace’s Equation Algorithms are applied to the above mentioned protein-protein interaction network. Proteins sharing high topological similarities (given by higher values of the Pearson Correlation Coefficient between every pair of columns on the probability matrix or the Laplace’s Equation Solution Matrix) are joined together through new interactions. New protein-protein interaction networks are generated and visualized with the help of Cytoscape. The new protein clusters seen in the newly reconstructed networks contain proteins which share common biological functions.

Inferring Genetic Interactions using PubMed

Download

Date: TBA
Room: TBA

Anthony Deeter, University of Akron, United States
Joseph Haddad, University of Akron, United States
Zhong-Hui Duan, University of Akron, United States
Mark Dalman, Kent State University, United States

Presentation Overview:

The PubMed database offers a large set of publication data that can be useful, yet inherently difficult to use without automated computational techniques. Inspired by past PubMed data mining studies as well as computational methods used with experimental genetic data, we created a method to infer genetic interactions. Mining genetic information from abstracts cited within PubMed, our method combines large numbers of Bayesian networks into consensus networks that represent potential genetic interactions. Through a novel concept we call network resolution, these consensus networks can be tailored to infer groups of interactions that range from narrow and focused to broad and encompassing. Our method can be used to confirm the existence of currently accepted interactions and has the potential to hypothesize new ones as well. Future work includes a web-driven interface that will allow investigators to utilize their own experimental data, as well as data from integrated KEGG pathways and experiment databases such as The Cancer Genome Atlas.

Modeling Time Patterns in Gene Expression Profiling

Download

Date: TBA
Room: TBA

Guenter Tusch, Grand Valley State University, United States
Shahrzad Eslamian, Grand Valley State University, United States
Krishna Nadiminti, Grand Valley State University, United States
Bhanu Yandrapragada, Grand Valley State University, United States
Raveena Pendyam, grand valley state university, United States

Presentation Overview:

PURPOSE: Temporal pattern in gene profiles like peaks can represent a biological effect that is reversed over time. In temporal translational research a researcher typically obtains an expression profile and tries to retrieve similar profiles in a set of genes or features in public databases.

PROCEDURES: We used as data normalized microarray or rna-seq datasets from stimulus response studies based on time series data from NCBI GEO and ArrayExpress. We assume that a researcher has found a temporal pattern exhibited, e.g., in a KEGG pathway and wants to search for a “similar” time pattern in public datasets. The pattern could be a peak found in a specified time interval, i.e., the first 24 hours. To model the time pattern we use knowledge-based temporal abstractions that convert expression values or counts (rna-seq) to an interval-based qualitative representation. Temporal abstractions allow for independence of the particular time points that the experimenters chose for their studies. Thus she can compare the same genes or features across different microarray or rna-seq platforms. A key assumption is that if the same biological signal is expressed in each study, it should be identified by statistical significance independent of technology and selection of time points.
To make sure the statistical power is not diminished by small sample sizes, we use the moderated t-statistics, where standard errors are being moderated across genes, borrowing information from the ensemble of genes.

OUTCOME: We evaluated our approach on a set of 644 temporal GDS with 171 different platforms from NCBI GEO. Preliminary results indicate that if you choose gene ensembles large enough the approach has significant potential.

IMPACT: This approach shows promise to help researchers in the analysis of temporal gene expression data and presents a novel opportunity to identify new drug targets or in phenotypic anchoring of gene expression data.

Genomic and Computational Approaches to Electrogenesis in Fishes

Download

Date: TBA
Room: TBA

Ahmed Elbassiouny, University of Toronto, Canada
Ryan Schott, University of Toronto, Canada
Belinda Chang, University of Toronto, Canada
Nathan Lovejoy, University of Toronto, Canada

Presentation Overview:

Electric fishes have dazzled scientists, for decades, with their sophisticated electric signal production. Neotropical weakly electric fishes of the order Gymnotiformes use their neuro-/myogenic electric organs to produce electric signals for communication and electrolocation. Different species produce different types of electric signals, by modulating both the amplitude and frequency of their electric signals. However, the genomic basis of the species-specific electrogenesis remains unexplored. We use high-throughput genomic techniques, coupled with high performance computing resources, to identify the key molecular players in the generation of electric signals.

Improved Hi-C Contact Maps by Adaptive Density Estimation

Download

Date: TBA
Room: TBA

Christopher Cameron, McGill University, Canada
Josée Dostie, McGill University, Canada
Mathieu Blanchette, McGill University, Canada

Presentation Overview:

A correct interpretation of chromosome conformation capture (3C) interaction matrices (or contact maps) relies on the ability to observe genomic interaction data at the proper resolution, which results in a trade-off between resolution and accuracy. The resolution (i.e., the minimum number of base pairs per interaction) of 3C and relative techniques is defined by two factors: 1) the choice of restriction enzyme, and 2) depth of sequencing coverage. Viewing a contact map at the restriction fragment level provides the highest possible resolution but potentially results in the lowest accuracy due to the limited sequencing depth; conversely, a lower resolution (i.e., accumulating contacts in larger bins) may provide increased accuracy due to the larger number of read pairs being considered.

Here we describe a density estimation (DE) algorithm based on Markov Random Fields, designed to incorporate 3C contact density variance to better estimate the underlying contact distribution while preserving the highest possible resolution. Conventional application of DE techniques to 3C-derivative interaction maps do not consider the changing densities of contacts and implement a fixed resolution across the matrix (non-adaptive). Log likelihood and squared error measures indicate that all DE techniques have difficulty when observing chromosomal regions of large interaction frequency variance, while adaptive methods provide the best overall estimations. A comparison of publicly available carbon-copy 3C (5C) and Hi-C data sets demonstrates that adaptive DE improves the correlation between observed interaction frequencies of both 3C-derived techniques.

The proper use of DE techniques with Hi-C data is shown to provide an improved representation of the underlying contact distribution and captured chromatin loops. The analysis of 5C data, which are expected to have lower statistical noise due to the reduced genomic region observed, and density estimated Hi-C data sets confirm the accuracy of our DE algorithm.

The utility of draft bacterial genomes for gene function analysis and genomic island prediction

Download

Date: TBA
Room: TBA

Julie A. Shay, Simon Fraser University, Canada
Claire Bertelli, Simon Fraser University, Canada
Bhavjinder K. Dhillon, Simon Fraser University, Canada
Fiona S.L. Brinkman, Simon Fraser University, Canada

Presentation Overview:

Short-read sequencing technologies tend to produce draft genomes, but these technologies continue to be the most popular for bacterial genome sequencing. It is still relatively costly and time consuming to close the gaps between contigs to produce a closed genome, so the majority of bacterial genome sequences are never closed. Despite this, the limitations of using draft bacterial genomes for function analysis have not been well assessed. To characterize the importance of missing regions of draft genomes, draft bacterial genomes (produced using the most common Illumina short-read sequencing technology used today) were compared with the subsequently closed version of these genomic sequences. Thirty-six Listeria monocytogenes genomes sequenced by the Canadian National Microbiology Laboratory were used for this assessment, and the results were compared with Pseudomonas aeruginosa reference panel genomes where draft and identical closed (or very similar) genome data was available. Listeria and Pseudomonas have very different evolutionary lineages and genome characteristics. Analysis of clusters of orthologous groups of genes (COGs), antimicrobial resistance genes, and virulence factors in regions missing from draft genomes was performed. The ability to detect genomic islands (segments of bacterial genomes thought to be acquired by horizontal gene transfer, GIs) in draft genomes was assessed using IslandViewer. Notably, neither antimicrobial resistance genes nor virulence factors were overrepresented in missing regions (draft vs. complete genome). Transposase and tRNA genes, which are associated with GIs, were overrepresented in missing regions, potentially impacting GI prediction (P < 0.00001 for each). This analysis should be repeated for other species of particular interest, as some differences were identified between the Listeria and Pseudomonas data sets. In summary, there are limitations to bacterial draft genome analysis, with respect to disproportionately missing certain types of genes, however, valuable information of medical interest can still be obtained from some incomplete genome datasets.

Regulation of genome-wide CTCF binding by MeCP2

Download

Date: TBA
Room: TBA

Michael Levy, Western University and Children's Health Research Institute, Canada
Kristin Kernohan, Western University and Children's Health Research Institute, Canada
Yan Jiang, Western University and Children's Health Research Institute, Canada
Nathalie Berube, Western University and Children's Health Research Institute, Canada

Presentation Overview:

CTCF is a transcription factor which regulates gene expression by modulating three-dimensional genome architecture. We previously showed that CTCF localization at imprinted genes is regulated by Methyl CpG binding protein 2 (Mecp2), which repositions nucleosomes to facilitate CTCF binding. MeCP2 is an X-linked protein mutated in Rett syndrome, a condition which affects young girls causing severe developmental regression starting around one year of age. MeCP2 binds throughout the genome—with a preference for centromeric regions and methylated DNA—and therefore has the potential to regulate CTCF at many sites besides imprinted genes. Using ChIP-seq we found that CTCF enrichment was increased at satellite repeats in the MeCP2-null mouse brain. These repeats also exhibited increased DNA methylation which was unexpected because DNA methylation typically inhibits CTCF binding. We found decreased expression of satellite repeats in most mice tested, possibly due to the inhibitory nature of the DNA methylation. Preliminary analysis of published RNA-seq from MeCP2 knockout and overexpression experiments suggests high variability in satellite repeat expression. Approximately 5000 unique CTCF peaks were lost or decreased in the MeCP2-null mice. Using ENCODE whole genome bisulfite sequencing data we found that changed CTCF peaks tended to be in areas of high CpG methylation. Of particular interest was the gene Foxp2 (a gene involved in speech and language development) which had decreased CTCF, increased nucleosome density, and a corresponding decrease in expression. The misregulation of Foxp2 could provide a connection between the molecular results seen here and some of the symptoms seen in Rett syndrome.

Improving Taxonomic Annotation of 16S Amplicon Sequencing Data from Environmental Samples.

Download

Date: TBA
Room: TBA

George Chlipala, University of Illinois at Chicago, United States
Ankur Naqib, University of Illinois at Chicago, United States
Daniel May, University of Illinois at Chicago, United States
Mark Maienschein-Cline, University of Illinois at Chicago, United States
Pinal Kanabar, University of Illinois at Chicago, United States
Zhengdeng Lei, University of Illinois at Chicago, United States
Vincent Hu, University of Illinois at Chicago, United States
Morris Chuckman, University of Illinois at Chicago, United States
Orjala Jimmy, University of Illinois at Chicago, United States
Stefan Green, University of Illinois at Chicago, United States
Bahroos Neil, University of Illinois at Chicago, United States

Presentation Overview:

Next-generation sequencing (NGS) of 16S amplicon libraries has become a standard technique to investigate host-associated and environmental microbial communities. The informatics workflow for this technique involves a number of computational steps which typically include sequence pre-processing, operational taxonomic unit (OTU) clustering, and taxonomic annotation. The taxonomic annotation of OTUs is crucial to describe the types of microorganisms present in a sample as well as giving insights into their possible role and function in the system being studied. Often some number of OTUs in a dataset can have annotations that are unclear or incomplete which can make it difficult to properly interpret the results of a NGS amplicon sequencing study. In this study we explored improvements in the informatics workflow that could improve the quality of taxonomic annotations. The primary focus was improving annotation of cyanobacteria through identifying and filtering uncultured references as well as providing phylum specific enhancements for the GreenGenes reference database. We found that the enhanced database resulted in significant improvements of the quality of taxonomic annotations for cyanobacteria using sequence data obtained from environmental samples. The methods developed can be applied to other phyla and could potentially improve taxonomic annotation of 16S amplicon sequence data for a variety of environmental samples.

Biologically-Based Approach to Evaluate Classification Criteria for Chronic Childhood Arthritis

Download

Date: TBA
Room: TBA

Elham Rezaei, University of Saskatchewan, Canada
Brett Trost, University of Saskatchewan, Canada
Daniel Hogan, University of Saskatchewan, Canada
Tony Kusalik, University of Saskatchewan, Canada
Alan Rosenberg, University of Saskatchewan, Canada
The Bbop Study Consortium, University of Saskatchewan, Canada

Presentation Overview:

Background: Childhood arthritis is a heterogeneous group of diseases. Efforts have been made to establish acceptable classification criteria for the disease. The International League for Associations for Rheumatology (ILAR) defined juvenile idiopathic arthritis (JIA) and proposed seven subgroups mainly based on clinical information. ILAR classifications have limitations: clinical courses are not consistent, and they do not reliably guide treatment choices or predict treatment responses. Data mining methods provide reliable tools to extract precise information from large sets of heterogeneous data and to interrogate diverse data with the goal of improving disease classification.
Objective: To generate a biologically-based robust taxonomy for chronic childhood arthritis based on clinical and biomarker profiles.
Methods: 150 newly diagnosed treatment naïve children, with JIA participated in the study. Data were collected at enrollment and six-month after. Categorical Principal Component Analysis was used for variable reduction and K-means method for clustering purpose with the partitioning around medoids (PAM) algorithm. A dissimilarity matrix was generated using DAISY. The results were compared with the ILAR subgroups. Insensitivity to data perturbation was tested using Leave One-Out-Variable (LOOV) method and the median test. SPSS Statistics Professional version 23, R version 3.2.2, and Circos version 0.69 were used.
Results: From 191 variables 16 were identified to determine clusters, using variance accented for (VAF) ≥ 70%. To optimize number of clusters, internal and external K-means clustering validation criteria were considered. Five clusters were identified in each visit. Both visits consist of more homogenous subgroups compared to ILAR subgroups. Clusters are more homogenous in visit 2 compared to visit 1.
Conclusion: The need to re-classify JIA led us to use data-driven, unsupervised, machine learning algorithms. Distinctive patterns recognized within the data provide insight into the underlying biology of JIA, enabling us to more precisely approach childhood arthritis based on the underlying biology.

Heterogeneity decreases diversity

Download

Date: TBA
Room: TBA

Wei Pan, National Chung Cheng University, Taiwan

Presentation Overview:

In the previous study published in Scientific Reports (http://www.nature.com/articles/srep19297), the genes that not selected go homozygous, which all individuals carry the same gene faster in a panmictic population rather than that in a population that the spreading in spatially constrained. The gene spreading is restricted between neighbors in a spatially constrained population, whereas all other individuals are the potential mate in a panmictic population. Formations of homogeneous clusters formed in spatial constraint populations, but not in panmictic population. The genes would remain the same inside the cluster in the following generations owing to the genes exchanged among the individuals that carried the same genes. This result is plausible with that the diversity in several small populations is higher than a single large population. In there a population structure that the genetic diversity lost faster than that in a panmictic population? In this study, computer simulations were performed to study the allelic ratio under Hardy-Weinberg conditions in finite populations. We compared the diversity lost in a population structure with networks of a power-law distribution and the panmixia, which was equivalent to a network that all individuals were connected. The result exhibited that the diversity loss is faster in the power-law population than that in the panmictic population. That is, the diversity of the population with a heterogeneous structure lost faster than that with a homogeneous structure.

Facilitating Comparative Genomic Analysis of Novel Microbes: Examination of Lactobacillus crispatus in the Female Urinary Microbiome

Download

Date: TBA
Room: TBA

Laurynas Kalesinskas, Bioinformatics Program, Loyola University Chicago, United States
Travis K Price, Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL USA, United States
Evann E Hilt, Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL USA, United States
Alan J Wolfe, Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL USA, United States
Catherine Putonti, Bioinformatics Program, Loyola University Chicago, United States

Presentation Overview:

While urine was long thought sterile, recently bacterial communities have been discovered in the human bladder. 16S rRNA gene sequencing of the female urinary microbiota revealed a community of bacterial species within both healthy and symptomatic patients; the diversity and composition, however, varies between the two. Numerous genera and species have been found to proliferate within this microbiota and have been isolated from patients. To gain greater insight into the role of the community as a whole, we have begun to sequence the genomes of these isolated bacteria. In an effort to expedite our analysis of these strains, we have developed a software pipeline which integrates several existing tools for assembly and annotation. Annotation includes predicting CRISPR elements, RNA genes, and protein coding genes including their COG classifications and the presence of transmembrane helices, signal peptides, and Pfam domains. Furthermore, we have automated comparative analyses of 16S rRNA gene sequences and whole genome sequences. Thus genic content can be compared both between related taxa isolated from the bladder as well as between the bladder and other niches within the human microbiome.
Using this new pipeline, we have conducted analyses of several Lactobacillus crispatus genomes. While the role of L. crispatus in vaginal health is well-studied, its role in urinary health remains unclear. Herein, we identify genes unique to: strains isolated from the bladder, healthy patients, and patients displaying urinary disorders/symptoms, thus providing insight into its putative function within bladder microbiota.

Transcriptome analysis of diet indeed obesity in Zebrafish

Download

Date: TBA
Room: TBA

Chris Walsh, The University of Akron, United States

Presentation Overview:

After much investigation, the methods of adipogenesis in Zebrafish are still unclear. To further investigate this, the use of next generation sequencing was used to look at the transcriptome of zebrafish. My research looked at which genes were upregulated in fish that were starved compared to fish that were overfed. In order to compare these two groups RNAseq was used to find the copy number variant between certain genes that are known to play a role in adipogenesis. With the knowledge of which genes are upregulated, further investigations can be done on these genes to determine how they are upregulated and what role they play in adipogenesis. My hypothesis is that pathways involved in adipogenesis will have a greater representation in the fish that are fed to obesity.
To investigate my hypothesis two separate groups of adult Zebrafish were fed a strict food regiment for three weeks. They were then euthanized, weighed, and had their livers removed. RNA extraction was done on each liver followed by a run on an Agilent Bioanalyzer to ensure that high integrity RNA was extracted. These samples were then sent to the Genome Technology Access Center at Washington University for RNA sequencing and analysis. This information was then used to find which genes, and by how much, were upregulated.

Analyzing Bacteria within the Bladder Microbiota

Download

Date: TBA
Room: TBA

Majed Shaheen, Loyola University Chicago, United States
Travis Price, Loyola University Chicago, United States
Evann Hilt, Loyola University Chicago, United States
Krystal Thomas-White, Loyola University Chicago, United States
Alan Wolfe, Loyola University Chicago, United States
Catherine Putonti, Loyola University Chicago, United States

Presentation Overview:

Though urine was initially believed to be sterile, recent studies have revealed a very diverse microbial community within the healthy bladder. Many of the bacteria found within the female bladder have also been found within the reproductive tract including Gardnerella vaginalis and Lactobacillus crispatus. Increased incidence of G. vaginalis and decreased levels of L. crispatus are often observed within the vaginal microbiome of adult women with bacterial vaginosis. A similar trend has been observed within the bladder microbiome; higher incidence of G. vaginalis is associated with symptoms of urgency urinary incontinence. Nevertheless, the role of G. vaginalis within the bladder has yet to be definitively determined. Numerous G. vaginalis strains isolated from the bladder of female patients have been sequenced in an effort to gain greater understanding into associated virulence factors present as well as its role within the microbiota. Focusing on four individual isolates, genomes were assembled and annotated. Comparative genomics were conducted, both between the bladder strains themselves as well as between the bladder strains and genomes from isolates of the reproductive tract. Thus, we were able to identify coding regions belonging to the core genome of G. vaginalis strains of the bladder in addition to differences between it and those from the reproductive tract. Furthermore, phylogenetic trees were inferred from both the 16S rRNA gene as well as the core genome, providing insight into the evolutionary history of this genus.

The Effect of Different Gene Set Scores on the Accuracy of Gene Set Enrichment Analysis Methods

Download

Date: TBA
Room: TBA

Farhad Maleki, University of Saskatchewan, Canada
Anthony Kusalik, University of Saskatchewan, Canada

Presentation Overview:

Enrichment analysis methods have been widely used to analyze the output of high-throughput experiments, such as microarrays and RNA-seq. Considering the large number of enrichment analysis methods that differ in various components such as gene set statistic, significance assessment, and adjustment for multiple comparisons, biologists are confronted with the challenge of which method to use for a given experiment.

In this research, we use synthesized datasets to quantitatively compare the accuracy of GSEA (Subramanian et al. 2005) to two alternatives proposed by Tian et al. (2005). The latter use a t-test or the Wilcoxon signed-rank test as alternatives to the enrichment score in GSEA. The comparison uses MSigDB Version 5.1 as the gene set database, the signal-to-noise ratio as the per gene statistic, and sample permutation as the measure of significance of gene set scores.

The results show that GSEA has the highest accuracy. Using the t-test instead of the enrichment score in GSEA significantly decreases the sensitivity. This can be attributed to the inability of the t-test to detect differential enrichment of a gene set when about half the set’s genes are up-regulated and the rest are down-regulated. In addition, Wilcoxon signed-rank test generates the highest number of false negatives. This can be attributed to the fact that median is a robust statistic with a breakdown point of 50%; therefore, as a gene set score it requires, at least, up-regulation (down-regulation) of 50% of genes within a gene set to detect it as differentially enriched.

Computational analysis of target specificity of Drosophila double-stranded RNA-binding protein Staufen

Download

Date: TBA
Room: TBA

Kun Nie, Terrence Donnelly Centre for Cellular and Biomolecular Research, Banting and Best Department of Medical Research, Toronto, Canada; University of Toronto, Department of Molecular Genetics, Toronto, Can, Canada
Quaid Morris, Terrence Donnelly Centre for Cellular and Biomolecular Research, Banting and Best Department of Medical Research, Toronto, Canada; University of Toronto, Department of Molecular Genetics, Toronto, Can, Canada

Presentation Overview:

RNA-binding proteins (RBPs) are one of the most important regulators in eukaryotic gene expression. They are involved in multiple co- and post-transcriptional processes and are implicated in human disease including cancer, genetic and neurological disorders.

In a complex RNA structure, a large fraction of bases engages in non-canonical base pairing, which can create binding sites for proteins and form motifs that mediate long-range RNA-RNA interactions1. MC-Flashfold is a fast version of MC-Fold2, which folds RNA while considers non-canonical base pairing as well as Watson-Crick base paring.

Drosophila Staufen is an evolutionarily conserved double-stranded RBP (dsRBP) that are essential for the function of multiple mRNAs including bicoid. Previously a genome-wide analysis of target mRNAs of Staufen from our lab suggested that Drosophila Staufen recognizes 19bp and 12bp dsRNA stems with few unbalanced bases in 3’UTR of its target mRNAs3. Using MC-Flashfold, we identified a new Staufen recognized structure (SRS) in bicoid 3’UTR that was previously undetected. This site also corresponds to one of the mRNA conformation required by Staufen interaction described in a previous research in vivo from Ferrandon et al4. Our new analysis also suggests that incorporating non-canonical interaction information improves the performance of SRS prediction. Thus, our study will provide insights into the binding specificities of dsRBPs as well as post-transcriptional regulations in Drosophila.

References

1. Leontis, N. B., Stombaugh, J. & Westhof, E. The non‐Watson–Crick base pairs and their associated isostericity matrices. Nucleic Acids Res. 30, 3497–3531 (2002).
2. Parisien, M. & Major, F. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 452, 51–55 (2008).
3. Laver, J. D. et al. Genome-wide analysis of Staufen-associated mRNAs identifies secondary structures that confer target specificity. Nucleic Acids Res. 41, 9438–9460 (2013).
4. Ferrandon, D., Koch, I., Westhof, E. & Nüsslein-Volhard, C. RNA-RNA interaction is required for the formation of specific bicoid mRNA 3’ UTR-STAUFEN ribonucleoprotein particles. EMBO J. 16, 1751–1758 (1997).

Transcriptome analysis of developing lens reveals hundreds of novel transcripts and abundance of skipped exon as well as retained intronic splicing events

Download

Date: TBA
Room: TBA

Rajneesh Srivastava, Indiana University Purdue University Indianapolis (IUPUI), United States
Gungor Budak, Indiana University Purdue University Indianapolis (IUPUI), United States
Sarath Chandra Janga, Indiana University Purdue University Indianapolis (IUPUI), United States

Presentation Overview:

Lens development employs a complex and highly orchestrated regulatory program with several specification and differentiation processes. However, the complete lens transcriptome and various isoforms in the context of developmental stages is not fully characterized. In this study, we investigated the transcriptomic alterations and splicing events across various developmental stages during lens formation, constructing a molecular portrait of known and novel transcript isoforms in mouse lens across developmental stages. We analyzed multiple publicly available RNA-seq datasets corresponding to the raw RNA sequence reads of mouse lens from different developmental stages (E15, E15.5, E18, P0, P3, P6 and P9) to investigate expression levels known and novel transcripts as well as alterations in splicing events. We observed that the number of novel transcripts decreases significantly in post-natal stages. We characterized novel transcripts into partially novel (novelty score <70%) and completely novel (novelty score >70%) transcripts and found out that the novel transcripts with phastCons conservation score >80% are highly expressed across multiple developmental stages. We also observed that they are likely to be functional in lens development such as neural system development, structural morphogenesis, protein localization, cell division and differentiation processes. Functional analysis of the most abundant alternative splicing events, which included skipped exon and retained intron events revealed the enrichment of mRNA processing, apoptotic signaling pathways, protein polymerization, cell development and differentiation. We found several genes including Cdk4, Cryba1, Lim2, Meis1, Pax6, and Smarca4 having skipped exonic events to be significantly associated with the lens developmental processes. Genes associated with retained intron events were found to be significantly associated with cell cycle. Results of our alternative splicing event predictions are also made available through Eye Splicer, a web based splicing browser showing developmentally altered splicing events in mouse lens. Our analysis provides a high-resolution architecture of the mouse lens transcriptome and provides a one-stop portal for furthering our understanding of the splicing alterations during lens development.

Visual evolution in marine derived Amazonian fishes

Download

Date: TBA
Room: TBA

Alexander Van Nynatten, University of Toronto, Canada
Nathan R Lovejoy, University of Toronto Scarborough, Canada
Belinda Sw Chang, University of Toronto, Canada

Presentation Overview:

Visual pigments, comprised of a light sensitive chromophore and opsin protein moiety, mediate the initial step in the visual transduction cascade. Interactions between the chromophore and the pocket formed by residues in the opsin protein determine the specific spectral sensitivity of the visual pigment. Convergent substitutions in these pocket residues have been observed in deep-sea inhabiting fishes, blue-shifting spectral sensitivity of visual pigments towards the predominantly blue wavelengths of light illuminating depths greater than 200m. In contrast with marine environments, large freshwater rivers are red-shifted and significantly dimmer due to increased turbidity. The visual pigments of freshwater fishes are often observed to have red-shifted spectral sensitivities, but the molecular mechanisms determining these differences have not yet been studied. Using likelihood-based comparative sequence analyses, we found evidence for positive selection in the dim-light sensitive visual pigment, rhodopsin, of freshwater fishes with closely related marine ancestors. The non-synonymous substitutions implicated in these analyses include residues known to be important for aspects of rhodopsin function, including spectral tuning. Ancestral reconstructions of amino acid states identify differences between marine and freshwater lineages at known red-shifting sites, tuning the sensitivity of rhodopsin towards the wavelengths of light most common in freshwater rivers. Our results suggest that an increased rate of rhodopsin evolution was driven by diversification into freshwater habitats, thereby constituting a rare example of molecular evolution mirroring large-scale palaeogeographical events.

Heterogeneity Analysis of Gene Fusion in Glioblastoma

Download

Date: TBA
Room: TBA

Morris Chukhman, University of Illinois at Chicago, United States
Mark Maienschein-Cline, University of Illinois at Chicago, United States
Pinal Kanabar, University of Illinois at Chicago, United States
Cong Liu, University of Illinois at Chicago, United States
George Chlipala, University of Illinois at Chicago, United States
Zhengdeng Lei, University of Illinois at Chicago, United States
Vincent Hu, University of Illinois at Chicago, United States
Hui Lu, University of Illinois at Chicago, United States
Neil Bahroos, University of Illinois at Chicago, United States

Presentation Overview:

Gene fusions have been shown to play driving roles in several cancers, whose protein products have been successfully targeted for therapeutics. However, identifying driver variants in a cancer cell line with a high degree of intratumor heterogeneity can be difficult using traditional genomic analysis methods, especially when the genomic lesions are rarely expressed. To address these challenges, we perform a fusion detection analysis on a single-cell RNA-seq dataset consisting of 430 cells from five primary glioblastomas, quantifying the copy numbers of all detected gene fusions and their intact parent genes in each of the isolated cancer cells.

Gene set enrichment analysis of leptinA (lepA) in the embryonic

Download

Date: TBA
Room: TBA

Matthew Tuttle, The University of Akron, United States
Mark Dalman, The University of Akron, United States
Dirk Bullock, The University of Akron, United States
Cory Boveington, The University of Akron, United States
Zhong-Hui Duan, The University of Akron, United States
Qin Liu, The University of Akron, United States
Richard Londraville, The University of Akron, United States

Presentation Overview:

Investigating the evolution of vision in Neotropical bats through comparative transcriptomics

Download

Date: TBA
Room: TBA

Eduardo de Almeida Gutierrez, University of Toronto, Canada
Belinda Chang, University of Toronto, Canada

Presentation Overview:

Bats comprehend a fascinating model in which to study how sensory systems evolved and adapted to explore the nocturnal environment. The chiropteran visual system operates well under low light intensities and provides crucial input for several aspects of bat biology, which makes it particularly amenable to evolutionary studies. However, different degrees of visual capabilities are observed in this group, which appear to vary in response to distinct echolocation abilities. While more visual species have larger eyes and are less reliant on echolocation, less visual species tend to have smaller eyes and exhibit sophisticated echolocation calls. Differences in visual capabilities are also associated with distinct feeding ecologies. Species that forage close to vegetation rely more on visual information than species that forage in open air. This is especially applicable to Neotropical lineages, which exhibit diverse sensory ecology and diet specialization, and adopt contrasting foraging strategies to locate and access food. Using representatives of the Neotropical endemic families Mormoopidae and Phyllostomidae, we have employed next-generation sequencing techniques in a phylogenetic framework to investigate how bats evolved varying visual capabilities in response to different ecological constraints. Eye samples of six species were collected in the field and stored in RNAlater® solution prior to snap freezing in liquid nitrogen. Total RNA was extracted and used to construct cDNA libraries for sequencing on an Illumina HiSeq platform. Good quality reads were assembled de novo into transcritptomes of each species, against which reads were mapped to estimate the abundance of genes expressed in the eye. Our preliminary results and discussions provide some insight on the distinct constraints underlying the evolution of vision in bats, particularly on the extent to which patterns of gene expression can be linked to diverse sensory ecologies in Neotropical species.

Structure-based prediction of homeodomain binding specificity using homology models and an integrative energy function.

Download

Date: TBA
Room: TBA

Alvin Farrel, University of North Carolina at Charlotte, United States
Jun-Tao Guo, University of North Carolina at Charlotte, United States

Presentation Overview:

Transcription factors (TFs) are essential to the regulation of gene expression through binding to specific target DNA sites. Structure-based methods for studying TF-DNA interactions can help us annotate TF-binding sites (TFBS) at genome-scale, better understand the effects of mutations in transcription factors and target sites, and facilitate structure-based drug design. Structure-based TFBS prediction algorithms require high-resolution TF-DNA complex structures. Despite advances in structure determination methods, the structural solution of protein-DNA complexes remains a difficult task, and there are a limited number of TF-DNA complex structures in Protein Data Bank (PDB). Therefore, there is a clear need for modeling protein-DNA complex structures to extend the applicability of structure-based TF-binding site prediction. Here we describe a method of generating TF-DNA complex models by combining TF homology models and DNA structures from homologous complex templates to increase the coverage of TF-DNA conformations. TF-DNA interface features are used to determine the top complex models. The top models are used for structure-based transcription factor binding site prediction using an integrative energy function. The integrative energy function combines a residue-level statistical potential with two atomic terms, hydrogen bond energy between protein residues and DNA bases, and electrostatic energy between aromatic residues and DNA bases involved in π stacking interactions. Our approach improves model selection and consequently TFBS prediction accuracy when applied to homeodomains.

Predicting the functional indispensability of non-coding elements

Download

Date: TBA
Room: TBA

Tawny Cuykendall, Weill Cornell Medicine, United States
In Sub Mark Han, Weill Cornell Medicine, United States
Eric Minwei Liu, Weill Cornell Medicine, United States
Ekta Khurana, Weill Cornell Medicine, United States

Presentation Overview:

Identification and prioritization of variants in cancer and other diseases that have causal links with disease remains challenging. Several computational tools are available to functionally annotate and score variants, enabling identification of high impact variants. In addition, knowledge of the functional indispensability of genes is extremely useful for prioritizing candidates for validation. Numerous studies of protein-coding regions have resulted in the functional classification of many genes. By identifying distinguishing properties of essential and loss-of-function (LoF) tolerant genes, such as degree centrality in various networks and the number of networks a gene is involved in, computational models to predict functional indispensability scores for all genes have been developed. However, the majority of variants identified in disease studies are located in non-coding regions and the functional indispensability of these regions remains largely unknown.

We have characterized a subset from nearly 300,000 ENCODE regulatory elements as essential or LoF-tolerant. Because essential genes are frequently mutated in cancer, we applied a novel computational method, CompositeDriver, to ~1500 WGS sequences spanning more than 20 tumor types, to identify elements containing an excess of high impact somatic variants, which we defined as essential. We defined LoF-tolerant elements as those that contain at least one variant that is predicted to be highly functional by our computational pipeline FunSeq and is homozygous in at least one individual in the 1000 Genomes project, which is comprised of more than 2500 WGS from 26 different populations. We examined several features of regulatory elements, such as the number of genes the element regulates, the number of tissues the element is active in, and the functional indispensability scores of the target genes and found that essential and LoF-tolerant elements can be distinguished on the basis of multiple features. We are currently using these features, among others, to build a regression model to predict the functional indispensability of all regulatory elements in the genome.

Developing a virtual high-throughput screening pipeline for faster drug discovery: Cathepsin-L lead identification and confirmation

Download

Date: TBA
Room: TBA

Jonathan Chen, University of Akron, United States
Donald Visco, University of Akron, United States

Presentation Overview:

Drug discovery through conventional high-throughput screening experiments is highly inefficient. Tens if not hundreds of thousands of compounds are tested, with the majority of time and resources spent testing compounds that are inactive. Only a few compounds are identified as possible drug leads, yielding a hit-rate of only a few percent at best. A more efficient process of identifying probable drug leads is needed.

Conventional high-throughput screening methods generate large amounts of target/compound interaction data. Computational methods, capable of leveraging the computational power and data available, are an appealing solution that can focus screening efforts and increase experimental efficiency. These methods include using mathematical models, docking simulations, and search algorithms to find likely candidates in a compound database.

Previously, we described a iterative pipeline we developed: we used Genetic Algorithms to stochastically create, optimize, and select Support Vector Machines for have high predictive power. We can then apply these optimized algorithms to screen compound databases for probable drug leads. As a case study, we applied our pipeline to identify Cathepsin-L receptor inhibitors. Cathepsin-L is a receptor implicated in viral disease pathways, including disease like Ebola and malaria. We identified 16 compounds and verified 3 were active for a hit-rate of 18.75%. We retrained our model by including the new experimental information. With the retrained models, we identified 12 compounds and verified 9 were active for a hit-rate of 75%.

TAIGA: A computational framework for predicting genotype to phenotype associations

Download

Date: TBA
Room: TBA

Arnab Saha Mandal, University of Calgary, Canada
Aaron Mathankeri, University of Calgary, Canada
A.P. Jason De Koning, University of Calgary, Canada

Presentation Overview:

Identifying causal links between patient genotypes and disease outcomes is a central challenge in medical genomics. Machine learning classifiers that integrate variant properties such as evolutionary conservation and population frequency have become increasingly important to provide informed predictions about the likely deleteriousness of genetic variants. We have developed an integrated machine learning classifier TAIGA, Transformation and Integration of Genomic Annotations, which implements a rotation forests algorithm for missense variant classification. TAIGA integrates the best variant-level predictors with data on known genotype-phenotype associations over a curated subset of pathogenic and neutral missense variants. Upon cross-validation, TAIGA outperforms existing and most widely used variant-prediction methods and further inclusion of gene-phenotype specific data enhances its sensitivity and specificity. We believe that TAIGA represents the best computational framework for predicting associations of variants with multiple phenotypes and will be useful for elucidating novel causal variants and genetic pathways in complex diseases, which have not been fully understood.

NOTUNG 2.8: A Reconciliation Engine for Phylogenomics

Download

Date: TBA
Room: TBA

Han Lai, Carnegie Mellon University, United States
Minli Xu, Carnegie Mellon University, United States
Maureen Stolzer, Carnegie Mellon University, United States
Rosanna Alderson, Carnegie Mellon University, United States
Dannie Durand, Carnegie Mellon University, United States

Presentation Overview:

Reconciliation of a gene tree with a species tree infers the association between ancestral genes and ancestral species, the history of gene duplication, loss, and horizontal transfer in the evolution of the gene family, gene copy number in ancestral species, and the timing of gene family expansions and contractions relative to speciation. This provides a framework for investigating the emergence of novel protein function in the context of species evolution. The availability of whole genome data makes it possible to ask more comprehensive questions. By extending reconciliation to a set of gene families, we can probe the evolution of an entire biological system.

Notung is a reconciliation engine that handles both binary and non-binary gene and species trees and can root, rearrange, or resolve a gene tree using duplication-loss parsimony. Notung's novel phylogenomics feature is specifically designed to facilitate simultaneous analysis of a comprehensive set of gene families from a selected set of species. It automatically aggregates results across all reconciled gene trees and all species tree nodes, generating reports summarizing the inferred events, gene copy number, and gene family expansions and contractions associated with each ancestral species. These innovations significantly simplify the analyses required for exploring whole genome duplication, genome streamlining, co-evolution of gene families, and bursts of horizontal gene transfer between species.

Notung offers diverse tools in a unified framework that supports a pipeline of linked analyses and makes results from different tools intuitive and comparable. The software also provides a user friendly GUI for users to interact with a single gene tree and explore multiple solutions and roots. The software is under continuous development and can be downloaded at: http://www.cs.cmu.edu/~durand/Notung/.

Modeling methyl-sensitive transcription factor motifs with an expanded epigenetic alphabet

Download

Date: TBA
Room: TBA

Coby Viner, University of Toronto, Canada
James Johnson, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Nicolas Walker, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Hui Shi, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Marcela Sjoberg, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
David J. Adams, Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, England, United Kingdom
Anne C. Ferguson-Smith, Department of Genetics, University of Cambridge, Cambridge, England, United Kingdom
Timothy L. Bailey, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
Michael M. Hoffman, Princess Margaret Cancer Centre/University of Toronto, Canada

Presentation Overview:

Cultivation of Iron Reducing Bacteria from Iron Ore Caves

Download

Date: TBA
Room: TBA

Ceth Parker, University of Akron, United States
John Senko, University of Akron, United States
Ira Sasowsky, University of Akron, United States
Hazel Barton, University of Akron, United States
Augusto Auler, Instituto do Carste (Institute of Karst), Brazil

Presentation Overview:

The iron mining regions of Brazil contain thousands of “iron ore caves” (IOCs) that form within Fe(III)-rich deposits. The mechanisms by which these IOCs form remain unclear, but the reductive dissolution of the host Fe(III) rich rock by Fe(III) reducing bacteria (FeRB) could provide a microbiological mechanism for their formation. The IOCs contain large amounts of organic carbon, which supports a considerable microbial ecosystem with cell counts as high as 2.29 x 109 cells per gram of sediment. Previous sequencing of Fe(III) reducing enrichment cultures have determined that cultures were dominated by Firmicutes. Using Illumina sequencing of environmental samples we show that the Firmicutes do not represent the dominant FeRB in these caves. Analysis of 16S rRNA gene sequences indicates a dominance by members of the Chloroflexi, Acidobacteria and subdivisions of the Alpha- Beta- and Gamma-proteobacteria, all of which have member species capable of Fe(III) reduction. Using comparative principle
coordinates analysis (PCoA) and unweighted UniFrac techniques, we demonstrate that enrichment culture community compositions migrated from the original environmental structure. Our enrichment media used a complex carbon (glucose) as a nutrient source that promoted a specific subset of the FeRB, this group of Firmicutes was capable of coupling fermentative metabolism to Fe(III) reduction. We anticipate that though there is an abundance of organic carbon in the IOC systems, glucose is not a large component of it, and despite the dominance of non-traditional FeRB in culture, more traditional Fe(III) dissimilatory FeRB are responsible for the bulk of Fe(III) reduction seen in IOCs.

SomVarIUS: Somatic variant identification from unpaired tissue samples

Download

Date: TBA
Room: TBA

Kyle Smith, University of Colorado, United States
Vinod Yadav, University of Colorado, United States
Shanshan Pei, University of Colorado, United States
Dan Pollyea, University of Colorado, United States
Craig Jordan, University of Colorado, United States
Subhajyoti De, Rutgers Cancer Institute of New Jersey, United States

Presentation Overview:

Somatic variant calling typically requires paired tumor-normal tissue samples. Yet, paired normal tissues are not always available in clinical settings or for archival samples. We present SomVarIUS, a computational method for detecting somatic variants using high throughput sequencing data from unpaired tissue samples. We evaluate the performance of the method using genomic data from synthetic and real tumor samples. SomVarIUS identifies somatic variants in exome-seq data of ~150X coverage with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor samples. We demonstrate utility of SomVarIUS by identifying somatic mutations in formalin-fixed samples, and tracking clonal dynamics of oncogenic mutations in targeted deep sequencing data from pre- and post-treatment leukemia samples.

Implementation and availability: SomVarIUS is written in Python 2.7 and available at http://www.sjdlab.org/resources/

Reference:
Smith KS, Yadav VK, Pei S, Pollyea DA, Jordan CT, De S. (2016) SomVarIUS: Somatic variant identification from unpaired tissue samples. Bioinformatics. [Epub ahead of print] PMID: 26589277.

Estimation of a Regulatory Network of Cooperation Response Genes in a Model of Cancer Malignancy

Download

Date: TBA
Room: TBA

Matthew Mccall, University of Rochester, United States
Helene McMurray, University of Rochester, United States
Anthony Almudevar, University of Rochester, United States
Hartmut Land, University of Rochester, United States

Presentation Overview:

Advances in genomic technology have led to the discovery of numerous genes whose expression differs between cellular conditions; however, genes do not act in isolation, rather they act together in complex networks that drive cellular function. By considering the interactions between genes (and gene products), one gains a more in-depth understanding of the underlying cellular mechanisms. Estimation of these gene regulatory networks is necessary to understand cellular mechanisms, detect differences between cell types, and predict cellular response to interventions. Cancer progression has been shown to produce drastic changes in genetic networks critical to normal cellular function. Some oncogenic mutations produce self-sustaining alterations in the network structure such that removal of the original mutation does not restore normal cellular function. This suggests that identifying the original oncogenic mutation may not be sufficient for a targeted intervention; rather, a detailed understanding of the gene regulatory networks present in both normal and malignant cells may be necessary.

Gene perturbation experiments are the primary tool to investigate gene regulatory networks and predict cellular response to interventions. However, given the numerous sources of variability and technical biases inherent in genomic technologies, network estimation algorithms are often unable to accurately reconstruct gene networks. To address this challenge, we propose an approach to network estimation that explicitly models and incorporates uncertainty in each step of the analysis. Instead of attempting to infer a single "best" network, we report a posterior density on the network space that directly conveys the uncertainty in the inferred network structure. Quantifying the uncertainty in specific network features allows researchers to determine highly-probable features and areas in which additional information is needed, thereby guiding future experimentation.

We have applied the proposed model to a network of cooperation response genes (CRGs), which respond synergistically to loss-of-function p53 and Ras activation. CRGs have been shown to play a crucial role in tumor formation independent of the initiating mutations, and many CRGs are essential components of the cellular machinery involved in maintaining malignancy. Finally, there is evidence of a robust regulatory structure that governs patterns of CRG co-expression. Linking phenotypic variables (e.g. tumor growth), experimental perturbation of CRGs (via retrovirus-mediated re-expression of corresponding cDNAs or shRNA-dependent stable knock-down), and features of the CRG gene regulatory network provides a glimpse into the complex multi-gene regulatory relationships that are crucial to the malignant state. Ongoing examination of the CRG network architecture has the potential to uncover specific vulnerabilities of the cancer cell and, ultimately, to guide multi-target interventions.

A NOVEL QUANTITATIVE INSILICO METHOD TO DELINEATE THE MECHANISM OF ENZYME SPECIFICITY- PROVIDE INSIGHTS FOR ENZYME ENGINEERING

Download

Date: TBA
Room: TBA

Pravin Kumar, Polycone Bioservices, India
Anurag Kumar, POLYCLONEBIO, India
Devashish Das, Polyclone Bioservices, India
Naveen Kulkarni, Polycone Bioservices, India

Presentation Overview:

Enzyme show different degree of specificity towards substrates. Enzymes may point towards a specific bond, chemical group, the whole substrate or even to optical configuration (1). Mechanisms of selectivity are a great deal to understand and computational approaches are most preferred to delineate. Amongst the computational methods MD simulations and QM/MM point out few crucial amino acids in the active site underlining this phenomenon (2). Albeit, the investigations are localized and predictions are biased toward the location where the substrate is changed. We have developed a novel method that quantitatively weights atomic interactions in conjuncture with high throughput binding free energy calculation of E-S complex derived from MD simulations. MM/PBSA method was used to estimating interaction free energies using ∆Gbinding = Gcomplex – (Gprotein+ Gligand) --eq1 and this was decomposed on a per residue basis (PRB). At every 2 picoseconds of the simulations summing the contributions of the pairwise interactions between the ith atom on the substrate and the jth atom of the surrounding residues contact score (CS) were calculated, CS---eq 2. Wherein, RvdW is the vdW radii, ‘D’ is the distance between the atoms and ‘A’ is the allowance for potentially hydrogen-bonded pairs. Finally, CS and PRB were log transformed and summed to derive the per residue weights (PRW), PRW=log CS + logPRB--eq3. To test this method we generated the Michaelis complex of Pencillin G acylase (PGA) and Pencillin-G(native substrate) using the X-Ray structure of PGA complexed with a substrate for which enzyme shows hydrolysis (PGSO). MD simulations were lunched over these two Michaelis complexes and extended till 120 nano seconds and PRW was calculated using eq3 for the native and the slow reaction. The analysis shows that the amide bond of PenG was stabilized by βS1, βT68, βQ23 and βA69 and the same in PGSO is stabilized only by βS1 and βA69. βQ23 involved in stabilizing tetrahedral intermediate by forming part of oxyanion hole in the native reaction shows unfavorable binding in PGSO simulation. The derived quantitative weights reveal that enzyme specificity is a cumulative effect exerted by many amino acids in and beyond the active site of the enzyme. In this work we will be presenting the results that portray the precise mechanism of substrate selectivity and pinpoint residues in the enzyme that can be used for designing enzymes for better specificity.
References
1. Berg, Jeremy M., Tymoczko, John L., and Stryer, Lubert. Biochemistry. 6th ed. New York, N.Y.: W.H. Freeman and Company, 2007: 206
2. Mechanisms and free energies of enzymatic reactions. Gao J, Ma S, Major DT, Nam K, Pu J, Truhlar DG Chem Rev. 2006 Aug; 106(8):3188-209.

Gene-specific RNA elongation rates are different depending on the treatment triggering gene expression changes

Download

Date: TBA
Room: TBA

Joachim von Wulffen, University of Stutgart, Institute for System Dynamics, Germany
Andreas Ulmer, University of Stutgart, Institute for System Dynamics, Germany
Oliver Sawodny, University of Stutgart, Institute for System Dynamics, Germany
Ronny Feuer, University of Stutgart, Institute for System Dynamics, Germany

Presentation Overview:

RNA elongation rates are a fundamental entity in biology. Elongation rates limit the availability of newly expressed genes and determine the dead time after which newly synthesized protein is complete at the earliest.
Several methods have been developed to determine bacterial RNA elongation rates, most of them incorporating poisoning the cells with rifampicin, an antibiotic inhibiting initiation of transcription. One such method traces gene degradation after poisoning using RNA sequencing. This method yields short reads that can be aligned to specific positions within the genome. By splitting the sequence of individual mRNAs into equally sized features, termed bins, the dead time after which degradation starts, can be determined and traced along the sequence of the mRNA. Finally, from the increase of dead time along the mRNA, the elongation rate of the last bound RNA polymerase can be deduced for individual mRNA species.
In our set-up, we did not use rifampicin to induce expression changes by poisoning, but treated an anaerobic Escherichia coli culture to oxygen and applied a rapid sampling technique. This treatment generates both, up- and downregulation of genes, that can be followed by RNA sequencing. By applying this technique, we encountered several differences compared to the elongation rates determined with rifampicin. In particular, we found that several genes that are upregulated in our setting exhibit higher elongation rates, compared to previous measurement. This effect might be ascribed to cotranscriptional translation, where ribosomes on the nascent RNA prevent proofreading of the RNA polymerase.
This study shows, to our knowledge for the first time, individual elongation rates in a wild-type E. coli strain under largely natural conditions.

miRNome profiling in human osteoarthritis: complexity of 3’ modifications

Download

Date: TBA
Room: TBA

Helen Piontkivska, Kent State University, United States
Abdul Haseeb, NEOMED, United States
Mohammad Makki An, NEOMED, United States
Tariq Haqqi, NEOMED, United States

Presentation Overview:

Osteoarthritis (OA) is the most common disease of the articulating joints and manifests as pain, tenderness and limitation of movement. The pathogenesis of OA is complex and includes an inflammatory component as well. miRNAs is a class of small non-coding RNAs that serve as important post-transcriptional regulators of gene expression. Although our understanding of the role(s) of miRNAs in etiology and pathology of diseases, including OA, is in its early stage, several miRNAs have been identified as important for cartilage homeostasis. In this study we used deep sequencing to examine the miRNome of the human chondrocytes from OA patients to characterize both the expression patterns and the distribution and variation of isomiRs, length and sequence variants of individual miRNAs. Such variants can be responsible for regulating different sets of targets. Illumina platform was used to characterize the miRNA repertoire from untreated and treated with IL-1beta primary human chondrocytes from normal subjects as well as OA patients at several time points. Sets of differentially expressed miRNAs were identified between samples, as well as over time, revealing that while the majority of miRNAs are expressed across all samples, with many exhibiting a relatively low expression, some miRNAs are expressed only at specific times. We further examined the nature and distribution of various isomiRs across samples, particularly, the nontemplated nucleotide additions, in order to gain a better understanding of isomiR repertoire and mechanisms responsible for miRNA regulation in OA. This in turn can provide insights into the OA pathogenesis and lead to discovery of novel therapeutic approaches.

Comprehensive transcriptome meta-analysis to characterize host immune responses in helminth infections

Download

Date: TBA
Room: TBA

Guangyan Zhou, Institute of Parasitology, McGill University, Canada
Tim Geary, Institute of Parasitology, McGill University, Canada
Mary Stevenson, Department of Microbiology and Immunology, McGill University, Canada
Jianguo Xia, Institute of Parasitology, McGill University, Canada

Presentation Overview:

Helminth infections affect more than a third of the world’s population. Despite very broad phylogenetic differences among helminth parasite species, a systemic Th2 host immune response is typically associated with long-term helminth infections, also known as the “helminth effect”. The objective of this study is to determine if there is a common transcriptomic signature characteristic of the helminth effect across multiple helminth species and tissue types. To this end, we performed a comprehensive meta-analysis of publicly available gene expression datasets. After data processing and adjusting for study specific effects, we identified ~400 differentially expressed genes that are consistently upregulated during helminth infection. Functional enrichment analyses indicate that these genes are broadly involved in various immune functions, including immunomodulation, immune signaling, inflammation, pathogen recognition and antigen presentation. This common immune gene signature confirms previous observations and indicates that the helminth effect is robust across different parasite species as well as host tissue types.

Phosphoproteomic analysis on mitochondria isolated from differentiated human neuronal cells identifies new phosphorylation sites on proteins linked to neurodegenerative disorders

Download

Date: TBA
Room: TBA

Florian Goebels, University of Toronto, Canada
Zoran Minic, University of Regina, Canada
Sadhna Phanse, University of Toronto, Canada
Gary Bader, University of Toronto, Canada
Mohan Babu, University or Regina, Canada

Presentation Overview:

Mitochondria play a central role in energy metabolism and different cellular processes such as cell-cycle control, cell differentiation, cell survival, programmed cell death, neuronal protection and in aging processes. Several studies have linked mitochondrial proteins to human illnesses including Parkinson’s disease, Alzheimer’s disease, schizophrenia, and autism. Understanding mitochondrial protein regulation is therefore important to decode the signaling pathways involved in these diseases. One central aspect how proteins are regulated in a cell is via post-translational modifications (PTM) such as phosphorylation. To examine the role of phosphorylation of mitochondrial proteins during neural development, we measured phosphoproteomic profiles of mitochondrial and cytosolic fractions isolated from a human pluripotent embryonal carcinoma neuronal Ntera-2 cell line, which is widely used as a model of human neurogenesis. To obtain a comprehensive overview of mitochondrial and soluble phosphoproteins transitioning from undifferentiated to differentiated cell stage, an efficient mitochondrial isolation protocol with IMAC and TiO2 phosphopeptide enrichment and LC-MS/MS was performed. We observed that differentiated and undifferentiated Ntera-2 cells display significantly different phosphoproteomic patterns. Phosphorylated proteins in Ntera-2 cells are enriched in three main functional groups: neurogenesis, mitochondrial metabolic pathways, and neurological disease proteins. Analysis of kinases-substrate interactions revealed novel phosphorylation sites and corresponding kinases linked to these three groups. We performed independent APMS experiments in differentiated/undifferentiated Ntera-2 cells on potential bait proteins, which revealed their unique protein interaction rewiring during differentiation. The here presented results give a novel insight into the molecular changes in signalling networks and metabolic pathways necessary for neuronal cell differentiation, which are of importance for design therapies for various diseases, including neurodegenerative diseases.

Optimal architecture of differentiation cascades with asymmetric and symmetric stem cell division

Download

Date: TBA
Room: TBA

Daniel Sánchez-Taltavull, Ottawa Hospital Research Institute, Canada

Presentation Overview:

The role of the symmetric division in stem cell is ambiguous. It is necessary after injuries, but an excess of symmetric division makes the appearance of tumours more likely. To explore the role of symmetric and asymmetric division in cell populations, we propose a mathematical model of competition of populations, in which the stem cell expansion is controlled by the fully differentiated cells. We show that there is an optimal fraction of symmetric stem cell division, which maximises the long-term survival probability of the organisms. Moreover, we show the optimal number of stem cells in a tissue, and we show that number has to be small enough to reduce the probability of the appearance of advantageous malignant cells, and large enough to assure that the population will not be suppressed by stochastic fluctuations.

Machine learning approach for identifying parasite immunoreactive proteins: example driven by Schistosoma genomes

Download

Date: TBA
Room: TBA

Md Shihab Hasan, The University of Queensland, Australia
Martha Zakrzewski, QIMR Berghofer Medical Research Institute, Australia
Don McManus, QIMR Berghofer Medical Research Institute, Australia
Lutz Krause, The University of Queensland, Australia

Presentation Overview:

Schistosomiasis is a chronic disease caused by Schistosoma species. It is considered by the World Health Organization as the second most socioeconomically devastating and second most common parasitic disease affecting 200 million people worldwide and causing at least 300,000 deaths annually. No vaccines are available. Driven by the need to improve disease treatment and prevention, the genomes of three human Schistosoma species have recently become publicly available (S. mansoni, S. japonicum Chinese strain and S. haematobium).

Recently, a immunonomics study for schistosomes using protein microarray has been published. The sequence properties of immunoreactive parasite proteins have been determined. We have developed a machine-learning tool, ‘SchistoTarget’, for predicting immunoreactive parasite antigens, which would aid in the discovery of novel vaccine targets against schistosomes. SchistoTarget provides a user-friendly web-interface and results are presented in interactive Tables and Figures. SchistoTarget is implemented in Python and publicly available at http://schistotarget.bioapps.org. The protocol might be a blueprint for other parasitic diseases such as malaria, scabies, hook worm, fungal infection etc.

miCoRe: Colorectal Cancer miRNAs Database

Download

Date: TBA
Room: TBA

Rahul Agarwal, SHIV NADAR UNIVERSITY, India
Ashutosh Singh, Shiv Nadar University, India

Presentation Overview:

Background: Colorectal cancer (CRC) is the fourth most frequently prevailing type of cancer worldwide. This cancer usually ensued from benign adenomas. Statistically, each year about 1.23 million new cases have been reported. Profiling of unique and stable CRC miRNA in human serum/plasma put through the study of their role in diagnosis and prognosis of CRC amid general population. A number of reports suggest the functional mantle of aberrant miRNAs in carcinogenesis and progression of colorectal cancer. Various studies were done for the discovery and molecular characterization of miRNAs specific to CRC but an inclusive repository of all the miRNAs involved in CRC is still mislaid.
Objectives: This work leads to the development of a miRNA database in colorectal cancer. We named this database- miCoRe. This database comprises of all validated colon-rectal cancer miRNAs information from various published literature with an effectual knowledge based information retrieval system.
Methodology: miRNAs have been collected from various published literature reports. MySQL is used for main-framework of miCoRe while the front-end was developed in PHP script. The aim of developing miCoRe is to create a comprehensive central repository of colorectal carcinoma miRNAs with all germane information of miRNAs and their target genes.
Results: The current version of miCoRe consists of 238 miRNAs which are known to be implicated in malignancy of CRC. Alongside with miRNA information, miCoRe also contains the information related to the target genes of these miRNA.
Conclusion: miCoRe furnishes the information about the mechanism of incidence and progression of the disease, which would further help the researchers to look for colorectal specific miRNAs therapies and CRC specific targeted drug designing. Moreover, it will also help in development of biomarkers for the better and early detection of CRC and will help in better clinical management of the disease.

Neural Network Word Embeddings of Biological Sequences with Applications in Deep Proteomics and Genomics

Download

Date: TBA
Room: TBA

Ehsaneddin Asgari, University of California, Berkeley, United States
Mohammad R.K. Mofrad, University of California, Berkeley, United States

Presentation Overview:

We introduce a new unsupervised data-driven feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In this work, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.

Gene Set – based Deep Neural Network Learning for Disease Classification

Download

Date: TBA
Room: TBA

Pingzhao Hu, University of Manitoba, Canada
Yang Wang, University of Manitoba, Canada
Huyen Le, University of Manitoba, Canada

Presentation Overview:

It has been known that earlier detection of cancer is a key for treatment. Consequently, the identification of clinically significant, biologically and functionally relevant biomarkers is critical. It has been found that individual gene signatures sometimes have little biological meaning since they are from different biological pathways. Thus, pathway or gene set biomarkers make more sense. Here we examine deep neural networks to classify cancer outcomes with microarray data sets in conjunction with biologically-derived gene sets.
We evaluated the performance of the gene set –based deep neural network learning with SVM. We performed our analyses at both individual genes and gene set levels. The individual genes used to build the classification models were selected using standard two-side t-test while the individual gene sets were selected using Gene Set Analysis (GSA) R package.
We examined the classification performance using 5, 10, 15 top gene sets with 3 hidden layers (30, 20, 10 units, respectively), 4 hidden layers (30, 20, 15, 10 units, respectively) and 6 hidden layers (30, 25, 20, 15, 10, 5 units, respectively). The error function we used is cross - entropy and linear.
In general, the SVM-based classifiers have better performance than neural net – based classifiers. Interestingly, we observed that the disconnect neural net - based classifiers have better performance than fully connected neural net – based classifiers with smaller number of hidden layers. Our results suggest a potential for applying deep learning methods to improve
cancer outcome prediction.

CT-Finder: A Web Service for CRISPR Optimal Target Prediction and Visualization

Download

Date: TBA
Room: TBA

Houxiang Zhu, Miami University, United States
Lauren Misel, Miami University, United States
Mitchell Graham, Miami University, United States
Chun Liang, Miami University, United States

Presentation Overview:

RGAugury: A pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants

Download

Date: TBA
Room: TBA

Pingchuan Li, Morden Research and Development Centre, AAFC, Canada
Sylvie Cloutier, Ottawa Research and Development Centre, AAFC, Canada
Frank You, Morden Research and Development Centre, AAFC, Canada

Presentation Overview:

Investigations into the physiological and biomechanical basis of differential success in oral rabies vaccination between skunks (Mephitis mephitis) and raccoons (Procyon lotor)

Download

Date: TBA
Room: TBA

Charlotte Klimovich, Ohio University, United States
Susan Williams, Ohio University, United States

Presentation Overview:

Oral rabies vaccination (ORV) programs in North America have historically been effective in controlling the spread of rabies in carnivoran populations. This is accomplished by distributing baits filled with the rabies vaccine that are located and consumed by wild animals. This has been most effective in raccoon populations which show significantly reduced rabies infections. However, despite also being a major rabies vector, the striped skunk has a significantly lower inoculation rate in wild populations although the vaccine has been shown to be effective in captive studies. Here, we aim to elucidate biomechanical differences between skunks and raccoons that may contribute to differences in oral immunization success. 3D biomechanical models incorporating data from the jaw muscles were developed to estimate bite forces and muscle moments at different gapes for the two species. Results demonstrate differences that might impact their ability to break down the baits to access the vaccine. Data from this project may be used to improve bait manufacturing techniques.

Set Covering Machines and Reference-Free Genome Comparisons Uncover Predictive Biomarkers of Antibiotic Resistance

Download

Date: TBA
Room: TBA

Presentation Overview:

The Interrelationships Between Ebola and Lassa virus: A Comparative Genomics Perspective

Download

Date: TBA
Room: TBA

Olaitan Awe, University of Ibadan, Nigeria
Angela Makolo, University of Ibadan, Nigeria
Segun Fatumo, University of Ibadan, Nigeria

Presentation Overview:

Ebola virus (a filovirus) and Lassa virus (an arenavirus) are two viral pathogens that cause severe multisystem dysfunction in infected humans, and are known with deadly epidemics like Lassa fever (caused by Lassa virus) in the current outbreak in Nigeria, a year after the World Health Organisation declared Nigeria free of the deadly Ebola virus transmission in 2014. There is an increasing need for methods to study these viral pathogens, especially because of the lethal nature of the diseases they cause in humans, because by studying their genomes, scientists are learning more about their infections. From the genomic point of view, we determined the interrelationships between Ebola virus and Lassa virus, by obtaining their RNA sequences from Genbank, and aligning these sequences. We did phylogenetic analysis of the aligned sequences and got a clue about the alignment of the regions. We showed that similarities exist among the strains of each virus, and considerable sequence similarity was noted between the Nigerian LP strain and Sierra Leone Josiah strain of Lassa virus, as opposed to between the two Nigerian strains. This foundation of knowledge will ultimately lead to new insights for diagnosing and treating Ebola Virus Disease and Lassa Fever, especially in infected humans. Insights from this research will also further increase our understanding of the ecology and evolution of Lassa virus and Ebola virus.

A genome-scale algorithmic approach for metabolic engineering of plants

Download

Date: TBA
Room: TBA

Jiun Yen, Virginia Tech, United States
Glenda Gillaspy, Virginia Tech, United States
Ryan Senger, Virginia Tech, United States

Presentation Overview:

Docking and Molecular Dynamics Studies on the Interaction of Steviol Glycosides with Human Bitter Taste Receptors

Download

Date: TBA
Room: TBA

Waldo Acevedo, Pontificia Universidad Católica de Chile, Chile
Danilo González, Universidad Andrés Bello, Chile
Eduardo Agosin, Pontificia Universidad Católica de Chile, Chile

Presentation Overview:

Stevia is a natural sweetener increasingly demanded by consumers due to its non-caloric intake and safety. Steviol glycosides (SG) are the active compounds present in the leaves of Stevia rebaudiana responsible for its sweetness, among which the most important are stevioside, different rebaudiosides, dulcosides and rubusoside. However, sensory evaluation has consistently shown that besides the sweetening effect, these compounds have unwanted attributes, in varying degrees, such as bitter, metallic and licorice. Recently, experiments in vitro cells revealed that steviol glycosides activate two bitter taste receptors, hT2R4 and hT2R14, triggering is mouthfeel. The aim of this study is to characterize the interaction of the different steviol glycosides with bitter taste receptors, hT2R4 and hT1R14, by molecular simulation. The results show that the SG has one orthosteric binding site in T2R4 and T2R14, whose cavities are constituted by transmembrane and extracellular regions. Interaction of SG with their binding site are mediated by hydrogen bonds and hydrophobic contacts. Furthermore, the calculated binding energy (ΔG) and the relative bitterness of SG reported are negatively correlated with r = -0.95 and r = -0.89 for T2R4 and T2R14, respectively. For example, rebaudioside D and rubusoside have a ΔG -5.3 and -7.0 Kcal/mol for T2R4 and -7.6 to -5.2 Kcal/mol for T2R14. Steered molecular dynamics simulations (SMD) helped interpret the difference in affinity of the SG. The force profiles for all complex are different. Results show that stevioside and rebaudioside A inserted in T2R4 and T2R14 are approximately stabilized in 5.42, 4.72, 4.30 and 5.48 picoseconds, respectively. These results could contribute to the understanding of the phenomena involved in the perception of bitterness of these types of sweeteners.

Docking and Molecular Dynamics Studies on the Interaction of Non-caloric Sweeteners with Human Sweet Taste Receptor

Download

Date: TBA
Room: TBA

Waldo Acevedo, Pontificia Universidad Católica de Chile, Chile
Danilo González, Universidad Andrés Bello, Chile
Eduardo Agosin, Pontificia Universidad Católica de Chile, Chile

Presentation Overview:

Natural sweeteners are increasingly demanded by consumers due to its low calorie intake and safety. In particular, stevia, 300 times sweeter than sugar, is a product that has positioned itself strongly in the basket as a tabletop sweetener and additive for the production of various food products. Steviol glycosides are the active compounds present in the leaves from Stevia rebaudiana responsible for their sweetness, among which the most important are stevioside, different rebaudiosides, dulcosides and rubusoside. The ability to predict the sweetening power of different compounds is a very important challenge for the food industry. The aim of this study is to characterize the interaction of the different steviol glycosides with the sweet taste receptor, hT1R2-hT1R3, by molecular simulation. The results show that the active compounds of stevia bind to the same orthosteric binding site of traditional sweeteners - such as sucrose - T1R2 subunit both as T1R3. Furthermore, the calculated binding energy (ΔG) and the reported sweetness intensity for the different families of sweeteners – from terpenes to protein - are negatively correlated with r = -0.9 and -0.86 for T1R2 and T1R3, respectively. For example, the ΔG of stevioside and rebaudioside A is -8.8 and -8.4 is Kcal/mol, respectively; unlike sucrose with ΔG -5.1 Kcal/mol. The affinity of steviol glycosides by the sweet taste receptor is favored by the increased number of hydrogen bonds and hydrophobic interactions with binding site residues. Steered molecular dynamics simulations (SMD) helped interpret the difference in affinity of the SG. Results show that stevioside and rebaudioside A inserted in T1R2 and T1R3 are approximately stabilized in 4.56, 4.02, 8.28 and 6.3 picoseconds, respectively. These results could contribute to the understanding of the phenomena involved in the perception of sweetness of these types of sweeteners.

May 16 -19, 2016 | GLBIO/CCBC 2016

Talks

ISCB On the Web