Oral Poster Presentations

Attention Conference Presenters - please review the Speaker Information Page available here.

OP01 Role of the DPAGT1/β-catenin/YAP signaling network in oral squamous cell carcinoma
Date: Sunday, July 12, 10:10 am - 10:30 amRoom: Wicklow Hall 2B
Vinay Kartha, Boston University, United States
Liye Zhang, Boston University, United States
Samantha Hiemer, Boston University, United States
Maria Kukuruzinska, Boston University School of Medicine, United States
Xaralabos Varelas, Boston University, United States
Stefano Monti, Boston University School of Medicine, United States
Progression of oral squamous cell carcinoma (OSCC) to metastasis involves complex changes in epithelial cell growth, survival and migration. While the roles of protein N-glycosylation, Wnt/β-catenin and Hippo pathways in cancer have been independently highlighted, the interplay between these pathways in promoting tumor metastasis is less understood. Prior studies have identified this co-dependent homeostatic pathway network to be deregulated in OSCC, playing a vital role in its tumorigenesis. However, identifying exact mediators of these changes still remains a challenging task and is crucial to the discovery of novel and lasting OSCC therapeutics. Here, we apply a multi-omic profiling approach to identify potential regulators of OSCC pathogenic pathway activity using a combination of OSCC cell line gene expression profiles and massive public genomic data. Gene expression signatures pertaining to genetic knockdowns of DPAGT1 - a gene crucial to protein N-glycosylation, and TAZ and YAP - two transcriptional activators involved in the Hippo pathway were derived using SCC2 cells. Primary human OSCC high-throughput gene expression data obtained from The Cancer Genome Atlas (TCGA) was then projected onto these signatures and analyzed for their association with clinical features including tumor grade and stage. By scoring samples based on their level of pathway deregulation, and additionally leveraging TCGA Copy Number Alteration (CNA), DNA methylation and somatic mutation data, we are able to identify potential genetic and epigenetic regulators of human OSCC development in the context of the DPAGT1/β-catenin/YAP signaling network, paving the way to discovering targets of OSCC therapy.

OP02 LAPRAS: An Integrative Model Incorporating Heterogeneous Datasets to Discover Genetic Etiology of Autism Spectrum Disorder
Date: Sunday, July 12, 10:10 am - 10:30 amRoom: Wicklow Hall 2B
Sumaiya Nazeen, Massachusetts Institute of Technology, United States
Rohit Singh, Computation & Biology Group, CSAIL, MIT, United States
Bonnie Berger, CSAIL, Massachusetts Institute of Technology, United States
Autism spectrum disorder (ASD), prevalent in 1% of the population, refers to a group of complex neurodevelopmental disorders sharing the common feature of dysfunctional reciprocal social interaction. There is compelling evidence that genetic factors are a predominant cause of ASD; however, the genetic heterogeneity underlying ASD makes it challenging to gain conclusive biological insights into the disease. Most of the general-purpose gene prioritization methods and ASD-specific gene network methods suffer from the limitation of depending just on the protein-protein interaction (PPI) network and/or co-expression network, and do not properly utilize other types of ASD-related information available in literature. We believe understanding the complex genetic background of ASD requires a strategy that can integrate multiple forms of data. To this end, we present a computational method termed LAPRAS (LAsso-Penalized logistic Regression based gene ASsociation) that incorporates ASD-specific DNA copy number variations, PPI network topology, phenotypic similarities of diseases, and pathway knowledge from literature. We provide a rank-list of genes in descending order of their probability of association with ASD. The top-ranked genes are overrepresented in neurological pathways, cell adhesion pathways, and signal transduction pathways pertinent to brain, cellular assembly and communication, synaptic development, and neuronal development. The most significant sub-networks discovered in the top-ranked genes are overrepresented in gastro-intestinal disorders, nervous system development, hereditary developmental disorders, and organismal abnormalities suggesting the existence of subclasses of ASD. This integrative method is novel and outperforms other state-of-the-art gene ranking methods.

OP03 Combined strategy to detect somatic point mutations from circulating DNA by targeted sequencing
Date: Sunday, July 12, 10:10 am - 10:30 amRoom: Wicklow Hall 2B
Nicola Casiraghi, Centre for Integrative Biology, University of Trento, Italy
Alessandro Romanel , Centre for Integrative Biology, University of Trento, Italy
Gerhardt Attard, Royal Marsden National Health Service Foundation Trust, United Kingdom
Francesca Demichelis, Centre for Integrative Biology, University of Trento, Italy
We developed a computational method that combines genetic knowledge and empirical signal to readily detect and quantify somatic point mutations in cell free DNA by fully exploiting single base resolution information from targeted next generation sequencing data using patient’s plasma (case) and matched germline sample (control). First, each targeted base is tested both in cases and controls for allelic fraction, local coverage and reads supporting the alternative allele(s). Controls allelic fractions distribution is built to determine the cut-off corresponding to the desired detection specificity. Second, to mitigate the impact of potential strand-bias, we implemented a combination of standard Fisher’s and Odds Ratio tests with ad-hoc analysis of study cohort reference/alternative strand proportions distribution. Third, control samples are exploited to build a genomic locus-specific error model to estimate the probability that observed case allelic fraction is indeed evidence of a somatic event. Fourth, comparison of expected versus observed ratios of non-synonymous and synonymous substitution rates in targeted control genes is adopted as additional quality check. Last, if the targeted design allows for case tumor content and local somatic copy number state estimations, the method also controls for point mutation detection suitability stratified by locus coverage (false negative rates). The robustness of our combined strategy was tested across a range of coverage depths by in-silico down-sampling analysis. We will present the strategy efficacy on 46 plasma samples from 15 metastatic patients recently profiled with a targeted panel spanning 40 Kb across eight cancer genes at 1500X mean coverage.

OP04 Transcriptomics of rare cell populations in the aging neural stem cell lineage
Date: Sunday, July 12, 10:30 am - 10:50 amRoom: Wicklow Hall 2B
Katja Hebestreit, Stanford University School of Medicine, United States
Dena Leeman, Stanford University School of Medicine, United States
Anne Brunet, Stanford University School of Medicine, United States
Neural stem cell niches in the adult brain are the locations where neural stem cells produce new neurons necessary for the maintenance and plasticity of brain function. With age, neural stem cell niches deteriorate, with a decline in neural stem cell proliferation and production of new neurons. To examine the transcriptional landscape in neural stem cells during aging we obtained RNA-seq data from freshly isolated cells along the neural stem cell lineage from young and old mice. Because of very low cell numbers per replicate and because differences with age were expected to be subtle, we captured unwanted variance in the data using surrogate variable analysis. We used limma to detect differentially expressed genes between cell types and between old and young samples for each cell type. We found strong gene expression differences between the cell types, especially between quiescent and activated cell types. Intriguingly, we found that quiescent neural stem cells show transcriptional changes with age, whereas activated neural stem cells do not seem to have an aging signature. Using pathway enrichment analysis we found that quiescent and activated neural stem cells use different primary pathways to carry out different modes of proteostasis with quiescent neural stem cells favoring autophagy and activated neural stem cells using the proteasome pathway. As defective proteostasis is a hallmark of aging, it represents an interesting candidate of further investigation to understand why activated neural stem cells are protected from transcriptional aging.

OP05 Improving Clustal Omega's sequence alignment accuracy with annotated profile Hidden Markov Models
Date: Sunday, July 12, 10:30 am - 10:50 amRoom: Wicklow Hall 2B
Quan Le, University College Dublin, Ireland
Des Higgins, University College Dublin, Ireland
Clustal Omega is the latest member of the Clustal sequence alignment program family; it allows the use of an additional profile HMM to improve the accuracy of the alignment. In this experiment, we use the tools HMMER 3.0 and pfam_scan to annotate each sequence in the set of sequences to align with profile HMMs from the Pfam database, we then add the annotated profile HMMs as the extra inputs to Clustal Omega to improve the alignment quality. Using one Pfam profile HMM per one alignment, we obtain positive results on all 5 reference sets of sequence alignment benchmark BALIBASE 3.0 (the average total columns scores improve from 2.4 % for the reference 3 to more than 20% for the reference 1 version 1 ). For the case multiple Pfam profile HMMs hit the sequences to align, we are performing initial experiments with using concatenated profile HMMs to improve further the alignment quality.

OP06 Investigating evolutionary models of genome structure in aggressive prostate cancer
Date: Sunday, July 12, 10:30 am - 10:50 amRoom: Wicklow Hall 2B
Marek Cmero, The University of Melbourne, Australia
Natalie Kurganovs, Royal Melbourne Hospital, Australia
Jessica Chung, The Victorian Life Sciences Initiative, Australia
Jan Schrӧder, The Walter + Eliza Hall Institute, Australia
Kangbo Mo, The University of Melbourne, Australia
Clare Sloggett, The Victorian Life Sciences Initiative, Australia
Niall Corcoran, Royal Melbourne Hospital, Australia
Christopher Hovens, Royal Melbourne Hospital, Australia
Cheng Soon Ong, NICTA, Australia
Geoff Macintyre, Cancer Research UK, United Kingdom
Tumour evolution is a complex and multifaceted process. Recently, many approaches have arisen for inferring the evolutionary dynamics of tumour cell populations from point-mutation and copy-number data. Studying the role of structural variations (SVs) in cancer evolution however, particularly balanced rearrangements, has been less thoroughly explored. We present a method of reconstructing cancer phylogeny from multiple single-patient samples using large scale genomic aberrations and apply it to prostate cancer, which is particularly rearrangement-driven. We demonstrate that tumour phylogenies are able to be reconstructed using rearrangement data alone, and we further expand our model to characterise subclonal SVs. We demonstrate our methods by applying them to longitudinal samples from patients undergoing second-line anti-hormone therapy to gain insight into the mechanisms of castration resistance.

OP07 Systematic characterization of the disease and tissue distributions for identification of novel drug targets
Date: Sunday, July 12, 10:50 am - 11:10 amRoom: Wicklow Hall 2B
David Westergaard, The Novo Nordisk Foundation Center for Protein Research, Denmark
Alberto Santos, The Novo Nordisk Foundation Center for Protein Research, Denmark
Kalliopi Tsafou, The Novo Nordisk Foundation Center for Protein Research, Denmark
Christian Stolte, Digital Productivity, Commonwealth Scientific and Industrial Research Organisation, Australia
Sune Frankild, The Novo Nordisk Foundation Center for Protein Research, Denmark
Albert Pallejà, The Novo Nordisk Foundation Center for Protein Research, Denmark
Janos X Binder, Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Germany
Seán O\'Donoghue, Digital Productivity, Commonwealth Scientific and Industrial Research Organisation, Australia
Søren Brunak, The Novo Nordisk Foundation Center for Protein Research, Denmark
Lars Juhl Jensen, The Novo Nordisk Foundation Center for Protein Research, Denmark
The identification and validation of drug targets remains a major obstacle in drug development. To date, the majority of drug targets fall into four classes: G protein-coupled receptors, nuclear receptors, ion channels, and kinases (Overington et al., 2006). Illuminating the Druggable Genome (IDG) is an NIH initiative that will aid the discovery of novel targets by integrating heterogeneous methods and data sources. To this end, we have developed two novel resources, DISEASES and TISSUES, which project evidence onto proteins from the STRING database and two controlled vocabularies, namely Disease Ontology and the BRENDA Tissue Ontology. The use of controlled vocabularies ensures a perfect translatability between the two resources.

The DISEASES and TISSUES resources both integrate heterogeneous evidence from manually curated databases, high-throughput experiments, and automatic literature mining. DISEASES integrates disease-gene associations from Genetic Home Reference, UniProt, DistiLD, COSMIC. TISSUES is a database of gene expression in human tissues according to publicly available data from microarrays, RNA sequencing, mass spectrometry and immunohistochemical staining. Both resources also contain evidence from comentioning in Medline abstracts. Using gold standards, we calibrate quality scores across evidence types and estimate a confidence level for each association.

The resources described here are publicly available under a CC-BY-4.0 license at http://diseases.jensenlab.org and http://tissues.jensenlab.org

OP08 Bioinformatic Analysis of Long Non-coding RNAs in Neuroblastoma
Date: Sunday, July 12, 10:50 am - 11:10 amRoom: Wicklow Hall 2B
Kate Killick, University College Dublin, Ireland
Markus Schroder, University College Dublin, Ireland
Sarah-Jane Lennon, University College Dublin, Ireland
Thomas Schwarz, European Molecular Biology Laboratory (EMBL), Germany
Walter Kolch, University College Dublin, Ireland
Desmond Higgins, University College Dublin, Ireland
David Duffy, University College Dublin, Ireland
Neuroblastoma is an embryonic childhood cancer arising from the neural crest progenitor cells of the sympathetic nervous system. It is the most commonly found extra cranial pediatric tumor accounting for approximately for 15% of all childhood cancer deaths. Amplification of the MYCN gene is found in 25% of neuroblastoma tumors and the degree of amplification is correlated with patient outcome. Non-coding RNAs have no protein coding potential yet have been shown to play a role in a diverse range of cellular functions including cell differentiation and embryonic development. In particular, over the last several years a large body of literature has emerged supporting a role for long non-coding RNAs (lncRNAs) in many types of cancer. Identification of novel lncRNAs has the potential to serve as diagnostic markers and therapeutic targets in this complex disease. Coupled with this, improved methods of examining the transcriptome have enabled advances in identifying and understanding non-coding RNAs. Here bioinformatic analyses were used to identify lncRNAs from RNAseq data taken from a range of MYCN amplified neuroblastoma cell lines. Time course data from a MYCN over-expressed cell line was also examined as well as data from a neuroblastoma cell line treated with a retinoid compound known to induce differentiation of neuroblastoma tumors into mature neurons, rendering them benign. Collectively these results demonstrate the induction of lncRNAs by MYCN in neuroblastoma and identify a subset lncRNAs involved in neuroblastoma cell fate and offer a new perspective for neuroblastoma research.

OP09 A new molecular signature approach for prediction of driver cancer pathways from transcriptional data Unable to attend:
Date: Sunday, July 12, 10:50 am - 11:10 amRoom: Wicklow Hall 2B
Boris Reva, Mount Sinai School Of Medicine, United States
Dmitry Rykunov, Mount Sinai School Of Medicine, United States
Andrew Usilov, Mount Sinai School Of Medicine, United States
Hui Li, Mount Sinai School Of Medicine, United States
Eric Schadt, Mount Sinai School Of Medicine, United States
Assigning cancer patients to the most effective treatments requires an understanding of the molecular basis of their disease. While DNA-based molecular profiling approaches have flourished over the past several years to transform our understanding of driver pathways across a broad range of tumors, a systematic characterization of key driver pathways based on RNA data has not been undertaken.

Here we introduce a new approach to predict the status of driver cancer pathways based on weighted sums of gene expressions or signature functions derived from RNA sequencing data. To identify the driver cancer pathways of interest, we mined DNA variant data from TCGA and nominated driver alterations in seven major cancer pathways in breast, ovarian, and colon cancer tumors. The activation status of these driver pathways were then characterized using RNA sequencing data by constructing signature functions in training datasets and then testing the accuracy of the signatures in test datasets.

The signature functions perform well in separation tumors with nominated active pathways from tumors with no genomic signs of activation (average AUC equals to 0.83) systematically exceeding the accuracies obtained by the SVM method that we employed as a control approach. A typical pathway signature is composed of ~20 biomarker genes that are unique to a given pathway and cancer type. Our results confirm that driver genomic alterations are distinctively displayed at the transcriptional level and that the transcriptional signatures can generally provide an alternative to DNA sequencing methods in detecting specific driver pathways.

OP10 Evaluating and optimizing variant calling: a comparison of Roche 454, Ion Torrent PGM and Illumina NextSeq sequencing data
Date: Sunday, July 12, 11:40 am - 12:00 pmRoom: Wicklow Hall 2B
Sarah Sandmann, University of Muenster, Germany
Aniek de Graaf, RadboudUMC, Netherlands
Bert van der Reijden, RadboudUMC, Netherlands
Joop Jansen, RadboudUMC, Netherlands
Martin Dugas, University of Muenster, Germany
There are various next-generation sequencing (NGS) techniques, all of them striving to replace Sanger sequencing as the gold standard. The ongoing development of NGS methods has greatly reduced turnaround time and cost of sequencing. However, false positive calls of SNVs and specially indels are a widely known problem of basically all NGS sequencers.

We developed optimized variant calling pipelines for three common NGS sequencers considering both SNVs and short indels. Amplicon-based targeted sequencing of 20 genes known to be recurrently mutated in myeloid dysplastic syndromes (MDS) was performed in parallel on Roche 454, Ion Torrent PGM and Illumina NextSeq500 platforms. Diagnostic material of MDS patients -- partially sequenced twice on each sequencing platform -- formed the basis of the optimization, representing the learning cohort. If required, called variants were confirmed by Sanger sequencing of the original patient material.

We calculated various parameters to characterize both SNVs and indels. Yet, instead of setting arbitrary thresholds for each parameter, we combined them to estimate generalized linear models returning a probability for each variant to be a true positive. A single threshold for each model was chosen to provide maximum sensitivity as well as a maximum positive predictive value.

Subsequently, we performed a comparison of the three NGS platforms and their previously optimized variant calling pipelines. Sequencing data from additional MDS patients with lab validated SNVs and indels formed the basis of the comparison, representing the validation cohort.

OP11 Bacterial vaccine design using reverse vaccinology
Date: Sunday, July 12, 11:40 am - 12:00 pmRoom: Wicklow Hall 2B
Ashley Heinson, University of Southampton, United Kingdom
Carmen Denman, London School of Hygiene and Tropical Medicine, United Kingdom
Yawwani Gunawardana, University of Southampton, United Kingdom
Mahesan Niranjan, University of Southampton, United Kingdom
Bastiaan Moekser, University of Southampton, United Kingdom
Christopher Woelk, University of Southampton, United Kingdom
Reverse Vaccinology (RV) uses computational approaches to identify vaccine candidates in the genomes of bacterial pathogens. Vaccine development for bacterial pathogens is at a critical juncture due to widespread antibiotic resistance. Previously our group was the first to apply machine learning approaches to the identification of vaccine candidates in an RV pipeline. The current study aims to dramatically enhance RV by increasing the size of the training data, expanding the number of bioinformatics programs with biological relevance used for protein annotation, and employing nested cross-validation. A literature search identified 200 vaccine candidates, defined as a protein that resulted in significant protection in an animal model following immunization and subsequent challenge with a bacterial pathogen. This positive training data was twinned with negative training data and annotated with 30 bioinformatic tools capable of annotating protein data to derive a total of 200 annotation features. A support vector machine was trained on this data and compared to previous analyses that used smaller training data sets, less protein annotation tools, and improper models of cross-validation. Although nested cross validation led to a reduction in accuracy compared to previous methods (that were over fit), increasing the size of the training data set and expanding the number of protein annotation tools led to higher accuracies (>92%). In conclusion, we have dramatically improved previous RV approaches such that our trained classifier can now be used to select novel vaccine candidates in the genomes of bacterial pathogens for validation in animal models.

OP12 Pathway relevance ranking for tumor samples through network-based data integration
Date: Sunday, July 12, 11:40 am - 12:00 pmRoom: Wicklow Hall 2B
Lieven Verbeke, Ghent University / iMinds / IBCN, Belgium
Jimmy Van den Eynden, Ghent University / iMinds / IBCN, Belgium
Piet Demeester, Ghent University / iMinds / IBCN, Belgium
Kathleen Marchal, / iMinds / IBCN, Belgium
Jan Fostier, / iMinds / IBCN, Belgium
We present a new pathway relevance ranking method that is able to prioritize pathways according to the information contained in any combination of tumor related omics datasets. Key to the method is the conversion of all available data into a single network representation containing not only genes but also individual patient samples. Additionally, all data are linked through a network of previously identified molecular interactions. The performance of the new method is demonstrated by applying it to breast and ovarian cancer datasets from The Cancer Genome Atlas. By integrating gene expression, copy number, mutation and methylation data, the method’s potential to identify key pathways involved in breast cancer development shared by different molecular subtypes, is illustrated. Interestingly, certain pathways were ranked equally important for different subtypes, even when the underlying (epi)-genetic disturbances were diverse. The pathway ranking method was also able to identify subtype-specific pathways. Often the score of a pathway could only be explained by a combination of genetic and epi-genetic disturbances, stressing the need for a network-based data-integration approach. The analysis of ovarian tumors, as a function of survival-based subtypes, demonstrated the method’s ability to correctly identify key pathways, irrespective of tumor subtype. A differential analysis of survival-based subtypes revealed several pathways with higher importance for the bad-outcome patient group than for the good-outcome patient group. Many of the pathways exhibiting higher importance for the bad-outcome patient group could be related to ovarian tumor proliferation and survival.

OP13 ContiBAIT: An R Package for Genome Finishing Using Strand-seq
Date: Sunday, July 12, 12:00 pm - 12:20 pmRoom: Wicklow Hall 2B
Kieran O’Neill, British Columbia Cancer Agency, Canada
Mark Hills, British Columbia Cancer Agency, Canada
Peter Lansdorp, British Columbia Cancer Agency, Canada
Ryan Brinkman , British Columbia Cancer Agency, Canada
Strand-seq is a method for directional, low-coverage sequencing of DNA
template strands in single cells. Taken together, strand-seq data from
cells from the same organism provide genomic distance information.
This can be used to improve the quality of early-build reference
genomes made up of many contigs with no bridging sequence, firstly by
grouping contigs from the same chromosome together, and secondly by
ordering contigs within chromosomes. We present ContiBAIT, an R
package for performing these tasks.

For grouping contigs into chromosomes, contiBAIT uses a custom
clustering method based on a Chinese restaurant process. Contigs are
then reoriented using a greedy algorithm which optimises for global
inter-contig distance. Contig groups showing close strand similarity
following reorientation are merged.

For ordering contigs within a putative chromosome, ContiBAIT computes
the strand distance between all pairs of contigs. The problem then
becomes one of finding the lowest-weight Hamiltonian path over the
contigs, which can be reformulated into a travelling salesman problem.
ContiBAIT then finds the best ordering of contigs using the TSP

To validate contig clustering, we applied ContiBAIT to an early build
of the mouse genome (mm2), with coordinates lifted over to mm10.
ContiBAIT was able to assign most contigs with sufficient read depth
for strand-seq analysis to the correct chromosome (median

To validate contig ordering, we applied ContiBAIT to artificial
contigs sampled from mm10, of sizes 1MB, 500kB and 250kB. Some
chromosomes were well-ordered (Pearson's rho=0.99), while others had
large sections locally well-ordered but incorrectly ordered relative
to each other.

OP14 Novel brain-specific miRNA discovery using small RNA sequencing in post-mortem human brain
Date: Sunday, July 12, 12:00 pm - 12:20 pmRoom: Wicklow Hall 2B
Christian Wake, Boston University, United States
Adam Labadorf, Boston University, United States
Alexandra Dumitriu, Boston University, United States
Andrew Hoss, Boston University, United States
Richard Myers, Boston University, United States
MicroRNAs (miRNA) are short non-coding RNAs that regulate gene expression mainly through translational repression of target mRNA molecules. More than 2700 human miRNAs have been identified and some are known to display tissue-specific patterns of expression. Here, we use high-throughput small RNA sequencing to discover novel and possibly brain-specific miRNAs in 94 human post-mortem prefrontal cortex samples from patients with Huntington's disease and Parkinson's disease and normal neuropathology. Using a custom analysis pipeline, we identified 66 novel miRNA candidates that originate in both intergenic and intragenic regions of the genome. 21 of the candidate miRNAs show sequence similarity with known mature miRNA sequences and may be novel members of known miRNA families, while the remaining 45 may constitute previously undiscovered families of miRNAs that are specific to the brain. In a small number of these novel miRNAs, preliminary differential expression analysis between neurodegenerative disease and normal samples identified differences in expression. These results suggest that a portion of these novel miRNAs may not only be unique to brain, but may have a role in the neurodegenerative disease processes.

OP15 Computationally efficient approach for novel transcript discovery across large RNA-seq dataset reveals glioblastoma-associated lncRNAs
Date: Sunday, July 12, 12:00 pm - 12:20 pmRoom: Wicklow Hall 2B
Maria Laaksonen, BioMediTech, University of Tampere, Finland
Antti Ylipää, 1) BioMediTech, University of Tampere 2) Department of Signal Processing, Tampere University of Technology, Finland
Janne Seppälä, BioMediTech, University of Tampere , Finland
Tommi Rantapero, BioMediTech, University of Tampere , Finland
Kirsi Granberg, 1) BioMediTech, University of Tampere 2) Department of Signal Processing, Tampere University of Technology, Finland
Matti Nykter, BioMediTech, University of Tampere , Finland
Availability of RNA-sequencing data from human tumors and normal tissues has resulted in discovery of hundreds of tissue specific transcripts. Uncovering novel transcripts typically requires computationally expensive de novo transcriptome assembly and combination of assemblies across samples have proven challenging. To be able to search for new transcripts from large RNA-seq cohorts, we developed a computational approach that directly identifies unannotated genomic loci that are variably expressed within a sample set, or differentially expressed between two sample sets. These loci are then subject to gene structure analysis, allowing identification of full transcript structures in data driven manner. Our approach was validated by re-discovering a set of well annotated genes. We were able to correctly re-build known gene structures and identify the typical structural features of protein coding genes even when only a single exon of the gene was given as input.

We applied our approach to RNA-seq data of 169 primary glioblastoma samples from The Cancer Genome Atlas (TCGA). We identified 53 unannotated transcripts that did not contain good quality open reading frames, indicating that they were lncRNAs. The expression of 20 out of 22 high confidence lncRNAs was validated by PCR in at least one glioblastoma cell line. Clinical association analyses in the TCGA glioma cohort revealed that a subset of lncRNA expression profiles associates with patient survival, tumor grade and/or IDH1 mutation status. The functional analysis of lncRNA knockdowns was performed in glioblastoma cells to evaluate their significance in disease aggressiveness.

OP16 Tau Protein Related Acetylation of Histone 3 Lysine 9 in the Human Brain
Date: Sunday, July 12, 12:20 pm - 12:40 pmRoom: Wicklow Hall 2B
Hans-Ulrich Klein, Harvard Medical School, United States
Cristin McCabe, Broad Institute, United States
Jishu Xu, Brigham and Women\'s Hospital and Harvard Medical School, United States
David Bennett, Rush University Medical Center, United States
Philip DeJager, Brigham and Women\'s Hospital and Harvard Medical School, United States
Accumulation of tau proteins and amyloid-β peptides in the brain are two hallmarks of Alzheimer’s Disease (AD). Recent studies suggest that epigenetic mechanisms are likely to play a key role in the pathogenesis of AD. Here, we studied genome wide the active mark H3K9ac using ChIP-seq in 669 post-mortem human brain samples to detect alterations of the epigenome induced by tau. RNA-seq was performed for 500 samples to assess the effect on transcription. We considered modifications of local H3K9ac domains as well as large genomic regions and distinguished alterations primarily associated with tau from those with amyloid.

We identified 26,384 H3K9ac domains which primarily occurred at promoters (15,225) and enhancers (8,071). H3K9ac levels at promoters were positively correlated with transcription, even though H3K9ac alone was not sufficient for transcription. Tau protein loads were significantly associated with H3K9ac levels in 5,980 domains and had a much broader impact than amyloid (610 domains). Domains positively associated with tau showed a strong enrichment (p<10^-16) for binding sites of CTCF, which regulates chromatin structure. Indeed, we found large genomic regions showing concordant tau associated increases in H3K9ac. Average transcription in these regions was consistently up-regulated. Strikingly, effect sizes within the regions were highly correlated with the regions' proportion of open chromatin.

Our results demonstrate a genome wide change in chromatin structure in AD, which is mediated by tau. Tau is known to cause heterochromatin relaxation in Drosophila models. CTCF could be a key factor in the pathogenic process of chromatin opening.

OP17 Low concordance of differential DNA methylation analysis methods
Date: Sunday, July 12, 12:20 pm - 12:40 pmRoom: Wicklow Hall 2B
Helen McCormick, Victor Chang Cardiac Research Institute, Australia
Eleni Giannoulatou, Victor Chang Cardiac Research Institute, Australia
Jennifer Cropley, Victor Chang Cardiac Research Institute, Australia
Catherine Suter, Victor Chang Cardiac Research Institute, Australia
DNA methylation is one of the most widely used markers for the study of epigenetic contributions to phenotypic variation and disease. There are several methods for analyzing genome-wide DNA methylation data in common use, but there has been no rigorous evaluation of their performance. We have performed a systematic assessment and comparison of four packages: MethySig, methylKit, eDMR and DSS, using an empirical dataset of 12 reduced representation bisulphite sequencing libraries (6 test, 6 control). Surprisingly, we observed very low concordance among these commonly used model-based and binomial test-based approaches: using equivalent pre-processing and filtering parameters for each method, we found that the four methods identified significant differentially methylated cytosines at a concordance rate of less than 1%. Similarly low levels of concordance were observed with identification of differentially methylated regions using tiled data. Our study highlights the need for systematic approaches to reliable differential methylation analysis via data simulation. This concept of simulation will be discussed in the context of the growing implementation of epigenomic data in human medicine.

OP18 Computational method for detecting patterns of epigenetic changes from time series ChIP-seq data
Date: Sunday, July 12, 12:20 pm - 12:40 pmRoom: Wicklow Hall 2B
Petko Fiziev, University of California, Los Angeles, United States
Jason Ernst, University of California, Los Angeles, United States
Histone modifications associate with important regulatory regions such as promoters and distal enhancers that control the expression of genes. Time-course genome-wide maps of these epigenetic marks have become available in a growing number of biological settings including stem cell reprogramming and differentiation, adipogenesis, cardiac development, circadian rhythms, embryogenesis and lymphocyte development. However, our understanding of the underlying cellular processes remains limited, because the current bioinformatics tools often fail to utilize fully the temporal aspects of this data. Here, we present a novel computational method for systematic detection of major classes of spatio-temporal patterns of epigenetic changes. The method takes as input data from a series of chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) experiments for a single histone mark that are performed at consecutive time points during a given biological process. The method uses a probabilistic mixture model that explicitly models the spatio-temporal nature of the data to identify regions for which the mark either expands or contracts significantly with time or holds steady. Furthermore, it incorporates information about replicate experiments at each time point, which can increase the accuracy of the method. We present applications of the method on publicly available data from T-cell development, which help in understanding the underlying regulatory dynamics during this process.

OP19 Human paralog genes share regulatory elements and co-localize in the three-dimensional chromatin architecture
Date: Sunday, July 12, 2:00 pm - 2:20 pmRoom: Wicklow Hall 2B
Jonas Ibn-Salem, Johannes Gutenberg University, Germany
Miguel Andrade-Navarro, Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany, Germany
Paralog genes arise from gene duplication events during evolution. The resulting sequence similarity between paralogs often leads to proteins of similar structures and functions in common pathways. Therefore it might be useful for the cell to have paralog genes co-regulated. However, since paralog genes often show also slightly different functions, for example alternative domains, it might be also useful for cells to exclusively express only one out of several paralogs for a specific function or response.
Eukaryotic genes are regulated by binding of transcription factors to distal enhancer elements which perform looping interactions to the transcription machinery at gene promoters. We hypothesised that paralog genes share common regulatory mechanism that allows co-regulation and exclusive expression.

To test this hypothesis, we integrated paralogy annotations with genome-wide data-sets of enhancer-promoter associations and genome-wide chromatin interaction data from Hi-C experiments in human cells.

With carefully sampled control data sets that take linear co-localisation of paralogs into account, we show that paralog gene pairs share a significant amount of common enhancer elements. Furthermore they are located significantly more often in the same topological association domain than expected and therefore cluster not only in the linear genome but also in the three-dimensional chromatin structure of the nucleus.

Together our results indicate that human paralog gene pairs share common regulatory mechanisms. We will further integrate expression data from different tissues and functional annotation of genes to support our findings that paralog genes tent to be expressed either collectively or exclusively depending on the cells functional needs.

OP20 Comprehensive analysis of association between heterogeneity and translation of 5’ leaders
Date: Sunday, July 12, 2:00 pm - 2:20 pmRoom: Wicklow Hall 2B
Paul Korir, University College Cork, Ireland
Pavel Baranov, University College Cork, Ireland
There is overwhelming evidence of translation of upstream Open Reading Frames (uORFs) in the 5’ leaders of many mammalian mRNAs. The translation of uORFs often inhibits translation of annotated coding ORFs (acORFs) and allows for regulation in response to changes in cellular conditions. We hypothesised that 5’ leader heterogeneity of alternative transcripts (due to alternative transcription initiation and splicing) is associated with a synthesis of mRNAs that code for the same protein product, but regulated differently depending on uORF organization of their 5’ leaders.

To explore the relationship between translation of 5’ leaders and their heterogeneity, we carried out bioinformatic analyses using publicly available ribosome footprinting data. Our analyses involve identifying high-confidence translated regions then estimating various facets of heterogeneity of 5’ leaders across alternative transcripts. We devised a simple peak-calling method on ribosome footprints in 5’ leaders treating such peaks as a proxy for 5’ leader translation. We defined heterogeneity on the set of transcript isoforms associated with a pair of translation termination sites for non-overlapping genes. We reasoned that such an approach would emphasise the effect of heterogeneity because such transcripts differ only in mRNA leader regions, which confer the bulk of regulatory activity. We examined several key aspects of heterogeneity such as alternative initiation and/or splicing, mean leader length across isoforms, sequence content (uAUGs, GC content, regulatory motifs such as terminal-oligopyrimidine (TOP) tracts, codon bias), and secondary structure on translation. Finally, we performed functional analyses on extreme cases cases of heterogeneity to identify enriched gene categories.

OP21 A systems biology characterization of the anti-cancer compound Vorinostat.
Date: Sunday, July 12, 2:00 pm - 2:20 pmRoom: Wicklow Hall 2B
Christopher Woelk, University of Southampton, United Kingdom
Cory White, UCSD, United States
Harvey Johnston, University of Southampton, United Kingdom
Celsa Spina, UCSD, United States
Douglas Richman, UCSD, United States
Spiro Garbis, University of Southampton, United Kingdom
Nadejda Beliakova-Bethell, UCSD, United States
Vorinostat is a histone deacetylase inhibitor (HDACi) used to treat refractory cutaneous T-cell lymphoma (CTCL) and is being investigated as a component of “shock and kill” strategies to cure HIV. Vorinostat inhibits deacetylation, leading to the acetylation of histones and the relaxation of chromatin. However, little is known about other mechanisms of action or the off-target effects of this compound. Therefore, the effects of Vorinostat on primary CD4 T cells were evaluated in a systems biology approach. Cells were isolated from 10 healthy donors and treated with 1µM of Vorinostat for 24 hours or left untreated. Protein extracts from 4 donors were subjected to iTRAQ labeling and characterized by two-dimensional liquid chromatography-mass spectrometry quantitative proteomics. RNA was isolated from 6 donors and subjected to transcriptomic analysis (Illumina HT12 v4 microarrays). Differentially expressed genes (DEGs) and proteins (DEPs), as well as differentially expressed phosphorylated (DPPs) and acetylated (DAPs) proteins were identified using Limma. Data integration was primarily facilitated by using all four data types to construct a single protein interaction network. The addition of proteomic data revealed a much more detailed protein interaction network with the inclusion of many nodes not regulated at the transcriptional level but at the post-translational level. In addition, HMGA1 was differentially expressed at the transcript, protein, and acetylated protein levels. This protein is of particular interest since it may repress transcription from the HIV promoter and thus may limit the effectiveness of Vorinostat in HIV cure strategies.

OP22 A high-resolution gene expression atlas of epistasis between gene-specific transcription factors reveals new mechanisms for genetic interactions
Date: Sunday, July 12, 2:20 pm - 2:40 pmRoom: Wicklow Hall 2B
Patrick Kemmeren, UMC Utrecht, Netherlands
Katrin Sameith, UMC Utrecht, Netherlands
Marian Groot Koerkamp, UMC Utrecht, Netherlands
Dik van Leenen, UMC Utrecht, Netherlands
Mariel Brok, UMC Utrecht, Netherlands
Nathalie Brabers, UMC Utrecht, Netherlands
Philip Lijnzaad, UMC Utrecht, Netherlands
Sander van Hooff, UMC Utrecht, Netherlands
Joris Benschop, UMC Utrecht, Netherlands
Tineke Lenstra, UMC Utrecht, Netherlands
Eva Apweiler, UMC Utrecht, Netherlands
Sake van Wageningen, UMC Utrecht, Netherlands
Berend Snel, Utrecht University, Netherlands
Frank Holstege, UMC Utrecht, Netherlands
Recent studies have systematically exposed large numbers of non-additive genetic interactions, the majority of which are functionally uncharacterized. To investigate such genetic interactions between gene-specific transcription factors (GSTFs) in Saccharomyces cerevisiae, we systematically analysed 72 GSTF pairs by DNA microarray analysis of double and single deletion mutants. These pairs were selected through previously published growth-based genetic interaction as well as through similarity in DNA binding properties. The result is a high-resolution atlas of gene expression-based genetic interactions that provides systems-level insight into GSTF epistasis. The atlas confirms known genetic interactions and exposes new ones. Importantly, the data can be used to elucidate the mechanisms that underlie individual genetic interactions. Evidence is provided for two previously uncharacterized mechanisms, "Buffering by induced dependency" and "Alleviation by derepression". These mechanisms demonstrate how negative genetic interactions can occur between seemingly unrelated pathways and how positive genetic interactions can indirectly expose parallel- rather than same-pathway relationships. The study provides general insights into the complex nature of epistasis and results in new models for genetic interactions, the majority of which do not fall into easily recognizable within- or between pathway relationships.

OP23 A novel approach to identify highly connected and differentially expressed subnetworks reveals underlying biological processes in endometrial cancer metastasis
Date: Sunday, July 12, 2:20 pm - 2:40 pmRoom: Wicklow Hall 2B
Kanthida Kusonmano, University of Bergen, Norway
Mari K Halle, University of Bergen, Norway
Helga B Salvesen, University of Bergen, Norway
Kjell Petersen, University of Bergen, Norway
Differential expression analyses based on high-throughput data have been used to study molecular changes between phenotypes of interest. Further from a typical analysis of deriving a ranked list of individual genes that are significant different between the studied conditions, several methods have been developed to identify differentially expressed genes as a set to facilitate functional interpretation. One main approach is gene set analysis, which evaluates functional enrichment of differentially expressed genes based on publicly available gene sets. Meanwhile network analysis tries to identify functional modules of genes with their interactions based on a studied data. Here we present a novel approach, which combine the features of both gene set and network analyses by detecting subnetworks based on internal relations of the studied data and assessing their differential expression using a well-known gene set method, Gene Set Enrichment Analysis (GSEA). The subnetworks are derived by integrating a priori gene-gene interactions (here we used protein-protein interactions) and expression correlations. We demonstrate our approach on endometrial cancer data between aggressive primary tumors and metastases. The detected differentially subnetworks show biological insights in metastatic settings and display interesting expression trends through tumor aggressiveness. A few subnetworks also have significant links to patient disease specific survival. The study provide exceptional discovery in metastatic context that is interesting for further follow-up studies.

OP24 Rapamycin treatment of normal human fibroblasts increases the transcriptional abundance of genes involved in cytokine-cytokine receptor signaling
Date: Sunday, July 12, 2:20 pm - 2:40 pmRoom: Wicklow Hall 2B
Kimberly MacKay, University of Saskatchewan, Canada
Zoe Gillespie, University of Saskatchewan, Canada
Brett Trost, University of Saskatchewan, Canada
Christopher Eskiw, University of Saskatchewan, Canada
Anthony Kusalik, University of Saskatchewan, Canada
Background: Rapamycin is an immunosuppressant drug that is currently used to prevent transplant organ rejection. It is additionally being investigated as a potential therapy for many other diseases. The effect it has on cytoplasmic and genomic function has been extensively studied in model organisms. However, it is unclear what affect rapamycin has on gene expression in normal human primary cells.

Objective: To determine the global impact rapamycin has on gene expression in normal human fibroblasts.

Methods: RNA-seq was performed on proliferative and rapamycin-treated human fibroblasts. SeqMonk was used to calculate the fold-change difference in transcriptional abundance by comparing the read counts of the two datasets. A protein interaction network was constructed based on the genes that had at least a 5-fold change in transcriptional abundance using Cytoscape and ReactomeFI. The resultant network was annotated using biological process, molecular function and cellular component terms from the Gene Ontology Consortium as well as pathway annotation terms from the Kyoto Encyclopedia of Genes and Genomes.

Conclusions: Rapamycin treatment of normal human fibroblasts resulted in 537 genes having a 5-fold or greater change in transcriptional abundance. The network analysis revealed a significant enrichment for genes associated with PI3K-AKT signaling, linking our observations to rapamycin’s established cytoplasmic target. The most significant pathway annotation was cytokine-cytokine receptor interaction with many of these genes belonging to the Interleukin-6 signaling pathway. It is possible that prolonged exposure to rapamycin and the production of cytokines like Interleukin-6 could produce sufficient cellular stress to drive normal human primary cells into senescence.

OP25 DNA methylation-dependent transcription regulatory networks elucidate dynamics of transcription regulatory circuitry in cancers
Date: Sunday, July 12, 2:40 pm - 3:00 pmRoom: Wicklow Hall 2B
Xuerui Yang, Tsinghua University, China
Yu Liu, Tsinghua University, China
Yang Liu, Tsinghua University, China
Zhengtao Xiao, Tsinghua University, China
Shengcheng Dong, Tsinghua University, China
Context-dependent DNA methylation plays a critical role in regulating gene transcription, thereby serving as an important epigenetic marker or regulator in many biological processes and complex diseases such as cancer. However, previously DNA methylation has rarely been taken into account as a significant factor in most of the de novo reconstructions of cancer type-specific transcription regulatory networks. The present study was set to systematically assess the involvement of DNA methylation in transcription regulatory circuitry in cancer. We took advantages of the multi-dimensional profiling data of DNA methylations and gene expressions in tumors of different cancers in The Cancer Genome Atlas consortium, and developed an integrative analysis pipeline based on conditional mutual information, to quantify the cooperative regulatory effects of CpG site methylation and transcription factor activity on gene expressions. Our genome-wide analysis shows that DNA methylation and transcription factors indeed cooperate to control gene expressions. To map the interplay between these two major defining factors of gene expression, DNA Methylation-dependent Transcription Regulatory Network (MeTRN), the first of its kind, was assembled for each of 19 major cancer types, and broadly validated using public ChIP-seq and DNaseI-seq data. Comparison of these networks across cancer types showed that context-specificity of transcriptional circuits can be largely attributed to the context-dependent nature of DNA methylation patterns. In summary, MeTRN recapitulates an epigenetic scheme that implements dynamics of transcription regulatory circuitry across cancers via context-dependent DNA methylation marks, and thereby serves as a new basis for further mechanistic studies of gene expression dysregulations in cancers.

OP26 Next generation sequencing of human tumor xenografts is significantly improved by prior depletion of mouse cells
Date: Sunday, July 12, 2:40 pm - 3:00 pmRoom: Wicklow Hall 2B
Stefan Tomiuk, Miltenyi Biotec GmbH, Germany
David Agorku, Miltenyi Biotec GmbH, Germany
Kerstin Klingner, Oncotest GmbH, Germany
Stefan Wild, Miltenyi Biotec GmbH, Germany
Silvia Rüberg, Miltenyi Biotec GmbH, Germany
Lisa Zatrieb, Miltenyi Biotec GmbH, Germany
Andreas Bosio, Miltenyi Biotec GmbH, Germany
Julia Schueler, Oncotest GmbH, Germany
Olaf Hardt, Miltenyi Biotec GmbH, Germany
Human tumor xenografts represent the gold standard method for research areas such as drug discovery, cancer stem cell biology, and metastasis prediction.
During the growth phase in vivo, xenografted tissue is vascularized and infiltrated by cells of mouse origin. Due to this, a strong impact of mouse-derived reads on downstream NGS analyses can be expected.
To overcome these limitations, we have developed a fast and easy method (MCD) allowing for the comprehensive depletion of mouse cells by using automated tissue dissociation and magnetic cell sorting (MACS). We have performed whole exome sequencing of bulk human tumor xenografts from lung, bladder, and kidney cancer, and compared the results to samples depleted of mouse cells. A significant increase in read counts (33%) was observed after MCD, indicating improved sample quality.
We mapped the reads of all samples against human and mouse genomes and determined their putative origin. An average of 12% of reads derived from non-depleted samples was assigned to mouse cells. This amount could be reduced to 0.3% by MCD.
A strong impact of MCD was observed on SNP calling: 63+/-10% of all SNPs predicted for the non-depleted samples could no longer be detected after MCD, 18+/-1% were specific for the depleted xenograft samples, probably due to higher coverage.
Taken together, MCD significantly improves the analysis of human tumor xenografts by NGS. As this effect was observed although a human sequence specific selection has been carried out during exome enrichment, the influence on whole genome and transcriptome sequencing is expected to be even more prominent.

OP27 The Developmental Transcriptome for Lytechinus variegatus
Date: Sunday, July 12, 2:40 pm - 3:00 pmRoom: Wicklow Hall 2B
Emily Speranza, Boston university, United States
John Hogan, Boston University, Bioinformatics Program, United States
Jessica Keenan, Boston University, Bioinformatics Program, United States
Lingqi Luo, Boston University, Bioinformatics Program, United States
Akhil Saji, Boston University, Biology Department, United States
Mary Ann Sundermeyer, Boston University, Biology Department, United States
Daphne Schatzberg, Boston Univeristy, Biology Department, United States
Michael Piacentino, Boston University, Molecular and Cellular Biology and Biochemistry Program, United States
Daniel Zuch, Boston University, Program in Molecular and Cellular Biology and Biochemistry, United States
Amanda Core, Boston University, Biology Department, United States
Jose Horacio Grau, Dahlem Center for Genome Research and Medical Systems Biology, Germany
Bernd Timmermann, Sequencing Core Facility, Max-Plank Institute for Molecular Genetics, Germany
Albert Poustka, Dahlem Center for Genome Research and Medical Systems Biology, Germany
Cynthia Bradham, Boston University, Biology Department, United States
Embryonic development is arguably the most complex process an organism undergoes during its lifetime. Understanding development is best approached with a systems-level perspective. The sea urchin has become a valuable model organism for understanding developmental specification, morphogenesis, and evolution. As a non-chordate deuterostome, the sea urchin occupies an important evolutionary niche between protostomes and vertebrates. Lytechinus variegatus (Lv) is an Atlantic Ocean species that has been studied for a number of years, and has provided important insights into signal transduction, patterning, and morphogenetic changes during embryonic/larval development. The Pacific Ocean species, Strongylocentrotus purpuratus (Sp), is well-studied particularly for gene regulatory networks and cis-regulatory analyses. A well-annotated genome and transcriptome for Sp are available, but similar resources have not been developed for Lv. Here, we provide analysis of the Lv transcriptome at 11 time points during embryonic/larval development. Based on analysis for the expression of a conserved set of genes, we find that the late pluteus larval stage most closely matches the phylotypic vertebrate pharyngula stage, suggesting that conservation of this temporal gene expression pattern predates the appearance of the chordates. Using principal component analysis, we show that the major transitions in variation of embryonic transcription divide the developmental time series into four temporally sequential groups, which is corroborated by k-means cluster analysis, specification network analysis, and metabolic network analysis. Together, these analyses indicate that sea urchin development includes sequential intervals of relatively stable gene expression states punctuated by more abrupt transitions.

OP28 Protein interaction interfaces and genetic variation
Date: Sunday, July 12, 3:30 pm - 3:50 pmRoom: Wicklow Hall 2B
Fábio Madeira, , United Kingdom
Geoffrey Barton, Division of Computational Biology, College of Life Sciences, University of Dundee, United Kingdom
There are currently more than 62 million Single Nucleotide Polymorphisms (SNPs) known and this number is doubling every two years stimulated by the falling cost of sequencing. Although many methods have been developed to predict the effect of non-synonymous SNPs on biological function and disease, few have focused on SNPs at protein-protein and protein-ligand interaction interfaces. Interfaces are essential sites for protein function and adaptation, and key in a majority of biological processes. The effects of non-disease intra- and inter-species variation occurring in such interaction surfaces remain mostly unexplored. The availability of over 105,000 protein three-dimensional structures allows the structural context of many SNPs at interfaces to be examined in atomic detail. Here, we present ProIntVar, a computational framework for mapping SNPs onto structure in order to study the features of variation at protein-protein and protein-ligand interfaces. ProIntVar allows the systematic analysis of genetic variation in protein structure interaction surfaces by integrating structural and sequencing data from several biological databases and resources. Genetic variants are analyzed in the context of functional families (FunFams), which are derived from structurally and functionally related protein domains classified in CATH (Class, Architecture, Topology, Homology). Examination of variation in protein interaction interfaces helps to infer which key residues are important for the function of the interface in a broader evolutionary sense. This approach has the potential to identify correlated adaptation, susceptibility to disease and unspecific protein-drug interactions in the human population that are due to sequence variation.

OP29 Length-independent canonical forms of antibody Complementarity Determining Regions
Date: Sunday, July 12, 3:30 pm - 3:50 pmRoom: Wicklow Hall 2B
Jaroslaw Nowak, University of Oxford, United Kingdom
Terry Baker, UCB Pharma Ltd, United Kingdom
Guy Georges, Roche Diagnostics GmbH, Germany
Stefan Klostermann, Roche Diagnostics GmbH, Germany
Jiye Shi, UCB Pharma Ltd, United Kingdom
Sudharsan Sridharan, MedImmune, United Kingdom
Charlotte Deane, University of Oxford, United Kingdom
Antibodies are Y-shaped proteins used by the immune system to bind and potentially neutralize foreign objects (antigens) that have entered the body. The antigen combining site of an antibody consists primarily of six hypervariable loops (L1-L3, H1-H3), known as the Complementarity Determining Regions or CDRs. Together, these determine an antibody's binding properties. Five out of the six CDRs (L1, L2, L3, H1, H2) form only a small number of discrete conformations called canonical classes. Previous work in this area assumes that CDRs of different lengths should, by default, belong to different classes. We exploited dynamic time warping, an algorithm originally designed for comparing temporal sequences varying in speed, to measure similarity between loops of different lengths and used density-based clustering to classify CDRs into length-independent canonical classes. The concept of length-independence allows us to cluster a larger number of CDRs into a smaller number of classes than the length dependent approach. In comparison to the length-dependent approach, it also improves the accuracy of canonical class prediction from sequence. We have also found that CDRs of different lengths that are co-clustered tend to show similar sequence patterns, even when they are coded by genes from different subgroups, pointing to a greater functional redundancy in the immune loci than previously known.

OP30 Determining the winning SH3 coalition: how cooperative game theory reveals the importance of domain residues in peptide binding
Date: Sunday, July 12, 3:30 pm - 3:50 pmRoom: Wicklow Hall 2B
Ashley Conard, , United States
Elisa Cilia, Université Libre de Bruxelles, Belgium
Tom Lenaerts, Université Libre de Bruxelles, Belgium
Cell signaling relies on protein-protein and protein-peptide interactions involving signaling domains, which typically recognize specific peptide motifs. For instance, SH3 domains bind preferably to proline-rich amino-acid motifs. Phage-display experiments allow one to determine those motifs and whether surface or core domain mutants gain or loose preference for peptide motifs. Here, we present an approach utilizing the Shapley Value (SV) from Cooperative Game Theory to determine the importance of seven residues in the Fyn SH3’s hydrophobic core. The core positions and the residues in those positions represent the players of a cooperative game in which the worth of each coalition is measured through its capacity to discriminate the binding and non-binding mutants for certain classes of peptides. The players (positions or residues) can be seen as the features of SH3 mutants in a binary classification task. Essentially, we use a feature selection method based on the SV to assign a pay-off to each core position and residue. We quantify their importance to promote peptide binding as well as their joint effects, and their interactions, represented through networks. Our results provide novel insights suggesting that the Fyn SH3 domain must contain different signatures of amino acids to promote binding to various peptide classes. This analysis highlights residue importance for proper domain function, which helps scale conservation profiles (e.g. WebLogo) by adding functionally relevant properties. These detailed pieces of information contribute an effective and novel approach to understanding the role core residues play, next to normally investigated binding-site residues, in binding specific peptides.

OP31 Detecting small structural variants with SoftSV using soft-clipping information
Date: Sunday, July 12, 3:50 pm - 4:10 pmRoom: Wicklow Hall 2B
Christoph Bartenhagen, Institute of Medical Informatics, Germany
Martin Dugas, Institute of Medical Informatics, Germany
Numerous tools for the detection of structural variations (SVs) have been developed over the last years, including our own contribution called SoftSV. But there still remains a gap between small indels, which can be detected by gapped alignments, and large SVs (many hundred or thousand bp), which can be reconstructed by paired-end reads or read-depth information. Filling this gap remains difficult and often demands special algorithms for split-read alignments directly at the breakpoints, which only a few of the published tools do for this range of SVs.

We initially developed SoftSVs for large SVs and now expanded our approach to small and medium-sized deletions, tandem duplications and inversions (starting at 20bp). Similar to large rearrangements, we detect their exact breakpoints under the premise that no threshold filters SVs with low support or reads with low mapping quality or ambiguous mappings. Our greedy approach exploits any kind of soft-clipped alignment and reconstructs the breakpoint sequence just by comparing the soft-clipped reads at the start and end of an SV.

Using simulated and four real datasets from the 1000 Genomes Project, we evaluate the sensitivity and precision of SoftSV and four other tools. Our results show that sensitive and reliable SV detection is subject to many different factors like read length, coverage and SV type. SoftSV achieved sensitivities and PPVs between 80% and 100% consistently for all SV types on simulated datasets starting at 75bp reads and 10-15x sequence coverage, without requiring any parameter configuration by the user.

SoftSV is freely available at http://sourceforge.net/projects/softsv.

OP32 Using reference-free compressed data structures to analyse thousands of human genomes
Date: Sunday, July 12, 3:50 pm - 4:10 pmRoom: Wicklow Hall 2B
Thomas Keane, Wellcome Trust Sanger Institute, United Kingdom
Zhicheng Liu, Wellcome Trust Sanger Institute, United Kingdom
Dirk Dominic-Dolle, Wellcome Trust Sanger Institute, United Kingdom
Shane McCarthy, Wellcome Trust Sanger Institute, United Kingdom
Richard Durbin, Wellcome Trust Sanger Institute, United Kingdom
We are rapidly approaching the point where we have sequenced the genomes of hundreds of thousands of human individuals. The scale up of human population sequencing has enabled us to detect sequence variants down to extremely low minor allele frequencies, explore ancient human lineages, and use genomics for screening of disease causing mutations. The Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed searchable data structure used by read aligners and for de novo assembly. We sought to explore the use of BWTs to store and compress the raw sequencing reads of 26 human populations from 2535 individuals in the 1000 Genomes Project. We show that it is possible to achieve compression ratios of 0.09 bytes per bp (including sample meta-data), much higher than any of the existing sequencing data formats. A key feature of this population BWT is that as more individuals are added to the structure, identical read sequences are observed and compression becomes ever more efficient. BWTs are inherently reference-free so one can rapidly query all the raw sequencing data for non-reference haplotypes and viral integrations. We use the BWT to assess the support in the raw data for the predicted 1000 Genomes haplotypes and investigate the population support along different versions of the human reference genome, and evaluate sequence specific to versions of the reference with and without support in the population. We develop methods to derive accurate genotypes for both single base variants and short indels reference free.

OP33 Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations
Date: Sunday, July 12, 3:50 pm - 4:10 pmRoom: Wicklow Hall 2B
Sergio Pulido Tamayo, KULeuven, Belgium
Aminael Sánchez-Rodríguez, Universidad Técnica Particular de Loja, Ecuador
Toon Swings, KULeuven, Belgium
Bram Van den Bergh, KULeuven, Belgium
Akanksa Dubey, KULeuven, Belgium
Hans Steenackers, KULeuven, Belgium
Jan Michiels, KULeuven, Belgium
Jan Fostier, UGent, Belgium
Kathleen Marchal, UGent, Belgium
Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems add ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information.

Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples

OP34 Simple Rapid RNA-seq Analysis with Unique Gapped q-Grams
Date: Sunday, July 12, 4:10 pm - 4:40 pmRoom: Wicklow Hall 2B
Sven Rahmann, University of Duisburg-Essen, Germany
We present a new simple approach to RNA-seq gene expression analysis that avoids separate read mapping and feature counting by constructing an index with the following property:
Each gene (exons and exon junctions) is represented by its q-grams (substrings of length q, e.g. q=16), or, more generally, by gapped q-grams with a given shape.
These sets of q-grams are reduced to gene-specific ones, i.e., all q-grams that occur in more than one gene are discarded.
Now each of the 4^q possible q-grams is either not present, specific for a gene, or present in more than one gene.
We build an index that recognises the specific q-grams and maps them to their respective genes.
Optimisation of the q-gram mask results in high sensitivity and specificity, as we show with several examples.

Read mapping becomes particularly simple:
We iterate over a read's q-grams and count the number of hits to each gene. Careful analysis allows to pick the correct gene (or declare the read as unmappable or ambiguous) at unprecedented speed.
We thus obtain raw gene counts in a much simpler and computationally less demanding way than with standard approaches.

Further analysis (e.g., differential expression, implicated pathways) can proceed as before.
The poster compares the running time and the obtained counts resulting from our method and standard methods, showing that we achieve equivalent results with much less computational work.

We also outline possible extensions of the approach, including variant-tolerance and fusion gene detection.
Software will be made available under the MIT license.

OP35 Nomenclature of the olfactory receptor gene family
Date: Sunday, July 12, 4:10 pm - 4:40 pmRoom: Wicklow Hall 2B
Tsviya Olender, The Weizmann Institute of Science, Israel
Elspeth Bruford, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, United Kingdom
Doron Lancet, The Weizmann Institute of Science, Israel
Olfactory Receptors (ORs) are G protein-coupled receptors with a crucial role in odor detection. There are ~1000 OR genes and pseudogenes in a typical mammalian genome, however the number of functional ORs varies among species reflecting their adaptation to different environments, a process which involves gene duplication/deletion events. While for human the current OR nomenclature is based on sequence similarity classification, for other mammals such a nomenclature has not yet been adopted, thus concealing important structural and functional insights. The difficulty stems from the complex orthology relationship among the ORs. We developed the Mutual Maximum Similarity (MMS) algorithm, a systematic classifier for assigning a human-based nomenclature to any OR gene based on detecting hierarchical similarity relationships between any two species. We used the MMS algorithm to compare mouse and rat OR repertoires to human, dog and opossum, and assigned a symbol for each rodent gene. In mouse, 31% of the symbols assigned were identical to human symbols, reflecting orthology. An additional 63% of the symbols were classified into pre-defined OR subfamilies; the remainder (6%) were classified into novel OR subfamilies. In rat, 86% of the genes were assigned the same symbol as their mouse ortholog. The suggested nomenclature was further supported by synteny and phylogenetic analyses. Using symbol comparison only we identified species-specific expansions in mouse, rat and human, demonstrating the power of this unified nomenclature system in generating a framework for studying mammalian OR evolution. This nomenclature will be expanded to other mammals in due course.

OP36 Site-specific evolution of selected post-synaptic protein complexes
Date: Sunday, July 12, 4:10 pm - 4:40 pmRoom: Wicklow Hall 2B
Maciej Pajak, University of Edinburgh, United Kingdom
Clive R. Bramham, University of Bergen, Norway
T. Ian Simpson, University of Edinburgh, United Kingdom
Sequence conservation analysis of proteins belonging to the post-synaptic proteome (PSP) has previously revealed that key synaptic protein classes are present in primitive organisms preceding the emergence of nervous systems.
Recent studies suggest that evolution of the PSP may be responsible for the emergence of complex neural system function and behaviour but these analyses assess evolution only at the whole protein level.

We have developed an analysis workflow that integrates codon-resolution selection pressure estimates with domain and motif data to allow refinement of our understanding of domain-centric functionalisation of the PSP.

We show the application of this workflow to the Activity-regulated cytoskeleton protein (Arc) complex, a set of 26 Arc interacting proteins. Arc is highly conserved among placental mammals and plays a significant role in the post-synaptic density as a major regulator of long-term synaptic plasticity, the presumed molecular correlate of memory and learning.

Maximum likelihood phylogenetic inference for proteins of the Arc interactome, followed by site-by-site selection pressure analysis using a fixed effect likelihood methodology reveals a small set of positively selected sites as well as many regions under strong negative selection pressure. Mapping of these sites onto both known and predicted binding domains and post-translational modification sites allows inference of key domain-level functionalisation events during Arc complex evolution and provides a rational basis for prioritising regions for functional studies.