HOME

Tweets by @ISMBinfo

Accepted Posters

Attention Conference Presenters - please review the Speaker Information Page available here.

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category H - 'Metagenomics'

H01 - OncoCis: An annotation tool for cis-regulatory mutations in cancer

Dilmi Perera, University of New South Wales, Australia

Short Abstract: Whole genomes are being sequenced at an accelerated pace but annotation of mutations and inference of their functional significance remains challenging. Whilst a myriad of tools are available for the annotation of protein coding mutations, few are suited for annotating non-coding mutations. To this end, we have developed, OncoCis, to provide an easy to use web service for researchers to annotate and prioritize potential causal cis-regulatory mutations. In order to annotate non-coding mutations, OncoCis integrates publicly available datasets from genome-wide chromatin accessibility and histone modifications experiments obtained from ENCODE and Epigenome Atlas, to identify mutations that occur within cis-regulatory regions in a cell type specific manner. These mutations are further annotated with sequence conservation scores and searched for possible removal or creation of transcription factor consensus binding motifs. Finally, the GREAT tool is used to map cis-regulatory mutations to the most likely gene that it may be regulating. If gene expression data is available, a fold change will be calculated for each of the mapped genes. We have applied this method to whole genome sequencing data from 21 breast cancer samples obtained from the Sanger Institute. A putative cis-regulatory mutation was identified in one sample that increased the expression of COL9A1 by over 5,000 fold over other samples without the mutation. Significantly, this mutation occurred in a region, COL9A1-95, that showed enhancer activity in transfection assays performed in MCF7 breast cancer cells. This tool will be invaluable to researchers seeking to prioritise large lists of mutations for experimental validation.

H02 - Potential non-B DNA regions in the human genome are associated with higher rate of nucleotide mutation and expression variation

Xiangjun Du, National Institutes of Health, United States

Short Abstract: Non-B DNA structures, such as G-quadruplex, Z-DNA, slipped DNA, cruciform, H-DNA and SIDD, are non-canonical conformations of DNA molecules that have been proposed to play regulatory roles. Regions susceptible to non-B DNA formation are abundant in the human genome and some were shown to be conserved between species. However, potential non-B DNA regions are also indicated to be more mutagenic. Seeking to reconcile these properties, we utilized genomic variants and expression quantitative trait loci (eQTL) data to analyse genome-wide variation propensities of potential non-B DNA regions, their relation to gene expression variation, and GC content dependent distribution. Independent of genomic location, potential non-B DNA regions were enriched in nucleotide variants, regardless of the degree of conservation between human and chimpanzee. We also found depletion of eQTL-associated variants in potential non-B DNA regions. Regarding expression variation, we found that, for potential non-B DNA region types enriched near transcription start sites, genes downstream of potential non-B DNA regions showed higher expression variation between individuals. We propose that the high rate of sequence variation of potential non-B DNA regions makes it beneficial for mammalian cells to adapt to higher expression variability of downstream genes.

H03 - Detection of fusion genes in RNA-seq data

Vladan Arsenijevic, Seven Bridges Genomics, United States

Short Abstract: Recent advances in genomics have shown that gene fusions, also known as chimeras, play an important role in cancer development. In this work the thorough analysis has been made
to address questions that might help improve robustness, sensitivity and overall performances of current fusion genes detection software, as well as how these found chimeras can be represented. Different bioinformatics tools were tested using several RNA-seq samples, yielding, in some cases, significant discrepancies between the results. These differences have been attributed to different gene annotations used in the downsampling analysis.
Suggestions have been made to alert the scientific community of these particular issues that may lead to inconsistencies between identified fusion genes from different tools.

H04 - Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast

Kathleen Marchal, Ghent University, Belgium

Short Abstract: Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors.

To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM).
Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method.

EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio’s i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available.

H05 - CNVkit: Copy number variant detection through next-generation DNA sequencing

Eric Talevich, University of California, San Francisco, United States

Short Abstract: Germline copy number variants and somatic copy number alterations are found in many diseases, including cancer. Copy number can be detected using array comparative genomic hybridization (aCGH), a microarray-based assay. Next-generation sequencing is increasingly used to detect germline and somatic point mutations. Copy number can also be estimated from NGS read depth; however, this approach has limitations in the case of targeted resequencing, which leaves gaps in coverage and introduces other biases related to the efficiency of target capture and library preparation.

We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and greater overall support in the larger intronic and intergenic regions. After normalizing coverages to a pooled reference, we evaluate and correct for three biases that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint, and local targeting density. In our validation, CNVkit performed comparably to aCGH on a variety of targeted platforms, including Agilent SureSelect. In particular, we successfully inferred copy number at equivalent to 150-kilobase genome-wide resolution from a platform targeting as few as 293 genes. CNVkit is user-friendly and provides flexible visualizations, detailed reporting of significant features, and export options for compatibility with other software.

Availability: http://bitbucket.org/etal/cnvkit

H06 - The landscape of RNA splicing alterations in human cancers

Angela Brooks, Broad Institute, United States

Short Abstract: Recent whole-exome sequencing studies have found that somatic mutations frequently occur in splicing factors across multiple cancer types, supporting the need to systematically and globally characterize splicing alterations across human cancers. Through the integration of mRNA and whole-exome sequencing data from The Cancer Genome Atlas, we are identifying RNA splicing alterations across ~7,000 cancer transcriptomes and investigating the underlying somatic mutations that cause these splicing alterations. To perform this analysis, we have developed a computational pipeline called JuncBASE to identify and quantify alternative splicing in cancer RNA-Seq data. We have identified somatic mutations in splice sites using an extended annotation of splice site positions, beyond what is typically used in cancer genomic studies, and have found associated transcriptome changes at the gene expression or splicing level. As a result of this analysis, we have identified splice site mutations that are associated with expression of known oncogenic isoforms. Restricting our analysis to known or predicted mutations in splicing regulatory elements may miss important splicing alterations; therefore, we are also using outlier detection methods to identify additional altered splicing events. To distinguish between cancer-specific splicing alterations and normal transcriptome variation, we are utilizing RNA-Seq data from healthy individuals from the Genotype-Tissue Expression project. Using RNA-Seq data from the Cancer Cell Line Encyclopedia, expression of these altered isoforms in cell lines are being used as biomarkers to identify genetic vulnerabilities in high-throughput shRNA screens. This work will have a significant impact on our understanding of the role of splicing in cancer pathogenesis.

H07 - Genotyping microsatellites in next-generation sequencing data

Harriet Dashnow, University of Melbourne, Australia

Short Abstract: Microsatellites are short (2-6bp) DNA sequences repeated in tandem, which make up approximately 3% of the human genome. These loci are prone to frequent mutations and high polymorphism. Dozens of neurological and developmental disorders have been attributed to microsatellite expansions. Microsatellites have also been implicated in a range of functions such as DNA replication and repair, chromatin organisation and regulation of gene expression.

Traditionally, microsatellite variation has been measured using capillary gel electrophoresis. In addition to being time-consuming, and expensive, this method fails to reveal the full complexity at these loci because it cannot detect SNP polymorphisms and compound microsatellites.

Next-generation sequencing has the potential to address these problems. However, determining microsatellite lengths using next-generation sequencing data is difficult. In particular, polymerase slippage during PCR amplification introduces stutter noise. A small number of software tools claim to genotype simple microsatellites in next-generation sequencing data, however they fail to address the issues of SNPs and compound repeats, and they tend to provide only approximate genotypes.

We have developed a microsatellite genotyping algorithm that addresses these issues, providing high accuracy as well as more detailed analysis of microsatellite loci. We have validated it using high depth amplicon sequencing data of microsatellites near the AVPR1A gene. We found high concordance between our algorithm and repeat lengths obtained by electrophoresis, manual inspection and Mendelian inheritance. By subsampling the reads, we found that our model is accurate to within one repeat unit down to coverages that we would expect in standard exome sequencing.

H08 - Clinical Genomics facility - DNA sequencing from clinical diagnostics

Rikard Erlandsson, Karolinska Institutet, Sweden

Short Abstract: The Clinical Genomics facility at Science for Life Laboratory provides a dedicated infrastructure for genomics services on a national level, targeting the Swedish healthcare system. All work is carried out in close collaboration with medical expertise provided by the clinical diagnostic laboratories and patients’ managing physicians.

The infrastructure provides state-of-the-art clinical genomics services, and serves as a national competence center for high-throughput analysis as well as prepares crucial guidelines and ethical principles for interpretation of test results. The facility will also aim at improving the capacity of the public health microbiology for national surveillance of infectious diseases and for epidemic preparedness.

The bioinformatics infrastructure includes four physical Linux servers, two virtual servers and a thirtytwo nodes Linux cluster. An automated pipeline has been built around the Casava pipeline from Illumina. It processes the raw data into fastq files that are piped into the bioinformatics analysis, including qc, alignment and variant detection.

H09 - Phenotype-Specific Prioritisation of Variants using PriMe SuSPect

Christopher Yates, Imperial College London,

Short Abstract: We have developed PriMe SuSPect (Prioritisation Method using SuSPect), a method for disease-specific prioritisation of non-synonymous variants from sequencing studies.

SuSPect is a method for predicting the phenotypic effects of non-synonymous variants in human disease (www.sbg.bio.ic.ac.uk/suspect), incorporating sequence-conservation and protein-protein interaction network features to give enhanced prediction compared to other tested methods, with an AUC of 0.90 on a test set of over 18,000 variants.

SuSPect only gives a score based on whether a variant is likely to cause disease, but in many cases a user is interested in a specific disease. PriMe SuSPect has been developed to meet this need, using a random-walk with restart on PPI and domain-domain interaction networks to associate variants with specific diseases. The protein- and domain-based scoring is adapted from the PRINCE method developed by Vanunu et al. (2010).

To test this method, disease-causing variants were ‘spiked’ into exomes from the 1000 Genomes Project. Using SuSPect scores, the causative variant was identified as the top candidate in 1.3% of cases. In contrast, PriMe SuSPect was able to select the correct variant up to 66% of the time, with up to 76% of disease-causing variants ranked in the top 10.

PriMe SuSPect will be incorporated into the SuSPect web-server, enabling users to upload data from a sequencing project and rank the variants identified therein for a specific disease of interest. We consider that by offering disease-specific variant scoring, PriMe SuSPect could become a valuable tool in the identification of causative variants from sequencing studies.

H10 - VarMod: modeling the functional effects of non-synonymous variants.

Morena Pappalardo, University of Kent,

Short Abstract: Unravelling the genotype-phenotype relationship in human is still one of the most challenging tasks in genomics studies. Recent advances in sequencing technologies have revealed millions of single nucleotide variants (SNVs), whereas they can only describe a small proportion of heritability. It is now important to develop methods that can identify those variants that are functional and also predict those protein functions that can result in altered phenotypes.
In this poster, we present VarMod a novel method for investigating the functional effects of non synonymous single nucleotide variants (nsSNVs) in proteins. VarMod identifies ligand binding and protein-protein interface sites in the query protein and considers the distance of nsSNVs to these functional sites. VarMod combines these features with other widely used ones for predicting if nsSNVs alter protein function; these features include: residue conservation, amino acid properties and structural features such as solvent accessibility and secondary structure properties. The features are combined using a support vector machine to make an overall prediction of the nsSNVs that are likely to have an effect on the protein function. VarMod is available as a web server and provides extensive features to visually analyse the protein model and the location of the nsSNVs occurring at ligand binding and/or protein-protein interface sites. In benchmarking on a set of pathogenic and neutral nsSNVs from VariBench, VarMod outperforms PolyPhen.

H11 - A powerful case-only computational method for identifying cancer-predisposing germline genes

Grace Tiao, Broad Institute, United States

Short Abstract: Identifying cancer-predisposing germline variants is an important challenge for cancer genomics. Most studies that report associations between germline variants and cancer phenotypes use case-control association tests, which are powerful but require large numbers of controls, particularly for studies involving rare variants. We have developed a novel case-only algorithm of analyzing rare germline variants found in cancer exome sequencing studies. Our method identifies genes in patients with an unexpectedly large number of rare damaging variants, which may contribute to increased risk for developing cancer. The algorithm, MutSigGL (for GermLine Mutation Significance), which is based on methods originally developed for analysis of somatic cancer mutations (Lawrence et. al, Nature 2013), estimates baseline mutation densities for each gene using patient-specific mutation frequencies and general gene properties such as length, conservation, and replication timing. These baseline rates are used as the null model to evaluate genes with observed mutations in case samples. To validate MutSigGL, we applied it to two pediatric cancer studies: pleuropulmonary blastoma and rhabdoid cancer. These studies served as positive controls, as one predisposing gene has been identified for each cancer (DICER1 and SMARCB1, respectively). The MutSigGL algorithm was able to retrieve the DICER1 and SMARCB1 genes at high exome-wide levels of significance despite small cohort sizes (n = 15 and n = 37). Our results show that, for certain tumor types, case-only germline analysis of cancer mutations is feasible, and the logistical, financial, and computational advantages to working with smaller sample sizes make it a promising alternative to classic case-control germline mutation analysis.

H12 - Analysis of stop-gain and frameshift variants in human innate immunity genes

Antonio Rausell, Swiss Institute of Bioinformatics and Lausanne University Hospital, Switzerland

Short Abstract: There are well-characterized severe immunodeficiencies associated with loss-of-function variants in innate immunity genes. Recent resequencing projects report that stop-gains and frameshifts are collectively prevalent in humans and could be responsible for some of the inter-individual variability in innate immune response. Computational approaches evaluating the contribution of such variants to disease use gene-centric approaches such as evolutionary conservation and functional redundancy across the genome. However, innate immunity genes represent a particular case because they are more likely to be under positive selection and duplicated. In this work we assess truncating variants in terms of sequence features associated to function and create a pathogenicity scoring system applicable to innate immunity genes. Using data from the 1000 Genomes Project and the NHLBI Exome Sequencing Project, we evaluated ~17000 stop-gain and 14000 frameshift variants collectively affecting ~11000 protein coding genes. Sequence-based features such as loss of functional domains, isoform-specific truncation and non-sense mediated decay were found to correlate with variant allele frequency. As a functional read-out, truncating variants expected to severely disrupt transcript production correlated with measurable decrease in RNA levels in affected individuals. We integrated these features in a Bayesian classification scheme and benchmarked its use in predicting pathogenic variants against OMIM disease stop-gains and frameshifts. The classification scheme was applied in the assessment of 335 stop-gains and 236 frameshifts affecting 227 interferon-stimulated genes. The sequence-based score ranks variants in innate immunity genes according to their potential to cause disease, and complements existing gene-based pathogenicity scores.

H13 - VarSim: A simulation validation framework for alignment and variant calling in high-throughput genome sequencing

Hugo Lam, Bina Technologies, United States

Short Abstract: Realistic simulation validation frameworks are essential for an unbiased comparison of the performance of high-throughput sequencing analysis algorithms. We present VarSim, an integrated computational framework that leverages the state-of-the-art read simulation and vast annotation databases to generate realistic high-throughput sequencing reads and report detailed accuracy statistics.

VarSim first generates a phased diploid genome using variants from existing annotations and novel sites - this includes real insertion sequences when simulating structural variations. Next, reads are simulated from this diploid genome using empirical error models. After alignment and variant-calling on the simulated reads, VarSim reports detailed statistics on the accuracy of the results. These statistics include alignment accuracy and variant-calling accuracy for different variant types and sizes, as well as for different categories of genomic regions, e.g., genes and repeats. Since VarSim generates a diploid genome, genotyping accuracy is also reported.

To demonstrate its utility and facilitate rapid validations, we constructed three synthetic genomes using VarSim and simulated their reads at high coverage (100x). The three genomes are: a male personal genome, a female personal genome and a female pseudo-random genome. Each genome contains over 4 million small variants and large structural variants. We compared the accuracy statistics generated by VarSim for these genomes on popular alignment and variant calling algorithms.

VarSim is useful for comparing aligners, SNP/indel detectors and structural variation callers. No other simulation framework offers such a comprehensive validation of whole genomes. It is released as a free software and the synthetic genomes are available for download.

H14 - RNA Somatic Mutation Caller (RSMC): accurately identifying somatic mutation using RNAseq data

Quanhu Sheng, Vanderbilt University, United States

Short Abstract: Both somatic mutations and changes in gene expression play a key role in tumor formation. With the maturity of next generation sequencing, RNAseq has successfully replaced the microarray as the primary tool for expression profiling. Traditionally, somatic mutations are detected through DNA sequencing. A robust somatic mutation caller designed specifically for RNAseq data will combine both somatic mutation detection and expression profiling in a single dataset. RSMC is a somatic mutation caller designed especially for RNAseq data. RSMC applies a robust statistical model for detecting somatic mutation, and filters out false positive somatic mutations based on several unique characteristics of RNAseq data such as position within read, strand issues and splicing junction identification. RSMC can accurately identify somatic mutations using RNAseq data and creates additional data mining opportunities using existing RNAseq data.

H15 - Computational and experimental approaches to the limit of detection for rare sequence variants in targeted ultra-deep sequencing.

Tina Koestler, Exosome Diagnostics, Germany

Short Abstract: Biofluids contain nucleic acids, either as cell-free DNA or captured in exosomes and other microvesicles, which are stable sources of genetic material for personalized medicine. Biofluids are easy to access and allow genotyping of solid tumors without requiring tissue. Low numbers of somatic mutations are diluted in a sea of wild-type sequences; targeted ultra-deep sequencing is our method of choice for the detection of rare variants.

We address the question on how well mutation frequencies can be estimated from low copy numbers in our sequencing workflow. We synthesized short identical DNA sequences except for 6 positions, where we introduced single nucleotide variations of pre-specified frequency, such that their combination generates 128 distinct sequences with relative frequencies between 26% and 0.0002%. We performed paired-end sequencing where both forward and reverse read covered the entire 87 nucleotides of the synthetic DNA. We filtered sequences where forward and reverse read did not agree, to increase the precision of the obtained sequences.

Our results show almost perfect recovery of the expected percentages with a Pearson coefficient of 0.99 between input and observation. The variance in counts of rare sequences follows that of a Poisson distribution. Moreover, at a coverage of 40,000 reads, we have a pickup rate of 100% down to frequencies of 0.004%, corresponding to 1.6 molecules detected in the sample. In conclusion, the limiting factor for estimating the frequency of rare variants is determined by a Poisson distribution at very low copy numbers, rather than systematic errors due to the experimental procedure.

H16 - Pilon: microbial genome assembly improvement and comprehensive variant detection

Thomas Abeel, Broad Institute of MIT and Harvard, United States

Short Abstract: Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small (e.g., 180 bp) and large (e.g., 3-5 Kb) inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing misassemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon can identify small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and insertions. Pilon is actively being used to improve the assemblies of thousands of new genomes and to identify genomic variants among thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

H17 - Novel procedure for detecting etiological SNPs in Genome-Wide Association Studies

Dariusz Plewczynski, University of Warsaw, Poland

Short Abstract: The most common approach to finding etiological Single Nucleotide Polymorphisms in Genome-Wide Association Studies is calculating per-SNP odds radios between cases and controls, performing chi-squared test and correcting the results for multiple comparisons. This procedure holds some undeniable advantages - it is straightforward as well as easy to implement and cheap to perform (computation-wise). Unfortunately it also has some problems and limitations: insufficient sample size, crude control for multiple comparisons, unaccounted population stratification and inability to detect epistasis - to name a few.

We would like to propose a completely different approach to the problem of detecting etiological SNPs in GWAS. We devised a two stage procedure consisting of a screening stage (the All-Relevant feature selection) and a selection stage (the Minimal-Optimal feature selection).

The motivation behind using an All-Relevant type of procedure for the screening stage is to eliminate as many irrelevant SNPs as possible and at the same time minimize the probability of eliminating an etiological SNP. This also allows for making both stages independent in this sense, that the All-Relevant feature selection problems are entirely data-driven and model-independent.

For the selection stage, we use Regularized Generalized Linear Models, which are immune to the multiple comparison problem as well as allow for seamless incorporation of non-SNP features to the model (thus allowing for handling population stratification). Also, since the number of SNPs is significantly reduced in the screening stage, we can model interactions of level 2 and above directly.

H18 - The discovery of novel SNP associations within a large cohort using dgGO and Superfamily

Natalie Thurlby, University of Bristol,

Short Abstract: The Avon Longitudinal Study of Parents and Children (ALSPAC), which is also known as Children of the 90’s, is a long-term study of parents and children. The cohort of over 14,000 mothers and children provided genetic, phenotypic and environmental data.

The genetic part of this data was analysed using the domain-centric gene ontology, dcGO, and Superfamily, which is a database containing the HMM-derived structural and functional annotation of proteins, with the aim to simultaneously predict multiple phenotypes in order to highlight possible novel SNP associations.

This method differs from the traditional GWAS (genome-wide association study) approach as multiple phenotypes are predicted at the same time. Promising SNPs are then more closely investigated to validate the predicted association.

H19 - Extensive trans and cis-QTLs revealed by large scale cancer genome analysis of The Cancer Genome Atlas RNA-seq, WGS-seq and WXS-seq data

Kjong-Van Lehmann, Memorial Sloan Kettering Cancer Institute, United States

Short Abstract: While population structure can be one of the most severe confounding factors in QTL analysis, tumor samples open up many new additional challenges. Tumor specific somatic mutations and recurrence patterns are known to explain large amounts of the observed transcriptome variation and sample heterogeneity can lead to spurious associations. Thus, we have developed a new strategy to perform a common variant association study (CVAS) using mixed models on tumor samples which enables us to account for tumor specific genotypic and phenotypic heterogeneity as well as population structure. We apply this strategy to investigate the relationship between germline and somatic variants as well as splicing patterns and expression changes in order to discover determinants of transcriptome variation. Due to sample size constraints, many QTL studies have been limited to the analysis of cis-associated variants. We use whole genome, exome and RNA-seq data from the TCGA project to overcome this limitation and discover trans-associated variants as well. We also investigate the effect of rare somatic variants that may have a significant effect on transcriptional and post transcriptional regulation. A rare variant association study (RVAS) using variants from whole genome and exome sequencing data is being utilized to investigate the basis of rare mutations. A decomposition of genomic covariances into trans and cis effects elucidates the importance of such factors across different cancer types which will not only improve our understanding of the molecular basis of cancer but may also provide new treatment targets.

H20 - GenTrAn: a new tool for de-novo transposon structural variant detection from single-end deep-sequencing data

Reazur Rahman, Brandeis University, United States

Short Abstract: Transposons are major structural variants (SVs) in animal genomes. In cancer and human biology, there is a need to determine new transposon SVs beyond the tremendous load of existing transposons (>45% of the human genome). Most current efforts to discover transposon SVs rely on Paired-End (PE) reads from genome deep-sequencing, but the greater costs of PE reads compared to Single-End (SE) reads (the standard form of genome deep-sequencing) motivated us to develop a new bioinformatics tool called GenTrAn (Genome Transposon Analyzer). By scanning SE read libraries with a hybrid approach of broad-level split-read mapping and then filtering with various quality criteria, GenTrAn discovers de-novo transposon SVs with high sensitivity and specificity. Importantly, the transposon SV sites that GenTrAn identifies display target site duplications indicative of a recent transposition event, and point to precise genomic coordinates that enable discrimination of SVs that disrupt coding gene exons versus less-disruptive intronic insertions.

We demonstrate the efficacy of our tool by discovering the genome-wide distributions of transposon SVs in four different Drosophila melanogaster cell lines. GenTrAn showed that transposon SV landscapes can be surprisingly diverse even in a natural cell line, and these SVs tend to avoid coding exons, yet prefer to insert near genes in intergenic regions. In addition, GenTrAn can measure the allele ratio of transposon SVs and all predicted SVs were successfully validated by genomic PCR. GenTrAn’s precision in transposon SV detection and feasibility to mine the more economical SE read libraries make this an attractive tool for genome diagnostics.

H21 - Modeling Regulatory Network Evolution in Mycobacteria

Joshua Goldford, Boston Unviersity, United States

Short Abstract: Although there is vast evidence that the human-specific pathogen Mycobacterium tuberculosis (M.tb) has evolved from non-pathogenic organisms, the evolutionary processes responsible for acquiring this pathogenesis are still largely not understood. In this study, we developed a method to characterize the likelihood that a given genetic element or DNA binding factor in Mycobacteria evolution would emerge or disappear over time. We applied a phylogenetic hidden Markov model to ChIP-Seq data sets from five transcription factors for four related species, M. tb, M. avium (M.av), M. smegmatis (M.smeg) and R. rhodococcus (R. rho) to estimate the probability that a given transcription factor binding site (TFBS) or gene target would change over time. For kstR, a conserved mycobacterial transcription factor responsible for the repression of genes involved in cholesterol catabolism and lipid biogenesis, we observed that as M.tb and M. avium diverged from a common ancestor, the likelihood of binding site gain and loss was high for M.tb and M. avium, respectively. Significantly different sets of gain and loss probabilities were observed when we partitioned the TFBS observations by location relative to the target gene. Additionally, substantial differences in gain/loss probabilities of TFBS and gene targets were observed between different transcription factors. Thus, this approach demonstrates the utility of a probabilistic model for regulon evolution in prokaryotes.

H22 - Investigating large sequence variants in drug resistant Mycobacterium tuberculosis

Alex Salazar, Broad Institute, United States

Short Abstract: Mycobacterium tuberculosis (Mtb) is the bacterial agent responsible for tuberculosis, a disease that kills over 1.3 million people each year largely due to our inability to rapidly diagnose and effectively treat drug resistant forms of the disease. Molecular diagnostics have emerged that rapidly identify mutations from the infecting Mtb genome known to be associated with drug resistance (DR) increasing the likelihood that patients are appropriately treated. However, these diagnostics are incomplete due to a lack of knowledge of all DR-causing mutations. While most DR-conferring mutations involve single base changes, it is well known that clinical strains of Mtb harbor many larger sequence mutations or variants (LSVs). LSVs have not been closely examined for their role in DR due to the difficulty in identifying these variants for association analyses. Using Pilon, a new tool that enables LSV detection; we identified over 45,000 LSVs in 161 clinical Mtb strains with variable DR profiles. Closer examination of LSVs across multiple samples revealed that identical LSVs were often being represented in Pilon output as different, making it difficult to statistically associate them with DR. We developed Emu, an algorithm that normalizes different representations of the same LSV to a canonical form. Applied to the 161 clinical Mtb genome sequences, Emu reduced the number of unique LSVs by more than half. Of the remaining ~20,000 LSVs, a large fraction appeared to be vertically inherited and not associated with DR, but there were a small number that correlated well with DR and may represent new pathways for achieving DR in Mtb.

H23 - Whole transcriptome framework to identify human genetic variants relevant to health and disease

Samuel Handelman, The Ohio State University, United States

Short Abstract: Genome wide association (GWA) studies provide a means of identifying genetic variants associated with a disease or drug response of interest. A hypothesis-free approach to GWA studies is free from bias introduced by the current understanding of disease pathophysiology or etiology and can introduce novel genes and variants that would otherwise remain unexplored. However, multiple hypothesis correction then becomes a greater challenge. In this report, novel network mining methods are described which are specially optimized for the use of RNAseq data to prioritize variants and/or genes, partially compensating for the multiple hypothesis problem. These methods recover properties of the broader network based on statistically significant pairwise interactions, thus identifying high-priority hub genes. Local subnetworks are identified using statistical algorithms based on topological and information-theoretic procedures; the effect of virtual gene knockouts in these local subnetworks, estimated using algebraic methods, further serves to prioritize genes. Finally, the networks of potential epistatis (non-additive interactions between variants) and of haplotype structure (reflecting recent positive selection) are also used to prioritize variants. Bagging is used to ensure that all of these network approaches are robust. Preliminary results are presented for efficacy of interventions against neurological, skeletal and cardiovascular disorders of interest to the XGEN group.

H24 - Transcriptome analysis of expressed single nucleotide variants in the 3’ UTR of ER-positive breast tumor

Xiaojia Tang, Mayo Clinic, United States

Short Abstract: It is known that the 3’ untranslated region (UTR) contains essential regulatory elements that will affect the expression and stability of mRNA. Transcriptome sequencing (RNA-Seq) with poly-A selected library usually generates high coverage at the 3’ UTRs and thus provides the possibility of accurate calling for the expressed single nucleotide variants (eSNVs) in untranslated regions. We have developed a novel computational system, ESNV-Detect, to identify eSNVs from the RNA-Seq data. It has been validated in several tumors and lymphoblastoid cell lines with high precision and sensitivity in both coding region and UTRs. Here we applied the ESNV-Detect to study 3’UTR eSNVs for estrogen-receptor positive (ER+) breast tumors from The Cancer Genome Atlas (TCGA). We obtained RNA-Seq data of 559 ER+ samples. To identify somatic eSNVs, we focused our analyses to 94 samples (47 pairs) that have both tumor and normal data. We identified 49,636 unique somatic 3’UTR eSNVs in the 47 ER+ tumor-normal pairs. Comparison of somatic 3’ UTR eSNVs with the list of somatic mutations that alter miRNA target sites obtained from SomamiR DB (http://compbio.uthsc.edu/SomamiR/) revealed that in our data 175 eSNVs create and 71 eSNVs disrupt known miRNA target sites. Thus far, only limited information is available in current known databases about somatic eSNVs in 3’ UTRs. Hence we are in the process of developing a computational system which will allow us to investigate the impact of the novel somatic 3’ UTR eSNVs in ER+ tumors. This will enhance our understanding of transcription regulation of ER+ disease.

H25 - MetaMerge-SV: An accurate method-aware merging algorithm for SVs

Hugo Lam, Bina Technologies, United States

Short Abstract: Accurate structural variant (SV) detection has been a key challenge in genomic analysis due to the complexity of SVs. SVs are genomic rearrangements formed by various mechanisms and vary largely in size, making it almost impossible to detect with the relatively short reads from next-generation sequencing (NGS) using any single algorithm. Each algorithm has its own limitation and is only sensitive to certain kinds of SVs with varying degree of accuracy and resolution.

Nevertheless, to date SV merging tools such as SVMerge are still limited in accuracy and precision as different SV detection methods are treated uniformly. While a couple of tools such as iSVP might consider different methods, their merging is limited to only removing duplicates. Many of the widely used SV detection tools are also not supported by these tools without any modification.

Here we present MetaMerge-SV, an algorithm for accurate and method-aware SV merging. It merges SVs from VCFs detected by multiple methods and by multiple tools for a method. Unlike just taking either the inner or outer bounds of the merged SVs, it resolves SV breakpoints based on the resolution of the methods. It attempts to recover missing zygosity from alignments and resolve conflicts based on the specificity of the methods in different regions. Local assembly with dynamic programming are used to provide further validation of SVs and to enhance breakpoint precision. With simulation and experimental data, our results show that MetaMerge-SV achieves high accuracy, precision and sensitivity across all SV types and sizes.

H26 - Learning similarity metrics from hierarchically classified data for phenotype mapping

Chun-Nan Hsu, University of California, San Diego, United States

Short Abstract: Phenotype mapping by learning semantic similarity metrics is important to integrate genomic data, such as GWAS and sequencing data. The training examples are usually classified into a one-layer classes, where pairs of data points must be either similar or not similar, and may fail to capture different level of similarity. Here, we present algorithms to learn a Mahalanobis distance metric from hierarchically classified datasets. The novelty of our contributions includes a new objective function that maintains a small margin between same-subclass examples and a larger one amidst same-class examples with regularization of intra-subclass and intra-class distances. We show that the new objective function is convex and can be optimized efficiently by a stochastic-batch sub-gradient descent method. We employ our model to two datasets: low-dimensional synthetic data from mixture of Gaussian with a hierarchical setting, and high-dimensional hierarchical phenotype descriptions in genomic data. Experiments show that our algorithm is able to yield a high accuracy in hierarchical classification tasks and handle both low and high-dimensional datasets.

H28 - Exome sequencing reveals the cause of more than eighty Mendelian disorders in mice

ANUJ SRIVASTAVA, The Jackson Laboratory, United States

Short Abstract: The discovery of causative mutations in Mendelian disorders is a key to understanding underlying disease biology and mechanism. It also provides diagnostic tools and target information for development of therapeutic approaches. The Mouse Mutant Resource (MMR) at the Jackson Laboratory has been developing and distributing spontaneous mutant mouse models of genetic disease for 50 years. We have found that whole exome sequencing (WES) offers a powerful primary approach for disease gene discovery. Here we show the results of a large-scale effort to identify the causative genes for 174 distinct Mendelian disorders in the laboratory mouse with clinically relevant phenotypes using WES. We also developed an optimized bioinformatics pipeline for analysis of mouse exome data that takes into account strain background by using high quality inbred strain specific SNPs/Indels from the Sanger Mouse Genomes Project. Using a newly developed mutation-calling algorithm, we identified and validated causative mutations in over 80 of these strains. In addition to identifying novel mutations, alleles and their associated genes, our data provide a variation resource of unprecedented breadth, with data from more than 100 unique strains representing 5 clades of laboratory mice. Further, variations identified in these strains were also incorporated into a newly created mouse variation database for subsequent data sorting, filtering, querying and sharing. These variation data in turn power on-going efforts to identify rare variants underlying Mendelian disorders in mice.

H29 - Computational prioritization of phenotype associated variants

Kymberleigh Pagel, Indiana University Bloomington, United States

Short Abstract: Advances in sequencing technologies have generated a wealth of potentially important variants with uncertain phenotypic implications that have shown promise for the identification of novel variant-phenotype relationships. Traditional methods of identifying relationships between genetic variation and phenotypes have yielded success, yet can be time consuming. Genome wide association studies tend to identify common variants with modest contributions to a phenotype and can miss rare variants which are more likely to be causal. The ability of exome sequencing to detect rare causal variants in protein-coding regions has already yielded success in the case of several Mendelian disorders. However, the applications to complex diseases have only recently been realized. A major challenge lies in the prioritization of candidate variants in decreasing order of their putative contribution to a given disease. We develop a novel computational framework to systematically combine predictions from MutPred, a tool that predicts the propensity of an amino acid substitution to cause disease, and PhenoPred, a method to infer gene-disease associations using biological networks, for the simultaneous prioritization of putative causal variants and prediction of the resulting phenotype. Specifically, we concentrate on the application of this method to predict status for variants associated with both Mendelian and complex traits with minimal prior knowledge and small sample sizes. The method performed well on exome sequencing data sets for familial combined hyperlipidemia (FCH), hypoalphalipoproteinemia (HA) and Crohn’s disease as part of the 2013 Critical Assessment of Genome Interpretation (CAGI) conference.

H30 - Development of model-based tumor content estimator for accurate variant calling in next–generation sequencing (NGS) dataset

Dongyoon Park, SK Telecom, Korea, Rep

Short Abstract: Low tumor content (tumor purity) is often caused by stromal cell contamination, and the fluctuation of tumor content can be major source of noise for detecting genetic variants in next-generation sequencing (NGS) dataset. Thus it’s important to estimate accurate tumor content level.
Although many tumor content estimation algorithms have been developed recently, they have certain limitations. Firstly, most of them are based on the loss of heterozygosity (LOH) assumption, but LOH events are not always occurred in tumor. Copy neutral LOH and unbalanced copy gain are general chromosomal aneuploidy events in tumor. Secondly, existing methods usually require matched normal sample for identifying tumor-related allele, but matched normal sample is not available in most cases.
We developed a model-based tumor content estimation method by utilizing B allele frequency (BAF) information and a graph based model. In the first step, candidate nodes of tumor content and regional copy numbers are selected from BAF segments in tumor sample. In the second step, graph model is built with candidate nodes. Finally, the most plausible tumor content, which is optimally explaining graph model, is determined. Therefore, proposed method can be applied to every tumor samples without necessity of LOH and matched normal sample information.
We generated 80 simulated datasets by varying tumor content level from 20% to 90% with 10% interval. The performance of proposed method was compared with that of PurBayes for generated datasets. Our method showed better estimation accuracy for given tumor content levels.

H31 - Genetic Variant Analysis - Detecting low frequency SNVs in targeted cancer panels

Gunjan Hariani, EA, A Quintiles Company, United States

Short Abstract: Despite the need for reliable detection of somatic, low frequency SNVs (LFSVs) in targeted cancer panels, existing software does not provide both the sensitivity and specificity necessary for clinical applications. To address these concerns, we have developed a low frequency variant caller, VarPROWL (Variant PROfiling With Logistic Regression), that reliably detects LFSVs while minimizing false positives. VarPROWL uses a logistic model to estimate sequencing error rates using base quality scores, genomic context[1, 2], homopolymer stretches and nearby error rates. Uncertainty in error rate estimation is hierarchically modeled, eliminating surges in false positives rates under increasing sequence depth - a problem seen in other tools[3, 4].

VarPROWL was compared to four other variant calling programs (with hard filter cut-offs) on Illumina sequencing data from a targeted capture of the cancer gene PIK3CA. Data came from 9 pure samples, with an additional 3 biological admixtures of quantified ratios, repeated both within and across flowcells to provide 48 total replicates. True variants were identified through allele frequency conformity, including comparisons to: known allele frequencies (admixtures), frequencies across replicates, and Ion Torrent sequence (when possible). Results provided 93 unique variants with allele frequencies ranging from 4% to 100%. VarPROWL gave 95% true positive rate, while maintaining only 21% false discovery rate. GATK[5], VarScan2[6], Platypus[7] and FreeBayes[8] gave TPRs of 63%, 66%, 12%, 23% and FDRs of 23%, 1.6%, 4%, 11%, respectively. This demonstrates the value of VarPROWL in LFSV detection for targeted cancer panels.

H32 - MutPred2: predicting the pathogenicity, structural and functional consequences of missense variants

Vikas Pejaver, Indiana University, United States

Short Abstract: The prioritization of missense variants that are relevant to disease and the inference of their impacts on protein structure and function remains a major challenge. Previously, we developed MutPred and showed that predicted structural and functional features can be used to address this challenge. However, the use of disparate predictors, combined in a heterogeneous infrastructural framework, limits the number of properties that can be considered and hinders the usage of the method in practice. In this study, we present MutPred2, a method and tool for the prediction of pathogenicity of missense variants and their molecular effects. MutPred2 uses a stack-like learning model that combines the high accuracy of random forests with the robustness and interpretability of neural networks. Apart from sequence-derived and evolutionary features, it models the predicted loss or gain of protein stability and other structural and functional properties such as signal and transmembrane helices, DNA-, RNA-, protein- and metal-binding sites, among others. While the performances of these in-house predictors vary (AUC values range from 60 to 98%), their incorporation into MutPred2 addresses the practical limitations mentioned above. MutPred2 achieves an AUC of 88% in 10-fold cross-validation experiments, suggesting that the use of non-specialized predictors does not affect accuracy. Furthermore, MutPred2 can assign up to 47 possible structural and functional consequences for a predicted deleterious variant. Such initial mechanistic hypotheses will be important for the elucidation of the molecular basis of disease and the development of novel therapeutic strategies.

H33 - VARANT: An Open Souce Variant Annotation Tool

Steven Brenner, University of California at Berkeley, United States

Short Abstract: We describe and present a comprehensive and easily extensible open source
tool for Human Genome Annotation called VARANT, written in the Python
programming language. While several tools for annotating variants are
available, we believe that VARANT distinguishes itself by being fully open
source, capable of using multiple processors/cores for speedy annotation and
providing extensive annotation of UTR and non-coding regions in addition to
the customary annotations of genes. An additional highlight of the tool is the
ability to incorporate various inheritance models into the annotation process,
which when coupled with phenotype information can be used to quickly generate
a list of prioritized genes and variants. The tool has been successfully used
to identify causal variants in rare immuno-disorders

.

H34 - NGS-Logistics: Federated analysis of NGS sequence variants across multiple locations

Amin Ardeshirdavani, KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic, Belgium

Short Abstract: As many personal genomes are now being sequenced across the world, collaborative analysis of those genomes has become essential to effectively gain biomedical knowledge from those sequencing efforts. However, analysis of personal genomic data raises important confidentiality issues and current solutions for personal genomic data sharing fall short of an effective and comprehensive solution to this problem. We propose a methodology for federated analysis of sequence variants from personal genomes that contributes to alleviate those problems. Our method allows querying a specific base-pair position or region in the genome for both a set of samples to which the user has authorized direct access (called the active data set) and for the whole set of samples. The query results are statistics that do not breach data confidentiality (by virtue of not being personal identifiable data) but allow further exploration of the data. Relevant samples outside the active data set can be identified through pseudonymous identifiers so that researchers can negotiate access to these samples with the authorized party. This approach minimizes the impact on data confidentiality while enabling powerful data analysis by gaining access to important rare samples. Our methodology is implemented in an open source tool called NGS-Logistics.

View Posters By Category

Search Posters:

TOP