Posters
Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015.
To confirm your poster find the poster acceptence email there will be a confirmation link.
Click on it and follow the instructions.
If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.
Category H - 'Metagenomics'
H01 - Assessment of Aligner and SNP Caller for Next Generation Sequencing and a Fast and Accurate SNP Detection Method
Short Abstract: The rapid development of next generation sequencing (NGS) technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently map millions of short reads to the reference genome and how to make accurate variants call are two major challenges in NGS analysis. We review current software for aligning short reads and detecting single nucleotide polymorphisms (SNPs), and extensively evaluate their performance on normal and cancer samples from The Cancer Genome Atlas project and trio’s data from the 1000 Genomes Project. We find that Burrows–Wheeler Transform based aligners are proven to be the most suitable for Illumina platform, and NovoalignCS shows the best overall performance for SOLiD data. We also demonstrate FaSD as the most reliable SNPs caller compared with several state-of-the-art programs. Furthermore, NGS shows significantly lower coverage and poorer SNP-calling performance in the CpG islands, promoter and 5’-UTR regions of the human genome. We show that both high GC content and low repetitive elements are the causes of lower coverage in the promoter regions.
TOP
H02 - Predicting copy number alterations and allelic status in cancer genomes with Control-FREEC using whole genome or exome sequencing data
Short Abstract: In addition to point mutations, cancer genomes often show a vast number of copy number alterations (CNAs): gains and losses of chromosomal material. Frequently, in order to disable a tumor suppressor gene, tumor cells use loss of heterozygosity (LOH) mechanism: the functional allele is lost while the mutated allele is duplicated.
Using high-depth whole genome sequencing or exome sequencing, we can now identify CNAs and LOH at the unprecedented resolution. In that way, we simultaneously get information about point mutations, short insertions/deletions and large chromosomal events.
We developed a method that allows detecting CNAs and LOH in whole genome sequencing data [1, 2]. Recently, we updated Control-FREEC so that it can be applied to exome sequencing data. Run time performance was also significantly improved. Within our software, we solve two important issues in the analysis of cancer genomes: contamination by normal cells and possible polyploidy. We also predict the somatic status of identified events.
The algorithm normalizes read count profiles using polynomial regression; then it applies a LASSO-based segmentation procedure to predict copy number status; allelic status is inferred using nucleotide coverage of SNP positions. Only high quality positions in reads are used if the user specifies a quality threshold. Users can also choose an option that will allow rechecking identified copy number status using the allelic profiles. This option turns out to be extremely helpful when analyzing noisy exome sequencing datasets.
References:
1. Boeva V, Zinovyev A et al. (2011), Bioinformatics, 27(2):268-9.
2. Boeva V, Popova T et al. (2012). Bioinformatics 28:423–425.
TOP
H03 - CAGI: The Critical Assessment of Genome Interpretation, a community experiment to evaluate phenotype prediction
Short Abstract: The Critical Assessment of Genome Interpretation (CAGI, 'kā-jē) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. In the CAGI experiment, participants are provided genetic variants and make predictions of resulting phenotypes. Independent assessors then evaluate these predictions against experimental characterizations. The primary goals of the experiment are to establish the current state of the art, identify bottlenecks in genome interpretation, inform critical areas of future research, and connect researchers from diverse disciplines whose expertise is essential for advancing methods for interpreting genomic variation. A long-term goal for CAGI is to improve the accuracy of phenotype and disease predictions in clinical settings.
The present CAGI experiment consists of 10 diverse challenges exploring the phenotypic consequences of genomic variation. Previous CAGI experiments have highlighted striking breakthroughs as well as disappointing failures. The CAGI experiment is underway at time of writing, with submissions due in late March, and assessment to be completed before the CAGI conference to be held just before ISMB on 17-18 July 2013.
This talk will summarize the most salient findings from the current CAGI experiment as first revealed to participants at the CAGI conference.
Further information about CAGI including challenge details, posters, and slide presentations are at the CAGI website at http://genomeinterpretation.org.
TOP
H04 - ANISE and BASIL: tools for locating and assembling novel sequence
Short Abstract: Variation analysis is a field in bioinformatics that has recently attracted a lot of attention. The greatest advances have been made for calling SNPs and small indel events. Fewer work has yet been done for finding large insertions.
We present our advances in finding and assembling long inserted sequence using high-throughput re-sequencing data. Our approach consists of two steps. BASIL is a method for locating insert sites on the nucleotide level. BASIL is based on aligned/unaligned pair signatures and clipping signatures generated by read mappers. ANISE is a method for assembling the inserted novel sequence. ANISE is based on the iterative mapping of reads and then assembling of the mapped reads' mates. For the assembly, an efficient overlapping step is followed by a multiple sequence alignment step and then finally by a consensus step.
BASIL is accurately determines possible insertion sites using a BAM read alignment file. The resulting file with possible insertion sites is then given to ANISE together with the unmapped reads from the BAM file. ANISE then iteratively closes the "gaps" at the positions determined by BASIL. Our results indicate that BASIL and ANISE are able to accurately determine insert site positions and assemble the novel sequence.
Great care has been taken that both BASIL and ANISE are easy to use, robustly and efficiently implemented, to exploit in-core parallelism, and to use state-of-the-art read mapping methods.
TOP
H05 - GWAS in the cloud: practical aspects and pitfalls
Short Abstract: High-performance computing (HPC) in the cloud potentially offers many advantages over in-house HPC, including lowered costs, easy scalability, virtually unlimited storage, easy backup and recovery, quick deployment, and access from almost any location. Increasingly, HPC is a necessity for many of the operations involved in genome-wide association studies (GWAS). Running a GWAS in the cloud is an attractive option, especially for institutions with limited HPC capabilities, or where HPC resources are in high demand. What are the practical aspects of running a GWAS in the cloud, which may use large data sets, often with sensitive patient information? We investigated the practicality of running various GWAS data processing and analysis operations using Amazon Web Services, including analyses involving millions of genetic variants imputed from 1000 Genomes. Specifically, we investigated the transport of data, cluster computing setup and scheduling, cost and data storage. Lessons learned from our investigation serves to inform other investigators who wish to utilize the cloud for GWAS analysis.
TOP
H06 - COSMIC Genomes: A resource for mining genomic tumour data in human cancer.
Short Abstract: Cosmic Genomes (http://cancer.sanger.ac.uk/wgs) is a high quality curated resource for the exploration of somatic mutations in human cancer, combining data from the Cancer Genome Project at the Sanger Institute, the scientific literature, and data portals such as the TCGA and ICGC.
There are a variety of ways by which data in COSMIC Genomes can be examined. For interactive and visual exploration of these data, the new COSMIC Genomes website provides enhanced views of genomic information using an integrated GBrowse, and gene centric information using histograms with various filtering options, for instance by tissue type or exclusion of 1000 genomes SNPs. The new powerful Tissue Browser enables users to specify detailed cancer diseases (using COSMICs controlled vocabulary to represent tumours) and analyse the associated data with the help of tables and charts. We enable users to analyse and download data either in TSV or CSV format across the whole website but also have a dedicated FTP site where more than 1000 unique visitors download our data every release. For more computational approaches we have COSMICMart (a BioMart instance) which can combine our data with other resources like Ensembl and Uniprot. We also have a DAS resource by which data can be downloaded.
In our continuing effort to enhance COSMIC’s functionality, we are currently integrating new data types (eg Copy Number Variants) and increasing its analytical functionality, ensuring it endures as a highly useful tool in cancer genomics.
TOP
H07 - SV-Bay: structural variant detection in cancer genomes based on the Bayesian approach with correction for the GC-content and mappability
Short Abstract: Next-generation sequencing (NGS) allows analyzing whole cancer genomes, providing us with insights into the landscape of somatic mutations including cancer-specific structural variants (SVs). We developed a tool, SV-Bay(Bayesian approach for detecting SVs), to detect SVs from mate-pairs or paired-ends reads uniquely mapped to the reference genome.
Abnormalities of read mapping signify abnormalities in the tumor genome. Each type of SV is characterized by a particular paired read signature. Read coverage provides additional information, e.g., breakpoint areas and deleted regions have a lower coverage than expected, while duplicated regions have a higher coverage. Both abnormal read mapping and changes in normal read coverage indicate an SV.
We suggest a new method based on the Bayesian approach which combines both types of observations (abnormal reads and coverage) to detect SVs from NGS data. SV-Bay collects and clusters all abnormally mapped reads, which could be associated with SVs. To distinguish between a real SV and possible mis-alignments, SV-Bay calculates a probability provided by a Bayesian model we define. When calculating conditional probabilities of the Bayesian model, we take into account the GC-content and mappability of the region as well as changes in read density and number of abnormal pairs supporting the SV.
SV-bay is able to detect large spectrum of SVs including homozygous/heterozygous deletions, insertion of a known/unknown fragment, inversions, tandem/mirror-like duplications, balanced/unbalanced translocations, amplicon structures. Additionally, the algorithm provides the most likely break-point position for each SV. The tool was tested on human melanoma cell line, COLO-829, and simulated data and demonstrated high prediction accuracy.
TOP
H08 - SNPest: A probabilistic graphical model for estimating genotypes
Short Abstract: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. One of the key challenges when analyzing sequencing data is the prediction of genotypes from the reads, i.e. the researchers have to able to correctly infer the DNA sequence that produced the fragments being sequenced. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for single nucleotide polymorphisms (SNPs) in order to investigate diseases or other phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. The program is implemented in C/C++ and the source code is freely available under GNU GPL v3 from https://code.google.com/p/snpest/
TOP
H09 - EvoSNP-DB: A database of genetic diversity in East Asian populations
Short Abstract: Genome-wide association studies (GWAS) have become popular as an approach for
identification of large numbers of phenotype-associated variants. However, differences
in genetic architecture and environmental factors mean that the effect of variants can
vary across populations. Understanding population genetic diversity is valuable for
investigation of possible population specific and independent effects of variants.
EvoSNP-DB aims to provide information regarding genetic diversity among East Asian populations including Chinese, Japanese, and Korean. Non-redundant SNPs (1.6 million)
were genotyped in 54 Korean trios (162 samples) and compared with 4 million
SNPs from HapMap phase II populations. EvoSNP-DB provides two user interfaces for
data query and visualization, and integrates scores of genetic diversity (Fst and VarLD)
at the levels of SNPs, genes, and chromosome regions. EvoSNP-DB is a web-based application allowing users to navigate and visualize measurements of population genetic
differences in an interactive manner and is available online at [http://biomi.cdc.go.kr/EvoSNP].
TOP
H10 - Computational Analysis of Large Rearranged Immunoglobulin Sequence Sets
Short Abstract: The study of inherited variation in the immunoglobulin heavy chain (IGH) locus has lagged behind that of other loci. This locus undergoes recombination and somatic mutation and the resulting variation is difficult to distinguish from inherited polymorphisms. In addition most large scale human genomics projects are based on sequencing DNA from lymphoblastoid cells in which the locus has been recombined.
Our group has pioneered the use of ultra-deep sequencing of rearranged IGH genes to understand variation in the germline locus. By sampling and comparing tens of thousands of rearranged sequences from an individual it is possible to identify the patterns of variation that are consistent with inherited polymorphisms rather than somatic mutation, and in turn genotype and in some cases haplotype the IGH locus in that individual.
This poster presents the various computational resources developed by our group for this purpose, including alignment, genotyping and haplotyping tools as well as a database of new polymorphisms discovered using our approach.
TOP
H11 - Associations of vitamin D receptor gene polymorphisms FokI and BsmI with susceptibility to rheumatoid arthritis and Behc¸ et’s disease in Tunisians
Short Abstract: Reports of immunomodulating effects of vitamin D suggest a need for examining allele and genotype frequencies of the vitaminDnuclear receptor gene (VDR) in patients with autoimmune diseases.
T-helper-1 (Th1) counts in peripheral blood are increased in both rheumatoid arthritis (RA) and Behc¸ et’s
disease (BD). We studied VDR polymorphisms in patients with these two diseases in Tunisia.
Methods: In 108 patients with RA, 131 patients with BD, and 152 controls, we studied FokI and BsmI VDR
polymorphisms, using the restriction fragment length polymorphism technique.
Results: The FokI polymorphism alleles and genotype were significantly more common in the RA group
than in the controls (P = 0.001 and P = 0.005, respectively). The FokI F allele and F/F genotype were significantly
associated with BD (P = 0.0003 and P = 0.002, respectively). Furthermore, in the group with BD, the
FokI polymorphism was significantly associated with the presence of vascular manifestations (P = 0.006).
In patients with RA, the FokI polymorphism was significantly associated with female gender (P = 0.003).
No significant associations were found between the Bsm1 polymorphism and RA or BD.
Conclusion: The VDR F allele is associated with RA and BD in Tunisians.
TOP
H12 - Identification of Low Frequency Variants from High-throughput Sequencing Datasets
Short Abstract: The ability to sequence deeply has spurred a renewed interest in
investigating the impact of cell-population heterogeneity in a
range of settings, including viruses (quasispecies), bacteria, tumors etc.
High-throughput sequencing datasets, in principle, enable the
detection of extremely low frequency variants seen in a given
cell-population, to study their evolution and impact on
phenotypes of interest. The use of ad hoc filters
however often limits the sensitivity and specificity of detection,
particularly when multiple samples are compared, as is the case
for somatic variant-calling and time-course studies.
In this work, we demonstrate the utility of a systematic and
extensible framework for variant-calling (LoFreq*) that
simultaneously incorporates sequence-quality, mapping-quality and
and other sources of error into a unified model, allowing robust and sensitive calls
of single-nucleotide variants and indels. Our benchmarking and
validation results on real and in silico datasets demonstrate
that this approach provides a significant boost in sensitivity
over existing variant-callers (accurately calling variants at <1%
frequency), while retaining very high specificity. We demonstrate
the generality of this approach by discussing results from its
application to calling somatic variants in heterogeneous and
contaminated tumor samples as well as in the analysis of
time-course RNA-seq datasets.
TOP
H13 - Biological pathways of musical aptitude
Short Abstract: Humans have developed the perception, production and processing of sounds into the art of music. Understanding of music is innate: Even newborn infants can recognize familiar melodies - the neuronal architecture is already set to process music. Musical aptitude has long been recognized to be inherited; indeed, heritabilities have been estimated to be as high as 0.7. Here, we evaluate the genetic background of musical aptitude.
We have measured musical aptitude as a skill of auditory perception: abilities to discriminate pitch, duration and sound patterns in tones. Genome-wide linkage and association scans were performed for 76 informative families ranging from trios to extended multigenerational families. We used Bayesian approach KELVIN, which supported these large families and quantitative phenotypes.
Several genes were identified. Importantly, most of the identified genes are involved in the development of cochlear hair cells or inferior colliculus (IC), both of which belong to the auditory pathway. We also confirmed previous findings of chromosome 4 being linked to musical aptitude.
Notably, all of the best associations were located at gene promoters. To study the biological meaning of the sites, we performed promoter analysis for the best-associated genes.
We hypothesize that genes affecting the development of auditory pathway constitute the ground for musical abilities and that differences in gene regulation cause the variation in these skills.
TOP
H14 - Detecting high quality variants from color-space data
Short Abstract: Next Generation Sequencing (NGS) Technologies have greatly improved our ability to mine variants out of the entire genome. The reliability of calling variants is highly related to the sequencing instrument used due to the sequencing chemistry and the intrinsic properties of each technology. Here, we focus on variants detected from color-space sequences generated by AB SOLiD 5500 XL sequencers. Thus, we systematically analyzed 120 human exomes from Spanish population, identifying the main drivers of bias derived by SOLiD colorspace data and, in turn, optimizing an analysis pipeline to obtain high-quality variants.
Among the main issues is the selection of the mapping algorithm. Several open-source procedures have been developed for mapping color-space data, among them we found that the BLAT-like Fast Accurate Search Tool (BFAST), using best-hit mode and some post-mapping modifications, gives us the best performance for our datasets. The final step comprises the Genome Analysis Toolkit (GATK) for variant calling. Unlike GATK recommendation for whole-exome experiments, we detected that in experiments with low-medium coverage (~40x), the use of a depth filter in combination with Best Practices V3 quality filters is essential to remove a high number of false positives.
The system that we have developed for color-space data provides high-quality variants with an extremely low rate of false positives as shown by Sanger sequencing validations.
TOP
H15 - Identification of 33 ostoesarcoma germ-line risk loci in three canine breeds point to multiple pathways including CDKN2A/B regulation
Short Abstract: Osteosarcoma, the most common bone malignancy, is an aggressive cancer characterized by early metastasis, primary onset in children and adolescents, and high mortality rates (30-40%). Recent work suggests that osteosarcoma arises when epigenetic and genetic factors disrupt osteoblast differentiation from mesenchymal precursors. In humans, large-scale genomic analysis is impeded by the genetic complexity of the tumor and relative rarity of the disease. Here, we use genome-wide analyses in three dog breeds (the greyhound, rottweiler and Irish wolfhound) that develop osteosarcomas at rates of 10-25%. We identify 33 distinct genomic loci associated with osteosarcoma explaining 55-85% of the phenotype variance in each breed. While regions differ between breeds, genes involved in bone development and differentiation are overrepresented. High-throughput sequencing of our top locus identifies a risk haplotype associated with disease in greyhounds and fixed in the two other breeds. This 15kb haplotype is located 150kb away from the CDKN2A/B locus. Luciferase assays reveal an element with a 30x-increased enhancer activity in human osteosarcoma cells, overlapping the most associated SNP, which is located at a constrained base in the genome. More generally, gene set enrichment analysis shows that both associated loci and regions potentially under selection are enriched in key pathways - both known (p53, kit) and novel (targets of MIR-124) - that are also frequently altered in tumors. Thus, mapping a complex disease in multiple dog breeds reveals a polygenic spectrum of germ-line risk factors that act as early drivers of disease progression.
TOP
H16 - FATHMM for variant analysis
Short Abstract: We present extensions to the "Functional Analysis Through HIdden Markov Models" (FATHMM) tool including cancer-associated variants, indels, predictions for variants in non-coding regions of the human genome and pre-computed results for all possible missense variants in human and other ENSEMBL genomes.
FATHMM is a tool for variant analysis. The central component of FATHMM is the prediction of whether protein missense variants will have a significant impact or not. Also included is a prediction of the functional and phenotypic outcome of the variant. FATHMM performs very well compared to other popular methods such as SIFT and PolyPhen (and other less common methods). FATHMM works on all species and is very fast (high-throughput).
FATHMM has been extended in several ways. There is a specific prediction now for cancer-associated variants which has been applied to the COSMIC dataset. FATHMM also now includes in the same tool, predictions for indels in addition to missense variants. FATHMM now also includes prediction for non-coding regions of DNA in human based on two principles. FATHMM previously included pre-computed results for all possible missense variations in human, and this has now been extended to other ENSEMBL genomes
TOP
H17 - Iterative principal component analysis for population structure correction for large GWAS studies
Short Abstract: Population structure refers to the presence of systematic ancestry differences between individuals in genomic data. The detection and correction of population structure is an important step in genome-wide association studies (GWAS), as uncorrected population structure may lead to false positives, confounding the results of GWAS. One popular approach to population structure correction is to determine the principal components of the genotype covariance matrix, which can be viewed as the different axes of ancestry of the individuals. These principal components are then included in a regression framework, accounting for population substructures in the sample data.
One drawback of this approach is that it can become computationally expensive for large-scale GWAS datasets, especially for increasing numbers of individuals.
To address this issue, we develop an iterative method to compute the leading principal components that is efficient in both memory and runtime. Our method calculates the principal components sequentially, taking advantage of the sparse nature of genotype data. Furthermore it can be easily integrated with the popular EIGENSTRAT program, as well as with different regression approaches, and can therefore be applied independently from the association analysis, allowing for the following application of more sophisticated statistical models such as association tests for interactions.
We evaluate our method on several real case-control GWAS datasets and compare its performance to the popular EIGENSTRAT method.
TOP
H18 - Analyzing the influence of genetic variation on the regulatory regions of the genome
Short Abstract: Most genetic variation lies within the immense non-coding fraction of the genome and interestingly, most phenotype (and disease) associated SNPs are also discovered there. Many of these SNPs may be explained by linkage disequilibrium but it is also very likely that some of them lie within promoters and enhancers, thereby affecting transcription factor binding, the local chromatin state and ultimately gene expression resulting phenotypic changes. Several examples are already well characterized but a genome wide approach to measure such epigenetic changes associated to genetic variation is not available yet. In order to address this we developed a ChIP-seq pipeline of histone H3 lysine 27 acetylation (H3K27ac) to purify and sequence active enhancer and promoter and tested it on 58 lymphoblastoid cell lines of the Nigerian Hapmap population. After peak calling and normalization we screened for differential peak heights (DPH) between individuals. We then matched 1000 genome SNPs that lie within the peaks in the individual cell lines and discovered that SNPs are associated to DPH . Next, we correlated gene expression and DPH for each individual and show that variation in DPH of enhancers and promoters influences gene expression of specific genes. Our results indicate that H3K27ac ChIP-seq can systematically be used to identify potential functional genetic variants in human populations and reveal the regulatory elements that connect non-coding variation to expression phenotype.
TOP
H19 - Nearly all adenosines in expressed Alu repeats undergo A-to-I RNA editing
Short Abstract: RNA molecules carry the information encoded in the genome and reflect its content. Adenosine-to-inosine (A-to-I) RNA editing by ADAR proteins converts a genomically encoded adenosine into inosine. It is known that most RNA editing in human takes place in the primate specific Alu sequences, but the extent of this phenomenon and its effect on transcriptome diversity is not clear. Here we analyzed large-scale RNA-seq data and detected over 1.6 million editing sites. As detection sensititivity increses with sequencing coverage, we performed ultra-deep sequencing of selected Alu sequences and showed that the scope of editing is much larger than anticipated. We found that virtually all adenosines within Alu repeats that form double-stranded RNA undergo A-to-I editing, although most sites exhibit very low levels (<1%). Moreover, we observed editing of transcripts resulting from residual anti-sense expression, doubling the number of edited sites in the human genome. Based on the bioinformatic analyses and deep targeted sequencing, we estimate that there are over 100 million human Alu RNA editing sites, located in the majority of human genes. These finding set the stage for exploring how this primate-specific massive diversification of the transcriptome is utilized.
TOP
H20 - Exploiting trait pleiotropy to identify causative genes in an integrated multidimensional genomic dataset
Short Abstract: Genome-wide association studies highlight quantitative trait loci (QTL) involved in regulating physiological traits, however identification of the causative genes underlying these traits remains a major challenge. On the other hand, identifying the causative gene for gene expression QTL (eQTL) may be straightforward, since colocation of gene expression and genetic association identifies that gene as likely ‘causative’ in itself, suggesting regulation in cis. Based on this logic, in instances where a cis eQTL underlies a physiological QTL, the causative gene for both traits may be readily identified. In the context of phenotype-rich ‘omic’ datasets, identifying genes with pleiotropic effects could therefore enable causative gene identification on a large scale.
To this end, we report a ‘hypothesis-free’ method for identifying candidate causative genes utilizing a large, multidimensional bovine dataset. This dataset consists of 864 F2 crossbreed cows assessed for a vast array of phenotypes, including measures of milk production and composition, disease, growth, behavior, and many other physiological characteristics. These data also include genome-wide microarray expression results from fat and liver tissues. We used these ~50,000 phenotypes in conjunction with >650k genotypes imputed from the Illumina BovineHD 777k platform to perform GWAS on all traits. Integrating these data by individual SNP, P-value rank correlations between traits were computed and used to identify potentially pleiotropic QTL. In the case of co-regulation between cis eQTL and physiological traits, we can identify the candidate causative genes for these phenotypes, and some of these will be presented to illustrate our method.
TOP
H21 - Characterization and identification of cis-regulatory elements in Arabidopsis thaliana based on SNP information
Short Abstract: The identification of regulatory elements encoded in an organism’s genome remains a central goal of modern molecular biology. We exploited the genomic sequencing information of a large number of different accessions of Arabidopsis thaliana as available from the 1001 genome project to characterize known and to identify novel cis-regulatory elements in its gene promoter regions. Assuming that promoters and regulatory elements such as transcription factor binding sites (TFBSs) are more conserved than non-functional intergenic regions, we wanted to estimate the bounds of promoter regions by determining the density of single nucleotide polymorphisms (SNPs) along the intergenic regions, verify known TFBSs by analyzing their localization versus their level of conservation, and find new candidate motifs out of all possible nucleotide hexamers. Based on the obtained SNP density profile and the genomic layout, the average length of promoter regions could be established at 500nt. We confirmed that known TFBS-motifs are indeed more conserved than the promoter background. For sixteen known motifs, their positional preferences could be clearly substantiated based on their position-specific decreased SNP density. Lastly, twelve candidate hexamers were identified whose relative positional occurrence correlates significantly with their conservation level and may represent newly discovered motifs awaiting experimental validation. For eight hexamers, significant associations to particular processes and functions of the associated downstream genes suggest a functional relevance of those newly found motifs. Our study demonstrates that the currently available resolution of SNP data offers novel ways for the identification of functional genomic elements and the characterization of gene promoter sequences.
TOP
H22 - Mitochondrial Cytcrome b Gene Missense Mutation Associated with Primary Cardiomyopathy
Short Abstract: Mitochondria play a critical role in both life and death of cardiomyocytes. In healthy cells, their primary function is to meet the high energy demand of the beating heart by providing ATP through oxidative phosphorylation. Mitochondrial disorders (MIDs) leading to myocardial disease show a strong age-dependent clinical heterogeneity. Various types of hypertrophic and dilative cardiomyopathy (hCMP, dCMP) can be attributed to disturbed mitochondrial oxidative energy metabolism. In the present study the MT-CYB gene was analysed in 30 patients with hCMP, 40 patients with dCMP, and 50 controls for alterations. Altogether, 27 MT-CYB variants were detected. Twenty-four of them were single nucleotide polymorphisms defining common haplogroups. The variant m.15434C>A was found in a single patient with severe dCMP and assessed as novel mutation, since it was not found in healthy controls or available data sets, and was nonhaplogroup associated with Phylotree. This variant altered an amino acid (L230I) with a high interspecific amino acid conservation index ( CI=97.7 %) indicative of the functional importance of the residue. Though the L230I mutation seems to play a causative role for dCMP.
TOP
H23 - Variobox: a tool for the exploration of human genetic variations
Short Abstract: Genetic variations not only dictate phenotypic differences between human beings, but are also the underlying cause of many gene-based disorders. Detecting, understanding, categorizing and associating human mutations with phenotypes is becoming the standard process to accomplish personalized medicine. Multiple software and hardware technologies already spanned from miscellaneous projects, leveraging an exponential growth of available genetics data, and it is now fairly easy and cheap to obtain sequence profiles for large cohorts, such as the 1000 Genomes Project, or for unique individuals, such as the ones performed by several genetic analysis companies and labs. Moreover, the overwhelming quantity of genetic patient data emerging from labs, along with available LSDBs, suggests the need for integration solutions that are able to use variations' knowledge for gene research and patient care scenarios.
We introduce Variobox, a desktop tool for the annotation, analysis and comparison of human genes. Variobox obtains variant annotation data from WAVe, protein metadata annotations from PDB and UniProt, and sequence metadata is obtained from Locus Reference Genomic (LRG) and RefSeq databases. To explore the retrieved data, Variobox provides an advanced sequence visualization that permits an agile navigation through the various genetic regions, and combines its features in an intuitive interface to analyse genes and mutations. At last, genes can be compared to sequences retrieved from LRG and RefSeq, finding and automatically annotating new potential variations. Variobox is a free cross-platform desktop application, available for download at http://bioinformatics.ua.pt/variobox
TOP
H24 - Graph based algorithms for genetic variants
Short Abstract: The amount of data generated by DNA sequencers is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. We have developed a series of graph based algorithms and software for the detection of some challenging types of genomic variation from the data generated by modern DNA sequencers and GWAS data.
The algorithms developed include algorithms for the detection of haplotypes, Alu repeats and copy number polymorphisms. We have formulated these problems in a common framework; The first step being a linear time algorithm that searches the original data for interesting attributes and casts the problem into a graph framework. This step reduces the size of our datasets by several orders of magnitude, allowing us to solve optimization problems over the resulting graph, even those whose solution in the worst case requires exponential time. The final step of our algorithms is verifying that the variants found in the reduced graph based formulation agree with the original data.
Using these formulations we are able to simultaneously consider the DNA sequence reads of multiple individuals in search of variants that have low population frequency and little signal. The algorithms are further able to detect complex variants, where multiple evolutionary events are occurring simultaneously. We have used this graph formulation to develop algorithms for the simultaneous detection of structural variants; inversions and copy number variations, from a large number of individuals.
TOP
H25 - Gene-centered viewing, storing and sharing of exome/genome variant and phenotype data
Short Abstract: The favourite view of sequence variant data in DNA diagnostic centers is gene-centered. We have developed a new version of the LOVD platform (Leiden Open-source Variation Database, http://www.LOVD.nl) facilitating the analysis of exome and genome sequence data. During installation, web services retrieve gene and transcript information on the fly. Imported variant data are stored using chromosomal nucleotide positions as a reference. Data can be stored and displayed in several ways: variant-by-variant or all connected to one individual. Using the existing LOVD functionality, users have the option to perform query per gene or individual, to link to other resources of interest, to get genome browser views of the data and to using web services to access variants stored in other gene variant databases. LOVD 3 has the unique option to independently store both the phenotypes screened and the variants detected. This gives submitters the chance to share inconclusive results, allowing collaborators with matching data to join the gene identification project and crack the case together. In addition, LOVD3 has a new access level, designated “collaborator”, allowing submitters to share otherwise non-public data with other submitters, e.g., to share detailed phenotype information with other diagnostic labs only.
TOP
H26 - Next generation data architecture for large-scale variation studies.
Short Abstract: With the advent of high throughput whole genome sequencing in large populations, many established storage and data-exchange schemes for genetic variation data are overstrained. The field has turned to purpose built data format and storage solutions for genetic variation data. Examples of these formats and associated tools include PLINK format for binary genotype representations and the VCF format. These formats have key advantages in compression and speed as compared with loading data into relational database systems, however their narrow design also tends to produce siloed datasets with a single discrete file per study, rendering global queries and data mining across datasets difficult. Another limitation is the implicit assumption that all non-reported sites contain only the reference allele. A more nuanced representation is needed to explicitly describe which parts of the genome were comprehensively assayed versus unreported regions due to lack of sequencing coverage or quality filtering.
Here, we present and compare alternative solutions, which combine the advantages of a file-based approach with the utility of a database, providing a global query infrastructure, through the use solutions in NoSQL databases and data formats for big-data scientific applications, including MongoDB and HDF5. We describe an integrated system where archival single submission data files form the basis of a data warehouse layer facilitating global query. Further, we extend these solutions to the ideal future case of the inclusion of phased haplotype data in conjunction with the allele level tracking of site-specific genetic variants.
TOP
H27 - Identifying Genomic Copy Number Alteration and Loss of Heterozygosity in Next-Generation Sequence Data
Short Abstract: Losses and duplications of large genomic regions resulting in copy number alterations or loss of heterozygosity (LOH) are common drivers of cancer development. Until recently, these aberrations have been characterized primarily by means of SNP and copy-number oligonucleotide arrays. However, next-generation sequencing has become a dominant approach for detecting mutations in cancer genomes, and the read data generated in these studies can also be used for characterizing genomic aberrations in cancers.
We developed an analytical approach and software that considers read depth and read counts for each allele to identify genomic regions with copy number alteration or LOH. The approach, “RDAAC” (Read Depth and Allele Counts) requires aligned sequencing reads from matched malignant and non-malignant DNA. RDAAC simultaneously estimates the proportion of non-tumor genomes in the sample and the number of copies of each SNP allele in the tumor sample. We assessed RDAAC’s performance in two ways. First, we compared its results to results from analyzing SNP arrays. Second, we assessed its performance on simulated data, for which true copy number alterations, regions of LOH, and admixture of non-malignant DNA were known. RDAAC functioned well on whole-exome sequencing data even in the presence of generalized polyploidy and >50% admixture of DNA from non-malignant cells.
We developed an R package that implements (1) RDAAC on top of SAMTtools(Li, et al. 2009) and ASCAT (which was designed for microarray data, Van Loo et al., 2010) and (2) the simulator for synthetic data.
TOP
H29 - Winnow: A tool to filter and prioritize exome variant datasets to extract user-specific results
Short Abstract: Exome sequencing is becoming a popular method to study causal variants associated with different diseases. Several variant calling software packages are currently available, but sifting through large datasets of potential variants generated by these software is a time-consuming, multi-step process. A tool to easily filter such large datasets and extract causal variants of specific interest to researchers would be a useful way to speed up the analysis process. To meet this need, we have developed Winnow— a flexible platform to dynamically filter variants based on different user-specified criteria. It has the following features:
(1) User-specific project space to upload sample data, filter relevant variants, and store results. (2) Perform drill-down set operations to compare variants across multiple samples and extract unique variants or recurrent variants. (3) Refine candidate variant lists using filters for: common VCF tags (e.g. QUAL, DP, MQ, PL, GT) and variant caller-specific tags (Samtools, GATK and VarScan); common variants (dbSNP, or 1000Genomes); somatic variants (Samtools CLR scores or Varscan somatic P values); germline variants (non-mendelian, dominant and recessive). (4) Prioritize and filter results based on variant functional impact (protein coding, deleterious effect, etc.) (5) Extract variants belonging to a gene set of interest or those located in a specific chromosome region. (6) Explore transcript isoform variations across samples. (7) View sequence reads across the alignment region of filtered variants.
TOP
H30 - eXtasy: variant prioritization by genomic data fusion
Short Abstract: Next-generation sequencing (NGS) greatly facilitates the discovery of novel disease genes causing Mendelian and oligogenic disorders. However, many mutations are present in any individual genome, and identifying which ones are disease causing remains a largely open problem. We introduce a novel computational approach, called eXtasy, to prioritize nonsynonymous single nucleotide variants (nSNVs) by integrating variant impact prediction, haploinsufficiency prediction and phenotype-specific gene prioritization that allows significantly improved prediction of disease-causing variants in exome sequencing data. To train our method we use the Human Gene Mutation Database (HGMD) as our source of disease-causing variants and 3 control sets ranging from common polymorphisms to rare variation in healthy individuals. By integrating phenotype-specific gene prioritization information we are able to greatly increase the area under the receiver-operator curve (ROC AUC) by at least 30% compared to classical deleteriousness prediction methods (e.g. SIFT, Polyphen, MutationTaster).This is likely due to eXtasy’s ability to discriminate between phenotype-specific and phenotype-unrelated deleterious variants. Although our performance estimates are likely overestimated due to prior information bias in a retrospective benchmark, we show that even controlling for these biases we obtain a substantial performance increase. We believe that the presented approach will greatly facilitate the analysis of exome sequencing data in human disease by efficiently prioritizing nSNVs in the light of the phenotype in question.
TOP
H31 - GUSTAF: Generic multi-split alignment of genomic and transcriptomic sequencing data
Short Abstract: Large-scale population and disease association studies have shown the importance as well as the difficulty of detecting structural variants (SVs) in genomic and also transcriptomic sequencing data. Although being very fast and precise, current read mapping tools usually fail to map sequencing reads that cross SV breakpoints or exon-exon boundaries. These events cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to various locations on the reference sequence.
We present GUSTAF, a sound generic multi-split detection method implemented in the C++ library SeqAn. GUSTAF uses SeqAn's exact local aligner Stellar to find partial read alignments. Compatible partial alignments are identified, and a split-read graph storing all compatibility information is constructed for each read. Vertices in the graph represent partial alignments, edges represent possible split positions. Using an exact dynamic programming approach, we refine the alignments around possible split positions to determine precise breakpoint locations at single-nucleotide level. We use a DAG shortest path algorithm to determine the best combination of refined alignments, and report those breakpoints supported by multiple reads.
GUSTAF is very versatile: It allows for multiple splits at arbitrary locations in the read, is independent of read length and sequencing platform, and supports both single-end and paired-end reads. Our results show that GUSTAF is able to accurately detect inversions, inter- and intra-chromosomal translocations, insertions and deletions in genomic sequencing data, and also to identify precise exon-exon junctions in RNA-Seq data, including gene fusion transcripts.
TOP
H32 - InvFEST: a scientific data warehouse to integrate the information of polymorphic inversions in the human genome.
Short Abstract: Newest genome sequencing technologies have uncovered an unprecedented degree of structural variation in the human genome, and have provided new insights on the genetic basis of phenotypic and disease-susceptibility differences between individuals. Most of these variants are currently catalogued in the Database of Genomic Variants (DGV). However, inversions have been relatively overlooked compared to CNVs due to their difficulty of study. Therefore, we have created “InvFEST”, a data-warehouse implementation that integrates several data of interest related to inversions with an online analytical processing engine (OLAP) to gather information and compute a report of each inversion. InvFEST merges inversion predictions from healthy individuals into a non-redundant dataset taking into account the resolution (error) in breakpoint location. Moreover, it stores information from validations and genotyping assays, the association with genes and segmental duplications, and the evolutionary history of the inversions. The initial results show a low overlap between the inversion predictions of the different studies, with more than 70% of inversions predicted only by one study. Nevertheless, after filtering unreliable locations, the total number of independent inversions is reduced by half, to less than 600. This suggests that there may be diverse biases in each inversion prediction method and that our knowledge of human inversions is still incomplete. The InvFEST database aims to fill the void in inversion information by becoming a central data repository to share results and collaborate towards the complete characterization of human polymorphic inversions.
TOP
H33 - A next generation variation archive
Short Abstract: A comprehensive resource for genetic variation in human and other species needs to provide an integrated view of variation at all scales, from single nucleotide variants, via short insertion and deletions, to large structural variants. Existing variation archives create an arbitrary distinction between variations at different scales as well as contain a decade of legacy observations and predicted variants that need to be re-interpreted and prioritized in the context of recent large next generation variant discovery datasets. We describe here our efforts to provide an integrated archive and query tools for genetic variation, within biological context which is scale invariant, reconciles current and legacy data and formats, providing relevant variation data in one place. Our system archives the original sequencing experiment and variant call products – often expressed as FASTQ, BAM, and VCF files - as a primary result reference object because this accurately reflects the unit of experiment and analysis in current pipelines. These study level objects are then linked to downstream analysis where variants are merged across studies for global query and display. Representation of known non-variant regions is just as critical as reports of variation, hence we explore ways to capture and expose this level of data. Further, explicit tracking of alleles at variant positions provides more refined biological interpretation than tying such information to genomic position based accessions.
TOP
H34 - Genotyping-by-sequence and SNP discovery in rice cultivars
Short Abstract: A wide range of naturally occurring variation, likely to be agronomically relevant, exists in rice cultivars. The knowledge of information about drought tolerance in rice, will help the selection of drought-resistant varieties in rice, one of the most drought-susceptible crops. Rice is a staple food for over half the people in the world, and water scarcity resulting from rising demand for water for competing uses indicates the need for finding polymorphisms linked to drought-tolerant varieties of rice.
Genetic variation for conditioning drought tolerance exists in rice but such variation must be captured. Single-nucleotide polymorphism (SNPs) are considered an effective way of acessing such polymorphism. To explore the existing biodiversity of rice in Brazil, a collection of rice accessions was selected from the Embrapa Rice Core Collection selected based on diversity, utility in breeding, and geographical representation in Brazil and worldwide.
Genotyping-by-sequence (GBS) was used to detect polymorphism in a collection of 94 acessions. An initial analysis, identified 13,167 SNPs distributed along the 12 rice chromosomes when filtering sites for at least 95% of taxa with data. After imputation around 120,000 quality SNPs could be identified.
TOP
H35 - Gene containing Variant Annotation for Prioritization a tool helping scientists to highlight candidate variants of interest
Short Abstract: The convergence of Next Generation Sequencing (NGS) and advances in genome annotation lead to quick and affordable causal variations predictions. By mapping the reads to the human genome reference and by searching for variations relative to the reference, a list of small nucleotide variations, insertions, deletions and structural rearrangements is predicted. Variant structural annotation with the reference genome allow focusing on variations within protein-coding genes in first intention. Further annotation reports variation frequency in the population and its presence in polymorphism and somatic mutation databases. Then existing software score the severity of single substitutions by assessing their impact on protein structure. These steps reduce the list of predictions submitted for validation.
Cis-regulatory variations may be causal and started to be reported. It is imperative no more to focus on protein-encoding exons. The list of candidate variations will grow-up with full-genome sequencing. Variant validations are time-consuming and expensive. To highlight on gene-related variants that make sense relative to the disease studied would help to prioritize causal variations. To gain biological understanding from NGS analysis is required and accessible. Several software allow analyzing the functional annotations of genes, but none has been designed for processing NGS data. Dedicaced tools are required. Gene annotation would help scientists rank predicted variations with respect to their potential to be causal for diseases. Easier than web-based tools, an automatic functional annotation of genes usable in command line into an NGS pipeline is accessible. This is the purpose of our tool: Gene containg Variant Annotation for Prioritization.
TOP
H36 - High-speed access to whole-genome variation data for R
Short Abstract: The programming language R has become a widely used environment for
developing and running analyses in computational biology. Being an
interpreted language, processing whole-genome SNP datasets in R is
unfeasible.
WhopGenome is a library for R that provides optimised functions to read SNP
data from Variant Call Format files as used by e.g. the 1000 Genomes
Project. It is implemented in C++ and offers a growing number of output
formats and matrix representations.
To filter out non-interesting data quickly, a highly configurable
filtering system allows to select SNPs by a number of properties.
Additionally, basic functionality to query genome annotation and manage
pedigree data is provided.
With WhopGenome it is possible to drastically reduce the time needed to
read SNP data into R and obtain useful result formats which allow for
fast processing.
It has been successfully integrated into a population genetics software and enables it to analyse whole genomes.
TOP
H37 - GWIS: Online exhaustive bivariate GWAS in minutes
Short Abstract: GWIS (Genome-Wide Interaction Search) is a fast method for the detecting statistical bivariate association between genotype and phenotype in GWAS data.
The algorithms used in GWIS were recently evaluated against conventional methods (Goudey et al 2013) on 7 Welcome-Trust Case-Control Consortium datasets.
Not only was it shown that GWIS methods were faster than all other algorithms, but they explicitly search for a well-defined proxy of epistasis: An improvement in association for SNP pairs over the association for each individual SNP.
Ranked pairs of SNPs detected by GWIS contain a greater variety of SNPs than ranked pairs from conventional statistics such as Chi2, because the latter are confounded by univariate association. Many pairs detected by GWIS were not previously reported yet have high odds ratios and coverage.
We have now developed a free online interface to GWIS based on an instance of the GalaxyProject server. Users can upload GWAS datasets for processing with a battery of popular conventional tests for association, or using the 3
tests specific to GWIS (SS, DSS and GSS). The server is free for public use as a demonstration of our methods. It is hosted on only a single desktop machine, yet exhaustive bivariate analysis for e.g. 3 tests can be completed in 15 minutes on a dataset of equal dimension to the WTCCC examples.
For each statistical test, the server returns a separate list of the most significant N SNP-pairs along with the score computed. Up to 1 million pairs can be ranked.
TOP
H38 - Stability of bivariate GWAS biomarker detection in the presence of univariate effects
Short Abstract: We compared the cross-validation performance of two recent methods for detecting candidate interacting biomarkers in GWAS data. The new methods specifically look for pairwise
epistatic effects by measuring the difference or gain in specificity and sensitivity (DSS and GSS respectively) over the univariate effect (Goudey et al 2013). The Chi2 test was used as a reference method. All three methods were applied to exhaustive bivariate analysis of seven complete WTCCC datasets, and to 10 trials of 2-fold cross-validation. For each run, the top million ranked scores were obtained.
Although DSS and GSS had higher overlap in five of the seven diseases and equal overlap in one disease, it was noted that Chi2 results were dominated by univariate effects, causing a distinctive U-shape in Jaccard index plots of the overlap between cross-validation folds. SNP-frequency plots confirmed the presence of a few highly significant individual SNPs throughout the 1M ranked scores from Chi2.
Plink was used to remove significant univariate effects (Bonferroni corrected p < 0.05) and the entire experiment was repeated. As expected, the distinctive U shape in Chi2
results was either reduced or eliminated. However, the overlap between folds was reduced for Chi2 but remained higher for DSS and GSS. These results suggest that DSS and GSS tests are successfully targeting bivariate association with phenotype and are able to detect this association more reliably. Conversely, conventional tests for association are unable to avoid confounding over-representation of lower-order effects.
TOP
H39 - ASPIREdb – an interactive web-based system for exploration of complex phenome-genome datasets
Short Abstract: We developed ASPIREdb, an open-source, interactive, web-based database and software system that supports the exploration, analysis and mining of complex phenome-genome datasets. The system links the anonymized subject data with their phenotypes and genomic variants. The data model supports all main types of genomic variants and phenotypes are recorded using Human Phenotype Ontology, whose controlled vocabulary and hierarchical structure facilitate computational inference and linking to external resources.
ASPIREdb is linked to UCSC Genome Browser, which allows the user to view selected genomic variants in the context of a large collection of aligned annotation tracks. It is also integrated with two other web-based databases developed in our lab: Gemma, which provides information on gene networks and differential gene expression patterns of the genes associated with the variants and Neurocarta, a database of curated and scored gene-phenotype associations which can be used for gene prioritization.
In addition to table-based view of the data, ASPIREdb provides various graphical representations: selected genomic regions and associated variants can be viewed using genome-wide interactive ideogram, and phenotypic data can be explored using the ontology tree structure or summarizing heatmaps. The interface supports complex searches using genotype and phenotype characteristics in conjunction with data from integrated databases and external sources. We are currently working on adding more features, such as automatic data labeling based on user defined rules, phenotype-based clustering and group comparison capabilities.
In brief, ASPIREdb is a powerful web-based system designed to facilitate and accelerate the exploration and interpretation of complex phenome-genome datasets.
TOP
H40 - Genome Annotation with Large Scale Mutation Extraction from Scientific Fulltext
Short Abstract: Over the last few years, the UCSC Genocoding project (http://text.soe.ucsc.edu) has assembled the biggest collection of biomedical research articles accessible for analysis to date, more than 5 million articles, many with supplemental files. We have mapped DNA/Protein sequences in them with BLAT and show them on the UCSC Genome Browser. But another important type of information are mutations discussed in the text (e.g. "the Q562E mutation in WNK4 causes pseudohypoaldosteronism").
We show how our compute cluster processes 200 GB of fulltext files in less than a few hours. We present a simple pipeline to extract mutation mentions from the text and map them to genome coordinates, without using any information in dbSNP. While the recall of this pipeline is relatively low, on the order of 20-30%, precision exceeds 80%. The resulting list of mutations is bigger than any existing database, can be updated in real-time and can be used as a basis for manual annotation. We also demonstrate a tool to intersect our text mining results with variation calls from whole genome sequencing datasets (VCF files).
Existing manually curated databases of human mutations cost several thousand dollars per year and many groups do not have access to them. Text mining of mutation descriptions has the potential to simplify the annotation of genomic variants for all researchers by directing them automatically to the relevant literature.
TOP
H41 - Variants affecting exon skipping contribute to health disparities
Short Abstract: This poster is based onProceedings Submission:Alternative splicing(AS) may be one biological factor accounting for cancer health disparities. Although AS is involved in a broad spectrum of cancer pathogenesis, AS differences among human individuals is underestimated. We compiled information on splicing regulatory elements (SREs) and single nucleotide polymorphisms (SNPs) with high population differentiation. For health disparities, we considered candidate SNPs that have a high fixation index (Fst>=0.5, an estimation of how populations differ genetically). We observed that synonymous and intronic variations within SRE sites tend to have higher Fst, values than those found outside SREs, suggesting that those functional SNPs in SREs are more likely to be under selective pressure. We describe the function (s) for a number of variants that showed phenotypic associations but for which mechanisms were unknown. For example, one of our findings in the TNFRSF1A gene, intronic rs12265291, is predicted to neutralize a SRE and has a high Fst, value of 0.88. The exon to the right to this SNP is skipped and participates in encoding CTNNB1-binding domain. Additionally, rs12265291 is in high linkage disequilibrium (LD) with rs4506565 (LD r2=0.53) which has been previously associated with Type 2 Diabetes. We also found SNP rs12265291 to be in LD with rs216013 (Fst=0.38, LD r2=0.868), this SNP is in the drug-response gene CACNA1C and has been associated with Warfarin maintenance dose requirement. Our study identifies several SNPs that may have a biological impact on human diseases through AS and emphasizes their importance for molecular therapies to reduce health disparities.
TOP
H42 - Machine learning methods for genotype-phenotype association in bacterial genomes
Short Abstract: Genotype-phenotype association methods for bacterial genomes are not yet well-established. Bacteria do not have sexual reproduction, which invalidates some of the assumptions made in many of the current methods used for association studies in other organisms. Bacteria have a huge influence on human health and we need better methods to learn about how changes in genetics give clinically relevant phenotypes.
Our bug of interest is Mycobacterium tuberculosis (Mtb), a bacterial pathogen that causes pulmonary tuberculosis (TB), which kills over a million people each year. Unfortunately, Mtb is difficult to diagnose and resistance to antibiotics is becoming rampant. The current diagnostics for drug resistance take six to eight weeks. The technology for rapid molecular diagnostics exists, but requires knowledge about resistance marker mutations, which is missing.
To address the lack of knowledge about marker mutations, we present a machine learning based strategy that uses support vector machines to predict genotype-phenotype associations by integrating genome sequence and clinical meta data. An ensemble feature selection method enables the discovery of antibiotic resistance markers in Mtb. The impact is two-fold: (i) the features selection procedure gives us a ranking of mutations that are associated with drug resistance and (ii) the classification model can be used together with a molecular diagnostic to predict treatment options for patients.
In this poster we discuss the methods and illustrate their capabilities on a panel of bacterial drug-resistance projects with a particular focus on Mycobacterium tuberculosis.
TOP
H43 - Quantifying the impact of somatic mutations on gene expression networks in cancer
Short Abstract: To improve interpretability of large-scale whole genome sequencing studies of human cancers, we aim to computationally estimate the functional impacts of individual somatic mutations in patient tumours, providing insights on which mutations drive malignant phenotypes.
We assume that a functional mutation in a gene of interest will have cis-effects on its own expression or trans-effects on the expression of genes in the same biological pathway. In addition, we assume mutated genes altering phenotype are selected during evolution thus they should accrue at higher than expected rates in population studies. We developed a novel probabilistic model to integrate different sources of data and prior information, such as somatic mutations, gene expression, and pathway datasets to predict functional mutations that are likely to have affected the transcriptional profile of a tumour. The model outputs a probability that each mutation in a dataset has functional impact.
Publically available TCGA cancer datasets and breast cancer data generated in our lab were used to evaluate the proposed model. Experimental results shown the model’s predictions had higher concordance with Cancer Gene Census documented genes than the MutationAssessor algorithms’ predictions. In addition, the model predicted some known driver genes missed by the MutSig algorithm. Finally, randomly permuted datasets were predicted to have statistically significantly lower functional probabilities compared with the orginal predictions.
We show the integration of multiple data types by a probabilistic graphical model can predict individual functional mutations and driver genes. We suggest that this techinque is an important step on the road to peronalized treatment strategies informed by genome and transcriptome sequencing.
TOP
H44 - Genome-wide structural variation analysis with genome mapping on nanochannel arrays
Short Abstract: Despite recent advances in next-generation sequencing technology, genome-wide structural variation (SV) detection using ‘short reads’ remains challenging. Detection of large and/or balanced SV such as inversion or translocations is difficult, if not impossible. To overcome the limitations of short reads, we generated genome maps using a novel approach that allows very long DNA molecules (> 150kb) fluorescently labeled at Nt.BspQI sites (GCTCTTCN/) to be linearized and imaged in highly parallel nanochannel arrays[1]. We obtained data to 50X genome coverage on NA12878, a member from a CEPH CEU trio extensively sequenced in the 1000 Genomes Project. To detect structural variations from these genome maps, we have developed a BreakPoint sensitive Optical Map Dynamic Programming algorithm (BP-OMDP). Data for each DNA molecule is represented as an array of inter-label distances or segments. BP-OMDP allows molecules to be partially aligned to the hg19 in silico Nt.BspQI reference with mismatching segments on either end of the molecules. Potential SV regions and their breakpoints are identified by observing a clear drop in the number of aligned molecules to the reference. SV regions are then reconstructed and verified by de novo assembly and re-alignment of partially mapped and originally unmapped molecules. Using BP-OMDP, we detected 78 putative inversions, of which 59% overlap with known inversions found in NA12878. Extension of our method to detect other balanced and unbalanced SV is underway.
1. Lam ET et al. Genome Mapping on Nanochannel Arrays for Structural Variation Analysis and Sequence Assembly. Nat Biotechnol (2012) 30(8):771-776.
TOP
H45 - Workflows and Services for Concept Profile Generation
Short Abstract: Concept profile matching is a knowledge discovery method that proved successful in generating hypotheses about molecular mechanisms explaining the results from genotype-phenotype studies. This technology has been implemented in the Anni standalone application (http://biosemantics.org/anni) and web service (http://www.biocatalogue.org/services/3330). At the core of this technology are the concept profiles, which had to be generated using a number of custom scripts and manual operations. To move towards a more customizable and service oriented architecture of the concept profile generation pipeline, we developed a set of workflows and services that represent individual steps of the pipeline. To aid interoperability, semantic web standards have been adopted to interface with these components. For instance, the indexing component now uses SKOS (Simple Knowledge Organization System) as its data interchange format, replacing the old legacy format specific to the indexing engine Peregrine (https://trac.nbic.nl/data-mining/). In addition, biomedical resources that are available as linked data can now be incorporated as sources for concept profile generation.
TOP
H46 - Variants affecting exon skipping contribute to complex traits
Short Abstract: Alternative splicing is a common eukaryotic cellular mechanism that increases the diversity of mRNA by allowing for the production of multiple proteins from one gene. Alternative splicing is important for many critical biological processes, including development, evolution, and psychological behavior. Furthermore, alternative splicing has been known to account for 15–50% of human genetic diseases, including breast cancer; however, the precise mechanism by which genetic variations regulate this process remains to be fully elucidated. In this study, we develop an integrative approach that utilizes sequence-based analysis and genome-wide expression profiling to identify genetic variations that may affect alternative splicing. We also evaluate their enrichment among established disease-associated variations. Our study provides insights into the functionality of these variations and emphasizes their importance for complex human traits and diseases.
TOP
H47 - A general framework for estimating the relative pathogenicity of human genetic variants
Short Abstract: As genetic information is insufficient to unambiguously implicate many disease-causal variants, annotations that enrich for causal variation are essential. Current annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). A broadly applicable metric that objectively weights and integrates diverse information is needed. Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations. We implement CADD as a support vector machine, trained to use 63 annotations to differentiate 14.7 million variants derived on the human lineage from 14.7 million simulated variants. We pre-compute CADD-based scores (C-scores) for all 8.6 billion possible single nucleotide variants of the reference genome and enable scoring of short insertions/deletions. C-scores strongly correlate with allelic diversity, pathogenicity of both coding and non-coding variants, and experimentally measured regulatory effects, and also highly rank causal variants within individual genome sequences. Finally, C-scores of complex trait-associated variants from genome-wide association studies (GWAS) are significantly higher than matched controls and correlate with study sample size, likely reflecting the increased accuracy of larger GWAS. Thus, the ability of CADD to quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures is unmatched by any current annotation and will be widely useful for the identification of causal variation in both research and clinical settings.
TOP
H48 - An assessment of the recovery of curated genetic variants through text mining
Short Abstract: We assess a mutation extraction tool with respect to the task of curation of the literature for the purpose of populating a database of genetic variation information. Our analysis shows that the ability of text mining tools to recover the mutations catalogued in the databases is far less than what would be expected based on the typically excellent performance of such tools on intrinsic evaluation. While lack of access to the full text of publications has been argued to explain this phenomenon, we show show that the effect persists even when the full text article that was indicated to be the direct source of a mutation in a curated resource is available for processing. We explore several possible explanations for these results, including difficulties in linking genetic variants to specific genes, and the inclusion of data from high-throughput experiments. The results of our work have implications for the future development of text mining systems for genetic variation.
TOP
H49 - Somatic mutations in MDS originate exclusively in rare human hematopoietic cancer stem cells
Short Abstract: Myelodysplastic Syndrome (MDS) is a clonal hematological disorder characterized by dysplastic and ineffective hematopoesis in multiple lineages resulting in anemia, trombocytopenia and neutropenia and frequently progress to acute myeloid leukemia (AML). The dysplastic cells have been hypothesized to originate from the normal hematopoetic stem cell compartment.
The concept of human cancer stem cells (CSC) relies on the existence of a rare stem cell population with a unique ability to self-renew and replenish a molecularly and functionally distinct non-tumorigenic progeny. The phenotypically distinct lineage (Lin) negative CD34+CD38-CD90+CD45RA- candidate MDS stem cell, early myeloid restricted granulocyte- macrophage (GMP) and megakaryocyte-erythroid (MEP) progenitors where molecularly and functionally characterized and shown to be distinct. Only the candidate MDS stem cell population had self-renewal capacity. It also had ability to replenish the MEP and GMP populations.
Targeted sequencing of 84 genes was performed on genomic DNA from unfractionated BM, purified cell populations and individually picked in vitro long-term single cell clones. All of the 26 somatic genetic lesions identified in the patients analyzed could be tracked back to the Lin-CD34+CD38-CD90+CD45RA- stem cell compartment, including the single cell long-term culture clones.
This study provides evidence for the existence of a rare phenotypically, molecularly and functionally distinct stem cell population, with the ability to self-renew and replenish lineage restricted MDS progenitors. All stable somatic genetic lesions identified could in each MDS patient be backtracked to the rare stem cell population, establishing their unique CSC identity. These findings have implications for therapeutic strategies in MDS.
TOP
H50 - Mobster: An accurate detection method for MEI events in Sequencing Data
Short Abstract: Mobile elements (MEs) are considered major drivers of genome evolution as they can insert in new genomic sites, create processed pseudogenes, and transduce parts of the genome. However the role of MEs in disease is largely unexplored in humans. We have developed a program, called Mobster, to detect novel mobile element insertion (MEI) events in whole genome sequencing (WGS) and whole exome sequencing (WES) paired-end and single-end data. Mobster uses a combination of discordant read pairs and split-reads to identify MEIs events, which are subsequently mapped to consensus sequences of known active MEs. Results from simulation data show that Mobster can reach a sensitivity of 95.5%, on simulated paired-end 5X coverage WGS datasets. We also tested Mobster on a variety of different NGS experimental data. In a monozygotic twin paired-end 40X WGS dataset we find approximately 1,000 novel MEIs per sibling compared to the reference. Predicted MEIs from sibling A overlap by 91.4% (non-pooled) or 99.0% (pooled) with predicted MEIs from sibling B. In four samples sequenced to a high depth by paired-end WES an average of 22 MEIs were found per sample. Both the MEI events identified in the paired-end WGS and WES data had a high overlap (95.8% and 97.7% respectively) with previously validated or predicted novel MEI sites, demonstrating the reliability of Mobster’s MEI predictions. We conclude that Mobster is able to robustly detect MEIs in a wide variety of second generation sequencing datasets with a high accuracy.
TOP
H51 - Identifying and classifying trait linked SNPs in non-reference species by walking coloured De Bruijn graphs
Short Abstract: Single Nucleotide Polymorphisms are invaluable markers for tracing the genetic basis of inheritable traits and the ability to create marker libraries quickly is vital for timely identification of target genes. Next-generation sequencing makes it possible to sample a genome rapidly, but polymorphism detection relies on having a reference genome to which reads can be aligned and variants detected. We present Bubbleparse, a method for detecting variants directly from next-generation reads without a reference sequence. Bubbleparse uses the de Bruijn graph implementation in the Cortex framework as a basis and allows the user to identify bubbles in these graphs that represent polymorphisms, quickly, easily and sensitively. The Bubbleparse algorithm is sensitive, can detect many polymorphisms quickly and performs well when compared with polymorphism detection methods based on alignment to a reference in Arabidopsis thaliana and found some SNPs not found by the canonical method. The heuristic can be used to maximise the number of true polymorphisms returned and with a proof-of-principle experiment we show that Bubbleparse is very effective on data from unsequenced wild relatives of potato and enabled us to identify disease resistance linked genes quickly and easily. Bubbleparse is a fast and effective tool for detection of polymorphisms in unsequenced genomes and is an excellent addition to the genomics toolbox, it can speed up variant detection and allow for new analyses in organisms that do not as yet have substantial genomic resources.
TOP
View Posters By Category
TOP