ISMB/ECCB 2015

Google Plus

Linked In

Flickr

Posters

Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category G - 'Genetic Variation Analysis'

G01 - SeqPurge: highly-sensitive adapter trimming for paired-end short read data

Marc Sturm, University Hospital Tuebingen, Germany

Christopher Schroeder, University Hospital Tuebingen, Germany

Peter Bauer, University Hospital Tuebingen, Germany

Short Abstract: Trimming adapter sequences from short read data is a common preprocessing step in most DNA/RNA sequence analysis pipelines. For amplicon-based approaches, which are mostly used in clinical diagnostics, sensitive adapter trimming is of special importance. Untrimmed adapters can be located at the same genomic position and can lead to spurious variant calls. Shotgun approaches are more robust towards adapter contamination, because untrimmed adapters are randomly distributed over the target region. This reduces the probability of spurious variant calls.
When performing paired-end sequencing, the overlap between forward and reverse read can be used to identify excess adapter sequences. This is exploited by several published adapter trimming tools. However, in our evaluations on amplicon-based paired-end data we found that these tools fail to remove all adapter sequences and that adapter contamination leads to spurious variant calls.
Here we present SeqPurge, a highly-sensitive adapter trimmer that uses a probabilistic approach to detect the overlap between forward and reverse reads of paired-end Illumina sequencing data. The overlap information is then used to remove adapter sequences, even if only one base long. Compared to other adapter trimmers specifically designed for paired-end data, we found that SeqPurge achieves a higher sensitivity. The number of remaining adapters after trimming is reduced by 40-75%, depending on the compared tool. The specificity of SeqPurge is comparable to that of the compared tools. In addition to adapter trimming, SeqPurge can also perform trimming based on quality and based on no-call (N) stretches.

G02 - MyVariant.info: community-aggregated variant annotations as a service

Chunlei Wu, The Scripps Research Institute, United States

Adam Mark, The Scripps Research Institute, United States

Jiwen Xin, The Scripps Research Institute, United States

Sean Mooney, The University of Washington, United States

Ben Ainscough, Washington University School of Medicine, United States

Ali Torkamani, The Scripps Research Institute, United States

Andrew Su, The Scripps Research Institute, United States

Short Abstract: The accumulation of genetic variant annotations has been increasing explosively with the recent technological advances. However, the fragmentation across many data silos is often frustrating and inefficient. We created a platform, called MyVariant.info (http://myvariant.info), to aggregate variant-specific annotations from community resources and provide high-performance programmatic access. Annotations from each resource are first converted into JSON-based objects with their id fields as the canonical names following HGVS nomenclature (genomic DNA based). This scheme allows merging of all annotations relevant to a unique variant into a single annotation object. A high-performance and scalable query engine was built to index the merged annotation objects and provides programmatic access to the developers. As of today, MyVariant.info is serving >100M variants in total and we are actively expanding the coverage by engaging community efforts. MyVariant.info decouples two fundamental steps in management of variant annotations: the creation and maintenance of centralized web services (which requires deep software-engineering expertise), and the task of structuring biological annotations (which requires broad community effort). Annotation providers from the community can provide data parsers to convert their raw data into JSON-compatible objects. The only requirement is that a valid HGVS name is used as the id field for each object. These data can then be queryable through the query engine we built. The data provider doesn’t have to worry about building their own query infrastructure. And the research community doesn’t have to learn another query interface in order to access new annotations.

G03 - An integrated computational framework for fast and easy variant prioritization from Whole Genome Sequencing

Bolan Linghu, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Fan Yang, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Robert Bruccoleri, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Thomas Schlitt, Biomarker Development, Novartis Institutes for BioMedical Research, Basel CH-4002, Switzerland

Daniela Wieser, Biomarker Development, Novartis Institutes for BioMedical Research, Basel CH-4002, Switzerland

Nicole Cheung, Biomarker Development, Novartis Institutes for BioMedical Research, Basel CH-4002, Switzerland

Xiaoyu Jiang, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, Switzerland

Yunsheng He, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Joseph Szustakowski, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Thomas Morgan, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Nanguneri Nirmala, Biomarker Development, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA, United States

Short Abstract: Whole genome sequencing (WGS) has become a promising approach to identify genetic variations related to diseases in clinical studies. Pinpointing the small subset of pathogenic related variants amongst millions of variants generated in WGS remains a challenge. A key solution is to integrate a comprehensive landscape of information, such as variant calls, prior disease knowledge, clinical phenotypes, variant effect predictions, allele frequencies, variant call quality assessment, for variant prioritization. However, to develop a centralized infrastructure to efficiently organize and query these “Big Data” remains a daunting task. Recently, a number of high-performance relational database systems have been developed specifically to enable analysis of extremely large data sets. Here we describe applying one such system, namely Vertica, to prioritize variants. Our approach leverages Vertica’s high performance capabilities to efficiently model, store, and query diverse resources. This framework enabled the convenient and efficient identification of candidate disease variants, with significant improvements over traditional databases. This demonstrated that high-performance databases such as Vertica provide an efficient solution to prioritize variants from whole genome sequencing by data integration.

G04 - Evaluating and optimizing variant calling: a comparison of Roche 454, Ion Torrent PGM and Illumina NextSeq sequencing data

Sarah Sandmann, University of Muenster, Germany

Aniek de Graaf, RadboudUMC, Netherlands

Bert van der Reijden, RadboudUMC, Netherlands

Joop Jansen, RadboudUMC, Netherlands

Martin Dugas, University of Muenster, Germany

Short Abstract: There are various next-generation sequencing (NGS) techniques, all of them striving to replace Sanger sequencing as the gold standard. The ongoing development of NGS methods has greatly reduced turnaround time and cost of sequencing. However, false positive calls of SNVs and specially indels are a widely known problem of basically all NGS sequencers.

We developed optimized variant calling pipelines for three common NGS sequencers considering both SNVs and short indels. Amplicon-based targeted sequencing of 20 genes known to be recurrently mutated in myeloid dysplastic syndromes (MDS) was performed in parallel on Roche 454, Ion Torrent PGM and Illumina NextSeq500 platforms. Diagnostic material of MDS patients -- partially sequenced twice on each sequencing platform -- formed the basis of the optimization, representing the learning cohort. If required, called variants were confirmed by Sanger sequencing of the original patient material.

We calculated various parameters to characterize both SNVs and indels. Yet, instead of setting arbitrary thresholds for each parameter, we combined them to estimate generalized linear models returning a probability for each variant to be a true positive. A single threshold for each model was chosen to provide maximum sensitivity as well as a maximum positive predictive value.

Subsequently, we performed a comparison of the three NGS platforms and their previously optimized variant calling pipelines. Sequencing data from additional MDS patients with lab validated SNVs and indels formed the basis of the comparison, representing the validation cohort.

G05 - Quantitative trait association study for mean telomere length in the South Asian population

Anna Hakobyan, Institute of Molecular Biology of the National Academy of Sciences of the Republic of Armenia, Armenia

Lilit Nersisyan, Institute of Molecular Biology of the National Academy of Sciences of the Republic of Armenia, Armenia

Arsen Arakelyan, Institute of Molecular Biology of the National Academy of Sciences of the Republic of Armenia, Armenia

Short Abstract: Mean Telomere length(MTL) is reported to be associated with several diseases, including cancer and age-related diseases. We have conducted a quantitative trait association analysis study for mean telomere length in South Asian population for 168 samples and 5,694,566 SNPs. The MTL’s were calculated with an in-house program Computel from whole-genome sequencing data. Our results showed that MTL in South Asian population did not significantly correlate with age (r=-0.048, n = 168, p = 0.53), which is concordant with previously published results.
We have identified 52 SNPs associated with MTL at p < 2.43e-6 (FDR-BH corrected p < 0.25) located in 4 genes (ADARB2, PDLIM7, SEMA6B, PHACTR2). Additionally, one associated SNP (p=3.94e-7) was detected within 2 overlapping genes (TMEM74B and PSMF) and 7 SNPs in intergenic regions.
The ADARB2 gene was carrying 38 most significant SNPs among which were rs1007147(p=3.354e-08) and rs10903420 (p=5.457e-08). Earlier studies conducted for different human populations have strongly implicated these SNPs in association with extreme old age. The next SNP with highest p-value (rs163203, p=2.007e-06) was residing in PDLIM7. Its protein interacts with telomeric transcripts (TERRA), and it has been shown, that downregulation of TERRA binding proteins can impact on telomere lengthening.
In conclusion, our study suggests, that the age association with mean telomere length might be population specific. Additionally, we introduce new loci implicated in MTL that may be worth to be considered in further telomere studies.

G06 - GenerateReports : an IonTorrent plugin summarizing a whole NGS experiment for clinical interpretation

Pierre-Julien Viailly, Centre Henri Becquerel, INSERM U918, Rouen, France

Sylvain Mareschal, Centre Henri Becquerel, INSERM U918, Rouen, France

Philippe Bertrand, Centre Henri Becquerel, INSERM U918, Rouen, France

Sydney Dubois, Centre Henri Becquerel, INSERM U918, Rouen, France

Elodie Bohers, Centre Henri Becquerel, INSERM U918, Rouen, France

Catherine Maingonnat, Centre Henri Becquerel, INSERM U918, Rouen, France

Thierry Lecroq, LITIS EA 4108, IRIB, Normandie Université, Rouen, France

Hélène Dauchel, LITIS EA 4108, IRIB, Normandie Université, Rouen, France

Herve Tilly, Centre Henri Becquerel, INSERM U918, Rouen, France

Fabrice Jardin, Centre Henri Becquerel, INSERM U918, Rouen, France

Short Abstract: The increasing arrival of Next Generation Sequencing technologies in diagnostic laboratories creates a need to develop tools for rapid data interpretation. For example, targeted cancer sequencing allows biologists to focus on a selected range of known cancer-relevant genes and has become a choice strategy to quickly screen patients' mutational profiles. Although these profiles can aid in developing personalized therapy, their interpretation remains difficult. The aim is now to develop tools to provide a quick understanding of mutation profiles, highlighting the most impactful anomalies.

Here, we present an open-source integrated IonTorrent plugin called GenerateReports. This tool aggregates data in a single clinical report, enabling the clear visualization of the main results for each sample sequenced. GenerateReports is based on CoverageAnalysis and VariantCaller Torrent Suite plugins results but goes even further by performing annotation of single nucleotide variants and searching for copy-number variations (CNVs). Thus, the biologist has access to sample identification, run and sample quality metrics, annotated and stratified variants, CNVs detected with statistical relevance and information about experimental and informatics traceability, all in a single PDF report for each sample sequenced. An associated interfaced database allows for further statistical studies. It could help to identify sequencing artifacts by Sanger validation storing and to stratify thousands of anomalies in several runs.

We illustrate the results obtained using this plugin through the sequencing of a patient suffering from Diffuse Large B-Cell Lymphoma.

G07 - Comparison of targeted NGS and conventional sequencing performance for a Leukemia gene panel

Mirjam Rehr, WWU Muenster, Germany

Stefanie Göllner, MLU Halle, Germany

Claudia Gebhard, Universitaet Regensburg, Germany

Short Abstract: We present a comparison of the performance of conventional sequencing taken as the Gold standard and targeted NGS in combination with a bioinformatics pipeline. Our gene panel
consists of clinically significant Leukemia mutation sites, namely in genes NPM1, FLT3 and CEBPA. Data and comparison scripts will be made available as R package DKHcomparison.

We evaluate two datasets where targeted NGS data have been enriched by the same custom HaloPlex design and were or are to be generated on HiSeq Illumina platforms. The pipeline in use currently consists of alignment via bwa and variant calling with gatk. Minimal average coverage at target is set to 50x. Our first analysis has been conducted retrospectively on a 66 samples dataset. Conventional diagnostics were available as patients clinical data with up to 35 missing values. The second dataset will be made up of data from 50 patients. As conventional diagnostics Sanger sequencing of the same targets will be carried out.

Our preliminary findings with the initial dataset suggest that NGS sensitivity and positive predictive values are generally high, ranging from one out of two (low prevalence) to 98% and from 94% to 100%, respectively. NGS false positives have been judged visually as actually beeing true positives pointing to a higher sensitivity of NGS as opposed to conventional diagnostics.

Challenges in NGS remain both in the target capture design and in setting up a bioinformatics pipeline including filtering by annotation and automated detection of tandem duplications.

G08 - VaDE: a manually-curated database of reproducible associations between various traits and human genomic polymorphisms.

Tadashi Imanishi, Tokai University School of Medicine, Japan

Yoko Nagai, Tokai University School of Medicine, Japan

Yasuko Takahashi, Tokai University School of Medicine, Japan

Short Abstract: Genome-wide association studies (GWASs) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with various traits including common diseases. However, because GWAS uses statistical evaluation, we cannot completely eliminate false positives that may contaminate to a certain extent. On the other hand, it is becoming clearer that genetic risk factors of common diseases are not totally universal but heterogeneous among human populations. We thus developed a new database of genomic polymorphisms that are reproducibly associated with disease susceptibilities, drug responses and other traits for each human population, and released it as "VarySysDB Disease Edition (VaDE)". Using PubMed and NHGRI GWAS catalog, we collected 1722 GWAS papers and curated them manually. We extracted information of associated SNPs, odds ratios, p-values, study design, nationality of subjects, and many others. Also, extensive manual curation has been carried out separately for hypertension and rheumatoid arthritis. Then, we assessed the reproducibility of each association in multiple, independent studies for each human population. Finally, we obtained 2498 and 543 reproducible associations for 304 and 119 traits in the European and East Asian populations, respectively. Furthermore, to support finding functional SNPs in VaDE, we integrated data of ChIP-seq, DNaseI hypersensitivity experiments, regulatory motifs, RefSeq genes, H-Inv transcripts, and linkage disequilibrium data in three major human populations that have been obtained from Haploreg v2, VarySysDB, and Univ Michigan. The VaDE database is publicly available from http://bmi-tokai.jp/VaDE/. We believe that our database will contribute to the future establishment of personalized medicine and understanding of genetic factors underlying diseases.

G09 - Exploring mutational effects and filling phenotypic gaps with allelic variability

Saumya Kumar, Medical Research Council Harwell,

Michelle Simon, Medical Research Council Harwell,

Ann-Marie Mallon, Medical Research Council Harwell,

Short Abstract: How genetic variations translate into disease phenotypes is largely unknown. With increasing number of genomic variants potentially associated with diseases being identified through GWAS studies and next-generation sequencing, it is important to work out the underlying principles of genotype-to-phenotype relationships. It has been observed that different alleles of the same gene may also result in phenotypic variability. Mutations such as point mutations, small indels or null mutations for the same gene can affect intricate molecular networks with little consequence or disrupt broader networks causing deleterious effects to the organism. Here we examine the impact an allelic series has on mouse phenotypes and understanding disease aetiology.
Two large scale mutation screens are carried out at MRC Harwell; first, using reverse genetics is the Harwell ENU Ageing screen. Here mice are ENU mutagenised causing random point mutations in their progeny. The other major project is the International Mouse Phenotyping Consortium (IMPC) which follows a forward genetics approach. This project aims to knockout every gene in the mouse genome. Mice from both the screens go through a series of phenotype assays and the data is collected and analysed.
Here, I present results analysing the molecular effects of different types of mutations derived from the above phenotype screens. I compare the variability of phenotypes for particular allelic series and how molecular and omic data for each mutation can fill the ‘phenotype gap’ for each gene

G10 - GRASP v2.0: an update on the Genome-Wide Repository of Associations between SNPs and phenotypes

Andrew Johnson, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

John Eicher, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Christa Landowski, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Brian Stackhouse, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Wenjie Chen, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Nicole Jensen, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Ju-Ping Lien, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Richard Leslie, National Heart, Lung and Blood Institute, NIH, Branch for Cardiovascular Epidemiology & Human Genomi, United States

Short Abstract: Here, we present an update on the Genome-Wide Repository of Associations between SNPs and Phenotypes (GRASP) database version 2.0 (http://apps.nhlbi.nih.gov/Grasp/Overview.aspx). GRASP is a centralized repository of publically available genome-wide association study (GWAS) results. GRASP v2.0 contains ∼ 8.87 million SNP associations reported in 2082 studies, an increase of ∼ 2.59 million SNP associations (41.4% increase) and 693 studies (48.9% increase) from our previous version. Our goal in developing and maintaining GRASP is to provide a user-friendly means for diverse sets of researchers to query reported SNP associations (P ≤ 0.05) with human traits, including methylation and expression quantitative trait loci (QTL) studies. Therefore, in addition to making the full database available for download, we developed a user-friendly web interface that allows for direct querying of GRASP. We provide details on the use of this web interface and what information may be gleaned from using this interactive option. Additionally, we describe potential uses of GRASP and how the scientific community may benefit from the convenient availability of all SNP association results from GWAS (P ≤ 0.05). We plan to continue updating GRASP with newly published GWAS and increased annotation depth.

G11 - Performance Metrics for a High Throughput Targeted Sequencing Assay

Perry Haaland, BD Technologies, United States

Kristen Borchert, BD Technologies, United States

Jessica Maia, BD Technologies, United States

Frances Tong, BD Technologies, United States

Jeff Baker, BD Technologies, United States

Short Abstract: As next generation sequencing proves it usefulness to medical practice, it becomes critically important to reduce cost and increase reliability of the results. Most performance studies to date have focused on the accuracy and reproducibility of the sequencing platform, but this only represents part of the process while ignoring DNA extraction from the sample, library preparation, and template preparation. Sequencing labs have a variety of systems in place for these tasks. When lower cost more automated systems become available upstream of the sequencer, it will be critical to understand their performance in comparison to existing systems.

As a first step in developing such performance comparisons, we conducted a benchmark study using the Horizon Multi-Gene Multiplex reference sample. To represent a high through put sequencing scenario we prepared 24 barcoded libraries at one time using the Ion AmpliSeq(TM) Cancer Hotspot v2 targeted sequencing panel and subsequently sequenced them on the Ion Torrrent PGM. This library preparation was replicated multiple times with all other parts of the process being standardized.

We define a set of coverage-based metrics that are relevant to reliable variant calling, and we apply them to the benchmark dataset. Using a statistical approach, we set potential performance criteria for upstream automation systems based on the variation observed in this study.

G12 - TagMix: An integrated Genome-Wide Tag SNPs Selection models for Custom Chip design in multiple populations of high genetic variation and low linkage disequilibrium

Emile Chimusa Rugamika, University of Cape Town, South Africa

Short Abstract: Despite the commercial availability of affordable genome wide DNA panels, custom-designed DNA panels are frequently used for high resolution DNA studies focusing on specific genes or chromosomal regions and now frequently used for genome wide association studies to optimize the inference of disease scoring statistics. Existing chip designs with imputation using large-scale sequencing panels can capture variation well relative to low coverage sequencing designs by taking advantage of high level of LD in European populations. However, the high diversity and low linkage disequilibrium (LD) in Africa in turn influence the design and analysis of genome-wide association studies in African populations. Because recent strategies for finding causal variants that underlie common diseases have been based on LD, non-random association of variants at separate genetic loci and haplotype block; current tag SNPS selection algorithms are based either on LD or haplotype block. We proposed TagMix, an integrated cross-populations LD-based, haplotype-based and principal component analysis genome-wide Tag SNPs selection models to efficiently identify informative variants for custom chip array of multi-populations of low LD and high diversity. We have applied TagMIx in 5 African populations from 1000 Genome phase3, TagMix improves genetic coverage of both common and rare variation in these African populations. Tag SNPs identified using TagMix have good portability across these 5 African populations and show higher statistical power in association tests.

G13 - cghRA : a flexible workflow for CGH array analysis

Sylvain Mareschal, Centre Henri Becquerel, France

Abdelilah Bouzelfen, INSERM U918, Centre Henri Becquerel, France

Marion Alcantara, INSERM U918, Centre Henri Becquerel, France

Philippe Ruminy, INSERM U918, Centre Henri Becquerel, France

Martin Figeac, Plateforme de Génomique Fonctionelle, IRCL, France

Christian Bastard, INSERM U918, Centre Henri Becquerel, France

Hervé Tilly, INSERM U918, Centre Henri Becquerel, France

Fabrice Jardin, INSERM U918, Centre Henri Becquerel, France

Short Abstract: Although Next Generation Sequencing technologies are becoming the new reference in whole genome analysis, there is still a need for more affordable methods like CGH arrays to compare genomic alterations in large sample series. Despite this need and the large collection of freely available un-interfaced algorithms and commercial software, biologists unfamiliar with command line interfaces and scripting lack a simple and efficient tool to handle such data.

Here, we provide free cross-platform software, combining a user-friendly interface for pure biologists and an object-oriented command line interface for more advanced users. Its open-source R implementation offers a native interface to most of the published algorithms in the field, and a sharp learning curve to users familiar with this widespread scripting language. Aside from well-established algorithms, it includes original algorithms for copy number calling, recurring event definition and polymorphism filtering in a somatic context.

The performances of these algorithms have been assessed in a series of 108 CGH arrays of Diffuse Large B-Cell Lymphoma (DLBCL), for which conventional caryotyping and quantitative PCR had also been performed. Copy number estimations made by cghRA proved to be more accurate than concurrent algorithms (GLAD, CGHcall), as compared to wet-lab validation. Recurring events highlighted by cghRA matched and extended the current knowledge of DLBCL genomics, while offering more flexibility than concurrent software (GISTIC, SRA). Finally, polymorphism annotation sensitivity was assessed in a public dataset of control samples, and specificity in a subset of arrays hybridized against matching normal DNA rather than a DNA pool.

G14 - Genomic landscape of metastatic colorectal cancer

Christian Rausch, VU University Medical Center, Netherlands

Josien Haan, VU University Medical Center, Netherlands

Mariette Labots, VU University Medical Cente, Netherlands

Stef van Lieshout, VU University Medical Cente, Netherlands

Miriam Koopman, University Medical Center Utrecht, Netherlands

Jolien Tol, Radboud University Medical Center, Netherlands

Leonie J. M. Mekenkamp, Radboud University Medical Center, Netherlands

Mark van de Wiel, VU University Medical Center, Netherlands

Danielle Israeli, VU University Medical Center, Netherlands

Hendrik F. van Essen, VU University Medical Center, Netherlands

Nicole C. T. van Grieken, VU University Medical Center, Netherlands

Quirinus J. M. Voorham, VU University Medical Center, Netherlands

Linda J. W. Bosch, VU University Medical Center, Netherlands

Xueping Qu, Genentech, Inc., United States

Omar Kabbarah, Genentech, Inc., United States

Henk M. W. Verheul, VU University Medical Center, Netherlands

Iris D. Nagtegaal, Radboud University Medical Center, Netherlands

Cornelis J. A. Punt, Academic Medical Center, Netherlands

Bauke Ylstra, VU University Medical Center, Netherlands

Gerrit Meijer, VU University Medical Center, Netherlands

Short Abstract: Colorectal cancer (CRC) is the second leading cause of cancer death in the western world with 1.2 million new cases and over 600,000 deaths worldwide in 2008 (Jemal et al., 2011).
We have recently published a study on the “Genomic landscape of metastatic colorectal cancer” (Haan et al., Nov 2014, DOI: 10.1038/ncomms6457). Here we have analyzed tumour samples of a homogeneous, well-annotated series of patients with metastatic CRC (mCRC) of two phase III clinical trials, CAIRO and CAIRO2. DNA copy number aberrations of 349 patients were determined using high-quality array comparative genomic hybridization (aCGH). Within three treatment regimens, 194 chromosomal subregions were found to be associated with progression-free survival (PFS; uncorrected single-test P-values <0.005). These subregions were filtered for effect on messenger RNA expression, using an independent data set from The Cancer Genome Atlas which returned 171 genes. Three chromosomal regions are associated with a significantly lower PFS in treatment regimens with or without irinotecan. One of these regions, 6q16.1–q21, correlates in vitro (in COSMIC cell lines) with sensitivity to SN-38, the active metabolite of irinotecan. This genomic landscape of mCRC further reveals a number of DNA copy number changes of genes that are known drug targets. The release of this large body of high-quality data, both in terms of phenotype annotations and genomic read out, to the scientific community (deposited in GEO accession code GSE36864) will allow further analysis and validation experiments.

G15 - Integration of wet and dry bench analytics in targeted NGS assays optimizes the accuracy of variant detection in residual clinical FFPE, FNA, and liquid biopsies

Brian Haynes, Asuragen, United States

Robert Zeigler, Asuragen, United States

Dennis Wylie, Asuragen, United States

Jeff Houghton, Asuragen, United States

Sachin Sah, Asuragen, United States

Liangjing Chen, Asuragen, United States

Huiping Zhu, Asuragen, United States

Stephanie Bridger, Asuragen, United States

Julie Krosting, Asuragen, United States

Andrew Hadd, Asuragen, United States

Gary Latham, Asuragen, United States

Short Abstract: Clinical research and diagnostics are increasingly reliant on next-generation sequencing (NGS) technologies due to the rich breadth and depth of information they provide. However, complexity of experimentation, heterogeneity of clinical specimens and burden of data analysis pose significant, ongoing challenges that limit the potential of NGS, particularly for oncology applications.

We present a targeted NGS system that utilizes an integrated and cross-platform workflow comprised of optimized QC and library prep reagents, clinically-relevant controls, and novel variant analysis pipeline. Integration of the nucleic acid quantification QC assay with post-sequencing analytics enriches the analysis with sample-specific details that cannot be inferred from PCR-enrichment based targeted sequencing data alone. We developed a variant calling algorithm that directly incorporates amplifiable DNA template count as a model covariate. This variant caller was trained on 425 residual clinical specimens, comprised of 171 FFPE and 254 FNA tumor biopsies for which truth was established through orthogonal confirmation or assay replication.

Evaluation of an independent test set of over 50 FFPE specimens reveals that incorporating sample-specific experimental information improves mutation detection sensitivity, especially for low-frequency variants present between 0.5% and 10%, and improves PPV for libraries prepared with less than 200 amplifiable DNA template molecules. We also demonstrate the potential of this technology for liquid biopsy applications using a cohort of matched fresh-frozen, FFPE and plasma specimens representing different solid tumor backgrounds. Our results underscore the value of a systems approach to targeted NGS that links pre-analytical, analytical, and post-analytical steps using a streamlined, cross-platform workflow.

G16 - Sequence analysis of regulatory variants reveals selection pressure on somatic mutations in breast cancer

Ivan Kulakovskiy, Engelhardt Institute of Molecular Biology, Russian Federation

Ilya Vorontsov, Vavilov Institute of General Genetics, Russian Federation

Grigory Khimulya, Vavilov Institute of General Genetics, Russian Federation

Darya Nikolaeva, Lomonosov Moscow State University, Russian Federation

Vsevolod Makeev, Vavilov Institute of General Genetics, Russian Federation

Short Abstract: Single nucleotide variants, SNVs, are the most common among different variations of the human genome. SNVs in regulatory regions do not affect the proteins directly, but, instead, may change expression patterns of the corresponding genes. The availability of high-throughput sequencing data inspires studies with principally new data on sequence variants coming from somatic mutations, in particular, those emerging in cancer.

Using PERFECTOS-APE software (http://opera.autosome.ru/perfectosape/) and HOCOMOCO collection of transcription factor binding sites models (http://autosome.ru/HOCOMOCO/), we predicted transcription factor binding sites overlapping somatic mutations in breast cancer.

By careful statistical assessment, we demonstrate that observed frequencies of binding sites at mutations significantly differ from expected. Moreover, the mutations not just tend to or avoid overlapping with the binding sites. Instead, they specifically target or avoid specific crucial positions in the binding sites of several transcription factors, in particular, involved in adipogenesis.

We believe this provides an important evidence for selection pressure determining cancer cell population.

G17 - Comprehensive analysis of genetic variations using computational approaches

Seungchul Lee, School of Information and Communications, Gwangju Institute of Science and Technology, Korea, Rep

Jingu Lee, School of Information and Communications, Gwangju Institute of Science and Technology, Korea, Rep

Se-Hoon Lee, Division of Hematology/Oncology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Korea, Rep

Hyunju Lee, School of Information and Communications, Gwangju Institute of Science and Technology, Korea, Rep

Short Abstract: With the dramatic growth in sequencing technologies and an urgent need of understanding complexity of genomic variations in human, computational approaches for detecting genetic variations have been remarkably developed over the past decade. However, strategies for distinguishing true variants from sequencing artifacts and for providing a list of cancer-related variants still remain unclear. In this study, we developed an integrated framework for the identification of somatic mutations, copy number variations (CNV) and their functional profiles. We used whole-exome sequencing to detect variants in patients with cancer and optimized excavating variants for mutation and CNVs by investigating properties of established tools and adding biological filters. Specifically, a consensus approach based on genomic features was used to increase the sensitivity of identified mutations and to discriminate false positive from real sequencing variants. Our method will be beneficial for analysis of the exome sequencing data of cancer.

G18 - Exploiting a large scale biodata management system to support NGS variant detection studies

Gianmauro Cuccuru, CRS4, Italy

Paolo Uva, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Stefano Onano, Istituto di Ricerca Genetica e Biomedica (IRGB), Consiglio Nazionale delle Ricerche (CNR), Italy

Rossano Atzeni, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Simone Leo, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Luca lianas, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Manuela Oppo, Dipartimento di Scienze Biomediche, Università di Sassari, Italy

Luca Pireddu, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Andrea Angius, Istituto di Ricerca Genetica e Biomedica (IRGB), Consiglio Nazionale delle Ricerche (CNR), Italy

Laura Crisponi, Istituto di Ricerca Genetica e Biomedica (IRGB), Consiglio Nazionale delle Ricerche (CNR), Italy

Gianluigi Zanetti, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Giorgio Fotia, Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna (CRS4), Italy

Short Abstract: Tracing, analysing and updating the results of data processing protocols can significantly improve turn-around time and reproducibility when systematically dealing with any large scale NGS based clinical study.

We present an infrastructure to automate and track all data-related procedures and clinical information at the CRS4 high-throughput sequencing platform, which is the largest in Italy (20 TB of raw sequencing data every ten days). To allow automated data processing this facility is directly interconnected to the CRS4 computational resources (3000 cores, 4.5 PB storage).
Our infrastructure has been recently used in a clinical exome sequencing program for mutation identification in patients with syndromic intellectual disability, a large group of disorders with variable phenotypes. An average of 60 million or 100bp paired-end reads were generated per sample, with a mean 100-fold coverage across 62Mb of target regions. Sequences were automatically processed according to the GATK Best Practices, annotated and filtered by allele frequency and predicted impact to select likely pathogenic variants from an average of 170,000 variants per family.

One of the core components of the CRS4 infrastructure is OMERO.biobank, a robust, extensible and scalable traceability biodata management system. This component provides a way to store, query and retrieve a full description of biomedical datasets, the computational procedures that derived them, and the graph of dependencies between datasets. These functionalities are essential to support dataset traceability, reproducibility and update.

G19 - NGS-SWIFT: A Cloud-Based Variant Analysis Framework Using Control-Accessed Sequencing Data

Chunlin Xiao, NCBI/NLM/NIH, United States

Eugene Yaschenko, NCBI/NLM/NIH, United States

Stephen Sherry, NCBI/NLM/NIH, United States

Short Abstract: Genetic variation analysis plays an important role in elucidating the causes of various human diseases. The massive sequencing data imposes significant technical challenges for data management and analysis, including the tasks of collection, storage, transfer, sharing, and privacy protection. Currently, each analysis group facing these analysis tasks must download all the relevant sequence data into a local file system before variation analysis is initiated. This heavy-weight transaction not only slows down the pace of the analysis, but also creates financial burdens for researchers due to the cost of hardware and time required to transfer the data over typical academic internet connections. To overcome such limitations and explore the feasibility of analyzing control-accessed sequencing data in cloud environment while maintaining data privacy and security, here we introduce a cloud-based analysis framework that facilitates variation analysis using direct access to the NCBI Sequence Read Archive through NCBI sratoolkit, which allows the users to programmatically access data housed within SRA with encryption and decryption capabilities and converts it from the SRA format to the desired format for data analysis. A customized machine image (ngs-swift) with preconfigured tools (including NCBI sratoolkit) and resources essential for variant analysis has been created for instantiating an EC2 instance or instance cluster on Amazon cloud. Performance of this framework has been evaluated and compared with that from traditional analysis pipeline, and security handling in cloud environment when dealing with control-accessed sequence data has been addressed.

G20 - Characterization of the effects of mutations detected in tumor exomes combining similarity networks methods and protein structure stability estimations with consideration of protein conformational diversity

Ezequiel Juritz, Universidad Nacional de Quilmes, Argentina

Jose Soto, Center for Bioinformatics and Integrative Biology, UNAB, Chile

Javier Caceres, Center for Bioinformatics and Integrative Biology, UNAB, Chile

Juan Pablo Bascur, Center for Bioinformatics and Integrative Biology, UNAB, Chile

Danilo Gonzalez Nilo, Center for Bioinformatics and Integrative Biology, UNAB, Chile

DE Almonacid, Center for Bioinformatics and Integrative Biology, UNAB, Chile

Short Abstract: Genetic alterations associated with cancer are mostly studied individually, omitting the integrative exploration of the biological, molecular and physicochemical context within the affected superfamilies.
Our goal is to characterize 806,207 punctual mutations associated to 26 types of cancer derived from exomes from 5,838 patients.
To that effect, statistical significance of each mutated gene was obtained using MutSig2CV3.0 (Lawrence et al. 2014), considering the mutational heterogeneity of the samples (i.e. patient-specific mutations frequency and gene-specific mutations-rates).
Proteins encoded by significantly mutated genes were then cross-linked with CSA (Porter et al. 2004) to identify mutations affecting residues involved in active sites; and with PCDB (Juritz et al. 2011) and CoDNaS (Monzon et al. 2013) to obtain the conformational space of these proteins.
Similarity networks were performed for all 1,352 recruited proteins using the recently developed parameter-free clustering software PFClust (Musayeva et al. 2014).
We identified 230 superfamilies significantly mutated in cancer, including Serine/Threonine and Tyrosine-Kinases, Growth Factors and Protocadherins among the most represented groups. Additionally, we detected hot spots presenting high-mutations rates in 72% of the superfamilies, including the three mentioned above.
From all missense mutations, less than 10% were mapped in catalytic sites, indicating that protein structure destabilization is strongly related with cancer associated mutations. We are currently computing the thermodynamic effect estimation of these mutations in all structural conformers recruited.
The integration of protein stability variations and conformational diversity with the superfamily context provides promising insights in characterizing cancer associated mutations effects, particularly within the detected hot spots.

G21 - Analyzing Viral Quasispecies via Semi-Definite Programing

Haris Vikalo, University of Texas at Austin, United States

Somsubhra Barik, University of Texas at Austin, United States

Shreepriya Das, University of Texas at Austin, United States

Short Abstract: RNA viruses are characterized by a high mutation rate that typically leads to a population of closely related genomes (a viral quasispecies) in a host organism. The genetic diversity of quasispecies enables the virus to adapt to varying conditions over the course of infection and keep proliferating. Determining genetic diversity (i.e., inferring viral haplotypes) of a virus is essential for the understanding of its origin and mutation patterns, and the development of effective drug treatments. A fast and affordable analysis of viral genomes can be accomplished using high-throughput sequencing platforms, revealing both the sequences of the mutated viruses as well as their frequencies. However, errors and limited read lengths of high throughput sequencing platforms render the problem of estimating genetic diversity of viral quasispecies challenging. The existing methods typically do not perform well when the sequence divergence is low, and experience difficulties when attempting to recover rare sequences; this is problematic since incomplete viral diversity information may lead to designing inadequate treatments. We present a fast and accurate method to assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure -- namely, the low rank of the SNP fragment matrix -- to solve it rapidly and accurately. Extensive benchmarking tests demonstrate that the proposed method is significantly more accurate than the existing techniques.

G22 - Conserved epigenomic signals in mice and human reveal immune basis of Alzheimer's disease

Andreas Pfenning, Massachusetts Institute of Technology, United States

Elzabeth Gjoneska, Massachusetts Institute of Technology, United States

Hansruedi Mathys, Massachusetts Institute of Technology, United States

Gerald Quon, Massachusetts Institute of Technology, United States

Anshul Kundaje, Stanford University, United States

Manolis Kellis, Massachusetts Institute of Technology, United States

Li-Huei Tsai, Massachusetts Institute of Technology, United States

Short Abstract: Alzheimer’s disease (AD) is a devastating neurological disorder characterized by beta-amyloid accumulation and cognitive decline. Our work builds a rigorous computational framework to characterize regulatory mutations that have been associated with AD through a genome-wide associate study meta analysis. First, we compare the AD-associated mutations to enhancer regions mapped as a part of the Roadmap Epigenomics project across 111 different cell types and tissues. Surprisingly, our analysis shows that AD predisposition is primarily mediated by immune cells, rather than neurons. To annotate these AD-associated enhancer regions, we use ChIP-Seq to measure enhancers change in H3K27ac during neurodegeneration in the CK-p25 mouse model. We identify a set of immune enhancers that increase in H3K27ac during neurodegeneration in the mouse, which are orthologous to immune enhancers in the human. These conserved enhancers are enriched for AD-associated mutations, providing evidence that mouse models can be used to study the epigenetics of complex disorders.

G23 - SV-Bay: structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability

Valentina Boeva, Institut Curie, Centre de Recherche, France

Daria Iakovishina, INRIA, France

Emmanuel Barillot, Institut Curie, France

Mireille Regnier, INRIA, France

Short Abstract: Whole genome sequencing of paired-end reads can be applied to characterize the landscape of large somatic rearrangements of cancer genomes. Several methods for detecting structural variants with whole genome sequencing data have been developed. So far, none of these methods has combined information about abnormally mapped read pairs connecting rearranged regions and associated copy number changes. Our aim was create a computational method that could use both types of information, i.e., normal and abnormal reads, and demonstrate that by doing so we can highly improve both sensitivity and specificity rates of structural variant prediction.
Here we present a computational method, SV-Bay, developed to detect structural variants from whole genome sequencing mate-pair or paired-end data using a probabilistic Bayesian approach. This approach takes into account depth of coverage by normal reads and abnormalities in read pair mappings. To estimate the model likelihood, SV-Bay considers GC-content and read mappability of the genome, thus making important corrections to the expected read count. For the detection of somatic variants, SV-Bay makes use of a matched normal sample when it is available. We validated SV-Bay on simulated datasets and an experimental mate-pair dataset for the CLB-GA neuroblastoma cell line. The comparison of SV-Bay with several other methods for structural variant detection demonstrated that SV-Bay has better prediction accuracy both in terms of sensitivity and false positive detection rate.
SV-Bay is implemented in Python and is available at https://github.com/InstitutCurie/SV-Bay
Contact information: SV.Bay@curie.fr

G24 - Bam2Drug: A novel cancer pipeline for complete characterization of WGS/WES data

Djordje Klisic, Seven Bridges Genomics,

Milos Tziotas, Seven Bridges Genomics,

Vladan Arsenijevic, Seven Bridges Genomics,

Ana Mijalkovic Lazic, Seven Bridges Genomics,

Jelena Radenkovic, Seven Bridges Genomics,

Lu Zhang, Seven Bridges Genomics, United States

Short Abstract: Next generation sequencing allows the mutational spectrum of cancer to be analyzed at an unprecedented scale. This technology has provided new insights into the mechanisms of tumorigenesis and has the potential to reveal novel therapeutic targets and prognostic indicators. Multiple open source tools were implemented into a novel pipeline for characterization of somatic mutations, which permits the reproducibility of analyses on whole genome/exome DNA data sets by processing paired tumor-normal BAMs. To increase robustness, two methodologically distinct somatic SNV and CNV callers were implemented (VarScan2+SomaticSniper and VarScan2+Control-FREEC respectively). Intersecting the results from the two callers allows a significant reduction of false positives, while joining them leads to the detection of a more complete set of true positives.

Somatic mutations obtained by the pipeline are annotated using multiple databases including COSMIC, dbSNP, CLINVAR, dbNSFP, SNPEff and DrugBank. Mutations are additionally prioritized using eXtasy and PriVar. Information is then fetched from DrugBank about available drugs for the gene on which a mutation is located, and their mode of action (e.g. inhibiting, activating). The pipeline is highly parallelized, and as a result has an execution time of less than 3h for WES data, and around 30h for WGS data using the HCC1143 breast cancer cell line from the TCGA public database. Further, an in-house prioritization approach was investigated, based on the distribution of annotation scores, to determine which somatic mutations are the foremost cancer drivers. The pipeline was achieved using Rabix, an implementation of the common workflow language.

G25 - Queryor: a web-based platform for exome variants prioritization

Nicola Vitulo, University of Padova, Italy

Claudio Forcato, University of Padova, Italy

Loris Bertoldi, University of Padova, Italy

Riccardo Schiavon, University of Padova, Italy

Alessandro Vezzi, University of Padova, Italy

Erika Feltrin, University of Padova, Italy

Fabio De Pascale, University of Padova, Italy

Giorgio Valle, University of Padova, Italy

Short Abstract: Exome sequencing is a cutting-edge methodology extremely useful to define the extensive inventory of human genetic variation. However, highlighting the few variants that are causal for human diseases among tens of thousands of variants is still a major challenge.
The massive number of obtained variants has to be sifted out somehow in order to exclude the neutral variants and focus on the subset of potentially pathogenic variants. Here we present Queryor, a web-server platform for the managing and retrieval of whole exome sequencing data.
Queryor is a gene-centered system that aggregates functional annotation of variants with gene annotations, offering a large number of criteria by which the end-user can prioritize genes/variants. The prioritization strategy consists in a ranking system that sorts results based on the satisfied criteria.
Variants obtained from user sequencing projects can be submitted to the platform in VCF format. They are functionally annotated by means of several external resources and the annotation data are stored on a relational database. Annotation data include the genome context of variants (exon, CDS, splicing sites, UTR), their functional impact on coding regions (amino acid substitution, SIFT and PolyPhen scores), genotypic data (read coverage of each allele, allele frequencies, quality scores) and gene annotation (Gene Ontology, OMIM, InterPro). After the annotation step, the data are available for query exploration in a private environment. Queryor has a user-friendly web-interface with a valuable output representation that allows an effective interpretation of results and the possibility to refine ranking strategies. Queryor is available at http://genomes.cribi.unipd.it/cgi-bin/queryor/login.pl

G26 - PhyloSpan: using multi-mutation reads to resolve subclonal architectures from heterogeneous tumor samples

Amit Deshwar, University of Toronto, Canada

Levi Boyles, University of Oxford,

Jeff Wintersinger, University of Toronto, Canada

Paul Boutros, Ontario Institute for Cancer Research, Canada

Yee Whye Teh, University of Oxford,

Quaid Morris, University of Toronto, Canada

Short Abstract: Tumors often contain multiple, genetically diverse subclonal populations, as predicted by the clonal theory of cancer. Identifying these sub-populations and their evolutionary relationships can help identify driver mutations associated with cancer development and progression.
Subclonal reconstruction algorithms attempt to infer the prevalence and genotype of multiple, genetically-related subclonal populations using the variant allele frequency (VAF) of somatic variants. To date, these algorithms exclusively use data on individual somatic mutations. In some cases, it is possible to determine the mutation status of >1 mutation in a single cell, for example, when single reads cover multiple single nucleotide variants (SNVs). This type of information can add considerable power to the phylogenetic reconstruction of the tumor subclonal population. Pairs of mutations close enough to be covered by a single read or read-pair provide more phylogenetic certainty than can be found by looking at mutations independently. For example, if those SNVs are found in the same evolutionary branch, then we expect to see some reads containing both mutations. If however, the SNVs are an separate branches then no reads should show both SNVs. PhyloSpan integrates this phylogenetic information, along with information about the VAF of each somatic SNV in order to perform subclonal reconstruction. This algorithm not only infers the number of subclonal populations and their genotype but also provides a measure of uncertainty about this inference, enabling users to determine which parts of the subclonal reconstruction are certain and which parts remain ambiguous.

G27 - An integrated approach for analysis of Genotyping-by-Sequencing (GBS) Data

Sateesh Kagale, National Research Council Canada, Canada

Chushin Koh, National Research Council Canada, Canada

Wayne Clarke, Agriculture and Agri-Food Canada, Canada

Venkatesh Bollina, Agriculture and Agri-Food Canada, Canada

Isobel Parkin, Agriculture and Agri-Food Canada, Canada

Andrew Sharpe, National Research Council Canada, Canada

Short Abstract: The advent of very high throughput next generation sequencing (NGS) platforms together with new technical methodologies to take advantage of these gains has provided an opportunity for establishing high resolution genetic analysis in any species. The development of genotyping-by-sequencing (GBS) to rapidly detect nucleotide variation at the whole genome level, in many individuals simultaneously has provided a transformative genetic profiling technique. GBS can be carried out in species with or without reference genome sequences, and yields huge amounts of potentially informative data. One limitation with the approach is the paucity of informatics tools to transform the raw data into a format that can be easily interrogated at the genetic level. Here we describe an integrated bioinformatics pipeline that is designed to identify genetic variants such as single nucleotide polymorphisms (SNPs) and insertions/deletions (InDels) from NGS data generated by most major GBS approaches. The bioinformatics pipeline described here focuses on the analyses of GBS data where there is access to a complete or draft genome. The basic workflow for variant discovery can be divided into three sequential steps: raw data processing, read alignment to a reference genome or de novo assembly of the sequence tags, and variant discovery and annotation. In addition to reviewing these three steps, we will outline some of the recently available bioinformatics resources to enable researchers to establish GBS applications for genetic analysis in their laboratories and also provide a description of key factors that need to be considered in experimental design.

G28 - SNPs and INDELs in genes involved in lipid metabolism of mammary gland of Zebu breeds identified by whole genome sequencing

Guilherme Oliveira, ITV, Brazil

Izinara Rosse, FIOCRUZ, Brazil

Juliana Assis, FIOCRUZ, Brazil

Francislon Oliveira, FIOCRUZ, Brazil

Laura Leite, FIOCRUZ, Brazil

Flávio Araújo, FIOCRUZ, Brazil

Anna Salim, FIOCRUZ, Brazil

Adhemar Zerlotini, EMBRAPA, Brazil

Beatriz Lopes, EPAMIG, Brazil

Wagner Arbex, EMBRAPA, Brazil

Marco Antônio Machado, EMBRAPA, Brazil

Maria Gabriela Peixoto, EMBRAPA, Brazil

Rui Verneque, EMBRAPA, Brazil

Martha Martins, EMBRAPA, Brazil

Roney Coimbra, FIOCRUZ, Brazil

Marcos Vinícius Silva, EMBRAPA, Brazil

Maria Raquel Carvalho, UFMG, Brazil

Short Abstract: Background: Guzerá and Gir are the main indicine breeds of cattle dairy that compound the Brazil herd. The milk produced by these indicine breeds contain higher concentration of fat in milk in comparison with taurine breeds. However, the genetic bases for these differences are unknown. In this context, the objective of this study was to sequence and to map the genome of three Guzerá bulls and three Gir bulls in order to identify zebu-specific variations involved in the lipid metabolism of the mammary gland.
Results: Genomes sequencing were performed using SOLiD and HiSeq platforms. The sequences obtained were mapped to the reference genome of Bos taurus (UMD 3.1) using the LifeScope and BWA-mem programs. The average depth of coverage achieved from mapping ranged from 10 to 26X for six samples and the genome coverage ranged from 87% to 98% of the reference genome. A list of putative SNPs and INDELs were generated using the LifeScope and SAMtools. We performed the selection of shared variations among the six samples and detected 2% of the SNPs and INDELs. These variations were classified functionally according to NSG-SNP and those located in genes involved in lipid metabolism of mammary gland were selected. We found potentially functional variations in genes involved in transport and secretion of cholesterol, activation of fatty acids and synthesis of sphingolipids.
Conclusions: This study identified potentially functional variations in genes that may be responsible for differences in the composition of lipids in the milk from indicine and taurine breeds.

G29 - Understanding the discrepancies between somatic variant callers

Andreas Schreiber, SA Pathology, Australia

Paul P. S. Wang, SA Pathology, Australia

Wendy T. Parker, SA Pathology, Australia

David T. Yeung, SA Pathology, Australia

Susan Branford, SA Pathology, Australia

Short Abstract: Cancer research has benefited greatly from recent improvements in next-generation sequencing methods that lead to significantly increased sequencing depth at relatively low cost. In particular, detection of somatic variants, which often requires greater sensitivity than germline variants due to low allele fraction, has become technically and economically feasible.

Several dedicated somatic variant-calling algorithms have been developed. Studies that examine these variant callers show that there is a much greater discrepancy between somatic variant callers than germline variant callers. And in most cases, the recommended approach is to use multiple callers and rely on their overlap to gauge confidence for variant calling.

Using data from tumour and normal samples of chronic myeloid leukaemia (CML) patients, we found that while variants with high caller consensus are much more likely to be true variants (in agreement with other studies), there exist many validated variants that are likely biologically relevant, and yet have low caller consensus.

With the aim to achieve better understanding of the differences between caller behaviour, we carried out in-depth analysis of seven currently available somatic variant callers (MuTect, Seurat, Shimmer, SomaticSniper, Strelka, VarScan2 and Virmid) by testing their component processes (i.e. pre-processing of reads, somatic statistical model, and post-processing of candidate variants). Our data strongly suggest that the caller-specific filters have significant impact on overall caller concordance. We report on these findings and also make recommendations on methods that can improve the accuracy and confidence of somatic variant calling.

G30 - Classification of true and false positive variants within a single sample of next generation sequencing.

Eleni Giannoulatou, Victor Chang Cardiac Research Institute, Australia

Steven Phan, University of Sydney, Australia

Joshua WK Ho, Victor Chang Cardiac Research Institute, Australia

Short Abstract: The emergence of high-throughput Next Generation Sequencing (NGS) technologies has resulted in multiple novel disease gene discoveries and has therefore revolutionised the clinical diagnosis of human genetic diseases. Sequencing the whole genome or exome of an individual can yield millions of variants ranging from single nucleotide variants (SNVs) to complex variants such as short insertions/deletions (indels) or copy number variation (CNV).

The discovery of disease-causing variants requires the application of methodologies with high accuracy and precision. However, many sources of false positives have been identified; false positive variants can be the result of sequencing errors, alignment errors as well as variant calling errors. Currently the identification of these false positive variants in a sequencing experiment of a single sample is very difficult. Genotyping algorithms can provide quality scores and other metrics for each variant called but each one of these metrics might not be sufficient to determine whether a variant is real. Many sequencing studies use large cohorts of control samples (or patient samples from different diseases) to help prioritizing variants found in a single disease sample.

We have developed a hierarchical logistic regression model that can help classifying variants found in a single sample into false positives (FPs) and true positives (TPs). By applying our model to an exome sequencing run of NA12878 and using the gold standard variant callset of Genomes in a Bottle, we were able to identify multiple features that can distinguish TPs from FPs.

G31 - Jacquard: A practical approach to integrating complex somatic variant data sets

James Cavalcoli, University of Michigan, United States

Christopher Gates, University of Michigan, United States

Jessica Bene, University of Michigan, United States

Peter Ulintz, University of Michigan, United States

Kevin Meng, University of Michigan, United States

Short Abstract: Identification of somatic variants from Next-Generation Sequencing (NGS) data generated from matched tumor-normal patient samples enables researchers to elucidate fundamental mechanisms of cancer and nominate key prognostic and predictive biomarkers. Most variant callers have embraced the Variant Call Format (VCF) standard which clearly and succinctly describes variants from a single tumor-normal pair. However, while many tools follow the standard format, tools often adopt different ways to partition results (e.g. somatic file vs. germline file, or SNP vs. indel) and each tool creates its own idiosyncratic dialect of VCF fields and tags. Moreover, variant callers typically analyze a single tumor-normal pair at a time, emitting a VCF for each patient.
To address these challenges, we have developed Jacquard, a suite of Python command-line tools that integrate VCFs across different tumor-normal pairs as well as different variant callers to produce a unified variant calling output immediately useful to researchers. Jacquard provides simple consensus calling results, highlights prevalence of variants across samples, and expedites downstream annotation, analysis, and visualization. Jacquard accepts VCFs from MuTect, Strelka, and VarScan and can be easily extended to support new callers. Jacquard emits VCFs and equivalent text files and gracefully integrates into a comprehensive variant calling pipeline.

G32 - Improving comparability of quantitative SNV calling across samples

Jochen Singer, ETH Zürich, Switzerland

Jonas Behr, ETH Zürich, Switzerland

David Seifert, ETH Zürich, Switzerland

Niko Beerenwinkel, ETH Zürich, Switzerland

Short Abstract: With the advent of high-throughput sequencing techniques high coverage genome analysis projects have become of great interest in the field of cancer research. The large number of nucleotides per genomic site allows the detection of low frequency single nucleotide variants (SNVs) and thereby the identification of sub-clones in cancer tissues. However, the quantitative comparison of the identified SNVs between samples is still challenging, for instance due to coverage differences between the samples. The nucleotide coverage affects the power of the applied test statistic and thus the variance between samples makes a direct comparison of SNV counts problematic. In order to overcome this problem we make the statistical power comparable across different samples by position-wise sub-sampling of reads, depending on the individual coverage distribution of each sample. This is combined with deepSNV, a SNV caller that is based on a probabilistic model and tests if the nucleotide distributions at a specified position are similar across samples. In doing so, we are able to compare the SNV frequency between different samples. In addition, the sampling procedure allows us to evaluate the reproducibility of SNV calling results.

TOP

View Posters By Category

Search Posters:

TOP