Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

VarI COSI Track

Attention Conference Presenters - please review the Speaker Information Page available here
VarI Welcome from the committee
Date: Monday, July 24
Time: 10:00 AM - 10:10 AM
Room: Meeting Hall IA
  • Yana Bromberg
  • Emidio Capriotti
  • Hannah Carter

Presentation Overview: Show

Session Introduction

Common sequence variants affect molecular function more than rare variants?
Date: Monday, July 24
Time: 10:10 AM - 10:40 AM
Room: Meeting Hall IA
  • Burkhard Rost, TUM Munich / Columbia University, Germany
  • Yannick Mahlich, TUM Munich, Germany
  • Maximilian Hechtg, Amazon, Germany
  • Tjaart Andries Petrus De Beer, Basel Biocenter, Switzerland
  • Yana Bromberg, Rutgers University, United States
  • Maria Schelling, TUM Munich, Germany

Presentation Overview: Show

Any two unrelated individuals differ by about 10,000 single amino acid variants (SAVs). Do these impact molecular function? Experimental answers cannot answer comprehensively, while state-of-the-art prediction methods can. We predicted the functional impacts of SAVs within human and for variants between human and other species. Several surprising results stood out. Firstly, four methods (CADD, PolyPhen-2, SIFT, and SNAP2) agreed within 10 percentage points on the percentage of rare SAVs predicted with effect. However, they differed substantially for the common SAVs: SNAP2 predicted, on average, more effect for common than for rare SAVs. Given the large ExAC data sets sampling 60,706 individuals, the differences were extremely significant (p-value<2.2e-16). We provided evidence that SNAP2 might be closer to reality for common SAVs than the other methods, due its different focus in development. Secondly, we predicted significantly higher fractions of SAVs with effect between healthy individuals than between species; the difference increased for more distantly related species. The same trends were maintained for subsets of only housekeeping proteins and when moving from exomes of 1,000 to 60,000 individuals. SAVs frozen at speciation might maintain protein function, while many variants within a species might bring about crucial changes, for better or worse.

Computational predictors fail to identify amino acid substitution effects at rheostat positions.
Date: Monday, July 24
Time: 10:40 AM - 11:00 AM
Room: Meeting Hall IA
  • Maximilian Miller, Rutgers University, United States
  • Yana Bromberg, Rutgers University, United States
  • Liskin Swint-Kruse, The University of Kansas Medical Center, Germany

Presentation Overview: Show

Many computational approaches exist for predicting the effects of amino acid substitutions. Here, we considered whether the protein sequence position class – rheostat or toggle – affects these predictions. The classes are defined as follows: experimentally evaluated effects of amino acid substitutions at toggle positions are binary, while rheostat positions show progressive changes. For substitutions in the LacI protein, all evaluated methods failed two key expectations: toggle neutrals were incorrectly predicted as more non-neutral than rheostat non-neutrals, while toggle and rheostat neutrals were incorrectly predicted to be different (https://www.nature.com/articles/srep41329/figures/4). However, toggle non-neutrals were distinct from rheostat neutrals. Since many toggle positions are conserved, and most rheostats are not, predictors appear to annotate position conservation better than mutational effect. This finding can explain the well-known observation that predictors assign disproportionate weight to conservation, as well as the field’s inability to improve predictor performance. Thus, building reliable predictors requires distinguishing between rheostat and toggle positions.

When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants
Date: Monday, July 24
Time: 11:00 AM - 11:20 AM
Room: Meeting Hall IA
  • Kymberleigh Pagel, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Vikas Pejaver, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Guan Ning Lin, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Hyunjun Nam, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Matthew Mort, Institute of Medical Genetics, Cardiff University, United Kingdom
  • David N Cooper, Institute of Medical Genetics, Cardiff University, United Kingdom
  • Jonathan Sebat, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Lilia M Iakoucheva, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Sean D Mooney, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, United States
  • Predrag Radivojac, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States

Presentation Overview: Show

Motivation:Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease.
Results: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar, and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1,142 de novo variants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants

Assessing regulatory variant effect scores by massively parallel reporter assays
Date: Monday, July 24
Time: 11:20 AM - 11:40 AM
Room: Meeting Hall IA
  • Martin Kircher, Berlin Institute of Health, Germany
  • Fumitaka Inoue, University of California San Francisco, United States
  • Chenling Xiong, University of California San Francisco, United States
  • Beth Martin, University of Washington, United States
  • Nadav Ahituv, University of California San Francisco, United States
  • Jay Shendure, University of Washington, United States

Presentation Overview: Show

The use of sequencing approaches for the identification of disease causal mutations is rapidly gaining traction, but the interpretation of the identified variants remains a major challenge. When scaling from exome to genome sequencing, the vast majority of variants fall in non-coding regions. However, we currently have a very limited toolset for their interpretation and almost no validation data for predictive approaches or training data that can be applied for machine learning strategies.
We assessed the performance of currently available computational predictors of non-coding sequence effects (e.g. CADD, DeepSEA, Eigen, FATHMM-MKL, FunSeq2, GAWAVA, ReMM) on several massively parallel reporter assay (MPRA) datasets. To generate these MPRAs, we used saturation mutagenesis on regulatory sequences where mutations are known to cause human disease. Specifically, we derived variant-specific activity maps for several clinically relevant promoter (F9, HBB, TERT and others) and enhancer sequences (SORT1, RET and others) – creating an unprecedented database for the interpretation of potentially disease-causing regulatory mutations.
We observe that score performance largely depends on the specific regulatory region under consideration. In some cases (e.g. LDLR promoter), predictors integrating evolutionary conservation with biochemical read-outs perform well while the same predictors perform poorly for other regulatory regions (e.g. TERT promoter). Further, we observe that motif predictions are frequently incomplete and ChIP-seq peaks tend to be too broad to guide variant prioritization. We believe that our experiments provide a rich dataset for benchmarking predictive models of variant effects and have the potential to provide critical insight for the development of improved computational tools.

PhD-SNPg: A webserver and lightweight tool for scoring single nucleotide variants..
Date: Monday, July 24
Time: 11:40 AM - 12:00 PM
Room: Meeting Hall IA
  • Emidio Capriotti, University of Bologna, Italy
  • Piero Fariselli, University of Padova, Italy

Presentation Overview: Show

One of the major challenges in human genetics is to identify functional effects of coding and non-coding single nucleotide variants (SNVs). In the past, several methods have been developed to identify disease-related single amino acid changes but only few tools are able of scoring the impact of non-coding variants. Among the most popular algorithms, CADD and FATHMM predict the effect of SNVs in non-coding regions combining sequence conservation with several functional features derived from the ENCODE project data. Thus, to run CADD or FATHMM locally, the installation process requires to download a large set of pre-calculated information. To facilitate the process of variant annotation we develop PhD-SNPg; a new easy-to-install and lightweight machine learning method depending only on sequence-based features. Despite this, PhD-SNPg performs similarly or better than more complex methods. This makes PhD-SNPg ideal for quick SNV interpretation, and as benchmark for tool development.
Availability: PhD-SNPg is accessible at http://snps.biofold.org/phd-snpg.

Phenotype-driven discovery of digenic variants in personal genome sequences.
Date: Monday, July 24
Time: 12:00 PM - 12:20 PM
Room: Meeting Hall IA
  • Imane Boudellioua, King Abdullah University of Science and Technology, Saudi Arabia
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Paul Schofield, University of Cambridge, United Kingdom
  • Georgios Gkoutos, University of Birmingham, United Kingdom
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Identification of variants associated with inherited diseases is a major challenge, in particular in the analysis of clinical sequence data from individual patients. An increasing number of Mendelian diseases have been identified in which two or more variants in multiple genes are required to cause the disease, or significantly modify its severity or phenotype. It is difficult to discover such interactions using existing approaches. Information that links patient phenotypes to databases of gene–phenotype associations observed in clinical and basic research canprovide useful information and improve variant prioritization for Mendelian diseases. PhenomeNET is a computational framework that utilized pan-phenomic data from human and non-human model organisms to prioritize candidate genes in genetically based diseases, and we have recently combined PhenomeNET with genome-wide pathogenicity prediction methods into the PhenomeNET Variant Predictor (PVP) that can be used to prioritize variants in inherited diseases. Here, we illustrate extensions to PVP that can be used to identify variants in oligogenic diseases and their interactions. We inserted multiple variants known to be associated with digenic disease into synthetic genomes and find that PVP can identify sets of causative variants in a hypothesis-neutral manner. Our results show that PVP can efficiently detect oligogenic interactions using a phenotype-driven approach and iden+C34tify etiologically important variants in whole genomes.

The importance of using a most comprehensive Knowledgebase for the identification of pathogenic variants in cancer and inherited diseases
Date: Monday, July 24
Time: 12:20 PM - 12:35 PM
Room: Meeting Hall IA
  • Anika Joecker, Qiagen, Germany

Presentation Overview: Show

Next generation sequencing technology has enabled identifying causal genetic variants underlying rare and inherited diseases in a short time. However, identifying critical variants for diagnosis and treatment can be difficult and time consuming.
Having a comprehensive manually curated collection of disease relevant variants including their impact is essential to overcome those challenges.
In this presentation, we will present QIAGENs Knowledgebase, a database of over hundreds of thousands manually curated pathogenic variants in the area of oncology and inherited disease as well as HGMD, the gold standard in inherited diseases. Based on real world examples using our interpretation tools QIAGEN Clinical Insight Interpret and Ingenuity Variant Analysis we will show why it is important to make use of a comprehensive database to achieve a better diagnosis and treatment of the patient.

Integration of molecular phenotypes into genome-wide association studies.
Date: Monday, July 24
Time: 2:00 PM - 2:30 PM
Room: Meeting Hall IA
  • Sven Bergmann, Université de Lausanne , Switzerland

Presentation Overview: Show

Genome-wide association studies (GWAS) screen for links between genotypes and phenotypic traits. The large number of tested genetic markers poses a major challenge in terms of multiple hypotheses testing. One strategy to prioritize the interpretation of GWAS results for complex traits and diseases is to measure not only organismal but also molecular phenotypes in large collections of samples, and to search for association between these measurements and genotypes. Yet, the appropriate methodologies for integrating such data are still poorly developed. Here, we present several examples for new integrative approaches for large-scale molecular data, such as gene expression and metabolomics data. Specifically, we highlight two examples were SNPs associated with diseases could also be linked to metabolites providing insight into potential mechanisms of action. Furthermore, we show how using gene expression data from the FANTOM5 project enabled us to construct 394 cell type and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity between transcription factors, enhancers, promoters and genes. We found that genetic variants associated with human diseases disrupt components of these networks in disease-relevant tissues, giving new insights on disease mechanisms. Finally we demonstrate how mapping GWAS results into gene-scores can be used for annotating gene network communities, an approach that we applied in a recent DREAM challenge for disease module identification.

Increasing the power of meta-analysis of genome-wide association studies to detect heterogeneous effects.
Date: Monday, July 24
Time: 2:30 PM - 2:50 PM
Room: Meeting Hall IA
  • Cue Hyunkyu Lee , Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, South Korea
  • Eleazar Eskin, Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, United States
  • Buhm Han, Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, South Korea

Presentation Overview: Show

Meta-analysis is essential to combine the results of genome-wide association studies (GWASs). Recent large-scale meta-analyses have combined studies of different ethnicities, environments, and even studies of different related phenotypes. These differences between studies can manifest as effect size heterogeneity. We previously developed a modified random effects model (RE2) that can achieve higher power to detect heterogeneous effects than the commonly used fixed effects model (FE). However, RE2 cannot perform meta-analysis of correlated statistics, which are found in recent research designs, and the identified variants often overlap with those found by FE. Here, we propose RE2C, which increases the power of RE2 in two ways. First, we generalized the likelihood model to account for correlations of statistics to achieve optimal power, using an optimization technique based on spectral decomposition for efficient parameter estimation. Second, we modified the statistic to focus on the heterogeneous effects that FE cannot detect, thereby increasing the power to identify new associations. We developed an efficient and accurate p-value approximation procedure using analytical decomposition of the statistic. In simulations, RE2C achieved a 71% increase in power compared with 21% for the decoupling approach when the statistics were correlated. Even when the statistics are uncorrelated, RE2C achieves a modest increase in power. Applications to real genetic data supported the utility of RE2C. RE2C is highly efficient and can meta-analyze one hundred GWASs in one day.

FUMA: Functional mapping and annotation of genetic associations.
Date: Monday, July 24
Time: 2:50 PM - 3:10 PM
Room: Meeting Hall IA
  • Kyoko Watanabe, VU University Amsterdam, Netherlands
  • Erdogan Taskesen, VU University (VU), Netherlands
  • Arjen van Bochoven, Vrije Universiteit Amsterdam, Netherlands
  • Danielle Posthuma, VU University Amsterdam (VU), Netherlands

Presentation Overview: Show

A main challenge in genome-wide association studies (GWAS) is to prioritize genetic variants and identify potential causal mechanisms of human diseases. Although multiple bioinformatics resources are available for functional annotation and prioritization, a standard, integrative approach is lacking. We, therefore, developed FUMA: a web-based platform to facilitate functional annotation of GWAS results, prioritization of genes and interactive visualization of annotated results by incorporating information from multiple state-of-the-art biological databases.

PopCluster: A new algorithm to identify genetic variants with effects that change with ethnicity
Date: Monday, July 24
Time: 3:10 PM - 3:30 PM
Room: Meeting Hall IA
  • Anastasia Gurinovich, Boston University, United States
  • John Farrell, Boston University, United States
  • Harold Bae, Oregon State University, United States
  • Annibale Puca, University of Salerno, Italy
  • Gil Atzmon, Albert Einstein College of Medicine, United States
  • Nir Barzilai, Albert Einstein College of Medicine, United States
  • Thomas Perls, Boston University, United States
  • Paola Sebastiani, Boston University, United States

Presentation Overview: Show

Over the last decade, more diverse populations are being included in genome-wide association studies (GWAS). Thus, it is important to adapt the existing techniques to be able to account for the heterogeneity of genetic effects. In this paper, we propose PopCluster – a novel algorithm to automatically identify clusters of subjects who have varying genetic effects in different ethnicities. PopCluster combines logistic and linear regression models, principal component analysis, hierarchical clustering, and a novel recursive bottom-up tree parsing procedure. It addresses two important issues that are not accounted for in standard GWAS with heterogeneous groups of study subjects. First, if a genetic variant has a varying effect on a phenotype in different populations, GWAS applied on a dataset as a whole would not pinpoint the differences. Second, if a genetic variant is strongly associated with ethnicity, the association model becomes unstable. Evaluation of PopCluster demonstrates that the algorithm has a stable, low false positive rate (~2%), and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. In addition, we used PopCluster to test the association between APOE alleles and extreme longevity, which demonstrated that the effect of the alleles changes with ethnicity.

Representing genetic determinants in bacterial GWAS with compacted De Bruijn graphs.
Date: Monday, July 24
Time: 3:30 PM - 3:50 PM
Room: Meeting Hall IA
  • Magali Jaillard, bioMerieux, France
  • Maud Tournoud, bioMerieux, France
  • Leandro Lima, LBBE, Université de Lyon, France
  • Vincent Lacroix, LBBE, Université de Lyon, France
  • Jean-Baptiste Veyrieras, bioMerieux, France
  • Laurent Jacob, LBBE, Université de Lyon, France

Presentation Overview: Show

Antimicrobial resistance has become a major worldwide public health concern, calling for a better characterization of existing and novel resistance mechanisms. GWAS methods applied to bacterial genomes have shown encouraging results for new genetic marker discovery. Most existing approaches either look at SNPs obtained by sequence alignment or consider sets of kmers, whose presence in the genome is associated with the phenotype of interest. While the former approach can only be performed when genomes are similar enough for an alignment to make sense, the latter can lead to redundant descriptions and to results which are hard to interpret. We propose an alignment-free GWAS method detecting haplotypes of variable length associated to resistance phenotype, using compacted De Bruijn graphs. Our representation is flexible enough to deal with very plastic bacterial genomes subject to gene transfers while drastically reducing the number of features to explore compared to kmers, without loss of information. It accommodates polymorphisms in core genes, accessory genes and non coding regions. Using our representation in a GWAS leads to the selection of a small number of entities which are easier to visualize and interpret than fixed length kmers. We illustrate the benefit of our approach by describing known as well as potential novel determinants of antimicrobial resistance in Pseudomonas aeruginosa, a pathogenic bacteria with a highly plastic genome.

One test to rule them all: Clinical grade Whole Genome Sequencing as first-line genetic test.
Date: Monday, July 24
Time: 3:50 PM - 4:05 PM
Room: Meeting Hall IA
  • Alexander Kaplun, Variantyx, United States

Presentation Overview: Show

With the cost of next generation sequencing continuing to decrease the point of inflection where Whole Exome Sequencing successfully replacing traditional diagnostic trajectory as standard of care has already been reached. Whole Genome Sequencing (WGS), however, is still considered by many as unnecessary waste of resources not suitable for clinical implementation. We report here a clinically validated WGS pipeline which is economically feasible and facilitating the benefits that are helping to define WGS as the new diagnostic standard.

Many benefits are straightforward, inherent in the sequencing technology itself. PCR-free DNA preparation methods eliminate amplification-related issues producing better coverage of amino acid coding regions. At the same time, comprehensive coverage of intronic and intergenic regions ensures potentially relevant transcription factor binding site, enhancer and other regulatory variants are identified. WGS provides unique opportunities for detection of structural variants, and we have developed a clinically validated WGS pipeline for highly specific and sensitive detection of structural variants. Using a combination of breakpoint analysis, read depth analysis and de novo assembly of tandem nucleotide repeats and tri-nucleotide tandem repeats, the pipeline identifies structural variants down to single base pair resolution. False positives are minimized using calculations for loss of heterozygosity combined with bi-modal heterozygous variant allele frequencies. Identified variants are annotated with phenotype information derived from HGMD Professional and population allele frequencies derived from DGV facilitating clinical interpretation. Single base pair resolution enables easy visual inspection of potentially causal variants using the IGV genome browser. Patient cases demonstrating clinical utility of the pipeline will be presented.

Network-based integration of multi-omics data for prioritizing cancer genes.
Date: Monday, July 24
Time: 4:30 PM - 5:00 PM
Room: Meeting Hall IA
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Cancer cells are altered in multiple ways, including genomic, epigenomic, transcriptomic, and proteomic changes. Protein interaction networks can help decode the functional relationship between aberration events and changes in the expression of genes and proteins. We present a graph diffusion-based method for prioritizing cancer genes by integrating diverse molecular data types on a directed functional interaction network. Genes are prioritized for individual tumor samples separately and integrated using a robust rank aggregation technique. Using TCGA data, we demonstrate that the method can aid in explaining the heterogeneity of aberration events by their functional convergence to common differentially expressed genes and proteins.

Inferring clonal composition from multiple tumor biopsies.
Date: Monday, July 24
Time: 5:00 PM - 5:20 PM
Room: Meeting Hall IA
  • Matteo Manica, ETH-IMSB//IBM Research, Switzerland
  • Roland Mathis, IBM Research Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM, Zurich Research Laboratory, Switzerland
  • Pavel Sumazin, Baylor College of Medicine, United States
  • Peter Wild, University Hospital of Zurich (USZ), Switzerland

Presentation Overview: Show

Motivation. Knowledge about the clonal evolution of each tumor can inform driver-alteration discovery by pointing out initiating genetic events as well as events that contribute to the selective advantage of proliferative, and potentially drug-resistant tumor subclones. A necessary building block to the reconstruction of clonal evolution from tumor profiles is the estimation of the cellular composition of each tumor subclone (cellularity), and these, in turn, are based on estimates of the relative abundance (frequency) of subclone-specific genetic alterations in tumor biopsies. Estimating the frequency of genetic alterations is complicated by the high genomic instability that characterizes many tumor types.
Results. Analysis of our mutation-centric model for genomic instability suggests that copy number variations (CNVs) that are commonly observed in tumor profiles can dramatically alter mutation- frequency estimates and, consequently, the reconstruction of tumor phylogenies. We argue that detailed accounting for CNVs based on profiles of multiple biopsies for each tumor are required to accurately estimate mutation frequencies. To help resolve this problem we propose an optimization algorithm—Chimæra: clonality inference from mutations across biopsies—that accounts for the effects of CNVs in multiple same-tumor biopsies to estimate both mutation frequencies and copy numbers of mutated alleles. We show that mutation-frequency estimates by Chimæra are consistently more accurate in unstable genomes compared to existing methods. When studying profiles of multiple biopsies of a high-risk prostate tumor, we show that Chimæra inferences allow for reconstructing its clonal evolution.
Data availability. Sequencing data is deposited in ENA project PRJEB1919.

Evaluating Variant Calling Tools for Non-Matched Next Generation Sequencing Data.
Date: Monday, July 24
Time: 5:20 PM - 5:40 PM
Room: Meeting Hall IA
  • Sarah Sandmann, Institute of Medical Informatics, Germany
  • Aniek de Graaf, Laboratory Hematology, Netherlands
  • Mohsen Karimi, Center for Hematology and Regenerative Medicine, Sweden
  • Bert van der Reijden, Laboratory Hematology, Netherlands
  • Eva Hellström-Lindberg, Center for Hematology and Regenerative Medicine, Sweden
  • Joop Jansen, Laboratory Hematology, Netherlands
  • Martin Dugas, Institute of Medical Informatics, Germany

Presentation Overview: Show

Next-generation sequencing (NGS) has revolutionized the application of personalized medicine. Already, sequencing results influence diagnosis, prognosis and even therapy. However, for the application of NGS in clinical routine, it is most essential to deal with valid results. To perform variant calling, there are numerous tools that usually feature different algorithms, filtering strategies, recommendations and thus, also different output.
We performed variant calling with respect to single nucleotide variants and short indels with allelic frequencies as low as 1%, considering eight open-source tools: GATK HaplotypeCaller, Platypus, VarScan, LoFreq, FreeBayes, SNVer, SAMtools and VarDict. We analyzed two sets of sequencing data of patients with myelodysplastic syndrome (MDS). The first set covers 54 Illumina HiSeq samples, the second set covers 111 Illumina NextSeq samples. Validation of all calls was achieved by re-sequencing on the same platform, on a different platform and expert based review. In addition, we analyzed two sets of simulated data with varying coverages and error profiles, covering 50 samples each. In all cases we evaluated an identical target region consisting of 19 genes (42,322 bp) known to be recurrently mutated in MDS.
Our evaluation shows that variant calling -- even of single nucleotide variants and short indels -- remains challenging. Validated mutations were missed by every tool. High sensitivity usually went along with low precision. Reproducible results could not be obtained in multithreading-mode. Influence of simulated varying coverages and background noise on variant calling was generally low. Considering both real- and simulated data sets, VarDict performed best.

Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes.
Date: Monday, July 24
Time: 5:40 PM - 6:00 PM
Room: Meeting Hall IA
  • Yoichiro Nakatani, Trinity College Dublin, University of Dublin, Ireland
  • Aoife McLysaght, Trinity College Dublin, University of Dublin, Ireland

Presentation Overview: Show

Motivation:
It has been argued that whole-genome duplication (WGD) exerted a profound influence on the course of evolution. For the purpose of fully understanding the impact of WGD, several formal algorithms have been developed for reconstructing pre-WGD gene order in yeast and plant. However, to the best of our knowledge, those algorithms have never been successfully applied to WGD events in teleost and vertebrate, impeded by extensive gene shuffling and gene losses.
Results:
Here we present a probabilistic model of macrosynteny (i.e., conserved linkage or chromosome-scale distribution of orthologs), develop a variational Bayes algorithm for inferring the structure of pre-WGD genomes, and study estimation accuracy by simulation. Then, by applying the method to the teleost WGD, we demonstrate effectiveness of the algorithm in a situation where gene-order reconstruction algorithms perform relatively poorly due to a high rate of rearrangement and extensive gene losses. Our high-resolution reconstruction reveals previously overlooked small-scale rearrangements, necessitating a revision to previous views on genome structure evolution in teleost and vertebrate.
Conclusions:
We have reconstructed the structure of a pre-WGD genome by employing a variational Bayes approach that was originally developed for inferring topics from millions of text documents. Interestingly, comparison of the macrosynteny and topic model algorithms suggests that macrosynteny can be regarded as documents on ancestral genome structure. From this perspective, the present study would seem to provide a textbook example of the prevalent metaphor that genomes are documents of evolutionary history.

Closing remarks from the committee.
Date: Monday, July 24
Time: 6:00 PM - 6:10 PM
Room: Meeting Hall IA

    Presentation Overview: Show

    Session end