Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact firstname.lastname@example.org and provide your poster title or submission ID.
Category G - 'Genetic Variation Analysis'
Short Abstract: The recent Ebola virus outbreak demonstrates how lethal the virus is. In this study we have used sequence analysis and structural modelling to investigate genetic variation within the
Ebolavirus family with the aim of identifying the molecular determinants of the Ebolavirus pathogenicity. The Ebolavirus family consists of five species, four of which are pathogenic to humans (Ebola (formerly Zaire), Sudan Bundibugyo and Tai Forest). The fifth species, Reston viruses, are the only Ebolaviruses that are not pathogenic in humans. We analysed 196 Ebolavirus genomes to identify specificity determining positions (SDPs) in all nine Ebolavirus proteins that distinguish the four human pathogenic Ebolaviruses from Reston viruses. Structural analysis and results from computational tools to investigate genetic variation revealed novel functional insights in particular for Ebolavirus proteins VP40 and VP24. The VP40 SDP P85T appears to interfere with VP40 octamer formation. The VP40 SDP Q245P affects the structure and hydrophobic core of the protein and consequently protein function. Three VP24 SDPs (T131S, M136L, Q139R) are likely to reduce VP24 binding to human karyopherin alpha5 (KPNA5) therefore preventing inhibition of interferon signaling. As only a few SDPs distinguish Reston virus VP24 from VP24 of other Ebolaviruses, it is possible that human pathogenic Reston viruses may emerge. This is of concern because Reston viruses circulate in domestic pigs and can infect humans.
Short Abstract: Short insertions and deletions (indels) are the second most common type of variation in the human genome. Despite tremendous advances in high-throughput sequencing technologies and computational methods for variant calling from DNA sequence data, accurate detection of indels remains a challenge. Some of the reasons for this difficulty include over-representation of short indels in regions of low sequence complexity, variability in indel error rates across different platforms as well as the lack of good error models for indels.
We have developed an EM algorithm for the detection and genotyping of short indels using aligned sequence reads from multiple individuals. Our probabilistic method models sequence context-specific error rates to estimate the posterior probability of a variant and genotypes. Modeling such error rates is particularly important for indel detection in homopolymer regions.
Using extensive simulations, we assessed the power of our EM algorithm to detect indels as a function of read depth, population allele frequency and indel error rates. Our method was significantly more accurate than the recently proposed population-based method SOAP-popIndel. We subsequently performed a comprehensive comparison of our method against a number of leading variant calling methods including GATK Haplotype-Caller, FreeBayes and Platypus, using exome data from the 1000 Genomes Project. Our algorithm is shown to have high sensitivity and low false positive rate compared to the other methods. We further demonstrate that our population-based approach enables the discovery of indels that would be impossible to call using individual data.
Short Abstract: Insertions and deletions (INDEL) are alterations in the DNA sequence. INDELs located within the coding region can create modifications in splice sites, amino acid sequence or frame shift. The 1000 genomes project can be used as source of data to search for INDELs in different populations. We applied an innovative method using human transcriptome data to search for small deletions (up to 100 nucleotides) within coding exons. We present the analysis of 66 genomes from the 1000 genomes project (1000G): three from each 11 populations available in the phase 3 from the project. A total of 750,327 small deletions were identified when we used only ESTs to map against these genomes, but only 14,123 were present in more than 33 genomes (50%). We detected deletions previously identified only by the 1000G (151), only annotated in the dbSNP (102) and by both approaches (13). In addition, 7,344 may cause frameshift. For example, we have found a novel small deletion of 11 nucleotides in the NFKBIZ gene, which truncates the encoded protein reducing 136 amino acids from the wild type protein. This hypothetical translated protein loses ankyrin repeats within the C-terminus. This alteration prevents the inhibition of p65 transactivation. In conclusion, we present preliminary data in which we used transcriptome data to identify deletions previously described in the dbSNP and the 1000G, and to detect novel unannotated deletions in human genes, which affect protein domains. Financial support: FIOCRUZ, CAPES, INCA/MS, Fundação do Câncer, FAPERJ and CNPq.
Short Abstract: The presence of SNPs on ligand-binding sites often have important functional consequences, leading to pathogenicity and variation in drug response. Understanding how SNPs may alter the efficacy and metabolism of certain drugs is crucial for successful implementation of the precision medicine model.
We review 136 unique protein-drug complexes and analyze the non-synonymous SNPs present in the drug-binding sites and the proximal residues. About 90% of these proteins have SNPs associated with less than 45% of their binding residues. In total, 2664 unique SNPs (2563 missense and 101 stop-gain mutations) are mapped. The frequency or clinical significance data is available for only 25.49% of these SNPs. Most show very low minor allele frequency in the populations and are associated with pathogenicity or drug response. Only two of the SNPs are found to be present in the GWAS catalogue. For the rest of the SNPs, online tools are used to predict the functional effects and conservation. We also analyze the SNP containing amino acids and the mutations that show significant differences between the binding residues and the rest of the protein sequences. Moreover, the protein-drug complexes with significant differences in presence of SNPs on binding sites are separately investigated.
This study is an effort towards understanding the possible effects of SNPs on drug response. We have comprehensively analyzed the association of SNPs with drug-binding sites and also highlighted the gaps in current knowledge.
Short Abstract: Variant Call Format (VCF) is widely used to store data about genetic variations. Variant calling workflows discover variants between billions of short sequences produced by sequencing machines and report them in VCF format. To evaluate the accuracy of variant callers, it is critical to correctly compare their output against a VCF file containing a true set of variants. However, finding concordance between VCF files is a complicated task as the same variant can be represented in several different ways and is therefore not necessarily reported in a unique way by different software. In this paper, we have introduced a VCF normalization method that results in more accurate comparison. Basically, in our proposed normalization procedure, we apply all variations in a VCF file to the reference genome to create a mutated sequence, and then recall variants by aligning this mutated sequence back with the reference sequence. The normalized VCF is not necessarily closer to the truth but is suitable for comparison purposes. We define a better normalization algorithm as one resulting in less disagreement in the output of different VCF comparison algorithms. Our results show over 10.6 times less disagreement when comparing VCF files normalized by our method, relative to unormalized files. Our method relies mostly on available, validated software.
Short Abstract: Background: Information of multiple synteny between plants and/or within a plant is one of a key knowledge to understand genome evolution. In addition, visualization of the multiple synteny is helpful to interpret the evolution. So far, some web applications had been developed to determine and to visualize patterns of multiple homology regions at once. However, the applications are not fully convenient for biologists because some did not include the function of synteny determination but visualize the multiple synteny plot by allowing for user to upload their synteny data and others determine the synteny just based on BLAST similarity information with some algorithms, not designed for synteny determination. Here, we introduce a web application which determine and visualize multiple synteny from two types of files, GFF and Protein sequence file.
Results: We developed a web application MultiSyn (Multiple Synteny determination and visualization) in order to determine the synteny and draw the sophisticated multiple synteny plot with sequence information from a user and/or publically released genome sequences. MultiSyn determines the synteny blocks using MCScanX which is elaborately designed program to determine synteny and visualize the multiple synteny with a pivot selected by the user.
Conclusions: MultiSyn shows the multiple synteny in a plot and helps biologists to understand the pattern of evolution of synteny regions. The software and example data sets are freely available at http://18.104.22.168:62001/.
Short Abstract: Protein kinases regulate many cellular pathways, including those that govern growth, differentiation and proliferation, phosphorylating approximately a third of all proteins. Due to the profound effects that they have upon a cell, their catalytic activity is stringently regulated. Loss of this control, often through the introduction of dominant activating mutations within these proto-oncogenes, underlies many diseases, including the development of many cancers. The ability to identify these activating mutations would be an invaluable tool for understanding the role of mutations in disease and the development of treatment strategies. We proposed to use a structural approach in order to predict and differentiate missense activating mutations from neutral polymorphisms and inactivating mutations. Here we introduce, KAMP (Kinase Activating Mutations Predictor), a new machine learning method for predicting activating mutations on kinases based on a set of structural features. These include graph-based signatures, residue environment properties, non-covalent interactions made by the wild-type residue and stability change predictions upon mutation. The best combination of features were identified via feature selection, different supervised learning algorithms were evaluated and the best predictive model selected. We have collected 214 missense mutations (154 activating) with experimental evidence from 24 different proteins from Ensembl and KinDriver databases used to train, test and validate our predictive model. KAMP achieved a precision and AUC of 91% and 0.93 while predicting activating mutations in cross validation and 90% and 0.94 in blind tests, respectively. KAMP is freely available through a user-friendly web server at http://biosign.cpqrr.fiocruz.br/kamp.
Short Abstract: Herpes simplex virus type 1 (HSV-1) causes recurrent mucocutaneous ulcers, and is the primary source of infectious blindness and sporadic encephalitis in the United States. Research using animal models has found that the genetic makeup of viral strains is one of the main factors for the severity of HSV-1 ocular disease. Conventional studies on the genetics of viral virulence have depended on characterizing a naturally occurring strain, and genetically engineering mutations into viruses. In this study, we exploit the fact that HSV-1 has been shown to be highly recombinogenic. We present a quantitative trait locus (QTL) based analysis of the genotypes and phenotypes of HSV-1 viral recombinants. The genotypes characterize complete genome sequences for two parental strains, OD4 and CJ994, as well as 65 OD4:CJ994 recombinants. The phenotypes quantify the severity of blepharitis, stromal keratitis, and other aspects of the disease. Our QTL mapping has been conducted by learning genotype-phenotype models using the Lasso, Ridge regression, and Random Forest methods. Many of the phenotypically meaningful SNPs identified by our models are involved in HSV-1 regulatory networks and viral genes that affect innate immunity. Several genes were previously indicated as being important in virulence, which validates this approach, and other genes were novel. We are currently extending the approach to use multitask learning approaches to simultaneously model all disease phenotypes, and to identify epistatic interactions affecting HSV-1 ocular virulence.
Short Abstract: The accumulation of gene and genetic variant annotations has been increasing explosively with the recent technological advances. However, the fragmentation across many data silos is often frustrating and inefficient. We created two platforms, called MyGene.info (http://mygene.info) and MyVariant.info (http://myvariant.info), centralized repositories to aggregate and serve dispersed annotation data. Both are free, open source, high-performance, and continuously-updated data application programming interfaces (APIs) for accessing comprehensive, structured gene and variant annotations. These resources are offered as cloud-based web service endpoints with the goal of providing “Annotation as a Service”. All annotations relevant to a unique gene or variant is merged into a single annotation object. NCBI’s gene ID and HGVS ID are selected as the primary key for all annotation objects in MyGene.info and MyVariant.info respectively. A high-performance and scalable query engine was built to index the merged annotation objects and provides programmatic access to the developers. In addition, we have built a scheduling system that automates the updates for each data source according to its own schedule. Currently, MyGene.info provides over 200 gene-specific annotation fields, covering more than 13 million genes for over 15,000 species. MyVariant.info contains over 500 variant-specific annotation types from dozens of resources, covering more than 334 million unique variants. Both MyGene.info and MyVariant.info APIs are currently well used by the research community, with more than 4 million requests per month for MyGene.info and more than 2 million requests per month for MyVariant.info.
Short Abstract: An understanding of insect–plant relationships, especially the relationships between insects and their edible plants, is essential for Integrated Pest Management (IPM), an environment-friendly approach to growing healthy crops and minimizing the use of pesticides. It is known that insects have their specific edible plants; for example, the swallowtail butterfly eats only citrus leaves. To date, a considerable body of knowledge on insect–plant relationships has been gathered by researchers, farmers, and amateur naturalists. However, these data are scattered across many different books and Internet sites, making it difficult to view them in a comprehensive way. It would therefore be beneficial to have a database of insect–plant relationships, which would also have links to the accumulating “omics” (genomic, transcriptomic, proteomic, and metabolomics) databases for agricultural and environmental studies. We present an integrated database of insect genomes and ortholog genes, chemical interaction networks between insects and plants, and prediction of plant metabolic pathways. These data will be integrated with other data, such as taxonomic, nucleic-acid and metabolite information stored in NCBI and other repositories.
Short Abstract: Large molecular datasets, including gene expression, proteomics, and especially next-generation sequencing, have brought many “big data” challenges to the biological sciences. Sequence data in general has proved challenging to store, access, and query due to the massive amount of data generated by sequencing experiments. While indexed file formats have made individual sample data fast and easy to access, combining information across numerous samples has stressed even high-end database systems. The most widely used solution is to store only genetic variants compared to the reference genome. While this focuses on the most interesting positions, information regarding most of the ~3 billion reference genotypes per patient (or their absence due to targeted sequencing or missing data) is not stored, limiting the questions such a database can be used to answer. This lack of knowledge about the precise state of non-variant samples (ie, reference vs missing or not tested) prevents accurate definition of the genetic control group. We have developed a Negative Storage Model (NSM) to efficiently store precise state information (variant, reference, or missing), allowing for flexible and enhanced querying ability across large numbers of samples. We have evaluated this model using relational SQL (MySQL) and noSQL (mongoDB) databases, and find advantages to each. Best times for a variety of queries, including genotype state at individual and multiple positions in a sample ranged from 1 to 60 secs. Ongoing work will utilize Hadoop together with mongoDB to evaluate scalability across thousands of samples, including TCGA samples and 3,383 internal targeted sequencing samples.
Short Abstract: Imaging genetics combines brain imaging and genetic information to identify the relationships between genetic variants and brain activity. When the data samples belong to different classes (e.g., disease status), the relationships may exhibit class-specific patterns that can be used to facilitate the understanding of a disease. Conventional approaches typically perform independent analysis on each class and simply detect the differences, but ignore important shared patterns.
In this paper, we develop a multivariate method to analyze the differential dependency across multiple classes. We propose a joint sparse canonical correlation analysis (JSCCA) method, which uses a generalized fused lasso penalty to jointly estimate multiple pairs of canonical vectors with both shared and class-specific patterns. Using a data fusion approach, the method effectively integrates the strength from each individual class to improve the accuracy of detection of the differences. The results from simulation studies demonstrate its higher accuracy in discovering both common and differential canonical correlations than conventional sparse CCA. Using a schizophrenia dataset with 92 cases and 116 controls including single nucleotide polymorphism (SNP) array and functional magnetic resonance imaging (fMRI) data, the proposed method reveals a set of distinct SNP-voxel interactions for the schizophrenia patients that are verified to be statistically and biologically significant.
Short Abstract: The study of a recently published high-coverage Neandertal genome confirmed that modern humans outside of Africa trace a small percentage of their ancestry back to an admixture event with Neandertals. We use the Neandertal genome sequence to assign short indels on the human lineage to three categories: (i) indels that predate the split from Neandertals (Neandertal-shared indels), (ii) indels that likely arose after the split from Neandertals (modern-human specific indels), and (iii) indels that were likely contributed to the modern human populations outside of Africa through admixture with Neandertals (introgressed indels). When comparing the abundance of modern human specific indels to Neandertal-shared indels, we find a significantly enrichment of modern human specific indels in genes associated with dendrite development. This finding is compatible with a relax of constraint affecting these genes. While introgressed indels were not associated with any specific functional category of genes compared to the other two categories of indels, we find that introgressed indels are significantly underrepresented in genic regions in general, suggesting that Neandertal-introgressed material in genes was often deleterious.
Our analysis provides a comprehensive list of introgressed and modern human specific indels that are predicted to affect phenotype in modern humans. These changes provide a resource to further investigate the contribution of Neandertals to modern human phenotypic variation and the specific evolutionary trajectory of modern humans.
Short Abstract: The identification of genomic structural variation with high sensitivity and specificity remains a major challenge. Here we present the most comprehensive comparison of structural variant calling software for Illumina-based sequencing to date. We compare BreakDancer, CORTEX, clever, CREST, DELLY, GASVPro, GRIDSS, HYDRA, lumpy, Socrates, TIGRA, and VariationHunter on both simulated data and well-studied cell lines to identify the strengths and weakness of existing structural variation detection algorithms. Our simulation suite compares each caller across a wide range of read lengths (36-250bp), fragment sizes (150-500bp), read depths (4-100x), aligners (bwa, bowtie2, mrsFast), and event types and sizes. We show that, unlike for SNV calling, all these factors are critically important in the selection of the most appropriate tool. This study reveals that for some variant calling approaches, sequencing using longer reads actually reduce variant calling performance, in some cases increasing the false discovery rate to over 98%. We demonstrate that newer methods combining genome assembly with split read and read pair information perform best across almost all combinations of input data.
Short Abstract: Haplotype resolved genomes are important in many areas of human genetics ranging from variant-disease associations, mapping regions of loss of heterozygosity (LOH) to studying inheritance patterns in human populations. To assemble haplotypes, statistical, sequencing-based, and specialized experimental approaches have been developed
In this study we use for the first time a hybrid phasing approach which takes advantage of experimental phasing using Strand-seq along with sequencing-based phasing using the WhatsHap algorithm. Strand-seq is a single cell sequencing technique with a unique ability to retain directionality of sequencing reads, based on the DNA template strand inheritance. This allows us to map every read to a single parental chromosome and hereby phased all variants present in this chromosome.
We show that WhatsHap is able to combine Strand-seq and PacBio data to yield nearly complete chromosome-length haplotypes at high accuracy. To demonstrate validity of this approach we have performed an experimental study using different subsets of single cell Strand-seq data to construct scaffold haplotypes and combined them with PacBio reads for chromosome 22 of the NA12878 individual.
Our results show that as few as 10 Strand-seq single cell libraries combined with PacBio data are sufficient to phase 24,283 heterozygous variants (out of 31,821, in total), as compared to 23,946 heterozygous variants obtained from 134 Strand-seq libraries/data only. This results show that this novel hybrid strategy can deliver dense chromosome-spanning haplotypes at high accuracy and a considerably reduced price compared to Strand-seq alone.
Short Abstract: In microbial engineering, metabolic evolution is an essential method for developing organisms with a desired phenotype such as tolerance or product yield. In an evolution experiment, organisms with advantageous phenotypes emerge under strong selective pressure and displace the parent strain in a population. Mutations in the evolved strains are credited with improved fitness. This method generates a strain with a desired phenotype, but understanding how genomic variations relate to fitness requires further investigation.
Evolved strains can contain numerous mutations of which only some demonstrate phenotypic changes and may be relevant to fitness. Genomic sequencing identifies mutations, but interpreting variations in the context of larger biological systems remains a challenge. Our previous work produced a pipeline for mutation analysis that leverages public E. coli databases and computational tools such as structure prediction software. Mutations in coding regions and extragenic regulatory sites are analyzed and then visualized on an integrated gene regulatory and metabolic network to investigate their relationships and relevance to metabolic pathways.
Mutations affecting regulators can be difficult to interpret without additional information. Incorporating their entire regulons without into the network can introduce numerous nodes and impede analysis. To better interpret such cases, associated transcriptomic experiments can provide insight into the implications of genomic variations.
Here, we demonstrate the integration of genomic data from an E. coli evolution study for improved octanoic acid tolerance and associated RNA-seq experiments for the parent and evolved strains. The additional transcriptomic data reveals the impact mutations in regulators have on genes in associated regulons.
Short Abstract: Influenza A viruses exhibit vast genetic mutability that develops capability to transmit among different hosts and are known to be responsible for several pandemics. One of the key computation issues in influenza prevention and control is to identify potential molecular signatures with cross-species transmission potential. We propose a new entropy-based host-specific signature identification method that uses a similarity coefficient to incorporate the amino acid substitution information and improve the identification performance. Our preliminary evaluation using simulated and real datasets demonstrates our method is of significant advantages in both identification sensitivity and false positive control.
Short Abstract: Toxoplasma gondii is a zoonotic apicomplexan parasite with a broad host range among warm-blooded animals, with a global distribution. The parasite is the cause of toxoplasmosis, a significant health risk to pregnant women and the immunocompromised. Until recently, T. gondii was believed to exhibit an almost exclusively clonal population structure, consisting of few sexual recombination events. However, the Toxoplasma research community analyzed recently suggested that the population structure involves more sexual recombination than previously thought.
Since the parasite can only undergo sexual recombination in the felid gut, it is difficult to know how frequently strains of T. gondii meet to facilitate sexual recombination. Understanding sex at the population level can be informative of how rapidly markers of pathogenesis may be moving throughout the population. To quantify these events, we use two approaches, (i) an in silico population simulation, and (ii) a matrix clustering analysis using Matlab.
For the in silico population simulation, we generate multiple hypothetical sub-populations in the form of simulated genome sequences using software for forward-genetics simulation. At different rates of sexual recombination, we generate new progeny populations. These in silico progeny are then compared to the measured natural Toxoplasma population to infer the rate of sexual recombination occurring in the natural population. The matrix clustering analysis will allow us to distinguish between individuals that are the result of sexual events, and those that are the result of clonal expansion (both sexual [inbreeding], and asexual). This comparison will allow us to construct proposed groups of sexually produced progeny.
Short Abstract: Accumulation of somatic mutations may contribute to the development of cancers and the functional decline associated with aging. However, the rate and extent of somatic mutation accumulation in otherwise healthy cells is poorly quantified at present, as estimates range from 10 to 100,000 mutations per cell. Somatic mutation rates for any complex organism likely vary between heterogeneous tissues and over the course of an individual’s lifespan. As such, we have collected extensive time series DNA-seq data from diverse tissues for three well-defined strains of Mus musculus, each with a distinct aging phenotype. Existing approaches for somatic mutation detection are largely designed for oncogenomics, and are not entirely appropriate for whole-genome aging research. To remedy this, we have created an algorithm for accurately determining the incidence rate of somatic mutations in complex DNA-seq data. Through its use of a sophisticated deep neural network machine learning model, our approach is able to detect rare sequence variations, while accounting for the systematic noise intrinsic to high-throughput sequencing technologies. Through our analyses, we have determined strain-specific profiles of somatic mutation accumulation in vivo across several tissues. Moreover, this study has deepened our understanding of how somatic mutations may accumulate differentially across subspecies, possibly giving rise to discrete aging phenotypes. Further work is underway to determine if the observed somatic mutation rates are stochastic, or driven by selective pressures.
Short Abstract: Genomic structural variation (SV) is a common clinical feature known to be involved in the initiation and pathogenesis of cancer. This complex class of variants also has significant implications on therapeutic decision and efficacy and has emerging roles in evidence-based clinical applications. However, despite recent advancements in the field, there is a lack of robust tools to accurately identify SVs from NGS data. Here we present STAR-SEQR, a novel tool used to detect and annotate DNA SVs and RNA fusions from paired-end sequencing data. This approach uses the popular STAR aligner as a first-pass approach to produce gapped junction and discordant paired reads. Additional filters include marking duplicates before assembling each candidate region, realignment and read-directionality checks to mitigate false-positive calls. Analytical testing has been performed on a set of samples with known DNA SVs including EML4-ALK, RET-CCDC6, and SLC34A2-ROS1 with 9 technical replicates of each. In every case STAR-SEQR accurately detected the breakpoint leading to 100% sensitivity and outperformed other software packages that were evaluated. Specificity was also 100% for these known samples. RNA fusions were evaluated against a synthetic dataset with every known fusion being detected. In summary, STAR-SEQR performs well in our testing and is currently being used in both the clinical and research settings
Short Abstract: Germline genetic variants have a major impact on drug response in cancer patients. These variants affect pharmacogenomic (PGx) genes encoding proteins involved in drug absorption, distribution, metabolism, and excretion. Genotyping PGx variants can stratify cancer patients based on their susceptibility to drug toxicities and likelihood to benefit from treatment. While peripheral blood is the primary DNA source for genotyping PGx variants, archival formalin-fixed paraffin-embedded (FFPE) tumours from clinical trials provide a large and valuable resource for retrospective PGx studies. However, disadvantages include DNA damage caused by formalin fixation, heterogeneity in tumour specimens, and somatic mutations in tumour DNA. We evaluated the concordance of PGx variants between 130 matched peripheral blood and FFPE tumours from amplicon sequencing data of a clinical genomic panel at the BC Cancer Agency. Six PGx genes, namely DPYD, MTHFR, GSTP1, TYMS, TYMP, and UGT1A1, were amplified with custom PCR primers and sequenced using Illumina MiSeq. We applied an in-house bioinformatics pipeline for data pre-processing and variant calling as well as assessed the concordance of 23 clinically relevant variants using kappa statistics. The mean concordance rate was 84.9%, in which 657/711 genotypes were concordant. 7/23 variants had a Cohen’s Kappa of ≥ 0.75, indicating substantial to almost perfect agreement. This study demonstrated that high concordance can be achieved between peripheral blood and FFPE tumours, thereby validating the use of FFPE tumours as an alternative to peripheral blood for germline variant calling in PGx genes.
Short Abstract: One of the key challenges in secondary analysis of next-generation sequencing (NGS) data is detecting structural variations (SVs) accurately and efficiently. Since SVs are large variations in the genome, detecting them using relatively short reads from NGS is particularly difficult. While a variety of methods and tools were developed in the past to detect SVs accurately, tools were either lacking in comprehensiveness or in accuracy.
To address that, we previously developed MetaSV, an integrated SV-caller which leverages multiple complementary methods to report highly accurate SVs. It also incorporates analysis of soft-clips in read alignments to significantly improve the detection of long insertions.
Here, we present several significant enhancements of soft-clip analysis in MetaSV which improve both its accuracy and speed. To extend the soft-clip analysis beyond insertions, we analyze soft-clip reads to detect more SV types such as deletions, inversions and duplications. We also propose an interval strength measure that formalizes the prioritization of local assembly intervals (regions around SV breakpoints), leading to up to a 10X speed-up of computation time for the assembly. We also enable local assembly for large SVs by restricting the assembly to the regions around the breakpoints. Performance comparison against the original MetaSV using the VarSim simulated dataset revealed significant improvement in detecting different SVs, especially insertions and inversions, both in terms of precision and sensitivity (respectively by 11.9% and 2.6% for insertions, and 2.0% and 28.5% for inversions).
Short Abstract: The rate of mRNA translation makes an important contribution to determining protein abundance in cells and genetic variants that affect the rate at which a protein is synthesized may give rise to human genetic diseases and phenotypes. Here we developed a computational pipeline to detect differences between alleles in the rate of protein translation, referred to as allele-specific translation (AST). Our method makes use of samples, for which both RNA-seq and Ribo-seq data are available and is sensitive to differences in the relative abundance of ribosome-associated and non ribosome-associated mRNA between alternative alleles in heterozygous samples. We applied the pipeline to identify AST in the HeLa cancer cell line and found 347 AST candidate genes, carrying 1477 genetic variants. This set of genes is significantly enriched for genes that have been shown experimentally to be associated with genetic variation in protein abundance (Fisher's test p-value = 1e-6). A variant rs114238154 in the start codon of cancer-associated NQO1 showing a strong statistical evidence of AST, have been validated experimentally with the average 30-fold difference in its allelic translational rates. Application of our pipeline to the HeLa cell line was facilitated by the recent availability of the complete haplotype-resolved genome of HeLa. However, we demonstrate that existing high-throughput sequencing data can be used to recover the haplotype-resolved genome of other samples with sufficient accuracy to infer AST. This approach has the capacity to provide insights into the etiology of a subset of mapped genetic diseases for which the causal variant remains undiscovered.
Short Abstract: Affymetrix Genome-Wide Human SNP Array 6.0 (SNP6) has been widely used for genome-wide association study (GWAS) discovery and validation. CytoScanHD array (CytoHD) is mainly designed for CNV analysis with the ability to output SNP genotype calling. To investigate potential usage of CytoHD arrays for validating GWAS SNP6 discovery results, we investigated the genotype comparability between 174 myeloid malignant samples typed on CytoHD and two groups of normal samples (#1:1,297/#2:497) typed using SNP6 downloaded from dbGap . Ancestry test was applied to remove non-Caucasian samples. SNPs failing HWE (p< 10-5), call rate< 95%, or potential ambiguous A/T or C/G SNPs were removed. After QC, 1,949 arrays and 123,281 SNPs are included in the analysis. Binned by HapMap MAF 1% interval the average of the Pearson correlations between CytoHD and SNP6 is 0.81. As expected, the correlation within the same platform, i.e., between SNP6#1 and SNP6#2, is higher (r = 0.91) than across platforms. The correlation of the association pattern between CytoHD/(SNP6#1+SNP#2) and SNP6#1/SNP6#2, is very high (r = 0.97, p-value < 2.2e-16). In PCA analysis, the samples typed by CytoHD moderately overlapped with those typed by SNP6 (t-test-P=0.003 for PC1; t-test-p = 0.06 for PC2). The SNP frequencies seemed to be highly comparable between two platforms. We are currently verifying genotype concordance between exome sequencing data and CytoHD data among 8 lung carcinoma and normal samples. These data suggest that researchers may be able to use CytoHD arrays in a refined scale for validation of GWAS data discovered using SNP6.
Short Abstract: Understanding cis-regulatory control of gene expression is crucial towards understanding complex diseases. Traditional methods for identifying sequence variants, such as the expression quantitative loci (eQTL) approach, have identified many genetic variants correlated with changes in gene expression. However, these methods have difficulty distinguishing functional variants from neighboring variants in linkage disequilibrium. Furthermore, since the effect size of individual variants is small, the eQTL approach requires large sample sizes to identify variants. To address these challenges, we implemented a method to identify functional sequence variants by searching for variants that explain observed patterns of allele specific expression (ASE) from RNA-seq data. Allele-specific analyses benefit from within-sample controls that reduce environmental and trans-acting influences. We measured the power of our method as a function of minor allele frequency and ASE measurement error. We applied this method using RNA-seq data and genotype data collected from the Genotype-Tissue Expression (GTex) Project, and variants identified have additional evidence for functional relevance based on gene annotation and epigenomics data.
Short Abstract: Background. Preterm birth (PTB) complications are the leading cause of long-term morbidity and mortality in children. Methods. We integrated whole-genome sequencing (WGS), RNAseq, and methylation data for 270 PTB and 521 control families. Statistics identified genomic variants associated with PTB, very early PTB (VEPTB), as well as premature rupture of membranes, pre-eclampsia, placenta-related, uterine-related, cervix-related, and idiopathic PTB. We identified differentially expressed genes and methylated probes, and performed eQTL and mQTL analyses to link genomic variants to these expression and methylation changes, and performed enrichment tests. Results. We identified 128 significant genomic variants associated with PTB-related phenotypes. The most significant variants, differentially expressed genes, and differentially methylated probes were associated with VEPTB. Integration of all data types allowed us to nominate a set of candidate biomarker genes for VEPTB, encompassing both novel and previously reported PTB genes. Notably, RAB31 and RBPJ were both identified by all three data types (WGS, RNAseq, and methylation). Systems involved in VEPTB include EGFR and prolactin signaling pathways, inflammation- and immunity-related pathways, chemokine signaling, interferon-gamma signaling, and Notch1 signaling. Both machine learning efforts with RandomForest and an enrichment of variants further supported the candidate pathways in mothers. Conclusions. Progress in identifying molecular and systems components of complex disease is aided by integrated analyses of multiple molecular data types, WGS, and clinical data. With these data, and by stratifying PTB by sub-phenotype, we have identified additional PTB genes and pathways, particularly for VEPTB
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other