View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: In sequence database search, Profile hidden Markov models (HMMs) are more sensitive than BLAST, and roughly the same speed. The open source software suite HMMER achieves this speed with a multi-stage pipeline, where efficient early stages filter out poor matches before reaching later, more computationally expensive algorithms. The initial filter, called SSV, computes maximal-scoring ungapped alignments, and is responsible for ~70% of HMMER’s runtime in typical searches. Here, we present a custom hardware accelerator for the SSV algorithm, designed for integration into HMMER on HPC systems with FPGA subsystems. This hardware accelerator reduces computation time, decreases power utilization, and scales with additional FPGA hardware.
Short Abstract: DNA methylation of cytosine in dinucleotide CpG is an essential epigenetic modification that plays a key role in gene expression, development and aberrant disease. DNA enrichment-based methods offer high coverage of methylated CpG dinucleotides with the lowest cost per CpG covered genome-wide. They measure the DNA enrichment of methyl-CpG binding, therefore do not directly provide absolute methylation levels. Further, the enrichment is influenced by confounding factors besides the methylation status, e.g., CpG density. Computational models that derive the absolute methylation levels from the enrichment signal are necessary. We present “MeDEStrand”, a method uses sigmoid function to estimate and correct CpG bias from the enrichment read counts. Uniquely, “MeDEStrand” processes the reads for the positive and negative DNA strands separately and acquires further improvement. We compare the performance of “MeDEStrand” with three other state-of-the-art methods “MEDIPS”, “BayMeth” and “QSEA” on four independent MeDIP-seq datasets generated from immortalized cell lines (GM12878 and K562) and human primary cells (foreskin fibroblast and mammary epithelial). “MeDEStrand” shows the best performance at high resolution of 25, 50 and 100 bps. In conclusion, “MeDEStrand” provides a free tool to infer whole-genome absolute DNA methylation level at the cost of enrichment-based methods with adequate accuracy and resolution.
Short Abstract: Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls—with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program.
Short Abstract: Single-cell RNA-Seq has been available for several years but high-throughput single-cell DNA analysis is in its infancy. To address current challenges and enable the characterization of genetic diversity in cancer cell populations, we developed a novel approach to identify mutation signatures which define subclones present in a tumor population. Methods Here we present a two-step clustering and subclone identification method using data generated on the Tapestri single-cell DNA platform. The variant-cell matrix generated is subjected to unsupervised hierarchical clustering on a PCA projection space to identify the subclone structure. The silhouette value is used to identify the most optimal clustering. After filtering on these top variants, we perform additional rounds of clustering to obtain subclone profile. To validate our methodology, we used two different model systems: A) a 50:50 mix of cell lines or B) a mixture of 3 distinct cell lines present at 33% each. Results With our two-step clustering process we show the distinct clusters correlating with titration and cell line ratio. We were also able to identify the cluster associated signature mutations. Our approach has the potential to address the key issues of identifying rare subpopulations of cells and transforms our ability for improved patient stratification.
Short Abstract: Human induced pluripotent stem cell-derived cardiomyocytes have promising applications in drug testing, disease modeling, and regenerative medicine. Single-cell RNA sequencing (scRNA-seq) provides unprecedented capability to precisely measure transcriptomic profiles in individual cells. In this study, hiPSCs generated from patient skin biopsies were differentiated into cardiomyocytes. We performed scRNA-seq on 85 cardiomyocytes, 42 at Day 12 and 43 at Day 40. Our data revealed dramatic patterns and identified 732 genes highly expressing at D12 and 271 at D40. Novel stage-specific markers were further identified based on differential analysis and the number of expressing cells. Numerous cardiac transcription factors were among the D12 differential expressed genes, including NKX2.5, BMP2, and ISL1. Function enrichment analysis showed that D12 genes are involved in atrial muscle tissue morphogenesis and ventricular septum development. More robust novel markers, 15 at D12 and 24 at D40, were selected by subpopulation comparison based on expression pattern of known cardiac markers. Furthermore, late-stage subpopulation markers were identified by comparing cardiomyocyte and fibroblast-like subpopulations at D40. In summary, combining scRNA-seq and hiPSC-CMs enabled us to identify novel stage-specific cardiac markers during in vitro differentiation. These results will allow more accurate phenotyping of hiPSC-CMs and tailor them for specific applications.
Short Abstract: Somatic variant callers are very sensitive with high recall, but provide insufficient filtering and produce results with low precision that are rich in false positives. This negatively affect all downstream analyses. Standard practice is ad-hoc filtering which is poorly standardized. To attempt to address this problem, we have developed Filters for Next Generation Sequencing (FiNGS), which outperforms both default filters and alternative filtering software such as FPfilter. FiNGS calculates 23 metrics by querying the normal and tumor BAM files at variant sites. Default cut-offs were determined using training data representing a range of tumor purities, sequencing depths and modalities including targeted panel, whole exome (WES) and whole genome (WGS). A test set of WGS data sourced from the International Cancer Genome Consortium (ICGC) was used for validation. These data were high depth (314x) and annotated with a list of 961 gold standard true positives. Variants were called using Strelka2. When compared to both default filters and FPfilter, FiNGS achieved a massive increase in precision (86%, up from 41%) for a minimal decrease in recall (95%, down from 99%). We demonstrate that FiNGS can supplement or supplant the default filters in variant callers and delivers improved somatic variant calls.
Short Abstract: Recently, 10x Genomics introduced the Chromium library preparation protocol for augmenting Illumina paired-end reads with long range linkage information ("linked reads"). Under the Chromium protocol, each read pair is tagged with a 16 bp barcode that associates it to one or more long DNA molecule(s) up to 100 kbp in length, providing invaluable information for resolving genomic repeat structures during de novo assembly. Here we present ABySS-LR, a linked reads assembly pipeline that leverages Chromium barcode information to resolve repeat components (ABySS), cut contigs at misassemblies (Tigmint), and build assembly scaffolds (ARCS). ABySS-LR is being developed for assembly of large genomes with multiple Chromium libraries, in conjunction with other sequencing data types such as paired-end reads, mate pair reads, and long reads. On a linked reads data set for human chromosome 21, ABySS-LR yields an NA50 length of 5.1 Mbp, which represents a 50X improvement over a standard ABySS v2.0 assembly.
Short Abstract: Errors in genotype calling can have perverse effects on genetic analyses, confounding association studies and obscuring rare variants. Because genotyping error rates vary between studies and technologies, reliable estimates can be difficult to obtain. Furthermore, most estimates of error rates are typically reported for the entire study even though genotypes can be miscalled in more than one way. Here, we report a method for estimating the rates at which different types of genotyping errors occur at bi-allelic loci using pedigree information. Our method uses instances where the haplotypic phase has not been faithfully transmitted to identify potential genotyping errors. We develop a model that uses the differences in the frequencies of inconsistent phase to estimate rates for different types of genotype error. We apply our method to a dataset of genotypes from the whole-genome sequencing of owl monkey families (Aotus nancymaae). We find significant differences between estimates for different types of genotyping error, with the most common being heterozygous sites miscalled as homozygous for the reference allele. The approach we describe is applicable to any set of genotypes where haplotypic phase can reliably be called, and should prove useful in helping to control for false discoveries.
Short Abstract: The classifying methods used in CancerSEEK, a novel blood test using ctDNA for the early detection of eight different cancer types, will be presented (Cohen et al. Science 2018, 359(6378):926-930). The blood test uses a combination of features derived from genetic alterations and protein biomarkers: mutations in cell-free DNA and levels of circulating proteins. The performance of the classifier is assessed via two different tasks. The first task consists in classifying patients with cancer vs healthy controls (Figure 1a; median sensitivity across the eight tissues 70%, with 99% specificity). The second task, for the positive cases only, consists in localizing the cancer site (Figure 1b). We will also present current developments of the algorithm, yielding improvements in its performance, as well as future directions.
Short Abstract: RNA viruses mutate at extremely high rates forming an intra-host viral population of closely related variants (i.e., quasi-species). Hepatitis C virus (HCV) outbreaks pose a significant problem for public health and it is important to infer transmission clusters, i.e., to decide whether two viral samples belong to the same outbreak. An initial approach was based on estimating relatedness between two samples as the distance between consensuses of the corresponding viral populations. The distance between the closest pair of representatives from two populations, MinDist, has been shown to be significantly more accurate. We utilized our algorithm Finch to distinguish related versus unrelated HCV sequences within an outbreak and compared our results to the MinDist algorithm. Finch utilizes a lightweight k-mer Boolean pairwise XOR function and obtained 99.16% accuracy for the detection of related versus unrelated sequences in 9406 pairwise comparisons. This accuracy is substantially improved over the previously reported <93% accuracy of the MinDist algorithm. Finch demonstrates that k-mer-based metrics allow for rapid and accurate separation of epidemiologically related and unrelated viral samples with high sensitivity, specificity and accuracy. Finch is also scalable allowing analysis of arbitrarily large viral datasets.
Short Abstract: To improve the quality and reliability of our RNAseq results, we have leveraged information available in SNP and gene count data to validate our sample annotations. The advent of next-generation sequencing (NGS) technologies has enabled researchers to quickly and inexpensively collect rich sequence data from large cohorts of subjects. However, with this high sample throughput comes the increased chance that samples may end up mislabeled due to processing errors. Recent studies have identified sample label error frequencies ranging from 1% to 10% in various datasets, highlighting the need for sample validation steps in NGS processing pipelines. To address this need, we have implemented two sample confirmation steps in our RNAseq pipeline. We first predict sample sex from gene counts, and then compare the similarity of SNPs observed between samples that originate from the same donor. Here we present a study of the impact of different aligners and different reference panels on each of these metrics, with a focus on conserving computational resources while maintaining sensitivity to labeling errors. We apply these tools to a real-world dataset to demonstrate that mislabeled samples can be readily identified, allowing researchers to either reassign corrected labels or exclude data with dubious labels.
Short Abstract: Transcript-guided targeted assemblies of genes can result in more complete reconstructions of gene loci and flanking sequences by focusing on target regions, thus simplifying the assembly problem. 10x Genomics’ Chromium sequencing platform produces linked reads by partitioning long DNA molecules during library construction, thereby ensuring that all sequencing fragments from a given molecule have the same 16-mer barcode. Our targeted assembly pipeline, TAILR, first utilizes barcode information from linked reads along with the ABySS unitig graph to identify de novo assembly intermediate sequences (unitigs) that co-locate to a target genomic region. It then assembles these filtered unitigs with ABySS. When running TAILR using 6,019 C. elegans transcripts and simulated linked reads, the pipeline achieved a 92.9% success rate, compared to 68.4% for a traditional ABySS whole-genome assembly (reconstructions with over 90% coverage are considered successful). TAILR also assembled genomic regulatory regions well beyond the initial target loci (median 15.9 kbp). We have also successfully run TAILR using real human data. Informed by the long-range information encoded in linked reads, TAILR recruits and assembles unitigs from targets of interest and successfully reconstructs their corresponding genomic loci at a higher rate than otherwise possible with a traditional short read de novo assembly.
Short Abstract: Introduction: Single-cell RNA-seq (scRNA-seq) experiments approach biological questions at the resolution of individual cells, but interpretations on the results of more than 5,000 cells are usually complicated by feature selection, dimensionality reduction, and stochasticity. To make interpretations more intuitive, we present SCEDAR, a machine-learning based Python package for efficiently exploring large scale scRNA-seq datasets that does not require pre-processing by feature selection or dimensionality reduction thus maintaining the biological context. Methods: SCEDAR implements a novel clustering algorithm for non-reduced scRNA-seq datasets, minimum description length (MDL) iteratively regularized agglomerative clustering (MIRAC), which iteratively divides the hierarchical tree data structure built by agglomerative clustering to identify the local sub-clusters, regularized by the MDL. Identified clusters are interpreted by a sparsity-aware gradient boosting tree algorithm. For single-cell quality control, two k-nearest neighbors methods are also provided to detect aberrant cells and “pick up” drop-out genes. These analytical methods are built upon the underlying efficient and modularized data models that are also designed for fast visualization. Conclusion: We demonstrate that SCEDAR is a computationally tractable method for analyzing scRNA-seq datasets to obtain biologically meaningful interpretations.
Short Abstract: In spliced alignment of an RNA sequencing (RNA-seq) sample to a reference genome, it is challenging to accurately call exon-exon junctions spanned close to the ends of reads. Modern splice-aware aligners improve alignment of these short-anchored reads by accepting input lists of known junctions. We develop a novel measure of spliced alignment accuracy based on defining as ground truth spliced alignments of an RNA-seq sample, then truncating its reads and realigning with a given protocol to assess precision and recall of junctions. We show that for some modern aligners, our accuracy metrics are often lower than those reported in published comparisons. We also show that even when these aligners are fed the ground-truth junctions, they often fail to map truncated reads across the correct junctions. To address these issues, we introduce anchorage, which realigns the output of a modern aligner. Given the ground-truth junctions, anchorage improves precision and recall in part because of a new multiread resolution algorithm. We further introduce a tool for building human junction lists to pass to anchorage called morna. Given a query sample, morna passes junctions from similar samples in publicly available data to anchorage to improve alignment.
Short Abstract: Accurate interpretation of RNA-Seq data presents a moving target as the technology evolves. This challenge has led researchers to perform many benchmarking studies to determine best analysis practices, most of which depend on simulated data. Despite this strong need for simulated data, only a few RNA-Seq simulators are available in the public domain, all of which are based on simplifying assumptions that limit their utility. To address these shortcomings and generate realistic simulated data, we are developing an open-source, modular simulator that models each step in the process of converting RNA molecules into sequencing reads. We will model the biochemical reactions and biases (e.g. polymerase GC-content biases, PCR duplication) of each step in library construction as separate modules. Using an object-oriented paradigm, each module will have well-defined inputs and outputs allowing users to easily subtitute new modules. This modular design will give the simulator the flexibility to model different library construction and sequencing protocols as the technology continues to advance. To model biological variability we will take an empirical approach based on using real data to configure the simulator’s parameters. This simulator will be a crucial tool for the community as we continue to develop standard practices for transcriptome analysis.
Short Abstract: Sharing genomes without personal identifiers is common practice. However, recent studies revealed the risk of re-identifying people from their genomes, or attached quasi-identifiers, such as birthdate and zip code. The additional availability of individuals’ RNA-seq data has implications for privacy, as it may be linked to the genome, potentially allowing the person’s privacy to be breached. RNA-seq reads contain genetic variants, and thus can be directly linked to the genome. To avoid this risk, some researchers release gene expression, isoform expression instead of raw reads. However, using a Bayesian framework, we found that it is feasible to predict genomic variants from relative isoform expression. Based on GTEx splicing QTLs data, using relative isoform expression from 15 genes, we could identify the target genome within a pool containing hundreds of individuals with >90% accuracy. When genes' expression is integrated, we are able to re-identify the source of an RNA-seq data from a pool containing billions of genomes. Our result implies that mitigation of the linking risk by adding noise would severely abrogate biological entity of the data, since over half of genes' expression are affected. Our study also implies that other kinds of “omic” data may also leak genome privacy.
Short Abstract: Structural variants (SVs) are large genomic rearrangements and have been linked to multiple diseases as well as cancer. With the increased adoption of NGS, reproducibility of these platforms in reporting SVs is critical for clinical practice. However, reproducibility of SV detection is not well-studied and, unlike small-variant detection, best practices are lacking for identifying SVs. We address these issues in the Germline Variants Detection Working Group of the SEQC2 consortium by systematically investigating the reproducibility of SV detection using combinations of multiple sequencing platforms and SV calling workflows. Based on our study, we also propose best practices for calling SVs using NGS data. In this work, replicates of a HapMap trio sample were sequenced using the Illumina HiSeq-2000 platform. SVs were called using multiple workflows, combining different aligners (BWA-MEM, Bowtie2, Isaac, and Stampy) and SV callers (Pindel, CNVnator, BreakSeq, BreakDancer, MetaSV, DELLY, LUMPY, Manta, and Parliament). The combination of SVs detected by multiple workflows allows assessment of possible modes of variability. We compared the SV call sets to study reproducibility of detecting different kinds of SVs using multiple metrics and quality control filters. Additionally, we also investigated genomic signatures linked to inconsistent SV calls across replicates and workflows.
Short Abstract: Variant discovery is crucial in medical and clinical research, especially in the setting of personalized medicine. As such, precision in variant identification is paramount. However, variants identified by current genomic analysis pipelines contain many false positives. These can be potentially eliminated by applying the state-of-the-art filtering tool VQSR, but its performance is not satisfying and often it fails to run in practice. Therefore, we propose VEF, a variant filtering tool based on ensemble methods that overcomes the drawbacks of VQSR. We treat filtering as a supervised learning problem, by using the high confident call set produced by GIAB (for human sample NA12878) as a ground truth and the annotations of each variant in the output variants call set as features. VEF trains a Random Forest classifier for filtering. We tested the performance of VEF on two NA12878 whole genome sequencing (WGS) datasets and one simulated WGS dataset, all with 30× coverage. We show that the proposed filtering tool consistently outperforms VQSR.
Short Abstract: Highly mutable RNA viruses such as influenza A virus, human immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in an immune escape, emergence of drug resistance, and an increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants. In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The full paper text is available at https://www.biorxiv.org/content/early/2018/03/31/264242
Short Abstract: Accurate typing of human leukocyte antigen (HLA), a histocompatibility test, is important because HLA genes play various roles in immune responses and disease genesis. The current gold standard for HLA typing uses targeted DNA sequencing technology requiring specially designed primers or probes. Although there exist enrichment-free computational methods that use various types of sequencing data, hyper-polymorphism found in HLA region of the human genome makes it challenging to type HLA genes with high accuracy from whole genome sequencing data. Furthermore, these methods are database-matching approaches where their output is inherently limited by the incompleteness of already known types, forcing them to find the best matching known alleles from a database, thereby causing them to be unsuitable for discovery of novel alleles. In order to ensure both high accuracy as well as the ability to type novel alleles, we have developed a graph-guided assembly technique for classical HLA genes, which is capable of assembling phased, full-length haplotype sequences of typing exons given high-coverage (> 30-fold) whole genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art database-matching methods. Using various data, we also demonstrate that our method can type novel alleles.
Short Abstract: Single-cell analysis has the potential to improve the understanding of cellular heterogeneity by obtaining individual cellular information instead of aggregate information that is usually seen in bulk level analysis. DNA methylation at CpG dinucleotides is an important epigenetic phenomenon that regulates gene expression. At present several single-cell methylation protocols exist to understand disease and normal state mechanisms but they all suffer from low coverage due to the low quantity of input DNA in a single-cell. We find that on average, only about 5 – 10% of CpGs are observed in typical single-cell libraries. We show how missingness of methylation status can bias seemingly simple metrics such as mean methylation estimates and clustering analyses. We propose a joint analysis approach that leverages either bulk sequencing data or a consensus generated from a large number of single-cells, to infer bias-corrected single-cell methylation status. In this approach we model and explicitly adjust for biases that arise due to missingness. Understanding and correcting the biases that exist in single-cell methylation data will be crucial to making robust biological conclusions about individual cells based on their methylation profiles.
Short Abstract: Somatic mutations are typically identified by comparing variants found in tumor samples to that in the matched normal samples. However, molecular diagnostics often only sequences and analyzed variants in tumors, followed by filtering against known germline variant databases. Such assay often leads to mis-classification of germline variants as somatic mutations. Here we developed a pipeline to detect somatic variants from tumor-only samples that greatly reduce mis-classification while maintaining competitive true positives. This pipeline integrates existing tools such as MuTect2 and PureCN, with customized filtering strategies based on panel of normals, allele frequency, microsatellite instability status, and contamination estimation. We tested this pipeline using single tumor Whole Exome Sequencing (WES) samples from TCGA, and observed above 70% precision and near 50% recall, when compared with tumor normal paired MuTect2 results downloaded from the Genomics Data Commons (GDC). In our test setting, this pipeline greatly outperforms many existing variant calling tools. Our pipeline provides a balanced performance of precision and recall in detecting somatic mutations in single tumor samples, and is particular useful in clinical cancer specimens or older tumor samples when matched normal samples are not available.
Short Abstract: Accurate annotation of biological sequences is fundamental to modern molecular biology. For many sequences this is a straightforward process - tools such as BLAST and HMMER quickly and accurately annotate sequences by aligning them to known sequences or sequence models., Here, we are interested in annotation by translated alignment, in which protein-coding DNA is aligned directly to protein sequences or models. We demonstrate that the use of profile hidden Markov models (HMMs) substantially increases annotation sensitivity relative to sequence-to-sequence comparison methods such as tblastn. Even with profile HMMs, annotation of protein-coding DNA sequences containing frameshift inducing indels can be particularly troublesome, as standard models do not support alignment through frame shifts. Here we present a new tool, built within the open source HMMER software package, that produces high-quality translated alignments and accurate annotation for even heavily frameshifted DNA sequences. With a new model and a first-ever Forward-Backward dynamic programing algorithm for frameshift-aware alignment, this tool promises to increase annotation of naturally frameshifted data such as pseudogenes and transposable elements, as well as improving the annotation of indel-rich long read sequencer data by eliminating the need for extensive error correction.
Short Abstract: Next-generation sequencing (NGS) techniques have been widely used in order to investigate gene expressions, protein binding peaks and chromatin interactions. Many computational methods have been generated that compare two samples to detect the significant difference of their signals that are generally fitted as negative binomial (NB) distributions, but they are usually based on data normalization and highly biased for the results of significance calling due to the distribution skewness. Here we introduce a general statistical model to directly compare two general NB distributions whose parameters are real positive numbers. We generated a computational method (named MORN) for significance calling in comparative analysis of ChIA-PET data, RNA-seq data and ChIP-seq data. The computational results on real datasets show that our method has higher performance and less bias than existing methods in RNA-seq and ChIP-seq analysis. From PolII and CTCF ChIA-PET data analysis of cancer cell lines, we detected a large number of cancer-associated disorders of chromatin interactions that consequentially cause dysregulations of multiple oncogenes in various cancer types. Thus, this work provides a theoretical foundation upon which unbiased comparative analysis can be pursued for any NGS data that are fitted as NB distributions.
Short Abstract: Motivation: The emergence of high-throughput sequencing technologies revolutionized genomics in the early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes. Results: In this manuscript, we demonstrate the performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG --- a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference. Availability and implementation: http://cab.spbu.ru/software/quast-lg Contact: firstname.lastname@example.org
Short Abstract: ATAC-Seq (Assay for Transposase-Accessible Chromatin with high throughput sequencing) is a modern technology used to accurately probe open chromatin accessibility, using a mutated hyperactive enzyme that preferentially binds to stretches of exposed DNA. An ATAC-Seq experiment will typically produce millions of sequencing reads that can be mapped to the reference genome, to point transposition events. One can then assign a cut count for each genomic position and create a base-pair resolution signal. Several open source analysis tools have been developed to aid each step of the analysis, however accurate and reproducible bioinformatics analysis of ATAC-Seq data is still challenging. We developed an automated and scalable bioinformatics analysis pipeline for the analysis of ATAC-Seq data on the cloud to conduct parallel-executed analysis on data from virtually any sample size. This pipeline includes adapter trimming, alignment, de-duplication, low mapping quality reads removal, merging of samples, summarization and visualization of alignment statistics, detection of chromosome binding sites, and identification of differentially bindings sites between groups. Parameter’s tuning can be easily carried out through a configuration file. This pipeline is being used as part of the ongoing Toxicant Exposure and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET) consortium effort.
Short Abstract: RNA-seq technology allows a high-throughput detection and quantification of RNA molecules. RNA-seq extends the NGS portfolio by enhancing one’s ability to study non-sequential genomic sequences, such as alternative splicing events. In RNA-seq, reads may be aligned to a transcriptomic reference, in opposition to the default genomic reference. The use of the transcriptome alleviates the need for splice-aware alignments of reads, which is not a trivial task. However, quantification and detection have been shown to heavily rely on the choice of the reference transcriptome. While there is typically only one major genomic assembly, users must choose which annotation, and therefore which transcriptome, to use for a specific analysis. For a single gene, such annotations may vary in terms of isoform and exon numbers, alternative splicing events, and length of untranslated regions. This project illustrates and quantifies the extent of discrepancies between the Ensembl and RefSeq annotations using a graph-based approach. Using a simplified graph-edit-distance algorithm, one can compare two annotations by quantifying the cost of transforming one annotation graph to the other, which may be used as a metric of discrepancies. Such a metric provides a prediction of which isoform abundance is prone to be affected by the choice of transcriptome.
Short Abstract: Recent progress of the DNA-sequencing technology is amazing and getting to overcome the problems for the structural analyses of the complex genomic regions like the vicinity of centromere and telomere. The most striking advances are the practical application of the long-read sequencing technology, their related computational power, and the analytical pipelines. We are interested in the genomic structure, function, and evolution of the primate subtelomeric regions and the comparative genome analysis is currently in progress. Subtelomeres have unique structure of the genome, which form the transitions between chromosome-specific sequence and arrays of telomere repetitive sequence on the vicinity of chromosomal ends. Due to their complex genomic structure, it has not been fully analyzed for the gene contents, constitution of the DNA elements, and variation of the copy number of the duplicated sequences in the subtelomeric regions. To clarify the precise and exact structure of subtelomeres, we isolated Fosmid/Cosmid clones covering primate subtelomeric regions and determined the complete structures by using the long-read sequencing technology. In addition, the comparative analysis of primate subtelomeric sequences revealed that the dynamic structural changes have occurred between species. We present an overview of our project and show the current status at the meeting.
Short Abstract: Platforms and methods for single-cell RNA sequencing (scRNA-seq) have greatly advanced in recent years. While droplet and barcoded mRNA-capture methods have significantly increased the isolation of cells for scRNA-Seq analysis, these technologies readily produce frequent technical artifacts, such as doublet-cell captures. Doublets occurring between distinct cell-types can appear as hybrid scRNA-Seq profiles, but do not have distinct transcriptomes from individual cell states. Traditional approaches for detecting doublets, such as assessing the number of sequencing reads, fall short when different cell types with differing levels of transcriptional activity or library amplification efficiency are sequenced. Although computational methods for detecting doublets are beginning to emerge, none have been proven to remove doublets with both a high sensitivity and specificity. We introduce DoubletDecon, an approach that detects doublets with a combination of deconvolution analyses and identification of unique cell-state gene expression. We demonstrate the ability of DoubletDecon to accurately identify synthetic and microscopy validated cell doublets from scRNA-Seq datasets of varying cellular complexity. DoubletDecon is able to account for covariates such as cell-cycle effects and is compatible with diverse species. We believe this approach has the potential to become a standard quality control step for the accurate delineation of cell states.
Short Abstract: Focal oncogene amplification and rearrangements drive tumor growth in multiple cancer types. Our team recently showed that circular, extrachromosomal DNA (ecDNA) formation could explain amplification in nearly half of all samples across cancer types, dramatically changing the tumor evolution trajectory (Turner, Nature 2017). Here we present AmpliconArchitect (AA), a tool to robustly reconstruct the fine structure of focally amplified regions using whole genome sequencing. For a candidate amplicon, AA performs copy number-aware (CN) breakpoint detection, estimates CN by optimizing a flow on a breakpoint graph and interactively visualizes a simple cycle decomposition to represent multiple amplicon structures. AA is the first tool to be systematically validated on a thousand diverse simulations using a novel edit-distance metric. Mechanisms proposed for amplification and complex rearrangements include fragile sites and chromothripsis. AA-reconstructed amplicons and FISH images from a pan-cancer dataset combined with TCGA data revealed multiple lines of evidence suggestive of random breaks in the human genome, ecDNA-mediated amplification, gradual accumulation of complex rearrangements, aggregation and reintegration into the chromosomes. In a second dataset of virus-driven cervical cancer samples, AA consistently reported a bifocal signature of viral integration which can be explained by chimeric human-viral ecDNA suggesting a novel mechanism for onco-viral pathogenesis.
Short Abstract: Rat Genome Database collects information on genetic variation from the worldwide community of rat researchers and provides tools for searching and retrieving these data. Currently, we show details about almost 605 million variants (SNVs, insertions and deletions) and the studies that have identified these variants employing different genome references and methods. Most variants were generated using previous rat genomic assemblies (Rnor3.4 and Rnor5.0). RGD reanalyzed high throughput whole-genome sequences data for 25 rat strains using the newest rat reference assembly Rnor6.0, released in 2014. Difficulties with correct variant identification require up-to-date evaluation of available algorithms, their accuracy and performance. Therefore, we utilize state-of-the-art tools for variant calling and effect prediction. We identified over 12 million genomic variants (SNVs and indels) using GATK Best Practices recommendations and annotated the data with the recent versions of SnpEff and PolyPhen2 prediction tools. In addition, we are expanding our current search tools to include the ability to use the combination of a variant position in the genome with the functional annotations assigned to RGD genes and strains. The combination of improved variant data and disease, phenotype, pathway, function and chemical interaction information will help researchers find appropriate models for disease research and drug testing.
Short Abstract: Cancer is one of the leading causes of death worldwide and was responsible for 8.8 million deaths in 2015. Precision medicine, which involves selecting appropriate therapies after molecular diagnostics, brings better treatment methods. Non-invasive biopsy from blood samples, which is much easier to tolerate and quicker than invasive biopsy, can be used to examine cell free DNA derived from tumor cells. This method has the potential to offer many advantages in the future including diagnosis, monitoring treatment progress and recurrence, and so on. Herein, we developed a new computational workflow to detect low level variants from liquid biopsy samples, with the Ion AmpliSeq™ HD Library Kit with custom assay designs from Ion AmpliSeq™ HD Panels. This workflow can correctly recognize molecular tags incorporated in the Ion AmpliSeq™ HD Library Kit , and accurately reported variants with 0.1% Limit of Detection in circulating free DNA . It has been enabled in Torrent Suite™ and Ion Reporter™ software. This workflow combining Ion AmpliSeq™ HD Library Kit and custom assay panels has been proven to be a powerful tools for cfDNA-like samples. Variants with 0.1% allele frequency could be detected by sequencing with sensitivity of ≥80% and specificity of ≥98% for cfDNA samples.
Short Abstract: Xenograft models are of significant utility in biomedical research. However, an often-overlooked confounding factor is host-graft contamination. Therefore, an important step in using xenograft derived tissues is to remove the presence of mouse cells in the samples. Here, we are presenting a toolkit that effectively removes the mouse reads from PDX samples for WGS and RNA-seq. We also present details on the differences in the variant calls and its subsequent functional effects; with and without the removal of mouse reads using a human + mouse concatenated genome v/s human only genome. We also note the changes in the RNA expression.
Short Abstract: There are considerable advantages to both strand specific sequencing and the ability to sequence samples with very small amounts of starting material. Until recently Illumina did not have a kit that allowed both. The standard TruSeq kit requires abundant starting quantity. The V4 kit allows for ultra low starting quantities but sacrifices strand specificity. A new kit called Pico has recently become available to do both. We have performed a detailed comparative analysis of these three kits on a set of samples representing two experimental conditions, allowing for comparison of the kits in a standard differential expression analysis. Comparison is performed at the level of alignment, differential gene expression and pathway enrichment analysis. The top 19 most discordant genes were selected to be validated with PCR. All three kits reveal meaningful yet different results, with Pico being more similar to TruSeq than V4.