If you need assistance please contact email@example.com and provide your poster title or submission ID.
Track: High Throughput Sequencing Algorithms and Applications (HitSeq)
Short Abstract: PDX mouse model is an emerging platform for testing treatment responses in preclinical setup and it provides ample opportunities to realize the personalized and precision medicine. Ideally, trio samples of patient normal, patient tumor, and mouse tumor tissues are required to identify somatic mutations in PDX mouse that are concordant with patient tumor. However, it is often the case that patient tissues are not enough to generate the deep sequencing data, thus making subsequent analysis of somatic calling process difficult and error-prone. Here we developed a computational pipeline to predict somatic mutations for exome sequencing data of PDX mouse in such circumstances. It consists of intricate read mapping and filtering processes to remove mouse-originated mutations and germline mutations. We tested our pipeline for over 60 trio cases of lung cancer assuming either patient normal or tumor data are missing, and demonstrated that we could retrieve most of genuine somatic mutations without too much of false positives. Our results indicate that PDX mouse without patient reference tissues can be utilized effectively.
Short Abstract: Sewage is a major source of both human pathogens and their associated bacteriophages. Phages control bacterial population by predation and can act as natural reservoirs for accessory genes such as antimicrobial resistance genes and virulence factors. However, currently limited knowledge is available about the sequence and functional diversity of such sewage phage communities. We here present a study of the phage communities of 81 sewage samples from 62 different countries around the world. The samples consist of metagenomic assemblies in which we identified phage contigs by using the MetaPhinder tool. These contigs were subsequently screened for the presence of known virulence and resistance genes with the VirFinder and ResFinder tools. Additionally, we performed host prediction with HostPhinder and taxonomic classification. Antimicrobial resistance genes were found in the phage population of 52 out of 80 samples and virulence factors in 18 of the samples. Potential hosts were predicted for 12.7 ± 3 % of phage contigs. Among the most common host genera were Escherichia, Caulobacter and Bacillus. Taxonomic classifications were assigned to 0.5% of the phage contigs on average. Among the most common taxonomic assignments is crAssphage which was identified in 74 of the samples. In conclusion, we found that the phage communities in sewage are extremely diverse and contain many novel sequences.
Short Abstract: Whole genome sequencing was performed on 855 different strains of clinically isolated non-typeable Haemophilus influenzae. After gene prediction and gene clustering, a gene presence/absence matrix was produced by homology search. The three goals were to use these gene clusters as features used to predict 1) if the strain was isolated from a sick or healthy patient, 2) if we could predict what body site the strain had been recovered from, and 3) if we could identify an informative subset of genes with high predictive power for either of the previous predictions. Feature selection was done using a combination of variance analysis and a GLM lasso implementation to identify informative features, and reduce/remove redundancy. In predicting if the strain came from a sick or healthy patient an artificial neural network (ANN) implementation achieved an accuracy of 0.7917 which increased to an accuracy of 0.7976 using the lasso selected features. The random forest (RF) implementation initially did slightly better with an accuracy of .8095, which interestingly went down to 0.7917 using only the selected features. In predicting the body site origin, ANN accuracy was 0.4385 percent, and after feature selection 0.4154. RF correctly identified the correct body origin 0.497 percent before feature selection, and at 0.5308 after.
Short Abstract: Introduction Rheumatoid Arthritis (RA) is a chronic autoimmune disease which leads to inflammation of joints in a patient. The cause of RA is still not well understood, but smoking, gender, pregnancy and genetic factors are all known to contribute to the development of RA. MicroRNAs are short non-coding RNAs with length varying from 19 to 26 nucleotides that regulate gene expression by binding to mRNA targets. Microarray experiments have identified several miRNAs that appear to play a functional role in RA patients, but there are few miRNA studies on RA using Next Generation Sequencing. In this work, we investigate changes both in miRNA expression levels, as well as variation in miRNA isoform (or isomiRs) population in miRNA sequence data extracted from three immune cell lines from Norwegian RA patients and healthy controls. Methods We collected blood samples and three types of immune cells (CD19, CD4 memory and CD4 naïve) from RA patients at three time points: newly diagnosed; three months treatment after diagnosis; and long-term patients. Samples from newly diagnosed and long-term patients were prepared for small RNA sequencing and the sequence data was submitted to a standard preliminary analysis including QC control and adapter trimming. Prior to mapping, identical reads were collapsed into single sequences and mapped to human reference miRNA hairpin sequences using bowtie with 2 mismatches. We then investigated the variation amongst conditions in the population of the reads (i.e. isoforms or isomiRs) that map to the reference set of human miRNAs (according to MiRBase version 21). To facilitate this, we introduced a comprehensive nomenclature to describe the modifications between a specific isomiR and the reference “parent” miRNA as specified in miRBase. Results We identified a set of isomiRs that are differentially expressed amongst the two RA cohorts, with many more isomiRs differentially expressed in CD19 than CD4 memory and CD4 naïve cells, and distinct isomiRs observed in each cell line. Additionally, computational target prediction identified distinct targets sets for each isomiR, which are also distinct from the predicted targets for the parent miRNA. For example, the mature form of hsa-miR-126 is predicted to target to more than 1600 target genes. In contrast, the differentially expressed isomiRs each have a dramatically decreased set of target genes; the shorter isoform (both one nucleotide deleted at both 5’ and 3’ end) have 5 targets and same length isomiR (one nucleotide extended at 5’ end and one nucleotide deleted at 3’ end) has 61 targets. Conclusions Investigating the additional dimensionality of small RNA NGS data (in the form of isomiR populations) can reveal additional structure that can provide further insight into differences among tested conditions.
Short Abstract: The exponential growth of high dimensional biological data has led to a rapidly increase in demand for automated approaches for knowledge production. Previous studies rely on two major approaches to address this type of challenge, 1) the Theorydriven approach, and 2) the Data-driven approach. The former constructed future knowledge based on the prior background knowledge that is acquired and the latter formulates scientific knowledge solely based on analyzing the data that is obtained previously. In this work, we argue that using either approach alone suffers from the bias towards past/present knowledge as they fail to incorporate all of the current knowledge that is available for knowledge production. To address such challenge, we propose a novel two-step analytical workflow that incorporates a new dimensionality reduction paradigm as the first step to handling high-throughput gene expression data analysis and utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Our results, on real world clinical datasets from The Cancer Genome Atlas (TCGA), show that our approach is capable of wisely selecting genes for learning effective causal networks.
Short Abstract: Skeletal diseases, including the complex diseases osteoarthritis and osteoporosis, present a large and growing health care burden with often poor treatment options. There is a critical need for more detailed mechanistic understanding of these diseases to enable the development of rational disease modifying treatments. Transcriptomics analysis has been often used to provide both characterisation of tissue affected by skeletal diseases and to find disease gene candidates in relevant cell types. Despite a large number of existing skeletal disease datasets, these expression profiles are difficult to interrogate and have not been examined in an integrated way. Using an automated pipeline for reproducible analysis of publicly available transcriptomics data, we have produced a large collection of consistently analysed, annotated expression datasets to allow mining for hidden connections and shared pathogenic mechanisms between different skeletal diseases. Unsupervised clustering at multiple regulatory levels was performed which revealed clusters of similarity at the pathway and transcription factor level which were not readily visible from simple examination of the gene expression signatures, demonstrating the utility of this integrative analysis. Our knowledge base of gene signatures, enriched pathways, active sub-networks and upstream transcription factors provides a resource for querying skeletal disease related datasets, so to contextualise the data with prior knowledge to provide more meaningful biological insight.
Short Abstract: Single-cell sequencing technology is rapidly improving the resolution at which cellular heterogeneity in complex tissues is studied, particularly in stem cell biology where rare intermediate cell types are often difficult to capture or isolate. We have developed a computational method which leverages high-throughput single-cell RNA-seq data to asses which biological pathways undergo dysregulation in the context of neoplastic development and where in differentiation maximum dysregulation occurs. Our method is built upon our previous computational strategy for aligning single cells along a developmental or pseudo-temporal axis to estimate and describe changes in gene expression as tissues differentiate and mature. We have leveraged publicly available data from 10X Genomics and the Pathway Commons to estimate pathway activity changes during differentiation of the erythroid lineage in bone marrow captured from two healthy patients and a patient with Acute Erythroid Leukemia. This Pathway Discordance Analysis has identified biologically discordant pathways previously implicated in myeloproliferative disorders by the scientific literature as well as less-studied pathways which may represent new opportunities for future research.
Short Abstract: The wild lentil species, Lens lamottei and Lens odemensis, are potential sources of novel genetic variation for disease resistance and other desirable traits for Lens culinaris (cultivated lentil) breeding. Populations from crosses with L. odemensis have been made but hybrids with L. lamottei have been very difficult to produce. Understanding the structural differences between the wild and cultivated genomes will give insight into both the evolution within the Lens genus and domestication of cultivated lentil, as well as identify large-scale structural differences that may contribute to the level of success in obtaining viable hybrid offspring. Short-read assemblies were improved with 10x scaffolding for L. lamottei (3.5Gb total assembly, 4.4Mb N50, 28,638 scaffolds) and L. odemensis (3.7Gb total assembly, 3.9Mb N50, 23,352 scaffolds). The scaffolds were anchored on high-density genetic maps generated from GBS data to create pseudomolecules and additional unanchored scaffolds. The 10x Chromium data (~30X read depth) initially used in superscaffolding were then remapped against the alternate genomes to identify both variations in coverage (PAV/CNV) and breakpoints (structural variation) amongst the wild and cultivated lentil species. The long-range information from linked reads was used to confirm interspecific differences in genome organization. Although large-scale rearrangements were expected based on previous cytogenetic studies, this approach allowed for much higher confidence and more fine-grained identification of variation between the genomes beyond simple SNP and indel calling.
Short Abstract: BACKGROUND: Gains and losses of genetic material, also known as DNA copy number alterations are aberrations that are involved in the development of cancer. Their analysis is therefore critical for research and diagnostics in oncology. DNA sequencing based determination of copy number aberrations is becoming the most cost effective way as compared to microarray based techniques at equal resolution. To obtain copy number calls from low coverage whole genome sequencing reads requires the combined usage of several programs with various steps, followed by more statistical analysis tools for pairwise comparisons, etc. A complete workflow would therefore be useful for many other researchers in the field. RESULTS: Here we present QDNAseqFLOW, a computational workflow that produces DNA copy number plots along with various summaries and statistics, including the aberration differences found between groups of input samples. Written in the R programming language, it relies on Bioconductor packages QDNAseq, DNAcopy, CGHcall and CGHregions as well as the open-source R packages NoWaves and CGHtest, all of them described in peer-reviewed journal articles. USAGE: The program is written in the R programming language and can be run without programming skills on Windows, MacOSX and Linux through provided wrapper scripts. The user is guided by simple graphical pop-ups to enter parameters or select file locations, while access to the program code allows users with R programming skills to change advanced paramters. FEATURES and WORKFLOW: (1) Reads obtained from low-coverage (= “shallow”) whole genome sequencing of DNA samples need to be provided as BAM files obtained by alignment to the human reference genome hg19. (2) Copy number plots and -files are created using Bioconductor package QDNAseq. (3) ‘Waves’ in the profiles are smoothed with the R package NoWaves (van de Wiel et al., 2009) and subsequently, aberrated regions are combined with the circular binary segmentation (CBS) algorithm implemented in Bioconductor package DNAcopy and the copy numbers of obtained segments are called using Bioconductor package CGHcall. (3) Summarizing frequency plots and quality statistics for all plots are created. Plots are flagged if their noise and/or number of segments is higher than expected, based on the inter-quartile range of values observed for all samples, and can then be checked and removed by the user from subsequent analysis. (4) If the user provides a grouping for the samples, individual frequency plots, aberration summaries (per chromosome arm) and a differential aberration analysis will be produced. To obtain the latter, Bioconductor package CGHregions is used to slightly adjust the segments in all samples in a way to obtain regions with start and end positions identical in all samples with minimal information loss. Then, with the help of R package CGHtest (van de Wiel et al., 2005), a Wilcoxon-Mann-Whitney two-sample test or Kruskal-Wallis k-sample test is applied to all aberrated regions to calculate which aberration is significantly different between the groups. CONCLUSIONS: QDNAseqFLOW is a comprehensive workflow for the analysis of copy number aberrations. It will be made available at github.com/NKI-Pathology.
Short Abstract: Naegleria fowleri, commonly known as the brain-eating amoeba, is a free-living eukaryote found in soil and fresh warm water sources all over the world. Once entered the nose, N. fowleri follows the olfactory nerves to the brain and causes primary amoebic meningoencephalitis (PAM), a fast progressing and mostly fatal disease of the central nervous system. The mechanisms involved in the pathogenesis are still poorly understood. To gain a better understanding of the relationships within the genus of Naegleria and to investigate pathogenicity factors of N. fowleri, we characterized the genome of its closest non-pathogenic relative N. lovaniensis. To achieve a nearly complete assembly of the N. lovaniensis genome, long read sequencing was applied followed by assembling of the data using FALCON, a diploid-aware string graph assembler. To unravel the relatedness of Naegleria species, a phylogenetic tree based on maximum likelihood and bootstrapping using RAxML was constructed. Keeping pathogenicity in mind, proteins specific for N. fowleri were defined by clustering of orthologous gene families between different Naegleria species and their function was characterized by functional annotation and GO enrichment analysis. In this study, we present the 30Mb genome of N. lovaniensis for the first time. Sequencing and de novo assembly of the genome supports the hypothesis of the close relationship to the human pathogen N. fowleri. Thus, knowledge of the N. lovaniensis genome provides the basis for further comparative approaches to unravel pathways involved in the pathogenicity of PAM and to identify structures for possible treatment options.
Short Abstract: Alternative splicing is well documented at the transcript level, but reliable large-scale proteomics experiments detect many fewer alternative isoforms than expected. Instead proteomics evidence suggests that the vast majority of coding genes have a single dominant splice isoform, irrespective of cell type. Where a main proteomics isoform can be determined there is almost perfect agreement with two orthogonal sources of reference isoforms, principal isoforms from the APPRIS database and unique CCDS variants, based on the conservation of protein structure and function and cDNA evidence respectively. When alternative isoforms are detected in proteomics experiments they tend to be highly conserved and are enriched in subtle splice events such as mutually exclusively spliced homologous exons and tiny indels. Only a small fraction of proteomics-supported alternative events disrupt protein functional domain composition. Two thirds of annotated alternative transcripts would disrupt functional domains. Many annotated alternative splice transcripts have little cross-species conservation. However, it has been suggested that these alternative variants may play an important role in evolutionary innovation. We have analysed the results of human population variation studies and find that this is not the case. Indeed most alternative exons appear to be evolving neutrally in present-day human populations. While a small number of annotated alternative variants are conserved across species and are translated in detectable quantities, most are evolving neutrally. This strongly suggests that most alternative variants will not generate functionally relevant proteins.
Short Abstract: MUC1 gene is coding for transmembrane glycoprotein mucin-1 and its coding sequence is GC-rich (82%), containing 25-120 polymorphic tandem repeats (VNTR) of the length of 60bp. Frameshift mutations in MUC1 are leading to the synthesis of abnormal, highly basic, cysteine-rich protein MUC1-fs. MUC1-fs accumulates in the tubular cells of kidneys, causes progressive deterioration of renal functions and leads to a renal failure. The age of kidney failure varies from 17 to 75 years and we hypothesize, that the exact location of the mutation may be related to the age of renal failure. Using current technologies, it is very difficult to detect the mutations in MUC1 and it seems to be impossible to determine its exact position because of the repetitive and GC-rich sequence, the length of the VNTR and the homopolymer stretch of 7 cytosines within each repeat. To detect the mutation, we amplified the VNTR region using Long Range PCR and sequenced the amplified region on Illumina HiSeq, followed by a bioinformatic analysis of the raw reads. We tested this method successfully on samples with previously known C insertion and then applied it to samples with unknown mutations. Using this approach, we identified three completely new mutations. To determine haplotype and the exact position of the mutations, we are currently using the MinION sequencer from Oxford Nanopore. These methods enable genetic diagnostics of ADTKD and could contribute to the understanding of the genetic factors determining the progression and the age of the onset of the kidney failure in ADTKD.
Short Abstract: Multiple endogenous and exogenous mutational processes drive cancer mutagenesis and leave distinct fingerprints. Noticeably, they have inherent mutational nucleotide context biases. Mutation profiling of cancer sample finds all mutations accumulate over the lifetime, including somatic alterations both before the cancer initiation and during cancer development. In a generative model, over-time multiple latent processes produce mutations, drawing from their corresponding nucleotide context distributions (the “mutational signature”). In cancer sample, mutations from various mutation processes are mixed and observable by sequencing. Many mutation processes are recognized and linked with known etiologies. Understanding the fundamental underlying processes helps understand cancer initiation and development. A key issue in the field is to detect operative signatures in new cancer samples by leveraging current known signatures derived from large-scale pan-cancer analyses. Previously published methods use empirical forward selection or iterate all combinations (brute force). Here, we formulate this as a LASSO linear regression problem. By parsimoniously assigning signatures to cancer genome mutation profiles, the solution becomes sparse and biologically interpretable. Additionally, LASSO organically integrates biological priors into the solution by fine-tuning penalties on coefficients. Compared with the current approach of subseting signatures in fitting, our method leaves leeway for noise and allows promoting similarity within sample subgroups, leading to a more reliable and interpretable signature solution. Last, our method can be automatically parameterized based on cross-validation. This objective, robust approach promotes data replicability and fair comparison across research.
Short Abstract: Human DNA sequencing studies are often compromised by mix-ups happening during either sample preparation or data management, bringing a demand of checking for mislabeled samples as a routine quality control step. We present ClearUp, a method and a software package for sample identity validation from BAM files. By selecting selecting a set of common population SNPs shared across input with high enough sequencing quality, ClearUp builds "SNP fingerprints" and uses them to determine relatedness, ancestry and sex. The user friendly web-based interface allows to review and refine the results using an interactive dendrogram and a built-in genome browser. The method works across different types of sequencing data, including WGS, WES, RNA-seq, and targeted sequencing, as soon as the input targets overlap. We demonstrate that SNP fingerprints give enough variation in order to accurately detect mislabeled and related samples. In contrast t similar tools, ClearUp is undemanding in terms of input and does not require any data pre-processing, taking only files in BAM format. The tool is open sourced and available on GitHub at https://github.com/AstraZeneca-NGS/Fingerprinting, and provides both a command line interface and a Flask-driven web-server with a graphical user interface.
Short Abstract: In RNA-Seq experiments, up to 20 percent of reads mapping to intronic regions can be observed. There are numerous hypotheses on the true origin of these reads. Whether reads mapping to non-coding regions should be considered noise and be ignored as a consequence is a subject of recent debate as RNA-Seq aims at estimating protein abundance. This study was conducted to determine whether the incorporation of reads mapping to non-coding regions into RNA-Seq transcript abundance estimation improves accuracy of differential expression analysis compared to the standard pipelines. To adjust transcript counts for so-called non-coding reads, we estimate for each transcript the abundance of non-coding RNA fragments and subtract these from the original transcript abundance. To estimate the reduction rate we test a brute-force method and a binning approach that corrects for 5’ to 3’ read distribution bias. We determine accuracy of results obtained in differential gene expression analysis using corrected read counts for synthetic data sets mimicking well-known RNA-Seq biases. Correcting for RNA-Seq reads possibly originating from non-coding elements improves reproducibility between replicates. Accuracy of differential gene expression analysis was significantly improved in synthetic data sets compared to results obtained with HTSeq read counts. Testing our approach on a freely available experimental data set, we could increase the number of detected differentially expressed genes. An R-package will be available soon.
Short Abstract: Correction of errors is often a necessary step in the analysis of high throughput sequence data. We have previously developed the general-purpose error correction software Pollux, which is highly effective at correcting substitution, insertion and deletion errors, including homopolymer repeat errors. The error correction software is effective at identifying errors in Illumina and Ion torrent data, and can be applied to single- or mixed genome data sets while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we demonstrate an accuracy of error correction greater than 94% for Illumina data and 88% for Ion Torrent data. Here we present the updated version of the software, Pollux 2.0, which further increases error correction rates while greatly reducing the occurrence of introduced errors (false positives). The new version of the software implements a new nearest-neighbor based algorithm to estimate the rate of substitution, insertion, deletion, and homopolymer repeat errors directly from kmer data. The algorithm is used to set data-dependent thresholds for the correction of the different error types, and is able to distinguish sequencing errors from low-frequency variants in data sets. The new version of the software also implements a more efficient memory management system, reducing memory requirements by up to six-fold. Here, a compressed hash table is used to store kmer count data, where the kmer is represented by an index and remainder using an XOR-based hash function. The index is dual purpose, and is used to locate kmer data within the hash table and also to represent approximately half of each kmer sequence. A user-specified memory limit is also implemented so that if available memory is limited, the hash table is written to disk and a new hash table constructed for additional reads. The final representation of kmers in memory is further reduced by omitting low frequency kmers and blank entries in the hash tables, providing a dense representation of kmer frequencies and permitting large data sets to be analyzed on a single workstation. The introduced software is highly effective at correcting errors across platforms, and provides general-purpose error correction that may be used in applications with or without assembly.
Short Abstract: Genome-wide association studies (GWAS) have been highly successful in identifying genetic variants associated with risk for common diseases. The majority of the phenotype-associated SNPs identified in GWA studies are in non-coding, regulatory regions. Despite this fact, the existing functional studies fall short in exploiting the regulatory impact of variants since most methods either do not go beyond positional overlap of annotated regulatory regions and associated variants or they integrate other types of molecular readouts such as eQTL or tfQTL. While the overlapping approaches cannot assess the actual impact of the variant on regulatory elements, the integration methods need additional data which is not always available. We here describe DeepWAS, a new approach where the phenotype-genotype link is interrogated in a cell line and transcription factor specific manner via multilocus regression models using the regulatory features of variants predicted by the deep learning method DeepSEA. DeepWAS, a method combining classical GWAS with deep learning-based functional variant annotation, has a potential as a powerful tool to uncover disease mechanisms for common disorders, including relevant cell types.
Short Abstract: Functional metagenomics is used to understand who is doing what in microbial ecosystems . DNA sequencing can be priorized by activity-based screening of libraries obtained by cloning and expressing metagenomic DNA fragments in an heterologous host. When large insert libraries are used, allowing a direct access to the functions encoded by entire metogenomic loci sizing several dozens of kbp, NGS is required to identify the genes which are responsible for the screened function. The pipeline presented here allows biologists to easily assemble, clean and annotate their NGS sequences. It has been set up in Galaxy as two tools which can easily be chained in a pipeline. The first one produces the cleaned assemblies and their metrics from the sequencing reads. It provides users with a table containing links to the different files corresponding to the assembly and vector cleaning steps as well as an interactive graph showing contig depth and length for all metagenomic inserts. It enables a quick assembly validation. Another output of this module is a compressed file including metagenomic insert sequences after assembly and cleaning, in fasta format. The second tool generates an annotation table including Metagene ORF finding and blast annotation against the nr, Swissprot and COGs databases which will help biologists to make functional and taxonomic annotation. The table includes links to the alignments files enabling a precise analysis. The poster presents the results for simulated and real read sets.
Short Abstract: Antibodies are proteins of the immune system that tag noxious molecules for elimination. They can be adjusted to bind with high affinity and specificity to a target molecule. This property has been extensively exploited in biopharmaceuticals, diagnostics and research agents. Their binding malleability arises from their diversity (>10^10 possible sequences). The advent of Next Generation Sequencing (NGS) has made it possible to produce snapshots of this diversity. In this poster we describe our work with a large NGS dataset comprising 13.5m heavy and light chains from ~500 individuals. By studying this dataset we aim to establish a set of descriptors that will allow us to formally interrogate the properties of immune repertoires. The descriptors we have explored include length of complementarity determining region (CDRs), gene usages, amino acid distributions. One feature we have identified within the data is that 6-7% of H3 loops in our dataset contain a cysteine pair motif. H3 length correlates with the proportion of cysteine residues inside the loop (Pearson correlation, R2 = 0.89). Two of the most common motifs containing two surrounding cysteines are 5 and 6 amino acids long. Our analysis of amino acid distribution reveals that both motifs have a strong preference for tyrosines in the flanking positions, although the amino acid distributions inside each motif is distinct. This knowledge is one example of how NGS can be used to rationally design antibodies in novel ways.
Short Abstract: Endogenous retroviruses invade the host genome and get horizontally transmitted from parents to the offspring. The koala retrovirus (KoRV) is currently invading the genome of Phascolarctos cinereus. By investigating KoRV, we can study the endogenization process of an infectious virus in real time. We conducted different studies to examine insertion sites in ancient DNA samples (Cui, P. et al. Comprehensive profiling of retroviral integration sites using target enrichment methods from historical koala samples without an assembled reference genome. PeerJ 4, e1847 (2016).), samples of wild animals and zoo animals, as well as comparing cancer and control tissues. We found rarely shared integration sites and recombination of two viruses, which may result in a reduced prevalence of the virus.
Short Abstract: Long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore have dramatically increased achievable read length, with reads routinely exceeding 10 Kbp. These reads are crucial for resolving ambiguities when mapping and assembling reads from repetitive genomes such as human, with consequences for many applications, from closing gaps of the reference to mapping the structural variation that underlies many human diseases. However, the increased read length comes at the cost of a significantly higher error rate. Despite continuing improvements, read mappers and assemblers designed for short-read technologies such as Illumina, where indels are almost nonexistent, struggle to map these reads accurately or at all. We describe six very simple "squash" transformation functions that can be applied to any DNA sequence to produce a smaller sequence, on average one quarter of the input size, and which have a useful “indel-tolerant” property: on average, 75% of deletions and 62.5% of insertions leave the result of the transformation unchanged. The transformations are fast, streamable and in-place. When applied as preprocessing before k-mer lookup, they can be viewed as a form of gapped k-mers, but with sequence-dependent gaps. We show that two of the functions significantly improve the accuracy of the initial k-mer lookup phase of read mapping: for simulated PacBio data, transforming both reference sequence and reads before mapping k-mers yields relative support scores higher than baseline for at least 76% of the reads, suggesting that these transformations have the potential to simultaneously improve both the speed and accuracy of long read mapping.
Short Abstract: 4C-seq is a method to identify chromosomal contact partners for one chosen position in the genome. Since the 3C-based technique inherently causes a fragment structure of the output, and suffers from technical artifacts like PCR bias, most current 4C-seq algorithms use windows or smoothing techniques to decrease noise, introducing arbitrary window sizes or smoothing parameters. This leads to the general problem of parameter choice for optimal analysis and visualization. We present the R package Scale4C, which uses Witkin's scale-space filtering approach to create a novel multi-scale 4C-seq near-cis visualization. This representation of the data allows for explorative analysis of candidate interactions, and structural comparison of datasets. During scale-space filtering, the 4C-seq signal is smoothed with Gauss kernels of increasing smoothing factors. Inflection points of the resulting curves are subsequently tracked in a so-called fingerprint map, and singular points of these curves are identified. Focusing on features of the data ('peaks' and 'valleys') and their transitions for multiple smoothing parameters, the package's plot functions can create 2D tesselation maps in scale-space from these singularities. Tesselation maps allow to visually assess prominent features of the 4C-seq signal with a high degree of stability for multiple smoothing factors, and thus potential interactions. Additional functions of the package include further visualization routines for smoothed data with a chosen smoothing factor and its corresponding inflection points, and plot functions for fingerprint maps with their traced singularities. Data import from bed-files and Basic4Cseq is supported, as well as export of the scale-space tesselation in tabular form.
Short Abstract: Sequencing simulators prove to be a useful procedure to test the accuracy of algorithms and their robustness to sequencing errors. Despite being an exceptional tool to infer the history of past populations, ancient DNA data are characterized by a series of idiosyncrasies such as extensive fragmentation, damage and contamination, all of which can influence downstream analyses. We present gargammel, a package to simulate sequencing reads from a set of user-provided reference genomes. This package simulates the entire molecular process from post-mortem DNA fragmentation, DNA damage, experimental sequencing errors, GC-bias as well as potential bacterial and present-day human contamination. We present two case studies to illustrate the capabilities of our software and how it can be used to assess the validity of specific read alignment procedures and inference of past population histories. First, we evaluate the impact of present-day human contamination on admixture analyses for hominin species. Second, we present the impact of microbial contamination on ancient DNA alignments to the human reference genome. The package is publicly available on github: https://grenaud.github.io/gargammel/ and released under the GPL.
Short Abstract: T-cell acute lymphoblastic leukemia (T-ALL) comprises 25% of all ALL cases and primarily affects children. We aimed to investigate the heterogeneity of T-ALL patient samples and identify the order of mutation acquisition during leukemia evolution. We performed targeted DNA sequencing and RNA sequencing on 200-400 single cells of 4 human T-ALL samples. Whole genome sequencing of the bulk diagnostic samples was used to identify the spectrum of genomic lesions present in the major diagnostic clone(s). We then used targeted sequencing of about 20 genomic lesions and 40 heterozygous SNPs (for quality control) in the single leukemia cells. Cells were discarded from analysis if locus and allelic drop-out exceeded 33.3%. Of the 4 patients analysed, two exhibit one homogeneous leukemic cell population with most cells having all mutations. In the other two patients we observed two distinct subclones with different mutation loads. A graph-based algorithm was developed to determine the order at which mutations were acquired, which showed that most chromosomal translocations were early events in leukemia development, while NOTCH1 mutations were typically late events. Single-cell RNA-sequencing analysis was performed with 10X genomics platform, and revealed limited heterogeneity on the level of gene expression, with the major discriminating factor being cell cycle effects. In conclusion, our novel graph-based algorithm for single cell sequence data was able to provide new information on the order at which mutations are accumulating during T-ALL development.
Short Abstract: Next generation sequencing (NGS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. This approach is currently limited to prokaryotic communities and communities of few eukaryotic species with sequenced genomes. For eukaryotes the analysis is hindered mainly due to inappropriate reference databases to infer the community composition. In this study, we focus on the development of a tool (TaxMapper) for a reliable mapping to a microeukaryotic reference database and a comprehensive analysis workflow. We focus on the assignment of higher taxonomic groups and therefore collected publicly available genomic and transcriptomic sequences from the databases of NCBI, Marine Microbial Eukaryote Transcriptome Sequencing Project and JGI. 143 references were selected such that the taxa represent the main lineages within each of the seven supergroups of Eukaryotes and possess predominantly complete transcriptomes or genomes. TaxMapper is used to assign taxonomic information to each NGS read by mapping to the database and filtering low quality assignment. Therefore, a logit classifier was trained and tested on sequences in the database, sequences of related taxa to those in the database and randomly generated reads. TaxMapper is part of a metatranscriptome Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and described in detail to empower researchers to easily apply it for metatranscriptome analysis of any environmental sample.
Short Abstract: The analysis of Next-generation sequencing (NGS) data remains a major obstacle to the efficient utilization of the technology. While substantial effort has been invested on the development of software dedicated to the individual analysis steps of NGS experiments, insufficient resources are currently available for integrating the individual software components within the widely used R/Bioconductor environment into automated workflows capable of running the analysis of most types of NGS applications from start-to-finish in a time-efficient and reproducible manner. To address this need, we have developed the R/Bioconductor package systemPipeR. It is an extensible environment for both building and running end-to-end analysis workflows with automated report generation for a wide range of NGS applications. Its unique features include a uniform workflow interface across different NGS applications, automated report generation, and support for running both R and command-line software on local computers and computer clusters. A flexible sample annotation infrastructure efficiently handles complex sample sets and experimental designs. To simplify the analysis of widely used NGS applications, the package provides pre-configured workflows and reporting templates for RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. Additional workflow templates will be provided in the future. systemPipeR accelerates the extraction of reproducible analysis results from NGS experiments. By combining the capabilities of many R/Bioconductor and command-line tools, it makes efficient use of existing software resources without limiting the user to a set of predefined methods or environments. systemPipeR is freely available for all common operating systems from Bioconductor (http://bioconductor.org/packages/devel/systemPipeR).
Short Abstract: Cancer metastasis is a series of stages that drives the movement of tumor cells to a distant location and is the main cause of mortality and morbidity of cancer patients. Despite plenty of remarkable advances in understanding causes and treatments of cancer during past few decades, the molecular mechanisms underlying the invasion and metastasis of cancer cells still remain unclear. On the other hand, many non-coding RNAs were found playing important roles in a diversity of biological processes with recently great advances in high-throughput sequencing technologies, and dysregulation of these non-coding RNAs may cause many acute diseases and cancers. Lots of non-coding RNAs involving in tumor invasion were also identified, such as many long non-coding RNAs (lncRNAs) found in promotion of cancer metastasis. Therefore, a more comprehensively transcriptomic regulatory mechanism of tumour-cell invasion and migration is requred, especially the competing endogenous RNA (ceRNA) network which composed of these mRNAs and non-conding RNAs. In this work, we have performed expression data analysis on microarray and RNA-seq datasets of breast cancer, and integrated multiple -omics data (including PPI, TF-gene, microRNA-mRNA, microRNA-lncRNA, microRNA-circRNA interaction data, etc.) to identify the key regulators and targets during cancer migration. Combining the results from previous step and regulator-target pair informations, we constructed a multi-level regulatory network. Finally, applying network analysis and functional analysis to the network, we've identified a module which is most relevant to cancer migration and invasion, which would be helpful for the prevention and treatment of metastatic breast cancer.
Short Abstract: In the past years, Next Generation Sequencing has been utilized in time critical applications such as pathogen diagnostics with promising results. Yet, long turnaround times had to be accepted to generate sufficient data, as the analysis was performed sequentially after the sequencing was finished. Finally, the interpretation of results can be hindered by various types of contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data. We designed and implemented a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples up to five days before the sequencing procedure is even finished. To achieve this, we adapted the core algorithm of HiLive, a real-time read mapper, while enhancing its accuracy for our use case. Furthermore, common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms are automatically marked beforehand using NGS datasets from healthy humans as a baseline. The results are visualized in an interactive taxonomic tree, providing the user with several measures regarding the relevance of each identified potential pathogen. We applied the pipeline on a human plasma sample spiked with Vaccinia virus, Yellow fever virus, Mumps virus, Rift Valley fever virus, Adenovirus and Mammalian orthoreovirus, which was then sequenced on an Illumina HiSeq. All spiked agents could already be detected after only 12% of the complete sequencing procedure. While we also found a large number of other sequences, these are correctly marked as clinically irrelevant in the resulting visualization, allowing the user to obtain the correct assessment of the situation at first glance.
Short Abstract: The revolution in next-generation sequencing (NGS) technologies has enabled a step-change in the way that sequence data is collected and used in Biology including in metagenomics, the sequencing of mixed source nucleic acid samples. These studies have profound implications for human, animal and plant health and disease as well as in diverse areas such as forensic science, environmental pollution monitoring and climate modelling. The increasing quantity of metagenomic sequence data being generated and the diversity of its application areas requires highly optimised and computationally scalable solutions to process and interpret these data. Yet, there is no standardised way of evaluating the accuracy of methods that assign these sequences to taxonomies. We present a comparative evaluation of metagenomic analysis methods in which we use sequence simulators to generate gold-standard data against which to benchmark the efficacy of the methods. We use our method to develop an approach to estimate errors in taxonomic sequence assignment by perturbing the underlying taxonomic trees used in our simulations. Our method demonstrates the high dependency of taxonomic classification success and accuracy on the information present in the reference database and the methods used for classification. We also present an evaluation of the relative importance of different regions of the 16S rRNA marker gene in taxonomic assignment for meta-genetic studies.
Short Abstract: Recent advances in next-generation sequencing technologies and genome assembly algorithms have enabled the accumulation of a huge volume of genome sequences of various species. This trend has provided new opportunities for large-scale comparative genomics together with unprecedented burden on handling large-scale genomic data. Identifying and utilizing synteny blocks, which are conserved genomic regions among multiple species, is the key step for large-scale comparative genomics, such as comparing genomes of multiple species, reconstructing ancestral genomes, and revealing the evolutionary changes of genomes and their functional consequences. However, the construction of the synteny blocks is very challenging, especially for biologists unfamiliar with bioinformatics skills, because it requires the systematic comparison of whole-genome sequences of multiple species. To alleviate these difficulties, we recently developed a web-based application, called Synteny Portal, for constructing, visualizing, and browsing synteny blocks. Synteny Portal can be used to (i) construct synteny blocks among multiple species by using prebuilt alignments, (ii) visualize and download syntenic relationships as high-quality images, such as the Circos plot, (iii) browse synteny blocks with genetic information, and (iv) download the raw data of synteny blocks to use it as input for downstream synteny-based analyses. It also provides an intuitive and easy-to-use web-based interface. In addition, a stand-alone version of Synteny Portal has been being developed, which supports the construction of a user’s own Synteny Portal interface. Synteny Portal will play a pivotal role in promoting the use of large-scale comparative genomic approaches. Synteny Portal is freely available at http://bioinfo.konkuk.ac.kr/synteny_portal/.
Short Abstract: Rala is a standalone layout module intended for assembly of raw reads generated by third generation sequencing platforms. It consists of two parts, read preprocessing inspired by HINGE (Kamath et al, 2017), and assembly graph simplification as described in Miniasm (Li, 2016). In preprocessing, coverage graphs are generated from pairwise mappings using Minimap (Li, 2016) and are used to detect chimeric reads as well as reads from repetitive regions. Afterwards, the assembly graph is simplified with transitive reduction, trimming, bubble popping and a heuristic which untangles leftover junctions in the graph. As a side result, we show that the percentage of chimeric reads produced by either the Pacific Biosciences or Oxford Nanopore Technologies platforms is correlated with the read length.
Short Abstract: Abstract Background Advances in technology that resulted in lowering the cost for generating gene expression data from large numbers of samples has led to the development of “Big Data” approaches to analyzing gene expression in basic and biomedical systems. That being said, the data still includes relatively small numbers of samples and tens of thousands of variables/genes. Different techniques have been proposed for searching these gene spaces in order to select the most informative genes that can accurately distinguish one class of subjects/samples from another. We now describe a new approach for selecting those significant clusters of genes using recursive cluster elimination (RCE) based on an ensemble clustering approach called Support Vector Machine RCE-Ensemble Clustering (SVM-RCE-EC) that improves on the traditional SVM-RCE approach. We present our results comparing the performance of SVM-RCE-EC with different methods applied to the same datasets. Results SVM-RCE-EC uses an ensemble-clustering method, to identify clusters that are robust. Support Vector Machines (SVMs), with cross validation is first applied to score (rank) those clusters of genes by their contributions to classification accuracy. Recursive cluster elimination (RCE) is then applied to iteratively remove the gene clusters that contribute the least to the classification performance. SVM-RCE-EC searches the cluster space for the most significantly differentially expressed clusters between two classes of samples. Utilization of gene clusters using the ensemble method enhances the accuracy of the classifier as compared to SVM-RCE and other similar methods. Conclusions The SVM-RCE-EC outperforms or is comparable to other methods. Additional advantage of SVM-RCE-EC is that the number of clusters is determined based on the ensemble approach thus capturing the real structure of the data, rather than by having the number of clusters defined by the user as is the case with SVM-RCE (k-means). Moreover, we show that the clusters generated by SVM-RCE-EC are more robust. Availability: The Matlab version of SVM-RCE-EC is available upon request to the first author.
Short Abstract: New technologies enabling the measurement of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 20% to 40% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Here we propose a Bayesian hierarchical method to share information across cells and quantify spatially- varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). The method clusters individual cells based on genome-wide methylation patterns, enabling the discovery of epigenetic diversities and commonalities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells. We show that by jointly learning the posterior distribution of all parameters of interest, the proposed model is more robust and allows the sharing of information across cells to improve its imputation accuracy both on simulated and real data sets.
Short Abstract: Mosquitoes are vectors of numerous human pathogens that cause enormous public health problems but the splice isoforms of gene transcripts in these vector species are poorly curated. IsoPlot is a publicly available database with visualization tools for exploration of alternative splicing events, including three major species of mosquitoes, Aedes aegypti, Anopheles gambiae, and Culex quinquefasciatus, and one model insect species of fruit fly Drosophila melanogaster. IsoPlot includes annotated transcripts and 17,037 newly predicted transcripts from massive transcriptome data at different life stages of insects. The web interface is interactive to explore the patterns and abundance of isoforms in different experimental conditions as well as cross-species sequence comparison of orthologous transcripts.
Short Abstract: Ra is a novel de novo genome assembler based on the Overlap-Layout-Consensus paradigm tailored for reads produced by third generation sequencing platforms. It integrates previously developed Minimap overlap (Li, 2016), Racon consensus (Vaser et al, 2017) tools and a newly developed layout module Rala into one package. Omitting time consuming error correction in the preprocessing step enables fast genome assembly while keeping high accuracy levels. The achieved results on several read datasets generated by the Pacific Biosciences sequencing platforms are comparable with those of similar de novo assemblers Hinge (Kamath et al, 2017) and Miniasm+Racon in contiguity, accuracy and running time.
Short Abstract: Modern high-throughput single-cell technologies facilitate the efficient processing of hundreds of individual cells to comprehensively study their morphological and genomic heterogeneity. Fluidigm's C1 Auto Prep system isolates fluorescence-stained cells into specially designed capture sites, generates high-resolution image data and prepares the associated cDNA libraries for mRNA sequencing. Existing methods such as Monocle and Oscope for downstream analysis sort and classify cells using single-cell RNA-seq expression data and do not take advantage of the important information carried by the images themselves. We propose a novel statistical model whose multiple steps are integrated into the Cell OrderiNg (by) FluorEScence Signal (CONFESS) R package. CONFESS performs image analysis and fluorescence signal estimation for data coming from the Fluidigm C1. It collects extensive information on the cell morphology, location and signal that can be used for quality control and phenotype prediction. If applicable, it normalizes and uses the signals for unsupervised cell ordering (pseudotime estimation) and 2-dimensional clustering via scalar projection, change-point analysis and Data Driven Haar Fisz transformation for multivariate data. One could potentially use CONFESS to classify and sort fluorescent cells in various applications (cell cycle, cell differentiation, etc). Here we illustrate the use of CONFESS to trace Fucci-labeled Hela cells in their cell cycle progression. The output can be easily integrated with available single-cell RNA-seq (or other) expression profile packages for subsequent analysis.
Short Abstract: The characterization of microbial communities based on sequencing and analysis of their genetic information has become a popular approach also referred to as metagenomics; in particular, the recent advances in sequencing technologies have enabled researchers to study even the most complex communities consisting of thousands of species. Metagenome analysis, the assignment of sequences to taxonomic and functional entities, however, remains a tedious task, as large amounts of data need to be processed. There are a number of approaches that aim to solve this problem addressing particular aspects, however, scientific questions are often too specific to be answered by a general-purpose method. We developed MGX, an extensible framework for the management and analysis of unassembled metagenome datasets. MGX is a client/server application providing a comprehensive set of predefined analysis pipelines including most recent tools like Kraken (Wood et al, 2014) or Centrifuge (Kim et al, 2016). MGX allows to include own data sources and/or to devise custom analysis pipelines based on the Conveyor workflow engine (Linke et al, 2011). As all analysis tasks are executed on the server infrastructure, thus no extensive compute resources need to be provided by the user. The intuitive and easy-to-use graphical user interface is available for all major operating systems (Windows, Linux, Mac OS X) and allows to create interactive as well as high-quality charts based on taxonomic and functional profiling results. References Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15:R46. (2014) Kim D, Song L, Breitwieser FP, Salzberg SL: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Research 26: 1721-1729 (2016) Linke B, Giegerich R, and Goesmann A: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7): 903-911 (2011)
Short Abstract: High-throughput sequencing has broadened the possibility to dissect the immune repertoire at a higher resolution to deepen our understanding of the adaptive immune system. Most significant insights can be gained relative to various states as cancer, autoimmune conditions, infection and aging process. Additionally, it can assist to uncover the underlying mechanisms of immunity in health and disease conditions. With further progress in sequencing technologies higher amount of data is being generated, which requires sophisticated analysis methods. Various tools have been developed so far for immunosequencing (T and B cell receptors) analyses and in this aspect progress is currently underway. The major analysis aim is to unravel the diversity of the immune system and perform composition profiling to obtain clinically relevant information. In T cell receptor analyses the general strategy includes steps to determine gene segments by aligning sequences to the reference set, clonotypes identification, detection of complementarity determining region 3 and their abundance estimation. However, there is still a clear lack of guidelines and consensus about features crucial for reliable data analysis. To provide a practical comparison of the computational methods for immune repertoire analyses, we have conducted an in-depth and systematic comparative study of eight available methods. We have employed numerous in silico and experimental datasets to perform thorough assessment of each approach in view of various analysis factors. Moreover, a clonal plane analysis strategy is used to perform clonality analysis of samples under investigation. In addition, we describe in detail the substantial effects of choice of analysis method on interpretation and outcome. Our study will help researchers in this filed to select an optimal analysis method and in this regard will provide basic evaluation based guidelines.
Short Abstract: Reaching valid conclusions from DNA sequencing requires accurate data and a thorough understanding of that data, but yet we – a bioinformatics facility based at the Babraham Institute, Cambridge UK – all too often see researchers misinterpreting artefacts as genuine results and consequently presenting incorrect findings and wasting time chasing false leads. To help researchers be aware of the key technical issues affecting their experiments we developed the website QC Fail. When we encounter a problem which may be of interest to the wider Life Sciences community we record our experiences on the QC Fail website. Each article typically discusses how a problem was identified; whether we determined the underlying cause; what measures should be taken to ameliorate the problem and how this experience should shape the planning of future work. We shall also make available example datasets for each problem, which people can download and analyse themselves, or use in the development of processing or QC tools. In addition, these datasets should prove useful for teaching purposes. We hope the QC Fail website is an invaluable resource to help other scientists plan, perform and analyse experiments. The stories generally focus on sequencing-related matters, but we intend to build on the site, reporting new technology trends as they develop. To visit the website please go to: qcfail.com
Short Abstract: Cancer samples investigated by high-throughput transcriptomics are often highly heterogeneous. Intra-tumour heterogeneity in conjunction with sample composition variability may lead to the observation of an unrealistic averaged combination of abundant transcripts. Lowly abundant cells are masked by such averaging. However, computational methods can help to separate mixed transcriptional signals. Here we investigated the independent component analysis (ICA) as a feature extraction method for sample classification and patient diagnostics. The method was applied to a TCGA RNA-seq melanoma dataset. First, the stability of ICA deconvolution was improved by performing multiple runs and building a consensus signal and mixture matrices. We optimized the number of independent components by minimising the correlation between component weights. Several gene expression metrics were investigated and FPKM, which showed the most promising results, was chosen for the subsequent analysis. Importantly, unlike PCA, the ICA method resulted in both gene signatures (signals) and clinical predictors (weight coefficients). Using this information, each component was associated to a biological, technical or clinical factor. The weight coefficients showed a strong statistical linkage with clinical data and were used as input features for a support vector machine classifier. We validated the method by leave-one-out cross-validation. This resulted in a 91% accuracy for classifying melanoma subtypes (immune, keratin or MITF-low). Interestingly, several identified components were strong predictors of patient survival. Next, we used the method to successfully classify a new patient. Thus, the proposed data-driven method not only improves patient classification, but also gives prognostic information.
Short Abstract: The task of discovering cancer-specific genetic markers or signatures is extremely important in uncovering disease mechanisms and predicting the effect of treatments. In particular, since pathway activities and their associations are distinguishable according to the cancer type, exploring notable cancer-specific associations among the activities of pathways would be interesting. In this study we aim to investigate the activities of pathways found in a specific type of cancer and identify their distinct associations as cancer-specific signatures. To this end, we employ RNA-seq data to define the activity level of pathways for each cancer type and use them to find pathway associations by applying association rule mining approach. Specifically, we find interesting sets of pathways frequently active together appeared in gene expression profiles of specific cancer type. In addition, we visualize pathway activities and their associations as cancer-specific signature.
Short Abstract: The African mole-rats (Bathyergidae) are a family of subterranean rodents with very unusual physiological traits for mammals. The most famous member of African mole-rats is the naked mole-rat (Heterocephalus glaber), which shows several extraordinary phenotypes like poikilothermy, extreme longevity, cancer resistance and extreme adaptation to low oxygen environments [Park, Reznick et al., Science 2017]. Additionally, the naked mole-rat and some other Bathyergidae species are insensitive to several noxious substances or algogens (e.g. acid, capsaicin, or mustard oil) [Park et al., PLOS Biology 2008]. This study focuses on understanding the sensory phenotypes of at least 8 African mole-rat species, as these closely related species show different patterns of insensitivity to noxious substances. Recently, a sequence motif in the NaV1.7 ion channel of the naked mole-rat was found to be directly connected to its acid insensitivity [Smith et al., Science 2011]. We sequenced poly-A selected mRNA from multiple tissues of 8 African mole-rat species. As there are no annotated genomes available for most of the species, we performed de-novo transcriptome assembly to obtain the protein-coding sequences. We developed a bioinformatic workflow to annotate putatively coding transcripts and exclude contaminating or falsely assembled sequences and chimeras. Using this approach, we were able to identify more than 9,000 unique protein-coding transcripts per species. We also directly compared the protein-coding sequences and transcript levels across species boundaries. Using statistical models correcting for phylogenetic relationships between species we were able to robustly identify differentially expressed genes in the species tree. Maximum likelihood methods for phylogenomics yielded insights in differences in selection pressure along the African mole-rat lineage. This approach allows a multivariate analysis of the relationship between gene expression level, sequence variation and extreme phenotypes across this rodent family.
Short Abstract: For obvious reasons many tools in bioinformatics have been developed having the human genome in mind. Nonetheless there are a lot more eukaryotic organisms which are utilized in biotechnology which have a quite distinct genome composition. This fact has to be taken into consideration during protocol development and software selection. RNA sequencing data of the yeast Komagataella phaffii, a versatile host for recombinant protein production, was analyzed for differential gene expression using two different approaches. Due to two distinct requirements, the data was analyzed using the recent successor of the tuxedo workflow – HISAT2-StringTie-Ballgown – and the rather new count-number-based workflow kallisto-DESeq2. Where the HISAT2-StringTie-workflow utilizes a genome guided assembly and can be used to identify new genes, alternative transcripts and subsequent gene expression analysis using ballgown, the kallisto-workflow is transcript guided solely for differential expression analysis. To achieve the optimal protocol the parameters needed to be adjusted according the dif- ferent genome composition like smaller intron sizes. Besides the protocol development, comparison and parameter optimisation, the data were analyzed for alternative transcripts.  Michael I. Love et al. “RNA-Seq workflow: gene-level exploratory analysis [...]” In: F1000Research 4 (2015).  Mihaela Pertea et al. “Transcript-level expression analysis of RNA-seq [...]” In: Nat Protocols 11.9 (2016).  Minoska Valli et al. “Curation of the genome annotation of Pichia pastoris (Komagataella phaffii) CBS7435 [...]” In: FEMS Yeast Research 16.6 (2016)
Short Abstract: Arrhythmogenic cardiomyopathy (ACM) is a genetic disorder, in which the heart muscle is progressively substituted with fibro-fatty tissue, leading to severe ventricular arrhythmias, heart failure and sudden cardiac death. Even though numerous genes are known to be involved in the disease, causal variants cannot be identified in 40% of patients and identified mutations often have low penetrance, suggesting the involvement of unknown genetic or environmental factors. In fact, recent studies have reported mutations in two different genes, implying digenic inheritance as a disease causal mechanism. We applied whole exome sequencing to investigate digenic inheritance in two ACM families in which all affected and some healthy individuals were known to carry mutations in PKP2, the gene most commonly mutated in ACM. We determined all genes that harbor variants in affected but not in healthy PKP2 carriers or vice versa. We identified likely candidates in each family by computationally prioritizing these genes and restricting to known ACM disease genes and genes related to PKP2 through protein interactions, functional relationships or shared biological functions. The top candidate in the first family is FRZB, which is located at the border of a known ACM locus and has been previously associated with other cardiac diseases. TTN, the most likely candidate in the second family, is a known ACM gene which, however, has not yet been reported in a digenic disease causal context. We propose that these variants might impair or modify protein function or structure and may cause ACM in combination with the PKP2 variant.
Short Abstract: With NGS there arise plenty genomic data. Lists of genomic ranges annotated with scores, names and other values are often explored only partially to answer specific scientific questions. Unknown dependencies between annotations can stay undiscovered. We developed a method to examine associations between all possible pairs of annotations. For this we first categorize and summarize the annotations according to their scale (binary, nominal, ordinal, interval and rational). In a second step an adequate statistic model is chosen to find possible dependencies between the annotations. To facilitate the differentiation between significant dependencies and random effects the results are then visualized. For implementation we choose R. In Bioconductor in R annotated genomic ranges can be easily handled as GRanges objects and implementations of statistic models can be used to determine associations between annotations. An input of for example a genomic ranges object with 30 columns of annotations will lead to an output of 435 plots with corresponding statistic numbers. The user can then choose the most promising pairings of annotations to further examine the found dependencies. As the approach realizes multiple testing found dependencies need to be further validated. We first concentrate on paired comparisons. To explorative examine dependencies between all possible subsets of annotation more complicated statistic models are necessary and the number of comparisons grows exponentially. Further work will also include the explorative analysis of the dependencies between more than one list of annotated genomic ranges.
Short Abstract: Synthetic spider silks have been explored for potential industrial applications, taking advantage of their immense toughness and renewability to realize protein-based plastic biomaterial as an alternative to those rely on petroleum. On the other hand, complete identification of spider fibroin genes still remains relatively uncharted, due to the many challenges in sequencing these genes. There are up to seven morphologically differentiated silks, and all of these genes are extremely long (>10kbp) genes, that are almost entirely comprised of tandem iterations of repeat sequences. In order to understand the sequence design principles of spider fibroins by marrying the genotype to phenotype, we are currently conducting a de novo transcriptome study of 1,000 spiders, and we have developed a sequential read extension algorithm using a hybrid of short and long read sequencing technologies to overcome the challenges. Here, we introduce a streamlined feasibility study of various storage and logistic conditions of field samples and their effects on de novo transcriptome assembly results, the sequencing protocols and analysis algorithms, as well as the obtained knowledge about phylogenetically conserved and diverse features of spider silk gene.
Short Abstract: For a long time, the finger-printing methods such as RFLP, AFLP and SSR have been used for plant molecular breeding. Recently, SNP has been known that are related to specific traits and are used as new molecular markers for target trait for molecular marker-assisted selection. The development of next generation sequencing(NGS) technology has made SNPs more powerful than conventional finger-printing techniques. A large mount of SNPs produced by NGS technology enable new molecular breeding . A Rice is a major food crop in Korea, hundreds to thousands of resequencing data are produced for the study of new varieties, and these data are stored in the National Agricultural Biotechnology Information Center(NABIC). These data are important for new plant molecular breeding studies, we need a new applications for identification of individual SNPs in two different genomes from NGS data. This pipeline is largely composed of comparisons between individuals for the development of molecular breeding and group analysis for discrimination of origin using manual python scripts, BIOPYTHON, VCF-tools, Primer3 and PLINK. A comparision pipeline is largely composed of three parts: a) specific SNP discovery among individuals b) restriction enzyme cleavage check c) primer sequence construction for SNPs. As a result of the experiment, 30 ~ 40% of the SNP candidates were confirmed to be actual SNPs and selected as the marker candidates. This pipeline has simplified the manual analysis process, which diffcult and complex analysis for plant molecular breeding. Although its accuracy and efficiency are different depending on the accuracy of the sequencing, it may be a tool that will be of great help to breeders.
Short Abstract: Analyzing single cell-based transcriptome profile highlights the heterogeneity of cancer cells. Although genomic instability is the major cause for the cellular variation of transcriptome in the given sample, non-genetic clonal variations of gene expression may also contribute to the differentiation and transformation of cancer cells in the course of anticancer therapies. Here we characterized the varied profile of cancer transcriptome among the homogeneous cancer cell population. Single cell-based RNA sequencing data were retrieved and analyzed for a total of 50 cells of lung cancer cell lines, H358. Varied transcriptome profiles in the clonal population were compared to the lineage-dependent variation of gene expression in diverse lung cancer cell lines. Geneset-based analysis provided new insights on functional categories associated with the non-genetic variation among homogeneous cancer cells. The present approach has applications in dissecting genetic and non-genetic factors in cancer progression.
Short Abstract: As Next Generation Sequencing technology advances, enormous amounts of whole genome sequence information in variety species have been released. However, it is still difficult to assemble the whole genome precisely due to inherent limitations of the short read sequencing technology. In particular, the complexities of plants are incomparable to those of microorganisms or animals because of whole genome duplications, repeat insertions, Numt insertions, etc. In this study, we describe a new methodology for detecting misassembly sequence regions of Brassica rapa with Genotyping-by-sequencing (GBS) followed by MadMapper clustering. The missembly candidate regions were cross-checked with BAC clone paired ends library sequences that have been mapped on the reference genome. The list were further verified with gene synteny relations between Brassica rapa and Arabidopsis thaliana. We conclude that this method will help to detect misassembly regions and be applicable to incomplete assembled reference genomes from a variety of species.
Short Abstract: Many de novo genome assembly projects have been performed using high-throughput sequencers, many genomic sequences are being produced. Gene prediction is one of the most important steps in the process of genome annotation, along with the genetic assembly process. A large number of software tools and pipelines developed with various computing technologies can be used for gene prediction. However, such a pipeline does not accurately predict all or most of the protein coding regions. Also, among currently available gene prediction programs, there is no Hidden Markov Model (HMM) that can automatically perform gene prediction for all life forms. Therefore, species-specific HMMs are required for specific genome annotation. We present a NAGPP, an automated gene prediction pipeline using a self-training HMM model, core-gene model and transcriptomic data. In this pipeline, the genome sequence and transcript sequence of the target species to be predicted using CEGMA, GlimmerHMM, SNAP, and AUGUSTUS were processed, and then the MAKER2 program was used to analyze protein sequence and the gene structure is unified. NAGPP uses the CEGMA for the plant genome that is currently being performed and generates a HMM that can be used universally without being devided into monocots and dicots, and then produces a species specific HMM. We evaluated this pipeline using the known arabidopsis and rice genomes. It was confirmed that gene structure can be identified by probabilities of 22% and 28% for Arabidopsis and rice, respectively. Because it uses CEGMA and species specific HMM, it shows better prediction results than GlimmerHMM, SNAP and Augustus used in existing MAKER2. NAGPP provides researchers with a pipeline that can reveal a more accurate gene structure for species that are not precisely cleared of the gene structure through species specific HMMs or for new species. This pipeline concludes that gene structure prediction for new species, as well as for new model species, can yield better results than conventional pipelines used.
Short Abstract: The human genome is complex, it holds nested genes, genes with multiple copies and overlapping genes, the study of which is difficult. Small nucleolar RNA (snoRNA) are one such family, most being intronic, many nested in an intron retention. Furthermore, the latest advances in RNA sequencing enable the study of different types of RNA at once. Existing tools are generally designed for a specific RNA type at the expense of others and do not address these genome particularities, or only one of them. We developed CoCo (Count Corrector for nested genes and multimapped reads) which modifies an annotation, inserting holes in exons and in retained introns containing nested genes. This annotation is then submitted to an existing tool like Subread’s featureCounts. Afterwards, CoCo distributes the counts from multimapped reads, usually coming from duplicated genes, based on the proportion of uniquely mapped reads. This approach prevents from allocating counts to a non-expressed gene, for example in the case of a protein coding gene and its pseudogene. CoCo salvages over 15% of reads that are usually left out. Using CoCo, we detect 133 more snoRNA species than with traditional methods and 60% of rescued reads come from snoRNA. With the multimapped read distribution, the estimated counts triple for genes with multiple copies like the signal recognition particle 7SL. The correlation between different types of RNA measured by PCR and sequencing is higher using CoCo than traditional methods. Thus, CoCo gives a better portrait of the abundance of most RNA types.
Short Abstract: Motivation: Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. Results: We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. Availability and Implementation: Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations.
Short Abstract: The 2nd and 3rd generation sequencing are now preferred choices for de novo genome reconstruction. The 2nd generation sequencing offers high throughput in low cost but the read length is inadequate to resolve large repeats. On the other hand, the 3rd generation sequencing can generate much longer reads for spanning large repeats, but the error rate and sequencing cost are much higher. This poster presented a novel algorithm for correcting low-quality long reads using FM-index constructed from high-quality short reads. In particular, long reads are mapped onto FM-index of short reads and correct sequences are generated via FM-index extension without time-dynamic programming alignment. The experimental results indicated that the correction power, accuracy, and speed are better than existing methods. The strength of hybrid assembly can be easily seen when coverage of 3rd generation sequencing is low.
Short Abstract: We propose a modification of the algorithm for DNA de-novo assembly, which uses the relative frequency of reads to properly reconstruct repetitive sequences (tandem repeats).
The main advantage of our approach is that tandem repeats, which are longer than the insert size of paired-end tags, can also be properly reconstructed (other genome assemblers fail in such cases).
What is more, tandem repeats could also be restored, if only single-read sequencing data is available.
The application was developed in client-server architecture, where web-browser is used to communicate with end-user and algorithms are implemented in C++ and Python.
Our data structures allow to build and handle graph up to 8*10^9 vertices (e.g. for human genome) in 256 GB RAM, therefore our solution is faster than others.
The software was thoroughly tested, over 350 unit tests and about 25 simulated set of reads were used, almost 100% of code coverage was achieved.
The results of the experiments presented proved the correctness of the algorithm and showed the effectivity of the approach presented.
The server with the application has been running for 1 year, there are several biological and medical groups using this software and involved in the testing process.
The application was used for real sets of data from Illumina sequencer.
Source code as well as a demo application with a web interface are available online:
http://smyrna.ise.pw.edu.pl:9007 - demo application
http://sourceforge.net/projects/dnaasm - source code
Short Abstract: Copy number variation (CNV) is a type of structural variant that affect a large range of nucleotide sequences (usually more than 1000 bp),which has been proved holding strong correlation with many genomic diseases, such as cancer. Currently, NGS-based CNV profiling are more and more prevailing. It can provide higher-resolution when compared with array-based approach, and also brings in more computational challenges at the same time. In this work, we proposed an efficient algorithm for the task of CNV segmentation on NGS data. Different from previous approaches, we proposed a vector-based bin representation and made use of the distance distribution of adjacent bins for detecting potential breakpoints in bins. Our algorithm runs in linear time approximately and owns scalability for whole genome sequencing (WGS) data. We compared our method with classic methods, such as binary circular segmentation and event-wise testing, in both simulation data and real data based on GIAB NA12878.
Short Abstract: The body of an organism is one system composed of a large number of cells, analyzing the behavior of cells contributes to elucidation of life phenomena and treatment of diseases. In particular, analytical methods such as single cell analysis with RNA-seq have attracted attention. This is due to the fact that the expression levels of cells are different for each cell even though they are of the same cell type, and also the biological function they play differs. Clustering of cells by their gene expression profiles is common as an informatics method in single cell analysis, and it aims to discover cell heterogeneity. Many of clustering methods reduce the dimensions of expression profiles by using principal component analysis or independent component analysis. However, gene expression profiles obtained by single-cell RNA-seq protocols contain a vast amount of zeros, and this fact makes it difficult to appropriately reduce the dimension. In this research, we propose a novel clustering method by using Latent Dirichlet Allocation (LDA) as a dimensionality reduction method, which is known to operate also on sparse matrix. Our method allocates genes into some gene sets called topics, and each topic is the result of dimensionality reduction. The topics are presumed by bias of expression levels, and genes that have similar function are assigns into the same topic. We experimented to classify cells against the actual expression profiles and showed that the method was able to classify more accurately than the conventional methods.
Short Abstract: Prediction of functional variant consequences is an important part of sequencing pipelines, allowing the categorization and prioritization of genetic variants for follow up analysis. However, current predictors analyze variants as isolated events, which can lead to incorrect predictions when adjacent variants alter the same codon, or when a frame-shifting indel is followed by a frame-restoring indel. Exploiting known haplotype information when making consequence predictions can resolve these issues. BCFtools/csq is a fast program for haplotype-aware consequence calling which can take into account known phase. Consequence predictions are changed for 501 of 5019 compound variants found in the 81.7M variants in the 1000 Genomes Project data, with an average of 139 compound variants per haplotype. Predictions match existing tools when run in localized mode, but the program is an order of magnitude faster and requires an order of magnitude less memory. The program is freely available for commercial and non-commercial use in the BCFtools package which is available for download from http://samtools.github.io/bcftools
Short Abstract: Somatic copy number alterations (SCNAs) are pervasive in cancer due to genomic instability that can lead to whole genome duplication (WGD), focal amplifications or deletion, and other chromosomal abnormalities. WGD is frequently observed in cancer and is typically associated with adverse outcomes suggesting that it plays an essential role in the development of an aggressive tumour phenotype. However, as WGD does not in occur in isolation but in concert with other chromosomal aberrations, the doubling of whole genomic content will convolute with the effect of other SCNAs, resulting in a complex landscape of chromosomal rearrangements that is highly challenging to interpret. Here we developed a computational approach for parsing complex copy number profiles from multiple tumour samples that can be used to deconvolute the effect of WGD and focal alterations. The method seeks to separate recurrent and tumour-specific SCNA events and can be viewed as a form of dimensionality reduction for structured high-dimensional discrete data. The output from our method provides an estimate of the sequential series of copy number alteration events that occur in the tumours. The problem is modelled as an optimization problem with an quadratic objective function and constraints. The Lagrange dual for the original optimization problem is solved by fixed point method. There are also a handful user-defined parameters in the model, so that the method is quite flexible to account for various forms of user needs. We demonstrate the utility of the method by analysing 380 CRC sample data from The Cancer Genome Atlas. The method gives better indication of the existence of WGD in the sample than merely average ploidy. The results also showed that there are different causes for the copy number differences among different genome loci. Thus, this method could potentially help researchers understand the evolution of SCNA in cancer.
Short Abstract: Motivation: Detection of copy-number variants from whole-genome (WGS) read-depth data suffers from confounding factors such as amplification and mapping bias, which are difficult to separate from the signal, and the sheer data size, which mandates simplifying assumptions. Methodological problems concern correspondence between segment means and copy numbers, implicit biases imposed by the modeling, and uncertainty in the number of copy numbers called. Additionally, methods often lack a calibrated measure of uncertainty in their CNV calls owing to computational effort. It is notable that fully Bayesian methods have not been used for WGS data. Results: We revisit multiplexed sequencing of multiple individuals from two populations as a method to identify recurrent copy number differences. Our novel implementation of Forward-Backward Gibbs (FBG) sampling for Bayesian Hidden Markov Models (HMM) is based on wavelet compression, and can analyze sequences of hundreds of millions of observations in a few minutes. Using two rat populations divergently selected for tame and aggressive behavior as an example, we demonstrate that multiplexed sequencing addresses the bias and normalization problems. Algorithmic improvements include a novel data structure called a breakpoint array. It allows for efficient dynamic compression of the data into blocks for which the underlying mean-signal is constant and without discontinuities. We show that a breakpoint array can be obtained from Haar wavelet regression at arbitrary noise levels in-place and in linear time. We discuss the discovery of several CNV of varying sizes consistent with earlier results concerning the domestication syndrome as well as experimental observations.
Short Abstract: The aim of the project is to develop a workflow that automates the processing, assembly and annotation of whole genome sequencing data. Therefore, the snakemake workflow engine is applied, which automates the use of executable tools and scripts. The steps include quality control, de novo assembly, gene prediction, gene annotation and gene comparison. Usually they require time for configuration and interpretation of the intermediate data, the workflow combines these tasks in one step. The key features of the presented workflow are the generation of hybrid assemblies from PacBio and Illumina sequencing data, optional filtering of prokaryotic sequences (from non-axcenic eukaryotic cultures) and advanced gene prediction from available RNA sequencing data. The workflow is applied to five whole genome sequenced Chrysophyceae strains using the technologies of Illumina Hiseq XTen and Pacbio RSII. In the course of evolution, Chrysophyceae frequently reduced their plastids and accordingly their nutritional mode. Hence, the genome comparison reveals details in dependence of nutrition: essential and optional genes, gene density and arrangement, GC content and genome size. The automated Snakemake workflow incorporates SPAdes to assemble the Illumina and Pacbio reads. Additionally, in non-axenic cultures the software MaxBin2.0 and Kraken separates the contigs in eukaryotic and prokaryotic sequences. If available RNA-Seq data aides the gene prediction process of the programs Tophat2, Augustus, Genemark and Braker. The predicted genes are searched with Diamond against the KEGG database. Finally, gene matches of each species are clustered and compared among the different nutritional modes. This workflow will simplify future genomic analysis.
Short Abstract: Viral infections can be a major public health threat, sometimes causing massive epidemics. The viral genome is of importance for both characterization and prevention. As the viral genotype can diverge rapidly because of high mutation rates, traditional sequencing approaches such as genome walking are laborious and time consuming, while often only generating fragmented sequences. Alternatively, utilizing whole genome sequencing (WGS), the complete viral genome can be obtained providing unprecedented resolution for the study of viral genomics. The lack of the required bioinformatics expertise for analysing WGS data is however a hurdle preventing its broad adaptation in many national reference centers. We developed a pipeline specifically designed to bridge this gap. Our pipeline is flexible and moreover species-agnostic. It performs automated quality control based on Illumina data (including removal of host DNA when working in a metagenomics context), and then either de novo subtyping or subtyping based on a user-provided set of input reference genomes, to generate the viral consensus sequence contained within a sample. A detailed output report containing intermediary results and quality parameters of importance is also created. A user-friendly interface has been deployed in an in-house Galaxy instance to facilitate access to a broad audience of scientists. Preliminary validation on influenza and mumps data demonstrates that our pipeline is capable of obtaining high-quality viral consensus sequences, providing a solid basis for downstream analyses such as viral genotyping, in silico serotyping, virulence and/or resistance characterization. Our pipeline can easily be adapted to other viral species and will be made publicly available upon its publication.
Short Abstract: Panel sequencing of patient-derived cancer samples has become an important tool for clinicians and researchers. Calling variants on panel-seq data however, is error-prone due to a limited amount of raw material, unknown ploidy-settings and purity-rates for commonly hyper-ploid and impure samples. The Perturber method identifies robust or sub-clonal variants in addition to determining the parameters that maximize the likelihood of the observed variant calls. This is achieved by calculating the a posteriori likelihood for a GATK null-hypothesis-run that is bootstrapped by a stochastic grid-search over different algorithms and parameter settings. The method is freely available and has been benchmarked on the 1000 genomes and a corresponding somatic variant calling goldstandard dataset.
Short Abstract: Motivation: A key component in many RNA-Seq based studies is the production of multiple replicates for varying experimental conditions. Such replicates allow to capture underlying biological variability and control for experimental ones. However, during data production researchers often lack clear definitions to what constitutes a "bad" replicate which should be discarded and if data from failed replicates is published downstream analysis by groups using this data can be hampered. Results: Here we develop a probability model to weigh a given RNA-Seq experiment as a representative of an experimental condition when performing alternative splicing analysis. Using both synthetic and real-life data we demonstrate that this model detects outlier samples which are consistently and significantly different compared to samples from the same condition. Using both synthetic and real-life data we perform extensive evaluation of the algorithm in different scenarios involving perturbed samples, mislabeled samples, no-signal groups, and different levels of coverage, and show it compares favorably with current state of the art tools. Availability: Program and code will be available at majiq.biociphers.org
Short Abstract: High-throughput DNA sequencing (HTS) enables metagenomic studies to be performed at large-scale. Such analyses are not restricted to present day environmental or clinical samples but can also be applied to molecular data from archaeological remains (ancient DNA) in order to provide insights into the relationship between hosts and bacteria through time. Here we present AMPS (Ancient Metagenomic Pathogen Screening), an automated bacterial pathogen screening pipeline for ancient DNA sequence data that provides straightforward and reproducible information on species identification and authentication of its ancient origin. AMPS consists of a customized version of (1) MALT (Megan ALignment Tool) (Herbig, et al. 2016), (2) RMAExtractor, a Java tool that evaluates a series of authenticity criteria for a list of target species, and (3) customizable post-processing scripts to identify, filter, and visualize candidate hits from the RMAExtractor output. We evaluated AMPS with DNA sequences obtained from archaeological samples known to be positive for specific pathogens, as well as simulated ancient DNA data from 33 bacterial pathogens of interest spiked into diverse metagenomics backgrounds (soil, archaeological bone, dentine, and dental calculus). AMPS successfully confirmed all experimental samples. AMPS further correctly identified all simulated target pathogens that were present with at least 500 reads in the metagenomic library. In addition, we used these data to assess and compensate for biases resulting from the reference database contents and structure. AMPS provides a versatile and fast pipeline for high-throughput pathogen screening of archaeological material that aids in the identification of candidate samples for further analysis.
Short Abstract: Background: The next generation sequencing (NGS) techniques have been around for over a decade. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. Results: An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new read aligner, HISEA (HIerarchical SEed Aligner) for SMRT sequencing data. Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper. Availability: The source code of the HISEA aligner and Canu+HISEA assembly pipeline are freely available from: https://github.com/lucian-ilie/HISEA and https://github.com/lucian-ilie/Canu HISEA, respectively.
Short Abstract: Standardization of measurement of the human gut microbiota is the first step to combine individual's microbial profile to the other types of healthcare data. The microbial profile is affected by various factors, such as fecal sampling condition, sequencing platform, targeted sequencing region, etc. In this study we focused to evaluate commercial sequencing platforms and targeted 16S region. With assist of bioinformatics tools, the 454 FLX pyrosequencer has been successfully applied to obtain microbial community profiling. Illumina MiSeq platform has increasingly outpaced the 454 systems, mainly due to its much higher throughput in contrast of the limitation of the short read length. Pacific Biosciences (PacBio) Single Molecule, Real-Time (SMRT) DNA sequencing system has become available for microbial phylogenetic profiling with the ability of full-length 16S sequencing. For this purpose, we generated fecal sequences from 170 Korean subjects using the GS FLX+ (V1-4), Illumina MiSeq (V1–3, V3–4 and V4), and PacBio (V1–9) systems. We compared the phylogenetic resolution and abnormality of the simulation study of public 16S rRNA gene database. The information generated from this study will become a valuable source of construction of the standard protocol for Korean gut microbiome analysis.
Short Abstract: During the past two decades, a large number of antibody libraries have been constructed to meet the needs of drug discovery and diagnostic processes. The advent of next-generation sequencing (NGS) technology has enabled scientists to rigorously assess library size, quality, diversity and robustness at different stages of the construction process. The currently available bioinformatic tools mainly focus on the analysis of clonotypes of T cell receptors. We propose a new software pipeline, Abseq, designed to facilitate a high-throughput analysis of NGS reads of the variable domain of an antibody chain. The Abseq pipeline includes all the essential analysis steps from merging paired-end reads, annotating V-(D)-J rearrangement, estimating the abundance levels of germline genes and families, visualising the alignment quality of germline genes (including the filtering of low quality sequences), predicting frame shifts and identifying functional clones, and finally calculating spectratypes and clonotypes to estimate the diversity of the library. Importantly, Abseq also contains functionality to facilitate the selection of the best combination of restriction enzymes for the construction of library vectors. We illustrate the pipeline capabilities by applying it to a naïve IgM repertoire extracted from peripheral blood lymphocyte (PBL) of pooled human donors. The results show that the abundance of germline genes is inline with the natural distribution that is reported in the literature. The integrity of frameworks, complementarity-determining regions and secretion signals has been examined through comprehensive motif analysis. Overall, the results confirm that the repertoire is not pathological and can be used for library construction.
Short Abstract: Bacteriophages (phages), or viruses that infect bacteria, are the most abundant group of viruses on Earth and play a critical role in structuring bacterial communities. Despite their prevalence and ubiquity, only a small number of phage genomes have been sequenced and characterized. Viral communities inhabiting niches from across the globe have now been sequenced, uniformly discovering a wealth of sequence data with no recognizable homology to extant sequence collections. Nevertheless, novel viral species genomes have been successfully excavated directly from complex community metagenomes. Discovery of such viral genomes often relies heavily on manual curation and prior studies have employed a variety of different criteria when sifting through sequencing data. To provide an automated solution for identifying viral genomes from complex sequence data sets, we developed the tool virMine. Synthetic metagenome data sets were created and examined to assess the performance of this new tool, testing its sensitivity to both variation in the abundance of viral relative to non-viral sequences and sequence divergence from previously characterized taxa. VirMine was next used to mine viral metagenomic data sets from several studies of: (1) the gut virome, (2) the urinary virome, and (3) the freshwater virome. Numerous complete and largely-complete phage genome sequences resembling previously characterized phage species were extracted from each dataset without manual intervention. Furthermore, novel putative phage genomes were identified warranting further investigation and confirmation in the lab. The virMine tool provides a robust and expedient means for viral genome sequence discovery from complex community sequence data.
Short Abstract: Abnormal variations are frequent in clonal genome evolution of cancers. Such aberrational variations often function as a driver in cancer cell growth. Understanding fundamental evolutionary dynamics underlying these variations in tumor metastasis still remains understudied owing to their genetic complexity. Recently, whole genome sequencing empowers to determine genome variations in short-term evolution of cell populations. This approach has been applied to evolving populations of unicellular organisms including yeast. It is substantial progress in evolutionary genomics to examine sequence changes at such fine-scale resolution. These studies, however, have been limited to observing only point mutations and small insertions and deletions relying on a given reference sequence due to the incomplete fragmented construction of individual de novo genome assemblies. We herein design a new meta-assembly approach for building the sequence assembly of each population at different time points and use time-series analysis for identifying novel genome-wide variations. We improve the continuity and accuracy of the genome assembly and determine the evolutionary patterns of variations in big data of yeast (Saccharomyces cerevisiae) W303 strain genomes from 40 populations at 12 time points.
Short Abstract: Structural variations (SVs) are major sources of human genetic variations. Both haplotype information and SVs have been shown to be related to disease phenotypes. Computational methods that integrate haplotype phasing and SV detection are lacking. Moreover, detecting complex SVs, for example inverted duplication, is still a challenging problem. By preserving the directionality of the template strands, single-cell strand sequencing (Strand-seq) technology provides a means to better tackle these problems. It is a powerful sequencing technology that enables us to simultaneously phase haplotypes and detect SVs in a diploid genome. We aim to develop a statistical framework that phases SVs based on Strand-seq observed read counts. We can then combine this framework with current haplotype phasing tools to obtain more complete and accurate haplotypes.
Short Abstract: The most common histological subtype of epithelial ovarian cancer, the high-grade serous ovarian carcinoma (HGS-EOC), shows five-years survival rate less than 30%, despite an initial response to platinum agents, the patients become progressively resistant and die becoming incurable. The conventional array-based approaches, drawn on known transcript structures, failed to identify biomarkers for platinum resistance. Thus, with the aim to discover new mutations or transcript variants associated with the mechanism of resistance, we sequenced the transcriptomes of multiple HGS-EOC biopsies: 14 biopsies from platinum-sensitive patients, 14 biopsies from platinum-resistant patients and 16 matched longitudinal biopsies (tumor specimens collected from insurgence throughout the progression of the disease) sensitive at the time of first biopsy and resistant at the time of the last biopsy. The samples and analyses, here presented, are part of an ongoing project, called VIOLeTS, that aims, using novel methodological and computational approaches coupled to next generation sequencing of DNA and RNA, to study the transcriptional and genomics changes occurring in the HGS-EOC transcriptomes leading to relapse and resistance, allowing the study of the tumor evolution during therapies. After the first-year project, the transcriptome reconstruction of the 14 sensitive and 14 resistant tumors highlights 1371 transcripts differentially expressed between resistant and sensitive samples: 125 known transcripts, 686 potentially novel isoforms of known transcripts and, the remaining, if validated, suggest novel intergenic transcripts and anti-sense transcripts. Interestingly, a very small part of the collected transcriptional alterations can be ascribed to coding-genes, suggesting a prominent non-coding role in HGS-EOC platinum resistance.
Short Abstract: Sample size calculation is crucial to ensure sufficient statistical power for detecting existing effects, but rarely performed for RNA-seq studies. To evaluate feasibility and provide guidance, we performed a systematic search and evaluation of open sources tools for RNA-Seq sample size estimation. We used simulations based on real data to examine which tools performs well for different levels of fold changes between conditions, different sequencing depth and variable numbers of differentially expressed genes. Furthermore we examined the effect of the pilot replicate number on the results and if real pilot data are necessary for reliable results at all. In addition we looked at the actual false discovery rate correction. The six evaluated tools provided widely different answers for human data, which were affected by fold change, the demanded power values and the used data. While all tools failed for small fold changes, some tools can at least be recommended when closely matching pilot data are available and relatively large fold changes are expected.
Short Abstract: For analyzing the immune system, a very precious and essential factor for humans’ health, we need computational methods to properly analyze the huge amount of generated immunosequencing data in the World Wide Web. We here present a new web service called ImmunExplorer Online that enables users to upload their NGS or preprocessed IMGT/HighV-QUEST data, to perform clonality analysis, diversity calculations, and additionally provide important statistics based on the functional and non-functional sequences. Additionally, users can create and manage their own projects and results can be visualized and downloaded. The basis of these analyses forms the freely available software framework ImmunExplorer (IMEX) which enables the analysis of raw next-generation sequencing data and preprocessed IMGT/HighV-QUEST data. Several features, such as the calculation of the clonality, the diversity, primer efficiency analyses, and the prediction of the status of the adaptive immune repertoire using machine learning algorithms are implemented in IMEX. Moreover, various analyses about the V-(D)-J rearranged regions, genes and alleles, and statistics about preprocessed IMGT/HighV-QUEST data can be determined. Using ImmunExplorer Online users can run a full pipeline to profile the human adaptive immune system starting from raw NGS data to health state prediction. The profiling of the immune system and the analysis of the interaction of its key players is one of the most addressed research topics in the field of immunoinformatics and therefore this web service shall help to gain detailed insights in medical research, transplantation medicine, and in the diagnosis and treatment of diseases.
Short Abstract: Background: Many algorithms and pipelines have been developed to analyse RNA-seq data. Nevertheless, there is much debate about which of these approaches provide the best results. Material and Methods: KMS12-BM and JJN-3 multiple myeloma cell lines were used to test amiloride and TG003 drugs. Control and treatment samples were sequenced by triplicate. Several pipelines combining 3 trimming methods, 5 algorithms for alignment, 5 counting methods and 6 normalization approaches were tested. Precision and accuracy of the 108 resulting pipelines were determined by analysing the MAD ranks and qPCR correlation for the most stable expressed genes, respectively. 9 differential expression programs were compared under 5 testing approaches. Objective: To assess the performance of the most widely used RNA-seq algorithms. Results: We observed that most of the pipelines provided comparable results in precision, except those based on eXpress, which performed worse regarding this parameter. With respect to the accuracy, pipelines based on the counting algorithm HTSeq reached the best results, being the preferred alignment methods RUM and HISAT2. When evaluating the differential expression of genes (DEG) we found a high degree of overlap in methods such as EdgeR, Limma and DESeq2. We also noticed that the detection power of these methods was dependent of the similarities between the 2 compared groups. Conclusion: To our knowledge, the best analysis approach included the HTSeq method, being RUM and HISAT2 the most recommendable aligners. EdgeR, Limma and DESeq2 were the methods which performed similarly well for DEG detection. Funding: “Fundación Española de Hematología y Hemoterapia”
Short Abstract: Nanopore sequencing, a promising single-molecule DNA sequencing technology, exhibits many attractive qualities and, in time, could potentially surpass current sequencing technologies. The first commercial nanopore sequencing device, MinION, is an inexpensive, pocket-sized, high-throughput sequencing apparatus that produces real-time data using the R7 nanopore chemistry. These properties enable new potential applications of genome sequencing, such as rapid surveillance of Ebola, Zika or other epidemics, near-patient testing, and other applications that require real-time data analysis. This technology is capable of generating very long reads (~50,000bp) with minimal sample preparation. Despite all these advantageous characteristics, it has one major drawback: high error rates. To take advantage of the real-time data produced by MinION, the tools used for nanopore sequence analysis must be fast and must overcome high error rates. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis, with a focus on understanding the advantages, disadvantages, and bottlenecks of them. It is important to understand where the current tools do not perform well in order to develop better tools. To this end, we analyze the multiple steps and tools in the nanopore genome analysis pipeline; and also provide some guidelines for determining the appropriate tools for each step of the pipeline and the corresponding parameters of them.
Short Abstract: Leishmaniasis refers to a disease complex caused by protozoan parasites of the genus Leishmania. Annually, approximately 300,000 new cases and 20,000 deaths related to visceral leishmaniasis are reported. The treatment of leishmaniasis is problematic due mainly to the high toxicity of pentavalent antimonials and the emergence of parasites resistant to these compounds. In order to investigate if the presence of single nucleotide polymorphisms (SNPs) could be associated with antimony resistance mechanisms, we used transcriptome data obtained by NGS Illumina RNA sequencing from susceptible (LiWTS) and trivalent antimony (SbIII) resistant (LiSbR) L. infantum (MHOM/BR/74/PP75) lines. Considering the availability of Leishmania genomic data and the high synteny observed between all sequenced genomes, for an initial assessment of SNPs, Burrows-Wheeler Aligner (BWA) was used for mapping the reads against the reference genome, L. infantum JPCM5. SAMtools and BCFtools were used for SNP calling and SnpEFF was used for variants identification. In addition, functional annotation was performed using Blast2GO software. The pipeline applied in the analysis process allowed the identification of variant rate of one variant every ~3,532 bases in the LiWTS and one variant every ~35,716 bases in LiSbR, most of them resulting in missense effects. The functional effect of the polymorphism, variants rate by chromosome and Indels were addressed. For the LiSbR line, the SNPs with high impact are related to proteins having domains of unknown function (DUF proteins), amastin and cysteine peptidase, which play important roles in survival and virulence of the parasites.
Short Abstract: Genome-scale expression profiling has become a key tool of functional genomics, critically supporting progress in the post-genomic era. It improves our understanding of living systems at molecular level. The fast development of sequencing technologies have recently led to many updates of genome sequences as well as annotations and has revealed the complexity of the gene models of many species. In this work, we have analysed extensively human gene model evolution with focusing on alternative splicing events (ASE). In addition to the well defined canonical ones, we have focus on ones which are more complex and do not fit to any canonical category. These in the latest EnsEMBL releases made over 40% of all ASE. We here define a 4 new ASE categories which encounter for about ⅔ of all complex ‘ASE’. The remaining ⅓ seems to be a combination of already known and these new 4 ASE types. In our future work we would like to investigate possible evolutionary origins of these ‘complex’ ASE. Based on the detailed analysis of the gene model complexity evolution and appearance of alternative splicing events, we would like to take advantage of this knowledge and incorporate it into newly introduced combined metric to assess gene model complexity. Motivation is that the currently available metrics are of limited use as they describe/assess just part of the gene model and they do not reflect the evolutionary sources of the gene model complexity.
Short Abstract: Itroduction High variation in sequencing data—as present in metagenomics and polyploidy genome analysis—poses many difficulties during the data processing and analysis. Error detection is particularly impeded, as erroneously read bases are hard to discern among correct bases of varying quantities. The problem becomes particularly significant when using modern sequencing technologies like Oxford Nanopore that offer low-cost sequencing with very high error rates. Rationale and methods Our work has been focused on the development of an aggregate error detection approach to take the variation into account. It uses the conjunctive summary of two predictors—an analytic error predictor that creates per-position fuzzy clusters of similar sequences, and a machine learning (ML)-based model that is trained to discern errors from the naturally occurring bases. Variants of the approach have been tested on metagenomics and hexaploid wheat. Deliverables and conclusion During our study, the aggregate approach showed very promising results particularly on low quality datasets where the error rates were significant. The different ML-based models alone had a precision and recall of over 99.5% on metagenomics, and even higher on wheat. Because of the Bayes rule, the accuracy on metagenomics was insufficient for improving the average already low error rates of Illumina or 454, however it still leaves the approach highly applicable for use in Oxford Nanopore datasets. Work is ongoing on applying this hybrid approach on metagenomics datasets sequenced using technologies such as Oxford Nanopore, and demonstrating that the consistently high accuracy of the ML model persists.
Short Abstract: Background: Exchanging information about the genotypes of microbial isolates is a cornerstone of many clinical and public-health-related genomics applications. Genotyping methods such as Multilocus Sequence Typing (MLST) and its variants have proven to be useful tools for this purpose, labelling genotypes by the alleles observed at a chosen set of conserved loci. However, as more isolates are sequenced across a wider range of species, curating the database of observed alleles and allele combinations is becoming an increasingly significant bottleneck in the timely communication of novel genotypes. Results: We propose a simple method for eliminating the majority of the curation burden through the use of cryptographic hashing and a minimal allele database, allowing a hash of the allele to be used as a self-identifying signature. Combining these hash-based signatures yields a genotypic identifier, allowing unambiguous discourse about the genotypes of microbial isolates with minimal recourse to curated databases. We demonstrate the effectiveness of this approach by showing that the use of standard hashing algorithms result in unambiguous labelling of alleles in practice, and provide a simple reference implementation that replicates the information provided by MLST and cgMLST across four sets of bacterial isolates. Finally, we discuss several implications of enabling allelic genotyping without the creation and maintenance of a large database of observed alleles including improved scalability, easier augmentation of existing schemes and simplifying development of novel schemes. Availability: GenomeHash is open-source, requires R and BLAST, and is available from https://github.com/bgoudey/genomehash. Contact: Address correspondence to Dr. Benjamin Goudey (firstname.lastname@example.org)
Short Abstract: Motivation: Almost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input (Compeau et al., 2011; Pevzner et al., 2001; Simpson et al., 2009; Schulz et al., 2012; Zerbino and Birney, 2008; Grabherr et al., 2011; Chang et al., 2015; Liu et al., 2016; Kannan et al., 2016). Even when other approaches are used for subsequent assembly (e.g., when one is using “long read” technologies like those offered by PacBio or Oxford Nanopore), efficient k-mer processing is still crucial for accurate assembly (Carvalho et al., 2016; Koren et al., 2017), and state-of-the-art long-read error-correction methods use de Bruijn Graphs Salmela et al. (2016). Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly (Pell et al., 2012; Pellow et al., 2016; Chikhi and Rizk, 2013; Salikhov et al., 2013). Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e., the number of times that each k-mer occurs, which is key in transcriptome assemblers. Results: We present a method for compactly representing the weighted de Bruijn Graph (i.e., with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18%–28%. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph-based systems. Availability: https://github.com/splatlab/debgr
Short Abstract: Motivation: The increasing application of Next Generation Sequencing technologies has led to the availability of thousands of reference genomes, often providing multiple genomes for the same or closely related species. The current approach to represent a species or a population with a single reference sequence and a set of variations cannot represent their full diversity and introduces bias towards the chosen reference. There is a need for the representation of multiple sequences in a composite way that is compatible with existing data sources for annotation and suitable for established sequence analysis methods. At the same time, this representation needs to be easily accessible and extendable to account for the constant change of available genomes. Results: We introduce seq-seq-pan, a sequential genome aligning workflow for the rapid construction of a pan-genome data structure from multiple genomes. The flexible data structure provides methods for adding a new genome to or removing one from the set of genomes. It further allows the extraction of sequences and the fast generation of a consensus sequence and provides a global coordinate system. All these features form the basis for the usage of pan-genomes in downstream analyses. Availability: https://gitlab.com/groups/rki_bioinformatics
Short Abstract: Motivation: The interpretation of transcriptome dynamics in single-cell data, especially pseudotime estimation, could help understand the transition of gene expression profiles. The recovery of pseudotime increases the temporal resolution of single-cell transcriptional data, but is challenging due to the high variability in gene expression between individual cells. Here, we introduce HopLand, a pseudotime recovery method using continuous Hopfield network to map cells in a Waddington’s epigenetic landscape. It reveals from the single-cell data the combinatorial regulatory interactions of genes that control the dynamic progression through successive cellular states. Results: We applied HopLand to different types of single-cell transcriptome data. It achieved high accuracies of pseudotime prediction compared to existing methods. Moreover, a kinetic model can be extracted from each dataset. Through the analysis of such a model, we identified key genes and regulatory interactions driving the transition of cell states. Therefore, our method has the potential to generate fundamental insights into cell fate regulation. Availability and implementation: The Matlab implementation of HopLand is available at https://github.com/NetLand-NTU/HopLand.
Short Abstract: Motivation: Contigs assembled from the second generation sequencing short reads may contain misassembly errors, and thus complicate downstream analysis or even lead to incorrect analysis results. Fortunately, with more and more sequenced species available, it becomes possible to use the reference genome of a closely related species to detect misassembly errors. In addition, long reads of the third generation sequencing technology have been more and more widely used, and can also help detect misassembly errors. Results: Here, we introduce ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red black multipositional de Bruijn graph to detect misassembly errors. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassembly errors. In our performance tests on contigs assembled from short reads of human chromosome 14 and japonica rice data, ReMILO can detect 16.1-84.1% extensive misassembly errors and 14.0-100.0% local misassembly errors. On hybrid A. thaliana contigs assembled from both short and long reads, ReMILO can also detect 10.7-25.7% extensive misassembly errors and 8.7-23.8% local misassembly errors. Availability: The ReMILO software can be downloaded for free from this site: https://github.com/songc001/remilo
Short Abstract: Motivation: Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational methods to precisely predict open chromatin regions from DNA sequences. Along this direction, existing computational methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning. Method and results: We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility. Availability and implementation: The source code can be downloaded from https://github.com/ minxueric/ismb2017_lstm.
Short Abstract: We aimed to develop RNA-sequencing-based classifiers for the key breast cancer histopathological biomarkers — estrogen receptor (ER), progesterone receptor (PgR), human epidermal growth factor receptor 2 (ERBB2/HER2), Ki67, and Nottingham histological grade (NHG) — which are routinely used for determining prognosis and treatment in the clinic. To obtain reliable training labels we performed a multi-rater
histopathological biomarker evaluation on a training cohort of 405 tumor samples. Using the resulting consensus-labels and RNA-seq-derived tumor gene expression data as input, we trained single-gene classifiers (SGC) and multi-gene nearest shrunken centroid classifiers (MGC). We assessed the performance of the resulting classifiers by comparing their predictions to the clinical biomarker status in an independent prospective population-based series of 3273 primary breast cancer cases from the SCAN-B study (ClinicalTrials.gov identifier NCT02306096; Saal et al, Genome Medicine 2015), and by analyzing the overall survival of the patients. The results show that concordance between histopathological evaluations was high for ER, PgR, and HER2, but only moderate for Ki67 and NHG. Within the 3273-cohort the concordance between our biomarker predictions and clinical histopathology was similar to the concordance baseline established in the multi-rater biomarker evaluation. Survival analysis showed that our predictions add clinical value to histopathology by identifying patients that could potentially benefit from additional treatment and
patients with poor prognosis.
Short Abstract: Recently, we have developed an open source DeepNano (Boža et al. 2016) base caller for Oxford Nanopore reads based on recurrent neural networks. On R7 data, our base caller outperforms alternatives, while on R9 data, accuracy of DeepNano is slightly worse than Albacore and Nanonet base callers released by Oxford Nanopore.
The advantage of DeepNano, however, is in its flexibility. Under the default settings, DeepNano is faster, and by adjusting the size of the underlying network, it is possible to further trade accuracy for speed. Fast base calling is essential in applications such as selective on-device sequencing (ReadUntil, Loose et al. 2016) and in settings where using cloud services, as supported by Oxford Nanopore, is impractical. It is also possible to adaptively retrain the network, which can be used to leverage data that is otherwise impossible to base call through standard means (e.g., due to modifications or damage to the DNA).
Finally, we examine the dynamic-time-warp (DTW, Sankoff and Kruskal 1983) scheme for classification of reads and show that for applications such as ReadUntil, the method suffers from low specificity at high sensitivity. We demonstrate that by adjusting methods for scaling raw data, the sensitivity vs. specificity tradeoff can be much improved.
Short Abstract: The Flow Decomposition problem, has recently been used as an important computational step in genetic assembly problems. We give a practical linear fpt algorithm, that solves this problem optimally.
Short Abstract: We introduce a cloud-based genomic workflow management system called Tibanna. There is an increasing demand for processing large-scale genomic data using cloud computing. Our goals are to accommodate flexible and automated handling of massive genomic data of heterogeneous types, to improve reproducibility by utilizing standardized workflows, and to facilitate an effective use of the elastic nature of the cloud platform. To meet this end, we built an integrated system with adaptable components, tailored for the 4D Nucleome (4DN) Data Coordination and Integration Center (DCIC).
Tibanna adopts Amazon Web Services (AWS) as a main platform and consists of an upstream scheduler utilizing AWS's step functions ('Tibanna scheduler') and a set of 'minions', or lambda functions that performs specific tasks in a serverless environment. The lambda functions use three different utility components; AWSF, SBG pipe and annotator.
Tibanna AWSF and Tibanna SBG pipe are workflow executors. AWSF, an independent Autonomous Workflow machine Submission and monitoring Facility, launches an autonomous EC2 instance that executes a specified workflow, reports logs and self-terminates (AWSEM, Autonomous Workflow Step Executor Machine). AWSF serves as a stand-alone tool as well as a lambda utility. The SBG pipe is a connector to the proprietary Seven Bridges Genomics (SBG) platform and controller for file import/export. A workflow may run either on the SBG platform (SBG pipe) or on the given account's EC2 instance (AWSF). Tibanna annotator creates and updates metadata for workflow runs, output files and quality metrics.
Tibanna handles workflows described in Common Workflow Language and is naturally Docker-friendly. Tibanna is available on http://github.com/4dn-dcic/tibanna.
Short Abstract: Most existing dimensionality reduction and clustering packages for single-cell RNA-seq (scRNA-seq) data deal with dropouts by heavy modeling and computational machinery. Here, we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm that uses a novel yet very simple implicit imputation approach to alleviate the impact of dropouts in scRNA-seq data in a principled manner. Using a range of simulated and real data, we show that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA, andRaceID, in terms of clustering accuracy. CIDR typically completes within seconds when processing a data set of hundreds of cells and minutes for a data set of thousands of cells. CIDR can be downloaded at https://github.com/VCCRI/CIDR.
View Posters By Category
- Bioinformatics Open Source Conference (BOSC)
- Network Biology
- Regulatory Genomics (RegGenSig)
- Computational Modeling of Biological Systems (SysMod)
Session A: (July 22 and July 23)
Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /mnt/target04/348208/www.iscb.org/web/content/cms_addon/conferences/ismbeccb2017/posterlist.php on line 304
- High Throughput Sequencing Algorithms and Applications (HitSeq)
- Machine Learning Systems Biology (MLSB)
- Translational Medicine (TransMed)