View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Accurate quantification of differential splicing (DS) from RNA-Seq data is a challenging task which has received much attention in previous years. However, most of the works on DS have focused on comparing two conditions with few replicates, while DS analysis for large heterogeneous datasets such as GTEx remains a challenge. In order to address this challenge, we developed MAJIQ-HET, which builds upon the MAJIQ framework (Vaquero et al, eLife 2016; Norton and Vaquero et al Bioinformatics 2017). Unlike many tools, the MAJIQ framework allows HET to accurately detect and quantify complex local splicing variations (LSVs), including de-novo junctions not included in the annotation. In contrast to the original MAJIQ, HET discards the assumption of a joint (hidden) inclusion level per group and replaces it with robust statistics, either parametric or nonparametric, which account for missing values common in splicing analysis. Coupled with efficient C/C++ implementation, the new algorithm is 20 times faster than MAJIQ 1.0 and is able to easily process thousands of samples. Using both real and synthetic data based on GTEx and TCGA, we demonstrate MAJIQ-HET’s robust performance compared to current state of the art.
Short Abstract: Fragile X syndrome is the most common form of inherited cognitive impairment and is caused by the absence of the Fragile X Mental Retardation Protein (FMRP). FMRP is an RNA binding protein that associates with approximately 4% of brain mRNAs and regulates their translation. FMRP regulates the activity of microRNA-mediated silencing in the 3’ UTR of a subset mRNA by directly associating with Argonaute (AGO) and the RNA helicase Moloney leukemia virus 10 (MOV10). In previous work, we showed that FMRP can agonistically or antagonistically mediate translational suppression through its interaction with MOV10. In this work, we elucidate the mechanism of FMRP’s role in enhancing translational activity by mapping the interacting domains of FMRP, MOV10 and AGO and then showing that the RGG box of FMRP protects cobound mRNAs from AGO association. The N-terminus of MOV10 is required for this protection: its over-expression leads to increased levels of the endogenous proteins encoded by this cobound subset of mRNAs. The N-terminus of MOV10 also leads to increased RGG box-dependent binding to the G-quadruplex (GQ)-rich MAZ mRNA. We hypothesize that association of FMRP with MOV10 at GQs acts to stabilize the secondary structure through its RGG box, which modulates association with AGO.
Short Abstract: Transposable Elements (TEs) are ubiquitous in eukaryotic genomes — comprising up to 80% of plants like maize. They play a major role in regulating nearby genes and provide mechanisms for relocating host genes and for generating new genes and pseudo genes. Computational tools have proven invaluable in automatically detecting TEs. Long terminal repeat (LTR) retrotransposons are a particular class of TEs for which the existing tools produce incomplete annotations. Many are difficult to use and suffer from runtime and memory inefficiencies. We are currently developing a new software tool called LtrDetector, which efficiently identifies LTR retrotransposons using de novo techniques inspired by signal processing. LtrDetector calculates the distance to the closest k-mer that matches the one starting at each nucleotide. Sections of identical scores — plateaus — will emerge where a repeat exists and will be merged into repeat candidates. Finally, these LTR candidates are paired. Our initial results evaluated on manual annotations of D. melanogaster are promising. Based on the F-measure, which combines the sensitivity and the precision, LtrDetector outperformed three related tools in the least amount of time and using reasonable memory requirements. Further work will seek to identify nested and solo LTRs.
Short Abstract: A fundamental feature of genetic diversity is that genetic similarity decays with geographic distance; however, this relationship is often complex, and it may vary across space and time. Methods to uncover and visualize these relationships in genetic datasets have widespread use for analyses. While various frameworks exist, a promising approach is to build maps of how migration rates vary across space. Such maps in principle could be estimated across time to reveal the full complexity of population histories. Here, we present a step in this direction by providing a method that uses haplotype data to build separate maps of population sizes and migration rates for different time periods. We focus on long segments with a single pairwise shared coalescent (PSC) time across it's length (also known as identity-by-descent tracks). By varying the length of PSC segments used as input, we obtain estimates of parameters for qualitatively different time periods. Using simulations, we find the method is capable of revealing time-varying migration rates and population sizes, including changes that are not detectable with data summaries that ignore haplotypic structure. We apply the method to the POPRES dataset consisting of 1400 European individuals to provide insights on recent population structure in Europe.
Short Abstract: Orphan genes are genes that are highly specific to a species, also called as taxonomically restricted genes if present only in species belonging to a particular taxonomic group. These genes have been found to be involved in many molecular and biochemical functions which have been studied in many plants, including Arabidopsis thaliana and Soybean. The QQS gene was discovered as an orphan gene by our group and it was found to regulate carbon/nitrogen allocation and increase protein content (Li, 2015). My research focuses on finding orphan genes in Maize and using clustering methods like Markov Cluster Algorithm (MCL) and hierarchical clustering to find those genes that are very closely related. Finding unannotated ORFs within this cluster of genes can be used to predict whether these are orphan genes in Maize and also predict their function on the basis of which clusters they belong to. I’m also working with verifying predicted orphan genes by growing orphan mutants by growing them in certain identified conditions. These will give insight into the plausible functions of the identified orphan genes.
Short Abstract: Proprotein convertases (PCs) are important in progression of various diseases including cancer and metastasis, viral infections, inflammatory diseases and hypercholesterolaemia and pathogenic infections. Inhibitors and modulators of PCs may have applications in these disorders. Non-peptide inhibitors from natural sources, particularly from medicinal plants had provided a large number of therapeutic agents. In this work, natural and semi-synthetic biflavonoid compounds of Garcinia brasiliensis were studied about IC50, inhibitory constant-Ki, inhibitory mechanism, and flexible molecular docking simulations-MDS against Kex2, PC1 and furin proteases. Biflavonoid compounds were more potent inhibitors of PC1 than furin and Kex2. Compound BF1 were a competitive inhibitor of furin; and mixed-type inhibitor against Kex2 and PC1. Compounds BF1 and BF3 presented the best binding energies of XP-GScore on PCs due the presence of H-bond donor (hydroxyl) and H-bond acceptor (carboxyl) groups. Compound BF4 not presented results from interaction with Kex2 and furin because the presence of carboxypropyl groups that generated steric hindrance to ligand-protein interaction. According MDS, compounds are performing predominantly hydrogen bound with amino acids from S1 and S2 subsites. In conclusion, in vitro IC50, and inhibitory mechanism, and MDS studies of biflavonoids showed these molecules as a source of new class of compounds to modulate PCs.
Short Abstract: Many biological processes including gene regulation and signal transduction are stochastic. Although the probability landscapes of their underlying networks can be now computed exactly using methods such as the ACME algorithm, the high-dimensional nature of these landscapes hinders clear understanding and interpretation of behaviors of these systems. We have developed a computational method to characterize the topology and metric properties of these high dimensional probability landscapes via the construction of the approximate Morse-Smale complex. Our method is efficient and requires information of only the 1-skeleton of the probability surface. With the analysis of persistence of the approximate Morse-Smale complex, we can identify precisely all important features of the high dimensional probability landscape such as basins, peaks and saddle points. We further discuss how this method can reveal important cellular states and obligated transition paths among different states.
Short Abstract: Interactome is the ensemble of complexes generated through protein-protein interactions. Another dimension of structural property is finely represented by protein-protein interactions. And all aspects of cell biology are linked by the networks of protein-protein interactions and a hallmark of human disease is disruption, dysfunction of network assembly. Targeting protein interfaces has enormous therapeutical potential. Modulating protein-protein interactions with small molecules is an interesting concept pursued for more than a decade. Further it is a challenge to discover small molecules that can disrupt protein-protein interactions. Small molecules has drug-like potencies and bind to hotspots on the contact surfaces involved in protein-protein interactions. Because of their difficult topologies and large buried surface areas protein-protein interactions were considered undruggable. Detailed understanding of mechanism of apoptosis help to design newer strategies in cancer therapy. With small molecules acting as inhibitors of anti-apoptotic proteins are promising ant-cancer agents. Natural products are scaffolds that specifically interact with macromolecules, especially proteins.They have a wide spectrum of chemical and functional diversity. The diversity and complexity of natural compounds enable these small molecules to target a wide variety biological macromolecules in a selective fashion and help in the development of drugs and understanding of biological systems.
Short Abstract: Here we integrated and corrected EV sequences with their serotypes according to virus classification for NCBI GenBank and the International Committee on Virus Taxonomy (ICTV) taxonomy. By using corrected sequences of EV family with their serotype around 48,382 records, we have applied deep learning approach (Convolutional Neural Networks，CNN) to classify these 308 genotypes of EV family. Although the macro-average of prediction accuracy by five folds cross-validation (CV) is around 80%, the micro-average for EV-71 and D-68 are up to 99% and 98% respectively. So here, we composited the pipeline by the filter for homology search (Coverage>80% with E<1.0E-5) and CNN model to ensure the submitted sequences belonged to EV family then classify them into suitable genotype. Here, we name this approach as EV genotyping and have constructed the web application that is fully automatic to provide precise and rapid prediagnosis on EV genotype, especially for EV-71 and D-68 as 99.3% and 97.8%, respectively. EV Genotyping is available at http://symbiosis.iis.sinica.edu.tw/Enterovirus/
Short Abstract: MySort is a web implement for resolving relative proportions of twenty-one immune cell subclasses from a human tissue profiled transcriptome by microarray technology. We use v-Support Vector Regression and a selected gene feature set to construct a deconvolution model. The resolved proportion from user-uploaded dataset is in a column-wise layout as well as visualized of bar chart with hierarchical clustering among submitted data. The diversity of immune cell components is estimated in alpha diversity (diversity of composition in a sample) and beta diversity (diversity of composition among samples). The performance of our system was finally evaluated using blood biopsies from 20 adults, in which 9 immune cell types were identified using flow cytometry. The present computations performed better than current state-of-the-art deconvolution methods. MySort is available at http://188.8.131.52/mySORT/
Short Abstract: In the past years, the long read(LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand. Without a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads(SRs). Using GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered many novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of isoforms in D. officinale that not reported in the existing annotation library.
Short Abstract: Third Generation Sequencing technologies (TGS), including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have drawn growing attention in biological research. However, the application of the long reads currently suffers from relatively high error rates. The high quality Second Generation Sequencing (SGS) short reads can be applied to correct the TGS long reads, which is termed as hybrid error correction. Most of the existing hybrid correction methods can be categorized into two classes: alignment- and graph-based methods. Several principal factors affect the error correction performance, including long read and short read error rates, short read coverage, alignment criterion and solid k-mer size, etc. We propose a theoretical framework of hybrid-correction algorithm. Through the modeling of short read alignment and solid k-mer occurrence, we analyze the effects of different factors on error correction performance (i.e., accuracy gain) for alignment- and graph-based methods. The theoretical results are further validated by simulated and real data. We also make a general comparison of the two kinds of methods on error correction performance. Our study serves as guidance for method selection, parameter design and future method development for hybrid correction.
Short Abstract: By analyzing a cohort of microarray dataset on OA patients, we identified differentially modulated genes including the highly deregulated genes termed as ‘drivers’ of OA mechanism. Genes with deregulated expression-level more than expected by chance on an average, were determined using “Gitools-oncodrive” script in Gitools software with binomial(bernoulli) test as the statistical test of significance. Our approach not only identified previously reported OA-associated genes e,g. ACAN, OGN, LOXL4 that validates our concept, but also identified some new genes like ASPM, AURKB, HAPLN1, HAS3, MAP1B and PCSK1 which have functions in extracellular matrix stability, cell cycle regulation, inflammation- making them potential regulators in OA mechanism. Most of the up-regulated driver genes detected are associated with inflammatory agents e,g cytokines: IL-8, IL-11, IL-6 which may drive increased production of matrix degrading enzymes in OA-joint tissues. Most down-regulated driver genes found are involved in stabilizing the proteoglycan monomers with hyaluronic acid in extracellular cartilage matrix. Because of significant down-regulation(as detected in this study) the stability of the ECM may be compromised in OA cartilage. Our identified driver genes not only provide better insight to the disease mechanism but can also aid as biomarkers in OA diagnosis and may work as potential drug targets.
Short Abstract: In silico deconvolution methods for estimating cell compositions in bulk gene expression data require cell type-specific reference gene expression profiles (RGEPs) and different statistical algorithms. All published RGEPs (e.g. CIBERSORT LM22 and IRIS) were generated from sorted microarray data based on peripheral blood samples, which may be inaccurate because of the absence of tissue-specific cell types (e.g. fibroblasts) and average expression level of a large cell population. As a high-resolution technology at the single cell level, single cell RNA-Sequencing may generate more comprehensive and accurate RGEPs including infiltrated immune cells, tissue-specific cells and other rare cells without known cell markers. Thus, we developed a novel method to generate tissue-derived RGEPs from scRNA-Seq of rheumatoid arthritis (RA) synovial tissue and found that this RGEP has a significantly higher Goodness-of-Fit than current blood-derived RGEPs when dissecting both sorted and bulk RNA-Seq data from RA synovial tissue. We further showed that single cell-based RGEP outperformed RGEP derived from sorted RNA-Seq data, supporting the advantage of single cell RNA-Seq method on studying cell compositions with higher resolution. Finally, we evaluated association between estimated proportion of cell subtypes and clinical parameters and found that some cell subtypes might associate with certain clinical phenotypes of RA.
Short Abstract: Age, disease and/or exposure to environmental factors can induce tissue remodelling as a consequence of alterations in either protein structure or relative protein composition/abundance. The causative mechanisms may include post-translational modification (PTM) by reactive oxygen species (ROS), glucose, proteases or the accumulation of DNA-damage. We have previously shown experimentally that proteins rich in amino acids susceptible to ultraviolet (UV) damage are differentially susceptible to UV radiation. Human skin is prone to UVR damage leading to photoageing and attention to date has focused on abundant extracellular matrix (ECM) proteins in the dermis. However, it is evident that “minor” proteins also play an important role in maintaining tissue homeostasis. We hypothesise that proteins and their genes can be stratified according to their susceptibility to modification. Our bioinformatic approaches characterise proteins defined by the Manchester Skin Proteome (http://www.manchesterproteome.manchester.ac.uk) according to their susceptibility to UV/ROS, glycation and protease cleavage (proteins) and photochemical modification by UVB (DNA). Using this approach collagenous proteins are predicted to be sensitive to glycation and proteolysis whilst their DNA is UV sensitive. These methods may be useful in identifying novel pathways and protein targets of ageing, inflammation and diabetes. Study was supported by a programme grant from Walgreens Boots Alliance.
Short Abstract: With a bang, alignment-free (AF) sequence analysis tools have exploded into biological research. As AF methods offer computational speed many hundreds of times faster than the comparable alignment-based approaches, they have been applied to problems such as NGS analysis, whole-genome/proteome phylogeny, identification of horizontally transferred genes or recombined sequences — and many more. We present Alfree (http://www.combio.pl/alfree), a web toolkit for both newcomers to the alignment-free field as well as experienced users and developers of AF tools. We present recent and forthcoming features of the web service: (i) integrated on-line tools for sequence comparisons/phylogeny, (ii) a catalogue of freely available programs covering 20 basic research tasks (e.g., NGS analysis, annotation of long non-coding RNAs), (iii) accuracy evaluation of popular alignment-free methods and (iv) benchmarking resources and protocols for developers of AF methods.
Short Abstract: With the rapid growth of biomedical literature in PubMed – about two articles every minute – finding and retrieving the most relevant papers for a given query is increasingly challenging. We present Best Match, a new algorithm for relevance search in PubMed, which uses machine learning to build on top of a classic term weighting strategy by integrating many additional relevance signals (e.g. publication year, type and past usage) for finding best matching articles. We demonstrate that the new ranking algorithm provides state-of-the-art retrieval performance in benchmarking experiments. Furthermore, we find that this positive algorithmic change translates into increased click-through rate and improved user experience in real-world circumstances. Since the new algorithm was fully deployed in June 2017, we have also observed a steady increase (over 30%) in Best Match usage by PubMed users: assisting millions of searches in PubMed on a weekly basis. By presenting this work in details, we hope to increase the awareness and transparency of this new relevance sort option for PubMed users worldwide, enabling them to ultimately search more effectively.
Short Abstract: One of the prevalent mutational processes in cancers is the C:G>T:A transition in a CpG context. This mutational process is connected to the cytosine methylation where 5-methylated cytosine can spontaneously deaminate to thymine. This mutagenic process in cancer has yet to be explored in detail, especially in protein coding regions. Whole genome bisulfite sequencing (WGBS) allows the methylation status of almost all CpG’s in the human genome to be assayed. In this study, we address the following questions by using WGBS from normal and tumor tissue together with the somatic mutation data. How well can cytosine methylation explain the variance in mutation rate at CpG dinucleotides in different regions of cancer genomes? To what extent does this mutational process affect various genes associated with cancer? We find that there is large heterogeneity in methylation levels within the cell population at mutated sites, and that the mutation frequency at CpG sites in cancer is highly correlated with CpG methylation in both protein coding and non-coding regions.
Short Abstract: Jupyter Notebooks provide an advent mean for making bioinformatics data analyses more transparent, accessible and reusable. However, creating notebooks requires a degree of programming expertise which is often prohibitive for common users. Here we introduce BioJupies Generator, a Jupyter Notebook generator that enables users to easily create, store, and deploy Jupyter Notebooks containing RNA-seq data analyses. Through an intuitive interface, novice users can rapidly generate tailored reports to analyze RNA-seq data from over 4000 published studies currently available on the NCBI Gene Expression Omnibus (GEO). Users can additionally upload their own data for analysis, either as raw sequencing files or gene expression matrices with counts. The reports contain interactive data visualizations of the samples, differential expression and enrichment analyses, and queries of signatures against the LINCS L1000 dataset for small molecule that can either mimic or reverse the signatures. Generated notebooks are permanently stored on the cloud, made available through a public URL, and can be cited through a unique Digital Object Identifier (DOI). By combining an intuitive user interface for Jupyter Notebook generation, with options to upload customized analysis scripts, BioJupies Generator addresses computational needs of both experimental and computational biologists. The BioJupies Chrome extension is available at: https://chrome.google.com/webstore/detail/biojupies-generator/picalhhlpcjhonibabfigihelpmpadel.
Short Abstract: A growing body of evidence suggests that the equilibrium of microbiome communities and their interaction with their host plays an important role in maintaining host health. Changes in microbiome communities (dysbiosis) may be a cause or a consequence of the development of many diseases. The choice of adequate computational metagenomics methods is decisive in determining microbial taxonomy profiles and functions accurately. However, there is no consensus yet about the best analytical and computational approaches to use. The sbv IMPROVER crowdsourcing project, as a means to verify methods and data in systems biology, has already shown its usefulness in benchmarking computational methods. The design of the sbv IMPROVER Microbiomics Challenge focuses on the influence of sample complexity (number of species) and sequence bias (AT vs. GC-rich) on the quantification of microbial communities at various taxonomic levels based on shotgun sequencing data. Preliminary results obtained with in-house computational pipelines benchmarked by the challenge organizers show disparities in method performance depending on taxonomic rank and sample complexity. Overall, the results of the sbv IMPROVER Microbiomics Challenge will contribute to our knowledge on specific aspects of metagenomic data analysis and the applicability of the different methods in various contexts.
Short Abstract: We present the first large-scale analysis of bioinformatics source code. We have curated 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we have included 23 high profile repositories identified by their popularity in an online bioinformatics forum. We have extracted repository data from the GitHub API, including file contents, activity data, and repository metadata, as well as metadata for the published articles. We demonstrate key relationships within our dataset, including: (1) certain article topics are associated with more active code development and higher community interest in the repository, (2) most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high profile set is statically typed, (3) developer team size is associated with community engagement and high profile repositories have larger teams, (4) the proportion of female contributors decreases for high profile repositories and with seniority level in author lists, and (5) multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication.
Short Abstract: Burkholderia pseudomallei (Bp) is a motile gram-negative bacillus and causative agent of melioidosis, a febrile disease. Bp possibly evades the innate immune response via binding to alternative pathway regulatory protein factor H (fH). FH-binding proteins (fHbps) exist in a microbe’s outer membrane facilitating exposure and binding to fH. Several microbial species possess fHbps including, Y. enterocolitica, via YadA, N. meningitidis via fHbp and H. influenza via P5. We hypothesize that Bp possesses one or more fHbps. Western blot analysis in our lab using OM protein preparations demonstrated fH binding to whole cell Bp. In Burkholderia, BpaC appears to have a role in resistance to serum-mediated killing. This suggests its role as a potential fHbp. BLAST database searches identified BpaC in Bp 1026b as possessing homology to Y. enterocolitica YadA. Homology and ab initio-based modeling predicted four domains in BpaC. Topological algorithms speculate domain two as being entirely extracellularly exposed. Binding affinity between BpaC domain two and fH domains 19-20 is predicted as stronger than the interaction between known fHbp B. burgdorferi OspE and fH domains 19-20. These findings support BpaC interaction with fH to prevent complement deposition via the alternative pathway. In vivo methods to confirm this interaction are on-going.
Short Abstract: The volume of transcriptome data is experiencing tremendous growth due to significant improvements in sequencing technologies. The National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data with databases, like the Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA). Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web applications for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web applications. They are based on open-source bioinformatics tools. The interactive web applications provide a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present two case studies for the species Physalis peruviana and Musa acuminata.
Short Abstract: To determine genetic factors, causing variation in longevity, several genome-wide association studies (GWAS) have been carried out on panels of long-lived individuals. Most studies tend to have little impact due to small sample sizes. For this reason model organisms such as Drosophila melanogaster have become increasingly important in identifying genetic factors affecting longevity. In this study a network approach was used for predicting novel genes/genomic regions/single nucleotide polymorphisms (SNPs), playing a role in longevity, by integrating three-dimensional (3D) chromosome interaction data and two GWAS datasets (Burke et al. 2013; Ivanov et al. 2015). We hypothesise that 3D architecture of the Drosophila genome dictates the co-location of specific genes/genomic regions. Genes and/or SNPs, residing within these co-located genomic regions, may influence longevity either independently or have a cumulative effect on longevity. To identify influential nodes, the properties of networks were calculated (clustering, modularity and Page Rank). These nodes were further analysed using Gene Ontology. References Burke, M.K. et al. 2013. Genome-wide association study of extreme longevity in Drosophila melanogaster. Genome biology and evolution, 6:1-11. Ivanov, D.K. et al. 2015. Longevity GWAS using the Drosophila genetic reference panel. Journals of Gerontology Series A: Biomedical Sciences and Medical Sciences, 70: 1470-1478.
Short Abstract: Our study tackles three challenges facing high-dimensional sparse counts data: mathematical theory, algorithmic development, and applications. With single-cell RNA sequencing, the transcript counts contain low integers and very high rates of 0's due to shallow sampling and “drop-outs”. Since cell number (100's-1000's) is usually smaller than gene number (20,000-40,000), statistical inference tends to be under-determined. We developed new theoretical support for the unique distribution properties of “double-sparse” data, with (1) low-rank structure and (2) low integers with zero-inflation. One recent effort is to adopt the Gini Index to assess if a cell devotes most of its transcripts on a small spectrum of genes, or spread them over many regulatory programs. Yet the Gini varies with the cell's sequencing depth; in fact, both the α and β diversity measures exhibit dependencies with the abundance and dispersion properties of the counts data. We adjusted both Gini index (α-diversity) and spearman rank correlation (β-diversity) and critically reassessed them in truth-known simulation data. We applied our workflow to analyze >36K single cells from the mouse testis, finding that the transition from spermatocytes to mature sperms follow a smooth transition with increasing Gini. Our experience highlights the value of iterative discovery cycles among theory-methods-applications.
Short Abstract: The quantification of RNA sequencing (RNA-Seq) abundance using the normalization method of transcripts per million (TPM) is a crucial step when comparing multiple samples from different experiments. TPMCalculator is a one-step software to process RNA-Seq alignments in BAM format and report TPM values, raw read counts and feature lengths for genes, transcripts, exons and introns. The program describes the genomic features through a model generated from the gene transfer format (GTF) file used during alignments. We calculate the Pearson correlation coefficient between TPM, FPKM and DESeq2 results for normalized expression values using RNA-Seq data of 1155 samples from the TCGA-BRCA project. TPM and FPKM were correlated above 0.9 for 95.67% of the samples, while the correlation coefficient of either TPM or FPKM using DESeq2 normalized data was between 0.6 and 0.8. The correlation coefficient between raw read counts calculated with TPMCalculator and with HTSeq was above 0.9 in 87.73% of the cases using a strict setting in HTSeq. TPMCalculator is freely available at https://github.com/ncbi/TPMCalculator. It is implemented in C++14 and supported on Mac OS X, Linux and MS Windows.
Short Abstract: Although significant progress have been made in developing computational methods of binning and taxonomic profiling of metagenomic and microbiome samples, so far they were mostly restricted to the analysis of prokaryotic sequences. However many recent surveys of environmental metagenomic samples also include sequences from microbial eukaryotes, such as fungi, metazoans and protists in addition to usual bacterial and archael sequences. The accurate identification of eukaryotic sequences, in the large sets of short sequences is an important step for further characterization of their role in the microbial community and evaluation of their metabolic potential. To aid with this task we conducted comparative analysis of ~900 eukaryotic genomes together with several thousand non-redundant reference prokaryotic genomes from IMG database in order to identify PFAM domains unique either to eukaryota or prokaryota domains. Additionally other identified unique eukaryotic and prokaryotic sequences without PFAM annotations were clustered into families, with subsequent building of their corresponding Hidden Markov Models (HMMs) These HMMs were calibrated and tested to predict with high accuracy eukaryote and prokaryote specific sequences. Based on found unique PFAMs and constructed new HMMs we developed a program which identifies and bins the potential eukaryotic sequences from metagenomic projects
Short Abstract: Genomic data-sharing beacons provides a standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. Previously deemed secure against re-identification attacks, beacons were shown to be vulnerable. Recent studies have demonstrated that it is possible to determine whether the victim is in the dataset, by repeatedly querying the beacon for his/her SNPs. Here, we propose a novel re-identification attack and show that the privacy risk is more serious than previously thought. Even if the victim systematically hides informative SNPs, it is possible to infer the alleles at positions of interest as well as the query results with high confidence. We use linkage disequilibrium and a high-order Markov chain-based algorithm for inference. We show that in a simulated beacon with 65 individuals from the CEU population, we can infer membership of individuals with 95% confidence with only 5 queries, even when SNPs with MAF less than 0.05 are hidden. We need less than 0.5% of the number of queries that existing works require, to determine beacon membership under the same conditions. We show that countermeasures such as a query budget would still fail to protect the privacy of the participants.
Short Abstract: An international consensus in 2012 reported that medulloblastoma (MB) comprises four molecular subgroups (WNT, SHH, Grp3, Grp4), each associated with distinct genomic features and clinical behaviors. Independently, recent work by three groups, using either DNA-methylation microarrays or combined DNA-methylation/expression-microarrays has defined varying intra-subgroup heterogeneity in the form of biologically and clinically relevant subtypes within subgroups. However, subtype number and definition differed between studies, especially within Grp3/Grp4. This study aims to establish consensus on the number and definition of Grp3/Grp4 subtypes. By combining published cohorts with novel cases, we assembled 1,506 Grp3/4 samples with methylation data, including 852 with paired transcriptome/methylome data. We analyzed these cohorts using the techniques previously employed in the recent subgrouping studies (i.e. NMF, t-SNE/DBSCAN, SNF), additionally adding objective subgroup definition and robustness measures by multiple resampling of datasets. Regardless of technique, the maximum number of robust subtypes was 8. We also inferred that while analysis of DNA-methylation data alone is sufficient for robust subtype stratification in large datasets, the use of SNF, which analyzes paired transcriptome/methylome data, can be advantageous for smaller datasets. We are working to understand remaining inconsistencies before presenting a final consensus which is broadly consistent with the previously published subtypes I-VIII.
Short Abstract: Motivation: CRISPR-Cas9 genome-wide screens have recently gained popularity, and the number of algorithms that can analyze screen data has concurrently increased. Finding the right tool can be a daunting task, especially with the ever more complex CRISPR screens with varying effect sizes, biological read outs, and number of replicates. Results may be further biased by additional analytical difficulties, such as data normalization or the selection of sgRNA library. Several algorithms exist, including MaGECK and ScreenBeam, but despite specific strengths, each performs differently depending on the screening conditions. Hence, there is a need for a tool that summarizes results and aids in prioritization of genes for further biological investigation. Results: Here we introduce CRISPRKat, a versatile and modular tool that combines results of some of the most widespread CRISPR screening algorithms. CRISPRKat aims to report not only selected genes, but also to explain how results vary based on normalization methods or cut-off stringencies. Finally, we incorporate information from biological knowledge databases (GO, STRING) to help with prioritization of genes of interest. Availability and implementation: CRISPRKat is open-source and implemented as an easy-to-use Snakemake pipeline that combines scripts written in R, Ruby and Python. All scripts are available at https://github.com/DBHi-BiC/CRISPRkat.
Short Abstract: Bacterial species, resistant to antimicrobial agents, have become a worldwide health concern. Of these, strains of Salmonella, causing diarrhoeal diseases, sometimes can be life-threatening. The host infection mechanism of the pathogen comprises a series of secretory proteins whose folding and quality are being regulated by the molecular chaperones. This takes us to the juncture of utilizing chaperones as potential antibiotic therapeutic strategies targeting the host-pathogen interactions. Essentially, we have used DnaK as one of the crucial chaperone proteins playing an important role in pathogen survival under stress conditions as encountered during antibiotic therapies. As DnaK is a heat-shock protein from Hsp70 class, we have selected another Hsp70 chaperone, SigE, which is required for stabilization and secretion of the secretory proteins. Moreover, its regulation through another autoregulatory indispensable chaperone protein SicA, brings us to select these three representative proteins to perform molecular docking with available marketed drugs and other selected ligands. Our results showed the best binding interaction of the aforementioned proteins with XR770, a phenalenone derivative, for a therapeutic intervention against emerging significant drug resistance conferred by the pathogenic Salmonella strains. We further propose potential susceptible predictors suitable for a series of trials, testing and analysis in the wet-lab environment.
Short Abstract: Identification of drug-target interaction(DTI) plays a key role in drug discovery. However, experiments for identification of DTI via in-vitro, in-vivo are laborious, so cost a lot. Therefore, computational models to predict potential DTIs are employed for reduction of cost. In previous computational models, protein and drug features are employed to predict potential DTI. However, previous models have limitations which only can retrieve simple combinations of features. In our model, we used convolutional neural network on protein sequence to catch local pattern of the drug-target binding region. We convert each amino acid to embedding vector and execute convolution on embedding. From global max-pooling of convolution result, our model yields how much given protein sequence matches with trained local patterns. To train our model, we collected 35164 DTIs from various databases. As a result, our trained model gives AUPR 0.83 for given independent validation dataset from MATADOR.
Short Abstract: Recent research has suggest that short term fasting related immune system rebooting and the stem cell-based regeneration of new immune system cell shifts stem cells from a dormant state into a state of self-renewal. However immune system rebooting mechanism in intermediate fasting remains unclear. We investigated the immune system inflection using genome wide DNA methylation and whole transcriptome in mice. Short term fasting mice were 5 day fed and 2 day fasting and examined immune cell population for CD4 positive T cells. DNA methylation in fasting mice was significantly decreased in gene body region. The expression of DNA methylases, including DNMT1, DNMT2, DNMT3a and DNMT3b, was decreased in short term fasting groups. In integration and network analysis between epigenome and transcriptome, immune system regulate related pathway were enriched, including, NF-kappa and Jak-STAT signaling pathway. Cytokine-mediated signaling and activated T cell proliferation pathway were identified module based network analysis. We validated CD4 positive T cell population increasing in short term fasting group, but not in normal diet and refasting group. We propose that short term fasting plays a crucial role in immune system by DNA methylation change via CD4 positive T cell activation and can be involved in immune system rebooting.
Short Abstract: Osteoarthritis(OA) is the chronic disease of the joints that occurs most often in knees, hips, etc. Although OA was considered to be caused by the attrition damage of joints, researchers now inspect it as a disease. In order to understand genetic regulation of OA-causing conditions, we analyzed OA-inducing factor transcriptome data augmented with PPI(protein-protein interaction). We used gene expression profile of 18 mouse including 11 OA condition articular cartilage induced by IL1β, HIF2α, ZIP8 and 7 healthy controls. For human, 10 OA condition synovium cells and 10 healthy controls were used. Augmented DEGs(differentially expressed genes) were found for each group using PPI network. Enrichment analysis of augmented DEGs was also performed for each experiment group. PPI based 1039, 693 and 433 DEGs were found in the mouse OA dataset induced by IL1β, HIF2α and ZIP8, respectively. The enrichment scores associated with OA were improved with the PPI network compared with traditional DEGs. In human dataset, the inflammatory GO terms were additionally enriched in the augment DEGs. By the PPI based genome-wide analysis, we confirmed that inflammatory is associated with OA mechanisms. We suggest that this PPI based approach for DEG analysis be able to improve of identifying potential biological pathways.
Short Abstract: As cancer omics data become increasingly available, there is a growing need to integrate these data with the biological networks to reveal tumor heterogeneity and discover potential biomarkers for precision medicine. Recent pan-cancer studies have proposed several strategies for an ever-expanding survey of genome variation, but these studies still do not consider both module/network evolution and pan-cancer analysis for studying tumor heterogeneity and evolution in patient cohorts and treatment regimens. Here, we present a new strategy via considering module organization, module variance, and pan-cancer module network analysis to identify the cancer-focused dominating protein-module networks by using expression signatures from 5,922 tumors with overall survival outcomes across 15 cancers. For example, ITGAV module and some interacting proteins participate in the processes of cell adhesion and communication in most cancer types; however, some proteins are tumor-specific (e.g., PARVG in lung squamous cell carcinoma). Moreover, we found that expression of genes in ITGAV module-regulated subnetworks are significantly associated with adversely prognostic outcomes in nine cancer types, and also have more prognostic power than the expression of ITGAV itself in these cancers. This strategy provides a roadmap to investigate the tumor-specific and -common signatures as well as the molecular classification of cancer subtypes.
Short Abstract: Functional genomics data is emerging as a valuable resource for personalized medicine. Here, we focus on the privacy aspects of genome-wide RNA-Seq and ChIP-Seq signal profiles, which represent measurement of activity at each genomic position. The signal profiles do not contain any explicit nucleotide information, i.e. no actual read information is revealed, and are thought to be safe to share publicly. Several consortia, for example IHEC, GTEx, and TCGA, publicly share functional genomics signal profiles while the genotype data is under restricted access. Here, we show that the signal profiles can be used to correctly genotype genomic deletions. Moreover, we demonstrate that these deletions can be used in linking attacks to identify individuals in genotype datasets. We develop measures of correct genotype prediction and information leakage from the RNA-seq signal profiles. We then present practical methods for genotyping deletions, and accurate linking of individuals to a large sample. To close the genotype leakage, we present an effective anonymization procedure against genotype prediction based linking attacks. Considering the extent to which RNA-seq signal profiles are shared publicly on the web, our results point to a critical source of sensitive information leakage, which can be potentially protected by our anonymization technique.
Short Abstract: The arrival of Next Generation Sequencing (NGS) has benefitted the development of many bio-related fields. In the case of large-scale multi-sample analysis, a slew of public available NGS libraries are downloaded and processed by a series of computational pipelines to extract useful information. However, it is very common to have the adapter sequences posted with errors or even lost in the databases. This paper presents a set of algorithms for finding the adapter contamination, for the first time, without the need for adapter information and is suitable for both Illumina single-end and paired-end NGS data. The algorithms start by predicting the adapter sequence from the libraries without any a priori information, and then used the de novo adapter predicted to further removed possible contaminations in the sequencing library. We designed different algorithms respectively to single- and paired-end sequencing. To our knowledge, this is the first adapter trimming algorithm that do not need a priori information of the adapter sequence for both single- and paired-end sequencing. The performance of our work has been validated by the NCBI open datasets and is proved to be fast and accurate, this tool is especially suitable for meta-analysis to really achieve fully automatic NGS processing.
Short Abstract: Bacteriophages (phages), or viruses that infect bacteria, outnumber both eukaryotic viruses and bacteria within the human body. Despite their prevalence, only a small fraction of viral diversity has been characterized. To further our understanding of the phage communities within the urinary tract, we examined metagenomic data collected from urine samples. Samples were collected from patients without urinary symptoms, and patients with either cytomegalovirus infection, urinary tract infection, or overactive bladder. A key challenge in studying the urinary microbiome, beit bacterial or viral, is the fact that it exists at a low biomass. Thus, the datasets examined here include genomic material of both the bacterial and viral fraction of the community. We employed a bioinformatic tool virMine, which was developed by our group specifically to isolate viral sequences from mixed samples. Using this approach, we found numerous partial and complete viral genomes, including both phage genomes and eukaryotic viruses. This approach provides a powerful tool for discovering novel viral species within the microbiome.
Short Abstract: Childhood cancers and structural birth defects share a common context of altered developmental biology. While researchers have been increasingly identifying the underlying biological causes of these conditions, the potential role of shared, genetic alterations and/or pathways across pediatric cancers and birth defects is not well explored. A better understanding of common developmental settings could spur advancements in prevention, detection, and therapeutics that will improve the lives of the children and families impacted by these conditions. The NIH Common Fund Gabriella Miller Kids First Program represents a first-in-kind national, collaborative initiative focused on large-scale clinically annotated genomic data for childhood cancers and structural birth defects. The Kids First Data Resource Center (DRC) is charged with empowering collaborative discovery across Kids First datasets. Through newly developed platforms and cloud-based resources, researchers will be able to access harmonized, WGS data as well as phenotypic/clinical data across a diverse landscape of childhood cancers and structural birth defects. Approximately 8,000 patient samples will be ready at the launch of the Kids First DRC portal. More than 25,000 WGS are expected to be processed by 2019, making the Kids First Data Resource Center Portal one of the largest pediatric data resources of its kind.
Short Abstract: The glycoscience field comprises not only carbohydrate sugar chains, or glycans, but their closely-related partners including proteins, lipids and metabolites. Thus the glycosciences entails a large portion of the life sciences and affects diseases, infection and immunity, to list a few. In order to grasp the complexity of the glycoscience field, we have been working on integrating glycoscience data through Semantic Web technologies. In doing so, we have developed ontologies to standardize this data, and we have released the International Glycan Structure Repository (GlyTouCan; https://glytoucan.org) by which scientists can register glycans to obtain accession numbers for them to use in publications. We note that the complexity of glycan structures in and of themselves adds to complexity of the standards we have developed. Glycans are synthesized by enzymes, so they cannot be sequenced directly; that is, no glycan sequencer exists, although semi-high-throughput methods using mass spectrometry have been developed in recent years. However, this difficulty in sequencing glycans results in data that are ambiguous and incomplete. We have developed a repository system that can handle such ambiguity to uniquely identify glycans and to create links between glycans that encapsulate their relationships.
Short Abstract: We present GWASpro, a GWAS (Genome-Wide Association Study) web server for analyzing large-scale and complex data. In GWAS, the linear mixed model (LMM) is widely adapted, in which the design matrices explain experimental designs. While other popular GWAS software tools are confined to simple experimental designs, GWASpro allows establishing complex experimental designs by supporting building complex design matrices, which can lead to a high quality QTL mapping. Also, GWASpro addresses big data-driven challenges by supporting job distributions and parallel computation. We simulated data sets imitating a situation in which the same two populations (Pop.A, Pop.B) are grown in different environments, and performed the GWAS practices with phenotypes of: (1) Pop.A, (2) Pop.B, (3) the average between Pop.A and Pop.B, and (4) the combination of Pop.A and Pop.B. Other GWAS software can handle either of (1), (2) or (3), while GWASpro can handle all practices. The practice (4) produced the best GWAS resolution, which was judged by risen QTL peaks without background noise inflation. Given a real Medicago truncatula data consisting of 220 individuals and more than 1.8 million SNPs, GWASpro completed the analysis within 8 minutes, while TASSEL spent more than one day on the same analysis. Web server: http://bioinfo.noble.org/GWASPRO
Short Abstract: Precise HLA genotyping is of a great clinical importance, albeit a challenging bioinformatics endeavour due to the hyper polymorphism of the HLA region. The ever-increasing availability of next-generation sequencing (NGS) solutions has spurred the development of several computational methods for predicting HLA genotypes from NGS data. Although some of these tools genotype HLA Class I alleles reasonably well, there is a need to incorporate integrative parameters related to ethnicity and haplotype information, in order to improve performance for both class I and class II alleles. Here we present OncoHLA, which addresses some of these current shortfalls in HLA genotyping from NGS. In OncoHLA, reads from the HLA region are aligned against a more comprehensive library of reference alleles. The detection of alleles is then determined on the basis of the distribution of aligned reads, and the prior probabilities of the ethnic and haplotype frequencies of alleles. Two NGS datasets were used to benchmark OncoHLA against five similar tools. OncoHLA displayed an overall accuracy of 98,62% for class I and 96,11% for class II alleles (at 2-fields resolution). We illustrate that OncoHLA’s integrative approach outperforms the existing tools, and therefore is able to predict HLA alleles with improved fidelity.
Short Abstract: Blumeria graminis f. sp. hordei (Bgh) causes powdery mildew disease in barley. Candidate Secreted Effector Proteins (CSEPs) expressed by Bgh interact with barley proteins to manipulate host cell physiology during infection. Currently, little is known about the virulence mechanisms and host targets of the ~540 predicted Bgh CSEPs. To understand how Bgh causes disease, CSEPs are screened against a comprehensive cDNA library made from leaves of barley cultivar CI 16151-WT and derived immune-signaling mutants mla6, rar3, bln1, mla6+bln1 over the first 48 hours of infection with Bgh. A modified yeast two-hybrid (Y2H) assay coupled with next-generation sequencing monitors the enrichment of prey sequences in yeast populations under selection, which allows high-throughput identification of putative interacting proteins. A negative control bait, luciferase is used to identify non-specific preys within the cDNA library. We designed a pipeline to map sequencing reads, verify prey translational fusions with the GAL4 activation domain, and perform statistical enrichment analysis. Using this approach, we identified high-confidence protein-protein interactions for Bgh CSEPs and the barley disease resistance protein MLA6. Diverse CSEPs putatively interact with unique, but overlapping set of host proteins with roles in cell division, transcriptional regulation, post-translational modifications, DNA binding activity, stress responses, and immune signaling.
Short Abstract: Elastic nets for generalized linear models are popular regression and variable selection methods, particularly useful when the number of features is substantially greater than the sample size and when there are many correlated predictor variables. The mixing parameter alpha generates a family of models with increasing shrinkage effects from ridge to lasso; for each of them, an ensemble of solutions depends on the penalty parameter lambda. This package provides a quantitative toolkit to explore elastic net families and to uncover correlates contributing to prediction under a cross-validation framework. Parameter importance is evaluated by flexible criteria based on out-of-bag predictions, which are assessed via user-defined quality functions. Statistical significance is assigned to each model by comparison to a set of null models generated by random permutations of the sample labels; analogous permutation procedures were used to assess significance for the contribution of individual parameters to prediction. The package provides a set of standard plots, summary statistics and output tables. It fits linear, binomial (logistic) and multinomial models. Our package enables quantitative, exploratory analysis to generate hypotheses on what parameters may be associated with biological phenotypes of interest, such as for the identification of biomarkers for therapeutic responsiveness.
Short Abstract: Leishmaniasis is a group of neglected infectious diseases with more than 1 million new cases each year, and the available treatments have serious drawbacks. Therefore, it is extremely important to apply methods capable of selecting new therapeutic targets in an efficient and low cost way. Thus, we propose the use of multidisciplinary computational methods capable of evaluating the druggability of the Leishmania braziliensis and Leishmania infantum proteomes in a structural, chemical and functional context. To achieve this goal, the proteomes were obtained from the TrytripDB database, and aligned against already approved targets deposited in the DrugBank through the BlastP tool. Protein structures were predicted by comparative modeling and submitted to the FPocket tool to predict structural druggability. Data on protein interactions were obtained from the String database or predicted through docking protein-protein. Hence, it was possible to identify 283 L.braziliensis and 292 L.infantum proteins with high similarity to target proteins. In addition, among the top 10 proteins to be used as therapeutic targets in a topological context of protein interaction networks, 5 L.braziliensis and 4 L.infantum proteins presented druggability in the structural context. These data will be expanded with information on protein expression and functional information.
Short Abstract: Functional enrichment analysis is widely used to identify significant functional information over-represented within sets of genes obtained from high-throughput profiling assay such as microarray and RNA-Seq. Various standalone tools and R packages have been developed; however, efficiently visualizing the results with lots of significant biological functions and pathways from various resources is still a challenge. Besides, the current tools do not directly support plants or other under-studied species even with their complete genome resources available. Here, we present RichR, a comprehensive R package for enrichment analysis and network construction, supporting not only the common species included in Bioconductor but also all species the Ensembl database. RichR can perform a network analysis based on enrichment and differential expression analysis results to build networks, connecting significant biological functional terms and pathways that share similar gene contents and expression profiles. This network visualization enables an easy comparison among multiple enrichment results. RichR can help to disclose new association and potential cooperating pathways in hidden biological processes and provide a new way to summarize large amount of information from enrichment analysis.
Short Abstract: Both the intrinsic regulatory network and spatial environment are contributors of cellular identity and result in cell state variations. However, their individual contributions remain poorly understood. Here we present a systematic approach to integrate both sequencing- and imaging-based single-cell transcriptomic profiles, thereby combining whole-transcriptomic and spatial information from these assays. We applied this approach to dissect the cell-type and spatial domain associated heterogeneity within the mouse visual cortex region. Our analysis identified distinct spatially associated signatures within glutamatergic and astrocyte cell compartments, indicating strong interactions between cells and their surrounding environment. Using these signatures as a guide to analyze single cell RNAseq data, we identified previously unknown, but spatially associated subpopulations. As such, our integrated approach provides a powerful tool for dissecting the roles of intrinsic regulatory networks and spatial environment in the maintenance of cellular states.
Short Abstract: Motivation: Images convey essential information in biomedical publications. As such, there is growing interest within the bio-curation and bio-databases communities, to store images within publications as evidence for biomedical processes and for experimental results. However, many of the images in biomedical publications are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information. For instance, graphs microscopy and x-ray images may all be shown side-by-side as panels in one single figure. To display and to further process individual image panels, segmenting compound images into their constituent panels forms an essential first step toward utilizing images. Methods: We have developed a new compound image segmentation system, FigSplit, which is based on Connected Component Analysis. To overcome shortcomings typically manifested by existing methods, we develop a quality assessment step for evaluating and modifying segmentations, employing methods that re-segment images if the initial segmentation is inaccurate. Results and Significance: We have tested FigSplit on multiple public datasets; the results show significant improvement and effectiveness compared to state-of-the-art methods. FigSplit effectively addresses an essential need for extracting panels within published biomedical figures, while significantly improving over the state-of-the-art. Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/FigSplit.
Short Abstract: FastqBLAST is a python-based program that invokes NCBI’s BLAST and EFetch services over the internet on a random sample of sequences in a FASTQ file and compiles summary reports that characterize the organismal and genomic diversity of that sample. The program is user-friendly, can be run on a personal computer, and can handle very large FASTQ files. Additionally, it does not require users to download and constantly update BLAST databases on a local server because NCBI requests are submitted via the internet. FastqBLAST is particularly useful in diagnosing issues of low mapping rates of next generation sequencing reads (RNA- and DNA-seq) to a reference genome and can quickly elucidate potential problems associated with sequence quality, reference genome assembly, and contamination from poor library preparation or an organism’s microbiome. The program selects a random sample of sequences from the FASTQ file, trims low quality ends from each sequence, BLASTs the trimmed sequences, retrieves additional data on the returned gene list with EFetch, and parses the results to produce tabular reports and key figures. Default parameters of FastqBLAST closely mirror those used on the NCBI website and can be adjusted as needed along with the sample size. FastqBLAST is available at https://github.com/AbashtLaboratory/FastqBLAST.
Short Abstract: Comparative analysis of helminth genomes is important to understand the genomic biodiversity and evolution of parasites and their hosts in terms of different selective pressures in their habitats. The interactions between helminths and their hosts are mediated in large part by secreted proteins. Proteins secreted by parasites are able to modify a host's environment and modulate their immune system. The present study aimed to predict, in silico, the secretome in 44 helminth species including Nematoda (31 species) and Platyhelminthes (13 species) and, understand the diversity and evolution of secretomes. Secretomes from plant helminths range from 7.6% to 13.9% of the filtered proteome with an average of 10.2% and from free-living helminths range from 4.4% to 13% with an average of 9.8%, respectively, and thus are considerably larger secretomes in relation to animal helminth secretomes which range from 4.2% to 11.8% of the proteomes, with an average of 7.1%. Across 44 secretomes in different helminth species, we found five conserved domains: PF00014 (Kunitz/Bovine pancreatic trypsin inhibitor domain), PF00046 (Homeobox domain), PF00188 (cysteine-rich secretory proteins), PF00085 (Thioredoxin) and PF07679 (Immunoglobulin I-set domain). Secreted proteins had higher architecture diversity compared with non-secreted proteins. This study will contribute towards the understanding of host-parasite interactions
Short Abstract: Analyses of viral genome sequences from clinical specimens can be complicated by co-infection with multiple viruses and/or the quasispecies nature of many RNA viruses. Robust next-generation sequencing (NGS) techniques provide an opportunity to investigate the different viral components, yet de novo assembly programs typically output multiple contig fragments for each taxon, often leading to mis-assemblies and possible virus mis-identifications. There is a large demand for a performance metric that accurately depicts how well the assembly performed on viral datasets. Currently, assembly performance is measured by N50, but it often produces skewed, inaccurate results. We developed a new performance metric called U50 to correct for over- and under- estimation of N50 due to poor assemblies, and the UG50% metric allows for cross-platform comparisons. We applied U50 to demonstrate assembly effectiveness on NGS data generated from enterovirus- and sapovirus-positive clinical specimens. 57/61 samples had UG50% values of over 99%; this high value indicates that the de novo assembler generated the full genomic sequence directly. In addition, the U50 returned an equal or more accurate assessment than N50 for 60/61 samples. For the remaining difficult data, we present an iterative mapping approach to handle these complex viral NGS data for full genome construction.
Short Abstract: The Institute for Systems Biology Cancer Genomics Cloud Resource (ISB-CGC) is becoming an invaluable asset for research that holds over 2 petabytes of data and features an open tools infrastructure. Focusing on one particular aspect of the project: clinical and molecular data for all of TCGA, genomic reference sources, and recently published results are all stored in Google BigQuery tables, a managed service consisting of a columnar database backed by a massively parallel analytics engine. Data is stored in a highly distributed manner making it possible to split up SQL queries automatically, resulting in super-fast processing times. All of this means researchers can focus on asking questions and getting answers rather than curating data and managing databases! We have developed numerous examples of relevant SQL queries addressing specific biomedical research questions. These include simple queries such as computing summary statistics like T-tests, building to more complex scenarios that include tasks such as gene set scoring or more advanced machine learning analysis methods such as the Top Scoring Pairs algorithm. We show how BigQuery can be used as a backend to R Shiny web apps. All examples are performed on publicly available data and have biologically relevant results.
Short Abstract: Ebola virus disease (EVD) occurs in outbreaks and results in a hemorrhagic fever with mortality rates up to 90%. mRNA changes in the blood can be used to predict whether individuals will survive or succumb to infection following their initial diagnosis. Based on this we hypothesized that tracking circulating mRNA changes following EVD infection will predict the stage of infection of an individual, allowing more effective treatment. To test this hypothesis, we examined transcriptional data from animal models of EVD where time series transcriptomes were known. We used linear models and Bayesian additive regression trees as the machine learning methods to develop algorithms that predicted days post infection based on mRNA abundance. Additionally, we used both single gene expression data as well as multi-gene averages (eigengene) approaches. The linear model was accurate in the cross-validation but poor in cross-dataset validation. The Bayesian additive regression tree approaches combined with eigengenes, produced a more accurate cross-species and cross-sample prediction. Our final predictor accurately and consistently predicts the stage of infection, especially during the mid and late stages. We will also present and discuss the generalizability of our predictor using a human influenza study as well as available human EVD samples.
Short Abstract: Previously, Han et al.  discovered population structure in a global identical-by-descent (IBD) network using multilevel clustering. Population structure, e.g. maternal and paternal lines, can also be discovered on a more local level such as an IBD network centered around an individual. Many of the 7 million customers in AncestryDNA's database, especially enthusiastic genealogists, are interested in separating their relatives into maternal and paternal lines. Our DNA database also contains a large set of trios (child-father-mother), on which we can assign maternal/paternal labels for all the edges in a child’s local IBD network. Using the trios’ label assignments, we evaluate the performance of 8 clustering methods, including: label propagation, Markov cluster algorithm, SPiCi, edge betweeness, fast greedy, walk trap, spinglass, and leading eigenvector. The performance metrics are adjusted Rand Index, mutual information, Fowlkes Mallows score, homogeneity, completeness, and v-measure. The results of this comparison are valuable to understanding the strengths and limitations of each method on a real-world dataset.  https://www.ncbi.nlm.nih.gov/pubmed/28169989 (Access 3/12/2018)
Short Abstract: During the last two years, the National Cancer Institute (NCI) Genomic Data Commons (GDC) has been serving the cancer research community with a data repository of uniformly processed genomic and associated clinical data. As the input data source changes from archived data to dynamic continuously added user submissions, GDC requires an automation system that enables continuous data processing with less human intervention. Here we present a data-driven workflow system that utilizes a Directed Acyclic Graph (DAG) to describe relationships between workflows and inputs/ outputs. The system is composed of three services: one that determines which workflows are ready to be submitted, one that submits workflows to the backend computer cluster, and one that coordinates data between databases. The system is modularized so that the framework is not dependent on a particular input database, a particular workflow language or execution engine, or the cluster’s workload scheduler. In addition, a pair of client and service was created to abstract file location and access from the workflow and allow remote file access via a Universally Unique IDentifier (UUID). The system improves reproducibility by tracking inputs/outputs, workflow versions, and supports the use of containerized applications.
Short Abstract: Concept normalization is an important step in biomedical information extraction. The task is to identify references to particular controlled vocabulary or ontology terms in text. In this work, we present a novel sequence-to-sequence architecture to normalize biomedical concepts. We develop a sequence-to-sequence model with LSTM (Long Short Term Memory) that has encoder and decoder architecture. The encoder is a bidirectional LSTM that takes as input one-hot-code encoding representation of the characters of biomedical mentions. The decoder is a unidirectional LSTM model that takes the output of the encoder. We have an attention model between the encoder and decoder models. The attention model gives different weights to the output of the encoder that address how much attention should the decoder give to the outputs of the encoder. The decoder predicts most likely concept IDs of biomedical mentions. The proposed architecture is evaluated against gold textual mentions. It is also evaluated in an end-to-end system, where the input is scientific articles, so that we can compare it with other systems. For the end-to-end system, we extracted the span of biomedical mentions in text using an existing state-of-the-art approach, condition random fields.
Short Abstract: The Cancer Genome Collaboratory is an academic compute cloud designed to enable computational research on the world’s largest and most comprehensive cancer genome dataset, the International Cancer Genome Consortium (ICGC). As of April 2018, the Collaboratory holds genomic data (BAMs, VCFs and others) on more than 3180 donors (671 TB), including harmonized whole genomes and variant calls from over 2,800 cancer patients of the ICGC PanCancer Analysis of Whole Genomes. A dataset of this size requires months to download and significant resources to store and process. By making the ICGC data available in cloud compute form in the Collaboratory, researchers can bring their analysis methods to the cloud, yielding benefits from the high availability, scalability and economy offered by cloud services and essentially eliminating the time needed to download the data. The Collaboratory, built upon the OpenStack and CEPH platforms, has developed innovative and interoperable open source software solutions for managing, searching and accessing genomic data from cloud object storage into the user’s compute instances. The Collaboratory is open to the public and we invite cancer researchers to apply for access to the Collaboratory at cancercollaboratory.org.
Short Abstract: Massive growth in the amount of research data in biology has led to increasing need for user-friendly analysis tools. We developed BioWardrobe platform to simplify routine analysis of epigenomics data. It had scheduled python scripts as a backend and a web interface to display the results. However, growing utilization of BioWardrobe revealed that the python back-end lacked functionality: it was difficult to add new or update existing pipelines and it could not optimally utilize resources. The recent development of Common Workflow Language (CWL) allowed us to solve these challenges. Thus, we combined CWL with an open-source pipeline manager, Airflow, to create a new back-end. BioWardrobe’s pipelines were rewritten in CWL, component tools were containerized with Docker, and Airflow was extended with CWL support. This enabled BioWardrobe to have more control over pipeline execution, which in turn helped to save time and reduce computational costs. CWL-based workflows are compact, well formalized, and can be easily versioned and visually represented making them more scientist friendly. Utilizing CWL-Airflow allows to set up analysis in any hardware environment from stand-alone server to cluster or cloud. The simplified installation of CWL-Airflow and pipelines enables scientists to concentrate on analyzing the data.
Short Abstract: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. We present an overview of our recent work in aligning sequences to graphs. Compared to linear sequence-to-sequence alignment, the graph structure introduces new difficulties. Our work includes theoretical insights on how to generalize banded alignment and Myers' bitvector parallelism to arbitrarily shaped node-labeled directed graphs, and an implementation which uses these insights in practice. Our implementation inputs commonly used file formats; GFA or vg for graphs and fasta or fastq for reads. The implementation has been succesfully used as a part of a hybrid Illumina-Pacbio genome assembly pipeline.
Short Abstract: Realizing African genomics research requires the capacity and development of reproducible, portable bioinformatics workflows that are able to handle numerous software dependencies, data movement and also automate intermediate analysis steps. H3ABioNet, the Pan-African bioinformatics network, is bridging this gap by developing 4 key portable, reproducible bioinformatics analysis pipelines for African researchers from the Human Heredity and Health in Africa (H3Africa) projects and the larger bioinformatics community. Namely, these are: (1) variant calling from WGS/ WES data; (2) diversity analysis from 16S rDNA data; (3) genotyping and GWAS; and (4) SNP imputation. Upon establishing SOPs, development and containerization of these workflows was implemented in community supported workflow management systems and languages, Nextflow and CWL respectively. This culminated via a week long hackathon held in South Africa in 2016, teaming up African bioinformaticians and US and European collaborators. These workflows were afterwards continuously tested and validated, code made openly available on GitHub, and containerized images deposited into Quay.IO. These portable, reproducible workflows H3ABioNet developed, are currently in use by a range of H3Africa projects and non H3Africa researchers contributing to the infrastructure and capacity building of bioinformatics, and enabling the analysis of African genomics data by Africans in Africa.
Short Abstract: Massive growth in the amount of research data has led to increased utilization of pipeline managers in biomedical computational research. However, each of such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was introduced as a specification for platform-independent workflow description. Here, we present CWL-Airflow, an extension for the Apache Airflow pipeline manager supporting CWL v1.0 specification. The Apache Airflow code is extended with a Python package that organizes CWL descriptor file in a form of directed acyclic graph (DAG). This DAG is used by Airflow to run a workflow with a structure identical to the original CWL descriptor file. A sample CWL pipeline for processing of ChIP-Seq data is provided. CWL-Airflow is one of the first pipeline managers supporting CWL specification and provides a robust and user-friendly interface for executing CWL pipelines. Furthermore, as one of the most lightweight pipeline managers, Airflow contributes only a small amount of overhead to the overall execution of a computational pipeline. This gives us a lightweight workflow management system with full support for CWL, the most perspective scientific workflow description language.
Short Abstract: Understanding Cell-to-Cell communication (C2CC) is critical to deciphering the complexity of cellular eco-system that control biological functions and disease states. For example, analyzing communication among cancer cells and stromal cells in a tumor has revealed important biological insights. Ligand-Receptor (LR) interactions typically drive C2CC. scRNA-seq helps identify C2CC by exploiting known LR interactions. However, no tool is available to identify C2CC networks and explore modes of communication among different cells at single cell level. Hence, we developed a methodology and an interactive tool called Single Cell-2-Cell Communicator (SC2CC) for end-2-end analysis of scRNA-seq data from normalization to eliciting C2CC network. Using SC2CC tool, we analyzed scRNA-seq data of a spontaneous murine medulloblastoma for C2CC network identification. The C2CC network indicates that all immune cell populations can communicate with each other. Interestingly, while tumor cell populations and astrocytes could signal robustly to all infiltrating macrophages, reciprocal signaling from immune cells to tumor cells or astrocytes was undetected. In addition, tumor cell signaling to brain-resident macrophage (microglia) was absent and may require infiltrating macrophage intermediary. This network shows hierarchical interaction among different cell types in a tumor and suggests subpopulation-specific C2CC in tumors. Our results signify the importance of the SC2CC tool.
Short Abstract: Cancer genomics studies result in numerous, huge, heterogeneous datasets with variable characteristics including file formats, attribute names, positional coordinates, reference data, and more. This heterogeneity is compounded by duplicate, non-uniform presentation of the data across several distributed databases, many of which have highly specific research foci. OncoMX is a web portal enabling search of unified cancer genomics data via integrated cancer mutation, expression, and biomarker databases, and is actively being developed to address issues of biomedical data integration to better facilitate cancer biomarker research. Funded by ITCR U01, OncoMX is a collaboration between the George Washington University, NASA’s Jet Propulsion Laboratory, the Swiss Institute of Bioinformatics, and the University of Delaware. Briefly, sequencing-based mutation and expression data are integrated and unified by Disease Ontology and Uberon terms into BioMuta and BioXpress, the core knowledgebases for OncoMX. Additional integrated data include normal expression from Bgee, text-mining results for mutation and expression in cancer, functional annotations from both curated and non-curated resources, EDRN biomarker information, and Reactome pathways. The proposed access to integrated cancer genomics data and supporting information in OncoMX is expected to benefit basic cancer research and promote efficient consumption of information by end users to ultimately improve detection capabilities.
Short Abstract: The advent of high-throughput genomics technologies has resulted in massive amounts of diverse genome-scale data. Disease subtyping is often the first step to interpret the massive amounts of high-dimensional and heterogeneous data types to gain insights into biological processes. Here we present a framework, named PINSPlus, that is: i) able to integrate multiple types of data and ii) substantially outperforms both single data type analyses as well as established integrative approaches, when identifying cancer subtypes with significantly different survival profiles. The framework was validated on thousands of cancer samples using mRNA, methylation, miRNA, and copy number variation data downloaded from The Cancer Genome Atlas, the European Genome-Phenome Archive, Gene Expression Omnibus, and the Broad Institute. For each cancer dataset tested, PINSPlus outperforms existing state-of-the-art subtyping approaches in identifying known subtypes as well as in discovering novel groups of patients with significantly different survival profiles. PINSPlus is sufficiently general to replace existing unsupervised clustering approaches outside the scope of disease subtyping. The software is currently available on CRAN: https://cran.r-project.org/web/packages/PINSPlus/index.html.
Short Abstract: With different types of cancer data (including clinical, genomics, proteomics, imaging) growing into petabyte scale, there is a need for infrastructure that leverages the diverse datasets and accelerates discovery by researchers. The NCI Cancer Research Data Commons (CRDC) will be an ecosystem where repositories of different data types will be connected along with analytical tools to enable the integration, query, analysis and visualization of data. The goal is to provide secure access to data; allow users to process data using elastic compute, analyze data with various applications, share results with collaborators, and incorporate their own data and tools for analysis. As an integral component of the CRDC, the Data Commons Framework (DCF) is a reusable and expandable framework for interoperable Data Commons. The modular components that can be leveraged across Data Commons are: secure user authentication and authorization; metadata validation tools; domain-specific and extensible data models; API and container environment for tools and pipelines; computational workspaces for storing and sharing data, tools and results. The DCF components are being developed in collaboration with the NCI Cloud Resources; will later be used to stand up new Data Commons; and can be leveraged by the community to set up their own commons.
Short Abstract: Determining gene function remains a fundamental problem in biology. Measuring gene expression levels via transcript analysis across various treatments and developmental stages from many tissues greatly facilitate gene, pathway, and genomic functional annotation and interpretation. Here we present a method that involves transforming expression data to approximate a normal distribution followed by dividing the genes into groups, then applying Gaussian parametric methods to assess the significance of observed differences. This method enables the assessment of differences in gene expression distributions within and across samples, enabling hypothesis-based comparison among groups of genes. We have implemented our method on a shiny server with a user-friendly interface. It has the following features 1) visualization of gene expression values distributions; 2) visualization of the data normality assumption; and 3) performance of Student’s T-test or Wilcoxon signed-rank test between two groups of genes. This application enables biologists to access a de novo statistics pipeline and to conduct an entirely new kind of RNAseq-based, hypothesis-driven research.
Short Abstract: ESE (Exon Splicing Enhancer) is a DNA motif that plays an important role in the regulation of alternative splicing. ESE interacts with SR proteins and positively regulates exon ‘s inclusion in the mRNA, and alteration of an ESE may lead to exclusion of the exon, called exon skipping. In this context, silent mutations may have functional effect by inducing exon skipping and production of foreign proteins without the exon. In this study, we have investigated impacts of ESE alteration for splicing pattern in cancer genomics. We collected 438 ESE that were experimentally proved / computationally predicted. We also collected WEX and RNA-seq data from 124 gastric and lung cancer samples. Somatic mutations that are modifying ESE were called, then exon skipping events were predicted by comparison VAF (Variant Allele Frequency) change between the DNA and RNA-seq data. Gene-level expression data is also used to estimate the existence of NMD (Nonsense Mediate Decay). We observed a trend that ESE-altering silent or missense mutations decrease the expression level of the genes when the mutations are in an asymmetric exon. We also observed that genes with nonsense mutation may be rescued by exon skipping if the ESE-modifying mutations are in a symmetric exon.
Short Abstract: Uropathogenic E. coli (UPEC) are the most common cause of urinary tract infections (UTIs) in women. Bacteriophages (phages), or viruses that infect bacteria, can contribute to a bacterial strain’s pathogenicity. Recently, the complete genome for a UPEC strain, E. coli ECONIH2, was sequenced using both Pacbio and Illumina technologies. The longer read technology uncovered a tandemly repeated phage genome (prophage) in the E. coli genome; this repeat was undetected by Illumina sequencing alone. What is particularly intriguing about this repeated prophage is its similarity to the YpfΦ prophage found within several Yersinia pestis strains. YpfΦ also occurs in tandem repeats and has been shown to play a role in Yersinia pathogenesis. We performed Blast queries and identified a single copy of the ECONIH2 prophage within the genomes of several Citrobacter, Salmonella, and Yersinia spp. as well as other E. coli strains. We have written a tool to automate bacterial genome reconstruction and detect instances of repeating prophages. This approach was used to reassess E. coli strains isolated from the urinary tract and discovered the presence of tandemly repeated prophages.
Short Abstract: With the advent of microarray- and RNAseq-based gene expression profiling, finding gene sets with phenotype-associated functional changes has been the focus of many studies. While gene set enrichment analysis tools have been extensively used to test for the significant association between gene sets and a phenotype of interest, gene set projection (GSP) tools have been used to couple the enrichment scores of individual samples with a gene set of interest. Multiple GSP tools have been developed to perform this task, including ssGSEA, ASSIGN, Pathifier, and GSVA, among others. In this analysis, we seek to compare their performance on real and simulated data. In particular, we use simulated data derived from real gene expression profiles to compare sensitivity and specificity of the different methods, by maintaining the correlation structure between the genes in our cohort and by adding a constant effect size to measure the efficacy of each tool to measure the true enrichment in the underlying data. We have measured the influence of various parameters such as gene set size, correlation between genes within the gene set and the effect size on the accuracy of the Gene set projection tools.
Short Abstract: The NBSeq project is evaluating the effectiveness of whole exome sequencing (WES) for detecting inborn errors of metabolism (IEM) in population-level newborn screening (NBS). Among the 4.4 million newborns born and screened by the California NBS program from 2005 to 2013, we sequenced the de-identified archived dried blood spots of most newborns affected with any of 48 IEMs. We analyzed the exomes using customized variant interpretation pipelines that incorporated CNV detection and variant curation, and found that the ability to detect likely disease-causing variants varied across metabolic disorders. After an integrative review of the genetic, biochemical and clinical data, we found that the interpretation pipeline achieved an overall sensitivity of 88%, highlighting limitations of our genetic understanding of even long-studied classic Mendelian disorders. In some cases, exomes compellingly identified disorders that differed from the metabolic center diagnoses, suggesting that sequencing information would have been valuable for making a correct, timely diagnoses in those cases. While not specific enough to screen for most IEMs, WES can play an important role in efficiently establishing a definitive diagnosis for several disorders.
Short Abstract: Samples from microarray experiments may have significant differences on gene expression levels. This difference is known as sample heterogeneity, and may bring in serious issues in subsequent analysis. Therefore, processing the heterogeneous samples is a critical aspect in microarray data analysis. In our previous study on the variation of samples in PCA bi-plot when using a series of subset of DEGs, the distances between samples in PCA bi-plot was significantly influenced by the selection of genes. Our analyses on the genetic distances between samples indicated that the genetic distance is a better index to characterize the genotypic and phenotypic differences between samples, and could be used to select an optimal list of genes. A novel strategy, dynamic PCA bi-plot for identifying heterogeneous sample could be developed. By using this novel method, more heterogeneous samples have been identified in each of the datasets. And the genetic distances from these heterogeneous samples to other samples are consistently larger than inter-sample distances. The combination of genetic distance and dynamic PCA bi-plot is able to identify heterogeneous more efficiently and reliably. This work has provided a basis for developing novel strategies to process heterogeneous samples in microarray data analysis.
Short Abstract: Obligate fungal pathogens are a major threat to cereal grain production worldwide, and represent ideal tools for exploring interdependent signaling between disease agents and their hosts. We performed an expression Quantitative Trait Locus (eQTL) analysis to interrogate the temporal control of immunity-associated gene expression in barley (Hordeum vulgare L.) challenged with the powdery mildew fungus, Blumeria graminis f. sp. hordei (Bgh), identifying two highly significant clusters of trans eQTL. Using these data, we outlined computational steps to discover transcriptional regulators that govern the temporal dynamics of plant immunity. We paired the barley genome assembly with extensive barley-Bgh expression data using two complementary approaches to predict defense gene modules, immune-active cis-regulatory elements (CRE) and their cognate transcription factors (TFs): First, we compared experimentally validated TF-CRE pairs with barley promoter sequences and calculated an enrichment score and FDR-adjusted p-value using Fisher's exact test. Second, we performed de novo CRE discovery. Consistent with our hypothesis, we identified overrepresented CREs in promoters of the trans eQTL-associated gene sets. Over 70% of the recovered motifs were consistent between the analyses, some of them novel. These results were represented with unrooted phylogenetic trees of each barley TF family, and used as selection tool for experimental validation
Short Abstract: ATAC-seq was first described in 2013 and it was immediately embraced by the scientific community as a fast and straightforward assay for defining gene regulatory elements marked as chromatin accessible regions. Dependable results of the assay rely on the overall quality of ATAC-seq library. However, the clear-cut evidence for the quality of the experiment can be assessed only from the sequencing data. Having a reliable and comprehensive standard for estimating sample quality prior to sequencing could save money and help guide sequencing study design. Common practice for assessing the quality of the ATAC-seq library is to analyse the Bioanalyzer electropherograms. However, this approach can be inconsistent because of the subjective human interpretation. The goal of our study is to establish automated and reliable procedure for standardization of ATAC-seq quality control by devising the ATAC-seq Integrity Number (AIN) algorithm. We propose a logistic regression model for more accurate estimation. Our dataset included 416 ATAC-seq libraries on: regulatory T cells, lymphoblastoid cell lines, macrophages and conventional T cells. Data was generated following original ATAC-seq and FAST-ATAC-seq protocol. The results showed better accuracy of the model (94.9%) over the human eye (56.5%) and its applicability across different cell types and protocols.
Short Abstract: Currently, there is no comprehensive pipeline available in breast cancer research that ranks the cancer cell-lines to a patient’s profile utilizing all four types of omics-profile (somatic and germline SNV, DNA copy number (CNV) and RNA gene-expression data). Hence, we have developed Oncomatch - a novel computational non-parametric algorithm that returns a list of a) most similar breast cancer cell-lines and b) most sensitive anti-cancer drugs, built on patient’s genetic information. Expression, CNV and mutation data are analyzed and ranked separately by creating similarity/dissimilarity score matrices, after which, we optimize the three ranks to obtain an aggregate ranking. Each similarity matrix is a partially-supervised weighted-similarity/dissimilarity measure, giving weights to oncogenes, tumor-suppressors and transporter genes. Rank aggregation is performed using Cross-Entropy Monte Carlo Algorithm and Genetic Algorithm. The best matched cell lines can also serve as a proxy in lieu of an applicable immortalized tumor cell line, patient-derived xenografts (PDXs) or organoid models. Further, the ranked list of anti-cancer drugs can serve as a prioritized foundation for in vivo screening of applicable screening models. The drug sensitivity information is obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) database. Current efforts are validating and improving our drug prioritizations.
Short Abstract: Epigenetic marks play critical roles in development, disease, and aging. One of the most readily assayed epigenetic marks is DNA methylation, most commonly via bisulfite conversion of unmethylated cytosines, followed by quantitative genotyping through sequencing or hybridization. For population-scale studies and exploratory analyses, short-read sequencing approaches can be cost prohibitive. Thus, hybridization-based bead arrays, in particular Illumina's HumanMethylation450 and HumanMethylationEPIC platforms, have become the dominant source of population-scale DNA methylation data. Over 120,000 samples have results publicly available, and many more have been assayed as part of pharmaceutical characterization efforts ranging from autoimmune to neurodegenerative diseases. Previous work on population-scale analysis of this platform has shown that beta regression outperforms competing methods in terms of sensitivity to detect differential methylation, while maintaining specificity. Here we extend this work to evaluate a range of modeling approaches for methylation arrays, including mixed-effects beta regression and generalized estimating equations by Monte Carlo simulations and publicly available data. We revisit and update existing work in light of improved preprocessing methods, larger sample sizes, and a broad range of effect sizes now under study. This work establishes quantitative benchmarks to serve as a baseline for further work on a growing corpus of epigenetic data.
Short Abstract: Many biologically-important traits vary on a quantitative scale, and are regulated by many interacting genes of small effect. In GWAS, it is most accurate to simultaneously quantify additive and epistatic genomic effects for such traits. We compared methods of constructing such models of genotype-to-phenotype relationship on their statistical power using simulated data based on maize and human genotypic data. We first calculated the type-I-error thresholds for model selection by permutation. We then evaluated the true/false positive detection rates of the quantitative trait nucleotides underlying the simulated traits. Although the Stepwise Epistatic Model Selection approach could identify a substantial proportion of the simulated QTN and distinguish between additive/epistatic effects, a similar statistical model fitting additive effects only (joint-linkage, or JL, analysis) yielded similar true/false positive rates. The power of SEMS improved with sample size while more time-consuming than JL due to growth of models tested with increasing number of SNPs. Therefore, it is vital to implement strategies for reducing the search space of the available models by pre-selecting SNPs utilizing prior biological information or eliminating SNPs not affecting the phenotype using statistical approximations. Here we propose several approaches automating this process as a preparatory step before constructing accurate models with SEMS.
Short Abstract: While the life in animals commences with the zygote, the process of oogenesis in generating the matured oocyte for fertilization is equally important for understanding the onset of life. However, our current knowledge of oogenesis and fertilization is far from complete. To fill this gap, we profiled the transcriptomes of different segments and individual cells in the gonad of the model organism C. elegans, which correspond to different developmental stages of oogenesis and fertilization. We found that transcriptomes of single cells clearly reflect their developmental stages through clustering and pseudo-time trajectory analysis. We also identified a large number of differentially expressed genes and pathways between adjacent stages of oogenesis and fertilization, and unveiled many known and unknown biological processes that drive the oogenesis and fertilization processes of the organism.
Short Abstract: Diet and certain food supplements can significantly impact cancer outcomes. However, diversity of molecular alterations in tumors necessitates the personalized approach to determining optimal drugs, diet and food supplements for each patient. Motivated by the hypothesis that: (i) upregulation of certain metabolic pathways results in downregulation of specific oncogenes; (ii) expression levels of metabolic pathway genes can be upregulated by certain metabolic substrates (food supplements); (iii) downregulation of tumor specific oncogenes will suppress tumor progression, we studied co-regulation between gene expression levels using RNAseq profiles produced by The Cancer Genomic Atlas consortium. We found that major cancer genes are significantly co-regulated with large sets of genes, which we call “gene clouds”, many of which were significantly enriched by genes of various biological pathways. In particular, KEGG’s ribosome and oxidative phosphorylation pathways were universally overrepresented in oncogene-clouds across many cancers. We developed a computational protocol that determines gene-clouds, overrepresented biological pathways and related metabolites. Currently, metabolites were derived for ~30 common oncogenes and ~30 tumor suppressors for ~20 TCGA cancers. The computational protocol can be applied for an individual set of oncogenes and tumor-suppressors to propose metabolites for testing in cell lines and mouse models.
Short Abstract: It becomes an increased need to study the possible impact of anti-cancer therapies before being applied to a patient. As a low-cost alternative to the expensive and hard-to-establish Patient-derived xenografts (PDXs), widely available cancer cell lines are getting more attention from the researchers. Based on our cancer cell line matching algorithm OncoMatch, here we present the OncoType webportal, a web-based tool that delivers fast cell line matching results, annotations and visualization. The web portal has a user-friendly interface to customize the matching analysis, providing a selection of distance metrics and ranking methods. The selected cell lines are ranked by the similarity to the molecular profile of the patient and annotated with subtyping information and drug sensitivity from the Genomics of Drug Sensitivity in Cancer (GDSC). Matched gene features from the input (gene expression, CNV and/or SNV) from the subject to each cell line of interest are visualized in circos plots for quick inspection. OncoType helps to bridge the genomics information between the patient and the cancer cell lines. The webportal is available at http://kalarikrlab.org/Projects/OncoType.html
Short Abstract: In breast cancer, early diagnosis and new therapeutics have significantly improved patient survival. However, metastasis and tumor recurrence is still observed years after successful treatment. Cancer stem cells, the most aggressive subpopulation of tumor, hypothesized to be highly associated with micrometastasis formation, tumor recurrence and chemoresistance. In this project, our main aim is to identify biomarkers to predict relapse, chemoresistance and to determine potential targets for therapeutic implications. Therefore, paired-end whole transcriptome profiling of breast cancer stem cells, chemotherapy applied cancer stem cells and normal breast cancer cells compared to identify novel and cancer stem cell specific genes. The Gene Expression Omnibus resource was utilized to validate associations between stemness, recurrence and chemoresistance. In addition to differentially expressed genes identification, co-expression networks were constructed to define the gene clusters significantly correlating with Spheres. Gene ontology analysis were computed to investigate the functional enrichment to facilitate a better biological interpretation. Data mining approaches were used to integrate various biological databases to define the roles in clinical outcomes. As a result, we defined a catalog of breast cancer stemness genes involved in chemoresistance and tumor relapse with pathways they are enriched in which provides information for identifying new targets for therapeutic implications.
Short Abstract: The McLysaght and Guerzoni, 2016 article provides evidence that 35 de novo genes were found located in the ancestral noncoding sequence. Similar methods described in the above-mentioned paper were utilized for this project, including an E-value threshold of 1x10-4. Sequence identity of a minimum of 60%, as well as a coverage of at least 40%, were the minimum thresholds required to determine homology between genes. Using the current NCBI nucleotide collection database, the goal of this project was to determine whether these genes may still be considered orphan genes. The 35 genes retrieved from the McLysaght and Guerzoni article and analyzed to determine whether these genes were still considered to be homologous to humans, gorillas, or chimpanzees. It was concluded that ten of these genes that were reported to be specific only to human by the source paper also matched distantly related organisms. The results of the BLAST searches confirmed 25 of the source paper’s results, strengthening the evidence for orphan genes. These results were supported by analyzing data output from the UCSC Genome Browser queries. It can be concluded that the potential scientific implication that these results provide is that within the past couple years, improvements in technologies
Short Abstract: Nasopharyngeal carcinoma (NPC) is predominant in Hong Kong. To gain more insights into the regulatory mechanism and identify master regulators of NPC, we performed the first time-series analysis in NPC. Specifically, paired tumor and normal specimens were collected from a total number of 60 NPC patients at three different conditions: at diagnosis, on celecoxib and on radiotherapy treatment. Genome-wide cDNA microarray was subsequently applied to capture the expression profiling of the tumor and normal samples. We assembled the expression data into tensor forms, and used a newly developed tensor decomposition algorithm called Sparse Decomposition of Arrays (SDA) to extract latent factors in the data. We identified 8 biologically significant non-overlapping factors for NPC tumor and stroma, namely immune response, viral response, cell cycle, mitochondrial metabolism, detoxification, stress response, RNA polymerase transcription and RNA processing. The 8 functional categories strongly associated with NPC, covering the vast majority of mechanisms underlying NPC tumorigenesis. We also identified several master regulators in the gene networks, which may guide the targeted therapies for NPC. To summarize, our study not only provided some insights into the underlying mechanism of the NPC tumorigenesis, but also identified some novel master regulators which may participated in the tumorigenesis process.
Short Abstract: The Hope for Depression Research Foundation focuses on the neurobiology of mood disorders and includes five research groups which use transcriptomic studies of animal models of depression. Integrating the results across diverse models could provide better insight into the neurobiology of mood disorder, so we developed a meta-analysis pipeline to extract consistent gene expression signatures across models and platforms. We tested this pipeline using five data sets from three animal models: selectively-bred high responder and low responder rats, Flinders sensitive and resistant rats, and mice with glucocorticoid receptor overexpression (early-life and lifetime). The whole hippocampus or dentate gyrus was analyzed using either microarray or RNA-Seq. To ensure comparability, raw data from each study was run through our proposed pipeline for annotation, quality control, and data preprocessing. We calculated the effect size from the moderated t-statistic and used it within a multi-level, random effects meta-analysis model which allows grouping by biologically relevant variables. We find that several hundred candidate genes show similar differential expression despite minimal intersection between the top results from each individual study, highlighting several biological pathways. We conclude that it is beneficial to integrate results from different models to understand the neurobiology of mood disorder.
Short Abstract: Prostate cancer (PCa) is the second most common cancer in men and the fourth most common tumor type worldwide. Many PCas are indolent and do not result in cancer mortality, even without treatment. Despite remarkable progress in diagnosis and patient care in recent years, the identification of genetic markers that distinguish indolent from aggressive PCa remains a major challenge. Large volumes of somatic mutation information has not been leveraged and integrated with germline mutations for genetic markers predictive of aggressive tumors. The objective of this study is three-fold: (i) To identify and functionally characterize genes containing germline and somatic mutations distinguishing indolent and aggressive PCa, (ii) To discover and characterize the molecular networks and biological pathways influenced by germline and somatic mutations in each disease subtype, and (iii) To model pathway crosstalk between germline mutation controlled pathways and the biological pathways regulated by somatic mutations. We combine germline mutations from GWAS and somatic mutation from TCGA using gene expression data from TCGA as the organizing principle, and leverage this approach with network and pathway analysis. The study reveals that joint analysis show how germline and somatic mutations are likely to cooperate in driving aggressive PCa.
Short Abstract: Breast cancer (BC) is the most diagnosed cancer among women in the US. Majority of BC are ER+BC and respond to targeted therapies. However, a significant proportion are triple negative BC (TNBC), which do not respond to targeted therapies. The application of next generation sequencing technologies has led to discovery of somatic mutations driving BC. The development of BC involves both germline and somatic mutations. However, to date, elucidating the possible oncogenic interactions between somatic and germline mutations and their joint role in tumorigenesis remains elusive. The objective of this investigation was to map the germline and somatic mutation interaction landscape in TNBC and ER+BC. We integrated somatic mutation information derived from patients diagnosed with TNBC and ER+BC from The Cancer Genome Atlas (TCGA) with germline mutation information from genome-wide association studies, using gene expression data from TCGA as the organizing principle. The analysis revealed a gene signature harboring both somatic and germline mutations distinguishing TNBC from ER+BC. Further analysis revealed molecular networks and biological pathways enriched for germline and somatic mutations in each type of BC. The study suggests that germline and somatic mutations are likely to cooperate in driving TNBC and ER+BC phenotypes.
Short Abstract: Prostate cancer (PCa) is a complex genetic disease influenced by both inherited variants in the germline DNA and somatic mutations acquired during formation of tumors. Recently, large multi-center projects such as The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) have performed detailed analysis and characterization of somatic mutations. However, the large amounts of somatic mutation information generated has not been optimally leveraged and integrated with germline mutations to infer the possible oncogenic interactions between germline and somatic alterations during tumorigenesis. Here we performed computational analysis integrating somatic mutation information derived from 495 PCa patients in TCGA with germline mutation information from genome-wide association studies, using transcriptome data from TCGA to model pathway crosstalk between germline and somatic mutation modulated pathways in PCa. Our analysis revealed molecular networks and biological pathways enriched for germline and somatic alterations. Computational analysis of pathway crosstalk revealed interactions between pathways regulated by germline mutations and pathways regulated by somatic mutations, as well as pathways enriched for both genomic alterations, including, AR, STAT3, P53, PTEN and IGF-1 signaling pathways. Together, these data may lay the foundation for developing novel strategies for precision prevention of PCa and development of novel therapeutics.