View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: How new protein-coding genes and new protein domains appear in evolution are major questions in biology. While new genes are often built by duplicating existing genes, new genes were recently found to arise de novo from genomic DNA. To understand how new genes may arise de novo, we built a mathematical birth-and-death model based on gene and genome dimensions and dynamic factors such as mutation, recombination and selection. We found most genomes should contain many new genes, with few being maintained. Second, we identified thousands of candidate de novo genes in 20 eukaryotic genomes, using phylostratigraphy and proteomics, and evaluated their predicted biophysical properties. Compared to ancient proteins, new proteins are shorter, more vulnerable to proteases, disordered, likely to bind other proteins, yet less prone to toxic aggregation. To test structural predictions, we performed biophysical experiments comparing human new proteins to ancient proteins. We found that new genes encode short proteins that have distinct structural features and are expressed in brain and male germline, readily providing an avenue for evolutionary testing of function. The continuous creation and destruction of new genes provides a dynamic reservoir of molecular variation that enables genomic exploratory behavior to find new structures and new functions.
Short Abstract: Splicing requires the macromolecular machine known as spliceosome to produce functional transcript from a pre-mRNA. The role of many of the 200 spliceosome proteins remains unknown. We are examining loss-of-function mutations in two conserved proteins (DGCR14 and FRA10AC1), which are part of the remodeled C* complex of the spliceosome. These proteins are conserved in organisms with high density of introns. The drg14-1 and fra10-1 mutations were identified in the unicellular green alga, Chlamydomonas reinhardtii as suppressors of mutants with noncanonical splice sites (Lin et al., 2018) Through transcriptome analyses, we find that in the drg14-1 and fra10-1 mutant strains, a small number of splicing events are affected when cells are grown logarithmically in complete medium. However, fra10-1 mutant strain shows reduced survival when put in medium lacking nitrogen. Thus, we hypothesize that the FRA10AC1 protein modulates splicing of genes involved in responding to the lack of nitrogen. To test this hypothesis, we will perform RNA-seq on the fra10-1 and parental strains under nitrogen-deprived conditions to identify and analyze changes in splicing patterns. This study may provide novel insights into the post-transcriptional regulation of genes by a spliceosomal protein and will provide greater understanding of genes that respond to nitrogen deprivation.
Short Abstract: Many analytical tools for microbial metagenomics have emerged in recent years to produce an OTU (Operational Taxonomic Units) table or a pathway table, and then step through a basic analytical pipeline. However, this is just a starting point for determining how these billions of bacteria are interacting and the implications for the functional and ecological roles. A lot of interesting biological questions require developing more advanced and easily accessible methods for downstream analyses. Therefore, we developed a novel online computational framework M2ROC and made it available through an interactive web interface. The framework selects and ranks the attributes using multiple feature selection methods and evaluates the selected feature set using a variety of machine learning methods. The challenges lie in the factor that microbiome metagenomics data have high dimensions in feature space and are often multiclass. The M2ROC implements a bootstrap sampling and an ensemble strategy to address these challenges. The micro-average and macro-average methods are included in M2ROC to evaluate performance in a multiclass classification setting. Finally, a network component is implemented to explore the inherent connections among features identified in microbiome. The M2ROC is deployed on BioComs Lab website and can be accessed at http://m2roc.biocoms.org/.
Short Abstract: Background: Creating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa(http://boa.cs.iastate.edu/), that abstracts away details of parallelization and storage management can be used to more efficiently process and parse data contained in large data repositories. Results: Here, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF), assembly files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well-funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.
Short Abstract: RGD’s general search tool is built with open source technologies Java and Elasticsearch a real time distributed full-text search and analytics engine built on top of Apache Lucene. RGD’s data is distributed across a cluster of 8 nodes with 8 primary shards and 2 replica shards for best performance, stability, scalability and high availability. The results are aggregated into facets based on various object categories, species and types. Faceted search refers to a way to explore large amounts of data by displaying summaries about various partitions of the data and later allowing the user to narrow the navigation to a specific partition. The search tool is integrated with various RGD data analysis tools: the OLGA Object List Generator and Analyzer tool, where a list of genes can be combined with the results of one or more additional queries, InterViewer a visualization tool for protein - protein interactions, Variant Visualizer, to view the strain-specific variants in those genes that are predicted to be possibly or probably damaging and the PhenoMiner tool to explore quantitative phenotype measurement data. In addition to the general search, enhanced search function is attached to the RGD ontology browser to give ontology term-specific results.
Short Abstract: Hi-C data has been used to reconstruct chromosomal 3D structures. Many of the methods use a fixed parameter when converting number of Hi-C contacts to wish distance in the three-dimensional space. We developed a new method to infer this converting parameter and further indicate the pairwise Euclidean distances with no need to generate a 3D structure but based on the topology of Hi-C complex network (HiCNet). The inferred distances are modeled by clustering coefficient and multiple constraints. Our inferred distances have a higher correlation with FISH and Xist localization data and better match 156 pairs of protein-enabled long-range chromatin interactions detected by ChIA-PET. Using the inferred distances and another round of optimization, we further reconstruct 40Kbp high-resolution 3D chromosomal structures of mouse male ES cells. The high-resolution structures illustrate topological domains and DNA loops and show that chromosome X has two large compartments and Xist locus locates at the boundary of the two compartments.
Short Abstract: Gene cascades refer to the linked expression of genes; that is, the upstream gene serves as the switch for the multi-downstream genes. Identifying a gene cascade makes it possible to elucidate the comprehensive function of an upstream gene. However, because targets of gene regulation cannot be certainly predicted, it is difficult to identify genome-wide gene cascades of translational regulators. In this study, we constructed gene cascades that have multi-downstream genes by integrating RNA-Seq and database data into a early embryo. First, the transcription factors that were directly regulated by the upstream gene were extracted from the RNA-Seq results. Second, the downstream transcription factors were extracted from the interaction and expression data obtained from the WormBase, the database for nematodes. This extraction process was repeated until the downstream genes in the early embryo could not be obtained from the interaction and expression data. The maternal gene, spn-4, which is important in embryogenesis because it is embryonic lethal, was used as a model for this study. To analyze the cascade, specific domains were extracted. From the results, the comprehensive function of spn-4 in early embryogenesis was determined as pharyngeal development regulation through the canonical Wnt signaling pathway.
Short Abstract: Intractable diseases are diseases with objective diagnostic criteria, which include rare diseases with infrequently observed genetic mutations, such as missense mutations that cause amino acid substitutions. Although 330 diseases have been designated as intractable, sequence information flanking the mutation sites in the causal genes remains elusive. In this study, we systemically analyzed the characteristics of amino acid sequences flanking 626 mutation sites responsible for intractable diseases, which are located on the human X chromosome. First, we classified these mutations based on their structural effects on the protein products. IUPred analysis indicated that 413 and 213 mutations were found in ordered and disordered protein regions, respectively. We also found that mutations in the ordered regions resulted in substitutions in glycine (12%), arginine (11%), and leucine (8%). In contrast, 77% of the mutations in amino acids in the disordered regions resulted in substitutions in glycine. Moreover, we found that glycine residues were often substituted to charged amino acids. In this presentation, we discuss the importance of repetitive sequences in disordered regions flanking causal amino acids of rare diseases.
Short Abstract: Epigenomics, such as histone modification, is the regulation of gene expression without altering the genome sequence. Histone acetylation is catalyzed by histone acetyltransferases (HATs) and histone deacetylases (HDACs); the latter is regulated by protein complexes that depend on various scaffold proteins. During early embryogenesis in C. elegans, LET-418, SIN-3, and SPR-1 are the scaffold proteins of HDAC complex. These proteins are evolutionarily conserved from yeast to humans, and have different domains and cause various phenotypes. The knock-out mutants of these scaffold proteins show embryonic lethality in various model organisms. In C. elegans, HDAC complex scaffold proteins are also important for embryogenesis, however their function in early embryogenesis currently remains unknown To elucidate the function of HDAC complex in early embryogenesis, we analyzed three HDAC scaffold genes (let-418, sin-3, and spr-1) by RNA-Seq analysis. Our study reveals that sin-3 is involved in embryo pattern specification, vulval fate commitment, and Wnt signaling. let-418 is involved in vulval cell fate commitment, muscle cell fate commitment, mRNA processing, and ncRNA processing. spr-1 is involved in nerve fate commitment and morphogenesis. Our results suggest that that each scaffold gene is involved in a different stage and function during embryonic development.
Short Abstract: Glycosyltransferases (GTs) and glycoside hydrolases (GHs) produce various glycans with diverse biological functions depending on their structural differences. Analysis of GTs partially elucidated the evolutionary acquisition of human glycan functions. However, evolutionary structural changes in glycans produced by GHs have not been characterized. Therefore, here, glycan functions were studied by genome-wide analysis of GHs using the phylogenetic profiling method. Human GHs were classified into four classes: class 1, conserved only in chordate; class 2, conserved only in metazoans; class 3, conserved in metazoans/plants; and class 4, widely conserved in eukaryotes. The enzymes belonging to class-1/2 are involved in fertilization, antimicrobial activity, and extracellular matrix remodeling. This suggests that multicellular systems that possess GHs-derived glycans acquired these during evolution into metazoans. Among the GHs producing N-glycans, those processing high mannose-type glycans were found in class 4, those involved in the degradation of complex glycans were identified in class 2, and endo-type degradation enzymes were located in class 1. These results suggest that N-glycans diversified among eukaryotes to produce glycan structures of increasing complexity as evolution progressed towards chordates.
Short Abstract: Kinases control cellular growth and possess extracellular binding domains making them the most commonly explored class of therapeutic targets in cancer. Despite this, the current number of approved SMKI drugs is limited and could be greatly expanded upon using appropriate datamining approaches. Accurate identification of inhibitors with pan-cancer application is an important and challenging problem. SMKIs have been designed previously based on approved drugs and screened manually resulting in a slow process for identifying new therapeutics. We will use extensive computational power coupled with analysis of high-throughput ‘omics datasets to search for potential targets allowing us to quickly design novel inhibitors. The availability of SMKIs with good efficacy and identifying putative driver kinases and their expression pattern in multiple cancers will provide tremendous opportunity to develop an understanding of how to fully exploit SMKIs. We analyzed RNAseq data from TCGA for altered kinase expression between tumor and normal samples, which revealed mostly up-regulation of kinases across multiple cancer types. We also explored somatic mutations and copy number variation of key kinases associated with various cancers. Using these data, we aim to create a list of kinases that can be further studied using SMKIs for therapeutic benefits to multiple cancers.
Short Abstract: Immunotherapeutic approaches are quickly becoming an important new source of promising drugs. These approaches generally employ or induce an antibody or T-cell receptor which recognizes a relatively short peptide epitope within a target protein. However, these therapeutics may unintentionally target peptides found in other proteins in the body and run the risk of off-target effects and toxicities. As most sequence homology tools are designed to identify relatively long stretches of sequence homology, there is a need for an application capable of homology-searching peptide sequences within a protein against all proteins in the proteome to identify these potential off-targets. AntigenBLAST divides an input protein sequence into all possible n-mer peptides and runs a BLAST sequence analysis on each peptide fragment using parameters suitable for short sequences. All perfect matches in a proteome are quickly identified and mapped onto an easy-to-interpret graph and report. Using this tool, we have been able to determine if any other proteins within a proteome share one or more epitopes. We have also noted that protein active sites tend to show significant regions of local homology suggesting that these regions may be the source of off-target effects observed with some small molecule drugs.
Short Abstract: Immunotherapy has already demonstrated its potential by reducing tumor burden and increasing patient survival in several cancer types. However, not all patients respond to a treatment and predicting which patients will profit from which type of immunotherapy is still an ongoing challenge. For adoptive T cell therapy, the T cells of a patient are enhanced by adding a T cell receptor (TCR) that can detect a cancer antigen epitope presented by major histocompatibility complex (MHC). Besides antigen expression, the human leukocyte antigen (HLA) allele restricting the receptor is often the only criterion for selecting patients for this type of therapy, although there are more than 70 genes involved in antigen processing and presentation. We investigate differences in expression levels of these genes and compare HLA typing results of RNA and DNA based approaches for cancer samples. Downregulation of MHC alleles has been observed in cancer cell lines. We also found significant differences regarding the expression of key genes such as TAP (transporter associated with antigen processing) and MHC class I in breast cancer samples for different ethnicities. Work is in progress to identify additional decision factors for selecting suitable therapies that can lead to increasing response rates to immunotherapy approaches.
Short Abstract: Structural variations (SVs) in the human genome originate from different mechanisms related to DNA repair and retrotransposition and are frequently associated with diseases. We analyzed 26927 SVs from the 1000 Genomes Project. The analyses revealed differential distributions and consequences of SVs of different origin, for instance, SVs from non-allelic homologous recombination (NAHR) are more prone to disrupt chromatin organization while processed pseudogenes can create accessible chromatin. We found that SVs from NAHR are primarily associated with spontaneous DSBs. This evidence, along with strong physical interaction of NAHR breakpoints belonging to the same SV and minor association of the breakpoints with DNA recombination sites, suggests that majority of NAHR SVs originate from errors during homology directed repair (HDR) of spontaneous DSBs. In turn, the origin of the DSBs is associated with transcription factor binding, revealing the vulnerability of functional, open chromatin. The chromatin itself is enriched with repeats, particularly Alus that provide the homology required to maintain stability via HDR. Additionally, we observed a striking difference between genomic distributions of fixed and variable Alus. Through co-localization of fixed Alus and NAHR SVs in open chromatin we hypothesize that Alu expansion in hominid lineage had a stabilizing role on the human genome.
Short Abstract: Transposable Elements (TEs) are small repetitive sequences of DNA that copy and paste themselves across the genome. TEs are important drivers of evolution. However, despite their importance, they are not well classified, and the factors governing TE activity are not well understood. To combat the classification problem within the Drosophila simulans genome, we devised a system for TE classification based off TE identity, and used it to create 1333 clusters of highly similar TEs. By superimposing existing sequence search strategies onto our clusters, we were able to categorize the proteins expressed by TEs, and examine whether active TE copies tend to have more intact proteins than inactive copies. Although number of intact open reading frames was poorly correlated with TE activity, we found a significant correlation between specific categories of protein and TE activity, suggesting potential key roles for these proteins in facilitating TE expression. We further note that our method of clustering similar sequences may have other applications for classifying a variety of chimeric sequences and those that undergo frequent horizontal transfer.
Short Abstract: GastroIntestinal (GI) tract hosts the largest number microbes in the human body with an estimated 3x10^13 bacterial cells. These gut microbes exhibit commensal, symbiotic, or pathogenic relationships with the host, by regulating host physiology, via producing different enzymes and metabolites. The altered gut ecology due to dietary changes, infections, and antibiotic exposure results in changes in gut microbial enzyme profile which leads to a wide range of gastric disorders including cancers. Hence, cataloguing the list of GI tract microbial enzymes in GI tract would facilitate identify drug targets for cancer treatment. We analyzed 374 reference GI tract microbial genomes from HMP database to identify gut microbial enzymes and their association with human cancer metabolic pathways using KEGG Automatic Annotation Server. These genomes mainly include species from firmicutes, epsilonproteobacteria, gammaproteobacteria and Bacteroidetes. Results showed 19% of GI tract microbial proteins encoded for different enzymes, and 1.4% of those are associated with various human GI cancers-gastric, pancreatic, and colorectal cancers. A unique set of 40 different microbial enzymes associated with different human cancer metabolic pathways were identified from reference genomes. Future work will be focused on the identification of specific microbe-derived metabolite markers for different GI cancers using microbial metabolic network modeling.
Short Abstract: GigaScience aims to revolutionize publishing by promoting reproducibility of analyses and data dissemination, organization, understanding, and reuse. We publish ALL research objects (data, software tools and workflows) from 'big data' studies across all life sciences. We follow the FAIR Principles for scientific data management and stewardship. To achieve this we have a novel publication format: standard manuscript publication linked with an extensive database hosting all associated data and provides data analysis tools and cloud-computing resources. The manual curation and hosting of data files is included FOC for accepted manuscripts. Here we describe the latest release of GigaDB.org which has recently undergone a complete overhaul to improve the user experience, allow faster navigation and improved visibility of links to related resources. In addition, we have recently installed new tools; 3D visualisation tool to enable interactive visualisation of surface-rendered reconstructions of 3D imaging data, such as microCT scan reconstructions; JBrowse genome browsers for a small number of genome assemblies; map viewer that utilises latitude/longitude cooridnates. In future developments we wish to provide more tools for visualisation of data with a view to increasing data accessibility and reuse. If you have opinions about the tools that we could offer please get in touch.
Short Abstract: For fifteen years our laboratory has been exploring hippocampal gene expression differences between selectively-bred high responder (bHR) and low responder (bLR) rats in order to provide insight into the biology underlying mood and temperament. In total, as part of experiments run across 53 generations of selective breeding, we have collected eight different microarray or RNA-seq datasets from samples of the whole hippocampus of HR and LR rats during both development (P7, P14, P21) and adulthood. Recently, we re-analyzed these datasets and ran a formal meta-analysis to determine which molecular candidates were consistently differentially-expressed across different generations and age groups. We further interpreted our results using cell type-based matrix deconvolution, gene set enrichment analysis, co-expression and protein-protein interaction networks, and positional gene enrichment analysis. Results indicate large differences in the expression of several genes with known relationships to mood and temperament, accompanied by an enrichment of differentially-expressed genes in their respective genomic neighborhoods. Results also indicate an enrichment of differentially-expressed genes within the dentate gyrus and within hippocampal-specific co-expression and protein-protein interaction networks that are important for proliferation and neural differentiation. We conclude that using meta-analysis, even small, legacy transcriptomic datasets can be leveraged to produce biological insight.
Short Abstract: A G4C2 hexanucleotide repeat expansion (HRE) in the first intron of the C9orf72 gene (C9) is the most significant genetic driver of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD). Recent findings identified a novel pathogenic mechanism wherein the C9 mutation disrupts nucleocytoplasmic transport. Accordingly, we developed a C9-HRE expression system and utilized biochemical subcellular fractionation coupled to mass spectrometry (MS)-based quantitative proteomics to identify proteins that demonstrate altered subcellular distribution in C9-HRE-expressing cells compared to controls. We found that the proteome shifted to a higher level of cytosolic accumulation in cells expressing the C9-HRE. Further, we identified 126 proteins that demonstrate a significantly altered nuclear to cytoplasmic ratio. The majority of proteins that showed a bidirectional change, shifted localization from the nucleus to the cytosol. Gene ontology analysis revealed a striking enrichment for proteins involved in RNA metabolism, proteostasis, nucleocytoplasmic transport, and protein translation. We validated that one identified protein, eukaryotic termination factor 1 (ETF1), is enriched in the nuclear fraction of C9 patient-derived MNs. Current analyses are interrogating possible mechanisms for nuclear enrichment of ETF1, including sequestration by HRE RNA foci, sequestration by aberrant dipeptide proteins produced by the HRE, and compensatory mechanisms, particularly nonsense-mediated decay (NMD).
Short Abstract: Alzheimer’s Disease (AD) is a growing epidemic as longer life expectancy fuels its principal risk factor - aging. As understanding of AD grows in the setting of many failed clinical trials, the concept of AD as a single disease is giving way to the hypothesis that it is a syndrome with multiple disease pathways progressing towards a common end-stage clinical presentation. Here, we aim to identify FDA-approved drugs that target these pathways and thus are candidates for repurposing in AD. Given an FDA-approved drug, we asked if its mechanism of action is related to AD biology by training a predictor of disease stage. The predictor was limited to using expression of genes known to be associated with the drug, and its performance was compared to predictors constructed on randomly-selected gene sets of equal size. Thirty top-performing drugs were subsequently profiled on human neuroprogenitor cell lines that differentiate into a mixed culture of neurons, glia and oligodendroctyes to further refine their mechanisms of action in relevant cell types. Jak inhibitors Tofacitinib and Ruxolitinib were among the top performers, and additional in vitro experiments demonstrated that the two drugs can rescue inflammatory-induced neuronal death, suggesting their potential as repurposing candidates for AD.
Short Abstract: The Genome 10K Project (G10K) is sequencing hundreds of mammalian genomes, enabling us to compare diverse species whose most recent common ancestors lived hundreds of millions of years ago. Many phenotypes evolved through gene expression, so they differ across species due to differences in cis-regulatory elements (CREs) that affect transcription. Since many of these phenotypes are tissue-specific, we are studying CRE strength in one tissue relative to another. As a proxy for CREs, we are using DNA accessibility from ATAC-seq. As a proof-of-concept, we trained a convolutional neural network (CNN) to predict whether a CRE in mouse is brain-specific relative to liver or visa verse. Our current CNN achieves > 90% AUROC and > 80% AUPRC on chromosomes in the validation set and on human regions whose orthologs are not CREs in the same mouse tissue. In addition, we used deepLIFT to identify important nucleotides in each brain-specific example and found that nucleotides in motif hits of known brain transcription factors (TFs) tend to be more important than the nucleotides in motif hits of other TFs (Wilcoxon rank-sum p = 2.662 x 10^-7). We are now using our CNN to predict the tissue-specificity of CREs across all mammals in G10K.
Short Abstract: The recently developed BioAssay Express technology streamlines the conversion of human-readable assay descriptions to computer-readable information. BioAssay Express uses public semantic standards (ontologies) to markup bioprotocols, which unleashes the full power of informatics technology on data that could previously only be organized by crude text searching. One of several annotation-support strategies within BioAssay Express is the use of machine learning models to provide statistically backed "suggestions" to the curator. We will describe our efforts to complement these models by applying ontology derived text mining, association rules mining based on existing annotations, and axioms that are embedded within the underlying ontologies. BioAssay Express includes the BioAssay Ontology (BAO), Gene Ontology (GO), Drug Target Ontology (DTO) and Cell Line Ontology (CLO). We will explore how this resource will be used, in conjunction with models and axiom support, to encourage further semantic annotation of publicly available bioassay protocol data. These efforts are timely and important, as such datasets (released by both public and private organizations) are only increasing, with the volume already exceeding the ability of individual scientists to manage productively.
Short Abstract: Measurement of changes in protein levels and in post-translational modifications, such as phosphorylation, can be highly informative about the phenotypic consequences of genetic differences or about the dynamics of cellular processes. Typically, such proteomic profiles are interpreted intuitively or by simple correlation analysis. Here, we present a computational method to generate causal explanations for proteomic profiles using prior mechanistic knowledge in the literature, as recorded in cellular pathway maps. To demonstrate its potential, we use this method to analyze the cascading events after EGF stimulation of a cell line, to discover new pathways in platelet activation, to identify influential regulators of oncoproteins in breast cancer, to describe signaling characteristics in predefined subtypes of ovarian and breast cancers, and to highlight which pathway relations are most frequently activated across 32 cancer types. Causal pathway analysis, that combines molecular profiles with prior biological knowledge captured in computational form, may become a powerful discovery tool as the amount and quality of cellular profiling rapidly expands. The method is freely available at http://causalpath.org.
Short Abstract: Standard somatic variant callers require a matched normal sample from the same individual, which is often not available, making it difficult to distinguish between true somatic variants and germline variants that are private to the individual. Archival sections frequently contain adjacent normal tissue, but it is often contaminated with tumor. Comparative somatic variant callers are designed to exclude variants present in the normal sample, so a novel approach is required to leverage sequencing of adjacent normal tissue. Here we present LumosVarMulti, a software package that jointly analyzes multiple samples from the same patient, models copy number states, and uses allelic fractions to classify variants as somatic or germline. We evaluated this approach both on simulated data and on a patient cohort where matched low and high tumor content biopsies and peripheral blood (for constitutional DNA) were available from each patient. Finally, we applied this approach to a set of archival tumor samples where constitutional DNA was not available, but the tumor blocks contained adjacent normal tissue. LumosVarMulti detected variants that were not detected by standard somatic variant callers that used the adjacent normal as a reference, including several known cancer hotspot mutations.
Short Abstract: Characterization of HIV genetic diversity within-host is required for selection of effective treatment. Because of the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on viral sequences by single genome amplification (SGA) for genetic and phylogenetic analysis has surged, necessitating automation of the assembly process. Here we present HIVA, a pipeline for the assembly of raw sequencing reads into annotated HIV genomes, capable of reconstructing thousands of genomes within hours. Specifically, a quality-control check trims Illumina adapters and low-quality bases, followed by multiple assembly steps that combine two widely used classes of algorithms: de-Bruijn-graph (DBG) and overlap-layout-consensus (OLC). DBG is performed by SPAdes for initial de novo assembly of contigs which are aligned via BLAST to a database of HIV genome sequences to select the closest reference. OLC is performed by MIRA in two steps. First, a modified version of the closest reference is generated by alignment to the contigs produced by SPAdes. After, the modified reference is used as a scaffold for the final reference-guided assembly of the initial trimmed reads. HIVA is constructed with Snakemake, a workflow management system, allowing reproducible data analyses and scalability to cluster and cloud environments.
Short Abstract: Tumors are characterized by a variety of cell types called sub-clones that dynamically interact among themselves and their microenvironment. This results in heterogeneous cancer phenotypes that can have increased tumor volumes and growth rates depending on their sub-clonal compositions. Perturbing specific sub-clones can significantly affect tumor developmental dynamics, from accelerated tumor proliferation to even collapse. To elucidate the dynamics of sub-clonal interactions and predict the effects of targeted therapeutic interventions, we developed a computational framework to infer dynamic mathematical models of sub-clonal interactions. Based on a genetic programming approach, our method utilizes high-performance computing to infer de novo complete models from data, simulate them through time, and evaluate their resulting tumor volume predictions in comparison to experimental data. We inferred sub-clonal interaction models that can accurately recapitulate tumor volume and clonal frequency data from experiments with genetically-engineered human breast cancer cells. Importantly, the reverse-engineered models can predict the results of novel experiments and perturbations, and hence determine the optimal sub-clones to target for therapeutic intervention causing the tumor to stabilize or even collapse. These results provide significant insight to potential therapeutic targets, and provide substantial understanding into the underlying complexity of tumor sub-clonal dynamics towards the development of personalized treatments.
Short Abstract: Approximately a third of eukaryotic genes encodes membrane/secreted proteins. Most of these proteins, characterized by hydrophobic signals are co-translationally recognized and translocated by the signal recognition particle (SRP) to the ER. However, the underlying sorting mechanism remain elusive. The specificity was proposed to originate from the hydrophobic signal of the nascent peptide. Two recent studies in yeast and E. coli suggested that a cis-signal in mRNA could serve as a pausing signal to promote the SRP recognition for translocation. Moreover, the fact that lack of sequence similarities of the hydrophobic signal sequence and the downstream region suggests RNA structure plays a role. To test this hypothesis, we performed transcriptome-wide in-vivo structure probing assay by using selective 2'-hydroxyl acylation (SHAPE) in yeast and human cells. Our preliminary data demonstrated there was an structured region downstream of the transmembrane domain region to facilitate the SRP recognition and in turn translocation. We are currently working on downstream structure region (mRNA structure motif) identification and further using mutation & rescue experiment to verify the function of such mRNA motif. This work will shed light on the functional understanding of mRNA co-translational targeting and the role of mRNA structure in translation regulation.
Short Abstract: AIMS. Tissue fibrosis, a major health problem that affects all vital organs, estimated to be involved in 45% of all deaths in developed countries. We have previously shown that cardiac fibroblasts display a unique cardiogenic identity, important for cardiac pathological remodelling. We therefore speculated whether fibroblasts from different organ systems would display a molecular signature unique to their organ of origin. METHODS. Transcriptomic profiles were obtained from adult mouse tissue fibroblast samples (heart, kidney, liver, lung, skin and tail), using two independent technologies: microarray and single-cell transcriptomic sequencing. Gene ontology analysis, bioinformatics analyses and visualisation were performed in R, MeV and Bioconductor. RESULTS. Here we demonstrate that fibroblasts isolated from the adult mouse skin, lung, kidney, gonad and liver retain a molecular positional and organ signature, similar to heart fibroblasts, and this identity is most likely a remnant from embryonic development. This information opens novel opportunities for the treatment of fibrotic diseases in an organ-specific manner. CONCLUSIONS. Systems analysis coupled with human-driven data exploration can be a powerful tool for understanding the development and diseases associated with fibroblasts.
Short Abstract: Valproic acid (VPA) is a widely prescribed anti-epileptic drug, whose side effects include steatosis. This study aims to foster better understanding of the mechanism of VPA-induced hepatotoxicity by systems toxicology approach. Male mice (C57BL/6, 12 weeks old) were orally dosed with VPA (0, 50, 150, 500 mg/kg, vehicle: methyl cellulose 0.5%) and their organs (brain, heart, lung, liver, kidney) were sampled 2, 4, 8, or 24 hr post-administration for microarray analysis. The raw data were normalized using the Percellome method , and the genes that were judged to be differentially expressed in a time- and dose-dependent manner, were used for pathway enrichment analysis by specific gadgets on Garuda platform . Our results suggested that VPA altered the expression patterns of PPARα− and ER- target genes in liver. To elucidate the linkage of events suggested by the gene expression data generated by Percellome project study, a Boolean network of the crosstalk among PPARα, ER and SREBP-1c signaling was constructed. The simulation result supported our hypothesis that VPA administration may likely contribute to a decreased PPARα/SREBP-1c activation ratio via ER ligand reduction.  BMC genomics 7.1 (2006): 64.  Nature Reviews Genetics 12.12 (2011): 821-832.
Short Abstract: Knowledge discovery is lagging behind the generation of genomic data. To facilitate the transformation from massive and complicated genomic data to novel knowledge, we developed the Awsomics framework, a combination of Amazon Web Service (AWS), Shiny web server, and various data analysis methods. Awsomics is supported by an archive of curated genomic data, including standard gene annotation, published experimental data, etc. For example, the archive has stored over 3,000 sets of GWAS analysis result and over 250 transcriptome data sets. All experimental data is curated to ensure high quality and consistent format. The data is then made accessible to the users through generic or project-specific online APPs. For example, GSA Genie (http://gsagenie.awsomics.org) runs gene set enrichment analysis on >2.2 million predefined gene sets. Project-specific APPs are designed according to the uniqueness of individual projects and they allow users to jointly analyze project-specific and public genomic data. Awsomics is highly portable. A new instance can be set up by simply cloning its image. It is also highly expandable. APPs can be developed independently while the data archive is shared by all APPs. Awsomics will enable researchers without strong analysis skills to efficiently exploit genomic data. Awsomics is available at http://awsomics.org.
Short Abstract: The development of next-generation sequencing technologies has helped to sequence large genomes easily producing a huge amount of short-reads, small fragments of DNA. Despite many developed alignment tools, mapping short-reads - datasets to the reference genome, a crucial step of genome analysis, is still remained as a challenge. In this study, we have developed a short-read alignment program, BWTaligner, based on Burrows Wheeler Transform compression, exact matching and inexact matching. We tested it on the paired-end read data simulated from the chromosome 9 of the rice genome to com-pare the alignment and single-nucleotide polymorphism (SNP) calling between our aligner and BWA, a preferred alignment program. The results showed that BWA performs the higher recall and F-score, while BWTaligner has bet-ter precision in high depth of coverage.
Short Abstract: Phosphorylation is the most common post-translational modification (PTM) in the eukaryotic proteome. Despite its importance, proteome-wide understanding of phosphorylation is limited, largely due to experimental difficulties and the transient nature of PTM. Phosphorylation site predictors have been developed and utilized as a practical alternative to deal with this issue. Here, we present PHOSforUS, a kinase-independent phosphorylation predictor based on biophysical properties. To acquire reasonable predictive performance, we applied two major changes: first, we divided site-specific ('vertical') and context-dependent ('horizontal') information into two separate layers, which roughly correspond to kinase-substrate interaction and conformational configuration, respectively. Also, we further divided serine/threonine phosphorylation sites into two distinct subclasses, one of which is associated with the well-known +1 proline residue sequence motif. Along with other technical modifications, this approach significantly improved predictive performance, which is now comparable or superior to currently available predictors. As it does not rely on external information such as sequence similarity, PHOSforUS is both faster and more robust when applied to novel sequence datasets. We interpret PHOSforUS's effectiveness as being due to the importance of the substrate's conformational equilibrium as the underlying biophysical mechanism, thereby increasing our understanding of protein phosphorylation.
Short Abstract: RNA-binding proteins (RBP’s) play an important role in gene expression regulation. Study of RBP binding preferences can provide new insights into various cellular processes. Accumulating data shows that the RNA recognition requires identification of both the sequence and the local structure attributes of the binding sites. While high-throughput experimental methods have provided vast data on RBP target sequences, a major limitation of most high-throughput RNA binding experiments is that they do not capture the RNA structural context of the RBP binding sites. We developed SMARTIV web server for de-novo discovery and visualization of combined sequence and structure motifs from in-vivo RNA binding data. SMARTIV combines the RNA sequence and secondary structure information in a single representation. Users have the option to select from several folding models for RNA secondary structure prediction. SMARTIV uses ranked CLIP-data to extract top-enriched motifs and assigns p-values to predicted motifs. Web-server output contains extensive information for further analysis of the results. The server was tested extensively on RNA binding data for different RBPs generated from different types of CLIP technologies. SMARTIV demonstrated consistent and accurate results and fastest running times. Server is user-friendly and freely accessible via http://smartiv.technion.ac.il/.
Short Abstract: Ovarian cancer has a very high mortality rate largely due to relapse of the chemoresistant disease after the standard Platinum-Taxane (Pt/Tx) chemotherapy in >80% of the patients. Thus, the focus in ovarian cancer research has recently shifted towards searching new targeted/personalized therapeutic strategies. Since various tumors rely on mitochondrial function, inhibitors of mitochondrial function, including the anti-diabetic drug Metformin, are being tested as possible therapeutics for various cancers. Our recent investigation of the TCGA and ICGC data revealed that cell cycle regulation by a mitochondrial regulator, Dynamin Related Protein 1 (Drp1), underlies the poor post-chemotherapeutic outcome of certain ovarian cancer patients. Drp1 is an emerging regulator of tumorigenesis and has been implicated in various cancers. We designed a Drp1-Based-Gene-Expression-Signature (DBGES) that specifically identifies the HGSC patients with poor survival. The goal of this project is to develop DBGES as a tool to be used on HGSC patient biopsy gene expression data to identify patients that may perform better with mitochondria based targeted therapy than the standard (Pt/Tx) chemotherapy. In this work, we sequenced ovarian cancer patient samples, matched pre and post-chemotherapy. We used various machine learning techniques to classify patients who would respond better with an alternative mitochondrial therapy regimen.
Short Abstract: Though most genes in mammalian genomes have multiple isoforms, an ongoing debate is whether these isoforms are all functional as well as the extent to which they increase the functional repertoire of the genome. Classically, a molecular trait is only considered functional when sufficient experimental evidence supports the trait’s necessity. For alternative splicing, support for splice isoform function must derive from isoform depletion experiments. Here we present a literature-based analysis of experimental evidence for functionally distinct splice isoforms (FDSIs), that is, the gene has multiple splice isoforms necessary for the gene’s overall function. We established a curation framework for evaluating experimental evidence of FDSIs, and analyzed over 700 human and mouse genes, biased towards genes that are prominent in the alternative splicing literature. Despite our bias, we found experimental evidence in agreement with the classical definition for functionally distinct isoforms in fewer than 10% of the curated human and mouse genes. Furthermore, many of these isoforms were not referenceable to a specific isoform in the Ensembl, a database that forms the basis for much computational research. Our results imply a lack of support from the experimental literature for claims that alternative splicing increases the functional diversity of the genome.
Short Abstract: RNA-binding proteins (RBPs) play a critical role in the regulation of alternative splicing (AS), a prevalent mechanism for generating transcriptomic and proteomic diversity in eukaryotic cells. The dysregulation of AS can cause diseases including cancer. Studies have shown that AS can be regulated by RBPs in a binding-site-position dependent manner. Depending on where RBPs bind, splicing of an alternative exon can be enhanced or suppressed. Here, we conducted RBP motif enrichment analysis of alternative splicing events associated with skin carcinogenesis following arsenic exposure. Using rMATS v3.2.5 (http://rnaseq-mats.sourceforge.net/), we identified 825 exon skipping events in human keratinocytes exposed to arsenic (100nM sodium arsenite) for 7 weeks (As+) compared to control samples (As-) from 100×2 bp RNA-seq. rMAPS2 (http://rmaps.cecsresearch.org/) was used for the analysis of binding motif enrichment near the exon skipping events. At p-value < 0.001, we identified 46 RBPs whose binding motifs were enriched near these exon skipping events. Interestingly, the KHDRBS2 binding motif ([AG]ATAAA[AC]) was enriched within upstream introns of up-regulated exons while it was enriched within downstream introns of down-regulated exons. Our results provide a set of RBPs with a potential alternative splicing regulatory role in early stages of arsenic-induced skin cancer.
Short Abstract: It is increasingly clear that cell-population level behaviors, such as disease progression or drug response, cannot be determined purely from bulk measurements. Rather, it is often rare subpopulations of cells that, individually or through mutual interactions, underpin these critical behaviors. With the advent of single-cell technologies, it is now possible to identify the functionally relevant components of such cellular-heterogeneity by profiling a large range of conditions. However, obtaining a large number of samples can be difficult, particularly for scarce or non-renewable patient tissue, and experiments can be costly. Thus, a critical question is which and how many samples should be extracted from a cellular population to be confident that its cellular heterogeneity has been well captured. Here, we present a data-driven framework to estimate the sampling depth required to reliably profile heterogeneity in prospective experiments given a “reference” collection of existing specimens and an experimental scheme to sub-sample them. Additionally, we demonstrate how identifying dominant sources of experimental variation can help identify efficient sub-sampling schemes. These approaches are illustrated in the context of image based studies in patient tissue and patient derived xenografts.
Short Abstract: The Mucuna bean belongs to the Fabaceae family, sub family Papilionaceae which includes approximately 150 species of annual and perennial legumes. The seed has lots of medicinal benefits ranging from antidiabetic, anti-snake, antispasmodic, anti-inflammatory, antipyretic, antidepressant and has shown prospects in managing Parkinson disease. We report the first Gemonic sequencing, annotation of the seed of M. pruriens. The M. pruriens seed was sequenced using Illumina HiSeq at 2X150 bp configuration. CLC Genomic Workbench 10.0.1 was used for the assembly. The minimum contig length was set to 5000, to better resolve repeats. Gene annotation was done by using MAKER-P cctools, Expressed sequence tags(EST) were obtained for Cajanus cajan, Glycin max and Vigna agularis and the protein homology for Fabaceae family from Uniprot. Protein-encoding genes and tRNAs were predicted by SNAP and tRNAscan-SE programs respectively. Comparative genomics analysis was made using orthovenn to compare M. pruriens, M. trunculata, A. lyrata A. thaliana and G. max. The genome of M. pruriens 411Mbp(G+C:31.4%). A total of 718 putative tRNA and 63,178 protein coding genes were identified. The genome has 284155 coding sequence. All the compared species shared 8249 orthologous clusters. NCBI accession number for the deposited genome of M. pruriens is QJKJ00000000
Short Abstract: Single-cell RNA-seq (scRNA-seq) is increasingly used in the study of organ development, function, and disease, and is ideally suited to identify tissue progenitor cells, their differentiation trajectories, and molecular networks controlling them. As scRNA-seq data becomes increasingly complex, advanced computational skills are required to process the data, impeding the accessibility of the data for physicians and research investigators. To address this challenge, we developed SVIA, a Shiny web app for online visualization and interactive analysis of large-scale scRNA-seq data from tissue development and disease related studies, providing easy access and visualization of gene and signaling pathways in thousands of cells on the web, and providing on-demand interactive analytic functions (e.g., cell type and signature identification) and statistical output (e.g., differential expression, gene and cell type correlation) based on user queries. The app was developed using R, facilitating its integration with sophisticated statistics, visualization, and scRNA-seq data analysis tools in R. We validated the app using scRNA-seq data from normal mouse lung development (embryonic day 10.5, 16.5, 18.5, postnatal day 1, 3, 7, 10, 14, 28) and mouse and human lung cancer and disease cells. SVIA will be available through CCHMC LGEA web portal. (Supported by NIH LungMAP and PCTC grants).
Short Abstract: Investigators typically require a ranking of biological features (e.g. genes) by their differential abundance between two populations. In RNA-Seq for example, we wish to compare the quantified values for tens of thousands of genes across a wide spectrum of expression intensities. A naive ranking by fold-change leads to several issues. The division-by-zero issue happens when the change is from 0 to a positive quantity. This problem is usually dealt with by using a pseudo-count of 1. Another issue is that fold-changes from smaller numbers tend to dominate the top of ranking lists in case of discrete data like RNA-Seq, leading one to question whether a change from 1 to 2 (fold change of 2) is to be considered more significant than a change from 100 to 190 (fold change of 1.9). We systematically study this issue at both theoretical and empirical levels and conclude that in RNA-Seq data the use of a pseudocount of one is highly sub-optimal. We introduce foundational mathematics in terms of an axiomatic framework to enable the systematic exploration of the ranking problem. Additionally we demonstrate how pseudocounts integrate the advantages of the two standard approaches of ranking based on ratios and differences.
Short Abstract: HIV tends to generate many potentially drug-resistant mutants within HIV-infected patients, which are important to identify for an efficient clinical treatment. Therefore, patients can be sequenced using NGS approaches to identify the different HIV clones, through the analysis of the variations in the virus sequences. This problem is a nonstandard clustering issue, due to missing pairwise similarity measures between non-overlapping reads, on large sequencing datasets. Although some tools are available to accomplish this task, they fail to work on large datasets or are poorly reproducible due to the underlying statistical subsampling approach. Here, we present a new computational technique, called HaploVir, for the identification of HIV haplotypes, relying on variation-based graph-model, which is explored by two path reconstruction algorithms, which are MaxFlow, based on maximum flow of the graph, and MaxPaths, a greedy approach that finds the maximal augmented paths. We perform and experimental analysis both on simulated and real data (of our published clinical investigations) to assess the performance of HaploVir. Results show that this approach is very promising to infer HIV clones for clinical studies, providing fast, robust and reproducible results. HaploVir has been implemented in Python and its source code is freely available at https://bitbucket.org/bereste/haplovir.
Short Abstract: A number of skin diseases are of immune origin. Although quite different in their signs and symptoms, they are known to share a number of pathways that are dysregulated in the disease state. For such immune-mediated skin diseases, human clinical transcriptomic signatures could be used to elucidate their unique and/or shared mechanisms that potentially accelerate the therapeutic development for these diseases. Stacked denoising autoencoders (SDAEs) have been hypothesized to be improve feature selection for machine learning applications by 1) reducing dimensionality and 2) decreasing effects resulting from random noise. Here we designed a study using real-world clinical transcriptomics datasets that consists of 1,376 skin samples from 33 studies from the NCBI Gene Expression Omnibus across three immune-mediated skin diseases including psoriasis, atopic dermatitis and acne. By training SDAE models to capture the intrinsic structures of our curated transcriptomic datasets, we find that the transcriptomic signatures learned by SDAE provided greater power for supervised machine learning models to differentiate between lesioned and normal samples compared with the principal component analysis (PCA). Moreover, we demonstrate that SDAEs are beneficial to facilitate transfer learning between skin diseases with shared mechanisms, as well as to fight batch effects among multiple studies in a meta-analysis.
Short Abstract: Compared to other agriculturally important plants, little is known about Cannabis. This includes a lack of knowledge regarding the soil microbiomes associated with the plants. However, soil microbiomes have been associated with plant growth and development as they not only contribute to the nutritional requirements of the plants but are also believed to aid in the overall immune system of the plants. In this study, 16S rRNA community analysis approach was implemented to begin to characterize the soil microbiomes of Cannabis. To better understand how these communities correlate to plant characteristics, a range of soil samples from varied plants were collected in replicate and sequenced. To determine how bacterial consortia varied spatially, samples were taken from bulk soil, rhizosphere, and rhizoplaneof each plant. To date, analysis of microbial communities has pointed towards shifts in community composition between different spatial locations of the soil, but also across between plants demonstrating a range of attributes, including general health, sustainability, pest resistance as well as different cultivars and phenotypes. This study is generating data that will establish a baseline for the general soil microbiome of Cannabis as well as provide insight as to microbial involvement in the growth and development of the plant.
Short Abstract: Objectives: RNA-Seq and microarray data have been used to screen lncRNAs for diagnostic biomarkers in cancer. However, no reports have used data from these two platforms together to validate each other. This study was to identify lncRNA biomarkers using RNA-Seq and microarray separately, then validate data from one with another platform. Methods: Lung cancer datasets were obtained from GEO (n = 287) and TCGA (n = 216). Microarray datasets from the same platform were merged and batch effect were removed. Then feature selection was used to find top 20 lncRNAs in each Affymetrix and TCGA datasets for further analysis. Bayesian network classifier was used in top 20 Affymetrix lncRNAs and validated in TCGA and Agilent datasets. Similar procedure was applied in TCGA and validated in Affymetrix and Agilent datasets. Results and Conclusion: When using TCGA dataset as training, the sensitivity was 0.991 and specificity was 0.954. And the sensitivity and specificity for Affymetrix and Agilent validation were 0.949, 0.964 and 0.600, 0.950. When using Affymetrix as training dataset, the sensitivity was 0.971 and specificity was 0.991. The result in TCGA and Agilent as validation datasets were 0.991, 0.880 and 0.850, 0.900. The biomarkers were further for function and prognostic analyses.
Short Abstract: Recent advances in single-cell profiling, such as droplet-based sequencing and methods for pairing transcriptomics with cellular physiology, have massively increased the scale and applicability of single-cell genomics. However, these methods are susceptible to technical artifacts, including off-cell type contamination, where mRNA from multiple cells is present in a single-cell sample. Here, we describe approaches for identifying and correcting off-cell type contamination that make use of cell type-specific marker genes. Re-analyzing data from patch-seq experiments, where scRNAseq is combined with patch-clamp electrophysiology, we find that these data show widespread evidence for multi-cell contamination. We suggest that this off-cell type contamination is due to the passage of the micropipette through the processes of other cells before and after collection of cellular mRNA. After correcting for these contamination artifacts, we show an improved correspondence between cellular transcriptomes and physiological features. In droplet-based scRNAseq data, we show that our method can identify droplets containing multiple cells (doublets) and pervasive mRNA contamination from lysed or dead cells. Removing suspected doublets improves the fidelity of pooled cell type-specific transcriptomic profiles across samples and between datasets. Our work highlights the need for detecting and accounting for multi-cell contamination as part of routine quality control for single-cell analyses.
Short Abstract: The intergovernmental research infrastructure ELIXIR unites >20 national Nodes in its aims to integrate life science resources across Europe. ELIXIR-Luxembourg (ELIXIR-LU) is strongly focused on the management and analysis of translational human data for personalised medicine. The lack of adaptable systems that collect, integrate, and manage access to the many and varied types of health data effectively remains a hurdle for data FAIRification in this sector. However, opening and sustaining (controlled) access to biomedical data for re-use in research without major hassles is critical for understanding and treating human diseases. ELIXIR-LU operates independently of a particular disease but our link to the Luxembourg Centre for Systems Biomedicine and our national mission open up collaborations centred around Parkinson’s, other neurodegenerative, and rare diseases. We discuss recent and ongoing technical work, including our clinical/translational medicine data catalog service (TMDC), our core services for clinical data hosting, a ready-for-deployment Beacon to support targeted searches for rare variants across a network of data hosting sites, and a new Data Information System (DAISY) tool. DAISY aims to assist the hosting and sharing of sensitive human data in the "bookkeeping" required to meet the accountability standards imposed by the new European General Data Protection Regulation (GDPR).
Short Abstract: Predicting the macromolecular targets of small molecules is important for drug discovery to flag off-targets and for drug repositioning. We have devised a target prediction method by representing small molecules with topological descriptors and predicting target proteins with a nearest neighbor estimator based on the Tanimoto coefficient of query molecules to annotated ligands of biological targets. We validated our approach with a dataset of 301,179 drug-protein interactions derived from ChEMBL. Cross-validated results were analyzed by precision-recall analysis yielding excellent performance (average precision, AP: 0.71; 90% chance of ranking the annotated target top-1). However, these results were biased by a large number of similar compounds, e.g. from series of analogs. We therefore devised a UCLUST-inspired clustering algorithm to remove redundancy. Performance dropped substantially (AP: 0.35) on our non-redundant dataset with highly similar compounds removed. However, the annotated target was still ranked top-1 for 60% of all molecules. This indicates an unexploited potential for drug discovery despite the limits imposed by chemical similarity. We confirmed this potential by testing an adenosine A2B receptor modulator and a c-Jun N-terminal kinase 3 inhibitor for inhibition of their predicted target: coagulation factor Xa. Both compounds were found to inhibit factor Xa in vitro.
Short Abstract: In prostate cancer, Stem-like cells have been shown to be resistant to standard treatments, reseed tumor growth and increase in metastasis. However, the current prostate (PS) culture system still generates a heterogeneous mixture of stem-like and progenitor cells types. Therefore, characterization of the subgroups within the mixture is a natural move to identify the cells in different differential stages. Here, we build a pipeline to analyze the single-cell RNAseq data of the mixed human prostate progenitor cells to character prostate stem cells. Through filtering the high-quality cells, the pipeline successfully clarified the human prostate epithelial lineage hierarchy. A couple of the KRT family genes are showing significant different expression profiles across clusters, which is further supported by the pseudo-time prediction across the cells. This clarification of the stem cell lineage hierarchy and keratin profiling may provide enhanced opportunities for translational studies that target therapeutic-resistant cancer stem-like cells.
Short Abstract: Numerous studies have shown that existing whole-genome aligners do not capture several biologically-meaningful, high-scoring alignments. Software whole-genome aligners are all based on the seed-and-extend paradigm with an ungapped extension step between seeding and dynamic programming steps. Using genomes of two nematodes (C.elegans and C.Briggsae), we show that replacing ungapped extension with dynamic programming (banded Smith-Waterman) increases the number of high-scoring alignment chains by over 3X but also slows down software considerably (160X). In this poster, we propose Darwin-WGA --- the first hardware-acceleration framework for whole-genome alignment which massively (over 3000X) accelerates the dynamic programming extension using Field Programmable Gate Arrays (FPGAs). We also introduce GACT-X, a novel algorithm to align arbitrarily long genome sequences using small and constant on-chip memory that achieves empirically optimal alignments. Overall, Darwin-WGA achieves a 20X speedup on whole-genome alignments over the state-of-the-art software (LASTZ) with improved sensitivity --- over 3X more high-scoring chains, aligning 1.9X more base-pairs with a 3.8% average score improvement in the 10 highest-scoring chains. We plan to open source the framework and port the design on Amazon FPGA cloud providing a practical and convenient solution to carry out whole-genome alignments at scale and facilitate new genomic discoveries.
Short Abstract: As NGS technology becomes more affordable, the amount of samples sequenced are increasing exponentially, and subsequently are the storage and analysis costs. High performance computing clusters for data analysis are lessening the workload, leveraging the power of machine learning and clustering algorithms to distill large scale genomic data. Storage poses a big challenge as scalability is an issue with legacy data. Post-analysis data sets are difficult and costly to manage and long retention periods results in rapid growth of stored data. Here we propose a NFS solution which merges Dell-EMC-Isilon and MapR storage to meet both genomic data storage and archival needs. We mount this hybrid NFS for our data analysis leveraging MapR for MapReduce jobs and Isilon for I/O-intense jobs. We then move the resulting data to MapR-based databases for query and visualization. Legacy pre-analysis data are compressed and stored in cheaper and faster Isilon storage for future needs. We benchmark the capabilities of both and couple them together to provide a cross-platform solution, which provides researchers with an affordable, scalable, and flexible genomic data storage system that can grow with the needs of genomics research and advancing software technology and is equipped with multi-disk backup and archival techniques.
Short Abstract: In D.melanogaster, polycomb response elements (PREs) are short segments of DNA with a high density of binding sites for transcription factors that recruit PcG and TrxG proteins to chromatin. Each PRE has a different number and topological organization of binding sites for these factors. However, the nature and function of PREs is still not properly understood. We have developed a framework to predict the potential PRE regions in the entire D.melanogaster genome using machine learning algorithms. Using predicted TFs binding-sites available through a newly developed database, and combining it with other sequence features, we able to train a random forest model with very high performance in predicting PRE regions. This model not only could distinguish PRE regions from non-PRE regions (precision and recall both greater than 0.99 upon cross-validation), but could also determine a large number of proteins that might contribute to PcG/TrxG recruitment at the PRE locations predicted by our model. As a result, we have discovered potential PRE regions in the whole fly genome that is consistent with known PRE regions with high confidence, and obtained new insight into the identities of other DNA binding proteins acting in tandem with PcG proteins to establish functional PREs.
Short Abstract: The Tasar silkworm, Antheraea mylitta Drury, a sericigenous lepidopteran insect, of commercial importance is distributed in various parts of India as ecoraces, with diverse phenotypic and quantitative traits including fecundity, voltinism, cocoon weight, silk ratio etc. In spite of their superior quality silk, they encounter problems like their gradual decrease in number and identification. Through phenotypic traits like larval markings, cocoon colour and shape etc. and biochemical analysis have been used for differentiation of silkworm genotypes; the advent of molecular markers has led to rapid analysis of germplasm and estimates of genetic diversity, which helped in their identification, a prerequisite in preserving their natural biodiversity. As lakhs of tribal populations thrive on tasar culture for their livelihood, a drive aiming towards protecting the tasar silkworm ecoraces by exploiting the technological and scientific developments is the need-of-the-hour. Hence in the present investigation, total indoor rearing, biochemical estimations, molecular characterization were adopted. The genomic DNA of seven distinct populations of A.mylitta was extracted and screened for polymorphism. The DNA profiles based on EST, SSR and ISSR markers suggest that these markers could be effectively utilized for identifying the genetic variability among tasar ecoraces and design appropriate breeding methods for improvement of ecoraces.
Short Abstract: Large-scale sequencing projects will produce the genomes of thousands of species. Repetitive elements — known as repeats — make up significant portions of most known genomes. We have developed ReFind using Hidden Markov Models (HMMs) for annotating repeats. ReFind uses the results from our two previous tools, Red (de novo repeat detection tool) and MeShClust (sequence clustering tool) to obtain initial repeat clusters to train its HMMs. Unlike MeShClust, ReFind can annotate incomplete and nested repeats. We used three HMM implementations: (i) k-mer-based HMM, (ii) score-based HMM, and (iii) coupled HMM of the k-mer-based and score-based HMMs. After training, the tool runs the Viterbi algorithm to find repeats in the genome. We tested the HMMs on synthetic DNA sequences with different mutation rates while fixing the length at a million nucleotides and repeats content at 30%, then we calculated the sensitivities. The Coupled HMM produced higher sensitivities than either of the single HMMs. For 30% mutation rate, the sensitivities are 58.8% for the score HMM, 93.9% for the k-mer HMM, and 94.2% for the coupled HMM. The results on synthetic sequences look promising, and we expect the model to perform well on real data.
Short Abstract: The laboratory rat has been extensively used for biomedical research and multiple strains have been developed that serve as disease models. The Rat Genome Database (RGD, https://rgd.mcw.edu) is the premier site for rat genomic, genetic, and phenotype data. PhenoMiner, a tool within the RGD suite, provides a repository and searchable interface for strain specific quantitative phenotype measurements for a large number of inbred, outbred, congenic, consomic and mutant strains. This data has been standardized and integrated from large phenotyping projects and manually curated data from published literature. This comprehensive data set has facilitated analysis for generation of expected ranges for individual clinical measurements for individual disease model strains as well as expected ranges for strains commonly used as controls. Search and visualization tools provide easy comparisons of these ranges across strains, ages and sexes. The quantitative phenotype datasets and tools at RGD aid researchers as they choose appropriate preclinical experimental models for their studies.
Short Abstract: Protein function prediction is a vital activity to elucidate the complex yet essential metabolic pathways. However, accurate protein function annotation is yet to be achieved. Various computational approaches are available for protein function annotation but these are wanting in specificity and precision. With ever increasing protein sequences in UniProtKb/Swiss-Prot database (total number of sequences: ~114,759,640; manually curated: ~557,275), only ~71,229 are annotated in accordance to evidence codes EXP, IC, IDA, IMP, IEP, EGI and TAS. Similarly, out of ~20,328 human proteins, only ~15005 have been annotated. Hence, a fast yet accurate method of function annotation is a squeezing need. Here, we present S2F, a meta-server which integrates some popular protein annotation programs integrated and furnishes the user with a set of top consensus functional terms at a single platform. The meta-server approach has enabled the annotation of complete human proteome including ~5,323 proteins which were hitherto unannotated. The S2F meta-server will be made available for shortly free access at www.scfbio-iitd.res.in/proteins/S2F.jsp/.
Short Abstract: Colorectal cancer (CRC) is one of the most highly incident carcinomas in the world. Development of colorectal tumors is related mainly to dietary and sedentarism, and recent lifestyle changes in many populations contribute to the rise in CRC cases. Recently, the Colorectal Cancer Subtyping Consortium (CRCSC), a joint effort involving multiple research groups, has identified and characterized four consensus molecular subtypes (CMS). This new classification allows us to better understand CRC and develop new therapy strategies based on each subtype features. Our aim was to investigate the Microsatellite Instability Immune (CMS 1) subtype, in order to identify new potential druggable targets. This subtype is characterized by hypermutation, CpG island methylator phenotype, immune infiltration, BRAF mutations, and worse survival after relapse. Transcriptomic data for 456 colon and 167 rectal primary tumor samples was obtained from The Cancer Genome Atlas (TCGA) and used for classifying these samples in the consensus molecular subtypes using CRCSC's R package, CMSClassifier. Seventy-four samples were classified as CMS1. Expression data for these samples was normalized using DESeq2's Variance Stabilizing Transformation function and co-expression modules were identified using the WGCNA package. Four modules were identified and each of the modules' eigengenes are being analyzed as potential drug targets.
Short Abstract: The clonal architecture of tumors plays a vital role in their pathogenesis; however, it is not yet clear how this clonality contributes to different malignancies. In this study using multidisciplinary approaches, we sought to address clonality and mutational intratumor heterogeneity (ITH) in adult T-cell leukemia/lymphoma (ATL) which is a malignancy caused by human T-cell leukemia virus type-1 infection. We conducted target sequencing of provirus integration sites and quantified clonality of infected cells in 70 clinical samples. Also, to determine the clonal structure through mutation profiles, we investigated whole-exome sequencing data of tumor and matched normal samples from 71 ATL patients. We translated observed clonality data into tree data structures, and visually described the clonality patterns. The model simply represents the progression status of each person and can be beneficial for predicting the risk of ATL development. Based on mutation profiles, the ATL samples showed a wide spectrum of clonal/subclonal frequencies ranging from one to nine clusters. All tumors had some heterogeneity, but the degree of heterogeneity was quite variable. Our findings indicate that ATL displays high degree of ITH and a complex subclonal structure; and suggest the clonal architecture as a useful measure for prognostic and diagnostics purposes.
Short Abstract: Despite the success of convolutional neural networks in multiple biological fields of research, application of them have mostly been restricted to cases where classes are known a priori. Unfortunately, this information is not always available, for example when classifying protein localization patterns in images for the Cell Atlas, part of the Human Protein Atlas. So far, attempts at solving these issues have been restricted to methods which assume a certain numbers of patterns, which is less than ideal when the number of classes are unknown. In this work we present a clustering convolutional neural network approach that is independent of the number of patterns in the dataset, to identify suborganelle structures in cellular images. To test the method, we analyze cellular subnuclear patterns in high quality immunoflourescent images from the Cell Atlas. Although the nucleus consists of several known substructures, many are hard to distinguish by eye in microscopy images and has so far required manual investigation to identify. We show that the clustering convolutional approach presented here is capable of distinguishing between proteins localized to several different subnuclear patterns and demonstrate how it can be used to find potential candidates for future studies.
Short Abstract: Many studies have discovered the special form of alternative splicing that produces a circular form of RNA. Although these circular RNAs have garnered considerable attention in the scientific community for their biogenesis and functions, the focus of these studies has been on exonic circular RNAs (circRNAs: donor site and acceptor site are from exon boundaries) and circular intronic RNAs (ciRNAs: donor and acceptor are from a single intron). This type of approach was conducted in the relative absence of methods for searching another group of circular RNAs, or circular complex RNAs (ccRNAs: either the donor site or acceptor site is not from exon boundaries), that contains at least one exon and one or more flanking introns. Studies of ccRNAs would serve as a significant first step in filling this void. Here, we present a new computational algorithm that can detect all three types of circular RNAs. We applied our algorithm on a set of RNA-seq data to examine the composition of circular RNAs in the given dataset. Surprisingly, our results showed that the new type of circular RNA (ccRNA) was the second most common type of circular RNA while circRNA was the most common type as expected.
Short Abstract: Extracellular surface lipids constitute a protective layer for plant cuticle against environmental stresses, and are comprised primarily of long-chain fatty acids, aldehydes, and hydrocarbons as the hypothesized precursors, intermediates, and end-products in hydrocarbon biosynthesis. To investigate this pathway, we employed a systems approach to query the metabolomes and transcriptomes of silks across a spatio-temporal developmental gradient that also captures the environmental transition as silks emerge from the husks. Metabolomic profiling revealed that metabolome composition is dynamic across encasement status, genotype, and development. Signature lipids, major contributors to the metabolomic variations across genotypes and environments, were selected by Partial-least-square (PLS) discriminant analysis. These lipids were then used in PLS regression with transcriptome data to identify metabolome-transcriptome associations impacting the metabolomic compositions. The PLS models explain ~80% of the transcriptomic and >60% of the metabolomic variation. Variable-selection using Variable-Importance-in-Projection thresholds and validation via random permutation identified ~200 candidate genes impacting lipid metabolome. Many of these genes are likely involved in several key but elusive metabolic processes in hydrocarbon accumulation, e.g. aldehyde decarbonylation and extracellular lipid transport. Further analysis is in progress to generate a network that integrates metabolomes with correlation-based gene clusters and will provide deeper insights into surface lipid biosynthesis network.