Oral Poster Presentations - Sunday, July 10
Attention Conference Presenters - please review the Speaker Information Page available here.
While the general process of gene transcription is well understood, the mechanisms by which different genes are activated in different conditions or different cell types are not. Transcription must be precisely controlled for proper development and response to differing conditions, and determining exactly which part of the cellular machinery is responsible for changes in expression is an important task in biology. In order to determine exactly which transcription factors are responsible for very specific conditions, it can be helpful to examine which genes are differentially expressed in similar but slightly different conditions. Here, we consider the problem of taking two closely related differentially expressed gene sets and determining which transcription factors could be responsible for the differences. While identifying transcription factors whose targets are significantly enriched in a set of differentially expressed genes is a common computational task, here we address a subtly but importantly different question: which transcription factors' targets are more significantly overrepresented in one set than another. We present approaches to rank transcription factors based on their regulation of one set of genes as compared to another and apply them to gene expression sets associated with the Mediator complex, a complex essential for most transcription in eukaryotes which may play an important role in differential transcription. We apply our methods to investigate the regulatory differences between CDK8 and CDK19, homologous proteins that function similarly and can alternatively occupy the same position in Mediator. We show that our methods perform substantially better than naïve methods.
Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Then, functional knowledge can be transferred from well- to poorly-annotated species between aligned network regions. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on state-of-the-art methods at the time, MI-GRAAL and IsoRankN, that combining NCF of one method and AS of another method can give a new superior method. More recently, we further confirmed this by mixing and matching MI-GRAAL’s and GHOST’s NCFs and ASs. Most recently, we introduced a novel AS called Weighted Alignment VotEr (WAVE). When used on top of well-established NCFs of the existing methods (such as MI-GRAAL or GHOST), WAVE improves alignment quality compared to the existing methods.
Ribosome profiling (or Ribo-seq) is currently the most popular methodology for studying translation; it has been employed in recent years to decipher various fundamental gene expression regulation aspects.
The main promise of the approach is its ability to detect ribosome densities over an entire transcriptome in high resolution of single codons. Indeed, dozens of ribo-seq studies have included results related to local ribosome densities in different parts of the transcript; nevertheless, the performance of ribo-seq has yet to be quantitatively evaluated and reported in a large-scale multi-organismal and multi-protocol study of currently available datasets.
Here we provide the first objective evaluation of Ribo-seq at the resolution of a single nucleotide(s) using clear, interpretable measures, based on the analysis of 15 experiments, 6 organisms, and a total of 712,168 transcripts. Our major conclusion is that the ability to infer signals of ribosomal densities at nucleotide scale is considerably lower than previously thought, as signals at this level are not reproduced well in experimental replicates. In addition, we provide various quantitative measures that connect the expected error rate with Ribo-seq analysis resolution.
The utility of tumor-derived cell lines is dependent on their ability to recapitulate the underlying genomic aberrations found in primary tumor biology. Here, we analyze the exome sequences of 25 bladder cancer (BCa) cell lines and compared mutations, copy number alterations, gene expression and drug response to BCa patient samples in The Cancer Genome Atlas (TCGA). We show that the genomic aberrations found in BCa cell lines mimic patient samples, including similar mutation patterns associated with altered CpGs and APOBEC-family cytosine deaminases, activating mutations in the TERT promoter, mutations in known BCa-associated genes (TP53, RB1, CDKN2A and TSC1), and alterations in chromatin associated proteins (MLL3, ARID1A, CHD6 and KDM6A). We confirmed non-silent sequence alterations in 76 cancer-associated genes. Next, we used PARADIGM to infer pathway activities for cisplatin treated BCa cell lines based on the cell lines’ gene expression and copy number data. We used the inferred pathway activities to build a predictive model of platinum drug response. The predictive model was based on an elastic net regression, which provided an implicit feature selection that identified important pathway concepts relevant to cisplatin response. When applied to BCa patients gathered from TCGA, the model predicted overall response, showing a clear separation in survival of predicted nonresponders vs predicted responders in the platinum-treated patient cohort (p=0.05) and no separation in the untreated patient cohort (p=0.62). Together, these data and predictive models represent a valuable community resource to model basic tumor biology and to study the pharmacogenomics of BCa.
Circular RNAs (circRNA) are a new class of abundant, non-adenylated, and stable RNAs that form a covalently closed loop. Recent studies have suggested that circRNAs play important regulatory roles through interactions with miRNAs and ribonucleoproteins. High-throughput RNA-sequencing to detect circRNAs requires non-poly(A) selected protocols. In this study, we established the use of Exome Capture RNA-Seq protocol to profile circRNAs across more than 1000 human cancers samples. We validated our protocol against two other gold-standard methods, depletion of rRNA (Ribo-Zero) and digestion of linear transcripts (RNase-R). Capture RNA-seq was shown to greatly facilitate the high-throughput profiling of circRNAs, providing the most comprehensive catalogue of circRNA species to-date. Specifically, our method achieved significantly better enrichment for circRNAs than rRNA depletion, and, unlike RNase-R treatment, preserved accurate circular-to-linear ratios. Although the correlation between circular and linear isoform abundance was modest in general , we found strong evidence that the lineage specificity of circular RNAs is due to the lineage specificity of their parent genes. To shed light on the mechanism of circRNAs biogenesis, we are investigating the associations between mutations in canonical splicing sites and splicing factors with aberrant formation of circRNAs. Finally, ratio of circular to linear transcript abundance was explored to give insight in the dynamics between transcriptome stability/turnover and cell proliferation. Overall, our compendium provides a comprehensive resource that could aid the exploration of circRNAs as a new type of biomarkers, or as intriguing splicing and regulatory phenomena.
Colorectal Cancer is one of the most common forms of cancer and is the second leading cause of cancer deaths in the world. While patient-derived xenograft models have emerged as an important tool to study tumor growth, progression and response to therapy, the extent to which they recapitulate the genetic features of the primary tumors is unknown. In this study, we compare colorectal patient tumors and patient-derived tumor xenografts (PDX) obtained from the same patient to identify recurrently mutated genes and their overlap in colorectal cancer.
We generated patient derived xenografts from 8 different colorectal cancer tumors. We sequenced exomes of these tumors, paired germline DNA and PDXs to identify somatic and germline mutations from 8 patients with colorectal cancer. We compared somatic mutations along with copy number alterations between the tumors and PDXs. We further applied copy number analyses along with somatic allele frequencies to infer tumor purity. The integration of allelic fraction and copy number information also helped us to identify tumor sub populations.
We identified significant recurrent mutations in PI3K pathway gene PIK3CA, ERBB-RAS pathway gene NRAS and Wnt pathway genes TCF7L2 and APC. We found hotspot mutations in tumor suppressor gene TP53, transcriptional modifier gene SMAD2 in the patients and PDXs. We observed significant subclonal heterogeneity in frequently mutated genes in colorectal cancer both in patient tumors and PDXs.
Our study demonstrates that tumor-specific PDX models faithfully recapitulate the genetic heterogeneity and clonality in tumors and are viable models for targeted therapies.
In recent years, rapid expansion of mobile devices, including smart phones and tablets, has created a new trend of personal computing. Personal mobile devices have become convenient devices for daily information retrieval and exchange with more freedom. However, few mobile applications (APPs) were created to retrieve and display genome annotation information on the tablets or smart phones. Currently, no bioinformatic related mobile applications have developed specifically for the visualization of large-scale NGS sequence data. With increasing computation and graphic display capacities of mobile devices, mobile devices and mobile applications would become suitable user-friendly platforms for interrogating large-scale bioinformatic and genomic data. Herein, we tried to develop mobile application software to demonstrate the feasibility of visualizing large-scale human cancer gene expression information. We have implemented an iOS mobile application (RNA-Seq Viewer) in order to visualize the Next Generation Sequencing gene expression information with over 2,500 human cancer patients retrieved from The Cancer Genome Atlas (TCGA). Users can select RNA-Seq data of any given individual sample from nine different cancer types and our mobile application could efficiently display whole transcriptome expression information systematically over a human chromosome framework with easy accessibility and intuitive navigation user interface. Local gene modulation patterns could be inspected thoroughly. In addition, users can visualize their own RNA-Seq data by building their customized dataset. We imagine such mobile applications could be utilized in future personalized medicine applications by serving as an underlying component to easily access the genomic and medical information using cloud infrastructure on various mobile devices.
Duchenne muscular dystrophy (DMD) is a common and devastating genetic disease characterized by muscle wasting. Exon skipping uses small DNA-like molecules, antisense oligos (AOs), that act like stitches to modulate gene products and rescue the mutations. The efficacy of exon skipping at different target positions can vary by more than 20-fold, thus the selection of the target site could make the difference between success and failure of clinical trials. However, no effective method has been developed to choose the optimal target site. We propose to develop an in silico (computational) method, which is considered a fast, inexpensive, and effective way to guide the screening. We have recently developed such framework, and identified a "DNA-stitch" that is improved by more than 10 times compared to current clinical trial molecules. We wish to improve it further and identify new drug candidates that can treat a majority of DMD patients with various mutations.
We plan to pursue the following objectives: 1) to identify influential features in exon skipping, and use bioinformatics techniques to develop an efficient algorithm to predict the efficacy of exon skipping of AOs; 2) to improve the efficacy of both single- and multi-exon skipping, extend our framework to predict efficacy of multiple AOs, using a new algorithm that addresses interaction of random sets of oligos and RNAs. 3) to verify the correlation of predicted and actual efficacy of exon skipping in vitro and in vivo. 4) to launch the web software and incorporate community feedback to improve its quality.
Alzheimer's disease (AD) is a common neurodegenerative disease. Age is a known main risk factor for AD. We analyzed the epigenetic mark histone 3 lysine 9 acetylation (H3K9ac) in the human prefrontal cortex of 676 samples from the ROSMAP study. Participants were not cognitively impaired upon study entry. After death, AD pathologies including neurofibrillary tangles were measured and anti-H3K9ac ChIP-seq experiments were conducted. We identified 26384 H3K9ac domains in the ChIP-seq data. The numbers of sequence reads falling into each domain were determined for each sample, and normalized by regressing out technical nuisance variables.
We split the dataset into training (n=446) and test data (n=230). An L1 penalized regression model was fitted on the training data with age of death as outcome and H3K9ac domains as penalized explanatory variables. Gender was added as unpenalized covariate. The penalty parameter was determined by maximizing the cross-validated likelihood on the training set. The coefficients of 10 domains were unequal to 0. This model was used to predict the epigenetic age of the test samples. Predicted epigenetic age showed a moderate correlation of 0.25 with age of death. We defined accelerated aging as the residuals resulting from regressing epigenetic age on age of death and gender. Accelerated aging was positively associated with neurofibrillary tangles (p=0.022).
We further discuss accelerated aging in AD and limitations of our study. We also calculate accelerated aging based on DNA methylation from the same samples [Levine et al., 2015] and compare those estimations to the H3K9ac-derived estimations.
Zika virus (ZIKV) is an emerging mosquito-borne flavivirus, first isolated in 1947 from the serum of a pyrexial rhesus monkey caged in the Zika Forest (Uganda/Africa). In 2007 ZIKV was reported to be responsible for an outbreak of relatively mild disease on Yap Island in the western Pacific Ocean. In the past year, ZIKV has been circulating in the Americas, probably introduced through Easter Island (Chile), by French Polynesians. In early 2015, a new outbreak was recognized in northeast Brazil, where concerns over its possible links with infant microcephaly have been discussed. Providing a definitive link between ZIKV infection and birth defects is still a big challenge. Small noncoding RNAs (small ncRNA) play important roles in biological processes, mainly regulating post-transcriptional gene expression through mechanisms of translation repression and gene silencing. It is well known that some classes of small ncRNA are able to influence viral pathogenesis and brain development. The potential for flavivirus-mediated small ncRNA signaling dysfunction in brain-tissue development provides a compelling mechanism underlying perceived linked between ZIKV and microcephaly. A collaborative database called ZIKV-CDB has been assembled that could help target mechanistic investigations of this possible relationship between ZIKV symptoms and small ncRNA mediated human gene expression control, helping to foster potential targets for therapy. The database is under development, but already includes predicted miRNAs involved in ZIKV/human-host interaction, being available at http://zikadb.cpqrr.fiocruz.br.
Environmental exposures contribute greatly to human health and disease, yet it has been difficult to quantify such impacts. The research in exposome and metabolomics, driven by high-resolution mass spectrometry, is now moving this frontier forward. The exposome aims to catalog internal doses or surrogates of all environmental exposures. The metabolome captures all small molecules, reflecting the biochemical state that serves as deep phenotyping and as the footprint of gene activities. These new data thus become the missing pillar in understanding gene-environment interactions. To illustrate this emerging paradigm, we use high-resolution metabolomics to study the effect of the pesticide DDT (dichlorodiphenyltrichloroethane) exposure in human population and in mouse models.
Archived serum samples of 465 subjects in California from the 1960s, when DDT exposure was at its peak, were used for metabolomics analysis, using a Thermo Q-Exactive mass spectrometer coupled with reverse phase C18 liquid chromatography. The association of each metabolite feature to DDT was assessed by regression models, accounting for age, BMI and total blood lipids. This metabolome wide association study (MWAS) was followed by mummichog, our published algorithm for untargeted metabolomics, to perform metabolic pathway and network analysis. Similar analysis was carried out in mouse models, and confirmed the significant pathways detected in the human population, including the metabolism of arginine, aspartate, asparagine and fatty acids. This study demonstrates a new set of methodology for MWAS, and reveals the biological effects from DDT exposure.
High-dimensional data are an important part of the tremendous recent growth of human immunology, which to a great extent, benefits from controlled longitudinal vaccination studies. We report here our development of a Multiscale, Multifactorial Response Network (MMRN) using data from a herpes zoster vaccine study in humans. Metabolomics, transcriptomics, cytokines and frequencies of cell subpopulations were measured multiple times at the beginning of the study, and antibody response was monitored up to 6 months. Dimension reduction was performed in two steps. E.g. the transcriptome was collapsed into our previously published blood transcription modules, and the modules were further grouped by network clustering techniques. Partial least square regression was used to assess the association between different data types, using permutation test. The resulting MMRN network revealed important temporal connections between cytokines, plasma metabolites, blood cell frequencies and gene expression. We demonstrate that the MMRN is highly accurate in predicting biological outcomes. These results also suggest a new paradigm that the gene expression in blood cells is guided by metabolite cues from the plasma.
Concern about the reproducibility and reliability of biomedical research has been rising. A bedrock principle of research conduct is that the samples analyzed are correctly identified and not mixed up during processing, but this has rarely been assessed formally.
Here we studied the prevalence of sample misannotation in a large corpus of genomics studies by comparing meta-data annotations of sex to predictions from expression of sex-specific genes. We identified apparent misannotated samples in 46% of the datasets sampled. Extrapolating beyond our corpus, we estimate that at least 33% of all studies have at least one such mix-up (99% confidence interval). Because this method can only identify a subclass of potential misannotations, this provides a conservative estimate for the breadth of the problem. In an additional set of studies that used samples from the same subjects, 2/4 had misannotatated samples. These misannotations are likely to result from laboratory mix-ups rather than subject meta-data collection errors.
Our findings emphasize the need for genomics researchers to implement more stringent sample tracking and data quality control steps, and suggests that re-use of published data should be done in conjunction with careful re-examination of meta-data.
Biological systems employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Advances in high-throughput technologies have generated comprehensive datasets measuring multiple aspects of biological regulations. Public databases, such as TCGA (The Cancer Genome Atlas), have been created for depositing diverse types of omics data for public dissemination. However, sample errors, such as sample-swapping or mis-labeling, are inevitable during the process of data generation and management. Because data errors could lead to wrong scientific conclusions, it is critical to properly match different types of omics data pertaining to the same individual before applying integrative analysis.
We applied a systematic alignment method into TCGA datasets. For example, in the breast cancer dataset (BRCA) consisting of ~1000 samples, we detected multiple sample errors in different types of molecular data. In each type of data, about 3-8% of profiles were not consistent with the labels based on the sample barcodes (16 profiles in microarray, 4 in HM27, 18 in HM450, 9 in GAmiRNA, 84 in HiSeq-miRNA, 31 in CNV). Multi-omics alignments identified sample-swapping of the 16 samples in microarray and mis-labeling of the 8 miRNA samples. Errors in genders or labeling of samples were also observed in other cancer datasets in TCGA (such as glioblastoma, lung, prostate, stomach). These results suggest that sample errors are not a dataset specific problem but more global problem in public databases and, therefore, our approach will provide a critical QC step to clean data for integrative analysis using large-scale dataset.
Computational modeling of signaling pathways is crucial for understanding carcinogenesis and predicting responses of cancer cells to drug treatments. However, canonical signaling pathways curated from the literature are seldom context-specific and thus can hardly make precise prediction of anti-cancer drug effects. Association-based data-driven methods have drawbacks such as limited interpretability about underlying mechanisms. Therefore, hybrid methods that integrate prior knowledge and real data for network inference are highly desirable. In this paper, we propose a knowledge-guided fuzzy logic network model to infer signaling pathways by exploiting both prior knowledge and time-series data. Dynamic time warping is adopted to measure the goodness of fit between experimental and predicted data, so that our method can model temporally-ordered experimental observations. Moreover, two regularizers are introduced to penalize the incompatibility of the model with prior knowledge and constrain the number of proteins interacting with each signaling protein. The knowledge-guided fuzzy logic network model is further converted to a constrained nonlinear integer programming problem that can be solved by a genetic algorithm. We evaluated the proposed method on a synthetic dataset and a real time-series phosphoproteomics dataset. The experimental results demonstrate that our model can effectively uncover drug-induced alterations in signaling pathways in cancer cells. Compared with existing hybrid models, we are able to model feedback loops so that the dynamical mechanisms of signaling networks can be uncovered from time-series data. By calibrating generic models of signaling pathways against real data, our method supports precise predictions of context-specific anticancer drug effects.
In many fields of science observations on a studied system represent complex mixtures of signals of various origin. Tumors are engulfed in a complex microenvironment (TME) that critically impacts progression and response to therapy. It includes tumor cells, fibroblasts, and a diversity of immune cells. It is known that under some assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction.
In this work, we apply independent components analysis (ICA) to decipher sources of signals shaping transcriptomes (global quantitative profiling of mRNA molecules) of tumor samples, with a particular focus on immune system-related signals. We use ICA iteratively decomposing signals into sub-signals that can be interpreted using pre-existing immune signatures through correlation or enrichment analysis.
Our analysis revealed a possibility to identify signals related to groups of immune cell types with unsupervised learning approach in a Breast Cancer dataset. Through Fisher exact test we identified significative groups corresponding to three out of five sub-signals: (1) T-cells, (2) DC/Macrophages, (3) Monocytes/ Macrophages/ Eosynophiles/Neutrophiles. T-cells metagene correlates well with the tumor grade (Kruskall-Wallis test p-value=0.003).
Ongoing analysis aims to evaluate the robustness of the represented groups and eventual differences between several types of cancer. We are to characterize the immune infiltration degree in the cancer transcriptome dataset and further correlate with patients’ survival and tumor characteristics. In the case of success, the results will be used in the diagnosis and cancer therapy, especially immunotherapies.
As mounting evidence indicates, each cell in the human body has its own genome, a phenomenon called somatic mosaicism. Such somatic variations include single nucleotide variants (SNVs), small insertions and deletions (indels), transposable element insertions, large copy-number variations (CNVs), and structural variations. Although somatic mosaicism may pose functional and pathological implications, there has been no comprehensive estimate of the number and allelic frequency of genomic variations in normal somatic cells in various tissues of the human body, as it remains difficult to detect somatic mosaic variants given their limited presence in cell tissue—at times, amounting to less than a fraction of a percent. To circumvent that problem, we sequenced the genomes of clonal cell populations derived from single brain progenitor cells to identify genomic variations present in the founder cell and manifested in each clone at 50% allele frequency. Unlike single cell sequencing, our approach avoids amplification artifacts. For data analysis, we developed a workflow to synergize calls from several variant calling programs: MuTect, SomaticSniper, Strelka, and VarScan for SNVs; Scalpel, Strelka, and VarScan for indels; CNVnator for CNVs. By applying the workflow to compare germline genomes of different individuals, we performed a data-driven estimation of workflow sensitivity. Using real data for six clones from an individual healthy brain, we detected per clone 200–500 SNVs at >75% sensitivity, 10–30 indels at >40% sensitivity, and 1-5 CNVs . Orthogonal experimental validation revealed a ~100% specificity of the calls generated. Thus, our analysis has revealed extensive somatic mosaicism within the human brain.
The human leukocyte antigen (HLA) gene family plays a critical role in biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. Coupled with the fact that the gene family contains the most polymorphic genes in human, clinical applications and biomedical research require highly accurate HLA typing. Meanwhile, NGS data have proved the ability to achieve high resolution HLA typing; however, the reads of the most platforms are not long enough to cover the two sequential exons, i.e., exon 2 and exon 3, and would lead to phasing ambiguities. On the other hands, the long reads of the PacBio system could unequivocally solve the phasing problem. The advantage of the PacBio long reads could be compromised by the high error rates; therefore, we proposed a typing method, which adjusted the Bayes’ theorem so that it could tolerate sequencing errors as well as de-multiplexing errors. We have implemented the method and integrated the pipeline of HLA typing into a program named BayesTyping.
High-throughput Next Generation Sequencing (NGS) technologies and reference databases have enhanced our ability to explore diversity at genetic and taxonomic levels. Most off-the-shelf tools for examining genetic diversity implement algorithms that rely on sequence similarity and composition, which can lead to resolution loss in genetic comparisons, particularly at the species/sub-species taxonomic ranks. We present a new version of the Automated Oligonucleotide Design Pipeline (AODP). AODP designs signature oligonucleotides (SO) with specificity and fidelity based on genome or DNA barcode sequence identity, reducing the resolution loss observed with existing approaches. SO designed with AODP highlight regions with taxon or clade-specific polymorphisms that are useful for comparative genomics and provide suitable candidates for the design of primers/probes in diagnostic assays. AODP has several unique features: 1) The AODP algorithm uses a novel packed-Trie data structure, with support for multi-threaded insertion, optimized for DNA nucleotide strings, which scales well to multi-processor architectures; 2) SO can be designed for a large dataset with relatively small memory footprint; 3) Regions of DNA with a single nucleotide polymorphism (SNP) can be optionally ignored to minimize noise caused by sequencing errors during NGS; 4) The specificity of SO can be further validated against large reference databases; 5) SO thermodynamic properties can be calculated for wet-lab experimental conditions; and 6) SO can be directly used for in silico identification of taxa from environmental NGS data.
Most new protein-coding genes originate from old genes by duplication and domain shuffling. It was previously assumed that intergenic DNA could not yield long enough protein products through random mutations. Yet de novo protein-coding genes - derived from intergenic DNA - were recently found in multiple species. These genes are of particular interest as they alone can invent novel protein structures.
We asked how often de novo genes appear, how many exist in any genome and what proteins they make. We built a mathematical model incorporating gene dimensions and genome dynamic processes (mutation, recombination, selection). It predicts that de novo genes can easily be created and that at any time many young de novo genes exist, most being lost quickly. We identified thousands of de novo genes by phylostratigraphy in five genomes and analyzed their biophysical properties using structural bioinformatics. We found that, compared to ancient proteins, de novo proteins are shorter, more disordered, promiscuous (interacting with more proteins and DNA), vulnerable to proteases, and less prone to aggregation. Moreover, de novo proteins lack Pfam domains and may be structurally novel.
Frequent gene creation and reduced tendency towards aggregation (which is toxic) provides a steady-state population of young de novo genes in the genome. This, along with de novo proteins’ propensity to interact, increases the chance that some will use their novel structures (and possibly novel functionalities) to integrate into existing genetic networks and survive for a long evolutionary time.
The assumption of lack of memory, i.e. Markovianity, is common to many models of protein sequence evolution, in particular to those based on point accepted mutation matrices (Dayhoff et al., 1978). Nevertheless, it has been observed (Benner et al.,1994 and Mitchison and Durbin,1995) that evolution seems to proceed differently at different time scales, questioning the Markovian assumption. We show that the among-site variability of substitution rates introduces an effective memory that makes protein sequence evolution not Markovian: each site retains the `memory' of its own substitution rate and this influences both the local destiny of that site and the global destiny of the full sequence. We introduce a simple model that describes the occurrence of substitutions in a generic protein sequence, based on the idea that mutations are more likely to be accepted at sites that interact with a spot where a substitution has occurred in the recent past. The model therefore extends the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution. We validate this model by successfully predicting the correlation of substitutions as a function of their distance along the sequence. Despite its simplicity, this model predicts a distribution of substitution rates highly compatible with a gamma distribution, consistently with the common wisdom (Yang 1993, Yang et al. 1994).
Many of the most powerful tools in biology rely on inference of homologs via sequence-based algorithms. However, many loci are invisible to such methods. Those that are short or rapidly evolving, such as orphan genes and small non-coding RNAs, may yield no significant hits. Whereas low-complexity or high-copy number loci may hide in a crowd of false positives. Searching by context bypasses this problem. We present an algorithm for tracing loci between genomes using a synteny map, and test its efficacy by mapping all Arabidopsis thaliana-specific genes to the genomes of eight related species. By reducing the search space and winnowing false positives, we were able to assess the origin of the individual orphan genes with unprecedented resolution. We traced many to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species also provides a starting point for future studies. Our pipeline can easily be adapted to contextualize elusive elements such as small RNAs and lineage-specific genes in any species for which reliable synteny maps can be built.
The Protein Data Bank (PDB) is a key resource of general principles that has shaped our understanding of protein structure. Most of the existing statistical generalizations of protein structures are made for secondary structures, which are often too generic to satisfy many specific design goals, or for protein domains, for which the PDB distribution is highly biased by evolution or human sampling, and thus not being physically meaningful. To fill this gap, we proposed the local tertiary motifs (TERMs) as a new fundamental level of structural unit. TERMs are combinations of non-continuous small secondary fragments connected by inter-residue contacts. We hypothesized that the PDB contains valuable quantitative information on the level of TERMs. We studied the propensities of TERMs within their corresponding ensembles, i.e. geometrically similar structural fragments from completely unrelated proteins. The TERM propensities are physically meaningful in many contexts. By breaking a protein structure into its constituent TERMs, we can evaluate the accuracy of structure-prediction models with poorly predicted regions identifiable, via a metric we named “structure score” capturing the sequence-structure relationships in TERMs. Also, querying TERMs affected by point mutations enables straightforward prediction of mutational free energies. Our performance exceeds or is comparable to state-of-art methods. Our results suggest that the data in the PDB are now sufficient to enable the quantification of complex structural features, such as those associated with entire TERMs. This should present opportunities for advances in computational structural biology techniques, including structure prediction and design.
High-throughput sequencing has become rapid and inexpensive, providing a vast amount of protein and DNA sequences for many genomes. The next challenge for biology is to use this information to gain fundamental insights into biomolecular mechanisms. One important direction towards this goal is structural reconstruction of the entire interactomes/biological pathways, with consecutive mapping of genetic variants/mutations onto corresponding structures. Due to inherent limitation of experimental techniques, most structures of protein-protein interactions (PPI) have to be computationally modeled (docked). Protein docking pipelines produce a large number of putative docking models. Identification of near-native models among them is a serious challenge. At the same time, a rapidly growing amount of publicly available information from biomedical research provides constraints on the binding mode, which can be essential for the docking. Recently, we have shown the potential of the basic text mining (TM) for protein docking (Badal VD, Kundrotas PJ, Vakser IA, PLoS Comput Biol, 2015, 11: e1004630). Here we present an extension of the TM tool, which utilizes natural language processing (NLP) to analyze residue-containing sentences and their surrounding in the retrieved PubMed abstracts. To generate sentence dependency tree, we utilized Stanford parser, and used inverse distances between PPI-relevant keywords and residues mentioned in the abstracts to discriminate the non-interface residues. We tested WordNet, dictionary look-up and deep parsing NLP approaches. The procedure was benchmarked on 579 X-ray bound structures of binary protein complexes and validated in docking of unbound protein structures from the DOCKGROUND resource (http://dockground.compbio.ku.edu).
Advances in genomic sequencing technology have drastically increased the amount of available sequence data, escalating the need for rapid annotation of genes and protein models. Recently, the Conserved Domain Database curation team has been developing an in house procedure, SPecific ARChitecture Labeling Engine (SPARCLE) to study the extent to which protein domain architecture can be utilized to define groups of proteins with similarities in molecular function and to derive corresponding functional characterization. So far, about 3, 000 common domain architectures from bacteria have been labelled and SPARCLE will be made available to the public as searchable resource. Currently, SPARCLE only considers best-scoring or top-ranked domain hits and is also hampered by imperfect domain annotation. To overcome some of these limitations, we propose an alternative computational procedure for defining clusters of functionally similar proteins that utilizes pre-computed domain annotation from each available source database (COGs, TIGRFAMs, Pfam, and NCBI-curated annotations) for grouping protein sequences, instead of the terse domain annotation currently employed by SPARCLE. This approach provides tunable fine-grained separation of domain architectures, and has been tested on multiple domain architecture families and several genomic datasets. The quality of the resulting classifications has been examined by curators and validated via analysis of the consistency and uniqueness of clusters. We will also discuss the limitations uncovered to date, and hope that this study will identify suitable approaches for both rapid and sustainable, but also increasingly accurate functional labeling of protein models predicted from genomic sequences.
As single-cell experimental approaches become increasingly popular, cell-to-cell heterogeneity has emerged as a key determinant factor contributing to variability in gene expression and signaling responses. Mass cytometry (CyTOF) is a new proteomic technology that enables the simultaneous quantification of dozens of proteins in thousands of individual cells. In the context of cancer research, recent applications of CyTOF include the characterization of inter- and intra-tumor heterogeneity and the identification of novel cell subpopulations. However, as already demonstrated for single-cell RNA-seq, the resulting measurements are largely influenced by confounding factors, such as the cell cycle and cell volume.
We present here TRACE, a novel computational approach to quantify this source of variability. TRACE first exploits a hybrid machine learning approach to classify single cells into discrete cell cycle phases according to measurements of established markers. Next, a metric embedding optimization technique creates a one-dimensional continuous marker that tracks biological pseudotime and individual cells are subsequently ordered according to this pseudotime marker. The resulting cell cycle trajectories across perturbation time points allow us to separate cell cycle effects from experimentally induced responses, enabling the direct comparison of signaling responses through cell cycle progression. Additionally we show that volume biases can be corrected using housekeeping gene measurements. Our approach, implemented in a simple and intuitive Graphical User Interface, was used to analyze data from various cell lines subject to different stimulations. In each case, TRACE was able to separate confounding effects from signaling responses, enabling the unbiased analysis of biological processes.
The relationships between gene expression, cellular functions, and disease phenotypes have been defined largely by transcriptome profiling. Transcriptomic studies rely explicitly or implicitly on the assumption that co-expressed mRNAs share similar biological functions, which guides common data analysis approaches, including gene clustering, co-expression network analysis, and gene set enrichment analysis. However, recent studies report only a moderate correlation between mRNA and protein profiles. Quantitative analysis of multi-level gene expression regulation is conceptually and technically challenging, and a key question — whether protein co-expression or mRNA co-expression better predicts gene co-functionality — remains largely unexplored. Here, we address this question in cancer using rich mRNA and protein profiling data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). We constructed mRNA and protein co-expression networks for three cancer types with matched mRNA and protein profiling data sets. The analyses revealed a marked difference between the wiring of the protein and mRNA co-expression networks. Whereas protein co-expression was driven primarily by functional similarity between co-expressed genes, mRNA co-expression was driven by both co-function and chromosomal co-localization of the genes. Protein-level regulation strengthened the link between gene expression and function for at least three quarters of Gene Ontology (GO) biological processes and ninety percent of KEGG pathways. A web application developed based on the three protein networks revealed novel gene-function relationships. Protein-level regulation provides essential mechanisms to drive coordinated gene functions. Elucidating these mechanisms requires proteomic measurements.
Motivation: Most RNA-seq data analysis software packages
are not designed to handle the complexities involved in
properly apportioning short sequencing reads to highly repetitive
regions of the genome. These regions are often occupied by
transposable elements (TEs), which make up between 20-80%
of eukaryotic genomes. They can contribute a substantial portion
of transcriptomic and genomic sequence reads, but are typically
ignored in most analyses.
Results: Here we present a method and software package for
including both gene- and TE-associated ambiguously mapped
reads in differential expression analysis. Our method shows
improved recovery of TE transcripts over other published
expression analysis methods, in both synthetic data and
qPCR/NanoString-validated published datasets.
Availability: The source code, associated GTF files for TE
annotation, and testing data are freely available at http://
Motivation: Duplication in biological sequence databases has persisted for 20 years. Duplicate records introduce redundancies to databases, delay biocuration processes, and undermine the accuracy of studies based on sequence analysis such as GC content and melting temperature. Rapid growth of data makes purely manual de-duplication nearly impossible, and existing automatic systems cannot detect duplicates as precisely as experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While a mature approach in other duplicate detection contexts, machine learning has seen only preliminary application in the large biological sequence databases.
Results: We developed a supervised duplicate detection method, employing an over one million-pair expert curated dataset of duplicates across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed both binary and multi-class models. Both models achieve promising performance; the binary model had over 90% accuracy in all the 5 organisms while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. In particular, better measurement on sequences drives the performance.
BioStudies is a new database at EBI that aims to the address current limitations within the traditional structured data archives available to scientists.
It is able to accept and store data from new and emergent technology where data is produced in formats not supported by the current EBI data resources. Biostudies is also able to link to data in other databases, this is particularly advantageous in multiomic studies where data has been deposited in a number of repositories but with no central link to tie everything together. Due to the flexible nature of it’s data model, Biostudies is also able to store the supplementary data that is associated with publications.
A simple tab-delimited text format, PAGE-TAB has been developed to enable the capture of all the information described. PAGE-TAB allows the submitter to describe files and external links associated with a study, organise information in hierarchies, and attach annotation as appropriate. Extra functionality can be added for specific purposes, such as a compound view in the ‘Data Infrastructure for Chemical Safety (diXa)’ project.
Submissions from users can be submitted through a new online tool allowing the submitter the input of metadata, including data release date, direct upload of files, links to already deposited data and associated publication information. The tool enables users to maintain and edit their own Biostudies record.
As of March 2016 BioStudies contains 578,167 studies that are free to browse, download and reuse. The user interface enables ontology-driven query expansion, enabling powerful searching across thousands of datasets.
Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities.
We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services.
The most common form of congenital heart disease (CHD), namely ventricular septal defect (VSD), is a subfeature of Tetralogy of Fallot (TOF), which comprises the majority of cases of cyanotic CHD. The underlying causes of the bulk of CHDs are still unclear but most probably consist of a combination of genetic, epigenetic and environmental factors. DNA methylation is the most-widely studied epigenetic modification and, here, we present the first analysis of genome-wide DNA methylation data (MBD-seq) obtained from myocardial biopsies of TOF and VSD patients. We found clear methylation differences between cases and controls, and between patient groups. For TOF, we linked DNA methylation with genome-wide gene expression data (RNA-seq) and found a significant overlap for hypermethylated promoters and down-regulated genes, and vice versa. Interestingly, we found examples of methylation changes co-localized with novel, differential splicing events among sarcomeric genes. In addition to DNA methylation, short non-coding RNAs like microRNAs have been shown to play a role in gene silencing. Thus, we further analyzed genome-wide small RNA-seq data from TOF patients and controls. Subsequently, we combined the microRNA expression data with previously analyzed gene expression profiles. In summary, our data suggest DNA methylation and microRNAs likely contribute to the pathogenesis of CHD by modulating disease-specific gene expression profiles.
RNA can fold into secondary and tertiary structures, which are important for regulation of gene expression. We recently developed a method to perform genome-wide RNA structure profiling in vivo employing high-throughput sequencing techniques, and applied this methodology to Arabidopsis. This method makes it possible to probe thousands of RNA structures at one time in living cells. Hidden RNA codes have been revealed by bioinformatic analyses of our RNA structuromes including RNA structures related to alternative polyadenylation and splicing .
Recently, further analysis of this dataset revealed a correlation between mRNA structure and the encoded protein structure, wherein the regions of individual mRNAs that code for protein domains generally have significantly higher structural reactivity than regions that encode protein domain junctions. This relationship is prominent for proteins annotated for catalytic activity but is reversed in proteins annotated for binding and transcription regulatory activity. We also found that mRNA segments that code for ordered regions have significantly higher structural reactivity than those that encode disordered regions .
We also developed a new computational platform, StructureFold, to facilitate the analysis of high throughput RNA structure profiling data. As a component of the Galaxy platform (https://usegalaxy.org), StructureFold integrates four computational modules in a user-friendly web-based interface or via local installation .
 Ding Y, Tang Y, Kwok CK, Zhang Y, Bevilacqua PC, Assmann SM. Nature. 2014;505:696-700.
 Tang Y, Assmann SM, Bevilacqua PC. J Mol Biol. 2016;428:758-766.
 Tang Y, Bouvier E, Kwok CK, Ding Y, Nekrutenko A, Bevilacqua PC, Assmann SM. Bioinformatics. 2015;31:2668-75.
Long intergenic noncoding RNAs (lincRNA) are a novel class of regulator that play important roles in many biological processes. Myogenesis is the formation of muscular tissue, particularly during embryonic development. Little is known how lincRNAs are involved in skeletal myogenesis. First, to identify the functional lincRNAs in myogenesis, we present a novel computational framework that can accurately identify potential functional lincRNAs from millions of assembly transcripts obtained from transcriptome sequencing data during myogenesis. Second, among many identified potential functional lincRNAs, we functionally validate a novel Linc-YY1 from the promoter of the transcription factor (TF) Yin Yang 1 (YY1) gene. We demonstrate that Linc-YY1 is dynamically regulated during myogenesis in vitro and in vivo. Gain or loss of function of Linc-YY1 in C2C12 myoblasts or muscle satellite cells alters myogenic differentiation and in injured muscles has an impact on the course of regeneration. Linc-YY1 interacts with YY1 through its middle domain, to evict YY1/Polycomb repressive complex (PRC2) from target promoters, thus activating the gene expression in-trans. Altogether, we show that Linc-YY1 regulates skeletal myogenesis and uncover a previously unappreciated mechanism of gene regulation by lincRNA.
The work described here is substantially supported by General Research Funds (GRF) and Collaborative Research Fund (CRF) from the Research Grants Council (RGC) of the Hong Kong Special Administrative Region, China 476113, 473713, 14116014, 14113514 and C6015-14G
Single cell RNA-seq has been widely used in biological studies. Removing technical noise and normalizing the sequencing data are critical to fully explore the power of this technology. Various methods have been developed for normalization, including FPKM, UQ, DeSeq, RUV, and GRM. Among all, RUV and GRM can use spike-in ERCC to calibrate the technical noise. It is urgent to assess the performance of these methods using data with ground truth.
Recently, the NIH Single Cell Analysis Program – Transcriptome Project generated a RNA-seq data set using different amount of RNAs (10pg, 100pg and bulk) with ERCC. These data provide an unprecedented opportunity to compare different methods using the same data set. After normalization using each method, we clustered the samples and assume bulk samples are most similar to each other, 100pg samples are more similar to bulk than 10pg samples, and 10pg samples are more diverse. We used different metrics to evaluate the clustering performance by statistical indice.
The results showed, for methods not using ERCC, UQ, DESeq and RUV have comparable performance and better than FPKM. Considering ERCC by RUV and GRM significantly outperformed these methods without ERCC. Between RUV and GRM, GRM is more robust subject to different sets of genes.
In summary, we presented the first systematic comparison of normalization methods for single cell RNA-seq. We found that considering ERCC is helpful to remove technical noise and drastically improves clustering results. This study provides a guidance of selecting normalization methods for analyzing single cell RNA-seq data.
Co-expression networks have been a useful tool for functional genomics, providing important clues about the cellular and biochemical mechanisms that are active in normal and disease processes. With the recent advances in single cell RNA-seq technology, it is now possible to zoom in to identify pathways at single cell resolution. We performed the first major analysis of single cell co-expression, sampling from 31 individual studies comprising 28799 samples from 163 cell-types. Data from 163 bulk RNA-seq experiments were used as an external control. Using neighbor voting in cross-validation, we found that single cell network connectivity is less likely to overlap with known gene ontology functions than co-expression derived from bulk RNA-seq (aggregate sc AUROC=0.68, aggregate bulk AUROC=0.73), which can be attributed to the preferential occurrence of expression drop-outs in single cell data. Strikingly, we discovered that functional variation within celltypes strongly resembles variation occurring across celltypes (rs~0.95). The lack of additional variation within celltypes suggests that current knowledge in GO cannot readily identify functions occurring in a celltype-specific manner, and that systematic mining of single cell data may be required to define novel pathways.
The presence of SNPs on ligand-binding sites often have important functional consequences, leading to pathogenicity and variation in drug response. Understanding how SNPs may alter the efficacy and metabolism of certain drugs is crucial for successful implementation of the precision medicine model.
We review 136 unique protein-drug complexes and analyze the non-synonymous SNPs present in the drug-binding sites and the proximal residues. About 90% of these proteins have SNPs associated with less than 45% of their binding residues. In total, 2664 unique SNPs (2563 missense and 101 stop-gain mutations) are mapped. The frequency or clinical significance data is available for only 25.49% of these SNPs. Most show very low minor allele frequency in the populations and are associated with pathogenicity or drug response. Only two of the SNPs are found to be present in the GWAS catalogue. For the rest of the SNPs, online tools are used to predict the functional effects and conservation. We also analyze the SNP containing amino acids and the mutations that show significant differences between the binding residues and the rest of the protein sequences. Moreover, the protein-drug complexes with significant differences in presence of SNPs on binding sites are separately investigated.
This study is an effort towards understanding the possible effects of SNPs on drug response. We have comprehensively analyzed the association of SNPs with drug-binding sites and also highlighted the gaps in current knowledge.
Short insertions and deletions (indels) are the second most common type of variation in the human genome. Despite tremendous advances in high-throughput sequencing technologies and computational methods for variant calling from DNA sequence data, accurate detection of indels remains a challenge. Some of the reasons for this difficulty include over-representation of short indels in regions of low sequence complexity, variability in indel error rates across different platforms as well as the lack of good error models for indels.
We have developed an EM algorithm for the detection and genotyping of short indels using aligned sequence reads from multiple individuals. Our probabilistic method models sequence context-specific error rates to estimate the posterior probability of a variant and genotypes. Modeling such error rates is particularly important for indel detection in homopolymer regions.
Using extensive simulations, we assessed the power of our EM algorithm to detect indels as a function of read depth, population allele frequency and indel error rates. Our method was significantly more accurate than the recently proposed population-based method SOAP-popIndel. We subsequently performed a comprehensive comparison of our method against a number of leading variant calling methods including GATK Haplotype-Caller, FreeBayes and Platypus, using exome data from the 1000 Genomes Project. Our algorithm is shown to have high sensitivity and low false positive rate compared to the other methods. We further demonstrate that our population-based approach enables the discovery of indels that would be impossible to call using individual data.