Attention Conference Presenters - please review the Speaker Information Page available here.
If you need assistance please contact email@example.com and provide your poster title or submission ID.
Category O - 'Systems Biology and Networks'
Short Abstract: Metabolic networks and transcriptional regulatory networks are among the best reconstructed networks in biological systems; however, the links between the two networks are not as well established and frequently remain inadequately defined. Reconstructed metabolic networks allow constraint based modeling of metabolism and provide computational frameworks to guide metabolic engineering experiments and explore fundamental questions on metabolism, whereas reconstruction of transcriptional regulatory networks (TRNs) can reveal fundamental aspects of regulatory framework and wiring of the cell. Chlamydomonas reinhardtii, a biofuel relevant green alga that has retained key genes with plant, animal, and protist phylogenetic affinities, serves as an ideal model organism to investigate the interplay between gene regulation and metabolism at systems level. Here, we have integrated a high-resolution metabolic-gene network model recently reconstructed (Chaiboonchoe et al., (under review)), with an expression-refined TRN in C. reinhardtii (http://baliga.systemsbiology.net/home/chlamy-portal/), to reveal missing regulatory layers between the two. We have established an association network describing transcriptional regulation conservation of enzymes that functions in related metabolic pathways. Our results suggest that a metabolic constraint-based approach integrated with TRNs from high-throughput data can identify hubs of genes-metabolites-reactions. This may improve the quality and predictive ability of the metabolic network, and manipulation of metabolic reactions via modification of transcriptional regulators. To explore the conservation of metabolic-transcriptional regulatory constraints in other related species, work on the diatom Phaeodactylum tricornutum is currently in progress.
Short Abstract: Metabolic pathway diagrams are a classical way of visualizing a linked
cascade of biochemical reactions. However, to understand some
biochemical situations, viewing a single pathway is insufficient,
whereas viewing the entire metabolic network results in information
overload. How do we enable scientists to rapidly construct
personalized pathway diagrams that depict a desired collection of
interacting pathways that emphasizes particular pathway interactions?
We define software for constructing pathway-collage diagrams using a
combination of manual and automatic layouts. The user specifies a set
of pathways of interest for the collage from a Pathway/Genome
Database. Layouts for the individual pathways are generated by the
application implemented using Cytoscape.js. That application allows
the user to re-position pathways; define connections between pathways;
change visual style parameters; and paint metabolomics, gene
expression, and reaction flux data onto the collage to obtain a
desired multi-pathway diagram. We demonstrate the use of pathway
collages in two application areas: a metabolomics study of pathogen
drug response, and an Escherichia metabolic model. The Pathway
Collage application is a new component of the Pathway Tools software,
and is freely available to academic users, and available for a fee to
commercial users, from http://biocyc.org/download.shtml.
Short Abstract: The increasing number of genome-wide assays of gene expression available from public databases presents opportunities for computational methods that facilitate hypothesis generation and biological interpretation of these data. We present an unsupervised machine learning approach, ADAGE (analysis using denoising autoencoders of gene expression), and apply it to the publicly available gene expression data compendium for Pseudomonas aeruginosa. In this approach, the machine-learned ADAGE model contained 50 features, which we predicted would correspond to gene expression patterns depicting different biological states of the organism. While no biological knowledge was used during model construction, cooperonic genes and genes sharing KEGG pathways were grouped into similar features. By analyzing newly generated and previously published microarray and RNA-seq data, we demonstrated features extracted by ADAGE representing differences between strains and modeling the cellular response to low oxygen. The ADAGE model also accurately predicted the involvement of biological processes even with low-level gene expression differences through feature activation. ADAGE compared favorably with traditional principal component analysis and independent component analysis approaches in its ability to extract validated patterns, and based on our analyses, we propose that these approaches differ in the types of patterns they preferentially identify. Extraction of consistent patterns across large-scale collections of genomic data using methods like ADAGE provides the opportunity to identify general principles and biologically important patterns in biological systems.
Short Abstract: Recent developments regarding the analysis of gene co-expression profiles using complex network theory have shown promising results. Such analysis usually starts with the construction of an unweighted gene co-expression network. In such a case, an important step for constructing the network is the selection of a proper threshold, which determines if a pair of vertices is connected. We aimed at providing and comparing two different approaches for threshold selection. To do so, we selected 25 microarray experiments from different biological groups. For each experiment, we used Pearson’s correlation coefficient to measure the correlation between each gene pair, and the resulting adjacency matrices were used to construct the gene co-expression networks. We replicated the method for 30 different threshold values, and collected information about many topological characteristics of the networks. For our first threshold criterion, we used Principal Component Analysis (PCA) to reduce the dimensionality of the dataset, which defined a set of trajectories followed by the networks associated with each experiment, for increasing threshold values. The trajectories were found to present a region of stability, followed by an unstable one. We suggest the optimal threshold should be near the end of the stable region. The second criterion is based on the Shannon entropy of selected network measurements. A threshold providing the maximum entropy was considered for each experiment. This threshold should be related to the configuration in which the network presents most information. The two criteria were compared and shown to provide distinct characterization of the expression profiles.
Short Abstract: Recent advances in systems biology have made clear the importance of network models for capturing knowledge about complex relationships in gene regulation, metabolism, and cellular signaling. A common approach to uncovering biological networks involves performing perturbations on elements of the network, such as gene knockdown experiments, and measuring how the perturbation affects some reporter of the process under study. We develop a method for inferring these networks from single- and double-knockdown data. The traditional network-learning paradigm is one in which a single structure applies across all data; however, there is evidence some biological processes do not necessarily follow this constraint. This led us to develop a representation and a learning method for network structures where an interaction between two genes may be present in one context but absent in another, to reflect the biological phenomenon that some sets of molecules might only be available in some pools in the cell. This type of model, which more faithfully matches the biological process under study, aids in furthering general systems biology knowledge and helps effectively apply this knowledge.
Short Abstract: Blood is non-invasively accessible, rendering it an ideal body fluid for disease prognosis and diagnosis by clinicians. Plasma metabolic profiles are expected to carry a mixture of process signals which can be attributed to specific tissues or organs. In this study, we systematically investigate the utility of plasma metabolites as proxy markers for organ processes and function. To this end, global metabolic profiles were measured in adipose tissue, kidney, liver, muscle and plasma samples, all simultaneously obtained from diabetic and healthy mice (db/db; wt). In addition, gene expression was monitored in kidney and liver samples of the same mice with microarrays. Based on this data, we first systematically compared the metabolomes across the various organs and identified a unique metabolic profile for each organ as well as metabolites only measurable in certain organs. Next, we constructed inter-organ networks between plasma and organs both on a pathway and single molecule level. These networks were then used to examine which specific organ processes and functions are reflected by plasma metabolites. Thereby, we observed a clear cross-talk between the different organs and plasma, and a clearly distinguishable pattern of reflection by plasma metabolites, with strong signals of kidney and only weak signals of liver processes. On a regulatory level, plasma metabolites predominantly reflect transport processes in kidney and signaling pathways and metabolic processes in liver. By a differential analysis, we demonstrate that changes in the levels of organ molecules are reflected by changes in plasma metabolite concentrations highlighting their utility as proxy markers.
Short Abstract: Recent technological advances in bioimage informatics and computer simulation have produced a large amount of quantitative data of spatiotemporal dynamics of biological objects such as molecules, cells, and organisms. However, these data are difficult to reuse for further analysis because they are often scattered over the Internet in different formats. There is a crucial need in bringing these data together in a coherent and systematic manner. We developed SSBD (Systems Science of Biological Dynamics database) as an open repository for quantitative data and microscopy images (http://ssbd.qbic.riken.jp). SSBD aims to facilitate and contribute to data-driven biology by repurposing and exploiting large sets of quantitative and image data. SSBD provides quantitative data in a unified format (BDML: Kyoda et al., 2015) and through REST API. There are over 300 sets of quantitative data. The data extracted from microscopy images include single-molecule, nuclear division, and behavioral dynamics data in D. discoideum, C. elegans, D. melanogaster, zebrafish, and mouse. The data generated from computer simulation include single-molecule and pronuclear migration dynamics data in E. coli and C. elegans. SSBD also provides image data in the original format by using OMERO platform. There are time-lapse image data obtained from confocal microscopy, differential interference contrast microscopy, total internal fluorescence microscopy and bioluminescence imaging system. A web-based viewer allows users to visualize time series of 3D spatial data on-demand without any additional plugin. In addition, SSBD provides software tools for data visualization and analysis. The open-source version of SSBD for managing quantitative data is available at http://github.com/openssbd/.
Short Abstract: Here we report a new R package CoSync that takes pseudo-temporally ordered single cell data, apply several dynamic systems and signal processing techniques that include phase synchronization, Granger causality, and coherence analysis, to infer gene interactions and construct co-synchronization networks. It connects smoothly with a number of single-cell pseudo-temporal ordering tools on its upstream, and the co-expression network tool WGCNA on the downstream. Its functionalities and potential has tested with two real data sets. CoSync can be used as a module in a regulatory network modeling pipeline of single cell studies.
Short Abstract: Human genome encodes hundreds of RNA-binding proteins, but the targets of only a small fraction of these proteins have been characterized. Recently, high-throughput in vitro and in vivo technologies have revealed the binding motifs for an appreciable number of RBPs, providing a great opportunity to develop RBP target prediction models. To predict unprecedented binding targets of RBPs, we here proposed a random forest model using four types of features, including matching scores, accessibility and conservation of putative binding sites and matching scores of neighboring regions. The random forest was trained on experimentally validated binding targets and randomly selected negative datasets. We built a specific random forest classifier for each RBP by integrating binding motifs and experimentally validated binding sites. We compared our classifier with two existing methods, RBPmap and deepBind. Across 12 common RBPs covered by three methods, our classifier achieved better performance than RBPmap and deepBind. Moreover, by adding a post-scoring normalization step, we found satisfactory cross-prediction accuracy among the RBP-specific models, i. e., a model specific to one RBP often achieved comparable prediction accuracy for another RBP. We therefore built a general random forest model to expand the application. The general model would be especially useful for interested users since for most human RBPs the binding motif is known but there are not sufficient experimentally validated targets. We expect to implement our methods into a web service for predicting binding targets of human RBPs, thus helping place RBPs into human post-transcriptional regulatory networks.
Short Abstract: Along with the recent popularization of live-imaging technologies, various types of microscopy images are becoming available in public databases. Quantitative data such as positions and shapes of nuclei and cells, and their temporal changes obtained from these images are important resources in systems biology. Such resources can be used in various kinds of computational analysis including phenotype screening and mathematical modeling. Here, we develop a new resource of quantitative data on nuclear division dynamics in C. elegans RNAi embryos. Quantitative data were extracted from two-dimensional time-lapse DIC microscopy images in a public database, Phenobank, using our newly developed automated image processing system. They consist of 1,579 sets of quantitative data for RNAi embryos. These datasets include three sets of data for each of the genes (i.e. 518/549 of all genes) that exerted defects in embryogenesis when depleted individually by RNAi. Each data contains the outlines of nuclear regions and their temporal changes. All data were verified through manual error correction. To demonstrate the usage of such resource, we applied it to computational phenotype screening on female pronuclear migration. In this screening, fifteen candidate genes were selected for faster or slower migration by calculating the maximum migration speed. We found that two of them, sds-22 and F44B9.8, exhibited significantly faster and slower migration respectively than wild-type in our independent experiments. sds-22(RNAi) and F44B9.8(RNAi) embryos expressing GFP::tubulin exhibited markedly larger and smaller asters respectively, consistent with the nuclear tracking along microtubules mechanism. This resource will be openly available in SSBD database (http://ssbd.qbic.riken.jp).
Short Abstract: We can now consider interactions between various biomedical objects, such as genes, chemicals, molecular signatures, diseases, pathways and environments. Often, any pair of objects—-like, a gene and a disease—-can be related in multiple different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different meanings and imply different similarities. However, present methods often disregard these semantics, which prevents them to fully exploit their potentials.
Our goal is to answer association queries on complex data systems and explicitly distinguish between diverse data semantics. For example, given a few diseases, infer the most significant size-k gene module. Or, given a list of genes, propose a group of k other genes that, taken together, will give the highest significance for a particular disease under an appropriate null hypothesis. Or, given a few pathways, find which k chemicals have collectively the largest impact on these pathways. To achieve this level of versatility we build on compressive data fusion to derive all possible semantics and formulate a submodular optimization program with theoretical guarantees about the detection quality.
In gene-disease association prediction and disease module detection studies on 310 complex diseases we show the new approach compares favorably to approaches that conflate data semantics. Importantly, we find that different semantics vary in their ability to make accurate predictions and that the utility of semantics depends on disease category. Overall, compressive data fusion provides a powerful mechanism for inference across gene-disease contexts.
Short Abstract: Recently it has been shown that cancer mutations selectively target protein-protein interactions(PPIs).We hypothesized that mutations affecting distinct PPIs involving established cancer genes could contribute to tumor heterogeneity, and that novel mechanistic insights might be gained into tumorigenesis by investigating PPIs under positive selection in cancer.To identify PPIs under positive selection in cancer, we mapped over 1.2 million nonsynonymous somatic cancer mutations onto 4,896 experimentally determined protein structures and analyzed their spatial distribution.In total, 20% of mutations on the surface of known cancer genes perturbed PPIs, and this enrichment for PPI interfaces was observed for both tumor suppressors (OR 1.28, P-value<10-4) and oncogenes (OR 1.17, P-value<10-3).Moreover, we constructed a bipartite network representing structurally resolved PPIs from all available human complexes in the Protein Data Bank (2,864 proteins, 3,072 PPIs).Analysis of frequently mutated cancer genes within this network revealed that tumor-suppressors, but not oncogenes, are significantly enriched with functional mutations in homo-oligomerization regions (OR 3.68, P-Value < 10-8).We present two important examples, TP53 and beta-2-microglobulin, for which the patterns of somatic mutations at interfaces provide insights into specifically perturbed biological circuits.In patients with TP53 mutations, patient survival correlated with the specific interactions that were perturbed.Moreover, we provide a resource of 3,072 PPI interfaces ranked according to their mutation rates. Analysis of this list highlights 282 novel candidate cancer genes that encode proteins participating in interactions that are perturbed recurrently across tumors.In summary, mutation of specific PPIs is an important contributor to tumor heterogeneity and may have important implications for clinical outcomes.
Short Abstract: KaBOB is knowledge-integration framework focused on genes and proteins, intended to support mechanistic explanations of experimental results in genomics, transcriptomics and proteomics. Extending it to include metabolic information would facilitate analysis of metabolomic datasets as well. Potential metabolomic knowledge sources for integration include HumanCyc with 1826 metabolites, ChEB with 3947 “human metabolites”, and the Human metabolome database (HMDB) wth 29289 “endogenous” human metabolites..
HMDB has an order of magnitude more metabolites than HumanCyc or ChEBI largely because it curates not only small molecules but lipids, which are important in metabolism and signalling. HMDB provides cross references to HumanCyc (1174) and ChEBI (2791). Of these, only 1064 are cross-referenced to both; 1767 are in ChEBI, not HumanCyc, and 235 are in HumanCyc, not ChEBI. However, HMDB is not a superset of these other two data sources. Compared to what they self report, 36% (652/1826) metabolites are in HumanCyc but not in HMDB, and 29% (1156/3947) are in ChEBI but not HMDB.
In order to create a comprehensive knowledge-base of metabolites, each of these sources must be integrated. To do so, the KaBOB framework requires that each knowledge source be converted into a formal semantic relationship grounded in Open Biomedical Ontologies and expressed in the Semantic Web standard OWL language. Future work involves semantic mappings for each of the sources, and a set of queries demonstrated the ability to access knowledge seamlessly from all of them simultaneously.
Short Abstract: Screening sets of chemical compounds for potential synergy or antagonism has a wide range of applications, from medicine to bioenergy research. In order to facilitate design and analysis of synergy screens, we have developed an R package, SynergyScreen. Given a set of compounds and dose ranges, SynergyScreen can produce a design of a screen testing their pairwise combinations. The package generates layouts for a set of 96-well microtiter plates. It includes titration series for each compound and each combination in a fixed ratio, comprising a set of single-ray synergy experiments. Once the experiments have been carried out, SynergyScreen can analyze the data to detect synergistic and antagonistic compound pairs. The analysis includes normalization to remove potential plate bias, modeling each individual dose-response curve, and computing interaction index values for a set of effect sizes. The package can produce tabular outputs and visualizations. We demonstrate SynergyScreen functionality using example data.
Short Abstract: Motivation: Most of the existing aging-related databases provide fragmented information that is limited to individual molecules or genome expression profiles. However, actual aging mechanisms are the result of net-works of interactions between numerous molecules, rather than simply being determined by individual genes/proteins. In order to fully understand such complex mechanisms, a comprehensive network of diseases, drugs, and molecules is required. Therefore, in this study we aimed to integrate several public databases to in order to construct an anti-aging related meta-database.
Results: A network database of approximately 40,000 edges was constructed from the literature and from pub-licly available databases. Open anti-aging related databases, protein interactions, drugs, biochemical, diseases, and signaling pathways were retrieved from public databases. Other relevant information was extracted from NCBI articles using text mining. We constructed a user-friendly web server to represent molecular information from the anti-aging related network. Unlike previous molecule-centered databases, our server provides a collec-tion of up-to-date research and their results at the network level, which will aid searches for anti-aging related information, and make it possible to discover new and significant insights into the mechanisms that underlie aging (http://antiaging2.labkm.net).
Short Abstract: Accurate detection of significant gene expression variation in RNA-seq experiments relies on high-quality data. For this purpose, we developed a RNA-seq QC (RSQC) pipeline for researchers to check data quality and identify outlier samples. There exist many quality issues in RNA-seq data, such as sequencing bias, non-genome region reads, non-uniquely mapped reads and outlier samples. These issues come from various sources such as low RNA quantity and quality, library preparation and platform selection. To ensure high data quality, our proposed RSQC pipeline cleans the raw data file by filtering out non-uniquely mapped reads, PCR duplicates, and reads that mapped to noncoding RNA to generate clean data. Since a successful experiment requires the sequencing data meet certain criteria, this pipeline generates various data metrics of the RNA-seq experiments: sequencing quality summary, the number of raw and clean reads, uniquely mapped reads, duplication ratio, chromosome and gene coverage, mapping quality, percentage of reads mapped to non-coding RNA. At the experimental design level, RSQC can identify outlier samples using principle component analysis (PCA), Pearson Correlation, and irreproducible discovery rate (IDR) methods. As quality control software for RNA quality, RSQC makes an assessment of RNA integrity for each sample based on expression of housekeeping genes. We test eight samples (half for control and half for treatment) in a pilot study. The pipeline successfully identified the sample with low RIN value and other outlier samples. This pipeline runs in Linux cluster with Maui scheduler and 32 G nodes.
Short Abstract: Major malignancies such as breast, colorectal and lung cancers are known to be heterogeneous both molecularly and clinically, which hampers the selection of right patients for optimized therapy. Therefore, it is important to determine subtypes of cancers for better clinical management and design of more targeted agents. In recent years, due to the development of next-generation sequencing high-throughput technologies, signature-gene-based subtyping has been widely demonstrated as an efficient approach for cancer stratification. However, such classification strategy has its intrinsic limitations due to platform differences, batch effect, etc. Here we propose a novel approach, based on functional gene sets and deep learning, which can overcome the limitations of the current signature-gene-based classification strategy. As a demonstration, we applied our approach on a robust and widely accepted colorectal cancer classification system, four consensus molecular subtypes (CMSs) proposed by the international colorectal cancer consortium. By comparing with their official classifier based on Random Forest, as well as other popular classification algorithms, our approach showed much better performance on accuracy, robustness, and clinical relevance. Furthermore, we also propose a new approach to interpret cancer heterogeneity, which improves our understanding of underlying biological mechanism. To summarize, our approach not only overcomes the shortcomings of traditional methods, but also brings new insights into understanding cancer.
Short Abstract: Small non-coding RNAs (sRNAs) are key regulators of gene expression in bacteria . Many studies have demonstrated that sRNA-mediated regulation is important for coordination of stress and physiological responses . Through complementary base pairing with messenger RNAs (mRNAs), sRNAs can affect gene expression by modulating transcript stability and/or translation efficiency . Here, we describe a network inference approach designed to identify sRNA-mRNA interactions affecting transcript levels. We use transcriptional datasets and prior knowledge to infer bacterial sRNA networks using the Inferelator , our network inference tool. We show the successful application of our network inference strategy in the scenario of computationally predicted sRNA-mRNA interactions and partially characterized sRNA regulons. We plan to build expanded gene regulation models of several bacterial species, incorporating the contributions of sRNAs. Additionally, we plan to study the connections between transcriptional regulation mediated by DNA-binding proteins and the sRNA regulatory layer. Ultimately, this effort will improve our understanding of regulation of gene expression in bacteria.
1. G. Storz, J. Vogel, and K.M. Wassarman, Molecular Cell, 2011, 43, 880.
2. E.G.H. Wagner and P. Romby, Advances in Genetics, 2015, 90, 133.
3. M.L. Arrieta-Ortiz, C. Hafemeister et al., Mol Syst Biol.,2015,11:839.
Short Abstract: The relationships between gene expression, cellular functions, and disease phenotypes have been defined largely by transcriptome profiling. Transcriptomic studies rely explicitly or implicitly on the assumption that co-expressed mRNAs share similar biological functions, which guides common data analysis approaches, including gene clustering, co-expression network analysis, and gene set enrichment analysis. However, recent studies report only a moderate correlation between mRNA and protein profiles. Quantitative analysis of multi-level gene expression regulation is conceptually and technically challenging, and a key question — whether protein co-expression or mRNA co-expression better predicts gene co-functionality — remains largely unexplored. Here, we address this question in cancer using rich mRNA and protein profiling data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC). We constructed mRNA and protein co-expression networks for three cancer types with matched mRNA and protein profiling data sets. The analyses revealed a marked difference between the wiring of the protein and mRNA co-expression networks. Whereas protein co-expression was driven primarily by functional similarity between co-expressed genes, mRNA co-expression was driven by both co-function and chromosomal co-localization of the genes. Protein-level regulation strengthened the link between gene expression and function for at least three quarters of Gene Ontology (GO) biological processes and ninety percent of KEGG pathways. A web application developed based on the three protein networks revealed novel gene-function relationships. Protein-level regulation provides essential mechanisms to drive coordinated gene functions. Elucidating these mechanisms requires proteomic measurements.
Short Abstract: Background :
In many fields of science observations on a studied system represent complex mixtures of signals of various origin. Tumors are engulfed in a complex microenvironment (TME) that critically impacts progression and response to therapy. It includes tumor cells, fibroblasts, and a diversity of immune cells. It is known that under some assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction.
In this work, we apply independent components analysis (ICA) to decipher sources of signals shaping transcriptomes (global quantitative profiling of mRNA molecules) of tumor samples, with a particular focus on immune system-related signals. We use ICA iteratively decomposing signals into sub-signals that can be interpreted using pre-existing immune signatures through correlation or enrichment analysis.
Our analysis revealed a possibility to identify signals related to groups of immune cell types with unsupervised learning approach in a Breast Cancer dataset. Through Fisher exact test we identified significative groups corresponding to three out of five sub-signals: (1) T-cells, (2) DC/Macrophages, (3) Monocytes/ Macrophages/ Eosynophiles/Neutrophiles. T-cells metagene correlates well with the tumor grade (Kruskall-Wallis test p-value=0.003).
Ongoing analysis aims to evaluate the robustness of the represented groups and eventual differences between several types of cancer. We are to characterize the immune infiltration degree in the cancer transcriptome dataset and further correlate with patients’ survival and tumor characteristics. In the case of success, the results will be used in the diagnosis and cancer therapy, especially immunotherapies.
Short Abstract: The human gut microbiota is a diverse community of bacteria, archaea, and eukaryotic cells. Close interactions within this community are hypothesized to form complex, interdependent metabolisms (syntrophy) integral in maintaining human health. To study these interactions, we combine metabolomics, genetics and microbial growth studies with software testing techniques and metabolic modeling to explore the limits of a proposed syntrophic relationship between two members of the gut microbiota. In the small intestines, bacterium Bacteroides thetaiotaomicron anaerobically ferments dietary polysaccharides to produce small-chain fatty acids, carbon dioxide and hydrogen gas. Archaeon Methanobrevibacter smithii utilizes fermentation products formate, acetate, and hydrogen gas to reduce carbon dioxide to methane. To characterize their relationship, we use classification trees from machine learning in combination with variable coverage analysis from software engineering to predict a syntrophic relationship from in vitro high-throughput growth studies and in silico flux balance analysis of uncurated models. We verify in vitro that B. thetaiotaomicron fermentation products are sufficient for M. smithii growth, and a build-up of fermentation products alters B. thetaiotaomicron production of metabolites to benefit M. smithii growth. Growth studies further suggest that B. thetaiotaomicron grows better in the absence of fermentation products or in the presence of M. smithii. Our combination of in vitro and in silico techniques indicate a mutually beneficial syntrophic relationship between B. thetaiotaomicron and M. smithii. The results will be used to refine metabolic models and analyses to increase the accuracy of future predictions without the need for full-factorial experimentation or manual curation of metabolic models.
Short Abstract: Many biological networks are found to divide naturally into modules. Most biological functions are carried out by specific group of genes and proteins, so that one can separate the structure into functional modules. However, the problem of detecting and characterization of modular structure in the physical network is one of outstanding issues in the study of biological networked system. A useful strategy for separating nodes (i.e. genes, proteins) in functional modules is to consider network perturbation models. Mass spectrometry (MS)–based proteomics has been applied to network biology to detect and quantify perturbation-induced network changes.
Here we introduce an unsupervised, topology-based approach that uses multidimensional quantitative proteomics data to infer structural modularity. This novel TDA-based approach allows related proteins to automatically organize into connected nodes providing a geometrical structure representation of the data which is usually unknown. We demonstrate the performance of this approach and the proof of concept using data generated from a human drug network and a yeast deletion network with role in chromatin remodeling processes. We show that the generated topological structures can correctly suggest the biological roles of proteins. In addition, our method provides a novel tool for the visualization of hidden biology within large protein interaction networks.
Short Abstract: The broad application of genome-scale metabolic modeling has made it a useful technique for tackling fundamental questions in biological research and engineering. Today over 100 models have been constructed for organisms that carry out a diverse array of metabolic activities spanning all three kingdoms of life. These models, however, have been curated independently following different conventions. The maintenance of model consistency has been challenging due to the lack of consensus in model representation and the absence of integrated modeling software for associating mathematical simulations with the annotation and biological interpretation of metabolic models. To solve this problem, we developed a new software package, PSAMM, and a new model format that incorporates heterogeneous, model-specific annotation information into modular representations of model definitions and simulation settings. PSAMM provides significant advances in standardizing the workflow of model annotation and consistency checking. Compared to existing tools, PSAMM supports more flexible configurations and is more efficient in running constraint-based simulations. All functions of PSAMM are freely available for academic users and can be downloaded from a public Git repository at https://zhanglab.github.io/psamm/ under the GNU General Public License.
Biological networks contribute effectively to unveil the complex
structure of molecular interactions and to discover driver genes
especially in cancer context. It can happen that due to gene
mutations, as for example when cancer progresses, the gene expression
networks undergoes some amount of localised re-wiring. The ability to
detect statistical relevant changes in the interaction patterns
induced by the progression of the disease can lead to discovery novel
Several procedures have been recently proposed to detect sub-network
differences in two networks. In this paper, we propose an improvement
over the state of the art based on the Generalized Hamming Distance
adopted for evaluating the degree of topological difference between
two networks and estimating its statistical significance. The proposed
procedure exploits a more effective model selection criteria and is
more efficient in terms of computational time and prediction accuracy
than literature methods. Moreover, the structure of the proposed
algorithm allows for a faster parallelized version. The proposed approach
is 10-15x faster than literature method and achieves 5-10% higher
AUC, Precision and Kappa values than the state-of-the-art in case
of dense random geometric networks.
Short Abstract: We have developed a new algorithm for quantitating the strength of gene co-expression within a given sample cohort and applied this to several cohorts of cancer tumors. Pearson and Spearman correlations are calculated between genes using only data points at which at least one of the two genes is differentially expressed vs. the trimmed mean of the gene, rather than using all data points, in an attempt to better estimate the correlation of biological activity. These correlations are then combined with the ranks of each gene in the other's list of significant correlations, the length of the vector used in the correlation, and similarity of rank-ordered co-expressed gene lists within local networks to yield a final interaction score. These scores are used to generate gene and sample cluster heat maps, which are then manually annotated with biological meaning. This approach was applied to in-house gene expression microarrays from ~10,000 primary tumor samples, spanning diverse sites of origin. Additional co-expression clusterings were performed on several subsets of the data, for individual organs and histologies. Three major biological processes were identified in the gene-clusters: proliferation, epithelial to mesenchymal transition (EMT), and immune response (with sub-clusters related to NK/T-cells, DC/B-cells, and interferon response). Gene membership within these broad clusters, as well as the signatures derived from them, can then be used to identify and interpret major cancer-related pathway biology in other experiments and gene signatures.
Short Abstract: BinoX is a novel sophisticated method for pathway analysis and
determining the relation between sets of gene. Most common approaches
use overlapping genes and set theory as fundamental statistics for assessing the
significance of the relation between gene sets and pathways. Because these types of methods disregarding non-overlapping genes they only get a small part of the whole picture.
BinoX overcomes this problem with a new statistical model, using genome
wide functional association networks. The implemented algorithm
outperforms common methods in terms of computation time, detection rate,
true positive rate and false positive rate.
Short Abstract: Over 200 million people suffer from asthma worldwide (PMID: 24997565). A characteristic feature of asthma is the aberrant accumulation, differentiation or function of memory CD4 (+) T cells that produce type 2 cytokines (TH2 cells). Several studies recently uncovered new metabolic checkpoints for T cell activity. In the current study, we identified potential asthma-specific metabolic genes in activated T-cells that differ between healthy and asthmatic individuals. Differential expression analysis between naïve (TN) and memory CD4+ T cells (TH) from asthma and healthy individuals (PMID: 24997565) identified a disease-specific set of 73 metabolic genes that are significantly up regulated in TH relative to TN cells. Transcriptomic data from healthy (H) and asthmatic (A) TN and TH cells was used to reconstruct 4 cell-type-specific genome scale metabolic models (H-TN, H-TH, A-TN, A-TH) by integrating the transcriptomic data from each group with the human generic human metabolic reconstruction (recon1). Two genes, 6-phosphogluconolactonase (6PGL) and the disease-causing gene hydroxysteroid (17β) dehydrogenase 10 (HSD17B10) were of particular interest since they were unique to the A-TH2 model as well as being in the list of disease-specific up regulated metabolic genes. 6PGL hydrolyses 6-phosphogluconolactone to 6-phosphogluconate in the oxidative arm of the pentose phosphate pathway (PPP) (PMID: 25037503) while HSD17B10 functions in isoleucine degradation and is required for mitochondrial integrity (PMID: 20077426). Both genes have been implicated in several diseases including cancer and Alzheimer's. However, our study is the first to propose a role for these genes in disease-specific T-cell activation events.
Short Abstract: Homeostatic renewal of mammalian tissue mass and composition is critical to maintenance of robust organ function throughout the life of the organism. This is typically achieved through a continuous and constitutive supply of new cells from differentiating progenitor cells. Several questions remain on how different tissue resident and non-resident cellular responses are integrated towards a coordinated dynamic equilibrium. In this study, we focused on the homeostatic renewal of liver tissue. We developed a new mathematical model that accounts for two recently described populations of hepatocytes: Axin2+ and Axin2- cells. In this scheme, liver tissue renewal occurs predominantly through replication of Wnt-responsive Axin2+ “stem-cell-like” hepatocytes, which have been shown to cluster pericentrally in the liver lobule, and presumably respond to factors produced by pericentral endothelial cells. As Axin2+ hepatocytes self-renew, some of the daughter cells migrate away from the pericentral region and transform into Axin2- hepatocytes, and populate the remainder of the liver. We investigated this system by considering several putative models with different feedback interactions between the two cell types. We utilized a Design of Experiments based approach to investigate the robustness of each model to multiple perturbations in the parameters. Our analysis identified a network model that included multiple feedback influences on cell proliferation as well as cellular transformation as yielding the most robust response to individual parameter changes. Based on our simulations and model analysis, we predict that homeostatic renewal of liver tissue is governed by a combination of feedback influences on proliferation and transformation of “stem-cell-like” hepatocytes.
Short Abstract: Colorectal cancer (CRC) is the third most common cancer worldwide with nearly 1.4 million new cases in 2012 due to lack of biomarkers for disease progression and defined therapeutic targets. This study aimed to identify the difference in gene expression and the biological pathways associated in different stages of CRC (Stage I - IV). Meta-analysis was performed on four microarray datasets having 372 cancer samples (73 Stage 1, 147 Stage II, 90 Stage III & 62 Stage IV) and 43 controls. Raw data integration methods like BMC, COMBAT, GENENORM, XPN and processed data integration like p-value integration were applied & suggested that GENENORM and COMBAT methods were performing well on dataset using assessment parameters like reproducibility of differentially expressed genes (DEGs) in comparison to the individual dataset, PCA, MDS plot and hierarchical clustering analysis. Genes 4069, 4059, 4226, and 4327 were found dysregulated in different stages (I-IV) of CRC. Comparison of these identified dysregulated genes, > 3000 genes are common among the different stages, indicating that these genes were dysregulated throughout progression of CRC. Also, stage-specific dysregulated genes (182 genes in stage1, 102 genes in stage 2, 133 genes in stage 3 and 282 genes in stage 4) were identified which are found to be important and helpful in distinguishing the different stages of CRC. This study suggest that the DEGs in stage I & II were mostly involved in metabolic process & immune response while in stage III & IV genes were involved in cell cycle and division.
Short Abstract: Modeling RNA transcription in eukaryotes remains challenging and incomplete with respect to biological relevance. We show that incorporation of RNA degradation substantially improves regulatory network inference on the example of yeast Saccharomyces cerevisiae.
Yeast has several hundred putative TFs and ~6,000 potential targets, which renders the potential interaction space enormous. Furthermore, eukaryotes are marked by extensive promoter regions, many response pathways, and additional regulatory layers, e.g. RNA decay. For these reasons, even the best network inference algorithms have so far performed very poorly.
We addressed this challenge by taking several steps towards constructing the first genome-wide biophysically relevant yeast regulatory network.
These modifications include incorporating RNA decay rates into the existing network inference framework, Inferelator-BBSR, which allowed for recovery of interactions missed by existing methods. RNA decay rates were incorporated by computationally finding optimal decay rates for different modes of regulation in yeast, or by incorporating empirical RNA decay rate data into the procedure. Both ways substantially improve regulatory network inference.
The new, inferred network independently recapitulates biological findings, such as biases in RNA half-lives amongst subsets of genes. Furthermore, RNA decay rates that maximize recovery of known interactions agree with experimentally measured rates. Finally, the new network can identify different modes of stress adaptation which require different average RNA decay rates.
Short Abstract: Nowadays, it’s largely accepted that a comprehensive understanding
of biological systems demands a joint analysis of diverse omics layers.
Integrative methods can collectively mine multiple types of biological data
and produce more holistic, systems-level insights that can drive biological
experimentation. A potential source of this data is RNA Sequencing (RNASeq), an
efficient method for discovery and annotation of many types of transcripts.
However, raw data analyses requires multiple steps. We are currently working
in a network-based integration of transcriptomic, genomic and proteomic data
using Galaxy as a scientific workflow system. To carry out the initial steps of
this task, we aimed at establishing a borderline horizon for processing and
suitability of the tool and the entire apparatus involved on the computing of
RNASeq data analysis. The implemented analytical steps involved: quality
analysis; trimming; reference genome mapping and differential expression
analysis. As a result we computed over 189,3Gb of data, generating graphics
and reports. It is noteworthy as a result: a) the system's robustness and
traceability; b) the possibility of sharing the results and the developed
pipeline. We are now working in the integration of the resulting data into a
pre-built network, in order to analyze possible topological elements of
biological interest. Data regarding the RNASeq analysis workflow developed with
Galaxy together with the preliminary results of network integration will be
Supported by: CAPES, CNPq, FAPEMIG and CPqRR.
Short Abstract: Diagnosis and stage classification of many diseases including cancer is still a major challenge. With the rapidly accumulating omics data of diseases, identification of biomarker from such a data pose as a mammoth problem and is a highly pursued objective. We report the development of a new algorithm that constructs condition-specific response networks by mapping gene expression values onto curated directed genome-scale protein-protein interaction network that has comprehensively captured both structural interactions and functional influences. Where required, these can also be made patient-specific or patient-group-specific response networks. For a given disease, separate networks are first constructed for control and disease conditions or disease stages. Top activity paths for each network is computed using Dijkstra’s algorithm. Highest ranked paths of perturbations due to the disease are then computed by comparing them using the Jaro-Winkler similarity metric score. Next, network communities are identified using a fast-greedy algorithm, followed by an identification of max-span and min-span paths that reflect inter-community and intra-community high-influence paths. An Influence index is computed for each node in these paths, based on eccentricity, betweenness centrality and the degree difference to obtain high-influence networks yielding a rank list of high influence nodes. From these, genes which showed significant differential expression values were selected and Tree-based estimator used to identify a minimal signature set capable of efficiently classifying disease samples from control samples or classify disease stages. As a case study, signatures for mutant BRAF, mutant RAS, mutant NF1 and Triple-WT classified Melanoma patients in TCGA data will be discussed.
Short Abstract: Molecular dynamics simulations were used to compare the binding of (R) and (S) Propranolol to the chiral molecular micelle poly-(sodium undecyl-(L)-leucine-valine). This study is part of a larger effort to understand the mechanism of chiral recognition in capillary electrophoresis by characterizing the molecular micelle binding of chiral compounds with different geometries and charges. Molecular dynamics simulations showed both propranolol enantiomers inserted their aromatic rings into the molecular micelle core and that (S)-propranolol associated more strongly with the molecular micelle than (R)-propranolol. This difference was attributed to stronger molecular micelle hydrogen bonding interactions experienced by (S)-propranolol. Chiral separations are especially important in the medical and pharmaceutical fields because in a chiral in vivo environment, a drug’s chirality often has a significant impact on its biological activity. For example, the beta-blocker drug propranolol is sold as a racemic mixture, however only the (S) enantiomer has the desired beta-blocking activity. As a result, the Food and Drug Administration mandates that the properties of each enantiomer of a chiral drug be studied separately before decisions are made to bring the drug to market as a single enantiomer or as a racemic mixture.
Short Abstract: High-dimensional data are an important part of the tremendous recent growth of human immunology, which to a great extent, benefits from controlled longitudinal vaccination studies. We report here our development of a Multiscale, Multifactorial Response Network (MMRN) using data from a herpes zoster vaccine study in humans. Metabolomics, transcriptomics, cytokines and frequencies of cell subpopulations were measured multiple times at the beginning of the study, and antibody response was monitored up to 6 months. Dimension reduction was performed in two steps. E.g. the transcriptome was collapsed into our previously published blood transcription modules, and the modules were further grouped by network clustering techniques. Partial least square regression was used to assess the association between different data types, using permutation test. The resulting MMRN network revealed important temporal connections between cytokines, plasma metabolites, blood cell frequencies and gene expression. We demonstrate that the MMRN is highly accurate in predicting biological outcomes. These results also suggest a new paradigm that the gene expression in blood cells is guided by metabolite cues from the plasma.
Short Abstract: Hodgkin lymphoma (HL) is a type of B cell lymphoma characterized by Hodgkin/Reed-Sternberg cells. To diagnose the disease and identify the subtype, biopsies are taken, immunostained for markers like CD30, and inspected under a microscope. Little is known about the spatial distribution of CD30 cells in the lymph node. In digital pathology, microscope slides are digitized to produce whole slide images (WSI). We present an analysis of a set of WSIs which includes cases of both HL and lymphadenitis (LA), an inflammation of the lymph node.
CD30+ cells were identified using our imaging pipeline. We defined cell graphs based on the positions and morphological properties of the immunostained cells. We analyzed the vertex degree distribution of the graphs and compared them to a suitable null model. CD30 cell graphs showed higher vertex degrees than expected, suggesting cell clustering. We found the gamma distribution suitable to model the vertex degree distributions. LA and HL showed different vertex degree distributions, while those of two HL subtypes were similar. Graph partitioning methods were used to determine biological meaningful cell groups.
This investigation provides objective parameters for CD30+ cells in HL. The method can be extended for different immuno cell types, their shapes and interactions. For future work we aim to combine our data with additional information, e.g. lymph node structures, to build a model of the tumor microenvironment of HL.
Short Abstract: Environmental exposures contribute greatly to human health and disease, yet it has been difficult to quantify such impacts. The research in exposome and metabolomics, driven by high-resolution mass spectrometry, is now moving this frontier forward. The exposome aims to catalog internal doses or surrogates of all environmental exposures. The metabolome captures all small molecules, reflecting the biochemical state that serves as deep phenotyping and as the footprint of gene activities. These new data thus become the missing pillar in understanding gene-environment interactions. To illustrate this emerging paradigm, we use high-resolution metabolomics to study the effect of the pesticide DDT (dichlorodiphenyltrichloroethane) exposure in human population and in mouse models.
Archived serum samples of 465 subjects in California from the 1960s, when DDT exposure was at its peak, were used for metabolomics analysis, using a Thermo Q-Exactive mass spectrometer coupled with reverse phase C18 liquid chromatography. The association of each metabolite feature to DDT was assessed by regression models, accounting for age, BMI and total blood lipids. This metabolome wide association study (MWAS) was followed by mummichog, our published algorithm for untargeted metabolomics, to perform metabolic pathway and network analysis. Similar analysis was carried out in mouse models, and confirmed the significant pathways detected in the human population, including the metabolism of arginine, aspartate, asparagine and fatty acids. This study demonstrates a new set of methodology for MWAS, and reveals the biological effects from DDT exposure.
Short Abstract: The integrative personal omics profile published by Michael Snyder’s group in 2012 is a landmark paper that sets examples for this new era of multi-omics and personalized medicine. Over the past few years, computational methods for transcriptomics and metabolomics have progressed considerably, and warrant a revisit of the Snyderome data. Our group developed a set of algorithms to perform metabolic pathway and network analysis from untargeted metabolomics (mummichog, Li et al, 2013 PLoS Compu Biol), which leverage known metabolic reactions to resolve ambiguity in metabolite prediction. Applying mummichog to the metabolomics data in Snyderome, we identified pathways of porphyrin metabolism, glycerophospholipid and linoleate metabolism. The porphyrin pathway corresponds precisely to the anemia condition of the patient. Glycerophospholipid and linoleate metabolic pathways are commonly observed metabolic changes in immunological responses to infections. Metabolomic pathway changes are consistent with alterations to transcriptomic profiles from the patient. The analysis of blood transcriptomics is often confounded by mixed cell populations and various immunological states of the white blood cells. Our blood transcription modules (BTMs, Li et al, 2014 Nature Immunology) are a set of gene modules that were computationally inferred from a large compendium of public data, as alternative to conventional pathways. Our new BTM analysis revealed detailed immunology during the RSV infection in Snyderome, including monocyte signals, antiviral pathways, TLR signaling, cell activation and complement pathways. This study demonstrates the utility of our new tools for personalized medicine, and provides significant novel insights in addition to those discovered in the original publication.
Short Abstract: Clostridium difficile is a nosocomial pathogen of the intestinal flora producing two major toxins that lead to diarrhea, pseudomembranous colitis and even death in 8% of the cases. With close to half a million infections per year in the U.S and a relapse rate of 20.5% in health-care acquisition, urgent and immediate actions against the bacteria are required as mention in the latest report of center for disease control (CDC) of the United States.
To address this dire need, we created the first curated metabolic network of the bacteria, named iMLTC806cdf, and identified 163 new potential therapeutic targets for Clostridium difficile using flux balance analysis (FBA) and synthetic accessibility (SA). These genes had an effect on growth under different conditions. On rich medium, we achieved an 89% precision rate for prediction of gene essentiality. We now seek to validate genes that aren`t essential in all conditions as therapeutic targets since they offer better specificity and are more adapted to the problematic of the pathogen than other classic targets (peptidoglycan, lipids etc.)
Among these targets, the gene nadA was selected as gene of high-interest due to its implication in sporulation (important process for pathogenesis) and its absence from human (reduce the probability of side-effects). The essentiality of the gene in absence of NAD+ was validated experimentally and deeper analysis are currently being made to validate it’s importance in-vivo and find potential binders via virtual screening which could be developed as the new therapeutic agent we need to control the bacteria.
Short Abstract: The immune system is composed of a large number of cell types that can be characterized by their combination of cell surface markers. The quantity of each particular cell type is considered as a quantitative trait, which is typically influenced by both environmental and genetics factors. Here we analyzed previously published data of cell-type traits that were measured in samples from the blood of human monozygotic twins. We identified a non-random distribution of environmental effects on different cell types. Thus, environmental effects act on specific cell types (characterized by their combination of cell surface markers) but not on others. The analysis distinguishes between different environmental effects acting on the same trait, generating a comprehensive map of immune system components that are affected by non-genetics factors.
Short Abstract: Biological network alignment (NA) aims to find a node mapping between molecular networks of different species that identifies topologically or functionally similar network regions. Analogous to genomic sequence alignment, NA can be used to transfer biological knowledge from well- to poorly-studied species between aligned network regions. Pairwise NA (PNA) finds similar regions between two networks while multiple NA (MNA) can align more than two networks. We focus on MNA since it captures at once functional knowledge that is common to multiple species, which can lead to deeper biological insights compared to PNA. Typically, existing MNA methods aim to maximize total similarity over all aligned nodes (node conservation). Then, they evaluate alignment quality by measuring the amount of conserved edges, but only after the alignment is constructed. Instead, we present multiMAGNA++, a novel MNA approach, that directly optimizes both node and edge conservation simultaneously during the alignment construction process. We show that multiMAGNA++ generally outperforms or is on par with the existing MNA methods, while often completing faster than the existing methods. That is, multiMAGNA++ scales well to larger network data and can be parallelized effectively. During method evaluation, we introduce new MNA quality measures to allow for more complete alignment characterization as well as more fair MNA method comparison compared to using only the existing alignment quality measures. Thus, our study can impact future MNA-related work in terms of both efficient method development and fair method evaluation.
Short Abstract: As single-cell experimental approaches become increasingly popular, cell-to-cell heterogeneity has emerged as a key determinant factor contributing to variability in gene expression and signaling responses. Mass cytometry (CyTOF) is a new proteomic technology that enables the simultaneous quantification of dozens of proteins in thousands of individual cells. In the context of cancer research, recent applications of CyTOF include the characterization of inter- and intra-tumor heterogeneity and the identification of novel cell subpopulations. However, as already demonstrated for single-cell RNA-seq, the resulting measurements are largely influenced by confounding factors, such as the cell cycle and cell volume.
We present here TRACE, a novel computational approach to quantify this source of variability. TRACE first exploits a hybrid machine learning approach to classify single cells into discrete cell cycle phases according to measurements of established markers. Next, a metric embedding optimization technique creates a one-dimensional continuous marker that tracks biological pseudotime and individual cells are subsequently ordered according to this pseudotime marker. The resulting cell cycle trajectories across perturbation time points allow us to separate cell cycle effects from experimentally induced responses, enabling the direct comparison of signaling responses through cell cycle progression. Additionally we show that volume biases can be corrected using housekeeping gene measurements. Our approach, implemented in a simple and intuitive Graphical User Interface, was used to analyze data from various cell lines subject to different stimulations. In each case, TRACE was able to separate confounding effects from signaling responses, enabling the unbiased analysis of biological processes.
Short Abstract: Frequent subgraph mining (FSM) is a common but complex problem within the data mining field that has gained in importance as more graph data has become available. However traditional FSM finds all frequent subgraphs within the graph dataset, while often a more interesting query is to find the subgraphs that are most associated with a specific set of nodes. Nodes of interest might be those that are associated with a specific disease, or those that are differentially expressed in an omics experiment.
We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given set of nodes. The association between a subgraph pattern and a set of nodes is defined by its significant enrichment based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it on three distinct problems, namely identification of gene ontology network motifs associated with duplicated genes in yeast, network motifs enriched for PhoR transcription factor orthologs across seven transcriptional networks, and amino acid labeled subgraphs associated with manganese-binding residues in protein structure networks. These applications could all be tackled with the same exact algorithm despite their diversity and the results show that in each case we can find relevant functional subgraphs enriched for the selected nodes.
Short Abstract: Introduction:
With increase in scientific literature, efficient methods of text mining greatly improve in extracting useful information from texts. Moreover, two or more types of information may be tagged concurrently to find functional coincidences. For example, our aim is identifying instances of fusion BCR-ABL1 protein and their protein-protein interactions (PPIs) from literature. However, BCR-ABL1 is spelled in various manner, BCR/ABL1, bcr-abl1, etc., which is hard to tag.
Protein Fusions (ProtFus) Server retrieves cancer fusion proteins and their PPIs using natural language processing (NLP). The text mining engine has four steps, namely, (1) consistent dictionary of literature corpus, text and annotation mappers, (2) bibliography with keywords, rule bases, negative tokens, and pattern extractor, (3) synonym tagger, normalization, and (4) regular expression mapper.
Results and Conclusions:
ProtFus searches for fusion protein tokens, e.g. 'chimera', 'transcripts', 'fusion proteins', etc. to generate output like 'FGFR3-TACC3 fusion proteins', highlighting fusion proteins in text. Similarly, for PPIs, ProtFus searches tokens like 'bind', 'block', 'elevate', 'interact', etc. to generate output such as 'GRB1 interacts with BCR-ABL' from text.
Huang M, Ding S, et. al. (2008) Mining physical protein-protein interactions from the literature. Genome Biol. 9 Suppl 2: S12.
Frenkel-Morgenstern M, Gorohovski A, et. al. (2015) ChiTaRS 2.1--an improved database of the chimeric transcripts and RNA-seq data with novel sense-antisense chimeric RNA transcripts. Nucleic Acids Res. 43(Database issue): D68-75.
Frenkel-Morgenstern M, Valencia A. Novel domain combinations in proteins encoded by chimeric transcripts. Bioinformatics. 2012 Jun 15;28(12): i67-74.
Short Abstract: Humans are constantly exposed to chemicals (e.g. pollutants, cigarette smoke) that can trigger molecular changes and be harmful for their organism. Risk assessment in the context of 21st century toxicology relies on the elucidation of mechanisms of toxicity and the identification of markers of exposure response. For that purpose, datasets using high-throughput technologies are generated from various biological samples after the exposure of cells or subjects to individual or mixture of chemicals. The development of relevant computational approaches for the analysis and integration of these large-scale data remains challenging and requires qualitative and quantitative evaluation. The scope of sbv IMPROVER (Industrial Methodology for Process Verification in Research; http://sbvimprover.com/) is the verification of methods and concepts in systems biology research via challenges proposed to the scientific community. The latest sbv IMPROVER computational challenge (Fall 2015 - Spring 2016) aims to address questions on (i) the identification of exposure response markers in human blood enabling to discriminate between exposed and non-exposed subjects as well as (ii) the translatability of those markers between species. Participants are provided with human and mouse blood gene expression data sets to develop gene signature-based models for exposure group class prediction. Anonymized submissions will be scored by comparing the predictions to the “Gold Standard” (true class labels) using specific metrics to identify the most performant computational methods. The outcome of the computational challenge is summarized in this poster.
Short Abstract: Ribosome profiling (or Ribo-seq) is currently the most popular methodology for studying translation; it has been employed in recent years to decipher various fundamental gene expression regulation aspects.
The main promise of the approach is its ability to detect ribosome densities over an entire transcriptome in high resolution of single codons. Indeed, dozens of ribo-seq studies have included results related to local ribosome densities in different parts of the transcript; nevertheless, the performance of ribo-seq has yet to be quantitatively evaluated and reported in a large-scale multi-organismal and multi-protocol study of currently available datasets.
Here we provide the first objective evaluation of Ribo-seq at the resolution of a single nucleotide(s) using clear, interpretable measures, based on the analysis of 15 experiments, 6 organisms, and a total of 712,168 transcripts. Our major conclusion is that the ability to infer signals of ribosomal densities at nucleotide scale is considerably lower than previously thought, as signals at this level are not reproduced well in experimental replicates. In addition, we provide various quantitative measures that connect the expected error rate with Ribo-seq analysis resolution.
Short Abstract: Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Then, functional knowledge can be transferred from well- to poorly-annotated species between aligned network regions. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on state-of-the-art methods at the time, MI-GRAAL and IsoRankN, that combining NCF of one method and AS of another method can give a new superior method. More recently, we further confirmed this by mixing and matching MI-GRAAL’s and GHOST’s NCFs and ASs. Most recently, we introduced a novel AS called Weighted Alignment VotEr (WAVE). When used on top of well-established NCFs of the existing methods (such as MI-GRAAL or GHOST), WAVE improves alignment quality compared to the existing methods.
Short Abstract: Large-scale genomic studies have identified many non-coding genetic variations associated with various diseases, many of which have been found to overlap transcriptional enhancers. Accurate evaluation of the functional impacts of these variations and their links to the disease phenotypes require the identification of active enhancers, their target genes, and the quantitative relationships between enhancer features and gene expression levels in the relevant cell and tissue types. Here we present a novel bioinformatics pipeline for predicting enhancer-target networks in a sample-specific manner. We have confirmed the accuracy of the pipeline using multiple evidences including chromosome interaction data. Using the pipeline, we have constructed and analyzed active enhancer-target networks in more than 900 human samples that cover a large variety of primary cells, tissues and cell lines. Features of the enhancers were shown to infer the expression levels of their target genes with high accuracy. Similarity of the networks from different samples follows closely their biological origin, and groups related samples into distinct clusters. Enhancers specifically active in a group of related samples tend to regulate genes specifically expressed in these samples. We also discovered that enhancers regulate genes in several major modes.
Short Abstract: MicroRNAs are short, non-coding RNAs that regulate their target genes in post-transcriptional regulation of gene expression. A number of prediction tools for microRNAs’ targets have been developed. These tools are generally classified into sequence-based (base-pairing with complementary sequences between target genes and microRNAs) and expression-based (inverse relationships between the expression profile of a microRNA and that of its target genes). We focus on the latter strategy for detecting the regulatory relationships through two Bayesian data analysis tools: GenMir++ and bnlearn. GenMir++ (J. C. Huang, et al., 2007) was used to identify a network of thousands of high-confidence target predictions for human microRNAs. However, in our preliminary experiment with simultaneous expression profiles of microRNAs and mRNAs in mouse adipocytes, the tool predicted too many microRNAs for their target genes (i.e., including false-positive microRNAs for each target gene) compared with experimentally-validated regulations from literatures. Another tool we used is bnlearn, which is an R package for learning the graphical structure of Bayesian networks. Using the package, we performed structure learning of the regulatory relationships between microRNAs and mRNAs based on the Bayesian Information Criterion (BIC). In the resulting network, too many target genes were presented for microRNAs (i.e., including false-positive target genes for each microRNA). To cope with these issues, we have made coincidences between the predicted regulations of GenMir++ and bnlearn. Using this method, we have successfully reduced the number of false-positives and improved the precision of the target predictions.
Short Abstract: Identification of perturbed molecular pathways under some biological condition is important for understanding cellular behaviour under that condition, and also for developing better therapeutic interventions. Typically, the knowledge about perturbed pathways is obtained by studying transcriptomics (gene expression) data to identify differentially expressed genes or to construct a gene co-expression network (by functionally linking genes whose expressions significantly “correlate” across different time points). However, these studies neglect physical interactions among the dysregulated genes (i.e., their protein products). And it is the proteins that carry out cellular function by interacting with each other. Hence, studies of the cellular interactome (i.e., protein-protein interaction (PPI) network) are promising. However, the current PPI data span multiple conditions and other contexts. Clearly, using the whole interactome without considering other condition-specific biological data fails to capture any condition-specific knowledge. As a result, recent studies have integrated transcriptomics data with PPI network data by mapping the activity of dysregulated genes (as captured by the gene expression data) to their corresponding proteins in the PPI network, in order to assign activity weights to nodes (genes/proteins) and edges (PPIs) in the network. Then, these studies consider only the most active network parts as condition-specific dysregulated pathways. We evaluate and compare such data integrative network inference methods to learn their (dis)advantages, understand their influence in explaining the given biological behaviour, and guide future method development. This presentation will discuss the results of our evaluation.
Short Abstract: Autophagy is known to be important in stress responses, regulation of inflammation, and intestinal homeostasis, including the elimination of intracellular pathogens. Conversely, autophagy is often hijacked or manipulated by intestinal pathogenic bacteria, such as Salmonella. A better understanding the effect of certain bacterial species on the regulation of human intestinal autophagy could help us to propose inflammatory bowel disease and colon cancer prognosis markers.
The major post-translational regulators of autophagy are well known, however, they have not yet been collected comprehensively. The precise and context dependent regulation of autophagy necessitates additional regulators, including transcriptional and post-transcriptional components that are listed in various datasets. Prompted by the lack of systems-level autophagy-related information, we developed an online resource, Autophagy Regulatory Network (ARN; http://autophagy-regulation.org), to provide an integrated database for autophagy research. ARN contains manually curated, imported and predicted interactions of autophagy components in humans. We listed transcription factors and miRNAs regulating autophagy components or their protein regulators. The user-friendly website of ARN allows researchers without computational background to search, browse and download the database in various file formats (SBML, PSI-MI, Bio-PAX, SQL, Cytoscape, CSV). ARN has the potential to facilitate the experimental validation of novel autophagy components and regulators.
To investigate how Salmonella is modulating autophagy we developed the first large-scale network resource for Salmonella enterica, integrating known and predicted regulatory, metabolic and signalling interactions. Then, we integrated earlier identified Salmonella-host interactions and data from ARN to list and validate novel genes responsible for autophagy modulation in the gut.
Short Abstract: Intracellular pathogens manipulate host pathways to establish themselves in the host cell. Burkholderia mallei uses secreted virulence factor proteins to directly influence host-pathogen interactions. Although we know that virulence factor proteins attenuate infection in animal models, we have scant information of their direct molecular roles in regulating host processes. Here, we used host-pathogen protein-protein interactions derived from yeast two-hybrid screens to derive mechanistic insights into nine B. mallei virulence factors. We showed that these virulence factors selectively targeted multifunctional host proteins, proteins that interacted with each other, and host proteins with a large number of interacting partners. Furthermore, we developed a novel host-pathogen interactions alignment algorithm to identify similarities between host-pathogen interactions of B. mallei, Yersinia pestis, and Salmonella enterica. Importantly, we showed how multiple B. mallei virulence factors broadly influenced key host processes to modulate and adapt the host-cell environment for establishing host infections and promoting intracellular spread.
Short Abstract: The multifactorial nature of traumatic brain injury (TBI), especially the complex secondary tissue injury involving intertwined networks of molecular pathways that mediate cellular behavior, has confounded attempts to elucidate the pathology underlying the progression of TBI. Here, we present a meta-analysis of four TBI studies to demonstrate that systems biology can provide an efficient approach to generate testable hypotheses in discovering novel mechanisms of action and molecular indicators of TBI. In our study, we used canonical pathways and a large human protein-interaction network as a scaffold and separately overlaid the gene expression data from each TBI study to identify conserved molecular signatures. We found that only significantly suppressed molecular signatures were specific to the nervous system. Our in-depth analysis of a suppressed synaptic subnetwork led to the hypothesis of three novel protein indicators for TBI, which were successfully confirmed in a subsequent Western blot experiment.
Short Abstract: A principal claim for RNA-sequencing has been greater replicability, typically measured in sample-sample correlations of gene expression levels. Replicability of transcript abundances in this way will provide misleading estimates of the replicability of conditional variation, which is what is of interest in most expression analyses. Heuristics which implicitly address this problem have emerged in quality control measures to obtain ‘good’ differential expression results. However, these methods involve strict filters such as discarding low expressing genes or using technical replicates to remove discordant transcripts, and are costly or simply ad hoc.
Instead, we show that gene-level replicability is a more useful metric, and demonstrate that it can be modeled in a co-expression framework, using known co-expressing gene pairs as pseudo-replicates instead of true replicates. We use this as a quality control metric: by modelling the effects of noise that perturbs a gene’s expression, we can then measure the aggregate effect of this perturbation on these co-expressing gene-pairs or ‘housekeeping interactions’. We find that perturbing expression by only 5% within its usual range of values is readily detectable (AUROC~0.73), suggesting this test is extraordinarily sensitive. In addition to making the software readily available (github.com/sarbal/AuPairWise), we have adapted the test to optimize RNA-seq alignment with the STAR aligner tool. Our findings suggest that more stringent parameters at the read mapping stage (e.g., minimum alignment scores) would have a modestly positive impact, making the post-hoc filtering done for high-expressing or high fold-changes a more intuitive part of direct quality control.
Short Abstract: Adequately characterizing individual tumors into distinct molecular subtypes of cancer can predict likely paths of disease progression and suggest better therapeutic methods for intervention. Here we present an unsupervised method for characterizing large cohorts of tumor tissue samples collected by The Cancer Genome Atlas (TCGA).
Non-negative matrix factorization (NMF) is a classification method that has been used to identify gene expression subtypes in many datasets, including TCGA. However, the structure of the input datasets for NMF will heavily influence the number of clusters identified, the robustness of the assigned classes for each sample, and the replicability of the gene lists that mark each class. The input data structure is largely constrained by pre-processing steps to select the features used as input for NMF. Common feature selection methods include ranking genes by their median absolute deviation (MAD) across samples. However, the optimal number of genes, N, to select depends upon the expected number of classes present, the within class variance levels, and the level of noise in the dataset, with no automated way to predetermine the optimal N.
Our rationale is to identify gene modules that are co-expressed across samples using a statistical metric to identify cohesive groups. This will preferentially select for genes whose deviation across samples is due to differential expression rather than simply high variance levels. We demonstrate improved classification performance by the ability to predict patient survival outcome and the comparison to known markers for cancer molecular subtypes for several datasets drawn from the TCGA.
Short Abstract: Analyzing incomplete gene expression data to detect significant genes and pathways in cancer are challenging tasks in Bioinformatics.
In the past few years, some imputation methods have been proposed to deal with missing values. The Normalized Root Mean Square Error (NRMSE) index is one of the famous measures which has been applied to evaluate the accuracy of the different imputation algorithms. In this work, we are interested to evaluate different imputation methods in more practical terms. So, the ability of different imputation methods to preserve the significant genes or pathways are studied.
For this purpose, in the simulation study at first 5% of the entire genes are selected randomly. For these genes we consider two kinds of ignorable and non-ignorable missingness with a rate of 10%, 20% and 30% missingness. Then, nine well-known imputation methods are applied in the context of 2 cancers (Rectal and Lung). Then, the NRMSE index is imputed to compare these methods. Based on NRMSE, Local least squares imputation method has better performance in comparison with other methods. Finally, an appropriate statistical test is applied to assess whether different imputation methods affect the set of significant genes or important pathways.
Results show that the imputation methods have effect on finding the set of significant genes or pathways. It can be concluded that, results can be improved by applying the appropriate imputation method.
Short Abstract: Primary sclerosing cholangitis (PSC) is a cholestatic autoimmune liver disorder of unknown etiology characterized by prominent hepatic immune infiltrate and extensive fibrotic strictures of the intra- and extra-hepatic biliary ducts. Genome-wide association studies (GWAS) have shown that the genetic architecture of PSC closely resembles other prototypical autoimmune disorders such as celiac disease and type 1 diabetes where the strongest genetic risk factors reside within the HLA alleles of the major histocompatibility complex. In PSC, the contribution of the overwhelming HLA association is unclear. In this work, we assessed the genetic HLA association in PSC by screening the autoantibody response in patient livers using human recombinant protein arrays and HLA elution studies. We identified eleven novel autoantibody targets shared among PSC samples from nine patients, which implies pathways involving these proteins may be targeted by autoimmune responses in patients with PSC. Furthermore, systems level analysis of the pathways modulated by the shared PSC autoantibody targets and their sequence homology analysis against the known proteomes have refined our understanding of the molecular triggers of PSC.
Short Abstract: Introduction:
New webservers provide biologists and bioinformaticians with new analytical opportunities but also raise workflow challenges. These challenges include sharing collections of genes with collaborators, translating gene identifiers to the most appropriate for each server, tracking these collections across multiple analysis tools and webservers, and maintaining effective records of the genes used in each analysis.
In this paper, we present the Tribe webserver (available at https://tribe.greenelab.com), which addresses these challenges in order to make multi-server workflows seamless and reproducible. This allows users to create analysis pipelines that use their own sets of genes in combinations of specialized data mining webservers and tools while seamlessly maintaining gene set version control. Tribe’s web interface facilitates collaborative editing: users can share with collaborators, who can then view, download, and edit these collections. Tribe’s fully-featured API allows users to interact with Tribe programmatically if desired. Tribe implements the OAuth 2.0 standard as well as gene identifier mapping, which facilitates its integration into existing servers. Access to Tribe’s resources is facilitated by an easy-to-install Python application called tribe-client. We provide Tribe and tribe-client under a permissive open-source license to encourage others to download the source code and set up a local instance or to extend its capabilities.
The Tribe webserver addresses challenges that have made reproducible multi-webserver workflows difficult to implement until now. It is open source, has a user-friendly web interface, and provides a means for researchers to perform reproducible gene set based analyses seamlessly across webservers and command line tools.
Short Abstract: Computational modeling of signaling pathways is crucial for understanding carcinogenesis and predicting responses of cancer cells to drug treatments. However, canonical signaling pathways curated from the literature are seldom context-specific and thus can hardly make precise prediction of anti-cancer drug effects. Association-based data-driven methods have drawbacks such as limited interpretability about underlying mechanisms. Therefore, hybrid methods that integrate prior knowledge and real data for network inference are highly desirable. In this paper, we propose a knowledge-guided fuzzy logic network model to infer signaling pathways by exploiting both prior knowledge and time-series data. Dynamic time warping is adopted to measure the goodness of fit between experimental and predicted data, so that our method can model temporally-ordered experimental observations. Moreover, two regularizers are introduced to penalize the incompatibility of the model with prior knowledge and constrain the number of proteins interacting with each signaling protein. The knowledge-guided fuzzy logic network model is further converted to a constrained nonlinear integer programming problem that can be solved by a genetic algorithm. We evaluated the proposed method on a synthetic dataset and a real time-series phosphoproteomics dataset. The experimental results demonstrate that our model can effectively uncover drug-induced alterations in signaling pathways in cancer cells. Compared with existing hybrid models, we are able to model feedback loops so that the dynamical mechanisms of signaling networks can be uncovered from time-series data. By calibrating generic models of signaling pathways against real data, our method supports precise predictions of context-specific anticancer drug effects.
Short Abstract: FAIRDOM's primary mission is to support researchers, students, trainers, funders and publishers by enabling Systems Biology projects to make their Data, Operating procedures and Models, Findable, Accessible, Interoperable and Reusable (FAIR).
The data stewardship challenges in modern science laboratories are manifold. From a practical perspective, most projects involve the exchange of data between geographically distributed partners. Data from several labs may need to be integrated into a single model, with little input from the original experimentalists. Data and models from projects need to be packaged, and made available for supplementary material, or for long-term accessibility for 10 years beyond the conclusion of the project. Coupled to this, staff within labs undergo turnover due to career progression, parental leave, and sickness. These issues can only be mitigated with robust, and easily-to-use data and model management infrastructure, and carefully developed plans which are agreed and adhered to by all researchers involved.
At FAIRDOM we have spent more than 8 years understanding the challenges involved in data and model management of interdisciplinary projects. We have used our understanding to advance infrastructure to support data and model management. We have helped to develop research into standardisation, annotation best practice, and social behaviour surrounding data and model sharing. All in order to offer better solutions for making data FAIR. Along with FAIRDOMHub we offer software (SEEK, RightField, openBIS, JWS Online), as well as a Knowledge Hub, to assist researchers in better data, model, and operating procedure management.
FAIRDOMHub is a component of the ISBE-lite infrastructure.
Short Abstract: The application of graph theory for the analysis and summarization of omics data is well established. There are algorithms optimized to build single networks from data generated with specific experimental procedures. However, biological processes involve the integration of several mechanisms, each analyzed with different laboratory methods. Multigraphs are well suited to combine diverse graphs in an integrated model that better describe complex processes. In multigraphs any two vertices can be connected by more than one edge, each representing a distinct type of interaction. However, querying and mining is more difficult in multigraphs than in simple graphs. Fortunately, the introduction of graph databases have facilitated multigraph management. We show the application of Neo4J and Titan DB, two popular choices of graph databases, to build multigraphs that integrate gene expression, protein-protein interaction and protein-DNA interaction data extracted from Mycobacterium tuberculosis studies. The multigraph also included information about operon organization and annotation metadata. A usual approach to explore biological networks is to search cliques of completely interconnected nodes; however, for multigraphs it is better to work with more relaxed quasi-cliques. We designed seven prototype queries based on quasi-cliques addressing relevant biological questions. For example, to find subsets that have predetermined patterns of co-expression and protein interaction and contain a given protein. Additionally, the subset members could be set to satisfy certain centrality criteria. We scanned the multigraphs with implementations of our prototypes and found quasi-cliques including known regulatory proteins, which validated our approach, and proteins with potential regulatory activity.
Short Abstract: We recent developed a novel tool set called Pathview  as a pathway based solution for omics data visualization and analysis. Pathview maps, integrates and renders a large variety of biological data on molecular and genetic pathways, and produces interpretable graphs with publication quality . Pathview quickly became a leading tool in pathway visualization, and has been widely adopted by tens of thousands of scientists and dozens of dependent applications worldwide.
In this work, we further developed the Pathview Web server to make the tool set more accessible and useful. First of all, the server provides an intuitive graphical user interface to Pathview. In addition, it also implements a generic pathway analysis workflow applicable to a wide variety of omics data including gene expression, genomics, proteomics, metabolomics data etc.
Pathview Web extends the static R package to an interactive bioinformatics server, with multiple unique and useful features: 1) The results graphs are interactive and hyperlinked to abundant external annotation data online. 2) The server provides the latest, most complete and accurate pathway definitions and graphs by regular synchronization with KEGG source databases. 3) Users can review, replicate and share their analyses easily with free registered user accounts, which enable collaborative research and reproducible science. 4) Useful user engagement features allow users make comments and suggestions, or ask for help in designated pages.
In conclusion, Pathview Web server is developed as a comprehensive and user friendly solution online for pathway visualization and analytics. It is available online at http://pathview.uncc.edu.
Short Abstract: Annotations in databases are often correlated each other. To understand the relationships among the annotations, we comprehensively examined how much each annotation is correlated each other through proteins. We selected Gene Ontology (GO) in the integrated human gene database, H-InvDB. Here we constructed a gene-GO matrix and decomposed it by non-negative matrix factorization. As a preliminary result using 1,000 gene set data, we identified 200 concept modules. The number of modules were subjected to the rank of the gene-GO matrix; The rank was estimated as optimal for the dataset by singular value decomposition (SVD) analysis. For each module, we extracted different concepts constructed from plural genes and plural GO terms as composite concepts. The results provide useful information for users to understand each database annotation.
Short Abstract: The vast majority of biomedical data lives buried within closed-access research articles, as opposed to a well-curated publically accessible database. Though there are clear benefits to organizing such data and making it available to the greater community, doing so presents a number of practical and technical challenges. Here, we present our approach for the NeuroElectro project (www.neuroelectro.org), an effort to systematically text-mine data on neuronal electrical properties from the full texts of published articles.
We take a two-stage approach for data extraction, using automated algorithms followed by manual curation. First, we download large volumes of research articles as HTML and apply simple rule-based text-mining methods to identify entities of interest, such as the names of electrical properties, neuron types, and experimental metadata. Second, we employ a team of undergraduate curators to review and correct the text-mined content for the underlying semantics and context of the article. Performing this manual curation step was necessary, as our text-mining methods alone had an accuracy of 78% for correctly labeling electrical properties but only 30% for identifying neuron types. Following data extraction, each article’s data (~1K curated articles are available currently) are made accessible through a user-friendly web interface and a public API.
In addition to making use of the extracted content for scientific aims, ongoing steps include expanding the kinds and sources of mined data and better utilizing community ontologies for text-mining and manual curation. Of particular interest is involving the original authors in curation review.
Short Abstract: It has been proposed that changes in gene coexpression can be used to predict changes in gene function across different conditions. It has also been suggested that gene networks, if made tissue-specific, would improve results from Guilt By Association (GBA) gene function prediction. However, to our knowledge, the properties of RNA coexpression in different tissues have not been described before. Here we report the analysis of coexpression patterns in five human tissues, targeted at identifying both common and tissue-specific patterns of gene coexpression. For each tissue, we have collected between 7 to 15 high-quality publicly-available microarray datasets and following rigorous quality control steps, we have generated coexpression networks from each of them. We describe an approach to identify coexpressed gene pairs (“links”) that are robust across tissues, or are present specifically in one tissue. Furthermore, we consider the role that expression levels have on the presence of differentially coexpressed links, hypothesizing that tissue-specific links tend to involve genes which are expressed in a tissue-specific manner. We identified a subset of links which are tissue-specific in the absence of notable changes in mean expression levels and also explored the potential functional implications of such patterns.
Short Abstract: The homeobox encodes a DNA-binding domain found in transcription factors regulating key developmental processes. The most notable examples of homeobox containing genes are the Hox genes, arranged on chromosomes in the same order as their activation along the body axis. The mechanism responsible for the synchronous regulation of Hox genes and the molecular function of their colinearity remain unknown. Here we report the discovery of a conserved structural signature of the 180-base pair DNA fragment comprising the homeobox. We demonstrate that the homeobox DNA has a characteristic 3-base-pair periodicity in the hydroxyl radical cleavage pattern. This periodic pattern is significant in most of the 39 mammalian Hox genes and in other homeobox-containing transcription factors. The signature is present in segmented bilaterian animals as evolutionarily distant as humans and flies. It remains conserved despite the fact that it would be disrupted by synonymous mutations, suggesting that evolutionary selective pressure exists on the structure of the coding DNA. The homeobox coding DNA may therefore have a secondary function, possibly acting as a regulatory element. The existence of such element may have important consequences for understanding how these genes are regulated.
Short Abstract: Gene-Term Enrichment Analysis is a commonly used method to analyze the differentially expressed genes gathered by techniques such as microarray assays. The common methods of Enrichment Analysis use a list of genes or proteins without considering protein-protein interaction network. Improvements by other researchers have led to network ontology analysis (NOA). Still, the quantity and complexity of data produced by such an analysis is often difficult to interpret directly.
As a solution to this, we developed the SAR (Segmentation, Analysis, and Reintegration) algorithm for a structural analysis of protein interaction networks. Segmentation – The source graph is broken into component subgraphs via an appropriate metric such as bipartite-like or clique-like subgraphs. Analysis – Each component subgraph is independently analyzed via the NOA method and the results are stored. Reintegration – The independent results, which contain overlapping ontological components, are reintegrated to form a single representation. We apply the SAR algorithm for functional analysis of protein network biomarkers for Adherens Junction and Breast Cancer. Results showed that the proposed algorithm produces a more concise and easily interpretable representation of the gene-term relationship compare to the representation produced using NOA only.
We developed a Subgraph Finder Utility as a part of segmentation process. This is an extensible Utility for detecting subgraphs of various kinds within a larger graph. The CyFinder App, a Cytoscape plugin, wrapping the Subgraph Finder Utility, is under development.
Short Abstract: In order to better understand gene regulation, differential gene expression is measured across different cell types and/or conditions. This often provides long lists of genes whose expression has varied significantly from the control cells. Further techniques are then employed on these lists in order to understand the significance of these differentially expressed genes. A common technique used is gene-class testing, which computes the significance of various types of categories (e.g., function annotations, metabolic pathways, transcription factor targets). However, the appropriate null model for these statistical computations is a subject of some controversy. Most analytic methods such as Fisher's Exact Test assume that the genes are chosen independently and uniformly from a universe of genes. It is well known that the independence assumption fails for real gene sets; here, we show that the uniform assumption is incorrect as well, and assuming a uniform distribution greatly skews p-values in an anti-conservative direction. We demonstrate that the probability of a gene being part of an overexpressed gene set is not uniform but increases with the number of transcription factors that target it. Focusing our enrichment analysis on the category of transcription factor targets, we show that, while the uniform assumption leads us to overestimate the significance of virtually all transcription factors to a particular gene set, weighting genes based on the number of transcription factors that targets them leads to a much more credible calculation of significance.
Short Abstract: The immense complexity of biological processes is one of the obstacles to fully understanding the dynamical capacity of these systems. Computational models provide a vehicle for interrogating and studying the dynamics of biological systems from a holistic perspective through the synthesis of individual pieces of data and knowledge from laboratory experiments.
Cell Collective is a web-based environment that enables scientists to construct, publish, simulate, and analyze large-scale computational (logic-based) models in a highly interactive and collaborative fashion. Models in the platform can be constructed and edited in the context of the biochemical/biological regulatory mechanism, without the need for computer programming or manual construction of the underlying mathematical equations, making the models more accessible to a wider, experimental scientific community. Cell Collective allows scientists to simulate and analyze models' dynamics interactively on the web, including the ability to simulate loss/gain of function and test what-if scenarios in real time. The platform includes a Knowledge Base to facilitate collaborative annotation of models, and track experimental data and evidence associated with each model component and interaction. Cell Collective also provides an interactive repository of (50+) peer-reviewed, published models that can be directly simulated and analyzed within the environment. Furthermore, the software and computational models are also being utilized in a wide range of university life sciences courses as an engaging and interactive tool to teach about biological systems. Finally, models defined in the platform can be shared directly on the web or exported as SBML qual files and other file formats.
Short Abstract: While the general process of gene transcription is well understood, the mechanisms by which different genes are activated in different conditions or different cell types are not. Transcription must be precisely controlled for proper development and response to differing conditions, and determining exactly which part of the cellular machinery is responsible for changes in expression is an important task in biology. In order to determine exactly which transcription factors are responsible for very specific conditions, it can be helpful to examine which genes are differentially expressed in similar but slightly different conditions. Here, we consider the problem of taking two closely related differentially expressed gene sets and determining which transcription factors could be responsible for the differences. While identifying transcription factors whose targets are significantly enriched in a set of differentially expressed genes is a common computational task, here we address a subtly but importantly different question: which transcription factors' targets are more significantly overrepresented in one set than another. We present approaches to rank transcription factors based on their regulation of one set of genes as compared to another and apply them to gene expression sets associated with the Mediator complex, a complex essential for most transcription in eukaryotes which may play an important role in differential transcription. We apply our methods to investigate the regulatory differences between CDK8 and CDK19, homologous proteins that function similarly and can alternatively occupy the same position in Mediator. We show that our methods perform substantially better than naïve methods.
Short Abstract: Dysregulation in signal transduction results in the development of complex diseases such as cancers. Systemic analyses of these proteins are important to develop better therapeutics. Here, we carried out a computational approach to identify the critical components in a signal transduction logical model consists of key signaling pathways, including RTKs, GPCR, Integrins pathways. We performed systemic perturbation of 130 components by in silico knockout and over-expression under four environmental conditions stimulating apoptosis, cell growth, cell motility and quiescence. The influence of each component on the rest of the network was measured by comparing each perturbation with the wild type condition. Based on the influence, we ranked all 130 components under each environmental condition as the most influential and least influential components. Further, the biological relevance of the most influential components was established by analyzing biological functions, gene essentiality, and druggability of proteins. We observed that the most influential components are part of the inositol pathway under inactivating perturbations, and kinase and small lung cancer pathways under activating perturbations. Furthermore, these components are enriched with essential genes, and druggable proteins. We identified regulatory motifs among the most influential components and predicted PI3K and IP3R1 as combinatorial target for cancer therapeutics. We simulated the synergistic effect of over-expression of IP3R1 with PI3K inhibition. We hypothesized that combinatorial perturbation of PI3K and IP3R1 may increase the rate of apoptosis by increasing the activity of tumor suppressor and apoptosis related components, including PLA2 and arachidonic acid. However, further experimental validations are required.
Short Abstract: Historically, modeling metabolism has fallen under two major approaches. Ideally, a well-parameterized kinetic model would provide detailed insight into metabolic processes. However, owing to the challenge of obtaining large-scale kinetic measurements, this approach is not tractable for larger systems. In contrast, constrained-based method, such as flux-balance analysis (FBA) uses the law of mass action to predict possible solutions of metabolic fluxes. However, the underdetermined nature of FBA requires the use of empirically-determined objective functions to “select” an appropriate prediction. Here, we present the theoretical framework of using statistical thermodynamics to model metabolism. This approach combines the simplicity of constrained-based methods, where no kinetic data is required, with the ability the model metabolism dynamically like in traditional kinetic models. Metabolism is modeled as a series of states, and using only standard free energy of reactions as input parameters, a stochastic simulation that propagates based on the principles of thermodynamics and maximum entropy is achieved. Therefore, no empirically-determined objective function is needed to select for the optimal solution. Metabolic pathways, such as glycolysis, was simulated, and the predicted metabolite concentrations agree with experimental measurements, to within 0.5 log concentration units, with a correlation coefficient of over 0.9. Future directions of scaling up to encompass central metabolism and genome scale models are also discussed.
Short Abstract: A gene regulatory network (GRN) consists of genes, transcription factors, and the regulatory connections between them that govern the level of expression of mRNA and proteins from those genes. Our open source MATLAB software package, GRNmap (http://kdahlquist.github.io/GRNmap/), uses ordinary differential equations to model the dynamics of medium-scale GRNs. The program uses a penalized least squares approach (Dahlquist et al. 2015, DOI: 10.1007/s11538-015-0092-6) to estimate production rates, expression thresholds, and regulatory weights for each transcription factor in the network based on gene expression data, and then performs a forward simulation of the dynamics of the network. GRNmap has options for using a sigmoidal or Michaelis-Menten production function. Parameters for a series of related networks, ranging in size from 15 to 35 genes, were optimized against DNA microarray data measuring the transcriptional response to cold shock in budding yeast, Saccharomyces cerevisiae, for the wild type strain and strains deleted for the transcription factors Cin5, Gln3, Hap4, Hmo1, and Zap1, giving biological insights into this process. GRNsight is an open source web application for visualizing such models of gene regulatory networks (http://dondi.github.io/GRNsight/index.html). GRNsight accepts GRNmap- or user-generated Excel spreadsheets containing an adjacency matrix representation of the GRN and automatically lays out the graph. The application colors the edges and adjusts their thicknesses based on the sign (activation or repression) and the strength (magnitude) of the regulatory relationship, respectively. Users can then modify the graph to define the best visual layout for the network. This work was partially supported by NSF award 0921038.
Short Abstract: Life involves simultaneous progress of many different biochemical objectives. While increased activity of some processes improves the activity of others, some alteration could hamper the workings of other essential processes. Biological systems need to assess the tradeoffs between benefits associated with each of these processes and devise regulatory means to ensure the right mixes of objective activity for different environmental settings. Mathematically, this type of analysis is called Multi-objective optimization (MO). MOs in areas ranging from engineering, economics, to biology solve for a set of results, called the Pareto frontier that would provide quantitative measurements of how activity of each system objective affects the value of others. With regards to examining the complex interplay of biological objectives in genome-scale constraint-based models (GSMs) of cellular metabolism, informative mapping of the multi-dimensional (n>10) Pareto front requires use of high performance computing (HPC). Here we report on development of a HPC tool called HDMOFA for MO analysis of GSMs. As a test case for our analyses, we used the well-curated GSM model of Chlamydomonas reinhardtii. Specifically, we have used HDMOFA to examine the effects of the interplay of among multiple processes on the phenotype of the mixtrophic metabolism of C. reinhardtii.
Short Abstract: Abstract:
Background: Neuropeptides are short peptides produced by neuron to communicate with and/or to regulate other cells. Neuropeptides normally perform their functions by interacting with specific receptors. These interactions have been broadly observed in all tissues to regulate many different functions. Recently, several studies demonstrated that specific neuropeptides interact with other neuropeptides. These results become potentially influential, when taking into consideration of the widespread and large volume of neuropeptides. However, whether or not neuropeptides can be regulated by the physical interaction with other neuropeptides is largely unknown at the proteomic level.
Results: 57 out of 110 neuropeptides bind to other neuropeptides to form 64 pairs of physical neuropeptide-neuropeptide interactions (NNIs). These NNIs form complex interaction networks that link to different signaling pathways. The neuropeptides in these networks/pathways are expressed in different tissues at different levels. The interaction sites of 26 pairs of NNIs were identified through integrated informatics strategy. This integrated strategy also predicted another 28 pairs of NNIs that have not been experimentally validated.
Conclusion: Interaction between one neuropeptide and other neuropeptides may alter the interaction between this neuropeptide and its receptor, leading to the regulation of neuropeptides’ functions by NNIs. The NNI networks and associated signaling pathways are tissue specific, indicating complex patterns of regulation among neuropeptides. Integrated strategies using database search, ELM, Pfam, and coiled-coil prediction are powerful in identifying physical protein-protein interactions and interaction sites.
Short Abstract: Pleckstrin homology domain-containing family A member 7 (PLEKHA7) is a recently discovered adherens junction protein that links E-Cadherin and microtubule. PLEKHA7 also interacts with other proteins to regulate different pathways, such as PIK4 and miRNA biogenesis. In addition to these observed molecular roles, PLEKHA7 has been pinpointed as a critical factor in several human diseases including hypertension, glaucoma, and a couple of types of cancer. However, the mechanisms of PLEKHA7 in these diseases are not completely elucidated. In this study, we have developed a bioinformatics strategy that integrates database search, secondary structure analysis, miRNA prediction, miRNA target prediction, and pathway analysis to analyze miRNA biogenesis and associated functions. By using this strategy, an intron of PLEKHA7 mRNA was found to produce miRNA. This miRNA is able to regulate about 50 proteins and multiple pathways. Among which, MPZ and PTPRM are two critical proteins in cell adhesens molecules pathway and adherens junction pathway. This result revealed a novel potential mechanism that is associated with the known function of PLEKHA7. Furthermore, several other pathways, which are regulated by PLEKHA7 associated miRNAs, were also predicted. These results may increase our knowledge base on the function and associated mechanisms of PLEKHA7.
Short Abstract: Genome-scale metabolic models have been employed with great success for phenotypic studies of organisms over the last two decades. The most difficult step in the reconstruction of these metabolic models is the manual curation. Although various automated reconstruction methods have been developed to accelerate the reconstruction process, much effort must still be expended to supplement an automatically-generated draft model with manually curated information. Here, we utilize a tool for likelihood-based gene annotation previously developed in our lab to create a method that “morphs” a manually curated metabolic model to a draft model of a closely related organism. Our method combines genes from the original, manually curated model with genes from an annotation database to create a final structure that contains gene-associated reactions from both sources. The benefits of the approach are twofold: firstly, the effort and accumulated knowledge that has gone into the construction of the original model is leveraged to create a metabolic model for a closely related organism. Secondly, starting from an already completed and functioning model allows the user to run simulations at every step as necessary, offering the ability to predict how modifications will affect the performance of the model. Using our manually curated model of Methanococcus maripaludis, iMR540 (Richards et al., 2016, in preparation), as the starting point, we employ our method to create morphed models of three related methanogenic archaea. Our morphing method provides a way to quickly reconstruct a clade of metabolic models for related organisms from one manually curated representative.
Short Abstract: High-throughput screening (HTS) assays (e.g. ToxCast™) rapidly gather toxicity data for a large number of chemicals; this provides a wealth of information to group chemicals based on a common mechanism. Mitochondrial toxicity has been extensively studied via cheminformatics, HTS, and in vivo toxicology making it an ideal case study. Previously developed structural alerts were utilized to profile chemicals with in vitro cytotoxicity and mitochondrial toxicity and in vivo data. Of the 380 chemicals identified as being “active” in at least one of the in vitro assays, 62 contained a structural alert for mitochondrial toxicity, which highlights the need to expand the number of structural alerts. Structural similarity, guided by the toxicity profile in the ToxCast assay battery, was then used to group those chemicals that either were not tested in a mitochondrial toxicity assay or were not considered a “hit” and read-across was performed. Preliminary analysis of the results showed a diversity of toxicological endpoints consistent with the pervasive nature of mitochondrial disruption. The Adverse Outcome Pathway (AOP) framework provides a scaffold for better defining the downstream mechanisms connecting mitochondrial toxicity to these diverse endpoints. Current work is focused on increasing our understanding of these mechanisms to improve the ability of the HTS, coupled with structural similarity, to predict in vivo toxicity. Additionally, these results provide a mechanism for predicting toxicity for chemicals based on structural similarity alone prior to HTS toxicity testing.
[This is an abstract or proposed presentation and does not necessarily reflect EPA policy.]
Short Abstract: Pathway diagrams are found everywhere: in textbooks, in review articles, on posters and on whiteboards. Their utility to biologists as conceptual models is obvious. They have also become immensely useful for computational analysis and interpretation of large-scale experimental data when properly modeled. We will highlight the latest developments and newest features of WikiPathways (www.wikipathways.org), a community curated pathway database that enables researchers to capture rich, intuitive models of pathways. WikiPathways and the associated tools PathVisio and pathvisio.js are developed as open source projects with a lot of community engagement.
(https://github.com/wikipathways/pathvisiojs/), is integrated in the WikiPathways website and enables users to zoom in and click on pathway elements to show linkouts to other databases. In the future pathvisio.js will replace the Java applet editor and introduce a quick and simple way to curate and edit pathways.
The standalone pathway editor and analysis and visualization tool, PathVisio (www.pathvisio.org), was refactored with the goal to achieve a better, modular system that can be easily extended with plugins. Plugins are accessible through the new plugin repository and can be installed through the plugin manager from within the application. This is an important aspect of usability that will allow users to build an application with all the necessary modules relevant for their work. The WikiPathways plugin of PathVisio allows searching and browsing WikiPathways from within PathVisio. Furthermore users can upload new pathways or update existing pathways.
Short Abstract: One of the main challenges of the post-genomic era is the understanding of how gene expression is controlled. Variation in the levels of gene expression is behind diverse biological phenomena such as development, disease and adaptation to different environmental conditions. Notably, despite the availability of well established methods to identify these variations, tools to discern how gene regulation is orchestrated are still required. The regulation of gene expression is usually depicted as a Gene Regulatory Network (GRN), where changes in the network topology represent variations in gene regulation. Like other networks, GRNs are composed of basic building blocks; small induced subgraphs called graphlets. LoTo implements a method that uses several Graphlet Based Metrics (GBMs) to identify topological variations in different realizations of a GRN.
In this approach, GRNs are analyzed to determine the types of graphlet formed by triplets of nodes. Subsequently, these graphlets are compared to those formed by the same three nodes in the other realization of the GRN. In doing so, LoTo applies GBMs to assess the topological similarity between both networks. Experiments performed on randomized networks demonstrate that GBMs are more sensitive to topological variations than other metrics. Additional experiments demonstrate that GBMs are capable to identify nodes whose relationship with other network components has changed. Notably, due to the explicit use of graphlets, LoTo captures topological variations that are not detected by other approaches. LoTo is freely available as an on-line web server (http://dlab.cl/loto).
Short Abstract: Deregulation of miRNAs is implicated in many diseases in partic-ular cancer, where miRNAs can act as tumour suppressors or onco-genes. As sequence-based miRNA target predictions do not providecontext-specific information, many algorithms combine expression datafor miRNAs and genes for prioritization of miRNA targets. However, common strategies prioritize miRNA-gene associations, although a miRNA may only target a subset of the alternative transcripts produced by a gene. Thus, current approaches are suboptimal. Here we address for the first time the problem of transcript and not gene based miRNA target prioritization. We show how to leverage methods that were developed for gene expression based miRNA-target prioritization for transcripts. In addition, we introduce a new multitasking based learning (MTL) method that uses structured-sparsity inducing regularization to improve accuracy of the learning. The new MTL approach performs especially favorable in small sample size settings and for genes with many transcripts and outcompetes the other approaches on simulated and liver cancer RNA-seq data.
Short Abstract: Gene regulation is largely mediated by intermolecular interactions of non-coding RNAs and predicting such interactions is of great interest. Identifying the full map of RNA interactomes promises to prove useful in several areas, such as non-coding RNA functional studies, integrative miRNA and gene expression analysis and interpreting the outcome of RNAi applications. However, such a task requires performing large-scale genome- and transcriptome-wide predictions. To meet these demands, we implemented an efficient method, RIsearch2, that uses suffix arrays to locate perfect complementary seed regions (including G-U wobble pairs) between query and target sequences and extends the possible interaction on both ends of the seeds using a simplified energy model. The large scale capability is exemplified by a screen of ~2600 human miRNAs on the whole repeat-masked human genome, which takes about ~6h using 16 cpu cores. This is orders of magnitude faster than other currently available methods. Furthermore, we use RIsearch2 interaction predictions in combination with accessibility and expression abundance information of binding sites to construct an efficient siRNA off-target discovery pipeline. We show that this pipeline can accurately predict the individual off-targets and overall off-targeting potential of siRNAs, which may further influence their repression efficiency.
Short Abstract: The ability to acquire, store, process and use data to advance discovery is limited. Artificial intelligence provides unprecedented opportunities for accelerating scientific advances. Here, in this study where we adapted IBM WATSON technology at scale to the biological literature. Reading all the publications in a research field is universally desired, but impossible. In a first, we demonstrate end-to-end automated knowledge discovery via massive-scale exploration of subtle literature connections. General text-mining and hypothesis generation algorithms scanned 21 million publication abstracts, read the relevant 240,000, and reasoned over a reliable 130,000 to predict, in this study, undiscovered kinases likely to phosphorylate p53. Six of these predicted kinases passed experimental validations, including one, NEK2, which was examined in depth and which represses p53 and promotes cell division. This work opens the door to integrate vast corpora of written knowledge, at scale and beyond human capability, in order to compute significant new hypotheses and accelerate scientific discovery.
Short Abstract: 13C metabolic flux analysis (MFA) is widely considered to be the premier method for determining the fluxes through metabolic networks. NMR spectroscopy has played an important role in the accurate identification and quantitation of metabolite isotopomers. However, chemical shift overlap in the commonly used 1D and 2D experiments often hinders the assignment and quantitation of site-specific 13C labeling. 3D NMR techniques hold the potential to alleviate this problem, but have not found wide spread use in 13C MFA due to the long acquisition times associated with conventional 3D NMR experiments. Recent advances in fast NMR techniques, such as Non-Uniform Sampling (NUS) provide a solution to this problem. Here we propose the use of a 3D TOCSY-HSQC experiment combined with NUS for 13C MFA. We show that 3D NMR can be used to identify and quantify 13C isotopomers in complex mixtures, such as biomass hydrolysates. The additional chemical shift dimensions provided by the 3D experiment enable easy identification and assignment of amino acid resonances in biomass hydrolysates. The information provided by this technique will improve metabolic flux analysis, especially for large scale metabolic network models.
Short Abstract: Since using the first approach myb transcription factor binding sites (TFBS) of C1 gene in Maize, Many studies have been studied continuously in the whole A.thaliana genome and novel myb TFBS identify in de novo genome will be predictable.
Maybe, significant over-representation of multiple TFBS was found in both promoter and non-coding genome regions.
The myb transcription factor(TF) co-regulated genes was discovered through the article, database and basic materials found in 8 co-regulated genes that are expressed in closely related species(arabidopsis, rice) were analyzed to find conservation regions use by bioinformatics tools.
As a result, our approach is through comparative analysis of 8 myb co-regulated genes to difference between monocots and eudicots in endothelium plants.
At the future, the end goal identified to the metabolism pathway system for through TFBS of the myb-regulate gene map in phenylpropanoid biosynthesis pathway.
View Posters By Category
- A) Bioinformatics of Disease and Treatment
- B) Comparative Genomics
- C) Education
- D) Epigenetics
- E) Functional Genomics
- F) Genome Organization and Annotation
- G) Genetic Variation Analysis
- H) Metagenomics
- I) Open Science and Citizen Science
- J) Pathogen informatics
- K) Population Genetics Variation and Evolution
- L) Protein Structure and Function Prediction and Analysis
- M) Proteomics
- N) Sequence Analysis
- O) Systems Biology and Networks
- P) Other