
Abstract:
With the exponential increase in data, researchers have a need for resources that can analyze such data. The exponential number of resources that have been developed for analyses of these data can be overwhelming to researchers. A collaboration between librarians and the biomedical community can help streamline the process of choosing a resource. Bioinformationists can help biomedical researchers by learning about resources, providing recommendations, and teaching researchers how to use them. This poster discusses the evolution of two bioinformationists’ teaching experiences of two molecular network visualization tools, Cytoscape and VisANT. The training audience begins with researchers in small training classes but ultimately evolves to include other librarians who can bring that knowledge back to their own institutions and researchers. This evolution shows the increasing role of librarians in the field of biomedical informatics.
Abstract:
Geonome-wide association studies are used to find associations between diseases and genes when a genetic link is suspected, but it is not known which genes are involved. Such studies are often done in two stages; the first stage typically tests hundreds of thousands of SNPs, to identify which SNPs are most likely linked to the disease. The second stage tests SNPs chosen from first stage analysis to either provide confirmation of the association, or to eliminate the SNP as a false positive. In practice, the number of SNPs chosen from the first stage for second stage analysis is arbitrary. SNP results are not statistically independent, but are treated as such for the selection process. SNPs are typically selected based on which SNPs' first stage results are least likely to have occurred by chance, without regard to the results of other SNPs. Since SNPs in the same location can be strongly linked to each other, this results in redundant testing at the expense of investigating other SNPs. Better technology means more SNPs are being tested during first stage analysis, which makes this problem worse. This problem can be avoided by selecting the subset of SNPs, which taken together as a set, are the least likely to produce the first stage results by chance alone.
Abstract:
The minimal Distance Constraint Model (mDCM) [1, 2] is a computational modeling scheme that integrates thermodynamic and mechanical descriptions to compute Quantitative Stability/Flexibility Relationships (QSFR) of protein structure. We have previously employed the mDCM to predict melting temperatures of c-type lysozyme mutants with a maximum error of 4.3%. Going further, we now assess the differences in other QSFR quantities across the dataset. The model is parameterized by fitting to experimental heat capacity curves. Subsequently, a large number of mechanical descriptions of protein flexibility are calculated. Pronounced changes in rigidity and flexibility occur over long ranges from the site of mutation. Drastic changes occur in the backbone flexibility at the mutation site and neighboring residues within 6Å. Increases in rigidity and flexibility are frequent and nearly equal, and collectively occur more than sites without change. Interestingly, the ?-domain becomes flexible in all cases, likely leading to domain unfolding related to aggregation. Changes in the underlying H-bond network as a result of mutation affect catalytic residues Glu35 and Asp53. We also present changes in pairwise residue-to-residue couplings that can affect functional collective motions.
[1] Jacobs, D.J., and Dallakyan, S. Elucidating protein thermodynamics from the three-dimensional structure of the native state using network rigidity. Biophys. J, 2005. 88: p. 903-15.
[2] Livesay, D.R., Dallakyan, S., Wood, G.G. and Jacobs, D.J. A flexible approach for understanding protein stability. FEBS Lett, 2004. 576: p. 468-67.
Abstract:
Alignment trimming is a common operation performed on alignments of multiple amino acid sequences prior to their use in phylogenetic analysis; the goal being to eliminate poorly aligned positions and divergent regions from the alignment prior to analysis. Several methods have been developed in recent years to automatically detect and remove these regions from alignments. Among the programs most used are Gblocks, trimAL, Noisy, or BGME. However there is considerable variation in how the methods define these poorly aligned positions and divergent regions. Regardless of the technique used, alignments run through a trimmer tend to be reduced in size in proportion to the amount of columns that are phylogenetically uninformative or contain gaps. For alignments on typical diverse protein families, it is not unusual for a trimmer to reduce the number of columns analyzed by 40-60%.
This study presents an examination and evaluation of the effects of commonly used trimming programs on several protein families, including the ABCG, GST and ALDH protein families. Trimmed alignments are compared by constructing bootstrapped phylogenetic trees using using distance-based phylogenetic methods. The effects on the resulting phylogenies are evaluated with bootstrap values on consensus trees, using Splitstree to evaluate the areas where there are uncertainties in the branch splitting, with tree comparison metrics from the TOPD/FMTS and Mesquite software packages, and by examining the effects on rogue taxa.
Abstract:
This paper presents a fully automatic method for segmentation of the Multiple Sclerosis (MS) lesion from multiplesequence MR (T2-weight and FLAIR) images. Our method treats MS lesions as outliers to the normal brain tissue distribution, and the separation is achieved by minimizing a statistically robust L2E measure, which is defined as the squared difference between the true density and the assumed Gaussian mixture. The method is fully automatic and doesn’t require any training, atlas or thresholding steps. The results of our method are compared with lesion delineations by human experts, and a high classification accuracy is demonstrated on 16 datasets containing small to moderate lesion loads.
Abstract:
Next-generation sequencing (NGS) technologies have established themselves as a key research tool in biological and biomedical research. The large amounts of data being produced by these platforms, however, can be overwhelming and has posed a great challenge for Bioinformaticians. New tools are introduced every week for handling NGS data, particularly for aligning the data to reference genomes. Because of the lack of impartial studies comparing these tools, it can be confusing when trying to choose the most appropriate tool. With this in mind, we have performed a comparative study of popular NGS alignment tools: soap, bowtie, bwa, bfast and ABI mapreads using both simulated and real RNA-Seq data generated on the ABI SOLiD platform. Using our RNA-Seq data, we have also performed comparisons of two RNA-Seq pipelines established to identify splice junctions: Tophat and the ABI Whole Transcriptome pipeline.
Abstract:
Large-scale protein-protein interaction datasets have been generated for several species including yeast and human and have enabled the identification, quantification and prediction of cellular molecular networks. Affinity purification-mass spectrometry (AP-MS) is the preeminent methodology for large-scale analysis of protein complexes. A typical AP-MS work flow expresses epitope-tagged “bait” protein in cultured cells, recovers associated protein complexes using an antibody against the epitope tag, and then identifies and quantifies constituent proteins using mass-spectrometry. The analysis and interpretation of these datasets is however not straightforward, in part due to the relatively low signal-to-noise ratio. Here we describe a framework for global analysis of large-scale interaction proteomics datasets of human proteins that incorporates elements of signal processing to improve the confidence of derived protein-protein interactions. The dataset consists of approximately 400 human bait proteins (in which approximately 50% have two or more replicate experiments). In this study, we use the D-score which combines peptide abundance and frequency measures and gives weight to the replicated prey proteins across bait experiments. We show how global co-occurrence profiles of proteins in large datasets may be used to infer protein-protein interactions. This presentation will outline how large-scale interaction proteomics datasets are generated as well as describing our framework (see outline in Figure 1), the interaction scores, and how interactions may be validated using orthogonal public datasets. Finally we present integrated models of the Eukaryotic Initiation Factor and Proteasomal networks derived from our dataset.
Abstract:
The 5'-cap structure of all polymerase II-transcribed RNAs plays important roles in mRNA processing, export, translation, and turnover, and loss of the cap is a regulated process that is considered to be the first irreversible step in mRNA decay. CAGE and deep sequencing identified capped transcripts missing portions of their 5’ ends, presumably by decapping and exonuclease decay or by endonuclease cleavage. We identified a cytoplasmic complex capable of capping these RNAs and in this study identify substrates using a dominant-negative form of cytoplasmic capping enzyme (DN-cCE) to interfere with re-capping. Uncapped mRNAs are then identified by their increased susceptibility to degradation in vitro by a 5’ exonuclease (Xrn1). As a proof of principle RNA from cells in which DN-cCE is OFF or ON were first analyzed via a novel algorithm that identifies transcripts showing partial 5' end loss from Affymetrix human Gene-ST arrays in which individual probes were match to their corresponding exons. This identified ~4,500 transcripts which have significantly decreased (less than or equal to -1 log fold change) expression for a contiguous 5' span of probes as compared with the remaining downstream probes in the transcript. We used these data to develop methods to graphically represent individual exons as a function of their position across each RefSEQ transcript and applied this to several rounds of biological replicates analyzed on Affymetrix human Exon-ST arrays. We will present these data and their relationship to those generated by others using CAGE for the first definitive identification of re-capping substrates.
Abstract:
Computational approaches are a necessary complement to empirical methods of identifying transcriptional cis-regulatory modules (enhancers). When such approaches have been applied to Drosophila, they have typically exploited knowledge about specific transcriptional networks and cannot be applied to organisms lacking Drosophila’s rich molecular genetic data, such as the arthropod emerging model organisms Nasonia, Tribolium, Apis, and Anopheles. The absence of discernable non-coding sequence alignment between Drosophila and any of these species makes comparative genomics approaches futile, and only a handful of CRMs have been characterized so far. We previously developed methods for “motif-blind” CRM discovery that do not depend on accurate prediction of TFBSs and that use knowledge of existing CRMs to “supervise” the search. We have also presented evidence suggesting that Drosophila TF binding specificities are largely conserved in these species. We reasoned that it’s possible to begin with a set of related Drosophila CRMs and locate the orthologous CRMs in highly diverged arthropod genomes. To test this hypothesis, we performed supervised CRM prediction in these four species for over 30 regulatory networks from Drosophila. By examining if the predicted CRMs fall in the proximity of the expected genes, we found 16 networks where our approach shows statistical evidence of recovering CRMs in at least two non-Drosophila species. We will describe the statistical and visualization tools developed for cross- genus CRM discovery and the results both in silico and in vivo validation of predicted CRMs, and discuss how TFBS counts and arrangements have diverged in the course of evolution.
Abstract:
Although synonymous codon usage in mammals has been investigated for decades, there are still controversial interpretations of the observed results. Selectionism cannot explain the strong regularities in codon bias, while neutralism is unable to comprehend the unusual frequency of G or C nucleotides (~60%) in the third codon positions across all mammals. We performed a genome-wide computational analysis of synonymous codon usage in relation to local genomic GC-content, evolutionary conservation of gene sequences, and the level of gene expression. In this study, Codon Bias Index (CBI) is used to measure codon bias within individual genes. The presented data is gathered from multiple available databases such as Codon Usage Database, Animal Genome Size Database, and Exon/Intron Database. Our results suggest that local GC-content is a major contributor to the non-randomness of codon usage in mammals. Based on the results obtained, we propose a unified hypothesis for the origin of GC-rich, GC-poor isochores and codon bias in mammals and vertebrates.
Abstract:
When genome sequences are obtained from organisms with different associated phenotypes, it should be possible to identify those sequence properties which confer a given phenotype. However, the evolutionary relationships between organisms lead to non-independence between the sequence properties. For example, the HIV-1 virus has a population structure reflecting both transmission between individuals and evolution of the HIV-1 quasispecies within each patient. This non-independence can introduce interdependence between unrelated mutations giving a false appearance of causation. These evolutionary relationships are an issue even in HIV-1 where recombination is rapid, and are pervasive in humans, where linkage disequilibrium is extensive. In human disease studies, this can sometimes be overcome by comparing siblings: alleles common only in “sick” siblings are likely true causative alleles. GENPHEN identifies, in a phylogenetic reconstruction, sibling lineages where the phenotype varies. Then, GENPHEN uses modified proportional hazard models to identify causal polymorphisms. GENPHEN’s advantages include: speed practical for high-throughput sequence data, estimates of relative strength or speed of different effects, and improved precision even vs. other tree-based methods: 50%-300% improvement in precision at same recall, either to predict experimental correlations (obtained from STRING: http://string-db.org/) or in simulations under biologically reasonable parameters on HIV quasispecies sequence trees.
Abstract:
Evolutionary genetics has long recognized that adaptive evolution is contingent upon the production of novel genotypes by mutation. Purifying selection must then purge genomes of deleterious mutations, while positive (Darwinian) selection must facilitate the fixation of adaptive genotypes. However, a large class of low-impact mutations exists in biological organisms for which selection is largely ineffective. Population size has been the primary focus of factors determining the efficacy of selection, but many other factors also contribute. This work uses the Avida software to study digital organisms, short self-replicating computer programs, in order to explore the evolutionary consequences of low-impact mutations in a computational system. Additionally, the population genetics simulation Mendel’s Accountant is utilized to examine the long-term effects of deleterious mutation accumulation with more precision. It is concluded that the neutral behavior of low-impact deleterious mutations may pose a threat to the long-term survival of many biological species, including humans, and that beneficial mutations may be incapable of reversing this process. Theoretical and medical implications of this work are also discussed, including the limits of natural selection, the role of mutation in human disease, and the use of lethal mutagenesis to cure pathogen infections and prevent pandemics.
Abstract:
Cis-regulatory elements (CRE) in promoters define transcription start sites (TSS), facilitate transcription initiation, and aid in transcription regulation. One group of CREs called core promoter elements aid in transcription initiation, whereas another group of CREs called upstream promoter elements mainly participate in regulating transcription. A genome-wide analysis of cis-regulatory elements and alternative promoters in Chlamydomonas reinhardtii (Chlamy) will be vital in chloroplast transformation, and flagellar regeneration studies, for which Chlamy is considered an important model organism. Goal of this project is to determine the core and upstream promoter elements present in Chlamy computationally. Two different approaches are being utilized to analyze Chlamy promoter elements: a statistical approach (LDSS) and a software approach (i.e. MEME). LDSS approach revealed that TATA box is the most prominent core promoter motif, and TATA box is the only major core promoter element in Chlamy. This phenomenon is also evident in Yeast, but not in plants or animals. Moreover, the TATA box seems to be more degenerative in protists and plants than in animals. There is no indication of a universal upstream promoter for all Chlamy genes, but it is widely known that functionally similar genes contain conserved upstream promoter elements. Genes were annotated based on GO and KEGG databases. Currently the annotation is used to cluster the genes into functionally similar groups, and the MEME software is used to discover conserved motifs within each group. Discovered motifs will be compared to known motifs to find any similar experimentally verified motifs in other species.
Abstract:
Gravitropism is a growth response that enables the plant to orient its organs for efficient utilization of available resources. Sedimentation of statoliths (graviperception) upon reorientation of inflorescence stems is translated into asymmetric auxin distribution, resulting in bending against the gravity vector (graviresponse). The gravity persistent signal (GPS) treatment used a cold treatment to separate the signal transduction phase from the response and was used to identify mutants specifically defective in mechanisms prior to auxin transport. At 4°C, starch-containing amyloplasts in the endodermis of the inflorescence stems sediment normally, but auxin transport was abolished indicating that the cold treatment affected early events of the signal transduction pathway. Although the GPS treatment has proven a robust screen in identifying genes/mutants there are critical signaling components that it is not identifying. The objective is to further exploit the GPS response to identify additional components of the signaling pathway using a proteomics approach. Proteins have been extracted at significant time points during the GPS treatment and will be assesed for diffrential expression using iTRAQ (isobaric tag for relative and absolute quantification) reagents. iTRAQ allows for differential labeling of experimental and control fractions for quantification and identification of differentially expressed proteins at a single step: the LC-MS/MS (liquid chromatography – tandem mass spectroscopy). Proteins that show differential expression will be characterized and placed in the gravitropic response pathway.
Abstract:
Yin Yang 1 (YY1) is a ubiquitous zinc finger transcription factor (TF). It is highly conserved, targets a variety of cellular and viral genes, and plays an important role in coordinating multiple biological pathways through a complex transcriptional network.
MicroRNAs (miRNAs) are novel regulators of gene expression in eukaryote that can bind to a specific target sequence within a much longer messenger RNA (mRNA), inhibiting its translation and thus controlling expression of the target gene.
Skeletal muscle cells provide a powerful model for understanding the transcriptional regulation during cell differentiation. So far, several TF-miRNAs mediated regulatory modules have been verified. Considering the prevalence of YY1 binding sites across the whole genome, we speculate that a large number of miRNAs could form the regulatory networks together with YY1 and play important roles in the gene regulation during skeletal myogenesis. In order to construct this network, we computationally identified a list of YY1 regulated Protein Coding Genes (PCGs) and miRNAs. We then constructed YY1/miRNAs regulatory network by combining these in-silico predictions with publicly available data from variety experimental platforms such as microarray gene expression, RNA-seq in skeletal myogenesis. We show that YY1 has been predicted as major regulator for various miRNA-TF-PCG networks. Several identified sub networks involving Hand2 and Ezh2 are supported by literature.
Our results show that the combination of computational predication with high throughput experiment data can provide novel insights into TF-miRNA meditated gene regulatory networks.
Abstract:
Muscle contraction is the result of the interaction of myosin with actin and ATP. The formation of this actomyosin interface may play a role in the generation of force in muscles. Kinetic studies of binding between myosin sub-fragment 1 and F-actin revealed a two-step binding process. Chemical cross-linking of S1 and F-actin demonstrated that the myosin head initially binds via the loop 635-647 to the N-terminus of one actin and then via the loop 567-574 to the N-terminus of the second actin. Depending on degree of saturation of F-actin with S1s, two structurally different complexes are formed: at complete saturation each S1 binds only one actin and its cleft is closed while at partial saturation S1 interacts with two actins and its cleft is opened. The transition between one- and two-actin binding states of myosin accompanying with opening the cleft in central domain of S1 might be associated with force generation. Early computational docking of S1 with F-actin demonstrated that both actin monomers are located in the same strand of F-actin. As such, we theorize that the formation of the actomyosin complex is a sequential multi-step process that begins with myosin weakly binding to the N-terminus of one actin then rotating towards the barbed end of F-actin to create a second stronger bond to the N-terminus of the second actin on the same strand. To investigate the formation of this complex, we will use experimental data to guide protein-protein docking techniques.
Abstract:
Numerous transcriptomic studies indicated that genes are not independently but rather coordinately expressed such that manipulation of a single gene has ripple effects on numerous others, located on all chromosomes and involved in a wide diversity of processes. Because of expression coordination any gene is directly or indirectly related to every functional pathway. However, genes are not of equal importance in controlling the pathways. Therefore, we have developed the Prominent Gene Analysis (PGA) as an alternative to the Principal Component Analysis (PCA) to select the most relevant genes for a given biological process. The difference is that while PCA considers as “centroids” the most alterable genes in a set of conditions, PGA selects those forming the most interconnected and stably expressed web responsible for a given function in each condition. The selected web, termed functional genomic fabric, exhibits characteristic gene composition and topology that respects the “transcriptomic” stoichimetry” of the functional pathway. The fabric remodels in response to various diseases, changes in environmental conditions and action of intra- and inter-fabric modulating transcriptomic networks that may cross cellular boundaries mediated by the intercellular communication. We have already applied PGA to determine the myelination fabric in oligodendrocytes and its modulation by intercellular cytokine and Ca2+-signaling, the heart rhythm determinant fabric and its sex dichotomy and sensitivity to oxygen deprivation. Here, we introduce the “genomic landscape” to further characterize a genomic fabric and its regulation by the intercellular signaling and the “transcriptomic distance” to measure the overall transcriptomic differences between two conditions.
Abstract:
Multiple sclerosis (MS) is a neurodegenerative often disabling autoimmune demyelinating disease whose etiology is not fully understood and for which no complete cure is yet established. We hypothesize that the autoimmune destruction of myelin is induced by the crosstalk among myelination (MYE), apoptosis (APO) and immune/inflammatory response (IIR) genomic fabrics. Such crosstalk occurs when fabric remodeling in response to variable environmental conditions and hormonal activity exceeds critical limits (that depend on sex, genetics, age, ethnicity and climate. The genomic fabric paradigm is based on the observation that expression levels of individual genes are tied to each other in partially overlapping functional pathways to ensure proper “transcriptomic stoichiometry”. The topology of the genomic fabric is controlled by intra- and inter-fabric transcriptomic regulatory networks that may cross cell boundaries mediated by intercellular signaling. We studied and quantified the topological alterations of MYE, APO and IIR fabrics and transcriptomic networks by which these fabrics modulate each-other in spinal cord of the MBP experimental autoimmune encephalomyelitis (EAE) mouse model of MS. Moreover, we have deconvoluted and ranked the pathways by which the sex hormone receptors control the MYE, APO and IIR fabrics. Results of this study may open and test novel therapeutic avenues in the gene therapy of MS.
Abstract:
Germline mutations in the folliculin (FLCN) gene are associated with Birt-Hogg-Dubé syndrome (BHDS), a disease characterized by papular skin lesions, spontaneous pneumothorax, and renal neoplasias. The majority of renal tumors that arise in BHDS-affected individuals are histologically similar to sporadic chromophobe renal cell carcinoma (RCC) and sporadic renal oncocytoma. However, most sporadic tumors lack FLCN mutations and the extent to which the BHDS-derived renal tumors share genetic defects associated with the sporadic tumors has not been well studied.
Comparative gene expression profiling analyses were carried out on renal tumors isolated from individuals afflicted with BHDS and a panel of sporadic renal tumors of different subtypes using discriminate and clustering approaches and qRT-PCR was used to confirm selected results. We further analyzed differentially expressed genes using gene set enrichment analysis and pathway analysis approaches, confirmed with independent pathway signatures and application to additional datasets.
BHDS-derived renal tumors showed distinct gene expression and cytogenetic characteristics from sporadic renal tumors. The most prominent molecular feature of these tumors was high expression of mitochondria- and oxidative phosphorylation-associated genes. This mitochondria expression phenotype was associated with deregulation of the PGC-1a-TFAM signaling axis. Loss of FLCN expression across various tumor types is also associated with increased nuclear mitochondrial gene expression.
Our results support a genetic distinction between BHDS-associated tumors and other renal neoplasias. In addition, deregulation of the PGC-1a-TFAM signaling axis is most pronounced in renal tumors that harbor FLCN mutations and in tumors from other organs that have relatively low expression of FLCN.
Abstract:
Rapid viral evolution due to errors in replication machinery, high rate of mutations and frequent recombination leads to high variability of (HIV-1) and hence remains a challenge to vaccine development. Interaction with host immune response also plays a major role in HIV-1 evolution. The persistent elimination pressure due to recognition of cytotoxic T lymphocyte (CTL) epitopes by the Human Leukocyte Antigen (HLA) molecules often promotes escape mutations in HIV-1 proteins. The recognition of epitopes is further complicated by enormous diversity of HLA class I alleles. Many HLA alleles share overlapping peptide repertoires and are thus grouped into supertypes.
We recently described a set of highly conserved HIV-1 epitopes which are strongly associated (Paul & Piontkivska, 2010). Of these epitopes some are restricted by supertype alleles. It is unclear whether such conserved epitopes change in response to immune pressure. Thus in this study we examined the substitution patterns of epitopes restricted by supertype (S) and non-supertype (NS) HLA alleles to determine whether higher cumulative frequencies of supertype alleles can be linked with higher nucleotide substitution rates.
We focused on 30 associated CTL epitopes, from 3 different genes, namely, Gag, Pol and Nef , from four geographical regions (China, India, Thailand and USSR). Notably, in all four populations the cumulative frequency of supertype HLA alleles exceeds 60%. Comparison of the rates of non-synonymous and synonymous substitution between supertype and non-supertype restricted epitopes revealed gene and country-wise differences. Our results have important implications in vaccine design that targets non-variable regions in HIV-1 genome.
Abstract:
Gravitropism refers to a plant’s bending in response to gravity which can be divided into three steps: gravity perception, signal transduction and the growth response. Previous work has intensively contributed to the knowledge of each phases, especially the perception and response stages. However, a systematic, genomic approach has not been taken to define these events until now. Here we built an interaction network using a combination of microarray data and published experimental data on genes involved in gravitropism in inflorescence stem. First, a gene interaction network was developed from interactome information collected from 7 different databases. This network includes 19,029 nodes (each representing a gene) and 329,825 edges (each representing a putative interaction). Second, a collection of microarray data (Affymetrix) from inflorescence stems were obtained from AtGen Expression and NASC databases to serve as a baseline expression. Then, a network analysis was performed and the important hub genes, along with the GO clusters were identified as well as the crucial pathways were selected based on Aracyc database . Next, the experimentally proved genes identified as involved in gravitropism were mapped to the existing network, and their functions and modules were proposed. Last, the microarray data on gravitropism signal transduction collected from our previous work was analyzed. Similar methods were applied and the results were compared with the background network modules. We expect the results could reveal more information on gravitropism, from the systematic analysis and mining across different databases.
Abstract:
Metabolomics is the study of reverse engineering metabolomics profiles of metabolisms to understand the dynamics of the metabolic reactions in the metabolic network. Given that the metabolic sub-network is very large, and the number of alternative scenarios about the dynamics of the metabolism is exponential, efficient computational algorithms are needed in metabolomics. The Steady-State metabolic network Dynamics Analysis (SMDA) Tool is a recently proposed computational tool that (i) employs a metabolic network database containing a multi-compartment, multi-tissue mammalian metabolic network, (ii) captures metabolic biochemistry principles in its computations, and (iii) given a set of metabolite measurements, locates all possible activation/inactivation scenarios of the biochemical network at steady state. SMDA tool is part of a metabolic analysis workbench, called PathCase-MAW, and is available in a web browser environment. SMDA tool allows a user to input (i) a metabolic sub-network of interest by choosing pathways, reactions and transport processes from its database, and (ii) metabolite measurements (either in-tissue or in-biofluid). It then (a) visualizes the sub-network of the user’s choice, (b) runs the SMDA algorithm online to produce a set of activation/inactivation scenarios for the selected network, and (c) visualizes the scenarios found in a sequential manner. SMDA Tool is free, and available online at http://nashua.case.edu/PathwaysMAW for the use of the research community.
Abstract:
As members of the flavivirus genus continue to establish themselves in new environments, greater efforts are made to understand these viruses. The vectors that transmit the viruses are able to inhabit new regions as a result of a variety of factors including human travel, deforestation, increases in population densities, and some have hypothesized climate change has a role as well. For the flaviviruses, the vector's genome can tell us much about their evolutionary history, as viruses exhibit great specificity towards the organisms they infect. Indeed, many viral pathogens depend wholly upon their host's replicative machinery, and as such it has been proposed that selection favors a genomic composition similar to that of their host. In this study, complete genomic sequences of 3,241 flavivirus strains were analyzed for mononucleotide, dinucleotide, codon and amino acid frequencies for both inter- and intraspecies comparisons. Between species, we have found compositional biases reflective of the hosts these strains infect. Strains exclusive to mammalian hosts exhibit suppression of CpG, and the degree to which this is observed reflects the time span they have been infecting these hosts. Within species, dinucleotide and codon usage correlations indicate either low tolerance for variation in these usage patterns, or else a possible bias in the samples used for sequence analysis. Through exhaustive analysis of all publicly available genomes of this genus, pathogen-host genomic composition compatibility is observed.
Abstract:
Genes are the biological units responsible for hereditary characteristics in living organisms. It takes years of arduous research to predict relations among genes using biological experiments. This research focuses on exploring an alternate solution by applying Bayesian Networks to microarray experiments. Biological pathways are used to depict the interactions among genes during biological process. In this study, such pathways were selected from the EcoCyc database for the bacterium strain EColi MG1655. For the genes involved in the pathways, the microarray data of seven experiments was obtained from the Many Microbe Microarray Database. The datasets were used for building the directed acyclic graphs using Bayesian Networks. Directed acyclic graphs (DAGs) were obtained for each experiment with a set of topological orders. A union graph was constructed based on the frequency of occurrence of each edge in the DAGs obtained. A final consensus graph was obtained by selecting a threshold frequency for the edges. A comparison of the resultant consensus graph to the biological pathway indicates that the existing gene relations can be replicated to major extent. In addition to the existing gene relations described in the pathways, a few other edges were also found to have quite a high frequency of occurrence. Previous research based on biological experimental studies confirmed several gene relations that were revealed by these edges. The results indicate that Bayesian networks can play a vital role not only in confirming the already existing gene relations, but also in predicting the possible gene interactions.
Abstract:
A number of statistical tests have been proposed to identify functional divergence in duplicate genes by detecting heterogeneous substitution rates in phylogenetic tree. A common disadvantage of the existing methods is autocorrelation of heterogeneous substitution rates along sequences is not well modeled. Therefore, the existing methods may not be powerful to identify motifs/domains under functional divergence, since the parallel shift of substitution rates after duplication may be a critical feature of motifs/domains under functional divergence. We design a phylogenetic hidden Markov model to identify protein motifs/domains relevant to functional divergence. A C++ program is developed to estimate model parameters by maximizing likelihood function and to identify regions under functional divergence using estimated parameters. Simulation demonstrates our program can successfully identify protein motifs/domains under functional divergence unless the discrepancies of evolutionary rates between subfamilies are very weak or the protein motifs/domains are very short. Applying the method to G protein alpha subunits in animals, we identify a candidate region overlapping with the alpha 4 helix and the alpha 4 - beta 6 loop in the GTPase domain. Previous studies suggest both of the two structures are important to the receptor-G protein specificity. Therefore, the reported candidate region highlights that the functional divergence in G protein alpha subunits may be relevant to the change of receptor-G protein specificity. From these results, we conclude that our method may be useful to identify motifs/domains relevant to functional divergence in duplicate genes.
Abstract:
The reassortment of segments in RNA viruses has proved to be a common pathway in the change of viruses. Various reassortment-modeling techniques have shown to be innovative in predicting certain RNA reassortment patterns. While most models have been developed for reassortment events in viruses infecting humans, reassortment does occur within viruses infecting other animals, plants and bacteria. Due to different lifestyles of the host, different parameters must be considered. In an effort to better understand the role of reassortment within the RNA-based bacteriophages, a model was developed to simulate reassortment within these viruses and their bacterial host(s). The preliminary results of this work are described here, focusing in particular for the phage ?8
Abstract:
The identification and prediction of transcription factor binding sites (TFBSs) is the first step in elucidating the complex network of interactions that regulates expression of the genome. Computational prediction of TFBSs has often focused on proximal promoters, whose location near transcription start sites (TSSs) is relatively well defined. However, many TFs bind to sites within enhancers and other important DNA regulatory regions that are large distances from the TSSs [1]; this larger noise makes TFBS prediction a much more difficult task. Recently. the location of cell-type-specific enhancers has been more precisely predicted based on patterns of histone modifications [2], so called “chromatin signatures”.
This project seeks to identify probable TFBSs within candidate human enhancer sequences predicted by their chromatin signatures. In addition to the analysis previously performed [3], we applied a new statistical model and motif finding algorithms specifically designed and developed for this project. By utilizing the WordSeeker genomic signature discovery toolkit [3] and additional post-processing methods, we were able to identify statistically over-represented words and motifs from putative enhancer regions. By considering chromatin signatures, we identified new putative TFBSs in gene regulatory regions extending beyond the promoters analyzed previously.
Abstract:
Physcomitrella patens, the model moss, has a protein-coding genome similar in size to Arabidopsis, but is similar to yeast in efficiency of gene targeting experiments and has a haploid dominant form making an interesting and useful molecular genetic tool for plants. The model moss is fast becoming a tool for bioinformatic and molecular work due to its key phylogenetic position as sister to land plant lineages. We present here the first predicted protein-protein interactome (PPI) for a bryophyte based on the interolog method using orthologs and outparalogs identified with the INPARANOID software package. We predicted more than 60000 total interactions including 41,936 unique interactions of 4062 different P. patens proteins. The twenty most interactive proteins belong to strongly conserved pathways that have not altered significantly during eukaryotic evolution. Analysis of gene ontology revealed the most significant categories represented include numerous metabolic processes, likely due to their conserved nature, and protein binding due to physical interaction requirement for inclusion, and catalytic activities. The utility of predicted interactomes lies in the “guilt-by-association” model of predicting proteins in a pathway under the assumption that orthologous proteins have similar functions. We reconstructed the photosynthetic Calvin Cycle network to determine the number of proteins associated with this important process and discovered many new interactions. The addition of moss, a plant representative 200 million years diverged from Arabidopsis, to interactomic research greatly expands the possibility of conducting comparative analyses thus giving tremendous insight into network evolution of land plants.
Abstract:
Coffee is an economically important species for which a large cDNA collection has been sequenced providing a look at expressed coffee genes. Annotation of protein function is difficult work in the lab, and is often done by homology to proteins from model organisms. Here we present the first predicted protein-protein interactome of C. canephora var robusta with 4587 protein-protein interactions between 939 proteins. These include interactions conserved across plants, animals and fungi, 325 that appear to be plant specific, and 53 that appear chloroplast specific and cyanobacterial in origin. These interactions are established by identification of orthologous genes in model organisms of the BIOGRID dataset. The connectivity (number of interacting partners) for majority of proteins follows a power-law distribution. Small hubs (2-10 partners) make up 30% of the proteins. Using GO (gene ontology) anotation revealed significant enrichment for proteins involved in translation, response to salt stress‚ and the cytosolic ribosome, and a depletion of unknown protiens. This was expected, as only conserved interactions would be predicted using our methods, and these are the best studied. However there were some highly conserved interactions in coffee between otherwise unknown proteins. Dividing the entire network into subnetworks (clusters) based on highly interconnected proteins, we identified potential functional modules. The strongest such cluster shows the connections between proteins of the large and small subunits of the ribosome, while other clusters were identified as the proteosome and transcription initiation complexes. Several previously unknown proteins were also detected within these clusters indicating possible biological function.
Abstract:
Digital pathology is an image-based process that provides the ability to acquire, manage, and interpret pathology cases from digitized glass slides. Challenges exist and issues such as thick or folded tissue, vibrations during scanning, and incorrect focus points contribute to an overall lower quality of image. The most common outcome from a poor quality scan is blurriness, which leads to loss of important image data and results in a significant amount of staff time and effort to manually assess the whole slide images for image quality. The method described by this poster aims at automatic detection of blurred regions in the image and reducing the manual labor involved in quality assurance.
To distinguish blurred and non-blurred areas, regional homogeneity and contrast are measured using numerical gradients across sub-images. More gradients will be found in clear, in-focus regions as compared to blurry regions. Gradients are counted for each sub-image, normalized for the amount of tissue found in the sub-image, and graded, allowing for high-level visual inspection of blurring across the whole slide image. Additionally, a blurring metric for the entire image is reported. Evaluation is performed by comparing results to those generated by expert imaging technicians.
Blurring errors are common and result in distortion and loss of data, which poses a challenge to both technicians and pathologists. This algorithm aims at detecting blurred regions in images as the initial step in providing an automated quality assurance program for creating high-quality digital pathology images.
Abstract:
Next-generation genome re-sequencing technology has become an effective strategy to identify SNPs/INDELs in genetic disease research. We have developed a computational pipeline to detect variants generated from Whole Human Exome Capture (NimbleGen EZExome version 2) technology. We first aligned Illumina HiSeq paired-end sequencing reads against human genome (hg19) using BWA and employed the SAMtools package to call SNPs/INDELs. Custom scripts were written to select variants from target regions. We then filtered out called variants that are most likely not homozygous and a step was inserted to filter in allele frequency of reads. In this work, we were only interested in homozygous regions (allele frequency above 80%). We identified variants mapping to known SNPs in NCBI dbSNP132. For project specific reporting we created a series of filters to exclude known SNPs which resulted in > 5% in all populations and > 5% of the sample/patient’s sub-population. The resulting output contained novel SNPs and low frequency (< 5%) ones in dbSNP and will be analyzed further.
Besides consolidating the information from SAMtools, our pipeline automatically annotates the class of variants (synonymous, non-synonymous, and stop codon if the hit falls into a coding region) and indicates whether the hit falls in the untranslated or intron region for each detected SNP using UCSC knownGene model. Finally, our pipeline prioritizes the hit class based on the importance order for each unique combination of SNP location and mapped gene ID. We believe our pipeline will detect and annotate variants generated under similar technologies.
Abstract:
Protein functionality in a broad spectrum of cellular processes is often dependent on protein-protein interactions. These interactions can be critical to enzyme performance and regulation, immune system response and a host of other cellular processes. A more thorough understanding of the protein features contributing to protein-protein interactions could lead to improved techniques for predicting interaction sites, identifying nearest-native interface structures and analyzing protein behavior in reaction networks. Tessellation is a computational geometry tool that has been used in reported studies to identify, isolate and characterize interface structures. This study extends the application of Delaunay Tessellation (DT) in particular, to analyze interacting protein structures and employs four-body nearest neighbor statistical potentials to compare the interfacial regions. DT decomposes the structure of a protein into a set of irregular tetrahedra, each containing 4 residues at its vertices that are nearest neighbors of each other in the structure. DT can be applied to a set of representative protein structures to determine the frequency of specific residue quadruplets, which in turn can be used to develop a statistical potential function. Potential functions derived from sets of entire protein structures as well as surface, buried and interfacial components are used to compute total potentials and potential contours for interaction regions, as well as the surfaces of the contributing monomers. These values are presented and compared for interface regions representative of several types of protein-protein interactions, including obligatory, non-obligatory, homo and heterodimers, oligomers and complexes.
Abstract:
End-of-life patient care and treatment in oncology care requires important discussions among the patient, their physician, and support network. Decisions must be made on stopping therapy regimens, switching therapies, beginning palliative care, and/or enrolling in hospice. The accurate assessment of patient length of survival (LOS) may aid in these discussions. We looked at a building a decision support system (Bayesian network) to predict LOS in an outpatient oncology clinic using basic information on lab work, medications, and weight measured. A study population of 1311 patients was used. Data consisted of lab values of hemoglobin (HGB) and albumin (ALB), patient weight (WT), erythropoietin (Aransep or Procrit) medications administered, if (and when) last chemotherapy was administered, and the patient’s age. Other than age, the data are non-uniform sampled time series or time-stamped data. The time-series data (HGB, ALB, WT) was analyzed (2 piece, 2nd order splines were fit to the data to assess the trends) and discretized. These values were combined with discretizations of the other variables to create a 9 variable BN that predicted LOS with 63% accuracy. Current work focuses on improving predictive performance with: (i) the inclusion of other predictive indicators for survival (clinical values – cancer staging and type; laboratory values – white blood count, creatinine, etc.) and (ii) different representation for the time-series data with positive initial results. Eventually, the best predictive model(s) will be shown to clinicians and placed into the clinical work-flow to assess their effectiveness.
Abstract:
We have developed web-based tools to analyze gene expression and proteomics data. Gene expression experiments are compiled into files including tens of thousands of genes expressed over hundreds of samples. Analysis and correlation of these data sets is challenging, and easy-to-use tools with a variety of functions are scarce. The online tools we have developed provide three distinct services. The first function can generate coexpression networks of the specified genes, where genes are represented as nodes, and their correlations are represented by edges. Nodes are connected to each other if their Pearson’s correlation coefficient is above a certain cut-off value. Secondly, weighted protein-protein interaction networks can be generated, where proteins are represented as nodes and edges represent physical or genetic interactions. Each interaction is weighted with a logistic regression model that incorporates sub-cellular localization data, co-clustering of interacting proteins, number of observations of the interaction in literature, and mRNA coexpression data calculated from user submitted microarray experiments. The regression model then assigns a probability that the interaction exists. Finally, the last function calculates the bimodality of coexpression, which is a measure of association between two subnetworks of genes/proteins. This association can suggest interactions between entire networks of genes, as opposed to individual gene-gene interactions. These tools provide researchers with the ability to analyze large -omics datasets by integrating publicly available databases.
Abstract:
The completion of the genome sequence has significantly facilitated the identification of genetic markers for complex traits. The marker of choice that has emerged for whole genome linkage and association studies is the single nucleotide polymorphisms (SNPs), which are sequence variations occurred in only one nucleotide in a long stretch of DNA [1]. Although there are multiple sources of genetic variation that are present in the population, SNPs are the most common type of sequence variation that can serve as powerful markers due to their abundance, stability, and relative ease of scoring. Therefore, whole-genome SNP typing is a powerful technique for genome-wide association study (GWAS). In the traditional case-control GWAS, the SNP profiles from individuals in the case group are compared to that of the control group. If the statistical significance for a SNP can be established, it could potentially be used as a genetic marker for the trait. Researchers have used GWASs to identify genetic markers for type 2 diabetes, Parkinson's disease, heart disorders, obesity, Crohn's disease and prostate cancer [2]. Instead of using this traditional approach, we are interested in developing computational techniques that will allow researchers to identify subjects in various sample groups in an unsupervised manner. In this study we used a dataset generated in 117 dogs belonging to three breeds, German Shepherd Dog, Belgian Malinois (Belgian Shepherd Dog) and Labrador Retriever, using the Affymetrix Canine SNP Array (v.2), which contain the probes for 127,132 SNPs.
The specific goal of this
Works Cited
1. http://www.ornl.gov/sci/techresources/Human_Genome/faq/snps.shtml .
2. http://www.genome.gov/20019523 .
3. Gao X, Starmer J: Human population structure detection via multilocus genotype clustering. BMC Genetics. 2007, 8:34.
4. Ward JH, Hook ME: A Hierarchical Grouping Procedure Applied to a Problem of Grouping Profiles. Lackland Air Force Base, Texas: Personal Laboratory, Wright Air Development Division, March 1961.
Approved for public release; distribution unlimited (88ABW-2011-2154).
Abstract:
Bacterial pathogens pose a severe burden to health. Tuberculosis alone is responsible for more than two-million deaths annually, and is re-emerging in Western nations. Our understanding of the complement of proteins that mediate pathogenesis is incomplete. Pathogens produce myriad-small proteins which are essential to establish disease. Identification of these proteins by proteomics is limited by the accuracy of translated protein databases. Many gene products in bacteria are un-annotated, misannotated, or frame-shifted, which eludes practical identification by proteomics. Pathogens with small research communities also suffer from limited opportunities to curate underlying genomic content. Approaches aimed at connecting empiricals link between proteomics and genomics are called proteogenomics. Much of these efforts focus on higher organisms rather than pathogens. Here, we describe a software implementation to generate tiles of translated nucleotides throughout the genome. No existing ORF's were used to determine start/stop codons; and the genome was tiled in all frames. This tile-set was combined with typical FASTA files and used as the search database within Mascot and Paragon; two proteomics search engines. Nano-HPLC/MS/MS data from mycobacterial protein digests were queried, and output tiles were compared against BLAST to identify known and unknown gene products. Changes in word (tile-size) and bite (tile-tile distance) can be varied depending on the genome investigated. These tiles are fully compatible with existing search software, comparable in size to existing databases and compatible with computational methods for determining false-positive assessment, such as reversed decoy-strategies. Progress towards complete annotation of the small-protein space in Mycobacteria is presented.
Abstract:
The identification of genetic risk factors in complex disease has been an important goal in the field of genetic epidemiology, and the use of Bioinformatics techniques has been prevalent. Recently the role of gene interactions in susceptibility to disease has been investigated, not only as an aid in diagnosis, but also in the hopes of adding to the knowledge of the biological and biochemical pathways that are involved in specific diseases. Machine learning methods are used in the prediction of disease susceptibility, but are often criticized as being "black box" approaches that reveal little about the relationships of the factors (in this case the genes) that contribute to the prediction.
Our method for the detection of gene interactions is based upon training a supervised learning algorithm to detect disease, and then "opening up the black box" to find out how the input parameters are related. We apply our method for the detection of gene interactions to two approaches from the machine learning field: Artificial Neural Networks (ANN) and Support Vector Machines (SVM). We test the versatility of our approach using 7 disease models, some of which model gene interactions and some of which model biological independence. Specifically our techniques reveal the presence or absence of gene interactions, and the nature of interactions (XOR, AND) if they do exist. Our method correctly identified all gene pairs in these 7 disease models as either interacting or independent.
Abstract:
The gps4 mutant phenotype in Arabidopsis thaliana represents a mutation in the gravitropism-signaling pathway. In root tips, a diminished gravitropic response is observed, while in inflorescent stems there is no observable tropic response to gravistimulation at 4°C. Many genes and signaling mechanisms are involved in this pathway. Affinity chromatography and bioinformatics have been utilized to identify possible protein-protein interactions between GPS4 and other proteins in the system. The affinity chromatography step used GPS4 inserted into pET28c his-tag vector. Expression of the construct produced GPS4 proteins with 6×his tail, which interacts with a nickel column. After his-tagged GPS4 is bound to the column, a set of extracted Arabidopsis proteins were introduced to the column. No significant conclusion can be drawn from the analysis of those proteins that exited the column initially, but those that eluted concurrently with the release of his-tagged GPS4 presumably interact with GPS4 in some capacity. Bioinformatics tools were then employed to identify the proteins eluted and their potential rolls in gravitropism signal transduction.
Abstract:
The goal of this work was to enhance the capability of a novel, prototype software tool for the visualization and analysis of small molecule metabolite gas chromatography/mass spectrometry (GC/MS) and liquid chromatography/mass spectrometry (LC/MS) data for biomarker discovery. The key features of the Metabolite Differentiation and Discovery Lab (MeDDL) software platform include support for the manipulation of large data sets, tools to provide a multifaceted view of the individual experimental results, and a software architecture amenable to modification and addition of new algorithms and software components. The MeDDL tool, through its emphasis on visualization, provides unique opportunities by combining the following: easy use of both GC/MS and LC/MS data; use of both manufacturer specific data files as well as netCDF (network Common Data Form); preprocessing (peak registration and alignment in both time and mass); powerful visualization tools; and built-in data analysis functionality. The latest version of MeDDL incorporates a variety of additional features which focus on expanding the GC/MS analysis capability of the platform in support of Volatile Organic Compound (VOC) based biomarker research on-going in our laboratory.
Abstract:
The diversity of avian species in tropical Asia is still little known due to the dearth of modern analyses. The bird family Timaliidae (babblers) has many taxa that are poorly studied in terms of species diversification. To understand the phylogeny of one babbler species, Pomatorhinus ruficollis, we studied both mitochondrial and nuclear genes. Our objectives were to establish phylogenetic relationships of P. ruficollis sampled from various locations in China in order to determine if there is genetic and geographic structure across these populations and to compare the influence of mitochondrial versus nuclear genes. We sequenced six genes for this analysis. Our data show that mitochondrial genes contain more informative characters when compared to nuclear genes for this level of divergences. Our results indicate there are distinct populations of P. ruficollis on either side of the Pearl River. The nuclear gene tree does not demonstrate as clear of a pattern of specimens from the same locality having similar genetic traits, suggesting a slower rate of evolution in nuclear versus mitochondrial genes. Our phylogenetic analysis of the combined data showed patterns that are consistent with previous studies that indicate that these Chinese populations consist of at least two distinct evolutionary lineages. Our findings also bring awareness to the immense diversity found in this region and the need for additional research to resolve diversification patterns.
Abstract:
Predicting the three-dimensional structure of proteins from their amino-acid sequences continues to be a compelling and important problem in computational biology. Such structures are useful in the design of drugs for or better understanding of human diseases which may be protein related, such as Huntington’s or Parkinson’s diseases. Protein contact maps, from which three-dimensional models can be created, are two-dimensional graphs of amino-acid sequences comprising a protein chain along the axes, with markings indicating whether or not two residues are spatially close to each other in three-dimensional space.
Many methods have been employed in the attempt to predict contact maps since such predictions first became an available category at the Critical Assessment of protein Structure Prediction, a global biennial competition. Many of these methods revolve around the evaluation of features of amino acid residues, and using those features to guess whether or not two residues are in contact. We propose to evaluate the efficacy of available amino acid residue features (such as flexibility) in determining whether or not two residues are in contact. We intend to ultimately produce an automated process of feature evaluation, which will allow others to build on our successful completion of research.
Abstract:
Advances in microarray technology have led to highly complex datasets often addressing similar or related biological questions. The statistical methodology of meta-analysis aims to combine results from independent but related studies. It is a relatively inexpensive option that has the potential to increase both the statistical power and generalizability of single-study analyses. For example, a meta-analysis of five circadian microarray studies of Drosophila helped researchers to identify a novel set of rhythmically expressed genes. We advocate here a related approach to potentially extend confirmed results to other species or organs. In translational medicine or biology research is often based on measurements that have been obtained at different points in time. The biologist looks at these values not as individual points, but as a progression over time. Our program (SPOT) helps the researcher find these patterns in large sets of microarray data. A researcher proceeds through three subsequent steps: first, selection of microarray data of interesting experiments from NCBI GEO, second, translating the temporal measurements into time intervals, and third, defining temporal concepts like “peaks” based on those intervals. Then he/she can search for genes that exhibit that particular pattern within the previously selected data pool. We created a software tool using open-source platforms that supports the R statistical package, Bioconductor, and Web 2.0 knowledge representation standards using the open source Semantic Web tool Protégé-OWL. We report here on the web interface that connects to programs based on R and Bioconductor.
Abstract:
The National Center for Integrative Biomedical Informatics (NCIBI) at the University of Michigan provides Web services that 1) make available the data resulting from the natural language processing of biomedical literature deposited in PubMed and PubMed Central Open Access, 2) make available the data in the Michigan Molecular Interactions (MiMI) repository, 3) make available the data from NCIBI tools such as Gene2MeSH and Metab2MeSH, and 4) expose a public API to a set of computational and analytical methods.
Our natural language processing pipeline includes sentence segmentation, named entity tagging, sentence parsing and part-of-speech tagging, and information extraction to identify protein-protein interactions and annotations in biomedical abstracts and full-text articles. The intermediate and resulting data from this pipeline are made available through our Web services.
MiMI, the Michigan Molecular Interactions repository, integrates data from numerous protein interaction databases into a single database using “deep integration” of molecules merged to genes. The MiMI Web service supports querying by keywords, genes, lists, or interactions.
Data from tools such as Gene2MeSH (genes annotated with concepts defined in MeSH, Medical Subject Headings from the National Library of Medicine) and Metab2MeSH (chemical compounds annotated with MeSH concepts) are available through our Web services.
Finally, we provide Web service access to a set of computational and analytical methods for natural language processing and gene set enrichment testing using a technique called LRPath.
The Web services are designed to work in combination with the National Center for Biotechnology Information’s Entrez Programming Utilities service.
Abstract:
The major goal in the application of tryptophan fluorescence spectroscopy is to interpret fluorescence properties in terms of structural parameters and to predict of the structural changes in the protein. We have developed methods for the mathematical analysis of fluorescence spectra of multitryptophan proteins aimed at revealing the spectral components of individual tryptophan or clusters of tryptophan residues located close to each other (Burstein et al., 2001, Biophys J., 81, 1699-1709; Reshetnyak and Burstein, 2001, Biophys. J., 81, 1710-1734). Also, we have created an algorithm for the structural analysis of the tryptophan environment in 3D atomic structures of proteins from PDB (Reshetnyak et al, 2001, Biophys. J., 81, 1735-1758). The successful design of the methods of spectral and structural analysis opened an opportunity for establishing a relationship between the spectral and structural properties of a protein. We have integrated the developed software modules, introduced new programs for the assignment of tryptophan residues to spectral-structural classes, and created a web-based toolkit PFAST: Protein Fluorescence and Structural Toolkit (Shen et al., 2008, Proteins, 71, 1744-1754). PFAST contains 3 modules: 1) FCAT - fluorescence-correlation analysis tool, which decomposes protein fluorescence spectra and assigns spectral components to one of five previously established spectral-structural classes. 2) SCAT - structural-correlation analysis tool for the calculation of the structural parameters of the environment of tryptophan residues from the atomic structures of the proteins from the PDB, and for the assignment of tryptophan residues to one of five spectral-structural classes. 3) The last module is a PFAST database.
Abstract:
The soybean genome contains several thousand copies of the GmOgre (Gmr9) retrotransposon. A ~20,000 bp consensus sequence was previously constructed and found to contain a minisatellite repeat region between the end of the coding region and the 3’ Long Terminal Repeat (LTR). The region contains five distinct minisatellite families with monomers ranging in length from 26 to 164 bp. The monomers are interspersed and repeated three to sixteen times within this region of GmOgre. The origin of this minisatellite region is not yet known. We developed a computational method to characterize other loci where these minisatellites might be present. We found a total of 77,265 monomer copies of the five minisatellites in assembled chromosomes from Genbank. In addition to those found in members of the GmOgre family, we found 486 copies of these minisatellites in 176 retrotransposons representing 21 additional retrotransposon families. Also, an additional 23,413 monomers are located in regions of the soybean genome that are currently unannotated. However, the majority of these minisatellites are contained in GmOgre. PCR analysis suggested and the computational analysis confirms that the total lengths of some of the minisatellite clusters may be far longer than that found in the GmOgre consensus sequence.
Abstract:
Identifying the set of single nucleotide polymorphisms (SNPs) contributing to the condition/disease amongst the millions of SNPs that are present within the human population is far from trivial. Genome-wide association studies (GWAS) are conducted comparing the variations between patients with and without a condition/disease. In order to ascertain that a SNP or set of SNPs is in fact connected to the condition, a large collection of both control data and test data is needed. Generating such a large set of patient data is typically not feasible for a single investigator thus necessitating multiple studies to be pooled together. There is, however, no universal format established for this data; rather it largely dependent upon the platform used for typing of patient SNPs and the availability of supplemental patient data. Given these differences, integrating such data, despite its potential benefits, is a logistical nightmare. Here we created a universal data structure for GWAS data to expedite and improve the exploration of variations that are associated with disease.
Abstract:
The Josephin Domain (JD) is an ubiquitin binding and cleaving site at the N- terminus of the ataxin-3 protein. The goal of this undergraduate research project was to compare the amino acid sequences of this domain from a variety of organisms and look for structural, functional, and phylogenetic relationships. JD sequences from fifty-one organisms with similarities to human JD ranging from 24% to 98% were collected and compared. Out of the 185 residues in the protein, five (N-20, L-84, H-119, S-135, and P-140) were totally conserved across all species and 75 had above the 60% conservation rate. Functions were assigned to many of these highly conserved residues. C-14, H-119, and N-134 were found to participate in the catalytic triad that cleaves ubiquitin chains. G-25, Y-27, F-28, W-87, and L-89 are the residues involved in HHR23 interaction. It was hypothesized that S-135 and N-20 may hydrogen bond to H-17 and E-26, respectively. Both of these bonds would be used to stabilize the helix alpha-1. L-84 was hypothesized to be involved in hydrophobic packing. It was concluded that the Josephin Domain is not strongly conserved across species and hypothesized that this may be because it must conform itself to the ubiquitin sequence unique to each organism.
Abstract:
Heat shock protein 90 (HSP90) is found in virtually all species from bacteria to mammals functioning as a molecular chaperone and serving a protective function in the cell. Since ATP is necessary for the functioning of this protein, our previous study aligned the primary structure of the ATP binding site, the N-terminal domain, of human HSP90 with sequences from a variety of different species. The results showed high conservation in the critical residues associated with the binding and hydrolysis of ATP across all species. Other studies have indicated that the tertiary structure of the ATP-binding domain of HSP90 was highly homologous to that of DNA gyrase B, a bacterial type II topoisomerase that catalyzes the negative supercoiling of prokayotic DNA. In this undergraduate research project, the primary and tertiary structures of twenty DNA gyrase B enzymes and thirty-nine HSP90 proteins were aligned to determine the critical residues involved in the binding and hydrolysis of ATP. It was found that Glu47, critical for ATP hydrolysis, and Asp93, which hydrogen bonds to the adenine moiety of ATP, were invariantly conserved in both HSP90 and DNA gyrase B. Not all of the fully conserved residues found in HSP90 alone were conserved in DNA gyrase B. Finally, it was seen that although the primary structures were quite dissimilar, the tertiary structures were highly analagous, as hypothesized, and supported by the tertiary structural alignment.
Abstract:
Protein phosphatase PtpB from Mycobacterium tuberculosis is hypothesized to remove the phosphate groups of unidentified proteins in host macrophages in order to protect the bacterium from lysosomal degradation. The goal of this undergraduate project was to compare related phosphatases from different organisms and determine conservation, function and phylogenetic relationships. Seventy-one phosphatase sequences from a variety of species were compared. Nine residues were fully conserved and 58 residues had a conservation of at least 60%. The functions of most of the fully conserved residues have been hypothesized. Cys160 is a catalytic residue and is close enough in the PTPB:PO4 structure to attack the substrate PO4 molecule. Arg166 forms a hydrogen bond can form hydrogen bonds with both the PO4 and the inhibitor OMTS. Additionally, there are several charged residues that appear to coordinate the structure of the PtpB molecule, including ion pairs between Arg56 and Glu32, as well as Arg29 and Asp188. Asn11 forms a hydrogen bond with Ser192. Arg13 is close to Arg29 and is part of a group of arginines, including Arg56, Arg166 and Arg191, all of which may aid in positioning the conserved P loop. One of the unique features of Mycobacterium PtpB is a lid, comprised of part of helix ?-7, helix ?-8 and a loop, that covers the active site. The lid is hypothesized to keep Cys160 from inactivation by oxidation. Pattern and phylogenetic analysis reveal that this group of proteins diverge considerably.
Abstract:
The globins are a protein family involved in oxygen-binding and transport. They are widely considered to share a common ancestor. There are four globin proteins in Homo sapiens: hemoglobin, subunits A and B (HbA and HbB); myoglobin (Mb); cytoglobin (CYGB); and neuroglobin (NGB). The goal of this undergraduate research project was to compare the amino acid sequences of all four globins from a variety of organisms and look for structural, functional, and phylogenetic relationships. A total of 234 sequences (59 Hb-A, 62 Hb-B, 61 Mb, 22 CYGB, 30 NGB) were compared. Two (His87 and Phe43) were fully conserved and 40 were at least 60% conserved (all amino acid position numbers are taken from Homo sapiens HbA chain). Functions were identified for many of these conserved residues. His87 was involved in the binding of the heme-ring complex, coordinating the central iron atom, while His58 bound to the oxygen ligand. Also involved in supporting the ligand were Phe43, Phe46, and Val93. His58, Lys29, and Val62 are involved in oxygen-affinity for the protein (in high oxygen-affinity conditions, His58 is replaced by Tyr and Lys29 is replaced by Gln). It was concluded that the residue conservations of the globin family are mostly involved in the oxygen-binding property common to all the protein members, as well as to the stabilization of the oxygen and heme ligands. Phylogenetic analysis demonstrates that each type of globin protein does group separately in the tree.
Abstract:
Support vector machine is statistical classification algorithm that classifies data by separating two classes with the help of a functional hyper plane. SVM is known for good performance on noisy and high dimensional data such as microarray. A marginal region of functional hyper plane named ‘danger zone’ is defined to be the region between two parallel hyper planes that are determined by the average distances of the support vectors from the two classes to functional hyper plane. The main aim of this study was to determine the effect of margin distance, the width of the danger zone, on the accuracy of the classifier and to analyze the role of margin distance in feature selection. The study was carried out using three microarray datasets. For each dataset, equation of functional hyper plane separating the two classes of data was derived. The corresponding support vectors were obtained. The average distances between support vectors from the two classes to functional hyper plane were calculated. The relations between the width of the danger zone and the classification accuracy were investigated. The rate of change of the margin distance with respect to the number of features used for constructing the support vector machine was also examined. The results indicate that although correlation between margin and accuracy is not very strong, but the rate of change of classification accuracy with respect to margin distance can be employed to determine the optimal number of features for constructing high performance support vector machine for classifying microarray samples.
Abstract:
One of the most active areas in cancer biomarker research is the development of statistical methods to identify “signature”, cancer-specific patterns that account for the heterogeneity of cancer. Tomlins et al observed the heterogeneous pattern of oncogene activation in many cancer types and introduced a statistical method called Cancer Outlier Profile Analysis (COPA) to identify so-called “cancer outlier genes”. Several other statistical approaches have since been described such as Outlier Sum (OS), Outlier robust t-statistics (ORT), maximum ordered subset t-statistics (MOST), and least sum of ordered subset square t-statistics (LSOSS). Simulation models and case studies of these models have demonstrated their power of detection and/or false discovery rate. However, there is no consensus regarding a suitable simulation approach to compare different test statistics. In this study, we explore an alternative simulation approach based upon the assumptions of different test statistics to analyze the power of detection of a traditional mean-based approach (e.g. t-test) and an outlier-based approach (e.g. OS). Surprisingly, the results of our simulation show that in a heterogeneous disease such as cancer, the power of OS is not always higher than that of t-test under certain circumstances. The performance of the t-test and outlier sum test depends on the non-normality of the data. As one group diverges from normality, our model predicts that the power of the outlier sum test increases and surpasses the t-test for a given effect size. This simulation model may enable informed decisions regarding the selection of test statistics for particular cancer datasets.
Abstract:
A proteomic approach was used to study how bluegill sunfish (Lepomis macrochirus) respond to parasitic infection by the black grub (Uvulifer ambloplitis). Control (uninfected) and treatment (infected with parasite) protein expression data were used to observe fold changes in identified proteins after 24-48 hours, and one month post infection. Prior to infection, plasma was extracted at seven and fifteen weeks to study an effect of the time in tank on differentially expressed proteins. GenBank accession numbers associated with each protein were used to map proteins to a Gene Ontology (GO) annotation from the entire zebrafish GO annotation set. Fisher’s exact test was used to test whether the GO term (cellular component, molecular function, biological process) was represented more than would be expected by chance. Lowest p-values in the 24-48 hour, one month, and time in tank groups were associated with extracellular component. The 24-48 hour and one month groups were associated with lipid binding and metabolism. We conclude that parasitism results in significant changes in how bluegill sunfish process lipids.
Abstract:
Auxin, indole-3-acetic acid (IAA), is the primary regulator of plant structure. IAA concentration gradients direct many processes including bending, cell differentiation, and leaf position. Gradients are established through IAA synthesis, degradation, and transport into and out of each cell. A literature review of auxin use and developmental patterns in algae and plants provided data for construction of a phylogenetic tree with important innovations in auxin regulation as synapomorphies. Based on phylogenetic relationships, auxin regulation of organ position predated land plants. The earliest land plants evolved auxin transport and transcriptional regulation based on auxin availability. Auxin regulation became increasingly complex throughout plant evolution, but lycophytes and euphylophytes diverged after the major auxin regulation pathways had evolved. Phylogenetic analysis suggests all vascular plants share a novel system of auxin regulation: IAA conjugation. Another phylogenetic tree, based on auxin conjugates found in diverse plants, was constructed to test for conservation of derived conjugates. Specific conjugates are not highly conserved and IAA conjugation is a poor predictor of plant complexity. A further literature search found putative auxin regulatory genes. Evaluation of gene sequence in Selaginella (a lycophyte) indicates it has sufficient genes to synthesize and conjugate auxin; has orthologs of most genes in auxin regulation; and has all classes of auxin transporters. Selaginella is capable of regulating auxin through synthesis, conjugation, and transport via all major mechanisms known in Arabidopsis. This compilation represents the most complete picture of auxin regulation among lycophytes.
Abstract:
Ever since the Spanish flu pandemic in 1918, the impact of influenza A viruses on human health and an increase in human morbidity and mortality associated with it have been receiving a growing amount of attention. Depending on the surface antigens, Influenza A viruses are classified into different subtypes, with a total of 16 Hemagglutinin (HA) and 9 Neuraminidase (NA) subtypes known. Current vaccine practice targets three isolated human influenza strains believed to be currently circulating. However, one of the major challenges in the development of a pandemic vaccine is that while vaccines must be altered each year to be effective against the constantly changing strains, it is difficult to predict which subtype will be prevalent next year. Further complication comes from the ability of the virus to cross species barrier, such as recently observed avian influenza pandemics. When in their natural reservoirs of aquatic birds, influenza viruses cause asymptomatic infections. They remain in stasis until they spread to other species at which point they can evolve causing mild or severe illness in the host. Thus, it is important to target those sections of the influenza A genome that are shared - and hence, relatively conserved - among strains harbored by different hosts. In this poster we present results of evolutionary analyses of complete genomes of avian, swine and human influenza strains that identify several such shared segments, and discuss the implications for the efficient vaccine design.
Abstract:
Microbial life is a critical component to a variety of Earth’s ecosystems and includes an assortment of species, the most abundant member being the virus. For instance, within the world’s oceans and lakes alone, it is estimated that there are 104-108 virus-like-particles per milliliter. The composition of an ecosystems’ viral diversity has a significant impact on other microbial species and likely drives fluctuations in bacterial density and diversity within the system. Examining the viral species present within ecological niches has been the subject of numerous studies within the literature; such examinations provide significant insight into the ecological system as a whole. By developing a more complete understanding of the ecological system, events involving aquatic environments from a human based perspective can be more readily understood and dealt with.
While Lake Michigan waters are routinely tested during the summer months for particular bacterial species, little is known about the more abundant viral species present. Identification of specific viral species by sequence alone presents challenges, particularly if genomic data does not exist for the species or a near neighbor. Cultivating within the laboratory necessitates the availability of proper growth conditions, most notably host species availability. Chicago area near-shore waters were sampled and filtered for viral particles. Lytic phages capable of infecting Escherichia coli C were detected. Fitness assays and sequencing were conducted to aid in the classification of these viruses. Identification of these phages through amplification techniques and plasmid based methods will aid in developing a viral profile of the Lake Michigan ecosystem.
Abstract:
ChIP-Sequencing (ChIP-Seq) is a powerful tool to identify binding sites for DNA-binding proteins in vivo. Briefly, DNA obtained from immunoprecipitated chromatin is sequenced in a high throughput manner and the large number (million or more) of short (50-75 nucleotides) reads obtained are mapped to a reference genome to identify DNA-binding regions. On the other hand, RNA-Sequencing (RNA-Seq) is a relatively new approach aiming to sequence whole transcriptomes in order to discover gene expression levels, survey novel splicing events and SNPs across transcriptomes.
In this study, we profiled the maize transcriptome in the presence and absence of the MYB transcription factor P1, in order to discover novel genome-wide cellular and metabolic processes associated with P1 function. We used publicly available bioinformatics tools to align reads generated from mRNA and ChIP DNA extracted from pericarp cells to the maize genome. The analysis of ChIP-Seq identified putative binding sites of P1. In addition, RNA-Seq revealed expression levels of transcripts, and we detected differentially expressed transcripts by comparing between wild-type (P1-rr) and mutant (P1-ww) tissues. In-depth analysis of RNA-Seq data, revealed novel splicing events associated with absence or presence of P1 and splicing events specific for developmental stages. We took advantage of publicly available RNA-Seq data for various maize tissues in order to determine pericarp specific alternatively splicing transcripts.
Abstract:
Shigella dysenteriae is a gram-negative, intracellular human pathogen that is a causative agent of bacterial dysentery. Small, non-coding RNAs (sRNAs) have been shown to “fine tune” gene expression by acting at all levels of gene expression. RyhB has been shown to decrease steady state levels of the transcriptional activator virB, the essential activator of the Type Three Secretion System (TTSS), its secreted effectors and associated chaperones. The research described herein aims to further elucidate the molecular mechanism mediating RyhB-dependent repression of virB expression by determining which step of virB expression is regulated by RyhB and identifying protein factors required for the process.
mRNA stability assays and transcriptional reporter based assays were used to investigate the step of virB expression that is inhibited by RyhB. A mini-Tn5 transposon mutant library was used to identify mutants in which efficient RyhB-dependent repression of virB expression was lost. The transposon mutant library was screened using Congo red dye binding analysis and in vitro tissue culture based analysis.
RyhB mediated repression of virB expression is independent of the activity of VirF, the upstream activator of virB. Inactivation of glk, encoding glucokinase, or ptr, encoding a putative transcriptional regulator, results in a loss of RyhB-dependent repression of virulence associated phenotypes.
The down-regulation of virB is mediated at the level of transcription and is not a result of lowered functionality of VirF. Both glk and ptr appear to have a role in virulence gene expression, however that role has yet to be completely elucidated.
Abstract:
Malaria is a disease caused by parasites belonging to the genus Plasmodium that affects 500 million people globally and causes over one million deaths annually. Despite research efforts on gene control mechanisms in Plasmodium species, little is known regarding its transcriptional regulation. While numerous regulatory element binding site prediction algorithms have been developed to aid in uncovering novel transcription factor binding sites (TFBS), they have proven ill-suited for analyzing organisms with AT-skewed genomes such as Plasmodium. Furthermore, experimentally identified TFBS in the genome vary greatly in size and include a significant degree of degeneracy between co-regulated genes as well as with other Plasmodium species. Herein, we discuss the development of a new algorithm for DNA motif discovery. Our model suggests enhanced motif prediction power by incorporating structural properties from known protein-DNA interaction sites and improved expectation values for statistical pattern recognition using motif-occurrence analysis. The ability to recognize TFBS within Plasmodium would provide invaluable insight into the regulation and functionality of the parasite’s genes within its two-host system. Knowledge of TFBS in Plasmodium would contribute to a deeper understanding of the parasite biology and could potentially lead to the discovery of new targets for controlling Malaria.
Abstract:
Operons are important features in bacterial genomes, which place many related genes under a common regulatory control mechanism. Often, operon genes are organized in the order that they catalyze reactions in metabolic pathways. These features allow operons to not only to control large blocks of metabolic functions in bacteria, but allow researchers a method to assign function to unknown genes encountered in genomic and metagenomic studies.
Several mechanisms were suggested for operon evolution. However, none offer a universal method to investigate all types of operons. This work aims to discover how operons evolve over time, to identify different mechanisms of operon evolution whether there is any preference in operon evolution mechanism based on the type of gene cluster it contains as suggested by studies of metabolic enzyme evolution.
To determine how operons evolve, we investigated all experimentally determined operons found in RegulonDB that contained at least five genes. We then located organisms that have all or part of these operons within their genomes, and determine how closely related to each is to our reference organism, E.coli. Our preliminary results support that the less related an organism is to the gold standard organism, the greater the divergence in its operonic structure, but phylogenetic distance is not the sole explanation to the patterns we see in operon de-structuring. It appears that operons can be preserved in distant species, while broken up in closely related ones. These initial finding support horzontal gene transfer as a major determinant in operon evolution.
Abstract:
The ability of a drug molecule to inhibit the function of its target may be reduced if the target receptor in question is up-regulated or amplified at a level beyond normal levels. The genetic variation within a target receptor is able to elucidate the outcome of a drug treatment given that the targets are well understood. Our approach is to find genetic characteristics utilizing microarray based approaches to define DNA copy number changes and transcriptional profiles that correlate variation within drug outcome.
In order to visualize and interpret genome-wide patterns, we developed a novel data mining strategy, Integrated Small Molecule Assessment (ISMA), for unveiling distinct patterns in copy number and gene expression as a function of drug outcome. We hypothesize that by evaluating a tumors genetic characteristics we can determine the optimal drug therapy and ultimately predict drug outcome.
Abstract:
Recently, RNA sequencing (RNAseq) from Next Generation Sequencing (NGS) technology has been successfully used to study alternative splicing in humans and mice. Methods used in these analyses rely solely on high quality gene models. Consequently, these methods are not suitable for other organisms lacking high quality gene annotations. To overcome this problem, other methods have been developed; for example, one method does not rely on an existing annotation but instead constructs the gene models from sequence reads that are mapped to the genome. However, spurious gene models are built by this method because many sequence reads are ambiguously mapped to the genome.
We have developed a new pipeline, employing an assembly approach, to build gene models and identify alternative isoforms. We have been using this method to study alternative splicing in chickens from two inbred lines that are resistance and susceptible to Marek’s disease (MD). The method identified many novel genes and isoforms that are not included in existing gene models. Isoforms were confirmed by reads that mapped across the exons. All kinds of alternative splicing events were detected including cassette exons, alternative splice sites, alternative 5’ and 3’ end, mutually exclusive exons and intron retention. Moreover, the method has successfully detected alternative isoforms between two chicken inbred lines that might contribute to genetic resistance to MD. Finally, this method does not rely on existing gene annotations, therefore, it can be applied to study alternative splicing in any organism.
Abstract:
The Biomedical Genomics Core (BGC) is a nationally recognized provider of expertise in multiple aspects of genomics analysis, integrating microarray, next-generation sequencing and bioinformatics into a single shared resource. Next generation sequencing with our Illumina HiSeq 2000 sequencing system is a critical example of how technology is driving development and demand for bioinformatics. We serve as an interface between the research investigator and the multiple domains that are required to handle the size and complexity of genomic data. In just a few short months we have generated over 2000 GBases of sequence data, consuming 50 TB of data storage and 10,000’s of CPU hours for analysis. The output from the HiSeq 2000 will soon exceed 75 GBases per day, enabling sequencing of up to three human genomes, 48 exome capture libraries, and up to 200 microbial genomes in just over one week. As such, we are currently facing a tremendous challenge regarding how best to handle these enormous data sets. We will describe the laboratory information management system (LIMS) that we implemented to track samples as they move through the various wet-bench library preparation steps. Secondly, we will showcase development of our pipeline to automate primary sequence analysis - basecalling, alignment and quality control analysis. A major focus for our group has been the development of secondary analysis pipelines to discover and annotate human genetic variants that lead to disease. Finally, we will describe some of the tools we are making available to investigators for data visualization and tertiary analysis.
Abstract:
Genome partitioning technologies are a powerful and cost-effective tool to selectively target and sequence defined genomic regions. Targeted sequence capture of the exome has enabled identification of numerous mutations that result in human disease. A number of methodologies have been developed to enrich a desired target sequence. The most established are solution based hybridization selection techniques utilizing novel capture probes (baits) designed to target 1.7% of the whole genome (~50MB), corresponding to known exonic regions. We developed novel bioinformatics approaches to assess the performance of each technology and a pipeline that automatically computes these statistics. This pipeline has become an essential analytical component for quality control of samples being processed using the Illumina HiSeq 2000 in the Biomedical Genomics Core.
Samples were processed using four kits: Agilent SureSelect Exon 38MB and 50MB kits, Illumina TruSeq, and NimbleGen SeqCap. Raw sequencing reads were aligned to the UCSC hg19 reference genome. Custom software was developed to extract exonic locations and record the number of aligned reads at these locations. Exons with a median coverage above a certain threshold value were counted as covered and averaged within the chromosome, across all chromosomes, and across all samples in the same sequencing run. Similar methods were used to compute the percentage of exons that were targeted by the capture kits, and the percentage of targeted exons that met defined sequence coverage thresholds.
The results of this study are invaluable in guiding investigators in making the most appropriate technology choice for their exome capture based studies.
Abstract:
A number of medical ontology models provide broad coverage of clinical terms and their relationships. In-depth coverage, however, is often not available, limiting their use in specialized sub-domains such as Sleep Medicine. Current ontology use is often limited to data annotation and as a source for term lists. Applications that need additional information for terms such as their data types, measurement units etc. need to model this information separately.
An existing 'application ontology' for sleep medicine (Sleep Domain Ontology or SDO) is developed by the PhysioMIMI project to provide a common framework for physiological and clinical data in Sleep Medicine and also provide adequate term definitions to guide user interfaces. Besides, another project COMET provides multiple data dictionaries(such as COMET-APPLES,COMET-ASQ, etc.) in sleep medicine, where there are many new concepts for us to integrate to SDO. Since the SDO ontology uses upper level and reference ontologies such as BFO, FMA, CPR and OGMS, we first introduce a Minimal Domain of Discourse (MiDas) algorithm to automatically extract a common domain of discourse from existing domain ontologies to facilitate the integration procedure, then add new concepts to SDO. As a result, we can have a very complete and well-structured ontology model for Sleep Medicine.
Abstract:
Genes are neither randomly nor uniformly distributed along mammalian genomes. Within C. elegans and D. melanogaster, genes with more complex functions have longer intergenic distances, and in humans, genes on the boundaries of gene deserts are enriched for transcription and developmental related functions. Using simulations, we characterize the role intergenic distance, gene length and mappabililty play in introducing bias in gene enrichment results when using standard gene set enrichment testing approaches for ChIP-Seq data. This is a critical issue because in high-throughput experiments, assignment of physical features or binding sites to the nearest gene without regard to the underlying locus length can result in spurious enrichment of development, nervous system, or transcription-related categories.
We propose a method that uses an empirical weighting scheme to
control for potential bias introduced by these factors, and illustrate the
method with simulated data. Existing methods do not take mappability
into account, and particularly for shorter read lengths this can affect the results.
Abstract:
To function, many RNA molecules fold into complex 3D structures, stabilized by recurrent edge-to-edge base-pairing interactions. Complex motifs often contain edge-to-edge hydrogen-bonding base triples. We analyzed atomic-resolution RNA 3D structures to identify, cluster, and classify RNA base triples into recurrent, geometric families. We find that the central base in almost all triples basepairs with the other two bases, providing a natural way to classify base triples. For example, if in a collection of three bases, bases 1 and 2 form a cis Watson-Crick/Watson-Crick (cWW) base pair, while bases 2 and 3 form a trans Hoogsteen/Watson-Crick (tHW) base pair, then we assign the resulting base triple to the family "cWW/tHW". Combinatoric enumeration predicts 108 potential base triple families. We find instances in the PDB/NDB database of base triples representing 68 of these 108 geometric triple families. Model building suggests that some of the remaining 40 families may be unlikely to form, for steric reasons. Given the fact that the RNA 3D database is small, we expect many more base triple combinations to exist in RNA structures. Over 3000 different base triples are predicted, but we find less than 300 hundred in the current RNA 3D database. We inferred additional triples by examining sequence variations in high quality sequence alignments at positions where the same base triple is found in the 3D structures of E. coli and T. thermophilus 5S, 16S and 23S rRNA. This yielded 110 new inferred base triple combinations in 24 different triple families.
Abstract:
Cell line systems are important tools in studying the deregulation of cellular processes underlying breast tumorigenesis as well as investigating molecular features which may predict therapeutic response. In cancer cells epigenetic modifications such as DNA methylation become substantially perturbed, altering expression profiles to promote development of malignant phenotypes. We have generated genome-wide DNA methylation profiles for the LBNL breast cancer cell line panel via MethylCap-seq. The strategy involves immunocapturing methylated DNA with the MBD2 protein, and subsequent second-generation sequencing. MicroRNAs (miRNAs) are a major class of small RNA molecules that post-transcriptionally inhibit gene expression. We are utilizing the new TruSeq small RNA multiplexing protocol to generate microRNA profiles for the LBNL cell line panel. Recent studies have shown transcription factors and miRNAs can cooperate in multi-gene transcriptional and post-transcriptional regulatory loops. Such loops may confer proliferative and invasive activity to breast cancer cells. Integrating diverse sets of data is necessary to draw conclusions regarding these and other biological phenotypes. Effective data visualization can bridge the divide between computational and experimental biologists engaged in integrated analyses. Anno-J is a REST-based genome annotation visualization program which facilitates real time evaluation of data at single base resolution. Visual interpretation of patterns may permit the researcher to observe phenomena which computational analysis do not detect. An integrated map of the genomic distributions of CpG methylation, miRNAs, and transcripts in breast cancer cells will help illuminate the scope and complexity of the interactions that exist between methylation and miRNA, and their impact on transcriptional regulation.
Abstract:
Chromatin immunoprecipitation followed by next-generation sequencing (ChIP-Seq) has been widely employed to identify in vivo protein-DNA interactions or histone modifications on a genome-wide scale. A growing number of software applications have been developed and shown to successfully identify transcription factor bindings or histone modifications from ChIP-Seq experiments. However, while peak lists reported by different programs tend to agree on strong binding signals, they can vary substantially for weaker binding. This may be partially due to the fact that applications use various background models and statistical distributions; it is likely compounded by the fact that most methods do not model variation among replicates/samples when available. To address this issue and better exploit the power of replicates, we developed a ChIP-Seq analysis pipeline that utilizes a sliding window approach with a negative binomial model to accommodate variation among replicates. To evaluate our pipeline performance, we apply ChIP-Seq data from wildtype and knockout mouse embryonic fibroblast samples for the transcription factor ATF4 to our pipeline and an alternative software, ERANGE. To assess the effect of introducing an estimate of replicate variation, we also compare performance between our novel pipeline and a simpler version using read-concatenation. We demonstrate here that our pipeline performance is markedly improved when taking into account the variance among replicates. We show our pipeline identified more peaks than an alternative peak finding software, ERANGE, and without sacrificing the percent of peaks that contained the canonical motif. We also compare the above pipelines with additional, publicly available replicated ChIP-Seq experiments.
Abstract:
Temperature is one of the most important environmental parameters to which living organisms adapt. At the most fundamental, molecular level, organisms adapt to different environments by changes in the sequences of their macromolecules (including proteins and RNA) as well as the compositions of their membranes. At high temperature, macromolecules must resist thermal denaturation, while at low temperatures, macromolecules must retain fluidity and flexibility for movement and catalytic action. Previous work has shown that structured RNA molecules adapt to high temperature by increasing the content of GC vs. AU Watson-Crick base pairs. However, up to 33% of basepairs in structured RNAs, such as the ribosomal RNAs, are non-Watson-Crick and little attention has been paid to the variations in these basepairs. We have assembled sequence alignments of bacterial 16S and 23S rRNAs, organized by optimal growth temperature and used them to examine sequence variations at positions that form non-Watson-Crick basepairs. We have examined sequence variations in recurrent motifs, comprising ordered arrays of non-Watson-Crick basepairs. We designed oligonucleotide RNA constructs containing sequence variants of the Sarcin/Ricin motif, a thermodynamically stable and widespread, recurrent 3D motif. The sequence variants contain isosteric base-pair substitutions observed with varying frequency in ribosomes from organisms that grow at different temperatures. The results of UV melting experiments to determine the thermodynamic parameters of a range of constructs will be presented and compared with bioinformatic analysis of our sequence datasets.
Abstract:
The T box riboswitch regulates gene expression at the level of transcription via the interaction of uncharged cognate tRNA with the 5?-untranslated region of the nascent mRNA. The interaction involves, in part, the base pairing of the tRNA accepter end with four bases in the bulge region of the highly conserved T box antiterminator RNA element. This base pairing prevents the formation of an alternative terminator element and results in complete transcription of the gene. We have developed multiple fluorescence-based assays to monitor tRNA binding to the T box antiterminator element and to monitor the disruption of this complex by small molecules. Assay validation and optimization results will be presented along with comparative approaches to data analysis.
Abstract:
This project addressed the problem of taxonomic classification of metagenomic sequences. Specifically, it considered discovery of genomic signature classifiers; genomic sequence classification; and characterization of classification accuracy. We presented high fidelity genome signatures and accurate classifications that utilized (1)our team’s extensive experience in genomic pattern discovery, and (2)our genome analysis toolkit, WordSeeker, which used the power of a supercomputer to provide sophisticated statistical models of genomic patterns.
Our approach improved on previous methods in several ways. The discovery of genomic signature classifiers was performed by utilizing WordSeeker, an open source, enumerative genomic signature discovery software toolkit. WordSeeker utilized a supercomputer to search many possible genomic signatures and to compute high fidelity genomic signatures. This allowed the generation of more sophisticated genomic signature classifiers than have been used in previous approaches for classification of DNA sequence fragments. Sequence classification was improved through the use of high fidelity genomic signatures. Sophisticated methods for characterizing the accuracy of classification techniques not only provide confidence measures, but are useful for guiding the selection of classifiers and the classification process. Benefits from these improved methods include increased quality and quantity of genomic signature classifiers, and consequent improvement and advancement in the reliability and accuracy of genomic sequence classification. For testing the approach, 35 organisms from the SEED database[1] were selected, and we demonstrated that the incorporation of the precise genomic signatures produced by WordSeeker enhanced classification accuracy. This rigorous foundation was used to build a comprehensive classification pipeline.
Abstract:
Identification and prediction of structured regulatory modules shared by Androgen-Responsive (AR) genes will provide better insights into the regulatory mechanism of AR related genes as well as allow us to better understand the hormone-control network and ultimately benefit prostate cancer treatment. We focused on 23 experimentally proven AR regulated human gene promoter sequences located within 2kb of the transcriptional start site. We expected that some of these promoters share structured regulatory elements related to AR regulation (as well as elements critical for other transcription factors), including individual binding sites, motifs and modules that consist of multiple motifs.
To identify such shared elements and their arrangement architecture, a genomic analysis toolkit, WordSeeker, was employed. By utilizing WordSeeker, we were able to detect multiple statistically over-represented words, motifs and modules based on the Markov Model and several scoring functions. Parallel radix tree was employed and integrated into the software for efficiently completing enumeration and scoring tasks. Additionally, multiple post-processing approaches, including visualization, functional look-up, conservation analysis, were also applied to the target dataset, allowing us to extend and refine the results. The results gained from 23 known AR related promoter sequences were considered to be the training result, and applied to 40 AR candidate gene promoters for predicting putative regulatory regions.
The ultimate goal of this innovative, unbiased approach was to identify novel regulatory arrangements that could impact the hormone control network and thus are critically important to assess for the design of therapy aimed to fully disrupt androgen signaling.
Abstract:
The Rat Genome Database has created a web-based tool for authorizing and displaying interactive physiological pathway diagrams. The tool is based on the Adobe Flex/BlazeDS technology and consists of two components: 1. Diagram Designer; 2. Diagram Player. The two components can be run in web browsers with Adobe Flash Player installed. This eliminates the hassle of downloading, installing and updating, which is required by most other diagram authoring tools. Using the Diagram Designer, diagram authors can draw physiological pathway diagrams online. Authors can also share and work on same diagrams with others. By dragging and dropping images in the user-friendly Designer, a diagram can be created in minutes. Authors can add text and URLs to every image in a diagram. A diagram can have many groups and layers of components. Authors can decide what part of the diagram to show according to user inputs. A diagram is ready for viewing online upon its creation, no exporting is needed. Diagram viewers can “play” diagrams stored on the server via the Diagram Player. The Diagram Player is feature-packed. Features like dynamic legends, full-screen display, build-in PDF printer, diagram auto-resizing and pop-up blocker detector, make the viewing process convenient and enjoyable. The software allows authors to publish physiological pathway diagrams of physiology, pathology and pharmacology, in system level and tissue level, as a package and make connections to each other. Most commonly used image formats, even video and audio clips, are supported by the software. The software also supports auto-backup and version control.
Abstract:
Overlapping genes occur frequently in all kingdoms of organisms. Their widespread occurrence in large genomes is unexpected considering the large intergenic space in these genomes and the negative evolutionary pressure may be imposed on overlapping genes. A few studies have been done on the evolution and origination of overlaps in Eukaryotes but no genome wide study on different types of overlapping genes has been conducted in plants. We have analyzed the overlapping genes in three cereal genomes: rice, maize and Brachypodium in which the number of identified overlapping genes were 747, 1564, 347, respectively. Even though there is no obvious correlation between the number of overlapping genes and the total number of genes or genome size, genome size does have an effect on types of overlapping genes. The larger maize genome possesses significantly more nested overlapping genes than the smaller rice and Brachypodium genomes. The majority of overlapping genes are species-specific, indicating frequent creation and loss of gene overlaps. Our results also show that translocation and gene creation/deletion are the major mechanisms for the origination and loss of overlapping genes. Mapping of small RNA reads to overlapping genes suggest that overlapping genes are a major source of generating nat-siRNAs in three genomes, however most nat-siRNA generation is not well conserved.
Abstract:
PubMed is a freely available database comprised primarily of the MEDLINE database of references and abstracts on life sciences and biomedical topics. Despite the fact that it is a treasure trove for the general analysis of publication metadata, a comprehensive use of PubMed content requires a decent understanding of its structure and the way in which the content is represented and queried is of some significance. This poster discusses an ongoing effort to convert a very recent version of PubMed into a dataset using technologies of the
Semantic Web to facilitate the use of a standardized data format and schema and the straightforward querying and graph-based analysis of publication metadata. The Resource Description Framework (RDF) was used as the metadata format and an Ontology Web Language (OWL) document was created that describes the categories and relations that comprise the RDF vocabulary used in the PubMed dataset.
During the creation of this vocabulary and the conversion of the content into an RDF dataset, a number of issues arose and are described in the poster. There were challenges associated with the transformation of the very large XML documents that comprise the PubMed source. In addition, issues related to the allocation of meaningful identifiers and the various ways in which the architecture of the web can be leveraged for this are discussed. Some example queries are also discussed along with a brief summary of the dimensions of the data and how they were prepared.
Abstract:
Obesity has garnered much attention due to an increased prevalence of calorie dense foods, sedentary lifestyle, and confounding genetic dispositions. Furthermore, about 1 in 4000 individuals is diagnosed with a metabolic disorder. At the leading edge of unraveling these metabolic disorders is leptin, a pleiotropic protein identified in mouse almost two decades ago. Since its discovery, research has extensively focused on its physiology in mammals with virtually <1% of studies carried out on non-mammalian systems. My study documents the overall magnitude and effect of leptin expression in response common to ectotherms, fasting. To accomplish this, I surveyed the NCBI database using key search words to compile a list of gene expression studies that were linearized using a standardized mean difference, fixed effects model and graphed on a Forest plot. The data was collected via Endnote, compiled in SharePoint, and interfaced via Access. The meta-analysis was then computed via Review Manager 5. In conclusion, this study will help in unraveling the complex physiology of leptin signaling in mammals using a more simple vertebrate system.
Abstract:
We analyzed over 15,000 expressed gene products obtained from Affymetrix genechip data in response to chronic hypoxia and found 376 to be differentially expressed. Zebrafish have been shown to elicit a response to hypoxia along with an ability to regenerate heart tissue, leading research towards better understanding human heart diseases. Phenotypic plasticity in response to external stimuli, such as hypoxia, have been widely documented, however these biochemical adaptations have only been recently looked at from a transcriptomic approach. Some teleosts, such as zebrafish (Danio rerio) and common carp (Cyprinus carpio), are capable of withstanding low oxygen levels which may have a direct corollary to their evolutionary host environments. Zebrafish, specifically, have garnered much attention in this area due to a completed and well-annotated genome along with commercially available genechips. Preliminary analysis of the data shows a significant reduction in aerobic metabolism enzymes involved in fatty acid oxidation along with reductions in suppressors of cytokine signaling 3 (SOCS-3). SOCS3 has been implicated in metabolic disorders such as obesity. There was a significant decrease in overall signal transduction and vesicular trafficking indicating a reduced efficiency in signaling and/or reduction in energy utilization for peripheral signals. Of note, there was a reduction in creatine kinase which is counterintuitive in a perceived state of injury such as hypoxia/ apoptosis. In response to chronic hypoxia, ROS protection and delta-9 desaturase where significantly induced along with uncoupling protein 4 and M-Ras suggesting that protection against apoptosis, maintenance of membrane fluidity, and protein stability are essential in the response to hypoxia.
Abstract:
Seasonal and pandemic strains of Influenza A have caused a severe health, as well as economic, crisis. According to the CDC, from April 2009 to April 2010 there were an estimated 61,000,000 cases of H1N1. Of which, 274,000 resulted in hospitalization and 12,470 cases resulted in death. This pandemic resulted in a very frightened public; hence there was a major economic impact primarily in the realm of tourism.
Our evolutionary studies of the H1N1 proteome has revealed an intriguing phenomenon of 100% conserved regions of protein surfaces across multiple strains and years, while on average the surface residues were found less conserved than the core residues for all 10 protein types of H1N1. Furthermore the conservation was found to be associated exclusively with the intra-viral interactions, where proteins of H1N1 interact with each other or with viral RNAs. To test the hypothesis that the intra-viral interactions exhibit extreme conservation in different influenza subtypes, we have developed an automated pipeline. With this information, perhaps new antiviral targets can be identified and even better vaccines can be developed.
Abstract:
With the advent of next-generation sequencing technologies it is now possible to generate billions of short sequence fragments or reads. Studies using these new high-throughput technologies have shown there to be an astonishingly high level of diversity at the microbial level of our environment – far exceeding anything we have ever seen. Numerous different algorithmic approaches have already been developed to map these short-reads onto reference genome sequences. In contrast with these resequencing projects, environmental sequencing projects often lack reference genomes of near-neighbor organisms. When sufficiently close reference genomes do not exist, less rigorous approaches must be taken, as is the case for mapping sequencing reads from diverse environmental samples. Herein, we have developed a new suite of data structures and algorithms specifically for the mapping of reads from environmental samples. This suite has fostered the development of a new pipeline that can rapidly map reads to genomes with many mismatches between the two. We have used this pipeline to map three environmental samples onto all currently sequenced bacteria, plasmids, and DNA based viruses. Samples were taken from the near-shore waters of three Lake Michigan beaches, Loyola, Montrose, and 57th Street for sequencing on the Ilumina Genome Analyzer System. About 12 million short reads were generated for each beach. Downstream analysis of this mapping information revealed the presence of several different organisms within the collection of Lake Michigan samples as well as differences between the individual sampling sites.
Abstract:
Arabinogalactan-proteins(AGPs) are a family of highly glycosylated proteoglycans that are found in all plants as components of the cell wall, plasma membrane and cellular secretions. AGPs function in a variety of cellular processes including cell proliferation, cell expansion, somatic embryogenesis and cell death. They also possess valuable adhesive and emulsification properties that are utilized for commercial purposes. Little is known about the mechanisms and galactosyltransferase enzymes (GALTs) responsible for glycosylation of AGPs. Consequently, a bioinformatics approach was adopted to identify and characterize putative GALTs responsible for synthesizing ?-(1,3)-Gal linkages. First, a BLASTP search was performed using protein sequences from three mammalian ?-(1,3)-galactosyltransferases in The Arabidopsis Information Resource database.Twenty putative galactosyltransferase containing pfam 01762 were identified. Like the mammalian ?-(1, 3) GALTs, all 20 Arabidopsis proteins contain aconserved galactosyltransferase sequence domain (pfam 01762) and are annotated as members of glycosyltransferase family GT31 in the CAZy database. Interestingly, only six of these contain another conserved sequence, which is classified as a galactoside binding lectin domain (pfam 00337). This domain is absent in mammalian ?-(1, 3)GALTs and also absent in all other plant glycosyltransferases involved in N-glycan processing. Furthermore, phylogenetic analysis of these 20 Arabidopsis proteins and mammalian GALTs revealed that the six identified proteins cluster together. These six candidate proteins were run through several subcellular localization prediction programs (TargetP,Predator). Based on these predictions, all six proteins should be targeted to the secretory pathway and all of them are predicted to have a single transmembrane domain. These studies form the basis for future
GLBIO is an official conference of the International Society for
Computational Biology
![]()