Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster.
All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.
Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time
Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
Short Abstract: Early diagnosis of many zoonotic diseases has always been problematic since the hosts often remain asymptomatic until it is too late, and accurate, sensitive diagnosis is often confounded by closely-related and environmental bacteria. Small RNA (sRNA), a class of regulatory RNA, are often dysregulated in both noncommunicable and infectious diseases. Although intra-and extra-cellular sRNA have been used extensively in cancer detection (host miRNA), few studies have addressed sRNA as biomarkers for detecting pathogens within the host. Here, we develop a computational workflow to identify pathogen-specific sRNA diagnostic targets within infected hosts. We start with quality control of the sequenced samples, followed by alignment against host and pathogen reference genomes, and standard sRNA databases. We then quantify the differential expressed sRNA and identify unique pathogen-specific candidates. We test the workflow with a recently published dataset with Mycobacterium avium paratuberculosis infecting Bos taurus (GSE129819), and our bovine tuberculosis dataset in nonhuman hosts (unpublished) to identify unique pathogenic sRNA. Next, we will use de novo approaches to predict novel pathogenic sRNA. Taken together, our computational approach will help identify sRNA signatures unique to pathogens within infected hosts and facilitate early diagnosis of infection and bacteremia.
Short Abstract: Background
Host-directed therapy (HDT) is a promising avenue for combating infectious diseases that continue to result in high mortality rates due to the rise of drug-resistant pathogens. An attractive route to HDT is by repurposing existing FDA-approved drugs designed for non-communicable diseases to fight infectious diseases at reduced cost and time. We are developing a computational workflow to repurpose approved drugs for HDT in infectious diseases.
We first obtain infectious disease-related and drug-related expression datasets from public gene-expression databases (NCBI GEO and LINCS). Based on the hypothesis that HDT candidates reverse the effects of the disease, we then identify potential drug-disease pairs that show anti-correlation between the drug and disease expression signatures.
We are applying and testing this general workflow on both Mycobacterium tuberculosis and Staphylococcus aureus infection datasets to (i) identify critical genes and (ii) identify pathways that are consistently perturbed, and (iii) generate a list of candidate HDT drugs that could target the key dysregulated gene signatures/pathways.
Short Abstract: InfluenzavirusA, an important human respiratory pathogen, causes seasonal, endemic, and pandemic infections all over the world with high mortality rates. The main reason for the re-formulation of the vaccine relies on the antigenic drift and shift of this virus. Therefore, it is of interest to design universal influenza peptide vaccines with high efficacy which could be done through the efficient CTL and HTL epitope predictions. Targeting the more conserved and antigenic Neuraminidase protein could result in giving a more dominant cross-protective immune response. In this study, we aim to develop a reliable deep neural network that can predict the Flu A epitopes. This drastically reduces experimental efforts and give faster results. The study of MHC-I-II-binding of peptides with other major properties become deciding factors of potent vaccine candidates. In order to assess the model's performance, various standardized performance measures were considered. In the case of both CTLs & HTLs, an optimal accuracy of 98 percent was obtained with all the features, and with peptide-column alone, it was 88. The model was then validated with the experimentally determined epitopes already available. In the scenario of the various existing servers based on ANN, benchmarking needs to be done with comparative rank measures.
Short Abstract: The importance of scientific conferences, symposia, workshops and satellite meetings is often discussed in scientific circles. When deciding where to submit a manuscript and present new findings, the quality of science and the opportunity to network with top researchers are of significant concern. To explore the question of the quality of science, we are interested in assessing the impact of primary research presented in the meetings in the field of bioinformatics and computational biology. We looked at five conferences with good reputation: ISMB, PSB, RECOMB, ECCB and BCB. We collected research papers published in these venues and extracted their citations. We then summarized these citations to quantify the long-term performance of an average paper, median paper, and median of the top 10 papers for each meeting in each year. Our results suggest that the original research presented in all conferences is influential, with ISMB being more impactful than ECCB and RECOMB, that themselves have comparable statistics. Despite some outstanding years and strong top papers, PSB ranked next, and ahead of BCB. Although citation-based measures of impact are imperfect, we submit that the results of our analyses provide a useful characterization of these venues.
Short Abstract: The U.S. Environmental Protection Agency is exploring the use of profiling methods for rapid bioactivity screening and hazard evaluation of environmental chemicals. ‘Cell Painting’ is an imaging-based profiling method that measures morphological features of cellular organelles. Here, we adapted this method for use in high-throughput bioactivity screening of environmental chemicals with a focus on two applications: (1) Estimation of potency thresholds (i.e. phenotype altering concentration, PAC) for chemical bioactivity; (2) Use of phenotypic profiles to discern putative mode-of-action (MOA).
To date, we have screened > 1200 chemicals in concentration-response in U-2 OS cells and identified PACs. For 420 chemicals, in vitro-to-in vivo extrapolation was performed to compare the potency estimates to available in vivo data. In 68% of cases, HTPP was comparable or more conservative than in vivo effect values.
Multiple characteristic phenotypic profiles were observed, with chemicals sharing a MOA often displaying similar profiles. For example, profiles for retinoic acid receptor agonists and glucocorticoids had high similarity to their respective model compound (retinoic acid, dexamethasone). We also noted profile clusters for different classes of pesticides (organochlorines, strobins, dinitroanilines). Overall, similar cellular effects were observed, both among structurally diverse and structurally related chemicals.
This abstract does not reflect USEPA policy.
Short Abstract: Antibodies are capable of potently and specifically binding individual antigens and, in some cases, disrupting their functions. The key challenge in generating antibody-based inhibitors is the lack of fundamental information relating sequences of antibodies to their unique properties as inhibitors. We develop a pipeline, Antibody Sequence Analysis Pipeline using Statistical testing and Machine Learning (ASAP-SML), to identify features that distinguish one set of antibody sequences from antibody sequences in a reference set. The pipeline extracts feature fingerprints from sequences. The fingerprints represent germline, CDR canonical structure, isoelectric point and frequent positional motifs. Machine learning and statistical significance testing techniques are applied to antibody sequences and extracted feature fingerprints to identify distinguishing feature values and combinations thereof. To demonstrate how it works, we applied the pipeline on sets of antibody sequences known to bind or inhibit the activities of matrix metalloproteinases (MMPs), a family of zinc-dependent enzymes that promote cancer progression and undesired inflammation under pathological conditions, against reference datasets that do not bind or inhibit MMPs. ASAP-SML identifies features and combinations of feature values found in the MMP-targeting sets that are distinct from those in the reference sets.
Short Abstract: We are presenting a research project that provides a semi-automatic means of conducting FAIR assessments of Bioinformatics tools and datasets. Our motivation stems from the growing interest in ensuring the transparency and reproducibility of the published scientific literature. A study of 149 biomedical articles (published between 2015 and 2017) by Wallach, Boyack and Ioannidis (2018) showed that only 19 (~18%) of 104 articles with empirical data discussed publicly available data, only one (~1.0%) included a link to a full study protocol while only 5 (~5.2%) of 97 articles had replication of previous studies.
Findability, Accessibility, Interoperability and Reusability (FAIR) assessments provide an indication of how easy it is for a researcher to reproduce a study by scoring aspects such as whether a tool or dataset used in a study are easily available for download and use, whether a tool can be used on different OS platforms and whether the tools or datasets are available on a respectable source among others.
Our tool provides a good start to automate the assessment by attempting to search for the FAIR criteria on the internet and providing an option for the researcher to input missing details. It can also score entire pipelines of tools.
Short Abstract: Protein tyrosine phosphatase 1B (PTP1B) enzyme, a widely validated anti-diabetic molecular target, is essential in the regulation of metabolism. In the insulin and leptin signaling pathway, it is a crucial element in the pathogenesis of two major diseases: type 2 diabetes and obesity.
There are three main approaches to inhibit PTP1B: active site, allosteric, and bidentate inhibition. Due to the PTP1B subcellular location and structural properties, developing a potent and selective inhibitor of the enzyme is a demanding task.
We did an extended search with the allosteric and active site of the catalytic domain to find an inhibitor targeted to PTP1B. The search includes structure-based virtual screening (with iDock) and molecular docking (with AutoDock vina). We screened 2485 chemical compounds from Zinc database. Some of them bounded in the active and allosteric site, showing the best bind affinity to the enzyme (-34.0 - -27.15 Kcal/mol). Finally, we modeled bioisosteres with the compounds specifically bounded to each part.
Molecular dynamics simulation showed stability for the complex formed by PTP1B and five compounds: irinotecan, doxycycline, tetracycline, demeclocycline, and riboflavin. Findings suggest that bioisosteres of irinotecan and doxycycline could be potentially competitive and non-competitive PTP1B inhibitors
Short Abstract: As the interface between the external world and the human body, the human nose is the main entry point for numerous pathogens and commensal bacteria. Composition and homeostasis of the nasal microbiome profoundly impact the development of infectious diseases. Surprisingly, the contribution of many nasal microorganisms to human health remains undiscovered.
The severity of world-wide infections with hardly treatable pathogens motivates us to construct a community-level network of microbial species that populate the human nose. Our large-scale network reconstruction approach started with an extensive literature search for observed nasal microbes. By applying an analysis tool on the collected data, we created an initial interaction network of nasal microbes. This network became the foundation for formulating a generalized Lotka-Volterra (gLV) system to accurately estimate the interaction parameters and get an efficient and detailed biological interpretation of the nasal microbiome. Next, we examine the mathematical behavior of the network and its stability in response to perturbations.
Our endeavor will ultimately result in a reference map of the community structure of abundant and scarce microbes in the human nose, and it will deepen our understanding of the interactions within the nasal microbial community and their role in homeostasis, health, and disease.
Short Abstract: We present the BioDepot-workflow-builder (Bwb), an integrated graphical platform that facilitates the creation and execution of analytical workflows using reproducible Docker containers as modules. Bwb represents workflows as interconnected graphical nodes (widgets) that represent executable modules. Users drag and drop widgets onto the screen and connect them to construct a workflow, which can be executed locally, saved, or exported as a shell script. Each widget represents executables encapsulated inside software containers that automates installation and ensures reproducibility. Widgets can also spawn their own interactive graphics which allows visualization tools to be included in workflows. Bwb is designed for biomedical scientists to interactively build, profile, execute and customize workflows without needing to write any code. We demonstrate the utility of Bwb to create and reproducibly execute well-established RNA sequencing workflows. We highlight a use case that demonstrates how Bwb's support for exporting GUIs and visualizations can be leveraged to incorporate Jupyter notebooks into workflows. Demonstration workflows and widgets are publicly available from GitHub and are included with the Bwb distribution.
Reference: Cell Systems 2019, volume 9, issue 5, pages 508-514.E3.
Source code: github.com/BioDepot/BioDepot-workflow-builder
Short Abstract: Tumours are heterogeneous tissues consisting of different subpopulations of cells. Estimating and accounting for tumour purity, the fraction of tumorous cells in contrast to immune or stromal cells in the same sample, have become common practice in identifying tumour-specific activation or suppression of gene expression. However, most analytical frameworks assume that the non-tumorous cells have no response to the tumour microenvironment at the transcriptomic level. Considering this potential confounding effect, we develop a novel analytical framework that jointly uncovers transcriptomic signatures of tumorous cells, normal cells and microenvironmental response in tumour-infiltrating immune or stromal cells. We leveraged RNA sequencing data of 1,250 paired tumour and normal samples of 16 types of tumour from The Cancer Genome Atlas. We estimated non-tumorous cell proportion in each sample and implemented our framework. We found that up to 739 genes exhibited significant tumour-specific immune or stromal responses to tumour microenvironment that were independent of tumour-specific dysregulation of gene expression. These genes were enriched in various anti-tumour immunity-associated pathways, including cytotoxicity and dysregulation of metabolism. Given the straightforward implementation and interpretation, this framework may be broadly generalized to molecular oncology research for refining characterization of the transcriptomic landscape of tumour and microenvironmental response.
Short Abstract: In single-cell transcriptomics, the clustering of cells relies on co-expressed gene sets. Usually, we compact a set of highly variable genes into a few PCA dimensions and, in this smaller space, apply a group-detection algorithm. Here we propose an alternative for this paradigma: by splitting the gene pool into co-expression modules, we can set a multi-layer view of cell-types/states, highlighting overlapping cell types.
We identified modules via two popular R packages (WGCNA and Monocle3) and one novel Bioconductor package, fcoex, which repurposes a feature selection method to identify seed genes for modules. After identification, the gene modules were used to divide cells into two groups: a module-positive group, which expresses module genes, and a module-negative, for which expression of module genes is absent. For all three packages, we could detect overlapping populations of functionally-related cells. As an example, in a peripheral blood mononuclear cells dataset, two of the populations identified by our approach corresponded, by their markers, to lymphocytes and antigen-presenting cells. Both these groups share an overlap at B cells, embracing the multiplicity of B-cell functional identities. This detection of heterogeneous, overlapping cell-types would be mathematically impossible using traditional single-layer clustering schemes.
Short Abstract: CRISPR mediated homology directed repair is a powerful method for investigating gene function, as it can introduce precise edits such as point mutations into the genome. A common difficulty with this type of experiment is the detection of successful CRISPR edits in injected embryos, particularly for organisms with longer generation periods where simply waiting for the f1 generation is less feasible. To address this problem, novel restriction enzyme recognition sites can be added to the repair oligo using synonymous substitutions. That way, if the repair oligo is successfully incorporated into the genome, it will have new restriction sites that will result in detectably shorter fragment lengths upon digestion with the enzyme. However, with hundreds of restriction enzymes and combinations of synonymous substitutions to choose from, designing repair oligos for this purpose can be extremely time and labor intensive. This poster introduces a computational tool that automates this process, outputting all viable CRISPR repair oligos with novel restriction sites for a given sequence and edit. It features an easy to read docx output and accompanying statistics for each output oligo that can be used to rank them. It is species agnostic and can support both Cas9 and Cas12 (Cpf1) PAM sites.
Short Abstract: Nelarabine is a nucleoside analogue and the prodrug of arabinosylguanine (AraG). It is commonly used for the treatment of T-cell acute lymphoblastic leukaemia but has limited efficacy in B-cell acute lymphoblastic leukaemia. Our approach combining bioinformatics with wet-lab investigation has identified SAMHD1, a deoxynucleoside triphosphate triphosphohydrolase (dNTPase), as a potential factor driving this lineage-specific discrepancy in nelarabine response. Pharmacogenomic screening data and data from an ALL cell line panel revealed SAMHD1 expression was associated with reduced nelarabine sensitivity across all ALL cell lines. Interestingly, SAMHD1 expression was significantly lower in T-ALL cell lines and patient-derived leukaemic blasts compared with B-ALL. Moreover, despite similar global methylation levels in B-ALL and T-ALL cell lines, SAMHD1 promoter methylation was significantly greater in T-ALL than B-ALL cell lines, suggesting a mechanistic association between increased SAMHD1 promoter methylation and reduced SAMHD1 expression in T-ALL. Results from ALL cell line panels supported these findings, with targeted SAMHD1 degradation by Vpx sensitising B-ALL cell lines to AraG and ectopic SAMHD1 expression inducing AraG resistance in SAMHD1-null T-ALL cells. Taken together, our results implicate SAMHD1 expression as a key factor in determining sensitivity to nucleoside analogues, which may present a useful biomarker for predicting drug response in T-ALL.
Short Abstract: CRISPR-Cas9 systems have become a leading tool for gene editing. However, the design of the guide RNAs used to target specific regions is not trivial. Design tools need to identify target sequences that maximise the likelihood of obtaining the desired cut, and minimise the risk of off-target modifications. Achieving this across entire genomes is computationally challenging. There is a clear need for a tool that can meet both objectives while remaining practical to use on large genomes. Here, we present Crackling, a new method for whole-genome identification of suitable CRISPR targets. We test its performance on 12 genomes, of length 375 to 9965 megabases, and on data from validation studies. The method maximises the efficiency of the guides by combining the results of multiple scoring approaches. On experimental data, the set of guides it selects are better than those produced by existing tools. The method also incorporates a new approach for faster off-target scoring, based on Inverted Signature Slice Lists (ISSL). This approach provides a gain of an order of magnitude in speed, while preserving the same level of accuracy. This makes Crackling a faster and better method to design guide RNAs at scale.
Short Abstract: Vast amounts of transcriptomic data reside in public repositories, but effective reuse remains challenging. Issues include unstructured dataset metadata, inconsistent data processing and quality control, and inconsistent probe-gene mappings across microarray technologies. Thus, extensive curation and data reprocessing is necessary prior to any reuse. The Gemma bioinformatics system (gemma.msl.ubc.ca) was created to resolve these problems. Gemma consists of a database of curated transcriptomic datasets, underlying analytical software, web interface, and web service. Here we present an update on Gemma’s holdings, our curation and analysis pipelines, and software features. As of April 2020, Gemma contains 10,136 manually curated datasets (primarily human, mouse, and rat), over 380,000 samples and hundreds of curated transcriptomic platforms (both microarray and RNA-sequencing). Datasets were annotated with 10,584 distinct terms from 12 ontologies, for a total of 52,010 annotations. While Gemma has broad coverage of conditions and tissues, brain-related datasets account for 33% of its holdings (large majority of brain studies in GEO). Users can access the curated data and differential expression analyses through both the Gemma website and a RESTful service; an R package is also provided for ease-of-use (github.com/PavlidisLab/gemmaAPI.R).
Short Abstract: Computational prediction of immunogenic epitopes is a promising platform for therapeutic and preventive vaccine design. A potential target for this strategy is HIV-1, for which no vaccine is available. In particular, a therapeutic vaccine devised to eliminate infected cells would represent a key component of cure strategies. HIV peptides designed based on individual viro-immunological data from people living with HIV/AIDS have recently shown able to induce post-therapy viral set point abatement. However, the reproducibility and scalability of this method is curtailed by the errors and arbitrariness associated with manual designs and by the time-consuming process.
We herein introduce Custommune, a user-friendly web tool to design personalized and population-targeted vaccines. When applied to HIV-1, Custommune predicted personalized epitopes using patient specific HLA alleles and viral sequences, as well as the expected HLA-peptide binding strength and potential immune escape mutations. Of note, Custommune predictions compared favorably with manually designed peptides administered in a phase II clinical trial (NCT02961829).
Furthermore, we utilized Custommune to design preventive vaccines targeted for populations highly affected by COVID-19. The results allowed the identification of peptides tailored for each population and predicted to elicit both CD8+ T-cell immunity and neutralizing antibodies against structurally conserved epitopes of SARS-CoV-2.
Short Abstract: Deep learning has become an innovative tool for detecting phosphorylation sites within a protein. However, the imbalance between negative and positive sites makes it challenging for a deep learning model to classify all sites accurately. Although identifying additional sites is possible, it is often costly and time-consuming with existing methods. Therefore, there is a demand for innovative modelling techniques that can overcome these limitations. To address these issues, we have designed a modelling scheme that utilises both convolutional and transformer-based neural networks. Specifically, we explore how both types of network can be combined and trained using a loss function employed in computer vision to form a robust architecture that is less likely to overfit to any one class when compared to previous baselines. We evaluate our model on a general phosphorylation site dataset, and a variety of kinase-specific datasets, including CDK, CK2, MAPK, PKA and PKC. Finally, to emphasise that this is an example of white-box deep learning, we show how one can visualise the model's features to gain a better understanding behind the prediction of each site.
Short Abstract: Parkinson's disease (PD) is the second most prevalent neurodegenerative disorder, affecting more than 1% of the population above the age of 60 years. Both genetic and environmental factors influence the risk of PD, but the molecular mechanisms underlying disease initiation and progression remain unknown. Studies of differential gene expression have identified molecular signatures associated with PD at the gene-level. However, the expression landscape at the level of alternatively spliced transcripts is largely unexplored. From two independent cohorts with Parkinson's disease patients and healthy controls we obtained RNAseq data from fresh frozen prefrontal cortex and investigated changes in the relative expression of transcripts in relation to the overall expression of the gene also referred to as differential transcript usage (DTU). We found that DTU occurs in the PD brain, and identified novel disease associated genes, that replicated across the two independent patient cohorts. Despite the novelty of genes that exhibit DTU, these were enriched in biological processes and functions which have already been reported in association with differential gene expression in neurodegeneration and PD.
Short Abstract: A critical step in unsupervised clustering of single-cell RNA sequencing (scRNA-seq) data is feature selection, i.e. identification of a subset of genes that can separate cells into distinct clusters. Current methods for feature selection test each gene individually, thus ignoring expression correlations between genes. This is a major limitation, since cell-type-specific marker genes tend to be highly correlated with each other. We therefore developed DUBStepR (Determining the Underlying Basis using Stepwise Regression), a method that selects a basis set of strongly correlated genes that maximally explain variation in gene-correlation space. DUBStepR then expands this basis set to identify the features (genes) that optimize cluster separation.
We benchmarked DUBStepR on 12 datasets spanning 4 scRNA-seq protocols (10x, Drop-Seq, CEL-Seq2 and Smart-Seq2) and found that DUBStepR yielded greater cluster separation than 6 widely-used feature selection algorithms. Moreover, DUBStepR detected marker genes with consistently greater accuracy than the other methods. We applied DUBStepR to identify low-frequency cell populations in multiple scRNA-seq datasets, and even extended DUBStepR to delineate hematopoietic differentiation trajectories in human bone marrow single-cell ATAC sequencing (scATAC-seq) data.
DUBStepR is available as an R package on GitHub (github.com/bbbranjan/DUBStepR), and can directly be incorporated into existing single-cell data analysis workflows.
Short Abstract: EDAM is an ontology of well established, familiar concepts that are prevalent within bioinformatics, and bioscientific data analysis in general. The scope of EDAM includes types of data and data identifiers, data formats, operations, and topics. EDAM has a relatively simple structure, and comprises a set of concepts with terms, synonyms, definitions, relations, links, and some additional information (especially for data formats).
EDAM is developed in a participatory and transparent fashion, within a growing international community of contributors. The development of EDAM is coordinated with the development and curation of tools registries (e.g. Bio.tools, bio.tools); training materials registries (e.g. TeSS, tess.elixir-europe.org); with packaging of open-source bioinformatics software (especially Debian Med and Bio-Linux, debian.org/devel/debian-med); the Common Workflow Language (www.commonwl.org); and other related communities and initiatives. These include developers of bioinformatics workbenches (mainly Galaxy, usegalaxy.org) and collaborations with specialised networks of experts, such as within the development of EDAM-bioimaging - an extension of EDAM towards bioimage informatics and machine learning - where a broad group of experts in bioimaging, image analysis, and deep learning has contributed into the common effort.
In summary, EDAM functions as common terminology when sharing and integrating information about bioinformatics tools, workflows, training materials, and other resources.
Short Abstract: Recent experimental advances have transformed the way that data are being generated in the biological sciences, and demands for powerful computational techniques to analyse these data are accelerating at an unprecedented rate. It is now possible to profile biological processes using a systems-wide framework, often with a high-degree of temporal resolution and at the level of single cells. Dynamic and stochastic extensions of flux balance analysis (FBA) have been developed to understand metabolic regulation at genome-scale beyond steady state in populations and single cells. Although well-defined mathematically, direct numerical simulation of dynamic and stochastic FBA models proves challenging due to the hybrid nature of embedding a linear programming problem into ordinary differential equations or the stochastic simulation algorithm. This talk will present new open source tools for simulation of dynamic and stochastic FBA models in Python. These software packages serve as extensions of the widely-used python module COBRApy, and will be released as part of the openCOBRA code base. They have been designed to enable users to intuitively build their models and access the most advanced algorithms for simulation on the basis of limited programming experience. Ongoing developments, including models of microbial communities and whole cells, will also be discussed.
Short Abstract: Lung cancer is the leading cause of cancer deaths, and lung adenocarcinoma (LUAD) is its most prevalent subtype. Symptoms often appear in advanced stages when treatment options are limited. Identifying genetic risk factors for LUAD will enable better stratification of high-risk individuals, who can then benefit from increased surveillance and early detection programs.
Towards this end, we analyzed germline whole-exome sequencing data of 1,083 patients and 7,650 controls, by far the largest case-control study to date, split into discovery and validation cohorts. Specifically, we focused on rare deleterious variants (RDVs) that have high penetrance through a rigorous analysis framework. For increased statistical power, we pursued a collapsing approach, and compared the cumulative RDV burden in patients versus controls at gene-level using penalized logistic regression. We observed that RDVs in ATM gene increase LUAD risk (ORcombined=4.6, p=1.7e-04, 95% CI=2.2–9.5). In support of these findings, ATM RDVs were also enriched in an independent cohort of 1,594 cases from the MSK-IMPACT study (0.63%).
Overall, in this exome-wide unbiased rigorous analysis we identified ATM as a moderate-penetrance LUAD risk gene. Given ATM is a recognized risk gene for other cancers; LUAD may be a part of the spectrum of ATM-related cancer syndrome.
Short Abstract: Predicting the effect of single point mutations in proteins is one of the most relevant objectives in the area of biotechnology and protein engineering. Due to the nature of predictive algorithms, each sequence needs to be numerically coded to train an ML-based model. Techniques like One Hot Encoder have been widely used. However, they do not represent the characteristics of the sequence. Other approaches use physicochemical properties as descriptors of the residues. But, there is the problem of property selection. Recent studies based on text mining, encode the sequences using embeddings from previously trained models applying techniques such as word2vec or doc2vec. However, such models require high computational costs. As an alternative to these techniques, we propose FFT-Predict. This tool generates predictive models based on the digitization of selected physicochemical properties through a combination of unsupervised learning algorithm techniques and representations in graph structures. The selected properties are digitized using the Fast Fourier Transformation algorithm and the predictive models are obtained from combinations of individual models using Meta-Learning techniques. This tool has been tested in different case studies, achieving high-performance measures. In this way, it is expected to be a significant contribution to the area of mutation design
Short Abstract: Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of expression data. The presence of transposable elements and sequence repeats in eukaryotic genomes adds to this complexity, as does overlapping genes and genes that produce numerous transcripts. Currently available software annotates genomes by relying on full-length cDNA or on a database of splice junctions to predict genes. We present FINDER1, which automates NCBI expression data download, read alignment, transcript assembly and gene prediction. FINDER1 is optimized to conduct read mapping with different settings to capture all biologically relevant alignments with special attention to micro-exons (exon length less than 51 nucleotides). FINDER1 further reports transcripts and recognizes genes expressed under specific conditions. FINDER1 integrates prediction results from BRAKER2 with assemblies constructed from expression data to approach the goal of exhaustive genome annotation. On the entire set Arabidopsis thaliana genes, FINDER1 achieves a transcript F1 score of 0.5, exceeding that of BRAKER2 by 0.23. FINDER1 vastly outperforms BRAKER2 across different categories of transcripts including micro-exons, overlapping transcripts, etc. The pipeline scores genes as high confidence or low confidence based on the available evidence. Finally, FINDER1 predicts and annotates non-coding genes using multiple approaches.
Short Abstract: Discontinuities in short read genome assemblies have presented challenges to many genomic analyses. As an alternative, long read sequencing has become a popular approach to generate highly contiguous albeit error-prone sequences. Even with a low overall assembly error rate (~99.8% identity), these errors can introduce frameshifts and premature stop codons hindering gene identification and annotation. We developed a software tool that identifies draft assembly errors through comparisons to multiple reference genomes. Reference genomes, or alternately proteins, are aligned with the draft assembly and conserved sites in these alignments are used to identify putative candidate error sites in the draft genome. These sites are then checked for raw read support for adjustment or confirmation. Software performance was assessed on two draft Pseudomonas spp. assemblies, constructed from either Oxford nanopore data or Illumina data. Less than 3000 errors or ~0.05% of the genome were corrected in the long read assembly, reducing annotated pseudogenes from 23.3% to 5.6%, comparable to 4.4% in the Illumina assembly. For targets that lack high quality reference genome data, the software can utilize reference protein data for targeted gene correction. Overall, the developed software was effective in correcting errors causing gene fragmentation, markedly improving long read assembly quality.
Short Abstract: A major focus of systems biology and genomic medicine is to link genotype to phenotype, yet, we remain far from accurately predicting disease states from genome sequence. Genetic interaction networks in model organisms have shed light on this problem, highlighting how combinations of genome variants can impact phenotypes. The disruptive CRISPR-based genome editing technology enables this combinatorial mutation approach in human cells. We systematically map genome-wide genetic interactions using CRISPR/Cas9 in human cells. We performed a large number of genome-wide screens with specific, loss-of-function mutation in HAP1 cells along with more than 20 screens in wildtype HAP1 cells to be used as a basis for robust scoring of genetic interactions. We developed a computational pipeline to identify quantitative genetic interactions qGI from these data. The qGI pipeline corrects unwanted effects such as surprising frequent interactions in wt HAP1 screens. Moreover, we applied a novel framework to explore reproducibility of qGI scores. We believe that our observations generalize to other differential CRISPR screening platforms. In summary, we developed a computational pipeline that will guide the generation of a genome-wide reference genetic interaction network in human cells.
Short Abstract: Breast cancer is a heterogeneous disease and one of the leading causes of mortality in women worldwide. Clinically it has been categorized into three therapeutic groups, estrogen receptor (ER) positive group, HER2/ERBB2/Neu amplified group and triple-negative breast cancers (TNBCs) that lack receptors for estrogen, progesterone and HER2. The underlying molecular signatures associated with the disease have been used to determine their prognostic values. However, molecular signatures for these subtypes have not been effective in identifying therapy. The therapeutic options also vary due the heterogeneous nature of the disease. Studies on the association between a tumor suppressor gene — Adenomatous Polyposis Coli (APC) and breast cancer, have revealed that certain genetic variants and/or epigenetically silenced version of the gene is capable of activating Wnt/β-catenin pathway and thus tumor development. Understanding the role of APC in chemotherapeutic resistance is critical for the development of new therapies for breast cancer. Towards understanding this, we have analyzed the whole genome expression in the human TNBC cell line MDA-MB-157 compared to MDA-MB-157 with APC knockdown and treatments with cisplatin and paclitaxel. We will discuss our results on the genes associated with key processes of Notch signaling, EMT, cell cycle and DNA damage.
Short Abstract: Background:
The technologies for the Next Generation Sequencing have developed rapidly in this decade. Among all applications of such technologies, single-cell RNA sequencing (scRNA-seq) is at the forefront of genomic research. In current literature, the methods of scRNA-seq data analysis mostly belong to two categories, differential expression and clustering analysis. Another important category of analysis is the prediction/classification of cell phenotypes, which always coupled with the task of feature/gene selection. The last type of analysis methods are rarely discussed, since the general-purpose prediction methods are well developed in the machine-learning community. However, there are special considerations in analyzing genomic data, which cannot be addressed by general-purpose prediction methods.
In this paper, we first discuss the special considerations of genomic data analysis, which are related to high correlations among genes. Then we propose a novel algorithm to address these considerations and challenges in modelling highly correlated genes. To introduce our algorithm, we integrate it with the elastic net method. The elastic net model is not a critical component of our algorithm framework, which can be replaced by other prediction models.
Using benchmark datasets and simulation studies, we show our algorithm (integrated with elastic net) outperform directly applying elastic net methods.
Short Abstract: Polyrhachis lamellidens is a temporary social parasitic ant, whose new queen founds a colony by invading the colonies of other ant species. Prior to this invasion, P. lamellidens performs a rubbing behavior to the host worker. Previous research estimated that cuticular hydrocarbon (CHC) disguise to be a possible explanation of this behavior, enabling avoidance of attack responses from the host. To verify this hypothesis, we carried out a quantitative metabolomics measurement of CHC. Firstly, after the rubbing behavior, CHC quantification by mass spectrometry confirmed that the CHC profile shifted in P. lamellidens from very low levels before invading the host colony, to pronounced peaks closely resembling that of the host workers. Secondly, to understand the system of CHC acquisition, we performed a bioassay using standard substances, and we also did an estimation of target genes and gene expression profiling by transcriptomic and qPCR analysis. The results showed that the acquisition of the standard substances by rubbing behavior was observed, while there was no change in the expression of cytochrome P450 decarbonylase, which is a CHC synthesis-related factor. These results suggest that P. lamellidens directly obtains host CHC through the rubbing behavior, and that enables to disguise during colony invasion.
Short Abstract: Bacterial or viral infections often cause acute and severe systemic inflammation, which affects the lungs Lipopolysaccharide (LPS), a pathogenic component of the membrane of gram-negative bacteria, stimulates active innate immune cells, monocytes, macrophages to produce pro-inflammatory cytokines, tumor necrosis factor -α, interleukin 1 beta, and inducible nitric oxide synthase (iNOS). The latter produces a high amount of nitric oxide (NO), with host cell damage and cascading inflammation. The same events are present in viral spread processes as in the case of COVID infection 19 We believe that numerous biochemical processes activate a cascade of inflammatory processes through the activation of iNOS with uncontrolled generation of (NO). iNOS is the cause of damage to host cells with a consequent pulmonary thromboembolic lung phenomenon in a contest of interstitial pneumonia. This study proposes the use of sildenafil to counter the inflammatory cascade and thromboembolic episodes.
Short Abstract: MERS SARS COVID-19 ,(CoV) causes severe acute, often fatal respiratory syndromes and lead to urgent research to discover the mechanisms of CoV’s infection.
The -CoV S proteins contain an N-terminal RBD Our study allowed us to verify how this type of protein has a configuration that is structurally similar to glycoproteins of the "MUCINS" type with the final part consisting of saccharide groups,exactly sialic acid, that binds to the receptors of the host cells
Their affinity to ACE 2 receptors is similar to other viruses; in this case the sialic acid present in ,(CoV),as the terminal part of the external neuroaminidase glycoprotein type,binds the ACE receptors
We thought to use of the neuroaminidase enzyme to break the bound that links the sialic acid with the rest of the glycoprotein making it impossible to link the virus to the host cell
Short Abstract: Systemic sclerosis (SSc) is a complex autoimmune disease. Its pathogenesis is unclear and, like that of other rheumatic diseases, complex. We sought common gene expression patterns associated with SSc in skin and blood. Gene expression data from skin and blood of SSc patients and healthy controls (HC) were downloaded from the Gene Expression Omnibus (GEO) database. We found the top 1000 differentially expressed genes(DEG) for each dataset, and shared genes were identified. A total of 31 genes were present in all three lists for skin Ssc. Among these were six that are involved in promoting homologous recombination in response to DNA damage and possibly associated with autoimmunity. Forty genes were shared among Ssc blood samples, among which four are part of a protein metabolic pathway. No genes were found to be common to both skin and blood samples in Ssc patients. In conclusion, this study identified differentially expressed genes(DEG) for SSc, that may as a group be useful as a biomarker for early detection and treatment of SSc.
Short Abstract: Alzheimer’s disease (AD) is one of the major global health problems. Some previous studies have suggested a link between viral infection and the development of AD. This study estimated whether herpes and hepatitis viral infection increases the risk of Alzheimer’s disease using Korean National Health Insurance Service National Sample Cohort data, which consists of records of 1,025,340 on patient covering 2002 through 2013 across the country. 1,660 patients developed AD (19.99%) and the hazard ratio of AD was 1.62 times (95% CI, 1.53-1.71) greater for patients with herpes viral infection. Using GEO data set, we identified 94 differentially expressed genes (DEGs) genes. Gene ontology enrichment analysis in biological process category for genes up regulated response to RNA metabolic process and RNA biosynthetic process. The integrated findings of this study suggest that AD biology is impacted by a complex constellation of viral and host factors acting as metabolic process and biosynthetic process. These results suggest a need for increased awareness of molecular and pathological mechanisms for AD in patient after other viral infection including herpes and hepatitis virus.
Short Abstract: Multiple sclerosis (MS) is among the most common neurological autoimmune disorders, primarily developing in people under the age of 30. Currently, diagnosis of MS is difficult and requires analysis of brain tissue, therefore, identification of a biomarker for the condition could considerably ease diagnosis. The aim of this study was to identify the potential key candidate genes of MS. Microarray and RNA-seq data were merged and analyzed using bioinformatic tools. Gene expression data from blood and brain tissue of MS patients and healthy controls (HC) were downloaded from the Gene Expression Omnibus (GEO) and ArrayExpress databases. Using the R program, highly differentially-expressed genes were identified in the microarray data; hundreds of shared genes were found in blood and brain tissues. 23 genes were found to be commonly upregulated between the two tissues types for the seven microarray studies analyzed. RNA-seq data from brain tissue was used in confirming candidate biomarker genes; 10 such genes were found to be shared between the microarray and RNA-seq data for brain tissue. These genes may serve as candidates for MS biomarkers. Further analysis will be conducted to identify the gene regulatory network associated with these genes.
Short Abstract: Melanoma is characterized by high heritability, yet a significant portion of this risk remains unexplained by known genetic risk factors. Immune surveillance mediated by the Major Histocompatibility Complex (MHC) has been shown to influence tumor formation and growth. We previously found genotype at the Human Leukocyte Antigen (HLA) locus encoding the MHC was associated with age at melanoma diagnosis. Here we investigated how MHC alleles could contribute to heritable predisposing or protective effects. Using the SKCM TCGA cohort as a discovery set, we identified ten MHC alleles (nine class-I and one class-II) associated with an average later disease onset of 5.21 years (p=0.0004) for individuals carrying at least one allele. Interestingly, 8 of these 10 alleles are known HLA specific risk alleles associated with the autoinflammatory skin conditions vitiligo and psoriasis, with the remaining alleles associated with other autoimmune conditions. Preliminary validation on a melanoma specific dbGaP cohort supported this trend with an average later disease onset of 5.56 years (p=0.0378). Analysis of these alleles suggests protection through better presentation of known melanoma specific drivers. Notably, HLA-B27:05 and HLA-B57:01 were found to effectively bind peptides harboring the BRAFV600E mutation common to advanced melanoma in mass spectrometry datasets characterizing peptide-MHC complexes.
Short Abstract: Formulating and investigating models of biochemical systems are key to
understanding biological processes in Systems Biology. For systems with large
molecular numbers of chemical species, classical chemical kinetics are used to model chemical
reaction systems as ordinary differential equations (ODEs). But when species are
present in low copy numbers, discrete and stochastic models, such as the Chemical
Master Equation, are required for accurate modelling. Both models rely on
experimentally-derived reaction rate parameters that are typically determined by
fitting an ODE system to experimental data, which is effective for systems with large
populations, but is not as applicable when some molecular counts are small. We show the
applicability of techniques designed for inferring parameters of partially-observed
Markov processes to the problem of stochastic biochemical system reaction rate
inference, given partial and/or sparse observation data. Further, we present an
accurate implementation of these techniques, and their integration into a set of software
packages for stochastic biochemical simulation and inference in the R programming
language. The proposed techniques are tested on several genetic networks of practical interest.
Short Abstract: Most genome-wide association studies (GWAS) primarily focus on European
individuals; however, these results cannot always be accurately applied to non-
European populations due to differences in genetic architecture. We sought to use
results from ethnically diverse GWAS to perform transcriptome-wide association
studies (TWAS) to find genes associated with complex traits. We performed TWAS
using summary statistics from GWAS of 27 clinical and behavioral
phenotypes in approximately 50,000 non-European individuals (Wojcik et al. 2019). We used S-PrediXcan combined with transcriptome prediction models trained using genotype and gene expression data from the Multi-Ethnic Study of Atherosclerosis (MESA) to find genes associated with traits across populations. In our preliminary analyses, we identified 229 unique genome-wide significant (P<5e-8) trait associated genes, of which 217 replicated (P<0.05) in larger European TWAS from the PhenomeXcan database. One such gene is LMNA, which associated with white blood cell count. The driving SNP for LMNA, rs517606, has a higher allele frequency in African populations which could explain its predominance in the African American cohort. Deeper understanding of the degree of transferability of genetic association results and implicated biological mechanisms across populations is essential for equitable precision medicine implementation and requires complex trait studies in diverse populations.
Short Abstract: Dilated Cardiomyopathy (DCM) is a multifactorial condition often leading to heart failure in many clinical cases. Due to the high number of DCM incidence reported as familial, a gene level network based study was conducted utilizing high throughput Next Generation Sequencing data. We exploited the exome and transcriptome sequencing data submitted in NCBI-SRA database to construct a high confidence scale-free regulatory network consisting of lncRNA, miRNA, mRNA and Transcription Factors (TFs). Analysis of RNA-Seq data revealed 477 differentially expressed coding transcripts and 77 lncRNAs. 268 miRNAs regulated either lncRNAs or mRNAs. Out of the 477 coding transcripts that are deregulated 82 were TFs. We identified three major hub lncRNA (XIST), miRNA (hsa-miR-195-5p) and mRNA (NOVA1) from the network. We also found putative disease associations of DCM with diabetes and DCM with hypoventillation syndrome. Five highly connected modules were also identified from the network. The hubs showed significant connectivity with the modules.Through this study we were able to gain insights into the underlying lncRNA-miRNA-mRNA-TF network. From a high throughput dataset we have isolated a handful of probable targets that may be utilized for studying the mechanisms of DCM development and progression to heart failure.
Short Abstract: Sentinel lymph node (SLN) status is one of the most significant prognostic factors in patients with breast cancer. However, several traditional statistics based clinical models to predict the risk of SLN involvement have been developed with lack of accuracy. Breast cancer is a heterogeneous syndrome related with phenotypic and genotypic measurements and there remains a significant challenge to process data rationally for these two types of factors. Thus, we present an approach to integrate genotype information and clinicopathological information from electronic medical records for predicting SLN metastasis in Chinese female breast cancer patients. Using a real-world clinical dataset, we constructed a series of predictive models: support vector machine, naive Bayes, random forest, bagged CART, gradient boosting and logistic regression based on clinicopathological and genotype data after t-SNE dimensional reduction of SNP profiles on samples along with DBSCAN clustering. All the models based on different algorithms were well trained with median AUCs of between 0.81 and 0.85, which is about 20% improvement than the current clinical practice. This study demonstrated the discrimination ability across multiple models with new insight into the utility of dimensional reduction on genetic features in SLN metastasis prediction.
Short Abstract: Antimicrobial resistance (AMR) is a significant and growing public health threat. Therefore, automatic identification of resistant pathogens is pivotal for efficient, wide-spread detection via sequencing. While most of the curated records in AMR databases are genes that are described as resistant to specific molecules or AMR classes/mechanisms, there is a limited number of protein variants known to cause AMR if present, with an on/off fashion. We hypothesized this mechanism should involve protein structural variations influencing active/exposed sites. We analysed the secondary structure of AMR-conferring amino acid variants with Brewery, a state-of-the-art deep learning method implementing stacked bidirectional recurrent neural networks and convolutional neural networks. We found a strong distribution shift in resistant residues, with respect to solvent accessibility, when compared to the susceptible ones. We extended this hypothesis to map candidate AMR on/off variants on the large PATRIC database through an ad-hoc alignment pipeline. Our approach unveiled solvent accessibility differences, measured as the Δ in the solvent accessibility probability distribution between wild type and variant, in proteins from resistant and susceptible genomes. We found these differences varying greatly depending on the considered AMR machinery. Based on these findings, we developed a novel scoring system determined by the solvent accessibility score.
Short Abstract: Network-based functional enrichment of gene ontology or gene sets is becoming a promising strategy to improve functional enrichment accuracy. Previously, we have developed a method named LEGO (functional Link Enrichment of Gene Ontology or gene sets), which takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. The analysis results show that LEGO achieves better performance than Fisher and three other network-based methods—NOA, NEA, and EnrichNet, and is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. Here, we present the R implementation of the algorithm (LEGO), which is much faster and easier to use than the original Perl code. Currently, LEGO performs the process of biological-term classification and the enrichment analysis of gene clusters. LEGO supports gene set functional enrichment of multi-sample simultaneously. In addition, this package contains a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. As such, LEGO will be of great value for identifying functionally relevant gene sets and deriving novel hypothesis in functional genomics studies. The source code and documents are freely available at github.com/huzhenyu115/LEGO.
Short Abstract: There are around 19,000 known Legume species in the world. They are among the few species which can convert nitrogen available in the air to plant usable form, and are capable of increasing the nitrogen content in soil which in turn helps in nitrogen fertilization for other crops. It is important to come up with solutions to cultivate more drought tolerant legumes. Molecular genetic markers such as the Simple Sequence Repeats (SSRs) are useful tools for measuring the genetic diversity and allow the connection of hereditary traits with genomic variation. With the advent of Next-Generation Sequencing (NGS) technologies, it has become easier than ever for the researchers to study the species of interest on a genome scale. Here, we present a comprehensive database resource, legumeSSRdb bioinfo.usu.edu/legumeSSRdb/, to aid genetic fingerprinting, Marker Assisted Selection (MAS) and bolster research in legume genomics.
We have identified nearly 1,132,531 SSRs from 6 legume species. The web interface allows users to browse SSRs based on the genomic region, chromosome, motif type, repeat motif, chromosome location etc.; also allows to design primers, perform blast, visualize using Jbrowse and explore the genes closer to those SSRs. The future plan is to scale the webserver to 15 species.
Short Abstract: Up to 5 million annual influenza infections cause substantial morbidity and mortality in seasonal epidemics world-wide. To ensure continued protection, the vaccine strains have to be regularly updated. In comparison to the expert-based vaccine strain recommendations of the WHO, we assess a vaccine strain selection involving a fully automated data collection and analysis by our SD-plots algorithm (Klingen et al., 2018).
From hemagglutinin sequences of H3N2 viruses from the GISAID database (Shu&McCauley, 2017), we calculate a genealogy to determine the changes in frequencies of amino acid changes associated with a particular clade. Changes that significantly increase in frequency over consecutive seasons are considered to provide a selective advantage. If viruses carrying these changes predominantly circulate in the population of the current season and are located in antigenicity-altering regions of the surface protein, the method recommends an update of the vaccine strain with a suitable strain from the particular clade is recommended.
Both in retrospective testing and in live predictions we performed for future seasons, the SD-plots method performed favorably in comparison to recommendations made by the WHO in suggesting suitable strains for the seasonal influenza vaccine. Up-to-date predictions of suitable vaccine strains for human influenza viruses are available at github.com/hzi-bifo/SDplots_VaccineUpdates.
Short Abstract: Introduction
Local frustration has extensively been linked to functional aspects in proteins. Recently, we introduced a way to detect evolutionary conserved frustration patterns (ECFPs) to study enzymatic activity. We extend our work to study related protein families to detect differential functional adaptations after the divergence from a common ancestor. Here we present our results of studying ECFPs at the globins superfamily.
Frustration was calculated using the Protein Frustratometer. ECFPs are detected by calculation of the information content over frustration results matched to homologous residues across multiple sequence alignments within the globin superfamily. ECFPs can be obtained both at the level of single residues and contact maps and weighted according to contacts occurrence frequency.
We analyzed the ECFPs of different members within the globins superfamily. Given that these families share a common ancestor we conclude that the differential ECFPs at the existent superfamily members correspond to specific functional adaptations to the activity and context in which these proteins operate at present times. We consider ECFPs can be used to exploit the evolutionary history of protein families to detect specific functional aspects of them and better understand the relationship between sequence and function over evolutionary scales.
Short Abstract: Amyotrophic lateral sclerosis (ALS) is a progressive neuro-degenerative disease affecting nerve cells in the brain and the spinal cord. Many interactions between metabolite and enzyme still undiscovered. Here, we present a new algorithm MeaP (Metabolite enzyme association Predictor), which predicts the potential associations between metabolites and enzymes in metabolomics data. MeaP uses the enzyme-metabolite interaction data from Kyoto Encyclopedia of Genes and Genomes (KEGG) reaction database and the protein-protein interaction data from STRING database. We applied MeaP to our metabolomics profiling study in a cohort of ALS. MeaP identified multiple ALS-specific metabolite-enzyme interactions such as between methylmalonic acid and glutamate decarboxylase 1 and between aldehyde oxidase and riboflavin. In summary, MeaP is a novel framework to analyze metabolomics data and extract novel biological signals by integrating data from experimentally validated databases.
Availability and Implementation: MeaP is available at github.com/FADHLYEMEN/meap.
Short Abstract: The objective of this work was to (i) provide a systematic understanding small molecule (including metabolites and drugs) - gene relationships, (ii) compile a database (MetaboNet) of available interaction data from publically available information (iii) generate an interactive website by linking the compiled information for query and seamless information generation by the end-user.
Interactions data was parsed using R and Python scripts from 4 different publically available databases: Drugbank, HMDB, PDBBind and Brenda. Furthermore, the corresponding information from PubChem, Uniprot and PubMed was extracted.
The PostgresSQL database is split into the separate tables for each collated data. These tables incorporate details such as; compound name, gene id, gene name, uniprot id and pubchem id. The Pubchem, Uniprot and PubMed tables provide detailed information on various properties. These relational databases are connected by the DBID (an unique identifier), Pubchem IDs, Uniprot IDs and PubMed publication IDs.
The interactive website presenting the database to the end-user is built mainly using the Rshiny, packages incorporating number of R libraries.
By combining published interaction data the MetaboNet database and the application interface provides an overview as well as network of drug or metabolite and gene interactions.
Short Abstract: CLIP-seq is a high-throughput method to detect RNA binding protein interactions guided by miRNAs based on cross-linking between the protein and mRNA and subsequent isolation of the Argonaute-protected mRNA via immunoprecipitation (IP). Analysis of miRNA using CLIP-seq data and correlating miRNA with binding target genes remains challenging, particularly so when narrowing the field of analysis to particular regions of interest (i.e. matching available miRNA abundance levels with peaks in 3’-UTR coverage of Argonaute). Focusing miRNA analysis to these areas of high coverage threshold or specific regions such as the 3’ UTR region, and creating a rank order by cell type, has been identified as an unmet need which miROGUE will attempt to address.
Utilizing a single interconnected pipeline and simple interface to keep user barrier-to-entry low, miROGUE accepts standard input FASTQ or BAM files from CLIP-seq or similar experiments. miROGUE will then determine appropriate coverage thresholds for these regions, aggregate data by cell types, create rank order tables of miRNA and binding targets, and find predicted binding events based on the abundant miRNAs and regions of interest where they might likely bind. miROGUE will also identify partial matches between the seed region of miRNA and the 3’UTR region of mRNA.
Short Abstract: Mogrify is a computational framework that combines gene expression data and regulatory information to systematically predict the reprogramming factors necessary to induce cell conversion. The platform is developed to systematically control the cellular transcriptomic network underlying cellular identity, and consequently identify the key regulatory factors necessary to convert any cell type into any other cell type without going through the stem cell state, a process called transdifferentiation. We have applied Mogrify to 173 human cell types and 134 tissues, defining an atlas of cellular reprogramming including both known transcription factors used in transdifferentiations and new ones, never implicated before in these cellular conversions. Mogrify in silico predictions have been validated in vitro in over 20 cell conversions, including generation of endothelial cells, astrocytes and cardiomyocytes. This technology also allows the development of enhanced differentiations and reduces the costs of current cell therapies.
Short Abstract: In 2005, we released the Gene Set Enrichment Analysis (GSEA) software and its companion gene set collections, the Molecular Signatures Database (MSigDB). Since then, MSigDB has served as the premiere resource providing experimentally derived expression profiles, collections of canonical pathways, and biological signatures to enable GSEA. GSEA and MSigDB have continually grown in their popularity and utility over their 15 years of operation, serving as key resources for interpretation of both microarray and RNA-sequencing data. MSigDB7 began a major modernization effort for this important resource, overhauling gene set collections to reflect significant improvements in the source resources and introducing new sets for a wide variety of new biology. MSigDB7 also ensures that data used in GSEA is fully concordant with MSigDB gene representations, both for microarrays and transcriptomes by providing resources for gene symbol harmonization across datasets. Special attention was also paid to model organism research with seamless orthology conversions for Mouse and Rat data. MSigDB 7.1 continues this evolution with the addition of modern resources for the analysis of gene expression regulatory programs by miRNAs and transcription factor targeting. These and future efforts ensure that MSigDB will remain a key resource hub for enrichment analysis of all data types.
Short Abstract: Despite the plethora of Workflow Management Systems (WMS), guidelines (i.e FAIR), source code repositories and Open Access Journals, reproducibility is still a major issue in bioinformatics and computational biology. Some of the contributing factors are the unwillingness of authors to share ROs (Research Objects, i.e. Tools, Data, Workflows) and the introvert nature of modern WMS (difficult to connect with other WMSs, require above average IT knowledge). Here we present OpenBio.eu an environment where users are incentivised to import their ROs by taking credit when others are using them. Importing ROs requires no additional knowledge that the one required to install them in a PC. Users can directly download, execute, rate and comment any RO. They can also “Fork” it and create a personal version that can edit as they wish. Similarly they can compile Workflows by Drag and Dropping ROs in a graph. Workflows can be imported/exported from/to any environment that supports Common Workflow Language such as Galaxy. A discourse analysis environment visualizes the comment thread and can help researchers choose the right RO. Overall OpenBio.eu is a free, extrovert and social environment aiming to maximize visibility and reproducibility of research in life sciences.
Short Abstract: Antibodies are key molecules of the adaptive immune response of vertebrates. The heavy chain is directly involved in antigen binding and as such greater diversity is seen in the heavy chain, derived from recombination of the V, D and J genes. From recent data, 25 V, 21 D and 4 J resultant functional alleles have been defined for bovines according to the IMGT database. Most of the IMGT entries have been derived from European Bos Taurus bovine breeds and so there is a need for characterizing African bovine breeds. Examining B cell receptor germline genes, novel allele prediction and its annotation is problematic because the analysis is affected by somatic hypermutation and sequencing errors. Several methods have currently been developed to determine germline VDJ alleles from RNA sequences.
In this project, we will first benchmark the performance of annotation tools, i.e. IgBlast, IMGT/HighV-QUEST and MiXCR using bovine simulated data. In addition, evaluate germline allele discovery tools; IgDiscover and TigGER to determine their suitability for bovine germline allele discovery. Lastly, the evolution of the antibody repertoire in bovine B-cells will be examined. IgM immunoglobulin sequences of three African bovine breeds; Ndama, Ankole, and Boran will be used in this analysis.
Short Abstract: In characterizing a disease, it is common to search for dysfunctional genes by assaying the transcriptome. The resulting differentially expressed genes are typically assessed for shared features, either through functional annotations or co-expression. However, most methods ignore the potential relevance of outlier genes, which may be expression outliers or differentially co-expressed. Here, we provide meta-analytically derived gene co-expression networks and a tool to assess differentially expressed genes with respect to them to detect outlier genes and modules. We illustrate its capability through a meta-analysis of Parkinson’s disease. We identify important biomarkers that are tagged as outliers in a non-specific gene network. Our results suggest characterizing genes as operating inside or outside typical pathways is a valuable step in assessing candidate and marker genes and may well lead to an improved understanding of mechanisms underlying other diseases. A simple and important practical implication is that relying only on enrichment-style methods to prioritize results for meaningful candidates may be misleading, in particular when the tissue or spatiotemporal expression is not accessible.
OutDeCo is implemented in R and can be obtained from: github.com/sarbal/OutDeCo
Short Abstract: The treatment of ovarian cancer, a disease with 22,000 new diagnoses each year in the United States, has long been frustrated by the development of resistance to first-line platinum-based drugs. Researchers have characterized the drug resistant state and identified differences such as proliferative capacity and drug export pathway modulations as potential mechanisms of resistance development, but so far efforts to target these pathways and reverse the resistant state have been unsuccessful. We sought to profile the resistant phenotype using single cell RNA-sequencing (scRNA-seq), which was made challenging by the sparsity of the data and the dominance of cell-cycle related signals. Here we present a method for quantifying relative pathway enrichment in the context of the progression through the cell cycle using a pseudotime trajectory and single sample gene set enrichment (ssGSEA). This method identifies pathways whose up- or down-regulation is associated with or independent of the cell cycle and identifies pathways whose differential regulation in certain parts of the cell cycle was not detectable with other methods. This method effectively profiles pathway activity in the context of proliferative changes in the chemo-resistant phenotype and may lead to the identification of mechanisms by which this phenotype may be reversed.
Short Abstract: Next Generation Sequencing has been applied in many areas of biology. In order to develop effective diagnostic and therapeutic approaches, we need to accurately characterize and identify sequencing errors and distinguish these errors from their true genetic variant in sequencing, i.e. misreads follow a binomial distribution and it further can be approximated to the Poisson process for longer sequences. However, the insertion and deletion rates are 1000 times lower than substitution error rates and, therefore, less significant. The model assumes that error arrival at a position is not dependent on an error at other position. Furthermore, errors in sequences can cause an error in studies based on multiple sequences and they also follow Binomial – Poisson Distribution (for example – Alignment is a merging of two Binomial processes for short sequences and it further can be approximated to Poisson for long sequences (for example – genomic sequence). It provides a systematic way to evaluate the accuracy in sequencing-based applications.
Short Abstract: Livestock species raised in an agricultural setting, play an important role in our ecosystem. The recent development in ‘OMICS’ technologies, such as the genomic selection, will be especially beneficial in breeding for low heritable disease traits that only manifest themselves following exposure to pathogens or environmental stresses in animals. Molecular genetic markers are useful tools for measuring the genetic diversity among these agricultural species and allow the connection of hereditary traits with genomic variation.
Molecular marker technology has developed rapidly, particularly the Simple Sequence Repeats/microsatellites, prevail applications in modern genetic analysis. However, there is no resource available for the prediction and identification of SSRs in farm animals. Using the recent bioinformatics technologies, we have developed an important resource to enhance research in livestock species like cow, goat, sheep, horse, donkey, mule, camel, chicken, pig and buffalo.
RanchSSR database is a webserver that can be used to predict SSR markers.This will help researchers in better gene tagging and genome mapping to enhance livestock breeding. We believe that ranchSATdb would be a critical resource for Marker Assisted Selection and mapping Quantitative Trait Loci in order to practice genomic selection and improve the farm animal health. The database is freely available at bioinfo.usu.edu/ranchSATdb/.
Short Abstract: Clustered regularly interspaced short palindromic repeats (CRISPR) and associated proteins (Cas) form the CRISPR-Cas systems. They have generated a high level of interest in recent years, due to their applications in gene editing. While most research has focused on the Streptococcus pyogenes Cas9 (SpCas9), there are a number of other systems that could have valuable properties in a gene editing context (e.g. specificity, compactness). Public sequence repositories are a valuable source of new CRISPR systems, provided that they can be mined efficiently and at scale.
Here we present a new tool that exploits low-level programming and GPU parallelism via C++/CUDA to high-performance detection and classification of CRISPR systems.
Our tool accurately discovers all repeat-spacer arrays by using quality scores to identify the genuine array at each locus. We then identify Cas proteins in proximity to each CRISPR via a protein-level kmer-based approach. This approach enables discovery of novel Cas proteins on the basis of kmer similarity with known Cas proteins. Finally, CRISPR type classification is performed on the basis of signature Cas protein presence.
Our results show that we can detect and classify CRISPR systems one order of magnitude faster than the most widely used existing tool (CRISPRCasFinder).
Short Abstract: Clustering is one of the most critical steps in analysis of scRNA-seq data, since it is essential for cell-type and marker gene identification. However, it remains challenging to cluster cells accurately in the presence of experimental noise, technical variation and batch effects. It was shown that Reference Component Analysis (RCA), which is a supervised clustering approach guided by a set of reference transcriptomes, is more accurate and less susceptible to batch effects than unsupervised clustering (Li et al., Nat Genet 2017). However, the original RCA software is not scalable to the size of modern scRNA-seq data sets, has limited usability, graphical visualization options, documentation and includes only a single reference transcriptome panel.
Here, we present RCA2, an improved implementation of reference-based clustering addressing all of the above limitations. We have reduced runtime, incorporated memory-efficient graph based clustering, expanded the set of reference panels and facilitated the generation of custom reference panels from user-provided data. Also, RCA2 has easy-to-use plotting functions, e.g. expression heat maps, and 2D/3D UMAP visualizations. Finally, RCA2 includes extensive documentation and tutorials, describing the use of the software and its integration into widely used scRNA-seq pipelines such as Seurat. RCA2 is freely available on GitHub: github.com/prabhakarlab/RCAv2.
Short Abstract: Single-molecule long-read sequencing provides an unprecedented opportunity to measure the transcriptome from any sample. However, current methods for the analysis of transcriptomes from long reads rely on the comparison with a genome or transcriptome reference, or use multiple sequencing technologies. These approaches preclude the cost-effective study of species with no reference available, and the discovery of new genes and transcripts in individuals underrepresented in the reference. Methods for the assembly of DNA long-reads cannot be directly transferred to transcriptomes since their consensus sequences lack the interpretability as genes with multiple transcript isoforms. To address these challenges, we have developed RATTLE, the first method for the reference-free reconstruction and quantification of transcripts from long reads. Using simulated data, transcript isoform spike-ins, and sequencing data from human and mouse tissues, we demonstrate that RATTLE accurately performs read clustering and error-correction. Furthermore, RATTLE predicts transcript sequences and their abundances with accuracy comparable to reference-based methods. RATTLE enables rapid and cost-effective long-read transcriptomics in any sample and any species, without the need of a genome or annotation reference and without using additional technologies.
Short Abstract: Mesenchymal stem cells (MSCs) form a heterogeneous population of multipotent progenitor cells that contribute to tissue remodeling, repair and homeostasis. While differentiation of MSC populations towards soft and stiff tissue lineages is directed by matrix mechanics, single cells differ by their matrix-sensing potential and multilineage differentiation capacity. Human MSCs were cultured on soft and stiff matrices that mimic fat and precalcified bone and exposed them to a bi-potential adipogenic/osteogenic induction medium. To study lineage specification heterogeneity, we obtained single-cell transcriptomes of thousands MSCs at early differentiating states using droplet-based single cell RNA (scRNA) profiling. While adipogenesis was favored on soft matrices and osteogenesis on stiff matrices, scRNA transcriptomes revealed matrix directed linage differentiation only in a fraction of cells. Reconstruction of subpopulations differentiation trajectories revealed a cell-fate decision-making bifurcation towards fat and bone fates. Adipogenesis was retarded on stiff matrices and soft matrices activated chondrogenic markers in osteogenic cells. Differential gene expression screening between matrix-sensitive cells and matrix-insensitive cells revealed lineage and matrix-specific cytoskeletal proteins whose signaling functions we validated via knockdown and overexpression assays. Taken together, our work provides dynamic mapping of MSC subpopulations characterized by multilineage differentiation capacity associated with cytoskeletal proteins that mediate mechanical signaling.
Short Abstract: Genome-scale CRISPR loss-of-function screens are an increasingly popular experimental platform for investigating potential genetic interactions and druggable targets. Recently, we developed a novel CRISPR screening system named CHyMErA that allows for the systematic perturbation of combinatorial genetic interactions, as well as the deletion of sizeable genomic fragments, by pairing a Cas9 guide with one or more Cas12a guides expressed from the same hybrid guide RNA (Gonatopoulos-Pournatzis et al. 2020). Key challenges for scoring hybrid guide RNAs include the presence of orthogonal guide RNAs and multiple orientations - gene A targeted by Cas9 and gene B targeted by Cas12a, or vice versa. Here, we present a novel computational workflow named ChymeraR that addresses these challenges by incorporating separate null models for different orientations. The ChymeraR scoring workflow enables the precise scoring of combinatorial screening data and is available as an open-source R package at github.com/HenryWard/chymeraR.
Short Abstract: In the era of Big Data, data collection underpins biological research more so than ever before. In many cases this can be as time-consuming as the analysis itself, requiring downloading multiple different public databases, with different data structures, and in general, spending days before answering any biological questions. To solve this problem, we introduce the open-source cloud big data platform called Sherlock (earlham-sherlock.github.io/). Sherlock provides a way for biologists to store, convert, execute, share and generate biology data ultimately streamlining bioinformatics data management. The Sherlock platform delivers a simple interface to leverage big data technologies, such as PrestoDB and MongoDB, that is designed to analyse, process, query and extract the information from extremely complex and large data sets. Furthermore, Sherlock is capable of handling different structured data (interaction, localization, or genomic sequence) from several sources and convert them to a common optimized storage format known as Optimized Row Columnar (ORC). This format facilitates Sherlock’s ability to quickly and easily execute analytical queries on extremely large data files as well as share datasets between teams. In conclusion, Sherlock provides an open-source platform empowering data management, data analytics and collaboration through modern big data technologies.
Short Abstract: Short read (SR) transcriptome assembly can be perforemd ab initio (by means of a reference genome/transcriptome) or de novo (without any reference). In a de novo assembly, a much deeper sequenceing depth would be required compared to ab initio, which then makes it harder to detect every possible transcript isoform when a useful reference is missing.
With the advent of long reads (LRs), e.g. PacBio CCS, the assembly of a transcriptome turned unnessecary. Still, as of 2020, LRs are still considerably more expensive than SRs, and might not cover some rarely expressed isoforms.
The pipeline we are developing, with the working title CGTA-deNovo-ish, aims to combine these two features: By clustering a transcriptome of LRs, and aligning a set of SRs to the LRs, each cluster of SRs could then be assembled with a lower coverage requirement.
Short Abstract: Recently developed technologies for digital imaging and highly-multiplexed immunohistochemistry (mIHC) are enabling the field of histology to enter into a quantitative era, allowing for more complex descriptions of tissue architecture. Imaging cytometry by time of flight (CyTOF), multiplexed ion beam imaging, and co-detection by indexing (CODEX) can be used to simultaneously profile the expression of dozens of proteins in a tissue section with single-cell resolution. However, annotating cell populations or states that differ little in the profiled antigens or for which the antibody panel does not include specific markers is challenging. To overcome this obstacle, we have developed a computational approach for enriching mIHC images with single-cell RNA-seq data, building upon recent experimental procedures for augmenting single-cell transcriptomes with concurrent antigen measurements. Spatially-resolved Transcriptomics via Epitope Anchoring (STvEA) performs transcriptome-guided annotation of highly-multiplexed cytometry datasets. It increases the level of detail in histological analyses by enabling annotation of subtle cell populations, spatial patterns of transcription, and interactions between cell types. We demonstrate the utility of STvEA by uncovering the architecture of poorly characterized cell types in the murine spleen using published CODEX and CyTOF datasets, and a CITE-seq atlas we have generated.
Short Abstract: Genome-wide perturbation screens enable a systematic investigation of biological systems for functional information. To evaluate the functional signal captured by these screens, ground truth gold-standards are instrumental. However, the development of such gold-standards through manual curation is difficult, and therefore, they are often unavailable.
An alternative approach to generating a gold-standard is to summarize the agreement of many independently replicated experimental screens. We introduce a new computational tool, called JEDER (Joint Estimation of Data and Error Rates) that, given a set of replicate genetic screens, computes a maximum likelihood estimate of the False Positive Rate, False Negative Rate, along with a consensus profile using an MCMC-based approach. Based on this consensus profile, JEDER then provides detailed benchmarking of screen quality. We demonstrate the utility of JEDER by applying it to a collection of replicated CRISPR/Cas9-based single mutant screens and double mutant screens in human cells. Our analysis provides the first estimates of error rates for genome-wide genetic interaction screens and suggests substantially increased difficulty of scoring differential effects as compared to single mutant effects. We expect our approach to generalize to many other genomic/proteomic data settings to enable precise estimates of reproducibility without relying on external gold standards.
Short Abstract: Since its inception RGD hosted data not only for rat, but also for human and mouse. We added and are still adding more species, each having its own strengths as disease models. In-house developed tools help to translate model organism research into human. To facilitate data loading from various sources we developed an architecture that allows a small bioinformatics team to 1) quickly create new pipelines 2) efficiently update, maintain and troubleshoot existing pipelines in a timely manner 3) easily add new species. New pipelines are created from the uniform Java-based code template comprising: versatile logging system, summary notification emails, database connection pooling, database connection layer and convenient pipeline parameterization. At the core we have a staging relational database with production data. 80+ pipelines run periodically to update the data within their area of responsibility using incremental updates strategy. Every pipeline performs adequate QC to ensure the data correctness. Due to elimination of multiple points of failure, pipeline breakdown doesn't impact other pipelines. Troubleshooting is performed at the bioinformatician convenience. The pipelines have been used to bring in several orphan model organisms into RGD (e.g. chinchilla, bonobo, 13-lines ground squirrel) and provides a platform to efficiently bring in additional species.
Short Abstract: Individual MHC genotype constrains the mutational landscape during tumorigenesis. Immune checkpoint inhibition reactivates immunity against tumors that escaped immune surveillance in approximately 30% of cases. Recent studies demonstrated poorer response rates in female and younger patients. Although immune responses differ with sex and age, the role of MHC-based immune selection in this context is unknown. We found that tumors in younger and female individuals accumulated more poorly presented driver mutations than those in older and male patients, despite no differences in MHC genotype. Younger patients showed strongest effects of MHC-based driver mutation selection, with younger females showing compounded effects and nearly twice as much MHC-II based selection. This study presents the first evidence that strength of immune selection during tumor development varies with sex and age, and may influence the availability of mutant peptides capable of driving effective response to immune checkpoint inhibitor therapy.
Short Abstract: The human frontal cortex is unusually large compared with many other species. The expansion of the human frontal cortex is accompanied by both connectivity and transcriptional changes. Yet, the developmental origins generating variation in frontal cortex circuitry across species remain unresolved. Nineteen genes, which encode filaments, synapse, and voltage-gated channels are especially enriched in the supragranular layers of the human cerebral cortex, which suggests enhanced cortico-cortical projections emerging from layer III. We identify species differences in connections with the use of diffusion MR tractography as well as gene expression in adulthood and in development to identify developmental mechanisms generating variation in frontal cortical circuitry. We demonstrate that increased expression of supragranular-enriched genes in frontal cortex layer III is concomitant with an expansion in cortico-cortical pathways projecting within the frontal cortex in humans relative to mice. We also demonstrate that the growth of the frontal cortex white matter and transcriptional profiles of supragranular-enriched genes are protracted in humans relative to mice. The expansion of projections emerging from the human frontal cortex emerges by extending frontal cortical circuitry development. Integrating gene expression with neuroimaging level phenotypes is an effective strategy to assess deviations in developmental programs leading to species differences in connections.
Short Abstract: Odorant binding proteins (OBPs) are soluble proteins found in sensillum lymph which can shape and modulate the peripheral olfactory signaling by diverse mechanisms. However, OBPs interaction with odorant and their structural insight is less understood in tsetse flies.Herein, we analyzed the structural features of Glossina fuscipes fuscipes odorant binding proteins and used molecular-docking simulations coupled with gene expression to address the functional role of OBPs using odours of known behaviour. We found a structural variability between the different OBPs that impacted the binding affinity to Waterbuck repellent compounds (WRB) (d-octalactone, geranyl acetone, guaiacol and pentanoic acid) and 1-octen-3-ol (attractant). Additionally, we also identified some of the putative OBPs for the Waterbuck repellent compounds (WRB) and 1-octen -3-ol. The tissue specific study showed that some OBPs were associated with tissue specific expression and sexual dimorphism. Furthermore, the physiochemical property analysis showed these various OBPs varies in their number of hydrogen bonds, hhydrophobic interactions and also in the area of binding pocket that might determine their molecular interactions.We also believe that a better understanding of chemosensory proteins will contribute to more efficient development of olfactory-based tool.
Short Abstract: ADME (Absorption, Distribution, Metabolism, and Excretion) genes are key players to determine the pharmacokinetics and pharmacodynamics properties of a drug and in defining the drug-host response. The impact of variants on putative protein-drug interaction is critical for functional predictions in pharmacogenomic analysis. Therefore, sequence-based annotation only offers a narrow perspective for interpreting the ADME variants. We describe the design of SWAAT workflow (Structural Workflow for the Annotation of ADME Targets) for the characterization of ADME variants from high throughput sequencing data. The tool integrates in-house Python and R scripts in Nextflow to process the variants from a VCF file and maps them on 26 structures from different genes. SWAAT was designed to account for the low scale and large scale conformational structural events and provides a machine learning model for binary classification of the impact of the variants. It allows predicting the impact of variants on the protein structure integrity and the conformational landscape of the ADME proteins. Moreover, SWAAT extends the analysis to map the effect on putative drug-ADME protein interaction hotspots. We plan to include the extended ADME genes for further improvement. The workflow is intended for an application on a novel high-coverage data set of African WGS.
Short Abstract: Traditional approaches to genome-wide association studies (GWAS) on Parkinson’s disease (PD) are based largely on single-locus tests, despite the genetic complexity of the disease. Genetic interactions refer to combinations of two or more genes whose contribution to a phenotype cannot be fully explained by their independent effects. Detecting genetic interactions systematically with statistical significance remains a major challenge due to the daunting number of variant combinations possible in the human genome. We recently developed a method called BridGE for identifying genetic interactions between pathways from GWAS cohorts. Here, we describe improvements to the BridGE method along with its application to multiple PD cohorts. We identified 20 between-pathway interactions (FDR<0.05) and 12 within-pathway interactions (FDR<0.1) associated with PD risk, with a large fraction (10 of 32) of the interactions replicating on an independent cohort. Several of the pathways implicated in genetic interactions show clear relevance to PD. For example, many interactions detected involved the Parkinson’s disease gene set (KEGG), suggesting that may of the established risk variants are modified by variants in multiple, previously unappreciated distinct pathways. We expect further exploration of discovered interactions is likely to be fruitful for understanding the underlying genetic basis of PD.
Short Abstract: Adiponectin (AdipoQ) is one of the most abundant adipocytokines secreted by adipose tissue. AdipoQ is emerging as a link between obesity, Type 2 Diabetes (T2D) and obesity/T2D related tumors. Since Obesity and T2D are two of the three modifiable risk factors for Pancreatic Ductal Adenocarcinoma (PDA), AdipoQ and its receptors (Adiponectin Receptor 1 and Adiponectin Receptor 2, ADIPOR1 and ADIPOR2 respectively) have been reported to be involved in its progression. We performed an integrative analysis of the main Adiponectin receptors genes in 32 types of cancer from The Cancer Genome Atlas. Transcriptome of patients was analyzed using hierarchical clustering analysis. Statistical analyses of clinical data were performed using R and GraphPad Prism 8. The clustering analysis in each tumor type identified two groups of patients: one with high levels of ADIPOR1 receptor and one with low levels of ADIPOR1. Statistical analysis of clinical features between patients belonging to high and low ADIPOR1 expression revealed that tumors with high ADIPOR1 are more aggressive compared to the low ADIPOR1 patients. Our analysis suggests a key role of ADIPOR1 expression in cancer, pointing out that targeting AdipoQ signaling could represent a novel therapeutic strategy for PDA and the others obesity/T2D related tumors.
Short Abstract: The practice of performing computational processes using workflows has taken hold in the life sciences. As scientific objects that encapsulate method they should be, like data, FAIR (Findable, Accessible, Interoperable, Reusable). They should also be citable, have managed metadata profiles and be openly available for review and analytics. Current workflow registries and repositories typically cater for specific workflow management systems (e.g. nf-core for Nextflow) and there is little standardisation of the metadata needed to describe them.
Workflow Hub (workflowhub.eu) is a new registry that is workflow management system agnostic, where workflows may remain in their native repositories in their native forms. A Bioschemas profile describes the metadata about a workflow, and the Common Workflow Language is encouraged to be used as a description of the workflow itself. Popular workflow management systems such as Galaxy are working with the Hub to seamlessly and automatically support packaging and registration.
An open community works together to co-develop the Hub, which is sponsored by the European EOSC Life Cluster and the ELIXIR Research Infrastructure. To respond to the COVID-19 crisis the Hub has had an early release to be a place for COVID related workflows.
Short Abstract: Satellite DNA (satDNA) consists of tandemly repeated sequences and comprises roughly 15% of the human genome. Although satDNA is implicated in genome stability, satDNA is difficult to sequence and assemble. Therefore, new methods are needed to accurately characterize these regions and thereby improve our understanding of satDNA in genome organization. Recent innovations in oligonucleotide- (oligo-) based FISH methods have improved our ability to visualize chromosomes at the single cell level. Here we present Tigerfish, a software tool that allows for de novo design of oligo probes against satDNA genome-wide. Tigerfish uses a k-mer counting strategy to identify regions enriched with satDNA. After probes are designed against satDNA regions, Tigerfish predicts specificity by computing a quantitative probability of off-target binding and uses a machine learning model to identify an optimal probe set. Tigerfish finds enrichment of satDNA-specific probes in pericentromic and subtelomeric regions on scaffolds in the hg38 human reference genome. We also applied Tigerfish to the fully assembled T2T chromosome X and have designed an additional twenty thousand new centromeric satDNA probes. We expect Tigerfish to be broadly used for the design of hybridization experiments and the investigation of the role of satDNA in genome organization.
Short Abstract: Most cancer chemotherapeutic agents are ineffective in a subset of patients, thus it is important to consider the role of genetic variation in drug response. Lymphoblastoid cell lines (LCLs) in 1000 Genomes Project populations of diverse ancestries are a useful model that has been applied to determine how genetic variation contributes to variation in drug toxicity. In our study, LCLs were previously treated with increasing concentrations of eight chemotherapeutic drugs and cell growth inhibition was measured at each dose with half maximal inhibitory concentration (IC50) or area under the dose-response curve (AUC) as our phenotype for each drug. After performing both genome- and transcriptome-wide association studies, we discovered several novel SNPs and two genes significantly associated with cellular sensitivity to drugs within and across diverse populations. For etoposide, increased STARD5 predicted gene expression associated with decreased cellular sensitivity to etoposide (p=8.49e-08). Functional studies in A549, a lung cancer line, revealed that knockdown of STARD5 gene expression resulted in an increased sensitivity to etoposide following exposure for 72 (P=0.033) and 96 hours (P=0.0001). By identifying variants associated with cytotoxicity across populations, we strive to understand the genetic factors impacting the effectiveness of chemotherapy drugs to contribute to future treatment decision support.
Short Abstract: In equatorial Africa, Burkitt lymphoma(BL) accounts for about half of all childhood cancers and in sub-Saharan Africa, it is the most common childhood cancer(36%) and childhood lymphoma(70%). Genomics research is failing on demographic diversity, with a bias (81%) of genomics data being generated for patients of European ancestry, followed by Asian ancestry (14%), and African ancestry, a distant third (3%). Towards our contribution to UN SDG 3(Good health and well-being), this study identified DEGs between African BL patients and normal individuals. The goal was to assess the associated gene ontology terms, molecular pathways and human phenotypes of the enriched genes towards advancement of population-specific therapeutics. Firstly, we assessed the quality of the reads with FastQC and trimmed low-quality bases and technical sequences with Trimmomatic. Reads that passed quality trimming were aligned to the human reference genome using BWA and bowtie2. Alignment with BWA achieved the highest exon mapping rates greater than 94%. Data from BWA and bowtie2 provided 1604 and 1741 DEGs respectively between the two conditions of which 1004 were common to both. The common gene list was then used in gene enrichment analyses and approximately 81% of the genes were involved in pathway implicated in Burkitt Lymphoma pathogenesis.
Short Abstract: Variable Number Tandem Repeats (VNTRs) account for a significant amount of human genetic variation. VNTRs have been implicated in both Mendelian and Complex disorders, but are largely ignored by whole genome analysis pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks for fast read recruitment. On 55X whole genome data, adVNTR-NN genotyped each VNTR in less than 18 cpu-seconds, while maintaining 100% accuracy on 76% of VNTRs.
We used adVNTR-NN to genotype 10,264 VNTRs in 652 individuals from the GTEx project and associated VNTR length with gene expression in 46 tissues. We identified 163 `eVNTRs' that were significantly associated with gene expression. Of the 22 eVNTRs in blood where independent data was available, 21 (95%) were replicated in terms of significance and direction of association. 49% of the eVNTRs showed a strong and likely causal impact on the expression of genes and 80% had an effect size at least 0.1. The impacted genes have important role in complex phenotypes including Alzheimer's, obesity and familial cancers. Our results point to the importance of studying VNTRs for understanding the genetic basis of complex diseases.