GIW XXXI/ISCB-Asia V
Original Research & Highlights

⏰ How to optimally sample a sequence for rapid analysis

Martin Frith, AIST & University of Tokyo, Japan
Jim Shaw, University of Toronto, Canada
John Spouge, National Library of Medicine, United States

We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence.&nbsp Previous sequence-sampling methods, such as minimizers, syncmers, and minimally-overlapping words, were developed by heuristic intuition, and are not optimal.

We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly-evolving sequences.&nbsp It it likely near-optimal for a wide range of alignment-based and alignment-free analyses.&nbsp For real biological DNA, it increases specificity by avoiding simple repeats.&nbsp Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once), and polar sets (which guarantee to sample a sequence at most once).&nbsp This helps us understand how to do rapid sequence analysis as accurately as possible.

⏰ SlowMoMan: A web app for discovery of important features along user-drawn trajectories in 2D embeddings

Kiran Deol, University of Alberta, Canada
Griffin Weber, Harvard University, United States
Yun William Yu, University of Toronto, Canada

Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where data sets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem.&nbsp In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data.&nbsp Thus, we present SlowMoMan (SLOWMOtions onMANifolds), a web application which allows the user to draw a 1-dimensional path onto the 2-dimensional embedding. Then, by back projecting the manifold to the original, high-dimensional space, we sort the original features such that thosemost discriminative along the manifold are ranked highly.&nbsp We show a number of pertinent use cases for our tools, including trajectory inference, spatial transcriptomics, and automatic cell classification.

⏰ Clinical and therapeutic application of adenosine to inosine RNA editing in glioma

Sean Chun-Chang Chen, Taipei Medical University, Taiwan
Sheng-Hau Lin, Rice University, United States

Glioma is the most common type of brain tumors in adults.&nbsp Among gliomas, glioblastoma remains incurable with a 5-year survival rate of 5.1%.&nbsp The unresponsiveness to treatment is mainly due to the high level of intratumor heterogeneity and the poor understanding of its molecular pathogenesis.&nbsp Thus, the development of biomarkers for proper stratification of glioma patients is of utmost importance for a deeper understanding of complex genetics of gliomas and novel targeted therapeutics.&nbsp Recent studies have revealed clinical relevance of adenosine to inosine (A-to-I) RNA editing in human cancers.&nbsp However, the prognostic and regulatory roles of A-to-I RNA editing in glioma remain unclear.&nbsp By analyzing editing signatures of two independent glioma cohorts, we found that editing signatures predicted patient survival in a sex-dependent manner; hyper-editing was associated with poor survival in females but better survival in males.&nbsp Moreover, a number of editing events were associated with glioma progression and affected mRNA abundance of their host genes, including genes associated with inflammatory response (EIF2AK2) and fatty acid oxidation (acyl-CoA oxidase 1).&nbsp Our study indicates that RNA editing could serve as a promising prognostic biomarker and a potential therapeutic strategy against glioma and highlights the importance of developing sexual dimorphic treatments.

⏰ Inference of single-cell network using mutual information for scRNA-seq data analysis

Lan-Yun Chang, National Yang Ming Chiao Tung University, Taiwan
Ting-Yi Hao, National Yang Ming Chiao Tung University, Taiwan
Wei-Jie Wang, National Yang Ming Chiao Tung University, Taiwan
Chun-Yu Lin, National Yang Ming Chiao Tung University, Taiwan

Motivation: With the advance in single-cell RNA sequencing (scRNA-seq) technology, deriving inherent biological system information from expression profiles at a single-cell resolution has become possible.&nbsp It has been known that network modeling by estimating the associations between genes could better reveal dynamic changes in biological systems.&nbsp However, accurately constructing a single-cell network (SCN) to capture the network architecture of each cell and further explore cell-to-cell heterogeneity remains challenging.

Results: We introduce SINUM, a method for constructing the SIngle-cell Network Using Mutual information, which estimates mutual information between any two genes from scRNA-seq data to determine whether they are dependent or independent in a specific cell.&nbsp Experiments on various scRNA-seq datasets validated the accuracy and robustness of SINUM in cell type identification, superior to the state-of-the-art SCN inference method.&nbsp The SINUM SCNs exhibit high overlap with the human interactome and possess the scale-free property.&nbsp Additionally, SINUM presents a view of biological systems at the network level to detect cell-type marker genes/gene pairs and investigate time-dependent changes in gene associations during embryo development.

Availability and implementation: Codes for SINUM are freely available at https://github.com/SysMednet/SINUM. This research used the publicly available datasets published by other researchers as cited in the manuscript.

⏰ Extracting T cell receptor reads from RNA-Seq data for studying non-recombined sequences in T cell lymphoma cell lines

Chen-Yan Hung, National Cheng Kung University, Taiwan
Cheng-Han Lin, National Cheng Kung University, Taiwan
Tsunglin Liu, National Cheng Kung University, Taiwan

Background: Different T cells can be distinguished via their unique T cell receptor (TCR) genes that undergo V(D)J recombination.&nbsp The domination of an abundant TCR sequence often suggests expansion of the corresponding T cells.&nbsp Thus, recombined TCR sequences can serve as a biomarker of immune related diseases.&nbsp A few recent works have shown the biomarker potential of non-recombined TCR sequences, i.e., those composed of only J gene segments.&nbsp However, the generation mechanisms are little studied.&nbsp Such a study should be at the single cell level because recombination status may play a role and the status varies between T cells.&nbsp RNA-Seq data of T cell lines are great resource for studying non-recombined TCR sequences.&nbsp However, currently no tool is available for extracting both recombined and non-recombined TCR sequences from RNA-Seq data.

Results: To develop a computational pipeline for extracting all TCRB reads, we applied two spliced alignment tools, HISAT2 and TRIg, and compared their performance.&nbsp We found HISAT2 was more comprehensive in spliced alignments while only TRIg could identify reads containing some non-reference nucleotides added during VDJ recombination.&nbsp Both HISAT2 and TRIg identified false TCRB reads, which could be filtered by comparing their alignments.&nbsp On RNA-Seq data of 17 cell lines of T cell lymphoma, our pipeline found TCRB reads in 0.01-0.6% of the data.&nbsp We found that recombination status was related to the expression of TCRB gene.&nbsp The coverage profiles of the TCRB reads revealed that the non-recombined TCRB J2-2P~J2-3 segments were relatively more abundant in T cells.&nbsp When the top abundant VJ clone involved J2-1 or a J1 gene, especially when the top clone was non-productive, relatively more J2-2P~J2-3 segments were observed.

Conclusion: Our pipeline identified TCRB reads in the RNA-Seq data accurately and comprehensively.&nbsp Analysis of the TCRB reads provides insights into biogenesis of non-recombined TCRB sequences.&nbsp Specifically, the J2-2P~J2-3 segments were likely by-products of non-desired splicing of VJ clones involving J2-1.&nbsp For T cells which top VJ clone involved a J1 gene, J2-2P~J2-3 segments may come from transcription from the D2 promoter.

⏰ An immune-suppressing protein in human endogenous retroviruses

Huan Zhang, University of Tokyo, Japan
Shengliang Ni, University of Tokyo, Japan
Martin Frith, AIST & University of Tokyo, Japan

Retroviruses are important contributors to disease and evolution in vertebrates.&nbsp Sometimes, retrovirus DNA is heritably inserted in a vertebrate genome: an endogenous retrovirus (ERV).&nbsp Vertebrate genomes have many such virus-derived fragments, usually with mutations disabling their original functions.

Some primate ERVs appear to encode an overlooked protein.&nbsp This protein has significant homology to protein MC132 from Molluscum contagiosum virus, which is a human poxvirus, not a retrovirus.&nbsp MC132 suppresses the immune system by targeting NF-κB, and it had no known homologs until now.&nbsp The ERV homologs of MC132 in the human genome are mostly disrupted by mutations, but there is an intact copy on chromosome 4.&nbsp We found homologs of MC132 in ERVs of apes, monkeys, and bushbaby, but not tarsiers, lemurs or non-primates.&nbsp This suggests that some primate retroviruses had, or have, an extra immune-suppressing protein, which underwent horizontal genetic transfer between unrelated viruses.

⏰ Anomalous Relative Frequency of Trimeric Tandem Repeats in Animal Genomes

Chotinan Nitikitpaiboon, University of Tokyo, Japan
Martin Frith, AIST & University of Tokyo, Japan

Tandem repeats, such as CATCATCATCAT, are abundant in natural DNA.&nbsp They tend to mutate and evolve rapidly by expansion and contraction.&nbsp These mutations often cause disease, and are probably a major source of variation in evolution.&nbsp A few previous studies have surveyed tandem repeats in eukaryote genomes, and found curious patterns such as 4-mer repeats being abundant in vertebrates, or 3-mer repeats being infrequent in mammals and birds.&nbsp However, repeat identification depends on criteria such as minimum length threshold, and these surveys arguably used non-equivalent thresholds for different repeat unit sizes.

Here, we survey tandem repeats with 3 different methods (TANTAN, ULTRA, and TRF), considering both exact and inexact repeats.&nbsp We confirm that 3-mer repeats are relatively infrequent in therian mammals.&nbsp More generally, in chordates 3-mer repeats often have anomalous frequencies: high in some genomes and low in others.&nbsp We test and reject several hypotheses: that repeat abundances are driven by evolutionary forces in genic regions, or by transposable elements, or by palindromic repeats.&nbsp In conclusion, tandem repeat surveys are sensitive to repeat definition criteria, but 3-mer repeats often have anomalous abundance in chordate genomes, which seems to be a genome-wide phenomenon.

⏰ Incorporating tissue-specific gene expression data to improve chemical-disease inference

Shan-Shan Wang, Kaohsiung Medical University, Taiwan
Chia-Chi Wang, National Taiwan University, Taiwan
Chien-Lun Wang, Taipei Medical University, Taiwan
Chun-Wei Tung, National Health Research Institutes, Taiwan

In silico toxicogenomics methods are resource-and time-efficient approaches for inferring chemical-protein-disease associations with potential mechanism information for exploring toxicological effects.&nbsp However, current in silico toxicogenomics systems make inference by considering only species-specific chemical-protein interactions irregardless of tissue-specific gene expression.&nbsp As a result, inferred diseases could be overpredicted with false positives.&nbsp In this study, a tissue-specific gene expression profile consisting of 16 tissues was analyzed and incorporated into the chemical-protein-disease inference process of the ChemDIS system by filtering out relatively low-expression genes.&nbsp Using curated chemical-disease associations as the gold standard, the enrichment analysis incorporating tissue-specific gene expression data was largely improved with up to 48.8% improvement on a mean enrichment rate.&nbsp The methodology is expected to be useful and can be implemented for other enrichment analysis tools.

⏰ Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

Y-H. Taguchi, Chuo University, Japan
Turki Turki, King Abdulaziz University, Saudi Arabia

In contrast to RNA-seq analysis, which has various standard methods, no standard methods for identifying differentially methylated cytosines (DMCs) exist.&nbsp To identify DMCs, we tested principal component analysis and tensor decomposition-based unsupervised feature extraction with optimized standard deviation whose effectiveness toward differentially expressed gene (DEG) identification was recently recognized.&nbsp The proposed methods can outperform certain conventional methods, including those that must assume beta-binomial distribution for methylation that the proposed methods do not have to assume, especially when applied to methylation profiles measured using high throughput sequencing.&nbsp DMCs identified by the proposed method are also significantly overlapped with various functional sites, including known differentially methylated regions, enhancers, and DNase I hypersensitive sites.&nbsp This suggests that the proposed method is a promising candidate for standard methods for identifying DMCs.

⏰ A putative scenario of how de novo protein-coding genes originate in the Saccharomyces cerevisiae lineage

Tetsushi Yada, Kyushu Institute of Technology, Japan
Takeaki Taniguchi, Mitsubishi Research Institute, Inc., Japan

Novel protein-coding genes were considered to be born by re-organization of pre-existing genes, such as gene duplication and gene fusion.&nbsp However, recent progress of genome research revealed that more protein-coding genes than expected were born de novo, that is, gene origination by accumulating mutations in non-genic DNA sequences.&nbsp Nonetheless, the in-depth process (scenario) for de novo origination is not well understood.&nbsp We have conceived bioinformatics analysis for sketching a scenario for de novo origination of protein-coding genes.&nbsp For each de novo protein-coding gene, we firstly identified an edge of a given phylogenetic tree where the gene was born based on parsimonious principle.&nbsp Then, from a multiple sequence alignment of the de novo gene and its orthologous regions, we constructed ancestral DNA sequences of the gene corresponding to both ends of the edge.&nbsp We finally revealed statistical features observed in evolution between the two ancestral sequences.&nbsp In the analysis of the Saccharomyces cerevisiae lineage, we have successfully sketched a putative scenario for de novo origination of protein-coding genes.&nbsp (1) In the beginning was GC-rich genome regions.&nbsp (2) Neutral mutations were accumulated in the regions.&nbsp (3) ORFs were extended/combined, and then (4) translation signature (Kozak consensus sequence) was recruited.&nbsp To the best of our knowledge, this is the first report outlining a scenario of de novo origination of protein-coding genes.

⏰ PSM-SMOTE: Propensity Score Matching and Synthetic Minority Oversampling for handling unbalanced microbiome data

Taesung Park Seoul, National University, South Korea
Jeongsup Moon, National University, South Korea

When building a model for predicting a phenotype of interest based on the omics data, data imbalance, resulting from either unbalanced covariates or unbalanced groups, can lead to inaccurate training of the model.&nbsp To handle the covariate imbalance between groups in the dataset, a matching algorithm, such as propensity score matching (PSM), can be used to reduce the bias in the unbalanced covariates on the phenotype of interest.&nbsp Although PSM reduces bias of the effect of a covariate, many samples can be filtered out.&nbsp In addition to PSM, oversampling methods, such as borderline synthetic minority oversampling technique (borderline-SMOTE), can be used to reduce the bias from class imbalance when building prediction models.&nbsp However, newly created borderline-SMOTE samples could be uninformative if they are not close to the borderline.

To manage the limitations of PSM and borderline-SMOTE, we propose a hybrid sampling method, PSM-SMOTE, that handles both covariate and class imbalance in microbiome data.&nbsp Logistic regression (LR), random forest (RF), and support vector machine (SVM) models were used to evaluate PSM-SMOTE using three different microbiome datasets.&nbsp The area under the receiver operating characteristic curve (AUC) was used to compare PSM-SMOTE with other existing methods.&nbsp When compared to those using PSM, RF models using PSM-SMOTE performed the best with the highest AUC resulting an increase in AUC.

⏰ DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms

Kyle Santiago, De La Salle University, Philippines
Anish Man Singh Shrestha, De La Salle University, Philippines

Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly.&nbsp We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.&nbsp Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline >1000x faster than assembly-based approach, yet more accurate.&nbsp We also compare quasi-mapping to other mapping techniques and show it is faster, but at the cost of sensitivity.

⏰ Representation of Neuron Activity and Examination of Cell Detection in Calcium Imaging Data by Max-Min Intensity Images

Yu-Wei Tsay, Institute of Information Science, Academia Sinica, Taiwan
Chia-Ying Wu, Institute of Information Science, Academia Sinica, Taiwan
Wei-Hsin Chen, Institute of Information Science, Academia Sinica, Taiwan
Chien-Chang Chen, Institute of Information Science, Academia Sinica, Taiwan
Arthur Chun-Chieh Shih, Institute of Information Science, Academia Sinica, Taiwan

Two important issues in imaging neurons were how to represent the neuron activity of a calcium imaging video in an image and how to predict a set of neurons with a high precision rate and high fluorescence level changes.&nbsp In this paper, a new pipeline was proposed to predict neuron cells using a calcium imaging video.&nbsp Rather than analyzing the whole video, we generate the max-min intensity image as a representative image from an input video.&nbsp Not only for visualization, the image can be used to predict neuron cells by applying Cellpose, a two-dimensional cell detection tool, directly.&nbsp In the result, we demonstrated the max-min intensity image showing clear cell shapes with fluorescence change levels compared with other reference images.&nbsp Then, six calcium imaging videos with consensus annotations were used for the comparison.&nbsp The precision rates for our method were mostly higher than those by two other popular tools though the total numbers of predicted neurons by our method were less.&nbsp Moreover, the fluorescence change levels for the predicted neurons by our method were significantly higher than those of un-predicted annotated ones.&nbsp Thus, our method can satisfy the neuroscientists seeking a predictor producing an output preferring the quality over quantity.

⏰ NuKit: A Deep Learning Platform for Fast Nucleus Segmentation of Histopathological Images

Ching-Nung Lin, H. Lee Moffitt Cancer Center and Research Institute, United States
Christine H Chung, H. Lee Moffitt Cancer Center and Research Institute, United States
Aik Choon Tan, H. Lee Moffitt Cancer Center and Research Institute, United States

Motivation: Nucleus segmentation represents the initial step for histopathological image analysis pipelines, and it remains a challenge in many quantitative analysis methods in terms of accuracy and speed.&nbsp Recently, deep learning nucleus segmentation methods have demonstrated to outperform previous intensity- or pattern-based methods.&nbsp However, the heavy computation of deep learning provides impression of lagging response in real time and hampered the adoptability of these models in routine research.

Results: We developed and implemented NuKit, a deep learning platform which accelerates nucleus segmentation and provides prompt results to the users.&nbsp NuKit platform consists of two deep learning models coupled with an interactive graphical user interface (GUI) to provide fast and automatic nucleus segmentation “on the fly”.&nbsp The two deep learning models are the whole image segmentation model and the click segmentation model.&nbsp Both deep learning models provide complementary tasks in nucleus segmentation in the NuKit platform.&nbsp The whole image segmentation model performs whole image nucleus whereas the click segmentation model supplements the nucleus segmentation with user-driven input to edits the segmented nuclei.&nbsp We trained the NuKit whole image segmentation model on a large public training data set and tested its performance in seven independent public image data sets.&nbsp The whole image segmentation model achieves average DICE = 0.814, average IoU = 0.689 and Nucleus Detected Ratio 1.052.&nbsp The outputs of Nukit could be exported into different file formats, as well as provides seamless integration with other image analysis tools such as QuPath.&nbsp NuKit can be executed on Windows, Mac and Linux using regular personal computer.&nbsp The deep learning inference time of NuKit is around 1.5 seconds.

Availability: The platform is available at http://tanlab.org/NuKit/.

⏰ Highly-accurate prediction of colorectal cancer through low abundance uncultivated genomes recovered using metagenomic co-assembly and binning approach

Po-Ting Lin, National Taiwan University of Science and Technology, Taiwan
Yu-Wei Wu, Taipei Medical University, Taiwan

Background: Microbiome studies in the recent years have established the association between the composition of gut microbiota and various diseases.&nbsp Since 16S ribosomal RNA sequencing may suffer from problems such as lower taxonomic resolution or limited sensitivity, more and more studies embraced whole-metagenome approach, which has the potential of sequencing everything in the target microbiome, to conduct microbial association analysis.&nbsp However, species profiling, which is the most popular analysis technique for whole-metagenome analysis, cannot detect uncultivated species.&nbsp Since uncultivated species may also be indispensable in the gut environments, it is crucial to identify those uncultivated species and evaluate their importance in discerning disease from healthy samples.

Results:After conducting de novo co-assembly and genome binning procedures on two colorectal cancer (CRC) cohorts, in which one of them was from the Asian population while the other was from the Caucasian population, we identified that the Asian and Caucasian cohorts shared a significant amount of microbial species in their microbiota.&nbsp In addition we found that low abundance genomes may be more important in classifying disease and healthy metagenomes.&nbsp By sorting the genomes based on their random forest importance scores in differentiating disease and healthy samples and cumulatively evaluating the subset of genomes in predicting CRC status, we identified dozens of “important” genomes for each of the cohorts that were able to predict CRC with very high accuracy (0.90 and 0.98 AUROC for the Asian and Caucasian cohorts respectively).&nbsp Uncultivated species were also identified among the selected genomes, highlighting the importance of extracting genomes of the uncultivated species in order to build better disease prediction models and evaluate the roles of the uncultivated species in the disease formation or progression.&nbsp Finally we found that the “important” species for both cohorts did not overlap with each other, hinting that the species highly associated with CRC disease may be different between the Eastern and Western populations.

⏰ A comparative analysis of ENCODE and Cistrome in the context of TF binding signal

Stefano Perna, Nanyang Technological University, Singapore
Pietro Pinoli, Politecnico di Milano, Italy
Stefano Ceri, Politecnico di Milano, Italy
Limsoon Wong, National University of Singapore, Singapore

With the rise of publicly available genomic data repositories, it is common for scientists to rely on computational models and preprocessed data, either as control or for knowledge discovery.&nbsp Nevertheless, there is no guarantee that different repositories adhere to the same principles and guidelines.&nbsp Furthermore, data processing plays a huge role in the quality of the resulting datasets.&nbsp In particular, two popular repositories for transcription factor binding sites － ENCODE and Cistrome － process the same biological samples in alternative ways, and their results are not always consistent.&nbsp Also, the output format of the processing (bed narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but offers valuable insight on the quality of the data.&nbsp We provide evidence that data points with high signalValue(s) are more likely to be consistent between ENCODE and Cistrome in human cell lines HepG2, GM12878, and K562.&nbsp Furthermore, we show that filtering according to high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information.&nbsp Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome.

⏰ MethylSeqLogo: DNA methylation smart sequence logos

Fei-Man Hsu, University of California at Los Angeles, United States
Paul Horton, National Cheng Kung University, Taiwan

Background: Sequence logos can effectively visualize position specific base preferences evident in a collection of binding sites of some transcription factor.&nbsp But those preferences usually fall far short of fully explaining binding specificity.

Interestingly, some transcription factors bind sites of potentially methylated DNA.&nbsp For example, MYC binds CpG sites.&nbsp This may increase binding specificity as such sites are 1) highly under-represented in the genome, and 2) offer additional, tissue specific information in the form of hypo- or hyper-methylation.&nbsp Fortunately, bisulfite sequencing data suitable to investigate this possibility is readily available.

Method: We developed MethylSeqLogo, an extension of sequence logos which adds DNA methylation information to sequence logos.&nbsp MethylSeqLogo includes new elements to indicate DNA methylation and under-represented dimers in each position of a set of aligned binding sites.&nbsp Our method displays information from both DNA strands, and takes into account the sequence context (CpG or other) and genome region (promoter versus whole genome) appropriate to properly assess the expected background dimer frequency and level of methylation.

When designing MethylSeqLogo, we took care to preserve the usual sequence logo meaning of heights; in which the relative height of nucleotides within a column represents their proportion in the binding sites, while the absolute height of each column represents information (relative entropy) and the height of all columns added together represents total information.

Results: We present several figures illustrating the utility of using MethylSeqLogo to summarize data from CpG binding transcription factors.&nbsp The logos show that unmethylated CpG binding sites are a feature of transcription factors such as MYC and ZBTB33, while some other CpG binding transcription factors, such as CEBPB, appear methylation neutral.

We also compare MethylSeqLogo with two previously reported ways to create methylation aware sequence logos.

Conclusions: Our freely available software enables users to explore large-scale bisulfite and ChIP sequencing data sets － and in the process obtain publication quality figures.

⏰ Predicting splicing patterns from the transcription factor binding sites in the promoter with deep learning

Tzu-Chieh Lin, Academia Sinica, Taiwan
Cheng-Hung Tsai, Academia Sinica, Taiwan
Cheng-Kai Shiau, Academia Sinica, Taiwan
Jia-Hsin Huang, Academia Sinica, Taiwan
Huai-Kuang Tsai, Academia Sinica, Taiwan

Alternative splicing is a crucial mechanism of post-transcriptional modification responsible for the transcriptome plasticity and proteome diversity of a metazoan cell.&nbsp Although many splicing regulations around the exon/intron regions have been discovered, the relationship between promoter-bound transcription factors and the downstream alternative splicing remains largely unexplored.&nbsp In this study, we present computational approaches to decipher the regulation relationship connecting the promoter-bound transcription factor binding sites (TFBSs) and the splicing patterns.&nbsp We curated a fine data set, including DNase I hypersensitive sites sequencing and transcriptome in fifteen human tissues from ENCODE.&nbsp Specifically, we proposed different representations of TF binding context and splicing patterns to tackle the associations between the promoter and downstream splicing events.&nbsp Our results demonstrated that the convolutional neural network (CNN) models learned from the TF binding changes in the promoter to predict the splicing pattern changes.&nbsp Furthermore, through an in silico perturbation-based analysis of the CNN models, we identified several TFs that considerably reduced the model performance of splicing prediction.&nbsp In conclusion, our finding highlights the potential role of promoter-bound TFBSs in influencing the regulation of downstream splicing patterns and provides insights for discovering alternative splicing regulations.

⏰ ThermalProGAN: a sequence-based thermally stable protein generator trained using un-paired data

Motivation: The synthesis of proteins with novel desired properties is challenging but sought after by the industry and academia. The dominating approach is based on trial-and-error-inducing point mutations, assisted by structural information or predictive models built with paired data that are difficult to collect.&nbsp This study proposes a sequence-based unpaired sample of novel protein inventor (SUNI) to build ThermalProGAN for generating thermally stable proteins based on sequence information.

Results: The ThermalProGAN can strongly mutate the input sequence with a median number of 32 residues.&nbsp A known normal protein,1RG0, was used to generate a thermally stable form by mutating 51 residues.&nbsp After superimposing the two structures high similarity is shown, indicating that the basic function would be conserved.&nbsp 84 molecular dynamics simulation results of 1RG0 and the Covid-19 vaccine candidates with a total simulation time of 840 ns indicate that the thermal stability increased.

Conclusion: This proof of concept demonstrated that the transfer of a desired protein property from one set of proteins is feasible.

Availability and implementation: The source code of ThermalProGAN can be freely accessed at https://github.com/markliou/ThermalProGAN/ with an MIT license.&nbsp The website is https://thermalprogan.markliou.tw:433.

Supplementary information: Supplementary data are available on Github.

⏰ Evaluating network-based missing protein prediction using p-values, Bayes Factors, and probabilities

Wilson Wen Bin Goh, Nanyang Technological University, Singapore
Kong Weijia, Nanyang Technological University, Singapore
Limsoon Wong, Nanyang Technological University, Singapore

Methods for network-based missing protein prediction may use either p-values or probabilities, which may render them difficult for direct cross-comparisons.&nbsp Approaches such as the Bayes Factor upper Bound (BFB) for p-value conversions may not make correct assumptions.&nbsp Here, using a well-established case study on renal cancer proteomics, we demonstrate how to compare methods based on false discovery rate (FDR) estimations, which does not make the same naïve assumptions as BFB conversions; and we introduce a powerful approach which we colloquially call “home ground testing”.&nbsp Both approaches perform better than BFB conversions.&nbsp Thus, we recommend that methods be compared by standardization to a common performance benchmark such as a global FDR.&nbsp And where not possible, then to consider reciprocal “home ground testing”.

⏰ A Deconvolution Approach to Unveiling the Immune Microenvironment of Complex Tissues and Tumors in Transcriptomics

Shu-Hwa Chen, Taipei Medical University, Taiwan
Bo-Yi Yu, the University of Tokyo, Japan
Wen-Yu Kuo, Academia Sinica, Taiwan
Ya-Bo Lin, Academia Sinica, Taiwan
Sheng-Yao Sue, Academia Sinica, Taiwan
I-Hsuan Lu, Academia Sinica, Taiwan
Chung-Yen Lin, Academia Sinica, Taiwan
Wei-Hsuan Chuang, Academia Sinica, Taiwan

Resolving the composition of tumor-infiltrating leukocytes is essential for expanding the cancer immunotherapy strategy, which has witnessed dramatic success in some clinical trials but remained elusive and limited in its application.&nbsp In this study, we developed a two-step streamed workflow to manage the complex bioinformatic processes involved in immune cell composition analysis.&nbsp We developed a dockerized toolkit (DOCexpress_fastqc, https://hub.docker.com/r/lsbnb/docexpress_fastqc) to perform gene expression profiling from RNA sequencing raw reads by integrating the hisat2-stringtie pipeline and our scripts with Galaxy/Docker images.&nbsp Then the output of DOCexpress_fastqc fits the input format of mySORT web, a web application that employs the deconvolution algorithm to determine the immune content of 21 cell subclasses.&nbsp The usage of mySORT was also demonstrated using a pseudo-bulk pool through single-cell datasets.&nbsp Additionally, the consistency between the estimated values and the ground-truth immune-cell composition from the single-cell datasets confirmed the exceptional performance of mySORT.&nbsp The mySORT demo website and Docker image can be accessed for free at https://mysort.iis.sinica.edu.tw and https://hub.docker.com/r/lsbnb/mysort_2022.

⏰ Effective modeling of the chromatin structure by coarse-grained methods

Irina Tuszynska, University of Warsaw, Poland
Paweł Bednarz, University of Warsaw, Poland
Bartek Wilczynski, University of Warsaw, Poland

The interphase chromatin structure is extremely complex, precise and dynamic.&nbsp Experimental methods can only show the frequency of interaction of the various parts of the chromatin.&nbsp Therefore, it is extremely important to develop theoretical methods to predict the chromatin structure.&nbsp In this publication, we describe the necessary factors for the effective modeling of the chromatin structure in Drosophila melanogaster.&nbsp We also compared Monte Carlo with Molecular Dynamic methods.&nbsp We showed that incorporating black, non-reactive chromatin is necessary for successfully prediction of chromatin structure, while the loop extrusion model with a long range attraction potential or Lennard-Jones (with local attraction force) as well as using Hi-C data as input are not essential for the basic structure reconstruction.&nbsp We also proposed a new way to calculate the similarity of the properties of contact maps including the calculation of local similarity.

(Full program schedule)

GIW XXXI/ISCB-Asia VOriginal Research & Highlights

GIW XXXI/ISCB-Asia V
Original Research & Highlights