Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster.
All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.
Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters
Ideally authors should be available for interactive chat during the times noted below:
View Posters By Category
Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time
Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
Short Abstract: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research despite a few attempts in the past. In previous works, dictionary-based and machine learning-based methods are attempted for phenotype concept recognition. The dictionary-based methods can achieve a high precision, but it suffers from a lower recall problem. Machine learning-based methods can recognize more phenotype concept variants by automatic feature learning. However, most of them require large corpora of manually annotated data for its model training, which is difficult to obtain due to the high cost of human annotation. To address the problems, we propose a hybrid method combining dictionary-based and machine learning-based methods to recognize the Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We firstly use all concepts and synonyms in HPO to construct a dictionary. Then we employ the dictionary-based method and HPO to automatically build a “weak” training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase into a corresponding concept id. Finally, machine learning-based prediction results are incorporated into the dictionary-based results for improved performance. Experimental results show that our method can achieve state-of-the-art performance without any manually labeled training data.
Short Abstract: With thousands of new research articles published every day, it is increasingly challenging for scientists to understand the relevance of new findings in the context of what is already known in their field. This problem can be seen acutely in the COVID-19 pandemic, where a deluge of new mechanistic research about the SARS-CoV-2 virus must be considered in the context of what is known about related coronaviruses and basic human cell biology. Here we describe a model of COVID-19 disease mechanisms deployed within the Ecosystem of Machine-maintained Models with Automated Analysis (EMMAA). EMMAA continuously updates causal models of biological pathways with new information automatically extracted from literature and evaluates the ability of the updated models to explain a set of empirical observations. The EMMAA COVID-19 model is built on a corpus of relevant documents processed with multiple text mining systems and assembled into causal networks by the Integrated Network and Dynamical Assembler (INDRA). The model is used to systematically identify mechanistic rationales for COVID-19 drug therapies that have been proposed or empirically observed to be effective. The coupling of text mining to automated model testing allows newly extracted findings to be evaluated for their explanatory impact and plausibility.
Short Abstract: Despite consequences of medication non-adherence, including worsening health conditions or even death, 50% of patients do not take medications as indicated. Current methods to identify non-adherence have major limitations. For example, direct observation or questionnaires are expensive or heavily rely on patients' honesty.
We explore social media data to supplement current efforts in population-level medication non-adherence monitoring. Using 22,809 social media postings mentioning a medication name, we developed an automatic classifier to find explicit mentions of changes to medication regimen (dose, frequency, switching drugs, etc.) regardless of whether the change was made in consultation with a clinician - which is the aspect that would define non-adherence.
When classifying tweets, our baseline convolutional neural network achieved an F1-score of 0.41, improving to 0.50 by transferring knowledge from the classification of medication change in WebMD reviews of medications, a smaller but more balanced and focused corpus. We also used active learning to significantly reduce the number of training examples and compensate for the imbalance of our corpora.
We further annotated 5,000 tweets to manually identify non-adherence and categorize the reasons mentioned by Twitter users, demonstrating that it is indeed a complementary resource that can help understand beliefs and behaviors leading to non-adherence.
Short Abstract: Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical subdomains, and demonstrate AGATHA’s predictive capacity across the twenty most popular relationship types. This system achieves best-in-class performance on an established benchmark, and demonstrates high recommendation scores across subdomains. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha .
Short Abstract: When extracting information from large collections of biomedical documents, named entity recognition (NER) supports automatically finding chunks of text denoting biomedical entities. Even though NER and natural language processing as a whole have made large strides, applying NER to text describing cardiovascular disease and related phenomena remains difficult due to the highly specific language involved and the extensive domain knowledge needed to create training data. We introduce the E-TEA (end to end annotation) platform, a rapid, assisted annotation and active learning tool to address and resolve these issues. E-TEA automatically tags documents with disease entities through an NER model trained on the BC5CDR corpus and suggests more specific labels (i.e., matches in ICD11 and MeSH). With E-TEA, users can create annotation projects to construct training, validation and test sets for their own NER models. E-TEA is specifically adapted for working with documents about cardiovascular and metabolic diseases. In addition to creating static data sets for model building, E-TEA allows for continuous updates of models through an active learning mode that can continuously improve model performance. Cardiovascular researchers seeking to unify unstructured biomedical text may now build upon this platform to consistently organize sets of relevant knowledge.
Short Abstract: Motivation: To facilitate biomedical text mining research on COVID-19, we have developed an automated textual evidence mining system, EvidenceMiner. EvidenceMiner works on CORD-19, the COVID-19 Open Research Dataset, which takes a researcher’s query (e.g., “UV, kill, Sars-Cov-2”) and returns a ranked list of sentences containing the compelling evidence as well as their associated research articles.
Results: EvidenceMiner is a web-based system with the following distinctive features: (1) it allows users query a natural language statement or an inquired relationship at the meta-symbol level (e.g., CHEMICAL, PROTEIN) and automatically retrieves textual evidence from a background corpora of COVID-19; (2) It has been constructed in a completely automated way without requiring any human effort for training data annotation; (3) it achieves the best performance compared with baseline methods such as BM25 and LitSense. Some case studies demonstrate that EvidenceMiner can help boost scientific discoveries on COVID-19. For example, scientists may get inspiration from EvidenceMiner that UV-inactivation also has the potential for vaccine development against the coronavirus. The EvidenceMiner system will be demonstrated in this year’s best natural language processing conference (ACL’2020, i.e., 2020 Annual Conference of the Association for Computational Linguistics).
Short Abstract: Gene and protein variations or expression patterns are often directly and specifically involved in human diseases. They are extensively researched with continuously enhanced omics technologies and used for clinical test assays or as potential drug targets. Consequently, there is a dramatically increasing number of biomarker-related scientific articles and it becomes unfeasible for researchers to know all of them. Moreover, access to structured information on biomarker-disease relationships is often restricted, limited to particular types of markers or diseases, outdated and/or user-unfriendly in the existing online databases. To address these issues we implemented the novel biomarker database BIONDA, which provides structured information on all biomarker candidates published in PubMed articles. There is no limitation to any kind of diseases. To this end, PubMed article abstracts and renowned databases such as UniProt and Human Disease Ontology are used as sources for BIONDA’s database entries. These are acquired automatically and updated regularly using text mining methods. BIONDA is available freely via a user-friendly web interface. As a specific characteristic, BIONDA’s database entries are rated by a scoring approach estimating biomarker reliability. Thus, BIONDA is a valuable knowledge resource for biomedical research.
Short Abstract: Environmental stress factors, such a drought and heat, severely affect crop yield. Plants produce a wide variety of responses to endure environmental stress, such as change in rate of photosynthesis. Given the importance of this topic in agriculture, the number of studies is increasing, and so are the publications. Automatically mining information on plant genes and stress could greatly assist biologists conducting research in plant tolerance to stress. Thus, we have established a pipeline to integrate text mining methods to efficiently retrieve information on stress genes and their relation to function and processes in Arabidopsis. For this pipeline, we used Textpresso, ePMC annotations for gene ontology and GenRIF, and annotations provided by PubTator and pGenN. Upon initial review of 428 abstracts related to Arabidopsis genes and stress, we were able to identify 215 genes and related GO term biological processes from 197 of these. This exercise revealed pain points in the pipeline that need to be improved. In the future, we would like to extend this pipeline to plants in general and present the data in iTextMine to enable integration with relation extraction tools that may help highlight interesting aspects of the underlying biology.
Short Abstract: Single-cell RNA sequencing (scRNA-Seq) is becoming widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-Seq experiments aim to identify and quantify all cell-types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are then mapped to cell-types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious effort that requires expert knowledge. We introduce CellMeSH - a new approach to identify cell-types based on gene-set queries directly from literature. CellMeSH combines a database of gene cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and scales automatically. The probabilistic query method enables reliable information retrieval even though the gene cell-type associations extracted from the literature are necessarily noisy. CellMeSH achieves 60% top-1 accuracy and 90% top-3 accuracy in annotating the cell-types on a human dataset, and up to 58.8% top-1 accuracy and 88.2% top-3 accuracy on three mouse datasets, which is consistently better than existing approaches.
Short Abstract: Motivation: To facilitate biomedical text mining research on COVID-19, we have developed an automated, comprehensive named entity annotation and typing system CORD-NER. The system generates an annotated CORD-NER dataset based on the COVID-19 Open Research Dataset, covering 75 fine-grained types in high quality. Both the system and the annotated literature datasets will be updated timely.
Results: The distinctive features of this CORD-NER dataset include (1) it covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types specifically related to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines; (2) it relies on distantly- and weakly-supervised NER methods, with no need of expensive human annotation on any articles or sub-corpus; and (3) its entity annotation quality surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Our NER system supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples.
Short Abstract: COVIDScholar (www.covidscholar.org) is a knowledge portal to aggregate, analyze, and make available all relevant research literature on COVID-19, featuring research papers, clinical trials, patents, and numerical datasets. To our knowledge, it is currently the largest collection of scientific literature relevant to COVID-19. We present an overview of the COVIDScholar data acquisition and analysis pipeline, which includes live scraping of new research as it appears from a diverse set of sources, data cleaning and repair, and NLP analysis tools. We also present the NLP-powered search tools we have developed for the site, which utilize modern transformer NN architectures to highlight relevant research in ways that go beyond traditional search. This includes an effort to link word embeddings for named entities derived from text to pre-existing, human expert-annotated ontologies such as the HPO, allowing unstructured textual input to be added directly into ML models for drug re-purposing, disease-gene relationships, virus-host interactions and more.
Short Abstract: In the domain of biomedical text mining, it is essential to find a way to represent the underlying data (e.g., PubMed abstracts) to the target applications (e.g., classifying whether a PubMed abstract is relevant to a query). Traditional methods reply on exact word matching. It cannot capture different words that have similar semantics, e.g., cancer and tumor. In recent years, deep learning models such as word embeddings have revolutionized the way in natural language processing research where words are represented using a vector of real numbers. Same applies to sentence embeddings and concept embeddings, capturing the semantics of biomedical sentences and entities, respectively.
In this work, we introduce three recently developed embeddings for biomedical text analysis (1) BioWordVec: embeddings for capturing the semantic relatedness between words, trained on PubMed abstracts and ~300K MeSH terms. (2) BioSentVec: embeddings for capturing the semantic relatedness between sentences, trained on PubMed abstracts and MIMIC-III clinical notes. (3) BioConceptVec: embeddings for capturing the semantic relatedness between biological concepts (e.g. genes), trained on PubMed abstracts and the PubTator tool. These embeddings have superior performance and can be used in various bioinformatics downstream tasks, such as protein-protein interaction prediction, drug-drug interaction classification, variant search and drug repurposing.
Short Abstract: There is a vast amount of data within biomedical texts concerning cardiovascular disease. The search term “heart failure” returns >250,000 PubMed entries, illustrating the challenging scale of this resource. We can better understand these texts with computational analysis, using knowledge graphs. We have extracted automatic disease annotations for 95,101 PubMed articles related to heart failure, and represented the annotation relationships in the form of a knowledge graph. We then calculated similarities between diseases based on the frequency of their co-annotation. With these similarity relationships, we can generate a list of related articles for each disease, identify articles where certain disease annotations may be overlooked by existing methods, and normalize disease terms to MeSH ID’s. The ability to identify documents related to a given cardiovascular disease provides us with more comprehensive search results, rather than relying on keywords and MeSH terms. Furthermore, the ability to identify related diseases can be implemented to support cardiovascular diagnostics. Thus, this project addresses both search capabilities and the organization of relational knowledge regarding heart disease. By making the knowledge graph available as a public resource, it can be used as a tool for analyzing and drawing inferences from large numbers of biomedical texts.
Short Abstract: The information extracted from biomedical articles by expert curators is vital to enable analysis of high throughput experimental results1. Curators rely upon text mining systems to prioritize and enhance curation2. In turn, most text-mining systems rely on gold standard corpora for development and tuning3,4. The statements made in an article can be judged by the experimental evidence underlying the findings5, and influence reasoning over them. Thus, it is important to capture the evidence supporting scientific claims together with the claims. Available corpora do not contain annotations of evidence based on the Evidence and Conclusion Ontology (ECO)5, which provides an extensive set of evidence-related concepts. Here we present the ECO-CollecTF corpus, a novel, freely available corpus of 54 documents, each annotated by 3 curators, containing high-quality annotations of evidence-based statements. ECO-CollecTF (DOI: 10.5281/zenodo.3763724) is available in multiple formats (BRAT and BioC12) under the CCBY NC 4.0 license. The corpus provides raw, individual annotations, enabling users to recompute IAA and to generate their own consensus or subset corpora. Individual sentence annotations are self-contained, and can therefore be reused for IAA research or for text mining applications.
Short Abstract: Biological databases provide precise, well-organized information for supporting biomedical research. Developing such databases typically relies on manual curation. However, the large and rapidly increasing publication rate makes it impractical for curators to quickly identify all and only those documents of interest. As such, automated biomedical document classification has attracted much attention.
Images convey significant information for supporting biomedical document classification. Accordingly, using image caption, which provides a simple and accessible summary of the actual image content, has potential to enhance the performance of classifiers.
We present a classification scheme incorporating features gathered from captions in addition to titles-and-abstracts. We trained/tested our classifier over a large imbalanced dataset, originally curated by Gene Expression Database (GXD), consisting of ~60,000 documents, where each document is represented by the text of its title, abstract and captions.
Our classifier attains precision of 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, exceeding the performance of current classifiers employed by biological databases and showing that our framework effectively addresses the imbalanced classification task faced by GXD. Moreover, our classifier’s performance is significantly improved by utilizing information from captions compared to using titles-and-abstracts alone, which demonstrates that captions provide substantial information for supporting biomedical document classification.
Short Abstract: Although individually uncommon, collectively, rare diseases affect 6-8% of the population. Identifying rare diseases through phenotypes is a difficult task because symptoms overlap. Genetic investigations could standardize their definition and may lead to disease agnostic interventions that target all rare diseases. Currently, text mining techniques have lower accuracy and more coverage gaps than manual curation, but the potential for higher productivity. Recently, there has been some progress in developing text mining applications to tackle this enormous problem. However, there is still a need to assess these applications and integrate the findings with genotypes. We developed a manual curation workflow which incorporated searching for genetic variants on PubMed, listing symptoms and phenotypes associated with each variant, converting cDNA or protein to standardized genetic notation, obtaining annotations from public data sources, and presenting the details in an online interface. Our project aims include summation of pathogenic variant frequency in populations for an estimated birth prevalence for each rare disease. We created test data sets of known accuracy and coverage whose genotype phenotype associations can serve as use cases. Text mining could scale this manually curated project and mine the rare disease phenotype-genotype associations for the more than 7,000 rare disease genes.
Short Abstract: With the urgent need for treatments for COVID-19, scientific literature about potential therapeutics is ballooning. To help biomedical experts easily track the latest scientific publications, we developed an NLP pipeline to extract drug and vaccine information about SARS-CoV-2 and other viruses. This system had to be developed quickly with no prior curated data for training. Our approach was to extend rule-based NLP software (based on University of Arizona’s REACH system) that we had adapted for virus-related information, including adding dictionaries and rules for identifying viruses, drugs, types of vaccines, stages of research, and entry receptor molecules. Our pipeline includes scripts that assess whether a paper is about a given virus and therapeutic based on quantity and features of the extracted information. This pipeline was applied to ~63,000 PubMed, preprint, and clinicaltrials.gov abstracts and identified documents for over 2,000 candidate drug-virus pairs (as of Apr 27). The pipeline is now being applied regularly to new papers; the results will be publicly available online in a dashboard called the COVID-19 Therapeutic Information Browser. A curation platform allows scientists to quickly review the NLP results.
Approved for Public Release; Distribution Unlimited. Public Release Case Number 20-1156
©2020 The MITRE Corporation. ALL RIGHTS RESERVED.
Short Abstract: The key tasks for curation comprise finding papers to consider, identifying curatable papers and indexing named entities and other the types of data included in each paper, and ultimately extracting information from those papers. RGD has automated the literature search and named entity recognition with an ontology-driven custom text mining software called OntoMate, that is tightly integrated with curation software to identify relevant curatable papers with concept-based queries, as an alternative for the PubMed search engine. OntoMate was built with a scalable and open architecture. Named Entity Recognition for bio curation was implemented with the plugins provided by bioNLP frameworks. With the use of bioNLP tools, RGD has automated much of the curation workflow and this has substantially reduced the curation efforts.
Short Abstract: Institutes are required to catalog their articles with proper subject headings so that the users can easily retrieve relevant articles from their institutional repositories (IR) such as university libraries. However, the number of articles in these repositories is proliferating making it extremely challenging to manually catalog the newly added articles at the same pace. To address this challenge, we explore the feasibility of automatically annotating articles with the Library of Congress Subject Headings (LCSH), which is a controlled vocabulary used for subject indexing of articles. We first use web scraping to extract keywords for a collection of articles from the Repository Analytics and Metrics Portal (RAMP), which is a web service providing access to a large collection of IRs. Then, we map these keywords to LCSH names for developing a gold-standard dataset. As a case study, using the subset of mapped Biology-related LCSH concepts, we develop predictive models by modeling this task as a supervised multi-label classification problem. We demonstrate the potential of this approach through our preliminary results, which has implications for IR managers/curators as well as IR users.
Short Abstract: Due to the contribution of prescription medications, such as stimulants or opioids, to the broader drug abuse crisis, several studies have explored the use of social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a lack of clear descriptions of how abuse information is presented in public social media, as well as few thorough annotation guidelines that may serve as the groundwork for long-term future research or annotated datasets usable for automatic characterization of social media chatter associated with abuse-prone medications.
We employed an iterative annotation strategy to create an annotated corpus of 16,433 tweets, mentioning one of 20 abuse-prone medications, labeled as potential abuse or misuse, non-abuse consumption, drug mention only, and unrelated. We experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Among the machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271).
We expect that our annotation strategy, guidelines, and dataset will provide a significant boost to community-driven data-centric approaches for the task of monitoring prescription medication misuse or abuse monitoring from Twitter.
Short Abstract: Documents describing biomedical observations contain a great variety of topic-speciﬁc terminology, the intended meaning of which cannot be fully ascertained without incorporating speciﬁc domain knowledge. Automated systems and ontologies for categorizing biomedical concepts and relationships are generally restricted to pre-deﬁned controlled vocabularies, supporting consistent term normalization but limiting categorization of novel or domain-specific concepts. This presents a challenge to development of systems for biomedical information extraction as advances rely upon the existence of labeled text data. We have developed ACROBAT (Annotation for Case Reports using Open Biomedical Annotation Terms), a typing system for detailed information extraction from clinical text. ACROBAT supports detailed identiﬁcation and categorization of entities, events, and relations within clinical text documents, including clinical case reports (CCRs) and the free-text components of electronic health records. Using ACROBAT and the text of 200 CCRs, we annotated a wide variety of real-world clinical disease presentations. The resulting dataset (MACCROBAT2020) is a rich collection of annotated clinical language appropriate for training biomedical natural language processing systems. Taken together, ACROBAT and its accompanying dataset support further manual or automated information extraction while remaining interoperable with other biomedical annotation schemas. The dataset is freely available on Figshare at doi.org/10.6084/m9.figshare.9764942.
Short Abstract: Analysis of social media exchanges has long been established as an efficient approach towards the observation and study of rapidly spreading health-related concerns. An unexpected emerging situation such a novel highly infectious virus has a major societal impact, as most of the world is experiencing at present.
Through the analysis of social media is possible to monitor in real time the spreading of concerns related to the novel situation, for example about the positive or negative effects of some drugs, but also monitor diffusion of fake news.
Concretely we have used a Twitter dataset provided by the Panacea Lab at Georgia State University, and have implemented a system capable of visualizing different aspects of this data, in particular:
- basic distributional statistics (per language, hashtag, URLs, etc)
- sentiment analysis by hashtag
- mentions of paper preprints
- mentions of drugs
- topics (through LDA topic models)
The current version of the system can be inspected at:
Additional extensions are underway, in particular the detection of tweets generated by bots as opposed as by humans, which will contribute to the detection of fake news and the quantification of their impact.
Short Abstract: A critical application of single-cell RNA sequencing is identifying the cellular composition of tissue samples by linking clusters of cells with similar expression profiles to distinct cell types. Current best approaches rely on scRNA-seq atlases or manually-curated cell-type markers, but these methods introduce study and selection bias into the annotation process. We have developed an unsupervised approach based on natural language processing (NLP) of millions of PubMed abstracts to address this challenge. Our approach first automatically associates potential marker genes for hundreds of cell types in the UBERON–Cell Ontology. We combine this information with gene expression profiles from an scRNA-seq dataset and build one regularized logistic regression classifier per cluster that distinguishes that cluster from all others based on genes that have high NLP-based association scores and high expression in that cluster. The beta coefficients of the underlying model help prioritize the correct cell-type label for that cluster. This approach results in correctly identifying cell-type labels among the top-ranked terms for clusters in several scRNA-seq benchmark datasets. This work provides a proof of principle that NLP can be used to create unbiased lists of genes activated in specific cell types that can be used to correctly annotate anonymous cell clusters.
Short Abstract: Objective:
We explore the use of natural language processing techniques to predict a protein-metabolite interaction network. Predicting likely protein-metabolites interactions will lead to a more accurate cellular network.
We used BioBERT, a deep learning medical language model, pre-trained on Pubmed articles to estimate the relatedness of proteins and metabolites. BioBERT converts textual phrases seen in its training set to numerical vectors, the relatedness of which can be quantified by cosine similarity.
We focus on designing an evaluation dataset composed of related and unrelated enzyme-metabolite pairs to measure the effectiveness of the method. Two datasets were created: 1)Using KI enzymatic values for enzyme names and metabolites 2)Finding relatedness of enzyme-metabolite pairs by reviewing the relevant literature.
We found that 61% of high-KI pairs had higher cosine similarity score than the corresponding low-KI pairs. For the second dataset, 68% of the related enzyme-metabolites pairs had higher similarity score than their corresponding unrelated pairs.
Medical language models can flag out the possible/likely protein-metabolite interactions to be verified in further in silico and experimental analysis. BioBERT is not specialized in protein-related language and further training of this model on relevant literature can drastically improve the results.