Antimicrobial outbreak publications outline the key factors involved in an uncontrolled spread of infection. Such factors include the environments, pathogens, hosts, and antimicrobial resistance genes involved. Individually, each paper published in this area gives a glimpse into the devastating impact drug resistance has on healthcare, agriculture, and livestock. When examined together, these papers reveal a story across time, from the discovery of new resistance genes to their dissemination to different pathogens, hosts, and environments.
My work aims to extract this information from publications by using the biomedical deep-learning language model, BioBERT. BioBERT is pre-trained on all abstracts found in PubMed and has state-of-the-art performance with language tasks using biomedical literature. I trained BioBERT on two tasks: entity recognition to identify AMR-relevant terms (i.e., AMR genes, taxonomy, environments, geographical locations, etc.); and relation extraction to determine which terms identified through entity recognition contextualize AMR genes. Datasets were generated semi-automatically to train BioBERT for these tasks. My work currently collates results from 204,094 antimicrobial publications worldwide and generates interpretable results about the sources where genes are commonly found. Overall, my work takes a large-scale approach to collect antimicrobial resistance data from a commonly overlooked resource.