Safro Research Group @ University of Delaware: literature based discovery

Showing posts with label literature based discovery. Show all posts

Thursday, January 4, 2024

Benchmarking Biomedical Literature-based Discovery and Hypothesis Generation Systems

Update: The paper is accepted for publication in BMC Bioinformatics.

We introduce a benchmarking framework Dyport for evaluating biomedical hypothesis generation (HG) and literature based discovery (LBD) systems. The evaluation of HG and LBD is still one of the major problems of these systems, especially when it comes to fully automated large-scale general purpose systems. For these, a massive assessment (that is normal in the machine learning and general AI domains) is often infeasible. One traditional evaluation approach is to make a system “rediscover” some of the landmark findings. However, this approach does not scale. Another traditional approach is to automatically discover some information in biomedical texts, train the system on historical data and test it on that "future" information. While this approach does scale well, the reliability and biomedical importance of the extracted test set are far from being illuminating.

Utilizing curated datasets, Dyport tests HG/LBD systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community.

Dyport is available at https://github.com/IlyaTyagin/Dyport

Paper: https://arxiv.org/pdf/2312.03303.pdf

Wednesday, December 29, 2021

Accelerating COVID-19 scientific discovery with hypothesis generation system

In 2020, the White House released the, “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections.

Ilya Tyagin, Ankit Kulshrestha, Justin Sybrandt, Krish Matta, Michael Shtutman, Ilya Safro "Accelerating COVID-19 research with graph mining and transformer-based learning", Innovative Applications of Artificial Intelligence (AAAI/IAAI), preprint at https://www.biorxiv.org/content/10.1101/2021.02.11.430789v1, 2022

We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.

Wednesday, September 1, 2021

NIH funding for literature based discovery

NIH awarded $2.11M grant for University of South Carolina (PI Shtutman) - University of Delaware (PI Safro) collaborative project "Knowledge discovery and machine learning to elucidate the mechanisms of HIV activity and interaction with substance use disorder." This work will leverage our hypothesis generation model AGATHA that is based on information extracted from full MEDLINE.

https://github.com/IlyaTyagin/AGATHA-C-GP

Here is its recent customization for COVID-19 in which Medline is fused with CORD-19, the dataset of all COVID-19 related papers.

https://arxiv.org/abs/2102.07631

Thursday, February 11, 2021

Literature-based knowledge discovery to accelerate COVID-19 research

Our new paper on customization of AGATHA knowledge discovery model for COVID-19 is out!

Ilya Tyagin, Ankit Kulshrestha, Justin Sybrandt, Krish Matta, Michael Shtutman, Ilya Safro

"Accelerating COVID-19 research with graph mining and transformer-based learning", 2021

https://www.biorxiv.org/content/10.1101/2021.02.11.430789v1

In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.

Friday, August 21, 2020

Generating biomedical scientific hypotheses with AGATHA

Accepted paper in the 29TH ACM International Conference on Information and Knowledge Management (CIKM)

Sybrandt, Tyagin, Shtutman, Safro "AGATHA: Automatic Graph-mining and Transformer based Hypothesis Generation Approach", preprint at http://arxiv.org/pdf/2002.05635.pdf

Medical research is risky and expensive. Drug discovery requires researchers to efficiently winnow thousands of potential targets to a small candidate set. However, scientists spend significant time and money long before seeing the intermediate results that ultimately determine this smaller set. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that learns a data-driven ranking criteria to recommend new biomedical connections. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical sub-domains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. Furthermore, we perform an ablation study to examine the aspects of our semantic network that most contribute to recommendation quality. Overall, AGATHA achieves best-in-class recommendation quality when compared to other hypothesis generation systems built to predict across all available biomedical literature. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha.

Tuesday, April 21, 2020

NSF grant to tackle COVID-19

Our team received NSF grant to tackle COVID-19 using our AI hypothesis generation system AGATHA
Clemson news coverage: Artificial intelligence could aid in fight against COVID-19

Ilya Safro of @socclemson said that his team will soon roll out a new artificial intelligence system aimed at helping researchers explore the scientific literature as they strive for new discoveries to combat the novel coronavirus.https://t.co/TwQyZqEUra pic.twitter.com/fh3K8iAoWO
— Clemson Engineering, Computing & Applied Sciences (@ClemsonCECAS) April 21, 2020

Wednesday, February 19, 2020

New papers on biomedical NLP and hypothesis generation

New papers on biomedical NLP+hypothesis generation.

Sybrandt, Safro "CBAG: Conditional Biomedical Abstract Generation", http://arxiv.org/pdf/2002.05637.pdf,

Sybrandt, Tyagin, Shtutman, Safro "AGATHA: Automatic Graph-mining and Transformer based Hypothesis Generation" http://arxiv.org/pdf/2002.05635.pdf

Friday, March 29, 2019

Future of medicine - man or machine

Some thoughts about literature based discovery and our automated biomedical hypothesis generation tool MOLIERE for Clemson World Magazine

Saturday, November 24, 2018

Two papers accepted at IEEE Big Data 2018

Sybrandt, Carrabba, Herzog, Safro "Are Abstracts Enough for Hypothesis Generation?", arXiv:1804.05942

Sybrandt, Shtutman, Safro "Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking", arXiv:1802.03793

Safro Research Group @ University of Delaware