Safro Research Group @ University of Delaware: NLP

Showing posts with label NLP. Show all posts

Thursday, January 4, 2024

Benchmarking Biomedical Literature-based Discovery and Hypothesis Generation Systems

Update: The paper is accepted for publication in BMC Bioinformatics.

We introduce a benchmarking framework Dyport for evaluating biomedical hypothesis generation (HG) and literature based discovery (LBD) systems. The evaluation of HG and LBD is still one of the major problems of these systems, especially when it comes to fully automated large-scale general purpose systems. For these, a massive assessment (that is normal in the machine learning and general AI domains) is often infeasible. One traditional evaluation approach is to make a system “rediscover” some of the landmark findings. However, this approach does not scale. Another traditional approach is to automatically discover some information in biomedical texts, train the system on historical data and test it on that "future" information. While this approach does scale well, the reliability and biomedical importance of the extracted test set are far from being illuminating.

Utilizing curated datasets, Dyport tests HG/LBD systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community.

Dyport is available at https://github.com/IlyaTyagin/Dyport

Paper: https://arxiv.org/pdf/2312.03303.pdf

Wednesday, December 29, 2021

Accelerating COVID-19 scientific discovery with hypothesis generation system

In 2020, the White House released the, “Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset,” wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections.

Ilya Tyagin, Ankit Kulshrestha, Justin Sybrandt, Krish Matta, Michael Shtutman, Ilya Safro "Accelerating COVID-19 research with graph mining and transformer-based learning", Innovative Applications of Artificial Intelligence (AAAI/IAAI), preprint at https://www.biorxiv.org/content/10.1101/2021.02.11.430789v1, 2022

We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.

Wednesday, September 1, 2021

NIH funding for literature based discovery

NIH awarded $2.11M grant for University of South Carolina (PI Shtutman) - University of Delaware (PI Safro) collaborative project "Knowledge discovery and machine learning to elucidate the mechanisms of HIV activity and interaction with substance use disorder." This work will leverage our hypothesis generation model AGATHA that is based on information extracted from full MEDLINE.

https://github.com/IlyaTyagin/AGATHA-C-GP

Here is its recent customization for COVID-19 in which Medline is fused with CORD-19, the dataset of all COVID-19 related papers.

https://arxiv.org/abs/2102.07631

Thursday, February 11, 2021

Literature-based knowledge discovery to accelerate COVID-19 research

Our new paper on customization of AGATHA knowledge discovery model for COVID-19 is out!

Ilya Tyagin, Ankit Kulshrestha, Justin Sybrandt, Krish Matta, Michael Shtutman, Ilya Safro

"Accelerating COVID-19 research with graph mining and transformer-based learning", 2021

https://www.biorxiv.org/content/10.1101/2021.02.11.430789v1

In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.

Saturday, December 12, 2020

Do we always need big data in text mining?

Do we always need Big Data text mining? Can we filter it? Check our new paper "Accelerating Text Mining Using Domain-Specific Stop Word Lists" accepted at IWBDR https://arxiv.org/pdf/2012.02294.pdf

Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain-specific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplane-based approach with other feature selection methods, namely \c{hi}2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information.

Safro Research Group @ University of Delaware