RepoFromPaper: an approach to extract software code implementations from scientific publications

The increasing integration of complex software systems in scientific research has amplified the need for efficient methods to access and utilize these software implementations. To enhance reproducibility and transparency, researchers often include links to code repositories within their publications. However, the extraction of these repository links is challenging due to inconsistent citation practices, varied formatting, and the presence of multiple repository links within a single publication. This thesis introduces “RepoFromPaper,” an innovative approach designed to automate the extraction of repository links from scientific publications using advanced natural language processing (NLP) techniques.
RepoFromPaper systematically identifies and extracts repository links proposed by authors, enhancing the discoverability and accessibility of research software. The methodology employed includes a multi-step process of data collection, PDF-to-text conversion, sentence extraction, and classification using BERT models. Additionally, RepoFromPaper is evaluated using a gold standard dataset, and its performance is benchmarked against existing approaches using metrics such as Mean Reciprocal Rank (MRR), precision, recall, and F1 score.
This work also explores the integration of RepoFromPaper into the Research Software Extraction Framework (RSEF), enabling bidirectional repository link searches. The evaluation encompasses various scientific domains, highlighting citation practices and the tool’s applicability. The findings demonstrate the potential of automated solutions like RepoFromPaper in improving research workflows, promoting reproducibility, and facilitating the use of scientific software.

​The increasing integration of complex software systems in scientific research has amplified the need for efficient methods to access and utilize these software implementations. To enhance reproducibility and transparency, researchers often include links to code repositories within their publications. However, the extraction of these repository links is challenging due to inconsistent citation practices, varied formatting, and the presence of multiple repository links within a single publication. This thesis introduces “RepoFromPaper,” an innovative approach designed to automate the extraction of repository links from scientific publications using advanced natural language processing (NLP) techniques.
RepoFromPaper systematically identifies and extracts repository links proposed by authors, enhancing the discoverability and accessibility of research software. The methodology employed includes a multi-step process of data collection, PDF-to-text conversion, sentence extraction, and classification using BERT models. Additionally, RepoFromPaper is evaluated using a gold standard dataset, and its performance is benchmarked against existing approaches using metrics such as Mean Reciprocal Rank (MRR), precision, recall, and F1 score.
This work also explores the integration of RepoFromPaper into the Research Software Extraction Framework (RSEF), enabling bidirectional repository link searches. The evaluation encompasses various scientific domains, highlighting citation practices and the tool’s applicability. The findings demonstrate the potential of automated solutions like RepoFromPaper in improving research workflows, promoting reproducibility, and facilitating the use of scientific software. Read More