Applying LLMs and Semantic Technologies for Data Extraction in Literature Reviews: A Pilot Study in LIS

Camille Demers

doi:10.29173/cais2020

Authors

Camille Demers Université de Montréal

DOI:

https://doi.org/10.29173/cais2020

Keywords:

LLMs, semantic technology, information extraction, knowledge synthesis

Abstract

This pilot study evaluates the capabilities of two LLMs, Mistral Small 3.1 and GPT-4o mini, in performing ontology-based data extraction to support literature reviews in library and information science (LIS). A sample of four published systematic reviews was selected as
ground truth data. The open-access publications included in these reviews (n = 47) were collected as inputs for the models to perform semantic information extraction, using classes from the Document Components Ontology (DoCO). These preliminary findings highlight the opportunities and challenges of using AI and semantic technologies to streamline literature reviews in the social sciences.

Application des GML et des technologies sémantiques pour l'extraction de données dans les revues de littérature : Une étude pilote en sciences de l'information

Résumé
Cette étude pilote évalue les capacités de deux GML, Mistral Small 3.1 et GPT-4o mini, pour effectuer une extraction de données basée sur une ontologie pour supporter les revues de littérature en bibliothéconomie et sciences de l'information (BSI). Un échantillon de quatre revues systématiques publiées a été sélectionné comme données véridiques de base. Les publications à accès libre incluses dans ces revues (n = 47) ont été choisies comme entrées dans les modèles, pour qu'ils effectuent une extraction d'information sémantique en utilisant les catégories du Document sur les composantes de l'ontologie (DoCO). Ces résultats préliminaires soulignent les opportunités et les défis de l'utilisation de l'IA et des technologies sémantiques pour l'organisation des revues littéraires en sciences sociales.

Mots-Clés
GML; Technologie sémantique; Extraction d'information; Synthèse de connaissances

References

Affengruber, L., Maten, M. M. van der, Spiero, I., Nussbaumer-Streit, B., Mahmić-Kaknjo, M., Ellen, M. E., Goossen, K., Kantorova, L., Hooft, L., Riva, N., Poulentzas, G., Lalagkas, P. N., Silva, A. G., Sassano, M., Sfetcu, R., Marqués, M. E., Friessova, T., Baladia, E.,

Pezzullo, A. M., … Spijker, R. (2024, 12 juillet). An exploration of available methods and tools to improve the efficiency of systematic review production - a scoping review. https://doi.org/10.21203/rs.3.rs-4595777/v1

Ali, A. et Gravino, C. (2018, décembre). An Ontology-Based Approach to Semi-Automate Systematic Literature Reviews. 2018 12th International Conference on Open Source Systems and Technologies (ICOSST) (p. 09‑16). https://doi.org/10.1109/ICOSST.2018.8632205

Augenstein, I., Das, M., Riedel, S., Vikraman, L. et McCallum, A. (2017, août). SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications.

Bethard, M. Carpuat, M. Apidianaki, S. M. Mohammad, D. Cer et D. Jurgens (dir.), SemEval 2017, Vancouver, Canada (p. 546‑555). https://doi.org/10.18653/v1/S17-2091

Baeza-Yates, R., Ribeiro-Neto, B., et others. (1999). Modern information retrieval (vol. 463). ACM press New York.

Beltagy, I., Lo, K. et Cohan, A. (2019, novembre). SciBERT: A Pretrained Language Model for Scientific Text. K. Inui, J. Jiang, V. Ng et X. Wan (dir.), EMNLP-IJCNLP 2019, Hong Kong, China (p. 3615‑3620). https://doi.org/10.18653/v1/D19-1371

Bolanos, F., Salatino, A., Osborne, F. et Motta, E. (2024, 13 février). Artificial Intelligence for Literature Reviews: Opportunities and Challenges. arXiv. https://doi.org/10.48550/arXiv.2402.08565

Constantin, A., Peroni, S., Pettifer, S., Shotton, D. et Vitali, F. (2016). The Document Components Ontology (DoCO). Semantic Web, 7(2), 167‑181. https://doi.org/10.3233/SW150177

Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A. et Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications, 15(1), 1418. https://doi.org/10.1038/s41467-024-45563-x

D’Arcus, B. et Giasson, F. (2016). Bibliographic Ontology (BIBO) in RDF. https://www.dublincore.org/specifications/bibo/bibo/bibo.rdf.xml

Datta, P., Datta, S. et Roy, D. (2025). RAGing Against the Literature: LLM-Powered Dataset Mention Extraction. New York, NY, USA. https://doi.org/10.1145/3677389.3702523

Färber, M., Lamprecht, D., Krause, J., Aung, L. et Haase, P. (2023). SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples. T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng et J. Li (dir.), Cham (p. 94‑112). https://doi.org/10.1007/978-3-031-47243-5_6

Foppiano, L., Lambard, G., Amagasa, T. et Ishii, M. (2024, 31 décembre). Mining experimental data from materials science literature with large language models: an evaluation study. Science And Technology of Advanced Materials-Methods. TAYLOR & FRANCIS LTD. https://doi.org/10.1080/27660400.2024.2356506

Hadi, M. U., Tashi, Q. A., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J. & Mirjalili, S. (2023). A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. https://www.authorea.com/doi/full/10.36227/techrxiv.23589741.v1?commit=b1cb46f5b0f749cf5f2f33806f7c124904c14967

Hong, Z., Ward, L., Chard, K., Blaiszik, B. et Foster, I. (2021). Challenges and Advances in Information Extraction from Scientific Literature: a Review. JOM, 73(11), 3383‑3400. https://doi.org/10.1007/s11837-021-04902-9

Gartlehner, G., Kahwati, L., Hilscher, R., Thomas, I., Kugley, S., Crotty, K., Viswanathan, M., Nussbaumer-Streit, B., Booth, G., Erskine, N., et others. (2024). Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods.

Jaradeh, M. Y., Oelen, A., Farfar, K. E., Prinz, M., D’Souza, J., Kismihók, G., Stocker, M. & Auer, S. (2019, 23 septembre). Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. New York, NY, USA (p. 243‑246). https://doi.org/10.1145/3360901.3364435

Khan, M. A., Ayub, U., Naqvi, S. A. A., Khakwani, K. Z. R., Sipra, Z. B. R., Raina, A., Zou, S., He, H., Hossein, S. A., Hasan, B., Rumble, R. B., Bitterman, D. S., Warner, J. L., Zou, J., Tevaarwerk, A. J., Leventakos, K., Kehl, K. L., Palmer, J. M., Murad, M. H., … Riaz, I.

B. (2024). Collaborative Large Language Models for Automated Data Extraction in Living Systematic Reviews. medRxiv: The Preprint Server for Health Sciences, 2024.09.20.24314108. https://doi.org/10.1101/2024.09.20.24314108

Legate, A., Nimon, K. et Noblin, A. (2024, 20 juin). (Semi)automated approaches to data extraction for systematic reviews and meta-analyses in social sciences: A living review. F1000Research. https://doi.org/10.12688/f1000research.151493.1

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries (p. 74‑81).

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. et Stoyanov, V. (2019, 26 juillet). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692

Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou et G. Tsakonas (dir.), Berlin, Heidelberg (p. 473‑474). https://doi.org/10.1007/978-3-642-04346-8_62

Mitchell, A. et Mavergames, C. (2019). Using linked data for evidence synthesis. Systematic Searching: Practical ideas for improving results, 171.

Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G. et PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of internal medicine, 151(4), 264‑269. https://doi.org/10.7326/0003-4819-151-4-20090818000135

Ng, J.-P. et Abrecht, V. (2015, septembre). Better Summarization Evaluation with Word Embeddings for ROUGE. L. Màrquez, C. Callison-Burch et J. Su (dir.), EMNLP 2015, Lisbon, Portugal (p. 1925‑1930). https://doi.org/10.18653/v1/D15-1222

Oelen, A., Jaradeh, M. Y., Stocker, M. et Auer, S. (2020, août). Generate FAIR Literature Surveys with Scholarly Knowledge Graphs. JCDL ’20: The ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event China (p. 97‑106). https://doi.org/10.1145/3383583.3398520

Oelen, A., Stocker, M. et Auer, S. (2021, 14 avril). Crowdsourcing Scholarly Discourse Annotations. IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station TX USA (p. 464‑474). https://doi.org/10.1145/3397481.3450685

Peroni, S. (2014). The semantic publishing and referencing ontologies. Semantic web technologies and legal scholarly publishing, 121‑193.

Peroni, S. et Shotton, D. (2012). FaBiO and CiTO: Ontologies for describing bibliographic resources and citations. Journal of Web Semantics, 17, 33‑43.

Peroni, S. et Shotton, D. (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1), 428‑444.

Reimers, N. et Gurevych, I. (2019, 27 août). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv. http://arxiv.org/abs/1908.10084

Sahlab, N., Kahoul, H., Jazdi, N. et Weyrich, M. (2022). A Knowledge Graph-Based Method for Automating Systematic Literature Reviews. Procedia Computer Science, 207, 2814‑2822. https://doi.org/10.1016/j.procs.2022.09.339

Schmidt, L., Hair, K., Graziozi, S., Campbell, F., Kapp, C., Khanteymoori, A., Craig, D., Engelbert, M. et Thomas, J. (2024, 23 mai). Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study. arXiv. https://doi.org/10.48550/arXiv.2405.14445

van Dinter, R., Tekinerdogan, B. et Catal, C. (2021). Automation of Systematic Literature Reviews: A Systematic Literature Review. Information and Software Technology, 136, 106589. https://doi.org/10.1016/j.infsof.2021.106589

Wagner, G., Lukyanenko, R. et Paré, G. (2022). Artificial Intelligence and the Conduct of Literature Reviews. Journal of Information Technology, 37(2), 209‑226. https://doi.org/10.1177/02683962211048201

Wang, Y., Zhang, C. et Li, K. (2022). A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics, 127(5), 2479‑2520. https://doi.org/10.1007/s11192-022-04332-7

Applying LLMs and Semantic Technologies for Data Extraction in Literature Reviews

A Pilot Study in LIS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

side

Make a Submission