Applying LLMs and Semantic Technologies for Data Extraction in Literature Reviews
A Pilot Study in LIS
DOI:
https://doi.org/10.29173/cais2020Keywords:
LLMs, semantic technology, information extraction, knowledge synthesisAbstract
This pilot study evaluates the capabilities of two LLMs, Mistral Small 3.1 and GPT-4o mini, in performing ontology-based data extraction to support literature reviews in library and information science (LIS). A sample of four published systematic reviews was selected as
ground truth data. The open-access publications included in these reviews (n = 47) were collected as inputs for the models to perform semantic information extraction, using classes from the Document Components Ontology (DoCO). These preliminary findings highlight the opportunities and challenges of using AI and semantic technologies to streamline literature reviews in the social sciences.
Application des GML et des technologies sémantiques pour l'extraction de données dans les revues de littérature : Une étude pilote en sciences de l'information
Résumé
Cette étude pilote évalue les capacités de deux GML, Mistral Small 3.1 et GPT-4o mini, pour effectuer une extraction de données basée sur une ontologie pour supporter les revues de littérature en bibliothéconomie et sciences de l'information (BSI). Un échantillon de quatre revues systématiques publiées a été sélectionné comme données véridiques de base. Les publications à accès libre incluses dans ces revues (n = 47) ont été choisies comme entrées dans les modèles, pour qu'ils effectuent une extraction d'information sémantique en utilisant les catégories du Document sur les composantes de l'ontologie (DoCO). Ces résultats préliminaires soulignent les opportunités et les défis de l'utilisation de l'IA et des technologies sémantiques pour l'organisation des revues littéraires en sciences sociales.
Mots-Clés
GML; Technologie sémantique; Extraction d'information; Synthèse de connaissances
References
Affengruber, L., Maten, M. M. van der, Spiero, I., Nussbaumer-Streit, B., Mahmić-Kaknjo, M., Ellen, M. E., Goossen, K., Kantorova, L., Hooft, L., Riva, N., Poulentzas, G., Lalagkas, P. N., Silva, A. G., Sassano, M., Sfetcu, R., Marqués, M. E., Friessova, T., Baladia, E.,
Pezzullo, A. M., … Spijker, R. (2024, 12 juillet). An exploration of available methods and tools to improve the efficiency of systematic review production - a scoping review. https://doi.org/10.21203/rs.3.rs-4595777/v1
Ali, A. et Gravino, C. (2018, décembre). An Ontology-Based Approach to Semi-Automate Systematic Literature Reviews. 2018 12th International Conference on Open Source Systems and Technologies (ICOSST) (p. 09‑16). https://doi.org/10.1109/ICOSST.2018.8632205
Augenstein, I., Das, M., Riedel, S., Vikraman, L. et McCallum, A. (2017, août). SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications.
Bethard, M. Carpuat, M. Apidianaki, S. M. Mohammad, D. Cer et D. Jurgens (dir.), SemEval 2017, Vancouver, Canada (p. 546‑555). https://doi.org/10.18653/v1/S17-2091
Baeza-Yates, R., Ribeiro-Neto, B., et others. (1999). Modern information retrieval (vol. 463). ACM press New York.
Beltagy, I., Lo, K. et Cohan, A. (2019, novembre). SciBERT: A Pretrained Language Model for Scientific Text. K. Inui, J. Jiang, V. Ng et X. Wan (dir.), EMNLP-IJCNLP 2019, Hong Kong, China (p. 3615‑3620). https://doi.org/10.18653/v1/D19-1371
Bolanos, F., Salatino, A., Osborne, F. et Motta, E. (2024, 13 février). Artificial Intelligence for Literature Reviews: Opportunities and Challenges. arXiv. https://doi.org/10.48550/arXiv.2402.08565
Constantin, A., Peroni, S., Pettifer, S., Shotton, D. et Vitali, F. (2016). The Document Components Ontology (DoCO). Semantic Web, 7(2), 167‑181. https://doi.org/10.3233/SW150177
Dagdelen, J., Dunn, A., Lee, S., Walker, N., Rosen, A. S., Ceder, G., Persson, K. A. et Jain, A. (2024). Structured information extraction from scientific text with large language models. Nature Communications, 15(1), 1418. https://doi.org/10.1038/s41467-024-45563-x
D’Arcus, B. et Giasson, F. (2016). Bibliographic Ontology (BIBO) in RDF. https://www.dublincore.org/specifications/bibo/bibo/bibo.rdf.xml
Datta, P., Datta, S. et Roy, D. (2025). RAGing Against the Literature: LLM-Powered Dataset Mention Extraction. New York, NY, USA. https://doi.org/10.1145/3677389.3702523
Färber, M., Lamprecht, D., Krause, J., Aung, L. et Haase, P. (2023). SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples. T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos, L. Hollink, Z. Kaoudi, G. Cheng et J. Li (dir.), Cham (p. 94‑112). https://doi.org/10.1007/978-3-031-47243-5_6
Foppiano, L., Lambard, G., Amagasa, T. et Ishii, M. (2024, 31 décembre). Mining experimental data from materials science literature with large language models: an evaluation study. Science And Technology of Advanced Materials-Methods. TAYLOR & FRANCIS LTD. https://doi.org/10.1080/27660400.2024.2356506
Hadi, M. U., Tashi, Q. A., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J. & Mirjalili, S. (2023). A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. https://www.authorea.com/doi/full/10.36227/techrxiv.23589741.v1?commit=b1cb46f5b0f749cf5f2f33806f7c124904c14967
Hong, Z., Ward, L., Chard, K., Blaiszik, B. et Foster, I. (2021). Challenges and Advances in Information Extraction from Scientific Literature: a Review. JOM, 73(11), 3383‑3400. https://doi.org/10.1007/s11837-021-04902-9
Gartlehner, G., Kahwati, L., Hilscher, R., Thomas, I., Kugley, S., Crotty, K., Viswanathan, M., Nussbaumer-Streit, B., Booth, G., Erskine, N., et others. (2024). Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods.
Jaradeh, M. Y., Oelen, A., Farfar, K. E., Prinz, M., D’Souza, J., Kismihók, G., Stocker, M. & Auer, S. (2019, 23 septembre). Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge. New York, NY, USA (p. 243‑246). https://doi.org/10.1145/3360901.3364435
Khan, M. A., Ayub, U., Naqvi, S. A. A., Khakwani, K. Z. R., Sipra, Z. B. R., Raina, A., Zou, S., He, H., Hossein, S. A., Hasan, B., Rumble, R. B., Bitterman, D. S., Warner, J. L., Zou, J., Tevaarwerk, A. J., Leventakos, K., Kehl, K. L., Palmer, J. M., Murad, M. H., … Riaz, I.
B. (2024). Collaborative Large Language Models for Automated Data Extraction in Living Systematic Reviews. medRxiv: The Preprint Server for Health Sciences, 2024.09.20.24314108. https://doi.org/10.1101/2024.09.20.24314108
Legate, A., Nimon, K. et Noblin, A. (2024, 20 juin). (Semi)automated approaches to data extraction for systematic reviews and meta-analyses in social sciences: A living review. F1000Research. https://doi.org/10.12688/f1000research.151493.1
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries (p. 74‑81).
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. et Stoyanov, V. (2019, 26 juillet). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692
Lopez, P. (2009). GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou et G. Tsakonas (dir.), Berlin, Heidelberg (p. 473‑474). https://doi.org/10.1007/978-3-642-04346-8_62
Mitchell, A. et Mavergames, C. (2019). Using linked data for evidence synthesis. Systematic Searching: Practical ideas for improving results, 171.
Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G. et PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Annals of internal medicine, 151(4), 264‑269. https://doi.org/10.7326/0003-4819-151-4-20090818000135
Ng, J.-P. et Abrecht, V. (2015, septembre). Better Summarization Evaluation with Word Embeddings for ROUGE. L. Màrquez, C. Callison-Burch et J. Su (dir.), EMNLP 2015, Lisbon, Portugal (p. 1925‑1930). https://doi.org/10.18653/v1/D15-1222
Oelen, A., Jaradeh, M. Y., Stocker, M. et Auer, S. (2020, août). Generate FAIR Literature Surveys with Scholarly Knowledge Graphs. JCDL ’20: The ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event China (p. 97‑106). https://doi.org/10.1145/3383583.3398520
Oelen, A., Stocker, M. et Auer, S. (2021, 14 avril). Crowdsourcing Scholarly Discourse Annotations. IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station TX USA (p. 464‑474). https://doi.org/10.1145/3397481.3450685
Peroni, S. (2014). The semantic publishing and referencing ontologies. Semantic web technologies and legal scholarly publishing, 121‑193.
Peroni, S. et Shotton, D. (2012). FaBiO and CiTO: Ontologies for describing bibliographic resources and citations. Journal of Web Semantics, 17, 33‑43.
Peroni, S. et Shotton, D. (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1), 428‑444.
Reimers, N. et Gurevych, I. (2019, 27 août). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv. http://arxiv.org/abs/1908.10084
Sahlab, N., Kahoul, H., Jazdi, N. et Weyrich, M. (2022). A Knowledge Graph-Based Method for Automating Systematic Literature Reviews. Procedia Computer Science, 207, 2814‑2822. https://doi.org/10.1016/j.procs.2022.09.339
Schmidt, L., Hair, K., Graziozi, S., Campbell, F., Kapp, C., Khanteymoori, A., Craig, D., Engelbert, M. et Thomas, J. (2024, 23 mai). Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study. arXiv. https://doi.org/10.48550/arXiv.2405.14445
van Dinter, R., Tekinerdogan, B. et Catal, C. (2021). Automation of Systematic Literature Reviews: A Systematic Literature Review. Information and Software Technology, 136, 106589. https://doi.org/10.1016/j.infsof.2021.106589
Wagner, G., Lukyanenko, R. et Paré, G. (2022). Artificial Intelligence and the Conduct of Literature Reviews. Journal of Information Technology, 37(2), 209‑226. https://doi.org/10.1177/02683962211048201
Wang, Y., Zhang, C. et Li, K. (2022). A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics, 127(5), 2479‑2520. https://doi.org/10.1007/s11192-022-04332-7
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Camille Demers

This work is licensed under a Creative Commons Attribution 4.0 International License.


