Evidence Summary

Academic Libraries can Expand Institutional Repository Holdings with Gold Open Access Publications Collected Through Web Scraping

A Review of:

Clark, B. (2023). Proactive institutional repository collection development techniques: Archiving gold open access articles and metadata retrieved with web scraping. Journal of Library Administration, 63(6), 743–765. https://doi.org/10.1080/01930826.2023.2240190

Reviewed by:

Kristy Hancock

Evidence Synthesis Coordinator

Maritime SPOR SUPPORT Unit

Halifax, Nova Scotia, Canada

Email: Kristy.Hancock@nshealth.ca

Received: 21 Sept. 2024 Accepted: 29 Oct. 2024

2024 Hancock. This is an Open Access article distributed under the terms of the Creative Commons‐Attribution‐Noncommercial‐Share Alike License 4.0 International (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly attributed, not used for commercial purposes, and, if transformed, the resulting work is redistributed under the same or similar license to this one.

DOI: 10.18438/eblip30614

Abstract

Objective – To describe a method for collecting gold open access publications from the web and packaging them for batch deposit in an institutional repository. The goal of this project is to expand institutional repository holdings and increase the comprehensiveness of the collection with gold open access content.

Design – Web scraping and analysis of institutional repository usage metrics.

Setting – A library at a public doctoral university with very high research activity in Alabama, United States.

Subjects – Articles and metadata from the Multidisciplinary Digital Publishing Institute (MDPI) website and the Sponsoring Consortium for Open Access Publishing in Participle Physics (SCOAP3) repository. MDPI is an open access publisher of over 400 journals spanning all disciplines. All articles published in MDPI journals are made freely and immediately accessible on the MDPI website. SCOAP3 is a global partnership of libraries, funding agencies, and research centers that support open access publishing in the field of high-energy physics. The SCOAP3 repository contains research funded by the organization and published in open access journals.

Methods – The MDPI website and SCOAP3 repository were selected because they contained a substantial amount of scholarship by University of Alabama affiliates. On the MDPI website, an author affiliation search across all journals retrieved University of Alabama publications. The Python library Beautiful Soup was used with the parser package lxml to collect articles and metadata. The first script iterated through the pages of search results, downloaded article PDFs, and wrote abstract page URLs to a text file. The second script collected metadata by iterating through the text file of abstract page URLs, parsing the HTML of each URL, and writing Dublin Core metadata to a CSV file. Articles already archived in the institutional repository were removed from the CSV file, and the remaining metadata were reviewed for errors. To pair each PDF with the correct metadata, the file names of all PDFs were added to the CSV file. Article PDFs and the metadata file were packaged using the DSpace CSV Archive and batch deposited in the University of Alabama’s institutional repository.

In SCOAP3, an author affiliation search retrieved University of Alabama publications. The browser automation software Selenium was used to collect articles and metadata. The first script iterated through the pages of search results and wrote article record page URLs to a text file. The second script downloaded article PDFs and extracted DOIs to use for PDF file names. The third script collected metadata by using the article record page URLs to query the SCOAP3 metadata harvesting API and writing MARCXML metadata to a CSV file. To pair each PDF with the correct metadata, the DOI column in the CSV file was duplicated, and the “.pdf” extension added to each DOI. The metadata in the CSV file was reviewed for errors, and citations and keywords were added manually. Articles and the metadata file were packaged and deposited using the MDPI method.

The impact of SCOAP3 content on institutional repository downloads from the physics and astronomy collection was measured in the 100 days preceding and following the deposits.

Main Results – 1,005 articles with corresponding metadata were collected from the MDPI website and SCOAP3 repository. After removing duplicate articles that were already archived in the University of Alabama institutional repository, 937 articles (272 from MDPI, 665 from SCOAP3) were deposited. The amount of faculty research available in the institutional repository increased from 1,639 articles before the project to 2,513 articles, or 37.3%.

678 articles were added to the physics and astronomy collection, which reflects the fact that most of the deposited articles were from a subject repository. The rest of the deposited articles were from MDPI and spanned various disciplines. The next best represented collections were civil, construction, and environmental engineering (26 articles); biological sciences (26 articles); electrical and computer engineering (24 articles); and geography (22 articles). The SCOAP3 articles also contributed to a significant increase in downloads from the physics and astronomy collection. Total downloads increased from 5,765 in the 100 days preceding the deposits to 7,243 in the 100 days following the deposits, with SCOAP3 articles representing 3,421 downloads, or 47.2%.

Conclusion – This project was successful in proactively increasing the amount of scholarship in the institutional repository without faculty or researcher participation. This semi-automated workflow requires considerable technical skills but is manageable for one person. Since the articles and metadata were freely accessible and issued under permissive Creative Commons licenses, there was no need to consult publisher self-archiving policies or solicit permission to copy the articles to the institutional repository. This project did not make any research openly accessible that was otherwise unavailable or behind a paywall, but the added publications contribute to making the institution’s scholarly record more complete.

This approach may be particularly helpful for academic library staff looking to build the holdings of a brand-new institutional repository, or for those dealing with an underpopulated institutional repository due to low self-archiving rates. Additional repositories containing a substantial amount of University of Alabama scholarship will be identified and considered for web scraping, to continue expanding the institutional repository holdings. The MDPI website and SCOAP3 repository will also be re-scraped in the future for research added since this project.

Commentary

This study contributes to the literature on content recruitment strategies for institutional repositories. As emphasized by the author, self-archiving is critical to institutional repository longevity, and low self-archiving rates can lead to an underpopulated and underused repository. To address the issue of stagnant content growth, institutional repository staff have developed workflows for harvesting content from other sources and using it to populate their own repositories. For example, Lappalainen and Narayanan (2023) describe a semi-automated process for harvesting publication metadata and full text files, where available and appropriately licensed, from Scopus, Web of Science, Dimensions, and Unpaywall. Harvesting content is one of several strategies that staff employ to ensure that new institutional repository content is added regularly. The author’s contribution is a process for populating an institutional repository with gold open access publications from MDPI and SCOAP3.

The study was assessed using a critical appraisal tool developed by Perryman and Rathbun-Grubb (2014). The reason for the study is clear, and the literature review is extensive. In the literature review, the author covers topics such as the emergence of open access publishing and institutional repositories, the evolution of open access mandates in the United States, common reasons for low faculty and researcher self-archiving participation, and the use of mediated deposit models to ensure content growth. On the topic of web scraping, they outline methods, general etiquette, and legal considerations. Several existing content harvesting workflows are also highlighted.

In terms of data collection, the author clearly describes the unit of analysis and the reason for choosing this type of data. They provide an in-depth description of the web scraping methods for the MDPI website and the SCOAP3 repository. The critical appraisal tool also includes a question about whether the right kind of information was examined to address the problem. The author measured the impact of SCOAP3 articles on downloads but didn’t measure the same outcome for the MDPI articles. It may simply have been too challenging to collect download metrics for the MDPI articles, given that they were added to a range of collections, but analyzing downloads across disciplines would have strengthened the study.

According to the author, the deposited SCOAP3 articles had a significant impact on downloads from the physics and astronomy collection, but readers may wonder if there had been other factors impacting downloads. For example, some institutional repositories offer a personalized email alert feature, which allows users to keep up with new content. It would have been interesting to know more about the 100 days following the deposits, and whether users were alerted to the new content or discovering the new publications while browsing. This aspect of the study could have been reported more clearly.

Overall, this study is novel and interesting. The literature review is excellent, and the author acknowledges study weaknesses, identifies plans for continued content recruitment at their institution, and discusses the wider implications of their findings. The project did not address the issue of low faculty and researcher self-archiving participation, but the author achieved their goal of increasing the amount of scholarship in the University of Alabama institutional repository without having to plan and sustain outreach efforts or mediated deposit services. More research is needed to determine the return on investment for this content recruitment strategy, and to help institutional repository staff decide whether this strategy will align with their institution’s goals. Researchers can build on this study by further exploring the relationship between content volume and usage metrics across multiple disciplines in institutional repositories.

References

Clark, B. (2023). Proactive institutional repository collection development techniques: Archiving gold open access articles and metadata retrieved with web scraping. Journal of Library Administration, 63(6), 743–765. https://doi.org/10.1080/01930826.2023.2240190

Lappalainen, Y., & Narayanan, N. (2023). Harvesting publication data to the institutional repository from Scopus, Web of Science, Dimensions and Unpaywall using a custom R script. The Journal of Academic Librarianship, 49(1), Article 102653. https://doi.org/10.1016/j.acalib.2022.102653

Perryman, C., & Rathbun-Grubb, S. (2014). The CAT: A generic critical appraisal tool. http://www.jotform.us/cp1757/TheCat