Enhancing the Discovery of Chemistry Theses by Registering Substances and Depositing in PubChem


  • Vincent F. Scalfani The University of Alabama
  • Barbara J. Dahlbach The University of Alabama
  • Jacob Robertson The University of Alabama https://orcid.org/0000-0001-6356-9585




PubChem, chemical information, data sharing, theses and dissertations


Chemical substances from theses are not widely accessible as searchable machine-readable formats. In this article, we describe our workflow for extracting, registering, and sharing chemical substances from the University of Alabama theses to enhance discovery. In total, 73 theses were selected for the project, resulting in about 3,000 substances registered using the IUPAC International Chemical Identifier and deposited in PubChem as either structure-data files or Simplified Molecular-Input Line-Entry System notations. In addition to substances being deposited in PubChem, an archive copy was also deposited in the University of Alabama Institutional Repository. The PubChem records for the substance depositions include the full bibliographic reference and link to the thesis full text or thesis metadata when the full text is not yet available. Excluding mixtures, we found that 40% of the shared substances were new to PubChem at the time of deposition. We conclude this article with a detailed discussion about our experiences, challenges, and recommendations for librarians and curators engaged in sharing chemical substance data from theses and similar documents.


Download data is not yet available.


Abshear, T., Banik, G., Dalvi, S., D’Souza, M., Kunitsky, K. & Nedwed, K. 2018. Validation of the KnowItAll stereochemistry toolkit: Tech note 210434. Philadelphia (PA): Bio-Rad Laboratories.

Akhondi, S.A., Kors, J.A. & Muresan, S. 2012. Consistency of systematic chemical identifiers within and between small-molecule databases. Journal of Cheminformatics 4:35. DOI:10.1186/1758-2946-4-35.

Andrews, D.M., Broad, L.M., Edwards, P.J., Fox, D.N.A., Gallagher, T., Garland, S.L., Kidd, R. & Sweeney, J.B. 2016. The creation and characterisation of a National Compound Collection: The Royal Society of Chemistry pilot. Chemical Science 7(6):3869–3878. DOI:10.1039/C6SC00264A.

Brecher, J. 2006. Graphical representation of stereochemical configuration - (IUPAC recommendations 2006). Pure and Applied Chemistry 78(10):1897–1970. DOI:10.1351/pac200678101897.

Brecher, J. 2008. Graphical representation standards for chemical structure diagrams. Pure and Applied Chemistry 80(2):277–410. DOI:10.1351/pac200880020277.

Buntrock, R.E. 2001. Chemical registries in the fourth decade of service. Journal of Chemical Information and Computer Sciences 41(2):259–263. DOI:10.1021/ci000109q.

Chambers, J., Davies, M., Gaulton, A., Hersey, A., Velankar, S., Petryszak, R., Hastings, J., Bellis, L., McGlinchey, S. & Overington, J.P. 2013. UniChem: A unified chemical structure cross-referencing and identifier tracking system. Journal of Cheminformatics 5:3. DOI:10.1186/1758-2946-5-3.

ChemAxon. 2019a. MarvinSketch v19.27.0 [Internet]. [cited 2021 Jan 13]. Available from https://chemaxon.com.

ChemAxon. 2019b. Molconverter v19.27.0 [Internet]. [cited 2021 Jan 13]. Available from https://chemaxon.com.

ChemAxon. 2021. Extended SMILES and SMARTS - CXSMILES and CXSMARTS [Internet]. [cited 2021 Apr 15]. Available from https://docs.chemaxon.com/display/docs/chemaxon-extended-smiles-and-smarts-cxsmiles-and-cxsmarts.md.

Clark, A.M. 2011. Accurate specification of molecular structures: The case for zero-order bonds and explicit hydrogen counting. Journal of Chemical Information and Modeling 51(12):3149–3157. DOI:10.1021/ci200488k.

Clark, A.M., Labute, P. & Santavy, M. 2006. 2D structure depiction. Journal of Chemical Information and Modeling 46(3):1107–1123. DOI:10.1021/ci050550m.

Copyright Advisory Network. 2020. Public Domain Slider [Internet]. Available from https://librarycopyright.net/.

Dalby, A., Nourse, J.G, Hounshell, W.D., Gushurst, A.K.I., Grier, D.L., Leland, B.A. & Laufer, J. 1992. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Modeling 32(3):244–255. DOI:doi:10.1021/ci00007a012.

Dassault Systemes. 2017. BIOVIA CTFILE formats: BIOVIA databases [Internet]. Available from http://help.accelrysonline.com/ulm/onelab/1.0/content/ulm_pdfs/direct/reference/ctfileformats2016.pdf.

Daylight Chemical Information Systems. 2011. Daylight theory manual v4.9 [Internet]. [accessed August 2020]. Available from https://www.daylight.com/dayhtml/doc/theory/.

de Laet, A., Hehenkamp, J.J.J. & Wife, R.L. 2000. Finding drug candidates in virtual and lost/emerging chemistry. Journal of Heterocyclic Chemistry 37(3):669–674. DOI:10.1002/jhet.5570370324.

Dittmar, P.G., Stobaugh, R.E. & Watson, C.E. 1976. The Chemical Abstracts Service chemical registry system. I. General design. Journal of Chemical Information and Computer Sciences 16(2):111–121. DOI:10.1021/ci60006a016.

Domokos, L. 1991. The Beilstein Structure Registry System. 1. General design. Journal of Chemical Information and Modeling 31(2):320–326. DOI:10.1021/ci00002a019.

Downing, J., Harvey, M.J., Morgan, P.B., Murray-Rust, P., Rzepa, H.S., Stewart, D.C., Tonge, A.P. & Townsend, J.A. 2010. SPECTRa-T: Machine-based data extraction and semantic searching of chemistry e-theses. Journal of Chemical Information and Modeling 50(2):251–261. DOI:10.1021/ci9003688.

Elsevier. 2021. Reaxys content. [Internet]. [cited 2021 Jan 13]. Available from https://www.elsevier.com/solutions/reaxys/features-and-capabilities/content.

Filippov, I.V. & Nicklaus, M.C. 2009. Optical Structure Recognition Software to recover chemical information: OSRA, an open source solution. Journal of Chemical Information and Modeling 49(3):740–743. DOI:10.1021/ci800067r.

Flaxbart, D. 2018. Analysis of citations to books in chemistry PhD dissertations in an era of transition. Issues in Science and Technology Librarianship. 88. DOI:10.5062/F4DV1H4T.

Food and Drug Administration. 2007 Substance registration system standard operating procedure [Internet]. Available from https://www.fda.gov/media/75274/download.

Frączek, T. 2016. Simulation-based algorithm for two-dimensional chemical structure diagram generation of complex molecules and ligand–protein interactions. Journal of Chemical Information and Modeling 56(12):2320–2335. DOI:10.1021/acs.jcim.6b00391.

Gabrielson, S.W. 2018. SciFinder. Journal of the Medical Library Association 106(4):588–590. DOI:10.5195/JMLA.2018.515.

Garritano, J.R. 2013. Evolution of SciFinder, 2011–2013: New features, new content. Science & Technology Libraries 32(4):346–371. DOI:10.1080/0194262X.2013.833068.

Gobbi, A. & Lee, M-L. 2012. Handling of tautomerism and stereochemistry in compound registration. Journal of Chemical Information and Modeling 52(2):285–292. DOI:10.1021/ci200330x.

Gordon, I.D., Meindl, P., White, M. & Szigeti, K. 2018. Information seeking behaviors, attitudes, and choices of academic chemists. Science & Technology Libraries 37(2):130–151. DOI:10.1080/0194262X.2018.1445063.

Hähnke, V.D., Kim, S. & Bolton, E.E. 2018. PubChem chemical structure standardization. Journal of Cheminformatics 10:36. DOI:10.1186/s13321-018-0293-8.

Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V., Turner, S., Swainston, N., Mendes, P. & Steinbeck, C. 2016. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research 44(D1):D1214–D1219. DOI:10.1093/nar/gkv1031.

Heller, S.R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D. 2015. InChI, the IUPAC International Chemical Identifier. Journal of Cheminformatics 7:23. DOI:10.1186/s13321-015-0068-4.

Hersey, A. [date unknown]. ChEMBL database: Meeting chemical and biological information needs of scientists of the future [Internet]. Available from https://www.rsc.org/images/ChEMBL-anne-hersey_tcm18-213324.pdf.

Hersey, A., Chambers, J., Bellis, L., Bento, A.P., Gaulton, A. & Overington, J.P. 2015. Chemical databases: Curation or integration by user-defined equivalence? Drug Discovery Today Technology 14:17–24. DOI:10.1016/j.ddtec.2015.01.005.

Ihlenfeldt, W.D., Bolton, E.E. & Bryant, S.H. 2009. The PubChem chemical structure sketcher. Journal of Cheminformatics 1:20. DOI:10.1186/1758–2946-1-20.

International Union of Pure and Applied Chemistry. 2017. International chemical identifier (InChI) version 1, software version 1.05 API reference [Internet]. Available from https://www.inchi-trust.org/downloads/.

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B.A., Thiessen, P.A., Yu, B., et al. 2021. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Research 49(D1):D1388–D1395. DOI:10.1093/nar/gkaa971.

Kim, S., Thiessen, P.A., Bolton, E.E. & Bryant, S.H. 2015. PUG-SOAP and PUG-REST: Web services for programmatic access to chemical information in PubChem. Nucleic Acids Research 43(W1):W605–W611. DOI:10.1093/nar/gkv396.

Kim, S., Thiessen, P.A., Bolton, E.E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B.A., et al. 2016a. PubChem Substance and Compound databases. Nucleic Acids Research 44(D1):D1202–D1213. DOI:10.1093/nar/gkv951.

Kim, S., Thiessen, P.A., Bolton, E.E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B.A., et al. 2016b. Literature information in PubChem: Associations between PubChem records and scientific articles. Journal of Cheminformatics 8:32. DOI:10.1186/s13321-016-0142-6.

Kim, S., Thiessen, P.A., Cheng, T., Zhang, J., Gindulyte, A. & Bolton, E.E. 2019. PUG-View: Programmatic access to chemical annotations integrated in PubChem. Journal of Cheminformatics 11:56. DOI:10.1186/s13321-019-0375-2.

Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. 2017. Information retrieval and text mining technologies for chemistry. Chemical Reviews 117(12):7673–7761. DOI:10.1021/acs.chemrev.6b00851.

Landrum, G.A. 2020. RDKit: Open-source cheminformatics software [Internet]. Available from https://www.rdkit.org/.

Lawson, A.J., Swienty-Busch, J., Géoui, T. & Evans, D. 2014. The making of Reaxys—Towards unobstructed access to relevant chemistry information. In: McEwen, L.R. & Buntrock, R.E., editors. The Future of the History of Chemical Information. Washington (DC): American Chemical Society. p. 127–148.

Martin, E., Monge, A., Duret, J-A., Gualandi, F., Peitsch, M.C. & Pospisil, P. 2012. Building an R&D chemical registration system. Journal of Cheminformatics 4:11. DOI:10.1186/1758-2946-4-11.

Mendez, D., Gaulton, A., Bento, A.P., Chambers, J., De Veij, M., Félix, E., Magariños, M.P., Mosquera, J.F., Mutowo, P., Nowotka, M., et al. 2018. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Research 47(D1):D930–D940. DOI:10.1093/nar/gky1075.

National Center for Biotechnology Information. 2021. Entrez programming utilities help [Internet]. Available from https://www.ncbi.nlm.nih.gov/books/NBK25501/.

National Center for Biotechnology Information. [date unknown-a]. PubChem specification: PC-StereoGroup [Internet]. [accessed 2020 Jul 2]. Available from https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/asn_spec/PC-StereoGroup.html.

National Center for Biotechnology Information. [date unknown-b]. PubChem substance tags [Internet]. [accessed 2020 May 18]. Available from https://pubchem.ncbi.nlm.nih.gov/upload/html/tags_substance.html.

Nguyen, A., Huang, Y-C., Tremouilhac, P., Jung, N. & Bräse, S. 2019. ChemScanner: Extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files. Journal of Cheminformatics 11:77. DOI:10.1186/s13321-019-0400-5.

O’Boyle, N.M., Mayfield, J.W. & Sayle, R.A. 2018. Can we agree on the structure represented by a SMILES string? A benchmark dataset [Internet]. Available from https://www.nextmovesoftware.com/products/SMILESBenchmark_ICCS_May2018.pdf.

Pence, H.E. & Williams, A. 2010. ChemSpider: An online chemical information resource. Journal of Chemical Education 87(11):1123–1124. DOI:10.1021/ed100697w.

Richardson, S. 2018. ChemSpider pre-deposition filters [Internet]. Available from https://blogs.rsc.org/chemspider/2018/09/18/chemspider-pre-deposition-filters/.

Rose-Wiles, L.M. & Marzabadi, C. 2018. What do chemists cite? A 5-year analysis of references cited in American Chemical Society journal articles. Science & Technology Libraries 37(3):246–273. DOI:10.1080/0194262X.2018.1481488.

Roth, B., Böhmer, H-U. & Deplanque, R. 1992. Registration of substances in the Gmelin Factual Database. Analytica Chimica Acta 265(2):301–304. DOI:10.1016/0003-2670(92)85036-6.

Royal Society of Chemistry. 2020. ChemSpider data source search: SORD [Internet]. [cited 2020 May 12]. Available from https://www.chemspider.com/Search.aspx?dsn=SORD.

Scalfani, V.F. 2017. Text analysis of chemistry thesis and dissertation titles. Issues in Science and Technology Librarianship 86. DOI:10.5062/F4TD9VBX.

Scalfani, V.F. 2020. UALIB_ChemStructures GitHub repository [Internet]. Available from https://github.com/ualibweb/UALIB_ChemStructures/blob/master/README.md.

Scalfani, V.F., Ralph, S.C., Alshaikh, A.A. & Bara, J.E. 2020. Class and home problems: Programmatic compilation of chemical data and literature from PubChem using MATLAB. Chemical Engineering Education 54(4):230-241. DOI:10.18260/2-1-370.660-115508.

Swain, M.C. & Cole, J.M. 2016. ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling 56(10):1894–1904. DOI:10.1021/acs.jcim.6b00207.

Tomaszewski, R. 2019. Citations to chemical databases in scholarly articles: To cite or not to cite? Journal of Documentation 75(6):1317–1332. DOI:10.1108/JD-12-2018-0214.

U.S. Copyright Office. 2017. Compendium of U.S. Copyright Office practices [Internet]. 3rd ed. Available from https://www.copyright.gov/comp3/.

Valko, A.T. & Johnson, A.P. 2009. CLiDE Pro: The latest generation of CLiDE, a tool for optical chemical structure recognition. Journal of Chemical Information and Modeling 49(4):780–787. DOI:10.1021/ci800449t.

Warr, W.A. 2011. Representation of chemical structures. WIREs Computational Molecular Science 1(4):557–579. DOI:10.1002/wcms.36.

Weininger, D. 1988. SMILES, a chemical language and information system 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences 28(1):31–36. DOI:10.1021/ci00057a005.

Wife, D. 2010. Selected organic reactions database [Internet]. Available from https://www.acdlabs.com/download/publ/2010/eum10_wife.pdf.

Wiley Science Solutions. 2020. ChemWindow chemical structure drawing software [Internet]. Available from https://sciencesolutions.wiley.com/chemwindow-chemical-structure-drawing-software/.

Zhang, L. 2013. A comparison of the citation patterns of doctoral students in chemistry versus chemical engineering at Mississippi State University, 2002–2011. Science & Technology Libraries 32(3):299–313. DOI:10.1080/0194262X.2013.791169.




How to Cite

Scalfani, V. F., Dahlbach, B. J., & Robertson, J. (2021). Enhancing the Discovery of Chemistry Theses by Registering Substances and Depositing in PubChem. Issues in Science and Technology Librarianship, (97). https://doi.org/10.29173/istl2566



Refereed Articles
Share |