Enhancing the Discovery of Chemistry Theses by Registering Substances and Depositing in PubChem

Chemical substances from theses are not widely accessible as searchable machine-readable formats. In this article, we describe our workflow for extracting, registering, and sharing chemical substances from the University of Alabama theses to enhance discovery. In total, 73 theses were selected for the project, resulting in about 3,000 substances registered using the IUPAC International Chemical Identifier and deposited in PubChem as either structure-data files or Simplified Molecular-Input Line-Entry System notations. In addition to substances being deposited in PubChem, an archive copy was also deposited in the University of Alabama Institutional Repository. The PubChem records for the substance depositions include the full bibliographic reference and link to the thesis full text or thesis metadata when the full text is not yet available. Excluding mixtures, we found that 40% of the shared substances were new to PubChem at the time of deposition. We conclude this article with a detailed discussion about our experiences, challenges, and recommendations for librarians and curators engaged in sharing chemical substance data from theses and similar documents.


Introduction and Literature Review
Chemical substances are core to the organization and retrieval of chemical information. Modern chemical databases contain millions of substances and associated data such as literature references, names, identifiers, and property data (Krallinger et al. 2017). Common to all chemical databases is the necessary step of substance data entry as a machine-readable format such as a chemical line notation or connection table (Weininger 1988;Dalby et al. 1992;Heller et al. 2015). After substance data entry, substances are registered (Warr 2011). Substance registration is an algorithmic process that uniquely identifies substances so that duplicates are not inadvertently entered into the database (Buntrock 2001;Gobbi & Lee 2012;Martin et al. 2012). Preventing duplicates ensures that associated known data, literature references, and identifiers can all be linked to one substance record. Computerized chemical substance registration systems have been developed over the past half century and are well documented in the literature. Examples of substance registration systems include the CAS REGISTRY from Chemical Abstracts Service (Dittmar et al. 1976), the Beilstein Registry File (Domokos 1991), Gmelin Factual Database (Roth et al. 1992;Lawson et al. 2014), the PubChem Compound Standardization system (Kim et al. 2016a;Hähnke et al. 2018) and systems based on the IUPAC International Chemical Identifier (InChI) such as UniChem and ChemSpider (Chambers et al. 2013;Heller et al. 2015;Richardson 2018).
Major efforts to extract and register substances into secondary chemical literature databases over the past half century have mainly focused on substances from journals, technical reports, and patents. To our knowledge, Reaxys (Elsevier 2021) and PubChem (Kim et al. 2016b) do not contain thesis bibliographic data nor substance data that is extracted and shared systematically from university theses. Both SciFinder (Gabrielson 2018) and ChemSpider (Pence & Williams 2010) contain data from the Selected Organic Reactions Database (SORD) (Garritano 2013; Royal Society of Chemistry 2020). The SORD database efforts in the early 2000s partnered with academic institutions to gain access to thesis collections and index reaction and substance data (Wife 2010). The latest information we could find about SORD data content suggests that substance and reaction data were extracted from 1,300 theses in total, mostly from Europe (Wife 2010). Within SciFinder, there is SORD reaction data from about 900 theses (Garritano 2013). And within ChemSpider, a data source search for SORD reveals 57,000 substances (Royal Society of Chemistry 2020). The SORD thesis substance data extraction efforts appear to no longer be active, as we were unable to locate a live website or other current information about SORD. From the SORD efforts, it was estimated that 80% of content in university chemistry theses is never published and was termed "Lost Chemistry" (de Laet et al. 2000;Wife 2010).
In addition to SORD content in CAS databases, CAS indexes thesis references (CAPlus) and registers substances from theses into the CAS REGISTRY (Garritano 2013). As of July 2020, CAS has about 670,000 thesis and dissertation records with substance and concept indexing in CAPlus (Chemical Abstracts Service, personal communication, July 13, 2020). In our experience with the chemistry theses at The University of Alabama (UA), the CAS REGISTRY typically contains two or less registered substances for each UA chemistry thesis. As chemistry theses with a synthetic focus typically contain many more than a handful of synthesized substances, there is an opportunity to increase thesis substance registration, sharing, and ultimately information discovery.
There are few reports in the literature focused specifically on extracting substance and related data from university chemistry theses. In fact, we are only aware of two such reports, one from Downing et al. (2010) and one from Andrews et al. (2016). Downing et al. developed automated text mining tools to extract chemical information such as names, experimental procedures, and characterization data directly from PDF/DOCX electronic theses. About 40 theses were used as a proof of concept and the extracted data was arranged in markup language format to allow for data repository storage and semantic searching. There were many challenges with data cleanup and false hits, but overall, their approach validated the use of machine extraction of chemical data from theses with acceptable precision in some cases (Downing et al. 2010). It is worth noting that there is a large amount of related ongoing research and available tools in cheminformatics focused on automated data extraction from digital documents, including methods to optically recognize chemical substance diagrams and convert them to machinereadable format (Filippov & Nicklaus 2009;Valko & Johnson 2009;Swain & Cole 2016;Krallinger et al. 2017;Nguyen et al. 2019). However, the Downing et al. (2010) article is the only article we are aware of that systematically evaluates machine extraction of chemistry thesis data. We suspect many of these automated text and chemical substance recognition methods could certainly be applied to thesis documents, but the focus has been on other documents like patents and journals.
More recently, Andrews et al. (2016) reported on an initiative that extracted substances from United Kingdom chemistry theses and deposited the substances publicly in ChemSpider. After an initial evaluation of substance optical recognition software, Andrews et al. selected a manual extraction workflow largely due to the initial uncertainty of potential copyright restrictions associated with automated extraction and complexities associated with validating the accuracy of machine extracted substances. In total, about 45,000 substances from over 700 chemistry theses were manually extracted, redrawn, encoded in machine format and deposited publicly in ChemSpider (Andrews et al. 2016). Importantly, the bibliographic information was submitted along with the substances, which greatly enhances the discovery of the data as well as providing a provenance record for users. About 70% of the substances were new to ChemSpider at the time of deposition, which is close to the estimate of 80% "lost chemistry" from Wife (2010) andde Laet et al. (2000).
Interestingly, chemistry theses are rarely cited in the chemical literature. For example, an analysis of chemistry theses at Mississippi State University (Zhang 2013) and the University of Texas at Austin (Flaxbart 2018) found that citations to theses amounted to less than one percent of the total citations. Similar results were reported in a recent analysis of citations in ten different American Chemical Society journals where the citations for "other" information types, which includes theses, were found to be less than 5% (Rose-Wiles & Marzabadi 2018). The lack of discoverability and access to theses, including any substance content, could be one factor contributing to low chemistry thesis citation counts. We recognize this is a minor factor as peerreviewed articles are the main information resource used in chemistry (Flaxbart 2018). Regardless, it is evident that chemistry theses contain useful data such as substances that are currently not electronically discoverable in chemical databases and, therefore, offer a unique opportunity for research libraries seeking to improve discoverability of chemical research at their institution.
The recent efforts by Andrews et al. (2016) to manually extract thesis substances and deposit them openly in ChemSpider was a workflow we envisioned could be adapted for research libraries; that is, subject chemistry librarians along with data repository librarians could extract and register substances from their local university chemistry theses and share them in public disciplinary repositories such as ChemSpider or PubChem. Such efforts would greatly enhance the discovery and utility of the theses, as users would have the ability to discover not only the text of the thesis, but the actual substances via standard chemical specific searches such as by molecular structure, formula, or identifier. As we began our own efforts to extract and register substances from UA theses, we quickly realized the various complexities of registering substance data, and the general lack of available detailed workflows and guidelines for the research library community. While the Andrews et al. (2016) report was helpful to think about the overall goals and significance of the project, specific workflow details such as how to redraw the substances so that machines interpret them accurately, how to organize the substance data locally, or how to create substance-to-document links, was not discussed in detail.
In this article, we describe our workflow and results with registering nearly 3,000 substances from 73 UA theses. We manually extracted the substances, encoded them in machine-readable format, and shared the substances in PubChem openly with links to the original thesis document on our Institutional Repository (UA IR) or library record if full text was not available. In addition to sharing the data openly in PubChem, an archival copy of the substance data is available in the UA IR, and all data, programmatic scripts, and notes are openly available in GitHub (Supporting Information). We conclude this article with a discussion of the workflow challenges and our recommendations for librarians and curators related to registering and sharing substance data from university theses.

Methods
The following methods section was adapted from our GitHub repository README file (Scalfani 2020). The GitHub repository contains all data, scripts, and working notes from this project.

Theses Selected
A total of 73 UA Chemistry Ph.D. or M.S. theses were used. Theses selected were related to organic chemistry and contained synthetic details for small molecule preparations. All theses were not embargoed; theses selected were available for public use, either digitally via the UA IR or in print from the UA Libraries. Nearly all theses selected were from 1984 through 2019. The few exceptions include three theses from the 1960s and one thesis from 1929. The thesis date was not the primary selection criteria; theses were selected based on their organic chemistry content as we discovered them. About 30% of the theses were available as full-text PDFs. The full-text PDFs were mostly theses that were born-digital (post-2009); a few were retroactively digitally scanned.

Software Environment
All software and data analysis were run on Linux Ubuntu 18.04, with the exception of Bio-Rad KnowItAll ChemWindow 2018 (Wiley Science Solutions 2020), which was run on Windows 10. The open source RDKit cheminformatics software package v2019.09.2 (Landrum 2020) was installed in an Anaconda 3 Linux conda environment with Python 3.6.

Substance Selection
In general, substances selected from theses included those that could be represented using the machine-readable Simplified Molecular-Input Line-Entry System (SMILES) line notation (Weininger 1988). These substances included small molecule organic chemistry with some limited organometallic and coordination compounds. In addition, selected substances had associated synthetic preparatory procedures and experimental characterization details such as nuclear magnetic resonance spectroscopy, infrared spectroscopy, melting point, elemental analysis, or mass spectrometry data. In rare cases, substances selected for registration included only a quantitative analytical test. The preparatory procedures were typically specific to the substance, however in certain cases, substances were selected that had only general synthetic procedures associated with them; that is, when the same reaction was run across similar substrates. We avoided selecting substances where the synthetic preparatory method, as noted by the author, directly followed a prior reported literature preparation.

Substance Drawing
The majority of chemical substances were redrawn similarly to the depiction in the theses using ChemAxon MarvinSketch v19.27.0 (ChemAxon 2019a). Stereochemistry including double bond configuration and chiral centers were reproduced as originally defined. In cases where the substance name included racemic notation, (±), both enantiomers were drawn and included within one registry identifier. In rare situations where the author defined the stereochemistry drawn as absolute in the 2D depiction, but named the compound with relative notation symbols, R* or S*, the depiction was considered the correct absolute stereochemistry. When substances were drawn by an author with stereo non-specific wavy bonds, these were reproduced as drawn with the non-specific stereocenters, which is equivalent to plain bonds (Brecher 2006). However, when additional information was provided such that the final product was not an isolated stereoisomer, and instead an identified mixture of enantiomers or diastereomers, we drew both substance configurations and combined them into one registry identifier with two components. In cases where the diastereomeric mixture was not easily identifiable; that is, when it was not clear which stereocenter or bond to flip, or when the diastereomeric mixture was greater than two substance configurations, we drew those substances as stereo non-specific single component substances. Lastly, atropisomers were encoded as non-specific bonds.
For substances depicted as projections, special care was required to preserve the stereochemistry (Brecher 2008;Martin et al. 2012). Haworth projections were manually converted to Mills skeletal depictions (Brecher 2006) and drawn in ChemAxon MarvinSketch. When substances were presented as chair conformations or Fischer projections, Bio-Rad's KnowItAll ChemWindow 2018 software was used to draw the structures and determine the stereochemistry automatically (Abshear et al. 2018).
Some substances (< 5% estimated) required adjustments to the original representation to maintain the correct hydrogen count and represent the structures within the limitations of chemical valence rules and cheminformatics file formats. These internal adjustments are described in Scalfani (2020). Our intention was to accurately maintain the author's original chemical structures as drawn. As such, we endeavored to keep these local business rules (Hersey et al. 2015) to a minimum, and instead rely on the well documented and established PubChem Compound standardization process to standardize the structures (Hähnke et al. 2018).

Generation of Machine-Readable Substance File Formats
For substances drawn in ChemAxon MarvinSketch, the representations were exported as ChemAxon SMILES (v19.27.0, Daylight variant). ChemAxon extended SMILES (CXSMILES) (ChemAxon 2021) were used for substances containing radicals or carbenes. Next, the SMILES were compiled in a spreadsheet along with the thesis bibliographic information and processed with RDKit v2019.09.2 using a custom Python script to generate a structure-date file (SDfile) (Dalby et al. 1992) containing the molecular representation connection table, a registry identifier (UALIB-1 and increasing sequentially), RDKit calculated Kekule SMILES, InChI (v1.05), thesis bibliographic reference, and link to the full-text thesis or library record. Any dative bonds were then added to the RDKit processed SDfile manually using the PubChem nonstandard bond syntax (National Center for Biotechnology Information [date unknown-b]). For substances drawn in KnowItAll ChemWindow 2018, the representations were exported as SMILES and InChI (v1.05) and compiled into a CSV spreadsheet with a substance registry identifier, thesis bibliographic reference, and link to the full-text thesis or library record, without any further local processing.

Registration and Consistency Check Using the Standard InChI
Standard InChIKeys (v1.05) were calculated from ChemAxon MarvinSketch exported SMILES using the ChemAxon command line program, Molconverter (ChemAxon 2019b) in a terminal window as follows: $ molconvert -g "inchikey:SAbs,AuxNone" in.smi -o out.inchikey The InChI absolute stereochemistry, SAbs, option was used to force the calculation of a Standard InChI (International Union of Pure and Applied Chemistry 2017). Next, duplicate substances were checked against a main local registry file list of InChIKeys containing a local list of previously registered substances. This step was completed in a Unix terminal: $ sort InChIKeys_list.inchikey | uniq --count -repeated The above command outputs a list of any duplicate InChIKeys. If any duplicates were identified, the duplicate substances were assigned the original registry identifier. The same sort/uniq command was used to check for duplicate substances with the InChIKeys generated from KnowItAll.
InChIKeys were also used as an interoperability check when transferring data between cheminformatics toolkits locally; that is, the InChIKeys generated from ChemAxon Molconverter were compared to RDKit generated InChIKeys for consistency (Akhondi et al. 2012).

PubChem Deposition
The RDKit generated SDfiles and KnowItAll compiled CSV spreadsheet files were submitted to PubChem for processing into the database through the PubChem Upload web interface. Our local registry file was then updated with the deposited PubChem Substance Identifier (SID) and standardized Compound Identifiers (CID).

Institutional Repository Archiving
After depositing the substance data in PubChem, an archive copy of the substance data in SDfile or CSV format was deposited in UA's DSpace Institutional Repository (UA IR). A new record was created for each collection of thesis substance data. Each UA IR record used the Dublin Core metadata schema with the following elements: dc.contributor, dc.date.issued, dc.description (includes a description of the substance data and CC-BY 4.0 license), dc.publisher, dc.relation.isbasedon (reference to original thesis), dc.title, dc.type, local.GitHub.URL, local.SDFPubChemExternalIDs.URL. The latter two local metadata elements provide cross links to the substance data on GitHub and PubChem.

New Substance Count Data Collection
To find the number of newly deposited substances in PubChem, the total number of substances (SIDs for same, mixtures, and all) linked to each of the UA deposited compound standardized records were retrieved. If there was only one associated SID, the structure was considered new to PubChem, and represents a new deposition. The data was programmatically collected using a script written in MATLAB R2020a. The MATLAB script uses the PubChem Power User Gateway web requests to retrieve the data and is detailed in a separate article  The substance count data was collected in May 2020 for each of the UA substances deposited in PubChem.

Thesis Content
The variety of chemistry encountered in the selected 73 organic chemistry theses was diverse and reported substances synthesized included, for example, ionic liquids, natural products, carbene complexes, silyl compounds, furanones, ribose derivatives, and boronates. On average each thesis contained 39 synthesized substances with associated characterization data. By evaluating thesis titles, we estimated that ~200 of the theses at UA from 1924-2020 are in the organic chemistry subject domain and have a significant focus on small molecule synthesis, and as a result, our selected sample represents about 40% of suitable theses at UA for organic chemistry substance registration. About 30% of the theses selected are available in digital full-text format.

Substance Drawing and Machine-Readable File Creation
Using the workflow described in the methods section, a total of 2,885 unique substances were manually redrawn. The majority of structures (~94%) were drawn in ChemAxon MarvinSketch with the remaining substances, originally depicted as perspective representations, drawn in KnowItAll ChemWindow. For reference, if no challenges were encountered, we could typically draw 60 substances in about 3 hours, and then complete the remainder of the workflow in minutes.

Substance Registration and Interoperability Check with InChI
In our workflow, we tracked substances by their calculated Standard InChIKey. A compiled tabular list of InChIKeys and a unique identifier (e.g., UALIB-1) served as our internal registry list. If the calculated Standard InChIKey for a substance was unique, the substance was determined as new and added to our internal registry list. If the substance was identified as a duplicate InChIKey within our substance registry list, it was assigned the previously known registry identifier. In all of the substances we selected, there were 76 duplicates identified (2.6%) using the Standard InChI. For these substances, the result is that they have more than one associated thesis reference in our local registry identifier list.
The Standard InChI was also used to check the consistency of the chemical substance data exchange between the cheminformatics toolkits. SMILES and file format reading differences between toolkits are known to exist (O'Boyle et al. 2018), and since we were transferring ChemAxon generated SMILES to RDKit, comparing the individual toolkit calculated Standard InChIs was a convenient way to check for consistency (Akhondi et al. 2012;O'Boyle et al. 2018). In total, there were 19 chemical substances (0.7%) where the calculated Standard InChIs did not match between the toolkits. We hypothesize that the differences are a result of a 2D drawing limitation (Clark et al. 2006;Frączek 2016) leading to a different calculated InChI (there are no coordinates in SMILES). Out of caution, we submitted these 19 substances as ChemAxon molfiles, which includes the original 2D coordinates, directly to PubChem, without any local transfer to other toolkits. No critical differences were observed compared to the original ChemAxon SMILES to RDKit molfile derived submission after the PubChem standardization process. Compounds were standardized in PubChem Compound in the same manner.

PubChem Deposition
The substance data was deposited in PubChem through their PubChem Upload interface as either a SDfile or a CSV text file. After submission, it typically took 3-7 days for PubChem to process the data and assign public PubChem SIDs from the Substance database along with the linked standardized PubChem CIDs from the Compound database. Each SID record in PubChem deposited by UA Libraries uses the External ID field to link to the full-text thesis in the UA IR or the catalog metadata record if the full text is not available yet (Figure 1).
We also added the full bibliographic citation of each thesis in the Depositor Comment Field. We notified PubChem staff that our depositions contained linked synthetic preparatory procedures in the original thesis reference. As a result, PubChem created a workflow on their end during the standardization process, which created a "Synthesis Reference" annotation from the bibliographic reference in the Depositor Comment field. The thesis reference is then displayed on the associated CID record page in the "Synthesis Reference" Literature section (Figure 2).

Evaluation of Substances Deposited
Using PubChem programmatic web requests, we found that 1,461 (51%) of the UA thesis PubChem standardized compounds had only one associated substance identifier (SID), and were, therefore, new to the PubChem database at the time of deposition. PubChem Compound considers mixtures as unique and since our depositions include mixtures, the unique percentage of 51% may be slightly inflated. We had a total of 298 mixture submissions. If we assume that all of the mixtures had known individual components, this brings the new compound percentage down to 40%.

Thesis Selection, Full Text Limitations, and Copyright Considerations
Theses containing organic, and some limited organometallic, substances are great candidates for substance data sharing as these molecules are most easily represented as machine-readable formats with available cheminformatics software (Clark 2011;Warr 2011;Hähnke et al. 2018). We, therefore, considered organic chemistry theses to be the priority area for substance registration and substance data sharing. The majority of the theses we identified at UA from 1924 through 2020 as having an organic chemistry focus (~200), were only available in print. As such, we considered and experimented with retrospective digital scanning of theses and deposition of the full text in the UA IR as permitted by copyright (Copyright Advisory Network 2020). However, this manual scanning process of theses was too time consuming and deemed not essential to the goals of the substance registration project. As each substance registered and shared would include the thesis bibliographic information, users discovering the substance data can contact UA Libraries for the full text.
It is our personal understanding as academic researchers, not lawyers, that according to the Compendium of U.S. Copyright Office Practices (2017), chemical substances are excluded from copyright protection. However, it is not clear to us if automated machine extraction of chemical substances would be considered copying the thesis content and a violation of the author's copyright. Andrews et al. (2016) had similar concerns with machine extraction in their thesis data extraction pilot. Given this uncertainty of machine extraction and copyright law, combined with the fact that most of our theses were only available in print, we had to use a manual substance extraction approach, which created the necessity for us to redraw all substance structures, as opposed to any automated substance machine-extraction techniques.

Experiences and Challenges with Substance Drawing
The majority of substances we encountered could be redrawn in ChemAxon MarvinSketch similarly to how they were depicted in the original thesis; that is, the subsequent export of the machine-encoded SMILES faithfully preserved the input structure atoms, bonds, connectivity, and stereochemistry. These "well-behaved" substances (> 90%) were substances that were drawn with organic chemistry 2D skeletal formulas, which followed, or at least loosely followed, graphical representation standards from IUPAC (Brecher 2006;Brecher 2008). Some of the key features include using lines for bonds, omitting hydrogen atoms, atomic symbols for heteroatoms, plus or minus symbols for charges, and hashed or solid wedges/bonds for stereochemistry (Figure 3). These types of structures are most easily interpreted by cheminformatics software (Brecher 2008;Martin et al. 2012).
We found that we were efficient with drawing structures in ChemAxon MarvinSketch; however, different structure editors can certainly be used and there are a variety of other editors available depending upon preferences such as ChemDraw or PubChem Sketcher (Ihlenfeldt et al. 2009).

Figure 3. Examples of "well-behaved" substance drawings redrawn as depicted
Preserving stereochemistry from perspective chemical substance drawings ( Figure 4) and handling stereochemical mixtures was the most challenging aspect of redrawing chemical substances. Perspective drawings including Haworth, chair, and Fischer projections are designed for humans and are generally not fully interpreted by most chemical drawing software tools, with the major limitation being loss of stereochemical information (Brecher 2008;Gobbi & Lee 2012;Martin et al. 2012). To our knowledge, the only consumer/academic software that can automatically assign stereochemistry in perspective drawings is the KnowItAll ChemWindow structure editor (Abshear et al. 2018). Given the software limitations of interpreting perspective drawings, we either had to manually infer the stereochemistry and redraw the structures with standard hash/solid wedges for stereochemistry or use the KnowItAll ChemWindow software to perceive the stereochemistry automatically. We found that it was most efficient for us to manually redraw Haworth projections as non-perspective drawings in MarvinSketch. However, for the chair and Fischer projections, it was faster for us to draw these in ChemWindow than to manually perceive the stereochemistry.
Ideally, for stereochemical mixtures including racemic, enantiomers, and diastereomers, we would use a file format such as molfile V3000 or ChemAxon Extended SMILES that support relative configuration of stereocenters (Gobbi & Lee 2012;Martin et al. 2012;. However, PubChem does not support relative stereochemistry or defined mixtures of stereoisomers as a single structure. Support for enhanced stereochemistry is technically possible and defined within the PubChem stereochemistry specification, but this feature is not currently supported (National Center for Biotechnology Information [date unknown-a]). Further, we selected the Standard InChI as our local substance uniqueness check, and this process considers the substances as only absolute stereochemistry. As a result of these limitations, using file formats that support enhanced stereochemistry was not an option for us and we instead represented stereochemical mixtures including racemates, enantiomers with any ratio, and diastereomers within one registry identifier as separate disconnected substances. Such representation limitations within chemical databases are discussed by Hersey et al. (2015), and there is, unfortunately, not currently an accepted standard across public databases for how to represent stereochemical mixtures; some sources choose to draw racemic mixtures as one substance with no stereochemistry, while others draw multiple enantiomers or diastereomers in . Lastly, drawing multiple substances in a record creates a way to describe molecules by the AND operator, for example: (2S)-2-bromobutane AND (2R)-2-bromobutane. It is unclear how to best represent a substance with a defined "OR" scenario within one registry identifier in public databases, without using extended stereochemistry file formats, such as in the case of (2S)-2-bromobutane OR (2R)-2bromobutane.

Machine-Readable File Creation Experiences and Recommendations
As noted above, most of the substance representations were processed using the RDKit to create SDfiles. We selected RDKit because of its strong integration with the Python programming language and our familiarity with it. RDKit is a cheminformatics toolkit and does not contain a graphical structure editor. As such, this required drawing the structures in a separate program, in our case ChemAxon MarvinSketch, and then transferring the molecular representation data to RDKit. In hindsight, processing the chemical substance SMILES data in a separate cheminformatics toolkit to create an SDfile was unnecessary for data sharing in PubChem and necessitated the incorporation of a local data interoperability check using InChI. A more efficient approach is to compile the molecular representation data as SMILES along with the thesis bibliographic information in a spreadsheet application and then submit this file directly to PubChem, as we did in the case of substances drawn with KnowItAll ChemWindow; that is, the same spreadsheet workflow could have been used for substances drawn in ChemAxon MarvinSketch. The major limitation with submitting a spreadsheet of SMILES chemical representations to PubChem is that, to our knowledge, it is not possible to specify PubChem nonstandard bonds such as dative bonds defined in PubChem substance tags (National Center for Biotechnology Information [date unknown-b]) within the spreadsheet or represent features such as radicals, as these SMILES extensions are not recognized by PubChem. In these specific cases, a molfile/SDfile representation format would need to be used for PubChem submissions. Such a task can still be completed within a single toolkit, as both MarvinSketch and KnowItAll can export molfile/SDfile formats. Finally, it should be noted that there are other differences and limitations with molfile/SDfile encoded molecular representations compared to SMILES and this may be a consideration when submitting data to PubChem (Daylight Chemical Information Systems 2011;Dassault Systemes 2017). However, as SMILES contain the entire representation on one line, we found SMILES much more convenient to work with compared to molfile/SDfiles.

InChI Algorithm for Substance Registration
The use of InChI was critical to our substance registration process, as well as being useful as a consistency check within our overall workflow. InChI is an open non-proprietary chemical identifier, which is well supported across cheminformatics software. The InChI algorithm is currently used to check for structure uniqueness in several public chemical databases and crossreferencing services such as ChEMBL (Mendez et al. 2018;Hersey [date unknown]), ChEBI (Chambers et al. 2013;Hastings et al. 2016), ChemSpider (Richardson 2018), and UniChem (Chambers et al. 2013). Combined with the ability to compare InChIs across cheminformatics toolkits and the established record of using InChI as a uniqueness check, InChI proved to be a great choice for our structure uniqueness check.
There are different levels of uniqueness that InChI can describe, depending on if a standard or non-standard InChI is calculated. Standard InChIs, for example, are tautomer independent, represent organometallics with disconnected metals, and only support absolute stereochemistry. These limitations can be overcome by calculating a non-standard InChI, which allows for specific options related to tautomers, metal representation, stereochemistry and more (Heller et al. 2015). Both the Standard InChI and non-standard InChI are suitable choices for checking the uniqueness of chemical substances. ChEMBL and UniChem use the Standard InChI, (Chambers et al. 2013;Hersey [date unknown]) while ChemSpider (Royal Society of Chemistry, personal communication, July 24, 2020) and ChEBI use a non-standard InChI (Chambers et al. 2013;Hastings et al. 2016). Chambers et al. (2013) argue that the community considers the Standard InChI to be an acceptable measure of substance uniqueness relevant to chemical biology and drug discovery. Ultimately, we selected the Standard InChI because of the primary consideration of data reuse; that is, since all of our data, including intermediate working files and registry lists are public, we felt it was best to share Standard InChIs for data exchange considerations.

PubChem Data Sharing, Provenance, and Access
At the time that we submitted the UA thesis substances, PubChem had 103 million unique compounds. As such, the fact that 40% or more of our contributions were new to the database and unique from the already present 103 million compounds is highly significant, and we believe supports the claim that contributing substances from university theses is valuable to the community. For duplicate substances submitted, there is still value as the data is merged with other records in PubChem and adds a new bibliographic reference to the record.
There are many steps involved in sharing chemical substance data from theses and with that comes many opportunities for data loss or corruption. No matter how careful data depositors are locally, there is still a possibility that any of the substances shared in PubChem could be interpreted differently after being processed with their selected cheminformatics software and standardization workflow (Hähnke et al. 2018). To evaluate how our substance representations changed after PubChem deposition, we compared our locally computed Standard InChIs for all substances that passed PubChem standardization to the PubChem Compound Standard InChIs and found that 150 (5.2%) of the substances did not have identical InChIs after PubChem processing, suggesting a possible change in structure representation. Chemical substance interpretation differences such as a stereochemistry loss or hydrogen count disagreement highlights the importance of maintaining provenance to the original data and link to bibliographic record. While we can endeavor to limit errors (~95% precision based on Standard InChI comparison), ultimately the end user should always validate the data with the original source.
One of the biggest advantages of depositing data in PubChem is that users can now search for UA thesis substance data with chemical specific search query options, such as by chemical structure, substructure, molecular formula, and identifier. Notably, there is limited information available about chemists' use of PubChem as citations to databases in the literature are rare (Tomaszewski 2019). In a recent information seeking behavior study of chemists, however, it was found that about 17 percent of the chemists surveyed use PubMed (Gordon et al. 2018), which is closely integrated with PubChem. Moreover, throughout 2020, PubChem had between two and four million unique users per month (Kim et al. 2021).
Full access to the UA thesis substance data is available through PubChem via the web interface or any of their programmatic interfaces such as PUG-REST (Kim et al. 2015), PUG-VIEW (Kim et al. 2019) or E-Utilities (National Center for Biotechnology Information 2021). We recommend accessing UA thesis substance data through PubChem, since PubChem standardizes the data and combines the data with related information. However, to maintain the provenance of the substance data, and allow users to validate the data, there is a link from the Source field in our PubChem deposited data directly to our UALIB_ChemStructures GitHub repository which contains notes about reuse (CC-BY 4.0), the original substance data files, and thesis bibliographic reference.
Another advantage of depositing substance data in PubChem is the ability to update records. If the updated data is submitted with the original registry identifier, PubChem will maintain substance record versioning and reprocess the data into PubChem Compound. This is important, and allows us to update our substance records, for example, as we become aware of errors or need to update a bibliographic reference link. Moreover, we expect to submit updates as cheminformatics file formats improve and as our workflows and understanding of how to handle chemical representation increases.

Cost versus Benefit Considerations
A reasonable question to consider is what is the cost versus benefit of spending the time to extract, register, and share substances retrospectively from theses? It is a hard question to answer, but we do have a couple of supporting quantitative data points. For example, we found that at least 40% of the substances we shared were new to PubChem. The 40% we found is less than the 70% reported by Andrews et al. (2016) for new substances deposited to ChemSpider from UK theses; however, it is still a large percentage of new substances deposited. There is also a potential to quantify any increased web traffic views of UA theses with substances shared versus theses that do not have their substances shared in machine-readable format. We hope to have some meaningful usage data to analyze after a few years, which should provide a reasonable time frame for discovery of the new data in PubChem.
More broadly, theses represent the history of the research at an institution (Scalfani 2017), and we feel strongly that one of the most important tasks a librarian should engage in is to help promote, share, and preserve their institutions' research for others to discover and build upon.
We acknowledge that a significant time investment will be necessary for the workflow setup and becoming familiar with chemical structures, software, and machine-readable chemical file formats. However, after a workflow is set up similarly to that described in this article, the actual process of redrawing structures and sharing them is reasonable and practical to incorporate within regular liaison workloads. With a bit of practice, we were able to complete an entire thesis with 60 substances in about 3 hours.

Conclusion
We successfully implemented a workflow to manually redraw chemical substances from UA theses and share them in machine-readable format in PubChem. The main workflow used a combination of ChemAxon MarvinSketch and RDKit to create a machine-readable SDfile containing the substance connection tables, SMILES, InChI and bibliographic reference. The greatest challenge was the manual redrawing of the chemical substances, particularly when encountering perspective drawings and stereochemical mixtures. In total, about 3,000 chemical substances from 73 UA theses were shared. At least 40% of the substances were new to PubChem at the time of deposition. Substance depositions in PubChem include the full thesis bibliographic information and link to the thesis full-text PDF or metadata record if the digital full text is not yet available. Users can now discover UA theses in PubChem using specific chemical literature search strategies like molecular formula, structure, and identifier searches.
For librarians and curators seeking to share chemical substance data from theses, it is necessary to first become familiar with chemical file formats and their limitations. It will take time to register and share a significant amount of retrospective thesis substance data from research libraries; however, we are hopeful that this article will help stimulate interest among chemistry librarians and support the idea that enhancing the discovery of theses is of value to the community and profession.