Correlation of term usage and term indexing frequencies
DOI:
https://doi.org/10.29173/cais1293Abstract
There have been several studies on the distributions of index terms, title terms, authors and other elements used in searching bibliographic databases. What is needed is to relate this information of the actual usage of terms in searching. This study uses data from monitoring the actual usage of terms in an online catalog at the School of Library and Information Science at the University of Western Ontario. Every time a term was used in a search expression, a count in the dictionary file was updated. If a word was not in the dictionary it was added. In this way we can see which words are in the catalog and not searched and also those which were searched but not in the dictionary. As a check on other studies the rank distribution of terms used in searching was checked and found to be of a general Zipf type.
The main interest was to check if high frequency terms in the catalog were used frequently in searching. Several measures of this were tried. First the regular scatterplot of frequency of use in the catalog versus the frequency in searching was checked and Pearson’s correlation coefficient was calculated. The correlation was reasonably high at 0.74. Since the total number of terms and frequencies was much larger in the catalog, a plot of the rank of the terms in the catalog was plotted against the rank of the term in searching.
This provided a very scattered plot with less clustering at the origin as in the original plot of frequencies. The Spearman rank correlation coefficient was moderate at 0.58. Some studies have suggested that removing high frequency and very low frequency terms from the search vocabulary will improve retrieval performance. This data shows that many general users actually use these terms in searching.
Currently a sample of search words are being analyzed for factors which affect the correlation, such as errors in both the catalog and searching, the effect of truncations, and the effect of stemming (which was done on both the catalog vocabulary and the search terms).
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Michael J. Nelson
This work is licensed under a Creative Commons Attribution 4.0 International License.