Correlation of term usage and term indexing frequencies

Authors

  • Michael J. Nelson School of Library and Information Science, University of Western Ontario

DOI:

https://doi.org/10.29173/cais1293

Abstract

There have been several studies on the distributions of index terms, title terms, authors and other elements used in searching bibliographic databases. What is needed is to relate this information of the actual usage of terms in searching. This study uses data from monitoring the actual usage of terms in an online catalog at the School of Library and Information Science at the University of Western Ontario. Every time a term was used in a search expression, a count in the dictionary file was updated. If a word was not in the dictionary it was added. In this way we can see which words are in the catalog and not searched and also those which were searched but not in the dictionary. As a check on other studies the rank distribution of terms used in searching was checked and found to be of a general Zipf type.

The main interest was to check if high frequency terms in the catalog were used frequently in searching. Several measures of this were tried. First the regular scatterplot of frequency of use in the catalog versus the frequency in searching was checked and Pearson’s correlation coefficient was calculated. The correlation was reasonably high at 0.74. Since the total number of terms and frequencies was much larger in the catalog, a plot of the rank of the terms in the catalog was plotted against the rank of the term in searching.

This provided a very scattered plot with less clustering at the origin as in the original plot of frequencies. The Spearman rank correlation coefficient was moderate at 0.58. Some studies have suggested that removing high frequency and very low frequency terms from the search vocabulary will improve retrieval performance. This data shows that many general users actually use these terms in searching.

Currently a sample of search words are being analyzed for factors which affect the correlation, such as errors in both the catalog and searching, the effect of truncations, and the effect of stemming (which was done on both the catalog vocabulary and the search terms).

Published

2022-03-26

How to Cite

Nelson, M. J. (2022). Correlation of term usage and term indexing frequencies. Proceedings of the Annual Conference of CAIS Actes Du congrès Annuel De l’ACSI. https://doi.org/10.29173/cais1293

Issue

Section

Articles