Article
Summon, EBSCO Discovery Service, and Google Scholar: A
Comparison of Search Performance Using User Queries
Karen Ciccone
Director of the Natural
Resources Library and Research Librarian for Science Informatics
North Carolina State University
Libraries
Raleigh, North Carolina,
United States of America
Email: kacollin@ncsu.edu
John Vickery
Analytics Coordinator &
Collection Manager for Social Science
North Carolina State
University Libraries
Raleigh, North Carolina,
United States of America
Email: john_vickery@ncsu.edu
Received: 11 Dec. 2014 Accepted:
6 Feb. 2015
2015 Ciccone and Vickery. This is an Open
Access article distributed under the terms of the Creative Commons‐Attribution‐Noncommercial‐Share Alike License 4.0
International (http://creativecommons.org/licenses/by-nc-sa/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly attributed, not used for commercial
purposes, and, if transformed, the resulting work is redistributed under the
same or similar license to this one.
Abstract
Objectives - To evaluate and compare the results produced by
Summon and EBSCO Discovery
Service (EDS) for the types of searches
typically performed by library users at
North Carolina State University. Also, to compare the performance of these products to
Google Scholar for the same types of searches.
Methods - A study was conducted to compare
the search performance of two web-scale discovery services: ProQuest’s Summon and
EBSCO Discovery Service (EDS). The performance of these services was also
compared to Google Scholar. A sample of 183 actual user searches, randomly
selected from the NCSU Libraries’ 2013 Summon search logs, was used for the
study. For each query, searches were performed in Summon, EDS, and Google
Scholar. The results of known-item searches were compared for retrieval of the
known item, and the top ten results of topical searches were compared for the
number of relevant results.
Results - There was no significant
difference in the results between Summon and EDS for either known-item or
topical searches. There was also no significant difference between the
performance of the two discovery services and Google Scholar for known-item
searches. However, Google Scholar outperformed both discovery services for
topical searches.
Conclusions - There was no significant
difference in the relevance of search results between Summon and EDS. Thus, any
decision to purchase one of those products over the other should be based upon
other considerations (e.g., technical issues, cost, customer service, or user
interface).
Introduction
The North Carolina State University (NCSU) Libraries
is a large academic library at a major land-grant research university, serving
over 34,000 students. Like many similar libraries, it has invested a significant
amount of money and staff time in implementing a web-scale discovery service
for its collections, and it continues to invest a significant amount of
resources in managing its discovery service. We therefore consider it important
to periodically evaluate competing products that could potentially provide a
less expensive and more effective replacement for our current service, Summon.
Like other web-scale discovery products, Summon
provides a pre-harvested central index allowing users to search across a
library’s book and journal holdings through a single search box. At the NCSU
Libraries, a Web-Scale Discovery Product Team tests and evaluates ongoing
upgrades to Summon and provides critical feedback to the vendor, ProQuest. It
also investigates alternatives to the Libraries' current discovery service and
reference linking products and makes recommendations for changes or upgrades as
needed. The Web-Scale Discovery Product Team is composed of nine librarians
representing the library’s public services, technical services, collection
management, and information technology departments.
The NCSU Libraries purchased the Summon Discovery
Service in 2009, at which time there were few competitors on the market, none
offering all of the features of Summon. Specifically, we needed a product that
had an application programming interface (API) that could be used to populate
the Articles portion of our QuickSearch application
(http://search.lib.ncsu.edu/). A search in this tool presents separate results
for Articles, Books & Media, Our Website, and other categories of
information (Figure 1; see also Lown, Sierra, &
Boyer, 2013). In 2009, Summon was the only discovery service with this feature.
Since then, other products, in particular EBSCO Discovery Service (EDS), have
added an API that can be used this way. This and other developments led the
Web-Scale Discovery Product Team to decide that EDS warranted fresh
investigation and comparison to Summon. We obtained a trial to EDS in April and
May of 2014, and it was during that time that this study was conducted.
Literature Review
Ellero (2013) offers a literature review on the evaluation and assessment of
web-scale discovery services. This literature focuses primarily on usability
studies and criteria for choosing a web-scale discovery service, with little or
no emphasis on search performance. There are few studies specifically comparing
the search performance of web-scale discovery services with each other, or with
Google Scholar. Of those, most base their evaluation of search performance on a
very small sample of searches (e.g., Timpson & Sansom,
2011; Zhang, 2013). The studies below represent more extensive attempts to
compare the search performance of these products.
Figure 1
The NCSU Libraries’ QuickSearch
interface.
Asher, Duke, and Wilson (2012) compared the search
performance of Google Scholar, Summon, and EDS. In their study, quality was
judged on the basis of whether each article was from a scholarly source or from
“non-peer-reviewed newspapers, magazines, and trade journals” (p. 470).
Performance scores were given to each product based on librarian quality
ratings of the resources selected by test subjects, who had been asked to
perform typical search tasks. Using this methodology, the authors found that
EDS produced the “highest quality” results.
While article quality is important, there is more than
one way to ensure that library patrons receive scholarly results. At the NCSU
Libraries, we pre-filter our users’ Summon results to include only journal
articles and book chapters. Thus, the relevance – i.e., the degree of
relatedness to the topic being searched – of the remaining results is a more
important factor for our users. There is therefore a need for a study comparing
the relevance of results of various web-scale discovery products, regardless of
format or peer review.
While the Asher et al. (2012) study was based on searches
performed by test subjects, more accurate assessments of user behavior and
experience can be made using actual queries from search logs. Tasks given in
test situations are artificial, and the behavior of test subjects is influenced
by the testing situation. In contrast, search logs contain the searches library
users actually perform. For this reason, search log data are likely to provide
better information about search performance as experienced by users in their
day-to-day use of web-scale discovery services.
Rochkind (2013) compared user preference for search results produced by EDS,
Summon, EBSCOhost “Traditional” API, Ex Libris Primo,
and Elsevier’s Scopus. His survey tool allowed subjects to enter search terms
of their own choosing and view side-by-side results, within the survey window,
for two randomly chosen products. Each product was configured to exclude
non-scholarly content. Users were asked to indicate which set of results they
preferred, with an option for “Can’t Decide/About the Same.” The study found no
significant difference in preference between products, with the exception of
Scopus (which was less preferred).
As with the Asher et al. study, the artificial testing
situation and use of test subjects in the Rochkind
study is problematic. As Rochkind (2013) noted,
When I
experimented with the evaluation tool, I found that if I just entered a
hypothetical query, I really had no way to evaluate the results. I needed to
enter a query that was an actual research question I had, where I actually
wanted answers. Then I was able to know which set of results was better.
However, when observing others using the evaluation tool, I observed many
entering just the sort of hypothetical sample queries that I think are hard to
actually evaluate realistically.
Rochkind also noted that,
This issue does
not apply to known-item searches, where either the item you are looking for is
there at the top of the list, or it isn’t. Looking through the queries entered
by participants, there seem to be very few ‘known item’ searches (known
title/author), even though we know from user feedback that users want to do
such searches. So the study may not adequately cover this use case.
The use of search log data solves both of these
problems by providing queries representing users’ actual research questions as
well as a sample of searches that accurately reflects the relative frequency of
known-item to topical searches.
Objectives
Our primary objective was to answer the question of
whether Summon or EDS produced better results for the types of searches
typically performed by our users. We know, from examining user search logs, that about 24% of the searches performed by our users
are for known items – specific articles or books that the user attempts to
retrieve through title keywords, a combination of title and author keywords, or
a pasted-in citation. We also know that about 74% of the searches performed by
our users are topical, with subjects often defined through the use of only two
or three keywords (Table 1). Approximately 52% of the topical queries in our
sample, or 42% of the total sample, used three or fewer words. Known-item and
topical searches represent very different use cases and present very different
challenges for a discovery layer. We therefore wanted to separately evaluate
how well Summon and EBSCO performed for each of these types of searches.
While discovery services offer several advantages over
Google Scholar (e.g., API available, ability to save and email results, ability
to limit to peer-reviewed articles), we know that the latter is the first go-to
search tool for many researchers. For this reason, a secondary objective of
this study was to compare the search performance of Summon and EDS to that of
Google Scholar. Google Scholar’s terms of use do not allow for its results to
be presented in a context outside of Google Scholar, so replacing our discovery
service with Google’s free service is not currently tenable. Nonetheless, we
were interested to know whether or not our users have as much success searching
with our discovery service as they do when searching Google Scholar. We hoped
that our discovery service would perform at least as well as, if not better
than, Google Scholar, for the types of known-item and topical searches our
users typically perform.
Table 1
Examples of Known-item and Topical Search Queries From
our Sample
Known-item
search queries: |
Topical search
queries: |
Personal
Characteristics of the Ideal African American Marriage Partner A Survey of
Adult Black Men and Women |
adderall edema |
National
Cultures and Work Related Values The Hofstede Study |
solar power coating nanoparticle |
Adalimumab
induces and maintains clinical remission in patients with moderate- to-severe
ulcerative colitis |
experimentation on animals |
hill bond mulvey terenzio |
conjugated ethylene uv vis |
Sullivan A,
Nord CE (2005) Probiotics and gastrointestinal diseases. J Intern Med 257:
78–92 |
religiosity among phd |
Bryan; Griffin
et al; Tierney and Jun |
sleep deprivation emotional effects |
Bone graft
substitutes Expert Rev Med Devices 2006 |
Czech underground |
Ann. Appl. BioI. 33: 14-59 |
toxicology capsaicin |
Methods
In order to truly measure how well the products
performed for our users, we used actual user search queries from our Summon search
logs. These searches sometimes contained typos, punctuation errors, extraneous
words, and other characters. They were also sometimes overly broad or otherwise
problematic from the standpoint of obtaining a useful list of article results
(Table 2). We kept these search terms as they were entered, both in order to
compare how well the two products dealt with the types of errors typically
found in user searches and to get a fair idea of our users’ actual search
experiences.
Table 2
Examples of Problematic Search Queries From our Sample
Problematic
search queries: |
Compendium of
Transgenic Crop Gplants |
suicide collegte |
the over use
of vaccinations |
new class of
drugs patent |
technology
behind online gaming |
disney world facts |
divorce
effects on childreen |
abortion is
not morally permissible |
By using a random sample of queries from our Summon
search logs, we hoped to create a dataset truly representative of our users’
searches, both in terms of the ratio of known-item to topical searches and in
the types of known-item and topical searches entered. The dataset also reflects
the range of disciplines represented by our users.
A computer-generated random sample of 225 search
queries was obtained from the approximately 664,000 Summon searches performed
between January 1 and December 31, 2013. The sample was obtained using PROC
SURVEYSELECT in SAS software. The sample size of 225 was deemed large enough to
be representative of the population yet small enough to be manageable by the
team doing the testing. The nine members of the Web-Scale Discovery Product
Team were each given a portion of this sample (25 queries) to analyze. Two team
members were unable to complete their portion of the testing, and two queries
in the sample were found to be uninterpretable, resulting in 183 queries being
tested.
Team members classified queries as topical search queries or known-item search
queries. For each query, team members performed the search in Summon, EDS,
and Google Scholar and entered the data into the spreadsheet. The spreadsheets
were then combined and the data was analyzed using SAS software.
For topical
search queries, the number of relevant results within the first ten results
was recorded. Team members were instructed to consider a result relevant if it
matched the presumed topic of the user’s search. Relevance was judged based on
information in the title and abstract only.
For known-item
search queries, team members coded “yes” or “no” responses to the following
questions:
●
Did you find the item?
●
Was it in the top three results?
Two versions of the analysis were performed, in order
to compare Summon directly to EDS as well as to compare both discovery services
to Google Scholar. The first analysis compared only the Summon and EDS data. The second analysis also included the Google
Scholar data.
For the Summon and EDS
comparison of performance for topical
search queries, we used three methods. We graphically compared the
distribution of the number of relevant results. We performed a matched pair
t-test to assess whether the mean numbers of relevant results for the two
discovery services were statistically different. Lastly, we performed a
bootstrap permutation test to compare the means of the paired data.
For the Summon and EDS
comparison of performance for known-item
search queries, we graphically compared the number of found known items.
For the comparison including Google Scholar, the topical search queries analysis was
expanded to include a permutation test for repeated measures analysis of
variance and pairwise permutation tests for comparing the means of the paired
data. The known-item search queries
analysis consisted of a graphical comparison of the number of found known items
and a Mantel-Haenszel analysis to examine the
relationship between discovery product and success of a known-item search.
Results
Summon and EDS comparison for topical searches
Of the 183 queries in our sample, 137 were classified
as topical search queries. Graphical comparison of the distributions of the number
of relevant results from Summon and EDS shows similar performance (Figure 2).
While Summon had a greater number of queries with ten relevant results, it also
had a greater number of queries with zero relevant results.
Figure 2
Frequency distribution of the number of relevant
results from EDS and Summon.
As each topical search query was tested in both Summon
and EDS, the number of relevant results for each query was considered a matched
pair. The matched pair t-test tests the hypothesis that the difference between
sample means for the paired data is significantly different from zero (“t test
for related samples,” 2004). There was not a significant difference in the mean
number of relevant results for EDS (M=4.83, SD=3.62) and Summon (M=4.76,
SD=3.81); t(137)=0.26, p=0.7924. Figure 3 shows the
distribution of the difference in the mean number of relevant results between
Summon and EDS. A post-hoc power analysis using SAS software showed that the
sample size of N=137 was sufficient to detect an effect of at least 1 mean
difference at (1 - β) of > 0.99. This
is above the standard power (1 - β) of 0.80.
In order to account for possible violation of
assumptions for the t-test, we performed a bootstrap permutation test (Good,
2005; Anderson, 2001). The permutation test also showed that there was no
significant difference in the mean number of relevant results for Summon and
EDS. Figure 4 shows that 10,000 simulations of the difference between Summon
and EDS agree with the matched-pair t-test.
Figure 3
Distribution of the difference in the mean number of
relevant results between EDS and Summon shows no significant difference, with
95% confidence interval for mean.
Figure 4
Bootstrap distribution under null hypothesis for
10,000 resamples shows that the observed difference in the mean number of
relevant results between Summon and EDS was not significant.
Summon and EDS comparison for known-item searches
Forty-four queries in our sample were classified as
known-item search queries. Between Summon and EDS, the frequency of items found
and not found was exactly equal (Figure 5).
The team also recorded whether or not a found known
item was returned within the top three results for each discovery service.
Summon and EDS performed nearly identically here as well. All but one of the
found known items for Summon and all but two for EDS were in the top three
results.
Topical search comparison including Google Scholar
As with the analysis of only Summon and EDS results, graphical comparison of the distributions of
the number of relevant results across all three products shows similar
performance. Figure 6 shows the frequency of the number of relevant results for
Summon, EDS, and Google Scholar.
Google Scholar had the highest number of queries with
ten relevant results. It also had the lowest number of queries with zero
relevant results.
The mean number of relevant results for each product
is listed in Table 3.
Figure 5
Frequency of known items found for EDS and Summon.
Figure 6
Frequency distribution of the number of relevant
results from EDS, Google Scholar, and Summon.
Table 3
Mean Number of Relevant Results for Each Discovery
Product
Discovery Product |
Mean number of
relevant results |
Summon |
4.76 |
EDS |
4.83 |
Google Scholar |
5.68 |
Figure 7
Bootstrap distribution of the F-statistic under the
null hypothesis for 10,000 resamples indicates an overall difference between
the mean numbers of relevant results for EDS, Google Scholar, and Summon.
A permutation test for repeated measures analysis of
variance was used to detect any overall difference between the three related
means (Good, 2005; Howell, 2006). Ten thousand simulations of the F-statistic
indicate that there is an overall difference (Figure 7).
Given the indication of an overall difference in the
mean number of results between the three products, pairwise permutation tests
were done to confirm where the difference occurred. These tests compared Summon
to EDS, EDS to Google Scholar, and Summon to Google Scholar. Given the
agreement between the matched-pair t-test and permutation test for Summon and
EDS, and the potential violation of t-test assumptions, we felt the permutation
tests alone would be appropriate for pairwise comparisons of Summon, EDS, and
Google Scholar. As with the original comparison between Summon and EDS, there
was no significant difference in the mean number of relevant results between
those two products. There was, however, a significant difference between the
mean number of relevant results for Google Scholar and both EDS and Summon.
In the observed data, Google Scholar outperformed EDS
by an average of 0.85 relevant results. As shown in Figure 8, 10,000 simulations
of the data indicate that it is highly unlikely that this difference was due to
chance alone.
Figure 8
Bootstrap distribution
under the null hypothesis for 10,000 resamples shows that the observed
difference in the mean number of relevant results between Google Scholar and
EDS was significant.
In the observed data, Google Scholar outperformed
Summon by an average of 0.91 relevant results. As with the EDS and Google
comparison above, 10,000 simulations of the data indicate that it is highly
unlikely that the difference was due to chance alone (Figure 9).
Figure 9
Bootstrap
distribution under the null hypothesis for 10,000 resamples shows that the
observed difference in the mean number of relevant results between Google
Scholar and Summon was significant.
Known-item search comparison including Google Scholar
As shown in Figure 10, the proportion of known-items
found by Summon, EDS, and Google Scholar was essentially the same.
The team
also recorded whether or not a found known item was returned within the top
three results for each product. All three performed nearly identically in this
regard. All but two of the found known items were in the top three results for
EDS and Google Scholar, and all but one was in the top three for Summon.
Figure 10
Frequency of
known items found for EDS, Google Scholar, and Summon.
Adjusting or
controlling for the sample query, no significant difference was found between Summon,
EDS, and Google Scholar success rates, χ2 (2, N = 132) = 0.08, p = 0.96.
However, the small sample size for known items (n=44) means that our test did
not have the power to detect small differences (<40%) in the performance of
the products.
Discussion
Very few
studies have compared the search performance of web-scale discovery services,
such as Summon and EDS, to each other or to Google Scholar. Of those that have
been conducted, most base their evaluation on a very small sample of searches,
and all rely on the use of test subjects in artificial testing situations. This
study contributes to our knowledge of the comparative search performance of
these products by using a large number of actual user search queries as the
basis for analysis.
The
relevance of search results is of primary importance in comparing the
performance of search engines. While there are ways to pre-filter search
results to ensure that patrons receive scholarly results, all web-scale
discovery products must deliver results that patrons recognize as related to
the topic of their query. By focusing our evaluation of topical queries upon
relevance, this study fills a need for comparative information about the
relevance ranking algorithms of various web-scale discovery products and Google
Scholar.
Our analysis
showed no significant difference in the search performance of Summon and EDS
for either topical or known-item searches. The number of relevant results for
topical searches was the same, and the number of known items found was the
same. Any decision to purchase one product or the other, therefore, should be
based upon other considerations (e.g., technical issues, cost, customer
service, or user interface).
Google
Scholar performed similarly to Summon and EDS for known-item searches, but
outperformed both discovery products for topical searches. This finding has
implications for how users may perceive the effectiveness of Google Scholar in
comparison to purchased library databases.
In our
study, we looked only at the top ten results for each product. This focus is
justified by studies showing that users of library databases rarely look beyond
the first page of results for information (e.g., Asher, Duke, & Wilson,
2012). Our own knowledge of user behavior corroborates this finding.
Click-through statistics for the Articles portion of QuickSearch
show that 57% of users click on the first result in that module, and that 74%
click on one of the top three results. Only 21% of users click on the “see all
results” option.
Our methodology
required members of the research team to make educated assumptions about what
users were actually looking for when they entered the search terms in our
sample into Summon. It also required them to judge whether each search result
was relevant in relation to the presumed search topic. This methodology is
similar to that used by Google’s search evaluators to improve its relevance
ranking algorithm (Google, 2012). While this methodology introduced a certain
amount of subjectivity into this study, the effect on the results was likely
small. In practice, it was generally easy to interpret the intent behind each
search query, and only two uninterpretable search queries were removed from the
sample. (See examples of search queries in Tables 1 and 2). It was also
generally easy to decide whether a specific search result was relevant (i.e.,
on topic). For subsequent studies, the authors would suggest including a
measure of intercoder reliability.
A limitation
of this study was the small sample size for known item queries. While the
proportion of known item queries in our random sample (24%) matched our
expectations, it resulted in a sample size of only n=44 for known items. For
subsequent studies, the authors suggest using a larger initial sample size in
order to obtain a sample of known item queries large enough to discriminate
small performance differences between the products.
While the
relevance of results is an important search engine evaluation criterion, it is
useful to keep in mind that other factors could be of equal or greater
importance to our users. Our study did not take into consideration other
potential advantages or disadvantages of Google Scholar, e.g., its familiar and
clean user interface, lack of ability to limit to peer-reviewed articles, or
inability to pull results into the library’s QuickSearch
interface. Similarly, our study did not take into consideration interface
design, usability, and feature differences between Summon and EDS.
Unlike most
institutions subscribing to a web-scale discovery product, the NCSU Libraries
does not use Summon to create a single search box for articles, books, and
other formats of material. Instead, it uses Summon primarily to help novice
library users find scholarly articles. Of key importance to us is the ability
to use Summon to populate the Articles portion of our QuickSearch
interface. Because the majority of our users search Summon through QuickSearch over half (72%) never even see the Summon
interface. Because of this, the relevance of the top three Summon results is
particularly important to us, and we will continue to evaluate products that
could potentially provide the same functionality at lower cost.
Conclusion
A study was
conducted to compare the search performance of two web-scale discovery services,
ProQuest’s Summon and EBSCO Discovery Service (EDS). The performance of these
services was also compared to Google Scholar. A sample of 183 actual user
searches, randomly selected from the NCSU Libraries’ 2013 Summon search logs,
was used for the study. There was found to be no significant difference in
performance between Summon and EDS for either known-item or topical searches.
There was also no significant performance difference between the two discovery
services and Google Scholar for known-item searches. However, Google Scholar
outperformed both discovery services for topical searches. Because there was no
significant difference in the search performance of Summon and EDS, any
decision to purchase one product or the other should be based upon other
considerations (e.g., technical issues, cost, customer service, or user
interface).
References
Anderson, M. (2001). Permutation
tests for univariate or multivariate analysis of variance and regression. Canadian Journal of Fisheries & Aquatic
Sciences, 58(3), 626-639. http://dx.doi.org/10.1139/cjfas-58-3-626
Asher, A. D., Duke, L. M., &
Wilson, S. (2013). Paths of discovery: Comparing the search effectiveness of Ebsco Discovery Service, Summon, Google Scholar, and
conventional library resources. College
& Research Libraries, 75(5), 464-488.
Ellero, N. (2013). An unexpected discovery: One library's
experience with web-scale discovery service (WSDS) evaluation and assessment. Journal
of Library Administration, 53(5/6), 323-343.
Good, P. (2005). Permutation, parametric and bootstrap tests
of hypotheses (3rd ed.). New York: Springer.
Google. (2012). Search quality rating guidelines
version 1.0. In Inside Search: How Search
Works. Retrieved 2 March 2015 from https://static.googleusercontent.com/media/
www.google.com/en/us/insidesearch/howsearchworks/assets/searchqualityevaluatorguidelines.pdf
Howell, D. C. (2006). Repeated measures analysis of variance via
randomization. Retrieved 2
March 2015 from https://www.uvm.edu/~dhowell/StatPages/Resampling/RandomRepeatMeas/RepeatedMeasuresAnova.html
Lown, C., Sierra, T., & Boyer, J.
(2013). How users search the library from a single search box. College & Research Libraries, 74(3),
227-241. http://dx.doi.org/10.5860/crl-321
Rochkind, J. (2013). A comparison of article
search APIs via blinded experiment and developer review. Code4Lib Journal, 19. Retrieved from http://journal.code4lib.org/articles/7738
t test for related samples. (2004).
In D. Cramer and D. Howitt (Eds.), The
SAGE dictionary of statistics (p. 168). London, England: SAGE Publications,
Ltd. Retrieved from http://srmo.sagepub.com/view/the-sage-dictionary-of-statistics/SAGE.xml
Timpson, H., & Sansom,
G. (2011). A student perspective on e-resource discovery: Has the Google factor
changed publisher platform searching forever? The Serials Librarian, 61(2), 253-266.
Zhang, T. (2013). User-centered
evaluation of a discovery layer system with Google Scholar. In Design, user experience, and usability. Web,
mobile, and product design (pp. 313-322). Berlin: Springer.