Can database-level MEDLINE exclusion filters in Embase and CINAHL be used to remove duplicate records without loss of relevant studies in systematic reviews? An exploratory study

: Objective : To investigate whether using database filters to remove MEDLINE results within Embase (OVID) and CINAHL (EBSCO) would result in fewer duplicate records, without leading to any loss of studies included in the final review. Methods : We reviewed the included studies from a sample set of 20 Cochrane Reviews published in 2015-2018, and replicated the search strategies from those reviews in MEDLINE, EMBASE (both on the OVID platform) and CINAHL (EBSCO). Results were exported to EndNote; then relevant MEDLINE filters were applied within CINAHL and Embase, and results were exported again. Filtered results were analysed to determine whether the filtered EMBASE and CINAHL results excluded studies that were not identified in the original MEDLINE search. Results : Using the “Records from: Embase” filter resulted in no loss of included studies; however, the “Exclude MEDLINE journals” filter in Embase resulted in a failure to retrieve a large number of relevant studies. CINAHL’s filter for MEDLINE records resulted in a small number of studies being lost. Conclusions : The “Records from: Embase” filter may be safely used for deduplication, though as it removes conferences, searchers may also want to review Conference abstracts separately using the Conferences filter. CINAHL’s MEDLINE filter comes with a small risk of filtering out relevant studies, but may be appropriate to use. Although we did not set out to address this issue, our results demonstrate that searches of MEDLINE are still necessary, as not all relevant results were found in Embase alone in order to identify all relevant studies.


Introduction
Searching multiple databases for systematic reviews and other evidence syntheses results in the retrieval of large numbers of duplicate and irrelevant records. Clinically requested or other in-depth medical literature searches (not for evidence synthesis) may also require searching more than one database. Deduplication and screening of such records is time consuming, and expert searchers seek ways to reduce the number of duplicates without accidentally excluding relevant references.
Embase in particular is often the focus of these efforts. Embase uses the Emtree Thesaurus, which contains over 71 000 preferred terms (subject headings) and over 300 000 synonyms, compared to over 27 000 terms and around 220 000 synonyms in MEDLINE's MeSH Thesaurus [1]. Elsevier has also increased the number of subject headings it assigns to records in Embase [2], which increases the number of articles, both relevant and irrelevant, retrieved by a systematic search.
Various approaches have been proposed to reduce the number of irrelevant results and duplicate records between databases. It might seem logical to assume that, because Embase includes the MEDLINE database, it is not necessary to search MEDLINE separately; however, Bramer et al. found that neither Embase nor MEDLINE alone retrieves all included references in a test set of systematic reviews [3]. Others have investigated whether adequate recall is maintained if searchers focus (i.e., limit to major headings) their subject terms in Embase and/or MEDLINE. Glanville [6]. They found this option to be ineffective for deduplication, because it led to a high number of false positives, based on a sample set of search results from multiple databases, generated using a search strategy from an existing systematic review. They also acknowledged limitations, including the fact that they only used 2 "gold standard" sets of records from a single systematic review project. They also identified further directions for research on deduplication, including the use of in-database options. Our study builds on their research by investigating MEDLINE record filters using a different method.
We sought to investigate whether using MEDLINE record filters within Embase and CINAHL would reduce the number of retrieved records without leading to loss of unique records that were relevant to the review (i.e., included studies). Our hypothesis was that if a MEDLINE search strategy is highly sensitive, then filtering out MEDLINE records in other databases should not compromise the results obtained. To answer this question, we re-ran the searches from 20 Cochrane Reviews in all 3 databases. We then applied filters in each database and compared filtered to unfiltered results.

Methods
We searched the Cochrane Database of Systematic Reviews via the Ovid interface on November 29, 2018, looking for studies that contained the terms "Embase" and "CINAHL" in the abstract in order to identify reviews that searched both the CINAHL and Embase databases (we assumed MEDLINE would be used in all reviews). We sorted by update code and selected the first 20 reviews that met the following criteria: • Search strategies were fully described in text format (as opposed to screen captures) to allow for copying and pasting. • The reported search strategies used both subject headings and keywords. • The review selected at least 20 and not more than 100 studies for inclusion. The lower end of 20 was selected to ensure a large enough result set, while the upper limit of 100 was set for the purposes of our own time management. • The Ovid interface was used for the MEDLINE and Embase searches.
The Cochrane reviews included in our sample were published between 2015 and 2018. As this study is exploratory in nature, we did not aim to select reviews from different disciplines or types of interventions, though our set included reviews of drug therapies, devices, patient communication interventions, and complementary and alternative therapies, among others. A complete list of the Cochrane Reviews used in our study can be found in Appendix A.
The search strategies were copied line by line from the original Cochrane Review and re-run in each of MEDLINE (OVID), Embase (OVID) and CINAHL (EBSCO). Results were exported to EndNote. The process of re-running the searches and exporting the records was carried out in December 2018 and January 2019.
The CINAHL searches were then re-run with the limit "Exclude MEDLINE records," which filters out records containing a PMID [7], and the results were exported to EndNote.
The Embase searches were re-run separately with each of the following limits: • Records from: Embase (excludes MEDLINE records that have been imported into Embase, and also excludes conference abstracts) • Exclude MEDLINE Journals (excludes any records, whether MEDLINE or Embase, if the journal is indexed in MEDLINE) The filtered results were also exported to Endnote. For each review, unfiltered and filtered results were analyzed to determine whether they contained all the references identified as relevant, i.e. the references to the review's included studies.
For each Cochrane review, we created a list of the included studies (found in the "references to studies included in the review" section of the Cochrane review), and then checked for this reference in each of the 6 data sets (MEDLINE records, Embase records unfiltered, Embase records using the "Exclude MEDLINE journals" filter, Embase records using the "Records from: Embase" filter, CINAHL records unfiltered, and CINAHL records using the "Exclude MEDLINE" filter). If the record for an included study was found in a data set, then it was logged with a 1, and if it was not found then it was logged as a 0. We then compared the results across filtered versus unfiltered searches. We were particularly interested in references to included studies that were not found in the MEDLINE search, as these were the studies most likely to be impacted by the use of the MEDLINE exclusion filters in Embase or CINAHL.
Conference abstracts presented a challenge, as they were often included in the references to included studies. We excluded conference abstracts that were clearly labelled as such from our analysis, unless a conference abstract was the only reference to a study, as we assumed that a published article would provide the fullest account of the study in question. Where both a conference abstract and a journal article were listed in the "references to studies included in the review," the journal article was starred to indicate that it was the source of the data. In some cases, it was not clear whether a publication was an article or conference abstract. We tagged these as suspected conference abstracts, based on whether they were published in supplementary issues, or were only 1-2 pages long. We then verified manually that these were conference abstracts, and then excluded them and reanalyzed our data. The rationale for doing so was that Embase indexes records with one of three statuses: Embase, MEDLINE or Conference Abstract. Records tagged as Conference Abstracts are excluded from the "Records from: Embase" limit.

Results
Of the 923 studies included in the 20 Cochrane reviews we analyzed, 54 were found only in MEDLINE, 54 were found only in Embase, 10 studies were unique to CINAHL, and 132 included studies were not found in any of these 3 databases (see Figure 1).

Fig. 1: Aggregate data for all 20 Cochrane articles' "References to Included Studies"
As shown in Figure 2, 12 of the reviews we examined contained both unique MEDLINE references not found in another database, and unique Embase records not found in any other database. Four studies (numbered 1, 11, 13, and 14 in Figure 2) had unique Embase records, but no unique MEDLINE records, and 4 others (numbered 2, 12, 15, and 16) had unique entries from MEDLINE but none from Embase. Our results ( Figure 2) confirm Bramer et al.'s findings that searching Embase on its own does not retrieve all articles indexed in MEDLINE. Since the filters are designed to remove records that are present in MEDLINE, we expected that the filtered results would lose records that were also present in MEDLINE. Therefore, for the next part of the analysis, we chose to focus on records that were found from each of the other database searches (Embase or CINAHL) but not found in the MEDLINE results. As shown in Figure 3, 37 of 55 of the references that were found in Embase (but not MEDLINE) were lost when we applied the "Exclude MEDLINE journals" filter in Embase. As shown in Figure 4, using the "Records from: Embase" filter (only original Embase records, excluding conferences and MEDLINE records) resulted in no loss of included studies; all 55 of the unique journal article references found in the Embase searches were also found with this filter applied.

Fig. 4: Blue line shows the number of studies that are found in Embase, but not in MEDLINE. The orange shading shows how many of these remain after the "Records from: Embase" filter is applied. Numbers in top row correspond to the number assigned to each Cochrane review referenced in Appendix A.
CINAHL's filter for removing MEDLINE records resulted in a small number of studies lost; of the 11 studies we found that were present in the CINAHL results, but not the MEDLINE results, 2 were lost when we applied this filter (see Figure 5).

Fig. 5: Blue line shows the number of studies that are found in CINAHL, but not in MEDLINE. The orange shading shows how many of these remain after the "Exclude MEDLINE records" filter is applied. Numbers in top row correspond to the number assigned to each Cochrane review referenced in Appendix A.
As the purpose of using these filters in a systematic search is to reduce the number of duplicates that the searcher must work to remove from the data set, we looked at the reduction of references that resulted from applying filters in the Embase and CINAHL searches ( Figure 6).

Fig. 6: Percentage reduction in the number of references returned from the searches of Embase and CINAHL using the filters.
The percentage reduction varied depending on the review. The average percentage reduction when using the "Records from: Embase" filter was 33.7% with lowest and highest values of 12.3% and 76.1%. In absolute values, the lowest, highest and average values were 88, 3929, and 815 references respectively. In CINAHL, using the "Exclude MEDLINE records" filter resulted in an average percentage reduction in number of references retrieved was 67.0%, with lowest and highest values of 36.5% and 88.1% respectively. In absolute values, the lowest, highest and average values were 40, 3141, and 728 references respectively.

Discussion
Our hypothesis was that if a MEDLINE search is sufficiently rigorous and comprehensive (and therefore highly sensitive), then all relevant studies indexed by MEDLINE should be captured by the MEDLINE search. Therefore, using the CINAHL and Embase filters to remove MEDLINE records would not result in any loss of unique studies. This was partially supported. While we did not assess the sensitivity of the searches in the 20 reviews we analyzed, we assumed that searches undertaken for Cochrane reviews are indicative of commonly accepted best practice. Our conclusions around the appropriateness of each filter are described below. We would like to emphasize that use of filters for deduplication is optional, and searchers should make this decision on a case-by-case basis.
1. We cannot recommend using the Exclude MEDLINE Journals filter in Embase. Figure 3 shows a significant number of unique articles lost from Embase using this filter, possibly because this filter operates at the journal level, and thus EMBASE records for articles in those journals indexed by both Embase and MEDLINE would be excluded. We also noted that some journals were indexed in Embase from an earlier date than in MEDLINE. Thus, older articles from these journals are present in Embase, but not in MEDLINE, and are excluded when using a journal-level filter. Given the high number of studies lost when using this filter, we do not recommend its use.
2. The "Records from: Embase" filter may be used without significant loss of unique studies if the searcher either is excluding conferences, or plans to search them separately with the Embase Conference filter.
There was no loss in recall of journal articles using this filter. In most of the reviews we examined, where conferences were cited, there was also a corresponding journal article that covered the same study in greater detail. It is possible that for some topics, searchers may want to do a thorough search of conference literature; if this is the case, we recommend applying the two limits "Records from: Embase" and "Conference Abstracts" together; this will exclude MEDLINE records, but retain all original Embase article and conference records.
3. CINAHL's Exclude MEDLINE filter may be used with caution.
Using this filter in CINAHL results in considerable reduction of duplicate records (on average, 67% less than without filtering). Of the 11 unique records from CINAHL across all of the studies, 2 were lost by applying the filter. We examined these 2 records to determine why they had been lost. One of the lost papers was excluded from MEDLINE results due to the RCT filter applied. The RCT filter did not exclude the study from the CINAHL results, but applying the MEDLINE exclusion filter did. The second lost article was filtered out due to the fact that the searchers used a narrow subject heading in their MEDLINE search, but were forced to use a broader term in their CINAHL search. The article was indexed in MEDLINE, but not retrieved by the MEDLINE search. Due to the broader CINAHL search, the article was found, but then excluded when the MEDLINE filter was applied. It appears that choices made around the translation of search strategies can result in lost articles, even when the MEDLINE exclusion filter is working appropriately. This finding corroborates Kwon et al.'s statement that "…there may be articles retrieved from CINAHL that are indexed in MEDLINE but are not retrieved by the MEDLINE search" [6]. Because very few of the included studies in our test set were unique to CINAHL, and a very small number (2) of those were removed by the filter, a searcher may decide that the benefits of using the filter outweigh the risks.
This recommendation contradicts the advice of Kwon et al. [6] that the CINAHL deduplication filter results in too many false positives (citations that were wrongly identified as duplicates and removed). However, there is no way to know whether the false positives identified in that study would have eventually been included in a final review. We took a different approach of assessing the filters based on whether articles that did end up in the final review were excluded by the filters. Given that only 2 relevant studies were excluded across our set of 20 reviews, we recommend the use of this filter if the searcher feels that substantially reducing the number of duplicates outweighs the risk of potentially losing a small number of studies.

How much time and effort do these filters save?
This varies depending on the overall size of the result set, but is usually in the hundreds or thousands of articles. Given that deduplication is a time-consuming process, the use of filters may result in significant time savings, especially for larger reviews. If you are using systematic review software that has very effective deduplication features, you may not realize much benefit from these filters; however, if you are using some combination of EndNote, Excel, and manual scanning, and your result set is in the thousands or even tens of thousands, these filters may save a significant amount of time.

Additional findings
We have heard anecdotally that some searchers no longer search MEDLINE, because MEDLINE records are imported into Embase, though Lam et al.'s analysis demonstrates that this practice is not widespread among searchers [8]. As 16 of the 20 studies that we examined contained at least one included study that was only retrieved from the MEDLINE search, we do not recommend relying on Embase as the primary tool for MEDLINE searching.

Limitations
We were not able to test our method in Embase.com, so findings from this study may not applicable to that interface.
This study was exploratory in nature; it is possible that a test set of 20 Cochrane Reviews is not sufficiently large or diverse to fully test these filters.
We examined the effects of these filters only on included studies from the 20 Cochrane Reviews, and did not assess the filters' effects on total records retrieved by each strategy, with and without the filters applied.
We selected our sample of 20 Cochrane reviews using eligibility criteria, from an initial result set that was sorted in reverse chronological order. We did not attempt to select specific review topics to match content that might be better indexed in Embase or CINAHL than in MEDLINE.

Future Directions for Research
As this study was exploratory in nature, one possible future direction is to repeat this analysis with a larger number of reviews, or to examine all references retrieved by the filters against the original, unfiltered search, rather than simply the studies selected for inclusion in the final review.

Conclusion
Deduplication is a tedious, time-consuming, and error-prone part of the data collection process for a systematic review. Reduction of duplicate studies by judicious use of MEDLINE filters in other databases may result in sizable time savings for review teams. We sought to assess the effectiveness of these filters in Embase and CINAHL in order to inform our own practice in the conduct of systematic reviews. Based on the results of our study, we are unable to recommend the "Exclude MEDLINE Journals" filter in Embase; however we are confident in recommending the "Records from: Embase" filter (possibly in conjunction with the Conference Abstracts filter), and we recommend with reservations the "Exclude Medline" filter in CINAHL.

Data Availability Statement:
The data that support the findings of this study are openly available in the PRISM Dataverse: University of Calgary's Data Repository at https://doi.org/10.5683/SP2/MPREV8, V1