U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Treadwell J, Mitchell M, Eatmon K, et al. Imaging Tests for the Diagnosis and Staging of Pancreatic Adenocarcinoma [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Sep. (Comparative Effectiveness Review, No. 141.)

Cover of Imaging Tests for the Diagnosis and Staging of Pancreatic Adenocarcinoma

Imaging Tests for the Diagnosis and Staging of Pancreatic Adenocarcinoma [Internet].

Show details

Methods

The methods for this comparative effectiveness review (CER) follow the methods suggested in the Agency for Healthcare Research and Quality (AHRQ) “Methods Guide for Effectiveness and Comparative Effectiveness Reviews” (available at http://www.effectivehealthcare.ahrq.gov/methodsguide.cfm). The main sections in this chapter reflect the elements of the protocol established for the CER; certain methods map to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses) checklist.29

Topic Refinement and Review Protocol

Initially a panel of key informants gave input on the Key Questions (KQs) to be examined; these KQs were posted on AHRQ’s website for public comment between February 26, 2013, and March 25, 2013, and revised as needed. We then drafted a protocol for the CER and recruited a panel of technical experts to provide high-level content and methodological expertise throughout the development of the review.

Literature Search Strategy

Medical librarians in the Evidence-based Practice Center (EPC) Information Center performed literature searches, following established systematic review protocols. We searched the following databases using controlled vocabulary and text words: Embase, MEDLINE, PubMed, and The Cochrane Library from 1980 through November 1, 2013. The full search strategy is shown in Appendix A.

Literature screening (for reviews or studies) was performed in duplicate using the database Distiller SR (Evidence Partners, Ottawa, Canada). Initially, we screened literature search results in duplicate for relevancy. We screened relevant abstracts again, in duplicate, against the inclusion criteria. Studies that appeared to meet the inclusion criteria were retrieved in full, and we screened them again, in duplicate, against the inclusion criteria. All disagreements were resolved by consensus discussion among the two original screeners and, if necessary, an additional third screener. For procedural harms of imaging technologies of interest, we conducted a supplemental search that was not limited to the literature on pancreatic adenocarcinoma. We used Reference Manager™ software (Thomson Reuters, New York, NY) for managing references.

The literature searches will be updated during the peer review process, before finalization of this comparative effectiveness review (CER).

Study Selection

Our criteria are listed in five categories below: (1) publication criteria, (2) study design criteria, (3) patient criteria, (4) test criteria, and (5) data criteria.

Publication Criteria

  1. Full-length articles: The article must have been published as a full-length peer-reviewed study. Abstracts and meeting presentations were not included because they do not include sufficient details about experimental methods to permit an evaluation of study design and conduct, and they may also contain only a subset of measured outcomes.30,31 Additionally, it is not uncommon for abstracts that are published as part of conference proceedings to have inconsistencies when compared with the final publication of the study or to describe studies that are never published as full articles.3236
  2. Redundancy: To avoid double-counting of patients, in instances in which several reports of the same or overlapping groups of patients were available, only outcome data based on the larger number of patients were included. However, we included data from publications with lower numbers of patients when either (a) a publication with lower patient enrollment reported an included outcome that was not reported by other publications of that study, or (b) a publication with lower patient enrollment reported longer followup data for an outcome.
  3. English language: Moher et al. (2000) have demonstrated that exclusion of non-English language studies from meta-analyses has little impact on the conclusions drawn.37 Juni et al. (2002) found that non-English studies typically were of higher risk of bias and that excluding them had little effect on effect-size estimates in the majority of meta-analyses they examined.38 Although we recognize that in some situations, exclusion of non-English studies could lead to bias, we believe that the few instances in which this may occur do not justify the time and cost typically necessary for translation of studies.
  4. Publication date: We included studies published since January 1, 2000. We thought that older articles would include outdated technologies. Studies of harms of imaging technologies that did not specifically involve pancreatic adenocarcinoma (i.e., any clinical indication), must have been published since January 1, 2009. We chose this more recent date because we anticipated a large number of studies of imaging in patients with any clinical condition.

Study Design Criteria

  1. For KQs on single-test accuracy: For KQs 1a and 1b, which address the performance of a single imaging test against a reference standard, we included only systematic reviews. EPC guidance by White et al. (2009)39 states how existing systematic reviews can be used to replace de novo processes in CERs. We will refer to the PICOTS-SD for the pertinent subquestion, and these seven components (Populations, Interventions, Comparisons, Outcomes, Time Points, Setting, Study design) will be the seven inclusion criteria. For quality, see the end of Appendix D on risk of bias.
  2. For any KQs comparing two or more tests, the study must have compared both tests to a reference standard. The reference standard must not have been defined by either imaging test being assessed.
  3. For any KQs on single versus multiple tests, test experience, patient factors (e.g., age), or tumor characteristics (e.g., head or tail of pancreas), the study must have made a comparison of data to address the question. For example, for test experience, the difference between multidetector computed tomography (MDCT) and endoscopic ultrasound with fine-needle aspiration (EUS-FNA) may depend on the experience of the centers (e.g., higher case-volume centers may find less of a difference in these technologies than lower case-volume centers).
  4. For any KQs involving comparative clinical management or long-term survival or quality of life, some patients must have received one of the imaging tests, and a separate group of patients must have received a different imaging test. This design permits a comparison of how the choice of test may influence management and/or survival and/or quality of life.
  5. For KQ3 on the rates of procedural harms, we included any reported harms data based on 50 or more patients, in the context of diagnosis or staging of pancreatic adenocarcinoma, on the harms of imaging procedures that contained a statement in the Methods section that the study planned in advance to capture harms/complications data. Additionally, we included studies primarily of harms and adverse events associated with the use of each specific imaging modality, regardless of the type of cancer being detected, that were published in 2009 or later.
  6. For KQ3b on patient perspectives of imaging tests, any study design was accepted.
  7. For KQ4 on screening, we included any study that reported the performance of at least one included imaging test in the context of screening for either pancreatic adenocarcinoma itself or precursor lesions to pancreatic cancer.

Patient Criteria

  1. To be included, the study must have reported data obtained from groups of patients in which at least 85 percent of the patients were from one of the patient populations of interest. If a study reported multiple populations, it must have reported data separately for one or more of the populations of interest.
  2. Adults. At least 85 percent of patients must have been aged 18 years or older, or data must have been reported separately for those aged 18 years or older.
  3. Studies of screening, diagnosing, or staging primary pancreatic adenocarcinoma were included. Testing for recurrent pancreatic cancer was excluded.
  4. Data on imaging tests performed after any form of treatment (e.g., neoadjuvant chemotherapy) were excluded, but pretreatment imaging data were considered.

Test Criteria

  1. Type of test. Only studies of the imaging tests of interest were included (listed in the KQs above). Studies of computed tomography (CT) that did not explicitly state that (or it could not be determined that) CT was MDCT were assumed to be MDCT. Given our publication date criterion of 2000 and later, we believe it safe to assume that CT performed in such studies was MDCT.

Data Criteria

  1. The study must have reported data pertaining to one of the outcomes of interest (see the KQs section).
    • For accuracy outcomes (KQ1a through 1e; KQ2a through 2e, and KQ4), this means reporting enough information for one to calculate both sensitivity and specificity, along with corresponding confidence intervals.
    • For clinical management (KQ1f, KQ2f), this means reporting the percentage of patients who received a specific management strategy, after undergoing each imaging test (a separate group of patients corresponding to each imaging test).
    • For long-term survival (KQ1g, KQ2g), this means either reporting median survival after each imaging test (separate groups of patients), or mortality rates at a given time point (separate groups of patients), or other patient survival such as a hazard ratio.
    • For quality of life (KQ1g, KQ2g), this means reporting data on a previously tested quality-of-life instrument (such as the SF-36) after each imaging test (separate groups of patients).
    • For harms (KQ3), this means a statement appearing in the Methods section that harms/complications would be measured, reporting the occurrence of a procedure-related harm and the number of patients at risk, or the reporting that no harms or complications occurred as a result of the procedure.
    • For patient perspectives (KQ3b), this means reporting the results of asking patients about their opinions or experience after having undergone one or more of the imaging tests.
  2. Regarding the minimum patient enrollment, for studies comparing imaging tests (KQ1b through 1g and KQ2b through 2g), we required data on at least 10 patients per imaging test. We also used a minimum of 10 for KQ3b on patient perspectives of imaging tests. We used a minimum of 50 patients for data on harms (KQ3) or screening (KQ4).
  3. For all KQs, the reported data must have included at least 50 percent of the patients who had initially enrolled in the study.
  4. Studies that reported data by tumor (e.g., x percent of pancreatic adenocarcinoma tumors were correctly detected) instead of by patient (e.g., x percent of enrolled patients were correctly given a diagnosis of pancreatic adenocarcinoma) were not excluded for this difference. However, the tumor-based data was separated from the patient-based data because they measure different types of accuracy.

Data Abstraction

We abstracted information from the included studies using Microsoft Excel (Redmond, WA) and we extracted the data into these forms. Duplicate abstraction of comparative accuracy data was used to ensure accuracy. All discrepancies were resolved by consensus discussion. Elements to be abstracted included general study characteristics (e.g., country, setting, study design, enrolled number), patient characteristics (e.g., age, sex, comorbidities), details of the imaging methodology (e.g., radiotracer, timing of test), risk-of-bias items, and outcome data. Appendix C contains all evidence tables except those involving risk of bias, which appear in Appendix D.

Risk of Bias Evaluation

For systematic reviews of single-test accuracy, EPC guidance by White et al. (2009)39 suggests that EPCs assess the quality of an existing systematic review by using a revised AMSTAR (Assessment of Multiple Systematic Reviews) instrument. The items we used for this appear in Appendix D. For each included review, two analysts independently answered 15 items and independently assigned the review as either high quality or not high quality (thus for systematic reviews, we made no distinction between moderate and low quality). Discrepancies in the category assignment were resolved by consensus. A review was considered high quality if it met eight specific items (see Appendix D). When systematic reviews did not meet these eight iterms, we considered them not high quality.

For studies comparing two or more tests, we used a set of nine risk-of-bias items after considering the QUADAS-2, as well as additional issues that specifically address bias in the comparison of diagnostic tests to differentiate between high, medium or low risk of bias (see Appendix D).

Strength of Evidence Grading

We used the EPC system for grading comparative evidence from primary studies on diagnostic tests as described in the EPC guidance chapter by Singh et al. (2012).40 This system uses up to eight domains as inputs (risk of bias, directness, consistency, precision, publication bias, dose-response association, all plausible confounders would reduce the effect, strength of association). The output is a grade of the strength of evidence: high, moderate, low, or insufficient. This grade is made separately for each outcome of each comparison of each KQ. Definitions for these categories is provided in Table 2 below. Strength of evidence final grades (as well as each component that contributed to each grade) are provided in a table of the Conclusion section for the pertinent KQ.

Table 2. Strength of evidence grades and definitions.

Table 2

Strength of evidence grades and definitions.

The EPC system requires that reviewers select the most important outcomes of a review to be graded. For this report, we graded evidence on comparative accuracy for diagnosis and staging, clinical outcomes (clinical management, survival, quality of life), and screening accuracy. These were the most important outcomes, and the EPC guidance chapter by Singh et al. (2012)40 can be applied. We did not grade the strength of evidence from published systematic reviews on the accuracy of individual imaging tests, or the procedural harms of a single imaging test, or screening accuracy.

For each comparison and each outcome, we determined whether the evidence permitted an evidence-based conclusion. For comparative test accuracy, this meant whether the evidence was sufficient to permit one of the following three types of conclusions: (1) test A is more accurate than test B, (2) test B is more accurate than test A, or (3) tests A and B are similarly accurate. The first two types of conclusions required a statistically significant difference for either sensitivity or specificity (or both), whereas the third type of conclusion required a non-statistically–significant difference for both sensitivity and specificity, as well as independent judgments from two reviewers that the data were precise enough to indicate similar accuracy. If none of these three conclusions were appropriate, we graded the evidence insufficient. If the evidence was sufficient to permit a conclusion, then the grade was high, moderate, or low. The grade provided by two independent raters, and discrepancies were resolved by consensus. Below, we discuss the eight domains and how they were considered:

Study limitations. Study limitations indicate the extent to which studies included for a given outcome were designed and conducted to protect against bias. If the evidence permitted a conclusion, then all else being equal, a set of studies at low risk of bias yielded a higher strength of evidence grade than a set of studies at medium or high risk of bias. The study limitations domain represents the overall risk of bias for a set of studies, and is judged low, medium or high.

Directness. For questions on test accuracy, data on accuracy directly addressed the question, so those data were considered direct. For question on other outcomes (e.g., long-term survival), data on the actual outcomes were necessary for inclusion and to be judged direct.

Consistency. For questions comparing the accuracy of two or more tests, and for other comparative questions, consistency was judged based on whether the studies’ findings suggested the same direction of effect.

Precision. For questions comparing the accuracy of two or more tests, and for other comparative questions, the evidence was considered sufficiently precise if the data showed a statistically significant difference (between groups or between tests) or if the data demonstrated similar results.

Reporting bias. This was addressed by noting the presence of abstracts or ClinicalTrials.gov entries describing studies that did not subsequently appear as full published articles. If many such studies exist, this will tend to decrease the strength of the evidence. We also considered the funding source of studies, and we performed any appropriate quantitative analyses correlating study effect sizes to the end of patient-enrollment dates.

Dose-response association. This domain was relevant only with respect to the radiation dose for CT. One possibility is that higher doses result in higher accuracy of CT. If the evidence shows that CT is more accurate than another imaging technique and that the difference is even larger in studies that used higher CT doses, it would generally increase the strength of evidence.

All plausible confounders would reduce the effect. This domain means that a set of studies may be biased against finding a difference between two interventions, and yet the studies still found an important difference. Thus, if the studies had controlled for the confounders, the effect would have been even larger. This domain was considered when statistical differences were found.

Strength of association. This domain was judged by EPC team members based on whether the size of a difference (e.g., the extent of difference in accuracy between two tests) was so large that the potential study biases could not explain it. If true, this domain will generally increase the grade of strength of evidence. This domain was considered when statistical differences were found.

Applicability

The applicability of the evidence involved four key aspects: patients, tests/interventions, comparisons, and settings. In considering the applicability of the findings to patients, we consulted large studies to ascertain the typical characteristics of patients newly given a diagnosis of pancreatic adenocarcinoma (e.g., age, sex) and then assessed whether the included studies enrolled similar patients. Some aspects of interventions may also affect applicability, for example if a study uses an uncommonly used radiotracer. Settings of care were described, and if data permitted, subgroups of studies by setting were analyzed separately. We did not provide categorical ratings of applicability, but instead we discussed applicability concerns in the Discussion section of the report.

Data Analysis and Synthesis

For comparing the accuracy of imaging tests, we synthesized the evidence on sensitivity and specificity using meta-analysis wherever appropriate and possible. Decisions about whether meta-analysis was appropriate were based on the judged clinical homogeneity of the different study populations, imaging and treatment protocols, and outcomes. Statistical heterogeneity was measured using tau-squared. When meta-analysis was not possible (because of limitations of reported data) or was judged to be inappropriate, the data were synthesized using a descriptive approach.

For each pair of imaging tests compared directly by a group of studies (e.g., MDCT and EUS-FNA) for a given clinical purpose (e.g., diagnosis), we performed bivariate meta-analysis of each test’s accuracy data using the “metandi” command in STATA.41 If this model could not be fit for a given test (i.e., if there were 3 or fewer studies in the analysis or the model did not converge), we used Meta-Disc (freeware developed by the Unit of Clinical Biostatistics, Ramón y Cajal Hospital, Madrid, Spain).42 Using the meta-analytic results, we used equation 39 in Trikalinos et al. (2013)43 to compare the tests statistically (separately for sensitivity and specificity). For these tests, we set p=0.05 (two-tailed) as the threshold for statistical significance. If a comparison was not statistically significant, two reviewers independently judged whether the confidence interval around the difference was sufficiently narrow to permit a conclusion of similar accuracy. We did not set the specific degree of narrowness required to permit an equivalence conclusion, since we could find no consensus in the field as to the minimal important difference in accuracy. Instead, this was a judgment made by two independent reviewers, with disagreements were resolved by consensus. When studies reported accuracy data for multiple readers separately, we first selected the data from reader 1 only, and performed sensitivity analyses of selecting all other permutations of readers. The selection of reader 1 was arbitrary.

Some data were reported in terms of whether the precise T stage (or the overall TNM) was correctly assessed by an imaging test. For these studies, we computed an odds ratio of accurate staging based on paired binary data and we assumed a test-test correlation of 0.5.

Peer Review and Publication

The review protocol was posted from August 9, 2013, to September 6, 2013, at Research Protocol. Peer reviewers are invited to provide written comments on the draft report based on their clinical, content, or methodologic expertise. Peer review comments on the preliminary draft of the report are considered by the EPC in preparation of the final draft of the report. The dispositions of the peer review comments are documented and will be published 3 months after the publication of the Evidence report.

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.3M)

Other titles in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...