Ideally, the body of evidence to support a conclusion would be strong. Often, however, the evidence suffers from various limitations concerning the possible risk of bias in available studies, small numbers of studies and patients, and/or inconsistent effects. These limitations often mean that the strength of the evidence is only moderate, weak, or even insufficient to permit any conclusion. In order to gauge the impact of these possible limitations, we applied a formal rating system that conforms with the CER Methods Guide Manual recommendations on grading the strength of evidence.

The strength of evidence supporting each major conclusion was graded as High, Moderate, Low, or Insufficient. The grade was developed by considering four important domains: the quality (risk of potential bias) of the evidence base, the size of the evidence base, the consistency (agreement across studies) of the findings, and the robustness of the findings (as determined by sensitivity analysis). The grading system moves stepwise to consider each important domain. These steps are described below.

Step 1. What is the quality of individual studies?

We used an internal validity rating scale for diagnostic studies to grade the internal validity of the evidence base. This scale is based on a modification of the QUADAS instrument.595 Each question in the instrument addresses an aspect of study design or conduct that can help to protect against bias. Each question can be answered “yes”, “no”, or “not reported,” and each is phrased such that an answer of “yes” indicates that the study reported a protection against bias on that aspect. A summary score was computed with each “yes” given a +1, each “no” a −1, and each “not reported” a zero. As all of the factors captured by the questions on the quality instrument were thought to be of equal importance for this topic, no weighting was utilized in computing the summary score. This summary score was then normalized to a scale from 0 to 10, with the lower the score the greater the risk that the study was affected by biases.

Step 2. What is the overall quality of evidence?

To evaluate the overall quality of the evidence base for each conclusion, we computed the median quality score of the studies contributing to that conclusion. We used the median because it is the appropriate measure of central tendency to represent the “typical” quality score, and is less sensitive to outliers than the mean. An evidence base with a median score higher than 8.4 was considered to be of high quality; an evidence base with a median score 8.4 or less but greater than 6.7 was considered to be of moderate quality; an evidence base with a median score 6.7 or less but greater than 5.0 was considered to be of low quality; and an evidence base with a median score less than 5.0 was considered to be of insufficient quality. The quality rating was considered to be the “baseline” grade of strength of evidence.

Step 3. Is the evidence base large enough to be informative?

For this Step, we first count the number of included studies. If there are fewer than three studies, the evidence grade is automatically set to Insufficient. Next, we determined whether the precision of the evidence base was sufficient to permit a conclusion. Precision is to a large degree dependent on the size of the evidence base-in general, as the evidence base increases in size the confidence interval around the summary effect becomes tighter due to the increase in statistical power. If the effect is statistically or clinically significant we conclude the data are informative. For diagnostic test evaluations, we consider the precision of the primary measures of diagnostic test accuracy, sensitivity and specificity. If the confidence interval bounds are within 20% of the point estimate for these measures we conclude the data are informative. Other measures of diagnostic test accuracy, such as likelihood ratios and predictive values, are calculated from the same analysis and data used to calculate sensitivity and specificity and therefore are not rated separately. If the data are not sufficiently precise, we down-grade the evidence rating by one level, for example, a rating of High from Step 2 would be down-graded to Moderate.

Step 4. Are data consistent?

Consistency refers to the extent to which the study findings are similar. Quantitative consistency can be tested with the Higgins and Thompson’s I2 statistic. For this report, we considered an evidence base to be quantitatively consistent when I2<50%. The evidence base is considered to be qualitatively consistent when the studies all report the same qualitative conclusion. If the data are not sufficiently consistent, we down-grade the evidence rating by one level, for example, a rating of High from Steps 2 and 4 would be down-graded to Moderate.

Step 5. Are data robust?

In this step we determine whether the data are robust to minor alterations of the data. What types of robustness tests should be performed may vary. For example, if some data were imputed, the analysis should be re-done using reasonable variations in the value(s) of the imputed data. Other robustness tests may include removing one study at a time from the analysis, or performing cumulative meta-analyses. We considered findings to not be robust only if a robustness analysis significantly altered the conclusion (e.g., a statistically significant finding becomes non-significant as studies are added to the evidence base, or the point estimate changed by more 20% after removal of any single study from the analysis). If the data are not sufficiently robust, we down-grade the evidence rating by one level, for example, a rating of High from Steps 2, 4, and 5 would be down-graded to Moderate.

In addition to the conclusions about diagnostic accuracy, we also rated the strength of evidence for one conclusion about a patient-oriented outcome, surgeries avoided. The method of rating patient-oriented outcomes is similar to, but not identical to, the method of rating the conclusions about diagnostic accuracy. In addition to the four important domains used for rating conclusions about diagnostic accuracy we also considered the domain Strength of Association (magnitude of the size of the effect). If the size of the effect was determined to be very small, we down-graded the rating of the strength of evidence by one level; and if the size of effect was determined to be very large we up-graded the strength of evidence by one level.


