U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Bruening W, Schoelles K, Treadwell J, et al. Comparative Effectiveness of Core-Needle and Open Surgical Biopsy for the Diagnosis of Breast Lesions [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2009 Dec. (Comparative Effectiveness Reviews, No. 19.)

Cover of Comparative Effectiveness of Core-Needle and Open Surgical Biopsy for the Diagnosis of Breast Lesions

Comparative Effectiveness of Core-Needle and Open Surgical Biopsy for the Diagnosis of Breast Lesions [Internet].

Show details

2Methods

Topic Development

In response to Section 1013 of the Medicare Modernization Act, AHRQ requested an evidence report to synthesize the evidence on the comparative effectiveness of core needle and open surgical biopsy for diagnosis of breast cancer. The topic was nominated in a public process. The Scientific Resource Center (SRC) for the AHRQ Effective Health Care Program recruited a technical expert panel (TEP) to give input on key steps including the selection and refinement of the questions to be examined. The expert panel membership is provided in Appendix A.

Upon AHRQ approval, the draft Key Questions were posted for public comment. After receipt of public commentary, the SRC finalized the Key Questions and submitted them to AHRQ for approval. These Key Questions are presented in the Scope and Key Questions section of the Introduction.

Our EPC created a work plan for developing the evidence report. The process consisted of working with AHRQ, the SRC, and the technical experts to outline the report’s objectives, performing a comprehensive literature search, abstracting data, constructing evidence tables, synthesizing the data, and submitting the report for peer review.

In designing the study questions and methodology at the outset of this report, the EPC consulted several technical and content experts. Broad expertise and perspectives were sought. Divergent and conflicted opinions are common and perceived as healthy scientific discourse that results in a thoughtful, relevant systematic review. Therefore, in the end, study questions, design and/or methodologic approaches do not necessarily represent the views of individual technical and content experts.

Search Strategy

The medical literature was searched from December 1990 through November 10, 2008, and the PubMed and EMBASE searches were updated to September 11, 2009. The full strategy is provided in Appendix B. In brief, we searched 14 external and internal databases, including PubMed and EMBASE, for clinical trials addressing the Key Questions. To supplement the electronic searches, we also examined the bibliographies/reference lists of included studies, recent narrative reviews, and scanned the content of new issues of selected journals and selected relevant gray literature sources.

Study Selection

We selected the studies that we consider in this report using a priori inclusion criteria. Some of the criteria we employed are geared towards ensuring that we used only the most reliable evidence. Other criteria were developed to ensure that the evidence is not derived from atypical patients or interventions, and/or outmoded technologies.

Studies of diagnostic test performance compare results of the experimental test to a reference test. The reference test is intended to measure the “true” disease status of each patient. It is important that the results of the reference test be very close to the truth, or the performance of the experimental test will be poorly estimated. For the diagnosis of breast cancer, the “gold standard” reference standard test is open surgical biopsy. However, an issue with the use of open surgical biopsy as the reference standard in large cohort studies of screening-detected breast abnormalities is the difficulty of subjecting women with probably benign lesions to open surgical biopsy. Furthermore, restricting the evidence base to studies that used open surgery as the reference standard for all enrolled subjects would eliminate the majority of the evidence. Therefore, we have chosen to use a combination of clinical and radiologic followup as well as open surgical biopsy as the reference standard for our analysis.

For Key Question 1 we used the following formal criteria to determine which studies would be included in our analysis. Many of our inclusion criteria for Key Question 1 were intended to reduce the potential for spectrum bias. Spectrum bias refers to the fact that diagnostic test performance is not constant across populations with different spectrums of disease. For example, patients presenting with severe symptoms of disease may be easier to diagnose than asymptomatic patients in a screening population; and a diagnostic test that performs well in the former population may perform poorly in the latter population. The results of our analysis are intended to apply to a general population of women at average risk of breast cancer participating in routine breast cancer screening programs(mammography, clinical examination, and self-examination programs) and therefore many of our inclusion criteria are intended to eliminate studies that enrolled populations of women at very high risk of breast cancer due to family history, or populations of women at risk of recurrence of a previously diagnosed breast cancer.

  1. The study must have directly compared core-needle biopsy to open surgery or patient followup for six months or longer in the same group of patients.
    Although it is possible to estimate diagnostic accuracy from a two-group trial, the results of such indirect comparisons must be viewed with great caution. Diagnostic cohort studies, wherein each patient acts as her own control, are the preferred study design for evaluating the accuracy of a diagnostic test.20 Retrospective case -control studies and case reports were excluded. Retrospective case-control studies have been shown to overestimate the accuracy of diagnostic tests, and case reports often report unusual situations or individuals that are unlikely to yield results that are applicable to general practice.20,21 Retrospective case studies (studies that selected cases for study on the basis of the type of lesion diagnosed by core-needle biopsy) were also excluded because the data such studies report cannot be used to calculate the overall diagnostic accuracy of core-needle biopsy. Studies may have performed open surgical procedures on all patients, or may have performed open surgical biopsy on some patients and followed the other patients with clinical examination and mammograms for at least six months.
  2. The study enrolled female human subjects.
    Animal studies or studies of “imaging phantoms” are outside the scope of the report. Studies of breast cancer in men are outside the scope of the report.
  3. The study must have enrolled patients referred for biopsy for the purpose of primary diagnosis of a breast abnormality.
    Studies that enrolled women who were referred for biopsy after discovery of a possible breast abnormality by screening mammography or routine physical examination were included. Studies that enrolled subjects that were undergoing biopsy for any of the following purposes were excluded as being out of scope of the report: breast cancer staging, evaluation for a possible recurrence of breast cancer, monitoring response to treatment, evaluation of the axillary lymph nodes, evaluation of metastatic or suspected metastatic disease, or diagnosis of types of cancer other than primary breast cancer. Studies that enrolled patients from high-risk populations such as BRCA1/2 mutation carriers are also out of scope. If a study enrolled a mixed patient population and did not report data separately, it was excluded if more than 15% of the subjects did not fall into the “primary diagnosis of women at average risk presenting with an abnormality detected on routine screening” category.
  4. Fifty percent or more of the subjects must have completed the study.
    Studies with extremely high rates of attrition are prone to bias and were excluded.
  5. Study must be published in English.
    Moher et al. and Holenstein et al. have demonstrated that exclusion of non-English language studies from meta-analyses has little impact on the conclusions drawn.22,23 Although we recognize the possibility that requiring studies to be published in English could lead to bias, it is insufficiently likely that we cannot justify the time and cost of translations.
  6. Study must be published as a peer-reviewed full article. Meeting abstracts were not included.
    Published meeting abstracts have not been peer-reviewed and often do not include sufficient details about experimental methods to permit one to verify that the study was well designed.24,25 In addition, it is not uncommon for abstracts that are published as part of conference proceedings to have inconsistencies when compared to the final publication of the study, or to describe studies that are never published as full articles.26–30
  7. The study must have enrolled 10 or more individuals per arm.
    The results of very small studies are unlikely to be applicable to general clinical practice. Small studies are unable to detect sufficient numbers of events for meaningful analyses to be performed, and are at risk of enrolling unique individuals.
  8. When several sequential reports from the same patients/study are available, only outcome data from the most recent report were included. However, we used relevant data from earlier and smaller reports if the report presented pertinent data not presented in the more recent report.
  9. Studies of biopsy instrumentation that are no longer commercially available were excluded.
    The ABBI device, the MIBB device, and SiteSelect have been discontinued by their manufacturers. Studies of the accuracy and harms related to the use of these devices are no longer clinically relevant.

To address Question 2, we recorded any harms information reported in the studies included to address Question 1. In addition, we collected any articles, regardless of design, that addressed part of Question 2, namely the dissemination of cancer cells by the biopsy procedure. To address Question 3, we consulted a variety of information sources, including published literature, cost-effectiveness analyses, evidence-based clinical practice guidelines, published expert panel consensus statements, and consultations with experts. We did not use formal inclusion criteria for Question 3 due to the nature of the question; instead, we approached the question as an “opinion/discussion” type of question.

To address the accuracy of open surgical biopsy, we first searched for clinical studies that performed open surgical biopsy, followed patients for six months or longer, and met the above listed inclusion criteria. However, we identified no clinical studies that met the inclusion criteria, so we searched for systematic and narrative reviews that addressed the accuracy and harms of open surgical biopsy.

The abstracts of articles identified by the literature searches were screened in duplicate for possible relevance by three research assistants. The first fifty abstracts screened by each research assistant were also screened in duplicate by the lead research analyst, and all exclusions at the abstract level were approved by the lead research analyst. The full-length articles of studies that appeared relevant at the abstract level were then obtained and three research assistants examined the articles in duplicate to see if they met the inclusion criteria. All conflicts were resolved by the lead research analyst. The excluded articles and primary reason for exclusion are shown in Appendix C.

Data Abstraction

Standardized data abstraction forms were created and data was entered by each reviewer into the SRS©4.0 database (see Appendix D). Three research assistants abstracted the data. The first fifty articles were abstracted in duplicate. All conflicts were resolved by the lead research analyst.

Study Quality Evaluation

We used an internal validity rating scale for diagnostic studies to grade the internal validity of the evidence base (Table 2). This instrument is based on a modification of the QUADAS instrument. 31 Each question in the instrument addresses an aspect of study design or conduct that can help to protect against bias. Each question can be answered “yes”, “no”, or “not reported,” and each is phrased such that an answer of “yes” indicates that the study reported a protection against bias on that aspect. A summary quality score was computed in order to reduce the subjectivity of the assessment of the potential for bias present in the evidence base. A summary score was computed with each “yes” given a +1, each “no” a −1, and each “not reported” a zero. As all of the factors captured by the questions on the quality instrument were thought to be of equal importance for this topic, no weighting was utilized in computing the summary score. This summary score was then normalizeda to a scale from 0 to 10, with the lower the score the greater the risk that the study was affected by biases. Consequently, a study employing all 14 features would score +10, a study employing none would score 0, and a study simply not reporting any of these features would score 5, thus acknowledging that published studies may not provide information on all study procedures that were actually carried out.

Table 2. Quality assessment instrument.

Table 2

Quality assessment instrument.

To evaluate the overall quality of the evidence base for each conclusion, we computed the median quality score of the studies contributing to that conclusion. An evidence base with a median score higher than 8.4 was considered to be of high quality; an evidence base with a median score 8.4 or less but greater than 6.7 was considered to be of moderate quality; an evidence base with a median score 6.7 or less but greater than 5.0 was considered to be of low quality; and an evidence base with a median score less than 5.0 was considered to be of insufficient quality. Internal validity assessment findings are summarized for each outcome in the Results section. Responses to the questions in the quality assessment instrument for each study are presented in the Evidence Tables in the Appendix.

Strength of Evidence

The strength of evidence supporting each major conclusion was graded as High, Moderate, Low, or Insufficient. The grade was developed by considering various important domains as suggested in the CER Draft Methods Guide and in accordance with a strength and stability of evidence grading system developed by ECRI Institute.32 Four do mains were evaluated: the quality (potential risk of bias, or “internal validity”) of the evidence base, the size of the evidence base, the consistency (agreement across studies) of the findings, and the robustness of the findings (as determined by sensitivity analysis). The domain of “directness” was incorporated into our analytic framework, but not into the grade, as downstream patient health outcomes are rarely reported in diagnostic studies.

The domain of “precision” was incorporated into our assessment of the size of the evidence base. The domain considered to be of overriding importance for this topic was the potential for bias in the evidence base. The potential for bias was measured by the quality of the evidence as described above. The quality rating was considered to be the highest strength of evidence grade that could be achieved for each conclusion. The other domains were evaluated as either “Sufficient” or “Insufficient,” and ratings of “Insufficient” for other domains caused a downgrading of the strength of evidence grade. Further details about grading the strength of evidence may be found in Appendix G.

Because of the nature of Question 3 and the sources of information used to address it, we did not draw many formal evidence-based conclusions for this question, nor, in most cases, did we attempt to rate the quality of the studies or grade the strength of the evidence. For one conclusion for Key Question 3 we considered the consistency, robustness, and strength of association between the type of biopsy and the outcome to be sufficient to support an evidence-based conclusion.

Applicability

The issue of applicability was chiefly addressed by excluding studies that enrolled patient populations that were not a general population of asymptomatic women participating in routine breast cancer screening programs. We defined this population as women at average risk of breast cancer participating in routine breast cancer screening programs (including mammography, clinical examination, and self-examination). We excluded studies that enrolled women who were referred for biopsy for the purpose of: staging of already diagnosed breast cancers; evaluation of the axillary lymph nodes; evaluation for metastatic or suspected metastatic disease; evaluation of recurrent or suspected recurrent disease; and studies that enrolled women thought to be at very high risk of breast cancer due to family history or carriers of BRCA mutations. We also excluded studies of biopsy instrumentation that are no longer commercially available on the grounds that the data reported is no longer applicable to clinical practice.

To verify that the evidence base enrolled a “typical” population we examined the prevalence of breast cancers diagnosed. The prevalence of cancers in the general population sent for breast biopsy (in the U.S.)has been reported to be around 23%. 15 If our evidence base were indeed typical for patients in the U.S., we would expect to see a similar prevalence of breast cancers.

Data Analysis and Synthesis

Several key assumptions were made: (1) the “reference standard,” open surgical biopsy and/or clinical and radiologic follow up for at least six months, was 100% accurate; (2) the pathologists diagnosing the open surgical biopsy results were 100% accurate in diagnosing the material submitted to them; and (3) core-needle diagnoses of malignancy (invasive or in situ) that could not be confirmed by an open surgical procedure were assumed to have been correct diagnoses where the lesion had been completely removed by the core-needle biopsy procedure.33 In addition, the majority of studies reported data on a per-lesion rather than a per-patient basis, and therefore we analyzed the data on a per-lesion basis assuming that statistical assumptions of data independence were not being violated.

We performed two primary types of analyses -a standard diagnostic accuracy analysis and an analysis of underestimation rates. For the diagnostic accuracy analysis,

  • true negatives were defined as lesions diagnosed as benign on core-needle biopsy that were found to be benign by the reference standard;
  • false negatives were defined as lesions diagnosed as benign on core-needle biopsy that were found to be malignant (invasive or in situ) by the reference standard;
  • true positives were defined as lesions diagnosed as malignant (invasive or in situ) on core-needle biopsy as well as “high risk” lesions that were found to be malignant (invasive or in situ) on the reference standard
  • false positives were defined as lesions diagnosed as “high risk”(most commonly ADH lesions) on core-needle biopsy that were found not to be malignant (invasive or in situ) by the reference standard (see Table 3).
Table 3. Definitions of diagnostic test characteristics.

Table 3

Definitions of diagnostic test characteristics.

We meta-analyzed the data reported by the studies using a bivariate mixed-effects binomial regression model as described by Harbord et al.34 All such analyses were computed by the STATA 10.0 statistical software package using the “midas” command.35 The summary likelihood ratios and Bayes theorem were used to calculate the post-test probability of having a benign or malignant lesion. In cases where a bivariate binomial regression model could not be fit we meta-analyzed the data using a random-effects model and the software package Meta-Disc.36 Meta-regressions were also performed with the Meta-Disc software package.

Diagnostic tests all have a trade-off between minimizing false-negative and minimizing false-positive errors. False-positive errors that occur on core-needle biopsy are not considered to be as clinically relevant as false-negative errors. Women who experience a false-positive error will be sent for an additional biopsy procedure, and may suffer anxiety and minor temporary complications. However, women who experience a false-negative error may die from a delayed cancer diagnosis. In addition, because all “positive” diagnoses of malignancy on core-needle biopsy are assumed to be correct, the “true” false positive rate is artificially reduced towards 0%. Thus false-positive errors, and diagnostic test characteristics that evaluate the impact of false-positive errors (specificity, positive predictive value, positive likelihood ratio), are not particularly relevant for evaluating this technology.

We focused on measures that evaluate the extent of false-negative errors: sensitivity and negative likelihood ratio. A biopsy method with a very high sensitivity misses very few cancers. Negative likelihood ratios can be used along with Bayes’ theorem to directly compute an individual woman’s risk of having a malignancy following a “benign” diagnosis on core-needle biopsy. In general, the smaller the negative likelihood ratio the more accurate the diagnostic test is in predicting the absence of disease. However, each individual woman’s post-test risk varies by her pre-test risk of malignancy. Simple nomograms are available for in-office use that allow clinicians to directly read individual patients’ post-test risk off a graph without having to go through the tedium of calculations. Negative predictive value is another commonly used measure of false-negative errors; however, negative predictive values are specific to specific populations of women. They can be used to predict how many women in that particular population do not have a malignancy following a “benign” diagnosis on core-needle biopsy. Negative predictive values vary by the prevalence of disease in each specific population and should not be applied to other populations with different prevalences of disease.

The second type of analysis we performed was an analysis of underestimation rates. Lesions diagnosed as DCIS by core-needle biopsy that were found to be invasive by the reference standard were counted as underestimates. Similarly, “high risk” (most commonly ADH lesions) that were found to be malignant (in situ or invasive) by the reference standard were counted as underestimates (see Table 4 ). The underestimation rate was then calculated as the number of underestimates per number of DCIS (or “high risk”) diagnoses and expressed as a percentage (the percentage of DCIS or ADH diagnoses that were underestimates). We meta-analyzed the underestimation rates with a random-effects model using the CMA software package.37

Table 4. Definitions of underestimation rates.

Table 4

Definitions of underestimation rates.

We meta-analyzed any other types of outcomes with a random-effects model using the CMA software package.37 We did not assess the possibility of publication bias because statistical methods developed to assess the possibility of publication bias in treatment studies have not been validated for use with studies of diagnostic accuracy.38,39

Peer Review and Public Commentary

A draft of the completed report was sent to the peer reviewers, the representatives of the AHRQ, and the Scientific Resource Center. The draft report was posted to a Web site for public comment. In response to the comments of the peer reviewers and the public, revisions were made to the evidence report, and a summary of the comments and their disposition was submitted to AHRQ. Peer reviewer comments on a preliminary draft of this report were considered by the EPC in preparation of this final report. Synthesis of the scientific literature presented here does not necessarily represent the views of individual reviewers.

Footnotes

a

Formula:((raw score +14)/28)*10

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (2.1M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...