Evidence-based Case Review
1 Istituto di Statistica Medica, Università degli
Studi di Modena e Reggio Emilia, Modena, and Milan, Italy
3 Milan and Rome
Competing interests: None declared
See this article on our web site for links to other articles in the series.
Authors Alessandro Liberati is affiliated with the Istituto di Statistica Medica, Università degli Studi di Modena e Reggio Emilia. Alessandro Liberati, Roberto Buzzetti, and Nicola Magrini are affiliated with the Centro Valutazione Efficacia Assistenza Sanitaria, Modena; and Alessandro Liberati and Roberto Grilli are associated with the Centro Cochrane Italiano Istituto "Mario Negri," Milan, Italy. Dr Grilli is also with the Agenzia Servizi Sanitari Regionali in Rome.
This article was edited by Virginia A Moyer of the department of pediatrics, University of Texas Medical Center at Houston. Articles in this series are based on chapters from Moyer VA, Elliott EJ, Davis RL, et al, eds. Evidence-Based Pediatrics and Child Health. London: BMJ Books; 2000.
Correspondence to: Dr Liberati email@example.com
The parents of a healthy, asymptomatic 5-year-old boy are anxious about his health and ask about the appropriateness of undergoing a screening examination with urinalysis. You search for existing recommendations on this topic and find the book, Putting Prevention Into Practice.1 You find the 2 statements outlined below.
This clinical scenario raises a number of important questions:
Explicit recommendations for clinical practice, such as guidelines or diagnostic and therapeutic protocols, are published frequently, but many have conflicting recommendations. To decide which guidelines we should follow, we need common criteria to assess the quality of available evidence. Although it is generally agreed that practice guidelines should explicitly assess the quality of the evidence that supports different statements, this is still uncommon.2
Historically, the Canadian Task Force was the first to attempt to classify levels of evidence supporting clinical recommendations. It did this by reviewing the indications for preventive interventions and producing recommendations with an explicit grading of the supporting evidence.3 These were subsequently adopted by the US Preventive Services Task Force.4 The original approach used by the Canadian Task Force classified randomized controlled trials (RCTs) as the highest level of evidence, followed by non-RCTs, cohort and case-control studies (representing fair evidence), comparisons among times and places with or without the intervention, and at the lowest level, "expert opinion." This approach is simple to understand and easy to apply, but it implicitly assumes that RCTs, no matter how small or large or how properly conducted, always produce better evidence than nonexperimental studies such as cohort or case-control studies. This approach also ignores the issue of heterogeneity and, thus, what to do when results from several RCTs or other nonexperimental studies vary.
Other scales proposed since that of the Canadian Task Force still rely on methodologic design of primary studies as the main criterion. These have incorporated systematic reviews and meta-analyses, which are placed above RCTs in the "hierarchy of evidence." Whereas this allows for a possibly more refined grading of levels of evidence, it suffers from the same limitation—ie, that attention is given to the a priori validity of the methods used. More recently, scales assessing the quality of study conduct and the consistency of results across different studies have been proposed.
The aims of this article are as follows:
We will not address how strength of recommendations has been assessed. This is a complex concept that implies value judgments and an explicit methodologic assessment of available studies. As recently suggested (A Oxman, S Flottorp, J Cooper, et al, "Levels of Evidence and Strength of Recommendations," unpublished data, 1999), "strength of recommendations" is a construct that should go beyond levels of evidence to incorporate more subjective considerations, such as patient- or setting-specific applicability; tradeoffs among risk, benefits, and costs; and the like.
WAYS TO CLASSIFY LEVELS OF EVIDENCE
When used for individual studies, quality assessment provides explicit criteria to separate valid from invalid studies (usually referred to as "internal or scientific validity"). When used in a systematic review, quality assessment can assist in qualifying the recommendations to be incorporated into practice guidelines or recommendations (figure).
A priori validity of study design
The validity of study design is the oldest and still most commonly used approach to levels of evidence classification. The 2 main advantages of this approach are its explicit nature and the fact that a general consensus exists regarding the hierarchy of different types of study designs in their ability to prevent bias.3,4 On the other hand, this approach relies exclusively on issues of design, thereby ignoring issues of study conduct and of the consistency and clinical and epidemiologic relevance of study findings.
Quality of study conduct
Despite its appeal, the feasibility of analyzing the quality of the conduct of the study is seriously jeopardized by the lack of consensus regarding the appropriate indicators of study validity (lack of an agreed-on gold standard). Not even for RCTs—the most standardized type of study design—is there an agreement on whether a quality score or a criteria-based system is better.5 Several years ago, Emerson et al6 failed to demonstrate the predictive validity of a widely used, detailed method for quantifying the quality of trials, which included evaluating adequacy of descriptions, blinding, and essential measurements. More recently, Juni et al7 reported substantial differences in the assessed "quality" of an article, depending on the method used to measure it. Thus far, the only item for which there is clear empiric evidence of bias prevention is the quality of the randomization process, defined as the extent to which the allocation process was concealed.8
Consistency of results across studies
Consistency of results is an important issue, although it must be adjusted for the study design and quality of study conduct. Dramatically large effects may be consistently reported in studies of lower methodologic quality (eg, a series of observational studies), but further tests based on more rigorous designs may then indicate much smaller, if any, effect.9 Relative to the quality of study conduct, consistency per se does not imply validity, as a series of individual studies can be systematically wrong if the same biases exist (such as in selecting the study population or using systematically inaccurate measurements).
Clinical relevance of study results
The difficulty with ensuring clinical relevance of results is in defining generic criteria for relevant end points of interventions across diseases or conditions and the likely dependence of the judgment(s) from the perspective of the assessor(s)—ie, patient, provider, or purchaser.
EXISTING SCALES FOR CLASSIFYING LEVELS OF EVIDENCE
The table lists 9 scales available to assess levels of evidence.3,4,10,11,12,13,14,15,16
ll scales explore the dimension of a priori study validity, but the level of details varies from the simplest approach of the Canadian Task Force (4 levels) to the more complex and analytic taxonomy proposed by more recent scales. Only 4 scales also critically appraised the quality of the study conduct, through predefined criteria, although they differed on criteria applied and operational definitions.13,14,15,16 Consistency of results is incorporated into 4 scales.12,14,15,16 However, heterogeneity is neither clearly nor consistently defined across scales.
Some scales, such as the Canadian Task Force and the US Preventive Services Task Force, separate levels of evidence from strength of recommendations. In the case illustrated in the opening paragraph, for example, the evidence for the use of routine urinalysis was level I, and the recommendation was "type E" (do not perform), but in others, the 2 are more closely tied.
The state of the art is still, therefore, unsatisfactory. Although 3 scales look at all 3 dimensions listed in the table,14,15,16 the main challenge for a better approach to levels of classifying evidence is how to combine the 3 dimensions outlined earlier with the clinical and epidemiologic relevance of the study findings.
NEED TO CONSIDER EPIDEMIOLOGIC AND CLINICAL RELEVANCE
When the Canadian Task Force scale was originally proposed, RCTs were less common and requirements for drug approval were less stringent, so that evidence from such trials was often not available. With the much wider availability of these trials, the scales have become insensitive to differences in the quality of supporting evidence. As a result, it may be inappropriate to accept the presence of 1 or 2 RCTs as sufficient evidence in favor of an intervention.
Critically appraising aspects of the question addressed is also important: was the study designed to explore long-term versus short-term use of the treatment, the type of skill or experience required by the providers, and the availability of the appropriate level of care? Two issues are central here: the nature of the end point (whether it is hard or soft, clinical versus surrogate, and what its relationship is to the quality or quantity of life), and the appropriateness of the comparator chosen (whether different candidate interventions are directly compared, or are they each only compared with nothing or placebo).
Strong evidence of effect for an intervention does not necessarily translate into equally strong recommendations for its use. Cost, the values placed on the outcomes by physicians and patients, and feasibility must all be factored into recommendation, along with the evidence (strong or otherwise). For instance, when assessing the evidence for and against breast cancer screening on a population level, although the evidence of effectiveness is strong (the usefulness of mammography screening in women >50 years is supported by several RCTs), it may still be inappropriate to recommend screening if the other criteria for implementation are not met. For example, too few well-trained radiologists may be available to read the mammograms, pathologists to interpret the biopsy specimens, or surgeons to perform appropriate surgery in a particular health district. On the other hand, evidence that is less strong may lead to strong recommendations when there are no viable alternatives and the do-nothing approach is not feasible.
CONCLUSIONS AND FUTURE DIRECTIONS
Although more recent scales take into account the quality of study conduct, we found no scale that explicitly includes the clinical and epidemiologic relevance of the question addressed by the studies. The use of only methodologically based quality assessment to judge the evidence supporting an intervention is inadequate, especially in an area of therapy where RCTs (ie, the highest methodologic level of evidence) are commonly available.
A possible solution is to abandon the idea that a generic scale can satisfactorily assess levels of evidence for a particular therapeutic or diagnostic question. A generic scale could be integrated with specific criteria targeted to the nature of the question being explored. The generic scale should look at the a priori quality of study design (ie, has the appropriate design for the question at issue been used?) and at the validity of the study conduct. Scales such as those discussed by Hadorn, Ball, Liddle, and Jovell and their co-workers13,14,15,16 are all good steps in this direction, although an effort to provide operational definitions is needed. The criterion-specific items might concentrate on the relevance of the end point and on the appropriateness of its timing, setting, and level of care.
|What is the lesson for the decision to be taken in the clinical
scenario at the start of this article? Going back to the
original sources, you find that the US Preventive Services Task
Force report indicates that evidence from both RCTs and
observational studies support the recommendation not to perform
a screening test for asymptomatic bacteriuria in infants,
children, and adolescents.4
The recommendation by the American Academy of Pediatrics is
simply an unqualified consensus statement without any reference
to the level of evidence supporting it.17
Despite the limitations of existing scales available to assess
levels of evidence, having an explicit approach for ranking the
methodologic quality of available studies is useful, at least
for the time-being. It is particularly helpful when comparing
different recommendations allegedly drawn from the same type of