Mind those outcome measures! A guide to enhanced critical appraisal of outcomes

Illustration of the OPTBal test (see end for further detail)

Outcome data always need to be critically examined, writes Dr Helen Handoll, Co-ordinating Editor of the Cochrane Bone, Joint and Muscle Trauma Group, in order not to jeopardize your systematic review. Here she presents some recently developed guidance on how to do this.

Re-posted 5 August 2020 (this blog was originally posted on Cochrane Bone, Joint and Muscle Trauma on 27 July 2020).

This blog was prompted by the late realisation, fortunately before completion of the review, that the key outcome data presented in the trial reports of four included trials were incompatible with expected values. This led to the belated decision not to use these data in the review; initially provisional on clarification from the trial authors. These were extreme cases, but the experience pointed to the general need to facilitate a more critical examination of outcomes by systematic reviewers. To this end, I have developed a guide to additional critical appraisal relating to outcomes measures, particularly continuous outcomes, that I hope will be helpful, including in the wider sense with data entry, presentation and interpretation as part of the review process. I then illustrate the guide with insights and examples drawn from my experience as both an editor and systematic reviewer. Please note that this is supplemental to advice in the Cochrane Handbook; including the guidance on statistical and data processing aspects.

Guide to enhanced critical appraisal of outcomes

 1.       Know the expected outcomes

These will be topic dependent but unexplained absence warrants extra attention.

2.       Know your outcome measures

Check the sources for outcome tools and find out the characteristics of the outcomes, including how these are measured, whether devices are required, whether the assessment applies to a point of time, or over a period of time; the expected direction of effect and, if appropriate, minimal important differences.

3.       Find out and note what the authors did

When appraising articles, it is better not to automatically assume the authors are using or reporting the outcome measure as you know it. It is helpful to ask the following.

  • Do the authors define the measure used?
  • Do the authors reference the measure, especially if the primary outcome, and is it the correct reference?
  • Do the authors use a variant or adjusted version of the outcome measure?
  • Do the authors give the range and direction of effect and/or units where appropriate?
  • How did they measure / calculate the outcome result?
  • Does the text tally with the tables and figures?

 4.       Consider the plausibility of the results and claims

Are the results compatible with reasoned expectations? For example, consider what are the normal values in your population; the recovery or condition trajectory; whether there is an inappropriate focus on minor outcomes; and are there missing key outcomes that may feature in the justification for the trial.

Note. This guide is not intended to promote a drastic loss of data used from trial reports but to enhance critical appraisal and to reduce the unquestioning use of data.

An expansion of the guide together with examples is provided below.

1.   Know the expected outcomes
The lack of reporting of some key outcomes should be noted. Such outcomes will be topic dependent and some may feature in Types of outcome measures.

e.g. No reporting of adverse effects for surgical trials. This would be exceptional as a focus of surgical trials is often on surgical complications, with an anticipated difference between the tested interventions. This might also be an issue for selective reporting bias.

e.g. No reporting of mortality for a general population of hip fracture patients in a surgical trial. Typically, hip fracture is associated with between 6% to 10% mortality at one month. It is possible, these participants have been retrospectively excluded from the trial population.

2.   Know your outcome measures
Commonly used, preferably validated, measures such as patient-reported outcome measures and functional outcomes are useful to establish at the protocol stage, with source citation (can be urls) where appropriate, a knowledge of the scale and direction of effect and, where possible and appropriate, some idea of the minimal important difference (MID) for the measures, preferably for the condition under review. This, includes as illustrated below, awareness of when there are different versions of tools available.

e.g.  An example of strikingly different versions of a scoring system that both trial authors and review authors can fail, and have failed, to observe is the IKDC (International Knee Documentation Committee) knee score. Readers are alerted to this in a Cochrane Review, published in 2006, as follows [see box].

It is important to note that the studies included in this review utilized two different versions of the IKDC knee form, which differ in their scoring systems. The scoring system used in the original (1995) version provides an overall grade (A, B, C or D) that incorporates the patient's subjective score (Hefti 1993Irrgang 1998). The newer (2000) version provides an overall group grade (A, B, C or D) in addition to a patient-based subjective score out of 100 (Irrgang 2001).

Also useful to know is how the outcome is typically measured, such as whether devices are required (e.g. goniometer), and whether the assessment applies to a point of time, or over a period of time (e.g. the Oxford Shoulder Score applies to the previous 4 weeks. These help you to assess whether the use of the tool was practical and even appropriate at the times of use reported in the trial report.

3.   Find out and note what the authors did
When appraising articles, it is better not to automatically assume the authors are using or reporting the outcome measure as you know it. It helps to check & generally note the following.

Do the authors define the measure used?
e.g. There can be a mention of pain but not the measure used or the range. As such, it is important to think whether it is safe to assume that it is a 10 cm (or 100 mm) visual analogue scale; if made, such assumptions should be noted. Be aware that trial authors can think up lots of ways of measuring and reporting pain, as well as other outcomes, and it could be part (a subdomain) of a composite score.

Do the authors reference the measure, especially if the primary outcome, and is it the correct reference?
Referencing the outcome measure’s source can help but the reality can still differ from the outcome measure described in the reference article. For instance, where there are different versions of the tool, the wrong one may be referenced.
e.g. An error subsequently corrected, but only for range, by the trial authors [see box], relates to the Oxford Shoulder Score. The trial authors reference the original version published in 1996, but their data use the very different scoring system for a later version, published in 2009. Notably, the original scale also was a demerit scoring system ranging from 12 points (best) to 60 points (worst).

This article was updated on May 20, 2020, because of a previous error. On page 482, in the legend for Figure 5, the sentence that had read “The Oxford Shoulder Score consists of 12 questions concerning shoulder pain, shoulder function, and activities of daily living and ranges from 12 points (worst) to 60 points (best)” now reads “The Oxford Shoulder Score consists of 12 questions concerning shoulder pain, shoulder function, and activities of daily living and ranges from 0 points (worst) to 48 points (best).”

Do the authors use a variant or adjusted version of the outcome measure?
Variations, such as shortened forms (SF-12 versus the SF-36; a commonly-used quality-of-life measure) and tools validated for different languages and settings, can be legitimate and arguably desirable. Even so, authors may have added in additional questions and/or altered or removed others. This can result in a rather different outcome measure than the one alleged in the trial report. These changes may be different to detect but differences in the number of items or total scores can be indicators of undeclared adjustments.

Do the authors give the range and direction of effect and/or units where appropriate?
This information is needed in order to report and interpret study findings in a review. As illustrated for the Oxford Shoulder Score, an extreme example of a changed scoring approach, the second version of the score differed not only in the range but also the direction of effect.

As discussed above, the very common outcome of pain can be measured in many ways, and although typically a demerit score (i.e. higher scores mean worse pain) it may not be and the scale may not be 0 to 10 cm either.

Units, such as for grip strength, range of motion, and time can be deduced to some extent, but with caution; clearly it is better when the trial report states these.

How did they measure / calculate the outcome result?
Change (note from when as it may not be from baseline) and final scores (note at what time) are obvious characteristics to note and are considered in the Cochrane Handbook. There are others, however, such as adjusted scores for various characteristics (e.g. age, gender, dominant arm) and other difference scores. For example, these latter can be between the injured limb versus uninjured or ‘normal’ limb. Such scores tend to be presented as percentages; which also reflects the temptation for some trialists to rescale scores to 100.  

Does the text tally with the tables and figures?
Reassuring if these do but definitely of concern when they don’t.

4.    Consider the plausibility of the results and claims
As well as your knowledge of the outcome measure, this is where topic-specific knowledge, such as typical recovery timings and extents from acute injuries, is important. These will help detect extreme values, contradictory and inconsistent results that should give pause for thought about the reliability of the data as reported. 

e.g. Knowledge of usual values for ranges of movements, usually presented as degrees, of various joints can help alert one to potential problems [see here for example]. An extreme and clearly incorrect example was the finding of shoulder flexion in excess of 360 (while not specified in the report, if the unit of measurement was degrees, this was more than a full circle).

 e.g. This example, detailed in the box below, is more complicated and draws on knowledge of the Constant score, which is a commonly used composite objective and subjective score for assessing shoulder function (0: worst outcome; 100: best outcome). For this outcome, the patient is tested on maximum strength (25 points) and range of motion (40 points), and asked in regards to pain (15 points) and activities of daily living (20 points). The trial compared two different methods of arm immobilisation after surgical fixation of a major shoulder fracture in older people. The Constant score at 30 days was the trial’s primary outcome; a statistically significant difference was reported for this and final follow-up.

 In the following, I present a simplified version of the trial’s Constant score results in order to illustrate the additional aspects of concern relating to these.


Day 1

Day 3 post op

Day 30 post op

6 to 24 months


1.6 ± 0.8

81.1 ± 4.8

86.4 ± 4.7

88.3 ± 4.1


1.5 ± 0.7

80.7 ± 5.5

76.0 ± 19.0

86.0 ± 5.0

Aspects that raise outcome-specific concerns are detailed below.

  • Assessing the strength and range of motion components of the Constant score are impractical and would be distressing for patients before and so soon (day 3) after surgery. There is no mention of measures taken to adapt the assessment process in the trial report.
  • Obtaining results at day 3 would be irrelevant too; these are very early days in the recovery trajectory for a serious injury.
  • A score of 80 at day 3 is very high and implausible: at this stage, the patients are likely to be in pain, unable to do anything much with their shoulder, be stiff (wary of moving and generally advised to swing their arms in a gentle fashion only) and weak (and unable to endure the strength test).
  • The changes from day 3, through to day 30 and onwards to final follow-up are small and not clinically important (one fracture-specific estimate of a minimally clinical important difference is 11.2). This is contrary to the expected recovery pattern for these injuries.
  • The mean final values are still rather high for this population as many will not recover their former shoulder function. This is confirmed by checks of the results from other studies. With one unexplained exception, the SDs seem smaller than expected (but do not appear to be SEs). This gives an impression of a more homogeneous population than expected.

Key aspects that indicate the exclusion of these data from a review are the impracticality of measurement as well as the unexpectedly high level of functioning at 3 days after major shoulder surgery, and the lack of evidence of subsequent improvement.

Clearly it is important to avoid the indiscriminate use of data; which is a key source of reservations for automated or unthinking data extraction. Although still unusually high, the data that generally would have been extracted for this example would have been for the final follow-up. Such a dataset used for the review would have disregarded or lost the data at 3 days and thus the clues that something was seriously amiss.

In summary, some data specific questions you could ask when you first skim the trial report are as follows.

  • Are data missing for a key outcome but presented, even highlighted, without justification for a ‘minor’ outcome?
  • Are the values presented plausible for the population at the time of measurement?
  • If a progressive condition (e.g. recovery) – are the ‘jumps’ between different follow-up times unexpectedly low or high; and does the trend reflect a probable trajectory.

Final thoughts
Inadequate and space-limited reporting of trial methods and results means that reviewers often have to make assumptions when using and interpreting these. This guide should help increase awareness of this aspect and authors need to be aware of their own preconceptions and biases too. This guide is not intended to promote a drastic and unexplained cull of data used from trial reports with the consequent evidence loss to the review but to enhance critical appraisal and to reduce the unquestioning use of data. Ultimately, we are all best served by using and presenting reliable and meaningful outcome data.

This blog was prompted by discussion by the author with Joanne Elliott, the Managing Editor of the Cochrane BJMT Group. I am grateful to Joanne for discussion, feedback and corrections on successive drafts and for facilitating the publication.

The top figure illustrates the Older Person’s Tightrope Balance Test (OPTBal). Usually conducted outside, which restricts its use in some countries, there are two main variants of the test: one with a safety net and one without. There is no standardised length of the rope; this being determined by the setting. The use of balancing poles is discouraged but walking sticks may be carried. The results can be reported in several ways. These include number of successful attempts; time to cross (often the best of three crossings); number of steps for a set time period; number of falls and near misses; and fatalities. The OPTBal test may also be used to test fear of falling. Anyone encountering this test in a study report should think carefully about the applicability of the study findings.