Observer bias in randomised clinical trials with binary outcomes: systematic review of trials with both blinded and nonblinded outcome assessors.
Keywords
Article abstract
OBJECTIVE:
To evaluate the impact of nonblinded outcome assessment on estimated treatment effects in randomised clinical trials with binary outcomes.
DESIGN:
Systematic review of trials with both blinded and nonblinded assessment of the same binary outcome. For each trial we calculated the ratio of the odds ratiosthe odds ratio from nonblinded assessments relative to the corresponding odds ratio from blinded assessments. A ratio of odds ratios <1 indicated that nonblinded assessors generated more optimistic effect estimates than blinded assessors. We pooled the individual ratios of odds ratios with inverse variance random effects metaanalysis and explored reasons for variation in ratios of odds ratios with metaregression. We also analysed rates of agreement between blinded and nonblinded assessors and calculated the number of patients needed to be reclassified to neutralise any bias.
DATA SOURCES:
PubMed, Embase, PsycINFO, CINAHL, Cochrane Central Register of Controlled Trials, HighWire Press, and Google Scholar.
ELIGIBILITY CRITERIA FOR SELECTING STUDIES:
Randomised clinical trials with blinded and nonblinded assessment of the same binary outcome.
RESULTS:
We included 21 trials in the main analysis (with 4391 patients); eight trials provided individual patient data. Outcomes in most trials were subjectivefor example, qualitative assessment of the patient's function. The ratio of the odds ratios ranged from 0.02 to 14.4. The pooled ratio of odds ratios was 0.64 (95% confidence interval 0.43 to 0.96), indicating an average exaggeration of the nonblinded odds ratio by 36%. We found no significant association between low ratios of odds ratios and scores for outcome subjectivity (P=0.27); nonblinded assessor's overall involvement in the trial (P=0.60); or outcome vulnerability to nonblinded patients (P=0.52). Blinded and nonblinded assessors agreed in a median of 78% of assessments (interquartile range 6490%) in the 12 trials with available data. The exaggeration of treatment effects associated with nonblinded assessors was induced by the misclassification of a median of 3% of the assessed patients per trial (17%).
CONCLUSIONS:
On average, nonblinded assessors of subjective binary outcomes generated substantially biased effect estimates in randomised clinical trials, exaggerating odds ratios by 36%. This bias was compatible with a high rate of agreement between blinded and nonblinded outcome assessors and driven by the misclassification of few patients.
Article content
Introduction
The randomised clinical trial is regarded as the most valid method for assessing the benefits and harms of healthcare interventions.1 One challenge to the validity of such trials is the tendency for assessments of outcomes to systematically deviate from the truth because of predispositions in observers, such as from hope or expectation.2
Such observer bias, also called ascertainment bias or detection bias, might be especially important when outcome assessors have strong predispositions and when outcomes are subjective—that is, involve personal judgment such as with qualitative scores or pattern recognition of images. Similarly, observer bias might have little practical importance when neutral assessors evaluate an objective outcome, such as death.
Many trials use blinded outcome assessors to avoid bias, though use of nonblinded outcome assessors is also common,3 4 especially in nonpharmacological trials. For example, one study of orthopaedic trauma trials reported that blinded outcome assessment had not been implemented in 90% of trials.3 It is an empirical question to which degree the estimated effects of experimental interventions in randomised trials are affected by lack of blinding of the outcome assessors and which factors influence the degree of bias.
The most reliable way of studying the impact of nonblinded outcome assessors is to analyse trials that use both blinded and nonblinded assessors for the same outcome. One such trial by Noseworthy and colleagues is often cited, reporting that the effect of plasma exchange for multiple sclerosis was significant only with assessments by nonblinded neurologists.5 The finding, however, was inconsistent across time points, seen only for one of the two experimental interventions, and might be atypical. Other studies have been based on indirect comparisons with a considerable risk of confounding.6 7 8
It is prudent to suspect possible bias in trials with nonblinded assessors. Existing analyses, however, do not provide a reliable assessment of the typical degree of observer bias in randomised clinical trials. Thus, it is not clear whether observer bias in clinical trials, on average, is negligible or large or how variable the size and direction of any observer bias is or which factors in a trial are associated with a more pronounced degree of bias.
A reliable evaluation of the impact of nonblinded outcome assessors in randomised clinical trials is important, both to guide the design of future trials and to assist the balanced interpretation of trial results—for example, in the assessment of the risk of bias in trials for metaanalysis.1 It also seems important for evidence based medicine to strengthen its own evidence base.
We systematically reviewed randomised trials with blinded and nonblinded assessors of binary outcomes to evaluate the impact of nonblinded outcome assessment on estimated treatment effects in randomised clinical trials and to examine reasons for its variation.
Methods
We included randomised clinical trials with blinded and nonblinded assessment of the same binary outcome. We excluded trials where it was unclear which group was experimental and which was control as such trials would not allow us to determine the direction of any bias; trials in which only a subgroup of patients had been evaluated by blinded and nonblinded assessors, unless they were selected at random; trials in which blinded and nonblinded assessors had access to each other’s results (for example, blinded assessments were provided to nonblinded assessors as a quality enhancement procedure); and trials where initially blinded assessors clearly had become unblinded—for example, when radiographs showed ceramic material indicative of the experimental intervention. Finally, we excluded trials with blinded end point committees adjudicating the assessments made by nonblinded clinicians because such adjudication often involves previous knowledge of the nonblinded assessment or is restricted to adjudication of events only.
We searched standard databases (PubMed, Embase, PsycINFO, CINAHL, Cochrane Central Register of Controlled Trials) and full text databases (HighWire Press and Google Scholar). Our core search string was: random* AND (“blind* and unblind*” OR “masked and unmasked”) with variations according to the specific database (see appendix on bmj.com). The last search was performed on 26 January 2010. We read the references of all included trials and asked authors of all the included trials if they knew of other trials.
One author (ASST) read all abstracts from standard databases and all text fragments from full text databases. If a study was potentially eligible, one author (ASST or AH) retrieved the full study report and excluded ineligible studies. Two authors (AH and ASST, SB, or BT) decided on the eligibility of the remaining studies. Disagreements were resolved by discussion.
We selected one binary outcome from each trial. If several outcomes had been assessed by both blinded and nonblinded assessors we selected the primary outcome of the trial, and if none was stated we selected the outcome we found most clinically relevant. We included the first assessment after the end of treatment, unless the primary outcome prescribed a different time point. Two authors (AH and either SB or BT) selected the outcome independently. Disagreements were resolved by discussion. For trials with more than two groups, we pooled the results in the experimental or the control groups.
We extracted background data for each trial (ASST and FE or AH and SB) and outcome data from each trial (AH and SB or BT): total number of failures and total number of successes in each group resulting from the blinded assessment and the nonblinded assessment. When possible we also extracted paired patient level data on blinded and nonblinded assessments, and constructed a 2×2 table (failure/success×blind/nonblind) for the experimental group and a corresponding table for the control group. Data from split body designed trials were treated as if they derived from parallel group trials.
If data were incomplete, we emailed the corresponding author and, if necessary, at least one additional author, followed up by telephone calls, and at least two reminders. Authors were asked whether they would share unpublished data with our group. We also searched the Food and Drug Administration (FDA) website for such data.
When authors chose to send us individual patient data (that is, all randomised patients listed by allocation group and result of blinded and nonblinded assessment), we checked whether all randomised patients were included in the dataset and tried to replicate a table or a main result of the published paper. Two authors (AH and BT or SB) independently derived outcome data. Any discrepancy was solved by discussion. We sent our results to the authors of the trial for comments.
For each trial, we evaluated five prespecified potential confounders in the comparison between blinded and nonblinded outcome assessments: a considerable time difference between these two assessments, different types of assessors (such as nurses v physicians), different types of procedures (such as direct visual assessment of wounds v assessment of photographs of wound), a substantial risk of ineffective blinding procedure, and nonidentical groups of patients assessed (such as a few patients evaluated only by the blinded outcome assessor). For 16 trials, two masked authors (IB and PR) independently evaluated the first four items at a different location from the rest of the group. Other masked authors (AH and BT or SB) scored five trials. Disagreements were resolved by discussion. The masking was implemented by manipulating pdfs of the trial reports so that tables, graphs, or text describing results of any comparison between blinded and nonblinded assessors were blanked out. There were no cases of accidental unmasking.
Using the same masking procedure, we also evaluated characteristics of each outcome assessment. Two authors (mainly IB and PR) independently scored three factors out of a score of 5 (1 was low and 5 high): the degree of outcome subjectivity (that is, the degree of assessor judgment, high in assessment of global improvement and low in reading a laboratory sheet); the nonblinded outcome assessor’s overall involvement in the trial (that is, a proxy for the degree of personal preference for a result favourable to the experimental intervention); and the vulnerability of the outcome to the reporting and behaviour of nonblinded patients (as they might influence results considerably when outcomes are based on interviews and less so when outcomes are based on pure observations, such as inspection of radiographs). Disagreements were resolved by discussion.
We calculated the odds ratio for failures (such as an unhealed wound) in each trial for both the blinded and nonblinded assessments. An odds ratio under 1 indicates a beneficial effect of the experimental intervention. For each trial we summarised the impact of nonblinded outcome assessment as the ratio of the odds ratios (OR_{nonblind} / OR_{blind}). A ratio <1 indicates that nonblinded assessments are more optimistic.
We metaanalysed the individual trial ratio of odds ratios with inverse variance methods using randomeffects models.9 The standard error of the ratio of odds ratios used for the main analysis disregarded the dependency between blinded and nonblinded assessments. The statistical software we used was Stata 11.
We tested the robustness of our main analysis of the ratios of the odds ratios in sensitivity analyses. We used standard errors that took account of the dependence between blinded and nonblinded assessments (see appendix on bmj.com); all trials were given equal weight; and an analysis was conducted on the basis of the ratio of risk ratios, as risk ratios might be more easily interpretable than odds ratios by some. We studied whether the effect differed in subgroups of trials involving various types of data; clinical problems; objectives, designs, and sources of funding; and type of nonblinded assessor; and according to risk of confounding. We also evaluated the influence of small sample size on estimated ratio of risk ratios by funnel plot inspection.1
We furthermore explored whether the variation in ratio of odds ratios was associated with the three prespecified outcome characteristics described above by random effects metaregression of log ratio with the scores for each outcome characteristic.
To analyse the pattern of misclassifications underlying any difference between the blinded and nonblinded outcome assessments we compared the total number of failure events during nonblinded and blinded assessments in the experimental and in the control group and also compared the rate of agreement between blinded and nonblinded assessments in each trial. Finally, we calculated how many reclassifications of nonblinded assessments were needed to neutralise a difference between the blinded and nonblinded treatment effects—that is, to drive the ratio of odds ratios to 1 (see appendix on bmj.com).
Results
We examined 537 publications based on 1835 hits in standard databases and 2200 hits in full text databases. We excluded 512 studies, mostly because they were not randomised clinical trials or lacked blinded or nonblinded outcome assessment (see appendix on bmj.com). Thus, we included 25 trials (tables 1⇓ and 2⇓).10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Of the 25 trials, six published outcomes for both types of assessments usable for our analysis.16 19 23 25 26 29] Contact with authors and searches of the FDA website increased the number of trials with outcome data to 21 (4391 randomised patients), of which eight trials provided individual patient data.11 12 13 14 15 21 23 24 Thirteen trials had strictly paired data (all patients had been assessed both by blinded and by nonblinded assessors), and eight trials provided predominantly paired data as a minority of patients had been assessed by only one type of assessor (see appendix on bmj.com).
In ten trials the validity of the nonblinded assessments was tested against the blinded assessments or nonblinded assessments were used as backup for missing blinded data.10 11 15 19 20 21 22 23 24 30 In four trials the main focus of the paper or abstract was a direct comparison between blinded and nonblinded outcome assessment, but it is unclear whether this was the original reason for using dual type assessors.21 23 24 30 In one trial refinement of the methods implied addition of blinded assessments without omission of the initially planned nonblinded assessments.26
Fifteen of the 21 trials (71%) studied the effect of surgery or a procedure, 19 were parallel group trials (90%), and the median sample size was 172 (10th90th centile 35368). The trials were conducted in general surgery, orthopaedic surgery, plastic surgery, cardiology, gynaecology, anaesthesiology, neurology, psychiatry, dermatology, otolaryngology, infectious diseases, and ophthalmology (table 1⇑).
The outcomes of the trials were in most cases subjective—for example, qualitative assessments of patients’ function (such as severity of angina or neurological deficit) or assessment of healing status (such as wounds or ulcers or fractures) (table 2⇑). Seventeen trials (81%) scored 4 or 5 for outcome subjectivity on the 1 to 5 scale.
The odds ratio point estimate was more optimistic when based on the nonblinded assessors in 15 trials (71%) (fig 1⇓). The ratio of odds ratios in the 21 trials ranged from 0.02 to 14.4 (fig 2⇓). The pooled ratio of odds ratios was 0.64 (95% confidence interval 0.43 to 0.96) with moderate heterogeneity (I^{2}=45%, P=0.015). Thus, on average, the odds ratios based on nonblinded assessments were exaggerated by 36% compared with the odds ratios based on blinded assessments.
Individual patient data provided 48% of the weight of the main analysis. The main result was robust, though sensitivity and subgroup analyses in general had wide confidence intervals (table 3⇓). In the 12 trials with data on the dependence between blinded and nonblinded assessments the pooled ratio of odds ratios was 0.76 (0.61 to 0.94). In these 12 trials, the standard error accounting for the dependence was a median of 25% smaller than the corresponding standard errors assuming independence. Reducing the standard errors of the nine additional trials (without data on the dependence between blinded and nonblinded assessments) by 25% resulted in a pooled ratio of odds ratios of 0.64 (0.44 to 0.93). No trial was free from any of the five predefined possible confounders, but results were not clearly affected (table 3⇓). The funnel plot was symmetrical on visual inspection (data not shown). Based on a qualitative assessment, the results in the four trials with incomplete or unclear outcome data did not to differ from the results in the trials we did metaanalyse (see appendix on bmj.com).
Metaregression analyses showed no significant association between low ratios of odds ratios and scores for outcome subjectivity (P=0.27), nonblinded outcome assessor’s overall involvement in the trial (P=0.60), or outcome vulnerability to the reporting and behaviour of nonblinded patients (P=0.52). The slope of the regression line between log ratio of odds ratios and scores for outcome subjectivity, however, was in the expected direction. The 17 trials with clearly subjective outcomes (scores 45 on a 15 scale) had a pooled ratio of odds ratios of 0.55 (0.32 to 0.95). The five trials with moderately subjective outcomes (scores 23) had a pooled ratio of 0.93 (0.56 to 1.54).
The pattern of misclassifications underlying the difference between blinded and nonblinded results was characterised by more optimistic nonblinded assessments. The nonblinded assessors detected 26% fewer failure events (such as no wound healing) compared with the blinded assessors (984 v 1335). In the intervention groups the nonblinded assessors detected 35% fewer patients with treatment failure than the blinded assessors (421 v 649 events), whereas in the control group the proportion was 18% (563 v 686 events).
The pattern of misclassifications was also characterised by a preoccupation with the intervention group. In the 12 trials with data on agreement, the blinded and nonblinded assessors agreed in a median of 78% of patient assessments (interquartile range 6391%). The proportion of concordant assessments, and the corresponding proportion of discordant assessments, however, seemed to differ according to the allocation group. The median proportion of discordant assessments between blinded and nonblinded assessors per trial was 28% (941%) in the intervention group and 16% (937%) in the control group (see appendix on bmj.com).
The number of reclassified assessments per trial needed to neutralise a difference between the estimated blinded and nonblinded treatment effects (that is, to drive the ratio of odds ratios to 1.00) ranged from 0 to 41.7, with a median of 2.5. This corresponded to 028% of the assessed patients per trial, with a median of 3% (see appendix on bmj.com).
Discussion
The estimated effects of experimental interventions in randomised clinical trials tended to be considerably more optimistic when they were based on nonblinded assessment of subjective outcomes compared with blinded assessment. The pooled ratio of odds ratios was 0.64 (0.43 to 0.96), indicating that the nonblinded outcome assessors generated odds ratios that, on average, were exaggerated by 36%. We interpret this as empirical evidence for substantial observer bias.
Strengths and weaknesses of the study
This result is based on contemporary trials representing a fair range of clinical specialties. The unique trial design with paired data implies a low risk of confounding. The data were high quality, as individual patient data provided about half of the weight of the main analysis. Our results were robust to modifications to both type of analysis and summary statistic. For example, the ratio of relative risks was 0.78 (0.63 to 0.96), indicating that nonblinded outcome assessors generated relative risks that, on average, were exaggerated by 22%.
We possibly did not identify all trials but we do not know whether they would report markedly different results. Publication bias is normally driven by the effect of a treatment35 and has less impact on our comparison between two types of assessments. Four trials in our study published papers with a main focus on observer bias. Though confidence intervals were wide, these four trials did not report significantly different findings compared with the 17 other trials.
Our cohort of trials is not representative of medical trials in general. We included no trials with clearly objective outcomes, such as total mortality. The trials we did include had mainly subjective outcomes—such as qualitative assessments of patients and evaluation of fracture or wound healing—and our result is applicable to trials with similar subjective outcomes. We would anticipate less observer bias with more objective outcomes, though it is an interesting question which medical outcomes should be considered clearly objective, apart from total mortality and some laboratory outcomes. Furthermore, the extrapolation of our results to randomised trials with binary subjective outcomes hinges on the assumption that the degree of observer bias in our trials with dual observation of outcomes is essentially similar to trials with only nonblinded observers.
We found no association between observer bias and five prespecified potential confounders. A special concern, however, is consensus classifications that could reduce observer variability and leave less room for observer bias. The only trial with consensus based nonblinded assessments11 found no observer bias (ratio of odds ratios 1.06, 0.79 to 1.43). It is unclear whether this is caused by the consensus classification, chance, or other trial characteristics.
We included one trial with probable reversed direction of bias.17 The trial compared an experimental oral prodrug, valganciclovir, for cytomegalovirus retinitis with the intravenous version of the same substance, ganciclovir. The comparison between nonblinded and blinded outcome resulted in a ratio of odds ratios that was extreme, but in the reversed direction. Comparable retinitis trials, also with blinded and nonblinded assessors, have reported similar results favouring the control intervention on time to event outcomes.36 We included the trial in our main analysis without reversing the direction of bias. Had we done so, the pooled ratio of odds ratios would have been 0.57 (0.39 to 0.84), indicating an average exaggeration of the effect estimate by 43%.
Several previous studies have compared treatment effects in “double blind” trials with similar trials not reported as “double blind.”7 8 An overview of seven such studies reported a pooled ratio of odds ratios of only 0.91 (0.83 to 1.00).7 Wood and colleagues’ reanalysis of three of the studies reported a similar overall result but with a ratio of odds ratios of 0.75 (0.61 to 0.93) for subjective outcomes.8 These studies do not directly evaluate the impact of blinded outcome assessors, are partly based on ambiguous terminology,3 37 and involve a considerable risk of confounding. Still, our findings are numerically roughly similar to those of Wood and colleagues.8
Mechanisms of observer bias
The pattern of misclassifications underlying the observer bias can be characterised by “optimism error” and “intervention preoccupation.” The nonblinded assessors detected fewer failures than blinded assessors. This optimism error, however, was much more pronounced in the intervention group than in the control group. Thus, the nonblinded outcome assessor did not “underrate” patients in the control group and “overrate” patients in the intervention group. Both groups were overrated but the intervention group considerably more so.
A third important feature of observer bias is the striking contrast between the substantial degree of observer bias we found and the surprisingly small number of misclassified patients needed to generate this bias. The median number of patients needed to be reclassified to neutralise bias in a trial was 2.5 or 3% of the assessed patients. The difference between numbers of events in the experimental group and the control group determines the estimated effect. Numbers of events are usually considerably smaller than the number of included patients, and still smaller is the number of misclassifications needed to bias the estimated effect. For example, in the trial by Noseworthy and colleagues,5 21 the ratio of odds ratios was 0.81 (0.40 to 1.61). This degree of bias was neutralised by reclassification of two of the 140 included patients. Binary outcomes seem sensitive to directional misclassifications of a few patients.
Fundamentally, observer bias is caused by the predispositions of the observers, which might vary unpredictably from trial to trial. Our cohort of trials probably consists of some trials with largely neutral assessors and some trials with predisposed assessors. The expected degree of observer bias in trials with predisposed assessors will be considerably larger than our averaged result. Thus, in any individual trial it is not possible to safely predict neither the direction nor the size of any bias. We would advise against using our pooled average as a simplistic correction factor. When the possible bias in a trial with nonblinded assessors is ascertained, the range of possible observer bias should be taken into account and not only our pooled average. Furthermore, it would be prudent to also consider the type of outcome involved and any indicators for predispositions in assessors.
Implications
Blinding outcome assessors might be seen as too cumbersome, unnecessary, or directly mistaken38 39; compared with the huge logistical challenges involved in setting up a trial, however, it is a minor procedure and one that improves reliability considerably. Fortunately, blinding the assessor is possible in nearly all trials, sometimes after the development of creative blinding procedures.40 41 In some trials a subsample of patients is blindly assessed and the result used to validate nonblinded assessments. Such comparisons are inherently underpowered and should be avoided.
Our result strengthens the hypothesis that blinding can also be important for other key people in a trial, especially patients,42 who can be seen as privileged outcome assessors of their own symptoms. Still, it is important to separately study the impact of blinding each key person. For example, one study found little impact of blinded outcome adjudicators in 10 large cardiovascular trials.43
We found no significant association between the degree of observer bias and degree of outcome subjectivity, though the association was in the expected direction. Future investigations could further analyse the role of outcome subjectivity and other factors that could modify the degree of observer bias.
The problem of observer bias goes beyond the randomised clinical trial. Comparisons between blinded and nonblinded observers in other types of empirical investigations have reported results indicative of observer bias—for example, in an observational study of patients with primary dystonia,44 an evaluation of cancer staging,45 an assessment of surgical skills,46 and a neurophysiological laboratory study.47 Furthermore, observer bias has been reported or discussed within veterinary science,48 forensic science,49 special educations studies,50 animal behaviour research,51 and broadly within psychology.52 53 Observation is fundamental to scientific activity; observer bias might be too.
In conclusion, randomised clinical trials with nonblinded assessors of subjective binary outcomes will, on average, generate substantially biased estimates of treatment effects. The bias is compatible with a high rate of agreement between blinded and nonblinded assessments and is driven by the misclassification of a few patients.
What is already known on this topic

Nonblinded assessors of binary outcomes are used in many randomised trials

It is prudent to suspect bias in randomised clinical trials with nonblinded outcome assessors

The typical impact of nonblinded outcome assessors on trial results is unclear, partly because previous studies have been based on indirect comparisons with high risk of confounding
What this study adds

Estimated effects in randomised clinical trials, measured as odds ratios, are exaggerated by an average of 36% when based on nonblinded assessments of subjective binary outcomes

The bias is compatible with a high rate of agreement between blinded and nonblinded outcome assessors and driven by the misclassification of few patients