Building High-Quality Training Data for Pathology AI: The Annotation Consensus Problem

There is a persistent oversimplification in how AI developers discuss training data quality: get enough annotations from enough pathologists and the model learns to generalize. The reality of building annotation sets for IHC biomarker scoring is considerably more demanding. Disagreement between annotators is not random noise that averages out at scale — it often reflects systematic differences in how pathologists have been trained to interpret the same visual evidence, and those systematic differences get encoded into training data in ways that produce models with systematic biases rather than statistical noise.

Understanding why annotation quality in IHC pathology is hard, and what rigorous annotation protocol design looks like, matters both for teams building AI pathology tools and for clinical institutions evaluating them. The training data quality question is directly relevant to a model's performance on your institution's slides, even if you will never see a line of the training code.

Why "Average of Three Pathologists" Is Not a Ground Truth Standard

The intuitive approach to resolving inter-pathologist disagreement is majority vote or averaged scores. Three readers, take the middle or the mean, use that as the training label. This is better than a single reader. It is not the same thing as a reliable ground truth, for several reasons.

First, pathologist disagreement on IHC scores is not always symmetrically distributed around a true value. In HER2 scoring, the systematic differences discussed earlier — calibration drift from different training lab environments, anchor effects from local staining intensity — mean that reader A may consistently call cases one grade higher than readers B and C, not randomly. A majority vote against reader A's annotations will encode a consistent directional bias that penalizes the model whenever it learns to see what reader A sees. The model learns the majority consensus, not the ground truth, and those are different things when systematic bias is present.

Second, averaging continuous scores (H-score, Ki-67 percentage) across readers can obscure bimodal distributions. If two readers call a Ki-67 as 15% and one calls it 38%, the mean of 22.7% is a score that none of the three readers actually believed and that may sit at a clinically important decision boundary by accident. The bimodal distribution in this case is more informative than the mean — it signals that this case is genuinely ambiguous at the current scoring protocol's resolution.

Third, the pathologists used for annotation are themselves a sample, and if that sample is not representative of the range of trained pathologists who will encounter this case type in clinical practice, the consensus label is specific to that annotator population. An annotation set derived entirely from academic pathology subspecialists will reflect the scoring norms of academic subspecialty practice, which may differ from the norms at community hospital pathology departments where a proportion of the eventual user base practices.

Multi-Reader Adjudication Protocols

Rigorous annotation protocols for IHC scoring training data treat disagreement not as noise to be eliminated but as information to be classified. The key distinction is between resolvable disagreement and irreducible ambiguity.

Resolvable disagreement — where one reader made a clear factual error, missed a region, or applied the scoring criteria incorrectly — should be corrected through adjudication. The standard approach is a review session where discordant cases are presented to all readers together, the scoring guideline is applied explicitly to the specific contested features, and a consensus label is agreed upon with documented reasoning. This requires that annotators understand that adjudication is not a criticism of their initial read but a quality improvement step.

Irreducible ambiguity — where two or more readers have correctly applied the scoring criteria to the same features but reached different conclusions because the visual evidence genuinely lies on the boundary — should not be forced to a false consensus. These cases should be explicitly labeled as ambiguous, and training protocols should handle ambiguous-label cases differently from high-confidence cases. One approach is to use soft labels (probability distributions over categories rather than hard class assignments) for genuinely ambiguous cases, allowing the model to learn that the appropriate output for such cases is a probability distribution rather than a point estimate.

The practical challenge is building an annotation infrastructure that supports this distinction. Most annotation platforms used in pathology AI development were designed for binary or multiclass labeling and do not natively support soft labels or structured ambiguity documentation. Building or adapting annotation tooling to capture inter-reader disagreement metadata — not just the final label — is an infrastructure investment that pays dividends in training data quality.

Annotator Calibration and Drift

Even within a single annotation project, annotator performance drifts over time. A pathologist who annotated 200 HER2 cases at the beginning of a project and 200 at the end, three months later, will typically show measurable calibration differences between the two batches. This is well-documented in the psychophysics literature: sustained exposure to a visual stimulus shifts perceptual thresholds, and regular exposure to clear-cut cases in batch annotation tends to shift annotators toward more confident borderline classifications.

The standard mitigation is regular recalibration sessions — presenting annotators with a fixed set of calibration cases with pre-adjudicated ground truth labels, measuring the annotator's current performance against those cases, and providing feedback. This should happen at the beginning of each annotation session, not just at the beginning of the project. The calibration case set should include an adequate representation of boundary cases (2+/3+ boundary for HER2, the clinical cut-off region for Ki-67) because drift in boundary case discrimination is more clinically significant than drift in clear-cut case classification.

We are not saying that annotator calibration protocols are universally implemented in IHC AI training datasets — they are not. What we are saying is that a model trained without systematic annotator calibration has an unknown amount of temporal drift baked into its training labels, and claims about model performance should be interpreted with that uncertainty in mind.

The Tissue Microarray Shortcut and Its Limitations

Tissue microarrays (TMAs) are a tempting source of pre-existing annotated IHC data. TMAs assemble small-core biopsies from multiple cases on a single slide, enabling high-throughput staining and annotation. Published TMA datasets with pathologist annotations exist for HER2, Ki-67, and other markers, and they are reasonably tractable to digitize and use for training.

The limitation is that a TMA core is a 0.6–2.0 mm tissue punch, representing a small fraction of a whole tumor. Models trained primarily on TMA cores learn to score a small, pre-selected tissue area where the annotator knew to look. They may not generalize to the task of identifying the relevant scoring area within a full whole-slide image, where tissue context, field-of-view selection, and stroma-tumor boundary decisions are part of the scoring task.

A training dataset for clinical IHC scoring tools should include a substantial proportion of full-section whole-slide images — not just TMA cores — to ensure the model learns the complete scoring task, including the spatial reasoning about where within the slide to score. The exact proportion depends on the intended use; a tool designed for research use on TMA cores might legitimately train predominantly on TMA data, while a tool intended for clinical biopsy and surgical resection scoring should not.

Case Mix and the Class Imbalance Problem

IHC scoring training datasets naturally acquire the class distribution of the source case population. For HER2 in unselected breast cancer, roughly 15–20% of cases are 3+ positive; for Ki-67 using a 20% cutoff, the positive proportion varies by histologic subtype but is often 30–50% in enriched cancer populations. If a training dataset reflects these proportions, the model will be well-calibrated for overall performance but will have fewer examples of the minority class, potentially producing lower precision for the positive category.

Case-stratified sampling — intentionally oversampling boundary cases and low-prevalence positive cases — improves model performance at the clinically critical categories at the cost of representativeness of the overall case distribution. For a clinical IHC scoring tool, this trade-off generally favors stratified sampling because the diagnostic stakes are highest at the positive/negative boundary and at the equivocal 2+ band for HER2, not at the center of the negative distribution.

Documenting the case mix of the training dataset — including the distribution of cases across scoring categories and the proportion of boundary cases — should be a standard part of any AI pathology tool's technical documentation. It is not sufficient to report overall concordance statistics without specifying what case mix those statistics were computed on. A model that achieves ICC 0.94 on a case mix of 80% unambiguous positives and negatives is less impressive than the same ICC on a case mix that includes 30% boundary cases — and the difference is material to how the model will perform on a clinical institution's actual case distribution.

The Investment Payoff

Rigorous annotation protocol design is expensive. Multi-reader annotation at full-slide level with structured adjudication, regular calibration sessions, and ambiguity documentation requires significantly more annotator time per case than a single-read majority-vote approach. For a training dataset of 3,000–5,000 WSIs, the difference can be months of additional annotation work and a material cost difference.

The payoff is models that fail predictably rather than opaquely — models whose performance distribution is well-characterized, whose uncertainty signals are calibrated, and whose edge-case behavior is understood rather than discovered in deployment. In a clinical context where the model's output informs treatment decisions, predictable failure modes are substantially more valuable than high average performance with unknown edge-case behavior. The annotation investment is, at its core, an investment in knowing what the model has actually learned.