PD-L1 Scoring Across Assays: 22C3, 28-8, and SP142 Are Not Interchangeable

PD-L1 immunohistochemistry occupies a peculiar position in companion diagnostics: it is simultaneously one of the most clinically important biomarkers in oncology and one of the most technically fragmented. The regulatory approvals for checkpoint inhibitor therapies across non-small cell lung cancer, urothelial carcinoma, triple-negative breast cancer, and gastric cancer are each tied to specific antibody clones, specific scoring algorithms, and specific cut-off thresholds. The result is a landscape where a TPS of 49% measured with one assay does not translate directly into a TPS of 49% measured with another, and where pathologist and algorithm alike must know which clone they are working with before the score means anything clinically.

The Three Major Companion Diagnostic Clones

The antibody clones approved as companion diagnostics in the United States each bind to distinct PD-L1 epitopes with different sensitivities and staining patterns:

22C3 (Dako/Agilent, run on Autostainer Link 48): The companion diagnostic for pembrolizumab (Keytruda) in NSCLC. TPS ≥ 1% is the positive threshold for second-line use; TPS ≥ 50% is required for first-line monotherapy. Also used with CPS (Combined Positive Score, which includes tumor cells, lymphocytes, and macrophages) for gastric, esophageal, cervical, and TNBC indications. The CPS has no defined upper bound but meaningful clinical thresholds are at 1 and 10.

28-8 (Dako/Agilent): The companion diagnostic for nivolumab (Opdivo) in NSCLC, HNSCC, and urothelial carcinoma. Uses TPS only (no CPS scoring). Positive threshold in most approved indications is TPS ≥ 1% or ≥ 5% depending on the cancer type and line of therapy.

SP142 (Ventana, run on BenchMark Ultra): The companion diagnostic for atezolizumab (Tecentriq) in TNBC (TC ≥ 1% or IC ≥ 1%) and urothelial carcinoma (IC ≥ 5%). SP142 is notably the only clone where the immune cell (IC) score is primary rather than supplementary — tumor cell expression alone at low levels does not predict atezolizumab response in the validated populations.

Why the Staining Patterns Differ

The non-interchangeability of these clones is not merely a regulatory formality. The antibodies bind to different epitopes on the PD-L1 protein (CD274), and the staining intensity distributions they produce on the same tissue section are measurably different. Multiple published concordance studies — most notably those supporting the Blueprint PD-L1 IHC Comparability Project — show that 22C3 and 28-8 produce broadly concordant TPS values on tumor cells across NSCLC cohorts, with Spearman correlations typically above 0.85. SP142, by contrast, stains fewer tumor cells and is selectively sensitive to immune cell expression; it consistently produces lower TC scores than either 22C3 or 28-8 on the same section, sometimes by 15–20 percentage points in the clinically critical range below 50% TPS.

A second important variable is the automated staining platform. 22C3 and 28-8 are validated for the Dako Link 48 autostainer; SP142 is validated for the Ventana BenchMark Ultra. Running SP142 on a Dako platform, or 22C3 on a BenchMark instrument, produces off-label staining that has not been validated for clinical decision-making. In practice, many academic centers run multiple companion diagnostic assays on different instruments and manage the workflow accordingly — but the scoring algorithm must know which platform-assay combination was used to apply the correct interpretation logic.

The Scoring Algorithm Problem

Any algorithmic approach to PD-L1 IHC scoring faces a fundamental choice at the outset: develop a single universal model and normalize across clones, or train and validate separate models per clone-platform combination. The universal model approach is technically attractive — it reduces training data requirements and deployment complexity. It is also scientifically inadequate. The staining characteristics of SP142 on immune cells are categorically different from 22C3 tumor cell staining; a single model architecture would either be systematically biased toward the dominant clone in the training data or would require an explicit normalization step that itself requires validation.

The correct approach is per-clone models validated against companion diagnostic-equivalent reference reads. This means the algorithm must receive explicit clone identity as an input — not infer it from staining pattern alone — and apply the appropriate scoring logic and clinical thresholds for that clone. This is infrastructure complexity that is worth building precisely because the alternative (clone-agnostic scoring) produces clinically actionable errors that are not detectably wrong from the algorithm's output alone.

We are not saying that cross-clone normalization research is without value — published bridging studies between 22C3 and 28-8 in NSCLC are scientifically credible and potentially useful for retrospective research cohorts. What we are saying is that a clinical scoring tool applying companion diagnostic thresholds to treatment eligibility decisions must handle clone identity as a first-class parameter, not a footnote.

The CPS Complexity: Immune Cells Are Not Optional

For indications using CPS rather than TPS, the scoring task is substantially more complex. The Combined Positive Score counts all PD-L1-positive cells in the numerator — tumor cells, lymphocytes, and macrophages — divided by total viable tumor cells, multiplied by 100. The denominator is tumor cells; the numerator includes non-tumor cells. This means the algorithm must accurately classify cell types, not just detect PD-L1 positivity.

In a densely immune-infiltrated tumor microenvironment — a common feature in TNBC — tumor-infiltrating lymphocytes and macrophages can significantly inflate the CPS relative to the TPS, and the clinical meaning of that inflation depends entirely on whether the infiltrate is at the invasive margin, within tumor nests, or in stroma. An algorithm computing CPS without spatial classification of the immune infiltrate location can produce a technically correct numerical CPS that carries ambiguous clinical meaning.

Consider a TNBC core needle biopsy where the immune infiltrate is predominantly stromal rather than intraepithelial. The raw CPS might exceed 10, qualifying the patient for first-line pembrolizumab, but a pathologist reviewing the same slide might note that the infiltrate pattern is predominantly stromal and recommend careful clinical correlation. An algorithm that reports CPS = 14 without flagging the spatial distribution of immune cell staining is providing less decision-support than the clinical situation requires.

What Multi-Platform Validation Actually Requires

A pathology AI tool claiming PD-L1 scoring capability must show concordance against companion diagnostic-equivalent reference reads separately for each clone it supports, on the instrument platform each clone is validated for. The sample sizes for each clone should be adequate to demonstrate performance at each clinically relevant cut-point — not just overall concordance across the full score range.

The positive predictive value at cut-off thresholds (TPS ≥ 50% for 22C3, IC ≥ 1% for SP142 in TNBC) matters more than global correlation statistics, because these are the actual clinical decision boundaries. An algorithm with ICC 0.92 for continuous TPS but moderate PPV at the 50% threshold may still cause treatment allocation errors at precisely the boundary where the score matters most.

Cross-platform bridging validation — showing that the algorithm calibrated on Dako instruments generalizes to the same clone stained on a different autostainer platform — is a distinct and additional requirement. The staining characteristics of the same antibody on different platforms are not identical, and a model trained entirely on one platform's images will have unknown performance on images from another.

Practical Implications for Oncology Pathology Labs

For academic cancer centers evaluating algorithmic PD-L1 scoring tools, the key questions are: does the tool explicitly handle clone identity, does it support both TPS and CPS scoring with the correct cell-type classification, and has it been validated separately for each clone on its approved platform? A positive answer to all three is a prerequisite for responsible deployment in companion diagnostic workflows. A tool that answers "yes" to the first and no to the others may still have research utility, but should not be used to inform treatment eligibility decisions without pathologist review of every borderline case near the approved clinical cut-off.

PD-L1 scoring with algorithmic support is not inherently more dangerous than PD-L1 scoring without it — but the specificity requirements are high precisely because the downstream consequence of a borderline score is an immunotherapy treatment decision. The complexity of the assay landscape is a feature of the biology and the regulatory history, not a problem that software can dissolve. It can, however, be handled correctly — or incorrectly — and the distinction matters.