Blog May 6, 2025 9 min read

Measuring Pre-Reader Performance Without Overstating AI Accuracy

By Dr. Mei Lin, CEO & Co-Founder, Histolyx

Abstract performance metrics visualization for medical AI evaluation

If you've spent any time evaluating radiology AI tools, you've seen the marketing numbers. Sensitivity of 94%. Specificity of 97%. AUC of 0.96. These figures appear with confidence in vendor materials, sometimes with a paper citation, sometimes without. They invite a specific question that doesn't get asked often enough: measured on what, under what conditions, compared to what reference standard?

We think about this problem a lot internally, because we're both producers and critics of these numbers. As a team building a pre-reader, we have to communicate performance honestly. As people who talk with imaging centers every week, we hear the damage that inflated performance claims do when sites deploy tools expecting the numbers on the slide and get something different in the real world.

Here's how we think about the right framework for evaluating pre-reader performance — including how we apply it to ourselves.

Why Published AI Performance Numbers Are Often Dataset Artifacts

The standard method for validating an AI diagnostic tool is to test it on a held-out dataset and compute sensitivity, specificity, and AUC against a radiologist reference standard. This produces a clean number and, if the dataset is large enough, a narrow confidence interval. The problem is that the number is only valid for cases that look like the training and test dataset.

Medical imaging datasets are not population-representative. Academic curated datasets tend to overrepresent confirmed positive cases — because you curate a pulmonary nodule detection dataset by including studies where nodules are present and confirmed. In a real screening population, the prevalence of actionable nodules in low-dose CT chest screening is much lower. When you deploy a model trained and validated on a dataset with 40% positive prevalence into a screening workflow with 5% positive prevalence, the positive predictive value drops sharply even if sensitivity stays the same.

Acquisition protocol matters too. A model validated on 1.25mm slice thickness CT studies doesn't necessarily perform equivalently on 3mm reconstructions from an older scanner. A mammography AI validated on direct digital acquisition systems may behave differently on computed radiography or older FFDM systems. Vendors who validate on a narrow equipment profile but deploy across a heterogeneous install base are presenting a number that describes their validation dataset, not your scanner room.

The Right Comparison Baseline

Sensitivity and specificity numbers only have meaning relative to a defined comparison. The two common choices are: AI versus pathology ground truth, and AI versus radiologist consensus. These answer different questions.

AI versus pathology ground truth — where the reference standard is biopsy-confirmed diagnosis or follow-up imaging confirming growth or resolution — tells you how well the AI identifies lesions that turned out to matter. This is the most rigorous measurement and the hardest to obtain at scale, because it requires long follow-up and clean outcome linkage.

AI versus radiologist consensus is more common and much faster to produce, but it answers a different question: how well does the AI agree with experienced radiologists on a curated dataset? This is useful but can overstate performance if the radiologist consensus on the training dataset is itself operating under better conditions than real-world reads — more time per case, only complex or interesting cases selected, no prior history reviewed under time pressure.

A pre-reader is specifically designed to support radiologist performance in high-volume conditions. The most relevant comparison baseline for a pre-reader is therefore not AI versus expert consensus on curated cases — it's AI-assisted radiologist versus unassisted radiologist on representative cases under realistic workload conditions. That study design is expensive and rarely done pre-market, which is why you should be skeptical when vendors present standalone AI accuracy numbers as evidence for pre-reader value.

What We Actually Measure for Histolyx

We track several categories of metrics for Histolyx's pre-reading performance, and we're deliberate about which ones we present publicly.

Internally, we measure finding-level detection performance against radiologist ground truth on our internal retrospective dataset. We report these as ranges with confidence intervals because the numbers depend on study parameters (slice thickness, reconstruction kernel, patient population) and because a point estimate presented without context is misleading. We don't share these numbers publicly as standalone claims because we haven't validated them in a prospective deployment study — which is on our roadmap but not complete.

What we do present publicly are workflow performance metrics: average time from study arrival to pre-read completion, change detection accuracy on serial studies where prior and current measurements are compared to human-measured delta, and structured report field completion rates. These are less impressive-sounding than sensitivity/specificity numbers, but they're metrics we can defend because they're measured on real workflow data from actual system behavior, not curated validation datasets.

We're not saying standalone detection performance doesn't matter. It does. We're saying that a vendor who will only show you sensitivity and specificity on a curated dataset, and won't talk about how those numbers translate to your case mix and scanner equipment, is not being fully transparent.

The False Negative Cost Question

One evaluation dimension that gets less attention than sensitivity is the cost of different error types in the specific context of a pre-reader. A false negative from a pre-reader — a finding present in the study that the pre-reader missed — is qualitatively different from a false negative from an autonomous diagnostic system.

In an autonomous system, a false negative means the finding wasn't identified at all. In a pre-reader context, the radiologist still reads the study — the pre-reader's missed finding isn't the final word, it's a missed head start. The radiologist is still performing a complete read and retains clinical responsibility. A false negative from Histolyx means the radiologist didn't get a head start on that finding, not that the finding was missed from the diagnostic process.

This doesn't mean false negatives are unimportant for pre-readers. A pre-reader with a high false negative rate on a subtle finding type (small ground-glass nodules, non-calcified densities in dense breast tissue) that the radiologist is also likely to miss under high-volume conditions offers less value than one with higher sensitivity for exactly those findings. But the framing matters: we're measuring how well the pre-reader extends the radiologist's attention, not replacing the radiologist's read.

A Practical Evaluation Framework for Imaging Centers

When an imaging center is evaluating a pre-reader, the questions that give you the most signal are: What dataset was used for validation, and how does its case mix and equipment compare to yours? What is the reference standard — radiologist consensus, pathology ground truth, or follow-up imaging? Is there data on performance stratified by equipment type, acquisition protocol, and patient population subgroups? What is the false positive rate in addition to sensitivity — and at what threshold were the reported numbers computed?

That last question is critical. A sensitivity and specificity pair is always computed at a specific operating threshold. Moving the threshold up (more conservative flags) reduces false positives at the cost of sensitivity. Moving it down catches more findings but adds noise. The threshold that was chosen for the reported numbers reflects a tradeoff the vendor made, and you should know what that tradeoff was and whether you can adjust it for your operating environment.

Ideally, you'd run a prospective evaluation on your own case mix before deploying at scale. This is logistically demanding, but even a limited retrospective analysis — pulling a sample of studies with known outcomes from your own PACS and running them through the pre-reader — gives you more relevant signal than published validation numbers alone.

The field is moving toward better performance transparency standards. Radiology society guidelines increasingly recommend that AI performance claims include dataset characteristics and comparison methodology. Until those standards are consistently applied across vendors, asking the right questions is the most effective tool available.