Blog July 18, 2025 6 min read

Mammography AI: What Flags Matter and What Creates Noise

By Dr. James Okafor, Chief Medical Officer, Histolyx

Abstract mammography imaging analysis visualization

The radiologists we talked with early in Histolyx's development were fairly consistent about what they didn't want from mammography AI. More flags was not the goal. The first generation of computer-aided detection (CAD) tools for mammography had already taught that lesson over about fifteen years of clinical use. High sensitivity came with a flag rate that forced radiologists to triage AI outputs rather than use them as genuine aids. The mental model shifted from "what did the AI find?" to "which of these flags is real?"

That inversion — from confirmation to triage — is exactly the wrong direction for a pre-reading tool. Here's how we think about where that went wrong and what a calibrated approach looks like.

The Original CAD Problem Was a Threshold Problem

The legacy CAD tools deployed through the 1990s and 2000s were tuned for maximum sensitivity. The clinical reasoning was defensible: missing a cancer is worse than an unnecessary callback, therefore detect everything. The result was systems that flagged 2-4 findings per study on average, including many obvious normal variants — dense fibroglandular tissue, the skin surface, pacemaker leads, benign calcification clusters that any experienced breast imager would dismiss in seconds.

The problem wasn't the technology per se. It was that a high false positive rate creates a usability failure even with high sensitivity. When a system flags so frequently that radiologists start treating all flags as low-prior-probability noise, they lose the benefit of the occasional high-confidence flag that actually directs attention to something meaningful. The tool becomes white noise. Multiple studies showed that radiologists using first-generation CAD had callback rates that varied little from those without CAD for exactly this reason.

The response from the field wasn't to abandon CAD — it was to recognize that sensitivity and specificity are not the only axes that matter. Confidence calibration, the ability to distinguish between high-confidence flags (this looks genuinely suspicious) and low-confidence flags (this warrants attention but probably isn't significant), is what determines whether a tool is usable or just technically sensitive.

What Well-Calibrated Flags Actually Look Like

For mammography specifically, the finding types where AI pre-reading adds meaningful value are those that benefit from systematic attention in a high-volume screening setting: subtle masses that could be obscured by overlying tissue, asymmetric densities that require comparison to the contralateral view or prior study, and clustered calcifications with morphology that warrants close evaluation.

These are also the finding types where radiologist performance is most variable under high-volume conditions. An experienced breast imager on a focused read of a single case will almost never miss a developing asymmetry. The same radiologist at case forty-five in a screening session, navigating quickly through a standard four-view study, is operating with less attentional bandwidth. That's where a high-confidence flag — one that the system assigns a meaningful suspicion score, not just a marginal threshold cross — has real value.

What this means in practice is that a useful confidence score for mammography AI isn't just a probability output from the detection model. It's a calibrated probability that accounts for the finding's morphologic characteristics (spiculated margin versus circumscribed, pleomorphic versus amorphous calcification pattern), whether the finding is new compared to a prior, and what the surrounding tissue context looks like. A system that reports confidence without these factors is presenting a number that doesn't reflect what experienced breast imagers actually use to prioritize attention.

The False Positive Rate Tradeoff and How to Think About It

Every mammography AI system operates at a threshold. Below a certain confidence score, findings are not flagged. Above it, they are. The threshold choice determines the sensitivity/specificity tradeoff, and there's no threshold that produces both perfect sensitivity and zero false positives on any real dataset.

The clinically relevant question is not "what threshold gives us the best AUC" — it's "at what threshold does the flag rate stay low enough that radiologists actually engage with the flags, while still capturing the findings they're most likely to miss without assistance?"

That question doesn't have a universal answer. It depends on the radiologist's baseline performance, their experience level with breast imaging specifically, the screening population's background prevalence of actionable findings, and the equipment used. A screening program that reads a high-volume population of average-risk patients will have a different optimal threshold than a program that screens a higher-risk or symptomatic population.

This is why configurable thresholds matter. A single system-wide threshold tuned on a benchmark dataset may be miscalibrated for a specific imaging center's population and radiologist mix. The ability to adjust the operating point — to raise the confidence threshold in a high-experience practice that wants fewer interruptions, or lower it in a setting with higher-risk patients or less specialized breast imaging expertise — is a practical requirement, not a nice-to-have.

Dense Breast Tissue as a Special Case

Breast density is the specific context where the threshold calibration question is most consequential. In heterogeneously dense or extremely dense breast tissue (ACR categories C and D), the sensitivity of mammography for mass detection drops significantly because masses can be obscured by overlying parenchyma. This is well established and is why supplemental screening (ultrasound, MRI) is recommended for dense breast patients in many guidelines.

For AI mammography pre-reading in dense breasts, the tradeoff shifts. The expected finding rate is higher in a dense-breast population screened with supplemental imaging because you're selecting for higher-risk cases, but the mammogram alone has lower sensitivity for mass detection. A well-calibrated system should flag more conservatively on obvious masses in fatty breasts (where the mass is clearly visible and the radiologist won't miss it) and more aggressively on subtle densities in heterogeneous tissue (where the mass may be partly obscured and the pre-reader's attention flag has real clinical value).

Implementing this kind of tissue-context-aware calibration is harder than a single-threshold approach, and not all mammography AI systems do it. It's worth asking vendors specifically whether their detection model accounts for tissue density context in the confidence outputs, or whether it applies a uniform confidence score regardless of background parenchymal density.

The BI-RADS Alignment Question

BI-RADS — the Breast Imaging Reporting and Data System from the ACR — provides the categorical framework for mammography reporting: categories 0 through 6, from incomplete/needs additional evaluation through known malignancy. The categories carry defined follow-up recommendations that drive screening program management.

For a pre-reader to be operationally useful, its outputs need to map cleanly to the BI-RADS framework. A flag that says "suspicious region, right upper-outer quadrant, 9 o'clock position, 1.2cm depth" is useful. A flag that says "region of interest detected, score 0.73" is not useful unless the radiologist has been trained to interpret what 0.73 means in BI-RADS terms.

The structured report that Histolyx generates for mammography pre-reads uses BI-RADS category language for finding descriptors — not as a final category assignment, which is always the radiologist's determination, but as a structured vocabulary that maps the pre-reader's findings to the same language the radiologist uses to document them. This alignment matters because it makes the pre-reader's output readable and actionable within the existing reporting workflow rather than requiring translation.

The broader principle: a pre-reader for mammography has to fit into the clinical framework breast imagers already use, not impose its own taxonomy. AI flags that translate cleanly into BI-RADS descriptors are useful. AI outputs that require mental translation from confidence scores to clinical categories add cognitive overhead rather than reducing it.