Blog November 7, 2025 7 min read

How to Evaluate AI Pre-Readers: Questions Imaging Directors Should Ask

By Dr. Mei Lin, CEO & Co-Founder, Histolyx

Abstract concept of evaluation framework for radiology AI tools

We wrote this post because we wanted to be asked these questions. Not because we have perfect answers to all of them — we do not — but because the questions that good imaging directors ask are the right ones to have on record. If you are evaluating AI pre-reading tools for your department, this is the list we would use if we were on your side of the table.

One framing point upfront: a vendor demo is optimized for the best case. The evaluation questions below are designed to expose what happens in the worst case — a scanner calibration that does not match the training distribution, an integration that takes twice as long as projected, a finding type where the model has known gaps. Any vendor who refuses to answer these questions, or answers them only with marketing language, is telling you something important.

On Training Data and Validation

What data was the model trained on, and how similar is it to our patient population?

This is the most important question and the one vendors are most likely to deflect. Training data characteristics determine where the model performs well and where it does not. A model trained predominantly on academic medical center data may perform differently on community imaging center populations. A chest CT model trained on a population skewed toward younger patients may have different performance characteristics on older populations with more calcified nodules and incidental findings.

You are not asking for the full training dataset specification — that is legitimately proprietary. You are asking: what institutions contributed data, what was the approximate demographic breakdown, what was the prevalence of the finding types the model is designed to detect, and was training data collected prospectively or retrospectively? Those questions have answers that do not compromise trade secrets.

Was performance validation done on an independent holdout set, or on a subset of the training data?

Internal validation on a training data subset inflates apparent performance. Independent external validation — ideally on data from institutions different from those that contributed training data — is the meaningful number. Ask specifically: was the validation set collected from the same institutions as the training set? If the answer is yes, the published performance metrics are almost certainly optimistic.

On Integration and Go-Live

What is your typical integration timeline from contract signature to go-live, and what does that depend on?

Any vendor who says "two weeks" without qualification is giving you a sales number. A realistic DICOMweb integration requires: confirming your PACS supports STOW-RS and WADO-RS, configuring routing rules so studies are forwarded to the AI endpoint, testing with a sample dataset, validating that annotations route back correctly into your viewer, and going through your IT security review. Each of those steps has dependencies outside the vendor's control.

Ask for a specific implementation checklist and a description of what the vendor needs from your team — not just what they will deliver. Then ask for references from other customers who can speak to actual integration time versus projected integration time. The gap between those numbers tells you how good their project management is.

What happens if the DICOM routing is misconfigured and studies do not arrive at the AI system?

This is a failure mode that happens in nearly every deployment. Studies stop flowing to the AI endpoint — because of a PACS software update, a firewall rule change, or a routing misconfiguration. How does the vendor detect this? Do they monitor inbound study volume and alert when it drops unexpectedly? Or does the imaging center only discover the problem when a radiologist notices the annotations are missing?

A vendor with a mature monitoring system can tell you within hours if study volume drops. A vendor without one relies on your team to notice. That difference matters for operational continuity.

On Performance Claims

What is the false-positive rate at your published sensitivity, and how does that translate to flag volume at our study volume?

Sensitivity without specificity is meaningless for workflow planning. If a vendor says their system flags 92% of positive findings, the relevant follow-up is: at what flag rate per study? If the system flags every study as potentially positive, the sensitivity number is trivially achievable but the tool creates noise, not signal.

Ask for the full operating point: sensitivity and specificity together, at the threshold the vendor uses by default. Then do the arithmetic: at your imaging center's daily study volume, how many studies per day would be flagged? How many of those flags would correspond to true findings versus incidental or indeterminate findings? That math tells you whether the tool reduces decision load or adds to it.

Are there finding types or patient presentations where you know performance is weaker?

Every model has weak spots. The question is whether the vendor has characterized them honestly. For chest CT, common weak spots include: very small nodules near pleural surfaces, ground-glass nodules in heavily calcified lung parenchyma, findings in the lung apices (where imaging artifacts are more common), and patients with prior surgical changes. A vendor who cannot name any weak spots has either not done the analysis or is not sharing it. Either is a problem.

On Radiologist Adoption

How do radiologists typically change their reading behavior after six months of using the tool?

This is a question about real-world use, not demo-room use. What vendors who have deployed in multiple imaging centers typically observe: an initial period of high engagement with the annotations, followed by a recalibration phase where radiologists develop a personal threshold for how much weight to give the flags, followed by a steady state where the tool is part of the workflow but does not dominate it.

The concerning pattern is radiologists who stop reading independently of the flags — who open a study, see no annotations, and conclude there is nothing to find. We are not saying this is common, but it is a documented concern in the clinical AI literature. Ask the vendor what they know about this behavioral pattern in their customer base and what the tool does to discourage over-reliance. The answer — or the absence of an answer — tells you a lot about how they think about clinical responsibility.

On Contract Terms and Exit

If we want to terminate the contract, how long does it take to fully remove the tool from our workflow?

This is not a hostile question — it is a business continuity question. If the AI system goes down, gets acquired, or changes its pricing model, how long does it take to revert to a non-AI workflow? If the answer involves rearchitecting your DICOM routing that took weeks to set up, your operational dependency is higher than you might want.

What happens to our PHI and historical annotation data if we terminate?

Addressed in more detail in our earlier post on HIPAA BAA terms, but worth asking in the vendor evaluation context: is there a data return or deletion procedure? How long does it take? Who confirms that deletion is complete? Your BAA should address this, but understanding the operational process behind the contractual commitment matters.

What Confident Answers Sound Like

A vendor who handles these questions well will give you specific numbers when asked for numbers, acknowledge uncertainty when it exists, name their weak spots before you have to ask, and connect you with references who can speak to real deployment experience — including the parts that were harder than expected.

We try to answer all of these questions honestly in our sales and onboarding process. Not because the honest answers are always flattering — some are not — but because a customer who deploys Histolyx based on accurate expectations is a customer whose radiologists will actually use it. Adoption built on unrealistic expectations collapses at the first difficult case.