Medical imaging and AI for decision support

Medical Imaging AI lab meeting

Wouter van Amsterdam

Department of Data Science Methods, Julius Center, University Medical Center Utrecht

2024-07-08

Uses of AI in medical imaging

Use AI for medical imaging

to make healthcare easier or more efficient

Acquisition (\(S \to X\))
- k-space to MRI image
- raw projection data to CT image
detection / segmentation (\(X \to X\))
- segmenting organs at risk in radiotherapy
inference / diagnosis (\(X \to D\), both at prediction time)
- medical diagnosis
- psuedo CT from MRI

Use AI for medical imaging

to make healthcare better (improve decisions)

prognosis (\(X \to Y\), \(Y\) in the future)
- expected survival time given CT-scan
treatment effect (\(X\) determines effect of a treatment \(T\) on outcome \(Y\) in the future)

Why would you estimate treatment effects based on images?

treatments have different effects on patients based on their (disease) characteristics
for example, whether tamoxifen increases survival for breast cancer patients depends on whether their tumor is hormone sensitive
some characteristics may be well captured in medical imaging:
- T-cell distributions around tumors related to effect of immunotherapy in cancer
- heterogeneity of tumor on CT may predict response to radiotherapy
- holistic view of ‘body composition’ on CT-scans determines whether patient can tolerate chemotherapy

How to estimate treatment effects based on images?

In principle the same as estimating a subgroup treatment effect (e.g. male vs female)

Conduct a randomized controlled trial where the treatments of interest are randomly allocated
Collect (imaging) data at randomization timepoint
Use a statistical learning technique like TARnet (Shalit, Johansson, and Sontag 2017) to estimate outcomes conditional on image and treatment
conditional treatment effect \(= f(X,T=1) - f(X,T=0)\)

What if you cannot do a (big enough) RCT?

Emulate / approximate the ideal trial in observational data you do have, using causal inference techniques

(which rely on untestable assumptions)

Improving decisions with AI

prognosis (\(X \to Y\), \(Y\) in the future)
treatment effect (\(X\) determines effect of a treatment \(T\) on outcome \(Y\) in the future)

Whereas treatment effect estimation is typically thought of as a causal task requiring causal approaches (e.g. randomized controllerd trials)
Prognosis models are often developed without any causal thinking (if it predicts it predicts)
but then advertised for making treatment decisions.

The in-between: using prediction models for (medical) decision making

prognosis (e.g. survival given medical image)

Using prediction models for decision making is often thought of as a good idea

For example:

give chemotherapy to cancer patients with high predicted risk of recurrence
give statins to patients with a high risk of a heart attack

TRIPOD+AI on prediction models (Collins et al. 2024)

“Their primary use is to support clinical decision making, such as … initiate treatment or lifestyle changes.”

This may lead to bad situations when:

ignoring the treatments patients may have had during training / validation of (AI) prediction model
only considering measures of predictive accuracy as sufficient evidence for safe deployment

When accurate prediction models yield harmful self-fulfilling prophecies

Building models for decision support without regards for the historic treatment policy is a bad idea

The question is not “is my model accurate before / after deployment”,

but did deploying the model improve patient outcomes?

Treatment-naive prediction models

\[\begin{align} E[Y|X] \class{fragment}{= E[E_{t~\sim \pi_0(X)}[Y|X,t]]} \end{align}\]

Treatment-naive prediction models

(Results from W. A. C. van Amsterdam, van Geloven, et al. 2024)

good or bad discrimination post deployment may be a sign of a harmful or a beneficial policy change
models that are perfectly calibrated before and after deployment are certainly not useful for decision making because they didn’t change the distribution

Is this obvious?

Prediction modeling is very popular in medical research

Recommended validation practices and reporting guidelines do not protect against harm

because they do not evaluate the policy change

Bigger data does not protect against harmful prediction models

More flexible models do not protect against harmful prediction models

What to do?

Evaluate policy change (cluster randomized controlled trial)
Build models that are likely to have value for decision making

Building and validating models for decision support

Deploying a model is an intervention that changes the way treatment decisions are made

How do we learn about the effect of an intervention?

With a randomized experiment

for using a decision support model, the unit of intervention is usually the doctor
randomly assign doctors to have access to the model or not
measure differences in treatment decisions and patient outcomes
this called a cluster RCT
if using model improves outcomes, use that one

Using cluster RCTs to evaluated models for decision making is not a new idea (Cooper et al. 1997)

“As one possibility, suppose that a trial is performed in which clinicians are randomized either to have or not to have access to such a decision aid in making decisions about where to treat patients who present with pneumonia.”

What we don’t learn

was the model predicting anything sensible?

So build treatment-naive prediction models and trial them for decision support?

Not a good idea

baking a cake without a recipe
hoping it turns into something nice
not pleasant to people that need to taste result of the experiment
- (i.e. patients may have side-effects / die)

We should build models that are likely to be valuable for decision making

Build models that predict expected outcomes under hypothetical interventions (prediction-under-intervention models)
doctor / patient can pick the treatment with best expected outcomes, depending on patient’s values and preferences
whereas treatment-naive prediction models average out over the historic treatment policy, prediction-under-intervention allows the user to select a treatment option

Hilden and Habbema on prognosis (Hilden and Habbema 1987)

“Prognosis cannot be divorced from contemplated medical action, nor from action to be taken by the patient in response to prognostication.”

prediction-under-intervention is not a new idea, but language and methods on causality have come a long way since (Hilden and Habbema 1987).

Estimand for prediction-under-intervention models

What is the estimand?

prediction: \(E[Y|X]\)
average treatment effect: \(E[Y|\text{do}(T=1)] - E[Y|\text{do}(T=0)]\)
conditional average treatment effect: \(E[Y|\text{do}(T=1),X] - E[Y|\text{do}(T=0),X]\)
prediction-under-intervention: \(E[Y|\text{do}(T=t),X]\)

using treatment naive prediction models for decision support

prediction-under-intervention

More on prediction-under-intervention models

development:

ideally estimated from RCTs, but these are often too small or don’t measure the right data
alternatively can use observational data and causal inference methods
assumption of no unobserved confounding often hard to justify in observational data
but there’s more between heaven (RCT) and earth (confounder adjustment)
- proxy-variable methods (e.g. Miao, Geng, and Tchetgen Tchetgen 2018; Wouter A. C. van Amsterdam et al. 2022)
- constant relative treatment effect assumption (e.g. Alaa et al. 2021; Wouter A. C. van Amsterdam and Ranganath 2023; Candido dos Reis et al. 2017)
- diff-in-diff
- instrumental variable analysis (Wald 1940; Puli and Ranganath 2021; Hartford et al. 2017)
- front-door analysis
many of these have potential new applications with AI and medical imaging

Evaluation of prediction-under-intervention models

prediction accuracy can be tested in RCTs, or in observational data with specialized methods accounting for confounding (e.g. Keogh and van Geloven 2024)
in confounded observational data, typical metrics (e.g. AUC or calibration) are not sufficient as we want to predict well in data from other distribution than observed data (i.e. other treatment decisions)
a new policy can be evaluated in historic RCTs (e.g. Karmali et al. 2018)
ultimate test is cluster RCT
if not perfect, likely a better recipe than treatment-naive models

Take-aways

deploying models for decision support is an intervention and should be evaluated as such
when developing or evaluating (AI) prediction models for medical decisions, think about
- what is the effect of using this model on medical decisions?
- what is the effect of this policy change on patient outcomes?
prediction-under-intervention models have a foreseeable effect on patient oucomes when used for decision making

From algorithms to action: improving patient care requires causality (W. A. C. van Amsterdam, Jong, et al. 2024)

When accurate prediction models yield harmful sel-fulfilling prophecies (W. A. C. van Amsterdam, van Geloven, et al. 2024)

References

Alaa, Ahmed M., Deepti Gurdasani, Adrian L. Harris, Jem Rashbass, and Mihaela van der Schaar. 2021. “Machine Learning to Guide the Use of Adjuvant Therapies for Breast Cancer.” Nature Machine Intelligence, June, 1–11. https://doi.org/gk6bh7.

Amsterdam, W.A. C. van, Nan van Geloven, Jesse H. Krijthe, Rajesh Ranganath, and Giovanni Ciná. 2024. “When Accurate Prediction Models Yield Harmful Self-Fulfilling Prophecies.” February 8, 2024. https://doi.org/10.48550/arXiv.2312.01210.

Amsterdam, W.A. C. van, Pim A. de Jong, Joost J. C. Verhoeff, Tim Leiner, and Rajesh Ranganath. 2024. “From Algorithms to Action: Improving Patient Care Requires Causality.” BMC Medical Informatics and Decision Making 24 (1). https://doi.org/10.1186/s12911-024-02513-3.

Amsterdam, Wouter A. C. van, and Rajesh Ranganath. 2023. “Conditional Average Treatment Effect Estimation with Marginally Constrained Models.” Journal of Causal Inference 11 (1): 20220027. https://doi.org/10.1515/jci-2022-0027.

Amsterdam, Wouter A. C. van, Joost J. C. Verhoeff, Netanja I. Harlianto, Gijs A. Bartholomeus, Aahlad Manas Puli, Pim A. de Jong, Tim Leiner, Anne S. R. van Lindert, Marinus J. C. Eijkemans, and Rajesh Ranganath. 2022. “Individual Treatment Effect Estimation in the Presence of Unobserved Confounding Using Proxies: A Cohort Study in Stage III Non-Small Cell Lung Cancer.” Scientific Reports 12 (1, 1): 5848. https://doi.org/10.1038/s41598-022-09775-9.

Candido dos Reis, Francisco J., Gordon C. Wishart, Ed M. Dicks, David Greenberg, Jem Rashbass, Marjanka K. Schmidt, Alexandra J. van den Broek, et al. 2017. “An Updated PREDICT Breast Cancer Prognostication and Treatment Benefit Prediction Model with Independent Validation.” Breast Cancer Research 19 (1): 58. https://doi.org/gbhgpq.

Collins, Gary S., Karel G. M. Moons, Paula Dhiman, Richard D. Riley, Andrew L. Beam, Ben Van Calster, Marzyeh Ghassemi, et al. 2024. “TRIPOD+AI Statement: Updated Guidance for Reporting Clinical Prediction Models That Use Regression or Machine Learning Methods.” BMJ 385 (April): e078378. https://doi.org/10.1136/bmj-2023-078378.

Cooper, Gregory F., Constantin F. Aliferis, Richard Ambrosino, John Aronis, Bruce G. Buchanan, Richard Caruana, Michael J. Fine, et al. 1997. “An Evaluation of Machine-Learning Methods for Predicting Pneumonia Mortality.” Artificial Intelligence in Medicine 9 (2): 107–38. https://doi.org/10.1016/S0933-3657(96)00367-3.

Hartford, Jason, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. 2017. “Deep IV: A Flexible Approach for Counterfactual Prediction.” In International Conference on Machine Learning, 1414–23. PMLR. https://proceedings.mlr.press/v70/hartford17a.html.

Hilden, Jørgen, and J. Dik F. Habbema. 1987. “Prognosis in Medicine: An Analysis of Its Meaning and Rôles.” Theoretical Medicine 8 (3): 349–65. https://doi.org/10.1007/BF00489469.

Karmali, Kunal N., Donald M. Lloyd-Jones, Joep van der Leeuw, David C. Goff Jr, Salim Yusuf, Alberto Zanchetti, Paul Glasziou, et al. 2018. “Blood Pressure-Lowering Treatment Strategies Based on Cardiovascular Risk Versus Blood Pressure: A Meta-Analysis of Individual Participant Data.” PLOS Medicine 15 (3): e1002538. https://doi.org/10.1371/journal.pmed.1002538.

Keogh, Ruth H., and Nan van Geloven. 2024. “Prediction Under Interventions: Evaluation of Counterfactual Performance Using Longitudinal Observational Data.” January 10, 2024. https://doi.org/10.48550/arXiv.2304.10005.

Miao, Wang, Zhi Geng, and Eric J Tchetgen Tchetgen. 2018. “Identifying Causal Effects with Proxy Variables of an Unmeasured Confounder.” Biometrika 105 (4): 987–93. https://doi.org/10.1093/biomet/asy038.

Puli, Aahlad Manas, and Rajesh Ranganath. 2021. “General Control Functions for Causal Effect Estimation from Instrumental Variables.” http://arxiv.org/abs/1907.03451.

Shalit, Uri, Fredrik D. Johansson, and David Sontag. 2017. “Estimating Individual Treatment Effect: Generalization Bounds and Algorithms.” http://arxiv.org/abs/1606.03976.

Wald, Abraham. 1940. “The Fitting of Straight Lines If Both Variables Are Subject to Error.” The Annals of Mathematical Statistics 11 (3): 284–300. https://doi.org/10.1214/aoms/1177731868.