AI and its (mis)uses in medical research and practice

University Medical Center Utrecht

2024-04-18

What is AI?

What is AI?

What is artificial intelligence?

computers doing tasks that normally require intelligence 1

What is artificial general intelligence?

General purpose AI that performs a range of tasks in different domains like humans

AI subsumes rule-based systems and machine learning

  • Rule-based AI: knowledge base of rules
  • Machine learning: statistical learning from examples #- (traditional) machine learning (logistic regression, SVM, RF, #GBM) #- modern machine learning: deep learning and foundation models

Rule-based systems are AI

  • rule: all cows are animals
  • observation: this is a cow \(\to\) it is an animal
  • applications:
    • medication interaction checkers
    • bedside patient monitors

machine learning: statistical learning from examples

ML tasks

data:

i length weight sex
1 137 30 boy
2 122 24 girl
3 101 18 girl

\[l_i,w_i,s_i \sim p(l,w,s)\]

ML tasks: generation

use samples to learn model \(p_{\theta}\) for joint distribution \(p\) \[ l_j,w_j,s_j \sim p_{\theta}(l,w,s) \]

ML tasks: conditional generation

use samples to learn model for conditional distribution \(p\) \[ l_j,w_j \sim p_{\theta}(l,w|s=\text{boy}) \]

task
generation \(l_j,w_j,s_j \sim p_{\theta}(l,w,s)\)

ML tasks: conditional generation 2

use samples to learn model for conditional distribution \(p\) of one variable \[ s_j \sim p_{\theta}(s|l=l',w=w') \]

task
generation \(l_j,w_j,s_j \sim p_{\theta}(l,w,s)\)
conditional generation \(l_j,w_j \sim p_{\theta}(l,w|s=\text{boy})\)

ML tasks: discrimination

call this one variable outcome and classify when expected value passes threshold (e.g. 0.5): \[ s_j = p_{\theta}(s|l=l',w=w') > 0.5 \]

task
generation \(l_j,w_j,s_j \sim p_{\theta}(l,w,s)\)
conditional generation \(l_j,w_j \sim p_{\theta}(l,w|s=\text{boy})\)
discrimination \(p_{\theta}(s|l=l_i,w=w_i) > 0.5\)

ML tasks: reinforcement learning

  • e.g. computers playing games
  • maybe not so useful for clinical research as requires many experiments

Machine learning is statistical learning with flexible models

- There is no fundamental difference between statistics and machine learning

- both optimize parameters to improve some criterion (loss / likelihood) that measures model fit to data

- models used in machine learning are more flexible

ML models can fit more functions but also more likely to overfit

Should pick the ‘right’ amount of model complexity

What is a large-language model like chatGPT?

What is a large-language model like chatGPT?

What is chatGPT?

a stochastic auto-regressive next-word predictor with a chatbot interface

  • trained by predicting the next <…>
    • in a large corpus of text
    • with a large model
    • for a long time on expensive hardware

auto-regressive conditional generation:

\[\begin{align} \text{word}_1 &\sim p_{\text{chatGPT}}(\text{word}|\text{prompt})\\ \end{align}\]

auto-regressive conditional generation:

\[\begin{align} \text{word}_1 &\sim p_{\text{chatGPT}}(\text{word}|\text{prompt})\\ \text{word}_2 &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_1,\text{prompt}) \end{align}\]

auto-regressive conditional generation:

\[\begin{align} \text{word}_1 &\sim p_{\text{chatGPT}}(\text{word}|\text{prompt})\\ \text{word}_2 &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_1,\text{prompt}) \end{align}\]

auto-regressive conditional generation:

\[\begin{align} \text{word}_1 &\sim p_{\text{chatGPT}}(\text{word}|\text{prompt})\\ \text{word}_2 &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_1,\text{prompt})\\ \text{word}_n &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_{n-1},\ldots,\text{word}_1,\text{prompt}) \end{align}\]

auto-regressive conditional generation:

\[\begin{align} \text{word}_1 &\sim p_{\text{chatGPT}}(\text{word}|\text{prompt})\\ \text{word}_2 &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_1,\text{prompt})\\ \text{word}_n &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_{n-1},\ldots,\text{word}_1,\text{prompt})\\ \text{STOP} &\sim p_{\text{chatGPT}}(\text{word}|\text{word}_{n-1},\ldots,\text{word}_1,\text{prompt}) \end{align}\]

GPT-4 scale

rule-based vs LLMs

  • deduction from explicit knowledge
  • knowledge verifiable and fast
  • constrained to deducible

  • extracted from observed data
  • unverifiable and compute intensive
  • “chatGPT seems to know(?) much”

Using ML in research

ML versus statistics, when to use what

machine learning

  • have more data
  • more complex functions (images)

statistics (e.g. GLMs)

  • less data
  • more domain knowledge

A sobering note

- ML in medicine has been ‘hot’ since at least the 90s (Cooper et al. 1997)

- not much evidence that it outperforms regression on most tasks (Christodoulou et al. 2019)

- though many poorly performed studies (Dhiman et al. 2022)

Safe use of AI models in medical practice

two questions

Question 1

  • prediction model of \(Y|X\) fits the data really well (AUC = 0.99 and perfect calibration)
  • will changing \(X\) induce a change \(Y\)?

Question 2

  • Give statins when risk of cardiovascular event in 10 years exceeds 10%
  • ML model based on age, medication history, cardiac CT-scan predicts this very well
  • will using this model for treatment decisions improve patient outcomes?

Improving the world is a causal task

  • statistics / ML: what to expect when we passively observe the world
  • not how we can intervene to make things better, this requires causality
  • Question 1
    • yellowish fingers predict lung cancer, paint fingers to skin color?
    • weight loss predicts death in lung cancer, send patients to couch with McDonalds?

When accurate prediction models yield harmful self-fulfilling prophecies

Tip

building models for decision support without regards for the historic treatment policy is a bad idea

image

image

image

image

image

Note

The question is not “is my model accurate before / after deploytment”, but did deploying the model improve patient outcomes?

Treatment-naive risk models

Is this obvious?

Tip

It may seem obvious that you should not ignore historical treatments in your prediction models, if you want to improve treatment decisions, but many of these models are published daily, and some guidelines even allow for implementing these models based on predictve performance only

Other risk models:

- condition on given treatment and traits

- unobserved confounding (hat type) leads to wrong treatment decisions

Bigger data does not protect against harmful risk models

More flexible models do not protect against harmful risk models

Gap between prediction accuracy and value for decision making

What to do?

What to do?

  1. Evaluate policy change (cluster randomized controlled trial)

  2. Build models that are likely to have value for decision making

Prediction-under-intervention models

Predict outcome under hypothetical intervention of giving certain treatment

When developing risk models,

always discuss:

1. what is effect on treatment policy?

2. what is effect on patient outcomes?

Don’t assume predicting well leads to good decisions

think about the policy change

From algorithms to action: improving patient care requires causality (Amsterdam, Jong, et al. 2024)

When accurate prediction models yield harmful sel-fulfilling prophecies (Amsterdam, Geloven, et al. 2024)

take-aways

  • AI subsumes rule-based programs and machine learning
  • machine learning is statistical learning from data with flexible models
  • chatGPT does auto-regressive next-word prediction
  • chatGPT produces beautiful mistakes: eloquently written logical fallacies
  • prediction: what to expect when passively observing the world
  • causality: what happens when I change something?
  • prediction models can cause harmful self-fulfilling prophecies when used for decision making
  • when building prediction models for decision support, you cannot ignore decisions on the treatments in historic data
  • models for prediction-under-intervention have foreseeable effects when used for decision making
  • ultimate test of model utility is determined by outcomes in (cluster) RCT

thank you

Amsterdam, Wouter A. C. van, Nan van Geloven, Jesse H. Krijthe, Rajesh Ranganath, and Giovanni Ciná. 2024. “When Accurate Prediction Models Yield Harmful Self-Fulfilling Prophecies.” arXiv. https://doi.org/10.48550/arXiv.2312.01210.
Amsterdam, Wouter A. C. van, Pim A. de Jong, Joost J. C. Verhoeff, Tim Leiner, and Rajesh Ranganath. 2024. “From Algorithms to Action: Improving Patient Care Requires Causality.” arXiv. https://doi.org/10.48550/arXiv.2209.07397.
Christodoulou, Evangelia, Jie Ma, Gary S. Collins, Ewout W. Steyerberg, Jan Y. Verbakel, and Ben Van Calster. 2019. “A Systematic Review Shows No Performance Benefit of Machine Learning over Logistic Regression for Clinical Prediction Models.” Journal of Clinical Epidemiology 110 (June): 12–22. https://doi.org/10.1016/j.jclinepi.2019.02.004.
Cooper, Gregory F., Constantin F. Aliferis, Richard Ambrosino, John Aronis, Bruce G. Buchanan, Richard Caruana, Michael J. Fine, et al. 1997. “An Evaluation of Machine-Learning Methods for Predicting Pneumonia Mortality.” Artificial Intelligence in Medicine 9 (2): 107–38. https://doi.org/10.1016/S0933-3657(96)00367-3.
Dhiman, Paula, Jie Ma, Constanza L. Andaur Navarro, Benjamin Speich, Garrett Bullock, Johanna A. A. Damen, Lotty Hooft, et al. 2022. “Methodological Conduct of Prognostic Prediction Models Developed Using Machine Learning in Oncology: A Systematic Review.” BMC Medical Research Methodology 22 (1): 101. https://doi.org/10.1186/s12874-022-01577-x.