other

We can learn from the past in AI/Medicine

There is a lot of excitement about using LLMs and AI more generally in medicine, but it sometimes seems that enthusiasts have limited awareness of the history of AI in Medicine. Which I think is a mistake, we can learn from previous “booms” and “busts” in AI/Medicine (such as IBM Watson), while of course still hoping and expecting that things will be different this time.

This blog is motivated by a letter which I sent to the Economist news magazine, in response to a survey they published on AI/Medicine (this is my second letter about AI which they have published recently). Obviously a letter is very short, so I thought I’d expand on the topic in a blog. The Economist survey talks about AI in diagnosis, admin, patient-facing apps, and drug discovery, with frequent mention of AI in low-income countries; I’ll follow this structure, except that I wont discuss drug discovery since I dont know anything about this.

Incidentally, I have great respect for reporting in the Economist, mostly because its just about the only mass-media news venue where I see reasonable articles about topics I know about. Ie, if the BBC, Guardian, New York Times, etc publish an article about a subject I know about, I am often horrified by what they say; whereas if the Economist does this, my usual reaction is “somewhat simplified, but not bad for something aimed at a mass-market audience”.

AI for Medical Diagnosis

AI and data-science researchers love to work on medical diagnosis, perhaps because the task fits the machine learning perspective well. Ie, in ML terms the goal is to build a classifier which outputs a diagnosis (category) based on patient symptoms (input data), and there is a lot of historical data to train models.

There is very old work on this; in particular Paul Meehl showed in 1954 that simple linear regression models did a better job at some diagnostic tasks (at least as assessed by ML criteria) than the average doctor. Daniel Kahneman discusses this in his book Thinking, Fast and Slow, and points that this perhaps this shouldnt be surprising because humans (including doctors) arent very good at reasoning about numbers and probabilities. In other words, models can beat doctors at some diagnosis tasks because the human brain is not very good at decision-making based on numbers. Anyways, despite this, Meehl’s models were never used clinically on a wide scale.

In the 1970s and early 1980s, the MYCIN system, which used rules, was shown to outperform doctors at recommending some types of drug treatment. This was a well-publicised finding at the time (it certainly had an impact on me as an undergraduate student). However, again Mycin was never used clinically, despite its better-than-human performance on test sets.

Of course Meehl’s work and Mycin were academic projects. In the 2010s, IBM invested billions in medical decision support tools based on its Watson technology. This failed, despite the huge investment. I highly recommend Strickland’s excellent reprospective (blog), which amongst other things point out that Watson’s tech didnt fit into the way doctors worked and also didnt address the most important problems (“pain points”) in healthcare. In other words, a tool which helps doctors make decisions in an artificial lab context may not have a lot of real-world utility.

More recently, Yu et al’s 2022 BMJ article points out that 100s of Covid-related AI models were developed, but very few of these were deployed and none were widely adopted. The below quote from the paper summarises some of their conclusions.

To achieve a sustainable impact, researchers into AI should look beyond model development and consider how solutions can be practically and ethically implemented at the bedside. This approach demands a broader perspective that ensures integration with hospital systems, satisfies ethical standards to safeguard patients, and adapts to existing workflows in a way that acknowledges and leverages clinical expertise. If AI researchers do not adapt their work to real world clinical contexts, they risk producing models that are irrelevant, infeasible, or irresponsible to implement.

In short, the past 70 years (since Meehl’s 1954 work) shows that it is certainly possible to build models that out-perform doctors in some medical diagnosis tasks, at least as judged by “performance on held-out test set” criteria used in ML. However it is much harder to build models which (A) are useful in real-world clinical contexts, (B) can be used in many hospitals and surgeries (scalability) and (C) provide good “return on investment” compared to other oppotunities.

Of course LLM and AI tech are much better in 2024, so perhaps we can build diagnostic AI systems which meet the above criteria; certainly there are encouraging developments in radiography (BMJ). But success needs to shown with regard to real-world criteria above, not on the basis of accuracy on a test set.

Adminstrative tasks

I’ll discuss the other areas more briefly. There has long been an interest in using AI to automate administrative tasks in healthcare, perhaps because this raises fewer patient-safety issues. Outcomes are also easier to measure (increased productivity in writing reports is easier to measure than changes in 5-year mortality). Indeed, one of the first uses of speech recognition technology was in helping doctors write reports (Kurzweil Clinical Reporter, in the 1980s). Of course transcription technology has gotten far better in 2024, and is widely used in healthcare.

There also is a long-standing tradition of using NLG to help write medical reports; see for example Hüske-Kraus 2003 review article. A recent example is work by my student Francesco Moramarco and others on summarising doctor-patient consultations (paper). Arria has also worked on some projects in this area, which I cannot talk about.

So I think there is great potential to use AI to support administrative tasks, but again we need to look at real-world usage and “return on investment” (ROI). I cannot give details here, but I have seen cases where real-world ROI was disappointing, especially for an exciting new tech that people had great hopes for.

Patient-facing

I personally think there is great potential in using AI to directly support patients, but again history shows this is challenging. My first project (around 2000) in this space was a tool to encourage smoking cessation; this was shown to not be effective in a clinical trial (paper).

Tim Bickmore has worked on apps to encourage health behaviour change for 20 years, using much better technology than I did. A few years ago he told me that it was very difficult to show long-term impact. Ie, people using an AI app might initially change behaviour, but after a while would revert to their original behaviour. Of course this is a common finding in health behaviour more generally; for example most people who start diets abandon them.

In my own group we’re working on explaining complex information to patients (blog), in close collaboration with colleagues from medical school. I think this is very exciting, but even if we are successful, it will be years before we have anything which can be widely deployed in real usage.

Medical AI in Low-Income Countries

The Economist survey, and many other people, have speculated that AI could be especially beneficial in low-income countries with few doctors. This is a great vision, but again its useful to be aware of history.

For example, at its peak the Babyl system was used by 20% of the population in Rwanda; it has now been wound down. From an AI perspective, when Babyl was launched there was a big emphasis on AI, but over time I saw less being said about AI and more of an emphasis on tele-medicine, ie letting villagers easily communicate with health professionals elsewhere. Perhaps the lesson here is that in a poor country with limited health services, there may be “quicker wins” than AI, ie there may be simpler things which have a better ROI in health and monetary terms.

Final Thoughts

It is really nice to see so much excitement about using AI in healthcare, and I certainly hope that this will turn into systems and products which make a difference. But if our goal is real-world impact, we should be aware of what has been learned about AI in healthcare over the past 70 years.

Certainly a key recurring lesson is that creating something which has a real-world impact is far harder than showing that a model does well on a test set. Another lesson is that creating a system which works in one hospital is much easier than creating a system which can be depoyed in many hospitals and other health-care organisations.

I’m not trying to be pessimistic, I think there is huge potential in using AI in healthcare! But success is more likely if we keep above lessons in mind.

One thought on “We can learn from the past in AI/Medicine

  1. Excellent points.

    I would like to add two twists to the “diagnosis” part.

    1. Regarding the limited success of early systems like Mycin, a smart person (I can’t recall who it was, I’m sorry) argued that these systems answered the wrong question. In clinical medicine, the driving question is not “what is the patient’s diagnosis?”, but rather “what should I do?”. In order to confirm the diagnosis, reduce the number of differential diagnoses, prevent expected pathophysiological consequences or understand the prognosis. Thus I would expand Yu’s quote “To achieve a sustainable impact,… AI-based solutions need to answer the right questions.”
    2. Another point of concern is a system’s behavior at its border of competency. Mycin would have spitted out a “most likely diagnosis” even if I had given it the symptoms of my broken dishwasher. Whereas humans “degrade gracefully” when leaving their competence areas (a neurosurgeon will still have an idea what to do with a broken leg) and – maybe more important – will understand that they operate outside their expertise, the same cannot be said about the early expert systems.  And I would add that LLM’s are at the risk of falling into the very same trap.

    As a side note to the quote by Hu (“To achieve a sustainable impact, researchers into AI should look beyond model development and consider how solutions can be practically and ethically implemented at the bedside”):
    “For example, in the context of medical applications, the failure to design expert systems according to the requirements of the context in which they are to be used (i.e., how they will fit into clinical work flow) is a major reason why so few of them have proved operationally useful.”

    This was written >30 years ago (Oravec, J. A., & Travis, L. (1992). If we could do it over, we’d… Learning from less-than-successful expert system projects. Journal of Systems and Software, 19(2), 113-122.)

    Liked by 1 person

Leave a comment