By John Halamka and Paul Cerrato*
We are often asked this question during interviews, podcasts and speaking engagements. It’s a complicated question that requires context. A closer look at the research and product offerings in digital health demonstrates that there are several high-quality, well-documented algorithms now available, but there are also several questionable vendors that have rushed to market with little evidence to support their claims. Separating the wheat from the chaff can be a full-time occupation.
We recently received a press release from a large U.S. company highlighting a new AI system that may be able to diagnose dementia using a patient’s vocal patterns. The vendor pointed out that its research generated an area under the curve (AUC) of 0.74 for its system, which suggests that at least one in 4 patients with dementia would be overlooked. With these concerns in mind, the question is: What kind of guidelines can clinicians and technologists on the front lines turn to when they want to make informed choices?
Ideally, we need an impartial referee that can act as a Consumer Reports type service, weighing the strengths and weaknesses of AI offerings, with brief summaries of the evidence upon which they base its conclusions. In lieu of that, there are criteria that stakeholders can tap to help with the decision making process. As we point out in a recent review in the New England Journal of Medicine journal, called NEJM Catalyst, at the very least, there should be prospective studies to support any diagnostic or therapeutic claims. Too many algorithms continue to rely on retrospective analysis to support their products. (1) In our NEJM analysis, we include an appendix entitled “Randomized Controlled Trials and Prospective Studies on AI and Machine Learning,” which lists only 5 randomized controlled trials and 9 prospective, non-RCT studies. When one compares that to the thousands of AI services and products coming to market, it’s obvious that digital health still has a long journey to make before it’s fully validated.
That’s not to suggest that there are no useful, innovative AI and machine learning tools that are well supported, as well as several that are coming through the pipeline. There are credible digital tools to estimate a patient’s risk of colorectal cancer (ColonFlag), manage type 1 diabetes (DreaMD), and screen for diabetic retinopathy (IDx), all of which are supported by good evidence.** The FDA has also published a database of approved AI/ML-based medical technologies, summarized by Stan Benjamens and his associates in npj Digital Medicine.(2) (Keep in mind when reviewing this database, however, that some of the algorithms cleared by FDA were based on very small numbers of patients.)
A recent virtual summit gathered several thought leaders in AI, digital health and clinical decision support to create a list of principles by which such tools can be judged. Spearheaded by Roche and Galen /Atlantica, a management consulting firm, its summit communique refers to the project as “A multi-stakeholder initiative to advance non-regulatory approaches to CDS quality.” Emphasizing the need for better evidence, the communique states: “The development of CDS is driven by increasing access to electronic health care data and advancing analytical capabilities, including artificial intelligence and machine learning (AI/ML). Measures to ensure the quality of CDS systems, and that high-quality CDS can be shared across users, have not kept pace. This has led some corners of the market for CDS to be characterized by uneven quality, a situation participant likened to “the Wild West.”
The thought leaders who gathered for the CDS summit certainly aren’t the only ones interested in improving the quality of AI/ML-enhanced algorithms. The SPIRIT AI and CONSORT-Ai Initiative, an international collaborative group that aims to improve the way AI-related research is conducted and reported in the medical literature, has published 2 sets of guidelines to address the issues we mentioned above. The twin guidelines have been published by Nature Medicine, BMJ and Lancet Digital Health. (3,4) They are also available on the group’s web site.
With all these thought leaders and experts on board, there’s no doubt the AI ecosystem is gradually transitioning from the “Wild West” into a set of well-defined and repeatable processes that health care stakeholders can trust. http://geekdoctor.blogspot.com/2021/01/to-build-fire.html
*Paul Cerrato is a senior research
analyst and communications specialist at Mayo Clinic Platform
References
1. Halamka J, Cerrato, P. The Digital Reconstruction of Health Care. NEJM Catalyst: Innovations in Care Delivery. Vol 1 (6); Nov-Dec 2020.
2. Benjamens S, Dhunnoo P, Mesko B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. Npj Digital Medicine. 3:118, 2020. https://www.nature.com/articles/s41746-020-00324-0
3. Cruz Rivera, S, Liu, X, Chan, A-W et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nature Medicine volume 26, pages1351–1363(2020)
4. Liu, X, Cruz Rivera, S, Moher D et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension Nature Medicine volume 26, pages1364–1374(2020)
Awesome, JH !
ReplyDeleteWe are working on surgical CDS and aligning the ACS resources to assure appropriateness and current algorithms for optimal clinical, end-to-end care in surgical conditions - such as Cancer, Trauma, etc.
Great work coming out of Mayo - terrific.
Frank
While I agree with most of your comments, an AUC of 0.74 does not mean that 26% of patients at-risk may be overlooked. It means that 74% of random patients at risk score higher than randomly selected patients not at-risk. Model AUC is not comparable between models based on different data.
ReplyDeleteThat next step of translating a model score to a decision requires a threshold of score to determine the usual measures of sensitivity and specificity. A higher threshold of score could have a much better performance than 74%, or much worse depending on the threshold.
I call this out not to nit pick, but that the community needs to ask better questions on performance of models to both vendors and as reviewers of models in the literature. Models should compare to a simple model on the same data set. Please ask to see performance versus a 5-10 feature regression model fir to the same data.