In addition to evaluating the safety of software as a medical device (SaMD), the agency needs to devote more resources to evaluating its efficacy and quality.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
The FDA’s approach to software as a medical device (SaMD) has been evolving. Consider a few examples.
In 2018, IDx-DR, a software system used to improve screening for retinopathy, a common complication of diabetes that affects the eye, became the first AI-based medical device to receive US Food and Drug Administration clearance to “detect greater than a mild level of … diabetic retinopathy in adults who have diabetes.” To arrive at that decision, the agency not only reviewed data to establish its safety, it also took into account prospective studies, an essential form of evidence that clinicians look for when trying to decide if a device or product is worth using. The software was the first medical device approved by the FDA that does not require the services of a specialist to interpret the results, making it a useful tool for health care providers who may not normally be involved in eye care. The FDA clearance emphasized the fact that IDx-DR is a screening tool not a diagnostic tool, stating that patients with positive results should be referred to an eye care professional. The algorithm built into the IDx-DR system is intended to be used with the Topcon NW400 retinal camera and a cloud server that contains the software.
Similarly, FDA looked at a randomized prospective trial before approval of a machine learning-based algorithm that can help endoscopists improve their ability to detected smaller, easily missed colonic polyps. Its recent clearance of GI Genius by Medtronic was based on a clinical trial published in Gastroenterology, in which investigators in Italy evaluated data from 685 patients, comparing a group that underwent the procedure with the help of the computer-aided detection (CADe) system to a group who acted as controls. Repici et al found that the adenoma detection rate was significantly higher in the CADe group, as was the detection rate for polyps 5 mm or smaller, which led to the conclusion: “Including CADe in colonoscopy examinations increases detection of adenomas without affecting safety.”
Their findings raise several questions: is it reasonable to assume that a study of 600+ Italians would apply to a U.S. population, which has different demographic characteristics? More importantly, were the 685 patients representative of the general public, including adequate numbers of persons of color and those in lower socioeconomic groups? While the Gastroenterology study did report enough female patients, there is no mention of these other marginalized groups.
An independent 2021 analysis of FDA approvals has likewise raised several concerns about the effectiveness and equity of several recently approved AI algorithms. Eric Wu from Stanford University and his colleagues examined the FDA’s clearance of 130 devices and found the vast majority were approved based on retrospective studies (126 of 130). And when they separated all 130 devices into low- and high-risk subgroups using FDA guidelines, they found none of the 54 high-risk devices had been evaluated by prospective trials. Other shortcomings documented in Wu’s analysis included the following:
- Of the 130 approved products, 93 did not report multi-site evaluation.
- Fifty-nine of the approved AI devices included no mention of the sample size of the test population.
- Only 17 of the approved devices discussed a demographic subgroup.
We would certainly like to see the FDA take a more thorough approach to AI-based algorithm clearance, but in lieu of that, several leading academic medical centers, including Mayo Clinic, are contemplating a more holistic and comprehensive approach to algorithmic evaluation. It would include establishing a standard labeling schema to document the characteristics, behavior, efficacy, and equity of AI systems, to reveal the properties of systems necessary for stakeholders to assess them and build the trust necessary for safe adoption. The schema will also support assessment of the portability of systems to disparate datasets. The labeling schema will serve as an organizational framework that specifies the elements of the label. Label content will be specified in sections that will likely include:
- model details such as name, developer, date of release, and version,
- the intended use of the system,
- performance measures,
- accuracy metrics, and
- training data and evaluation data characteristics
While it makes no sense to sacrifice the good in pursuit of the perfect, the current regulatory framework for evaluating SaMD is far from perfect. Combining a more robust FDA approval process with the expertise of the world’s leading medical centers will offer our patients the best of both worlds.