Dataset shift can thwart the best intentions of algorithm developers and tech-savvy clinicians, but there are solutions.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
Generalizability has always been a concern in health care,
whether we’re discussing the application of clinical trials or machine-learning
based algorithms. A large randomized controlled trial that finds an intensive
lifestyle program doesn’t reduce the risk of cardiovascular complications in
Type 2 diabetics, for instance, suggests the diet/exercise regimen is not worth
recommending to patients. But the question immediately comes to mind: Can that
finding be generalized to the entire population of Type 2 patients? As we have
pointed out in other publications, subgroup analysis has demonstrated that many
patients do, in fact, benefit from such a program.
The same problem exists in health care IT. Several
algorithms have been developed to help classify diagnostic images, predict
disease complications, and more. A closer look at the datasets upon which these
digital tools are based indicates many suffer from dataset shift. In simple
English, dataset shift is what happens when the data collected during the
development of an algorithm changes over time and is different from the data
when the algorithm is eventually implemented. For example, the patient
demographics used to create a model may no longer represent the patient
population when the algorithm is put into clinical use. This happened when
COVID 19 changed the demographic characteristics of patients, making the Epic
sepsis prediction tool ineffective.
Samuel Finlayson, PhD, with Harvard Medical School, and his
colleagues described a long list of data set shift scenarios that can
compromise the accuracy and equity of AI-based algorithms, which in turn can
compromise patient outcomes and patient safety. Finlayson et al list 14
scenarios, which fall into 3 broad categories: changes in technology; changes
in population and setting; and changes in behavior. Examples of ways in which
dataset shift can create misleading outputs that send clinicians down the wrong
road include:
- Changes in the X-ray scanner models used
- Changes in the way diagnostic codes are collected (e.g. using ICD9 and then switching to ICD10)
- Changes in patient population resulting from hospital mergers
Other potential problems to be cognizant of include changes in your facility’s EHR system. Sometimes updates to the system may result in changes in how terms are defined, which in turn can impact predictive algorithms that rely on those definitions. If a term like elevated temperature or fever is changed to pyrexia in one of the EHR drop down menus, for example, it may no longer map to the algorithm that uses elevated temperature as one of the variable definitions to predict sepsis, or any number of common infections. Similarly, if the ML-based model has been trained on a patient dataset for a medical specialty practice or hospital cohort, it’s likely that data will generate misleading outputs when applied to a primary care setting.
Finlayson et al mention another example to be aware of:
changes in the way physicians practice can influence data collection: “Adoption
of new order sets, or changes in their timing, can heavily affect predictive
model output.” Clearly, problems like this necessitate strong interdisciplinary
ties, including an ongoing dialogue between the chief medical officer, clinical
department heads, and chief information officer and his or her team. Equally
important is the need for clinicians in the trenches to look for subtle changes
in practice patterns that can impact the predictive analytics tools currently
in place. Many dataset mismatches can be solved by updating variable mapping,
retraining or redesigning the algorithm, and multidisciplinary root cause
analysis.
While addressing dataset shift issues will improve the
effectiveness of your AI-based algorithms, they are only one of many stumbling
blocks to contend with. One classic example that demonstrates that computers
are still incapable to matching human intelligence is the study that concluded
that patients with asthma are less likely to die from pneumonia that those who
don’t have asthma. The machine learning tool used to come to that unwarranted
conclusion had failed to take into account the fact that many asthmatics often
get faster, earlier, more intensive treatment when their condition flares up,
which results in a lower mortality rate. Had clinicians acted on the misleading
correlation between asthma and fewer deaths from pneumonia, they might have
decided asthma patients don’t necessarily need to hospitalized when they
develop pneumonia.
This kind of misdirection is relatively common and
emphasizes the fact that ML-enhanced tools sometimes have trouble separating
useless “noise” from meaningful signal. Another example worth noting: Some algorithms designed to help detect COVID 19 by analyzing X-rays suffer from
this shortcoming. Several of these deep learning algorithms rely on confounding variables instead of focusing on medical pathology, giving clinicians the
impression that they are accurately identifying the infection or ruling out its
presence. Unbeknownst to their users, the algorithms have been shown to rely on
text markers or patient positioning instead of pathology findings.
At Mayo Clinic, we have had to address similar problems. A
palliative care model that was trained on data from the Rochester, Minnesota,
community, for instance, did not work well in our health system because the
severity of patient disease in a tertiary care facility is very different than
what’s seen in a local community hospital. Similarly, one of our algorithms
broke when a vendor did a point release in its software and changed the format
of the results. We also had a vendor with a CT stroke detection algorithm run
10 of our known stroke patients through its system and was only able to
identify one patient. The root cause: Mayo Clinic medical physicists have
optimized our radiation exposure to 25% of industry standards to reduce
radiation exposure to patients, but that changed the signal to noise ratio of
the images and the vendor’s system wasn’t trained on that ratio and couldn’t
find the images.
Valentina Bellini, with University of Parma, Parma, Italy,
and her colleagues sum up the AI shortcut dilemma in a graphic that illustrates
3 broad problem areas: Poor quality data, ethical and legal issues, and lack of
educational programs for clinicians who may be skeptical or uninformed about
the value and limitations of AI enhanced algorithms in intensive care settings.
As we have pointed out in other blogs, ML-based algorithms
rely on math, not magic. But when reliance on that math overshadows clinicians’
diagnostic experience and common sense, they need to partner with their IT
colleagues to find ways to reconcile artificial and human intelligence.
No comments:
Post a Comment