Friday, July 30, 2021

The Future Belongs to Digital Pathology

Advances in artificial intelligence are slowly transforming the specialty, much the way radiology is being transformed by similar advances in digital technology.

John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.

Any patient who faces a potential cancer diagnosis knows how important an accurate, timely pathology report is.  Similarly, surgeons often require fast pathology results when they are performing a delicate procedure to determine their course of action during an operation. New technological developments are poised to meet the needs of patients and clinicians alike.

AI can improve pathology practice in numerous ways. The right digital tools can automate several repetitive tasks, including the detection of small foci. It can also help improve the staging of many malignancies, make the workflow process more efficient, and help classify images, which in turn gives pathologists a “second set of eyes”. And those “eyes” do not grow tired at the end of a long day or feel stressed out from too much work.

Such capabilities have far-reaching implications. With the right scanning hardware and the proper viewer software, pathologists and technicians can easily view and store whole slide images (WSIs). That view is markedly different from what they see through a microscope, which only allows a narrow field of view. In addition, digitization allows pathologists to mark up WSIs with non-destructive annotations, use the slides as teaching tools, search a laboratory’s archives to make comparisons with images that depict similar cases, give colleagues and patients access to the images, and create predictive models. And if the facility has cloud storage capabilities, it allows clinicians, patients, and pathologists around the world to access the data.

A 2020 prospective trial conducted by University of Michigan and Columbia University investigators illustrates just how profound the impact of AI and ML can be when applied to pathology.  Todd Hollon and colleagues point out that interoperative diagnosis of cancer relies on a “contracting, unevenly distributed pathology workforce.”1 The process can be quite inefficient, requiring a tissue specimen travel from the OR to a lab, followed by specimen processing, slide preparation by a technician, and a pathologist’s review. At University of Michigan, they are now using Stimulated Raman histology, an advanced optical imaging method, along with a convolutional neural network (CNN) to help interpret the images. The machine learning tools were trained to detect 13 histologic categories and includes an inference algorithm to help make a diagnosis of brain cancer. Hollon et al conducted a 2-arm, prospective multicenter, non-inferiority trial to compare the CNN results to those of human pathologists. The trial, which evaluated 278 specimens, demonstrated that the machine learning system was as accurate as pathologists’ interpretation (94.6% vs 93.9%). Equally important was the fact that it took under 15 seconds for surgeons to get their results with the AI system, compared to 20-30 minutes with conventional techniques. And that latter estimate does not represent the national average. In some community settings, slides have to be shipped by special courier to labs that are hours away.

Mayo Clinic is among several forward-thinking health systems that are in the process of implementing a variety of digital pathology services. Mayo Clinic has partnered with Google and is leveraging their technology in two ways. The program will extend Mayo Clinic’s comprehensive Longitudinal Patient Record profile with digitized pathology images to better serve and care for patients. And we are exploring new search capabilities to improve digital pathology analytics and AI. The Mayo/Google project is being conducted with the help of Sectra, a digital slide review and image storage and management system. Once proof of concept, system testing, and configuration activities are complete, the digital pathology solution will be introduced gradually to Mayo Clinic departments throughout Rochester, Florida, and Arizona, as well as the Mayo Clinic Health System.

The new digital capabilities taking hold in several pathology labs around the globe are likely to solve several vexing problems facing the specialty. Currently there is a shortage of pathologists worldwide, and in some countries, that shortage is severe. One estimate found there is one pathologist per 1.5 million people in parts of Africa. And China has one fourth the number of pathologists practicing in the U.S., on a per capita basis. Studies predict that the steady decline of the number of pathologists in the U.S. will continue over the next two decades. A lack of subspecialists is likewise a problem. Similarly, there are reports of poor accuracy and reproducibility, with many practitioners making subjective judgements based on a manual estimate of the percentage of positive cells for a biomarker. Finally, there is reason to believe that implementing digital pathology systems will likely improve a health system’s financial return on investment. One study has suggested that it can “improve the efficiency of pathology workloads by 13%.” 2

As we have said several times in these columns, AI and ML are certainly not a panacea, and they will never replace an experienced clinician or pathologist. But taking advantage of the tools generated by AI/ML will have a profound impact of diagnosis and treatment for the next several decades.



1. Hollon T, Pandian B, Adapa A et al. Near real-time intraoperative brain tumor diagnosis using stimulated Raman histology and deep neural networks. Nat. Med. 2020. 26:52-58.

2. Ho J, Ahlers SM, Stratman C, et al. Can digital pathology result in cost savings? a financial projection for digital pathology implementation at a large integrated health care organization. J Pathol Inform. 2014;5(1):33; doi: 10.4103/2153-3539.139714.

Wednesday, July 28, 2021

Shift Happens

Dataset shift can thwart the best intentions of algorithm developers and tech-savvy clinicians, but there are solutions.

John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.

Generalizability has always been a concern in health care, whether we’re discussing the application of clinical trials or machine-learning based algorithms. A large randomized controlled trial that finds an intensive lifestyle program doesn’t reduce the risk of cardiovascular complications in Type 2 diabetics, for instance, suggests the diet/exercise regimen is not worth recommending to patients. But the question immediately comes to mind: Can that finding be generalized to the entire population of Type 2 patients? As we have pointed out in other publications, subgroup analysis has demonstrated that many patients do, in fact, benefit from such a program.

The same problem exists in health care IT. Several algorithms have been developed to help classify diagnostic images, predict disease complications, and more. A closer look at the datasets upon which these digital tools are based indicates many suffer from dataset shift. In simple English, dataset shift is what happens when the data collected during the development of an algorithm changes over time and is different from the data when the algorithm is eventually implemented. For example, the patient demographics used to create a model may no longer represent the patient population when the algorithm is put into clinical use. This happened when COVID 19 changed the demographic characteristics of patients, making the Epic sepsis prediction tool ineffective.

Samuel Finlayson, PhD, with Harvard Medical School, and his colleagues described a long list of data set shift scenarios that can compromise the accuracy and equity of AI-based algorithms, which in turn can compromise patient outcomes and patient safety. Finlayson et al list 14 scenarios, which fall into 3 broad categories: changes in technology; changes in population and setting; and changes in behavior. Examples of ways in which dataset shift can create misleading outputs that send clinicians down the wrong road include:

  • Changes in the X-ray scanner models used
  • Changes in the way diagnostic codes are collected (e.g. using ICD9 and then switching to ICD10)
  • Changes in patient population resulting from hospital mergers

Other potential problems to be cognizant of include changes in your facility’s EHR system. Sometimes updates to the system may result in changes in how terms are defined, which in turn can impact predictive algorithms that rely on those definitions. If a term like elevated temperature or fever is changed to pyrexia in one of the EHR drop down menus, for example, it may no longer map to the algorithm that uses elevated temperature as one of the variable definitions to predict sepsis, or any number of common infections. Similarly, if the ML-based model has been trained on a patient dataset for a medical specialty practice or hospital cohort, it’s likely that data will generate misleading outputs when applied to a primary care setting.

Finlayson et al mention another example to be aware of: changes in the way physicians practice can influence data collection: “Adoption of new order sets, or changes in their timing, can heavily affect predictive model output.” Clearly, problems like this necessitate strong interdisciplinary ties, including an ongoing dialogue between the chief medical officer, clinical department heads, and chief information officer and his or her team. Equally important is the need for clinicians in the trenches to look for subtle changes in practice patterns that can impact the predictive analytics tools currently in place. Many dataset mismatches can be solved by updating variable mapping, retraining or redesigning the algorithm, and multidisciplinary root cause analysis.

While addressing dataset shift issues will improve the effectiveness of your AI-based algorithms, they are only one of many stumbling blocks to contend with. One classic example that demonstrates that computers are still incapable to matching human intelligence is the study that concluded that patients with asthma are less likely to die from pneumonia that those who don’t have asthma. The machine learning tool used to come to that unwarranted conclusion had failed to take into account the fact that many asthmatics often get faster, earlier, more intensive treatment when their condition flares up, which results in a lower mortality rate. Had clinicians acted on the misleading correlation between asthma and fewer deaths from pneumonia, they might have decided asthma patients don’t necessarily need to hospitalized when they develop pneumonia.

This kind of misdirection is relatively common and emphasizes the fact that ML-enhanced tools sometimes have trouble separating useless “noise” from meaningful signal. Another example worth noting: Some algorithms designed to help detect COVID 19 by analyzing X-rays suffer from this shortcoming. Several of these deep learning algorithms rely on confounding variables instead of focusing on medical pathology, giving clinicians the impression that they are accurately identifying the infection or ruling out its presence. Unbeknownst to their users, the algorithms have been shown to rely on text markers or patient positioning instead of pathology findings.

At Mayo Clinic, we have had to address similar problems. A palliative care model that was trained on data from the Rochester, Minnesota, community, for instance, did not work well in our health system because the severity of patient disease in a tertiary care facility is very different than what’s seen in a local community hospital. Similarly, one of our algorithms broke when a vendor did a point release in its software and changed the format of the results. We also had a vendor with a CT stroke detection algorithm run 10 of our known stroke patients through its system and was only able to identify one patient. The root cause: Mayo Clinic medical physicists have optimized our radiation exposure to 25% of industry standards to reduce radiation exposure to patients, but that changed the signal to noise ratio of the images and the vendor’s system wasn’t trained on that ratio and couldn’t find the images.

Valentina Bellini, with University of Parma, Parma, Italy, and her colleagues sum up the AI shortcut dilemma in a graphic that illustrates 3 broad problem areas: Poor quality data, ethical and legal issues, and lack of educational programs for clinicians who may be skeptical or uninformed about the value and limitations of AI enhanced algorithms in intensive care settings.

As we have pointed out in other blogs, ML-based algorithms rely on math, not magic. But when reliance on that math overshadows clinicians’ diagnostic experience and common sense, they need to partner with their IT colleagues to find ways to reconcile artificial and human intelligence.

Friday, July 23, 2021

Causality in Medicine: Moving Beyond Correlation in Clinical Practice

A growing body of research suggests it’s time to abandon outdated ideas about how to identify effective medical therapies.

Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.

“Correlation is not causation.” It’s a truism that researchers take for granted, and for good reason. The fact that event A is followed by event B doesn’t mean that A caused B. An observational study of 1,000 adults, for example, that found those taking high doses of vitamin C were less likely to develop lung cancer doesn’t prove the nutrient protects against the cancer; it’s always possible that a third factor — a confounding variable — was responsible for both A and B. In other words, patients taking lots of vitamin C may be less likely to get lung cancer because they are more health conscious than the average person, and therefore more likely to avoid smoking, which in turn reduces their risk of the cancer.

As this example illustrates, confounding variables are the possible contributing factors that may mislead us into imagining a cause-and-effect relationship exists when there isn’t one. It’s the reason interventional trials like the randomized controlled trial (RCT) remain a more reliable way to determine causation than observational studies. But it’s important to point out that in clinical medicine, there are many treatment protocols in use that are not supported by RCTs. Similarly, there are many risk factors associated with various diseases but it’s often difficult to know for certain whether these risk factors are actually contributing causes of said diseases. 

While RCTs remain the good standard in medicine, they can be impractical for a variety of reasons: they are often very expensive to perform; an RCT that exposes patients to potentially harmful risk factor and compares them to those who aren’t would be unethical; most trials require many exclusion and inclusion criteria that don’t exist in the everyday practice of medicine. For instance, they usually exclude patients with co-existing conditions, which may distort the study results.

One way to address this problem is by accepting less than perfect evidence and using a reliability scale or continuum to determine which treatments are worth using and which are not. That scale might look something like this, with evidential support growing stronger from left to right along the continuum: 

In the absence of RCTs, it’s feasible to consider using observational studies like case/control and cohort trials to justify using a specific therapy. And while such observational studies may still mislead because some confounding variables have been overlooked, there are epidemiological criteria that strengthen the weight given to these less than perfect studies:

  • A stronger association or correlation between two variables is more suggestive of a cause/effect relationship than a weaker association.

  • Temporality. The alleged effect must follow the suspected cause not the other way around. It would make no sense to suggest that exposure to Mycobacterium tuberculosis causes TB if all the cases of the infection occurred before patients were exposed to the bacterium.

  • A dose-response relationship exists between alleged cause and effect. For example, if researchers find that a blood lead level of 10 mcg/dl is associated with mild learning disabilities in children, 15 mcg/dl is linked to moderate deficit, and 20 mcg/dl with severe deficits, this gradient strengthens the argument for causality.

  • A biologically plausible mechanism of action linking cause and effect strengthens the argument. In the case of lead poisoning, there is evidence pointing to neurological damage brought on by oxidative stress and a variety of other biochemical mechanisms.

  • Repeatability of the study findings: If the results of one group of investigators are duplicated by independent investigators, that lends further support to the cause/effect relationship.

While adherence to all these criteria suggests causality for observational studies, a statistical approach called causal inference can actually establish causality. The technique, which was spearheaded by Judea Pearl, Ph.D., winner of the 2011 Turing Award, is considered revolutionary by many thought leaders and will likely have profound implications for clinical medicine, and for the role of AI and machine learning. During the recent Mayo Clinic Artificial Intelligence Symposium, Adrian Keister, Ph.D., a senior data science analyst at Mayo Clinic, concluded that causal inference is “possibly the most important advance in the scientific method since the birth of modern statistics — maybe even more important than that.”

Conceptually, causal inference starts with the conversion of word-based statements into mathematical statements, with the help of a few new operators. While that may sound daunting to anyone not well-versed in statistics, it’s not much different than the way we communicate by using the language of arithmetic. A statement like fifteen times five equals seventy five is converted to 15 x 5 = 75. In this case, x is an operator. The new mathematical language of causal inference might look like this if it were to represent an observational study that evaluated the association between a new drug and an increase in patients’ lifespan: P (L|D) where P is probability, L, lifespan, D is the drug, and | is an operator that means “conditioned on.”

An interventional trial such as an RCT, on the other hand, would be written as X causes Y if P (L|do (D)) > P(Y), in which case the do-operator refers to the intervention, i.e., giving the drug in a controlled setting. This formula is a way to of saying X (the drug being tested), causes Y (longer life) if the results of the intervention are greater than the probability of a longer life without administering the drug, in other words, the probability in the placebo group, namely P(Y).

This innovative technique also uses causal graphs to show the relationship of a confounding variable to a proposed cause/effect relationship. Using this kind of graph, one can illustrate how the tool applies in a real-world scenario. Consider the relationship between smoking and lung cancer. For decades, statisticians and policy makers argued about whether smoking causes the cancer because all the evidence supporting the link was observational. The graph would look something like this.

Figure 1:

G is the confounding variable — a genetic predisposition for example — S is smoking and LC is lung cancer. The implication here is that if a third factor causes persons to smoke and causes cancer, one cannot necessarily conclude that smoking causes lung cancer.  What Pearl and his associates discovered was that if an intermediate factor can be identified in the pathway between smoking and cancer, it’s then possible to establish a cause/effect relationship between the 2 with the help of a series of mathematical calculations and a few algebraic rewrite tools. As figure 2 demonstrates, tar deposits in the smokers’ lung are that intermediate factor.  

Figure 2:

For a better understanding of how causal inference works, Judea Pearl’s The Book of Why is worth a closer look. It provides a plain English explanation of causal inference. For a deeper dive, there’s Causal Inference in Statistics: A Primer.

Had causal inference existed in the 1950s and 1960s, the argument by tobacco industry lobbyists would have been refuted, which in turn might have saved many millions of lives. The same approach holds tremendous potential as we begin to apply it to predictive algorithms and other machine-learning based digital tools. 

Monday, July 19, 2021

Taking Down the Fences that Divide Us

Innovation in healthcare requires new ways to think about interdisciplinary solutions.

Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.

During the 10 years we have worked together, John and I have written often about the power of words like transformation, optimism, cynicism, and misdiagnosis. Another word that needs more attention is “interdisciplinary.” It’s been uttered so many times in science, medicine, and technology that it’s lost much of its impact.  We all give lip service to the idea, but many aren’t willing or able to do the hard work required to make it a reality, and one that fosters innovation and better patient care.

Examples of the dangers of focusing too narrowly on one discipline are all around us. The disconnect between technology and medicine becomes obvious when you take a closer look at the invention of blue light emitting diodes (LEDs), for instance, for which Isamu Aksaki, Hiroshi Amano, and Shuti Nakamura won the Nobel Prize in Physics in 2014. While this technology reinvented the way we light own homes, providing a practical source of bright, energy-saving light, the researchers failed to take into account the health effects of their invention.  Had they been encouraged to embrace an interdisciplinary mindset, they might have considered the neurological consequences of being exposed to too much blue light. Certain photoreceptive retinal cells detect blue light, which is plentiful in sunlight. As it turns out, the brain interprets LEDs much like it interprets sunlight, in effect telling us it’s time to wake up, making it difficult to get to sleep.

Problems like this only serve to emphasize what materials scientist Ainissa Ramirez, PhD  discusses in a recent essay: “The culture of research … does not incentivize looking beyond one’s own discipline … Academic silos barricade us from thinking broadly and holistically. In materials science, students are often taught that the key criteria for materials selection are limited to cost, availability, and the ease of manufacturing. The ethical dimension of a materials innovation is generally set aside as an elective class in accredited engineering schools. But thinking about the impacts of one’s work should be neither optional nor an afterthought.”

This is the same problem we face in digital health. Too many data scientists and venture capitalists have invested time and resources into developing impressive algorithms capable of screening for disease and improving its treatment. But some have failed to take a closer look at the data sets upon which these digital tools are built, data sets that misrepresent the populations they are trying to serve. The result has been an ethical dilemma that needs our immediate attention.

Consider the evidence: A large commercially available risk prediction data set used to guide healthcare decisions has been analyzed to find out how equitable it is. The data set was designed to determine which patients require more than the usual attention because of their complex needs. Zaid Obermeyer from the School of Public Health at the University of California, Berkley, and his colleagues looked at over 43,000 White and about 6,000 Black primary care patients in the data set and discovered that when Blacks were assigned to the same level of risk as Whites by the algorithm based on the data set, they were actually sicker than their White counterparts. How did this racial bias creep into the algorithm? Obermeyer et al explain: “Bias occurs because the algorithm uses health costs as a proxy for health needs. Less money is spent on Black patients who have the same level of need, and the algorithm thus falsely concludes that Black patients are healthier than equally sick White patients.”

Similarly, evidence from an Argentinian study that analyzed data from deep neural networks used on publicly available X-ray image datasets intended to help diagnose thoracic diseases revealed inequities. When the investigators compared gender-imbalanced datasets to datasets in which males and females were equally represented, they found that, “with a 25%/75% imbalance ratio, the average performance across all diseases in the minority class is significantly lower than a model trained with a perfectly balanced dataset.” Their analysis concluded that datasets that underrepresent one gender results in biased classifiers, which in turn may lead to misclassification of pathology in the minority group.

These disparities not only re-emphasize the need for technologists, clinicians, and ethicists to work together, they beg the question: How can we fix the problem now? Working from the assumption that any problem this complex needs to be precisely measured before it can be rectified, Mayo Clinic, Duke School of Medicine, and Optum/Change Healthcare are currently analyzing a massive data set with more than 35 billion healthcare events and about 16 billion encounters that are linked to data sets that include social determinants of health. That will enable us to stratify the data by race/ethnicity, income, geolocation, education, and the like. Creating a platform that systematically evaluates commercially available algorithms for fairness and accuracy is another tactic worth considering. Such a platform would create “food label” style data cards that include the essential features of each digital tool, including its input data sources and types, validation protocols, population composition, and performance metrics. There are also several analytical tools specifically designed to detect algorithmic bias, including Google’s TCAV, Audit-AI, and IBM’s AI- Fairness 360.

The fences that divide healthcare can be torn down. It just takes determination and enough craziness to believe it can be done — and lots of hard work.

Monday, July 12, 2021

Identifying the Best De-Identification Protocols

Keeping patient data private remains one of the biggest challenges in healthcare. A recently developed algorithm from nference is helping address the problem.

John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.

In the United States, healthcare organizations that manage or store personal health information (PHI) are required by law to keep that data secure and private. Ignoring that law, as spelled out in the HIPAA regulations, has cost several providers and insurers millions of dollars in fines, and serious damage to their reputations. HIPAA offers 2 acceptable ways to keep PHI safe: Certification by a recognized expert and the Safe Harbor approach, which requires organizations to hide 18 identifiers in patient records so that unauthorized users cannot identify patients. At Mayo Clinic, however, we believe we must do more.

In partnership with the data analytics firm nference, we have developed a de-identification approach that takes patient privacy to the next level, using a protocol on EHR clinical notes that includes attention-based deep learning models, rule-based methods, and heuristics. Murugadoss et al explain that “rule-based systems use pattern matching rules, regular expressions, and dictionary and public database look-ups to identify PII [personally identifiable information] elements.” The problem with relying solely on such rules is they miss things, especially in an EHR’s narrative notes, which often use non-standard expressions, including unusual spellings, typographic errors and the like. Such rules also consume a great deal of time to manually create.  Similarly, traditional machine learning based systems, which may rely on support vector machine or conditional random fields, have their shortcomings and tend to remain reliable across data sets.

The ensemble approach used at Mayo includes a next generation algorithm that incorporates natural language processing and machine learning. Upon detection of PHI, the system transforms detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from Mayo Clinic. We compared our approach with other existing tools considered best-in-class. The results indicated a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively.

While this protocol has many advantages over older systems, it’s only one component of a more comprehensive system used at Mayo to keep patient data private and secure.  Experience has shown us that de-identified PHI, once released to the public, can sometimes be re-identified if a bad actor decides to compare these records to other publicly available data sets. There may be obscure variants within the data that humans can interpret as PHI but algorithms will not. For example, a computer algorithm expects phone numbers to be in the form area code, prefix, suffice i.e. (800) 555-1212. What if a phone number is manually recorded into a note as 80055 51212? A human might dial that number to re-identify the record. Further we expect dates to be in the form mm/dd/yyyy. What if a date of birth is manually typed into a note as 2104Febr (meaning 02/04/2021)? An algorithm might miss that.

With these risks in mind, Mayo Clinic is using a multi-layered defense referred to as data behind glass. The concept of data behind glass is that the de-identified data is stored in an encrypted container, always under control of Mayo Clinic Cloud. Authorized cloud sub-tenants can be granted access such that their tools can access the de-identified data for algorithm development, but no data can be taken out of the container. This prevents prevents merging the data with other external data sources.

At Mayo Clinic, the patient always comes first, so we have committed to continuously adopt novel technologies that keep information private.

Tuesday, July 6, 2021

Learning from AI’s Failures

A detailed picture of AI’s mistakes is the canvas upon which we create better digital solutions.

John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.

We all tend to ignore clichés because we’ve heard them so often, but some clichés are worth repeating. “We learn more from failure than success” comes to mind. While it may be overused, it nonetheless conveys an important truth for anyone involved in digital health. Two types of failures are worth closer scrutiny: algorithms that claim to improve diagnosis or treatment but fall short for lack of evidence or fairness; and failure to convince clinicians in community practice that evidence-based algorithms are worth using.

As we mentioned in an earlier column, a growing number of thought leaders in medicine have criticized the rush to generate AI-based algorithms because many lack the solid scientific foundation required to justify their use in direct patient care. Among the criticisms being leveled at AI developers are concerns about algorithms derived from a dataset that is not validated with a second, external dataset, overreliance on retrospective analysis, lack of generalizability, and various types of bias. A critical look at the hundreds of healthcare-related digital tools that are now coming to market indicates the need for more scrutiny, and the creation of a set of standards to help clinicians and other decision makers separate useful tools from junk science. 

The digital health marketplace is crowded with attention-getting tools. Among 59 FDA-approved medical devices that incorporated some form of machine learning, 49 unique devices were designed to improve clinical decision support, most of which are intended to assist with diagnosis or triage. Some were designed to automatically detect diabetic retinopathy, analyze specific heart sounds, measure ejection fraction and left ventricular volume, and quantify lung nodules and liver lesions, to name just a few. Unfortunately, the evidential support for many recently approved medical devices varies widely.

Among the AI-based algorithms that has attracted attention is one designed to help clinicians predict the onset of sepsis.  The Epic Sepsis Model (ESM) has been used on tens of thousands of inpatients to gauge their risk of developing this life-threatening complication. Part of the Epic EHR system, it is a penalized logistic regression model that the vendor has tested on over 400,000 patients in 3 health systems. Unfortunately, because ESM is a proprietary algorithm, there’s a paucity of information available on the software’s inner workings or its long-term performance. Investigators from the University of Michigan just conducted a detailed analysis of the tool among over 27,600 patients and found it wanting. Andrew Wong and his associates found an area under the receiver operating characteristic curve (AURAC) of only 0.63. Their report states: “The ESM identified 183 of 2552 patients with sepsis (7%) who did not receive timely administration of antibiotics, highlighting the low sensitivity of the ESM in comparison with contemporary clinical practice. The ESM also did not identify 1709 patients with sepsis (67%) despite generating alerts for an ESM score of 6 or higher for 6971 of all 38,455 hospitalized patients (18%), thus creating a large burden of alert fatigue.” They go on to discuss the far-reaching implications of their investigation: “The increase and growth in deployment of proprietary models has led to an underbelly of confidential, non–peer-reviewed model performance documents that may not accurately reflect real-world model performance. Owing to the ease of integration within the EHR and loose federal regulations, hundreds of US hospitals have begun using these algorithms.”

Reports like this only serve to amplify the reservations many clinicians have about trusting AI-based clinical decision support tools. Unfortunately, they tend to make clinicians not just skeptical but cynical about all AI-based tools, which is a missed opportunity to improve patient care. As we pointed on in a recent NEJM Catalyst review, there are several algorithms that are supported by prospective studies, including a growing number of randomized controlled trials.

So how do we get scientifically well-documented digital health tools into clinicians’ hands and convince them to use them? One approach is to develop an evaluation system that impartially reviews all the specs for each product, and generates model cards to provide end users a quick snapshot of their strengths and weaknesses. But that’s only the first step. By way of analogy, consider the success of online stores hosted by Walmart or Amazon. They’ve invested heavily in state of the art supply chains that ensure their products are available from warehouses as customers demand them. But without a delivery service that gets products into customers’ homes quickly and with a minimum of disruption, even the best products will sit on warehouse shelves. The delivery service has to seamlessly integrate into customers’ lives. The product has to show up on time, it has to be the right size garment, in a sturdy box, and so on. Similarly, the best diagnostic and predictive algorithms have to be delivered with careful forethought and insight, which requires design thinking, process improvement, workflow integration, and implementation science.

Ron Li and his colleagues at Stanford University describe this delivery service in detail, emphasizing the need to engage stakeholders from all related disciplines before even starting algorithm development to look for potential barriers to implementation. They also suggest the need for “empathy mapping” to look for potential power inequities among clinician groups who may be required to use these digital tools.  It is easy to forget that implementing any technological innovation must also take into account the social and cultural issues unique to the healthcare ecosystem, and to the individual facility where it is being implemented.

If we are to learn from AI’s failures, we need to evaluate its products and services more carefully and develop them within an interdisciplinary environment that respects all stakeholders.