Thank you for reading our blog. Paul and I will continue to publish "Dispatch from the Digital Health Frontier" on https://www.mayoclinicplatform.org/blog/.
Thursday, October 21, 2021
An analysis of over 2 billion lab test results suggests a deep learning model can help create personalized reference ranges, which in turn would enable clinicians to monitor health and disease better.
Paul Cerrato, MA, senior research analyst and communications specialist, Mayo Clinic Platform and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.
Almost every patient has blood drawn to measure a variety of metabolic markers. Typically, test results come back as a numeric or text value accompanied by a reference range which represents normal values. If total serum cholesterol level is below 200 mg/dl or serum thyroid hormone level is 4.5 to 12.0 mcg/dl, clinicians and patients assume all is well. But suppose Helen’s safe zone varies significantly from Mary’s safe zone. If that were the case, it would suggest a one-size-fits-all reference range misrepresents an individual’s health status. That position is supported by studies that found the distribution of more than half of all lab test results, which rely on standard reference ranges, differ when personal characteristics are considered.1
With these concerns in mind, Israeli investigators from the Weismann Institute and Tel Aviv Sourasky Medical Center extracted data on 2.1 billion lab measurements from EHR records, taken from 2.8 million adults for 92 different lab tests. Their goal was to create “data-driven reference ranges that consider age, sex, ethnicity, disease status, and other relevant characteristics.”1 To accomplish that goal, they used machine learning and computational modeling to segment patients into different “bins'' based on health status, medication intake, and chronic disease.2. That in turn left the team with about half a billion lab results from the initial 2.8 million people, which they used to model a set of reference lab values that more precisely reflected the ranges of healthy persons. Those ranges could then be used to predict patients’ “future lab abnormalities and subsequent disease.”
Taking their investigation one step forward, Cohen et al. used their new algorithms to evaluate the risk of specific disorders amongst healthy individuals. When they looked at anemia cut offs like hemoglobin and mean corpuscular volume, a measurement of red blood cell size, their newly created risk calculators were able to separate anemic patients into groups at high risk for microcytic and macrocytic anemia from those with a risk no higher than the average nonanemic population. Similar benefits were observed when the researchers applied their models to prediabetes: “…using a personalized risk model, we can improve the classification of patients who are prediabetic and identify patients at risk 2 years earlier compared to classification based merely on current glucose levels.”
William Morice, M.D., Ph.D., chair of the Department of Laboratory Medicine and Pathology (DLMP) at Mayo Clinic and president of Mayo Clinic Laboratories, immediately saw the value of this type of data analysis: “In the ‘era of big data and analytics,’ it is almost unconscionable that we still use ‘normal reference ranges’ that lack contextual data, and possibly statistical power, to guide clinicians in the clinical interpretation of quantitative lab results. I was taught this by Dr. Piero Rinaldo, a medical geneticist in our department and a pioneer in this field, who focuses on its application to screening for inborn errors of metabolism. He has developed an elegant tool that is now used globally for this application, Collaborative Laboratory Integrated Reports (CLIR).”
During a recent conversation with Piero Rinaldo, M.D., Ph.D., he explained that Mayo Clinic has been using a more personalized approach to lab testing since 2015 and stated that “CLIR is a shovel-ready software for the creation of collaborative precision reference ranges.” The web-based application has been used to create several personalized data sets that can improve clinicians’ interpretation of lab test results. It has been deployed by Dr. Rinaldo and his associates to improve the screening of newborns for congenital hyperthyroidism.3. The software performs multivariate pattern recognition on lab values collected from 7 programs, including more than 1.9 million lab test results. CLIR is able to integrate covariate-adjusted results of different tests into a set of customized interpretive tools that physicians can use to better distinguish between false positive and true positive test results.
1. Tang A, Oskotsky T, Sirota M. Personalizing routine lab tests with machine Learning. Nature Medicine. 2021; 27:1510-1517.
2. Cohen N, Schwartzman O, Jaschek R et al. Personalized lab test models to quantify disease potentials in healthy individuals. Nature Medicine.2021; 27: 1582-1591.
3. Rowe AD, Stoway SD, Ahlman H et al. A Novel Approach to Improve Newborn Screening for Congenital Hypothyroidism by Integrating Covariate-Adjusted Results of Different Tests into CLIR Customized Interpretive Tools. Inter J Neonatal Screening. 2021. 7:23 https://doi.org/10.3390/ijns7020023
Wednesday, October 13, 2021
AI and machine learning have the potential to redefine the management of several GI disorders.
Colonoscopy is one of the true success stories in modern medicine. Studies have demonstrated that colonoscopy screening detects the cancer at a much earlier stage, reducing the risk of invasive tumors and metastatic disease, and reducing mortality. However, while colorectal cancer is highly preventable, it is the third leading cause of cancer-related deaths in the U.S. About 148,000 individuals develop the malignancy and over 53,000 die from it each year. We asked ourselves a question: can AI improve the detection of this and related gastrointestinal disorders?
As we explained in The Digital Reconstruction of Healthcare, one of the challenges in making an accurate diagnosis of GI disease is differentiating between disorders that look similar at the cellular level. For example, because environmental enteropathy and celiac disease overlap histopathologically, deep learning algorithms have been designed to analyze biopsy slides to detect the subtle differences between the two conditions. Syed et al.1 used a combination of convolutional and deconvolutional neural networks in a prospective analysis of over 3,000 biopsy images from 102 children. They were able to tell the differences between environmental enteropathy, celiac disease, and normal controls with an accuracy rating of 93.4%, and a false negative rate of 2.4%. Most of these mistakes occurred when comparing celiac patients to healthy controls.
The investigators also identified several biomarkers that may help separate the two GI disorders: interleukin 9, interleukin 6, interleukin 1b, and interferon-induced protein 10 were all helpful in making an accurate prediction regarding the correct diagnosis. The potential benefits to this deep learning approach become obvious when one considers the arduous process that patients have to endure to reach a definitive diagnosis of either disorder: typically, they must undergo 4 to 6 biopsies and may need several endoscopic procedures to sample various sections of the intestinal tract because the disorder may affect only specific areas along the lining and leave other areas intact.
Several randomized controlled trials have been conducted to support the use of ML in gastroenterology. Chinese investigators, working in conjunction with Beth Israel Deaconess Medical Center and Harvard Medical School, tested a convolutional neural network to determine if it was capable of improving the detection of precancerous colorectal polyps in real time.2 The need for a better system of detecting these growths is evident, given the fact that more than 1 in 4 adenomas are missed during coloscopies. To address the problem, Wang et al. randomized more than 500 patients to routine colonoscopy and more than 500 to computer-assisted colonoscopies. In the final analysis, the adenoma detection rate (ADR) was higher in the ML-assisted group (29.1% vs. 20.3%, P < 0.001). The higher ADR occurred because the algorithm was capable of detecting a greater number of smaller adenomas (185 vs. 102). There were no significant differences in the detection of large polyps.
Nayantara Coelho-Prabhu, M.D., a gastroenterologist at Mayo Clinic, points out, however, that the clinical relevance of detection of diminutive polyps remains to be determined. “Yet, there is definite clinical importance in the subsequent development of computer assisted diagnosis (CADx) or polyp characterization algorithms. These will help clinicians determine clinically relevant polyps, and possibly advance the resect and discard practice. It also will help clinicians adequately assess margins of polyps, so that complete removal can be achieved, thus decreasing future recurrences.”
Randomized clinical trials demonstrated that a convolutional neural network in combination with deep reinforcement learning (collectively called the WISENSE system) can reduce the number of blind spots during endoscopy intended to evaluate the esophagus, stomach, and duodenum in real time. “A total of 324 patients were recruited and randomized; 153 and 150 patients were analysed in the WISENSE and control group, respectively. Blind spot rate was lower in WISENSE group compared with the control (5.86% vs 22.46%, p<0.001) . . .”3
Mayo Clinic’s Endoscopy Center, utilizing Mayo Clinic Platform’s resources, has also been exploring the value of machine learning in GI care with the assistance of Endonet, a comprehensive library of endoscopic videos and images, linked to clinical data including symptoms, diagnoses, pathology, and radiology. These data will include unedited full-length videos as well as video summaries of the procedure including landmarks, specific abnormalities, and anatomical identifiers. Dr. Coelho-Prabhu explains that the idea is to have different user interfaces:
“From the patient’s perspective, it will serve as an electronic video record of all their procedures, and future procedures can be tailored to survey prior abnormal areas as needed.
From a research perspective, this will be a diverse and rich library including large volumes of specialized populations such as Barrett’s esophagus, inflammatory bowel disease, familial polyposis syndromes. The additional strength is that Mayo Clinic provides highly specialized care, especially to these select populations. We can develop AI algorithms to advance medical care using this library. From a hospital system perspective, this would serve as a reference library, guiding endoscopists, including for advanced therapeutic procedures in the future. It also could be used to measure and monitor quality indicators in endoscopy. From an educational standpoint, this library can be developed into a teaching set for both trainee and advanced practitioners looking for CME opportunities. From industry perspective, this database could be used to train/validate commercial AI algorithms.”
AI and machine learning may not be the panacea some technology enthusiasts imagine it to be, but there’s little doubt they are becoming an important partner in the road to more personalized patient care.
1. Syed S, Al-Bone M, Khan MN, et al. Assessment of machine learning detection of environmental enteropathy and celiac disease in children. JAMA Network Open. 2019;2:e195822.
2. Wang P, Berzin TM, Brown JR, et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut. 2019;68:1813–1819.
3. Wu L, Zhang J, Zhou W, et al Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy. Gut. 2019;68:2161–2169.
Wednesday, October 6, 2021
We must make a serious commitment to increase financial resources and provide better analytics for real world evidence/real time data in support of public health.
Public health has been underfunded for decades. That neglect has had a profound impact since the COVID-19 pandemic has taken hold, and awakened policy makers and thought leaders to the need for more investment.
Consider the statistics: The U.S. spends about $3.6 trillion each year on health but less than 3% of that amount on public health and prevention. A 2020 Forbes report likewise pointed out that “From the late 1960s to the 2010s, the federal share of total health expenditure for public health dropped from 45 percent to 15 percent.” This relative indifference to public health is partly responsible for the nation’s mixed response to the SARS-CoV-2 pandemic. A recent McKinsey & Company analysis concluded: “Government leaders remain focused on navigating the current crisis, but making smart investments now can both enhance the ongoing COVID-19 response and strengthen public-health systems to reduce the chance of future pandemics. Investments in public health and other public goods are sorely undervalued; investments in preventive measures, whose success is invisible, even more so.”
Among the other “public goods” that require more investment is population health management and analytics. Although experts continue to debate the differences between public health and population health, most are unimportant. For our purposes, population health refers to the status of a specific group of individuals, whether they reside in a specific city, state, or country. Public health usually casts a wider net, concerned about the status of the entire population. Managing the health of these subgroups requires an analytical approach that can take into account a long list of variables, including social determinants of health (SDoH), the content of their medical records, and much more. SDoH data from Change Health care, for instance, has demonstrated that economic stability index (ESI) is a strong predictor of health care utilization. ESI is a cluster model that uses market behavior and financial attitudes o group individuals into one of 30 categories, with category 1 representing persons most likely to be economically stable and category 30 least likely to be stable. The figure, which links race, ESI and health care utilization in Kentucky, suggests that Blacks/African Americans are far less likely to be economically stable (category 1). The same analysis found that Blacks/African Americans were almost twice as likely to use the ED compared to Whites (30.5% vs 18.1%). A growing number of health care organizations are starting to see the value of such population health metrics and are incorporating these statistics into their decision making.
Among the valuable sources of data that can inform population health are patient surveys, clinical registries, and EHRs. Several traditional analytics tools are available to extract actionable insights from these data sources, including logistic regression. Over the decades, several major studies have also generated risk scoring systems to improve public health. The Framingham heart health risk score has been used for many years to assess the likelihood of developing cardiovascular disease over a 10-year period. Because the scoring system can help predict the onset of heart disease, it can also serve as a useful tool in creating population-based preventive programs to reduce that risk. The tool requires patients to provide their age, gender, smoking status, total cholesterol, HDL cholesterol, systolic blood pressure, and whether they are taking antihypertensive medication. The American Diabetes Association has developed its own risk scoring method to assess the likelihood of type 2 diabetes in the population. The tool takes into account age, gender, history of gestational diabetes, physical activity level, family history of diabetes, hypertension, height and weight. Another analytics methodology that has value in population health is the LACE Index. The acronym stands for length of stay, acuity of admission, Charlson comorbidity index (CCI), and number of emergency department visits in the preceding 6 months. More recently, there are several AI-based analytic tools currently being used to improve population health. A review of ML-related analytic methods found that neural networks based algorithms are the most commonly used (41%) in this context, compared to 25.5% for support vector machines, and 21% for random forest modeling.
There is no way of knowing how the world would have coped with COVID-19 had policy makers fully invested in public and population health programs and analytics. But there’s little doubt that we’ll all fare much better during the next health crisis if we put more time, energy, and resources into these initiatives.
Thursday, September 30, 2021
In addition to evaluating the safety of software as a medical device (SaMD), the agency needs to devote more resources to evaluating its efficacy and quality.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
The FDA’s approach to software as a medical device (SaMD) has been evolving. Consider a few examples.
In 2018, IDx-DR, a software system used to improve screening for retinopathy, a common complication of diabetes that affects the eye, became the first AI-based medical device to receive US Food and Drug Administration clearance to “detect greater than a mild level of … diabetic retinopathy in adults who have diabetes.” To arrive at that decision, the agency not only reviewed data to establish its safety, it also took into account prospective studies, an essential form of evidence that clinicians look for when trying to decide if a device or product is worth using. The software was the first medical device approved by the FDA that does not require the services of a specialist to interpret the results, making it a useful tool for health care providers who may not normally be involved in eye care. The FDA clearance emphasized the fact that IDx-DR is a screening tool not a diagnostic tool, stating that patients with positive results should be referred to an eye care professional. The algorithm built into the IDx-DR system is intended to be used with the Topcon NW400 retinal camera and a cloud server that contains the software.
Similarly, FDA looked at a randomized prospective trial before approval of a machine learning-based algorithm that can help endoscopists improve their ability to detected smaller, easily missed colonic polyps. Its recent clearance of GI Genius by Medtronic was based on a clinical trial published in Gastroenterology, in which investigators in Italy evaluated data from 685 patients, comparing a group that underwent the procedure with the help of the computer-aided detection (CADe) system to a group who acted as controls. Repici et al found that the adenoma detection rate was significantly higher in the CADe group, as was the detection rate for polyps 5 mm or smaller, which led to the conclusion: “Including CADe in colonoscopy examinations increases detection of adenomas without affecting safety.”
Their findings raise several questions: is it reasonable to assume that a study of 600+ Italians would apply to a U.S. population, which has different demographic characteristics? More importantly, were the 685 patients representative of the general public, including adequate numbers of persons of color and those in lower socioeconomic groups? While the Gastroenterology study did report enough female patients, there is no mention of these other marginalized groups.
An independent 2021 analysis of FDA approvals has likewise raised several concerns about the effectiveness and equity of several recently approved AI algorithms. Eric Wu from Stanford University and his colleagues examined the FDA’s clearance of 130 devices and found the vast majority were approved based on retrospective studies (126 of 130). And when they separated all 130 devices into low- and high-risk subgroups using FDA guidelines, they found none of the 54 high-risk devices had been evaluated by prospective trials. Other shortcomings documented in Wu’s analysis included the following:
- Of the 130 approved products, 93 did not report multi-site evaluation.
- Fifty-nine of the approved AI devices included no mention of the sample size of the test population.
- Only 17 of the approved devices discussed a demographic subgroup.
We would certainly like to see the FDA take a more thorough approach to AI-based algorithm clearance, but in lieu of that, several leading academic medical centers, including Mayo Clinic, are contemplating a more holistic and comprehensive approach to algorithmic evaluation. It would include establishing a standard labeling schema to document the characteristics, behavior, efficacy, and equity of AI systems, to reveal the properties of systems necessary for stakeholders to assess them and build the trust necessary for safe adoption. The schema will also support assessment of the portability of systems to disparate datasets. The labeling schema will serve as an organizational framework that specifies the elements of the label. Label content will be specified in sections that will likely include:
- model details such as name, developer, date of release, and version,
- the intended use of the system,
- performance measures,
- accuracy metrics, and
- training data and evaluation data characteristics
While it makes no sense to sacrifice the good in pursuit of the perfect, the current regulatory framework for evaluating SaMD is far from perfect. Combining a more robust FDA approval process with the expertise of the world’s leading medical centers will offer our patients the best of both worlds.
Thursday, September 9, 2021
By providing a safe, secure environment, novel approaches enable health care innovators to share data without opening the door to snoopers and thieves.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
We know that bringing together AI algorithms and data in ways that preserve privacy and intellectual property is one of the keys to delivering the next generation of clinical decision support. But meeting that challenge requires health care innovators to look to other innovators who themselves have created unique cybersecurity solutions. Among these “Think outside the box” solutions are products and services from vendors like TripleBlind, Verily, Beekeeper.AI/Microsoft, Terra, and Nvidia.
The concept of secure computing enclaves has been around for many years. Apple created its secure enclave, a subsystem built into its systems on a chip (SoC), which in turn is “an integrated circuit that incorporates multiple components into a single chip,” including an application processor, secure enclave, and other coprocessors. Apple explains that “The Secure Enclave is isolated from the main processor to provide an extra layer of security and is designed to keep sensitive user data secure even when the Application Processor kernel becomes compromised. It follows the same design principles as the SoC does—a boot ROM to establish a hardware root of trust, an AES [advanced encryption standard] engine for efficient and secure cryptographic operations, and protected memory. Although the Secure Enclave doesn’t include storage, it has a mechanism to store information securely on attached storage separate from the NAND flash storage that’s used by the Application Processor and operating system.” The secure enclave is embedded into the latest versions of its iPhone, iPad, Mac computers, Apple TV, Apple Watch, and Home Pod.
While this security measure provides users when an extra layer of protection, because it’s a hardware-based solution, its uses are limited. With that in mind, several vendors have created software-based enclaves that are more readily adapted by customers. At Mayo Clinic Platform, we are deploying TripleBlind’s services to facilitate sharing data with our many external partners. It allows Mayo Clinic to test its algorithms using another organization’s data without either party losing control of its assets. Similarly, we can test an algorithm from one of our academic or commercial partners with Mayo Clinic data, or test an outside organization’s data with another outside organization’s data.
How is this “magic” performed? Of course, it’s always about the math. TripleBlind allows the use of distributed data that is accessed but never moved or revealed; it always remains one-way encrypted with no decryption possible. TripleBlind’s novel cryptographic approaches can operate on any type of data (structured or unstructured images, text, voice, video), and perform any operation, including training of and inferring from AI and ML algorithms. An organization’s data remains fully encrypted throughout the transaction, which means that a third party never sees the raw data because it is stored behind the data owner organization’s firewall. In fact, there is no decryption key available, ever. When two health care organizations partner to share data, for instance, TripleBlind software de-identifies their data via one-way encryption; then, both partners access each other’s one-way encrypted data through an Application Programming Interface (API). That means each partner can use the other’s data for training an algorithm, for example, which in turn allows them to generate a more generalizable, less biased algorithm. During a recent conversation with Riddhiman Das, CEO for TripleBlind, he explained: “To build robust algorithms, you want to be able to access diverse training data so that your model is accurate and can generalize to many types of data. Historically, health care organizations have had to send their data to one another to accomplish this goal, which creates unacceptable risks. TripleBlind performs one-way encryption from both interacting organizations, and because there is no decryption possible, you cannot reconstruct the data. In addition, the data can only be used by an algorithm for the specific purpose spelled out in the business agreement.”
Developing innovative technological services is exciting work, with the potential to reshape the health care ecosystem worldwide. But along with the excitement is the challenge of keeping data safe and secure. Taking advantage of the many secure computing enclaves available on the market allows us to do just that.
Tuesday, August 31, 2021
The three risk assessment tools now in use fall far short. Using the latest deep learning techniques, investigators are developing more personalized ways to locate women at high risk.
The promise of personalized medicine will eventually allow clinicians to offer individual patients more precise advice on prevention, early detection and treatment. Of course, the operative word is eventually. A closer examination of the screening tools available to detect breast cancer demonstrates that we still have a way to go before we can fulfill that promise. But with the help of better technology, we are getting closer to that realization.
Disease screening is about risk assessment. Researchers collect data on thousands of patients who develop breast cancer, for instance, and discover that the age range, family history and menstruation history of those who develop the disease differs significantly from those who remain free of it. That in turn allows policy makers to create a screening protocol that suggests women of a certain age who have experienced early menarche or late menopause are more likely to develop the malignancy. That risk assessment is consistent with the fact that more reproductive years means more exposure to the hormones that contribute to breast cancer. Similarly, there’s evidence to show that women with first degree relatives with the cancer and those with a history of ovarian cancer or HRT use are at greater risk.
Statistics like this are the basis for several breast cancer risk scoring systems, including the Gail score, the IBIS score, and BCSC tool. The National Cancer Institute, which uses the Gail model, explains: “The Breast Cancer Risk Assessment Tool allows health professionals to estimate a woman's risk of developing invasive breast cancer over the next 5 years and up to age 90 (lifetime risk). The tool uses a woman’s personal medical and reproductive history and the history of breast cancer among her first-degree relatives (mother, sisters, daughters) to estimate absolute breast cancer risk—her chance or probability of developing invasive breast cancer in a defined age interval.” While the screening tool saves lives, it can also be misleading. If, for example, it finds that a woman has a 1% likelihood of developing breast cancer, what that really means is a large population of women with those specific risk factors has a one in 100 risk of developing the disease. There is no way of knowing what the threat is for any one patient in that group. Similar problems exist for the International Breast Cancer Intervention Study (IBIS) score, based on the Tyrer-Cuzick Model, and the Breast Cancer Surveillance Consortium (BCSC) Risk Calculator. These 3 assessment tools can give patients a false sense of security if they don’t dive into the details. BCSC, for instance, cannot be applied to women younger that 35 or older than 74, nor does it accurately measure risk for anyone who has previously had ductal carcinoma in situ (DCIS), or had breast augmentation. Similarly, the NCI tool doesn’t accurately estimate risk in women with BRCA1 or BRCA1 mutation, as well as certain other subgroups.
During a conversation with Tufia Haddad, M.D,, a Mayo Clinic medical oncologist with specialty interest in precision medicine in breast cancer and artificial intelligence, she discussed the research she and her colleagues are doing to improve the risk assessment process and identify more high-risk women. Dr. Haddad pointed out that there are numerous obstacles that prevent women from obtaining the best possible risk assessment. Too many women do not have a primary care practitioner who might use a risk tool. And those that do have a PCP are more likely to have an evaluation based on the Breast Cancer Risk Assessment tool (the Gail model). “We prefer the Tyrer-Cuzick model in part because it incorporates more personal information for each individual patient including a detailed family history, a woman’s breast density from her mammogram, as well as her history of atypia or other high risk benign breast disease,” says Dr. Haddad. Unfortunately, the Tyrer-Cuzick method requires many more data elements to assess breast cancer risk, which discourages busy clinicians from using it.
Another obstacle to using any of these risk assessment tools is the fact that they don’t readily fit into the average physician’s clinical workflow. Ideally these tools should seamlessly integrate into the EHR system. Even better would be the incorporation of AI-enhanced algorithms that automate the abstraction of the required data elements from the patient’s record into the assessment tool. For example, the algorithm would flag a family history of breast cancer, increased breast density as determined during a mammogram, as well as hormone replacement therapy and insert those risk factors into the Tyrer-Cuzick tool.
Even with this AI-enhanced approach, all of the available risk models fall short because they take a population-based approach, as we mentioned above. Dr. Haddad and her colleagues are looking to make the assessment process more individualized, as are others work in this specialty. That model could incorporate each patient’s previous mammography results, their genetics and benign breast biopsy findings, and much more. Adam Yala, and his colleagues at MIT recently developed a mammography-based deep learning model designed to take this more sophisticated approach. Called Mirai, it was trained on a large data set from Massachusetts General Hospital and from facilities in Sweden and Taiwan. The new model generated significantly better results for breast cancer risk prediction than the TC model.
Breast cancer risk assessment continues to evolve. And with better utilization of existing assessment tools and the assistance of deep learning, we can look forward to better patient outcomes.
Monday, August 23, 2021
The evidence is mixed but suggests that these overlooked variables have a profound impact on each patient’s journey.
This article was written by Tim Suther, Nicole Hobbs, Jeff McGinn, Matt Turner with Change Healthcare, John Halamka, MD, MS, president of Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform.
By one estimate, social determinants of health (SDoH) influence up to 80% of health outcomes. Although reports like this suggest that these social factors have a major impact, thought leaders continue to debate whether they can also enhance the accuracy in predictive models. Resolving that debate is far from simple because the answer depends on the type, source and quality of the data, and the design of the model under consideration.
In general, we derive SDoH from subjective and objective sources. Subjective data includes self-reported or clinician-collected data such as patient reported outcomes, Z codes from ICD-10-CM that report factors that influence health status and interactions with health service providers, and other unstructured EHR data. Objective data includes individual-level and community-level data from government, public and private (and consumer behavior) sources; it’s usually more structured and often derived from national-level datasets.
Unfortunately, the research on the value of SDoH in predictive models varies widely. Some studies report no appreciable differences when SDoH are injected into models, while others report significant enhancements to predictive power. Unsurprisingly, these varying study results depend in part on levels of reliance on traditional clinical models and, most importantly, on the types and sources of SDoH data employed in the studies.
For example, a group from Johns Hopkins Bloomberg School of Public Health demonstrated SDoH predictive models can fail in part due to predictive model design as well as to EHR-level data that is unstructured and collected inconsistently. They also demonstrated that dependence on data from EHR-derived population health databases for SDoH can be problematic because the data tends to be used as a proxy for individual-level social factors. The problem lies in the fact that these proxies are often based on assumptions, not evidence. Other research supports the above and showcases the challenges of using SDoH data from sources that traditionally struggle with the comprehensive collection and standardization of these data types.
On a more positive note, several studies and healthcare articles have reported success by relying on objectively collected and/or highly structured and consistent data. For example, one study that used EHR-derived SDoH data sources found that the addition of structured data on median income, unemployment rate, and education from trustworthy non-EHR sources enhanced their model’s health prediction granularity for some of the most vulnerable subgroups of patients. In another study, collaboration between Stanford, Harvard, and the Imperial College London found that adding structured SDoH data from the US Census, along with using machine learning techniques, improved risk prediction model accuracy for hospitalization, death, and costs. They also showed that their models based on SDoH alone, as well as those based on clinical comorbidities alone, could predict health outcomes and costs. Similarly, researchers at The Ohio State University College of Medicine added community-level and consumer behavior data not available in standard EHR data and found it enhanced the study of and impact on obesity prevention. Juhn et. al. at Mayo Clinic tapped telephone survey data and appended housing and neighborhood characteristic data from local government sources to create a socioeconomic status index (HOUSES). They first showed that HOUSES correlated well with outcome measures and later showed that HOUSES could even serve as a predictive tool for graft failure in patients.
Patient Level SDoH + Clinical Data = Predictive Power
Incorporating social factors into the healthcare equation can fill gaps needed at the point of care, but it also generates better healthcare predictions, but only when these determinants are patient level and linked to robust clinical data. Change Healthcare, for example, has curated such an integrated national-level dataset, linking billions of historical de-identified distinct medical claims with patient-level social, physical and behavioral determinants of health. One of this dataset’s most important uses is to understand the relative weight of specific patient SDOH factors, in comparison to clinical factors alone, for various therapeutic conditions, including COVID-19. For example, across Change Healthcare’s research, economic stability is repeatedly ranked as the highest or among the highest predictors of the healthcare experience. Despite this realization, most end users, including providers and payers, lack such visibility (or rely on geographic averages that are unhelpful in making accurate predictive models).
Incorporating SDoH data into predictive models holds much promise. Given the relative newness of SDoH data in predictive analytics, along with a lack of data standardization and scale, it’s not surprising to find varying degrees of success in using it to improve predictive health models. But as researchers learn more about the best types and sources of SDoH data to use, along with developing better-suited models for these types of data, we’re likely to see significant advances in healthcare predictive models. By combining the right data with the right models, SDoH are a powerful asset in predictive models of health, outcomes, and potential health disparities.
If you're still with us . . .
Please consider supporting Dr. Steve Parodi, Reed Abelson and I by "voting up" on our panel at the upcoming South by Southwest conference in March of 2022. Our proposed panel, "Extending the Stethoscope Into the Home," will dive into a discussion about acute health care for patients in their home and the infrastructures needed to support it. If you are so inclined to vote, please do so here.
Tuesday, August 17, 2021
To convince physicians and nurses that deep learning algorithms are worth using in everyday practice, developers need to explain how they work in plain clinical English.
Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.
AI’s so-called black box refers to the fact that much of the underlying technology behind machine learning-enhanced algorithms is Oftentimes that’s the case because the advanced math or the data science behind the algorithms is too complex for the average user to understand without additional training. Several stakeholders in digital health maintain, however, that this lack of understanding isn’t that important. They argue that as long as an algorithm generates actionable insights, most clinicians don’t really care about what’s “under the hood.” Is that reasoning sound?
Some thought leaders point to the fact that there are many advanced, computer-enhanced diagnostic and therapeutic tools currently in use that physicians don’t fully understand, but nonetheless accept. The CHA2DSA-VASc score, for instance, is used to estimate the likelihood of a patient with non-valvular atrial fibrillation having a stroke. Few clinicians are familiar with the original research or detailed reasoning upon which the calculator is based, but they nonetheless use the tool. Similarly, many physicians use the FRAX score to estimate a patient’s 10-year risk of developing a bone fracture, despite the fact that they have not investigated the underlying math.
It’s important to point out, however, that the stroke risk tool and the FRAX tool both have major endorsements from organizations that physicians respect. The American Heart Association and the American College of Cardiology both recommend the CHA2DSA-VASc score while the National Osteoporosis Foundation supports the use of FRAX score. That gives physicians confidence in these tools even if they don’t grasp the underlying details. To date, there are no major professional associations recommending specific AI-enabled algorithms to supplement the diagnosis or treatment of disease. The American Diabetes Association did include a passing mention of an AI-based screening tool in its 2020 Standards of Medical Care in Diabetes, stating: “Artificial intelligence systems that detect more than mild diabetic retinopathy and diabetic macular edema authorized for use by the FDA represent an alternative to traditional screening approaches. However, the benefits and optimal utilization of this type of screening have yet to be fully determined.” That can hardly be considered a recommendation.
Given this scenario, most physicians have reason to be skeptical, and surveys bear out that skepticism. A survey of 91 primary care physicians found that understandability of AI is one of the important attributes they want before trusting its recommendations during breast cancer screening. Similarly, a survey of senior specialists in UK found that understandability was one of their primary concerns about AI. Among New Zealand physicians, 88% were more likely to trust an AI algorithm that produced an understandable explanation of its decisions.
Of course, it may not be possible to fully explain the advanced mathematics used to create machine learning based algorithms. But there are other ways to describe the logic behind these tools that would satisfy clinicians. As we have mentioned in previous publications and oral presentations, there are tutorials available to simplify machine learning-related systems like neural networks, random forest modeling, clustering, and gradient boosting. Our most recent book contains an entire chapter on this digital toolbox. Similarly, JAMA has created clinician friendly video tutorials designed to graphically illustrate how deep learning is used in medical image analysis and how such algorithms can be used to help detect lymph node metastases in breast cancer patients.
These resources require clinicians to take the initiative and learn a few basic AI concepts, but developers and vendors also have an obligation to make their products more transparent. One way to accomplish that goal is through saliency maps and generative adversarial networks. Using such techniques, it’s possible to highlight the specific pixel grouping that a neural network has identified as a trouble spot, which the clinician can then view on a radiograph, for example. Alex DeGrave, with the University of Washington, and his colleagues, used this approach to help explain why an algorithm designed to detect COVID-19-related changes in chest X-rays made its recommendations. Amirata Ghrobani and associates from Stanford University have taken a similar approach to help clinicians comprehend the echocardiography recommendations coming from a deep learning system. The researchers trained a convolutional neural network (CNN) on over 2.6 million echocardiogram images from more than 2,800 patients and demonstrated it was capable of identifying enlarged left atria, left ventricular hypertrophy, and several other abnormalities. To open up the black box, Ghorbani et al presented readers with “biologically plausible regions of interest” in the echocardiograms they analyzed so they could see for themselves the reason for the interpretation that the model has arrived at. For instance, if the CNN said it had identified a structure such as a pacemaker lead, it highlighted the pixels it identifies as the lead. Similar clinician-friendly images are presented for a severely dilated left atrium and for left ventricular hypertrophy.
Deep learning systems are slowly ushering in a new way to manage diagnosis and treatment, but to bring skeptical clinicians on board, we need to pull the curtain back. In addition to providing evidence that these tools are equitable and clinically effectively, practitioners want reasonable explanations to demonstrate that they will do what they claim to do.
Friday, July 30, 2021
Advances in artificial intelligence are slowly transforming the specialty, much the way radiology is being transformed by similar advances in digital technology.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
Any patient who faces a potential cancer diagnosis knows how important an accurate, timely pathology report is. Similarly, surgeons often require fast pathology results when they are performing a delicate procedure to determine their course of action during an operation. New technological developments are poised to meet the needs of patients and clinicians alike.
AI can improve pathology practice in numerous ways. The right digital tools can automate several repetitive tasks, including the detection of small foci. It can also help improve the staging of many malignancies, make the workflow process more efficient, and help classify images, which in turn gives pathologists a “second set of eyes”. And those “eyes” do not grow tired at the end of a long day or feel stressed out from too much work.
Such capabilities have far-reaching implications. With the right scanning hardware and the proper viewer software, pathologists and technicians can easily view and store whole slide images (WSIs). That view is markedly different from what they see through a microscope, which only allows a narrow field of view. In addition, digitization allows pathologists to mark up WSIs with non-destructive annotations, use the slides as teaching tools, search a laboratory’s archives to make comparisons with images that depict similar cases, give colleagues and patients access to the images, and create predictive models. And if the facility has cloud storage capabilities, it allows clinicians, patients, and pathologists around the world to access the data.
A 2020 prospective trial conducted by University of Michigan and Columbia University investigators illustrates just how profound the impact of AI and ML can be when applied to pathology. Todd Hollon and colleagues point out that interoperative diagnosis of cancer relies on a “contracting, unevenly distributed pathology workforce.”1 The process can be quite inefficient, requiring a tissue specimen travel from the OR to a lab, followed by specimen processing, slide preparation by a technician, and a pathologist’s review. At University of Michigan, they are now using Stimulated Raman histology, an advanced optical imaging method, along with a convolutional neural network (CNN) to help interpret the images. The machine learning tools were trained to detect 13 histologic categories and includes an inference algorithm to help make a diagnosis of brain cancer. Hollon et al conducted a 2-arm, prospective multicenter, non-inferiority trial to compare the CNN results to those of human pathologists. The trial, which evaluated 278 specimens, demonstrated that the machine learning system was as accurate as pathologists’ interpretation (94.6% vs 93.9%). Equally important was the fact that it took under 15 seconds for surgeons to get their results with the AI system, compared to 20-30 minutes with conventional techniques. And that latter estimate does not represent the national average. In some community settings, slides have to be shipped by special courier to labs that are hours away.
Mayo Clinic is among several forward-thinking health systems that are in the process of implementing a variety of digital pathology services. Mayo Clinic has partnered with Google and is leveraging their technology in two ways. The program will extend Mayo Clinic’s comprehensive Longitudinal Patient Record profile with digitized pathology images to better serve and care for patients. And we are exploring new search capabilities to improve digital pathology analytics and AI. The Mayo/Google project is being conducted with the help of Sectra, a digital slide review and image storage and management system. Once proof of concept, system testing, and configuration activities are complete, the digital pathology solution will be introduced gradually to Mayo Clinic departments throughout Rochester, Florida, and Arizona, as well as the Mayo Clinic Health System.
The new digital capabilities taking hold in several pathology labs around the globe are likely to solve several vexing problems facing the specialty. Currently there is a shortage of pathologists worldwide, and in some countries, that shortage is severe. One estimate found there is one pathologist per 1.5 million people in parts of Africa. And China has one fourth the number of pathologists practicing in the U.S., on a per capita basis. Studies predict that the steady decline of the number of pathologists in the U.S. will continue over the next two decades. A lack of subspecialists is likewise a problem. Similarly, there are reports of poor accuracy and reproducibility, with many practitioners making subjective judgements based on a manual estimate of the percentage of positive cells for a biomarker. Finally, there is reason to believe that implementing digital pathology systems will likely improve a health system’s financial return on investment. One study has suggested that it can “improve the efficiency of pathology workloads by 13%.” 2
As we have said several times in these columns, AI and ML are certainly not a panacea, and they will never replace an experienced clinician or pathologist. But taking advantage of the tools generated by AI/ML will have a profound impact of diagnosis and treatment for the next several decades.
1. Hollon T, Pandian B, Adapa A et al. Near real-time intraoperative brain tumor diagnosis using stimulated Raman histology and deep neural networks. Nat. Med. 2020. 26:52-58.
2. Ho J, Ahlers SM, Stratman C, et al. Can digital pathology result in cost savings? a financial projection for digital pathology implementation at a large integrated health care organization. . 2014;5(1):33; doi: 10.4103/2153-3539.139714.
Wednesday, July 28, 2021
Dataset shift can thwart the best intentions of algorithm developers and tech-savvy clinicians, but there are solutions.
John Halamka, M.D., president, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, wrote this article.
Generalizability has always been a concern in health care, whether we’re discussing the application of clinical trials or machine-learning based algorithms. A large randomized controlled trial that finds an intensive lifestyle program doesn’t reduce the risk of cardiovascular complications in Type 2 diabetics, for instance, suggests the diet/exercise regimen is not worth recommending to patients. But the question immediately comes to mind: Can that finding be generalized to the entire population of Type 2 patients? As we have pointed out in other publications, subgroup analysis has demonstrated that many patients do, in fact, benefit from such a program.
The same problem exists in health care IT. Several algorithms have been developed to help classify diagnostic images, predict disease complications, and more. A closer look at the datasets upon which these digital tools are based indicates many suffer from dataset shift. In simple English, dataset shift is what happens when the data collected during the development of an algorithm changes over time and is different from the data when the algorithm is eventually implemented. For example, the patient demographics used to create a model may no longer represent the patient population when the algorithm is put into clinical use. This happened when COVID 19 changed the demographic characteristics of patients, making the Epic sepsis prediction tool ineffective.
Samuel Finlayson, PhD, with Harvard Medical School, and his colleagues described a long list of data set shift scenarios that can compromise the accuracy and equity of AI-based algorithms, which in turn can compromise patient outcomes and patient safety. Finlayson et al list 14 scenarios, which fall into 3 broad categories: changes in technology; changes in population and setting; and changes in behavior. Examples of ways in which dataset shift can create misleading outputs that send clinicians down the wrong road include:
- Changes in the X-ray scanner models used
- Changes in the way diagnostic codes are collected (e.g. using ICD9 and then switching to ICD10)
- Changes in patient population resulting from hospital mergers
Other potential problems to be cognizant of include changes in your facility’s EHR system. Sometimes updates to the system may result in changes in how terms are defined, which in turn can impact predictive algorithms that rely on those definitions. If a term like elevated temperature or fever is changed to pyrexia in one of the EHR drop down menus, for example, it may no longer map to the algorithm that uses elevated temperature as one of the variable definitions to predict sepsis, or any number of common infections. Similarly, if the ML-based model has been trained on a patient dataset for a medical specialty practice or hospital cohort, it’s likely that data will generate misleading outputs when applied to a primary care setting.
Finlayson et al mention another example to be aware of: changes in the way physicians practice can influence data collection: “Adoption of new order sets, or changes in their timing, can heavily affect predictive model output.” Clearly, problems like this necessitate strong interdisciplinary ties, including an ongoing dialogue between the chief medical officer, clinical department heads, and chief information officer and his or her team. Equally important is the need for clinicians in the trenches to look for subtle changes in practice patterns that can impact the predictive analytics tools currently in place. Many dataset mismatches can be solved by updating variable mapping, retraining or redesigning the algorithm, and multidisciplinary root cause analysis.
While addressing dataset shift issues will improve the effectiveness of your AI-based algorithms, they are only one of many stumbling blocks to contend with. One classic example that demonstrates that computers are still incapable to matching human intelligence is the study that concluded that patients with asthma are less likely to die from pneumonia that those who don’t have asthma. The machine learning tool used to come to that unwarranted conclusion had failed to take into account the fact that many asthmatics often get faster, earlier, more intensive treatment when their condition flares up, which results in a lower mortality rate. Had clinicians acted on the misleading correlation between asthma and fewer deaths from pneumonia, they might have decided asthma patients don’t necessarily need to hospitalized when they develop pneumonia.
This kind of misdirection is relatively common and emphasizes the fact that ML-enhanced tools sometimes have trouble separating useless “noise” from meaningful signal. Another example worth noting: Some algorithms designed to help detect COVID 19 by analyzing X-rays suffer from this shortcoming. Several of these deep learning algorithms rely on confounding variables instead of focusing on medical pathology, giving clinicians the impression that they are accurately identifying the infection or ruling out its presence. Unbeknownst to their users, the algorithms have been shown to rely on text markers or patient positioning instead of pathology findings.
At Mayo Clinic, we have had to address similar problems. A palliative care model that was trained on data from the Rochester, Minnesota, community, for instance, did not work well in our health system because the severity of patient disease in a tertiary care facility is very different than what’s seen in a local community hospital. Similarly, one of our algorithms broke when a vendor did a point release in its software and changed the format of the results. We also had a vendor with a CT stroke detection algorithm run 10 of our known stroke patients through its system and was only able to identify one patient. The root cause: Mayo Clinic medical physicists have optimized our radiation exposure to 25% of industry standards to reduce radiation exposure to patients, but that changed the signal to noise ratio of the images and the vendor’s system wasn’t trained on that ratio and couldn’t find the images.
Valentina Bellini, with University of Parma, Parma, Italy, and her colleagues sum up the AI shortcut dilemma in a graphic that illustrates 3 broad problem areas: Poor quality data, ethical and legal issues, and lack of educational programs for clinicians who may be skeptical or uninformed about the value and limitations of AI enhanced algorithms in intensive care settings.
As we have pointed out in other blogs, ML-based algorithms rely on math, not magic. But when reliance on that math overshadows clinicians’ diagnostic experience and common sense, they need to partner with their IT colleagues to find ways to reconcile artificial and human intelligence.
Friday, July 23, 2021
A growing body of research suggests it’s time to abandon outdated ideas about how to identify effective medical therapies.
Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.
“Correlation is not causation.” It’s a truism that researchers take for granted, and for good reason. The fact that event A is followed by event B doesn’t mean that A caused B. An observational study of 1,000 adults, for example, that found those taking high doses of vitamin C were less likely to develop lung cancer doesn’t prove the nutrient protects against the cancer; it’s always possible that a third factor — a confounding variable — was responsible for both A and B. In other words, patients taking lots of vitamin C may be less likely to get lung cancer because they are more health conscious than the average person, and therefore more likely to avoid smoking, which in turn reduces their risk of the cancer.
As this example illustrates, confounding variables are the possible contributing factors that may mislead us into imagining a cause-and-effect relationship exists when there isn’t one. It’s the reason interventional trials like the randomized controlled trial (RCT) remain a more reliable way to determine causation than observational studies. But it’s important to point out that in clinical medicine, there are many treatment protocols in use that are not supported by RCTs. Similarly, there are many risk factors associated with various diseases but it’s often difficult to know for certain whether these risk factors are actually contributing causes of said diseases.
While RCTs remain the good standard in medicine, they can be impractical for a variety of reasons: they are often very expensive to perform; an RCT that exposes patients to potentially harmful risk factor and compares them to those who aren’t would be unethical; most trials require many exclusion and inclusion criteria that don’t exist in the everyday practice of medicine. For instance, they usually exclude patients with co-existing conditions, which may distort the study results.
One way to address this problem is by accepting less than perfect evidence and using a reliability scale or continuum to determine which treatments are worth using and which are not. That scale might look something like this, with evidential support growing stronger from left to right along the continuum:
In the absence of RCTs, it’s feasible to consider using observational studies like case/control and cohort trials to justify using a specific therapy. And while such observational studies may still mislead because some confounding variables have been overlooked, there are epidemiological criteria that strengthen the weight given to these less than perfect studies:
- A stronger
association or correlation between two variables is more suggestive of a
cause/effect relationship than a weaker association.
- Temporality. The
alleged effect must follow the suspected cause not the other way around. It
would make no sense to suggest that exposure to Mycobacterium
tuberculosis causes TB if all the cases of the infection occurred before
patients were exposed to the bacterium.
- A dose-response relationship exists between alleged cause
and effect. For example, if researchers find that a blood lead level of
10 mcg/dl is associated with mild learning disabilities in children, 15 mcg/dl
is linked to moderate deficit, and 20 mcg/dl with severe deficits, this
gradient strengthens the argument for causality.
- A biologically plausible mechanism of action
linking cause and effect strengthens the argument. In the case of lead
poisoning, there is evidence pointing to neurological damage brought on by
oxidative stress and a variety of other biochemical mechanisms.
- Repeatability of the study findings: If the results of one group of investigators are duplicated by independent investigators, that lends further support to the cause/effect relationship.
While adherence to all these criteria suggests causality for observational studies, a statistical approach called causal inference can actually establish causality. The technique, which was spearheaded by Judea Pearl, Ph.D., winner of the 2011 Turing Award, is considered revolutionary by many thought leaders and will likely have profound implications for clinical medicine, and for the role of AI and machine learning. During the recent Mayo Clinic Artificial Intelligence Symposium, Adrian Keister, Ph.D., a senior data science analyst at Mayo Clinic, concluded that causal inference is “possibly the most important advance in the scientific method since the birth of modern statistics — maybe even more important than that.”
Conceptually, causal inference starts with the conversion of word-based statements into mathematical statements, with the help of a few new operators. While that may sound daunting to anyone not well-versed in statistics, it’s not much different than the way we communicate by using the language of arithmetic. A statement like fifteen times five equals seventy five is converted to 15 x 5 = 75. In this case, x is an operator. The new mathematical language of causal inference might look like this if it were to represent an observational study that evaluated the association between a new drug and an increase in patients’ lifespan: P (L|D) where P is probability, L, lifespan, D is the drug, and | is an operator that means “conditioned on.”
An interventional trial such as an RCT, on the other hand, would be written as X causes Y if P (L|do (D)) > P(Y), in which case the do-operator refers to the intervention, i.e., giving the drug in a controlled setting. This formula is a way to of saying X (the drug being tested), causes Y (longer life) if the results of the intervention are greater than the probability of a longer life without administering the drug, in other words, the probability in the placebo group, namely P(Y).
technique also uses causal graphs to show the relationship of a confounding
variable to a proposed cause/effect relationship. Using this kind of graph, one
can illustrate how the tool applies in a real-world scenario. Consider the
relationship between smoking and lung cancer. For decades, statisticians and
policy makers argued about whether smoking causes the cancer because all the
evidence supporting the link was observational. The graph would look something
the confounding variable — a genetic predisposition for example — S is smoking and
LC is lung cancer. The implication here is that if a third factor causes persons
to smoke and causes cancer, one cannot necessarily conclude that smoking causes
lung cancer. What Pearl and his
associates discovered was that if an intermediate factor can be identified in
the pathway between smoking and cancer, it’s then possible to establish a cause/effect
relationship between the 2 with the help of a series of mathematical
calculations and a few algebraic rewrite tools. As figure 2 demonstrates, tar
deposits in the smokers’ lung are that intermediate factor.
Had causal inference existed in the 1950s and 1960s, the argument by tobacco industry lobbyists would have been refuted, which in turn might have saved many millions of lives. The same approach holds tremendous potential as we begin to apply it to predictive algorithms and other machine-learning based digital tools.
Monday, July 19, 2021
Innovation in healthcare requires new ways to think about interdisciplinary solutions.
Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform and John Halamka, M.D., president, Mayo Clinic Platform, wrote this article.
During the 10 years we have worked together, John and I have written often about the power of words like transformation, optimism, cynicism, and misdiagnosis. Another word that needs more attention is “interdisciplinary.” It’s been uttered so many times in science, medicine, and technology that it’s lost much of its impact. We all give lip service to the idea, but many aren’t willing or able to do the hard work required to make it a reality, and one that fosters innovation and better patient care.
Examples of the dangers of focusing too narrowly on one discipline are all around us. The disconnect between technology and medicine becomes obvious when you take a closer look at the invention of blue light emitting diodes (LEDs), for instance, for which Isamu Aksaki, Hiroshi Amano, and Shuti Nakamura won the Nobel Prize in Physics in 2014. While this technology reinvented the way we light own homes, providing a practical source of bright, energy-saving light, the researchers failed to take into account the health effects of their invention. Had they been encouraged to embrace an interdisciplinary mindset, they might have considered the neurological consequences of being exposed to too much blue light. Certain photoreceptive retinal cells detect blue light, which is plentiful in sunlight. As it turns out, the brain interprets LEDs much like it interprets sunlight, in effect telling us it’s time to wake up, making it difficult to get to sleep.
Problems like this only serve to emphasize what materials scientist Ainissa Ramirez, PhD discusses in a recent essay: “The culture of research … does not incentivize looking beyond one’s own discipline … Academic silos barricade us from thinking broadly and holistically. In materials science, students are often taught that the key criteria for materials selection are limited to cost, availability, and the ease of manufacturing. The ethical dimension of a materials innovation is generally set aside as an elective class in accredited engineering schools. But thinking about the impacts of one’s work should be neither optional nor an afterthought.”
This is the same problem we face in digital health. Too many data scientists and venture capitalists have invested time and resources into developing impressive algorithms capable of screening for disease and improving its treatment. But some have failed to take a closer look at the data sets upon which these digital tools are built, data sets that misrepresent the populations they are trying to serve. The result has been an ethical dilemma that needs our immediate attention.
Consider the evidence: A large commercially available risk prediction data set used to guide healthcare decisions has been analyzed to find out how equitable it is. The data set was designed to determine which patients require more than the usual attention because of their complex needs. Zaid Obermeyer from the School of Public Health at the University of California, Berkley, and his colleagues looked at over 43,000 White and about 6,000 Black primary care patients in the data set and discovered that when Blacks were assigned to the same level of risk as Whites by the algorithm based on the data set, they were actually sicker than their White counterparts. How did this racial bias creep into the algorithm? Obermeyer et al explain: “Bias occurs because the algorithm uses health costs as a proxy for health needs. Less money is spent on Black patients who have the same level of need, and the algorithm thus falsely concludes that Black patients are healthier than equally sick White patients.”
Similarly, evidence from an Argentinian study that analyzed data from deep neural networks used on publicly available X-ray image datasets intended to help diagnose thoracic diseases revealed inequities. When the investigators compared gender-imbalanced datasets to datasets in which males and females were equally represented, they found that, “with a 25%/75% imbalance ratio, the average performance across all diseases in the minority class is significantly lower than a model trained with a perfectly balanced dataset.” Their analysis concluded that datasets that underrepresent one gender results in biased classifiers, which in turn may lead to misclassification of pathology in the minority group.
These disparities not only re-emphasize the need for technologists, clinicians, and ethicists to work together, they beg the question: How can we fix the problem now? Working from the assumption that any problem this complex needs to be precisely measured before it can be rectified, Mayo Clinic, Duke School of Medicine, and Optum/Change Healthcare are currently analyzing a massive data set with more than 35 billion healthcare events and about 16 billion encounters that are linked to data sets that include social determinants of health. That will enable us to stratify the data by race/ethnicity, income, geolocation, education, and the like. Creating a platform that systematically evaluates commercially available algorithms for fairness and accuracy is another tactic worth considering. Such a platform would create “food label” style data cards that include the essential features of each digital tool, including its input data sources and types, validation protocols, population composition, and performance metrics. There are also several analytical tools specifically designed to detect algorithmic bias, including Google’s TCAV, Audit-AI, and IBM’s AI- Fairness 360.
The fences that divide healthcare can be torn down. It just takes determination and enough craziness to believe it can be done — and lots of hard work.