Health Care A.I. Needs to Get Real
Unlocking the medical potential of artificial intelligence requires being more realistic about its limitations
This is Open Dialogue, an interview series from OneZero about technology and ethics.
I’m excited to talk with Muhammad Aurangzeb Ahmad. Muhammad is the principal research scientist at KenSci, inc., a company specializing in A.I. in health care, and an affiliate professor in the department of computer science at the University of Washington Bothell. I’ve known Muhammad for a long time. When I started teaching philosophy at Rochester Institute of Technology, he was one of my first students. Over the years, we’ve kept in touch as Muhammad went on to get his PhD in computer science and eventually became a notable data scientist and A.I. researcher. Muhammad's interests in philosophy have never wavered. I often turn to him when I need a holistic explanation of technological trends and controversies.
Our conversation has been edited and condensed for clarity.
Evan Selinger: With so much pain occurring during the pandemic, including massive financial challenges, adaptive responses were necessary. Many depended on technological innovation, from online communication platforms facilitating remote teaching and working to mobile robots keeping places clean. Despite high hopes for medical A.I., Stanford Medicine drew negative headlines. It selected software to determine which workers should receive priority for the Pfizer vaccine during the first distribution wave, and the algorithms created poor results. It favored lower-risk doctors over medical residents who worked in close physical proximity to Covid-19 patients. Why did the system choose the wrong people?
Muhammad Ahmad: The press initially misdescribed the software as a machine-learning system. It was, however, a simple rule-based system. In other words, human programmers gave the software rules to follow. These rules created a protocol for who should get vaccinated first. Here’s the interesting thing. If one looks at each rule individually, they’re all sensible. None of them express anything objectionable. However, when the rules are combined, they turn out to be more than the sum of their part. Collectively, they direct a computer to nonoptimal outcomes.
Basically, well-intentioned people believed automation could provide them with the best outcomes for a fraught situation. But when designing their system, they didn’t fully appreciate something essential to constructing any model: Nonlinear interactions can occur.
Evan: To reduce what happened to a slogan with broader appeal, can we say the following? Failing to acknowledge complexity can undermine equity!
Muhammad: That’s fair.
Evan: Since you’re emphasizing everyone’s good intentions, I’m wondering if there’s another sticking point. Do you think the pandemic’s urgency got in the way of sufficient testing — testing that might have revealed this problem?
Muhammad: That’s my best guess. I’m sure we’ll come back to this point throughout our conversation. Models that look good in laboratory conditions don’t always do well in the real world. Sometimes they’re rushed and should have spent more time in the lab. Other times, the guiding laboratory assumptions are wrong.
Evan: Continuing with this theme, what do you make of IBM considering selling Watson Health? IBM is a veteran multinational technology company, and it went all in with Watson. The A.I. system is even famous in the public eye for putting its natural language processing skills on display during its Jeopardy win a decade ago. I’ll never forget when Ken Jennings, a contestant who won so many straight games that he earned over $3 million, ended his crushing defeat by telling the world, “I, for one, welcome our new computer overlords.” You don’t need to be Freudian to recognize when humor conveys an uncomfortable truth.
Muhammad: Watson is the most prominent example of A.I. hype. And we shouldn’t judge the limits of medical A.I. based on cases where industry promises more than it can deliver. Watson was poorly represented as an all-encompassing A.I. The guiding image presented to the public was just feed Watson data, and it will spit out amazing diagnostic insights.
Evan: Like what?
Muhammad: Giving Watson inputs, like a person’s medical history and recent lab values, was supposed to lead the system to diagnose whether they’re likely to get cancer or diabetes in 20 years.
Evan: And computational hocus pocus isn’t the right way to make medical predictions?
Muhammad: Anyone who has been in the health care industry even for a short amount of time should know that when you’re dealing with large data sets, a massive amount of engineering effort is required to get valuable insights from data.
Evan: Are you saying IBM created false expectations by marketing Watson as a plug-and-play system that will take care of everything by itself once humans present it with enough information?
Muhammad: Exactly! A.I. isn’t just about models and algorithms. At the end of the day, data powers it, and data often requires a great deal of cleaning and preprocessing to transform it into a format that machine-learning systems can employ.
Look, to some extent, the Jeopardy competition was a PR stunt. I understand why IBM did it. To get the public excited about an A.I. system, it helps to show them something dazzling.
Evan: It’s not surprising that public relations efforts over-simplify a messy and complicated process. But why didn’t the range of technical people involved with Watson bring things to a more realistic level?
Muhammad: The are different ways to unpack your question. Think about the end-users of this system: medical and research hospitals. They often don’t have the in-depth technical knowledge to evaluate the capability of systems like Watson. Given IBM’s history and track record, you can understand why they’d be inclined to believe the hype and give the company the benefit of the doubt.
Then, there’s a situation at IBM. I don’t know the internal dynamics. Still, it’s reasonable to presume that many technical professionals weren’t even aware of everything that the PR team told to industry. I’ve seen this happen time and again — divisions within a company where different groups have a different understanding of what’s going on. But that’s not all. Let’s not forget a long-standing problem: People who work on the technical side of things often underestimate how difficult it will be to find solutions to complex problems.
A great xkcd comic, “Here to Help,” cuts to the heart of the matter. It starts with a professional from some field acknowledging how complicated an established problem is, shifts to a computer scientist enthusiastically declaring that using algorithms will save the day, and concluding with the unsuccessful computer scientist eventually acknowledging, “Wow, this problem is really hard.”
Evan: Does your point about misguided optimism bring us back to your earlier remark about divergences between models and the real world?
Muhammad: Absolutely. Let me give you an example. A few years ago, Google created a deep-learning system designed to identify a particular type of diabetes known as retinopathy based on scanning retinal images. The system performed impressively, being over 90% accurate. The media widely called it a triumph of deep learning. Later, the system was deployed in multiple parts of the world, including India. But guess what happened then? The accuracy went down.
Muhammad: Because the data used to train the model came from an idealized setting. In this case, an idealized setting is a high-resource environment with state-of-the-art cameras. In remote parts of rural India, you’re not going to find cameras with sufficiently high pixel resolution as well as poor lighting conditions. Low-resolution data is likely to degrade the performance of such A.I. models. Or worse, the models even reject inputs and don’t give any results.
Sadly, that’s not the only problem that transpired. When power outages occurred in rural India and doctors couldn’t run equipment that uses Google’s software, their diagnostic efficiency — the main reasons why institutions adopt medical A.I. — went down.
Evan: I see two lessons here. The first is it can be dangerous to simplify A.I. by conceptualizing it as software. Change the hardware, and you might change the outcome.
Muhammad: Right. More specifically, when you move A.I. models from a high- to a low-resource environment, more often than not, performance degrades. It’s not just changing the hardware but also changing the accompanying conditions that were used to create the A.I. model in the first place.
Evan: The second lesson is adopting medical A.I. can create new vulnerabilities by establishing new dependencies.
Muhammad: That’s why A.I. solutions created in the first world might not always work in the developing world.
Evan: It’s a classic technology transfer issue.
Muhammad: Yes, I remember talking about some of these issues when I took your philosophy of technology course many years ago. Unfortunately, they keep coming up. In the A.I. context, I think it’s partially because when people consider deploying it, they expect to use a purely technical tool. But it’s better to view A.I. as what humanities scholars call a “socio-technical system.” To approach a problem with A.I., you have to meet a technical challenge, account for human factors, including cultural biases, and also factor in infrastructure limits.
Since you can’t anticipate every possible problem with running an A.I. system, the best practical thing to do is ensure diverse people are working on it. People from different ethnicities, races, genders, socioeconomic backgrounds, worldviews, and disciplines — including the humanities.
Evan: It’s frustrating that we have to keep banging the drum for greater inclusion. After so many years of scholars and activists making the point, you’d think it would be clear, from the start, that people use technology in settings where social and cultural factors don’t magically disappear, the complexity of ethical considerations requires substantial deliberation to understand, and affected communities need to be empowered to voice anticipatory concerns for themselves. Do any quality, best practice documents exist that provide recommendations for the medical A.I. community to better address inclusivity?
Muhammad: This has been an active area of research and something I regularly encounter in my work. Fairness by design in machine learning as a principle is gaining traction. The basic idea is that instead of examining algorithmic bias as an afterthought, one should incorporate fairness and bias mitigation as part of design thinking for A.I. in health care and medical A.I. Several recent reports and documents also encapsulate great recommendations and best practices. The useful resources that spring to mind are MIT’s Design Lab’s “Exploring Fairness in Machine Learning for International Development”, Google’s responsible A.I. best practices, a full-length manuscript on fairness by Solon Barocas, Moritz Hardt, and Arvind Narayanan, and my own work with colleagues at KenSci and the University of Washington on fairness in health care A.I. For a lay audience, Brookings Institute report on algorithmic bias is a good place to start.
Evan: I’d be remiss if I didn’t ask for your take on the much-debated topic of explicability. In cases where opaque, black-box mechanisms make it hard to know why a medical A.I. system seems to be producing accurate results, how should be professionals think about the tension between understanding why a system works and getting maximum utility from it?
Muhammad: We should conceptualize the explicability of A.I. models in terms of trade-offs. There are trade-offs with predictive performance, model fairness, and, most importantly, risks. To illustrate, consider two extreme cases: Suppose a machine-learning model is used to predict mortality risk — that is, if someone is going to pass away in the next few months. If a physician uses this information to decide whether to send a patient to hospice care, there’s a high risk associated with making the wrong decision. After all, you’re sending the wrong patient to hospice when you should be providing them with additional care.
On the end of the spectrum, consider a machine-learning model that predicts the number of patients expected in a well-funded hospital’s general ward. Even if the model is occasionally off, the staffing needs are likely to be met. The penalty for being wrong is minimal.
To compare cases, transparency is highly desirable for mortality prediction. But it’s not really important if you’re only concerned about how many staff members are needed in a context where there’s plenty of available workers.
Evan: So far, we’ve addressed other people’s work. But we haven’t discussed your contributions to medical A.I. What’s a recent project?
Muhammad: One interesting recent project focused on the problem of leaving without being seen. It’s the challenge emergency waiting rooms deal with when people come in with various ailments, and some take off before medical professionals examine them. In the U.S., you can wait for a long time in the waiting room — hours and hours — and this is a frustrating experience.
It’s important to state that A.I. in health care and medicine is never a singular effort. I feel privileged to work with a wonderful team of people who inspire one another and go above and beyond. My team worked on creating decision-support software that helps emergency room professionals better manage the flow of people. The dashboard lists how likely it is that an individual will leave the hospital without receiving care. We’re getting positive feedback. A nurse said that after she saw someone coded as ready to depart, she moved him up the line, let him know that he was next in the queue, and he thanked her because he was preparing to exit the hospital.
Evan: How does the system make these predictions?
Muhammad: It uses a model that factors in all kinds of information: the severity of a patient’s current medical issue, a patient’s history — not just medical background, like underlying conditions and medications used, but also how much time they’ve waited in the past — demographic data, and a range of additional details that are important but might not seem relevant at first glance.
Evan: Why demographic data?
Muhammad: To give an example, socioeconomic status and race can be linked due to a society’s systemic and structural problems. Suppose a job only allows a Black employee a limited amount of time away from the office, even when experiencing a medical emergency. In that case, it’s useful information to consider when one is prioritizing emergency room patients.
Evan: How do you determine what procedure is fair? First come, first served doesn’t take into account different levels of medical risk. And once you introduce demographics, some people are inevitably going to complain. Karens will say that they shouldn’t be de-prioritized because they’re more likely to complain about poor service and threaten to get staff fired than leave before being seen.
Muhammad: In the machine-learning literature, there are more than 60 different notions of fairness. For a large number of use cases, I actually think we can drill down to about six or seven relevant definitions. But even so, these versions of fairness are incompatible with one another. To pick one notion of fairness is to exclude others. And there’s no objectively right way to select any particular one. There will always be trade-offs.
There are plenty of good responses to Karens, including the point that worse health care options for vulnerable people can inflate everyone’s overall medical care costs. I doubt they’ll find this challenge to their privilege persuasive, though.
Evan: No, they won’t. Of course, this difficulty goes far beyond medical A.I. In a diverse and politically polarized society like the United States, we’ve seen, time and again, conflicts arise when people and institutions attempt to rectify systemic inequalities involving race, gender, or class. A.I. might be the latest innovation trend to be imbued with a futuristic aura. But the people who design it, deploy it, and have their lives altered by it are inherently linked to histories that influence who currently has power and who is disenfranchised.
How does this issue manifest in debates over how to optimize diagnostic medical A.I. models? Are there situations where it’s better to use a model that will lead to statistically worse outcomes for one group because it will improve the care another group receives?
Muhammad: Rectifying injustices and creating a more equitable future in tech and beyond will involve navigating rough seas where trade-offs between model performance and fairness may be needed across use cases, contexts, and stakeholders. Consider how different stakeholders might think about trade-offs in a disease prediction model for Covid-19. A physician may ask: Of the patients labeled high risk of dying from Covid-19 by the AI.. model, how many are actually going to be high risk based on my own medical knowledge and intuition? A patient who belongs to a minority or a vulnerable class might ask: What’s the probability that I will be incorrectly labeled as low-risk given that I am from a protected class? Will I be given the same clinical services according to the best evidence? From a societal perspective, one could ask: Are the risks balanced across all protected classes? Of course, it’s not always possible to satisfy all three criteria at once. Hence, we’re back to the question of trade-offs.
To illustrate, consider a simplified scenario where optimizing an A.I. model for a minority group would result in a slightly worse off outcome for the majority groups. If the disparity between the number of people affected before and after adjustments to the model isn’t significant, it’s an acceptable trade-off. There are, however, clever A.I. techniques to mitigate some of this disparity. For example, you could use one model for the overall population and another model for the minority population. This way, the majority model does not suffer because of the quest for parity. This approach works well, especially if the main goal is maximizing predictive performance. But for situations where the model predictions will be used for resource allocation, the problem becomes more complicated. Consider kidney organ donation where African Americans are disadvantaged because kidney transplant algorithms use one set of rules for them and another set for everyone else. A recent study found out that more than 700 African Americans would have gotten transplants if the algorithms did not treat African Americans differently. Algorithms like these make racial disparities in health care even worse.
Evan: Having emphasized a range of complications and problems, let’s end on a positive note. What are you most excited about in medical A.I.? I began this term’s philosophy of technology course by asking students for their pro/con A.I. list. What leaves them optimistic, and what leaves them worried? I think you can anticipate the responses — heightened anxiety about automated surveillance. Lots of hope for A.I. to contribute to medical breakthroughs, especially in personalized medicine.
Muhammad: What excites me the most is how this revolution in medicine and health care will unfold in the next few decades.
Specifically, highly personalized medicine and personalized behavioral nudging have great potential to improve population health. As data collection in health care and medicine increases and is coupled with privacy-preserving ways of doing machine learning in health care (like federated machine learning), it will become more common to do two things: Repurpose research datasets and identify phenotypes and genotypes to craft medical mitigation strategies that work well on sub- and even micro-populations of patients. A.I. also holds the promise of sifting through massive amounts of data to focus on historically neglected diseases, such as dengue, lymphatic filariasis, trachoma, and leishmaniasis. Since these tropical maladies mainly affect less-developed parts of the globe, not enough attention is paid to them. But with the democratization of A.I. and the greater dissemination of relevant skillsets, I foresee a near future where these conditions will be given overdue due diligence.
On the health and social equity front, A.I. is highlighting and quantifying problems that have existed for centuries. In tech culture, there is an adage that if something can be measured it can be changed. While that’s not entirely correct, it conveys the idea that quantifying problems can help us determine the extent of the problem and work towards clear end goals. For example, consider bias in human and machine systems. Since we can now collect data about decision-making, we can identify unfairness and discrimination statistically and put policies and measures in place to mitigate it.
The early modern philosopher Gottfried Wilhelm Leibniz envisioned a future where people will settle their disputes by following the mantra, “Let us calculate.” While that may never happen, the ideal might help us make more informed decisions to create improved and equitable outcomes in health care and beyond.