Listen to this story
DeepMind’s Latest A.I. Health Breakthrough Has Some Problems
The Google machine learning company trumpeted its success in predicting a deadly kidney condition, but its results raise questions around data rights and patient diversity
Google-affiliated artificial intelligence firm DeepMind has been pushing into the healthcare sector for some time. Last week the London-based company synchronized the release of a set of new research articles — one with the U.S. Department of Veterans Affairs, and three with a North London hospital trust known as the Royal Free.
In one paper, published in the journal Nature, with co-authors from Veterans Affairs and University College London, DeepMind claimed its biggest healthcare breakthrough to date: that artificial intelligence (A.I.) can predict acute kidney injury (AKI) up to two days before it happens.
AKI — which occurs when the kidneys suddenly stop functioning, leading to a dangerous buildup of toxins in the bloodstream — is alarmingly common among hospital patients in serious care, and contributes to hundreds of thousands of deaths in the United States each year. DeepMind’s bet is that if it can successfully predict which patients are likely to develop AKI well in advance, then doctors could stop or reverse its progression much more easily, saving lives along the way.
Beyond the headlines and the hope in the DeepMind papers, however, are three sober facts.
First, nothing has actually been predicted — and certainly not before it happens. Rather, what has happened is that DeepMind has taken a windfall dataset of historic incidents of kidney injury in American veterans, plus around 9,000 data-points for each person in the set, and has used a neural network to figure out a pattern between the two.
Second, that predictive pattern only works some of the time. The accuracy rate is 55.8% overall, with a much lower rate the earlier the prediction is made, and the system generates two false positives for every accurate prediction.
Third, and most strikingly of all: the study was conducted almost exclusively on men — or rather, a dataset of veterans that is 93.6% male. Given the A.I. field’s crisis around lack of diversity and amplification of bias and discrimination, that fact is very important — and astonishingly understated.
A DeepMind spokesperson responded to this point by stating “the dataset is representative of the VA population and, as with all deep learning models, it would need further, representative data before being used more widely.” But this depoliticizes how DeepMind has cast the study and its results — not as a tool for potential use with American veterans, or even as a tool that provides indicative results for men, but as a groundbreaking innovation with a general application.
Beyond these very significant deficiencies are a number of missed opportunities in DeepMind’s analysis. Some are foundational.
Kidney injury affects 13.4% of the patients in the Veterans Affairs dataset–a rate two-thirds the 20% average the study presents for hospitalized U.S. patients in general. The difference is interesting in such a specific population as veterans, and suggests there may be something relevant in the VA’s patient characteristics as well as doctor’s choices and clinical practice. But the Nature study is presented with such an absence of context and explanation as to how VA clinicians actually detected and sought to prevent kidney injury, or the features of the population and its variability, that this basic and essential information is impossible to interrogate.
Similarly, the researchers make no attempt to explain the A.I. model they used. How did it work? Why did they decide to construct the model in the way they did? Why was this a good conceptual fit for this particular dataset, and how effectively could it generalize to a wider population? How did the A.I. address the needs of specific patient types, and what impact would this algorithm have on them?
A spokesperson said that DeepMind aimed to justify all of the decisions made in the VA study through supplemental information and a non-peer reviewed protocol paper. However, none of these questions were answered with precision and, as a result, the study offers few meaningful insights into either renal medicine or A.I. prediction. The study is littered with unexplained choices that may have been medically instructive, as well as omissions of details (such as 36 salient features discovered by the deep learning model, and what they mean) and outliers — distractions from a clean model, perhaps, but all representing real patients, at the end of the day.
But even if all these missed opportunities and deficiencies were addressed, there’s a much larger narrative to DeepMind’s U.S. health research. And that’s where the remainder of the newly released papers come in.
Veterans are far from DeepMind’s first attempt–and, quite plausibly, far from its first choice–in predicting kidney injury. The company has been trying to tackle avoidable patient harm since at least 2015, when it first struck a deal with the Royal Free London NHS Foundation Trust that gave it the fully identified health records of over 1.6 million patients.
By 2016, and as a direct result of receiving such a treasure trove of private information, DeepMind was embroiled in a major data scandal over legality. In 2017, the U.K.’s data watchdog ruled that patient rights had been breached in several major respects for what turned out to be DeepMind’s gain. The whole saga caused significant reputational damage that has simmered ever since, and by late 2018, Google moved unilaterally to absorb and rehabilitate DeepMind’s healthcare arm into its own Google Health division–a move that remains incomplete because none of DeepMind’s healthcare partners have agreed to transfer their contracts fully to Google.
DeepMind’s work with Veteran Affairs points to a crucial question. Clearly the dataset that DeepMind holds on patients in Britain is three times as large and significantly more diverse than the dataset on U.S. veterans. So why would DeepMind choose to do the research that led to the Nature paper with American veterans instead of Royal Free patients?
DeepMind rationalizes the choice as simply one of working with different partners for different projects. But the more plausible answer could be that the legal and reputational risk was perceived to be greater in the U.K., particularly around reusing the controversial Royal Free patient data.
Nevertheless, the Veteran Affairs data isn’t absent complication, as a point of comparison. Although individual patients were de-identified in the United States before being transferred to DeepMind, with 9,000 data-points per person, collected across a period of up to 15 years, it remains likely that at least one of the patients would be capable of being reidentified with expert methods. That’s all that’s required to bring processing within the scope of personal data, and therefore the European Union’s General Data Protection Regulation.
Two other legacies remain from the initial data scandal between DeepMind and Royal Free. First, despite ongoing privacy concerns implicated by a decidedly dubious legal basis for processing, the full data trove has remained in DeepMind’s possession, with the blessing of the U.K. regulator. (DeepMind analogizes itself to a clinical data storage system, ready to serve up records on each and every Royal Free patient, even if those patients haven’t set foot in the hospital in nearly a decade and have no identifiable need for care.) This demonstrates, as does the gift from Veterans Affairs, that DeepMind is able to gain and maintain access to incredibly valuable datasets in a way not practically realizable and defensible by others. Second, both DeepMind and the Royal Free have been determined to make a good news story of it all, overlooking any inconvenient concerns.
It was in this spirit that in early 2017, Royal Free starting pushing the clinical deployment of Streams, DeepMind’s clinical smartphone app built off the back of the data transfer and designed to interface with patient data and tests and to generate push alerts. Importantly, and ironically, Streams is not an A.I. tool. It is driven by standard tests and standard formulae. But DeepMind is an A.I. company, so Streams has always been destined to become an A.I. tool.
In a dramatic evaluation in Nature Digital Medicine, the use of Streams is shown to have no detectable beneficial impact on patient outcomes, despite all the fanfare.
The other papers coordinated to be released with the Veterans Affairs study therefore make a connection that has been years coming. Simultaneously, if rather shakily and with multiple deficiencies, some kind of A.I. model for predicting kidney injury–even if it is a model for American veteran males–is presented alongside a set of evaluations on Streams as a clinical tool.
Those evaluation papers, however, give a decidedly lukewarm impression on Streams. In a dramatic evaluation in Nature Digital Medicine, the use of Streams is shown to have no statistically significant beneficial impact on patient’s clinical outcomes, despite all the fanfare.
An associated user study in the Journal of Medical Internet Research (JMIR) — on 19 of the total set of 47 Royal Free clinicians who shared the six iPhones that carried Streams — in fact tells us that the app, for all its promises, creates more work, more anxiety, and probably would require hiring more people to monitor and respond to alerts, often unnecessarily.
But finally, an evaluation paper about costs, also in the JMIR, claims that assuming no change in staffing (and therefore simply expecting clinicians to absorb the workload and anxiety identified in the user study), Streams could deliver a mean of less than $2,600 cost savings per patient. Delving into the supplementary file containing data to support the cost estimate, it seems that the control hospital that did not use Streams also saw significant reductions in the major cost contributors to that figure, at several statistically significant levels, in a way that deserves comparison and could challenge the central claim.
But to DeepMind, perhaps these are extraneous details, given the kicker of the whole exercise. “We did not include the costs of providing the technology, and therefore, it is not possible to judge whether or not it would be cost saving overall,” states the paper, co-written by DeepMind co-founder Mustafa Suleyman. “Our results suggest that the digitally enabled care pathway would be cost saving, provided provision of the technology costs less than around £1,600 [$1,945] per patient spell.”
The authors of the JMIR study about costs could not be reached for comment, but a DeepMind spokesperson emphasized that, although the Nature Digital Medicine study demonstrated no significant improvement in clinical outcomes through the use of Streams, the evaluations detailed improvements in the reliability and speed of AKI recognition, the time frames in which some key treatments and specialist care were delivered, as well as a claimed reduction in healthcare costs.
Streams, then, seems typical of DeepMind’s way of working. It offers few overall gains in clinical outcomes, creates anxiety and additional workload for physicians, and was built on the back of deeply controversial access to patients’ data. Whatever Google and DeepMind are planning to do in the United States, they need to overhaul their attitude to the most basic priorities of rights, explanations, and costs to humans, not machines. Those come first, before profit — or rushing to proclaim that A.I. has a central place in the future of medicine.