Listen to this story



Why ‘Anonymized Data’ Isn’t So Anonymous

Cleaning data of ‘personally identifying information’ is harder than you might think

Credit: Pattanaphong Khuankaew / EyeEm/Getty Images

InIn 2015, Latanya Sweeney, a researcher who studies data anonymization and privacy, published research specifically targeting the deanonymization of HIPAA-protected data in Washington. In that state, (and many others), it is possible for companies and individuals to purchase anonymized medical record data. Sweeney purchased data through legal channels that included, as she noted, “virtually all hospitalizations occurring in the state in a given year” and myriad details about those hospital visits, including diagnoses, procedures, the attending physician, a summary of charges, how the bill was paid, and more. The records were anonymous in that they did not contain the patients’ name or address, but they did include patients’ five-digit U.S. postal codes.

Then, using an archive of Washington state news sources, Sweeney searched for any article printed in 2011 that contained the word “hospitalized.” The search turned up 81 articles. By analyzing the newspaper articles and the anonymized dataset, Sweeney “uniquely and exactly matched medical records in the state database for 35 of the 81 news stories,” she wrote. Those news stories also contained the patient’s name, effectively nullifying the anonymization efforts for these 35 patients.

Data powers the modern world. Data about us controls which news, movies, and advertisements we see. Data determines which of our friends’ posts arrive in our social media feeds. Data drives which potential romantic partners appear in our dating apps. Scientific research, which has long been data focused, continues to push further into the realm of big data. Researchers compile and process massive datasets — and the platforms of surveillance capitalism are right there with them.

Much of this data is sensitive. Google’s data stockpile can include your complete search history over time. Depending on what you search for, it might reveal a bout of depression, a private kink, a medical condition, and much more. Facebook’s stockpile of our past behavior, comments, and photos is quite revealing for many people. Few of us would be comfortable giving a new acquaintance the complete history of our credit card activity. Our medical data is protected by HIPAA because we recognize its sensitivity.

Governments, corporations, and research institutes continue to roll out massive data collections.

So, why do we give away our most private information? Most people glean significant benefits from this data collection. Google’s data makes search results better and helps Gmail filter out spam. Your credit card history helps your bank detect fraudulent purchases. Aggregate purchase history can help stores manage their inventory and reduce waste. Medical data helps researchers and doctors invent new drugs and better treatment plans. Indeed, nearly all forms of research science rely heavily on data to make and evaluate claims.

But these benefits do not come without risk. Governments, corporations, and research institutes continue to roll out massive data collections. This collection is only the start of your data’s journey. The data is repackaged, combined with data from other sources, and sold through data brokers, legitimate and otherwise. The following data is for sale through either legal or illegal channels — and frequently both.

Even if you have “nothing to hide,” in the wrong hands, this knowledge makes you more exploitable. Because of this, there are ongoing efforts to scrub data of personally identifiable information when storing or selling it. In some cases, there are legal requirements to anonymize data, such as HIPAA’s requirements on medical data (though HIPAA’s legal protections are not as strong as most people think). Similarly, the EU’s new General Data Protection Regulation (GDPR) places fewer restrictions on the use of anonymized data compared to data with personally identifying information.

In other cases, companies make efforts to anonymize the data they collect as part of their business strategy. Apple is a good example of this. Apple doesn’t sell customer data, and having a lot of data could make the company a target for hackers. Instead of collecting and processing massive datasets like Google and Facebook, Apple has reduced its data collection, made significant efforts to anonymize the data it does collect, and has leveraged its privacy efforts in its marketing materials.

These measures are valorous and worth pursuing. Unfortunately, research has shown that many attempts to anonymize data are vulnerable to reidentification tactics, especially when alternative data sources are available with some degree of overlap.

One of the landmark case studies in deanonymization, published in 2008, involved a dataset of Netflix users and their movie ratings. The dataset was anonymized and published as part of a competition to improve the Netflix recommendation engine. The anonymization tactics included randomly changing some of the ratings and rating dates for the roughly 480,000 users who were included in the dataset.

Despite these perturbations to the data, researchers concluded that “very little auxiliary information is needed [to] deanonymize an average subscriber record from the Netflix prize dataset. With eight movie ratings (of which two may be completely wrong) and dates that may have a 14-day error, 99% of records can be uniquely identified in the dataset.” The research showed that for many people, much less information is required to establish unicity: “For 68% [of users], two ratings and dates (with a three-day error) are sufficient.”

As more data about us becomes publicly available, these deanonymizing strategies become easier.

Building on the result that a handful of ratings could be used to identify a unique — but still unnamed — individual, the researchers turned to IMDb’s publicly available ratings to prove they could also unmask individuals. After gathering a sample of ratings from 50 IMDb users, researchers applied their deanonymization methods and were able to identify two of the 50 users with very high confidence.

Movie ratings might seem innocuous — they are clearly less sensitive than medical records — but they can still be revealing. The researchers gave this example from one of the two identified individuals: Many of the movies this person rated on Netflix were not rated by this person on IMDb. Deanonymizing the Netflix dataset revealed information that was not already public. Among those movies were Power and Terror: Noam Chomsky in Our Times, Fahrenheit 9/11, Jesus of Nazareth, The Gospel of John, Bent, and Queer as Folk. Their ratings of these six movies could potentially reveal something about the subject’s political views, religious affiliation, and sexual orientation — all three of which are used to discriminate against individuals in various ways.

Obviously, enjoying (or hating) a couple movies doesn’t really prove anything about someone’s ideology, but, especially in oppressive regimes, it might not matter. During the height of McCarthyism, many Americans were accused of being communists, blacklisted, and even jailed based on unsubstantiated claims. Modern authoritarian regimes are similarly uncommitted to proof beyond a reasonable doubt.

The result is remarkable given that both the Netflix and IMDb samples were random — there was no assurance that any of the 50 random IMDb users were even in the Netflix dataset, especially given the relatively small sample size of IMDb users. On one hand, the Netflix dataset included ratings from more than 480,000 subscribers, so deanonymizing two of them feels like a drop in the bucket. On the other hand, if the researchers had sampled 480,000 IMDb users, they could surely have identified many more.

For someone to come to harm, only their individual data needs to be deanonymized, not the whole dataset. Connecting one person of interest to their HIV-positive status, political affiliation, sexual orientation, or gender identity, among other things, can represent a serious breach of privacy for that individual and put them at risk. This represents a special challenge in our data-driven society: Data is more powerful in aggregate, but the more we collect, the easier it is to identify someone in the dataset. As more data about us becomes publicly available, these deanonymizing strategies become easier. The reason researchers scraped reviews from only 50 IMDb users was to comply with IMDb’s terms of service agreement — but not everyone plays by the rules.

Datasets are increasingly getting leaked and stolen. FEMA leaked records on 2.3 million people earlier this year. In the infamous Equifax hack, information on more than 145 million people was stolen. Troubling databases are sometimes left unsecured, like the one discovered by a security researcher containing names, addresses, and the supposed “breed readiness” of more than 1.8 million Chinese women.

According to the Privacy Rights Clearinghouse, a nonprofit that has maintained a list of database breaches since 2005, 8,804 data breaches have occurred in those 14 years, exposing more than 11.5 billion records. That means we’ve averaged 1.7 data breaches and 2.2 million records exposed per day since 2005. This is just what’s available due to crime and negligence. When motivated entities start putting all this data together, every new anonymized dataset will be increasingly susceptible to this kind of correlation.

These concerns aren’t necessarily news to privacy-focused academics. In 2010, privacy lawyer Paul Ohm published a detailed examination of these issues for the UCLA Law Review titled “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.” Nearly a decade ago, Ohm argued that “[a]lthough it is true that a malicious adversary can use PII [personally identifiable information] such as a name or social security number to link data to identity, as it turns out, the adversary can do the same thing using information that nobody would classify as personally identifiable.”

Ohm references some of Sweeney’s earlier research, where she found that 87% of individuals in the 1990 U.S. census could be uniquely identified by just two pieces of information: their birth date (day, month, and year) and their five-digit postal code. Ohm also referenced the Netflix competition research and other examples before concluding that “using traditional, release-and-forget, PII-focused anonymization techniques, any data that is even minutely useful can never be perfectly anonymous, and small gains in utility result in greater losses for privacy.”

Research continues to corroborate the core result — that a shockingly small amount of information might be personally identifying, especially given the enormous amount of data available for an adversary to correlate against.

In 2013, researchers found that location data is highly unique, making it harder to anonymize. The researchers found that with a dataset constructed by recording which cellphone tower a phone was connected to once per hour, 95% of devices can be uniquely identified by only four data points; 50% of devices could be uniquely identified with just two data points. If the data is more granular (GPS tracking instead of cellphone towers, or up to the minute rather than up to the hour), the matching becomes easier.

In 2018, the New York Times described how reporters were able to legally obtain a dataset of “anonymized” location data and then identify individuals in that dataset. For one person featured in the Times story, the dataset included a location record once every 21 minutes on average. It was detailed enough that Times reporters could identify when she went to the doctor, approximately how long she was there, when she visited her ex-boyfriend, when she went to the gym, and more.

The hardest part of this problem is that despite the potential for abuse, good data creates a lot of positive social value.

A lot of anonymous datasets can indirectly give away your location, like in-person credit card purchases or hospital visits. But an adversary could easily go old school as well: If you know where someone lives, you can quickly filter a large anonymous dataset to just the individuals who are frequently nearby in the mornings and evenings. If you know where that person works, you can filter further. For a number of people in such a dataset, these two facts will be enough to deanonymize the rest of their location data.

Location data can be extremely revealing. Imagine the past five years of your location data in the hands of a con man, an extortionist, an agent of an oppressive regime, or even just a less-than-scrupulous hiring manager. Are there places you’ve visited that could be used against you? Even in liberal Western democracies like the United States, people have been targeted for harassment, sent death threats, and even killed simply for being at a Planned Parenthood. Imagine what agents of North Korean leader Kim Jong Un or Philippine President Rodrigo Duterte might do to dissidents with broad swaths of location data.

The hardest part of this problem is that despite the potential for abuse, good data creates a lot of positive social value. We want medical researchers to create new drugs and treatments, and we want them to evaluate the effectiveness of those treatments. We want our houses to optimally govern their own temperatures to increase efficiency. We want Google to tell us there’s congestion on the road ahead and that we should reroute. We want the benefit of big data — without the deanonymization downsides.

There is no silver bullet. We have to make trade-offs. We’ve already ceded some of our privacy, and in all likelihood we’ll give up more in the future, but there are ways to reduce the potential for abuse.

Securing sensitive data and preventing unauthorized access to databases must be a priority for everyone who collects data. Security best practices have, sadly, been an afterthought for many collecting personal data. There will be more data breaches, but through organizational commitments to security, we can make them less common, harder to execute, and riskier for the attackers.

Regulators should continue improving data privacy rights for people across the globe. The GDPR incentivizes companies to store less data and make efforts to anonymize the data they do store — this is good even if it’s not 100% effective. If breaking into a database becomes less likely to yield immediately useful data, fewer people will do it. Regulators also need to take a harder look at data brokers and take action to ensure that data being sold is adequately anonymized.

Similarly, everyone involved in data collection and storage needs to stay up to date on the latest anonymity research. Tactics like differential privacy — where some amount of random noise is added to datasets before they are published — can reduce the effectiveness of data correlation attacks. Apple and Google have both made significant efforts to adopt differential privacy strategies, and others should follow suit.

In his 2010 survey, Ohm noted that there was a fundamental trade-off between the usefulness of a dataset and its ability to be anonymized. As a society, we need to have a more candid conversation about this trade-off. Most of us genuinely want the power of big data to be unleashed, because it can genuinely improve the world — and our own lives. Nevertheless, the mere existence of massive amounts of data is a privacy risk in and of itself. When we give up too much privacy, society degrades, and in the wrong hands, big data could ravage our freedoms.

A curious human on a quest to watch the world learn. I teach computer programming and write about software’s overlap with society and politics.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store