General Intelligence

Facebook Scraped 1 Billion Pictures From Instagram to Train Its A.I. — But Spared European Users

The team purposely excluded Instagram images from the European Union, likely because of GDPR

Photo illustration source: Alexander Koerner/Stringer/Getty Images

OneZero’s General Intelligence is a roundup of the most important artificial intelligence and facial recognition news of the week.

Facebook researchers announced a breakthrough yesterday: They have trained a “self-supervised” algorithm using 1 billion Instagram images, proving that the algorithm doesn’t need human-labeled images to learn to accurately recognize objects.

Typically, the most accurate image recognition algorithms require humans to label images as containing dogs, horses, people, or any other subject, and then the algorithm can find similarities between images humans have indicated contain the same objects. Facebook’s chief A.I. scientist Yann LeCun has been on a mission to change A.I.’s reliance on labels for decades, calling it the “holy grail” of A.I.

But Facebook didn’t just select any billion Instagram images to train the algorithm. The team purposely excluded Instagram images from the European Union, noting in its paper that images were “random, public, and non-EU images.” While the rest of the world’s Instagram images are fair game, EU residents don’t have to worry about their images being used to generate Facebook’s next big algorithm.

OneZero asked Facebook whether the exclusion was motivated by the EU’s GDPR regulations, which gives users greater insight into how companies use their data and protects against data use without consent. A Facebook spokesperson acknowledged the question, but did not immediately reply to the request for comment.

Whether it was because the use of data would be a GDPR violation, or just that Facebook didn’t want to give the impression of impropriety, it’s likely that the law had a chilling effect on the use of private data.

Jules Polonetsky, CEO of Future of Privacy Forum, told OneZero in a message that it’s not unusual for companies to err on the side of caution when collecting data in Europe.

“[It’s] quite common for global companies to be more limited in how they use data subject to GDPR,” he wrote, noting that explicit informed consent is often required for use of sensitive data.

Instagram’s terms of use give Facebook enormous freedom to do whatever it wants with your data, by giving the company a license to use, replicate, and modify any information you upload to the platform. But EU courts have decided that large-scale scraping of personal data, especially images, violates GDPR. For instance, a German court decided Clearview AI’s data scraping practices violated the European privacy law. In another decision against web-scraping, Polish regulators found that a digital marketing company had not adequately obtained users’ consent when processing their data.

Facebook’s data practices have been highly criticized around the world, whether under GDPR, newer privacy laws, or more recently in the United States. A recent settlement in Illinois left Facebook with a $650 million bill for violating the states’ Biometric Information Privacy Act by processing images with facial recognition.

In early May 2018, just weeks before the European data guidelines went into effect, Facebook released another research paper that had scraped nearly a billion images from Instagram. Back then, there was no EU carve-out. Even for 2019 research, after GDPR came online, Facebook didn’t specifically exclude EU users. But it seems like the company is finally coming around to the fact that it has to play by the legal rules, especially as the company faces mounting pressure from European legislators in 2021.

Going forward, it seems EU Instagram users don’t have to worry about whether their images are being scooped up into this iteration of Facebook’s A.I. research, which automatically refreshes the dataset with new images every 90 days. That might especially be a boon to photographers or digital content creators whose work is being used to increase Facebook’s A.I. chops. Users in the U.S., however, will have to rely on a state-by-state patchwork of regulation. So far, only California has enacted a broad data privacy law, although it still falls short of GDPR.

Senior Writer at OneZero covering surveillance, facial recognition, DIY tech, and artificial intelligence. Previously: Qz, PopSci, and NYTimes.