A.I.’s Most Important Dataset Gets a Privacy Overhaul, a Decade Too Late
The authors of the image dataset that changed the world have made one long-overdue tweak
ImageNet is arguably the most important dataset in recent A.I. history. It’s a collection of millions of images that were compiled in 2009 to test a simple idea: If a computer vision algorithm had more examples to learn from, would it be more accurate? Were the underperforming algorithms of the day simply starved for data?
To encourage others to test the same hypothesis, the authors of ImageNet started a competition to see who could train the most accurate algorithm using the dataset. By 2012, results from the academic competition had attracted the full attention of tech industry giants, who began to compete and hire winners. It is no exaggeration to say that the results from the ImageNet competition gave rise to the A.I. boom we’re in today.
Now, more than a decade after its debut, ImageNet’s authors have made a tweak to the dataset that changed the world: They’ve blurred all the faces.
“The dataset was created to benchmark object recognition — at a time when it barely worked,” the researchers wrote in a blog post announcing the change. “Today, computer vision is in real-world systems impacting people’s Internet experience and daily lives. An emerging problem now is how to make sure computer vision is fair and preserves people’s privacy.”
ImageNet is actually two datasets: A full version that contains more than 14 million images, and a smaller one used for the competition, with just over 1 million images. This change has been made to the smaller, more available version of the dataset, which hosts 1,000 categories of images. Only three categories actually include images of people: “scuba diver,” “bridegroom,” and “baseball player.” It’s freely available on the ImageNet website. The full version of ImageNet, which spans more than 10,000 classes, contains 2,832 subcategories in its “person” category and requires permission from the ImageNet team to download.
This isn’t the first time ImageNet authors have had to edit one of their datasets in response to an ethical problem, like racial inequities in the data. In 2019, artist Trevor Paglen released “ImageNet Roulette,” where people could upload selfies and see what ImageNet class they would be sorted into. People with darker skin were invariably sorted into offensive and reductive classes, like “Black African” or “Negroid,” rather than classes like “handsome” or “doctor.”
Soon after, ImageNet authors audited the dataset, released a paper detailing the results, and removed the offending sets of image categories from the larger dataset. The change deleted more than 600,000 images.
ImageNet’s history shines some light on how these offensive and privacy-averse images made it into the dataset in the first place. The dataset’s creators didn’t dream up these categories by themselves: Instead, they relied on a preexisting dataset of words called WordNet, which was created in the late 1980s by Princeton psychologist George Miller and was an effort to organize the English language into a hierarchy. The “furniture” category would contain the words “desk” and “chair.” You might recognize this as similar to the way a computer’s folders work.
ImageNet authors decided to construct their dataset by taking the nouns from WordNet and scraping the internet to find examples. They then paid Amazon Mechanical Turk workers to match images to the correct category. This strategy inherited WordNet’s worst qualities and ultimately collected images of thousands of people without their consent. Data labeling through the use of Mechanical Turk has also been shown to have serious shortcomings, as workers might not understand the data they’re labeling or may label incorrectly to satisfy poorly defined objectives of the task. But compiling and tagging millions of images is labor-intensive and difficult to automate, because this is the data needed to teach A.I. algorithms in the first place.
However, ImageNet authors also shared evidence that making A.I. datasets private by default doesn’t meaningfully hinder how well they perform. Algorithms trained on the blurred images for certain tasks faced only a “marginal” loss of accuracy, the blog post says.
ImageNet paved the way for large-scale A.I. databases in the 2010s. But as its flaws have been laid bare, it’s becoming refreshingly clear that nobody is exempt from the implications data collection can have—not even the dataset that made the field what it is today.