General Intelligence

A.I.’s Most Important Dataset Gets a Privacy Overhaul, a Decade Too Late

The authors of the image dataset that changed the world have made one long-overdue tweak

Dave Gershgorn
OneZero
Published in
3 min readMar 19, 2021
Photo illustration source: John M. Lund Photography Inc.

ImageNet is arguably the most important dataset in recent A.I. history. It’s a collection of millions of images that were compiled in 2009 to test a simple idea: If a computer vision algorithm had more examples to learn from, would it be more accurate? Were the underperforming algorithms of the day simply starved for data?

To encourage others to test the same hypothesis, the authors of ImageNet started a competition to see who could train the most accurate algorithm using the dataset. By 2012, results from the academic competition had attracted the full attention of tech industry giants, who began to compete and hire winners. It is no exaggeration to say that the results from the ImageNet competition gave rise to the A.I. boom we’re in today.

Now, more than a decade after its debut, ImageNet’s authors have made a tweak to the dataset that changed the world: They’ve blurred all the faces.

“The dataset was created to benchmark object recognition — at a time when it barely worked,” the researchers wrote in a blog post announcing the change. “Today, computer vision is in real-world systems impacting people’s Internet experience and daily lives. An emerging problem now is how to make sure computer vision is fair and preserves people’s privacy.”

ImageNet is actually two datasets: A full version that contains more than 14 million images, and a smaller one used for the competition, with just over 1 million images. This change has been made to the smaller, more available version of the dataset, which hosts 1,000 categories of images. Only three categories actually include images of people: “scuba diver,” “bridegroom,” and “baseball player.” It’s freely available on the ImageNet website. The full version of ImageNet, which spans more than 10,000 classes, contains 2,832 subcategories in its “person” category and requires permission from the ImageNet team to download.

This isn’t the first time ImageNet authors have had to edit one of their datasets in response to an ethical problem…

--

--

Dave Gershgorn
OneZero

Senior Writer at OneZero covering surveillance, facial recognition, DIY tech, and artificial intelligence. Previously: Qz, PopSci, and NYTimes.