Member-only story
A Privacy Dustup at Microsoft Exposes Major Problems for A.I.
The most important datasets are rife with sexism and racism

The results you get when you search for an image on Google have something in common with Siri’s ability to listen to your commands. And they each share DNA with the facial recognition projects rolling out to the world’s airports and beyond.
All of these features are fed by enormous piles of data. These datasets might contain thousands of pictures of faces or gigabytes of audio logs of human speech. They are the raw material used by nearly everyone who wants to work with A.I., and they don’t come cheap. It takes expertise and investment to build them. That’s why corporate and academic institutions are constructing their own datasets — and only sometimes share them with the world, creating what are known as open datasets.
But open doesn’t automatically mean good — or ethical. Last week, Microsoft took down its MS Celeb facial recognition dataset following a report from MegaPixels, a watchdog project dedicated to uncovering how open datasets are used to build surveillance infrastructure across the world. MegaPixels showed that Microsoft’s data included photographs not just of public people like celebrities, but of private citizens and journalists as well — and that it had been downloaded by private U.S. researchers and state-backed surveillance operations in China.
“There’s clearly a large misalignment between what researchers and the general public think is acceptable,” Adam Harvey, the creator of MegaPixels, tells OneZero.
MS Celeb was created for a competition in 2016. A.I. researchers used the dataset, which included 10 million images of celebrities collected from around the internet, to train their facial recognition algorithms, and then compete for the highest accuracy on a standardized set of face images. Following the competition, MS Celeb was made freely available online for anybody to download and use to train their own facial recognition algorithms. But no one realized that the dataset included images of private people — none of whom knew they were included in the data — until MegaPixels pointed it out.
This isn’t just Microsoft’s problem, however. While MS Celeb is under intense scrutiny…