A Privacy Dustup at Microsoft Exposes Major Problems for A.I.

The most important datasets are rife with sexism and racism

Dave Gershgorn
OneZero

--

Credit: Andrej Karpathy/Stanford

TThe results you get when you search for an image on Google have something in common with Siri’s ability to listen to your commands. And they each share DNA with the facial recognition projects rolling out to the world’s airports and beyond.

All of these features are fed by enormous piles of data. These datasets might contain thousands of pictures of faces or gigabytes of audio logs of human speech. They are the raw material used by nearly everyone who wants to work with A.I., and they don’t come cheap. It takes expertise and investment to build them. That’s why corporate and academic institutions are constructing their own datasets — and only sometimes share them with the world, creating what are known as open datasets.

But open doesn’t automatically mean good — or ethical. Last week, Microsoft took down its MS Celeb facial recognition dataset following a report from MegaPixels, a watchdog project dedicated to uncovering how open datasets are used to build surveillance infrastructure across the world. MegaPixels showed that Microsoft’s data included photographs not just of public people like celebrities, but of private citizens and journalists as well — and that it had been downloaded by…

--

--

Dave Gershgorn
OneZero

Senior Writer at OneZero covering surveillance, facial recognition, DIY tech, and artificial intelligence. Previously: Qz, PopSci, and NYTimes.