Machine Learning’s Crumbling Foundations

Doing ‘data science’ with bad data.

Cory Doctorow
Published in
7 min readAug 19, 2021


An industrial meat-grinder; on its intake belt is a procession of recycling bins heaped high with garbage; its output cone has been replaced with the glowing eye of HAL 9000, and it empties into a giant wheeled hopper full of ground-up trash. Image: Seydelmann (modified) CC BY-SA: Cryteria (modified) CC BY:

Technological debt is insidious, a kind of socio-infrastructural subprime crisis that’s unfolding around us in slow motion. Our digital infrastructure is built atop layers and layers and layers of code that’s insecure due to a combination of bad practices and bad frameworks.

Even people who write secure code import insecure libraries, or plug it into insecure authorization systems or databases. Like asbestos in the walls, this cruft has been fragmenting, drifting into our air a crumb at a time.

We ignored these, treating them as containable, little breaches and now the walls are rupturing and choking clouds of toxic waste are everywhere.

The infosec apocalypse was decades in the making. The machine learning apocalypse, on the other hand…

ML has serious, institutional problems, the kind of thing you’d expect in a nascent discipline, which you’d hope would be worked out before it went into wide deployment.

ML is rife with all forms of statistical malpractice — AND it’s being used for high-speed, high-stakes automated classification and decision-making, as if it was a proven science whose professional ethos had the sober gravitas you’d expect from, say, civil engineering.

Civil engineers spend a lot of time making sure the buildings and bridges they design don’t kill the people who use them. Machine learning?

Hundreds of ML teams built models to automate covid detection, and every single one was useless or worse.

The ML models failed due to failure to observe basic statistical rigor. One common failure mode?

Treating data that was known to be of poor quality as if it was reliable because good data was not available.

Obtaining good data and/or cleaning up bad data is tedious, repetitive grunt-work. It’s unglamorous, time-consuming, and low-waged. Cleaning data is the equivalent of sterilizing surgical implements — vital, high-skilled, and invisible unless someone…