Debugger
Google Promises reCAPTCHA Isn’t Exploiting Users. Should You Trust It?
An innovative security feature to separate humans from bots online comes with some major concerns
A surprising amount of work online goes into proving you’re not a robot. It’s the basis of those CAPTCHA questions often seen after logging into websites: blurry photos of crosswalks, traffic lights, and storefronts that users are tasked with identifying through a series of clicks.
They come in many forms, from blurry letters that must be identified and typed into a box to branded slogans like “Comfort Plus” on the Delta website — as if the sorry state of modern air travel wasn’t already dystopian enough. The most common, however, is Google’s reCAPTCHA, which launched its third version at the end of 2018. It’s designed to drastically reduce the number of challenges you must complete to log into a website, assigning an invisible score to users depending on how “human” their behavior is. CAPTCHA, after all, is designed to weed out bot accounts that flood systems for nefarious ends.
But Google’s innovation has a downside: The new version monitors your every move across a website to determine whether you are, in fact, a person.
A necessary advancement?
Before we get into the how of this new technology, it’s useful to understand where it’s coming from. The new reCAPTCHA disrupts a relatively ancient web technology that has been harnessed for plenty of things beyond security.
CAPTCHA — which stands for Completely Automated Public Turing test to tell Computers and Humans Apart — first appeared in the late ’90s, and it was designed by a team at the early search engine AltaVista. Before CAPTCHA, it was easy for people to program bots that would automatically sign up for services and post spam comments by the thousands. AltaVista’s technology was based on a printer manual’s advice for avoiding bad optical character recognition (OCR), and the iconic blurry text in a CAPTCHA was specifically designed to be difficult for a computer to read but legible for humans, thereby foiling bots.
By the early 2000s, these tests were everywhere. Then came reCAPTCHA, developed by researchers at Carnegie Mellon and purchased by Google in 2009. It used the same idea but in an innovative way: The text typed by human users would identify specific words that programs were having trouble recognizing. Essentially, programs would scan text and flag words they couldn’t recognize. Those words would then be placed next to known examples in reCAPTCHA tests — humans would verify the known words and identify the new ones.
By 2011, Google had digitized the entire archive of the New York Times through reCAPTCHA alone. People would type in text from newspaper scans one blurry CAPTCHA at a time, ultimately allowing Google to make the Times’ back catalog searchable forever. While creating a velvet rope to keep bots off sites, Google had managed to conscript human users into doing the company’s grunt work.
There’s no way to opt out of reCAPTCHA on a site you need to use, forcing you to either accept being tracked or stop using a given service altogether.
With that achievement under its belt, reCAPTCHA switched to showing pictures from Google’s Street View software in 2014, as it does today. After pressing the “I’m not a robot” box, you might be prompted to recognize which of nine images contain bicycles or streetlights. Behind the scenes, Google reduced the frequency at which people were asked to complete these tests by performing behavioral analysis — reCAPTCHA can now run in the background and track how people use websites.
If a Google cookie is present on your machine, or if the way you use your mouse and keyboard on the page doesn’t seem suspiciously bot-like, visitors will skip the Street View test entirely. But some privacy-conscious users have complained that clearing their cookies or browsing in incognito mode drastically increases the number of reCAPTCHA tests they’re asked to complete.
Users have also pointed out that browsers competing with Google Chrome, like Firefox, require users to complete more challenges, which naturally raises a question: Is Google using reCAPTCHA to cement its own dominance?
This raises serious privacy concerns, given that Google’s revenue is primarily from its ad business, which relies on tracking data. You might worry that reCAPTCHA is essentially a secret ad tracker, hiding in plain site just like the Facebook “like” button embedded on web pages.
Google’s perspective
To use its latest version of reCAPTCHA, Google asks that developers include its tracking tags on as many pages of their websites as possible in order to paint a better picture of the user. This doesn’t exist in a vacuum: Google also offers Google Analytics, for example, which helps developers and marketers understand how visitors use their website. It’s a fantastic tool, included on more than 100,000 of the top 1 million visited websites according to Built With, but it’s also part of a strategy to monitor users’ habits across the internet.
The new version of reCAPTCHA fills in the missing pieces of that picture, allowing Google to further reach into those sites that might not use its Analytics tool. When pressed on this, Google told Fast Company that it won’t capture user data from reCAPTCHA for advertising and that the data it does collect is used for improving the service.
But that data remains sealed within a black box, even to the developers who implement the technology. The documentation for reCAPTCHA doesn’t mention user data, how users might be tracked, or where the information ends up — it simply discusses the practical parts of the implementation.
I asked Google for more information and what its commitment is to the long-term independence of reCAPTCHA relative to its advertising business — just because the two aren’t bound together now doesn’t mean they couldn’t be in the future, after all.
“It will not be used for personalized advertising by Google.”
A Google representative says “reCAPTCHA may only be used to fight spam and abuse” and that “the reCAPTCHA API works by collecting hardware and software information, such as device and application data, and sending these data to Google for analysis. The information collected in connection with your use of the service will be used for improving reCAPTCHA and for general security purposes. It will not be used for personalized advertising by Google.”
That’s great, and hopefully Google maintains this commitment. The problem is that there’s no reason to believe it will. The introduction of a powerful tracking technology like this is a move that should come with public scrutiny because we’ve seen in the past how easily things can go sour. Facebook, for example, promised in 2014 that WhatsApp would remain independent, separate from its backend infrastructure but went back on that decision after just two years. When Google acquired Nest, it promised to keep it independent but recanted five years later, requiring owners to migrate to a Google account or lose functionality.
For the same reason Google is able to build reCAPTCHA in the first place — its vast resources and reach — we should be suspicious of where all this might lead us.
Unfortunately, as users, there’s little we can do. There’s no way to opt out of reCAPTCHA on a site you need to use, forcing you to either accept being tracked or stop using a given service altogether. If you don’t like those full-body scanners at airports, you can at least still opt out and get a manual pat-down. But if a site has reCAPTCHA, there’s no opting out at all.
If Google intends to build tools like this with the public good in mind rather than its bottom line, then the company must find better ways to reassure the world that it won’t change the rules when it’s convenient. If it were willing to open-source the project (as it has with many, many others), move it outside the company, or, at the very least, establish third-party oversight, perhaps we could start building that trust.