Night after night, Fien de Meulder sat in front of her Linux computer flagging names of people, places, and organizations in sentences pulled from Reuters newswire articles. De Meulder and her colleague, Erik Tjong Kim Sang, worked in language technology at the University of Antwerp. It was 2003, and a 60-hour workweek was typical in academic circles. She chugged Coke to stay awake.
The goal: develop an open source dataset to help machine learning (ML) models learn to identify and categorize entities in text. At the time, the field of named-entity recognition (NER), a subset of natural language processing, was beginning to gain momentum. It hinged on the idea that training A.I. to identify people, places, and organizations would be a key to A.I. being able to glean the meaning of text. So, for instance, a system trained on these types of datasets that is analyzing a piece of text including the names “Mary Barra,” “General Motors,” and “Detroit” may be able to infer that the person (Mary Barra) is associated with the company (General Motors) and either lives or works in the named place (Detroit).
In 2003, the entire process centered on supervised machine learning, or ML models trained on data that previously had been annotated by hand. To “learn” how to make these classifications, the A.I. had to be “shown” examples categorized by humans, and categorizing those examples involved a lot of grunt work.
Tjong Kim Sang and de Meulder didn’t think much about bias as they worked — at the time, few research teams were thinking about representation in datasets. But the dataset they were creating — known as CoNLL-2003 — was biased in an important way: The roughly 20,000 news wire sentences they annotated contained many more men’s names than women’s names, according to a recent experiment by data annotation firm Scale AI shared exclusively with OneZero.
CoNLL-2003 would soon become one of the most widely used open source datasets for building NLP systems. Over the past 17 years, it’s been cited more than 2,500 times in research literature. It’s difficult to pin down the specific commercial algorithms, platforms, and tools CoNLL-2003 has been used in — “Companies tend to be tight-lipped about what training data specifically they’re using to build their models,” says Jacob Andreas, PhD, an assistant professor at the Massachusetts Institute of Technology and part of MIT’s Language and Intelligence Group—but the dataset is widely considered to be one of the most popular of its kind. It has often been used to build general-purpose systems in industries like financial services and law.
Only this past February did someone bother to quantify its bias.
Using its own labeling pipeline — the process and tech used to teach humans to classify data that’ll then be used to train an algorithm — Scale AI found that, by the company’s own categorization, “male” names were mentioned almost five times more than “female” names in CoNLL-2003. Less than 2% of names were considered “gender-neutral.”
When Scale AI tested a model trained using CoNLL-2003 on a separate set of names, it was 5% more likely to miss a new woman’s name than a new man’s name (a notable discrepancy). When the company tested the algorithm on U.S. Census data — the 100 most popular men’s and women’s names for each year — it performed “significantly worse” on women’s names “for all years of the census,” according to the report.
All of this means that a model trained on CoNNL-2003 wouldn’t just fall short when it comes to identifying the current names included in the dataset — it would fall short in the future, too, and likely perform worse over time. It would have more trouble with women’s names, but it would also likely be worse at recognizing names more common to minorities, immigrants, young people, and any other group that wasn’t regularly covered in the news two decades ago.
“It’s only after the fact, if the systems are used on different datasets, that the bias will become apparent.”
To this day, CoNLL-2003 is relied on as an evaluation tool to validate some of the most-used language systems — “word embedding” models that translate words into meaning and context that A.I. can understand — including fundamental models like BERT, ELMo, and GloVe. Everything influenced by CoNLL-2003 has, in turn, had its own ripple effects (for instance, GloVe has been cited more than 15,000 times in literature on Google Scholar).
Alexandr Wang, founder and CEO of Scale AI, describes ML as a “house of cards” of sorts, in that things are built atop each other so quickly that it’s not always apparent whether there’s a sturdy foundation underneath.
The dataset’s ripple effects are immeasurable. So are those of its bias.
Imagine a ruler, slightly bent, that’s seen as the universal standard for measurement.
In interviews, industry experts consistently referred to CoNLL-2003 with wording that reflects its influence: Benchmark. Grading system. Yardstick. For almost two decades, it’s been used as a building block or sharpening tool for countless algorithms.
“If people invent a new machine learning system,” Tjong Kim Sang says, “one of the datasets they will… test it on is this CoNLL-2003 dataset. That is the reason why it has become so popular. Because if people make something new, if it’s in 2005, 2010, 2015, or 2020, they will use this dataset.”
If an algorithm performs well after being run on CoNLL-2003, meaning the way it classified entities closely matches how humans classified them, then it’s viewed as successful — a seminal work in the sector. But in actuality, passing a test like this with flying colors is concerning: It means the model has been built to reinforce some of the dataset’s initial bias. And what about the next model that comes along? If the new one outperforms the old, then it’s likely even more aligned with the dataset’s initial bias.
“I consider ‘bias’ a euphemism,” says Brandeis Marshall, PhD, data scientist and CEO of DataedX, an edtech and data science firm. “The words that are used are varied: There’s fairness, there’s responsibility, there’s algorithmic bias, there’s a number of terms… but really, it’s dancing around the real topic… A dataset is inherently entrenched in systemic racism and sexism.”
In interviews with OneZero, the primary creators of CoNLL-2003 didn’t object to the idea that their dataset was biased.
De Meulder, Tjong Kim Sang, and Walter Daelemans, PhD (the team’s supervisor at the time) don’t recall considering bias much back then, especially since they created the dataset for a specific “shared task” — an exercise allowing different groups to test their algorithms’ performance on the same data — ahead of a conference in Canada. “It’s only after the fact, if the systems are used on different datasets, that the bias will become apparent,” writes de Meulder in an interview follow-up.
That’s exactly what happened.
The bias of a system trained on CoNLL-2003 could be as simple as your virtual assistant misreading your instructions to “call Dakota” as dialing a place rather than a person, or not recognizing which artist you’d like to stream via Spotify or Google Play. Maybe you’re looking up a famous actress, artist, or athlete, and a dedicated panel doesn’t pop up in your search results — costing them opportunities and recognition. It’s “exactly the kind of subtle, pervasive bias that can creep into many real-world systems,” writes James Lennon, who led the study at Scale AI, in his report.
“If you can’t recognize people’s names, then those people become invisible to all kinds of automated systems that are really important,” Andreas says. “Making it harder to Google people; making it harder to pull them out of one’s own address books; making it hard to build these nice, specialized user interfaces for people.”
This kind of bias can also lead to problems stemming from lack of recognition or erasure. Many algorithms analyze news coverage, social media posts, and message boards to determine public opinion on a topic or identify emerging trends for decision-makers and stock traders.
“Let’s say there were investors that identified companies to invest in based on ‘social media buzz,’ the number of mentions of that company or any of the senior executives of the company on social media,” writes Graham Neubig, PhD, an associate professor at Carnegie Mellon University’s Language Technology Institute, in an email to OneZero. “In this case, if an NER system failed to identify the name of any of the senior executives, then this ‘buzz’ would not register, and thus the company would be less likely to attract investment attention.”
Daelemans sees it as “a bit of laziness” that people are still using his team’s dataset as a benchmark. Computational linguistics has progressed, but CoNLL-2003 still provides an easy way out in proving a new model to be the latest and greatest. Building a better dataset means dedicating human labor to the unglamorous task of labeling sentences by hand, but today it can be done more quickly, and with fewer examples, than in 2003.
“It would not take that much energy to do a new, more balanced dataset as a benchmark,” Daelemans says. “But the focus is really on getting the next best model, and it’s highly competitive, so I don’t think a lot of research groups will want to invest time in doing a better version.”
Then there’s the question of what building a better dataset actually looks like.
Scale AI’s analysis of CoNLL-2003’s bias, for instance, isn’t without its own problems. When it comes to asking how recognition accuracy compares between the name categories, “that question itself is a whole can of worms,” Andreas says. “Because what does it mean to be a female name, and who are the annotators that are judging… and what about all the people in the world who are not males or females but identify with some other category and who’d maybe even be left out of an analysis like this?” (OneZero has chosen to refer to Scale AI’s “male” and “female” categories as “men’s names” and “women’s names.”)
“If you can’t recognize people’s names, then those people become invisible to all kinds of automated systems that are really important.”
To complete its analysis of CoNLL-2003’s bias, instead of using surrounding pronouns to infer gender, Scale AI used societal notions about the names themselves. The humans who tagged the data assumed, for example, that Tiffany must be a woman, John must be a man, and Alex goes in the gender-neutral category. An ML model that assigns gender externally based on any characteristic is “in complete contradiction with the idea that gender is something that people define for themselves,” says Rachel Thomas, PhD, director of the University of San Francisco’s Center for Applied Data Ethics.
Scale AI’s interest in conducting this experiment is partly propelled by its business model, which involves clients using the company’s labeling pipeline to comb through their own datasets, or the open source data they’re using, to gauge bias. The company created a new open source dataset, called CoNLL-Balanced, after adding more than 400 additional “women’s” names to the initial data. Scale AI’s preliminary results suggest the new algorithm performs comparably on both categories of names.
But this still may not solve the fundamental problem. In interview after interview, experts made it clear that increasing representation in datasets is merely a bandage — in many ways, the tech community wants to “find a tech solution for a social problem,” Marshall says. When it comes to shifting power into the hands of women, BIPOC, and LGBTQ+ individuals, there’s a lot of work still to be done — and reevaluating datasets alone isn’t going to change things. According to Marshall and Andreas, moving forward will take interdisciplinary work: bringing together leaders in machine learning with those in fields like anthropology, political science, and sociology.
“Representation in datasets is important,” Thomas says. “I worry that too many people think that’s just the sole issue — like once you’ve balanced your dataset, then you’re good — whereas bias really also involves all these questions… People [are] moving more towards talking about how different machine learning models shift power.”
That power mismatch can stem from the representation gap between the people creating these tools and those who could be affected by them. It comes down to the importance of bringing members of marginalized groups into the conversation and development of these tools, in a significant way, so they can think through dangers and potential misuse cases down the line.
“The academic community’s been playing with these datasets for decades, and we know that there are some human errors in the datasets — we know that there’s some bias,” says Xiang Ren, PhD, an assistant professor at the University of Southern California and part of USC’s NLP group. “But I think most of the time, people just kind of follow the popular evaluation protocols.”
Some experts think we’re beginning to see a reckoning for how ML models are evaluated — which, eventually, could lead to the retirement of datasets like CoNLL-2003.
The entire community is now “staring real closely at the datasets and thinking about… our whole scientific apparatus,” Andreas says. “The way in which we judge the effectiveness of systems is largely built around datasets that are like CoNLL-2003.”