Night after night, Fien de Meulder sat in front of her Linux computer flagging names of people, places, and organizations in sentences pulled from Reuters newswire articles. De Meulder and her colleague, Erik Tjong Kim Sang, worked in language technology at the University of Antwerp. It was 2003, and a 60-hour workweek was typical in academic circles. She chugged Coke to stay awake.
The goal: develop an open source dataset to help machine learning (ML) models learn to identify and categorize entities in text. At the time, the field of named-entity recognition (NER), a subset of natural language processing, was beginning to gain momentum. It hinged on the idea that training A.I. to identify people, places, and organizations would be a key to A.I. being able to glean the meaning of text. So, for instance, a system trained on these types of datasets that is analyzing a piece of text including the names “Mary Barra,” “General Motors,” and “Detroit” may be able to infer that the person (Mary Barra) is associated with the company (General Motors) and either lives or works in the named place (Detroit).
In 2003, the entire process centered on supervised machine learning, or ML models trained on data that previously had been annotated by hand. To “learn” how to make these classifications, the A.I. had to be “shown” examples categorized by humans, and categorizing those examples involved a lot of grunt work.
Tjong Kim Sang and de Meulder didn’t think much about bias as they worked — at the time, few research teams were thinking about representation in datasets. But the dataset they were creating — known as CoNLL-2003 — was biased in an important way: The roughly 20,000 news wire sentences they annotated contained many more men’s names than women’s names, according to a recent experiment by data annotation firm Scale AI shared exclusively with OneZero.
CoNLL-2003 would soon become one of the most widely used open source datasets for building NLP systems. Over the past 17 years, it’s been cited more than 2,500 times in research literature. It’s difficult to pin down the specific commercial algorithms, platforms, and…