An A.I. Training Tool Has Been Passing Its Bias to Algorithms for Almost Two Decades
The data set’s ripple effects are immeasurable
Night after night, Fien de Meulder sat in front of her Linux computer flagging names of people, places, and organizations in sentences pulled from Reuters newswire articles. De Meulder and her colleague, Erik Tjong Kim Sang, worked in language technology at the University of Antwerp. It was 2003, and a 60-hour workweek was typical in academic circles. She chugged Coke to stay awake.
The goal: develop an open source dataset to help machine learning (ML) models learn to identify and categorize entities in text. At the time, the field of named-entity recognition (NER), a subset of natural language processing, was beginning to gain momentum. It hinged on the idea that training A.I. to identify people, places, and organizations would be a key to A.I. being able to glean the meaning of text. So, for instance, a system trained on these types of datasets that is analyzing a piece of text including the names “Mary Barra,” “General Motors,” and “Detroit” may be able to infer that the person (Mary Barra) is associated with the company (General Motors) and either lives or works in the named place (Detroit).
In 2003, the entire process centered on supervised machine learning, or ML models…