Why Is It So Hard to Make a Computer Talk Like a Human?

The next generation of computerized voices has to be human enough to connect with but not so human that we feel we’re being lied to. That’s no small feat.

Starre Julia Vartan
OneZero
Published in
6 min readJan 15, 2019

--

Illustration: Will Harvey

WWhen our machines first began speaking to us, it was in the simple language of children. Some of those voices were even designed for kids — my Speak & Spell was a box with a handle and a tiny green screen that tested my skills in a grating tone, but I still heard that voice sometimes in my dreams. Teddy Ruxpin’s words played from cassette tapes popped into his back, but his mouth moved at just the right cadence, which made him feel almost alive. At least to a kid.

For adults, however, the clunky computerized voices of the 1980s, ’90s, and early aughts were far from real. When the train’s voice announced that the next stop was Port Chester using two words instead of “porchester” — we knew: That was a machine. It could not know that we New Yorkers pronounced this place as one word, not two. It was simple: A voice that sounded human was a person; a voice that sounded like a machine was a machine.

This was fine when all we needed were announcements that were basic, short phrases. But if there is a fire on the train, we all instinctively want to hear a human voice guiding us — and not just because it would calm our nerves. It’s because, as studies have shown, mechanized voices are very difficult for us to comprehend for anything longer than a short sentence. We’ve evolved to read nonverbal voice cues while we listen to our fellow humans, and we get distracted when they’re missing — that distraction is what makes computerized voices tough to follow.

If were are going to replace assistants (or ourselves) with Google Assistant, or if we want a real conversation with the Alexa of the future, it has to converse like a human — responding to verbal cues and following the rhythm, music, and often freewheeling flow of human conversation. To be truly useful to us, in other words, we need computers to sound human. And that’s extremely difficult.

What stands in the way? Prosody. That’s the intonation, tone, stress, and rhythm that give our voices their unique stamp. It’s not the words we…

--

--

Starre Julia Vartan
OneZero
Writer for

AKA The Curious Human. Science journalist & nature nerd w/serious wanderlust. Former geologist. Still picks up rocks. Words in @NatGeo @SciAm @Slate @CNN, here.