OneZero

OneZero is a former publication from Medium about the impact of technology on people and the future. Currently inactive and not taking submissions.

Follow publication

Member-only story

Why Is It So Hard to Make a Computer Talk Like a Human?

Starre Julia Vartan
OneZero
Published in
6 min readJan 15, 2019

Illustration: Will Harvey

WWhen our machines first began speaking to us, it was in the simple language of children. Some of those voices were even designed for kids — my Speak & Spell was a box with a handle and a tiny green screen that tested my skills in a grating tone, but I still heard that voice sometimes in my dreams. Teddy Ruxpin’s words played from cassette tapes popped into his back, but his mouth moved at just the right cadence, which made him feel almost alive. At least to a kid.

For adults, however, the clunky computerized voices of the 1980s, ’90s, and early aughts were far from real. When the train’s voice announced that the next stop was Port Chester using two words instead of “porchester” — we knew: That was a machine. It could not know that we New Yorkers pronounced this place as one word, not two. It was simple: A voice that sounded human was a person; a voice that sounded like a machine was a machine.

This was fine when all we needed were announcements that were basic, short phrases. But if there is a fire on the train, we all instinctively want to hear a human voice guiding us — and not just because it would calm our nerves. It’s because, as studies have shown, mechanized voices are very difficult for us to comprehend for anything longer than a short sentence. We’ve evolved to read nonverbal voice cues while we listen to our fellow humans, and we get distracted when they’re missing — that distraction is what makes computerized voices tough to follow.

If were are going to replace assistants (or ourselves) with Google Assistant, or if we want a real conversation with the Alexa of the future, it has to converse like a human — responding to verbal cues and following the rhythm, music, and often freewheeling flow of human conversation. To be truly useful to us, in other words, we need computers to sound human. And that’s extremely difficult.

What stands in the way? Prosody. That’s the intonation, tone, stress, and rhythm that give our voices their unique stamp. It’s not the words we…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

OneZero
OneZero

Published in OneZero

OneZero is a former publication from Medium about the impact of technology on people and the future. Currently inactive and not taking submissions.

Starre Julia Vartan
Starre Julia Vartan

Written by Starre Julia Vartan

AKA The Curious Human. Science journalist & nature nerd w/serious wanderlust. Former geologist. Still picks up rocks. Words in @NatGeo @SciAm @Slate @CNN, here.

Write a response

This reminds me of the work of Wittgenstein.
“If a lion could speak we would not be able to understand what he said”
Language is a product of an intelligence and not intelligence in itself. As he stated, we would not be able to understand a lion…

--

For adults, however, the clunky computerized voices of the 1980s, ’90s, and early aughts were far from real.

That’s when in France in the train stations they implemented the voice synthesis solution of France Télécom, just devised in Lannion. Instead of constructing the utterances from separate phonemes, they constructed the utterances from what they…

--