The person who voiced the original iPhone Siri has revealed herself as Susan Bennett, although Apple won’t confirm it. But what interested me in the story was how the system was created.
For four hours a day, every day, in July 2005, Bennett holed up in her home recording booth. Hour after hour, she read nonsensical phrases and sentences so that the “ubergeeks” — as she affectionately calls them; they leave her awestruck — could work their magic by pulling out vowels, consonants, syllables and diphthongs, and playing with her pitch and speed.
These snippets were then synthesized in a process called concatenation that builds words, sentences, paragraphs. And that is how voices like hers find their way into GPS and telephone systems.
I used to wonder whether they had coded the voice to utter lots of words or even phrases that could be arranged in multiple ways to provide the answers to questions, but it seems like the basic component units are even smaller. Pretty impressive people, these ubergeeks.
This does answer one question that puzzled me. I have an old iPhone 3G that was handed down from my daughter when she upgraded, so I don’t have Siri. But when my daughter was visiting for the first time after this came out, we were all playing around on her phone asking all manner of silly but innocent questions to see what Siri would say. We were startled when she suddenly said, “I am horny”, stopping us dead in our tracks.
We wondered why the people at Apple would program such a sentence but now it seems clear that that particular sentence was not recorded but reconstructed due to some combination of triggers in our questions. We could not get her to say it again.
But was the reconstruction a sheer fluke or were some of the ubergeeks having a bit of fun by secretly throwing in the possibility of such an answer?
RebeccaT says
Long ago, when I was a little DOS programmer, just finding her feet, I experimented with doing text-to-speech at the word level. While the individual words sounded WAY better then the text-to-speech that was available to me at the time, putting them together sounded bizarre. It was like the audio version of one of those kidnapping notes made up of cutouts from a newspaper. All the stress and inflection was weird and/or flat, when stitched together like that.
throwaway, never proofreads, every post a gamble says
Or maybe truth is stranger than that: she has developed a consciousness and is literally horny for a male counterpart! Hie thee to the Garmin!
Mano Singham says
Yes, I can imagine that if you took anyone’s speech (say reading a book) and switched the words into different sequences, the result would be pretty weird because the way we say something depends on the context, what came just before and just after. I hadn’t thought of any of this until I read this article and your comment.
Pen says
One of the games in our household at the moment is being incredibly rude to Siri just to hear the pathetic responses it comes up with -- ‘You’re entitled to your opinion’????
I really didn’t need to know there was a real person behind the machine, but I suppose it was inevitable.
DsylexicHippo says
“Who let the dogs out?”
Siri: Who? Who? Who? Who? Who?
Cuttlefish says
Mano, you might be interested (I was fascinated) by the offer Roger Ebert got, to have his own voice synthesized in (what seems to me) much the same manner. He had certainly accumulated enough recorded samples of his voice, and of course was (at the time) communicating only by keyboard.
Something came up (grading, probably), so I never did find out what became of that offer.
Mano Singham says
Reading the Siri story, I don’t think ordinary speech would be sufficient to give the range of sounds needed for really good synthesis.
mobius says
Your “I am horny” story brought to mind a somewhat (OK…just a tiny bit) similar story of my own.
Many years ago when VCRs were still in use, I spent about one year working repairing them. One unit had come in with intermittent sound problems. So I plugged it in and let it just run and was listening as I worked on another VCR. The tape I was playing was Schwarzenegger’s Commando. Not that I was at all interested in it. It was just a tape I had available for troubleshooting. There was no sound from the machine until suddenly Arnie’s voice came on and said, “F*ck you”, and then the machine went silent again.
True story, believe it or not.
CaitieCat says
As a linguist who’s worked on speech-recognition software, yes, this is a big problem; it’s an even bigger one in the reverse direction, trying to pick speech out of the different inflections which occur at phrase- or sentence-level synthesis. If I were able to be an academic right now, I’d probably be looking into whether it’s possible to teach a computer to filter someone’s accent away from their words by recognizing the accent. So if the speaker is, say, a German and is devoicing all final consonants (“Hund” -- dog -- is pronounced with an almost-“t” at the end). When we’re doing the speech-rec, if we can pick out that someone does this, we’ve narrowed down their accent to a small list of languages. Other obvious cues could work similarly, by recognizing consistent patterns of interference: Arabic speakers putting reflexive pronouns in weird places, or Russians using (or failing to use) articles in the usual way, and so on.
Being able to recognize these things could make speech-rec a lot more robust, by enabling filters that “convert” the speaker’s utterances into a hypothetical standard model.
Not possible now, as we don’t do speech-rec at the phoneme level yet, not with any reliability, and you’d need to do so reliably before you could do accent-rec. But I’d bet someone’s working on it, somewhere.