eirias | The stream of speech (Reply)

Remez, Robert E., and Rubin, Philip E. (1983). The stream of speech. Scandinavian Journal of Psychology, 24: 63-66.

What is it that makes speech speech? I asked my coworker this a couple of months ago, and she laughed and told me nobody's quite sure. This article takes a stab at that question.

A natural first guess - that each speech sound, or phoneme, is represented by a single set of prominent bands of frequencies, or formants - turns out to be quite wrong. If you record the words "dog" and "dig," for instance, and try to chop off the /d/s from each and do a frequency analysis on them, you'll find they don't look the same at all. This is because the vowel that comes after the consonant influences how the consonant itself sounds.

Okay, said researchers, then perhaps the secret is in the broadband transitions between the different phonemes, or the harmonic relationships between them. This was a plausible next guess and it was widely held throughout the sixties and seventies - in fact I think there are people who still believe this. However, Remez, Rubin, and some other colleagues don't think this holds water, because they were able to synthesize sounds that didn't have any of the traditional speech-recognition cues - instead of harmonically related broadband formants, they just had three single-frequency sinewaves modulating in similar patterns - and yet the subjects were able to listen to these sounds as speech.

There are half a dozen articles by these guys on their synthetic speech methods, some of which I'll probably review later. The point of this particular article was to compare these barebones yet interpretable speech stimuli with some visual work done by G. Johansson in the 70's. Without having read his work, I think the gist was this: This guy stuck lights on people's joints and had subjects watch these people walk around in a dark room. The subjects couldn't find any sense in the dots when the actors stood still, but once they started moving, a pattern emerged ("Hey, why's that idiot running around in the dark?!"). Remez & Rubin contend that their sine wave speech stimuli work under the same principle - namely, "the value of each element is established only by virtue of the coherent configuration to which it belongs."

Bonus points to Remez and Rubin for usage of the words "terpsichoric" and "dotty" in the same sentence.