Psycholinguistics/Perception of Continuous Speech

Introduction


The perception of continuous speech is a complex and difficult process. And yet, we all do this daily. Whether we are listening to a friend’s funny story, or to a powerful political speech, we use the steps involved in understanding of the stream of language that allows us to communicate with others. Our ability to put together and understand continuous speech is what makes communication possible. Without being able to comprehend continuous speech, we would be unable to understand what others are expressing – we would be isolated. The perception of speech segments and continuous speech is a truly incredible feat of the human mind. Our perception of continuous speech is different from our perception of individual words, because it is affected by the individual words and phonemes. We can never really segment thoughts into different phonemes, because their property of invariance causes them to blend together. And yet, without being able to separate these individual words, people are able to understand a whole stream of these sounds that are almost unidentifiable in isolation, and make sense of them.

Consonant and Vowel Identification


Vowels and consonants are identified differently. When isolated, vowels have a long, steady quality. However, the sound of a vowel is strongly influenced by the sounds around it. Consonants cannot be perceived without an accompanying vowel, but they sound shorter and require less energy to produce. For example, a person cannot make the sound of a consonant (i.e., k) without expressing a vowel sound with it. Consonants have certain properties encoded – for example, an unvoiced consonant is encoded differently than a stop consonant. However, in continuous speech, the cues used to identify vowels rely on the transition between the vowels and the consonants. Strange, Edman, and Jenkins (1979) performed a study in which people tried to identify vowels in different conditions and pairings (i.e., paired with different types of consonants in different orders). They discovered that vowels are easiest to identify when they are paired with consonants. Vowels were best identified when they were bordered on either side by consonants, but the final consonant provided more identification information than the initial consonant (Strange, Edman & Jenkins, 1979). Thus, it appears that consonants need vowels to be perceived, and vowels need consonants to be perceived correctly.



When vowels and consonants are combined, there are many effects. One such effect is assimilation, in which the letters of words are slurred together in coarticulation. Assimilation occurs when the features of phonemes overlap when they are paired together. Combining consonants and vowels can result in the bidirectional effect, in which phonemes have an effect on each other in pronouncing words or phrases. Phonemes are pronounced differently in anticipation of the next phoneme or as a result of the previous phoneme. Coarticulation can only occur when there is both a vowel and a consonant present. In assimilation, in the phrase “Come on,” for example, it is impossible to hear exactly where the “m” sound stops, and the second “o” sound begins.

Epenthesis is another effect of the combinations of vowels and consonants in continuous speech. In this case, a letter or a syllable is inserted into the letter to make pronunciation easier. An example of this is how the word “warmth” often sounds like it contains a “p” towards the end of the word. This makes annunciation of the word easier in constant speech. The opposite case can also be true. In elision, a sound is omitted from the pronunciation of a word. For example, in the word “February,” the first “r” sound is often omitted.

Assimilation, epenthesis and elision allow us to speak and understand speech more rapidly. They are adaptations to the language to assist with the flow of pronunciation. These adjustments are also the reason we can understand a continuous flow of speech, but if individual sounds were isolated from that continuous flow, they would be incomprehensible. We use context and adjust to the speaker through normalization to be able to understand language, despite all the changes through assimilation, epenthesis, and elision.

Parallel Transmission
Essentially, parallel transmission is when syllables affect the pronunciation of one another. Coarticulation and the bidirectional effect (discussed above) are parts of parallel transmission. The existence of parallel transmission creates the segmentation problem – we cannot tell where one phoneme ends and the other begins. This causes a problem for theories of speech perception that are based around using the phoneme as the basic unit. Some critics say that if the lines between the phonemes are blurred, they are not really individual units. Because of this criticism, it has been argued that the syllable is a better basic unit of measurement for speech perception.

Contextual variance seems to support the view that phonemes are not unchanging, individual sounds. The concept suggests that, in reality, phonemes posses a quality of invariance. This quality allows the sound of the phoneme to change depending on the preceding or following sounds. Consonants, especially, have been found to be context dependent – their acoustic pattern is based on the vowel that comes after the consonant.

This segmentation problem and the problem with parallel transmission are especially apparent when automated speech recognition programs are involved. Because people adjust the annunciation of phonemes in anticipation of the next phoneme, phonemes do not always sound the same. This invariance makes it difficult for programs that have been set to identify the prototypical phoneme to detect variations of that phoneme. The effects of assimilation, epenthesis and elision are especially pronounced in automated speech recognition programs. While humans are able to understand sentences with these changes involved, an automated program may not recognize them, or misperceive the words. For example, people would know the difference between “come on,” and “common,” because there is a slight variation in the phonemes and the context would provide people with clues for interpretation. However, an automated program is likely to frequently confuse one statement for the other. Automated programs are most effective when they have been adjusted to one person’s voice for a few key words and phrases.

Voice onset time (VOT)


Voice onset time (VOT) is the time between the unvoiced consonant and the perception of the voiced vowel. VOT is often used in the identification process of segments in continuous speech. In other words, VOT is the time between the beginning of the pronunciation of the word and the onset of the vibration of the vocal chords. For example "pa" and "ba" are two different sounds but at 50 milliseconds and longer we perceive the sound "ba" and at 20 milliseconds and shorter we perceive the sound "pa". Surprisingly at 30 milliseconds we here "pa"/"ba" making it impossible to decide which sound it actually is.

In categorical perception, the length of VOTs is used to identify sounds. People hear sounds in different categories. Within phonemic categories, people are not very apt at discriminating between sounds. However, people are much better at discrimination between different phonemic categories. VOT and categorical perception allow people to distinguish between the sounds of different phonemes.

The understanding of these segments is affected by the comprehension of the language. English speakers are unable to understand where the segments of speech are divided in Spanish if they do not speak that language, but can identify the segments in English.

Phoneme Prototypes
The concept of prototypes entails people trying to fit different vowel sounds into their graded structure of what the phoneme is supposed to sound like under ideal conditions. People compare what they actually hear to what they know the phoneme should sound like. The magnitude of the difference between what the phoneme actually sounds like and what we expect it to sound like facilitates proper identification. It is easier to make the distinction between sounds that are farther away from the prototypical sound.

Pauses
Pauses are an important component of speech. Depending on their placement and their duration, they tell can indicate the importance and the salience of the semantic meaning of a statement. Political speeches are powerful examples of using pauses to hold an audience and place emphasis on certain statements. The link below is a short clip from President Kennedy’s famous inaugural speech. He uses pauses to show emphasis and create excitement.

http://www.youtube.com/watch?v=JLdA1ikkoEc



However, pauses are not always intentional, and some must be filtered out or they will create confusion in the perception of continuous speech. Duez (1985) did a study on the perception of silent pauses in speech. She tested utterance boundary pauses (pauses between syntactically independent sequences), clause boundary pauses (pauses between clauses in the utterance), constituent boundary pauses (pauses between phrases), and within-constituent pauses (pauses between connected elements – i.e., a pronoun and an article). In her tests, she would use normal speech and inverted speech to test the difference in pause-perception. She found that in 70% of the cases, whether the speech was played normally or inverted, there was a similar perception of pauses. These results emphasize the role of prosodic features of speech, such as intonation, rhythm, and tone. She explains 30% variation through differences in semantic context. She explains that people are more likely to hear a pause when it is expected, but sometimes do not hear the pause if it is unexpected. Duez also determined that duration is positively correlated with the identification of pauses in speech (Duez, 1985).

Bottom-up vs. Top-down Processing
In bottom-up processing, people examine the sensory data and draw conclusions about what the sensory information means. In language, this means people hear the combination of phonemes and use that data to draw conclusions about which words are being produced. In top-down processing, the opposite happens. People think about what they expect to hear and, judging on the context, they fill in the blanks. Some researchers believe this phenomenon explains the phonemic restoration effect. The perception of continuous speech relies on the combination of bottom-up and top-down processing; people are able listen to the available cues and understand words as they fit with the context. The combination of processing allows us to rapidly understand continuous speech.

Phonemic Restoration


Phonemic restoration is the mind’s ability to fill in the missing phoneme based on the context that is presented. Warren and Warren (1970) designed an experiment to examine phonemic restoration. They had participants come into a lab and listen to a tape in which one phoneme was removed. They used a cough paired with the word “eel” and put it into situations where the context affected how the subjects interpreted the word. For example, when put into a sentence involving shoes, participants heard “heel” instead of a cough, and the word “eel,” (Warren &Warren, 1970). The following represent additional examples:


 * It was found that the (cough)eel was on the axle
 * (cough)eel was interpreted as “wheel”
 * It was found that the (cough)eel was on the shoe
 * (cough)eel was interpreted as “heel”
 * It was found that the (cough)eel was on the orange
 * (cough)eel was interpreted as “peel”
 * It was found that the (cough)eel was on the table
 * (cough)eel was interpreted as :meal”

There are two main theories regarding the process of how phonemic restoration works. The first theory is based on the sensory system. It states that the interaction between the top-down and bottom-up processing affect how the word is processed through the sensitivity effect. According to this theory, the acoustic information of a sound if affected by the context in which it is perceived. The second theory is based the cognitive representation of a word and incorporated the bias effect. In the bias effect, the top-down and bottom-up information are combined, but they do not interact as they do in the first theory. In this model, the output from the bottom-up processing system is biased by what a person is expecting via top-down processing.

Follow this link to engage in an example of phonemic restoration. http://www.youtube.com/watch?v=UlJs24j3i8E

Gating Task


People often guess what a word will be before it is said aloud based on the context on the conversation. These predictions are more accurate when more phonemic information about the word is available. Grosjean (1980) created a task in which participants heard fragments of a word. Each fragment started at the beginning of the word and went a little longer in length. He found that there were three main points in understanding a word. The first is called the isolation point, in which the participant may be able to guess the word. The second point is called the uniqueness point, when the participant realizes there is only one word that could be correct. These two points do not necessarily yield the same word, but the idea about what the word could be is narrowed from a broad guess to a more informed evaluation. The last point is the recognition point, when the participant knows they have correctly identified the word (Grosjean, 1980). In everyday conversations, because conversations happen so quickly, there is very little time between the isolation, uniqueness and recognition point.

The Importance of Grammar


The grammar of a sentence can also strongly affect how people understand the semantics of the statement. In an experiment, Miller and Isard (1962) played white noise with sentences. They found that the words in syntactically correct sentences were easier to recognize. Participants were able to recognize grammatical sentences, such as “Accidents kill motorists on highways,” better than anomalous (but still grammatical) sentences, such as “Accidents carry honey between the house.” The participants were most unsuccessful at perceiving words in ungrammatical sentences played with white noise, such as “Around accidents country honey shoot,” (Miller & Isard, 1962). This supports the theory of top-down processing in that the context of the word clearly impacts people’s ability to understand it because they are expecting it.

Multimodal Perception
The perception of continuous speech clearly does not only rely on auditory input and acoustic representation, as demonstrated by phonemic restoration. However, cognitive input is not the only process that affects how sounds are interpreted; input from sensory systems can change how sounds are heard. Visual input (such a lip-reading and interpreting gestures or expressions)often helps us understand what people are saying. However, sometimes this input conflicts with the auditory input of a stimulus. In the McGurk effect, sometimes, although it sounds like someone is saying one word (i.e., “bah), when it is combined with the visual input of another word (i.e., “gah”) it can create the perception of a different sound altogether (i.e., “dah”). McGurk and MacDonald (1976) tried to explain this phenomenon by saying that “bah” has more acoustically in common with “dah” than with “gah.” Similarly, “gah” shares more visual similarities with “dah,” than with “bah.”  This creates a fused perception of the auditory and visual processing systems. Since “bah” and “gah” are not similar on acoustic or visual fields, the brain finds a middle ground – “dah,” (McGurk & MacDonald, 1976).

To illustrate this point, I have included a link to a video. Watch the following link three times – the first time, close your eyes and listen to the video; the second time, mute the video, and watch it; and the third time, watch the video with the sound on. The video uses a visual recording of a man saying “gah” and pairs it with the sound recording of a man saying “bah.” This creates the misperception of the word “dah.”

http://www.youtube.com/watch?v=aFPtc8BVdJk



In an article written by Massaro (2001), he summarizes a number of studies he has performed over the years. He discusses the progression of how people see speech perception and how, although it was assumed to be unimodal, research has demonstrated its multimodal qualities. For example, in one study, he presented participants with an auditory, a visual, and a bimodal presentation of a phoneme. He found that people were more able to correctly identify the phonemes in the bimodal presentation than in the auditory or visual presentation. He also realized that the McGurk effect can be applied to more than just individual sounds. The auditory track of “My bab pop me poo brive,” paired with the visual stimulus of, “My gag kok me koo grive,” is understood as “My dad taught me to drive,” (Massaro, 2001). By using the McGurk effect of individual phonemes, entire strings of nonsense words can be integrated into a meaningful sentence.

Green and Kuhl (1991) examined the effects that place and voicing have on the perception of continuous speech. Their research supports the theory behind the McGurk effect. They concluded that the way people classify placing and voicing is not only based on the auditory system. They suggest that either visual processes wither work in parallel with auditory processes, or are integrated into the auditory mechanisms (Green & Kuhl, 1991).

Normalization


When listening to a person speak, people are able to understand what they are saying even if they have an accent or speak very quickly. This is possible because the brain engages in normalization. There are two main forms of normalization in regards to perception of continuous speech. The first is the voice tract normalization, in which a person adjusts to a speaker. Instead of trying to match up what a person is saying to the prototypical phonemes, the listener is able to compare it to the ratio of formants.

Speech rate normalization is another adaptive process that allows us to adjust to the speed at which someone speaks. Rate normalization is defined as a decrease in VOT that corresponds with an increase in utterance rate. In a series of experiments, Diehl, Souther and Convis (1980) determined that a listener’s ability to normalize to speech rate was greatly affected by the gender of the speaker. They found that, overall, people are able to normalize to male voice, but have significantly more difficulty normalizing to a female voice. In fact, the female voice seemed to have the opposite effect. Through experimentation, they found that there are certain conditions in which the gender effect can be neutralized, including adjustments in pitch and vocal-tract size (Diehl, Souther & Convis, 1980).

In more recent years, research in voice tract length normalization and speech rate normalization has been used in the development of technology for automatic speech recognition. In experiments conducted by Pfau, Faltlhauser and Ruske (2001), researchers calculated an optimal warping factor to determine normalized feature vectors, which were used to aid the comprehension of automatic continuous speech. They found that by applying these vectors and using normalization, under ideal conditions, there could be up to a 17% reduction in word error rate (Pfau, Faltlhauser &Ruske, 2001). These findings may indicate that there is a way to train automated speech recognition programs to a specific voice, and to recognize the changes in speech (assimilation through coarticulation and parallel transmission, epenthesis, and elision), for a more accurate perception of continuous speech.

Learning Exercise
This Wikiversity Learning Exercise is designed to help you connect the concepts in this article to the real world. The first four questions are based on the text, but they require critical thinking of the subject matter, including drawing on your own experience and providing examples. The other questions are related to the video links.

1.	Explain the difference between bottom-up and top-down processing. State which processing system is more useful in the following situations, and why.
 * a.	When listening to someone with a strong accent
 * b.	When listening to someone who speak very quickly
 * c.	When listening to someone discuss a subject you are not familiar with (hint: familiarity affects the context)

2.	There are two theories about how phonemic restoration works. Which do you agree with and why? Support your argument with information in this article.

3.	The three critical stages of word identification (isolation, uniqueness and recognition) allow people to predict what the word might be before it is said.
 * a.	Using an example, explain how each stage can yield different words as the listener gains more information.
 * b.	Explain how the gating task relates to top-down processing.

4.	Explain the importance of normalization in regards to evolution. Diehl, Souther and Convis (1980) found that people are able to normalize to male voices more easily than to female voices. Provide a reasonable explanation for the gender difference in normalization, and explain how Diehl, Souther and Convis (1980) were able to neutralize the gender difference in this study.

The perception of continuous speech is very different from the perception of individual words or phonemes. I created the following video to demonstrate the difference of how words sound in the context of continuous speech compared to when they are isolated. Watch the following video and try to pick out the first three words: http://www.youtube.com/watch?v=HwWbzgm0ySA

5.	Were the words easier to understand individually or in context? Using this video as an example, explain the importance of context in regards to the perception of continuous speech. In your answer, mention the effects top-down processing and parallel transmission have on the understanding of isolated words in continuous speech.

This video explains the usefulness of phonemic restoration in everyday life. Watch the following video and then answer the questions: http://www.youtube.com/watch?v=k74KCfSDCn8

6.	Rate the level of difficulty of understanding for each situation
 * a.	Gaps in the statement were filled with silence at the beginning of the test (prior to hearing the gaps filled with noise)
 * b.	Gaps in the statement were filled with noise
 * c.	Gaps in the statement were filled with silence at the end of the test (after hearing the gaps filled with noise)

7.	Why was it more difficult to understand the sentence in situation 6a than in situation 6c?

I edited the soundtrack from the 2001 movie, “Snatch,” to take out any offensive language. In the movie, one of the characters is notoriously difficult to understand. Please follow these steps to learn about how expectancy and top-down processing can affect understanding:

8.	Watch the clip and listen to the dialogue: http://www.youtube.com/watch?v=xSiBwuNA_Nc

9.	Read the following section of the script from the movie, “Snatch”


 * Narrator: There was a problem with pikeys or gypsies.
 * Mickey: What're you doing? Get out of the way, man.
 * Narrator: You can't understand what's being said.
 * Mickey: You Tommy? Come about the caravan? Call me Mickey.
 * Narrator: Not Irish, not English.
 * Tommy: How are you?
 * Mickey: Weather's been kind, but it's harsh, to be honest.
 * Narrator: It's just, well, it's just Pikey.
 * Mickey: Would you look at the size of him? How big are you? Kids, how big is he?
 * Kids: Big man, that's for sure.
 * Mickey: Hey, Mam, come and look at the size of this fella. Bet you box a little, can't you, sir? You look like a boxer.
 * Mickey’s Mom: Get out of the way. See if they'd like a drink.
 * Tommy: I could murder one.
 * Mickey’s Mom: Be no murdering done around here, I don't mind telling you.
 * Mickey: Get your hands out of there, you cheeky little boy. Cup of tea for the big fella?
 * Mickey’s Mom: Don't be silly, Mickey. Offer the man a proper drink, boy.

10.	Watch it again.

11.	Explain how expectations and top-down processing affected your perception of speech in this sound clip.