Speech Recognition/Performance

The performance of speech recognition systems is usually evaluated in terms of accuracy and speed. Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).

Speech recognition by machine is a very complex problem, however. Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy of speech recognition may vary with the following:
 * Vocabulary size and confusability
 * Speaker dependence versus independence
 * Isolated, discontinuous or continuous speech
 * Task and language constraints
 * Read versus spontaneous speech
 * Adverse conditions

Learning Tasks

 * Explain how the size of the vocabulary can be significantly reduced if the speaker selects an application domain for the speech recognition (e.g. a medical doctor dictates the results of a computer tomography image and then he drives home and dictates an e-mail to his friend about meeting for sports) Both applications need a specific vocabulary and subset of words in the medical enviroment might not used in private enviroment and vice versa. Other subset of the vocabulary are used in both domains. Select a few words as examples for thoses sets and select words in the intersection of both domains.
 * Reduction of the size of the vocabulary is helpful to improve accuracy and performance especially for mobile devices. How can the speech recognition itself detect the application domain.
 * What are the option for application in Machine Learning to the perform the domain detection?

Accuracy
As mentioned earlier in this article, accuracy of speech recognition may vary depending on the following factors:

Vocabulary Size

 * Error rates increase as the vocabulary size grows:
 * e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7% or 45% respectively.


 * Vocabulary is hard to recognize if it contains confusable words:
 * e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusable words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z"); an 8% error rate is considered good for this vocabulary.


 * Speaker dependence vs. independence:
 * A speaker-dependent system is intended for use by a single speaker.


 * A speaker-independent system is intended for use by any speaker (more difficult).


 * Isolated, Discontinuous or continuous speech
 * With isolated speech, single words are used, therefore it becomes easier to recognize the speech.

With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.

With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.


 * Task and language constraints
 * e.g. Querying application may dismiss the hypothesis "The apple is red."
 * e.g. Constraints may be semantic; rejecting "The apple is angry."
 * e.g. Syntactic; rejecting "Red is apple the."

Grammar and Constraints
Constraints are often represented by a grammar.
 * Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.
 * Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)

Speech recognition is a multi-levelled pattern recognition task. e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower level; For telephone speech the sampling rate is 8000 samples per second; computed every 10 ms, with one 10 ms section called a frame;
 * Acoustical signals are structured into a hierarchy of units, e.g. Phonemes, Words, Phrases, and Sentences;
 * Each level provides additional constraints;
 * This hierarchy of constraints are exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken in smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on lower level, and going to lower levels even more, we create more basic and shorter and simpler sounds. The lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sound on upper level, a new set of more deterministic rules should predict what new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition we need to take into a consideration neural networks. There are four steps of neural network approaches:
 * Digitize the speech that we want to recognize
 * Compute features of spectral-domain of the speech (with Fourier transform);

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions: amplitude (how strong is it), and frequency (how often it vibrates per second).

Security concerns
Speech recognition can become a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action. Voice-controlled devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside. Attackers may be able to gain access to personal information, like calendar, address book contents, private messages, and documents. They may also be able to impersonate the user to send messages or make online purchases.

Two attacks have been demonstrated that use artificial sounds. One transmits ultrasound and attempt to send commands without nearby people noticing. The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system.

Accuracy
As mentioned earlier in this article, accuracy of speech recognition may vary depending on the following factors:

Vocabulary Size

 * Error rates increase as the vocabulary size grows:
 * e.g. the 10 digits "zero" to "nine" can be recognized essentially perfectly, but vocabulary sizes of 200, 5000 or 100000 may have error rates of 3%, 7% or 45% respectively.


 * Vocabulary is hard to recognize if it contains confusable words:
 * e.g. the 26 letters of the English alphabet are difficult to discriminate because they are confusable words (most notoriously, the E-set: "B, C, D, E, G, P, T, V, Z"); an 8% error rate is considered good for this vocabulary.


 * Speaker dependence vs. independence:
 * A speaker-dependent system is intended for use by a single speaker.


 * A speaker-independent system is intended for use by any speaker (more difficult).


 * Isolated, Discontinuous or continuous speech
 * With isolated speech, single words are used, therefore it becomes easier to recognize the speech.

With discontinuous speech full sentences separated by silence are used, therefore it becomes easier to recognize the speech as well as with isolated speech.

With continuous speech naturally spoken sentences are used, therefore it becomes harder to recognize the speech, different from both isolated and discontinuous speech.


 * Task and language constraints
 * e.g. Querying application may dismiss the hypothesis "The apple is red."
 * e.g. Constraints may be semantic; rejecting "The apple is angry."
 * e.g. Syntactic; rejecting "Red is apple the."

Grammar and Constraints
Constraints are often represented by a grammar.
 * Read vs. Spontaneous Speech – When a person reads it's usually in a context that has been previously prepared, but when a person uses spontaneous speech, it is difficult to recognize the speech because of the disfluencies (like "uh" and "um", false starts, incomplete sentences, stuttering, coughing, and laughter) and limited vocabulary.
 * Adverse conditions – Environmental noise (e.g. Noise in a car or a factory). Acoustical distortions (e.g. echoes, room acoustics)

Speech recognition is a multi-levelled pattern recognition task. e.g. Known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower level; For telephone speech the sampling rate is 8000 samples per second; computed every 10 ms, with one 10 ms section called a frame;
 * Acoustical signals are structured into a hierarchy of units, e.g. Phonemes, Words, Phrases, and Sentences;
 * Each level provides additional constraints;
 * This hierarchy of constraints are exploited. By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by a machine is a process broken into several phases. Computationally, it is a problem in which a sound pattern has to be recognized or classified into a category that represents a meaning to a human. Every acoustic signal can be broken in smaller more basic sub-signals. As the more complex sound signal is broken into the smaller sub-sounds, different levels are created, where at the top level we have complex sounds, which are made of simpler sounds on lower level, and going to lower levels even more, we create more basic and shorter and simpler sounds. The lowest level, where the sounds are the most fundamental, a machine would check for simple and more probabilistic rules of what sound should represent. Once these sounds are put together into more complex sound on upper level, a new set of more deterministic rules should predict what new complex sound should represent. The most upper level of a deterministic rule should figure out the meaning of complex expressions. In order to expand our knowledge about speech recognition we need to take into a consideration neural networks. There are four steps of neural network approaches:
 * Digitize the speech that we want to recognize
 * Compute features of spectral-domain of the speech (with Fourier transform);

Analysis of four-step neural network approaches can be explained by further information. Sound is produced by air (or some other medium) vibration, which we register by ears, but machines by receivers. Basic sound creates a wave which has two descriptions: amplitude (how strong is it), and frequency (how often it vibrates per second).

Security concerns
Speech recognition can become a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in an audio or video broadcast can cause devices in homes and offices to start listening for input inappropriately, or possibly take an unwanted action. Voice-controlled devices are also accessible to visitors to the building, or even those outside the building if they can be heard inside. Attackers may be able to gain access to personal information, like calendar, address book contents, private messages, and documents. They may also be able to impersonate the user to send messages or make online purchases.

Two attacks have been demonstrated that use artificial sounds. One transmits ultrasound and attempt to send commands without nearby people noticing. The other adds small, inaudible distortions to other speech or music that are specially crafted to confuse the specific speech recognition system into recognizing music as speech, or to make what sounds like one command to a human sound like a different command to the system.