Speech Recognition/End-to-End Automated Speech Recognition

The end-to-end ASR uses a client-server infrastructure records the speech e.g. on mobile device performs the speech recognition on the server and returns the recognized text to the client (e.g. the mobile device).

Open Community Approach
This learning resource is based on the Open Community Approach to assure that the data, software is open, adaptable to the learners. Nevertheless speech recognition is integrated in mobile devices for voice control, large vocabulary speech recognition in a commercial setting. Therefore this part is not excluded from this learning resources, because the commercial developments are integrated in mobile devices or television or IT devices used in our homes.

Introduction: End-to-end Automatic Speech Recognition
Since 2014, there has been much research interest in "end-to-end" ASR. Traditional phonetic-based (i.e., all HMM-based model) approaches required separate components and training for the pronunciation, acoustic and language model. End-to-end models jointly learn all the components of the speech recognizer. This simplifies the training process by online acquisition of training data from the users and deployment process for the provider, so that updates and improvements of the speech recognition are deployed to the users. For example, a n-gram language model is required for all HMM-based systems, and a typical n-gram language model often takes several gigabytes in memory making them impractical to deploy on mobile devices. Consequently, modern commercial ASR systems from Google and Apple (as of 2017) are deployed on the cloud and require a network connection as opposed to the device locally.

The first attempt at end-to-end ASR was with Connectionist Temporal Classification (CTC)-based systems introduced by Alex Graves of Google DeepMind and Navdeep Jaitly of the University of Toronto in 2014. The model consisted of recurrent neural networks and a CTC layer. Jointly, the RNN-CTC model learns the pronunciation and acoustic model together, however it is incapable of learning the language due to conditional independence assumptions similar to a HMM. Consequently, CTC models can directly learn to map speech acoustics to English characters, but the models make many common spelling mistakes and must rely on a separate language model to clean up the transcripts. Later, Baidu expanded on the work with extremely large datasets and demonstrated some commercial success in Chinese Mandarin and English. In 2016, University of Oxford presented LipNet, the first end-to-end sentence-level lip reading model, using spatiotemporal convolutions coupled with an RNN-CTC architecture, surpassing human-level performance in a restricted grammar dataset. A large-scale CNN-RNN-CTC architecture was presented in 2018 by Google DeepMind achieving 6 times better performance than human experts.

An alternative approach to CTC-based models are attention-based models. Attention-based ASR models were introduced simultaneously by Chan et al. of Carnegie Mellon University and Google Brain and Bahdanau et al. of the University of Montreal in 2016. The model named "Listen, Attend and Spell" (LAS), literally "listens" to the acoustic signal, pays "attention" to different parts of the signal and "spells" out the transcript one character at a time. Unlike CTC-based models, attention-based models do not have conditional-independence assumptions and can learn all the components of a speech recognizer including the pronunciation, acoustic and language model directly. This means, during deployment, there is no need to carry around a language model making it very practical for deployment onto applications with limited memory. By the end of 2016, the attention-based models have seen considerable success including outperforming the CTC models (with or without an external language model). Various extensions have been proposed since the original LAS model. Latent Sequence Decompositions (LSD) was proposed by Carnegie Mellon University, MIT and Google Brain to directly emit sub-word units which are more natural than English characters; University of Oxford and Google DeepMind extended LAS to "Watch, Listen, Attend and Spell" (WLAS) to handle lip reading surpassing human-level performance.

Learning Task

 * Explain why a centralized server/cloud based speech recognition service simplifies the training process and deployment process!
 * Combine this approach with the concept of Commercial Data Harvesting. What are information that could violate privacy for the application of speech recognition (medical environment, government, development units in companies, ...?
 * Compare Open Source Speech Recognition Developments like KDE Simon, that work offline with commercial End-to-End Automated Speech Recognition that is cloud-based. How can user decide with speech recognition samples can be shared with a centralized service and when the speech recognition kept locally due to privacy concerns of the user or of the institution or company that uses speech recognition in their daily workflow?