Speech Recognition

This learning resource is about automatic conversion of spoken language into text, that can be stored as documents or processed as commands to control devices e.g. for handicapped people or elderly people or in a commercial setting allows to order goods and services by audio commands. The learning resource is based on the Open Community Approach so the used tools are Open Source to assure that learner have access to the tools.



Learning Tasks

 * (Applications of Speech Recognition) Analyse the possible applications of speech recognition and identify challenges of the application!
 * (Human Speech Recognition) Compare human comprehension of speech with the algorithmic speech recognition approach. What are the similarities and differences of human and algorithmic speech recognition?
 * (Speech and Detection of Emotions) Speech contains more information than the encoded text. Is it possible to detect emotions in the speech with methods developed in computer science?
 * What are similarities and difference between text and emotion recognition in speech analysis?
 * What are possible application areas in digital assitants for both speech recognition and emotion recognition?
 * Analyze the different types of information systems and identify different areas of application of speech recognition and include mobile devices in your consideration!
 * (History) Analyse the history of speech recognition and compare the steps of development with current applications. Identify the major steps that are required for the current applications of speech recognition!
 * (Risk Literacy) Identify possible areas of risks and possible risk mitigation strategies if speech recognition is implemented in mobile devices, or with voice control for Internet of Things in general? What are required capacity building measures for business, research and development!
 * (Commercial Data Harvesting) Apply the concept of speech recognition to commercial data harvesting. What are potential benefits for generation of tailored advertisments for the users according to their generated profile? How is speech recognition contributing to user profile? What is the difference between offline and online speech recognition systems due to submission of recognized text or audio files submitted to remote servers for speech recognition?
 * (Context Awareness of Speech Recognition) The word "Fire" with a candle in your hand and with burning house in the background creates a different context and different expectations of people listening to what someone is going to tell you. Exlain why context awareness can be helpful to optimize the recognition correctness? How can a speech recognition system detect a context to the speech recognition. I.e. detecting the context without a user setting that switches to a dictation mode e.g. for medical report for X-Ray images.
 * (Audio-Video-Compression) Go to the learning resource about Audio-Video-Compression and explain how Speech Recognition can be used in conjunction with Speech Synthesis to reduce the consumption of bandwidth for Video conferencing.
 * (/Performance/) Explain why the performance of speech recognition and accurancy is relevant in many applications. Discuss application in cars or in general in vehicles. Which voice commands can be applied in a traffic situation and which command (not accurately recognized) could cause trouble or even an accident for the driver. Order the theortical application of speech recognition (e.g. "turn right at crossing", "switch on/off music",...) in terms of required performance and accuracy resp. to current available technologies to perform the command in an acceptable way.
 * (HTML5 Speech Recognition) Analyze the source code of the OpenSource web application demo with PocketSphinx (use browser Firefox/Chromium or Chrome).
 * Explain how the recognized words are encoded for speech recognition in the demo application (digits, cities, operating systems).
 * Explain how the concept of speech recognition can support handicapped people with navigating in a WebApp or offline AppLSAC for digital learning environments.
 * (Size of Vocabulary) Explain how the size of the recognized vocabulary determines the precision of recognition.
 * (People with Disabilities) Explore the available frameworks Open Source offline infrastructure for speech recognition without sending audio streams to a remote server for processing. Identify options to control robots or in the context of Ambient Assisted Living with voice recognition.
 * (Version Control) Explore the concept of Version Control and apply that specifically to the Open Community Approach:
 * Collaborative development of the Open Source code base of the speech recognition infrastructure,
 * Application on the collaborative development of a domain specific vocabulary for speech recognition for specific application scenarios.
 * Application on Open Educational Resources that support learners in using speech recognition and Open Source developers in integrating Open Source frameworks into learning environments.

Definition
Speech recognition is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

Training of Speech Recognition Algorithms
Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent" systems. Systems that use training are called "speaker dependent".

Applications
Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g., word processors emails, and generating a string-searchable transcript from an audio track), and aircraft (usually termed direct voice input).

The term voice recognition  or speaker identification  refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.

Models, methods, and algorithms
Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation.


 * /Hidden Markov Model/
 * /Dynamic Time Warping/


 * /Neural Networks/


 * /End-to-End Automated Speech Recognition/

Learning Task: Applications
The following learning tasks focus on different applications of Speech Recognition. Explore the different applications.
 * /In-Car Systems/
 * /People with Disabilities/
 * /Health Care/
 * Telephone Support Systems

Usage in education and daily life
For language learning, speech recognition can be useful for learning a second language. It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills.

Students who are blind (see Blindness and education) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard.

Students who are physically disabled or suffer from Repetitive strain injury/other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs. They can also utilize speech recognition technology to freely enjoy searching the Internet or using a computer at home without having to physically operate a mouse and keyboard.

Speech recognition can allow students with learning disabilities to become better writers. By saying the words aloud, they can increase the fluidity of their writing, and be alleviated of concerns regarding spelling, punctuation, and other mechanics of writing. Also, see Learning disability.

Use of voice recognition software, in conjunction with a digital audio recorder and a personal computer running word-processing software has proven to be positive for restoring damaged short-term-memory capacity, in stroke and craniotomy individuals.

Further applications

 * Aerospace (e.g. space exploration, spacecraft, etc.) NASA's Mars Polar Lander used speech recognition technology from Sensory, Inc. in the Mars Microphone on the Lander
 * Automatic subtitling with speech recognition
 * Automatic emotion recognition
 * Automatic translation
 * Court reporting (Real time Speech Writing)
 * eDiscovery (Legal discovery)
 * Hands-free computing: Speech recognition computer user interface
 * Home automation
 * Interactive voice response
 * Mobile telephony, including mobile email
 * Multimodal interaction
 * Pronunciation evaluation in computer-aided language learning applications
 * Real Time Captioning
 * Robotics
 * Speech to text (transcription of speech into text, real time video captioning, Court reporting )
 * Telematics (e.g. vehicle Navigation Systems)
 * Transcription (digital speech-to-text)
 * Video games, with Tom Clancy's EndWar and Lifeline as working examples
 * Virtual assistant (e.g. Apple's Siri)

Conferences and journals
Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and the IEEE ASRU. Conferences in the field of natural language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed IEEE/ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication.

Books
Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc. More up to date are "Computer Speech", by Manfred R. Schroeder, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The recently updated textbook Speech and Language Processing (2008) by Jurafsky and Martin presents the basics and the state of the art for ASR. Speaker recognition also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by Roberto Pieraccini (2012).

The most recent book on speech recognition is Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) written by D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods. A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning.

Software
In terms of freely available resources, Carnegie Mellon University's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the HTK book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques, Kaldi toolkit can be used. In 2017 Mozilla launched the open source project called Common Voice to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at GitHub) using Google open source platform TensorFlow.

A demonstration of an on-line speech recognizer is available on Cobalt's webpage.

For more software resources, see List of speech recognition software.

Page Information
This page was based on the following wikipedia-source page:
 * Speech Recognition https://en.wikipedia.org/wiki/Speech%20Recognition
 * Date: 7/2/2019 - Source History
 * Wikipedia2Wikiversity-Converter: https://niebert.github.com/Wikipedia2Wikiversity