Speech recognition, or speech-to-text, is the ability for a machine or program to identify words spoken aloud and convert them into readable text. Rudimentary speech recognition software has a limited vocabulary of words and phrases, and it may only identify these if they are spoken very clearly. More sophisticated software has the ability to accept natural speech, different accents and languages.
Speech recognition incorporates different fields of research in computer science, linguistics and computer engineering. Many modern devices or text-focused programs may have speech recognition functions in them to allow for easier or hands-free use of a device.
It is important to note the terms speech recognition and voice recognition are sometimes used interchangeably. However, the two terms mean different things. Speech recognition is used to identify words in spoken language. Voice recognition is a biometric technology used to identify a particular individual's voice or for speaker identification.
How it works
Speech recognition works using algorithms through acoustic and language modeling. Acoustic modeling represents the relationship between linguistic units of speech and audio signals; language modeling matches sounds with word sequences to help distinguish between words that sound similar.
Often, hidden Markov models are used as well to recognize temporal patterns in speech to improve accuracy within the system. This method will randomly change systems where it is assumed that future states do not depend on past states. Other methods used in speech recognition may include natural language processing (NLP) or N-grams. NLP makes the speech recognition process easier and take less time. N-Grams, on the other hand, are a relatively simple approach to language models. They help create a probability distribution for a sequence.
More advanced speech recognition software will use AI and machine learning. These systems will use grammar, structure, syntax as well as composition of audio and voice signals in order to process speech. Software using machine learning will learn more the more it is used, so it may be easier to learn concepts like accents.
The most frequent applications of speech recognition within the enterprise include the use of speech recognition in mobile devices. For example, individuals can use this functionality in smartphones for call routing, speech-to-text processing, voice dialing and voice search. A smartphone user could use the speech recognition function to respond to a text without having to look down at their phone. Speech recognition on iPhones, for example, is tied to other functions, like the keyboard and Siri. If a user adds a secondary language to their keyboard, they can then use the speech recognition functionality in the secondary language (as long as the secondary language is selected on the keyboard when activating voice recognition. To use other functions like Siri, the user would have to change the language settings.)
Speech recognition can also be found in word processing applications like Microsoft Word, where users can dictate what they want to show up as text.
Pros and cons
While convenient, speech recognition technology still has a few issues to work through, as it is continuously developed. The pros of speech recognition software are it is easy to use and readily available. Speech recognition software is now frequently installed in computers and mobile devices, allowing for easy access.
The downside of speech recognition includes its inability to sometimes capture words due to variations of pronunciation, its lack of support for some languages and its inability to sort through background noise. These factors can lead to inaccuracies. Some speech recognition software may also take time and feel relatively slow to process speech.
Speech recognition performance is measured by accuracy and speed. Accuracy is measured with word error rate. WER works at the word level and identifies inaccuracies in transcription, although it cannot identify how the error occurred. Speed is measured with the real-time factor. A variety of factors can affect computer speech recognition performance, including pronunciation, accent, pitch, volume and background noise.