In the next decade, AI speech recognition will develop in these five directions

In the past two years, automatic speech recognition (Automatic Speech Recognition, ASR) has made important development in commercial use, one of the metrics is:

Several enterprise-level ASR models based on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc. In 2016, Microsoft Research published an article announcing that their model achieved human-level performance (measured by word error rate) on the 25-year-old “Switchboard” dataset. The accuracy of ASR is still improving, gradually reaching human level in more datasets and use cases.

In the next decade, AI speech recognition will develop in these five directions

Source: Awni Hannun’s blog post “Speech Recognition is not Solved”

As the recognition accuracy of ASR technology has been greatly improved and the application scenarios have become more and more abundant, we believe that it is not yet the peak of ASR commercial use, and the research and market applications in this field have yet to be explored. We expect AI speech-related research and commercial systems to focus on the following five areas in the next decade:

1 Multilingual ASR model
“Over the next decade, we’ll deploy true multilingual models in production, enabling developers to build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.”

In the next decade, AI speech recognition will develop in these five directions

Source: “Unsupervised cross-lingual representation learning for speech recognition” paper by Alexis Conneau et al in 2020

Today’s commercial ASR models are mainly trained on English datasets and thus have higher accuracy on English input. The long-term focus on English is higher in academia and industry due to data availability and market demand. Although the recognition accuracy of commercial popular languages such as French, Spanish, Portuguese, and German is reasonable, there is obviously a long tail of languages with limited training data and relatively low ASR output quality.

Furthermore, most business systems are based on a single language, which cannot be applied to the multilingual scenarios specific to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made strides in dealing with this issue recently with a product that integrates language identification (LID) and ASR. In contrast, cross-language (also known as code-switching) is a language system used by individuals that combines the words and grammars of two languages in the same sentence. This is an area where academia continues to make interesting progress.

Just as the field of natural language processing takes a multilingual approach, we will see ASR follow suit over the next decade. As we learn how to leverage emerging end-to-end techniques, we will train large-scale multilingual models that can transfer across multiple languages. Meta’s XLS-R is a good example: in one demo, the experiencer could speak any of 21 languages without specifying a language, and the model would eventually translate to English. By understanding and applying similarities between languages, these smarter ASR systems will provide high-quality ASR availability for low-resource language and mixed-language use cases, and will enable commercial-grade applications.

2 Rich standardized output objects
“Over the next decade, we believe that commercial ASR systems will output richer transcription objects that will contain more than simple words. Furthermore, we expect that this richer output will be recognized by standards organizations such as the W3C, so that All APIs will return similarly constructed outputs. This will further unlock the potential of speech applications for everyone in the world.

“Although the National Institute of Standards and Technology (NIST) has a long tradition of exploring ‘enriched transcription’, it is still rudimentary in incorporating it into a standardized and extensible format for ASR output. The concept of enriched transcription originally involved capitalization, punctuation, and journaling but to some extent extended to speaker roles and a range of non-linguistic speech events. Expected innovations include transcribing overlapping speech from different speakers, different emotions, and other paralinguistic features, as well as a range of non-linguistic and even non-verbal speech Human speech scenes and events can also transcribe information based on textual or linguistic diversity. Tanaka et al. depict a scenario where a user may wish to choose among transcription options of varying richness, and apparently, we predict the amount of additional information and properties are specifiable, depending on the downstream application.

Traditional ASR systems are able to generate multiple hypothetical grids in the process of recognizing spoken words, and these have been shown to be of great benefit in human-assisted transcription, spoken dialogue systems, and information retrieval. Including n-best information in the rich output format will encourage more users to use the ASR system, thus improving the user experience. While no standard currently exists for building or storing additional information currently generated or likely to be generated during speech decoding, CallMiner’s Open Speech Transcription Standard (OVTS) is a solid step in this direction, making it easy for businesses to explore and choose Multiple ASR providers.

We predict that in the future, ASR systems will produce richer output in standard formats, enabling more powerful downstream applications. For example, an ASR system might output all possible meshes, and the application could use this additional data for intelligent automatic transcription when editing transcriptions. Similarly, ASR transcriptions that include additional metadata such as detected regional dialects, accents, ambient noise, or sentiment can enable more powerful search applications.

3 Mass ASR for all
“In this decade, ASR at scale (i.e. privatized, affordable, reliable and fast) will become part of everyone’s daily life. These systems will be able to search for video, index all the media content we participate in, and enable the world Every video is accessible to hearing-impaired consumers everywhere. ASR will be the key to making every audio and video accessible and actionable.”

In the next ten years, AI speech recognition will develop in these five directions In the next ten years, AI speech recognition will develop in these five directions

We probably all use audio and video software a lot: podcasts, social media streaming, online video, live group chats, Zoom meetings, and more. However, the relevant content is actually rarely transcribed. Today, content transcription is one of the largest markets for ASR APIs and will grow exponentially over the next decade, especially given their accuracy and affordability. Having said that, ASR transcription is currently only used for specific applications (broadcast video, certain conferences and podcasts, etc.). As a result, many people do not have access to this media content and it is difficult to find relevant information after the broadcast or event.

In the future, this situation will change. At some point, as Matt Thompson predicted in 2010, ASR is so cheap and widespread that we will experience what he calls “speechiness.” We envision a future where almost all audio and video content will be transcribed, instantly accessible, storable, and searchable at scale. But the development of ASR will not stop here, and we also hope that these contents will be actionable. We expect each audio and video consumed or engaged to provide additional context, such as automatically generated insights from podcasts or meetings, or automated summaries of key moments in the video, etc., and we expect NLP systems to routinely process this.

4 Human-machine collaboration

“By the end of the century, we will have evolving ASR systems that act like a living organism, learning continuously with human help or self-supervision. These systems will learn from different sources in the real world, in real-time rather than asynchronously. way of understanding new words and language variants, self-debugging and automatically monitoring different usages.”

In the next decade, AI speech recognition will develop in these five directions

Human-machine collaboration will play a key role as ASR becomes mainstream and covers more and more use cases. The training of the ASR model exemplifies this well. Today, open source datasets and pretrained models lower the barrier to entry for ASR vendors. However, the training process is still fairly simple: collect data, annotate the data, train the model, evaluate the results, improve the model. But this is a slow process and, in many cases, prone to errors due to difficult tuning or insufficient data. Garnerin et al. observed that missing metadata and inconsistencies in representations across corpora make it difficult to guarantee equivalent accuracy in terms of ASR performance, a problem that Reid and Walker attempted to address when developing metadata standards.

In the future, humans will play an increasingly important role in accelerating machine learning by efficiently supervising ASR training through intelligent means. A human-in-the-loop approach places human reviewers in a machine learning/feedback loop that can continuously review and adjust model results. This makes machine learning faster and more efficient, resulting in higher quality output. Earlier this year, we discussed how improvements to ASR have enabled Rev’s human transcribers (called “Revvers”) to post-edit ASR drafts, making their work more efficient. The transcription of Revver can be directly input into the improved ASR model, forming a virtuous circle.

For ASR, one area still indispensable for human linguists is Inverse Text Normalization (ITN), where they convert recognized strings (like “five dollars”) into the expected written form (like “$5”). Pusateri et al. proposed a hybrid approach using “hand-crafted grammars and statistical models”, and Zhang et al. continued these ideas, constraining RNNs with hand-crafted FSTs.

5 Responsible ASR
“Like all AI systems, future ASR systems will adhere to stricter AI ethics principles so that the system treats all people equally, is more explainable, is accountable for its decisions, and respects the privacy of users and their data. ”

In the next decade, AI speech recognition will develop in these five directions

Future ASR systems will follow four principles of AI ethics: fairness, explainability, respect for privacy, and accountability.

Fairness: Fair ASR systems recognize speech regardless of speaker background, socioeconomic status, or other characteristics. Notably, building such a system requires identifying and reducing bias in our models and training data. Fortunately, governments, NGOs, and businesses have set out to create the infrastructure to identify and mitigate bias.

Interpretability: ASR systems will no longer be “black boxes”: they will interpret data collection and analysis, model performance and output processes on demand. This additional transparency requirement allows for better human oversight of model training and performance. Like Gerlings et al., we look at interpretability from the perspective of a range of stakeholders, including researchers, developers, customers, and in the case of Rev, transcriptionists. Researchers may want to know why the wrong text is being output in order to mitigate the problem; while transcriptionists may need some evidence of why ASR thinks so to help them assess its effectiveness, especially in noisy situations, where ASR may be better than People “hear” better. Weitz et al. take important initial steps towards achieving interpretability for end users in the context of audio keyword recognition. Laguarta and Subirana have incorporated clinician-guided interpretation into a speech biomarker system for Alzheimer’s detection.

Respect for Privacy: “Voice” is considered “Personal Data” under various U.S. and international laws, and as such, the collection and processing of voice recordings are subject to strict personal privacy protections. At Rev, we have provided data security and control functions, and future ASR systems will further respect the privacy of user data and the privacy of models. In many cases, this will likely involve pushing the ASR model to the edge (on device or browser). The voice privacy challenge is driving research in this area, and many jurisdictions, such as the European Union, are already working on legislation. The field of privacy-preserving machine learning promises to draw attention to this critical aspect of the technology, allowing it to be widely accepted and trusted by the public.

Accountability: We will monitor the ASR system to ensure compliance with the first three principles. In turn, resources and infrastructure are required to design and develop the necessary monitoring systems and to take action on findings. Companies deploying ASR systems will be responsible for the use of their technology and make specific efforts to comply with ASR ethical principles.

It is worth mentioning that, as designers, maintainers, and consumers of ASR systems, humans will be responsible for implementing and enforcing these principles—another example of human-machine collaboration.