In the Realm of Artificial Intelligence and Machine Learning: Exploring Amazon Chime SDK’s Audio Science Innovations

0
140
Renault confirms Google as preferred cloud partner

Source is Amazon Business Productivity

Introduction

I recently had the chance to sit down with Mike Goodwin, Sr Manager Applied Science for Amazon Chime SDK to discuss his team’s work and gain a deeper understanding of their latest innovations. As a driving force behind new audio technologies, the Amazon Chime SDK is continually pushing the boundaries of what is possible in the realm of audio AI/ML. Join us as we delve into their current projects, explore their areas of expertise, and learn about the solutions they are developing to address our customers’ use cases. Whether you are a tech enthusiast or simply curious about the fascinating world of speech and audio science, this blog is your gateway to the captivating realm of science in the Amazon Chime SDK.

Q. Recently there was an audio science conference held called the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), where topics such as speech recognition are discussed. I am not a scientist, in simpler terms can you tell me about the papers your team wrote?

A. We had four papers at ICASSP. Two of them are related to improving speech coding with Machine Learning (ML), one is about simplifying adaptive filtering with ML, and one is about a flexible speech denoiser. Let’s talk about the speech coding papers first.

A speech coder has two basic components, an encoder and a decoder. The goal of the encoder is to derive an efficient representation of the speech signal. This problem has been studied for quite a few decades now and some of the classical signal processing techniques such as linear predictive coding are still used in modern encoders. So the encoder converts a speech signal into a set of representative parameters that enable you to efficiently store or transmit the speech. The decoder, on the other hand, has to recreate the speech signal from that parametric representation. That conversion process, where you generate the speech signal from the parameters, is typically called vocoding.

Q. You said speech coders have been around for a few decades. What are the everyday applications and what is your team doing that’s new?

A. One of the big applications of speech coders is transmitting speech signals across the internet, for example in video conference calls. The most broadly used coder for that application is called Opus, an open-source codec that has been in development since 2007. Over the last ten or so years, ML has blossomed to say the least. What we are doing now on the team and demonstrating in these papers is that we can leverage modern ML by adding purpose-built models to Opus to improve its quality and robustness. The first paper (Low-bitrate redundancy coding of speech using a rate-distortion-optimized variation autoencoder) addresses the problem of packet loss resilience. When you transmit speech signals across the internet, the information is conveyed in packets which can get lost when the network conditions are bad. For instance, when you have a low-bandwidth connection, it may not be possible to convey all of the incoming information across it, especially when there is competing traffic. In those kinds of scenarios you can lose some of the audio packets in the information stream and as a result you will not get all of the speech that was intended to make it to you. What we’ve done in this paper is to create ML models that can encode the speech redundantly so that you don’t actually need to get all the packets to be able to restore the speech. In fact, you can lose up to a full second of audio and recover it from the packet you receive after the gap. The approach can efficiently encode 50x redundancy, so in a packet that has 20 millisecond of the baseline high-quality encoded speech we add another full second of speech encoded at a much lower resolution, but still high enough quality to be intelligible. Then, we can recover from drastic packet losses. Our next step here is to work to add this ML-based capability into the Opus standard, essentially creating a next-generation Opus to bring this advance to billions of devices and billions of people around the world.

Q. So the approach in the first paper is to help when you lose audio due to network problems. What does the second paper focus on?

A. The second paper (Framewise WaveGAN: high speed adversarial vocoder time domain with very low computational complexity) is about the vocoding process. What we’re doing with the Framewise WaveGAN is training a novel machine learning model to synthesize high-quality speech from the parametric representation that the encoder derived. And we’re pushing the limits of the model in an effort to make it both more efficient and higher quality than prior approaches to ML-based vocoding. With respect to efficiency, previous approaches to ML-based vocoding generated speech signals sample by sample. Our paper proposes a new approach which can instead generate a full frame of samples in each inference pass. It’s computationally much cheaper than synthesizing the speech output sample by sample. And at this low efficiency we’re still able to create high quality output. It’s also worth noting that this innovation is not limited to the speech coding domain, but has other applications such as text-to-speech synthesis.

The low-bitrate redundancy coding and framewise WaveGAN papers are both cutting edge ML contributions to speech processing and coding research. We’re really excited to have this work published now and about getting it closer to our customers through the standardization process.

Q. These two innovations can work independently or together, correct? Can it be “and” or “one or the other”?

A. That’s right, the innovations are independent but they can work together. You can use the low-bitrate redundancy coding in conjunction with a legacy vocoder. Or if you want to bring in the Framewise WaveGAN for higher quality vocoding at lower cost, you can design the speech coder to use the Framewise WaveGAN either as the core decoder for all received packets or just for synthesizing the audio for packets that have been lost and recovered through the new redundancy coding technique. On the other hand, you could use the Framewise WaveGAN independently in a speech coder without incorporating the redundancy coding at all.

Q. A few years ago your team announced Amazon Voice Focus – is that something that you are still working on, and if so, what is new with that? I use it all the time and I am always shocked when people say they can’t hear my darling Chihuahua barking in the background.

A. One of our areas of work is speech enhancement, which is about cleaning up a speech signal that’s been corrupted in some way, maybe there’s some background noise like a fan or even foreground noise like typing or eating potato chips. For better communication sessions, we want to help reduce those disruptions so that only the speech goes through to the other party on the call and it’s easier for them to focus on what’s being said. Amazon Voice Focus is our generalized speech enhancement model which passes speech and suppresses non-speech interfering sounds. In some use cases, though, it’s a limitation that the model lets all the speech through. More specifically, if there’s interfering speech, it doesn’t differentiate that from the primary speech. All the speech from one end of a teleconference meeting gets passed through the model and delivered to the rest of the meeting attendees. In some cases, though, you may want to only preserve a single desired talker while suppressing other interfering speech, for instance if that speech is extraneous to the meeting. To address that use case, we created a model that could latch onto a target voice. We published a paper about this at Interspeech 2021. In that approach, the model is provided a conditioning vector as input that describes the target speech; given that information, the model can separate the target voice out of a mixture of voices and suppress other interfering noises as well. In some cases, though, we don’t necessarily have that conditioning information about the talker that we’re trying to latch on to. To cover both of these scenarios, what we wanted to do was to make a unified model that could either do personalized speech enhancement or legacy generalized speech enhancement. In this year’s ICASSP paper, we present that unified model which can do either generalized or personalized speech enhancement depending on whether or not the model is provided the extra conditioning information about a target voice.

Q. How does real-time personalized and non-personalized speech enhancement accommodate different languages and accents?

A. We generalize across languages and accents in two ways. First, through data we train our speech enhancement models on a wide variety of languages and accents and voices: thousands and thousands of hours of audio data consisting of labeled mixtures of clean speech and interfering sounds. Training a model on a diverse set of talkers enables it to generalize well to other talkers during inference. Second, our models leverage the fundamentals of speech production and speech perception. Our models are based in part on an understanding of a speech production mechanism consisting of vocal cords and a vocal tract. Since this mechanism is generally applicable across many different languages, accents, and talkers, it helps our model generalize across different kinds of talkers.

So we generalize through data diversity and by basing our approaches on fundamental speech properties. Then as a part of model evaluation before we release a model into production, we do fairness and bias testing across many demographic categories. We have data sets that cover different languages, different regional accents, different age groups, etc and we test our models for these various demographics to help ensure that performance is comparable for different groups.

Q. Let’s talk about immersive audio experience, can you tell me a little bit about what that means to you?

A. That’s a densely mathematical topic that might seem a little abstract. But the idea here with the complex math is to build a framework to simplify complex problems such as acoustic echo cancellation. Suppose I’m talking to you in a video conference and my voice gets picked up by your microphone and comes back to me so that I hear myself. That’s something that I’m sure everybody’s experienced in meetings and typically that’s handled with what’s called an acoustic echo canceler (AEC). The AEC has to go through some fairly elaborate computations to try to figure out how to remove that echo. And those computations can get off track, in which case you wind up hearing your own voice coming back. It’s especially a problem if two parties on a call are talking simultaneously (doubletalk) or if the acoustics are changing over time and the AEC can’t keep up with an accurate estimate. Our ICASSP paper on manifold learning is about taking that type of complicated estimation problem and projecting it into a simpler domain. With this framework, we take a complicated system response like the acoustics of a room and project it in a structured way onto a lower-dimensional manifold that captures the important parts of the response. Then in these problems where you need to track an estimate of a complicated response (like how an echo reflects around a room), that complicated response gets simplified. Now you can solve these problems like echo cancellation more robustly because the problem is simpler on this lower-dimensional manifold. Other applications include spatial audio for augmented or virtual reality (AR/VR), where you’re trying to realistically reproduce the acoustics of specific environments to generate truly immersive audio experiences. With our approach, you can use a low-dimensional representation to capture the essence of a desired acoustic environment from only a handful of measurements. This reduces the overhead to take the necessary real-world measurements and enables the generation of high-quality AR and VR audio experiences at a low computational cost.

Other applications include spatial audio in augmented or virtual reality (AR/VR), aiming to faithfully reproduce the acoustics of specific environments to generate a truly immersive audio experience. Our method enables the utilization of low-dimensional representations to capture the essence of the desired environment from only a handful of locations to predict the acoustic at unseen location, thereby minimizing the time required for precise measurements and enhancing the overall quality of AR and VR applications.

Q. My mind works in real-time scenarios, can you give me some use cases that the readers could identify with?

A. First, the manifold learning paper has a very relatable use case because everyone has experienced echo problems. And that innovation helps us to resolve the echo problem more robustly and with less computation than previous approaches. For the personalized speech enhancement paper, one use case many people have experienced is a contact center scenario where there are a number of conversations going on in the customer service agent’s background but you only want to hear the voice of the particular agent that you’re talking to. Our approach will help suppress those background voices and help lead to a clearer conversation. Basically, in any use case where there are background voices, our system can help keep those from disrupting your conversation. For the other papers, we talked about the use cases a bit already. For the redundancy coding, it’s really to maintain intelligible speech communication in bad network conditions where packets are being lost. We’ve all been in the situation where you’re talking to somebody and you just don’t get all their audio so you have to ask them to repeat themselves. Our approach helps overcomes that pain point by recovering the missing parts to fill in the gaps so you still get all the speech information even if the network is congested. The Framewise WaveGAN paper, lastly, is about using machine learning to efficiently derive high-quality speech from a coded speech representation, which is really a universal aspect of modern speech communication systems.

Q. What interesting projects is your team currently focused on in the field of audio science?

A. We’re currently focusing on three different threads of work. We have talked about speech coding a fair bit. That’s one thread and those scientists are working on building that next-generation Opus standard and driving the ML-based advances toward broad availability. Another thread we’ve also talked about is speech enhancement. We can we take a noisy or corrupted speech signal or a low-quality speech signal and get rid of the background noise, clean it up, and make it a higher quality speech signal to basically improve the fidelity of your communication session. This helps users understand what’s being said, so it improves productivity, reduces listening fatigue, and so on. The third thread is analytics. We want to understand what’s happening in communication sessions. For instance, our voice tone analysis capability estimates a talker’s sentiment based on both the words they say and how they say them. Unlike many other sentiment estimation approaches, though, it doesn’t use an explicit transcript. In our ML model there’s an implicit analysis of the linguistics (the words being said), and an analysis of the acoustics (how the words are said) is layered onto that.

Q. In closing, in what ways do the investments in the voice and video space that your team is focused on differ from the other AWS ML/Al services?

A. What we focus on here are real-time solutions for voice and [to a lesser extent] video processing that address customer problems. Our customers want to understand what’s happening in their communication sessions, both in the speech and in the video [if it’s a video session]. For instance, let’s talk about Personally Identifiable Information (PII) that might potentially be shared on the screen in a video session. We need to detect very quickly that this happening so the PII can be redacted. The computation needs to be lightweight — because everybody is cost conscious. [, right?] So our algorithms need to run at low cost with very low latency. These kinds of constraints are first and foremost when we develop our ML models; these are the critical considerations for solving customer problems in real-time communication use cases. Now, that’s not to say that we don’t do some things offline. For instance, our call recording feature includes the capability of enhancing call recordings in an offline process. In those cases we can take advantage of working offline by adding more latency to boost model performance.

Going back to real-time solutions though, we focus on these use cases because so many of our customers want to get insights about their communication sessions on the fly. Let’s say for instance that the topics and sentiment in a customer support conversation are being used to provide automated agent assistance or to determine whether to escalate to a manager. You need the information now; you can’t wait till after the call is over, it’s too late. So many of these things have to happen with very little delay. Within a few seconds you need to know what’s going on so you can raise the right flags and make the right decisions for your business.

Conclusion

Throughout this interview with Mike, we learned about the Amazon Chime SDK’s innovative recent work in audio science and machine learning. We hope that this interview has provided you with a better understanding of some of these developments, and left you inspired by the possibilities that lie ahead. Thank you for joining us and stay tuned for more captivating interviews about the Amazon Chime SDK’s customer-obsessed science initiatives..

Paper summaries from the authors

Low-bitrate redundancy coding of speech using a rate-distortion-optimized variation autoencoder

Packet loss is a big problem for real-time voice communication over the Internet. Everyone has been in the situation where the network is becoming unreliable and enough packets are getting lost that it’s hard — or impossible — to make out what the other person is saying. In this paper we propose a Deep REDundancy (DRED) algorithm that significantly improves quality and intelligibility under packet loss conditions by efficiently retransmitting each audio packet up to 50 times with minimal overhead, enabling robust recovery from dropouts up to a second in duration.

Framewise WaveGan: high speed adversarial vocoder time domain with very low computational complexity

A vocoder refers to a method for creating a time-domain voice signal from a representation such as a set of speech-related parameters, a spectrogram, or acoustic/phonetic features. Vocoders based on generative adversarial networks (GANs) are the current state of the art, but these models are computationally prohibitive for low-resource devices since they synthesize the voice signal sample by sample. This paper proposes a new architecture for GAN vocoding that substantially reduces the complexity by instead generating the voice signal in a framewise manner while still maintaining high quality. The approach has applications including low-rate speech coding, text-to-speech synthesis, and speech enhancement.

A framework for unified real-time personalized and non-personalized speech enhancement

In this paper, we present the Unified PercepNet (UPN) model, which eliminates the need to train and maintain independent speech enhancement models for personalized and non-personalized use cases. The UPN model is controlled by a user input that specifies the enhancement mode and can be toggled in real time. By default, it behaves as a non-personalized speech enhancement model that suppresses only environmental sound; when the personalized mode is activated, the model isolates the target speaker and suppresses both background noise and interfering speech from other talkers.

Generative modeling based manifold learning for adaptive filtering guidance

Fast and accurate estimation of room impulse responses is vital for audio signal processing applications such as acoustic echo cancellation and augmented/virtual reality. In this paper we propose an integrated ML/DSP solution to the underlying ill-posed estimation problem by exploring the fact that impulse responses are not arbitrary but instead lie on a structured manifold that can be learned by means of properly designed deep neural networks.

Learn More

A quick guide to Amazon’s 40-plus papers at ICASSP
Neural encoding enables more-efficient recovery of lost audio packets
Blog: Label multiple speakers on a call with speaker search
Amazon Chime SDK launches call analytics
Amazon Chime SDK the AWS Console
Amazon Chime SDK

Source is Amazon Business Productivity

Vorig artikelPublic Preview: Test Single GraphQL Resolver in Azure API Management Portal
Volgend artikelTikTok opens Irish datacentre in push to safeguard European user data