Part 1 The Good, The Bad, and The GenAI

Today, generative AI (or “gen AI” for short) is all the rage. Most often we see this used for text generation – be it chat bots, text summarisers or, well, spam bots. Then there’s image and video generation which is perhaps the most controversial of them, and not to mention the “deep fake” side of things. The final pillar is audio and speech generation, which typically doesn’t get as much notice and attention as the others, except for the voice cloning used with deep fakes. While there are clearly many ways of abusing gen AI tech, there are also positive uses. Speech generation quality has massively improved thanks to machine learning (ML) techniques, and we’re now at the point where computers are able to generate natural sounding voices after having been stuck in the uncanny valley for a long time. Currently, that level of speech quality still requires significant hardware, but I fully expect that in the rather near future that will be optimised to the point where it’s accessible even on microcontrollers. Naturally, this is where I start getting excited about things, working in the embedded space as I am.

A New Era: Speech Recognition

The inverse of speech generation is speech recognition (SR). Here too ML is making strides in advancing the ability to recognise human speech in a manner that can be acted upon. Note that speech recognition should not be confused with voice recognition, which is the ability to distinguish different voices and identify the speaker. While this can be a useful tool as well, it is somewhat more limited in its applicability and often used as a security enhancement tool. Speech recognition however has seen a huge uptick in both use and popularity, with various mobile and home assistants such as Apple’s Siri, Google’s Hey Google, and Amazon’s Alexa. These are backed by powerful cloud services and generally require Internet connectivity to function. I admit to not knowing how well, if at all, any of these function when they are unable to reach their corresponding backends. The last voice assistant I actively used was Cortana on my old Windows phone, and I was really quite impressed. Plus, it retained limited functionality even when used offline. I’m sad that Windows Phone got discontinued, because it was quite a polished product and Cortana was actually useful. Switching to Siri after Cortana was such a downgrade in functionality and usability that I gave up on it altogether. But I digress.

Speech recognition has clearly advanced to the point where it’s both usable and less resource intensive than it once was. While many systems do rely on cloud-backed services, it is perfectly possible to do at least limited speech recognition purely on an edge device, including on microcontrollers.

Wake Word Technology Changing the Game

One of the primary ways resource requirements have been reduced is through development of “wake word” technology. With a wake word (or wake phrase), a device does not have to constantly “listen” and interpret all the speech in its vicinity. Doing this, while relying on a cloud service for the speech recognition would be prohibitively costly in terms of both bandwidth and server resources. Instead, a much reduced ML model trained purely to recognise the wake word can run locally on the edge device, and only after it has detected the wake request will the device enable full speech recognition functionality, whether this be local or cloud assisted.

The Balancing Act: Cloud Quality vs. Edge Efficiency

Cloud backed speech recognition and generation have a clear advantage in terms of quality, but it’s not without drawbacks. Besides a stable Internet connection being a prerequisite, there is the issue of latency. Cloud backed speech interaction can easily feel sluggish, and it doesn’t take much to put a person off. A primary driver for speech interaction is the immediacy of it, and if responses aren’t prompt enough it will generally be perceived poorly. This is where pure edge speech handling has the advantage. The time required to locally process speech is generally far less than the combined time for speech processing plus latency, even if the speech processing itself is slower on the edge device.

One way on-device speech generation / Text-to-Speech (TTS) can be sped up is through the use of dedicated speech synthesis hardware modules, such as the SYN6988 or XFS5152CE. The latter also purportedly supports some speech recognition in addition to generation. The voice quality of these chips can’t rival what larger ML models can produce, however – plug those model numbers into Youtube if you want to hear what they can (and can’t) do. Modules such as these are mostly useful if your main microcontroller doesn’t have the power or memory to run a TTS engine. These days, it’s often more cost-effective to choose a more capable microcontroller than to add a separate speech synthesis module. Still, they remain an option.

The Privacy Bonus with Edge Computing

So far I’ve largely been covering the technical aspects and differences between on-device and cloud-backed speech processing. There is one other major differentiator which I’d be remiss to not call out – privacy, or at least, the perception thereof. If you have a device which uses cloud services for the speech processing, a user must be comfortable with having their conversations shared with the company. Even if a device appears to make use of a wake word and therefore shouldn’t be eavesdropping on every word uttered in its vicinity, in reality there is no guarantee that it is not in fact doing that. A device which doesn’t have an Internet connection and does all the speech processing locally on the other hand, does not suffer from this, or at least not to anywhere near the same extent.

To Recap

There are numerous reasons why on-device/edge speech processing can be a better choice, depending on the use case:

Not reliant on Internet connectivity
No ongoing cloud costs for speech processing
Far less latency / quicker speech response
Improved privacy and perception thereof

Part 2 Voice Quest: The ESP32-S3 Encounter

One microcontroller I’ve been eager to test out both speech recognition and speech generation on is Espressif’s ESP32-S3. Aside from the usual niceties in the ESP32 range, the S3 features vector instructions for accelerated neural network processing, and larger PSRAM capacity. Espressif also provides out-of-the-box support for speech recognition with their esp-sr component, and general neural net support via the esp-dl component. The esp-sr component also comes with several configurable wake words, with the possibility of requesting (or buying) custom wake words as well. One missing feature I’d love to see available is the ability to train wake words yourself. I could see this being incredibly useful for allowing an end user to name individual devices. Be it “hey fridge” or “dear amaranth”, giving the user control over which device responds when would be game-changing!

By the way, those of you wanting to live out your Star Trek fantasies of interacting with a ship’s computer via speech will be pleased to know that one of the supported wake words is in fact “computer”.

There is also Text-to-Speech support, but at the time of writing, only the Chinese TTS engine is available, with an English engine listed as being under development.

Teaching the ESP32-S3 to Speak English and More

As I wanted to take the ESP32-S3 for a spin and see how well it handles both SR and TTS, the lack of English TTS posed a bit of a hiccup. So, I did what any self-respecting geek would do – I ported a TTS engine myself. In this case, PicoTTS, an open source TTS engine released by SVOX. Aside from both British English and American English, it also supports German, French, Italian and Spanish speech generation. PicoTTS stands out as it is designed for embedded uses, and can get by on only 2.5MB of RAM. Of course, when your total available RAM is only 8MB, and you also need to fit in the ML models for wake word and general speech recognition support, that’s still quite significant.

Having a bit of a history of optimising available RAM on Espressif’s chips, I had an idea on how to further reduce the RAM requirements. I’m a contributor to the NodeMCU firmware project, which enables scripting in Lua on the ESP8266 and ESP32 range of chips. One of my early contributions was a hardware trap handler which enabled relocating constant data out of RAM and into flash, which freed up a significant amount of memory at the expense of some extra cycles to access said data. You’ll find this same feature included in the standard ESP-IDF these days.

So, I already knew that for PicoTTS I wanted to ensure its resource files could be kept in flash rather than having to be loaded into precious RAM. The exposed API, however, did not lend itself to augmentation, so I had to resort to some “clever” approaches to introduce new resource loading and unloading functions. I don’t like having to be clever, because it can easily come back and haunt you, but sometimes there’s little choice. So, one ugly include-the-C-file modification later and I was able to reimplement the resource loading routines. This took a bit of light-level reverse engineering of the data format to understand it well enough that I was confident in writing the replacement loader. The custom loader now effectively mmap()s the bulk of the resource data directly from flash, with the net result that instead of the PicoTTS engine taking up 2.5MB it only needs 1.1MB. Which, given the impressive feature it provides, is quite reasonable in my view.

Of course, generating the speech data is one thing, turning it into audio is a separate thing. Normally I’d expect a Board Support Package (BSP) to provide a high-level interface for audio playback, but for reasons unknown I could not get the ESP32-S3-BOX to output sound properly. I tried both the hardware driver from the “esp-skainet” repo (yes, they went there), and one of the speech recognition demos but without luck. When I reached for the “proper” BSP from the esp-bsp repo, I ran into component conflicts – it seems I hit the transition period from IDFv4 to IDFv5 where everything wasn’t quite ready for the IDFv5 version I was using. Not to be deterred, I grabbed the parts I genuinely needed, and got sound working properly after a little while. For a real project I would’ve had to spend time to properly resolve the component conflicts, but for a quick demo I could justify the shortcut.

With both sound output/input working, it was time to combine the two. Of course, as Murphy would have it, it wasn’t quite that easy. It would appear there wasn’t enough bandwidth to read the stereo microphone data as well as transmit the speaker data simultaneously, or I ran out of processing cycles, but whatever the cause the produced sound got awfully clicky and poppy. Not having time to investigate this properly, I opted for doing basic resource arbitration – either it’s listening, or it’s talking. Of course, the interface I’d provided to the TTS engine did not expose enough information to know whether it was still busy producing data or not, so I had to revisit the PicoTTS component and add that feature. On the upside, doing that led me to realise I’d stuffed up a bit and left the TTS task chewing up cycles needlessly, so that got fixed while I reorganised things to provide the needed TTS-is-idle notification. With that in place, the resource arbitration worked well enough.
When it comes to speech recognition it is common for the commands to first need to be translated into phonemes, the basic building blocks of speech. This is not a straight-forward, trivial task (English is not exactly consistent!) but the esp-sr component comes with a handy Python utility. The resulting phoneme string is then what is handed to the speech recognition engine. Espressif provides support for both compile-time and runtime configured command strings. According to the documentation, if you give a non-phoneme string at runtime it will use an internal translation which isn’t as good as the off-line method. This, however, I had no luck with, and in the end I used the Python tool to do the phoneme translation. Later I realised this feature is only available for one particular ML model, and not the one I was using.

Demonstrably true: SR + TTS = Fun

In the end, everything has come together into a working demo, show-casing offline speech recognition and speech generation on an ESP32-S3 microcontroller. In many scenarios there is no need for actual TTS on an edge device, as pre-generated phrases may be a better option. When the number of phrases gets large enough however, a TTS really shows its value. For this demo, I opted to pull in a large “dad joke” database, and let the user ask for jokes. I would like to dedicate this little project to my colleague Trevor, who is well known for having a wealth of terrible jokes up his sleeve, and a willingness to share them – Trevor, this one’s for you 🙂

While I’d heartily recommend people try out the demo project, I acknowledge that not everyone has the required hardware (or time, or skills, or interest) so I recorded a brief demo video. It’s somewhat boring, considering I did not add any visual interface so all you see is the console output, but it’ll give you an idea of how well it all works. No timeline editing has been done on the video, what you hear is real-time responses.

I’ve also published the PicoTTS port as a proper ESP-IDF component, complete with documentation, so it’s available for others to reuse.

Reflections

In conclusion, it’s evident that pure edge speech recognition and speech generation is possible by now, and it’s only going to get better. In time, we’ll hopefully see even higher quality open source TTS engines become available. I’m curious to see what Espressif finally releases, considering the extensive R&D they appear to be doing in this field.

As already mentioned, edge SR and TTS provide a different value proposition to that of cloud-backed services, and should be viewed and considered in that light. As always, there are trade-offs, and the right choice will vary between projects.

I’ll leave you with these final words:

“Want to hear a construction joke?”

“Sorry, I’m still working on it.”

Internet of Things

Machine Learning on the Edge – Speech Command Recognition

Part 1

The Good, The Bad, and The GenAI

A New Era: Speech Recognition

Wake Word Technology Changing the Game

The Balancing Act: Cloud Quality vs. Edge Efficiency

The Privacy Bonus with Edge Computing

To Recap

Part 2

Voice Quest: The ESP32-S3 Encounter

Teaching the ESP32-S3 to Speak English and More

Demonstrably true: SR + TTS = Fun

Reflections

Offices

Melbourne

Subscribe to updates from DiUS