GoogleTTS Lemonbalm: Work smarter, not harder to make a text to speech speak right while not compromizing on the intonation.
For those who have been following the channel, I think you've seen my so called "obsession" with the Google TTS Lemonbalm voices, and it's not baseless. While there are voices who have better intonation, and many AI ONNX models available, all these AI models do something: Focus the neural network on the way the voice sounds, rather than focusing the neural network on the way the TTS reads.
GoogleTTS flips that idea over and uses the neural network mostly on how the voice speaks, and as little as it's needed on it's intonation and prosody.
First of all, let's name some examples:
1. 48486568427475867482854346532755723857238578342
2. llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
3. Bryanah
For more TTS traps, I've posted a TTS nightmare stress test on the channel, so go see that instead.
So what really happens?
Lemonbalm has a deep neural G2P network, amazing phonemy models, and a sound player, to call it in simple terms.
The sound is literally very precice and advanced speech concatenation, hence the reason why voices like this take an incredible amount of time to produce. If you used any of these voices, you've seen that the voice actors didn't take this litely and have recorded an enormous amount of data for them to train these voices -- if a concatenated speech voice can make such intonation, then it's clear we're talking truly, truly advanced data synthesis and sampling.
Then where does the neutrality happen?
Anywhere else other than the sound!
While all these other voices focus on the way the voice sounds, add 100-400ms delay between synthesis calls just to try to make the voice sound better, Google found out something else: Focus the neural networks on things that take the least amount of time to be generated, like text processing, the parser, and G2P and the prosody model itself, then play the sound using the concatenated speech engine.
This is how we're able to get under30ms latency for speech. In Lemonbalm, the neural network focuses on sound just enough to have the magic work. It is not the best intonation available, and I'll admit there are things that sound better, but then it doesn't make it a reliable embedded model: Shame on SamsungTTS, for example! They claim their voices are neural, yet they're still reading AD B debugging O B for UWB which is Ultra WideBand, and more such issues that should absolutely not exist in a neural voice.
Conclusion:
Let's appreciate Lemonbalm!
While it's not to be ran on 2018-2019 phones -- even flagships from that time struggle a bit with them, Lemonbalm is perfect for later flagships. It needs a little bit of power to run, and it drains the battery quite significantly, but, at the end of the day, it's not too bad.
Let's appreciate Lemonbalm for what it does, it completely flips the AI TTS world upside down and, at this point, it's the absolute best offline AI TTS model available on any platform.
The stress tests I exemplified in this post aren't to be taken litely. They are words hard to pronounce and failures for pretty much all that is TTS. The fact that there exists an embedded solution that is able to pronounce numbers like that, that monster of a city name, and that "Who would've thought" of an American name, while still delivering output in under 30ms if it has the power is a leap.
Now, in 2026, we're seeing a trash of a TTS engine which can't even pronounce the name of the TTS solution made by RoApps properly and everyone's "Wow" when Codefactory should feel absolutely ashamed for it!
GoogleTTS Lemonbalm is here and now, when the market is filled with AI voices that just lag, delay, and wait for a million years before they can speak, to just say Ad B after that much time of waiting, they flip the script and focus on what matters.
A TTS voice's job is to pronounce things for us.
The more human it sounds, *WHILE AT THE SAME TIME pronounce things the right way*, the better.
And Lemonbalm checks both boxes.
Thank you for reading until