The State Of The Art: Linux Text To Speech (TTS)

With Alexa, Siri and Google happily chatting around, let’s take a snapshot of what is available on Linux. Hint – not much. But let’s take a look.

It’s 2018 and Text-to-Speech (TTS) and, of course, the other way round (Speech to Text) is at the core of all those new services promising to make our life easier. But all those new services come with a catch: Everything you do or say might be recorded by a number of parties. If this is not what you want – bad luck.

If you want to knit your own environment to be able to have your weather reports, ebooks or whatever read to you, you need something that translates your texts into spoken words. You can of course ask a number of providers to convert those texts via the Internet, but – well – that might not be a good idea.

Fortunately, there are a number of technologies available to us that can be installed right on our Linux boxes. They convert text to speech without the need to talk to anybody else, so nothing leaks and we’re ok.

Festival: A general framework for building speech synthesis systems as well as including examples of various modules.
CMU Flite: A small, fast run-time open source text to speech synthesis engine developed at CMU and primarily designed for small embedded machines and/or large servers.
Mimic:Mycroft’s TTS engine, based on CMU’s Flite (Festival Lite)
SVOX Pico2Wave: The open source Android TTS engine adapted for Linux
Cepstral : A commercial TTS engine, available for Linux even in Raspberry Pi

There are a few more – but the sound quality was so much below a certain threshold, or I couldn’t get it installed on my Debian Stretch, or they didn’t provide a demo online.

Again – all services (except Apple below) can be installed on a Linux box and don’t use any cloud- or Internet service to do their magic.

Please notice: I used the suggested voices for this test. Some have a Scottish accent (think: Sean Connery) because, well, Professor Alan W Black, among the world leaders in the area of speech synthesis and father of Festival and Flite, is a Scotsman.

Here’s some text you should know, spoken by “Festival”:

Sounds like a scene from “Braveheart” with Mel Gibson a bit drunk. In any way, “Festival” is a huge piece of technology that offers tons of tweaks, configuration utilities and many different voices. If you want it a but smaller, think “Flite”. It’s the smaller brother of “Festival” and it comes with a much smaller footprint.

And while we are talking “Flite”, let’s talk a bit about “Mycroft”. This company is working to build an open source alternative to “Alexa” & Co. Here’s a bit of their Kickstarter promo-video:

Now – that sounds pretty good. Unfortunately, they didn’t use an open-source TTS for their promotion.

So .. that wasn’t open source. There current engine, called “mimic” sounds much like the technology it is based upon (CMU FLITE)

Not bad, but not comparable with the voice in the promo-video. They have their code on GitHub and people are welcome to use it. But I couldn’t feel much improvement compared to “Flite”, but your mileage may vary. They are however working on mimic2, demos available here. It’s based on the Tacotron speech synthesis in TensorFlow, demos are available here.

Next up, “Pico2Wave”. This technology has been created by the Swiss Company SVOX AG and was selected by Google for Android, subsequently open-sourced and finally made its way into the Linux environment. It’s very light weight, fast and supports different languages.

There are tons of voices available – but unfortunately only for Android. I haven’t found any other voices I could combine with “Pico2Wave” but you know the old saying: Hope dies last.

Last but not least, the “Catrin” voice from the commercial voice developer “Cepstral”. These voices have been around for quite some years now and they are (as far as I know) still updated. Here’s what I have:

While the “Cepstral” voice are available for Linux, they are not open source and they are not free.If you want to have Catrin talking to you, you will have to pay for her (it?).

So kids, that’s the current status as of January 2018. All of those TTS technologies available to us don’t come even close to what we can get from Amazon Alexa, Google or other online providers. But is it fair to compare stand-alone “offline” engines against the combined server power of Alexa or Google?

Here’s a TTS example created offline on a Mac machine:

While that is still not the same as “Siri” or “Alexa” – it is much better than anything available to us. Why is that so? I asked Prof. Black (“Festival” and “Flite”).

He graciously took the time to answer and pretty much told me, that there’s a difference between research and commercialization. While Prof. Black and his teams are doing the research, Apple invested money to propel TTS technology from a research state into a product. That requires, among other things, lot of money to create professionally recorded data bases and more.

So – without anybody investing into TTS (for Linux), we’re stuck for the time being. At least until Prof. Black comes up with a research result that elevates the technology available to us to the next level. I hope that will be soon. Because the more we get used to the capabilities of Alexa & Co. , the more we will be hooked by their services and surveillance. After all – who wants to listen to “Robot-Charly” when he can have “Siri” ?

UPDATE: We were looking for an offline TTS engine for our Lumosur environment. There are a few new developments (e.g. from Mozilla) pending, but for now, we settled for Pico2Wave – with a few updates. It sounds like this now ..

That’s not Alexa for sure. But it is acceptable for now while we are waiting for future technologies to become available. In order to make pico2wave a little clearer, we passed the audio through sox, decreased the base and increased the trebles. Everybody here thinks it is somewhat clearer, but – well – you’ll be the judge. We are going to look at the Tacotron technology soon, but voice, either TTS or STT is not really a super important issue for Lumosur, so we’ll concentrate on other problems for the time being.

About the author:

Michaela Merz is an entrepreneur and first generation hacker. Her career started even before the Internet was available. She invented and developed a number of technologies now considered to be standard in modern web-environments. Among other things, she developed, founded, managed and sold Germany’s third largest Internet Online Service “germany.net” . She is very much active in the Internet business and enjoys “hacking” modern technologies like block chain, IoT and mobile-, voice- and web-based services.

9 thoughts on “ The State Of The Art: Linux Text To Speech (TTS) ”

Bruno Alterescu says:

Hi Michaela,
Your article was very helpful. Pls let me know if you made any progress with Tacotron. Thank you.
Bruno