PDA

View Full Version : speech synth vs voice actors


Sol_HSA
12-17-2005, 11:36 AM
As one project at the school I'm attending, I did little research based on the idea that what if we could use speech synth (or a TTS engine) in place of voice actors on games that either have budget limitations, size limitations, or both (as in mobile games).

One of the limitations of the bleeding edge speech synths is figuring out the emphasis on text. This wouldn't really be a limitation on a game, as the developer could add emphasis tags or some such in the text to make it sound better. It would still mean some work, but not too much different from other audio work (sound effects / music).

Based on my (granted, brief) research I found two things that astonish me.

First is the apparent lack of progress. The bleeding edge speech synth could pass for a monotonic person with a slight flu talking to you over the phone from the other side of the world. The bad side here is that it will need couple of gigabytes of recorded speech to pull it off.

The second is the small number of free/open source low-end speech synths. There's bound to be loads of TTS engine implementations out there, considering that just about every computer I've owned has had one - be it spectrum48k, amiga, z80-based CP/M machine, macintosh..

The synths out there are basically based on three different technologies, or combinations thereof.

1. Emulating real human speech organs.
Typically useful only for figuring out what the heck is going on. Not suitable for real implementations due to complexity.

2. Formant synthesis.
Based on resonators and filters. Can be implemented with analog hardware! Good side is small size and low computing power requirements, bad side is that this is what we've used to recognize as the "robot talk".

3. Combinatory synthesis.
"the train arrives on track.. THREE". The most common sort of synthesis. If the samples are shrunk to single phonemes, the data set shrinks small enough, but the sound doesn't feel natural. The bleeding edge of speech synthesis is based on huge library of words, where the synthesizer cuts and patches words and parts of words together as required.

Out of these, the formant synthesis is pretty much the way to go.. but won't be really natural.

One possibility would be to use the OS built-in synthesizers, but where the game would gain from small size requirements, it would lose in control. And when going cross-platform, different synth would be used, leading to problems.

Pretty much the only thing I can think of is to find and use some very low-end speech synth with some extra filters plugged in, display the spoken text on screen as it is being said, and make all the speakers robots, aliens, or some such, as the voice is definitely not going to be natural.

Note that I didn't look into commercial speech synths other than those that have web-based samples available; there MAY be something good out there, but it's most likely out of reach for indie budgets. (Some price tags I saw were 500$ per node or some such).

Anyhoo, if someone has any thoughts about this, comments, or corrections, toss them my way =)

Chozabu
12-17-2005, 12:56 PM
I've seen a few commercial games (N64?) use mumbling that sounds like speech, in a few different emotions, and for a few characters that repeats quite a bit - with subtitles.
Im not sure if it will be ok for your project, but im rather happy with it.

Sol_HSA
12-17-2005, 02:13 PM
I don't have a game project in mind as such, this was just some research on the possibilities.. Most indie games and pretty much no mobile games use speech due to budget and/or size issues. But wouldn't it be cool if..

Escapist Games
12-17-2005, 02:50 PM
It's been awhile since I played it, but as I recall the Gamecube title "Animal Crossing" had an interesting solution for their talking characters. The animal NPCs all had talk balloons, but as they were speaking you heard muffled (almost musical) tones. It definitely gave the impression they were talking but it was completely nonsensical.

It wasn't as compelling as actual speech... but it was better than just text.

mahlzeit
12-17-2005, 04:18 PM
Here's a possibility: develop a speech synth that actually does sound good and then license it to other game developers. ;)

Julio Gorge
12-17-2005, 05:26 PM
We used the integrated text-to-speech capabilities of Mac OS X and Windows in one published title. It is a word-puzzle game, and everytime a word is completed it is spelled out loud by the speech engine.

If somebody wants to give it a try, here is a link to download the game:

Windows - http://www.bigfishgames.com/downloads/wordem/
Mac OS X - http://www.pobros.com/games.php?category=pc_puzzle&gameid=2&section=1&page=1

The code is pretty straight forward, specially on OS X.

PD: Sorry about the download size, but we were hired to develop the game, and had to 'follow orders' regarding the included assets :mad:

Vectrex
12-17-2005, 11:26 PM
the scary thing is the speech synth on my Amiga 500 from... 15 years ago is significantly better than the windows one :| I have heard an online one that sounded AWESOME, i'll have to track it down again.

Fabio
12-18-2005, 04:38 AM
the scary thing is the speech synth on my Amiga 500 from... 15 years ago is significantly better than the windows one :| I have heard an online one that sounded AWESOME, i'll have to track it down again.Only the speech synth (narrator.device)? What about multitasking? Scrolling? Mouse-feeling? Should I go on? :mad:
BTW: I recall of a techno song made with the Amiga sweet voice. :)

mahlzeit
12-18-2005, 05:17 AM
AT&T Natural Voices is the best I heard so far. Online demo (http://www.naturalvoices.att.com/demos/)

Anthony Flack
12-18-2005, 05:25 AM
http://www.squashysoftware.com/civilizatron.mp3

I used the Amiga voice synth in a song once. (A bit of background is probably needed here. This was a one-off performance piece, based on the idea that since there is a Christian version of every kind of music, what would the Christian Kraftwerk sound like? We called the group Salvatron, and it was made with three of us playing cheap keyboards - no sequencing)

Vectrex
12-18-2005, 06:51 AM
AT&T Natural Voices is the best I heard so far. Online demo (http://www.naturalvoices.att.com/demos/)

yes! that's the one, the uk english girl is the best

Sol_HSA
12-18-2005, 07:44 AM
There's plenty of TTS demos online - http://www.google.com/search?q=tts+demo

If you think the AT&T is the best there is, you may wish to try
http://actor.loquendo.com/actordemo/default.asp?language=en

.. in any case, the "natural" sounding speech synths either require huge amounts of voice data, are too expensive for any kind of small-scale game projects, or both.

Fabio
12-18-2005, 08:31 AM
BTW: the Amiga one wasn't recording-based, it was a true synthesizer.

Momor
12-18-2005, 11:37 PM
Yeah I remember this Amiga TTS, it was very impressive at this date. Even if the rendering was still very computer-ish, I would love to be able to reproduce this kind of TTS in my programs (without using the Windows SAPI). It could really fit well in some cases.

Has someone any clue on how it was done ? I believed it was made using the phonems + smoothing method, but perhaps I'm wrong ?

Fabio
12-19-2005, 12:46 AM
I analyzed it but it was like 15 years ago, thus many EEG's ago.

If my mad-cow-diseased emmenthal-switzerland-shaped brain is not at fault, I recall I got the impression it used formant technology.

Basicly it had like 5 oscillators (controllable in frequency and amplitude) for vowels + 1 noise generator (controllable too, this one for consonants), and a final modulator (e.g. modulate with sawtooth waveform).

My guess was that those oscillators were piloted in real-time using some data (representing phonemes), and that inbetween phonemes there was some sort of morphing (probably some interpolation of this data + more complex control).

This is how I imagined I would do it, btw, and how I still imagine I will do it.

And, just like you, I don't care it sounds "robotish".. I think it's fsking cool, and I'd like to have it in my games too. Actually I'd like to have a whole Amiga in my games, but that's another story I'm afraid.

Phil Steinmeyer
12-19-2005, 06:54 AM
I looked into voice synth tech a year ago. Was thinking about a social/adventure game, where the NPCs you interact with would respond with dynamic text (based on all kinds of game state variables). A voice sysnth would be ideal for this, but unfortunately, all that I found were still much too 'computer-y'.

I've always wondered if/when somebody would try to develop one that worked by a physical model of the human throat/mouth/tongue/vocal chords, and actually 'rendered' the sound that is made as you speak (i.e. breathe out air while forming your mouth/tongue to various positions). While hard to do, in theory you could get a near perfect sound (in the same way that 3D renderers rasterize objects with full lighting). And changing vocal qualities (i.e. from a male to a female voice), would be a matter of adjusting the physical properties of the model (smaller mouth and vocal chords for female), and changing some of the target positions for different sounds (i.e. a foreign accent is often just the difference of where the tongue is positioned for certain sounds)

Indiepath.T
12-19-2005, 06:58 AM
All the vocalisation on GEOM was done with a speech synth, the raw samples were tidied up a bit and some magic applied to make it sound more natural. I think the results are pretty damn good though.

Fabio
12-20-2005, 07:22 AM
I looked into voice synth tech a year ago. Was thinking about a social/adventure game, where the NPCs you interact with would respond with dynamic text (based on all kinds of game state variables). A voice sysnth would be ideal for this, but unfortunately, all that I found were still much too 'computer-y'.

I've always wondered if/when somebody would try to develop one that worked by a physical model of the human throat/mouth/tongue/vocal chords, and actually 'rendered' the sound that is made as you speak (i.e. breathe out air while forming your mouth/tongue to various positions). While hard to do, in theory you could get a near perfect sound (in the same way that 3D renderers rasterize objects with full lighting). And changing vocal qualities (i.e. from a male to a female voice), would be a matter of adjusting the physical properties of the model (smaller mouth and vocal chords for female), and changing some of the target positions for different sounds (i.e. a foreign accent is often just the difference of where the tongue is positioned for certain sounds)Really much depends on our (humans) ability to change the tone and inflection to add detail to our speech. While a physical model would work for the sound, I think that to get true realism you can't do without full AI (i.e. the computer understanding what it is saying, and with what feelings and emotions). And AI is still a young science after all.

It's like translation: you can't get a really working one (I mean of a quality compared to a skilled human) without actually understanding the meaning of the text (and even the context) you're translating. Deep understanding is what will make computers become more human, physical modeling alone or mechanically-perfect robots won't ever play nor sound nor seem nor act human-like otherwise. And don't forget to add the big bag of defects of us humans!

Fry Crayola
12-20-2005, 07:50 AM
It's been awhile since I played it, but as I recall the Gamecube title "Animal Crossing" had an interesting solution for their talking characters. The animal NPCs all had talk balloons, but as they were speaking you heard muffled (almost musical) tones. It definitely gave the impression they were talking but it was completely nonsensical.

It wasn't as compelling as actual speech... but it was better than just text.

The funny thing about Animalese is that it just pronounces the actual letters, in a high pitched voice, and very fast. So although it's utter giibberish, each word does actually have a unique sound to it.

bitshit
01-10-2006, 07:34 AM
The most impressive vocal synthesiser I ever heard was from Yahama, it sings! :-)

http://www.vocaloid.com/en/introduction.html

ManuelFLara
01-10-2006, 07:37 AM
The most impressive vocal synthesiser I ever heard was from Yahama, it sings! :-)

http://www.vocaloid.com/en/introduction.html
Actually that's a Yamaha-funded project but developed by the Music Technology Group crew at the Pompeu Fabra University (http://www.upf.edu) in Barcelona.