Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!pipex!uunet!ukma!darwin.sura.net!convex!linac!sunova!dominic!blocker
From: blocker@dominic.ssc.gov (Rich Hall)
Subject: Re: Mac/Amiga speech, non-english languages
Message-ID: <1992Oct2.222025.27565@sunova.ssc.gov>
Keywords: dsp,sound effects,ad,da,guitar,pc
Sender: blocker@dominic (Craig Blocker)
Nntp-Posting-Host: dominic
Organization: Superconducting Super Collider Laboratory
References:  <1992Oct2.164714.3268@fel.tno.nl>
Date: Fri, 2 Oct 1992 22:20:25 GMT
Lines: 197

I am posting this article in reference to all of the articles I have read
so far in this newsgroup.
There were several requests for Mac speech synthesizers. MacInTalk is a
Borland utility that came with my version of Turbo Pascal for the Mac. 
If I remember correctly, you could write to it just as easily at writing
to standard output or a data file.
The default mode was full text-to-speech conversion, but there was also a
phoneme mode. 

Answering the non-English language question, I would be very surprised if 
no one has used MacInTalk with a Borland development product to create
phonemes from non-English languages. 

My Russian class at the University of Texas at Arlington used Macs to 
practice with. I don't remember the name of the package, but one of its 
features was that it would say any phrase that students typed in correctly 
(which is hard to do using a USA keyboard to make Russian characters). I am
positive that this 'synthesizer' was using digitized words, but I don't
remember the name of the package. It used a Hypercard interface.

I am a little more familiar with Amiga speech synthesis. The first AREXX
script I wrote was one to take special characters out of text files so
that the Amiga's narrator.device would have an easier time reading the
file. The problem was mainly with on-line documentation that used darn
near any character as an underscore. The result was often the word 'telga'
repeated 80 times, which became very annoying. Also, some tables of
contents use repeated periods to visually align headings with page
numbers. The over-efficient narrator device converts every two repeated
periods to "and so on" for cases like this... So the computer would repeat
'and so on' about 20 times per line. There were also other problems.

So, I wrote an AREXX script that reads a text file, replaces all repeated
special charactes with just one special character (except that up to three
consecutive periods are allowed). Then, all special characters that cause
trouble are deleted. Also, the script left-justifies all text, and sends
output to the SPEAK: device one "sentence" (the text between periods) at
at time, waiting to write the next sentence to SPEAK: until the previous
sentence is through. The alternate method, sending the file to SPEAK: all 
at once, is useful sometimes, but usually results in periodic breaks where 
SPEAK: refills its buffer. These breaks usually occur in the middle of 
words, and sound quite unnatural.

My script also opens an output window and writes the original file to
the screen in its original, untranslated format. The visual output is sent
one line at a time, somewhat synchronized with the speech so that you can
more easily follow along.

I also recently acquired some PD 'jive' and 'valley girl' source code that
replaces certain phrases in documents with colloquialized lingo. My next
project is to add command-line options to the script that will let you
choose such translations.

I was also impressed with a PD program called 'talk' that someone made by
digitizing his own phonemes. It sounded OK, but the rate was not
adjustable.

I use my script (originally called 'talk', but now renamed 'aloud' so as
not to conflict with the other 'talk') to read documentation files that I
know I should read all the way through, but just don't feel like it. I
have a tendency to scan doc files quickly, so this method makes sure that
I have both read and heard all of the warnings. (Such as: 'Press gadget
B to execute your favorite hard drive formatting utility'. That was the
last time I tinkered with shell replacements without carefully studying
the documents FIRST. :)

As for non-English language translators for the Amiga, I have to
conjecture that there are many, considering Amiga's dominant European
market share. To find a good one, try asking the Amiga experts in
comp.sys.amiga.software or other Amiga newsgroups.

Anyone remember the TI 99/4A's speech synthesizer? It sounded exactly like
"Speak N' Spell," (probably the same circuitry) and had a very limited 
vocabulary. I think there were some foreign language modules for it, too.
The entire system cost me about $70 in the early eighties, and should cost
even less now. (Heck, I'd be happy to sell my entire system, including a
$90 extended BASIC cartridge, for $60+P&H).

On the topic of voice recognition, I must differ with the claim
that the Amiga had the first voice-recognition system. My first computer,
a 1980 TRS-80 model III, had a peripheral called a VOXBOX (or something)
that could do some primitive voice processing to trigger TRS-DOS commands
and for use in some other limited ways. I think this was also intended for
the visually-impaired. I'm not saying that the VoxBox was the first voice
recognition system, but it did come out before the Amiga 1000. Maybe a
professional student of voice technology could clear this one up, and
offer some insight into some good history/reference/technical books on the 
topic?

One of the more recent commercial voice recognition systems is the Voice
Navigator on the Macintosh platform. It is reportedly good at replacing a
mouse for menu selections and the keyboard for shortcut macros, but not at
converting speech into text. The possibility of a computer system that
uses a keyboard for text, a mouse or similar pointer for drawing, and a 
voice input device for commands, macros, and changing tools is intriguing.
Such a system diverges from the trend toward the 'penpad' computers, which
consolidates all input and output devices into a pen and a pad. 

The three-input system (keyboard/pointer/listener) would seem to allow more
efficient control of input. With a one-handed keyboard device like the
Bat, a user could have one hand on the keyboard, one hand on the pointer,
and one voice interacting with the microphone, with the possibility of
these three forms of input being processed simultaneously. Furthermore,
the pointer could always remain positioned in the window area rather than
going back and forth to menus and buttons. This would be good for detailed
work that might require issuing a command while the cursor stays on a
specific pixel. Now, users with that problem choose a menu item and then
proceed on a laborious journey to find that pixel they were just pointing
to a second ago (obviously, this is one reason why keyboard shortcuts are
so important in window environments).

I brought up the handwriting readers for a reason. I have heard that the
pattern-matching algorithms developed for optical character readers were
developed to the point that a small set of routines can recognize
characters of all fonts, styles, and sizes with a very high degree of
accuracy. Then, these OCR algothms were adapted, with minor modifications,
to reading handwriting. Are these same pattern-matching routines being
used to match vocal sound waves to those of standard phonemes?

There have been some posts here about expression-reading or lipreading
through digitized images. It would seem feasible that image-enhansement
software could convert facial images to high-contrast monochrome
snapshots. Then, adaptations of the OCR pattern-matching routines could be
developed to read lips. But don't voiced and unvoiced consonants
(like 'v' and 'f') present a problem, in that they use the same lip
formation?

An non-expert suggestion as to how to get a large database of
head-and-shoulders pictures of people speaking is to get access to a 
computer equipped with large abount of empty storage space, a sound 
digitizer, live video digitizer, and some specialized software that can:

(a) synchroneously digitize sound and video into one editable format, 
(b) edit images to provide high-detail/high contrast output. 

(Examples: (Mac Quadra w/Video Spigot, MacRecorder, Adobe Premiere, 
Photoshop; Amiga 2000/4000 with Video Toaster and the Sunrize
16-bit audio system). 

QuickTime on the Mac and MicroChannel IBMs have made significant advances 
toward competing with the Amiga's separate Audio and Video coprocessors, 
and have the advantage of being more widely available. Multimedia packages 
like Premiere and MovieMaker can show periodic frames of the video 
accompanied by the waveform of the sound track, and any subtitles you care 
to add.

Another step would be to develop a standard speech for volunteers to read
that incorporates all known phonemes (well, maybe just the most commonly
used ones!). Then, spend your free time parading your friends and
relatives of all nationalities in front of the camera to read your speech
at a 'normal' rate. Fill up your hard drives (or a few Syquest,
Bournoullis, Opticals, Magneto-Opticals, etc.) with your samples. Later,
go back and delete all of the extraneous stuttering that might have
occured during the recording. 

If the tools are available, enhanse the images to the optimum monochrome 
contrast/detail level available. (On the Amiga Video Toaster, there is an 
easy option for this... quite spectacular, I might add). Then, choose the
option that shows some video frames accompanied by the soundwaves.
Software is available that can find the number of times that a certain
waveform coincides with a certain mouth shape, and use those to generate a
generic video and waveform that is a composite of all of the digitized
images and sounds. If you can then load this data back in the movie
editor, you should test the result to see if this composite is visually
and audibly intelligible. A study that I saw on some technology show did
this. The voice was monotonous, but quite understandable, and the image
looked a little like Caspar, the Friendly Ghost with shaggy black hair and
beady eyes.

From what I understand, the trick is to do some OCR-like visual analysis
of the 'composited' frames with their corresponding waveforms. After
generating some preliminary algorithms that work on the generic sample,
they are iteratively applied to the original samples to 'fine-tune' these
theoretical algorithms to adapt to reality. When the revised algorithms
can achieve a high level of accuracy, about 70% or so, you are on your way.

The report I heard on 'Beyond 2000,' or some similar show, was still
underway, but had given up on standard video technology due to lack of
detail. There was mention of High Definition Television, electronically
scanning analog 35-mm movie film to a high-DPI file format, immense data
constraints, and not much success. I'm sure someone reading this newsgroup
has seen the program I saw and understands the whole thing better. Please
post your replies so that we can all benefit from your knowledge of the
topic.

Thanks for letting me ponder my hobbies in this forum... Please email me if 
you think I am too long-winded or if you would like a copy of the AREXX
script I mentioned, or if you need help with a HyperCard stack or AREXX
script.

   -- Rich Hall, Technologist
      Superconducting Super Collider Laboratory
      Dallas, Texas

Please send any email to blocker@dominic.ssc.gov rather than other nodes.
My views are my own, and do not necessarily reflect those of my employers,
the Universities Research Association, the US Department of Energy, or any
other government entity or contractor.
