Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.mathworks.com!news.kei.com!wang!news
From: bruck@actcom.co.il (Uri Bruck)
Subject: Re: Nonsense text generation
Organization: ACTCOM - Internet Services in Israel
Date: Mon, 10 Jul 1995 18:55:07 GMT
Message-ID: <DBIKJy.Cr9@actcom.co.il>
References: <3t3v0n$el1@blackrabbit.cs.uoregon.edu> <1995Jul5.090521.1@ctdvx5.priv.ornl.gov>
Sender: news@wang.com
Lines: 48

: In article <3t3v0n$el1@blackrabbit.cs.uoregon.edu>, bhelm@cs.uoregon.edu (B. Robert Helm) writes:
: > I need to automatically generate paragraphs of English-like nonsense
: > text.  Could someone point me to references on linguistics that might
: > be relevant?  I vaguely recall one of the popular science magazines
: > (Byte? Scientific American?) doing a column on a "travesty generator"
: > which apparently generated nonsense words.  I'm interested in
: > generating text that not only has plausible-looking words,but also a
: > plausible distribution of word lengths and repetitions.  Do any of the
: > standard probability distributions fit?  Are there specialized
: > distributions (Zipf's law?) that work well?
: > 
I don't have the date for the issue, but it was scientific american, computer
recreations column, can't remember title, they used Markov chains.

Markov chains do pretty much what you describe. I'll try to give a short
description. Feel free to e-mail me for a longer one.
The simplest form is take an input text, any text. For any pair of words that
appear in a text creat a new table entry. The entry will the pair of words, 
and the word that follows them. F'rinstance, 'Scientific American', is followed
by 'doing' in the qoute from your post, so that would be one table entry. 
It is followed by 'computer' in my reply, so that would be another
(I am ignoring the paragraph that begins with 'Markov chains' to avoid endless
self refferential loops)
You do this for every consecutieve pair of words in the text.
(word 1 with word 2, then word 2 with word 3, etc)
You also need to count how many times each pair of words is followed by 
any specific word.
In the above example 'Scientific American' was followed once by 'doing'
and once by 'computer', which means %50 of the time it was followed by 
'doing' and %50 of the time it was followed by 'computer'.
If a pair of words is followed once with one word, and 9 times with another.
Then you note that the first word appears %10 of the time, and the second
appears %90.
So far so good?
Now for generating nonesense.
Pick any pair of words, look at the entire set of words, each of which 
follows that pair somewhere in the text. pick one of those at random
with the probabilty for selecting the words equal to their distriubtion
in the text. If you picked Scientific American you would give a 50-50
chance to 'doing' and 'computer', in other cases the probablities would
be different.
Having chosen the third word, take the 2nd and 3rd as your pair and
repeat the process.
You might like to end some consideratio as end of sentence. There are
many ways to enhance this, limited by your imagination.
Uri Bruck
bruck@actcom.co.il

