Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.mathworks.com!newsfeed.internetmci.com!news.sprintlink.net!howland.reston.ans.net!tank.news.pipex.net!pipex!uknet!newsfeed.ed.ac.uk!edcogsci!steve
From: steve@cogsci.ed.ac.uk (Steve Finch)
Subject: Re: Chomksy, Significance, and Current Trends
Message-ID: <DDAw66.38G@cogsci.ed.ac.uk>
Organization: Centre for Cognitive Science, Edinburgh, UK
References: <4084i9$dml@newsbf02.news.aol.com> <DD5CLH.2nJ@cogsci.ed.ac.uk> <40gcuc$rl9@ruccs.rutgers.edu> <DD7CpI.BoA@cogsci.ed.ac.uk> <40mfdk$du2@senator-bedfellow.MIT.EDU>
Date: Mon, 14 Aug 1995 12:31:49 GMT
Lines: 62

David Pesetsky <pesetsk@mit.edu> writes:

>steve@cogsci.ed.ac.uk (Steve Finch) wrote:
>>
>>My favourite example is aligning parallel texts; the task of saying
>>for direct translations "this sentence is a translation of that
>>sentence (or those two sentences, etc)".  One would think the
>>vocabulary used in the sentences might have the most important
>>influence here.  Not a bit of it.  By far the most effective way found
>>so far is to align on sentence lengths (number of words).  Some might
>>say "but that's just a hack".  It works far too well to be "just a
>>hack".  There is something about the nature of languages which
>>translates short sentences to short sentences and long ones to long
>>ones and traditional linguistics simply does not address that part of
>>the nature of language at the level of detail required to build an
>>aligner.

>Is this really a mystery of language, or is it something fairly trivial (albeit 
>one that may touch on questions of deeper interest)?  Aren't languages fairly 
>similar in the sorts of meanings associated with morphemes? Since translations aim 
>to preserve meaning, doesn't one *expect* that the number of morphemes needed to 
>express an idea in roughly the same way should be fairly stable across languages?  
>And if you pick a pair of languages that group morphemes into words in roughly the 
>same way (e.g. English/French), doesn't one expect word counts to inherit their 
>stability from the morpheme counts across the languages?

Well, it may not be surprising that such simple statistics are very
informative for this task, but then wouldn't we expect a HSPM to be
able to exploit the regularities evident in such simple statistics
(although maybe not in such an artificial task)?

I agree about the observation about morpheme counts, and this is
indeed an interesting observation and one which deserves a thorough
investigation by looking at corpora.

>Has anyone tried the *word* counting method on, say, Chinese/Chukchi parallel 
>texts? Chukchee is heavily "polysynthetic". Much that would take up several words 
>in English is packed into a single word in Chukchi.  One would expect sentence 
>word counts to be heavily depressed in Chukchee, making the alignment technique 
>that works so well for the Canadian parliament proceedings a bit more troublesome.

I know of chinese/english (which works very well), but one unfortunate
problem for corpus linguistics is that languages like Chukchi don't
have large enough machine-readable corpora at present to apply
statistical techniques to (let alone large enough Chinese/Chukchi
parallel corpora).  Indeed, gathering large enough corpora for study
is an important part of corpus linguistics, and it may well turn out
that different classes of language have different statistical methods
which work for them.

I agree that since the nature of words varies across languages,
techniques which operate at the word level may well need to be
different for different languages.  One might speculate that any
statistical models underlying language processing are similar (since
we're all humans processing information and have a similar bag of
tools to interpret and process this information), but the linguistic
units to which the models refer (the way the regularities are
expressed) may differ.

>-David Pesetsky

Steve.
