Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!portc02.blue.aol.com!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!EU.net!sun4nl!phcoms4.seri.philips.nl!newssvr!news
From: Danny Kersten <kerstend@natlab.research.philips.com>
Subject: Re: Wanted: Algorithm for approximate document comparison
Sender: news@natlab.research.philips.com (USENET News System)
Message-ID: <32BEAE69.167E@natlab.research.philips.com>
Cc: 850579750snz@cdsl.demon.co.uk
Date: Mon, 23 Dec 1996 16:08:09 GMT
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
References: <850579750snz@cdsl.demon.co.uk> <5948mf$8qq@dove.nist.gov> <59fctc$jcm@netnews.upenn.edu>
Mime-Version: 1.0
X-Mailer: Mozilla 2.02 (X11; I; IRIX 5.3 IP22)
Organization: Institute for Perception Research (IPO)
Lines: 25

Michael John Collins wrote:
> 
> : In article <850579750snz@cdsl.demon.co.uk>, peter@cdsl.demon.co.uk (Peter Hayward) writes:
> :> I'm looking for an algorithm (= metric or discriminent
> :> function) to ...

> Checking how many lines are identical (or near-identical) between the > two files ....

But that is not what he wanted. Now, for another approach:

Take, for every word or for some words in a file the sum of that
word divided by the total number of words.

For example, suppose we look at only three words, 'file', 'system'
and 'distributed':

word count = 200
word count ('file') = 6
word count ('system') = 1
word count ('distributed') = 4

then the metric would be something like (6/200, 1/200, 4/200) or
(6/200) + (1/200) * x + (4/200) * x^2 with x some constant.

Danny.
