Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsfeed.cit.cornell.edu!newsstand.cit.cornell.edu!news.kei.com!newsfeed.internetmci.com!in2.uu.net!daver!dlb!megatest!news
From: Dave Jones <djones>
Subject: Re: Q: Relative entropy or Kullback-Leibler divergence
Content-Type: text/plain; charset=us-ascii
Message-ID: <DHus2x.I67@Megatest.COM>
To: mglinws@aol.com
Sender: news@Megatest.COM (News Admin)
Nntp-Posting-Host: pluto
Content-Transfer-Encoding: 7bit
Organization: Megatest Corporation
References: <47igt6$7cs@newsbf02.news.aol.com>
Mime-Version: 1.0
Date: Sat, 11 Nov 1995 00:30:32 GMT
X-Mailer: Mozilla 1.1N (X11; I; SunOS 5.4 sun4m)
X-Url: news:47igt6$7cs@newsbf02.news.aol.com
Lines: 60

Mark Laubach sez...

> How does K-L compare with the Shannon information entropy?

The following is mostly speculation, which I would like for someone to
confirm or deny. As I said before, I invented some stuff a few months ago,
and I now have some reason to believe that it was already known, and going
by the name of "Kullback-Leibler".

Could someone please peruse the following and tell me the standard
names for the ideas? I would very much appreciate it.

We can make an estimate of Shannon entropy, which I have called "average
confusion". Suppose we have an exhaustive set S of mutually exclusive
classes ("events") by which samples may be classified, and a probability
distribution P over S.


           Entropy = Sum     { P(E) log(1/P(E)) }
                    E in S                    

(It is traditional to use base 2 logarithms.)

To make an estimate of the entropy, we perform some tests, and call the
estimate "average confusion", or AC.

           AC = Average   log(1/P(C(t)))
                 tests t

          where C(t) is the class to which test t is assigned,

You can see that if the probability P is well-calibrated, AC converges
to the Shannon entropy with probability 1 as the number of tests grows
without bound.

There is a kind of "fuzzy" generalizaton of the above where you replace
the "1" in the numerator with a "discovered" probability Q. In other words,
rather than discovering that the test t is classified (certainly) as being an
instance of the event C(t), and not any other, you discover instead a
refined probability distribution. Call the information that new probability
adds the "confusion-drop", or CD.

Then the formula becomes


          CD  = average sum log( Q(E|t) / P(E|t) )
                   t   E in S

This is an estimate of how much information the new odds line has
added. The Shannon entropy estimate is a special case,
where the discovered probability is "God's odds", which assigns a probability
of 1 to every winner (correct classification), and 0 to every loser,
without fail. We mortals can only discover God's odds after learning
the result of the test.




            Jive

