Newsgroups: comp.ai.neural-nets,comp.answers,news.answers
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!oitnews.harvard.edu!news.sesqui.net!uuneo.neosoft.com!news.blkbox.COM!academ!bcm.tmc.edu!news.msfc.nasa.gov!elroy.jpl.nasa.gov!swrinde!newsfeed.internetmci.com!news.mathworks.com!uhog.mit.edu!news.mtholyoke.edu!world!mv!barney.gvi.net!redstone.interpath.net!sas!mozart.unx.sas.com!hotellng.unx.sas.com!saswss
From: saswss@unx.sas.com (Warren Sarle)
Subject: comp.ai.neural-nets FAQ, Part 4 of 7: Datasets
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <nn4.posting_823026431@hotellng.unx.sas.com>
Supersedes: <nn4.posting_820193064@hotellng.unx.sas.com>
Approved: news-answers-request@MIT.EDU
Date: Tue, 30 Jan 1996 18:27:15 GMT
Expires: Tue, 5 Mar 1996 18:27:11 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
Reply-To: saswss@unx.sas.com (Warren Sarle)
Organization: SAS Institute Inc., Cary, NC, USA
Keywords: frequently asked questions, answers
Followup-To: comp.ai.neural-nets
Lines: 275
Xref: glinda.oz.cs.cmu.edu comp.ai.neural-nets:29579 comp.answers:16670 news.answers:63362


Archive-name: ai-faq/neural-nets/part4
Last-modified: 1996-01-06
URL: ftp://ftp.sas.com/pub/neural/FAQ4.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

This is part 4 (of 7) of a monthly posting to the Usenet newsgroup
comp.ai.neural-nets. See the part 1 of this posting for full information
what it is all about.

========== Questions ========== 
********************************

Part 1: Introduction

   What is this newsgroup for? How shall it be used?
   What is a neural network (NN)?
   What can you do with a Neural Network and what not?
   Who is concerned with Neural Networks?

Part 2: Learning

   What does 'backprop' mean? What is 'overfitting'?
   Why use a bias input? Why activation functions?
   How many hidden units should I use?
   How many learning methods for NNs exist? Which?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?
   How are NNs related to statistical methods?

Part 3: Information resources

   Good introductory literature about Neural Networks?
   Any journals and magazines about Neural Networks?
   The most important conferences concerned with Neural Networks?
   Neural Network Associations?
   Other sources of information about NNs?

Part 4: Datasets

   Databases for experimentation with NNs?

Part 5: Free software

   Freely available software packages for NN simulation?

Part 6: Commercial software

   Commercial software packages for NN simulation?

Part 7: Hardware

   Neural Network hardware?

------------------------------------------------------------------------

Subject: Databases for experimentation with NNs?
================================================

1. The neural-bench Benchmark collection
++++++++++++++++++++++++++++++++++++++++

   Accessible via anonymous FTP on ftp.cs.cmu.edu [128.2.206.173] in
   directory /afs/cs/project/connect/bench. In case of problems or if you
   want to donate data, email contact is "neural-bench@cs.cmu.edu". The data
   sets in this repository include the 'nettalk' data, 'two spirals',
   protein structure prediction, vowel recognition, sonar signal
   classification, and a few others. 

2. Proben1
++++++++++

   Proben1 is a collection of 12 learning problems consisting of real data.
   The datafiles all share a single simple common format. Along with the
   data comes a technical report describing a set of rules and conventions
   for performing and reporting benchmark tests and their results.
   Accessible via anonymous FTP on ftp.cs.cmu.edu [128.2.206.173] as 
   /afs/cs/project/connect/bench/contrib/prechelt/proben1.tar.gz. and also
   on ftp.ira.uka.de [129.13.10.90] as /pub/neuron/proben.tar.gz. The file
   is about 1.8 MB and unpacks into about 20 MB. 

3. UCI machine learning database
++++++++++++++++++++++++++++++++

   Accessible via anonymous FTP on ics.uci.edu [128.195.1.1] in directory 
   /pub/machine-learning-databases". 

4. NIST special databases of the National Institute Of Standards
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   And Technology:
   +++++++++++++++

   Several large databases, each delivered on a CD-ROM. Here is a quick
   list. 
    o NIST Binary Images of Printed Digits, Alphas, and Text 
    o NIST Structured Forms Reference Set of Binary Images 
    o NIST Binary Images of Handwritten Segmented Characters 
    o NIST 8-bit Gray Scale Images of Fingerprint Image Groups 
    o NIST Structured Forms Reference Set 2 of Binary Images 
    o NIST Test Data 1: Binary Images of Hand-Printed Segmented Characters 
    o NIST Machine-Print Database of Gray Scale and Binary Images 
    o NIST 8-Bit Gray Scale Images of Mated Fingerprint Card Pairs 
    o NIST Supplemental Fingerprint Card Data (SFCD) for NIST Special
      Database 9 
    o NIST Binary Image Databases of Census Miniforms (MFDB) 
    o NIST Mated Fingerprint Card Pairs 2 (MFCP 2) 
    o NIST Scoring Package Release 1.0 
    o NIST FORM-BASED HANDPRINT RECOGNITION SYSTEM 
   Here are example descriptions of two of these databases: 

   NIST special database 2: Structured Forms Reference Set (SFRS)
   --------------------------------------------------------------

   The NIST database of structured forms contains 5,590 full page images of
   simulated tax forms completed using machine print. THERE IS NO REAL TAX
   DATA IN THIS DATABASE. The structured forms used in this database are 12
   different forms from the 1988, IRS 1040 Package X. These include Forms
   1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F
   and SE. Eight of these forms contain two pages or form faces making a
   total of 20 form faces represented in the database. Each image is stored
   in bi-level black and white raster format. The images in this database
   appear to be real forms prepared by individuals but the images have been
   automatically derived and synthesized using a computer and contain no
   "real" tax data. The entry field values on the forms have been
   automatically generated by a computer in order to make the data available
   without the danger of distributing privileged tax information. In
   addition to the images the database includes 5,590 answer files, one for
   each image. Each answer file contains an ASCII representation of the data
   found in the entry fields on the corresponding image. Image format
   documentation and example software are also provided. The uncompressed
   database totals approximately 5.9 gigabytes of data. 

   NIST special database 3: Binary Images of Handwritten Segmented
   ---------------------------------------------------------------
   Characters (HWSC)
   -----------------

   Contains 313,389 isolated character images segmented from the 2,100
   full-page images distributed with "NIST Special Database 1". 223,125
   digits, 44,951 upper-case, and 45,313 lower-case character images. Each
   character image has been centered in a separate 128 by 128 pixel region,
   error rate of the segmentation and assigned classification is less than
   0.1%. The uncompressed database totals approximately 2.75 gigabytes of
   image data and includes image format documentation and example software.

   The system requirements for all databases are a 5.25" CD-ROM drive with
   software to read ISO-9660 format. Contact: Darrin L. Dimmick;
   dld@magi.ncsl.nist.gov; (301)975-4147

   The prices of the databases are between US$ 250 and 1895 If you wish to
   order a database, please contact: Standard Reference Data; National
   Institute of Standards and Technology; 221/A323; Gaithersburg, MD 20899;
   Phone: (301)975-2208; FAX: (301)926-0416

   Samples of the data can be found by ftp on sequoyah.ncsl.nist.gov in
   directory /pub/data A more complete description of the available
   databases can be obtained from the same host as 
   /pub/databases/catalog.txt 

5. CEDAR CD-ROM 1: Database of Handwritten Cities, States,
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
   ZIP Codes, Digits, and Alphabetic Characters
   ++++++++++++++++++++++++++++++++++++++++++++

   The Center Of Excellence for Document Analysis and Recognition (CEDAR)
   State University of New York at Buffalo announces the availability of
   CEDAR CDROM 1: USPS Office of Advanced Technology The database contains
   handwritten words and ZIP Codes in high resolution grayscale (300 ppi
   8-bit) as well as binary handwritten digits and alphabetic characters
   (300 ppi 1-bit). This database is intended to encourage research in
   off-line handwriting recognition by providing access to handwriting
   samples digitized from envelopes in a working post office. 

        Specifications of the database include:
        +    300 ppi 8-bit grayscale handwritten words (cities,
             states, ZIP Codes)
             o    5632 city words
             o    4938 state words
             o    9454 ZIP Codes
        +    300 ppi binary handwritten characters and digits:
             o    27,837 mixed alphas  and  numerics  segmented
                  from address blocks
             o    21,179 digits segmented from ZIP Codes
        +    every image supplied with  a  manually  determined
             truth value
        +    extracted from live mail in a  working  U.S.  Post
             Office
        +    word images in the test  set  supplied  with  dic-
             tionaries  of  postal  words that simulate partial
             recognition of the corresponding ZIP Code.
        +    digit images included in test  set  that  simulate
             automatic ZIP Code segmentation.  Results on these
             data can be projected to overall ZIP Code recogni-
             tion performance.
        +    image format documentation and software included

   System requirements are a 5.25" CD-ROM drive with software to read
   ISO-9660 format. For any further information, including how to order the
   database, please contact: Jonathan J. Hull, Associate Director, CEDAR,
   226 Bell Hall State University of New York at Buffalo, Buffalo, NY 14260;
   hull@cs.buffalo.edu (email) 

6. AI-CD-ROM (see question 'Other sources of information')
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

7. Time series archive
++++++++++++++++++++++

   Various datasets of time series (to be used for prediction learning
   problems) are available for anonymous ftp from ftp.santafe.edu
   [192.12.12.1] in /pub/Time-Series". Problems are for example:
   fluctuations in a far-infrared laser; Physiological data of patients with
   sleep apnea; High frequency currency exchange rate data; Intensity of a
   white dwarf star; J.S. Bachs final (unfinished) fugue from "Die Kunst der
   Fuge"

   Some of the datasets were used in a prediction contest and are described
   in detail in the book "Time series prediction: Forecasting the future and
   understanding the past", edited by Weigend/Gershenfield, Proceedings
   Volume XV in the Santa Fe Institute Studies in the Sciences of Complexity
   series of Addison Wesley (1994). 

8. USENIX Faces
+++++++++++++++

   The USENIX faces archive is a public database, accessible by ftp, that
   can be of use to people working in the fields of human face recognition,
   classification and the like. It currently contains 5592 different faces
   (taken at USENIX conferences) and is updated twice each year. The images
   are mostly 96x128 greyscale frontal images and are stored in ascii files
   in a way that makes it easy to convert them to any usual graphic format
   (GIF, PCX, PBM etc.). Source code for viewers, filters, etc. is provided.
   Each image file takes approximately 25K. 

   For further information, see 
   ftp://src.doc.ic.ac.uk/pub/packages/faces/README Do NOT do a directory
   listing in the top directory of the face archive, as it contains over
   2500 entries! 

   According to the archive administrator, Barbara L. Dijker
   (barb.dijker@labyrinth.com), there is no restriction to use them.
   However, the image files are stored in separate directories corresponding
   to the Internet site to which the person represented in the image
   belongs, with each directory containing a small number of images (two in
   the average). This makes it difficult to retrieve by ftp even a small
   part of the database, as you have to get each one individually.
   A solution, as Barbara proposed me, would be to compress the whole set of
   images (in separate files of, say, 100 images) and maintain them as a
   specific archive for research on face processing, similar to the ones
   that already exist for fingerprints and others. The whole compressed
   database would take some 30 megabytes of disk space. I encourage anyone
   willing to host this database in his/her site, available for anonymous
   ftp, to contact her for details (unfortunately I don't have the resources
   to set up such a site). 

   Please consider that UUNET has graciously provided the ftp server for the
   FaceSaver archive and may discontinue that service if it becomes a
   burden. This means that people should not download more than maybe 10
   faces at a time from uunet. 

   A last remark: each file represents a different person (except for
   isolated cases). This makes the database quite unsuitable for training
   neural networks, since for proper generalisation several instances of the
   same subject are required. However, it is still useful for use as testing
   set on a trained network. 

   ------------------------------------------------------------------------

   Next part is part 5 (of 7). Previous part is part 3. 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
