CS294-3: Algorithms in the Real World (Guy Blelloch, Fall 97)

next up previous
Topic 9: Indexing and Searching

[ Topics | Scribe Notes | Readings | Text Books | Links ]


Topic Outline

  • Introduction
  • Types of queries (logical, proximity, keyword sets, wildcards, edit-distance)
  • General techniques (case folding, stemming, stop words)
  • Inverted indices
  • Compressing posting lists
  • Lexicon access
  • Sorted list, front coding, B-trees, tries, perfect hashing
  • Wildcards using n-grams or rotated lexicon
  • Merging posting lists for queries
  • Signature files
  • Generating term and document signatures
  • Boolean logic on signatures
  • Efficient queries on signatures
  • Vector space models
  • Selecting weights (frequency and "information content")
  • Similarity measures (dot-product, cosine)
  • Efficient queries using augmented inverted indices
  • Used with relevance feedback
  • Clustering
  • Latent semantic indexing
  • The Singular value decomposition (SVD) and properties
  • Indexing by taking first k columns/rows of the SVD
  • Applications and performance
  • Scribe Notes

  • Indexing 1 (draft) (Helen Wang)
  • Indexing 2 (draft) (Ben Horowitz)
  • Evolutionary Trees and Indexing 3 (draft) (Amar Chaudhary)
  • Readings

  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
  • Chapter 3: Indexing
  • Text Books

  • Christos Faloutsos. Searching Multimedia Data Bases by Content. Kluwer Academic, 1996.
  • Frakes and Baeza-Yates (ed.). Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992
  • Frants V. J., Shapiro J., Voiskunskii V. G. Automated Information Retrieval Theory and Methods Academic Press, Aug 1997.
  • Lesk M. Practical Digital Libraries Books, Bytes & Bucks Morgan Kaufman Publishers, 1997.
  • Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
  • Sparck Jones K. and Willett P. (editors). Readings in Information Retrieval. Morgan Kaufman Publishers, 1997.
  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
  • Further Readings and Links

  • Digital Libraries
  • The digital libraries initiative. Includes projects at Carnegie Mellon U., Stanford U., UC Berkeley, UC Santa Barbara, U. Illinois, U. Michigan.
  • Digital Library Related Information and Resources
  • Systems for Indexing/Searching your own site
  • A long list of indexing/search engines that can be used to search your own site.
  • Web Site search engines. A nice overview of the various search engines available for indexing your own site
  • Glimpse
  • Information on Web search engines
  • Lycos: Design choices in an Internet search service
  • Other relevant info
  • The Text Retrieval Conference (TREC). "The goal of the conference series is to encourage research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results." Sponsored by NIST and DARPA.
  • Latent Semantic Indexing (LSI)
  • J. Kleinberg. Authoritative sources in a hyperlinked environment. This is the paper that David Wagner discussed in class.

  • Back to the Algorithms in the Real World home page.
    Guy Blelloch, guyb@cs.berkeley.edu.