15-853: Algorithms in the Real World (Guy Blelloch)

up previous
Topic 9: Indexing and Searching

[ Topics | Scribe Notes | Readings | Text Books | Links ]


Topic Outline

  • Introduction
  • Types of queries (logical, proximity, keyword sets, wildcards, edit-distance)
  • General techniques (case folding, stemming, stop words)
  • Inverted indices
  • Compressing posting lists
  • Lexicon access
  • Sorted list, front coding, B-trees, tries, perfect hashing
  • Wildcards using n-grams or rotated lexicon
  • Merging posting lists for queries
  • Signature files
  • Generating term and document signatures
  • Boolean logic on signatures
  • Efficient queries on signatures
  • Vector space models
  • Selecting weights (frequency and "information content")
  • Similarity measures (dot-product, cosine)
  • Efficient queries using augmented inverted indices
  • Used with relevance feedback
  • Clustering
  • Latent semantic indexing
  • The Singular value decomposition (SVD) and properties
  • Indexing by taking first k columns/rows of the SVD
  • Applications and performance
  • Link-Based analysis
  • Hubs and Authorities: HITS and Clever
  • Page Ranking: Google
  • Scribe Notes

  • Indexing 1 (draft) (Helen Wang)
  • Indexing 2 (draft) (Ben Horowitz)
  • Evolutionary Trees and Indexing 3 (draft) (Amar Chaudhary)
  • Readings

  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
  • Chapter 3: Indexing
  • Text Books

  • Christos Faloutsos. Searching Multimedia Data Bases by Content. Kluwer Academic, 1996.
  • Frakes and Baeza-Yates (ed.). Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992
  • Frants V. J., Shapiro J., Voiskunskii V. G. Automated Information Retrieval Theory and Methods Academic Press, Aug 1997.
  • Robert R. Korfhage Information Storage and Retrieval. John Wiley & Sons, 1997.
  • Lesk M. Practical Digital Libraries Books, Bytes & Bucks Morgan Kaufman Publishers, 1997.
  • Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, 1989.
  • Sparck Jones K. and Willett P. (editors). Readings in Information Retrieval. Morgan Kaufman Publishers, 1997.
  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Academic Press/Morgan Kaufmann, 1999.
  • Further Readings and Links

  • Summaries
  • A nice tutorial by Andrei Broder and Monika Henzinger on Information Retrieval on the Web.
  • Information on Web search engines
  • Lycos: Design choices in an Internet search service
  • Search Engine Watch. Information on the various search engines, including size and features and some information on how they work.
  • Latent Semantic Indexing
  • Latent Semantic Indexing (LSI)
  • Latent semantic indexing: A probabilistic analysis by Papadimitriou et. al. Analyzes an information retrieval technique related to principle components analysis. In PODS 98.
  • Link Based Analysis for Indexing
  • Authoritative sources in a hyperlinked environment by J. Kleinberg. Overviews the ideas behind the HITS method for finding Hubs and Authorities.
  • Automatic resource compilation by analyzing hyperlink structure and associated text by Chakrabarti et. al. Describes how HITS can be used to find pages within given clasifications to fill the topics pages of a Yahoo like system automatically.
  • The Clever project at IBM Almaden.
  • The Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Lawrence Page. This describes the Google search engine, at least the original version from Stanford. It briefly describes the page ranking scheme.
  • The Connectivity Server: fast access to linkage information on the Web by Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumara, and Suresh Venkatasubramanian.
  • Digital Libraries
  • The digital libraries initiative. Includes projects at Carnegie Mellon U., Stanford U., UC Berkeley, UC Santa Barbara, U. Illinois, U. Michigan.
  • Digital Library Related Information and Resources
  • Systems for Indexing/Searching your own site
  • A long list of indexing/search engines that can be used to search your own site.
  • Web Site search engines. A nice overview of the various search engines available for indexing your own site
  • Glimpse
  • Other relevant info
  • The Text Retrieval Conference (TREC). "The goal of the conference series is to encourage research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results." Sponsored by NIST and DARPA.

  • Back to the Algorithms in the Real World home page.
    Guy Blelloch, guyb@cs.cmu.edu.