11-741: Information Retrieval

Welcome to the IR course for CMU graduate-level students!

General Information

  • Time: TR 12:00-1:20pm
  • Location: NSH 3002
  • Instructors:
  • Textbooks: Modern Information Retrieval (MIR)(errata), Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 2000. Course notes and papers will also be distributed in lectures, or available online (/afs/cs.cmu.edu/academic/class/11741-s01/hand). In addition, selected papers or chapters in the following books (placed on reserve in the Engineering and Science Library, WEH 4th floor) will be used as reading material:
    • Foundations of Statistical Natural Language Processing (SNLP), C. Manning and H. Schutze. (1999), MIT Press.
    • Managing Gigabytes (2nd Ed.) Ian H. Witten, Alistair Moffat and Timothy C. Bell. (1999) Morgan Kaufmann, San Francisco, California.
    • Information Retrieval: Data Structure & Algorithms, Frakes, W.B. and Baeza-Yates, R.B. (1992) Prentice Hall, Englewood Cliffs, New Jersey (currently not available).
    • C.J. van Rijsbergen (1979) "Information Retrieval" available on-line.
    • Pattern Classification and Scene Analysis. Duda & Hart. (1973) Wiley-Interscience. New York.
    • Linear Algebra and Its Applications (3rd Ed.) Gilbert Strang. (1988) Harcourt Brace & Company, Orlando, Florida.
  • Course Description: This course studies the theory, design, and implementation of text-based information systems. The Information Retrieval core components of the course include statistical characteristics of text, representation of information needs and documents, several important retrieval models (Boolean, vector space, probabilistic, inference net, language modeling), clustering algorithms, automatic text categorization, and experimental evaluation. The software architecture components include design and implementation of high-capacity text retrieval and text filtering systems. A variety of current research topics are also covered, including cross-lingual retrieval, document summarization, machine learning, topic detection and tracking, and multi-media retrieval.

    Prerequisites :

    • Programming and data-structures at the level of 15-212 or higher.
    • Algorithms comparable to the undergraduate CS algorithms course (15-451) or higher.
    • Basic linear algebra (21-241 or 21-341) and basic statistics (36-202) or higher.

    Homework (anticipated) : 4 programming assignments, 1 problem set and 1 reading summary per week.

    Grading : Grades will be based on 4 programming assignments (10% each, 40% total), 1 problem set (10% each, 10% total), weekly summary of pre-lecture readings (10% in total), a midterm exam (20%) and a final exam (20%).

    Workload : Moderate. 12 hours/week estimate.

    Teaching Assistant: Paul Ogilvie (pto@cs.cmu.edu)

    • Office hours: Th/Th 2:30-3:30 in NSH 3612A

    Sit-in: You need the approval from the instructors. Please email to them for application.

Syllabus (anticipated)

Day Important Events Subject Readings Lecturer
lec 1. 1/15 Course policies; IR overview
Notes: .ps.gz, pdf, big pdf
MIR 1.0 - 1.4 Callan
Text Retrieval
lec 2. 1/17 Task definition & evaluation; characteristics of text
Notes: .ps.gz, pdf, big pdf
MIR 2.0 - 2.3, 3.0 - 3.4 Callan
lec 3. 1/22 Crawling the Web
Notes: .ps.gz, pdf, big pdf
Callan
lec 4. 1/24 Indexing techniques, Part 1
Notes: .ps.gz, pdf, big pdf
MIR 7.0 - 7.3, 6.0 - 6.3 Callan
lec 5. 1/29 HW1 out Indexing techniques, Part 2
Notes: .ps.gz, pdf, big pdf
MIR 4.0 - 4.6 Callan
lec 6. 1/31 Data structures & algorithms
Notes: .ps.gz, pdf, big pdf
MIR 8.0 - 8.4, 8.6, 8.7 Callan
lec 7. 2/5 Retrieval models: Boolean Notes: .ps.gz, pdf, big pdf MIR 2.4 - 2.5 Callan
lec 8. 2/7 Retrieval models: vector space Notes: .ps.gz, pdf, big pdf Callan
lec 9. 2/12 HW1 due Retrieval models: probabilistic, inference net
Notes: .ps.gz, pdf, big pdf
van Rijsbergen, chapter 6; Turtle, SIGIR90 Callan
lec 10. 2/14 Retrieval models: language models
Notes: .ps.gz, pdf, big pdf
Ponte, SIGIR98; Miller, SIGIR99 Callan
lec 11. 2/19 HW2 out Distributed IR
Notes: .ps.gz, pdf, big pdf
MIR 9.0 - 9.4 ,Callan, CIIR00 Callan
lec 12. 2/21 Relevance feedback and query expansion
Notes: .ps.gz, pdf, big pdf
MIR 5.0 - 5.5, Xu, SIGIR'96 Callan
lec 13. 2/26 Text summarization, Visualization
Summarization Notes: .ps.gz, pdf, big pdf
Visualization Notes: .ps.gz, pdf, big pdf
Goldstein, SIGIR99; Jing, SIGIR99
Readings summaries for lecture 12 are due this week, too.
Callan
lec 14. 2/28 Implementation Issues
Notes: .ps.gz, pdf, big pdf
Callan
lec 15. 3/5 HW2 due, HW3 out Cross-Language IR, Part 1, Note (ps) Yang, AIJ'98 , Dumais, SIGIR-96 CLIR Workshop Yang
3/7 Mid Semester Break
3/12 Midterm Exam Study Guide: Last year's midterm
This year's midterm (and answers)
lec 16. 3/14 Cross-Language IR, Part 2, Note (ps.gz) Nie, SIGIR-99 , Franz, TREC-7 Yang
Navigation & Interfaces
lec 17. 3/19 Hypertext retrieval, Note (pdf) Kleinberg, ACM-SIAM'98 ; Ng et al., IJCAI'01 Yang
Text Classification
lec 18. 3/21 HW3 due, HW4 out Nearest Neighbor, Note (pdf.gz) Duda & Hart, pp 95-105 (reserved); Yang IRJ'99 Yang
lec 19. 3/26 Naive Bayes Methods , Note (pdf.gz) McCallum,ICML'98 ; McCallum, AAAI'98 TC workshop Yang
lec 20. 3/28 Support Vector Machines , Note (pdf.gz) Joachims ECML'98 ; Dumais & Chen, SIGIR'00 Yang
4/2 Spring Break!
4/4 Spring Break!
lec 21. 4/9 Feature Selection , Note (pdf.gz) Yang ICML'97 ; Baker SIGIR'98 Yang
lec 22. 4/11 HW4 due; HW5 out Significance tests , Note (ps.gz) Hull, SIGIR'93 ; Yang SIGIR'99 Yang
Datamining & Automatic Discovery
lec 23. 4/16 Document Clustering, Part 1 SNLP: 14.0 - 14.2.1 (reserved) Yang
lec 24. 4/18 Document Clustering, Part 2 Hofmann, IJCAI'99 ; Hofmann, SIGIR'99 Yang
lec 25. 4/23 Information Extraction, Part 1 Cardie, AIM'97 ; Rillof, WVLC'98 Yang
lec 26. 4/25 HW5 due Information Extraction, Part 2 Bikel, CMP-IL'98 , Seymore, AAAI'99 workshop Yang
lec 27. 4/30 Question Answering Czuba
lec 28. 5/2 Topic Detection and Tracking Yang IEEE'99 Yang
5/6 Final Exam: Location: TBD, Time: TBD spr00 final (ps)


Yiming Yang ( yiming@cs.cmu.edu)