Le Zhao

PhD (Graduated 2012)
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Advisor: Jamie Callan and our group

   lezhao@ cs. cmu. edu
Lives in Foster City, CA

                           My Google Scholar Page

| CV | research | publications | teaching | research tricks | blog |

What's New

  • Sept 4, 2012, started as Software Engineer at Google, Mountain View, working on Web search relevance.
  • Aug 27, 2012, dissertation final version approved.
  • Apr 16, 2012, my job talk video, thanks to Microsoft Research.


I'm generally interested in computer facilitated human problem solving. My current research interests are in information retrieval, in using computational modeling and techniques to facilitate users in their search, information processing and decision making process. I drive my research with a deep understanding of every aspect of the search process, drawing insights from retrieval theory, as well as areas such as user behavior and natural language understanding. I verify the insights with data analyses, sometimes in a large scale, and carry them out as statistical inference or structured retrieval models.

My thesis research is the first to quantitatively study the vocabulary mismatch problem in retrieval, which leads to effective ways of predicting whether a query term is likely to mismatch relevant documents, and a number of principled interventions that significantly improve retrieval using the mismatch predictions.

Another topic of interest is structured retrieval enabled by advanced query languages and diverse document structure. We develop and apply the structured retrieval capabilities of the Indri search engine of the Lemur project to problems ranging from Ad-hoc retrieval, relevance feedback, pseudo relevance feedback, to applications such as question answering, intelligent tutoring, XML retrieval, and information extraction. I also work on legal search and bio/medical/chemical patent search, which are structure heavy in their own ways.

During my time at CMU, I also worked on other cool research projects. I identified areas of the Read The Web knowledge base that need improved coverage and did focused crawls of the Web to fix such areas. I worked on crawl seeding, PageRank prioritization and language identification for the Hadoop based ClueWeb09 billion page crawl. And almost every summer, I would do a TREC retrieval evaluation task.

Selected Publications (full list here)

  • WikiQuery, a wiki for high quality queries. code.
  • Wei Xu, Raphael Hoffmann, Le Zhao and Ralph Grishman. "Filling knowledge base gaps for distant supervision of relation extraction". ACL 2013.
  • Le Zhao. "Modeling and solving term mismatch for full-text retrieval". PhD Dissertation. Carnegie Mellon University. 2012. document (pdf html) video slides.
  • Le Zhao, Xiaozhong Liu and Jamie Callan. "WikiQuery -- An interactive collaboration interface for creating, storing and sharing effective CNF queries". SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 16, 2012. paper slides code.
  • Le Zhao and Jamie Callan. "Automatic term mismatch diagnosis for selective query expansion". SIGIR 2012. paper slides.
  • Le Zhao. "Modeling and predicting term mismatch for full-text retrieval". PhD Thesis Proposal. Carnegie Mellon University. 2011. proposal slides.
  • Le Zhao and Jamie Callan. "Term necessity prediction". In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM 2010). Toronto, Canada. paper slides.
  • Le Zhao and Jamie Callan. "Effective and efficient structured retrieval" (poster description). In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009). Hong Kong. paper poster.
  • Le Zhao and Jamie Callan. "A generative retrieval model for structured documents". In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), Napa valley, USA paper slides.
  • Yangbo Zhu, Le Zhao, Jamie Callan and Jaime Carbonell. "Structured queries for legal search". TREC 2007. November 2007. paper TREC poster.
  • Le Zhao, Min Zhang, Shaoping Ma. "The nature of novelty detection", Journal of Information Retrieval, 9(5): pp. 521-541, November 2006 paper.


  • April, 2012, Modeling and Solving Term Mismatch for Full-text Retrieval. Microsoft Research. video.
  • March, 2012, Modeling and Solving Term Mismatch for Full-text Retrieval. University of California, Santa Cruz.
  • Feb 15th, 2010, Tutorial on MapReduce (Hadoop) and Large Scale Processing. slides.
  • Fall, 2009, Teaching Assistant (off-cycle) for 11-741 Information Retrieval, for creating HWs for Map/Reduce based indexing and retrieval. Prepared slides and documentation to quickly get the students started with Hadoop. We worked with the Open Cloud group who provided and maintained the computer cluster.
  • Spring 2009, Teaching Assistant for 11-741 Information Retrieval, by Jamie Callan and Yiming Yang. Gave a lecture on probabilistic retrieval models, scored hws and exams, and did QA.
  • Spring 2004, Teaching Assistant for Data Structure (CS 0024-0074), by Prof. Junhui Deng.
  • Fall 2003, Teaching Assistant for Artificial Intelligence, by Prof. Shaoping Ma.


  • Sep. 2012 - present, Software Engineer at Google Inc. (on Web search relevance).
  • Jun. 2010 - Sep. 2010, Research Intern at Microsoft Research Redmond, ISRC group on crawling and indexing selection. Mentors: Chao Liu @MSR, Xiaodong Fan and Yan Ke @Bing.
  • Mar. 2006 - Jun. 2006, Software Intern at Sogou.com, a Chinese search engine company, worked to improve the relevance ranking of web documents.

Courses and General Interests

  • Query structure (aspects) and Document structure (annotations, parse trees, etc.), Structured retrieval models, XML Retrieval, Search Engine Indexing.
  • Text Retrieval (Sentence level novelty detection, probabilistic models, language modeling, formalizing the notion of Relevance) and Web Search IR 11-741, Advanced IR seminar 11-743.
  • Natural Language Processing (syntactical and semantical theories of natural language understanding, statistical or rule-based) Algorithms in NLP 11-711, Language and Statistics I 11-761, Language and Statistics II 11-762, Grammar Formalisms 11-722 (There are not so many courses about how to get the semantics -- e.g. event-patient-agent -- out of natural language texts, and this is a great basic course of that.)
  • Data Mining & Database Multimedia Databases and Data Mining 15-826 (In the real world: many Bursty distributions, power laws, fractals.. ideas about graph mining and analyzing real world data.)
  • Machine Learning (with its relation to language and intelligence, mostly applying/devising ML tools for NLP) Advanced ML seminar 11-745 (statistical analysis, problems, methods, bias variance etc.), Graphical Models 10-708 (modeling random variables and relations among them, in a graph and solve inference problems efficiently).

Useful Resources

  • CMU SCS Thesis Latex template.
  • Topical Words: Top 100 popular words in low grade level ranges (5-8 in K-12), and popular words in topics such as Arts, Business, Computer, Health, Science, Society, Sports, Music, MovieAndTheater, Biology, Fitness, Religion, Politics, LawAndCrime, History etc. Starts from 3rd column. (Created around Jan. 2008).
  • Kid sites list: This is a long list of about 1,800 websites that are of low reading difficulty level. Not everyone of them is very good, but there are many interesting ones. The first number is a popularity score, whether the site has many low difficulty pages. Boys and girls, Enjoy! (Create around Jan. 2008).

Other Interests

  • Logic and Linguisitics (anything related to human intelligence, and computerizing them).
  • Philosophy, Psychology, Buddhism, Aesthetics, Abstract theories of human intellects.
  • Literature (reading), Classical Music (Bethoven, Chopin, Mozart), Movies (esp. good documentaries).
  • Volleyball (LTI won Championship in 2007, 2008 and 2009!), Badminton, Swimming, Skiing.
  • Cuisine (keep improving), Houseology (figuring out ways to keep housework simple while keeping the house neat -- Once I read about this terminology on the web so I borrowed it here.)
  • I am nerdier than 92% of all people. Are you a nerd? Click here to take the Nerd Test, get nerdy images and jokes, and write on the nerd forum!.


Le Zhao
Last Update: 2015-05-25