LTI logl Freebase logo

Using Knowledge Resources
To Improve Information Retrieval

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Project Overview

Current search engines understand how humans use language, but they do not understand the language itself. They match the words in a query to the words in a document and words that are linked somehow to the document (e.g., 'Click here to get the employee handbook') to find documents that might satisfy the query. Then they use statistical methods and the behavior of other people who searched for similar information to rank these potential matches. Although current technology works well most of the time, it sometimes fails badly because the search engine doesn't really understand the meanings of the documents that it ranks. Recently, companies, research organizations, and volunteer communities have begun to create large knowledge graphs that describe important, essential, or well-known information. Knowledge graphs are similar in spirit to Wikipedia, but they are designed to be used by computers instead of humans. For example, a knowledge graph might contain the entities Cleveland Cavaliers and LeBron James, and these two entities might be connected by an employs relationship. Information can be entered by people with moderate expertise, and by machine learning software, thus it is practical to build large knowledge graphs that cover a wide range of human knowledge. Freebase, which is now owned by Google, is a well-known knowledge graph that contains 2.5 billion 'facts' about 44 million 'topics' and is growing rapidly. Currently knowledge graphs are used for just a few well-defined tasks, for example, to produce the info boxes that Google displays next to some search results. New methods of using knowledge graphs for more varied tasks are of significant scientific and commercial interest. This project develops new methods of using knowledge graphs to improve the accuracy of search engines, especially for vague, ambiguous, or poorly-specified queries. The search engine uses the knowledge graph to identify the probable meanings of query terms, and then uses this knowledge to improve its ability to identify documents that match those meanings. The project is of practical significance for its potential to improve search engine accuracy on queries that are currently difficult. It is of scientific significance for its potential to inject greater understanding of meaning and relationships into search engines. The project is of educational significance because it provides opportunities for graduate student to do class projects and independent studies that lead to participation in the National Institute of Standards and Technology's (NIST's) annual TREC conference, which is a semi-competitive annual event that attracts some of the best research groups from around the world.

Knowledge graphs are less structured than typical relational databases and semantic web resources but more structured than the text stored in full-text search engines. The weak semantics used in these semi-structured information resources is sufficient to support interesting applications, but is also able to accommodate contradictions, inconsistencies, and mistakes, which makes them easier to scale to large amounts of information. The typical use of a semi-structured resource treats it like a structured resource that has somewhat restricted functionality. The application must understand the semantics associated with each type of entity, attribute, and relation that it uses. Although this approach is effective, the need to understand the semantics of entity types and relation types limits the application's ability to automatically incorporate new types of information as the resource evolves and grows. This project develops new methods of using semi-structured information resources that make fewer assumptions about the structure and semantics of a semi-structured knowledge resource, thus enabling them to make full use of the resource as it grows and evolves. The resource is treated as a network of entities and relations that are each described by a 'bag of words' description. Entities and relations are retrieved using extensions of full-text retrieval methods. Evidence such as estimates of authority or related language models can be associated with entity and relation types, and propagated along specific network links to improve entity and relation models. This project applies this general architecture to make several improvements in the accuracy of a full-text search engine, for example, providing an alternative method of answering entity-attribute queries and a more stable and effective method of query expansion.

 

Project Personnel

Jamie Callan, Principal Investigator
Chenyan Xiong, Research Assistant

 

Dissemination of Research Results

Research results are disseminated by research publications, through our Virtual Appendices web page, and as part of the open-source Lemur Project.

 


NSF logo     This research is sponsored by National Science Foundation grant IIS-1422676, a Google Faculty Research Award, and a fellowship from the Allen Institute for Artificial Intelligence. Prior research was sponsored by Google through its support of the Worldly Knowledge project. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.     Google logo
Allen Institute logo

Updated on Oct 10, 2018.
Jamie Callan