If it's NELL, it knows what it "reads" on the web — and then it tweets about it
By Tom Imerito
Can a computer system form beliefs? Carnegie Mellon's Never Ending Language Learner does. More than half a million beliefs, in fact--and still growing.
Created by Tom Mitchell, head of the Machine Learning Department, and his research team, NELL autonomously and continuously reads the web; compiles words and their relationships to each other into a knowledge base from which it formulates beliefs; and then tweets its thoughts to more than 1,700 Twitter followers. It uses the words, "I think" when it tweets a new belief, whether mundane ("I think "ground cayenne pepper" is a #Condiment") or profound ("I think "art wedding photography" is form of #VisualArt").
The project is funded by the Defense Advanced Research Projects Agency, the National Science Foundation, Google, Yahoo! and Brazil's National Council for Scientific and Technological Development. The research goals behind NELL are creating machine learning programs that can run autonomously for decades; improving machine understanding of human language; and building the world's biggest digital knowledge base.
When NELL was launched in January 2010, its ontology (a collection of words and associations) included 123 noun categories and 55 possible relations between them. It's since grown to 500 categories and relations that include more than 20 million noun phrases and 50 million metadata phrases.
NELL's recent success comes after three years of effort, stymied in part by what Mitchell calls the tendency of relatively small ontologies to produce inaccurate results. Ambiguous and irrelevant words were periodically sucked into NELL's original, or seed, ontology. That produced increasingly inappropriate associations that multiplied in scope every time NELL read its 500 million web pages (a process that, amazingly, only takes about four hours). Mitchell calls them "avalanches of inaccuracies."
A breakthrough came in late 2009 when Mitchell and his collaborators, including Burr Settles, a post-doctoral fellow in the MLD, and William Cohen, an associate research professor, had the idea that NELL might perform more accurately if they gave it more to do rather than less. They were right. They gave NELL a larger ontology which essentially gave the polluting or irrelevant words their own categories--and the avalanches stopped.
Settles illustrated the phenomenon with an example. "At first NELL confused spoken languages with programming languages, so it thought Fortran was a human language," he says. "When we gave it a programming language category, pollution of the language category stopped. The more (categories) we add, the better job it does."
NELL reads the web using a method called macro-reading, which analyzes associations by looking at patterns of phrases; the structure of sentences; the occasions when certain words appear together; and the surface structure (or "morphology") of specific nouns. Every day, NELL reads and re-reads a local collection of 500 million web pages that are periodically crunched by Yahoo!'s 4,000-node M45 supercomputer to enable local analysis and processing. (The local collection represents about 10 percent of all pages on the web.)
To improve its accuracy, four individual learning components--each working on a different principle--analyze NELL's web reading. The multiple methods minimize the likelihood that any two components will validate an erroneous belief. One module scans NELL's resident 500 million local web pages for word and phrase patterns while another issues queries to Google, based on that initial reading, and extracts information from the web in real time. A third module looks for new rules based on the patterns of existing ones. The fourth module analyzes word morphologies. For example, the suffix "-ing" on a noun often indicates the word being described is an activity or hobby. If a noun is preceded by the word "Mount," the resulting phrase is likely to describe a mountain peak.
Data from each of the subsystem modules are then compared for consistencies and sent to NELL's Knowledge Integrator as "Candidate Facts," which are run back through the subsystem components for validation. If they survive all of the validation processes, they become one of NELL's "Beliefs."
Once accepted, the "Beliefs" are assigned confidence levels ranging between 50 and 100 percent. Those below 50 percent are discarded. To improve its confidence levels, NELL can re-read its web pages, query Google or ask its Twitter followers for verification. Soon, NELL also will use its "Beliefs" as the basis of an online game called "Polarity." The multiple choice game, developed by Settles, computer science professor Luis von Ahn and graduate student Edith Law, asks two players to categorize a low-confidence word served up by NELL. The game compares player responses to NELL's own assessment of the word and decides where and how well it fits into the knowledge base.
At the heart of NELL's ability to read and learn is bootstrapping--the process of discovering new categories, relations and rules in response to recurring word, phrase and sentence patterns.
Mitchell gave one example: "NELL read that the Mets played against the Braves and that the Mets play baseball. Therefore, it believes that the Braves play baseball. In itself that's not surprising, but the fact that NELL discovered it on its own is amazing. I think the point will come where NELL will be discovering things that we weren't aware of."
Jason Togyer | 412-268-8721 | jt3y [atsymbol] cs.cmu.edu