We are drowning in information and having difficulty finding knowledge: useful and actionable information. recent studies estimate that humantiy has stored in excess of 295 exabytes (1018 bytes) of data. Much data is stored in the form of unstructured text, such as news articles, message boards and forums, texts, emails, status updates, tweets, and nearly a billion webpages.
In this thesis, we present a solution to extracting knowledge present in untold amounts of unstructured text. We define our problem as one of relation extraction: given a document, extract all instantiations of well-defined binary relations present in the text. To this end, we use distant supervision and a novel probabilistic first order logic system combined with co-reference resolution to identify candidate relation instances. These candidates are then classified by a series of cost augmented, binary one-vs-all Support Vector machines to produce the final relation extractions. Results on a corpus of 5.7 million newswire articles over 29 different relations results in an F1 of 37.32%.
tracyf [atsymbol] cs.cmu.edu