Trained systems that apply machine learning to very large datasets, such as web search and IBM's Watson question-answering system, are among the most important and sophisticated software systems being constructed today. Such trained systems are frequently based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. For example, a good feature for a search engine's relevance ranker might be the number of times the user's query term was mentioned in a given Web page. The success of a modern trained system depends substantially on the quality of its features.
Unfortunately, feature engineering--the process of writing code that takes raw data objects as input and out- puts feature vectors that are suitable for a machine learning algorithm--is a tedious, time-consuming, miserable experience. Because "big data" inputs are so diverse, feature engineering is often a trial-and-error process that requires many small iterative code changes; because the inputs are so large, each code change can entail a time-consuming data processing task, such as processing each page in a Web crawl. We introduce Zombie, a data-centric system that accelerates feature engineering by performing intelligent input selection, thereby optimizing the "inner loop" of the feature engineering process. It can evaluate a feature engineer's code much faster than current practice, thereby enabling a feature engineer to be substantially more productive.
Michael Cafarella is an assistant professor in the division of Computer Science and Engineering at the University of Michigan. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington, Seattle, in 2009 with advisors Oren Etzioni and Dan Suciu. He received the NSF CAREER award in 2011. In addition to his academic work, he costarted (with Doug Cutting) the Hadoop open-source project, which is widely used at Facebook, Yahoo!, and elsewhere.
Faculty Host: Andy Pavlo
Partially sponsored by Yahoo! Labs
jennsbl [atsymbol] cs.cmu.edu