Operations Research Seminar

  • Professor
  • Department of Computer Science
  • University of Pittsburgh

Algorithms for in-database machine learning

The current standard practice for a data scientist, confronted with a machine learning task on relational data, is to  issue a feature extraction query to extract the (carefully curated) data from a relational database by joining together multiple tables to materialize a design matrix, and then to import this design matrix  into some  machine learning tool  to train the model. This standard practice is wasteful because computing  relational joins is computationally expensive, the resulting design matrix will likely contain much redundant information and consume a lot more space than the original  tables, and thus the machine learning algorithm will take more time than should conceptually be necessary.

I will discuss our nascent efforts to develop "in-database algorithms" for common machine learning problems, and to understand the limits of what can be solved "in-database". Informally  an "in-database algorithm" is one that works directly on the relational data, without forming the design matrix, and that is much faster than the standard practice algorithm.

For More Information, Please Contact: