Many data mining algorithms employ to some extent a generate-and-test approach: large amounts of partial or complete hypotheses are generated and evaluated during the data mining process. This evaluation usually involves testing the hypothesis on a large data set, a process which is typically linear in the size of the data set. Examples of such data mining algorithms are APRIORI , decision tree algorithms [25,11], algorithms inducing decision rules , etc.
Even though the search through the hypothesis space is seldom exhaustive in practical situations, and clever branch-and-bound or greedy search strategies are employed, the number of hypotheses generated and evaluated by these approaches may still be huge. This is especially true when a complex hypothesis space is used, as is often the case in inductive logic programming (ILP), where the sheer size of the hypothesis space is an important contribution to the high computational complexity of most ILP approaches. This computational complexity can be reduced, however, by exploiting the fact that there are many similarities between hypotheses.
Most ILP systems build a hypothesis one clause at a time. This search for a single clause is what we will be concerned with in the rest of this paper, and so the word ``hypothesis'' further on will usually refer to a single clause. The clause search space is typically structured as a lattice. Because clauses close to one another in the lattice are similar, the computations involved in evaluating them will be similar as well. In other words, many of the computations that are performed when evaluating one clause (which boils down to executing a query consisting of the body of the clause) will have to be performed again when evaluating the next clause. Storing certain intermediate results during the computation for later use could be a solution, e.g., tabling as in the XSB Prolog engine , but may be infeasible in practice because of its memory requirements. It becomes more feasible if the search is reorganised so that intermediate results are always used shortly after they have been computed; this can be achieved to some extent by rearranging the computations. The best way of removing the redundancy, however, seems to be to re-implement the execution strategy of the queries in such a way that as much computation as possible is effectively shared.
In this paper we discuss a strategy for executing sets of queries, organised in so-called query packs, that avoids the redundant computations. The strategy is presented as an adaptation of the standard Prolog execution mechanism. The adapted execution mechanism has been implemented in ILPROLOG, a Prolog system dedicated to inductive logic programming. Several inductive logic programming systems have been re-implemented to make use of this dedicated engine, and using these new implementations we obtained experimental results showing in some cases a speed-up of more than an order of magnitude. Thus, our work significantly contributes to the applicability of inductive logic programming to real world data mining tasks. In addition, we believe it may contribute to the state of the art in query optimisation in relational databases. Indeed, in the latter field there has been a lot of work on the optimisation of individual queries or relatively small sets of queries, but much less on the optimisation of large groups of very similar queries, which understandably did not get much attention before the advent of data mining. Optimisation of groups of queries for relational databases seems an interesting research area now, and we believe techniques similar to the ones proposed here might be relevant in that area.
The remainder of this paper is structured as follows. In Section 2 we precisely describe the ILP problem setting in which this work is set. In Section 3 we define the notion of a query pack and indicate how it would be executed by a standard Prolog interpreter and what computational redundancy this causes. We further describe an execution mechanism for query packs that makes it possible to avoid the redundant computations that would arise if all queries in the pack were run separately, and show how it can be implemented by making a few small but significant extensions to the WAM, the standard Prolog execution mechanism. In Section 4 we describe how the query pack execution strategy can be incorporated in two existing inductive logic programming algorithms (TILDE and WARMR). In Section 5 we present experimental results that illustrate the speed-up that these systems achieve by using the query pack execution mechanism. In Section 6 we discuss related work and in Section 7 we present conclusions and some directions for future work.