A Toolkit for Multiple-Instance Learning 
and its Experiments with Information Retrieval

Jun Yang   (juny@cs.cmu.edu)

Abstract

Multiple-instance learning (MIL) is a way to model ambiguity in semi-supervised learning setting, where each training example is a bag of instances and the labels are assigned on the bags instead of on the instances. After being introduced by Dietterich et al., (1997), MIL has become an active research area and a number of MIL algorithms have been proposed. Meanwhile, the nature of MIL makes it applicable to several applications, ranging from drug activity prediction to text or multimedia information retrieval. The research and applications of MIL can be promoted by (1) a public toolkit that consists of several widely-used MIL algorithms, and (2) a comparative study of the performance of these algorithms on several common applications, neither of which is available at this point. In this project, I implemented MILL – a Multiple Instance Learning Library as a Matlab toolkit consisting of several popular MIL algorithms such as Diverse Density [Maron and Lozano-Perez, 1998], EM-DD [Zhang and Goldman, 2001], SVM variants for MIL [Andrews et al., 2002], kNN for MIL [Wang and Zucker, 2000], etc. Based on MILL, we evaluated the performance of the algorithms on several typical MIL problems, including drug activity prediction, text categorization, and image classification/retrieval.

Introduction

Multiple-instance learning (MIL) is an emerging topic of semi-supervised learning where there is only incomplete knowledge on the labels of the training data. Specifically, instances in MIL are grouped into a set of bags. The labels of the bags are provided, but the labels of instances in the bags are unknown. However, a bag is labeled positive if at least one instance in the bag is positive, and a bag is negative if all the instances in it are negative. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the testing data.

 

The nature of MIL lends it to a variety of applications, besides the famous drug activity prediction problem which motivated the research on MIL. In text categorization, for example, a document (bag) may be labeled relevant to a topic only because certain unspecified paragraphs (instances) of the document are relevant, while an irrelevant document usually has no relevant paragraphs. With only the document-level labels provided, MIL algorithms can help us figure out the relevant paragraphs and therefore build a possibly better text categorization model. The same in true in image retrieval, where it remains ambiguous which regions within a relevant image contains the target object.

 

I believe that the research and applications on MIL can be promoted if (1) a public toolkit that consists of several popular MIL algorithms is built, and (2) a comparative study on the performance of these algorithms is available. Neither of them has been done. As an effort towards these two objectives, I implemented MILL – a Multiple Instance Learning Library as a Matlab toolkit consisting of several popular MIL algorithms such as Diverse Density, EM-DD, SVM variants for MIL, kNN for MIL, etc. A detailed description and the tutorial of MILL has been provided. Based on MILL, we evaluated the performance of the algorithms on several typical MIL problems, including drug activity prediction, text categorization, and image classification/retrieval. The preliminary results are also presented in this report.

 

Literature Review

This section reviews motivation behinds multiple-instance learning (MIL), several popular MIL algorithms, and its applications in the area of information retrieval. (Full details) (PDF file)

MILL Toolkit and its Tutorial  

I have implemented MILL -- a Matlab library that consists of several popular multiple-instance learning algorithms. The MILL library can be downloaded from here. Please also check out its tutorial.

 

The MILL toolkit implements 6 widely-used MIL algorithms, which are iterdiscrim_APR [Dietterich et al., 1997], Diverse Density (DD) [Maron and Lozano-Perez, 1998], EM-DD [Zhang and Goldman, 2001], two SVM variants for MIL [Andrews et al., 2002] (mi-SVM and MI-SVM), and Citation-kNN for MIL [Wang and Zucker, 2000]. Note that the iterdiscrim_APR algorithm is the best-performing variant among a set of MIL algorithms based on learning axis-parallel rectangle (APR) as the target concept, which are proposed in [Dietterich et al., 1997]. On top of these base algorithms, MILL builds a wrapper that performs various evaluation methods, including cross-validation, leave-one-out, train-only, test-only, etc.

Experiments with Drug Activity Prediction Problem

Multi-instance learning was first introduced by Dietterich et al. (1997) to tackle the problem of drug activity prediction. The objective is to predict whether a drug molecule can bind well to a target protein related to certain disease states, which is primarily determined by the shape of the molecule. Example molecules that can bind well and some that cannot bind are provided as training data. However, since a molecule is flexible, it can adopt a wide range of shapes (called conformations). A molecule can bind well if at least one of its shapes can bind well, while it cannot bind well if none of its shapes can bind. Therefore, knowing a molecule that can bind well does not convey what shape it needs to take in order to bind. In order to predict the binding of unclassified molecules, one need to know the “right shape” that dictates the success of a binding. By modeling a molecule as a bag and shapes that it takes as the instances in the bag, one can using MIL algorithms to learn the right shape that determines the binding, and therefore predict whether a new molecule can bind to the target protein or not.

 

Our experiment is conducted using the public dataset MUSK1 [Dietterich et al., 1997] on the drug activity prediction problem, which has been used as the benchmark to test virtually every MIL algorithm. This dataset has 476 instances grouped into 92 bags, and each instance is described by a 166-d feature vector. We tested every algorithm implemented in MILL on the MUSK1 dataset. For each algorithm, we computed the average classification accuracy obtained from 10-fold cross-validation. The following table summarizes the performance comparison of these algorithms. Note that for each algorithm we explore several parameter settings, such as kernel types and parameters for SVM variants, number of neighbors and rank of citers for kNN, etc. The best choices are also indicated in the following table.

 

Table1 : Performance comparison on MIL algorithms on MUSK1 dataset

 

Algorithm

Tested Values

Best Choices

Classification Accuracy

Iterdiscrim-APR

N/A

N/A

.913

mi_SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 100

.76

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=3, cost = 10

.87

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = 0.05,cost = 100

.826

MI-SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 1

.804

Polynomial kernel

degree = 2 – 4,  cost = 1 – 100

degree=2, cost = 1-10

.81-.815

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = 0.05,cost = 100

.83

Nearest Neighbor

 

 

 

Standard-kNN

K = 1-5

K=1

.852

Citation-kNN

K = 1-4, C = 1 – 5

K = 2, citer = 4-5

.893 - .913

Diversity Density (DD)

# starting points = 10

N/A

.862

EM-DD

# starting points = 10

N/A

.848

 

As we can see, the iterdiscrim_APR outperforms basically all the other MIL algorithms on the MUSK1 dataset. The accuracy 0.913 it achieves is close to the 0.924 accuracy reported in [Dietterich et al., 1997] on the same dataset. It also partially confirmed the argument in the literature that iterdiscrim_APR is particularly designed for the drug activity prediction problem. Among other algorithms, the citation version of the Nearest Neighbor algorithm performs the second best. The citation-kNN algorithm not only examines the labels of the top K neighbors of the current bag, but also examines the labels of the bags whose top C neighbors include the current bag. It supposed to improve the robustness of the standard kNN algorithm, which is also confirmed by our experiment results. The other algorithms, including DD, EM-DD, mi-SVM and MI-SVM are basically at the same performance levels, and the accuracy they achieves are close to what is reported in [Andrews et al., 2002], [Maron and Ratan, 1998], and [Zhang and Goldman, 2001].

 

Although not quantitatively compared, the running time needed for each algorithm has the following relation. DD takes the longest to train, spending several hours to do 10-fold cross-validation, with each run using 5 different starting points. EM-DD takes about half or less time to run. Iterdiscrim-APR and two SVM variants are relatively fast, and their running time varies from a few minutes to one hour. All the algorithms other than kNN spend most of its running time on training, and the time they need for prediction is almost negligible. kNN is a memory-based algorithm which needs no training but a large amount of time on prediction. The overall running time of kNN is longer than SVM or iterdiscrm-APR, but significantly shorter than DD and EM-DD.

 

Given the speed and performance, we choose mi-SVM and kNN in our further experiments, which involve much larger dataset and higher feature dimensions. We skip iterdiscrim-APR because it is specialized for the MUSK dataset and extensive effort is needed to make it work for other datasets.

Experiments with Text Categorization

Although there is only one work [Andrews et al., 2002] on applying MIL to text categorization (TC) problem, we believe that MIL has the potentials for better modeling of document categories given the limited example documents for each category. In text categorization, a document is classified as relevant to a topic (category) because at least a part of the document is on that topic. It does not require that every pieces of the document are all about that topic. This is especially the case among long and multi-topic documents. However, most TC methods are built on the features extracted from the whole documents, which may not be the most precise representations of a topic/category. Building a TC model from the relevant regions seems to be an appealing alternative. By formulating a document as a bag and the paragraphs in the document as instances in the bag, we turn a binary TC problem into a standard MIL problem, where our objective is to learn a classifier that can classify both paragraphs and documents.

 

The test data we used for TC is chosen from the publicly available TREC9 data, also known as OHSUMED. Specifically, we used the same subset of data used in [Andrews et al., 2002] in order to make the results comparable. We performed two TC tasks on two sets of documents, each consisting of 400 documents with 200 relevant documents and 200 irrelevant ones. Each document is split into overlapping passages with 50 words in each passage, which results in over 3000 instances in either of the data sets. A passage is indexed using MeSH terms (Medical Subject Headings). Due to the variety of MeSH terms, each passage is actually indexed by a feature vector of extremely high dimension (over 60,000). But meanwhile, for each passage only a few features have non-zero values. This poses a challenge for MIL algorithms to handle very sparse feature vectors.

 

As discussed above, we use two MIL algorithms in the experiment, namely mi-SVM and Citation-kNN. For comparison purpose, we also performed the TC task using global features extracted from the whole documents, as most existing TC methods do. We use SVM as the supervised learning algorithm for this task, and choose the best empirical parameters for it. Table 2 and 3 summarizes the average classification accuracy achieved by these algorithms using 10-fold cross-validation as well as the best parameters for each algorithm.

 

Table2 : Performance comparison on text categorization set 1

Setting

Algorithm

Tested Values

Best Choices

Classification Accuracy

Supervised

SVM

Various kernels & params

Linear, cost = 10

.94

Multi-Instance Learning

mi-SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 1-10

.935-.945

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=2, cost = 10-100

.90 -.91

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = 0.05,cost = 1-10

.945-.950

Citation-kNN

K = 5-10, C = 7-12

K=10, C = 12

.87

 

Table3 : Performance comparison on text categorization set 2

Setting

Algorithm

Tested Values

Best Choices

Classification Accuracy

Supervised

SVM

Various kernels & params

Linear, cost = 10

.805

Multi-Instance Learning

mi-SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 1

.82

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=2, cost = 100

.73

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = 0.05,cost = 1-10

.825

Citation-kNN

K = 5-10, C = 7-12

K=10, C = 12

.81

 

As we can see, overall MIL algorithms that work on passage-level features perform equally well as the SVM that work on document-level features. Among the MIL algorithms, the result clearly shows that mi-SVM with either linear kernel or RBF kernel is a better choice than citation-kNN. Besides its inferior performance, citation-kNN is inefficient in the TC application, mainly due to the extremely high dimension of the feature vectors. Although the two TC datasets are at different difficulty levels, their results point to similar parameter settings. For example, mi-SVM achieves its highest performance when cost factor between 1 and 10 is used for linear kernel, or gamma around 0.05 is used for RBF kernel.

Experiments with Image Retrieval

There have been some preliminary works on applying MIL to content-base image retrieval in [Yang and Lozano-Perez, 2000] and [Zhong et al, 2002]. Like a document, an image can be regarded as a bag of regions that contain different visual objects. Assuming that a user of an image retrieval system is searching for a target object, an image is regarded as relevant if only one of its regions have the target object, while the other regions can be either relevant or non-relevant. Therefore, an image classifier built on the local features of the relevant regions is preferred to the one built on the global features of the relevant images, although the latter is what most current image retrieval systems are doing. Since user evaluations on regions are rarely provided, the labels on regions are unknown. T provides a perfect application for MIL algorithms, which help identify the relevant regions and build a potentially better model for judging the relevance of images.

 

We conduct the experiments on the images from Corel Image Database, which is popular in the image retrieval area. Specifically, we perform three retrieval tasks, in order to classify images of “elephant”, “fox”, and “tiger”. For each tasks, we use a test data collection of 200 images, with 100 relevant images and 100 irrelevant ones. All the images are segmented using the Blobworld system [Carson et al., 1999] into a set of regions and for each region, a 320-d feature vector is extracted to represent its color, texture, and shape characteristics. After the segmentation process, the number of instances (regions) for each retrieval tasks is 1390, 1319, and 1219 respectively. Similar to our experiments on text categorization, we use mi-SVM and Citation-kNN as two MIL algorithms, and compare their performance to a comparison test based on global image features using standard SVM. Table 5, 6, and 7 reports the classification accuracy of 10-fold cross-validation achieves by these algorithms.

 

Table5 : Performance comparison on image retrieval of “Elephant”

Setting

Algorithm

Tested Values

Best Choices

Classification Accuracy

Supervised

SVM

Various kernels & params

RBF, gamma=.01-.05

 cost = 10-100

.82-.84

Multi-Instance Learning

mi_SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 10

.82

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=2-3, cost = 1-10

.80 - .83

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = .05,cost = 1-10

.81

Citation-kNN

K = 5-10, C = 7-12

K=5, C = 7

.805

 

Table6 : Performance comparison on image retrieval of “Tiger”

Setting

Algorithm

Tested Values

Best Choices

Classification Accuracy

Supervised

SVM

Various kernels & params

Linear, cost = 10-100

.805-.815

Multi-Instance Learning

mi_SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 1

.86

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=2-3 , cost = 1-10

.76-.79

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = .05,cost = 1-10

.81-.82

Citation-kNN

K = 5-10, C = 7-12

K = 5-10, C = 7-12

.775-.78

 

Table7 : Performance comparison on image retrieval of “Fox”

Setting

Algorithm

Tested Values

Best Choices

Classification Accuracy

Supervised

SVM

Various kernels & params

RBF, gamma = 0.01,

cost = 10 -100

.615-.635

Multi-Instance Learning

mi-SVM

 

 

 

Linear kernel

cost  = 1 – 100

cost = 10

.60

Polynomial kernel

degree = 2 – 3,  cost = 1 – 100

degree=3, cost = 10

.59-.61

RBF kernel

gamma = 0.01 – 0.1, cost = 1-100

gamma = 0.05,cost = 10

.60

Citation-kNN

K = 5-10, C = 7-12

K = 5-10, C = 7-12

.58-.60

 

As we can see, MIL algorithms running on local features achieve comparable performance to that of SVM running on global image features. Specifically, MIL algorithms outperform SVM on the “Tiger” datasets by a non-trivial margin, but they do not show any advantages on the other two datasets. Among different MIL algorithms, we found that mi-SVM with linear kernel is a good choice for this application, which achieves the top or near-top performance on all the three datasets. Citaton-kNN performs slightly worse than mi-SVM, despite its relatively long running time.

 

Conclusions

We implemented and presented MILL as a Matlab toolkit for multi-instance learning. The experiments conducted using the algorithms in MILL on drug activity prediction, text categorization, and image retrieval datasets have demonstrated the potentials of MIL algorithms in these applications. However, the experiments presented above are still restricted by the size and variety of data sets and the range of parameter settings tested. We plan to do more experiments in the future work.

 

Reference

 

[Amar et al., 2001] R. A. Amar, D. R. Dooly, S. A. Goldman, and Q. Zhang. Multiple-instance learning of real-valued data. Proc. of the 18th Int. Conf. on Machine Learning, pp.3-10, 2001.

 

[Andrews et al., 2002] S. Andrews, I. Tsochantaridis, T. Hofmann. Support Vector Machines for Multiple-Instance Learning. NIPS 2004.

 

[Carson et al., 1999] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. 3rd Int’l Conf. on Visual Information and Information Systems, volume 1614 of Lecture Notes in Computer Science, pages 509--516. Springer, 1999.

 

[Chevaleyre and Zucker, 2001] Y. Chevaleyre, J. D. Zucker, Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. Application to the mutagenesis problem. In: Stroulia, E., Matwin, S. (eds.): Lecture Notes in Artificial Intelligence, Vol. 2056. Springer, Berlin (2001) 204-214.

 

[Dietterich et al., 1997] T. G. Dietterich, R. H. Lathrop, T. Lozano-Perez. Solving the Multiple-Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence Journal, 89, 1997.

 

[Maron and Lozano-Perez, 1998] Oded Maron , Tomás Lozano-Pérez, A framework for multiple-instance learning, Proc. of the 1997 Conf. on Advances in Neural Information Processing Systems 10, p.570-576, 1998.

 

[Maron and Ratan, 1998] O. Maron, A. L. Ratan. Multiple-Instance Learning for Natural Scene Classification. Proc. 15th Int. Conf. on Machine Learning, pp. 341-349, 1998.

 

[Wang and Zucker, 2000] J. Wang and J.-D. Zucker. Solving the multiple-instance problem: a lazy learning approach. Proc. 17th Int'l Conf. on Machine Learning, pp. 1119-1125, 2000.

 

[Xu and Croft, 1996] J. Xu and W. B. Croft. Query Expansion using Global and Local Document Analysis. SIGIR 1996.

 

[Yang and Lozano-Perez, 2000] C. Yang and T. Lozano-Perez. Image database retrieval with multiple-instance learning techniques. Proc. of the 16th Int. Conf. on Data Engineering, pp.233-243, 2000.

 

[Zhang and Goldman, 2001] Q. Zhang and S. A. Goldman. EM-DD: An improved multiple-instance learning technique. In Neural Information Processing Systems 14, 2001.

 

[Zhong et al, 2002] Q. Zhang, S. A. Goldman, W. Yu, J. E. Fritts. Content-Based Image Retrieval Using Multiple-Instance Learning. Proc. of the19th Int. Conf. on Machine Learning, July 2002.

[Zhou and Zhang, 2002] Z. H. Zhou, M. L. Zhang. Neural networks for multi-instance learning. In Proc. of the Int'l Conf. on Intelligent Information Technology, (2002) pp. 455-459.

[Zhou and Zhang, 2003] Z.-H. Zhou and M.-L. Zhang. Ensembles of multi-instance learners. In Proc of the 14th European Conf on Machine Learning, pages 492--501. Springer, 2003.