Evaluation of Image Segmentation Algorithms


Caroline Pantofaru
Ranjith Unnikrishnan
Martial Hebert


The first segmentation is clearly better than the second one. Is there a principled way to compute a score that would capture that fact.

Unsupervised image segmentation is an important component in many image understanding algorithms and practical vision systems. However, evaluation of segmentation algorithms thus far has been largely subjective, leaving a system designer to judge the effectiveness of a technique based only on intuition and results in the form of a few example segmented images. This is largely due to image segmentation being an ill-defined problem—there is no unique ground-truth segmentation of an image against which the output of an algorithm may be compared.

In this work, we   propose a new measure of similarity, the Normalized Probabilistic Rand (NPR) index, which can be used to perform a quantitative comparison between image segmentation algorithms using a hand-labeled set of ground-truth segmentations. This measure is a generalization of the Rand index commonly used in statistics and it has some nice properties, broadly speaking in four categories:

In practice, the measure of similarity  allows principled comparisons between segmentations created by different algorithms, as well as segmentations on different images. The reason why it is well-suited for comparing segmentations is that it maches the "natural" ordering  (i.e., as defined by human subjects) of segmentations.  For example, the NPR index is low for both over- and under-segmentation, and is high on "good" segmentations that match well the reference segmentations:

Input Segmentation
(mean shift)
Manual segmentations (from the Berkeley set)
NPR =  -0.59
NPR = -0.74
"Good" segmentation
NPR = 0.85

This can be seen in a different way by comparing the scores obtained by the NPR index with those obtained by three leading approaches for segmentation scoring. In general, our measure does conform to the natural ordering of the segmentations, including a sharper separation between useuful segementations and massively over- or under-segmented examplars:
Input image and 7 possible segmentations

NPR score value for
all 7 segmentations,
compared with the other popular similarity measures

We developed  a procedure for algorithm evaluation through an example evaluation of some familiar algorithms—the mean-shift-based algorithm, an efficient graph-based segmentation algorithm, a hybrid algorithm that combines the strengths of both methods, and expectation maximization. We used the   300 images in the publicly available Berkeley Segmentation Data Set to validate this measure.


R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward Objective Evaluation of Image Segmentation AlgorithmsIEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, June, 2007, pp. 929-944.

R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward Objective Evaluation of Image Segmentation Algorithms, Tech. Report CMU-RI-TR-05-40, Robotics Institute, Carnegie Mellon University, September, 2005.

R. Unnikrishnan, C. Pantofaru, and M. Hebert. A Measure for Objective Evaluation of Image Segmentation Algorithms Workshop on Empirical Evaluation Methods in Computer Vision, IEEE Conference on Computer Vision and Pattern Recognition (CVPR '05), June, 2005.

R. Unnikrishnan and M. Hebert.  Measures of Similarity. Seventh IEEE Workshop on Applications of Computer Vision, January, 2005, pp. 394-400.

Copyright notice