Next: Results Up: Learning Concept Hierarchies from Previous: Text Processing

Evaluation

In order to evaluate our approach we need to assess how good the automatically learned ontologies reflect a given domain. One possibility would be to compute how many of the superconcept relations in the automatically learned ontology are correct. This is for example done by [33] or [8]. However, due to the fact that our approach, as well as many others (compare [34,46,30]), does not produce appropriate names for the abstract concepts generated, it seems difficult to assess the validity of a given superconcept relation. Another possibility is to compute how 'similar' the automatically learned concept hierarchy is with respect to a given hierarchy for the domain in question. Here the crucial question is how to define similarity between concept hierarchies. Though there is a great amount of work in the AI community on how to compute the similarity between trees [63,28], concept lattices [4], conceptual graphs [42,45] and (plain) graphs [11,64], it is not clear how these similarity measures also translate to concept hierarchies. An interesting work in these lines is the one presented by [40] in which ontologies are compared along different levels: semiotic, syntactic and pragmatic. In particular, the authors present measures to compare the lexical and taxonomic overlap between two ontologies. Furthermore, they also present an interesting study in which different subjects were asked to model a tourism ontology. The resulting ontologies are compared in terms of the defined similarity measures thus yielding the agreement of different subjects on the task of modeling an ontology. In order to formally define our evaluation measures, we introduce a core ontology model in line with the ontological model presented by [59]:

Definition 6 (Core Ontology)
A core ontology is a structure consisting of (i) a set of concept identifiers, (ii) a designated root element representing the top element of the (iii) partial order on such that , called concept hierarchy or taxonomy.

For the sake of notational simplicity we adopt the following convention: given an ontology , the corresponding set of concepts will be denoted by and the partial order representing the concept hierarchy by . It is important to mention that in the approach presented here, terms are directly identified with concepts, i.e. we neglect the fact that terms can be polysemous.6 Now, the Lexical Recall (LR) of two ontologies and is measured as follows:7

Take for example the concept hierarchies and depicted in Figure 4. In this example, the Lexical Recall is .
In order to compare the taxonomy of two ontologies, we use the Semantic Cotopy (SC) presented by [40]. The Semantic Cotopy of a concept is defined as the set of all its super- and subconcepts:

In what follows we illustrate these and other definitions on the basis of several example concept hierarchies. Take for instance the concept hierarchies in Figure 5. We assume that the left concept hierarchy has been automatically learned with our FCA approach and that the concept hierarchy on the right is a handcrafted one. Further, it is important to point out that the left ontology is, in terms of the arrangement of the leave nodes and abstracting from the labels of the inner nodes, a perfectly learned concept hierarchy. This should thus be reflected by a maximum similarity between both ontologies. The Semantic Cotopy of the concept vehicle in the right ontology in Figure 5 is for example {car, bike, two-wheeled vehicle, vehicle, object-to-rent} and the Semantic Cotopy of driveable in the left ontology is {bike, car, rideable, driveable, rentable, bookable}.
It becomes thus already clear that comparing the cotopies of both concepts will not yield the desired results, i.e. a maximum similarity between both concepts. Thus we use a modified version SC' of the Semantic Cotopy in which we only consider the concepts common to both concept hierarchies in the Semantic Cotopy (compare [14,15]), i.e.

By using this Common Semantic Cotopy we thus exclude from the comparison concepts such as runable, offerable, needable, activity, vehicle etc. which are only in one ontology. So, the Common Semantic Cotopy of the concepts vehicle and driveable is identical in both ontologies in Figure 5, i.e. {bike, car} thus representing a perfect overlap between both concepts, which certainly corresponds to our intuitions about the similarity of both concepts. However, let's now consider the concept hierarchy in Figure 6. The common cotopy of the concept bike is {bike} in both concept hierarchies. In fact, every leave concept in the left concept hierarchy has a maximum overlap with the corresponding concept in the right ontology. This is certainly undesirable and in fact leads to very high baselines when comparing such trivial concept hierarchies with a reference standard (compare our earlier results [14,15]). Thus, we introduce a further modification of the Semantic Cotopy by excluding the concept itself from its Common Semantic Cotopy, i.e:

This maintains the perfect overlap between vehicle and driveable in the concept hierarchies in Figure 5, while yielding empty common cotopies for all the leave concepts in the left ontology of Figure 6.
Now, according to Maedche et al. the Taxonomic Overlap () of two ontologies and is computed as follows:

where

and TO' and TO'' are defined as follows:

So, gives the similarity between concepts which are in both ontologies by comparing their respective semantic cotopies. In contrast, gives the similarity between a concept and that concept in which maximizes the overlap of the respective semantic cotopies, i.e. it makes an optimistic estimation assuming an overlap that just does not happen to show up at the immediate lexical surface (compare [40]). The Taxonomic Overlap between the two ontologies is then calculated by averaging over all the taxonomic overlaps of the concepts in . In our case it doesn't make sense to calculate the Semantic Cotopy for concepts which are in both ontologies as they represent leave nodes and thus their common semantic cotopies are empty. Thus, we calculate the Taxonomic Overlap between two ontologies as follows:

Finally, as we do not only want to compute the Taxonomic Overlap in one direction, we introduce the precision, recall and an F-Measure calculating the harmonic mean of both:

The importance of balancing recall and precision against each other will be clear in the discussion of a few examples below. Let's consider for example the concept hierarchy in Figure 5. For the five concepts bookable, joinable, rentable, driveable and rideable we find a corresponding concept in with a maximum Taxonomic Overlap and the other way round for the concepts activity, object-to-rent, vehicle and two-wheeled-vehicle in , such that .
In the concept hierarchy shown in Figure 7 the precision is still 100% for the same reasons as above, but due to the fact that the rideable concept has been removed there is no corresponding concept for two-wheeled-vehicle. The concept maximizing the taxonomic similarity in for two-wheeled-vehicle is driveable with a Taxonomic Overlap of 0.5. The recall is thus and the F-Measure decreases to .
In the concept hierarchy of in Figure 8, an additional concept planable has been introduced, which reduces the precision to , while the recall stays obviously the same at and thus the F-Measure is . It becomes thus clear why it is important to measure the precision and recall of the automatically learned concept hierarchies and balance them against each other by the harmonic mean or F-Measure. For the automatically learned concept hierarchy in Figure 4 the precision is , the recall and thus the F-Measure .
As a comparison, for the trivial concept hierarchy in Figure 6 we get (per definition), and . It is important to mention that though in our toy examples the difference with respect to these measures between the automatically learned concept hierarchy and the trivial concept hierarchy is not so big, when considering real-world concept hierarchies with a much higher number of concepts it is clear that the F-Measures for trivial concept hierarchies will be very low (see the results in Section 6). Finally, we also calculate the harmonic mean of the lexical recall and the F-Measure as follows:

For the automatically learned concept hierarchy , we get for example:

Next: Results Up: Learning Concept Hierarchies from Previous: Text Processing
Philipp Cimiano 2005-08-04