Semantic File Systems David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, James W. O'Toole Operating Systems Review, v25 n5 1991 (summary by Ross A. Knepper) The semantic file system is an information storage system which is halfway between a traditional hierarchical file system and a traditional database. Hierarchical file systems position data within a tree and rely on human memory and simple naming to locate files. Data in such file systems is not content-addressible. On the opposite end of the spectrum, databases contain records with a number of uniform attributes, and data are accessed by searching attributes rather than by an assigned location. The traditional database model has never been fully appropriate in the task of a file system, primarily because there is no uniform set of attributes which fully and appropriately characterizes all records within a file system. The semantic file system attempts to bridge the gap between tree-strutured file systems and databases by individually generating a set of attributes appropriate for each file. Once all the files in the system have been properly indexed, searches can be done for a particular value of a given attribute, producing only those files which have the appropriate value of the specified attribute. The primary purpose of the paper is to describe the authors' implementation of a semantic file system which is syntactically compatible with the traditional UNIX hierarchical file system. A directory called /sfs is added at the root of the UNIX file system, as the root from which all queries begin. From this directory, queries can be made by specifying pairs of directory names. The first direcotry name specifies the request field, and it typically ends with a colon. The second directory name specifies the value of the field which we want to search for. Further searches may be done inside of the resulting directory to narrow the search. The use of this system is illustrated in the following example: % cd /sfs/exports:/lookup_fault % ls -F virtdir_query.c@ virtdir_query.o@ % cd ext:/c % ls -F virtdir_query.c@ % Here, we first do a search for all files which export the symbol "lookup_fault". These files can be code, object files, or executables. Next, we do a search based on the results of the first search, to determine which of these files also have the extention ".c". Each of the query results is displayed in the form of a symbolic link to the file in its normal place within the unix file system. If the possible completions for a given field are desired, they can be listed as well, as follows: % ls -F /sfs/owner: jones/ root/ smith/ % However, it is important to distinguish between the case of a missing value (as above) and of a missing field. A listing of /sfs will not list all possible fields, because there are an unlimited number of fields, as specified by user-defined transducers. A transducer is a program which assigns attributes to a given input file. There are different transducers for each file-type for which the user desires a classification. For each file to reside within the semantic file system, the file type must be identified, and the appropriate transducer must be run on each file. Some attributes, such as "owner:" may apply equally to every file type and transducer, while others, such as "exports:" may be output from only a specific transducers. The information from these transducers is passed to an indexer database, which does attribute lookups when a search is requested. The authors implemented the semantic file system by interposing on top of the Sun NFS protocol. Their server resides on the local machine, communicating with an indexer process. Initially, the entire UNIX file system must be indexed to build up an attribute database. While this process can take considerable time, it only needs to be done once. Later, a file only need be re-indexed after it is written. However, rather than immediately re-transducing a file upon close, the server waits a short period of time, since very frequently a closed file is soon re-opened. Instead, it is therefore preferable to guarantee only that the data will eventually be fully searchable, rather than ensuring that the index is always up-to-date. When the SFS server receives a request for a directory listing, it dynamically generates search queries and passes them to the indexer. The indexer is responsible for returning a list of files satisfying each query. The SFS server is then able to dynamically generate virtual directories in the semantic file system on demand and populate them with query results. The authors present some results from their implementation, which they use to demonstrate that the semantic file system both is fast enough to be useable and provides enhanced search ability, allowing people to locate files they would otherwise have had difficulty finding. For typical searches, they find that the initial ls of a directory takes approximately 2 seconds on their Microvax-3 -- a short enough time that the user does not grow impatient. And future directory accesses are nearly instant with the help of cached data. In contrast to this speediness, their Vax takes 1 hour, 36 minutes to go through a 326MB file system, resulting in 68MB of transduced data in 7771 files. This slowness is a one-time startup cost, however. The effectiveness and usibility evidence which the authors present is somewhat more anecdotal. In conclusion, the authors claim that they have demonstrated that a semantic file system is an efficient and effective solution to the problem of content-addressible file access. Furthermore, they have demonstrated that such a file system can effectively be layered on top of a traditional heierarchical unix file system, using overloading of symbols and inclusion of virtual directories.