Semantic File Systems

David K. Gifford, Pierre Jouvelot, Mark A. Sheldon, James W. O'Toole
Operating Systems Review, v25 n5 1991
(summary by Ross A. Knepper)

The semantic file system is an information storage system which is 
halfway between a traditional hierarchical file system and a 
traditional database.  Hierarchical file systems position data within 
a tree and rely on human memory and simple naming to locate files.  
Data in such file systems is not content-addressible.  On the opposite 
end of the spectrum, databases contain records with a number of 
uniform attributes, and data are accessed by searching attributes 
rather than by an assigned location.  The traditional database model 
has never been fully appropriate in the task of a file system, 
primarily because there is no uniform set of attributes which fully 
and appropriately characterizes all records within a file system.  
The semantic file system attempts to bridge the gap between 
tree-strutured file systems and databases by individually generating 
a set of attributes appropriate for each file.  Once all the files 
in the system have been properly indexed, searches can be done for 
a particular value of a given attribute, producing only those files 
which have the appropriate value of the specified attribute.

The primary purpose of the paper is to describe the authors' 
implementation of a semantic file system which is syntactically
compatible with the traditional UNIX hierarchical file system.  A
directory called /sfs is added at the root of the UNIX file system,
as the root from which all queries begin.  From this directory, 
queries can be made by specifying pairs of directory names.  The 
first direcotry name specifies the request field, and it typically 
ends with a colon.  The second directory name specifies the value 
of the field which we want to search for.  Further searches may be 
done inside of the resulting directory to narrow the search.  The 
use of this system is illustrated in the following example:

% cd /sfs/exports:/lookup_fault
% ls -F
virtdir_query.c@	virtdir_query.o@
% cd ext:/c
% ls -F
virtdir_query.c@
% 

Here, we first do a search for all files which export the symbol
"lookup_fault".  These files can be code, object files, or
executables.  Next, we do a search based on the results of the 
first search, to determine which of these files also have the 
extention ".c".  Each of the query results is displayed in the 
form of a symbolic link to the file in its normal place within 
the unix file system.  If the possible completions for a given 
field are desired, they can be listed as well, as follows:

% ls -F /sfs/owner:
jones/		root/		smith/
%

However, it is important to distinguish between the case of a 
missing value (as above) and of a missing field.  A listing of 
/sfs will not list all possible fields, because there are an 
unlimited number of fields, as specified by user-defined 
transducers.

A transducer is a program which assigns attributes to a given 
input file.  There are different transducers for each file-type 
for which the user desires a classification.  For each file to 
reside within the semantic file system, the file type must be 
identified, and the appropriate transducer must be run on each 
file.  Some attributes, such as "owner:" may apply equally to 
every file type and transducer, while others, such as "exports:" 
may be output from only a specific transducers.  The information 
from these transducers is passed to an indexer database, which 
does attribute lookups when a search is requested.

The authors implemented the semantic file system by interposing 
on top of the Sun NFS protocol.  Their server resides on the local 
machine, communicating with an indexer process.  Initially, the 
entire UNIX file system must be indexed to build up an attribute 
database.  While this process can take considerable time, it only 
needs to be done once.  Later, a file only need be re-indexed 
after it is written.  However, rather than immediately re-transducing 
a file upon close, the server waits a short period of time, since 
very frequently a closed file is soon re-opened.  Instead, it is 
therefore preferable to guarantee only that the data will eventually 
be fully searchable, rather than ensuring that the index is always 
up-to-date.  

When the SFS server receives a request for a directory listing, 
it dynamically generates search queries and passes them to the 
indexer.  The indexer is responsible for returning a list of files 
satisfying each query.  The SFS server is then able to dynamically 
generate virtual directories in the semantic file system on demand 
and populate them with query results.  

The authors present some results from their implementation, which 
they use to demonstrate that the semantic file system both is fast 
enough to be useable and provides enhanced search ability, allowing 
people to locate files they would otherwise have had difficulty 
finding.  For typical searches, they find that the initial ls of 
a directory takes approximately 2 seconds on their Microvax-3 -- 
a short enough time that the user does not grow impatient.  And 
future directory accesses are nearly instant with the help of 
cached data.  In contrast to this speediness, their Vax takes 
1 hour, 36 minutes to go through a 326MB file system, resulting 
in 68MB of transduced data in 7771 files.  This slowness is a 
one-time startup cost, however.  The effectiveness and usibility 
evidence which the authors present is somewhat more anecdotal.  

In conclusion, the authors claim that they have demonstrated that a
semantic file system is an efficient and effective solution to the 
problem of content-addressible file access.  Furthermore, they have 
demonstrated that such a file system can effectively be layered on 
top of a traditional heierarchical unix file system, using 
overloading of symbols and inclusion of virtual directories.