\section{MHFS Design Issues}
\label{issues}

 The \B\ structure allows for relatively fast lookups (the tree has a
maximum of 8 levels), which leads to fast {\tt stat} operations
because, as described in section \ref{description}, all the file
information is stored in the \C.  This is particularly useful on the
native Macintosh because path traversal involves {\tt stat}ing every
file at each directory level \footnote{For example, traversing the
path {\bf a:b:c} in either a dialog box or Finder involves opening
{\bf a}, {\tt stat}ing all its items, having the user select {\bf b},
then {\tt stat}ing all the items in {\bf b} \ldots}.  However,
inserting and deleting file system metadata can be very expensive.
There are several other issues that arose in our implementation.

\subsection{\B\ Complexity}

A large part of the B-Tree code is needed to handle extremely uncommon
cases which could have been eliminated if the MHFS design were less
flexible.  Complexity is added by the fact that each \B\ has an
internal, non-fixed size and noncontiguous bitmap spread among the
regular nodes. This bitmap records the allocation status of the \B\
nodes. This extra code is involved only if the number of files is
extremely large (30,000) because the first portion of the bitmap is
stored at a known location.  If the number of B-Tree records were
bounded, the complete bitmap could be stored at a known location.

  The feature that is responsible for most of the extra complexity of
the B-Tree code is that both trees are files, which can dynamically
grow and are not necessarily contiguous. The first three extent
descriptors of both B-Trees are stored in the volume information
block.  Any further extent descriptors are stored in the \E.  This
feature is not documented so we tried to experiment with a native
Macintosh file system. We noticed that the \C\ grows relatively
quickly because of the high size of a catalog record (5 records in a
node on average) while the \E\ grows very slowly because of a general
low fragmentation and because the extent record is small (40 records
in a node on average).  Our experiments have shown that the additional
extents for \C\ are stored in the \E.  We were unable to sufficiently
fragment a native Macintosh volume to cause the \E to grow beyond
three extents.  Therefore, in our implementation we refuse to grow the
extents tree beyond the three extents that can be stored in the volume
information block.  Growing the \E beyond three extents and storing
the additional extent descriptors in the \E\ itself would lead to
circularity and possible deadlock.

  Storing the extents of the \C\ in \E\ can lead to very expensive
metadata accesses even in the case of a simple lookup. This can happen
when the path from the root to the target leaf contains nodes located
in different extents that must be searched in the \E. To alleviate the
problem we currently cache the \C's extent descriptors.

\subsection{Reentrancy}

  Making an MHFS-compatible file system reentrant and highly
concurrent amounts to permitting concurrent operations on the B-Trees.
This turns out to be a difficult task.  We note that Apple's
implementation is not reentrant.

  Fine grain synchronization -- at the B-Tree node level -- would
maximize concurrency.  Algorithms exist for such synchronization, but
they determine whether a node needs to be locked or not (unsafe versus
safe) by exploiting bounds on the number of keys in a node.  For
example, \cite{Bayer} notes that for a degree $d$ B-Tree, a node is
safe on insert if it has fewer than $2d$ keys because if one of its
children needs to be split, there is room for the middle key on the
node.  However, the MHFS B-Tree nodes hold variable length keys,
making it much more expensive to determine the whether there is enough
room for another key.  Further, growing an MHFS B-Tree into a new extent
requires a second, coarser level of synchronization.

%
% The truth
%
  Currently, we simply lock the entire \C\ or \E\ when used.  This
does not limit concurrency much because catalog records are
instantiated in memory when a file is opened, and few files have more
than the three extent descriptors stored in the catalog record.  Even
on a heavily fragmented volume, concurrency between the \C\ and \E\
remains.

 

%  We are in the process of implementing a multiple readers/one writer
%exclusion algorithm that blocks the entire \B\ when one writer needs
%to alter the tree but allows for multiple threads performing searches.
%This solution should perform well because insertions and deletions in
%the \C\ occur only when creating or deleting files and are much less
%frequent that lookup operations.  The \E\ is not usually on the critical
%path, so this will not affect \E\ operations.

  Our conclusion is that the designers of the Macintosh file system
considered the flexibility of the file system data structures the most
important issue. We feel that with simpler and possibly fixed size and
position data structures the file system would perform better and
would also allow for better reentrant implementations.






