Efficient Monitoring of Operating Systems

Tools for monitoring the behavior of an operating system are invaluable in performance tuning and debugging. However, monitoring operating systems under realistic workloads is a difficult problem. The objective of this project is to develop a monitoring tool that allows the performance of the system to be studied under such workloads.

Unlike previous hardware and software monitoring tools, the new tool proposed here can collect long execution traces with very little perturbation to the actual system. The duration of the monitored execution is estimated to be orders of magnitude longer than is possible using any existing technique. The tool allows the studying of system performance under real-time workloads and network activities, both of which are very difficult to monitor using existing software techniques because of the excessive processing overhead of such methods. Implementation takes place entirely in software, and relies on the instruction counters that are becoming available in an increasing number of modern processors. Because the execution of the system is not affected by the monitoring process, it can be applied to collect very long traces while the system is in production use. This will allow more ambitious performance studies than is possible today. The monitor can also be applied for debugging purposes during the final testing phases to uncover intermittent bugs. These bugs, known also as Heizenbugs, are the most difficult to discover and are the reason for the perceived unreliability of systems software in the user community. Unfortunately, techniques for uncovering these bugs once the system is in production use do not exist. The work presented here presents a fresh approach to attack this important problem.

Extensions of the ideas involved to rollback-recovery and distributed systems will also be investigated. The work has the potential to study the performance of distributed and parallel programs, which constitute the next challenging front.