Authors: A. J. Oliner and A. Aiken
Published: International Conference on Supercomputing (ICS), 2010.
When something unexpected happens in a large production system—a program crashes, a node’s performance flags, a power supply overheats—administrators face several problems at once. First, they may be unable to describe the event any more accurately than the approximate time it occurred. Second, they must diagnose the problem using only the data that was recorded when the issue manifested (primarily log files); this data may be noisy and may not describe all components and their interactions. Third, the system may have many components (tens to thousands), and the administrators must identify which components and component interactions are likely to have been involved.
Consider the following example. Users notice that their jobs are failing more frequently. The typical process for a system administrator is to search the job logs to figure out what components were used by these jobs, scour the system logs from those components for any messages that might hint at a cause, and possibly expand the search to other related components based on their expert knowledge of the system. The key observation is that this is fundamentally a search problem—one for which the state-of-practice is primarily manual, tedious, and ad hoc—where the administrator asks, “What components and interactions are likely to be involved with these job failures?” The input to the search is the available measurements from instrumentation and a simple description of the behavior we wish to understand; the goal of the search is to identify the components and interactions that are likely to be involved.
In a paper to appear at ICS 2010, we present a method for using simple user specifications of when and where a problem manifested, together with existing instrumentation, to compute the components and interactions that are likely to be involved with the problem. Our method computes which system components statistically influence the behavior of other components and which components are statistically linked with the problem.
Our system, QI (pronounce ‘chee’), does not require modifications or perturbations to the system, access to source code, or even knowledge of all the components in the system or their dependencies on one another. Our assumptions are considerably weaker than most previous work and they reflect, in our experience, the reality faced by administrators when they must diagnose a problem. The answers QI provides are limited by these contraints: a passive, black-box technique can, at best, suggest the components and interactions that seem statistically most likely to be involved with a problem. The main advantage is that, because of the weak assumptions, such a system can leverage all of the information available. This is precisely what our method provides, and it does so in a way that is computationally efficient and applicable to a wide variety of systems.
In particular, we evaluate QI using nearly 1.22 billion lines of code from unmodified production systems: four supercomputers, two embedded systems, and a server cluster. On these data, we correctly answer a wide variety of exploratory and diagnostic questions about dynamic system behavior, usually in a couple of seconds.