Archive for the School Category

Authors: A. J. Oliner and A. Aiken

Title: A Query Language for Understanding Component Interactions in Production Systems [pdf] [slides]

Published: International Conference on Supercomputing (ICS), 2010.

When something unexpected happens in a large production system—a program crashes, a node’s performance flags, a power supply overheats—administrators face several problems at once. First, they may be unable to describe the event any more accurately than the approximate time it occurred. Second, they must diagnose the problem using only the data that was recorded when the issue manifested (primarily log files); this data may be noisy and may not describe all components and their interactions. Third, the system may have many components (tens to thousands), and the administrators must identify which components and component interactions are likely to have been involved.

Consider the following example. Users notice that their jobs are failing more frequently. The typical process for a system administrator is to search the job logs to figure out what components were used by these jobs, scour the system logs from those components for any messages that might hint at a cause, and possibly expand the search to other related components based on their expert knowledge of the system. The key observation is that this is fundamentally a search problem—one for which the state-of-practice is primarily manual, tedious, and ad hoc—where the administrator asks, “What components and interactions are likely to be involved with these job failures?” The input to the search is the available measurements from instrumentation and a simple description of the behavior we wish to understand; the goal of the search is to identify the components and interactions that are likely to be involved.

In a paper to appear at ICS 2010, we present a method for using simple user specifications of when and where a problem manifested, together with existing instrumentation, to compute the components and interactions that are likely to be involved with the problem. Our method computes which system components statistically influence the behavior of other components and which components are statistically linked with the problem.

Our system, QI (pronounce ‘chee’), does not require modifications or perturbations to the system, access to source code, or even knowledge of all the components in the system or their dependencies on one another. Our assumptions are considerably weaker than most previous work and they reflect, in our experience, the reality faced by administrators when they must diagnose a problem. The answers QI provides are limited by these contraints: a passive, black-box technique can, at best, suggest the components and interactions that seem statistically most likely to be involved with a problem. The main advantage is that, because of the weak assumptions, such a system can leverage all of the information available. This is precisely what our method provides, and it does so in a way that is computationally efficient and applicable to a wide variety of systems.

In particular, we evaluate QI using nearly 1.22 billion lines of code from unmodified production systems: four supercomputers, two embedded systems, and a server cluster. On these data, we correctly answer a wide variety of exploratory and diagnostic questions about dynamic system behavior, usually in a couple of seconds.

Authors: A. J. Oliner, A. V. Kulkarni, and A. Aiken

Title: Using Correlated Surprise to Infer Shared Influence [pdf] [slides]

Published: International Conference on Dependable Systems and Networks (DSN), 2010.

During the DARPA Grand Challenge race in 2005, the autonomous vehicle named Stanley, Stanford’s entry, slowed down and swerved around an obstacle that was not actually there. Stanley did this several times over the course of the race, nearly causing it to be disqualified. Although Stanley went on to win the competition, the Stanford Racing Team was justifiably vexed: why had Stanley hallucinated these obstacles?

Using a hand-crafted dependency diagram—the golden ideal to which all previous work on dependency inference aspires—it took the designers of the system early two months to isolate the problem, which was originating from a buffer component shared by the laser sensors. This shared buffer was intermittently dropping measurements, causing Stanley to see stale, inconsistent data about the world around him, which sometimes meant seeing obstacles where there were none. Every other component of the system was behaving according to specification, which partially explains why the dependency diagram was so unhelpful: the source of the problem and its outward manifestation (swerving) were on opposite logical ends of the system but the dependency diagram advised looking at almost every component in between. The shared buffer was not even on the diagram.

In a paper to appear at DSN 2010, we introduce the idea of computing influence, a type of component interaction that is orthogonal to dependencies and allows us to capture implicit interactions among components and subsystems. For the Stanley swerving bug, our method not only infers an influence directly between the swerving behavior and misbehavior near the laser sensors, it also implicates an uninstrumented component shared by those lasers: the true cause of the problem. The Racing Team says the results of our analysis, which took only a few seconds to compute, would have saved them two months of debugging.

Computing the strength of shared influence between components is straightforward:

  1. Represent the behavior of each component as a function of surprise over time, called an anomaly signal.
  2. See how well these functions “line-up” using a standard technique called cross-correlation.
  3. Summarize cross-correlations in a Structure-of-Influence Graph (SIG), where the edges indicate the strength and time-delay of the influence.

Our paper gives a mathematical foundation for influence, as described above, and evaluates it using both simulations of idealized systems and case studies with real systems, including Stanley, his successor (Junior), and the Thunderbird supercomputer.

A few weeks ago, my advisor and I made the arduous 10 mile journey to the Googleplex to give a talk about our work on understanding complex systems. This was part of Google’s lecture series called Tech Talks. Alex actually gives the presentation, but I take the blame for the crudely drawn slides. Here’s the abstract:

We propose a method for identifying the sources of problems in complex production systems where, due to the prohibitive costs of instrumentation, the data available for analysis may be noisy or incomplete. In particular, we may not have complete knowledge of all components and their interactions. We define influences as a class of component interactions that includes direct communication and resource contention. Our method infers the influences among components in a system by looking for time-correlated divergence from models of individual component behavior. We summarize the strength and directionality of shared influences using a Structure-of-Influence Graph (SIG). This talk explains how to construct a SIG and use it to isolate performance bugs, and presents both simulations and an in-depth case study using data from two autonomous vehicles.

I’ve been meaning to start this up again for a while, and now seems as good a time as any, what with that change bug going around. From now on, this blog will be more focused on my research and career activities, rather than personal anecdotes and rants. One major reason for the shift in content is also the reason why it’s been so long since I’ve posted; social networking sites have subsumed this site’s primary social function: status updates. Instead of weaving the epic pageantry of graduate student life into a rich tapestry of personal stories, resplendent with charming details, I could just type, “Adam is haha lollerskates,” and let facebook do the rest. So much for that.

Meanwhile, I’ve been keeping busy. Here’s a recent sampling:

  • The US Patent Office granted me a second patent. The first one, meanwhile, has already passed into obscurity.
  • I got an alert detection paper [pdf] accepted to ICDM in Pisa, Italy, where I’ll be going in December to talk about it.
  • I served on the program committee for a new workshop called WASL and am on the program committee for NAS ’09.
  • I applied to graduate… sort of. The requirements for a Master’s are a subset of those for my Ph.D., so I filled out the paperwork to pick up an extra degree. I already have an M.Eng. from MIT, so this one’s redundant and will be obsolete whenever I get my doctorate.
  • I gave a few invited talks, most recently at a workshop at LACSS [slides].
  • I agreed to serve on the CS Department’s faculty search committee.

I’m working on a system modeling paper for submission on Monday. Wish me luck!

For those of you who couldn’t attend DARPATech this week, or had no desire to, or don’t know what it is, please find below a parodic sample talk. Some of it is verbatim, some of the technology is real, and this is more similar to the actual presentations than you think.

[A man in a blue suit strides to the podium, the enormous ballroom is filled to overflowing with scientists and military officers. Cameras are focused on him from all sides, projecting his visage onto the screens behind him and into the many satellite viewing rooms throughout the hotel.]

Good morning. I’m going to talk to you today about the future: a vision of the future as seen by the DARPA Made-Up Technologies Office. The best of the best. DARPA’s DARPA. Rambo to STO‘s Barney Fife.

Imagine a world in which soldiers cannot die. In which their armor adapts to new threats instantaneously, their weapons target flawlessly and inflict the desired damage, and their hair maintains its shine and bounce, even in the harshest of combat conditions. Imagine a world where a global information network is accessible at your fingertips, or even closer, like at your knuckles or wrists. Where you can detect enemies breathing behind concrete walls, clot and repair a bleeding femoral artery with a simple tourniquet, and where a universal replacement part can assume whatever shape or function you desire. A wrench becomes a hammer. Wings take dream.

We at the MUTO are imagining exactly that.

Soldiers must fight in extremes. In the snow dunes of the arctic, the sand drifts of the desert, deep beneath the ocean, on mountain peaks, and, someday, in outer space and in the center of our sun.

[Slide show displays the Sun. Speaker gestures meaningfully.]

Our opponents are smart, capable, well-trained, and fighting on their home turf. Some of them can yodel. Most of our soldiers can barely manage a passable Star Spangled Banner. Our Army Rangers train in the mountains of Georgia, while Afghani fighters are acclimated to altitudes tens of thousands, no, millions of feet higher. Geese can handle these altitudes, why can’t our warriors?

Our enemies have rockets launchers. Some of them have elephants. They may even have figured out how to put rocket launchers on elephants. You can’t prove they haven’t. And when they do, will you be able to say you did everything possible to prepare?

The work we do at MUTO represents not merely fundamentally unique technological achievements, but entirely new fields of research. A calculus of awesomeness, if you will. It revolutionizes not only urban combat, but warfare in its entirety. And also poetry.

Allow me to give you a moment for your brains to stop smoking.

[Stares wistfully into the distance.]

Now that you have some idea of the preponderance of cutting edge research that is discussed at length in our office, let me introduce the next speaker, who will frighten you with outrageously melodramatic nightmare scenarios, entice you with nonexistent but sexy technology, and ease you into a peaceful and meditative state with utopian vistas of the future. Your future.

But only if the money keeps flowing to DARPA. Thank you.

The talks were obviously not the main attraction of the conference, for me. Rather, I enjoyed walking around the exhibit hall and learning about the amazing projects already underway. I especially liked some of the simpler ones, like the sniper rifle equipped with a cross-wind detector, which would indicate where one should aim in order to compensate.

The project I was there to help present is called Vernier, which aims to leverage application communities to detect and control exploits. We had a live demo that showed Vernier successfully detecting, controlling, and recovering from a self-propagating worm as it spread through a community of twenty nodes.

It was strange seeing military officers, including a three-star general complete with military entourage, checking out the latest geeky wares. Then again… there, but for the funding from DARPA, go I.