Research

I am a third-year Ph.D. student in the Department of Computer Science at Stanford University, advised by Alex Aiken. I am a DOE HPCS Fellow and an Honorary Stanford Graduate Fellow. My research spiel and lists of publications and presentations can be found below; you can also check out my resumé.

Office: Gates 420
eMail: lastname at cs dot stanford dot edu

Spiel

In order to understand complex systems, we must discern dependencies among components. My research takes steps toward accomplishing this by applying two important insights: (1) anomalies that are correlated in time across components are almost certainly indicative of a shared influence, and (2) the timing of events in a system can reveal the semantics of their behavior.

My previous work has addressed challenges in high performance computing by making systems reliability and robustness a first-class research focus. I have designed more robust algorithms for job scheduling, techniques for identifying and predicting faults, methods for leveraging those predictions to significantly improve checkpointing and Quality of Service, and the most extensive study of system logs, ever.

Publications

  1. Bad Words: Finding Faults in Spirit’s Syslogs. J. Stearley and A. J. Oliner. In Workshop on Resiliency in High-Performance Computing (Resilience-2008), Lyon, France, 2008. [pdf]
  2. RA: ResearchAssistant for the Computational Sciences. D. Ramage and A. J. Oliner. In Workshop on Experimental Computer Science (ExpCS), San Diego, CA, 2007. [pdf] [slides]
  3. What Supercomputers Say: A Study of Five System Logs. A. J. Oliner and J. Stearley. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, 2007. [pdf] [slides]
  4. Cooperative Checkpointing: A Robust Approach to Large-scale Systems Reliability. A. J. Oliner, L. Rudolph, R. K. Sahoo. In Proceedings of the 20th Annual International Conference on Supercomputing (ICS), Cairns, Australia, June 2006. [pdf] [slides]
  5. Evaluating Cooperative Checkpointing for Supercomputing Systems. A. J. Oliner, R. K. Sahoo. In Proceedings of IPDPS, Workshop on System Management Tools for Large-Scale Parallel Systems, Rhodes Island, Greece, April 2006. [pdf] [slides]
  6. Cooperative Checkpointing Theory. A. J. Oliner, L. Rudolph, R. K. Sahoo. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece, April 2006. [pdf] [slides]
  7. Cooperative Checkpointing for Supercomputing Systems. A. J. Oliner. Master of Engineering thesis at MIT, 2005. Advised by L. Rudolph. [pdf]
  8. Probabilistic QoS Guarantees for Supercomputing Systems. A. J. Oliner, L. Rudolph, R. K. Sahoo, J. E. Moreira, M. Gupta. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Yokohama, Japan, 2005. [pdf] [slides]
  9. Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems. A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta. In Proceedings of the First Workshop on System Management Tools for Large-Scale Parallel Systems at the International Parallel and Distributed Processing Symposium (IPDPS), Denver, CO, 2005. [pdf] [slides]
  10. Fault-aware Job Scheduling for BlueGene/L Systems. A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, A. Sivasubramaniam. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), Santa Fe, NM, April 2004. [pdf]
  11. Critical Event Prediction for Proactive Management in Large-scale Computer Clusters. R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, S. Ma, R. Vilalta, A. Sivasubramaniam. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, August 2003. [pdf]
  12. Autonomic Computing Features for Large-scale Server Management and Control. R. K. Sahoo, I. Rish, A. J. Oliner, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta and A. Sivasubramaniam. In the IJCAI-03 Workshop on AI and Autonomic Computing, Acapulco, Mexico, August 2003. [pdf]
  13. An Overview of The BlueGene/L Supercomputer. The BlueGene/L Team. In Proceedings of Supercomputing and IBM Research Report, 2002. [pdf]

Presentations

Note: Slides from conference and workshop presentations are linked next to the associated papers, above.

  1. Why Stanley Swerved: Correlated Anomalies in an Autonomous Vehicle. A. J. Oliner. Invited Talk at Open Source Quality (OSQ) Retreat. May 15, 2008. [slides]
  2. A Scientific Approach to Systems Reliability. A. J. Oliner. Invited Talk at IBM Conference on Interaction between Architecture, Circuits, and Compilers (P=ac2). April 1, 2008. [slides]
  3. Syzygy: Community Epidemic Detection. A. J. Oliner, N. Semsarilar, and A. Aiken. Application Communities Project, DARPA PI Meeting. July 10, 2007. [slides]
  4. Anomalies in Complex Systems. A. J. Oliner and A. Aiken. Presented to Stanford DARPA Grand Challenge Team. March 08, 2007. [slides]
  5. Leveraging Communities to Control Epidemics. A. J. Oliner, N. Semsarilar, H. Saidi, and A. Aiken. Vernier Project, DARPA Site Visit. April 12, 2007. [slides]
  6. Intelligent High Performance Computing. A. J. Oliner, R. K. Sahoo, J. E. Moreira, and M. Gupta. SQUALL Lunch Talk, CMU. November 09, 2004.