Automatic failure diagnosis at the end-user's site. Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, leaving the diagnosis of production run failures unaddressed. We propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. By leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, Triage can provide a detailed diagnosis report. Triage employs a failure diagnosis protocol that mimics the steps a human takes in debugging, using a variety of existing diagnosis techniques. Further, we propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triage reduces the total time to fix by almost half.