Principles for Operating Large-Scale Production Systems With AI-Augmented Operations
Introduction Today’s global digital platforms are powered by hundreds of microservices that run behind the frontend that users are exposed to. These services all have to operate at scale in conjunction with each other. Hence, the ultimate user experience is determined by the composite availability of these systems, engineered so that the final service continues to operate even if subsystems experience outages. Talking about availability standards of 5 9s, systems that are available 99.999% of the time are […]