Multi-Region failover with ease: The Netflix way
Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. We’ve turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach — and our understanding — as we’ve matured.
Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with an brief “all is well” email.
This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we’ve developed to tame, refine and leverage our approach. Once you’ve decided to go multi-region, the three major questions that arise are: how many regions? how should we steer users to regions? how do we actually perform the failover? In addition to the story of how we got to where we are, I’ll present the design considerations and system models we used to make those decisions