Cloudflare has revealed little about how it maintains the millions of boxes it operates around the world — including the concept of an "error budget" that creates "empathy embedded in automation."
Cloudflare Complete Walkthrough Guide
In a Tuesday post titled "Autonomous hardware diagnostics and recovery at scale," the Internet taming biz explains that it built fault-tolerant infrastructure that can continue to operate with "little or no impact" to its services. But as explained by Jet Marsical's CTO of Infrastructure Engineering and Systems Engineers Aakash Shah and Yilin Xiong, when the servers went down, the Data Center Operations team relied on manual processes to identify dead boxes. And these processes can take "hours for a single server alone and [can] easily consume an engineer's entire day."
Which doesn't work at hyperscale.
Even worse, dead servers sometimes stay on, costing Cloudflare money without producing anything of value.