Cloudflare reveals it's automated empathy to increase SRE

Home › News, Other Content › Cloudflare reveals it's automated empathy to increase SRE

Sources from Theregister • One Scales on Youtube • Published 6 months ago

Cloudflare has revealed little about how it maintains the millions of boxes it operates around the world — including the concept of an "error budget" that creates "empathy embedded in automation."

Cloudflare Complete Walkthrough Guide

In a Tuesday post titled "Autonomous hardware diagnostics and recovery at scale," the Internet taming biz explains that it built fault-tolerant infrastructure that can continue to operate with "little or no impact" to its services. But as explained by Jet Marsical's CTO of Infrastructure Engineering and Systems Engineers Aakash Shah and Yilin Xiong, when the servers went down, the Data Center Operations team relied on manual processes to identify dead boxes. And these processes can take "hours for a single server alone and [can] easily consume an engineer's entire day."

Which doesn't work at hyperscale.

Even worse, dead servers sometimes stay on, costing Cloudflare money without producing anything of value.