At our NGINX Conf 2019, we conducted more than 50 recorded sessions covering various subjects, but in this blog I’ll share takeaways from one of the hottest topics in the industry: Site Reliability Engineering (and also the related topic of Chaos Engineering). I’ll just focus on three key takeaways, but you’re encouraged to watch the entire session here.
1. SRE Definition
The conversation started on how the panelists defined the term Site Reliability Engineering, with the consistent comment that it is essentially: “Anything to make sure a site is up and running.” But, beyond that, they also emphasized “going really deep and fixing the issue as quickly as possible when any problem occurs” and “empowering development teams with a customer-centric mindset.” Also, did you recognize some approximate similarities with traditional Networking Operations teams in the descriptions? Yes, me too, but one panelist really read my mind in highlighting that, “Some organizations establish an SRE team just by renaming their Network Ops team, but that is not the best way.” There was some discussion on this, but my takeaway here is that the biggest difference between SRE and NetOps is that SRE personnel “sit on a Dev team or customer-facing team and truly focus on business goals.”
2. Chaos Engineering and Failure Injection
One of the key topics for an SRE function is the concept of Chaos Engineering. I will defer the detailed explanation of Chaos Engineering to this article, but in this session it’s really about “an approach to identify critical failures and get them fixed quickly” – something similar to fire drills. And although it has similarities with fire drills, the goal of Chaos Engineering is broader, in that it focuses on quantitatively analyzing recovery, durability, and availability metrics.
Failure Injection is a fairly common method, introduced by Netflix back in 2014. It is a testing approach to push failure simulation metadata into the production environment for testing purposes, but with control. These efforts are typically led by SRE teams in order to ensure higher availability and reliability of the service (or site).
3. KPI and Skillset of SRE
There was some interesting discussion around how SRE should be measured. While there were several points made around MTTD (Mean-time-to-Detect) and MTTR (Mean-time-to-Respond) being significant metrics, all panelist agreed that metrics will differ depending on the industry you're in, as well as the systems or sites you operate. A good suggestion captured from the discussion was, “You can start by asking this question: ‘What are your top 5 most critical systems?’ and that will help you prioritize things.”
The preferred skillset for an SRE position was another topic covered. According to panelists, this also depends on what system you run. (For example, if you are running NGINX, then NGINX experience would be crucial for an SRE hire.) A great suggestion from the group was to explore ways to rotate SRE personnel across different areas of the company and systems to scale – and better equip – SRE resources. Also, ensuring your SRE teams participate in SRE community events and activities such as training, offsites, dedicated Slack channels, and ‘game days,’ among other helpful suggestions.
Conclusion – Is 2020 the Time to Define your own SRE Strategy?
In a nutshell, the discussion revealed that many organizations are still learning how to define and leverage the concept and role of SRE – and like the panelists reiterated, these will often vary depending on industries and systems (and even individual companies). Overall, Chaos Engineering will continue to be tackled next year – maybe this is a perfect time to start thinking through what this means for you and your organization?