Need For Site Reliability Engineering

Introduction

Before we go to the definition of SRE, let’s discuss the SRE history. SRE practices were originated at Google in 2003. Few challenges the Operations teams faced once they inherited the product from the Development team are scaling, operations stability. Google addressed this issue using SRE where software engineering practices were followed at the Operations team by SRE. SREs were given similar tools as developers and focused on improving product reliability. If Google search doesn’t work, SREs would be the first to address the issues and not the Development team.

Outside of Google, these problems were addressed by adopting DevOps.

As people from Google left the company, SRE started to spread to more organizations and customized to their requirements, this resulted in different implementations of SRE.

Site Reliability Engineering is a Software Engineering discipline that helps organizations sustainably achieve appropriate levels of reliability.

Site Reliability Engineer Role

To identify and manage asset reliability risks that could adversely affect business operations. 

SREs spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention

SRE should spend the other 50% of their time on development tasks such as new features, scaling, or automation. The ideal SRE candidate is a highly skilled system administrator with knowledge of code and automation.

Skills

  • Depends on product technology, strong experience in technology
  • Application logging, monitoring, and diagnostic tools
  • Application performance management tools
  • Strong knowledge of technology best practices
  • Scripts and Automation tools

Certifications

  • SRE Foundation, DevOps Institute
  • Site Reliability Engineer, DevOps Institute
  • Azure DevOps Engineer Expert

Why SRE?

Challenges faced in product development,

  • Development velocity Vs Operational stability
  • Instability in production
  • Not focused on product security and reliability features
  • Not focused on automation and process improvements
  • Lack of IT standards awareness across multiple teams

DevOps vs SRE

  • DevOps and SRE are two different parallel attempts to address the above challenges
  • SRE is an engineering discipline that focuses on reliability, DevOps is a cultural movement that emerged to break down the silos typically associated with separate Development and Operations organizations
  • SRE is not the next evolutionary step after DevOps. Not DevOps 2.0

SRE is not just improving the reliability of a system today, but making it better as it changes and grows over time

Key principles and practices

Service Level Indicator – Indicators of Service Health, point in time metric

Example – Request Latency, Request per second, Failures per request

Service Level Objective – a binding target for collection of SLIs, represented as lower bound <= SLI <= upper bound.

Example – Availability should be 80%

Service Level agreement – a business agreement between a Customer and Service Provider typically based on SLOs

Error budget – the difference between the service’s potential perfect reliability and its desired reliability. It is calculated for a set period (monthly or quarterly)

How to setup SLIs & SLOs?

  • SLOs are defined between SREs and Product Owners
  • SLIs used to measure SLOs, differ based on application usability.
  • Choose desirable SLOs in the beginning and set stricter over the period depending on requirements

Example for SLOs and Error budgets

  • SLO - 80% of web requests per month should be successful
  • Error budget (20%) - If there are 1 million web requests per month, then up to 200K are allowed to fail.
  • Error budget policies - If more than 200K web requests fail per month then the team has to improve system reliability instead of feature development

How to control error budgets?

  • Review SLOs and adjust as per team capacity
  • Avoid big-bang release and plan for release in every Sprint to track the changes
  • Setup CI/CD pipeline that can be triggered automatically to reduce downtime
  • Make system architecture fault-tolerant
  • Remediate technical debts in the product, this will avoid latent bugs and help in performance optimization
  • Ensure SLI monitoring and alerting are done correctly

Conclusion

Each of the organizations has come up with its own customized implementation of the virtual workforce or SREs as per need. It is up to the organization to understand the reliability targets by-products and setup SRE practices accordingly.