SRE - DevOps - Operations

Still under construction

Things about operating sites

Oper what?

In ancient times (Mico Maco has already an age or… better, a lot of experience!!) there were Systems Administrators (SysAdmins), many times known also are Infrastructure guys, there were also Developers. Well, some people might say that there was also the almighty Mainframe and its unreachable guys, but this is to complex to further develop ;-)

Good System administrators, aka Operators, tend to program to automate the operation of the systems they are responsible for. Developers program services/product functionalities required by the business.

The split of both roles leaded to very strange situations, where administrators were responsible for operating and keep running things that they even knew they existed or how they were build. Because this separation of roles and knowledge, service dependencies were a problem, troubleshooting was hard and time consuming, altogether with lots of finger-pointing when something fails.

Time has passed and although many SysAdmins used to program (yes, we also build things !!) and developers sometimes had to administer a system, many of us saw quite logical to close the gap among these to disciplines bridging both worlds.

At the end we might agree that automating systems is a type of programming so, in fact, we can also agree that if we extend the programming capabilities of sysadmins and extend the system management (operation) of developer, we come to a common ground that points to what Google names an SRE.

Nowadays, one can encounter many words/concepts related to this merge: DevOps, SRE, Infrastructure as code, etc. At the end, scale has forced this mix with astonishing results, both in scale and services. Think of Google and GCP or Amazon and AWS.

Many may argue that SRE and DevOps are different concepts. I recommend reading the SRE vs. DevOps: competing standards or close friends?. to see that they look the same, or that they aim for the same, or that one implements the other, or that… To Mico Maco this discussion, as with “any resistance”, if futile.

So bridging these two worlds you end-up with a team that creates applications to fully automate the administration of systems or services. In others words, one can apply all the software development concepts to system administration, including metrics, SOA / API oriented services etc.

As Liz Fong-Jones explains in this conference SRE operational practices can be grouped in:

SLIs, SLOs and SLAs

We can define SLIs (Service Level Indicators) as a “rest” of a specific functionality of a service/product. This concept goes a bit further to the typically metrics like CPU, Memory or number of hits…

An SLI should be a prove to check the status of something meaningful for the you, the product manager or the business. In the “internet” this can be a request to a web site emulating different devices. This request has to be well thought. If you define an SLI by just choosing the initial page, without calling any functionality, and just looking for response time and return code, make sure you understand that this SLI will “only” tell you that the web server is up and running, how fast it responded not much else. This SLI is a very good one to check “general” availability. But will give you just this info and no more. If you plan to have a SLI indicating if your site is performing well from a user point of view, you will need to define few more request pointing to previously identified critical parts of the service/product. Examples might be the search engine, or the simulating a user login or sing-in, etc.

Good SLIs are needed to be able to define good SLOs and SLAs.

Availability - Reliability

We aware of targeting to many ‘nines’. While it might sound good to have 5 or 6 ‘nines’ of availability percentage (99,999% or 99,9999%), make sure that all the line from your service and to the user has the same target percentage… Ask yourself questions like is their user mobile device network or their domestic fiber optic has also 5 o 6 ‘nines’. If not, you should not invert money and time in getting an extra ‘nine’. Each ‘nine’ you add makes and exponential growth in cost and time that you probably should employee in other part of your service (for instance faster searches or lighter pages…).

Also, keep always in mind how much availability is after each ‘nine’:

Availability Downtime per year per quarter per 30 days
Six-nines or 99.9999% 32 seconds or less
Five-nines or 99.999% 5 minutes, 15 seconds or less 1,30 minutes 25,9 seconds
Four nines or 99.99% 52 minutes, 36 seconds or less 12,96 minutes 4,32 minutes
99,95% 4,38 hours or less 1,08 hours 21,6 minutes
Three nines or 99.9% 8 hours, 46 minutes or less 2,16 hours 43,2 minutes
99,5% 1,83 days or less 10,8 hours 3,6 hours
Two nines or 99% 3 days, 15 hours and 40 minutes or less 21,5 hours 7,2 hours
95% 18,25 days or less 4,5 days 1,5 days
90% 36,5 days or less 9 days 3 days
One nine or 9% 332 days or less !!

Reliability goes beyond Availability, and covers all the practices and tools that enforces availability, resiliency, recovery, scalability etc. You can find tools for adding all these characteristics to a system in the NALS section

But what is a good measure for reliability? A very initial approach it to divide “good minutes” of the service vs. “total minutes” of the service. While easy to understand, it is simple, if not naive. A better approach is to divide “good interactions” vs. “total interactions”. An “interaction” is something to be defined (see SLI above) but can be summarized as a accomplished interaction by the users. This approach copes better with distributed systems and when, for instance, 1 of 3 servers is down.

Lets take a virtual shop example (Amazon?) where users search and find a product, add it to the chart and pays the products. We could just get how long each of these step took, but a better approach is to fix a window for the whole actions (keeping as well each individual metric !!). You might think it is hard, as people might get several products, therefore several searches, etc. Well, you might want to have the search part weighted by number of products, for instance. At the end, is you, your product manager and your business who knows the internals of your products and systems !!

Error budget

As 100% reliability is almost impossible, the difference among the SLO objective and the 100% is the error budget. Google states that this budget is a shared responsibility for developers and SREs. If the SLO is defined at 99,99%, and it is based on time, the budget allows for 52 minutes a year for not providing service, or 12 minutes per quarter or 4 minutes per month… The metric will depend on the SLO metric. If it is “good interactions”, you can calculate the number of “bad interactions” that your error budget admits.

A key point is that the error budget is shared among SREs and developers, so it is a way to leverage stability vs. change (or uptime vs. innovation)

Toil budget

In this context, “toil defines work tied to running a production service that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows.” (Liz at Google definition)

As with availability, sometimes it is not worth to automate some toil, as it might cost more than the toil itself. Keeping some toil can by useful to allow new engineers to know the system. There has to be a balance among toil and engineering. In other words, there has to be a balance between administering and improving the system.

class SRE implements DevOps

Evolution of On-call @ Spotify
Francesc Zacarias talks abut “Operational Responsibility” for developers.

Ops love APIs: Here is Why

What’s a senior engineer’s job?
On Being A Senior Engineer