Recently I recorded a very interesting podcast episode with my old friend Adrian Ratnapala, who works as a site reliability engineer at Google. (Sadly the episode isn’t published yet, but I’ll tell you once it is)
Among other things we discussed service level objectives: what performance you’re promising your customers (internal or external).
And we touched on something that’s actually quite pervasive, yet still rarely actively dealt with: the existence of both implicit and explicit contracts in parallel.
Adrian told the example of a database that guaranteed to return within 500ms 99% of the time. Well what do you know, the engineers who built it knew their jobs, and the 99% performance was closer to 10ms.
The internal customers of course twigged quickly that the database was faster than promised. I’m sure you know where this story is going: they came to expect adherence to an informal contract, an unspoken (but delivered) faster SLO.
If you dare to go back to what you actually promised, don’t expect any thanks.
This is the risk of implicit contracts. You might be tempted to artificially worsen your product’s performance to what you actually promised – but that sounds stupid, doesn’t it?
I think this is where the chaos engineering approach really shines: you can usually deliver at your best speed, but both during testing and in production you can artificially delay some responses to exactly what you promised, and keep your developers aware that those are the actual rules.
Do you have a question? A project proposal? Something special in mind? Contact me, and let’s talk about how I can make your team, your products, and your life better