If you don’t understand stuff, it’s magic. And if you’re relying on something that’s magic, your platform can disappear in a puff of smoke. This especially true of newer things – I don’t understand MySQL but it’s long in the tooth enough I can (mostly) trust it. Some of the newer NoSQL techs do not have that lineage…
Open Source allows you to get under the hood of all these things, to look behind the curtain and reverse-engineer what is going on. You invariably have to as the documentation is a TODO item. This means that when you do hit these extreme edge cases situations you can fix them, eventually.
But that’s only once you’ve really understood the problem. In black-box situations it’s all too easy to pull the levers you have until it seems the problem has gone away, but all you’ve done is masked, displaced, or deferred it. You have to understand the whole stack and not just “your bit”. (This reminded me a bit of a conversation with a friend who does network security, where decisions not to collect some data for “safety” actually made potential targets more obvious)
There are no gremlins
My favourite point was this: Computers are (mostly) deterministic.
We talk about bugs, issues, intermittent and transient faults – almost resigning ourselves to sometimes “things just happen”.
As Artur points out, computers are deterministic state machines, this randomness doesn’t really exist. Yes, the complex interplay of our interconnected systems can give the appearance of a random system, but that is just the appearance.
There is pattern in there, and when find it, you can fix it. How? Lots of monitoring, lots of measuring, and good old-fashioned investigation.
Stop throwing boxes & sharding at things
The easy availability of horizontal scale-out makes us lazy and complacent: “we’ll just throw another amazon instance at the problem”. That can be a valid approach, but only when your existing instances are actually spending all of their time doing meaningful work and not stuck queuing on some random service. If you’re site is sluggish because of poor code, database performance or tuning, you’re not really solving the problems.
Latency is even more critical(Google PDF), and scaling out a broken system may just let more people use it slowly – not make it faster.
Post-Cloud Call to Arms?
Scaling was hard: ordering servers took ages and it was all confusing. CDNs cost lots of money, were hard to use and only for the big boys.
Then “The Cloud” appeared: people like amazon and others made stuff cheaper and faster to get machines from. For a while we could ignore the complexity and just throw money at it.
But latency isn’t as simple as capacity, and we’re back to the situation that isn’t always about throwing more boxes into the battle.