Cloud, the cost and value of everything

Forget scalability, speed to change and flexibility, I think the single most important thing about cloud hosting is putting an explicit cost on everything…

Last night I gave a lightning talk at the newly tweaked #metabeertalks, these guys are great friends of mine, and their topic was “is realtime Fashion or Fad?”.

Modern hosting approaches, aka “the cloud” have many advantages: They scale trivially, encourage you to use best practices in how you architect and deploy, and are flexible to change as your application does.

The single most powerful thing though, is that it puts a cost on every element of your application. We can debate if it’s cheaper, more expensive, or about the same as hosting on tin: but you know what your components cost.

Your application isn’t being bundled up with a load of others on a server, with your IT team complaining they have to install a new one with about 1 months lead time every 3 months.

You host your application on instance the right size for it, be it small or huge, single or a fleet of 20. You use the storage you need, when you need it, without playing that impossible game of “how much storage will we need by the time the storage system actually arrives”.

And all this comes with transparency: set your system up with the right tags, and all the costs of an application are known.

Knowing those, you can start flexing: If you need 10 machines to keep up with realtime analysis, they’re yours. Or if you don’t want to pay that, bid for some cheaper instances and batch the work overnight.

Within reason, you can do anything, if you can afford it. So you can take a call about which bits of information are valuable enough to justify being realtime.

When Netflix launched House of Cards series two, you can hear them talking about the “Play Start” messages coming in. That kind of realtime information is amazingly helpful for debugging.

The deeper stats of how many people watched, and how many episodes they binged on, that information could probably wait a few hours to batch…

My take on realtime: do it where it’s valuable, and where you can justify the cost.

Which is exactly the same for all elements of cloud hosting.

Articles like this are why people think the cloud is oversold

When Malaysian 370 went missing, someone suggested “The Cloud” could solve all the problems.

The cloud can solve many problems, and is rightly seen as one of the easiest ways to launch web services. But it isn’t magical, and articles like this  are why people think the cloud is being oversold: The cloud is not the solution to finding missing Malaysian flight 370:

But if MH370 had been fitted with technology that made use of the cloud it may never have been lost in the first place. The cloud is a cluster of computers that provides reliable computing and storage as a service to large numbers of requests from computers with limited capabilities, such as those on board a plane or inside a mobile phone.

What the author says is really “planes should dial-back to a server with their telemetry”

This may be true, but as a comment on the article points out: that doesn’t need the cloud.

It needs a server in a data-centre. Now you may choose to deploy that server as a virtualised box in the cloud, but this is not an application where you need the main virtues of cloud type platforms.

Over and above machine virtualisation, I tend to think about ‘cloud’ meaning some combination of these  things:

  1. You scale your resources when you need them, not ahead of time. The best example of this is storage: you don’t have to pre-size your storage allocation in Amazon or Azure. 1
  2. Your application is making use of the two main scaling patterns, incoming load balancers2 and asynchronous message passing3, to dynamically change the amount of processing capacity that you have.
  3. Not a technical thing, but your costs should be scaling in line with your usage. Having the incentive to save money by doing as little as possible when you’re idle will encourage you to properly scale.
  4. You start treating your servers as livestock and not pets. If  virtualisation separates your instances from the physical hardware,  cloud deployment should separate your application from the instances.
  5. Your deployment should be cheap. It should take minutes, and be painless, and shouldn’t make your ops team bite their nails in fear. It needs to be a routine, accepted, automated process. This also requires you to have your config held in more durable places than a file on an instance, which could disappear at any moment.

The dial-back type solution that could help us find missing planes doesn’t really need many of these characteristics. The data formats would be relatively static, and the loading wouldn’t peak to such levels that you needed to place it all behind a massive loadbalancer. You’d care about reliability, but I don’t see masses of room for flexing things here.

Yes dial-back is a good opportunity to improve visibility (the ACARS data from Air France flight 440 provided a  trace of the accident), but what really could have helped us in the case of Malaysian 370, would have been that something had continued to report back position information after ACARS was disabled.

We don’t know if ACARS was disabled manually by the flight crew, or by a result of electrical systems being de-powered due to a fire. We do know the Inmarsat satellite modem was still functioning for some time, and responded to a network level ping. This only gave us a confirmation that the modem was still in range of a satellite beam, and unfortunately it was a large satellite beam which covers a wide area.

Had the plane been fitted with a newer Inmarsat system, it would have been connecting to a satellite beams with smaller footprints, which could have narrowed the search area.

What might have helped  would have been if there was another GPS receiver integrated with the satellite modem, so that even without the main ACARS system, at least position could still be reported.

That isn’t in the cloud however, that’s on the plane.

Better reporting back could have helped the investigators here, but no, that is not another solution in search of “the cloud”.

  1. Much to the cheers of capacity planners
  2. When more people hit your website, you launch more servers
  3. Instead of doing an operation when a request comes in, you put it on a queue. When the queues start growing too big, you start additional instances

Google makes VM Immortal – but how useful?

Google let you migrate machines between data-centres while they still run

While it’s a nice feature, and something that VMWare has been able to do for a while – But I can’t help feeling it’s an anti-pattern in cloud-infrastructures. Yes there are some applications that you can’t easily design as message consuming stateless data-beasts – in general to take advantage of scaling (for capacity or to money), you need to design your applications so that they can survive machine failure, be it from chaos monkey or otherwise.

Performance: still hard

Performance is still hard: Artur Bergman of Fastly talks about what you’re doing wrong.

I watched @crucially’s video from the velocity conference, where once again it’s a good talk where he plays the Grumpy Bastard with aplomb. Soon, soon I promise, I will “Buy a Fucking SSD”.

That’s Magic!

If you don’t understand stuff, it’s magic. And if you’re relying on something that’s magic, your platform can disappear in a puff of smoke. This especially true of newer things – I don’t understand MySQL but it’s long in the tooth enough I can (mostly) trust it. Some of the newer NoSQL techs do not have that lineage…

Open Source allows you to get under the hood of all these things, to look behind the curtain and reverse-engineer what is going on. You invariably have to as the documentation is a TODO item. This means that when you do hit these extreme edge cases situations you can fix them, eventually.

But that’s only once you’ve really understood the problem. In black-box situations it’s all too easy to pull the levers you have until it seems the problem has gone away, but all you’ve done is masked, displaced, or deferred it. You have to understand the whole stack and not just “your bit”. (This reminded me a bit of a conversation with a friend who does network security, where decisions not to collect some data for “safety” actually made potential targets more obvious)

There are no gremlins

My favourite point was this: Computers are (mostly) deterministic.

We talk about bugs, issues, intermittent and transient faults – almost resigning ourselves to sometimes “things just happen”.

As Artur points out, computers are deterministic state machines, this randomness doesn’t really exist. Yes, the complex interplay of our interconnected systems can give the appearance of a random system, but that is just the appearance.

There is pattern in there, and when find it, you can fix it. How? Lots of monitoring, lots of measuring, and good old-fashioned investigation.

Stop throwing boxes & sharding at things

The easy availability of horizontal scale-out makes us lazy and complacent: “we’ll just throw another amazon instance at the problem”. That can be a valid approach, but only when your existing instances are actually spending all of their time doing meaningful work and not stuck queuing on some random service. If you’re site is sluggish because of poor code, database performance or tuning, you’re not really solving the problems.

Latency is even more critical(Google PDF), and scaling out a broken system may just let more people use it slowly – not make it faster.

Post-Cloud Call to Arms?

Scaling was hard: ordering servers took ages and it was all confusing. CDNs cost lots of money, were hard to use and only for the big boys.

Then “The Cloud” appeared: people like amazon and others made stuff cheaper and faster to get machines from. For a while we could ignore the complexity and just throw money at it.

But latency isn’t as simple as capacity, and we’re back to the situation that isn’t always about throwing more boxes into the battle.