“What will the performance be like?”

Despite the documentation, blog posts and experiences, you’re just going to have to test it…

AWS offers a dazzling array of services for similar things. Dazzling is a great synonym for “confusing”. Feels like “Speed of Innovation”, “Range of Services” and “Coherence of Offerings” is a “pick 2 of 3” situation…

An often asked question on the AWS communities I hang out on is “can we use a Lambda for that?”

People asking this are often primed with some of the problems of serverless compute, and jump to the advanced features to mitigate them:

  • “I’m worried about cold starts, do I need to pay for provisioned concurrency?”
  • “I’m accessing a database, do I need to add an RDS Proxy?”
  • “I worry about response time, do I need to enable API Gateway caching?”
  • “I worry about out-of-order”, do I need First-In First-Out queues1

These are valid questions, with valid solutions, but paraphrasing a global sportswear brand: Just Try It.

Deploy your code, and see how it performs/responds. Then apply the tricks to make it better.

Starting with the dirtiest hack of all: Make it bigger.

Lambda CPU is only controllable by memory, and I’ve been in multiple situations where there was a step-change in performance jumping from 64 to 128 megabytes. There’s tooling that can help you find the right size for your application.

It’s lazy, brute force, but takes literally seconds to change the memory and re-test.

I’m a big proponent of ‘innovation as Business as Usual’: you shouldn’t need the excuse of an innovation day to try the new things. That only works if you start simple.

But I’m an even bigger proponent of not stressing yourself/teams unnecessarily: If you don’t have time to test in your current project, do what you know works. Hit your shipping dates, have an easier sprint.

Hopefully you can have a play in the future and add something else list of what “works”, it’s nice to expand that list, but only when you’ve enough headroom to do so.

In conclusion, while sometimes I can tell you that something is unlikely to work, most of the time you’re just going to have to try it.

Sorry.

  1. I understand FIFO messaging has its place, but I work in content, not transactions. Including a ‘reliable’ timestamp in the event (e.g. the time an article was published, not the time the message was sent) is cheaper and allows consumers to identify if they should process an event, allowing easier recovery if databases have to be re-crawled. ↩︎

Systems are designed in a context

I’ve been following a bit of Internet Of Things drama because a company Cease & Desisted a developer who was polling their API for an unofficial Home Assistant Integration.

Thankfully, it looks like the company is now engaging with the Developer and The Home Assistant team, so maybe it can be resolved.

But one of the recurring comments initially was “if they’re saying it costs too much money, they should use a cheaper host than AWS”

Disclosure, my career is helping companies use the right bits of AWS well, and for all my discomfort at the centralisation of the internet into the Hyperscalers – the ability to deploy a website that costs nothing if nobody uses it, yet scales to match demand, is pretty compelling.

“What Context do you mean Gareth?”

Backing up a second, this integration polls the endpoint pretty aggressively, and so could be causing a noticeable spike to API calls, and to the costs experienced by Haier.

If there were about 500 active installs, that could cost something between 500 and 1,500 USD per month, based on the polling rates, probably immaterial in the grand scheme of things, but noticeable.

The context I’m talking about, is that they built this API knowing it wasn’t going to be called that much… many of the interactions customers saw would be driven by push notifications or other event driven things. It’s only called by the app/website when people visit… it was never costed for being called continuously.

They didn’t optimise that cost, because it didn’t make sense to.

But still, why aren’t they doing it cheaper?

Bluntly because “total cost of operation” (TCO) is more than just the compute.

Sure I could just “spin up a VM and do it all myself” but then I’m doing it all myself. I’m suddenly in the realm of patching. I’ve got to do my own auth.

If AWS are terminating your HTTPS endpoint, then AWS are on the hook for the latest HTTP desynchronisation exploit.

When you host stuff on managed services (of any provider) you’re really going hands off, you don’t need an ops team because ultimately, there isn’t generally anything that you can operate… If there’s an outage, often times, you just have to wait for the provider to fix it.

That might feel disempowering, but also, there’s no point having an on-call team and waking someone at 3am to go “Yup, it’s fucked” but be unable to actually fix it.

In summary

There’s usually a cheaper way to deploy things.

The question is “is it actually cheaper, everything else considered?”

Don’t try to second guess other peoples architecture, and if you are going to poll an API, try to do it considerately…

Save Money, deploy IPv6 in your VPC

Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.

I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.

Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.

The Joy of NAT

Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.

As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.

In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.

The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.

I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.

This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…

So why haven’t we

There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:

  1. IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
  2. The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
  3. We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing

Some steps to take

It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.

  1. Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
  2. Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
  3. Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges

Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.

About “That” Prime Engineering article

Everyone who has worked with managed cloud services has experienced the moment when it made sense to move away from managed services.

Turns out, so does Prime Video.

Amazon Prime Video recently wrote about how changing away from managed services and writing a more integrated application saved them money. Despite being a few months old, this appeared to blow-up this week, and predictably has caused some cries of “SEE, SEE YOU SHOULD JUST RUN EVERYTHING YOURSELF”.

But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.

I say this as someone who is an annoying cheerleader for serverless compute and managed services, but despite that, I have home-rolled things, when it made sense.

How do you solve a problem

When you’re solving a problem, you look at what the managed services that you have available, considering factors like:

  • Your teams experience with the service
  • Limitations on the service, and what it was intended for, against what you’re doing
  • What quotas may apply that you hit
  • How the pricing model works

While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.

I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.

My example

We’d built out a system that was performing 3 distinct processing steps on large files.

The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.

While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.

The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.

Attempt 1: “EFS looks good”

We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.

This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.

We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.

Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.

Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.

Attempt 2: “Sod it, just combine them”

The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.

We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.

What did we gain

The system was faster, and even more usefully to operators, more predictable.

Once processing had started you could tell, based on the file size, when the items would be ready…

Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.

What did we lose

Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.

In Conclusion

There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.

Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.

“Pure” design doesn’t win prizes.

Suitable design does.

Monoliths or Microservices: how about a middle way?

Should we deploy as micro-services or monoliths, how about neither.

The latest argument that we’re having again, is how we should deploy our systems, and we’re asking “micro services” or “monolith”.

Now, I’ll try to skip past what we mean by all of those things (because it’s covered better elsewhere), but in essence, we’re asking “does our software live in 80 repos or 1 repo?”.

TL;DR How about we aim for 8?

What does good deployment/development look like

In an ideal world, we’d have the following properties in our deployment:

  • It would have appropriate tests and automation, so deployment is easy and doesn’t feel risky
  • The potential impact to a deployment should be predictable, something over shouldn’t impact something over here
  • It should be clear where to look to change code

Problems with Micro-services

  • If you go properly granular, it can be difficult to know which repo code resides in – if you pick up a ticket, you should need to spend 15 minutes to identify where that codes live
  • Deployment of related services may need to be coordinated more closely than you’d like, ensuring that downstream components are ready to accept any new messages/API calls when they arrive
  • Setting up deployment for each new component can be time-consuming (Although with things like CDK/Terraform etc, it should be possible to template much of this to a config file for the deployment system)

Problems with Monoliths

  • Code can potentially leak into production more easily – requiring more robust feature-flagging to hide non-live code. While this is good practice, it becomes a requirement in larger repos – you can avoid this by ‘dev’ deployments not being in trunk, but that’s a different kind of deployment complexity
  • Spinning up another instance of “the system” for testing of a single component may be more expensive and fiddlier than duplicating an individual component
  • The impact of a deployment may not be known, you may need to assess if other commits included in what you’re putting live could break things, this may increase deployment friction

How about Service Cluster Deployments

In a prior engagement, we built what was really a task management system.

  1. Messages would arrive which could potentially make new tasks for the system, or update existing tasks: these were handled by the task-creator-and-updater
  2. The task-viewer would access the database of tasks, cross reference with other services, and create a unified view of the task list
  3. An automation component would use the output of of task-viewer to initiate actions to resolve the tasks, which would ultimately result in more messages arriving, which then updated the task database

In our deployment, these components were all in 3 different projects, the micro service model. And it worked, but is also an example of where these 3 components could be combined into one functional service repo.

This makes sense to me because the 3 services are closely coupled, especially between the task-creator-and-updater and the task-viewer. So maybe they could have been in a combined repo task-management

With this setup I could still feel safe doing a deployment on a Friday afternoon to one component, because even if the task management system failed entirely, the manual processes were in place to allow recovery until the system could be rolled back.

Meanwhile another one of our components, the cost of a failed deployment was so high, and even if it was recovered the time-critical nature, meant we only deployed during ‘off-peak’ periods of the week. Could it have been made more robust? Probably, but it was also a relatively static system – that effort was better spent on other components that were more ‘active’.

In summary

Your deployment should work for your team. It should be based on templated conventions that allow easy configuration of new deployments, and it should be as granular as makes sense.

Instead of worrying about being “truly micro-services” or “fire & forget monolith” find the smallest number of functional groups to keep your code in. That way you can have scope-limited deployments, without having hundreds of repos.

Finally, please, just-name-your-repos-like-this, it’s funny at first giving things amusing names, but honestly, kitchen-cooking-oven is far more supportable than the-name-of-a-dragon because it gets really hot.

The One Boring Reason Why People Use the AWS Service

One of my clients recently started using a relatively new AWS CI/CD Service, and I just stumbled on a defensive/marketing type post from one of the traditional providers. And it made me realise how much vendors can miss the reason people choose to go with the AWS/GCP/Azure service, even if it’s inferior.

Aside: I’m not going to link to the article because they don’t deserve the clicks.

Back to their post, it went through a familiar structure:

  1. “But it doesn’t have all the features, our lovely features”
  2. “You can’t self-host, you’re LOCKED-IN!”
  3. “Why not buy into our broader platform?”

I’ll go through these in turn, before getting to the actual reasons.

“It doesn’t have the features…”

It doesn’t. It’s version 1 of an AWS product… they always launch very lean and gain new things.

And yes, it only supports 3 integrations while Vendor supports around 30. Turns out though those 3 are the most important ones. Others will be added I’m sure, but only where people will use them.

“You can’t self-host, you’re LOCKED-IN”

Good. I literally don’t want to.

I know that some Ops-Teams feel happier that they can touch a container or an instance, but this is a product that can be replaced quite easily, include by this Vendor should the need arise.

They do have a SaaS offering you can pay for, but it’s relatively expensive for small-teams. (And we’ll come onto legal things later)

“Why not buy into our broader platform?”

Lock-in to your cloud provider is bad, but if you use all of their products you can get a great unified experience… which sounds a little like, erm, lock-in.

The simple reason people choose the service on their Cloud… procurement

Companies generally make buying stuff difficult. Every new vendor is a new round of legal review, potentially procurement exercises. It’s a painful affair.

This Vendor does sell their SaaS platform on the AWS marketplace, but it’s another End User License Agreement (EULA) that needs to be accepted. And that means it has to evaluated by a legal-team: like most other EULAs the lawyers will probably go “Yeah, it’s got a bunch of stuff in it that nobody could ever enforce, so proceed at a tiny risk”.

When you already have a cloud-provider, and the legal/finance agreements are in place, it’s just easier to use the provided service.

The ‘default’ product may well be inferior, have less features, and even be more expensive: but if I can click “use this” without involving legal – it’s the one I’ll likely choose.

My workload is too special for Serverless

A few years back it was “My workload would cost more in the cloud”, which while I’m sure is true for some workloads, it was a small and falling amount. It fell even more when you actually costed in all the admin you were doing for your “cheap” servers.

Now it’s “my workload is cheaper on servers than serverless”. Now, again, this will be true for some workloads, but again, this percentage is falling every month as features increase.

Time for the Horror Story…

With every new technology, we need the horror story to dismiss it.

“bUt wHAT aBOUT tHe COld-StArT PeNalTy, thaT meANS tHiS IS uNusABlE fOr ME”

Serverless Function Refusenik

Yes, cold-starts are clunky, and if you’re on Amazon (at time of writing this), you cannot feasibly start a lambda into a VPC because the startup penalty is too painful. This is apparently on their roadmap for this year.

Microsoft are launching a pricing model that allows you to pay for some pre-warmed functions, which could give you the best combination of easy scaling, if the pricing is acceptable.

Anyway, for a lot of these things, the API-Gateway memory cache, or CDNs in front of your APIs should be offloading a lot of traffic and ensuring that common items are rapidly available

Stop swimming upstream

All the effort in IT infrastructure is heading towards serverless functions, container orchestration, containers without actively running container hosts. The choice of hosted database or database-like storage services we are offered can make it confusing to decide. The answer is almost never I’ll running something myself.

Shunning these modern hosting because you genuinely feel that your service is so special is choosing just to take the hard path for little reason, in nearly all cases. And someone- else will use them, have the advantage of working far more on functional code, and far less on overheads, and could offer a cheap/better product than you.

Yes, I know when you are at the scale of one of the top ten internet giants it can make sense – dropbox moved their storage to their own appliances, but you’re not really Dropbox, are you?

AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.1 There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”2

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

  1. Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
  2. Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
  3. Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain3.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

  1. That it is a singular rule is one of those bits of pedantry I cannot let go of
  2. This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke
  3. I could be very wrong here, I don’t have a one of those hanging around to test

Cloud, the cost and value of everything

Forget scalability, speed to change and flexibility, I think the single most important thing about cloud hosting is putting an explicit cost on everything…

Last night I gave a lightning talk at the newly tweaked #metabeertalks, these guys are great friends of mine, and their topic was “is realtime Fashion or Fad?”.

Modern hosting approaches, aka “the cloud” have many advantages: They scale trivially, encourage you to use best practices in how you architect and deploy, and are flexible to change as your application does.

The single most powerful thing though, is that it puts a cost on every element of your application. We can debate if it’s cheaper, more expensive, or about the same as hosting on tin: but you know what your components cost.

Your application isn’t being bundled up with a load of others on a server, with your IT team complaining they have to install a new one with about 1 months lead time every 3 months.

You host your application on instance the right size for it, be it small or huge, single or a fleet of 20. You use the storage you need, when you need it, without playing that impossible game of “how much storage will we need by the time the storage system actually arrives”.

And all this comes with transparency: set your system up with the right tags, and all the costs of an application are known.

Knowing those, you can start flexing: If you need 10 machines to keep up with realtime analysis, they’re yours. Or if you don’t want to pay that, bid for some cheaper instances and batch the work overnight.

Within reason, you can do anything, if you can afford it. So you can take a call about which bits of information are valuable enough to justify being realtime.

When Netflix launched House of Cards series two, you can hear them talking about the “Play Start” messages coming in. That kind of realtime information is amazingly helpful for debugging.

The deeper stats of how many people watched, and how many episodes they binged on, that information could probably wait a few hours to batch…

My take on realtime: do it where it’s valuable, and where you can justify the cost.

Which is exactly the same for all elements of cloud hosting.

Articles like this are why people think the cloud is oversold

When Malaysian 370 went missing, someone suggested “The Cloud” could solve all the problems.

The cloud can solve many problems, and is rightly seen as one of the easiest ways to launch web services. But it isn’t magical, and articles like this  are why people think the cloud is being oversold: The cloud is not the solution to finding missing Malaysian flight 370:

But if MH370 had been fitted with technology that made use of the cloud it may never have been lost in the first place. The cloud is a cluster of computers that provides reliable computing and storage as a service to large numbers of requests from computers with limited capabilities, such as those on board a plane or inside a mobile phone.

What the author says is really “planes should dial-back to a server with their telemetry”

This may be true, but as a comment on the article points out: that doesn’t need the cloud.

It needs a server in a data-centre. Now you may choose to deploy that server as a virtualised box in the cloud, but this is not an application where you need the main virtues of cloud type platforms.

Over and above machine virtualisation, I tend to think about ‘cloud’ meaning some combination of these  things:

  1. You scale your resources when you need them, not ahead of time. The best example of this is storage: you don’t have to pre-size your storage allocation in Amazon or Azure. 1
  2. Your application is making use of the two main scaling patterns, incoming load balancers2 and asynchronous message passing3, to dynamically change the amount of processing capacity that you have.
  3. Not a technical thing, but your costs should be scaling in line with your usage. Having the incentive to save money by doing as little as possible when you’re idle will encourage you to properly scale.
  4. You start treating your servers as livestock and not pets. If  virtualisation separates your instances from the physical hardware,  cloud deployment should separate your application from the instances.
  5. Your deployment should be cheap. It should take minutes, and be painless, and shouldn’t make your ops team bite their nails in fear. It needs to be a routine, accepted, automated process. This also requires you to have your config held in more durable places than a file on an instance, which could disappear at any moment.

The dial-back type solution that could help us find missing planes doesn’t really need many of these characteristics. The data formats would be relatively static, and the loading wouldn’t peak to such levels that you needed to place it all behind a massive loadbalancer. You’d care about reliability, but I don’t see masses of room for flexing things here.

Yes dial-back is a good opportunity to improve visibility (the ACARS data from Air France flight 440 provided a  trace of the accident), but what really could have helped us in the case of Malaysian 370, would have been that something had continued to report back position information after ACARS was disabled.

We don’t know if ACARS was disabled manually by the flight crew, or by a result of electrical systems being de-powered due to a fire. We do know the Inmarsat satellite modem was still functioning for some time, and responded to a network level ping. This only gave us a confirmation that the modem was still in range of a satellite beam, and unfortunately it was a large satellite beam which covers a wide area.

Had the plane been fitted with a newer Inmarsat system, it would have been connecting to a satellite beams with smaller footprints, which could have narrowed the search area.

What might have helped  would have been if there was another GPS receiver integrated with the satellite modem, so that even without the main ACARS system, at least position could still be reported.

That isn’t in the cloud however, that’s on the plane.

Better reporting back could have helped the investigators here, but no, that is not another solution in search of “the cloud”.

  1. Much to the cheers of capacity planners
  2. When more people hit your website, you launch more servers
  3. Instead of doing an operation when a request comes in, you put it on a queue. When the queues start growing too big, you start additional instances