Save Money, deploy IPv6 in your VPC

Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.

I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.

Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.

The Joy of NAT

Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.

As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.

In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.

The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.

I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.

This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…

So why haven’t we

There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:

  1. IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
  2. The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
  3. We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing

Some steps to take

It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.

  1. Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
  2. Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
  3. Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges

Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.

About “That” Prime Engineering article

Everyone who has worked with managed cloud services has experienced the moment when it made sense to move away from managed services.

Turns out, so does Prime Video.

Amazon Prime Video recently wrote about how changing away from managed services and writing a more integrated application saved them money. Despite being a few months old, this appeared to blow-up this week, and predictably has caused some cries of “SEE, SEE YOU SHOULD JUST RUN EVERYTHING YOURSELF”.

But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.

I say this as someone who is an annoying cheerleader for serverless compute and managed services, but despite that, I have home-rolled things, when it made sense.

How do you solve a problem

When you’re solving a problem, you look at what the managed services that you have available, considering factors like:

  • Your teams experience with the service
  • Limitations on the service, and what it was intended for, against what you’re doing
  • What quotas may apply that you hit
  • How the pricing model works

While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.

I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.

My example

We’d built out a system that was performing 3 distinct processing steps on large files.

The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.

While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.

The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.

Attempt 1: “EFS looks good”

We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.

This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.

We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.

Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.

Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.

Attempt 2: “Sod it, just combine them”

The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.

We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.

What did we gain

The system was faster, and even more usefully to operators, more predictable.

Once processing had started you could tell, based on the file size, when the items would be ready…

Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.

What did we lose

Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.

In Conclusion

There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.

Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.

“Pure” design doesn’t win prizes.

Suitable design does.

Monoliths or Microservices: how about a middle way?

Should we deploy as micro-services or monoliths, how about neither.

The latest argument that we’re having again, is how we should deploy our systems, and we’re asking “micro services” or “monolith”.

Now, I’ll try to skip past what we mean by all of those things (because it’s covered better elsewhere), but in essence, we’re asking “does our software live in 80 repos or 1 repo?”.

TL;DR How about we aim for 8?

What does good deployment/development look like

In an ideal world, we’d have the following properties in our deployment:

  • It would have appropriate tests and automation, so deployment is easy and doesn’t feel risky
  • The potential impact to a deployment should be predictable, something over shouldn’t impact something over here
  • It should be clear where to look to change code

Problems with Micro-services

  • If you go properly granular, it can be difficult to know which repo code resides in – if you pick up a ticket, you should need to spend 15 minutes to identify where that codes live
  • Deployment of related services may need to be coordinated more closely than you’d like, ensuring that downstream components are ready to accept any new messages/API calls when they arrive
  • Setting up deployment for each new component can be time-consuming (Although with things like CDK/Terraform etc, it should be possible to template much of this to a config file for the deployment system)

Problems with Monoliths

  • Code can potentially leak into production more easily – requiring more robust feature-flagging to hide non-live code. While this is good practice, it becomes a requirement in larger repos – you can avoid this by ‘dev’ deployments not being in trunk, but that’s a different kind of deployment complexity
  • Spinning up another instance of “the system” for testing of a single component may be more expensive and fiddlier than duplicating an individual component
  • The impact of a deployment may not be known, you may need to assess if other commits included in what you’re putting live could break things, this may increase deployment friction

How about Service Cluster Deployments

In a prior engagement, we built what was really a task management system.

  1. Messages would arrive which could potentially make new tasks for the system, or update existing tasks: these were handled by the task-creator-and-updater
  2. The task-viewer would access the database of tasks, cross reference with other services, and create a unified view of the task list
  3. An automation component would use the output of of task-viewer to initiate actions to resolve the tasks, which would ultimately result in more messages arriving, which then updated the task database

In our deployment, these components were all in 3 different projects, the micro service model. And it worked, but is also an example of where these 3 components could be combined into one functional service repo.

This makes sense to me because the 3 services are closely coupled, especially between the task-creator-and-updater and the task-viewer. So maybe they could have been in a combined repo task-management

With this setup I could still feel safe doing a deployment on a Friday afternoon to one component, because even if the task management system failed entirely, the manual processes were in place to allow recovery until the system could be rolled back.

Meanwhile another one of our components, the cost of a failed deployment was so high, and even if it was recovered the time-critical nature, meant we only deployed during ‘off-peak’ periods of the week. Could it have been made more robust? Probably, but it was also a relatively static system – that effort was better spent on other components that were more ‘active’.

In summary

Your deployment should work for your team. It should be based on templated conventions that allow easy configuration of new deployments, and it should be as granular as makes sense.

Instead of worrying about being “truly micro-services” or “fire & forget monolith” find the smallest number of functional groups to keep your code in. That way you can have scope-limited deployments, without having hundreds of repos.

Finally, please, just-name-your-repos-like-this, it’s funny at first giving things amusing names, but honestly, kitchen-cooking-oven is far more supportable than the-name-of-a-dragon because it gets really hot.

The One Boring Reason Why People Use the AWS Service

One of my clients recently started using a relatively new AWS CI/CD Service, and I just stumbled on a defensive/marketing type post from one of the traditional providers. And it made me realise how much vendors can miss the reason people choose to go with the AWS/GCP/Azure service, even if it’s inferior.

Aside: I’m not going to link to the article because they don’t deserve the clicks.

Back to their post, it went through a familiar structure:

  1. “But it doesn’t have all the features, our lovely features”
  2. “You can’t self-host, you’re LOCKED-IN!”
  3. “Why not buy into our broader platform?”

I’ll go through these in turn, before getting to the actual reasons.

“It doesn’t have the features…”

It doesn’t. It’s version 1 of an AWS product… they always launch very lean and gain new things.

And yes, it only supports 3 integrations while Vendor supports around 30. Turns out though those 3 are the most important ones. Others will be added I’m sure, but only where people will use them.

“You can’t self-host, you’re LOCKED-IN”

Good. I literally don’t want to.

I know that some Ops-Teams feel happier that they can touch a container or an instance, but this is a product that can be replaced quite easily, include by this Vendor should the need arise.

They do have a SaaS offering you can pay for, but it’s relatively expensive for small-teams. (And we’ll come onto legal things later)

“Why not buy into our broader platform?”

Lock-in to your cloud provider is bad, but if you use all of their products you can get a great unified experience… which sounds a little like, erm, lock-in.

The simple reason people choose the service on their Cloud… procurement

Companies generally make buying stuff difficult. Every new vendor is a new round of legal review, potentially procurement exercises. It’s a painful affair.

This Vendor does sell their SaaS platform on the AWS marketplace, but it’s another End User License Agreement (EULA) that needs to be accepted. And that means it has to evaluated by a legal-team: like most other EULAs the lawyers will probably go “Yeah, it’s got a bunch of stuff in it that nobody could ever enforce, so proceed at a tiny risk”.

When you already have a cloud-provider, and the legal/finance agreements are in place, it’s just easier to use the provided service.

The ‘default’ product may well be inferior, have less features, and even be more expensive: but if I can click “use this” without involving legal – it’s the one I’ll likely choose.

My workload is too special for Serverless

A few years back it was “My workload would cost more in the cloud”, which while I’m sure is true for some workloads, it was a small and falling amount. It fell even more when you actually costed in all the admin you were doing for your “cheap” servers.

Now it’s “my workload is cheaper on servers than serverless”. Now, again, this will be true for some workloads, but again, this percentage is falling every month as features increase.

Time for the Horror Story…

With every new technology, we need the horror story to dismiss it.

“bUt wHAT aBOUT tHe COld-StArT PeNalTy, thaT meANS tHiS IS uNusABlE fOr ME”

Serverless Function Refusenik

Yes, cold-starts are clunky, and if you’re on Amazon (at time of writing this), you cannot feasibly start a lambda into a VPC because the startup penalty is too painful. This is apparently on their roadmap for this year.

Microsoft are launching a pricing model that allows you to pay for some pre-warmed functions, which could give you the best combination of easy scaling, if the pricing is acceptable.

Anyway, for a lot of these things, the API-Gateway memory cache, or CDNs in front of your APIs should be offloading a lot of traffic and ensuring that common items are rapidly available

Stop swimming upstream

All the effort in IT infrastructure is heading towards serverless functions, container orchestration, containers without actively running container hosts. The choice of hosted database or database-like storage services we are offered can make it confusing to decide. The answer is almost never I’ll running something myself.

Shunning these modern hosting because you genuinely feel that your service is so special is choosing just to take the hard path for little reason, in nearly all cases. And someone- else will use them, have the advantage of working far more on functional code, and far less on overheads, and could offer a cheap/better product than you.

Yes, I know when you are at the scale of one of the top ten internet giants it can make sense – dropbox moved their storage to their own appliances, but you’re not really Dropbox, are you?

AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.1 There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”2

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

  1. Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
  2. Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
  3. Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain3.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

  1. That it is a singular rule is one of those bits of pedantry I cannot let go of
  2. This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke
  3. I could be very wrong here, I don’t have a one of those hanging around to test

Cloud, the cost and value of everything

Forget scalability, speed to change and flexibility, I think the single most important thing about cloud hosting is putting an explicit cost on everything…

Last night I gave a lightning talk at the newly tweaked #metabeertalks, these guys are great friends of mine, and their topic was “is realtime Fashion or Fad?”.

Modern hosting approaches, aka “the cloud” have many advantages: They scale trivially, encourage you to use best practices in how you architect and deploy, and are flexible to change as your application does.

The single most powerful thing though, is that it puts a cost on every element of your application. We can debate if it’s cheaper, more expensive, or about the same as hosting on tin: but you know what your components cost.

Your application isn’t being bundled up with a load of others on a server, with your IT team complaining they have to install a new one with about 1 months lead time every 3 months.

You host your application on instance the right size for it, be it small or huge, single or a fleet of 20. You use the storage you need, when you need it, without playing that impossible game of “how much storage will we need by the time the storage system actually arrives”.

And all this comes with transparency: set your system up with the right tags, and all the costs of an application are known.

Knowing those, you can start flexing: If you need 10 machines to keep up with realtime analysis, they’re yours. Or if you don’t want to pay that, bid for some cheaper instances and batch the work overnight.

Within reason, you can do anything, if you can afford it. So you can take a call about which bits of information are valuable enough to justify being realtime.

When Netflix launched House of Cards series two, you can hear them talking about the “Play Start” messages coming in. That kind of realtime information is amazingly helpful for debugging.

The deeper stats of how many people watched, and how many episodes they binged on, that information could probably wait a few hours to batch…

My take on realtime: do it where it’s valuable, and where you can justify the cost.

Which is exactly the same for all elements of cloud hosting.

Articles like this are why people think the cloud is oversold

When Malaysian 370 went missing, someone suggested “The Cloud” could solve all the problems.

The cloud can solve many problems, and is rightly seen as one of the easiest ways to launch web services. But it isn’t magical, and articles like this  are why people think the cloud is being oversold: The cloud is not the solution to finding missing Malaysian flight 370:

But if MH370 had been fitted with technology that made use of the cloud it may never have been lost in the first place. The cloud is a cluster of computers that provides reliable computing and storage as a service to large numbers of requests from computers with limited capabilities, such as those on board a plane or inside a mobile phone.

What the author says is really “planes should dial-back to a server with their telemetry”

This may be true, but as a comment on the article points out: that doesn’t need the cloud.

It needs a server in a data-centre. Now you may choose to deploy that server as a virtualised box in the cloud, but this is not an application where you need the main virtues of cloud type platforms.

Over and above machine virtualisation, I tend to think about ‘cloud’ meaning some combination of these  things:

  1. You scale your resources when you need them, not ahead of time. The best example of this is storage: you don’t have to pre-size your storage allocation in Amazon or Azure. 1
  2. Your application is making use of the two main scaling patterns, incoming load balancers2 and asynchronous message passing3, to dynamically change the amount of processing capacity that you have.
  3. Not a technical thing, but your costs should be scaling in line with your usage. Having the incentive to save money by doing as little as possible when you’re idle will encourage you to properly scale.
  4. You start treating your servers as livestock and not pets. If  virtualisation separates your instances from the physical hardware,  cloud deployment should separate your application from the instances.
  5. Your deployment should be cheap. It should take minutes, and be painless, and shouldn’t make your ops team bite their nails in fear. It needs to be a routine, accepted, automated process. This also requires you to have your config held in more durable places than a file on an instance, which could disappear at any moment.

The dial-back type solution that could help us find missing planes doesn’t really need many of these characteristics. The data formats would be relatively static, and the loading wouldn’t peak to such levels that you needed to place it all behind a massive loadbalancer. You’d care about reliability, but I don’t see masses of room for flexing things here.

Yes dial-back is a good opportunity to improve visibility (the ACARS data from Air France flight 440 provided a  trace of the accident), but what really could have helped us in the case of Malaysian 370, would have been that something had continued to report back position information after ACARS was disabled.

We don’t know if ACARS was disabled manually by the flight crew, or by a result of electrical systems being de-powered due to a fire. We do know the Inmarsat satellite modem was still functioning for some time, and responded to a network level ping. This only gave us a confirmation that the modem was still in range of a satellite beam, and unfortunately it was a large satellite beam which covers a wide area.

Had the plane been fitted with a newer Inmarsat system, it would have been connecting to a satellite beams with smaller footprints, which could have narrowed the search area.

What might have helped  would have been if there was another GPS receiver integrated with the satellite modem, so that even without the main ACARS system, at least position could still be reported.

That isn’t in the cloud however, that’s on the plane.

Better reporting back could have helped the investigators here, but no, that is not another solution in search of “the cloud”.

  1. Much to the cheers of capacity planners
  2. When more people hit your website, you launch more servers
  3. Instead of doing an operation when a request comes in, you put it on a queue. When the queues start growing too big, you start additional instances

Google makes VM Immortal – but how useful?

Google let you migrate machines between data-centres while they still run

While it’s a nice feature, and something that VMWare has been able to do for a while – But I can’t help feeling it’s an anti-pattern in cloud-infrastructures. Yes there are some applications that you can’t easily design as message consuming stateless data-beasts – in general to take advantage of scaling (for capacity or to money), you need to design your applications so that they can survive machine failure, be it from chaos monkey or otherwise.

Performance: still hard

Performance is still hard: Artur Bergman of Fastly talks about what you’re doing wrong.

I watched @crucially’s video from the velocity conference, where once again it’s a good talk where he plays the Grumpy Bastard with aplomb. Soon, soon I promise, I will “Buy a Fucking SSD”.

That’s Magic!

If you don’t understand stuff, it’s magic. And if you’re relying on something that’s magic, your platform can disappear in a puff of smoke. This especially true of newer things – I don’t understand MySQL but it’s long in the tooth enough I can (mostly) trust it. Some of the newer NoSQL techs do not have that lineage…

Open Source allows you to get under the hood of all these things, to look behind the curtain and reverse-engineer what is going on. You invariably have to as the documentation is a TODO item. This means that when you do hit these extreme edge cases situations you can fix them, eventually.

But that’s only once you’ve really understood the problem. In black-box situations it’s all too easy to pull the levers you have until it seems the problem has gone away, but all you’ve done is masked, displaced, or deferred it. You have to understand the whole stack and not just “your bit”. (This reminded me a bit of a conversation with a friend who does network security, where decisions not to collect some data for “safety” actually made potential targets more obvious)

There are no gremlins

My favourite point was this: Computers are (mostly) deterministic.

We talk about bugs, issues, intermittent and transient faults – almost resigning ourselves to sometimes “things just happen”.

As Artur points out, computers are deterministic state machines, this randomness doesn’t really exist. Yes, the complex interplay of our interconnected systems can give the appearance of a random system, but that is just the appearance.

There is pattern in there, and when find it, you can fix it. How? Lots of monitoring, lots of measuring, and good old-fashioned investigation.

Stop throwing boxes & sharding at things

The easy availability of horizontal scale-out makes us lazy and complacent: “we’ll just throw another amazon instance at the problem”. That can be a valid approach, but only when your existing instances are actually spending all of their time doing meaningful work and not stuck queuing on some random service. If you’re site is sluggish because of poor code, database performance or tuning, you’re not really solving the problems.

Latency is even more critical(Google PDF), and scaling out a broken system may just let more people use it slowly – not make it faster.

Post-Cloud Call to Arms?

Scaling was hard: ordering servers took ages and it was all confusing. CDNs cost lots of money, were hard to use and only for the big boys.

Then “The Cloud” appeared: people like amazon and others made stuff cheaper and faster to get machines from. For a while we could ignore the complexity and just throw money at it.

But latency isn’t as simple as capacity, and we’re back to the situation that isn’t always about throwing more boxes into the battle.